tendermint

Commit Graph

Author	SHA1	Message	Date
Anton Kaliaev	ec9bff5234	rename WAL#Flush to WAL#FlushAndSync (#3345 ) * rename WAL#Flush to WAL#FlushAndSync - rename auto#Flush to auto#FlushAndSync - cleanup WAL interface to not leak implementation details! * remove Group() * add WALReader interface and return it in SearchForEndHeight() - add interface assertions Refs #3337 * replace WALReader with io.ReadCloser	6 years ago
Anton Kaliaev	ed1de13548	cs: update wal comments (#3334 ) * cs: update wal comments Follow-up to https://github.com/tendermint/tendermint/pull/3300 * Update consensus/wal.go Co-Authored-By: melekes <anton.kalyaev@gmail.com>	6 years ago
Thane Thomson	dff3deb2a9	cs: sync WAL more frequently (#3300 ) As per #3043, this adds a ticker to sync the WAL every 2s while the WAL is running. * Flush WAL every 2s This adds a ticker that flushes the WAL every 2s while the WAL is running. This is related to #3043. * Fix spelling * Increase timeout to 2mins for slower build environments * Make WAL sync interval configurable * Add TODO to replace testChan with more comprehensive testBus * Remove extraneous debug statement * Remove testChan in favour of using system time As per https://github.com/tendermint/tendermint/pull/3300#discussion_r255886586, this removes the `testChan` WAL member and replaces the approach with a system time-oriented one. In this new approach, we keep track of the system time at which each flush and periodic flush successfully occurred. The naming of the various functions is also updated here to be more consistent with "flushing" as opposed to "sync'ing". * Update naming convention and ensure lock for timestamp update * Add Flush method as part of WAL interface Adds a `Flush` method as part of the WAL interface to enforce the idea that we can manually trigger a WAL flush from outside of the WAL. This is employed in the consensus state management to flush the WAL prior to signing votes/proposals, as per https://github.com/tendermint/tendermint/issues/3043#issuecomment-453853630 * Update CHANGELOG_PENDING * Remove mutex approach and replace with DI The dependency injection approach to dealing with testing concerns could allow similar effects to some kind of "testing bus"-based approach. This commit introduces an example of this, where instead of relying on (potentially fragile) timing of things between the code and the test, we inject code into the function under test that can signal the test through a channel. This allows us to avoid the `time.Sleep()`-based approach previously employed. * Update comment on WAL flushing during vote signing Co-Authored-By: thanethomson <connect@thanethomson.com> * Simplify flush interval definition Co-Authored-By: thanethomson <connect@thanethomson.com> * Expand commentary on WAL disk flushing Co-Authored-By: thanethomson <connect@thanethomson.com> * Add broken test to illustrate WAL sync test problem Removes test-related state (dependency injection code) from the WAL data structure and adds test code to illustrate the problem with using `WALGenerateNBlocks` and `wal.SearchForEndHeight` to test periodic sync'ing. * Fix test error messages * Use WAL group buffer size to check for flush A function is added to `libs/autofile/group.go#Group` in order to return the size of the buffered data (i.e. data that has not yet been flushed to disk). The test now checks that, prior to a `time.Sleep`, the group buffer has data in it. After the `time.Sleep` (during which time the periodic flush should have been called), the buffer should be empty. * Remove config root dir removal from #3291 * Add godoc for NewWAL mentioning periodic sync	6 years ago
Anton Kaliaev	0b0a8b3128	cs/wal: refuse to encode msg that is bigger than maxMsgSizeBytes (#3303 ) Earlier this week somebody posted this in GoS Riot chat: ``` E[2019-02-12\|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 878916964 exceeded maximum possible value of 1048576 bytes]" E[2019-02-12\|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 825701731 exceeded maximum possible value of 1048576 bytes]" E[2019-02-12\|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 1631073634 exceeded maximum possible value of 1048576 bytes]" E[2019-02-12\|10:38:37.596] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[length 912418148 exceeded maximum possible value of 1048576 bytes]" E[2019-02-12\|10:38:37.600] Corrupted entry. Skipping... module=consensus wal=/home/gaia/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]" E[2019-02-12\|10:38:37.600] Error on catchup replay. Proceeding to start ConsensusState anyway module=consensus err="Cannot replay height 7242. WAL does not contain #ENDHEIGHT for 7241" E[2019-02-12\|10:38:37.861] Error dialing peer module=p2p err="dial tcp 35.183.126.181:26656: i/o timeout ``` Note the length error messages. What has happened is the length field got corrupted probably. I've looked at the code and noticed that we don't check the msg size during encoding. This PR fixes that. It also improves a few error messages in WALDecoder.	6 years ago
Ethan Buchman	dc6567c677	consensus: flush wal on stop (#3297 ) Should fix #3295 Also partial fix of #3043	6 years ago
Ethan Buchman	45b70ae031	fix non deterministic test failures and race in privval socket (#3258 ) * node: decrease retry conn timeout in test Should fix #3256 The retry timeout was set to the default, which is the same as the accept timeout, so it's no wonder this would fail. Here we decrease the retry timeout so we can try many times before the accept timeout. * p2p: increase handshake timeout in test This fails sometimes, presumably because the handshake timeout is so low (only 50ms). So increase it to 1s. Should fix #3187 * privval: fix race with ping. closes #3237 Pings happen in a go-routine and can happen concurrently with other messages. Since we use a request/response protocol, we expect to send a request and get back the corresponding response. But with pings happening concurrently, this assumption could be violated. We were using a mutex, but only a RWMutex, where the RLock was being held for sending messages - this was to allow the underlying connection to be replaced if it fails. Turns out we actually need to use a full lock (not just a read lock) to prevent multiple requests from happening concurrently. * node: fix test name. DelayedStop -> DelayedStart * autofile: Wait() method In the TestWALTruncate in consensus/wal_test.go we remove the WAL directory at the end of the test. However the wal.Stop() does not properly wait for the autofile group to finish shutting down. Hence it was possible that the group's go-routine is still running when the cleanup happens, which causes a panic since the directory disappeared. Here we add a Wait() method to properly wait until the go-routine exits so we can safely clean up. This fixes #2852.	6 years ago
Ethan Buchman	39eba4e154	WAL: better errors and new fail point (#3246 ) * privval: more info in errors * wal: change Debug logs to Info * wal: log and return error on corrupted wal instead of panicing * fail: Exit right away instead of sending interupt * consensus: FAIL before handling our own vote allows to replicate #3089: - run using `FAIL_TEST_INDEX=0` - delete some bytes from the end of the WAL - start normally Results in logs like: ``` I[2019-02-03\|18:12:58.225] Searching for height module=consensus wal=/Users/ethanbuchman/.tendermint/data/cs.wal/wal height=1 min=0 max=0 E[2019-02-03\|18:12:58.225] Error on catchup replay. Proceeding to start ConsensusState anyway module=consensus err="failed to read data: EOF" I[2019-02-03\|18:12:58.225] Started node module=main nodeInfo="{ProtocolVersion:{P2P:6 Block:9 App:1} ID_:35e87e93f2e31f305b65a5517fd2102331b56002 ListenAddr:tcp://0.0.0.0:26656 Network:test-chain-J8JvJH Version:0.29.1 Channels:4020212223303800 Moniker:Ethans-MacBook-Pro.local Other:{TxIndex:on RPCAddress:tcp://0.0.0.0:26657}}" E[2019-02-03\|18:12:58.226] Couldn't connect to any seeds module=p2p I[2019-02-03\|18:12:59.229] Timed out module=consensus dur=998.568ms height=1 round=0 step=RoundStepNewHeight I[2019-02-03\|18:12:59.230] enterNewRound(1/0). Current: 1/0/RoundStepNewHeight module=consensus height=1 round=0 I[2019-02-03\|18:12:59.230] enterPropose(1/0). Current: 1/0/RoundStepNewRound module=consensus height=1 round=0 I[2019-02-03\|18:12:59.230] enterPropose: Our turn to propose module=consensus height=1 round=0 proposer=AD278B7767B05D7FBEB76207024C650988FA77D5 privValidator="PrivValidator{AD278B7767B05D7FBEB76207024C650988FA77D5 LH:1, LR:0, LS:2}" E[2019-02-03\|18:12:59.230] enterPropose: Error signing proposal module=consensus height=1 round=0 err="Error signing proposal: Step regression at height 1 round 0. Got 1, last step 2" I[2019-02-03\|18:13:02.233] Timed out module=consensus dur=3s height=1 round=0 step=RoundStepPropose I[2019-02-03\|18:13:02.233] enterPrevote(1/0). Current: 1/0/RoundStepPropose module=consensus I[2019-02-03\|18:13:02.233] enterPrevote: ProposalBlock is nil module=consensus height=1 round=0 E[2019-02-03\|18:13:02.234] Error signing vote module=consensus height=1 round=0 vote="Vote{0:AD278B7767B0 1/00/1(Prevote) 000000000000 000000000000 @ 2019-02-04T02:13:02.233897Z}" err="Error signing vote: Conflicting data" ``` Notice the EOF, the step regression, and the conflicting data. * wal: change errors to be DataCorruptionError * exit on corrupt WAL * fix log * fix new line	6 years ago
Anton Kaliaev	d178ea9eaf	use our logger in autofile/group	6 years ago
goolAdapter	110b07fb3f	libs: Call Flush() before rename #2428 (#2439 ) * fix Group.RotateFile need call Flush() before rename. #2428 * fix some review issue. #2428 refactor Group's config: replace setting member with initial option * fix a handwriting mistake * fix a time window error between rename and write. * fix a syntax mistake. * change option name Get_ to With_ * fix review issue * fix review issue	6 years ago
Anton Kaliaev	0e1cd88863	Remove ConsensusParams.TxSize and ConsensusParams.BlockGossip (#2364 ) * remove ConsensusParams.TxSize and ConsensusParams.BlockGossip Refs #2347 * block part size is now fixed Refs #2347 * use max data size, not max bytes for tx limit Refs #2347	6 years ago
Zarko Milosevic	7b88172f41	Implement BFT time (#2203 ) * Implement BFT time * set LastValidators when creating state in state helper for heights >= 2	6 years ago
Dev Ojha	2756be5a59	libs: Remove usage of custom Fmt, in favor of fmt.Sprintf (#2199 ) * libs: Remove usage of custom Fmt, in favor of fmt.Sprintf Closes #2193 * Fix bug that was masked by custom Fmt!	6 years ago
Zach Ramsay	44dad6d70b	Revert "detele everything" This reverts commit `d02c5d1e30`.	6 years ago
Zach Ramsay	d02c5d1e30	detele everything	6 years ago
Ethan Buchman	d55243f0e6	fix import paths	6 years ago
Liamsi	d2c05bc5b9	Revert "delete everything" (includes everything non-go-crypto) This reverts commit `96a3502`	6 years ago
Liamsi	96a3502126	delete everything	7 years ago
Anton Kaliaev	1f22f34edf	flush wal group on stop Refs #1659 Refs https://github.com/tendermint/tmlibs/pull/217	7 years ago
Anton Kaliaev	708f35e5c1	do not look for height in older files if we've seen height - 1 Refs #1600	7 years ago
Anton Kaliaev	f3f5c7f472	we must only return io.EOF to progress to the next file in auto.Group since we never write msg partially, if we've encountered io.EOF in the middle of the msg, we must abort	7 years ago
Anton Kaliaev	68f6226bea	data is corrupted, but this requires manual intervention i.e., can't be skipped and we should only return DataCorruptionError if we can skip a msg safely	7 years ago
Anton Kaliaev	118b86b1ef	fix nil panic error msg is nil and if we continue executing, we'll get nil exception at `msg.Msg.(....)`	7 years ago
Anton Kaliaev	b9afcbe3a2	fix typo	7 years ago
Ethan Buchman	ee4eb59355	update comments	7 years ago
Ethan Buchman	082a02e6d1	consensus: only fsync wal after internal msgs	7 years ago
Anton Kaliaev	e88f74bb9b	remove wal_light setting Closes #1428	7 years ago
Jae Kwon	fb64314d1c	Review from Anton	7 years ago
Ethan Buchman	799beebd36	fix consensus tests	7 years ago
Jae Kwon	45ec5fd170	WIP consensus	7 years ago
Ethan Buchman	a17105fd46	p2p: peer.Key -> peer.ID	7 years ago
Anton Kaliaev	843e1ed400	Updates -> ValidatoSetUpdates	7 years ago
Anton Kaliaev	e57cad6c3f	correct maxMsgSizeBytes	7 years ago
Anton Kaliaev	06aece31cf	lower the max message size	7 years ago
Anton Kaliaev	af79a2a59e	fix error msg	7 years ago
Anton Kaliaev	ee66476d62	set max msg size otherwise, it is easy to get OutOfMemory panic (somebody can even expoit this)	7 years ago
Anton Kaliaev	40f9261d48	handle data corruption errors Refs #573	7 years ago
Anton Kaliaev	90944bb1a2	be specific about what type we're encoding to be consistent with Decode, which returns TimedWALMessage	7 years ago
Anton Kaliaev	07571741c5	[consensus] remove WAL separator (Refs #785 ) We don't really need a separator unless we have complex structures (rows, cells like RDBMS have https://www.sqlite.org/fileformat.html).	7 years ago
Anton Kaliaev	c6f025f40e	generate WAL on the fly (Refs #468 )	7 years ago
Anton Kaliaev	922af7c405	int64 height uint64 is considered dangerous. the details will follow in a blog post.	7 years ago
Anton Kaliaev	69b5da766c	service#Start, service#Stop signatures were changed See https://github.com/tendermint/tmlibs/issues/45	7 years ago
Zach Ramsay	6f3c05545d	fix new linting errors	7 years ago
Zach Ramsay	b75d4f73e7	errcheck: PR comment fixes	7 years ago
Anton Kaliaev	61d76a273f	fixes from Bucky's and Emmanuel's reviews	7 years ago
Anton Kaliaev	f6539737de	new pubsub package comment out failing consensus tests for now rewrite rpc httpclient to use new pubsub package import pubsub as tmpubsub, query as tmquery make event IDs constants EventKey -> EventTypeKey rename EventsPubsub to PubSub mempool does not use pubsub rename eventsSub to pubsub new subscribe API fix channel size issues and consensus tests bugs refactor rpc client add missing discardFromChan method add mutex rename pubsub to eventBus remove IsRunning from WSRPCConnection interface (not needed) add a comment in broadcastNewRoundStepsAndVotes rename registerEventCallbacks to broadcastNewRoundStepsAndVotes See https://dave.cheney.net/2014/03/19/channel-axioms stop eventBuses after reactor tests remove unnecessary Unsubscribe return subscribe helper function move discardFromChan to where it is used subscribe now returns an err this gives us ability to refuse to subscribe if pubsub is at its max capacity. use context for control overflow cache queries handle err when subscribing in replay_test rename testClientID to testSubscriber extract var set channel buffer capacity to 1 in replay_file fix byzantine_test unsubscribe from single event, not all events refactor httpclient to return events to appropriate channels return failing testReplayCrashBeforeWriteVote test fix TestValidatorSetChanges refactor code a bit fix testReplayCrashBeforeWriteVote add comment fix TestValidatorSetChanges fixes from Bucky's review update comment [ci skip] test TxEventBuffer update changelog fix TestValidatorSetChanges (2nd attempt) only do wg.Done when no errors benchmark event bus create pubsub server inside NewEventBus only expose config params (later if needed) set buffer capacity to 0 so we are not testing cache new tx event format: key = "Tx" plus a tag {"tx.hash": XYZ} This should allow to subscribe to all transactions! or a specific one using a query: "tm.events.type = Tx and tx.hash = '013ABF99434...'" use TimeoutCommit instead of afterPublishEventNewBlockTimeout TimeoutCommit is the time a node waits after committing a block, before it goes into the next height. So it will finish everything from the last block, but then wait a bit. The idea is this gives it time to hear more votes from other validators, to strengthen the commit it includes in the next block. But it also gives it time to hear about new transactions. waitForBlockWithUpdatedVals rewrite WAL crash tests Task: test that we can recover from any WAL crash. Solution: the old tests were relying on event hub being run in the same thread (we were injecting the private validator's last signature). when considering a rewrite, we considered two possible solutions: write a "fuzzy" testing system where WAL is crashing upon receiving a new message, or inject failures and trigger them in tests using something like https://github.com/coreos/gofail. remove sleep no cs.Lock around wal.Save test different cases (empty block, non-empty block, ...) comments add comments test 4 cases: empty block, non-empty block, non-empty block with smaller part size, many blocks fixes as per Bucky's last review reset subscriptions on UnsubscribeAll use a simple counter to track message for which we panicked also, set a smaller part size for all test cases	7 years ago
Ethan Buchman	57a684d5ac	fixes from review	7 years ago
Anton Kaliaev	c74a359c46	fixes per Bucky's review	7 years ago
Anton Kaliaev	3115c23762	binary format for WAL	7 years ago
Anton Kaliaev	31030c6514	make crc32c a global var change echo format in build.sh script	7 years ago
Anton Kaliaev	7b8ffc9981	add checksum and msg size to TimedWALMessage updated test_data/build.sh script	7 years ago

1 2

72 Commits (86a581f28f8011a3ea7745065dbbf92c7413de4c)