When shutting down blocksync, it is observed that the process can hang completely. A dump of running goroutines reveals that this is due to goroutines not listening on the correct shutdown signal. Namely, the `poolRoutine` goroutine does not wait on `pool.Quit`. The `poolRoutine` does not receive any other shutdown signal during `OnStop` becuase it must stop before the `r.closeCh` is closed. Currently the `poolRoutine` listens in the `closeCh` which will not close until the `poolRoutine` stops and calls `poolWG.Done()`.
This change also puts the `requestRoutine()` in the `OnStart` method to make it more visible since it does not rely on anything that is spawned in the `poolRoutine`.
```
goroutine 183 [semacquire]:
sync.runtime_Semacquire(0xc0000d3bd8)
runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc0000d3bd0)
sync/waitgroup.go:130 +0x65
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStop(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:193 +0x47
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0000d3a00, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc00052c000)
github.com/tendermint/tendermint/node/node.go:758 +0xc62
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00052c000, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000df6d20, 0x7f04a68da900, 0xc0004a8930, 0xc0005a72d8)
github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6
goroutine 161 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).poolRoutine(0xc0000d3a00, 0x0)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:464 +0x2b3
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:174 +0xf1
goroutine 162 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processBlockSyncCh(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:310 +0x151
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:177 +0x54
goroutine 163 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processPeerUpdates(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:363 +0x12b
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:178 +0x76
```
The code in the Tendermint repository makes heavy use of import aliasing.
This is made necessary by our extensive reuse of common base package names, and
by repetition of similar names across different subdirectories.
Unfortunately we have not been very consistent about which packages we alias in
various circumstances, and the aliases we use vary. In the spirit of the advice
in the style guide and https://github.com/golang/go/wiki/CodeReviewComments#imports,
his change makes an effort to clean up and normalize import aliasing.
This change makes no API or behavioral changes. It is a pure cleanup intended
o help make the code more readable to developers (including myself) trying to
understand what is being imported where.
Only unexported names have been modified, and the changes were generated and
applied mechanically with gofmt -r and comby, respecting the lexical and
syntactic rules of Go. Even so, I did not fix every inconsistency. Where the
changes would be too disruptive, I left it alone.
The principles I followed in this cleanup are:
- Remove aliases that restate the package name.
- Remove aliases where the base package name is unambiguous.
- Move overly-terse abbreviations from the import to the usage site.
- Fix lexical issues (remove underscores, remove capitalization).
- Fix import groupings to more closely match the style guide.
- Group blank (side-effecting) imports and ensure they are commented.
- Add aliases to multiple imports with the same base package name.
This commit extends the fix in #6518, so all other goroutine which run
concurrently with processBlockchainCh can safely send data to blockchain
out channel via a bridge channel. This helps eliminating all possible
data race with sending and closing blockchainCh.Out channel at the same
time.
Fixes#6516
There is a possible data race/panic between processBlockchainCh and
processPeerUpdates, since when we send to blockchainCh.Out in one
goroutine and close the channel in the other. The race is seen in some
Github Action runs.
This commit fix the race, by adding a peerUpdatesCh as a bridge between
processPeerUpdates and processBlockchainCh, so the former will send to
this channel, the later will listen and forward the message to
blockchainCh.Out channel.
Updates #6516
This cleans up the `Router` code and adds a bunch of tests. These sorts of systems are a real pain to test, since they have a bunch of asynchronous goroutines living their own lives, so the test coverage is decent but not fantastic. Luckily we've been able to move all of the complex peer management and transport logic outside of the router, as synchronous components that are much easier to test, so the core router logic is fairly small and simple.
This also provides some initial test tooling in `p2p/p2ptest` that automatically sets up in-memory networks and channels for use in integration tests. It also includes channel-oriented test asserters in `p2p/p2ptest/require.go`, but these have primarily been written for router testing and should probably be adapted or extended for reactor testing.
## Description
Fixes the data race in usage of `WaitGroup`. Specifically, the case where we invoke `Wait` _before_ the first delta `Add` call when the current waitgroup counter is zero. See https://golang.org/pkg/sync/#WaitGroup.Add.
Still not sure how this manifests itself in a test since the reactor has to be stopped virtually immediately after being started (I think?).
Regardless, this is the appropriate fix.
closes: #5968
blockchain/vX reactor priority was decreased because during the normal operation
(i.e. when the node is not fast syncing) blockchain priority can't be
the same as consensus reactor priority. Otherwise, it's theoretically possible to
slow down consensus by constantly requesting blocks from the node.
NOTE: ideally blockchain/vX reactor priority would be dynamic. e.g. when
the node is fast syncing, the priority is 10 (max), but when it's done
fast syncing - the priority gets decreased to 5 (only to serve blocks
for other nodes). But it's not possible now, therefore I decided to
focus on the normal operation (priority = 5).
evidence and consensus critical messages are more important than
the mempool ones, hence priorities are bumped by 1 (from 5 to 6).
statesync reactor priority was changed from 1 to 5 to be the same as
blockchain/vX priority.
Refs https://github.com/tendermint/tendermint/issues/5816
After a reactor has failed to parse an incoming message, it shouldn't output the "bad" data into the logs, as that data is unfiltered and could have anything in it. (We also don't think this information is helpful to have in the logs anyways.)
- drop Height & Base from StatusRequest
It does not make sense nor it's used anywhere currently. Also, there
seem to be no trace of these fields in the ADR-40 (blockchain reactor
v2).
- change PacketMsg#EOF type from int32 to bool
check bcR.fastSync flag when "OnStop"
fix "service/service.go:161 Not stopping BlockPool -- have not been started yet {"impl": "BlockPool"}" error when kill process
Fixes#828. Adds state sync, as outlined in [ADR-053](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-053-state-sync-prototype.md). See related PRs in Cosmos SDK (https://github.com/cosmos/cosmos-sdk/pull/5803) and Gaia (https://github.com/cosmos/gaia/pull/327).
This is split out of the previous PR #4645, and branched off of the ABCI interface in #4704.
* Adds a new P2P reactor which exchanges snapshots with peers, and bootstraps an empty local node from remote snapshots when requested.
* Adds a new configuration section `[statesync]` that enables state sync and configures the light client. Also enables `statesync:info` logging by default.
* Integrates state sync into node startup. Does not support the v2 blockchain reactor, since it needs some reorganization to defer startup.
to prevent malicious nodes from sending us large messages (~21MB, which
is the default `RecvMessageCapacity`)
This allows us to remove unnecessary `maxMsgSize` check in `decodeMsg`. Since each channel has a msg capacity set to `maxMsgSize`, there's no need to check it again in `decodeMsg`.
Closes#1503
* Added BlockStore.DeleteBlock()
* Added initial block pruner prototype
* wip
* Added BlockStore.PruneBlocks()
* Added consensus setting for block pruning
* Added BlockStore base
* Error on replay if base does not have blocks
* Handle missing blocks when sending VoteSetMaj23Message
* Error message tweak
* Properly update blockstore state
* Error message fix again
* blockchain: ignore peer missing blocks
* Added FIXME
* Added test for block replay with truncated history
* Handle peer base in blockchain reactor
* Improved replay error handling
* Added tests for Store.PruneBlocks()
* Fix non-RPC handling of truncated block history
* Panic on missing block meta in needProofBlock()
* Updated changelog
* Handle truncated block history in RPC layer
* Added info about earliest block in /status RPC
* Reorder height and base in blockchain reactor messages
* Updated changelog
* Fix tests
* Appease linter
* Minor review fixes
* Non-empty BlockStores should always have base > 0
* Update code to assume base > 0 invariant
* Added blockstore tests for pruning to 0
* Make sure we don't prune below the current base
* Added BlockStore.Size()
* config: added retain_blocks recommendations
* Update v1 blockchain reactor to handle blockstore base
* Added state database pruning
* Propagate errors on missing validator sets
* Comment tweaks
* Improved error message
Co-Authored-By: Anton Kaliaev <anton.kalyaev@gmail.com>
* use ABCI field ResponseCommit.retain_height instead of retain-blocks config option
* remove State.RetainHeight, return value instead
* fix minor issues
* rename pruneHeights() to pruneBlocks()
* noop to fix GitHub borkage
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
* libs/common: refactor libs common 3
- move nil.go into types folder and make private
- move service & baseservice out of common into service pkg
ref #4147
Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>
* add changelog entry
cleanup to add linter
grpc change:
https://godoc.org/google.golang.org/grpc#WithContextDialerhttps://godoc.org/google.golang.org/grpc#WithDialer
grpc/grpc-go#2627
prometheous change:
due to UninstrumentedHandler, being deprecated in the future
empty branch = empty if or else statement
didn't delete them entirely but commented
couldn't find a reason to have them
could not replicate the issue #3406
but if want to keep it commented then we should comment out the if statement as well
* go routines in blockchain reactor
* Added reference to the go routine diagram
* Initial commit
* cleanup
* Undo testing_logger change, committed by mistake
* Fix the test loggers
* pulled some fsm code into pool.go
* added pool tests
* changes to the design
added block requests under peer
moved the request trigger in the reactor poolRoutine, triggered now by a ticker
in general moved everything required for making block requests smarter in the poolRoutine
added a simple map of heights to keep track of what will need to be requested next
added a few more tests
* send errors to FSM in a different channel than blocks
send errors (RemovePeer) from switch on a different channel than the
one receiving blocks
renamed channels
added more pool tests
* more pool tests
* lint errors
* more tests
* more tests
* switch fast sync to new implementation
* fixed data race in tests
* cleanup
* finished fsm tests
* address golangci comments :)
* address golangci comments :)
* Added timeout on next block needed to advance
* updating docs and cleanup
* fix issue in test from previous cleanup
* cleanup
* Added termination scenarios, tests and more cleanup
* small fixes to adr, comments and cleanup
* Fix bug in sendRequest()
If we tried to send a request to a peer not present in the switch, a
missing continue statement caused the request to be blackholed in a peer
that was removed and never retried.
While this bug was manifesting, the reactor kept asking for other
blocks that would be stored and never consumed. Added the number of
unconsumed blocks in the math for requesting blocks ahead of current
processing height so eventually there will be no more blocks requested
until the already received ones are consumed.
* remove bpPeer's didTimeout field
* Use distinct err codes for peer timeout and FSM timeouts
* Don't allow peers to update with lower height
* review comments from Ethan and Zarko
* some cleanup, renaming, comments
* Move block execution in separate goroutine
* Remove pool's numPending
* review comments
* fix lint, remove old blockchain reactor and duplicates in fsm tests
* small reorg around peer after review comments
* add the reactor spec
* verify block only once
* review comments
* change to int for max number of pending requests
* cleanup and godoc
* Add configuration flag fast sync version
* golangci fixes
* fix config template
* move both reactor versions under blockchain
* cleanup, golint, renaming stuff
* updated documentation, fixed more golint warnings
* integrate with behavior package
* sync with master
* gofmt
* add changelog_pending entry
* move to improvments
* suggestion to changelog entry
Fixes#3457
The topic of the issue is that : write a BlockRequest int requestsCh channel will create an timer at the same time that stop the peer 15s later if no block have been received . But pop a BlockRequest from requestsCh and send it out may delay more than 15s later. So that the peer will be stopped for error("send nothing to us").
Extracting requestsCh into its own goroutine can make sure that every BlockRequest been handled timely.
Instead of the requestsCh handling, we should probably pull the didProcessCh handling in a separate go routine since this is the one "starving" the other channel handlers. I believe the way it is right now, we still have issues with high delays in errorsCh handling that might cause sending requests to invalid/ disconnected peers.