We're waiting between trying witnesses (which shouldn't be neccessary
because the witnesses shouldn't depend on each other,) and also
between *attempts*, and really the outer sleep should be enough.
This is a little coarse, but the idea is that we'll send information
about the channels a peer has upon the peer-up event that we send to
reactors that we can then use to reject peers (if neeeded) from reactors.
This solves the problem where statesync would hang in test networks
(and presumably real) where we would attempt to statesync from seed
nodes, thereby hanging silently forever.
* Rebased and git-squashed the commits in PR #6546
migrate abci to finalizeBlock
work on abci, proxy and mempool
abciresponse, blok events, indexer, some tests
fix some tests
fix errors
fix errors in abci
fix tests amd errors
* Fixes after rebasing PR#6546
* Restored height to RequestFinalizeBlock & other
* Fixed more UTs
* Fixed kvstore
* More UT fixes
* last TC fixed
* make format
* Update internal/consensus/mempool_test.go
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
* Addressed @williambanfield's comments
* Fixed UTs
* Addressed last comments from @williambanfield
* make format
Co-authored-by: marbar3778 <marbar3778@yahoo.com>
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Our test cases spew a lot of files and directories around $TMPDIR. Make more
thorough use of the testing package's TempDir methods to ensure these are
cleaned up.
In a few cases, this required plumbing test contexts through existing helper
code. In a couple places an explicit path was required, to work around cases
where we do global setup during a TestMain function. Those cases probably
deserve more thorough cleansing (preferably with fire), but for now I have just
worked around it to keep focused on the cleanup.
This pull request merges in the changes for implementing Proposer-based timestamps into `master`. The power was primarily being done in the `wb/proposer-based-timestamps` branch, with changes being merged into that branch during development. This pull request represents an amalgamation of the changes made into that development branch. All of the changes that were placed into that branch have been cleanly rebased on top of the latest `master`. The changes compile and the tests pass insofar as our tests in general pass.
### Note To Reviewers
These changes have been extensively reviewed during development. There is not much new here. In the interest of making effective use of time, I would recommend against trying to perform a complete audit of the changes presented and instead examine for mistakes that may have occurred during the process of rebasing the changes. I gave the complete change set a first pass for any issues, but additional eyes would be very appreciated.
In sum, this change set does the following:
closes#6942
merges in #6849
*light: rpc /status returns status of light client ; code refactoring
light: moved lightClientInfo into light.go, renamed String to ID
test/e2e: Return light client trusted height instead of SyncInfo trusted height
test/e2e/start.go: Not waiting for light client to catch up in tests. Removed querying of syncInfo in start if the node is a light node
* light: Removed call to primary /status. Added trustedPeriod to light info
* light/provider: added ID function to return IP of primary and witnesses
* light/provider/http/http_test: renamed String() to ID()
This change has two main effects:
1. Remove most of the Async methods from the abci.Client interface.
Remaining are FlushAsync, CommitTxAsync, and DeliverTxAsync.
2. Rename the synchronous methods to remove the "Sync" suffix.
The rest of the change is updating the implementations, subsets, and mocks of
the interface, along with the call sites that point to them.
* Fix stringly-typed mock stubs.
* Rename helper method.
The custom error types in the provider package did not propagate their wrapped
underlying reasons, making it difficult for the test to check that the correct
error was observed.
- Fix the custom errors to have a true underlying error (not just a string).
- Add Unwrap methods to support inspection by errors.Is.
- Update usage in a few places.
- Fix the test to check for acceptable variation.
Fixes#7609.
* p2p: migrate to use new interface for channel errors
* Update internal/p2p/p2ptest/require.go
Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
* rename
* feedback
Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
This continues the push of plumbing contexts through tendermint. I
attempted to find all goroutines in the production code (non-test) and
made sure that these threads would exit when their contexts were
canceled, and I believe this PR does that.
This is, perhaps, the trival final piece of #7075 that I've been
working on.
There's more work to be done:
- push more of the setup into the pacakges themselves
- move channel-based sending/filtering out of the
- simplify the buffering throuhgout the p2p stack.
This is intended to fix a test failure that occurs in the p2p state provider. The issue presents as the state provider timing out waiting for the consensus params response.
The reason that this can occur is because the statesync reactor has the possibility of attempting to respond to the params request before the state provider is ready to read it. This results in the reactor hitting the `default` case seen here and then never sending on the channel. The stateprovider will then block waiting for a response and never receive one because the reactor opted not to send it.
When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag.
```
goroutine 36 [chan receive]:
github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200)
github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240)
github.com/tendermint/tendermint/node/node.go:769 +0x132
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0)
github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6
goroutine 188 [semacquire]:
sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1)
runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00026b1c8)
sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
sync/mutex.go:81
sync.(*RWMutex).Lock(0xc00026b1c8)
sync/rwmutex.go:111 +0x90
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4)
github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080)
github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab
created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart
github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd)
```
This test reliably gets hung up on network configuration, (which may
be a real issue,) but it's network setup is handcranked and we should
ensure that the test focuses on it's core assertions and doesn't fail for
test architecture reasons.
I've been noticing that there are a number of situations where the
statesync reactor blocks waiting for peers (or similar,) I've moved
things around to improve outcomes in local tests.