This tweaks the connectivity of test configurations, in hopes that more will be viable.
Additionally reduces the prevalence of testing the legacy mempool.
My earlier p2p cleanup code removed support for the p2p tests from the
e2e generator and runner, but missed removing the CI
configuration. This patch remedies that.
While discussing a question about the indexing interface (#7044), we found some
confusion about the intent of the design decisions in ADR 065.
Based on discussion with the original authors of the ADR, this commit adds some
language to the Decisions section to spell out the intentions more clearly, and
to call out future work that this ADR did not explicitly decide about.
Addresses one of the concerns with #7041.
Provides a mechanism (via the RPC interface) to delete a single transaction, described by its hash, from the mempool. The method returns an error if the transaction cannot be found. Once the transaction is removed it remains in the cache and cannot be resubmitted until the cache is cleared or it expires from the cache.
This code hasn't been battle tested, and seems to have grown
increasingly flaky int tests. Given our general direction of reducing
queue complexity over the next couple of releases I think it makes
sense to remove it.
This PR tackles the case of using the e2e application in a long lived testnet. The application continually saves snapshots (usually every 100 blocks) which after a while bloats the size of the application. This PR prunes older snapshots so that only the most recent 10 snapshots remain.
A few notes:
- this is not all the deletion that we can do, but this is the most
"simple" case: it leaves in shims, and there's some trivial
additional cleanup to the transport that can happen but that
requires writing more code, and I wanted this to be easy to review
above all else.
- This should land *after* we cut the branch for 0.35, but I'm
anticipating that to happen soon, and I wanted to run this through
CI.
The race occurred as a result of a goroutine launched by `processPeerUpdate` racing with the `OnStop` method. The `processPeerUpdates` goroutine deletes from the map as `OnStop` is reading from it. This change updates the `OnStop` method to wait for the peer updates channel to be done before closing the peers. It also copies the map contents to a new map so that it will not conflict with the view of the map that the goroutine created in `processPeerUpdate` sees.
This commit should be one of the first to land as part of the v0.36
cycle *after* cutting the 0.35 branch.
The blocksync/v2 reactor was originally implemented as an experiement
to produce an implementation of the blockstack protocol that would be
easier to test and validate, but it was never appropriately
operationalized and this implementation was never fully debugged. When
the p2p layer was refactored as part of the 0.35 cycle, the v2
implementation was not refactored and it was left in the codebase but
not removed. This commit just removes all references to it.
This script is referenced from the release documentation, we should make sure it's functional. This is helpful in generating the "Special Thanks" section of the changelog.
This is intended to fix a test failure that occurs in the p2p state provider. The issue presents as the state provider timing out waiting for the consensus params response.
The reason that this can occur is because the statesync reactor has the possibility of attempting to respond to the params request before the state provider is ready to read it. This results in the reactor hitting the `default` case seen here and then never sending on the channel. The stateprovider will then block waiting for a response and never receive one because the reactor opted not to send it.
When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag.
```
goroutine 36 [chan receive]:
github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200)
github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240)
github.com/tendermint/tendermint/node/node.go:769 +0x132
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0)
github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6
goroutine 188 [semacquire]:
sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1)
runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00026b1c8)
sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
sync/mutex.go:81
sync.(*RWMutex).Lock(0xc00026b1c8)
sync/rwmutex.go:111 +0x90
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4)
github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080)
github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab
created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart
github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd)
```
When shutting down blocksync, it is observed that the process can hang completely. A dump of running goroutines reveals that this is due to goroutines not listening on the correct shutdown signal. Namely, the `poolRoutine` goroutine does not wait on `pool.Quit`. The `poolRoutine` does not receive any other shutdown signal during `OnStop` becuase it must stop before the `r.closeCh` is closed. Currently the `poolRoutine` listens in the `closeCh` which will not close until the `poolRoutine` stops and calls `poolWG.Done()`.
This change also puts the `requestRoutine()` in the `OnStart` method to make it more visible since it does not rely on anything that is spawned in the `poolRoutine`.
```
goroutine 183 [semacquire]:
sync.runtime_Semacquire(0xc0000d3bd8)
runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc0000d3bd0)
sync/waitgroup.go:130 +0x65
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStop(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:193 +0x47
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0000d3a00, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc00052c000)
github.com/tendermint/tendermint/node/node.go:758 +0xc62
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00052c000, 0x0, 0x0)
github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000df6d20, 0x7f04a68da900, 0xc0004a8930, 0xc0005a72d8)
github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6
goroutine 161 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).poolRoutine(0xc0000d3a00, 0x0)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:464 +0x2b3
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:174 +0xf1
goroutine 162 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processBlockSyncCh(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:310 +0x151
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:177 +0x54
goroutine 163 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processPeerUpdates(0xc0000d3a00)
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:363 +0x12b
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:178 +0x76
```
This test reliably gets hung up on network configuration, (which may
be a real issue,) but it's network setup is handcranked and we should
ensure that the test focuses on it's core assertions and doesn't fail for
test architecture reasons.
Fix the order of lines in docs/versions so that v0.34 is last (the current release).
Related changes:
- Update docs/DOCS_README.md to reflect the current state of how we publish the site.
- Fix the build-docs target in Makefile to not perturb the package-lock.json during the build.
- Fix the Makefile rule to not clobber package-lock.json.
I observed a couple of problems with the generator in some recent tests:
- there were a couple of hybrid test cases which did not have any
legacy nodes (randomness and all.) I change the probability to
produce more reliable results.
- added options to the generation to be able to add a max (to
compliment the earlier min) number of nodes for local testing.
- added an option to support reversing the sort order so "more
complex" networks were first, as well as tweaked some of the point
values.
- this refactored the generators cli parsing to be a bit more clear.
The main effect of this change is to flush the socket client and server message
encoding buffers immediately once the message is fully and correctly encoded.
This allows us to remove the timer and some other special cases, without
changing the observed behaviour of the system.
-- Background
The socket protocol client and server each use a buffered writer to encode
request and response messages onto the underlying connection. This reduces the
possibility of a single message being split across multiple writes, but has the
side-effect that a request may remain buffered for some time.
The implementation worked around this by keeping a ticker that occasionally
triggers a flush, and by flushing the writer in response to an explicit request
baked into the client/server protocol (see also #6994).
These workarounds are both unnecessary: Once a message has been dequeued for
sending and fully encoded in wire format, there is no real use keeping all or
part of it buffered locally. Moreover, using an asynchronous process to flush
the buffer makes the round-trip performance of the request unpredictable.
-- Benchmarks
Code: https://play.golang.org/p/0ChUOxJOiHt
I found no pre-existing performance benchmarks to justify the flush pattern,
but a natural question is whether this will significantly harm client/server
performance. To test this, I implemented a simple benchmark that transfers
randomly-sized byte buffers from a no-op "client" to a no-op "server" over a
Unix-domain socket, using a buffered writer, both with and without explicit
flushes after each write.
As the following data show, flushing every time (FLUSH=true) does reduce raw
throughput, but not by a significant amount except for very small request
sizes, where the transfer time is already trivial (1.9μs). Given that the
client is calibrated for 1MiB transactions, the overhead is not meaningful.
The percentage in each section is the speedup for flushing only when the buffer
is full, relative to flushing every block. The benchmark uses the default
buffer size (4096 bytes), which is the same value used by the socket client and
server implementation:
FLUSH NBLOCKS MAX AVG TOTAL ELAPSED TIME/BLOCK
false 3957471 512 255 1011165416 2.00018873s 505ns
true 1068568 512 255 273064368 2.000217051s 1.871µs
(73%)
false 536096 4096 2048 1098066401 2.000229108s 3.731µs
true 477911 4096 2047 978746731 2.000177825s 4.185µs
(10.8%)
false 124595 16384 8181 1019340160 2.000235086s 16.053µs
true 120995 16384 8179 989703064 2.000329349s 16.532µs
(2.9%)
false 2114 1048576 525693 1111316541 2.000479928s 946.3µs
true 2083 1048576 526379 1096449173 2.001817137s 961.025µs
(1.5%)
Note also that the FLUSH=false baseline is actually faster than the production
code, which flushes more often than is required by the buffer filling up.
Moreover, the timer slows down the overall transaction rate of the client and
server, indepenedent of how fast the socket transfer is, so the loss on a real
workload is probably much less.
I've been noticing that there are a number of situations where the
statesync reactor blocks waiting for peers (or similar,) I've moved
things around to improve outcomes in local tests.
In the last run, there were two problems at the RPC layer returned
from light nodes' RPC end points. I think exercising the light client
proxy RPC system is something that can/should be done via unit
testing, and that likely these errors are (in production) transient
and (in CI) very likely to fail for test environment issues.