Updates #8077. The panic handler for consensus currently attempts to effect a
clean shutdown, but this can leave a failed node running in an unknown state
for an arbitrary amount of time after the failure.
Since a panic at this point means consensus is already irrecoverably broken, we
should not allow the node to continue executing. After making a best effort to
shut down the writeahead log, re-panic to ensure the node will terminate before
any further state transitions are processed.
Even with this change, it is possible some transitions may occur while the
cleanup is happening. It might be preferable to abort unconditionally without
any attempt at cleanup.
Related changes:
- Clean up the creation of WAL directories.
- Filter WAL close errors at rethrow.
After poking around #7828, I saw the oppertunity for this cleanup,
which I think is both reasonable on its own, and quite low impact, and
removes the math around process start time.
Our test cases spew a lot of files and directories around $TMPDIR. Make more
thorough use of the testing package's TempDir methods to ensure these are
cleaned up.
In a few cases, this required plumbing test contexts through existing helper
code. In a couple places an explicit path was required, to work around cases
where we do global setup during a TestMain function. Those cases probably
deserve more thorough cleansing (preferably with fire), but for now I have just
worked around it to keep focused on the cleanup.
During file rotation and WAL shutdown, there was a race condition between users
of an autofile and its termination. To fix this, ensure operations on an
autofile are properly synchronized, and report errors when attempting to use an
autofile after it was closed.
Notably:
- Simplify the cancellation protocol between signal and Close.
- Exclude writers to an autofile during rotation.
- Add documentation about what is going on.
There is a lot more that could be improved here, but this addresses the more
obvious races that have been panicking unit tests.
After writing and then reading a bunch of random messages, the test was
checking that it did not read the same number of messages that it wrote.
The sense of this check was inverted; they should match.
Introduced by accident in #7522. I'm not sure why this did not show up in CI.
Edit: I now know why it didn't show up in ci: #7608.
Noticed in profiles that invoking *VoteSignBytes always created a
bytes.Buffer, then discarded it inside protoio.MarshalDelimited.
I dug further and examined the call paths and noticed that we
unconditionally create the bytes.Buffer, even though we might
have proto messages (in the common case) that implement
MarshalTo([]byte), and invoked varintWriter. Instead by inlining
this case, we skip a bunch of allocations and CPU cycles,
which then reflects properly on all calling functions. Here
are the benchmark results:
```shell
$ benchstat before.txt after.txt
name old time/op new time/op delta
types.VoteSignBytes-8 705ns ± 3% 573ns ± 6% -18.74% (p=0.000 n=18+20)
types.CommitVoteSignBytes-8 8.15µs ± 9% 6.81µs ± 4% -16.51% (p=0.000 n=20+19)
protoio.MarshalDelimitedWithMarshalTo-8 788ns ± 8% 772ns ± 3% -2.01% (p=0.050 n=20+20)
protoio.MarshalDelimitedNoMarshalTo-8 989ns ± 4% 845ns ± 2% -14.51% (p=0.000 n=20+18)
name old alloc/op new alloc/op delta
types.VoteSignBytes-8 792B ± 0% 600B ± 0% -24.24% (p=0.000 n=20+20)
types.CommitVoteSignBytes-8 9.52kB ± 0% 7.60kB ± 0% -20.17% (p=0.000 n=20+20)
protoio.MarshalDelimitedNoMarshalTo-8 808B ± 0% 440B ± 0% -45.54% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
types.VoteSignBytes-8 13.0 ± 0% 10.0 ± 0% -23.08% (p=0.000 n=20+20)
types.CommitVoteSignBytes-8 140 ± 0% 110 ± 0% -21.43% (p=0.000 n=20+20)
protoio.MarshalDelimitedNoMarshalTo-8 10.0 ± 0% 7.0 ± 0% -30.00% (p=0.000 n=20+20)
```
Thanks to Tharsis who tasked me to help them increase TPS and who
are keen on improving Tendermint and efficiency.
Updates #7156, and a follow-up to #7070.
Event subscriptions in Tendermint currently use a fixed-length Go
channel as a queue. When the channel fills up, the publisher
immediately terminates the subscription. This prevents slow
subscribers from creating memory pressure on the node by not
servicing their queue fast enough.
Replace the buffered channel used to deliver events to buffered
subscribers with an explicit queue. The queue provides a soft
quota and burst credit mechanism: Clients that usually keep up
can survive occasional bursts, without allowing truly slow
clients to hog resources indefinitely.
I saw one of these tests fail and it looks like it was using code that
wasn't being called anywhere, so I deleted it, and avoided the package
name aliasing.
Adds a simple property test to the `clist` package. This test uses the [rapid](https://github.com/flyingmutant/rapid) library and works by modeling the internal clist as a simple array.
Follow up from this mornings workshop with the Regen team.
## Description
Internalize some libs. This reduces the amount ot public API tendermint is supporting. The moved libraries are mainly ones that are used within Tendermint-core.