rfc: add performance taxonomy rfc (#6921)

This document attempts to capture and discuss some of the areas of Tendermint that seem to be cited as causing performance issue. I'm hoping to continue to gather feedback and input on this document to better understand what issues Tendermint performance may cause for our users. The overall goal of this document is to allow the maintainers and community to get a better sense of these issues and to be more capably able to discuss them and weight trade-offs about any proposed performance-focused changes. This document does not aim to propose any performance improvements. It does suggest useful places for benchmarks and places where additional metrics would be useful for diagnosing and further understanding Tendermint performance. Please comment with areas where my reasoning seems off or with additional areas that Tendermint performance may be causing user pain.
3 years ago · 382947ce93
--- a/docs/rfc/README.md
+++ b/docs/rfc/README.md
@ -40,6 +40,7 @@ sections.
 - [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst)
 - [RFC-001: Storage Engines](./rfc-001-storage-engine.rst)
 - [RFC-002: Interprocess Communication](./rfc-002-ipc-ecosystem.md)
 - [RFC-003: Performance Taxonomy](./rfc-003-performance-questions.md)
 - [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md)

 <!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
--- a/docs/rfc/rfc-003-performance-questions.md
+++ b/docs/rfc/rfc-003-performance-questions.md
@ -0,0 +1,283 @@
 # RFC 003: Taxonomy of potential performance issues in Tendermint 

 ## Changelog

 - 2021-09-02: Created initial draft (@wbanfield)
 - 2021-09-14: Add discussion of the event system (@wbanfield)

 ## Abstract

 This document discusses the various sources of performance issues in Tendermint and
 attempts to clarify what work may be required to understand and address them.

 ## Background

 Performance, loosely defined as the ability of a software process to perform its work
 quickly and efficiently under load and within reasonable resource limits, is a frequent
 topic of discussion in the Tendermint project.
 To effectively address any issues with Tendermint performance we need to
 categorize the various issues, understand their potential sources, and gauge their
 impact on users.

 Categorizing the different known performance issues will allow us to discuss and fix them
 more systematically. This document proposes a rough taxonomy of performance issues
 and highlights areas where more research into potential performance problems is required.

 Understanding Tendermint's performance limitations will also be critically important
 as we make changes to many of its subsystems. Performance is a central concern for
 upcoming decisions regarding the `p2p` protocol, RPC message encoding and structure,
 database usage and selection, and consensus protocol updates.


 ## Discussion

 This section attempts to delineate the different sections of Tendermint functionality
 that are often cited as having performance issues. It raises questions and suggests
 lines of inquiry that may be valuable for better understanding Tendermint's performance issues.

 As a note: We should avoid quickly adding many microbenchmarks or package level benchmarks. 
 These are prone to being worse than useless as they can obscure what _should_ be
 focused on: performance of the system from the perspective of a user. We should,
 instead, tune performance with an eye towards user needs and actions users make. These users comprise
 both operators of Tendermint chains and the people generating transactions for
 Tendermint chains. Both of these sets of users are largely aligned in wanting an end-to-end
 system that operates quickly and efficiently.

 REQUEST: The list below may be incomplete, if there are additional sections that are often
 cited as creating poor performance, please comment so that they may be included.

 ### P2P

 #### Claim: Tendermint cannot scale to large numbers of nodes

 A complaint has been reported that Tendermint networks cannot scale to large numbers of nodes.
 The listed number of nodes a user reported as causing issue was in the thousands.
 We don't currently have evidence about what the upper-limit of nodes that Tendermint's
 P2P stack can scale to.

 We need to more concretely understand the source of issues and determine what layer
 is causing a problem. It's possible that the P2P layer, in the absence of any reactors
 sending data, is perfectly capable of managing thousands of peer connections. For
 a reasonable networking and application setup, thousands of connections should not present any
 issue for the application.

 We need more data to understand the problem directly. We want to drive the popularity
 and adoption of Tendermint and this will mean allowing for chains with more validators.
 We should follow up with users experiencing this issue. We may then want to add
 a series of metrics to the P2P layer to better understand the inefficiencies it produces.

 The following metrics can help us understand the sources of latency in the Tendermint P2P stack:

 * Number of messages sent and received per second
 * Time of a message spent on the P2P layer send and receive queues

 The following metrics exist and should be leveraged in addition to those added:

 * Number of peers node's connected to
 * Number of bytes per channel sent and received from each peer

 ### Sync

 #### Claim: Block Syncing is slow

 Bootstrapping a new node in a network to the height of the rest of the network is believed to
 take longer than users would like. Block sync requires fetching all of the blocks from
 peers and placing them into the local disk for storage. A useful line of inquiry
 is understanding how quickly a perfectly tuned system _could_ fetch all of the state
 over a network so that we understand how much overhead Tendermint actually adds.

 The operation is likely to be _incredibly_ dependent on the environment in which
 the node is being run. The factors that will influence syncing include:
 1. Number of peers that a syncing node may fetch from.
 2. Speed of the disk that a validator is writing to.
 3. Speed of the network connection between the different peers that node is
 syncing from.

 We should calculate how quickly this operation _could possibly_ complete for common chains and nodes.
 To calculate how quickly this operation could possibly complete, we should assume that
 a node is reading at line-rate of the NIC and writing at the full drive speed to its
 local storage. Comparing this theoretical upper-limit to the actual sync times
 observed by node operators will give us a good point of comparison for understanding
 how much overhead Tendermint incurs.

 We should additionally add metrics to the blocksync operation to more clearly pinpoint
 slow operations. The following metrics should be added to the block syncing operation:

 * Time to fetch and validate each block
 * Time to execute a block
 * Blocks sync'd per unit time

 ### Application

 Applications performing complex state transitions have the potential to bottleneck
 the Tendermint node.

 #### Claim: ABCI block delivery could cause slowdown

 ABCI delivers blocks in several methods: `BeginBlock`, `DeliverTx`, `EndBlock`, `Commit`.

 Tendermint delivers transactions one-by-one via the `DeliverTx` call. Most of the 
 transaction delivery in Tendermint occurs asynchronously and therefore appears unlikely to
 form a bottleneck in ABCI.

 After delivering all transactions, Tendermint then calls the `Commit` ABCI method.
 Tendermint [locks all access to the mempool][abci-commit-description] while `Commit`
 proceeds. This means that an application that is slow to execute all of its
 transactions or finalize state during the `Commit` method will prevent any new
 transactions from being added to the mempool.  Apps that are slow to commit will
 prevent consensus from proceeded to the next consensus height since Tendermint
 cannot validate block proposals or produce block proposals without the
 AppHash obtained from the `Commit` method. We should add a metric for each
 step in the ABCI protocol to track the amount of time that a node spends communicating
 with the application at each step.

 #### Claim: ABCI serialization overhead causes slowdown

 The most common way to run a Tendermint application is using the Cosmos-SDK.
 The Cosmos-SDK runs the ABCI application within the same process as Tendermint.
 When an application is run in the same process as Tendermint, a serialization penalty
 is not paid. This is because the local ABCI client does not serialize method calls
 and instead passes the protobuf type through directly. This can be seen
 in [local_client.go][abci-local-client-code].

 Serialization and deserialization in the gRPC and socket protocol ABCI methods
 may cause slowdown. While these may cause issue, they are not part of the primary
 usecase of Tendermint and do not necessarily need to be addressed at this time.

 ### RPC

 #### Claim: The Query API is slow.

 The query API locks a mutex across the ABCI connections. This causes consensus to
 slow during queries, as ABCI is no longer able to make progress. This is known
 to be causing issue in the cosmos-sdk and is being addressed [in the sdk][sdk-query-fix]
 but a more robust solution may be required. Adding metrics to each ABCI client connection
 and message as described in the Application section of this document would allow us
 to further introspect the issue here. 

 #### Claim: RPC Serialization may cause slowdown

 The Tendermint RPC uses a modified version of JSON-RPC. This RPC powers the `broadcast_tx_*` methods,
 which is a critical method for adding transactions to Tendermint at the moment. This method is
 likely invoked quite frequently on popular networks. Being able to perform efficiently
 on this common and critical operation is very important. The current JSON-RPC implementation
 relies heavily on type introspection via reflection, which is known to be very slow in
 Go. We should therefore produce benchmarks of this method to determine how much overhead
 we are adding to what, is likely to be, a very common operation.

 The other JSON-RPC methods are much less critical to the core functionality of Tendermint.
 While there may other points of performance consideration within the RPC, methods that do not
 receive high volumes of requests should not be prioritized for performance consideration.

 NOTE: Previous discussion of the RPC framework was done in [ADR 57][adr-57] and 
 there is ongoing work to inspect and alter the JSON-RPC framework in [RFC 002][rfc-002]. 
 Much of these RPC-related performance considerations can either wait until the work of RFC 002 work is done or be
 considered concordantly with the in-flight changes to the JSON-RPC.

 ### Protocol

 #### Claim: Gossiping messages is a slow process

 Currently, for any validator to successfully vote in a consensus _step_, it must
 receive votes from greater than 2/3 of the validators on the network. In many cases,
 it's preferable to receive as many votes as possible from correct validators.

 This produces a quadratic increase in messages that are communicated as more validators join the network.
 (Each of the N validators must communicate with all other N-1 validators).

 This large number of messages communicated per step has been identified to impact
 performance of the protocol. Given that the number of messages communicated has been
 identified as a bottleneck, it would be extremely valuable to gather data on how long
 it takes for popular chains with many validators to gather all votes within a step.

 Metrics that would improve visibility into this include:

 * Amount of time for a node to gather votes in a step.
 * Amount of time for a node to gather all block parts.
 * Number of votes each node sends to gossip (i.e. not its own votes, but votes it is
 transmitting for a peer).
 * Total number of votes each node sends to receives (A node may receive duplicate votes
 so understanding how frequently this occurs will be valuable in evaluating the performance
 of the gossip system).

 #### Claim: Hashing Txs causes slowdown in Tendermint

 Using a faster hash algorithm for Tx hashes is currently a point of discussion
 in Tendermint. Namely, it is being considered as part of the [modular hashing proposal][modular-hashing].
 It is currently unknown if hashing transactions in the Mempool forms a significant bottleneck.
 Although it does not appear to be documented as slow, there are a few open github
 issues that indicate a possible user preference for a faster hashing algorithm,
 including [issue 2187][issue-2187] and [issue 2186][issue-2186]. 

 It is likely worth investigating what order of magnitude Tx hashing takes in comparison to other
 aspects of adding a Tx to the mempool. It is not currently clear if the rate of adding Tx
 to the mempool is a source of user pain. We should not endeavor to make large changes to
 consensus critical components without first being certain that the change is highly
 valuable and impactful.

 ### Digital Signatures

 #### Claim: Verification of digital signatures may cause slowdown in Tendermint

 Working with cryptographic signatures can be computationally expensive. The cosmos
 hub uses [ed25519 signatures][hub-signature]. The library performing signature
 verification in Tendermint on votes is [benchmarked][ed25519-bench] to be able to perform an `ed25519`
 signature in 75μs on a decently fast CPU. A validator in the Cosmos Hub performs
 3 sets of verifications on the signatures of the 140 validators in the Hub
 in a consensus round, during block verification, when verifying the prevotes, and
 when verifying the precommits. With no batching, this would be roughly `3ms` per
 round. It is quite unlikely, therefore, that this accounts for any serious amount
 of the ~7 seconds of block time per height in the Hub.

 This may cause slowdown when syncing, since the process needs to constantly verify
 signatures. It's possible that improved signature aggregation will lead to improved
 light client or other syncing performance. In general, a metric should be added
 to track block rate while blocksyncing.

 #### Claim: Our use of digital signatures in the consensus protocol contributes to performance issue

 Currently, Tendermint's digital signature verification requires that all validators
 receive all vote messages. Each validator must receive the complete digital signature
 along with the vote message that it corresponds to. This means that all N validators
 must receive messages from at least 2/3 of the N validators in each consensus
 round. Given the potential for oddly shaped network topologies and the expected
 variable network roundtrip times of a few hundred milliseconds in a blockchain,
 it is highly likely that this amount of gossiping is leading to a significant amount
 of the slowdown in the Cosmos Hub and in Tendermint consensus.

 ### Tendermint Event System

 #### Claim: The event system is a bottleneck in Tendermint

 The Tendermint Event system is used to communicate and store information about
 internal Tendermint execution. The system uses channels internally to send messages
 to different subscribers. Sending an event [blocks on the internal channel][event-send].
 The default configuration is to [use an unbuffered channel for event publishes][event-buffer-capacity].
 Several consumers of the event system also use an unbuffered channel for reads.
 An example of this is the [event indexer][event-indexer-unbuffered], which takes an
 unbuffered subscription to the event system. The result is that these unbuffered readers
 can cause writes to the event system to block or slow down depending on contention in the
 event system. This has implications for the consensus system, which [publishes events][consensus-event-send].
 To better understand the performance of the event system, we should add metrics to track the timing of
 event sends. The following metrics would be a good start for tracking this performance:

 * Time in event send, labeled by Event Type
 * Time in event receive, labeled by subscriber
 * Event throughput, measured in events per unit time.

 ### References
 [modular-hashing]: https://github.com/tendermint/tendermint/pull/6773
 [issue-2186]: https://github.com/tendermint/tendermint/issues/2186
 [issue-2187]: https://github.com/tendermint/tendermint/issues/2187
 [rfc-002]: https://github.com/tendermint/tendermint/pull/6913
 [adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md
 [issue-1319]: https://github.com/tendermint/tendermint/issues/1319
 [abci-commit-description]: https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#commit
 [abci-local-client-code]: https://github.com/tendermint/tendermint/blob/511bd3eb7f037855a793a27ff4c53c12f085b570/abci/client/local_client.go#L84
 [hub-signature]: https://github.com/cosmos/gaia/blob/0ecb6ed8a244d835807f1ced49217d54a9ca2070/docs/resources/genesis.md#consensus-parameters
 [ed25519-bench]: https://github.com/oasisprotocol/curve25519-voi/blob/d2e7fc59fe38c18ca990c84c4186cba2cc45b1f9/PERFORMANCE.md
 [event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/libs/pubsub/pubsub.go#L338
 [event-buffer-capacity]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/types/event_bus.go#L14
 [event-indexer-unbuffered]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/state/indexer/indexer_service.go#L39
 [consensus-event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/internal/consensus/state.go#L1573
 [sdk-query-fix]: https://github.com/cosmos/cosmos-sdk/pull/10045
--- a/internal/consensus/state.go
+++ b/internal/consensus/state.go
@ -137,7 +137,7 @@ type State struct {
 	done chan struct{}

 	// synchronous pubsub between consensus state and reactor.
 	// state only emits EventNewRoundStep and EventVote
 	// state only emits EventNewRoundStep, EventValidBlock, and EventVote
 	evsw tmevents.EventSwitch

 	// for reporting metrics