diff --git a/docs/architecture/README.md b/docs/architecture/README.md index 84374964e..a3d068d35 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -96,3 +96,4 @@ Note the context/background should be written in the present tense. - [ADR-054: Crypto-Encoding-2](./adr-054-crypto-encoding-2.md) - [ADR-057: RPC](./adr-057-RPC.md) - [ADR-063: Privval-gRPC](./adr-063-privval-grpc.md) +- [ADR-067: Mempool Refactor](./adr-067-mempool-refactor) diff --git a/docs/architecture/adr-067-mempool-refactor.md b/docs/architecture/adr-067-mempool-refactor.md new file mode 100644 index 000000000..317038ccc --- /dev/null +++ b/docs/architecture/adr-067-mempool-refactor.md @@ -0,0 +1,303 @@ +# ADR 067: Mempool Refactor + +- [ADR 067: Mempool Refactor](#adr-067-mempool-refactor) + - [Changelog](#changelog) + - [Status](#status) + - [Context](#context) + - [Current Design](#current-design) + - [Alternative Approaches](#alternative-approaches) + - [Prior Art](#prior-art) + - [Ethereum](#ethereum) + - [Diem](#diem) + - [Decision](#decision) + - [Detailed Design](#detailed-design) + - [CheckTx](#checktx) + - [Mempool](#mempool) + - [Eviction](#eviction) + - [Gossiping](#gossiping) + - [Performance](#performance) + - [Future Improvements](#future-improvements) + - [Consequences](#consequences) + - [Positive](#positive) + - [Negative](#negative) + - [Neutral](#neutral) + - [References](#references) + +## Changelog + +- April 19, 2021: Initial Draft (@alexanderbez) + +## Status + +Proposed + +## Context + +Tendermint Core has a reactor and data structure, mempool, that facilitates the +ephemeral storage of uncommitted transactions. Honest nodes participating in a +Tendermint network gossip these uncommitted transactions to each other if they +pass the application's `CheckTx`. In addition, block proposers select from the +mempool a subset of uncommitted transactions to include in the next block. + +Currently, the mempool in Tendermint Core is designed as a FIFO queue. In other +words, transactions are included in blocks as they are received by a node. There +currently is no explicit and prioritized ordering of these uncommitted transactions. +This presents a few technical and UX challenges for operators and applications. + +Namely, validators are not able to prioritize transactions by their fees or any +incentive aligned mechanism. In addition, the lack of prioritization also leads +to cascading effects in terms of DoS and various attack vectors on networks, +e.g. [cosmos/cosmos-sdk#8224](https://github.com/cosmos/cosmos-sdk/discussions/8224). + +Thus, Tendermint Core needs the ability for an application and its users to +prioritize transactions in a flexible and performant manner. Specifically, we're +aiming to either improve, maintain or add the following properties in the +Tendermint mempool: + +- Allow application-determined transaction priority. +- Allow efficient concurrent reads and writes. +- Allow block proposers to reap transactions efficiently by priority. +- Maintain a fixed mempool capacity by transaction size and evict lower priority + transactions to make room for higher priority transactions. +- Allow transactions to be gossiped by priority efficiently. +- Allow operators to specify a maximum TTL for transactions in the mempool before + they're automatically evicted if not selected for a block proposal in time. +- Ensure the design allows for future extensions, such as replace-by-priority and + allowing multiple pending transactions per sender, to be incorporated easily. + +Note, not all of these properties will be addressed by the proposed changes in +this ADR. However, this proposal will ensure that any unaddressed properties +can be addressed in an easy and extensible manner in the future. + +### Current Design + +![mempool](./img/mempool-v0.jpeg) + +At the core of the `v0` mempool reactor is a concurrent linked-list. This is the +primary data structure that contains `Tx` objects that have passed `CheckTx`. +When a node receives a transaction from another peer, it executes `CheckTx`, which +obtains a read-lock on the `*CListMempool`. If the transaction passes `CheckTx` +locally on the node, it is added to the `*CList` by obtaining a write-lock. It +is also added to the `cache` and `txsMap`, both of which obtain their own respective +write-locks and map a reference from the transaction hash to the `Tx` itself. + +Transactions are continuously gossiped to peers whenever a new transaction is added +to a local node's `*CList`, where the node at the front of the `*CList` is selected. +Another transaction will not be gossiped until the `*CList` notifies the reader +that there are more transactions to gossip. + +When a proposer attempts to propose a block, they will execute `ReapMaxBytesMaxGas` +on the reactor's `*CListMempool`. This call obtains a read-lock on the `*CListMempool` +and selects as many transactions as possible starting from the front of the `*CList` +moving to the back of the list. + +When a block is finally committed, a caller invokes `Update` on the reactor's +`*CListMempool` with all the selected transactions. Note, the caller must also +explicitly obtain a write-lock on the reactor's `*CListMempool`. This call +will remove all the supplied transactions from the `txsMap` and the `*CList`, both +of which obtain their own respective write-locks. In addition, the transaction +may also be removed from the `cache` which obtains it's own write-lock. + +## Alternative Approaches + +When considering which approach to take for a priority-based flexible and +performant mempool, there are two core candidates. The first candidate is less +invasive in the required set of protocol and implementation changes, which +simply extends the existing `CheckTx` ABCI method. The second candidate essentially +involves the introduction of new ABCI method(s) and would require a higher degree +of complexity in protocol and implementation changes, some of which may either +overlap or conflict with the upcoming introduction of [ABCI++](https://github.com/tendermint/spec/blob/master/rfc/004-abci%2B%2B.md). + +For more information on the various approaches and proposals, please see the +[mempool discussion](https://github.com/tendermint/tendermint/discussions/6295). + +## Prior Art + +### Ethereum + +The Ethereum mempool, specifically [Geth](https://github.com/ethereum/go-ethereum), +contains a mempool, `*TxPool`, that contains various mappings indexed by account, +such as a `pending` which contains all processable transactions for accounts +prioritized by nonce. It also contains a `queue` which is the exact same mapping +except it contains not currently processable transactions. The mempool also +contains a `priced` index of type `*txPricedList` that is a priority queue based +on transaction price. + +### Diem + +The [Diem mempool](https://github.com/diem/diem/blob/master/mempool/README.md#implementation-details) +contains a similar approach to the one we propose. Specifically, the Diem mempool +contains a mapping from `Account:[]Tx`. On top of this primary mapping from account +to a list of transactions, are various indexes used to perform certain actions. + +The main index, `PriorityIndex`. is an ordered queue of transactions that are +“consensus-ready” (i.e., they have a sequence number which is sequential to the +current sequence number for the account). This queue is ordered by gas price so +that if a client is willing to pay more (than other clients) per unit of +execution, then they can enter consensus earlier. + +## Decision + +To incorporate a priority-based flexible and performant mempool in Tendermint Core, +we will introduce new fields, `priority` and `sender`, into the `ResponseCheckTx` +type. + +We will introduce a new versioned mempool reactor, `v1` and assume an implicit +version of the current mempool reactor as `v0`. In the new `v1` mempool reactor, +we largely keep the functionality the same as `v0` except we augment the underlying +data structures. Specifically, we keep a mapping of senders to transaction objects. +On top of this mapping, we index transactions to provide the ability to efficiently +gossip and reap transactions by priority. + +## Detailed Design + +### CheckTx + +We introduce the following new fields into the `ResponseCheckTx` type: + +```diff +message ResponseCheckTx { + uint32 code = 1; + bytes data = 2; + string log = 3; // nondeterministic + string info = 4; // nondeterministic + int64 gas_wanted = 5 [json_name = "gas_wanted"]; + int64 gas_used = 6 [json_name = "gas_used"]; + repeated Event events = 7 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "events,omitempty"]; + string codespace = 8; ++ int64 priority = 9; ++ string sender = 10; +} +``` + +It is entirely up the application in determining how these fields are populated +and with what values, e.g. the `sender` could be the signer and fee payer +of the transaction, the `priority` could be the cumulative sum of the fee(s). + +Only `sender` is required, while `priority` can be omitted which would result in +using the default value of zero. + +### Mempool + +The existing concurrent-safe linked-list will be replaced by a thread-safe map +of ``, i.e a mapping from `sender` to a single `*Tx` object, where +each `*Tx` is the next valid and processable transaction from the given `sender`. + +On top of this mapping, we index all transactions by priority using a thread-safe +priority queue, i.e. a [max heap](https://en.wikipedia.org/wiki/Min-max_heap). +When a proposer is ready to select transactions for the next block proposal, +transactions are selected from this priority index by highest priority order. +When a transaction is selected and reaped, it is removed from this index and +from the `` mapping. + +We define `Tx` as the following data structure: + +```go +type Tx struct { + // Tx represents the raw binary transaction data. + Tx []byte + + // Priority defines the transaction's priority as specified by the application + // in the ResponseCheckTx response. + Priority int64 + + // Sender defines the transaction's sender as specified by the application in + // the ResponseCheckTx response. + Sender string + + // Index defines the current index in the priority queue index. Note, if + // multiple Tx indexes are needed, this field will be removed and each Tx + // index will have its own wrapped Tx type. + Index int +} +``` + +### Eviction + +Upon successfully executing `CheckTx` for a new `Tx` and the mempool is currently +full, we must check if there exists a `Tx` of lower priority that can be evicted +to make room for the new `Tx` with higher priority and with sufficient size +capacity left. + +If such a `Tx` exists, we find it by obtaining a read lock and sorting the +priority queue index. Once sorted, we find the first `Tx` with lower priority and +size such that the new `Tx` would fit within the mempool's size limit. We then +remove this `Tx` from the priority queue index as well as the `` +mapping. + +This will require additional `O(n)` space and `O(n*log(n))` runtime complexity. Note that the space complexity does not depend on the size of the tx. + +### Gossiping + +We keep the existing thread-safe linked list as an additional index. Using this +index, we can efficiently gossip transactions in the same manner as they are +gossiped now (FIFO). + +Gossiping transactions will not require locking any other indexes. + +### Performance + +Performance should largely remain unaffected apart from the space overhead of +keeping an additional priority queue index and the case where we need to evict +transactions from the priority queue index. There should be no reads which +block writes on any index + +## Future Improvements + +There are a few considerable ways in which the proposed design can be improved or +expanded upon. Namely, transaction gossiping and for the ability to support +multiple transactions from the same `sender`. + +With regards to transaction gossiping, we need empirically validate whether we +need to gossip by priority. In addition, the current method of gossiping may not +be the most efficient. Specifically, broadcasting all the transactions a node +has in it's mempool to it's peers. Rather, we should explore for the ability to +gossip transactions on a request/response basis similar to Ethereum and other +protocols. Not only does this reduce bandwidth and complexity, but also allows +for us to explore gossiping by priority or other dimensions more efficiently. + +Allowing for multiple transactions from the same `sender` is important and will +most likely be a needed feature in the future development of the mempool, but for +now it suffices to have the preliminary design agreed upon. Having the ability +to support multiple transactions per `sender` will require careful thought with +regards to the interplay of the corresponding ABCI application. Regardless, the +proposed design should allow for adaptations to support this feature in a +non-contentious and backwards compatible manner. + +## Consequences + +### Positive + +- Transactions are allowed to be prioritized by the application. + +### Negative + +- Increased size of the `ResponseCheckTx` Protocol Buffer type. +- Causal ordering is NOT maintained. + - It is possible that certain transactions broadcasted in a particular order may + pass `CheckTx` but not end up being committed in a block because they fail + `CheckTx` later. e.g. Consider Tx1 that sends funds from existing + account Alice to a _new_ account Bob with priority P1 and then later + Bob's _new_ account sends funds back to Alice in Tx2 with P2, + such that P2 > P1. If executed in this order, both + transactions will pass `CheckTx`. However, when a proposer is ready to select + transactions for the next block proposal, they will select Tx2 before + Tx1 and thus Tx2 will _fail_ because Tx1 must + be executed first. This is because there is a _causal ordering_, + Tx1 ➝ Tx2. These types of situations should be rare as + most transactions are not causally ordered and can be circumvented by simply + trying again at a later point in time or by ensuring the "child" priority is + lower than the "parent" priority. In other words, if parents always have + priories that are higher than their children, then the new mempool design will + maintain causal ordering. + +### Neutral + +- A transaction that passed `CheckTx` and entered the mempool can later be evicted + at a future point in time if a higher priority transaction entered while the + mempool was full. + +## References + +- [ABCI++](https://github.com/tendermint/spec/blob/master/rfc/004-abci%2B%2B.md) +- [Mempool Discussion](https://github.com/tendermint/tendermint/discussions/6295) diff --git a/docs/architecture/img/mempool-v0.jpeg b/docs/architecture/img/mempool-v0.jpeg new file mode 100644 index 000000000..dc3da6e77 Binary files /dev/null and b/docs/architecture/img/mempool-v0.jpeg differ