diff --git a/docs/architecture/adr-042-state-sync.md b/docs/architecture/adr-042-state-sync.md new file mode 100644 index 000000000..d525a4974 --- /dev/null +++ b/docs/architecture/adr-042-state-sync.md @@ -0,0 +1,239 @@ +# ADR 042: State Sync Design + +## Changelog + +2019-06-27: Init by EB +2019-07-04: Follow up by brapse + +## Context +StateSync is a feature which would allow a new node to receive a +snapshot of the application state without downloading blocks or going +through consensus. Once downloaded, the node could switch to FastSync +and eventually participate in consensus. The goal of StateSync is to +facilitate setting up a new node as quickly as possible. + +## Considerations +Because Tendermint doesn't know anything about the application state, +StateSync will broker messages between nodes and through +the ABCI to an opaque applicaton. The implementation will have multiple +touch points on both the tendermint code base and ABCI application. + +* A StateSync reactor to facilitate peer communication - Tendermint +* A Set of ABCI messages to transmit application state to the reactor - Tendermint +* A Set of MultiStore APIs for exposing snapshot data to the ABCI - ABCI application +* A Storage format with validation and performance considerations - ABCI application + +### Implementation Properties +Beyond the approach, any implementation of StateSync can be evaluated +across different criteria: + +* Speed: Expected throughput of producing and consuming snapshots +* Safety: Cost of pushing invalid snapshots to a node +* Liveness: Cost of preventing a node from receiving/constructing a snapshot +* Effort: How much effort does an implementation require + +### Implementation Question +* What is the format of a snapshot + * Complete snapshot + * Ordered IAVL key ranges + * Compressed individually chunks which can be validated +* How is data validated + * Trust a peer with it's data blindly + * Trust a majority of peers + * Use light client validation to validate each chunk against consensus + produced merkle tree root +* What are the performance characteristics + * Random vs sequential reads + * How parallelizeable is the scheduling algorithm + +### Proposals +Broadly speaking there are two approaches to this problem which have had +varying degrees of discussion and progress. These approach can be +summarized as: + +**Lazy:** Where snapshots are produced dynamically at request time. This +solution would use the existing data structure. +**Eager:** Where snapshots are produced periodically and served from disk at +request time. This solution would create an auxiliary data structure +optimized for batch read/writes. + +Additionally the propsosals tend to vary on how they provide safety +properties. + +**LightClient** Where a client can aquire the merkle root from the block +headers synchronized from a trusted validator set. Subsets of the application state, +called chunks can therefore be validated on receipt to ensure each chunk +is part of the merkle root. + +**Majority of Peers** Where manifests of chunks along with checksums are +downloaded and compared against versions provided by a majority of +peers. + +#### Lazy StateSync +An [initial specification](https://docs.google.com/document/d/15MFsQtNA0MGBv7F096FFWRDzQ1vR6_dics5Y49vF8JU/edit?ts=5a0f3629) was published by Alexis Sellier. +In this design, the state has a given `size` of primitive elements (like +keys or nodes), each element is assigned a number from 0 to `size-1`, +and chunks consists of a range of such elements. Ackratos raised +[some concerns](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit) +about this design, somewhat specific to the IAVL tree, and mainly concerning +performance of random reads and of iterating through the tree to determine element numbers +(ie. elements aren't indexed by the element number). + +An alternative design was suggested by Jae Kwon in +[#3639](https://github.com/tendermint/tendermint/issues/3639) where chunking +happens lazily and in a dynamic way: nodes request key ranges from their peers, +and peers respond with some subset of the +requested range and with notes on how to request the rest in parallel from other +peers. Unlike chunk numbers, keys can be verified directly. And if some keys in the +range are ommitted, proofs for the range will fail to verify. +This way a node can start by requesting the entire tree from one peer, +and that peer can respond with say the first few keys, and the ranges to request +from other peers. + +Additionally, per chunk validation tends to come more naturally to the +Lazy approach since it tends to use the existing structure of the tree +(ie. keys or nodes) rather than state-sync specific chunks. Such a +design for tendermint was originally tracked in +[#828](https://github.com/tendermint/tendermint/issues/828). + +#### Eager StateSync +Warp Sync as implemented in Parity +["Warp Sync"](https://wiki.parity.io/Warp-Sync-Snapshot-Format.html) to rapidly +download both blocks and state snapshots from peers. Data is carved into ~4MB +chunks and snappy compressed. Hashes of snappy compressed chunks are stored in a +manifest file which co-ordinates the state-sync. Obtaining a correct manifest +file seems to require an honest majority of peers. This means you may not find +out the state is incorrect until you download the whole thing and compare it +with a verified block header. + +A similar solution was implemented by Binance in +[#3594](https://github.com/tendermint/tendermint/pull/3594) +based on their initial implementation in +[PR #3243](https://github.com/tendermint/tendermint/pull/3243) +and [some learnings](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit). +Note this still requires the honest majority peer assumption. + +As an eager protocol, warp-sync can efficiently compress larger, more +predicatable chunks once per snapshot and service many new peers. By +comparison lazy chunkers would have to compress each chunk at request +time. + +### Analysis of Lazy vs Eager +Lazy vs Eager have more in common than they differ. They all require +reactors on the tendermint side, a set of ABCI messages and a method for +serializing/deserializing snapshots facilitated by a SnapshotFormat. + +The biggest difference between Lazy and Eager proposals is in the +read/write patterns necessitated by serving a snapshot chunk. +Specifically, Lazy State Sync performs random reads to the underlying data +structure while Eager can optimize for sequential reads. + +This distinctin between approaches was demonstrated by Binance's +[ackratos](https://github.com/ackratos) in their implementation of [Lazy +State sync](https://github.com/tendermint/tendermint/pull/3243), The +[analysis](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/) +of the performance, and follow up implementation of [Warp +Sync](http://github.com/tendermint/tendermint/pull/3594). + +#### Compairing Security Models +There are several different security models which have been +discussed/proposed in the past but generally fall into two categories. + +Light client validation: In which the node receiving data is expected to +first perform a light client sync and have all the nessesary block +headers. Within the trusted block header (trusted in terms of from a +validator set subject to [weak +subjectivity](https://github.com/tendermint/tendermint/pull/3795)) and +can compare any subset of keys called a chunk against the merkle root. +The advantage of light client validation is that the block headers are +signed by validators which have something to lose for malicious +behaviour. If a validator were to provide an invalid proof, they can be +slashed. + +Majority of peer validation: A manifest file containing a list of chunks +along with checksums of each chunk is downloaded from a +trusted source. That source can be a community resource similar to +[sum.golang.org](https://sum.golang.org) or downloaded from the majority +of peers. One disadantage of the majority of peer security model is the +vuliberability to eclipse attacks in which a malicious users looks to +saturate a target node's peer list and produce a manufactured picture of +majority. + +A third option would be to include snapshot related data in the +block header. This could include the manifest with related checksums and be +secured through consensus. One challenge of this approach is to +ensure that creating snapshots does not put undo burden on block +propsers by synchronizing snapshot creation and block creation. One +approach to minimizing the burden is for snapshots for height +`H` to be included in block `H+n` where `n` is some `n` block away, +giving the block propser enough time to complete the snapshot +asynchronousy. + +## Proposal: Eager StateSync With Per Chunk Light Client Validation +The conclusion after some concideration of the advantages/disadvances of +eager/lazy and different security models is to produce a state sync +which eagerly produces snapshots and uses light client validation. This +approach has the performance advantages of pre-computing efficient +snapshots which can streamed to new nodes on demand using sequential IO. +Secondly, by using light client validation we cna validate each chunk on +receipt and avoid the potential eclipse attack of majority of peer based +security. + +### Implementation +Tendermint is responsible for downloading and verifying chunks of +AppState from peers. ABCI Application is responsible for taking +AppStateChunk objects from TM and constructing a valid state tree whose +root corresponds with the AppHash of syncing block. In particular we +will need implement: + +* Build new StateSync reactor brokers message transmission between the peers + and the ABCI application +* A set of ABCI Messages +* Design SnapshotFormat as an interface which can: + * validate chunks + * read/write chunks from file + * read/write chunks to/from application state store + * convert manifests into chunkRequest ABCI messages +* Implement SnapshotFormat for cosmos-hub with concrete implementation for: + * read/write chunks in a way which can be: + * parallelized across peers + * validated on receipt + * read/write to/from IAVL+ tree + +![StateSync Architecture Diagram](img/state-sync.png) + +## Implementation Path +* Create StateSync reactor based on [#3753](https://github.com/tendermint/tendermint/pull/3753) +* Design SnapshotFormat with an eye towards cosmos-hub implementation +* ABCI message to send/receive SnapshotFormat +* IAVL+ changes to support SnapshotFormat +* Deliver Warp sync (no chunk validation) +* light client implementation for weak subjectivity +* Deliver StateSync with chunk validation + +## Status + +Proposed + +## Concequences + +### Neutral + +### Positive +* Safe & performant state sync design substantiated with real world implementation experience +* General interfaces allowing application specific innovation +* Parallizable implementation trajectory with reasonable engineering effort + +### Negative +* Static Scheduling lacks opportunity for real time chunk availability optimizations + +## References +[sync: Sync current state without full replay for Applications](https://github.com/tendermint/tendermint/issues/828) - original issue +[tendermint state sync proposal](https://docs.google.com/document/d/15MFsQtNA0MGBv7F096FFWRDzQ1vR6_dics5Y49vF8JU/edit?ts=5a0f3629) - Cloudhead proposal +[tendermint state sync proposal 2](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit) - ackratos proposal +[proposal 2 implementation](https://github.com/tendermint/tendermint/pull/3243) - ackratos implementation +[WIP General/Lazy State-Sync pseudo-spec](https://github.com/tendermint/tendermint/issues/3639) - Jae Proposal +[Warp Sync Implementation](https://github.com/tendermint/tendermint/pull/3594) - ackratos +[Chunk Proposal](https://github.com/tendermint/tendermint/pull/3799) - Bucky proposed + + diff --git a/docs/architecture/img/state-sync.png b/docs/architecture/img/state-sync.png new file mode 100644 index 000000000..08b6eac43 Binary files /dev/null and b/docs/architecture/img/state-sync.png differ