This ADR outlines the plan for an initial state sync prototype, and is subject to change as we gain feedback and experience. It builds on discussions and findings in ADR-042, see that for background information.
2020-01-28: Initial draft (Erik Grinaker)
2020-02-18: Updates after initial prototype (Erik Grinaker)
reason
fields.RequestApplySnapshotChunk.chain_hash
to RequestOfferSnapshot.app_hash
.State sync will allow a new node to receive a snapshot of the application state without downloading blocks or going through consensus. This bootstraps the node significantly faster than the current fast sync system, which replays all historical blocks.
Background discussions and justifications are detailed in ADR-042. Its recommendations can be summarized as:
The application periodically takes full state snapshots (i.e. eager snapshots).
The application splits snapshots into smaller chunks that can be individually verified against a chain app hash.
Tendermint uses the light client to obtain a trusted chain app hash for verification.
Tendermint discovers and downloads snapshot chunks in parallel from multiple peers, and passes them to the application via ABCI to be applied and verified against the chain app hash.
Historical blocks are not backfilled, so state synced nodes will have a truncated block history.
This describes the snapshot/restore process seen from Tendermint. The interface is kept as small and general as possible to give applications maximum flexibility.
A node can have multiple snapshots taken at various heights. Snapshots can be taken in different application-specified formats (e.g. MessagePack as format 1
and Protobuf as format 2
, or similarly for schema versioning). Each snapshot consists of multiple chunks containing the actual state data, allowing parallel downloads and reduced memory usage.
message Snapshot {
uint64 height = 1; // The height at which the snapshot was taken
uint32 format = 2; // The application-specific snapshot format
uint32 chunks = 3; // The number of chunks in the snapshot
bytes metadata = 4; // Arbitrary application metadata
}
message SnapshotChunk {
uint64 height = 1; // The height of the corresponding snapshot
uint32 format = 2; // The application-specific snapshot format
uint32 chunk = 3; // The chunk index (one-based)
bytes data = 4; // Serialized application state in an arbitrary format
bytes checksum = 5; // SHA-1 checksum of data
}
Chunk verification data must be encoded along with the state data in the data
field.
Chunk data
cannot be larger than 64 MB, and snapshot metadata
cannot be larger than 64 KB.
// Lists available snapshots
message RequestListSnapshots {}
message ResponseListSnapshots {
repeated Snapshot snapshots = 1;
}
// Offers a snapshot to the application
message RequestOfferSnapshot {
Snapshot snapshot = 1;
bytes app_hash = 2;
}
message ResponseOfferSnapshot {
bool accepted = 1;
Reason reason = 2; // Reason why snapshot was rejected
enum Reason {
unknown = 0; // Unknown or generic reason
invalid_height = 1; // Height is rejected: avoid this height
invalid_format = 2; // Format is rejected: avoid this format
}
}
// Fetches a snapshot chunk
message RequestGetSnapshotChunk {
uint64 height = 1;
uint32 format = 2;
uint32 chunk = 3;
}
message ResponseGetSnapshotChunk {
SnapshotChunk chunk = 1;
}
// Applies a snapshot chunk
message RequestApplySnapshotChunk {
SnapshotChunk chunk = 1;
}
message ResponseApplySnapshotChunk {
bool applied = 1;
Reason reason = 2; // Reason why chunk failed
enum Reason {
unknown = 0; // Unknown or generic reason
verify_failed = 1; // Chunk verification failed
}
}
Tendermint is not aware of the snapshotting process at all, it is entirely an application concern. The following guarantees must be provided:
Periodic: snapshots must be taken periodically, not on-demand, for faster restores, lower load, and less DoS risk.
Deterministic: snapshots must be deterministic, and identical across all nodes - typically by taking a snapshot at given height intervals.
Consistent: snapshots must be consistent, i.e. not affected by concurrent writes - typically by using a data store that supports versioning and/or snapshot isolation.
Asynchronous: snapshots must be asynchronous, i.e. not halt block processing and state transitions.
Chunked: snapshots must be split into chunks of reasonable size (on the order of megabytes), and each chunk must be verifiable against the chain app hash.
Garbage collected: snapshots must be garbage collected periodically.
Nodes should have options for enabling state sync and/or fast sync, and be provided a trusted header hash for the light client.
When starting an empty node with state sync and fast sync enabled, snapshots are restored as follows:
The node checks that it is empty, i.e. that it has no state nor blocks.
The node contacts the given seeds to discover peers.
The node contacts a set of full nodes, and verifies the trusted block header using the given hash via the light client.
The node requests available snapshots via RequestListSnapshots
. Snapshots with metadata
greater than 64 KB are rejected.
The node iterates over all snapshots in reverse order by height and format until it finds one that satisfies all of the following conditions:
The snapshot height's block is considered trustworthy by the light client (i.e. snapshot height is greater than trusted header and within unbonding period of the latest trustworthy block).
The snapshot's height or format hasn't been explicitly rejected by an earlier RequestOffsetSnapshot
call (via invalid_height
or invalid_format
).
The application accepts the RequestOfferSnapshot
call.
The node downloads chunks in parallel from multiple peers via RequestGetSnapshotChunk
, and both the sender and receiver verifies their checksums. Chunks with data
greater than 64 MB are rejected.
The node passes chunks sequentially to the app via RequestApplySnapshotChunk
, along with the chain's app hash at the snapshot height for verification. If the chunk is rejected the node should retry it. If it was rejected with verify_failed
, it should be refetched from a different source. If an internal error occurred, ResponseException
should be returned and state sync should be aborted.
Once all chunks have been applied, the node compares the app hash to the chain app hash, and if they do not match it either errors or discards the state and starts over.
The node switches to fast sync to catch up blocks that were committed while restoring the snapshot.
The node switches to normal consensus mode.
This describes the snapshot process seen from Gaia, using format version 1
. The serialization format is unspecified, but likely to be compressed Amino or Protobuf.
In the initial version there is no snapshot metadata, so it is set to an empty byte buffer.
Once all chunks have been successfully built, snapshot metadata should be serialized and stored in the file system as e.g. snapshots/<height>/<format>/metadata
, and served via RequestListSnapshots
.
The Gaia data structure consists of a set of named IAVL trees. A root hash is constructed by taking the root hashes of each of the IAVL trees, then constructing a Merkle tree of the sorted name/hash map.
IAVL trees are versioned, but a snapshot only contains the version relevant for the snapshot height. All historical versions are ignored.
IAVL trees are insertion-order dependent, so key/value pairs must be set in an appropriate insertion order to produce the same tree branching structure. This insertion order can be found by doing a breadth-first scan of all nodes (including inner nodes) and collecting unique keys in order. However, the node hash also depends on the node's version, so snapshots must contain the inner nodes' version numbers as well.
For the initial prototype, each chunk consists of a complete dump of all node data for all nodes in an entire IAVL tree. Thus the number of chunks equals the number of persistent stores in Gaia. No incremental verification of chunks is done, only a final app hash comparison at the end of the snapshot restoration.
For a production version, it should be sufficient to store key/value/version for all nodes (leaf and inner) in insertion order, chunked in some appropriate way. If per-chunk verification is required, the chunk must also contain enough information to reconstruct the Merkle proofs all the way up to the root of the multistore, e.g. by storing a complete subtree's key/value/version data plus Merkle hashes of all other branches up to the multistore root. The exact approach will depend on tradeoffs between size, time, and verification. IAVL RangeProofs are not recommended, since these include redundant data such as proofs for intermediate and leaf nodes that can be derived from the above data.
Chunks should be built greedily by collecting node data up to some size limit (e.g. 32 MB) and serializing it. Chunk data is stored in the file system as snapshots/<height>/<format>/<chunk>/data
, along with a SHA-1 checksum in snapshots/<height>/<format>/<chunk>/checksum
, and served via RequestGetSnapshotChunk
.
Snapshots should be taken at some configurable height interval, e.g. every 1000 blocks. All nodes should preferably have the same snapshot schedule, such that all nodes can serve chunks for a given snapshot.
Taking consistent snapshots of IAVL trees is greatly simplified by them being versioned: simply snapshot the version that corresponds to the snapshot height, while concurrent writes create new versions. IAVL pruning must not prune a version that is being snapshotted.
Snapshots must also be garbage collected after some configurable time, e.g. by keeping the latest n
snapshots.
An experimental but functional state sync prototype is available in the erik/statesync-prototype
branches of the Tendermint, IAVL, Cosmos SDK, and Gaia repositories. To fetch the necessary branches:
$ mkdir statesync
$ cd statesync
$ git clone git@github.com:tendermint/tendermint -b erik/statesync-prototype
$ git clone git@github.com:tendermint/iavl -b erik/statesync-prototype
$ git clone git@github.com:cosmos/cosmos-sdk -b erik/statesync-prototype
$ git clone git@github.com:cosmos/gaia -b erik/statesync-prototype
To spin up three nodes of a four-node testnet:
$ cd gaia
$ ./tools/start.sh
Wait for the first snapshot to be taken at height 3, then (in a separate terminal) start the fourth node with state sync enabled:
$ ./tools/sync.sh
To stop the testnet, run:
$ ./tools/stop.sh
Should we have a simpler scheme for discovering snapshots? E.g. announce supported formats, and have peer supply latest available snapshot.
Downsides: app has to announce supported formats, having a single snapshot per peer may make fewer peers available for chosen snapshot.
Is it OK for state-synced nodes to not have historical blocks nor historical IAVL versions?
Yes, this is as intended. Maybe backfill blocks later.
Do we need incremental chunk verification for first version?
No, we'll start simple. Can add chunk verification via a new snapshot format without any breaking changes in Tendermint. For adversarial conditions, maybe consider support for whitelisting peers to download chunks from.
Should the snapshot ABCI interface be a separate optional ABCI service, or mandatory?
Mandatory, to keep things simple for now. It will therefore be a breaking change and push the release. For apps using the Cosmos SDK, we can provide a default implementation that does not serve snapshots and errors when trying to apply them.
How can we make sure ListSnapshots
data is valid? An adversary can provide fake/invalid snapshots to DoS peers.
For now, just pick snapshots that are available on a large number of peers. Maybe support whitelisting. We may consider e.g. placing snapshot manifests on the blockchain later.
Should we punish nodes that provide invalid snapshots? How?
No, these are full nodes not validators, so we can't punish them. Just disconnect from them and ignore them.
Should we call these snapshots? The SDK already uses the term "snapshot" for PruningOptions.SnapshotEvery
, and state sync will introduce additional SDK options for snapshot scheduling and pruning that are not related to IAVL snapshotting or pruning.
Yes. Hopefully these concepts are distinct enough that we can refer to state sync snapshots and IAVL snapshots without too much confusion.
Should we store snapshot and chunk metadata in a database? Can we use the database for chunks?
As a first approach, store metadata in a database and chunks in the filesystem.
Should a snapshot at height H be taken before or after the block at H is processed? E.g. RPC /commit
returns app_hash after previous height, i.e. before current height.
After commit.
Do we need to support all versions of blockchain reactor (i.e. fast sync)?
We should remove the v1 reactor completely once v2 has stabilized.
Should ListSnapshots
be a streaming API instead of a request/response API?
No, just use a max message size.
Tendermint: light client P2P transport #4456
IAVL: export/import API #210
Cosmos SDK: snapshotting, scheduling, and pruning #5689
Tendermint: support starting with a truncated block history
Tendermint: state sync reactor and ABCI interface #828
Cosmos SDK: snapshot ABCI implementation #5690
Tendermint: staged reactor startup (state sync → fast sync → block replay → wal replay → consensus)
Let's do a time-boxed prototype (a few days) and see how much work it will be.
Tendermint: prune blockchain history #3652
Tendermint: allow genesis to start from non-zero height #2543
Tendermint: light client verification for fast sync #4457
Tendermint: allow start with only blockstore #3713
Tendermint: node should go back to fast-syncing when lagging significantly #129
Accepted