From d65205ecada8f73c2f75030675c1457e865a489f Mon Sep 17 00:00:00 2001
From: Tess Rinearson <tess.rinearson@gmail.com>
Date: Mon, 11 May 2020 07:27:37 +0200
Subject: [PATCH] document state sync ABCI interface and P2P protocol (#93)

* Revert "Revert "document state sync ABCI interface and P2P protocol (#90)" (#92)"

This reverts commit 90797cef90da762db7f7a14ce834c745140f7bcd.

* update with new enum case

* fix links

Co-authored-by: Erik Grinaker <erik@interchain.berlin>
---
 spec/abci/abci.md                   | 131 ++++++++++++++++++++++--
 spec/abci/apps.md                   | 152 +++++++++++++++++++++++++++-
 spec/abci/client-server.md          |   2 +-
 spec/reactors/state_sync/reactor.md |  77 ++++++++++++++
 4 files changed, 353 insertions(+), 9 deletions(-)
 create mode 100644 spec/reactors/state_sync/reactor.md

diff --git a/spec/abci/abci.md b/spec/abci/abci.md
index 5fb1d3902..f53784b4a 100644
--- a/spec/abci/abci.md
+++ b/spec/abci/abci.md
@@ -5,17 +5,22 @@
 The ABCI message types are defined in a [protobuf
 file](https://github.com/tendermint/tendermint/blob/master/abci/types/types.proto).
 
-ABCI methods are split across 3 separate ABCI _connections_:
+ABCI methods are split across four separate ABCI _connections_:
 
-- `Consensus Connection`: `InitChain, BeginBlock, DeliverTx, EndBlock, Commit`
-- `Mempool Connection`: `CheckTx`
-- `Info Connection`: `Info, SetOption, Query`
+- Consensus connection: `InitChain`, `BeginBlock`, `DeliverTx`, `EndBlock`, `Commit`
+- Mempool connection: `CheckTx`
+- Info connection: `Info`, `SetOption`, `Query`
+- Snapshot connection: `ListSnapshots`, `LoadSnapshotChunk`, `OfferSnapshot`, `ApplySnapshotChunk`
 
-The `Consensus Connection` is driven by a consensus protocol and is responsible
+The consensus connection is driven by a consensus protocol and is responsible
 for block execution.
-The `Mempool Connection` is for validating new transactions, before they're
+
+The mempool connection is for validating new transactions, before they're
 shared or included in a block.
-The `Info Connection` is for initialization and for queries from the user.
+
+The info connection is for initialization and for queries from the user.
+
+The snapshot connection is for serving and restoring [state sync snapshots](apps.md#state-sync).
 
 Additionally, there is a `Flush` method that is called on every connection,
 and an `Echo` method that is just for debugging.
@@ -151,6 +156,22 @@ The result is an updated application state.
 Cryptographic commitments to the results of DeliverTx, EndBlock, and
 Commit are included in the header of the next block.
 
+## State Sync
+
+State sync allows new nodes to rapidly bootstrap by discovering, fetching, and applying
+state machine snapshots instead of replaying historical blocks. For more details, see the
+[state sync section](apps.md#state-sync).
+
+When a new node is discovering snapshots in the P2P network, existing nodes will call
+`ListSnapshots` on the application to retrieve any local state snapshots. The new node will
+offer these snapshots to its local application via `OfferSnapshot`.
+
+Once the application accepts a snapshot and begins restoring it, Tendermint will fetch snapshot
+chunks from existing nodes via `LoadSnapshotChunk` and apply them sequentially to the local
+application with `ApplySnapshotChunk`. When all chunks have been applied, the application
+`AppHash` is retrieved via an `Info` query and compared to the blockchain's `AppHash` verified
+via light client.
+
 ## Messages
 
 ### Echo
@@ -388,6 +409,83 @@ Commit are included in the header of the next block.
     other purposes, e.g. auditing, replay of non-persisted heights, light client
     verification, and so on.
 
+### ListSnapshots
+
+- **Response**:
+  - `Snapshots ([]Snapshot)`: List of local state snapshots.
+- **Usage**:
+  - Used during state sync to discover available snapshots on peers.
+  - See `Snapshot` data type for details.
+
+### LoadSnapshotChunk
+
+- **Request**:
+  - `Height (uint64)`: The height of the snapshot the chunks belongs to.
+  - `Format (uint32)`: The application-specific format of the snapshot the chunk belongs to.
+  - `Chunk (uint32)`: The chunk index, starting from `0` for the initial chunk.
+- **Response**:
+  - `Chunk ([]byte)`: The binary chunk contents, in an arbitray format. Chunk messages cannot be
+    larger than 16 MB _including metadata_, so 10 MB is a good starting point.
+- **Usage**:
+  - Used during state sync to retrieve snapshot chunks from peers.
+
+### OfferSnapshot
+
+- **Request**:
+  - `Snapshot (Snapshot)`: The snapshot offered for restoration.
+  - `AppHash ([]byte)`: The light client-verified app hash for this height, from the blockchain.
+- **Response**:
+  - `Result (Result)`: The result of the snapshot offer.
+    - `ACCEPT`: Snapshot is accepted, start applying chunks.
+    - `ABORT`: Abort snapshot restoration, and don't try any other snapshots.
+    - `REJECT`: Reject this specific snapshot, try others.
+    - `REJECT_FORMAT`: Reject all snapshots with this `format`, try others.
+    - `REJECT_SENDERS`: Reject all snapshots from all senders of this snapshot, try others.
+- **Usage**:
+  - `OfferSnapshot` is called when bootstrapping a node using state sync. The application may
+    accept or reject snapshots as appropriate. Upon accepting, Tendermint will retrieve and
+    apply snapshot chunks via `ApplySnapshotChunk`. The application may also choose to reject a
+    snapshot in the chunk response, in which case it should be prepared to accept further
+    `OfferSnapshot` calls.
+  - Only `AppHash` can be trusted, as it has been verified by the light client. Any other data
+    can be spoofed by adversaries, so applications should employ additional verification schemes
+    to avoid denial-of-service attacks. The verified `AppHash` is automatically checked against
+    the restored application at the end of snapshot restoration.
+  - For more information, see the `Snapshot` data type or the [state sync section](apps.md#state-sync).
+
+### ApplySnapshotChunk
+
+- **Request**:
+  - `Index (uint32)`: The chunk index, starting from `0`. Tendermint applies chunks sequentially.
+  - `Chunk ([]byte)`: The binary chunk contents, as returned by `LoadSnapshotChunk`.
+  - `Sender (string)`: The P2P ID of the node who sent this chunk.
+- **Response**:
+  - `Result (Result)`: The result of applying this chunk.
+    - `ACCEPT`: The chunk was accepted.
+    - `ABORT`: Abort snapshot restoration, and don't try any other snapshots.
+    - `RETRY`: Reapply this chunk, combine with `RefetchChunks` and `RejectSenders` as appropriate.
+    - `RETRY_SNAPSHOT`: Restart this snapshot from `OfferSnapshot`, reusing chunks unless 
+      instructed otherwise.
+    - `REJECT_SNAPSHOT`: Reject this snapshot, try a different one.
+  - `RefetchChunks ([]uint32)`: Refetch and reapply the given chunks, regardless of `Result`. Only  
+    the listed chunks will be refetched, and reapplied in sequential order.
+  - `RejectSenders ([]string)`: Reject the given P2P senders, regardless of `Result`. Any chunks 
+    already applied will not be refetched unless explicitly requested, but queued chunks from these senders will be discarded, and new chunks or other snapshots rejected.
+- **Usage**:
+  - The application can choose to refetch chunks and/or ban P2P peers as appropriate. Tendermint
+    will not do this unless instructed by the application.
+  - The application may want to verify each chunk, e.g. by attaching chunk hashes in
+    `Snapshot.Metadata` and/or incrementally verifying contents against `AppHash`.
+  - When all chunks have been accepted, Tendermint will make an ABCI `Info` call to verify that
+    `LastBlockAppHash` and `LastBlockHeight` matches the expected values, and record the
+    `AppVersion` in the node state. It then switches to fast sync or consensus and joins the 
+    network.
+  - If Tendermint is unable to retrieve the next chunk after some time (e.g. because no suitable
+    peers are available), it will reject the snapshot and try a different one via `OfferSnapshot`.
+    The application should be prepared to reset and accept it or abort as appropriate.
+
+### 
+
 ## Data Types
 
 ### Header
@@ -540,3 +638,22 @@ Commit are included in the header of the next block.
   - `Type (string)`: Type of Merkle proof and how it's encoded.
   - `Key ([]byte)`: Key in the Merkle tree that this proof is for.
   - `Data ([]byte)`: Encoded Merkle proof for the key.
+
+### Snapshot
+
+- **Fields**:
+  - `Height (uint64)`: The height at which the snapshot was taken (after commit).
+  - `Format (uint32)`: An application-specific snapshot format, allowing applications to version 
+    their snapshot data format and make backwards-incompatible changes. Tendermint does not 
+    interpret this.
+  - `Chunks (uint32)`: The number of chunks in the snapshot. Must be at least 1 (even if empty).
+  - `Hash (bytes)`: An arbitrary snapshot hash. Must be equal only for identical snapshots across 
+    nodes. Tendermint does not interpret the hash, it only compares them.
+  - `Metadata (bytes)`: Arbitrary application metadata, for example chunk hashes or other 
+    verification data.
+
+- **Usage**:
+  - Used for state sync snapshots, see [separate section](apps.md#state-sync) for details.
+  - A snapshot is considered identical across nodes only if _all_ fields are equal (including 
+    `Metadata`). Chunks may be retrieved from all nodes that have the same snapshot.
+  - When sent across the network, a snapshot message can be at most 4 MB.
diff --git a/spec/abci/apps.md b/spec/abci/apps.md
index 9989deecd..4c44e85e6 100644
--- a/spec/abci/apps.md
+++ b/spec/abci/apps.md
@@ -14,10 +14,11 @@ Here we cover the following components of ABCI applications:
   application state
 - [Crash Recovery](#crash-recovery) - handshake protocol to synchronize
   Tendermint and the application on startup.
+- [State Sync](#state-sync) - rapid bootstrapping of new nodes by restoring state machine snapshots
 
 ## State
 
-Since Tendermint maintains three concurrent ABCI connections, it is typical
+Since Tendermint maintains four concurrent ABCI connections, it is typical
 for an application to maintain a distinct state for each, and for the states to
 be synchronized during `Commit`.
 
@@ -92,6 +93,11 @@ QueryState should be set to the latest `DeliverTxState` at the end of every `Com
 ie. after the full block has been processed and the state committed to disk.
 Otherwise it should never be modified.
 
+### Snapshot Connection
+
+The Snapshot Connection is optional, and is only used to serve state sync snapshots for other nodes
+and/or restore state sync snapshots to a local node being bootstrapped.
+
 ## Transaction Results
 
 `ResponseCheckTx` and `ResponseDeliverTx` contain the same fields.
@@ -464,3 +470,147 @@ If `appBlockHeight == storeBlockHeight`
     update the state using the saved ABCI responses but dont run the block against the real app.
 This happens if we crashed after the app finished Commit but before Tendermint saved the state.
 
+## State Sync
+
+A new node joining the network can simply join consensus at the genesis height and replay all
+historical blocks until it is caught up. However, for large chains this can take a significant
+amount of time, often on the order of weeks or months.
+
+State sync is an alternative mechanism for bootstrapping a new node, where it fetches a snapshot
+of the state machine at a given height and restores it. Depending on the application, this can
+be several orders of magnitude faster than replaying blocks.
+
+Note that state sync does not currently backfill historical blocks, so the node will have a 
+truncated block history - users are advised to consider the broader network implications of this in 
+terms of block availability and auditability. This functionality may be added in the future.
+
+For details on the specific ABCI calls and types, see the [methods and types section](abci.md).
+
+### Taking Snapshots
+
+Applications that want to support state syncing must take state snapshots at regular intervals. How
+this is accomplished is entirely up to the application. A snapshot consists of some metadata and
+a set of binary chunks in an arbitrary format:
+
+* `Height (uint64)`: The height at which the snapshot is taken. It must be taken after the given 
+  height has been committed, and must not contain data from any later heights.
+
+* `Format (uint32)`: An arbitrary snapshot format identifier. This can be used to version snapshot 
+  formats, e.g. to switch from Protobuf to MessagePack for serialization. The application can use 
+  this when restoring to choose whether to accept or reject a snapshot.
+
+* `Chunks (uint32)`: The number of chunks in the snapshot. Each chunk contains arbitrary binary
+  data, and should be less than 16 MB; 10 MB is a good starting point.
+
+* `Hash ([]byte)`: An arbitrary hash of the snapshot. This is used to check whether a snapshot is
+  the same across nodes when downloading chunks.
+
+* `Metadata ([]byte)`: Arbitrary snapshot metadata, e.g. chunk hashes for verification or any other
+  necessary info.
+
+For a snapshot to be considered the same across nodes, all of these fields must be identical. When
+sent across the network, snapshot metadata messages are limited to 4 MB.
+
+When a new node is running state sync and discovering snapshots, Tendermint will query an existing
+application via the ABCI `ListSnapshots` method to discover available snapshots, and load binary
+snapshot chunks via `LoadSnapshotChunk`. The application is free to choose how to implement this
+and which formats to use, but should provide the following guarantees:
+
+* **Consistent:** A snapshot should be taken at a single isolated height, unaffected by
+  concurrent writes. This can e.g. be accomplished by using a data store that supports ACID 
+  transactions with snapshot isolation.
+
+* **Asynchronous:** Taking a snapshot can be time-consuming, so it should not halt chain progress,
+  for example by running in a separate thread.
+
+* **Deterministic:** A snapshot taken at the same height in the same format should be identical
+  (at the byte level) across nodes, including all metadata. This ensures good availability of 
+  chunks, and that they fit together across nodes.
+  
+A very basic approach might be to use a datastore with MVCC transactions (such as RocksDB),
+start a transaction immediately after block commit, and spawn a new thread which is passed the
+transaction handle. This thread can then export all data items, serialize them using e.g.
+Protobuf, hash the byte stream, split it into chunks, and store the chunks in the file system
+along with some metadata - all while the blockchain is applying new blocks in parallel.
+
+A more advanced approach might include incremental verification of individual chunks against the
+chain app hash, parallel or batched exports, compression, and so on.
+
+Old snapshots should be removed after some time - generally only the last two snapshots are needed
+(to prevent the last one from being removed while a node is restoring it).
+
+### Bootstrapping a Node
+
+An empty node can be state synced by setting the configuration option `statesync.enabled =
+true`. The node also needs the chain genesis file for basic chain info, and configuration for
+light client verification of the restored snapshot: a set of Tendermint RPC servers, and a
+trusted header hash and corresponding height from a trusted source, via the `statesync`
+configuration section.
+
+Once started, the node will connect to the P2P network and begin discovering snapshots. These
+will be offered to the local application, and once a snapshot is accepted Tendermint will fetch
+and apply the snapshot chunks. After all chunks have been successfully applied, Tendermint verifies
+the app's `AppHash` against the chain using the light client, then switches the node to normal
+consensus operation.
+
+#### Snapshot Discovery
+
+When the empty node join the P2P network, it asks all peers to report snapshots via the
+`ListSnapshots` ABCI call (limited to 10 per node). After some time, the node picks the most
+suitable snapshot (generally prioritized by height, format, and number of peers), and offers it
+to the application via `OfferSnapshot`. The application can choose a number of responses,
+including accepting or rejecting it, rejecting the offered format, rejecting the peer who sent
+it, and so on. Tendermint will keep discovering and offering snapshots until one is accepted or
+the application aborts.
+
+#### Snapshot Restoration
+
+Once a snapshot has been accepted via `OfferSnapshot`, Tendermint begins downloading chunks from
+any peers that have the same snapshot (i.e. that have identical metadata fields). Chunks are 
+spooled in a temporary directory, and then given to the application in sequential order via
+`ApplySnapshotChunk` until all chunks have been accepted.
+
+As with taking snapshots, the method for restoring them is entirely up to the application, but will
+generally be the inverse of how they are taken.
+
+During restoration, the application can respond to `ApplySnapshotChunk` with instructions for how
+to continue. This will typically be to accept the chunk and await the next one, but it can also
+ask for chunks to be refetched (either the current one or any number of previous ones), P2P peers
+to be banned, snapshots to be rejected or retried, and a number of other responses - see the ABCI
+reference for details.
+
+If Tendermint fails to fetch a chunk after some time, it will reject the snapshot and try a
+different one via `OfferSnapshot` - the application can choose whether it wants to support
+restarting restoration, or simply abort with an error.
+
+#### Snapshot Verification
+
+Once all chunks have been accepted, Tendermint issues an `Info` ABCI call to retrieve the
+`LastBlockAppHash`. This is compared with the trusted app hash from the chain, retrieved and 
+verified using the light client. Tendermint also checks that `LastBlockHeight` corresponds to the
+height of the snapshot.
+
+This verification ensures that an application is valid before joining the network. However, the
+snapshot restoration may take a long time to complete, so applications may want to employ additional
+verification during the restore to detect failures early. This might e.g. include incremental
+verification of each chunk against the app hash (using bundled Merkle proofs), checksums to
+protect against data corruption by the disk or network, and so on. However, it is important to
+note that the only trusted information available is the app hash, and all other snapshot metadata
+can be spoofed by adversaries.
+
+Apps may also want to consider state sync denial-of-service vectors, where adversaries provide
+invalid or harmful snapshots to prevent nodes from joining the network. The application can
+counteract this by asking Tendermint to ban peers. As a last resort, node operators can use
+P2P configuration options to whitelist a set of trusted peers that can provide valid snapshots.
+
+#### Transition to Consensus
+
+Once the snapshot has been restored, Tendermint gathers additional information necessary for
+bootstrapping the node (e.g. chain ID, consensus parameters, validator sets, and block headers) 
+from the genesis file and light client RPC servers. It also fetches and records the `AppVersion` 
+from the ABCI application.
+
+Once the node is bootstrapped with this information and the restored state machine, it transitions
+to fast sync (if enabled) to fetch any remaining blocks up the the chain head, and then
+transitions to regular consensus operation. At this point the node operates like any other node,
+apart from having a truncated block history at the height of the restored snapshot.
diff --git a/spec/abci/client-server.md b/spec/abci/client-server.md
index 94485f0d9..57924f7c2 100644
--- a/spec/abci/client-server.md
+++ b/spec/abci/client-server.md
@@ -85,7 +85,7 @@ it is the standard way to encode integers in Protobuf. It is also generally shor
 As noted above, this prefixing does not apply for GRPC.
 
 An ABCI server must also be able to support multiple connections, as
-Tendermint uses three connections.
+Tendermint uses four connections.
 
 ### Async vs Sync
 
diff --git a/spec/reactors/state_sync/reactor.md b/spec/reactors/state_sync/reactor.md
new file mode 100644
index 000000000..48beabc09
--- /dev/null
+++ b/spec/reactors/state_sync/reactor.md
@@ -0,0 +1,77 @@
+# State Sync Reactor
+
+State sync allows new nodes to rapidly bootstrap and join the network by discovering, fetching,
+and restoring state machine snapshots. For more information, see the [state sync ABCI section](../../abci/apps.md#state-sync).
+
+The state sync reactor has two main responsibilites:
+
+* Serving state machine snapshots taken by the local ABCI application to new nodes joining the 
+  network.
+
+* Discovering existing snapshots and fetching snapshot chunks for an empty local application
+  being bootstrapped.
+
+The state sync process for bootstrapping a new node is described in detail in the section linked
+above. While technically part of the reactor (see `statesync/syncer.go` and related components), 
+this document will only cover the P2P reactor component.
+
+For details on the ABCI methods and data types, see the [ABCI documentation](../../abci/abci.md).
+
+## State Sync P2P Protocol
+
+When a new node begin state syncing, it will ask all peers it encounters if it has any
+available snapshots:
+
+```go
+type snapshotsRequestMessage struct{}
+```
+
+The receiver will query the local ABCI application via `ListSnapshots`, and send a message 
+containing snapshot metadata (limited to 4 MB) for each of the 10 most recent snapshots:
+
+```go
+type snapshotsResponseMessage struct {
+	Height   uint64
+	Format   uint32
+	Chunks   uint32
+	Hash     []byte
+	Metadata []byte
+}
+```
+
+The node running state sync will offer these snapshots to the local ABCI application via
+`OfferSnapshot` ABCI calls, and keep track of which peers contain which snapshots. Once a snapshot
+is accepted, the state syncer will request snapshot chunks from appropriate peers:
+
+```go
+type chunkRequestMessage struct {
+	Height uint64
+	Format uint32
+	Index  uint32
+}
+```
+
+The receiver will load the requested chunk from its local application via `LoadSnapshotChunk`,
+and respond with it (limited to 16 MB):
+
+```go
+type chunkResponseMessage struct {
+	Height  uint64
+	Format  uint32
+	Index   uint32
+	Chunk   []byte
+	Missing bool
+}
+```
+
+Here, `Missing` is used to signify that the chunk was not found on the peer, since an empty
+chunk is a valid (although unlikely) response. 
+
+The returned chunk is given to the ABCI application via `ApplySnapshotChunk` until the snapshot
+is restored. If a chunk response is not returned within some time, it will be re-requested,
+possibly from a different peer.
+
+The ABCI application is able to request peer bans and chunk refetching as part of the ABCI protocol.
+
+If no state sync is in progress (i.e. during normal operation), any unsolicited response messages 
+are discarded.
\ No newline at end of file