You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

239 lines
11 KiB

  1. # ADR 042: State Sync Design
  2. ## Changelog
  3. 2019-06-27: Init by EB
  4. 2019-07-04: Follow up by brapse
  5. ## Context
  6. StateSync is a feature which would allow a new node to receive a
  7. snapshot of the application state without downloading blocks or going
  8. through consensus. Once downloaded, the node could switch to FastSync
  9. and eventually participate in consensus. The goal of StateSync is to
  10. facilitate setting up a new node as quickly as possible.
  11. ## Considerations
  12. Because Tendermint doesn't know anything about the application state,
  13. StateSync will broker messages between nodes and through
  14. the ABCI to an opaque applicaton. The implementation will have multiple
  15. touch points on both the tendermint code base and ABCI application.
  16. * A StateSync reactor to facilitate peer communication - Tendermint
  17. * A Set of ABCI messages to transmit application state to the reactor - Tendermint
  18. * A Set of MultiStore APIs for exposing snapshot data to the ABCI - ABCI application
  19. * A Storage format with validation and performance considerations - ABCI application
  20. ### Implementation Properties
  21. Beyond the approach, any implementation of StateSync can be evaluated
  22. across different criteria:
  23. * Speed: Expected throughput of producing and consuming snapshots
  24. * Safety: Cost of pushing invalid snapshots to a node
  25. * Liveness: Cost of preventing a node from receiving/constructing a snapshot
  26. * Effort: How much effort does an implementation require
  27. ### Implementation Question
  28. * What is the format of a snapshot
  29. * Complete snapshot
  30. * Ordered IAVL key ranges
  31. * Compressed individually chunks which can be validated
  32. * How is data validated
  33. * Trust a peer with it's data blindly
  34. * Trust a majority of peers
  35. * Use light client validation to validate each chunk against consensus
  36. produced merkle tree root
  37. * What are the performance characteristics
  38. * Random vs sequential reads
  39. * How parallelizeable is the scheduling algorithm
  40. ### Proposals
  41. Broadly speaking there are two approaches to this problem which have had
  42. varying degrees of discussion and progress. These approach can be
  43. summarized as:
  44. **Lazy:** Where snapshots are produced dynamically at request time. This
  45. solution would use the existing data structure.
  46. **Eager:** Where snapshots are produced periodically and served from disk at
  47. request time. This solution would create an auxiliary data structure
  48. optimized for batch read/writes.
  49. Additionally the propsosals tend to vary on how they provide safety
  50. properties.
  51. **LightClient** Where a client can aquire the merkle root from the block
  52. headers synchronized from a trusted validator set. Subsets of the application state,
  53. called chunks can therefore be validated on receipt to ensure each chunk
  54. is part of the merkle root.
  55. **Majority of Peers** Where manifests of chunks along with checksums are
  56. downloaded and compared against versions provided by a majority of
  57. peers.
  58. #### Lazy StateSync
  59. An [initial specification](https://docs.google.com/document/d/15MFsQtNA0MGBv7F096FFWRDzQ1vR6_dics5Y49vF8JU/edit?ts=5a0f3629) was published by Alexis Sellier.
  60. In this design, the state has a given `size` of primitive elements (like
  61. keys or nodes), each element is assigned a number from 0 to `size-1`,
  62. and chunks consists of a range of such elements. Ackratos raised
  63. [some concerns](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit)
  64. about this design, somewhat specific to the IAVL tree, and mainly concerning
  65. performance of random reads and of iterating through the tree to determine element numbers
  66. (ie. elements aren't indexed by the element number).
  67. An alternative design was suggested by Jae Kwon in
  68. [#3639](https://github.com/tendermint/tendermint/issues/3639) where chunking
  69. happens lazily and in a dynamic way: nodes request key ranges from their peers,
  70. and peers respond with some subset of the
  71. requested range and with notes on how to request the rest in parallel from other
  72. peers. Unlike chunk numbers, keys can be verified directly. And if some keys in the
  73. range are ommitted, proofs for the range will fail to verify.
  74. This way a node can start by requesting the entire tree from one peer,
  75. and that peer can respond with say the first few keys, and the ranges to request
  76. from other peers.
  77. Additionally, per chunk validation tends to come more naturally to the
  78. Lazy approach since it tends to use the existing structure of the tree
  79. (ie. keys or nodes) rather than state-sync specific chunks. Such a
  80. design for tendermint was originally tracked in
  81. [#828](https://github.com/tendermint/tendermint/issues/828).
  82. #### Eager StateSync
  83. Warp Sync as implemented in Parity
  84. ["Warp Sync"](https://wiki.parity.io/Warp-Sync-Snapshot-Format.html) to rapidly
  85. download both blocks and state snapshots from peers. Data is carved into ~4MB
  86. chunks and snappy compressed. Hashes of snappy compressed chunks are stored in a
  87. manifest file which co-ordinates the state-sync. Obtaining a correct manifest
  88. file seems to require an honest majority of peers. This means you may not find
  89. out the state is incorrect until you download the whole thing and compare it
  90. with a verified block header.
  91. A similar solution was implemented by Binance in
  92. [#3594](https://github.com/tendermint/tendermint/pull/3594)
  93. based on their initial implementation in
  94. [PR #3243](https://github.com/tendermint/tendermint/pull/3243)
  95. and [some learnings](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit).
  96. Note this still requires the honest majority peer assumption.
  97. As an eager protocol, warp-sync can efficiently compress larger, more
  98. predicatable chunks once per snapshot and service many new peers. By
  99. comparison lazy chunkers would have to compress each chunk at request
  100. time.
  101. ### Analysis of Lazy vs Eager
  102. Lazy vs Eager have more in common than they differ. They all require
  103. reactors on the tendermint side, a set of ABCI messages and a method for
  104. serializing/deserializing snapshots facilitated by a SnapshotFormat.
  105. The biggest difference between Lazy and Eager proposals is in the
  106. read/write patterns necessitated by serving a snapshot chunk.
  107. Specifically, Lazy State Sync performs random reads to the underlying data
  108. structure while Eager can optimize for sequential reads.
  109. This distinctin between approaches was demonstrated by Binance's
  110. [ackratos](https://github.com/ackratos) in their implementation of [Lazy
  111. State sync](https://github.com/tendermint/tendermint/pull/3243), The
  112. [analysis](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/)
  113. of the performance, and follow up implementation of [Warp
  114. Sync](http://github.com/tendermint/tendermint/pull/3594).
  115. #### Compairing Security Models
  116. There are several different security models which have been
  117. discussed/proposed in the past but generally fall into two categories.
  118. Light client validation: In which the node receiving data is expected to
  119. first perform a light client sync and have all the nessesary block
  120. headers. Within the trusted block header (trusted in terms of from a
  121. validator set subject to [weak
  122. subjectivity](https://github.com/tendermint/tendermint/pull/3795)) and
  123. can compare any subset of keys called a chunk against the merkle root.
  124. The advantage of light client validation is that the block headers are
  125. signed by validators which have something to lose for malicious
  126. behaviour. If a validator were to provide an invalid proof, they can be
  127. slashed.
  128. Majority of peer validation: A manifest file containing a list of chunks
  129. along with checksums of each chunk is downloaded from a
  130. trusted source. That source can be a community resource similar to
  131. [sum.golang.org](https://sum.golang.org) or downloaded from the majority
  132. of peers. One disadantage of the majority of peer security model is the
  133. vuliberability to eclipse attacks in which a malicious users looks to
  134. saturate a target node's peer list and produce a manufactured picture of
  135. majority.
  136. A third option would be to include snapshot related data in the
  137. block header. This could include the manifest with related checksums and be
  138. secured through consensus. One challenge of this approach is to
  139. ensure that creating snapshots does not put undo burden on block
  140. propsers by synchronizing snapshot creation and block creation. One
  141. approach to minimizing the burden is for snapshots for height
  142. `H` to be included in block `H+n` where `n` is some `n` block away,
  143. giving the block propser enough time to complete the snapshot
  144. asynchronousy.
  145. ## Proposal: Eager StateSync With Per Chunk Light Client Validation
  146. The conclusion after some concideration of the advantages/disadvances of
  147. eager/lazy and different security models is to produce a state sync
  148. which eagerly produces snapshots and uses light client validation. This
  149. approach has the performance advantages of pre-computing efficient
  150. snapshots which can streamed to new nodes on demand using sequential IO.
  151. Secondly, by using light client validation we cna validate each chunk on
  152. receipt and avoid the potential eclipse attack of majority of peer based
  153. security.
  154. ### Implementation
  155. Tendermint is responsible for downloading and verifying chunks of
  156. AppState from peers. ABCI Application is responsible for taking
  157. AppStateChunk objects from TM and constructing a valid state tree whose
  158. root corresponds with the AppHash of syncing block. In particular we
  159. will need implement:
  160. * Build new StateSync reactor brokers message transmission between the peers
  161. and the ABCI application
  162. * A set of ABCI Messages
  163. * Design SnapshotFormat as an interface which can:
  164. * validate chunks
  165. * read/write chunks from file
  166. * read/write chunks to/from application state store
  167. * convert manifests into chunkRequest ABCI messages
  168. * Implement SnapshotFormat for cosmos-hub with concrete implementation for:
  169. * read/write chunks in a way which can be:
  170. * parallelized across peers
  171. * validated on receipt
  172. * read/write to/from IAVL+ tree
  173. ![StateSync Architecture Diagram](img/state-sync.png)
  174. ## Implementation Path
  175. * Create StateSync reactor based on [#3753](https://github.com/tendermint/tendermint/pull/3753)
  176. * Design SnapshotFormat with an eye towards cosmos-hub implementation
  177. * ABCI message to send/receive SnapshotFormat
  178. * IAVL+ changes to support SnapshotFormat
  179. * Deliver Warp sync (no chunk validation)
  180. * light client implementation for weak subjectivity
  181. * Deliver StateSync with chunk validation
  182. ## Status
  183. Proposed
  184. ## Concequences
  185. ### Neutral
  186. ### Positive
  187. * Safe & performant state sync design substantiated with real world implementation experience
  188. * General interfaces allowing application specific innovation
  189. * Parallizable implementation trajectory with reasonable engineering effort
  190. ### Negative
  191. * Static Scheduling lacks opportunity for real time chunk availability optimizations
  192. ## References
  193. [sync: Sync current state without full replay for Applications](https://github.com/tendermint/tendermint/issues/828) - original issue
  194. [tendermint state sync proposal](https://docs.google.com/document/d/15MFsQtNA0MGBv7F096FFWRDzQ1vR6_dics5Y49vF8JU/edit?ts=5a0f3629) - Cloudhead proposal
  195. [tendermint state sync proposal 2](https://docs.google.com/document/d/1npGTAa1qxe8EQZ1wG0a0Sip9t5oX2vYZNUDwr_LVRR4/edit) - ackratos proposal
  196. [proposal 2 implementation](https://github.com/tendermint/tendermint/pull/3243) - ackratos implementation
  197. [WIP General/Lazy State-Sync pseudo-spec](https://github.com/tendermint/tendermint/issues/3639) - Jae Proposal
  198. [Warp Sync Implementation](https://github.com/tendermint/tendermint/pull/3594) - ackratos
  199. [Chunk Proposal](https://github.com/tendermint/tendermint/pull/3799) - Bucky proposed