# Fastsync Fastsync is a protocol that is used by a node to catch-up to the current state of a Tendermint blockchain. Its typical use case is a node that was disconnected from the system for some time. The recovering node locally has a copy of a prefix of the blockchain, and the corresponding application state that is slightly outdated. It then queries its peers for the blocks that were decided on by the Tendermint blockchain during the period the full node was disconnected. After receiving these blocks, it executes the transactions in the blocks in order to catch-up to the current height of the blockchain and the corresponding application state. In practice it is sufficient to catch-up only close to the current height: The Tendermint consensus reactor implements its own catch-up functionality and can synchronize a node that is close to the current height, perhaps within 10 blocks away from the current height of the blockchain. Fastsync should bring a node within this range. ## Outline - [Part I](#part-i---tendermint-blockchain): Introduction of Tendermint blockchain terms that are relevant for FastSync protocol. - [Part II](#part-ii---sequential-definition-of-fastsync-problem): Introduction of the problem addressed by the Fastsync protocol. - [Fastsync Informal Problem statement](#Fastsync-Informal-Problem-statement): For the general audience, that is, engineers who want to get an overview over what the component is doing from a bird's eye view. - [Sequential Problem statement](#Sequential-Problem-statement): Provides a mathematical definition of the problem statement in its sequential form, that is, ignoring the distributed aspect of the implementation of the blockchain. - [Part III](#part-iii---fastsync-as-distributed-system): Distributed aspects of the fast sync problem, system assumptions and temporal logic specifications. - [Computational Model](#Computational-Model): timing and correctness assumptions. - [Distributed Problem Statement](#Distributed-Problem-Statement): temporal properties that formalize safety and liveness properties of fast sync in distributed setting. - [Part IV](#part-iv---fastsync-protocol): Specification of Fastsync V2 (the protocol underlying the current Golang implementation). - [Definitions](#Definitions): Describes inputs, outputs, variables used by the protocol, auxiliary functions - [FastSync V2](#FastSync-V2): gives an outline of the solution, and details of the functions used (with preconditions, postconditions, error conditions). - [Algorithm Invariants](#Algorithm-Invariants): invariants over the protocol variables that the implementation should maintain. - [Part V](#part-v---analysis-and-improvements): Analysis of Fastsync V2 that highlights several issues that prevent achieving some of the desired fault-tolerance properties. We also give some suggestions on how to address the issues in the future. - [Analysis of Fastsync V2](#Analysis-of-Fastsync-V2): describes undesirable scenarios of Fastsync V2, and why they violate desirable temporal logic specification in an unreliable distributed system. - [Suggestions](#Suggestions-for-an-Improved-Fastsync-Implementation) to address the issues discussed in the analysis. In this document we quite extensively use tags in order to be able to reference assumptions, invariants, etc. in future communication. In these tags we frequently use the following short forms: - TMBC: Tendermint blockchain - SEQ: for sequential specifications - FS: Fastsync - LIVE: liveness - SAFE: safety - INV: invariant - A: assumption - V2: refers to specifics of Fastsync V2 - FS-VAR: refers to properties of Fastsync protocol variables - NewFS: refers to improved future Fastsync implementations # Part I - Tendermint Blockchain We will briefly list some of the notions of Tendermint blockchains that are required for this specification. More details can be found [here][block]. #### **[TMBC-HEADER]** A set of blockchain transactions is stored in a data structure called *block*, which contains a field called *header*. (The data structure *block* is defined [here][block]). As the header contains hashes to the relevant fields of the block, for the purpose of this specification, we will assume that the blockchain is a list of headers, rather than a list of blocks. #### **[TMBC-SEQ]** The Tendermint blockchain is a list *chain* of headers. #### **[TMBC-SEQ-GROW]** During operation, new headers may be appended to the list one by one. > In the following, *ETIME* is a lower bound > on the time interval between the times at which two > successor blocks are added. #### **[TMBC-SEQ-APPEND-E]** If a header is appended at time *t* then no additional header will be appended before time *t + ETIME*. #### **[TMBC-AUTH-BYZ]** We assume the authenticated Byzantine fault model in which no node (faulty or correct) may break digital signatures, but otherwise, no additional assumption is made about the internal behavior of faulty nodes. That is, faulty nodes are only limited in that they cannot forge messages. > We observed that in the existing documentation the term > *validator* refers to both a data structure and a full node that > participates in the distributed computation. Therefore, we introduce > the notions *validator pair* and *validator node*, respectively, to > distinguish these notions in the cases where they are not clear from > the context. #### **[TMBC-VALIDATOR-PAIR]** Given a full node, a *validator pair* is a pair *(address, voting_power)*, where - *address* is the address (public key) of a full node, - *voting_power* is an integer (representing the full node's voting power in a given consensus instance). > In the Golang implementation the data type for *validator > pair* is called `Validator`. #### **[TMBC-VALIDATOR-SET]** A *validator set* is a set of validator pairs. For a validator set *vs*, we write *TotalVotingPower(vs)* for the sum of the voting powers of its validator pairs. #### **[TMBC-CORRECT]** We define a predicate *correctUntil(n, t)*, where *n* is a node and *t* is a time point. The predicate *correctUntil(n, t)* is true if and only if the node *n* follows all the protocols (at least) until time *t*. #### **[TMBC-TIME-PARAMS]** A blockchain has the following configuration parameters: - *unbondingPeriod*: a time duration. - *trustingPeriod*: a time duration smaller than *unbondingPeriod*. #### **[TMBC-FM-2THIRDS]** If a block *h* is in the chain, then there exists a subset *CorrV* of *h.NextValidators*, such that: - *TotalVotingPower(CorrV) > 2/3 TotalVotingPower(h.NextValidators)*; - For every validator pair *(n,p)* in *CorrV*, it holds *correctUntil(n, h.Time + trustingPeriod)*. #### **[TMBC-CORR-FULL]** Every correct full node locally stores a prefix of the current list of headers from [**[TMBC-SEQ]**][TMBC-SEQ-link]. # Part II - Sequential Definition of Fastsync Problem ## Fastsync Informal Problem statement A full node has as input a block of the blockchain at height *h* and the corresponding application state (or the prefix of the current blockchain until height *h*). It has access to a set *peerIDs* of full nodes called *peers* that it knows of. The full node uses the peers to read blocks of the Tendermint blockchain (in a safe way, that is, it checks the soundness conditions), until it has read the most recent block and then terminates. ## Sequential Problem statement *Fastsync* gets as input a block of height *h* and the corresponding application state *s* that corresponds to the block and state of that height of the blockchain, and produces as output (i) a list *L* of blocks starting at height *h* to some height *terminationHeight*, and (ii) the application state when applying the transactions of the list *L* to *s*. > In Tendermint, the commit for block of height *h* is contained in block *h + 1*, > and thus the block of height *h + 1* is needed to verify the block of > height *h*. Let us therefore clarify the following on the > termination height: > The returned value *terminationHeight* is the height of the block with the largest > height that could be verified. In order to do so, *Fastsync* needs the > block at height *terminationHeight + 1* of the blockchain. Fastsync has to satisfy the following properties: #### **[FS-SEQ-SAFE-START]** Let *bh* be the height of the blockchain at the time *Fastsync* starts. By assumption we have *bh >= h*. When *Fastsync* terminates, it outputs a list of all blocks from height *h* to some height *terminationHeight >= bh - 1*. > The above property is independent of how many blocks are added to the > blockchain while Fastsync is running. It links the target height to the > initial state. If Fastsync has to catch-up many blocks, it would be > better to link the target height to a time close to the > termination. This is captured by the following specification: #### **[FS-SEQ-SAFE-SYNC]** Let *eh* be the height of the blockchain at the time *Fastsync* terminates. There is a constant *D >= 1* such that when *Fastsync* terminates, it outputs a list of all blocks from height *h* to some height *terminationHeight >= eh - D*. #### **[FS-SEQ-SAFE-STATE]** Upon termination, the application state is the one that corresponds to the blockchain at height *terminationHeight*. #### **[FS-SEQ-LIVE]** *Fastsync* eventually terminates. # Part III - FastSync as Distributed System ## Computational Model #### **[FS-A-NODE]** We consider a node *FS* that performs *Fastsync*. #### **[FS-A-PEER-IDS]** *FS* has access to a set *peerIDs* of IDs (public keys) of peers . During the execution of *Fastsync*, another protocol (outside of this specification) may add new IDs to *peerIDs*. #### **[FS-A-PEER]** Peers can be faulty, and we do not make any assumptions about the number or ratio of correct/faulty nodes. Faulty processes may be Byzantine according to [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link]. #### **[FS-A-VAL]** The system satisfies [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] and [**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link]. Thus, there is a blockchain that satisfies the soundness requirements (that is, the validation rules in [[block]]). #### **[FS-A-COMM]** Communication between the node *FS* and all correct peers is reliable and bounded in time: there is a message end-to-end delay *Delta* such that if a message is sent at time *t* by a correct process to a correct process, then it will be received and processed by time *t + Delta*. This implies that we need a timeout of at least *2 Delta* for remote procedure calls to ensure that the response of a correct peer arrives before the timeout expires. ## Distributed Problem Statement ### Two Kinds of Termination We do not assume that there is a correct full node in *peerIDs*. Under this assumption no protocol can guarantee the combination of the properties [FS-SEQ-LIVE] and [FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] described in the sequential specification above. Thus, in the (unreliable) distributed setting, we consider two kinds of termination (successful and failure) and we will specify below under what (favorable) conditions *Fastsync* ensures to terminate successfully, and satisfy the requirements of the sequential problem statement: #### **[FS-DIST-LIVE]** *Fastsync* eventually terminates: it either *terminates successfully* or it *terminates with failure*. ### Fairness As mentioned above, without assumptions on the correctness of some peers, no protocol can achieve the required specifications. Therefore, we consider the following (fairness) constraint in the safety and liveness properties below: #### **[FS-SOME-CORR-PEER]** Initially, the set *peerIDs* contains at least one correct full node. > While in principle the above condition [FS-SOME-CORR-PEER] > can be part of a sufficient > condition to solve [FS-SEQ-LIVE] and > [FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] in the distributed > setting (their corresponding properties are given below), we will discuss in > [Part V](#part-v---analysis-and-improvements) that the > current implementation of Fastsync (V2) requires the much > stronger requirement [**[FS-ALL-CORR-PEER]**](#FS-ALL-CORR-PEER) > given in Part V. ### Safety > As this specification does > not assume that a correct peer is at the most recent height > of the blockchain (it might lag behind), the property [FS-SEQ-SAFE-START] > cannot be ensured in an unreliable distributed setting. We consider > the following relaxation. (Which is typically sufficient for > Tendermint, as the consensus reactor then synchronizes from that > height.) #### **[FS-DIST-SAFE-START]** Let *maxh* be the maximum height of a correct peer [**[TMBC-CORR-FULL]**][TMBC-CORR-FULL-link] in *peerIDs* at the time *Fastsync* starts. If *FastSync* terminates successfully, it is at some height *terminationHeight >= maxh - 1*. > To address [FS-SEQ-SAFE-SYNC] we consider the following property in > the distributed setting. See the comments below on the relation to > the sequential version. #### **[FS-DIST-SAFE-SYNC]** Under [FS-SOME-CORR-PEER], there exists a constant time interval *TD*, such that if *term* is the time *Fastsync* terminates and *maxh* is the maximum height of a correct peer [**[TMBC-CORR-FULL]**][TMBC-CORR-FULL-link] in *peerIDs* at the time *term - TD*, then if *FastSync* terminates successfully, it is at some height *terminationHeight >= maxh - 1*. > *TD* might depend on timeouts etc. We suggest that an acceptable > value for *TD* is in the range of approx. 10 sec., that is the > interval between two calls `QueryStatus()`; see below. > We use *term - TD* as reference time, as we have to account > for communication delay between the peer and *FS*. After the peer sent > the last message to *FS*, the peer and *FS* run concurrently and > independently. There is no assumption on the rate at which a peer can > add blocks (e.g., it might be in the process of catching up > itself). Hence, without additional assumption we cannot link > [FS-DIST-SAFE-SYNC] to > [**[FS-SEQ-SAFE-SYNC]**](#FS-SEQ-SAFE-SYNC), in particular to the > parameter *D*. We discuss a > way to achieve this below: > **Relation to [FS-SEQ-SAFE-SYNC]:** > Under [FS-SOME-CORR-PEER], if *peerIDs* contains a full node that is > "synchronized with the blockchain", and *blockchainheight* is the height > of the blockchain at time *term*, then *terminationHeight* may even > achieve > *blockchainheight - TD / ETIME*; > cf. [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link], that is, > the parameter *D* from [FS-SEQ-SAFE-SYNC] is in the range of *TD / ETIME*. #### **[FS-DIST-SAFE-STATE]** It is the same as the sequential version [**[FS-SEQ-SAFE-STATE]**](#FS-SEQ-SAFE-STATE). #### **[FS-DIST-NONABORT]** If there is one correct process in *peerIDs* [FS-SOME-CORR-PEER], *Fastsync* never terminates with failure. (Together with [FS-DIST-LIVE] that means it will terminate successfully.) # Part IV - Fastsync protocol Here we provide a specification of the FastSync V2 protocol as it is currently implemented. The V2 design is the result of significant refactoring to improve the testability and determinism in the implementation. The architecture is detailed in [ADR-43](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-043-blockchain-riri-org.md). In the original design, a go-routine (thread of execution) was spawned for each block requested, and was responsible for both protocol logic and IO. In the V2 design, protocol logic is decoupled from IO by using three total threads of execution: a scheduler, a processer, and a demuxer. The scheduler contains the business logic for managing peers and requesting blocks from them, while the processor handles the computationally expensive block execution. Both the scheduler and processor are structured as finite state machines that receive input events and emit output events. The demuxer is responsible for all IO, including translating between internal events and IO messages, and routing events between components. Protocols in Tendermint can be considered to consist of two components: a "core" state machine and a "peer" state machine. The core state machine refers to the internal state managed by the node, while the peer state machine determines what messages to send to peers. In the FastSync design, the core and peer state machines correspond to the processor and scheduler, respectively. In the case of FastSync, the core state machine (the processor) is effectively just the Tendermint block execution function, while virtually all protocol logic is contained in the peer state machine (the scheduler). The processor is only implemented as a separate component due to the computationally expensive nature of block execution. We therefore focus our specification here on the peer state machine (the scheduler component), capturing the core state machine (the processor component) in the single `Execute` function, defined below. While the internal details of the `Execute` function are not relevant for the FastSync protocol and are thus not part of this specification, they will be defined in detail at a later date in a separate Block Execution specification. ## Definitions > We now introduce variables and auxiliary functions used by the protocol. ### Inputs - *startBlock*: the block Fastsync starts from - *startState*: application state corresponding to *startBlock.Height* #### **[FS-A-V2-INIT]** - *startBlock* is from the blockchain - *startState* is the application state of the blockchain at Height *startBlock.Height*. ### Variables - *height*: kinitially *startBlock.Height + 1* > height should be thought of the "height of the next block we need to download" - *state*: initially *startState* - *peerIDs*: peer addresses [FS-A-PEER-IDS](#fs-a-peer-ids) - *peerHeights*: stores for each peer the height it reported. initially 0 - *pendingBlocks*: stores for each height which peer was queried. initially nil for each height - *receivedBlocks*: stores for each height which peer returned it. initially nil - *blockstore*: stores for each height greater than *startBlock.Height*, the block of that height. initially nil for all heights - *peerTimeStamp*: stores for each peer the last time a block was received - *pendingTime*: stores for a given height the time a block was requested - *peerRate*: stores for each peer the rate of received data in Bytes/second ### Auxiliary Functions #### **[FS-FUNC-TARGET]** - *TargetHeight = max {peerHeigts(addr): addr in peerIDs} union {height}* #### **[FS-FUNC-MATCH]** ```go func VerifyCommit(b Block, c Commit) Boolean ``` - Comment - Corresponds to `verifyCommit(chainID string, blockID types.BlockID, height int64, commit *types.Commit) error` in the current Golang implementation, which expects blockID and height (from the first block) and the corresponding commit from the following block. We use the simplified form for ease in presentation. - Implementation remark - implements the check that *c* is a valid commit for block *b* - Expected precondition - *c* is a valid commit for block *b* - Expected postcondition - *true* if precondition holds - *false* if precondition is violated - Error condition - none ---- ### Messages Peers participating in FastSync exchange the following set of messages. Messages are encoded using the Amino serialization protocol. We define each message here using Go syntax, annoted with the Amino type name. The prefix `bc` refers to `blockchain`, which is the name of the FastSync reactor in the Go implementation. #### bcBlockRequestMessage ```go // type: "tendermint/blockchain/BlockRequest" type bcBlockRequestMessage struct { Height int64 } ``` Remark: - `msg.Height` > 0 #### bcNoBlockResponseMessage ```go // type: "tendermint/blockchain/NoBlockResponse" type bcNoBlockResponseMessage struct { Height int64 } ``` Remark: - `msg.Height` > 0 - This message type is included in the protocol for convenience and is not expected to be sent between two correct peers #### bcBlockResponseMessage ```go // type: "tendermint/blockchain/BlockResponse" type bcBlockResponseMessage struct { Block *types.Block } ``` Remark: - `msg.Block` is a Tendermint block as defined in [[block]]. - `msg.Block` != nil #### bcStatusRequestMessage ```go // type: "tendermint/blockchain/StatusRequest" type bcStatusRequestMessage struct { Height int64 } ``` Remark: - `msg.Height` > 0 #### bcStatusResponseMessage ```go // type: "tendermint/blockchain/StatusResponse" type bcStatusResponseMessage struct { Height int64 } ``` Remark: - `msg.Height` > 0 ### Remote Functions Peers expose the following functions over remote procedure calls. The "Expected precondition" are only expected for correct peers (as no assumption is made on internals of faulty processes [FS-A-PEER]). These functions are implemented using the above defined message types. > In this document we describe the communication with peers via asynchronous RPCs. ```go func Status(addr Address) (int64, error) ``` - Implementation remark - RPC to full node *addr* - Request message: `bcStatusRequestMessage`. - Response message: `bcStatusResponseMessage`. - Expected precondition - none - Expected postcondition - if *addr* is correct: Returns the current height `height` of the peer. [FS-A-COMM] - if *addr* is faulty: Returns an arbitrary height. [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] - Error condition - if *addr* is correct: none. By [FS-A-COMM] we assume communication is reliable and timely. - if *addr* is faulty: arbitrary error (including timeout). [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] ---- ```go func Block(addr Address, height int64) (Block, error) ``` - Implementation remark - RPC to full node *addr* - Request message: `bcBlockRequestMessage`. - Response message: `bcBlockResponseMessage` or `bcNoBlockResponseMessage`. - Expected precondition - 'height` is less than or equal to height of the peer - Expected postcondition - if *addr* is correct: Returns the block of height `height` from the blockchain. [FS-A-COMM] - if *addr* is faulty: Returns arbitrary or no block [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] - Error condition - if *addr* is correct: precondition violated (returns `bcNoBlockResponseMessage`). [FS-A-COMM] - if *addr* is faulty: arbitrary error (including timeout). [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] ---- ## FastSync V2 ### Outline The protocol is described in terms of functions that are triggered by (external) events. The implementation uses a scheduler and a de-multiplexer to deal with communicating with peers and to trigger the execution of these functions: - `QueryStatus()`: regularly (currently every 10sec; necessarily interval greater than *2 Delta*) queries all peers from *peerIDs* for their current height [TMBC-CORR-FULL]. It does so by calling `Status(n)` remotely on all peers *n*. - `CreateRequest`: regularly checks whether certain blocks have no open request. If a block does not have an open request, it requests one from a peer. It does so by calling `Block(n,h)` remotely on one peer *n* for a missing height *h*. > We have left the strategy how peers are selected unspecified, and > the currently existing different implementations of Fastsync differ > in this aspect. In V2, a peer *p* is selected with the minimum number of > pending requests that can serve the required height *h*, that is > with *peerHeight(p) >= h*. The functions `Status` and `Block` are called by asynchronous RPC. When they return, the following functions are called: - `OnStatusResponse(addr Address, height int64)`: The full node with address *addr* returns its current height. The function updates the height information about *addr*, and may also increase *TargetHeight*. - `OnBlockResponse(addr Address, b Block)`. The full node with address *addr* returns a block. It is added to *blockstore*. Then the auxiliary function `Execute` is called. - `Execute()`: Iterates over the *blockstore*. Checks soundness of the blocks, and executes the transactions of a sound block and updates *state*. > In addition to the functions above, the following two features are > implemented in Fastsync V2 #### **[FS-V2-PEER-REMOVE]** Periodically, *peerTimeStamp* and *peerRate* and *pendingTime* are analyzed. If a peer *p* has not provided a block recently (check of *peerTimeStamp[p]*) or it has not provided sufficiently many data (check of *peerRate[p]*), then *p* is removed from *peerIDs*. In addition, *pendingTime* is used to estimate whether the peer that is responsible for the current height has provided the corresponding block on time. #### **[FS-V2-TIMEOUT]** *Fastsync V2* starts a timeout whenever a block is executed (that is, when the height is incremented). If the timeout expires before the next block is executed, *Fastsync* terminates. If this happens, then *Fastsync* terminates with failure. ### Details ```go func QueryStatus() ``` - Expected precondition - peerIDs initialized and non-empty - Expected postcondition - call asynchronously `Status(n)` at each peer *n* in *peerIDs*. - Error condition - fails if precondition is violated ---- ```go func OnStatusResponse(addr Address, ht int64) ``` - Comment - *ht* is a height - peers can provide the status without being called - Expected precondition - *peerHeights(addr) <= ht* - Expected postcondition - *peerHeights(addr) = ht* - *TargetHeight* is updated - Error condition - if precondition is violated: *addr* not in *peerIDs* (that is, *addr* is removed from *peerIDs*) - Timeout condition - if `OnStatusResponse(addr, ht)` was not invoked within *2 Delta* after `Status(addr)` was called: *addr* not in *peerIDs* ---- ```go func CreateRequest ``` - Expected precondition - *height < TargetHeight* - *peerIDs* nonempty - Expected postcondition - Function `Block` is called remotely at a peer *addr* in peerIDs for a missing height *h* *Remark:* different implementations may have different strategies to balance the load over the peers - *pendingblocks(h) = addr* ---- ```go func OnBlockResponse(addr Address, b Block) ``` - Comment - if after adding block *b*, blocks of heights *height* and *height + 1* are in *blockstore*, then `Execute` is called - Expected precondition - *pendingblocks(b.Height) = addr* - *b* satisfies basic soundness - Expected postcondition - if function `Execute` has been executed without error or was not executed: - *receivedBlocks(b.Height) = addr* - *blockstore(b.Height) = b* - *peerTimeStamp[addr]* is set to a time between invocation and return of the function. - *peerRate[addr]* is updated according to size of received block and time it has passed between current time and last block received from this peer (addr) - Error condition - if precondition is violated: *addr* not in *peerIDs*; reset *pendingblocks(b.Height)* to nil; - Timeout condition - if `OnBlockResponse(addr, b)` was not invoked within *2 Delta* after `Block(addr,h)` was called for *b.Height = h*: *addr* not in *peerIDs* ---- ```go func Execute() ``` - Comments - none - Expected precondition - application state is the one of the blockchain at height *height - 1* - **[FS-V2-Verif]** for any two blocks *a* and *b* from *receivedBlocks*: if *a.Height + 1 = b.Height* then *VerifyCommit (a,b.Commit) = true* - Expected postcondition - Any two blocks *a* and *b* violating [FS-V2-Verif]: *a* and *b* not in *blockstore*; nodes with Address receivedBlocks(a.Height) and receivedBlocks(b.Height) not in peerIDs - height is updated height of complete prefix that matches the blockchain - state is the one of the blockchain at height *height - 1* - if the new value of *height* is equal to *TargetHeight*, then Fastsync **terminates successfully**. - Error condition - none ---- ## Algorithm Invariants > In contrast to the temporal properties above that define the problem > statement, the following are invariants on the solution to the > problem, that is on the algorithm. These invariants are useful for > the verification, but can also guide the implementation. #### **[FS-VAR-STATE-INV]** It is always the case that *state* corresponds to the application state of the blockchain of that height, that is, *state = chain[height - 1].AppState*; *chain* is defined in [**[TMBC-SEQ]**][TMBC-SEQ-link]. #### **[FS-VAR-PEER-INV]** It is always the case that the set *peerIDs* only contains nodes that have not yet misbehaved (by sending wrong data or timing out). #### **[FS-VAR-BLOCK-INV]** For *startBlock.Height <= i < height - 1*, let *b(i)* be the block with height *i* in *blockstore*, it always holds that *VerifyCommit(b(i), b(i+1).Commit) = true*. This means that *height* can only be incremented if all blocks with lower height have been verified. # Part V - Analysis and Improvements ## Analysis of Fastsync V2 #### **[FS-ISSUE-KILL]** If two blocks are not matching [FS-V2-Verif], `Execute` dismisses both blocks and removes the peers that provided these blocks from *peerIDs*. If block *a* was correct and provided by a correct peer *p*, and block b was faulty and provided by a faulty peer, the protocol - removes the correct peer *p*, although it might be useful to download blocks from it in the future - removes the block *a*, so that a fresh copy of *a* needs to be downloaded again from another peer By [FS-A-PEER] we do not put a restriction on the number of faulty peers, so that faulty peers can make *FS* to remove all correct peers from *peerIDs*. As a result, this version of *Fastsync* violates [FS-DIST-SAFE-SYNC]. #### **[FS-ISSUE-NON-TERM]** Due to [**[FS-ISSUE-KILL]**](#fs-issue-kill), from some point on, only faulty peers may be in *peerIDs*. They can thus control at which rate *Fastsync* gets blocks. If the timeout duration from [FS-V2-TIMEOUT] is greater than the time it takes to add a block to the blockchain (LTIME in [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link]), the protocol may never terminate and thus violate [FS-DIST-LIVE]. This scenario is even possible if a correct peer is always in *peerIDs*, but faulty peers are regularly asked for blocks. ### Consequence The issues [FS-ISSUE-KILL] and [FS-ISSUE-NON-TERM] explain why does not satisfy the property [FS-DIST-LIVE] relevant for termination. As a result, V2 only solves the specifications in a restricted form, namely, when all peers are correct: #### **[FS-ALL-CORR-PEER]** At all times, the set *peerIDs* contains only correct full nodes. With this restriction we can give the achieved properties: #### **[FS-VC-ALL-CORR-NONABORT]** Under [FS-ALL-CORR-PEER], *Fastsync* never terminates with failure. #### **[FS-VC-ALL-CORR-LIVE]** Under [FS-ALL-CORR-PEER], *Fastsync* eventually terminates successfully. > In a fault tolerance context this is problematic, > as it means that faulty peers can prevent *FastSync* from termination. > We observe that this also touches other properties, namely, > [FS-DIST-SAFE-START] and [FS-DIST-SAFE-SYNC]: > Termination at an acceptable height are all conditional under > "successful termination". The properties above severely restrict > under which circumstances FastSync (V2) terminates successfully. > As a result, large parts of the current > implementation of are not fault-tolerant. We will > discuss this, and suggestions how to solve this after the > description of the current protocol. ## Suggestions for an Improved Fastsync Implementation ### Solution for [FS-ISSUE-KILL] To avoid [FS-ISSUE-KILL], we observe that [**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link] ensures that from the point a block was created, we assume that more than two thirds of the validator nodes are correct until the *trustingPeriod* expires. Under this assumption, assume the trusting period of *startBlock* is not expired by the time *FastSync* checks a block *b1* with height *startBlock.Height + 1*. To do so, we first need to check whether the Commit in the block *b2* with *startBlock.Height + 2* contains more than 2/3 of the voting power in *startBlock.NextValidators*. If this is the case we can check *VerifyCommit (b1,b2.Commit)*. If we perform checks in this order we observe: - By assumption, *startBlock* is OK, - If the first check (2/3 of voting power) fails, the peer that provided block *b2* is faulty, - If the first check passes and the second check fails (*VerifyCommit*), then the peer that provided *b1* is faulty. - If both checks pass, we can trust *b1* Based on this reasoning, we can ensure to only remove faulty peers from *peerIDs*. That is, if we sequentially verify blocks starting with *startBlock*, we will never remove a correct peer from *peerIDs* and we will be able to ensure the following invariant: #### **[NewFS-VAR-PEER-INV]** If a peer never misbehaves, it is never removed from *peerIDs*. It follows that under [FS-SOME-CORR-PEER], *peerIDs* is always non-empty. > To ensure this, we suggest to change the protocol as follows: #### Fastsync has the following configuration parameters - *trustingPeriod*: a time duration; cf. [**[TMBC-TIME-PARAMS]**][TMBC-TIME-PARAMS-link]. > [NewFS-A-INIT] is the suggested replacement of [FS-A-V2-INIT]. This will > allow us to use the established trust to understand precisely which > peer reported an invalid block in order to ensure the > invariant [NewFS-VAR-TRUST-INV] below: #### **[NewFS-A-INIT]** - *startBlock* is from the blockchain, and within *trustingPeriod* (possible with some extra margin to ensure termination before *trustingPeriod* expired) - *startState* is the application state of the blockchain at Height *startBlock.Height*. - *startHeight = startBlock.Height* #### Additional Variables - *trustedBlockstore*: stores for each height greater than or equal to *startBlock.Height*, the block of that height. Initially it contains only *startBlock* #### **[NewFS-VAR-TRUST-INV]** Let *b(i)* be the block in *trustedBlockstore* with b(i).Height = i. It holds that for *startHeight < i < height - 1*, *VerifyCommit (b(i),b(i+1).Commit) = true*. > We propose to update the function `Execute`. To do so, we first > define the following helper functions: ```go func ValidCommit(VS ValidatorSet, C Commit) Boolean ``` - Comments - checks validator set based on [**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link] - Expected precondition - The validators in *C* - are a subset of VS - have more than 2/3 of the voting power in VS - Expected postcondition - returns *true* if precondition holds, and *false* otherwise - Error condition - none ---- ```go func SequentialVerify { while (true) { b1 = blockstore[height]; b2 = blockstore[height+1]; if b1 == nil or b2 == nil { exit; } if ValidCommit(trustedBlockstore[height - 1].NextValidators, b2.commit) { // we trust b2 if VerifyCommit(b1, b2.commit) { trustedBlockstore.Add(b1); height = height + 1; } else { // as we trust b2, b1 must be faulty blockstore.RemoveFromPeer(receivedBlocks[height]); // we remove all blocks received from the faulty peer peerIDs.Remove(receivedBlocks(bnew.Height)); exit; } } else { // b2 is faulty blockstore.RemoveFromPeer(receivedBlocks[height + 1]); // we remove all blocks received from the faulty peer peerIDs.Remove(receivedBlocks(bnew.Height)); exit; } } } ``` - Comments - none - Expected precondition - [NewFS-VAR-TRUST-INV] - Expected postcondition - [NewFS-VAR-TRUST-INV] - there is no block *bnew* with *bnew.Height = height + 1* in *blockstore* - Error condition - none ---- > Then `Execute` just consists in calling `SequentialVerify` and then > updating the application state to the (new) height. ```go func Execute() ``` - Comments - first `SequentialVerify` is executed - Expected precondition - application state is the one of the blockchain at height *height - 1* - [NewFS-NOT-EXP] *trustedBlockstore[height-1].Time > now - trustingPeriod* - Expected postcondition - there is no block *bnew* with *bnew.Height = height + 1* in *blockstore* - state is the one of the blockchain at height *height - 1* - if height = TargetHeight: **terminate successfully** - Error condition - fails if [NewFS-NOT-EXP] is violated ---- ### Solution for [FS-ISSUE-NON-TERM] As discussed above, the advantageous termination requirement is the combination of [FS-DIST-LIVE] and [FS-DIST-NONABORT], that is, *Fastsync* should terminate successfully in case there is at least one correct peer in *peerIDs*. For this we have to ensure that faulty processes cannot slow us down and provide blocks at a lower rate than the blockchain may grow. To ensure that we will have to add an assumption on message delays. #### **[NewFS-A-DELTA]** *2 Delta < ETIME*; cf. [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link]. > This assumption implies that the timeouts for `OnBlockResponse` and > `OnStatusResponse` are such that a faulty peer that tries to respond > slower than *2 Delta* will be removed. In the following we will > provide a rough estimate on termination time in a fault-prone > scenario. > In the following > we assume that during a "long enough" finite good period no new > faulty peers are added to *peerIDs*. Below we will sketch how "long > enough" can be estimated based on the timing assumption in this > specification. #### **[NewFS-A-STATUS-INTERVAL]** Let Sigma be the (upper bound on the) time between two calls of `QueryStatus()`. #### **[NewFS-A-GOOD-PERIOD]** A time interval *[begin,end]* is *good period* if: - *fmax* is the number of faulty peers in *peerIDs* at time *begin* - *end >= begin + 2 Delta (fmax + 3)* - no faulty peer is added before time *end* > In the analysis below we assume that the termination condition of > *Fastsync* is > *height = TargetHeight* in the postcondition of > `Execute`. Therefore, [NewFS-A-STATUS-INTERVAL] does not interfere > with this analysis. If a correct peer reports a new height "shortly > before termination" this leads to an additional round trip to > request and add the new block. Then [NewFS-A-DELTA] ensures that > *Fastsync* catches up. Arguments: 1. If a faulty peer *p* reports a faulty block, `SequentialVerify` will eventually remove *p* from *peerIDs* 2. By `SequentialVerify`, if a faulty peer *p* reports multiple faulty blocks, *p* will be removed upon trying to check the block with the smallest height received from *p*. 3. Assume whenever a block does not have an open request, `CreateRequest` is called immediately, which calls `Block(n)` on a peer. Say this happens at time *t*. There are two cases: - by t + 2 Delta a block is added to *blockStore* - at t + 2 Delta `Block(n)` timed out and *n* is removed from peer. 4. Let *f(t)* be the number of faulty peers in *peerIDs* at time *t*; *f(begin) = fmax*. 5. Let t_i be the sequence of times `OnBlockResponse(addr,b)` is invoked or times out with *b.Height = height + 1*. 6. By 3., - (a). *t_1 <= begin + 2 Delta* - (b). *t_{i+1} <= t_i + 2 Delta* 7. By an inductive argument we prove for *i > 0* that - (a). *height(t_{i+1}) > height(t_i)*, or - (b). *f(t_{i+1}) < f(t_i))* and *height(t_{i+1}) = height(t_i)* Argument: if the peer is faulty and does not return a block, the peer is removed, if it is faulty and returns a faulty block `SequentialVerify` removes the peer (b). If the returned block is OK, height is increased (a). 8. By 2. and 7., faulty peers can delay incrementing the height at most *fmax* times, where each time "costs" *2 Delta* seconds. We have additional *2 Delta* initial offset (3a) plus *2 Delta* to get all missing blocks after the last fault showed itself. (This assumes that an arbitrary number of blocks can be obtained and checked within one round-trip 2 Delta; which either needs conservative estimation of Delta, or a more refined analysis). Thus we reach the *targetHeight* and terminate by time *end*. # References [[block]] Specification of the block data structure. [block]: https://github.com/tendermint/spec/blob/d46cd7f573a2c6a2399fcab2cde981330aa63f37/spec/core/data_structures.md [TMBC-HEADER-link]: #tmbc-header [TMBC-SEQ-link]: #tmbc-seq [TMBC-CORR-FULL-link]: #tmbc-corrfull [TMBC-CORRECT-link]: #tmbc-correct [TMBC-Sign-link]: #tmbc-sign [TMBC-FaultyFull-link]: #tmbc-faultyfull [TMBC-TIME-PARAMS-link]: #tmbc-time-params [TMBC-SEQ-APPEND-E-link]: #tmbc-seq-append-e [TMBC-FM-2THIRDS-link]: #tmbc-fm-2thirds [TMBC-Auth-Byz-link]: #tmbc-auth-byz [TMBC-INV-SIGN-link]: #tmbc-inv-sign [TMBC-SOUND-DISTR-PossCommit--link]: #tmbc-sound-distr-posscommit [TMBC-INV-VALID-link]: #tmbc-inv-valid [LCV-VC-LIVE-link]: https://github.com/informalsystems/VDD/tree/master/lightclient/verification.md#lcv-vc-live [lightclient]: https://github.com/interchainio/tendermint-rs/blob/e2cb9aca0b95430fca2eac154edddc9588038982/docs/architecture/adr-002-lite-client.md [failuredetector]: https://github.com/informalsystems/VDD/blob/master/liteclient/failuredetector.md [fullnode]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md [FN-LuckyCase-link]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#fn-luckycase [blockchain-validator-set]: https://github.com/tendermint/spec/blob/d46cd7f573a2c6a2399fcab2cde981330aa63f37/spec/core/data_structures.md#data-structures [fullnode-data-structures]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#data-structures [FN-ManifestFaulty-link]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#fn-manifestfaulty