You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

44 KiB

Fastsync

Fastsync is a protocol that is used by a node to catch-up to the current state of a Tendermint blockchain. Its typical use case is a node that was disconnected from the system for some time. The recovering node locally has a copy of a prefix of the blockchain, and the corresponding application state that is slightly outdated. It then queries its peers for the blocks that were decided on by the Tendermint blockchain during the period the full node was disconnected. After receiving these blocks, it executes the transactions in the blocks in order to catch-up to the current height of the blockchain and the corresponding application state.

In practice it is sufficient to catch-up only close to the current height: The Tendermint consensus reactor implements its own catch-up functionality and can synchronize a node that is close to the current height, perhaps within 10 blocks away from the current height of the blockchain. Fastsync should bring a node within this range.

Outline

  • Part I: Introduction of Tendermint blockchain terms that are relevant for FastSync protocol.

  • Part II: Introduction of the problem addressed by the Fastsync protocol.

    • Fastsync Informal Problem statement: For the general audience, that is, engineers who want to get an overview over what the component is doing from a bird's eye view.

    • Sequential Problem statement: Provides a mathematical definition of the problem statement in its sequential form, that is, ignoring the distributed aspect of the implementation of the blockchain.

  • Part III: Distributed aspects of the fast sync problem, system assumptions and temporal logic specifications.

  • Part IV: Specification of Fastsync V2 (the protocol underlying the current Golang implementation).

    • Definitions: Describes inputs, outputs, variables used by the protocol, auxiliary functions

    • FastSync V2: gives an outline of the solution, and details of the functions used (with preconditions, postconditions, error conditions).

    • Algorithm Invariants: invariants over the protocol variables that the implementation should maintain.

  • Part V: Analysis of Fastsync V2 that highlights several issues that prevent achieving some of the desired fault-tolerance properties. We also give some suggestions on how to address the issues in the future.

    • Analysis of Fastsync V2: describes undesirable scenarios of Fastsync V2, and why they violate desirable temporal logic specification in an unreliable distributed system.

    • Suggestions to address the issues discussed in the analysis.

In this document we quite extensively use tags in order to be able to reference assumptions, invariants, etc. in future communication. In these tags we frequently use the following short forms:

  • TMBC: Tendermint blockchain
  • SEQ: for sequential specifications
  • FS: Fastsync
  • LIVE: liveness
  • SAFE: safety
  • INV: invariant
  • A: assumption
  • V2: refers to specifics of Fastsync V2
  • FS-VAR: refers to properties of Fastsync protocol variables
  • NewFS: refers to improved future Fastsync implementations

Part I - Tendermint Blockchain

We will briefly list some of the notions of Tendermint blockchains that are required for this specification. More details can be found here.

[TMBC-HEADER]

A set of blockchain transactions is stored in a data structure called block, which contains a field called header. (The data structure block is defined here). As the header contains hashes to the relevant fields of the block, for the purpose of this specification, we will assume that the blockchain is a list of headers, rather than a list of blocks.

[TMBC-SEQ]

The Tendermint blockchain is a list chain of headers.

[TMBC-SEQ-GROW]

During operation, new headers may be appended to the list one by one.

In the following, ETIME is a lower bound on the time interval between the times at which two successor blocks are added.

[TMBC-SEQ-APPEND-E]

If a header is appended at time t then no additional header will be appended before time t + ETIME.

[TMBC-AUTH-BYZ]

We assume the authenticated Byzantine fault model in which no node (faulty or correct) may break digital signatures, but otherwise, no additional assumption is made about the internal behavior of faulty nodes. That is, faulty nodes are only limited in that they cannot forge messages.

We observed that in the existing documentation the term validator refers to both a data structure and a full node that participates in the distributed computation. Therefore, we introduce the notions validator pair and validator node, respectively, to distinguish these notions in the cases where they are not clear from the context.

[TMBC-VALIDATOR-PAIR]

Given a full node, a validator pair is a pair (address, voting_power), where

  • address is the address (public key) of a full node,
  • voting_power is an integer (representing the full node's voting power in a given consensus instance).

In the Golang implementation the data type for validator pair is called Validator.

[TMBC-VALIDATOR-SET]

A validator set is a set of validator pairs. For a validator set vs, we write TotalVotingPower(vs) for the sum of the voting powers of its validator pairs.

[TMBC-CORRECT]

We define a predicate correctUntil(n, t), where n is a node and t is a time point. The predicate correctUntil(n, t) is true if and only if the node n follows all the protocols (at least) until time t.

[TMBC-TIME-PARAMS]

A blockchain has the following configuration parameters:

  • unbondingPeriod: a time duration.
  • trustingPeriod: a time duration smaller than unbondingPeriod.

[TMBC-FM-2THIRDS]

If a block h is in the chain, then there exists a subset CorrV of h.NextValidators, such that:

  • TotalVotingPower(CorrV) > 2/3 TotalVotingPower(h.NextValidators);
  • For every validator pair (n,p) in CorrV, it holds correctUntil(n, h.Time + trustingPeriod).

[TMBC-CORR-FULL]

Every correct full node locally stores a prefix of the current list of headers from **[TMBC-SEQ]**.

Part II - Sequential Definition of Fastsync Problem

Fastsync Informal Problem statement

A full node has as input a block of the blockchain at height h and the corresponding application state (or the prefix of the current blockchain until height h). It has access to a set peerIDs of full nodes called peers that it knows of. The full node uses the peers to read blocks of the Tendermint blockchain (in a safe way, that is, it checks the soundness conditions), until it has read the most recent block and then terminates.

Sequential Problem statement

Fastsync gets as input a block of height h and the corresponding application state s that corresponds to the block and state of that height of the blockchain, and produces as output (i) a list L of blocks starting at height h to some height terminationHeight, and (ii) the application state when applying the transactions of the list L to s.

In Tendermint, the commit for block of height h is contained in block h + 1, and thus the block of height h + 1 is needed to verify the block of height h. Let us therefore clarify the following on the termination height: The returned value terminationHeight is the height of the block with the largest height that could be verified. In order to do so, Fastsync needs the block at height terminationHeight + 1 of the blockchain.

Fastsync has to satisfy the following properties:

[FS-SEQ-SAFE-START]

Let bh be the height of the blockchain at the time Fastsync starts. By assumption we have bh >= h. When Fastsync terminates, it outputs a list of all blocks from height h to some height terminationHeight >= bh - 1.

The above property is independent of how many blocks are added to the blockchain while Fastsync is running. It links the target height to the initial state. If Fastsync has to catch-up many blocks, it would be better to link the target height to a time close to the termination. This is captured by the following specification:

[FS-SEQ-SAFE-SYNC]

Let eh be the height of the blockchain at the time Fastsync terminates. There is a constant D >= 1 such that when Fastsync terminates, it outputs a list of all blocks from height h to some height terminationHeight >= eh - D.

[FS-SEQ-SAFE-STATE]

Upon termination, the application state is the one that corresponds to the blockchain at height terminationHeight.

[FS-SEQ-LIVE]

Fastsync eventually terminates.

Part III - FastSync as Distributed System

Computational Model

[FS-A-NODE]

We consider a node FS that performs Fastsync.

[FS-A-PEER-IDS]

FS has access to a set peerIDs of IDs (public keys) of peers . During the execution of Fastsync, another protocol (outside of this specification) may add new IDs to peerIDs.

[FS-A-PEER]

Peers can be faulty, and we do not make any assumptions about the number or ratio of correct/faulty nodes. Faulty processes may be Byzantine according to **[TMBC-AUTH-BYZ]**.

[FS-A-VAL]

The system satisfies [TMBC-AUTH-BYZ]** and [TMBC-FM-2THIRDS]**. Thus, there is a blockchain that satisfies the soundness requirements (that is, the validation rules in [block]).

[FS-A-COMM]

Communication between the node FS and all correct peers is reliable and bounded in time: there is a message end-to-end delay Delta such that if a message is sent at time t by a correct process to a correct process, then it will be received and processed by time t + Delta. This implies that we need a timeout of at least 2 Delta for remote procedure calls to ensure that the response of a correct peer arrives before the timeout expires.

Distributed Problem Statement

Two Kinds of Termination

We do not assume that there is a correct full node in peerIDs. Under this assumption no protocol can guarantee the combination of the properties [FS-SEQ-LIVE] and [FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] described in the sequential specification above. Thus, in the (unreliable) distributed setting, we consider two kinds of termination (successful and failure) and we will specify below under what (favorable) conditions Fastsync ensures to terminate successfully, and satisfy the requirements of the sequential problem statement:

[FS-DIST-LIVE]

Fastsync eventually terminates: it either terminates successfully or it terminates with failure.

Fairness

As mentioned above, without assumptions on the correctness of some peers, no protocol can achieve the required specifications. Therefore, we consider the following (fairness) constraint in the safety and liveness properties below:

[FS-SOME-CORR-PEER]

Initially, the set peerIDs contains at least one correct full node.

While in principle the above condition [FS-SOME-CORR-PEER] can be part of a sufficient condition to solve [FS-SEQ-LIVE] and [FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] in the distributed setting (their corresponding properties are given below), we will discuss in Part V that the current implementation of Fastsync (V2) requires the much stronger requirement **[FS-ALL-CORR-PEER]** given in Part V.

Safety

As this specification does not assume that a correct peer is at the most recent height of the blockchain (it might lag behind), the property [FS-SEQ-SAFE-START] cannot be ensured in an unreliable distributed setting. We consider the following relaxation. (Which is typically sufficient for Tendermint, as the consensus reactor then synchronizes from that height.)

[FS-DIST-SAFE-START]

Let maxh be the maximum height of a correct peer **[TMBC-CORR-FULL]** in peerIDs at the time Fastsync starts. If FastSync terminates successfully, it is at some height terminationHeight >= maxh - 1.

To address [FS-SEQ-SAFE-SYNC] we consider the following property in the distributed setting. See the comments below on the relation to the sequential version.

[FS-DIST-SAFE-SYNC]

Under [FS-SOME-CORR-PEER], there exists a constant time interval TD, such that if term is the time Fastsync terminates and maxh is the maximum height of a correct peer **[TMBC-CORR-FULL]** in peerIDs at the time term - TD, then if FastSync terminates successfully, it is at some height terminationHeight >= maxh - 1.

TD might depend on timeouts etc. We suggest that an acceptable value for TD is in the range of approx. 10 sec., that is the interval between two calls QueryStatus(); see below. We use term - TD as reference time, as we have to account for communication delay between the peer and FS. After the peer sent the last message to FS, the peer and FS run concurrently and independently. There is no assumption on the rate at which a peer can add blocks (e.g., it might be in the process of catching up itself). Hence, without additional assumption we cannot link [FS-DIST-SAFE-SYNC] to [FS-SEQ-SAFE-SYNC]**, in particular to the parameter D. We discuss a way to achieve this below: Relation to [FS-SEQ-SAFE-SYNC]:
Under [FS-SOME-CORR-PEER], if peerIDs contains a full node that is "synchronized with the blockchain", and blockchainheight is the height of the blockchain at time term, then terminationHeight may even achieve blockchainheight - TD / ETIME; cf. [TMBC-SEQ-APPEND-E]**, that is, the parameter D from [FS-SEQ-SAFE-SYNC] is in the range of TD / ETIME.

[FS-DIST-SAFE-STATE]

It is the same as the sequential version **[FS-SEQ-SAFE-STATE]**.

[FS-DIST-NONABORT]

If there is one correct process in peerIDs [FS-SOME-CORR-PEER], Fastsync never terminates with failure. (Together with [FS-DIST-LIVE] that means it will terminate successfully.)

Part IV - Fastsync protocol

Here we provide a specification of the FastSync V2 protocol as it is currently implemented. The V2 design is the result of significant refactoring to improve the testability and determinism in the implementation. The architecture is detailed in ADR-43.

In the original design, a go-routine (thread of execution) was spawned for each block requested, and was responsible for both protocol logic and IO. In the V2 design, protocol logic is decoupled from IO by using three total threads of execution: a scheduler, a processer, and a demuxer.

The scheduler contains the business logic for managing peers and requesting blocks from them, while the processor handles the computationally expensive block execution. Both the scheduler and processor are structured as finite state machines that receive input events and emit output events. The demuxer is responsible for all IO, including translating between internal events and IO messages, and routing events between components.

Protocols in Tendermint can be considered to consist of two components: a "core" state machine and a "peer" state machine. The core state machine refers to the internal state managed by the node, while the peer state machine determines what messages to send to peers. In the FastSync design, the core and peer state machines correspond to the processor and scheduler, respectively.

In the case of FastSync, the core state machine (the processor) is effectively just the Tendermint block execution function, while virtually all protocol logic is contained in the peer state machine (the scheduler). The processor is only implemented as a separate component due to the computationally expensive nature of block execution. We therefore focus our specification here on the peer state machine (the scheduler component), capturing the core state machine (the processor component) in the single Execute function, defined below.

While the internal details of the Execute function are not relevant for the FastSync protocol and are thus not part of this specification, they will be defined in detail at a later date in a separate Block Execution specification.

Definitions

We now introduce variables and auxiliary functions used by the protocol.

Inputs

  • startBlock: the block Fastsync starts from
  • startState: application state corresponding to startBlock.Height

[FS-A-V2-INIT]

  • startBlock is from the blockchain
  • startState is the application state of the blockchain at Height startBlock.Height.

Variables

  • height: kinitially startBlock.Height + 1

    height should be thought of the "height of the next block we need to download"

  • state: initially startState

  • peerIDs: peer addresses FS-A-PEER-IDS

  • peerHeights: stores for each peer the height it reported. initially 0

  • pendingBlocks: stores for each height which peer was queried. initially nil for each height

  • receivedBlocks: stores for each height which peer returned it. initially nil

  • blockstore: stores for each height greater than startBlock.Height, the block of that height. initially nil for all heights

  • peerTimeStamp: stores for each peer the last time a block was received

  • pendingTime: stores for a given height the time a block was requested

  • peerRate: stores for each peer the rate of received data in Bytes/second

Auxiliary Functions

[FS-FUNC-TARGET]

  • TargetHeight = max {peerHeigts(addr): addr in peerIDs} union {height}

[FS-FUNC-MATCH]

func VerifyCommit(b Block, c Commit) Boolean
  • Comment

    • Corresponds to verifyCommit(chainID string, blockID types.BlockID, height int64, commit *types.Commit) error in the current Golang implementation, which expects blockID and height (from the first block) and the corresponding commit from the following block. We use the simplified form for ease in presentation.
  • Implementation remark

    • implements the check that c is a valid commit for block b
  • Expected precondition

    • c is a valid commit for block b
  • Expected postcondition

    • true if precondition holds
    • false if precondition is violated
  • Error condition

    • none

Messages

Peers participating in FastSync exchange the following set of messages. Messages are encoded using the Amino serialization protocol. We define each message here using Go syntax, annoted with the Amino type name. The prefix bc refers to blockchain, which is the name of the FastSync reactor in the Go implementation.

bcBlockRequestMessage

// type: "tendermint/blockchain/BlockRequest"
type bcBlockRequestMessage struct {
 Height int64
}

Remark:

  • msg.Height > 0

bcNoBlockResponseMessage

// type: "tendermint/blockchain/NoBlockResponse"
type bcNoBlockResponseMessage struct {
 Height int64
}

Remark:

  • msg.Height > 0
  • This message type is included in the protocol for convenience and is not expected to be sent between two correct peers

bcBlockResponseMessage

// type: "tendermint/blockchain/BlockResponse"
type bcBlockResponseMessage struct {
 Block *types.Block
}

Remark:

  • msg.Block is a Tendermint block as defined in [block].
  • msg.Block != nil

bcStatusRequestMessage

// type: "tendermint/blockchain/StatusRequest"
type bcStatusRequestMessage struct {
 Height int64
}

Remark:

  • msg.Height > 0

bcStatusResponseMessage

// type: "tendermint/blockchain/StatusResponse"
type bcStatusResponseMessage struct {
 Height int64
}

Remark:

  • msg.Height > 0

Remote Functions

Peers expose the following functions over remote procedure calls. The "Expected precondition" are only expected for correct peers (as no assumption is made on internals of faulty processes [FS-A-PEER]). These functions are implemented using the above defined message types.

In this document we describe the communication with peers via asynchronous RPCs.

func Status(addr Address) (int64, error)
  • Implementation remark
    • RPC to full node addr
    • Request message: bcStatusRequestMessage.
    • Response message: bcStatusResponseMessage.
  • Expected precondition
    • none
  • Expected postcondition
    • if addr is correct: Returns the current height height of the peer. [FS-A-COMM]
    • if addr is faulty: Returns an arbitrary height. **[TMBC-AUTH-BYZ]**
  • Error condition
    • if addr is correct: none. By [FS-A-COMM] we assume communication is reliable and timely.
    • if addr is faulty: arbitrary error (including timeout). **[TMBC-AUTH-BYZ]**

func Block(addr Address, height int64) (Block, error)
  • Implementation remark
    • RPC to full node addr
    • Request message: bcBlockRequestMessage.
    • Response message: bcBlockResponseMessage or bcNoBlockResponseMessage.
  • Expected precondition
    • 'height` is less than or equal to height of the peer
  • Expected postcondition
    • if addr is correct: Returns the block of height height from the blockchain. [FS-A-COMM]
    • if addr is faulty: Returns arbitrary or no block **[TMBC-AUTH-BYZ]**
  • Error condition
    • if addr is correct: precondition violated (returns bcNoBlockResponseMessage). [FS-A-COMM]
    • if addr is faulty: arbitrary error (including timeout). **[TMBC-AUTH-BYZ]**

FastSync V2

Outline

The protocol is described in terms of functions that are triggered by (external) events. The implementation uses a scheduler and a de-multiplexer to deal with communicating with peers and to trigger the execution of these functions:

  • QueryStatus(): regularly (currently every 10sec; necessarily interval greater than 2 Delta) queries all peers from peerIDs for their current height [TMBC-CORR-FULL]. It does so by calling Status(n) remotely on all peers n.

  • CreateRequest: regularly checks whether certain blocks have no open request. If a block does not have an open request, it requests one from a peer. It does so by calling Block(n,h) remotely on one peer n for a missing height h.

We have left the strategy how peers are selected unspecified, and the currently existing different implementations of Fastsync differ in this aspect. In V2, a peer p is selected with the minimum number of pending requests that can serve the required height h, that is with peerHeight(p) >= h.

The functions Status and Block are called by asynchronous RPC. When they return, the following functions are called:

  • OnStatusResponse(addr Address, height int64): The full node with address addr returns its current height. The function updates the height information about addr, and may also increase TargetHeight.

  • OnBlockResponse(addr Address, b Block). The full node with address addr returns a block. It is added to blockstore. Then the auxiliary function Execute is called.

  • Execute(): Iterates over the blockstore. Checks soundness of the blocks, and executes the transactions of a sound block and updates state.

In addition to the functions above, the following two features are implemented in Fastsync V2

[FS-V2-PEER-REMOVE]

Periodically, peerTimeStamp and peerRate and pendingTime are analyzed. If a peer p has not provided a block recently (check of peerTimeStamp[p]) or it has not provided sufficiently many data (check of peerRate[p]), then p is removed from peerIDs. In addition, pendingTime is used to estimate whether the peer that is responsible for the current height has provided the corresponding block on time.

[FS-V2-TIMEOUT]

Fastsync V2 starts a timeout whenever a block is executed (that is, when the height is incremented). If the timeout expires before the next block is executed, Fastsync terminates. If this happens, then Fastsync terminates with failure.

Details

func QueryStatus()
  • Expected precondition
    • peerIDs initialized and non-empty
  • Expected postcondition
    • call asynchronously Status(n) at each peer n in peerIDs.
  • Error condition
    • fails if precondition is violated

func OnStatusResponse(addr Address, ht int64)
  • Comment
    • ht is a height
    • peers can provide the status without being called
  • Expected precondition
    • peerHeights(addr) <= ht
  • Expected postcondition
    • peerHeights(addr) = ht
    • TargetHeight is updated
  • Error condition
    • if precondition is violated: addr not in peerIDs (that is, addr is removed from peerIDs)
  • Timeout condition
    • if OnStatusResponse(addr, ht) was not invoked within 2 Delta after Status(addr) was called: addr not in peerIDs

func CreateRequest
  • Expected precondition
    • height < TargetHeight
    • peerIDs nonempty
  • Expected postcondition
    • Function Block is called remotely at a peer addr in peerIDs for a missing height h
      Remark: different implementations may have different strategies to balance the load over the peers
    • pendingblocks(h) = addr

func OnBlockResponse(addr Address, b Block)
  • Comment
    • if after adding block b, blocks of heights height and height + 1 are in blockstore, then Execute is called
  • Expected precondition
    • pendingblocks(b.Height) = addr
    • b satisfies basic soundness
  • Expected postcondition
    • if function Execute has been executed without error or was not executed:
      • receivedBlocks(b.Height) = addr
      • blockstore(b.Height) = b
      • peerTimeStamp[addr] is set to a time between invocation and return of the function.
      • peerRate[addr] is updated according to size of received block and time it has passed between current time and last block received from this peer (addr)
  • Error condition
    • if precondition is violated: addr not in peerIDs; reset pendingblocks(b.Height) to nil;
  • Timeout condition
    • if OnBlockResponse(addr, b) was not invoked within 2 Delta after Block(addr,h) was called for b.Height = h: addr not in peerIDs

func Execute()
  • Comments
    • none
  • Expected precondition
    • application state is the one of the blockchain at height height - 1
    • [FS-V2-Verif] for any two blocks a and b from receivedBlocks: if a.Height + 1 = b.Height then VerifyCommit (a,b.Commit) = true
  • Expected postcondition
    • Any two blocks a and b violating [FS-V2-Verif]: a and b not in blockstore; nodes with Address receivedBlocks(a.Height) and receivedBlocks(b.Height) not in peerIDs
    • height is updated height of complete prefix that matches the blockchain
    • state is the one of the blockchain at height height - 1
    • if the new value of height is equal to TargetHeight, then Fastsync terminates successfully.
  • Error condition
    • none

Algorithm Invariants

In contrast to the temporal properties above that define the problem statement, the following are invariants on the solution to the problem, that is on the algorithm. These invariants are useful for the verification, but can also guide the implementation.

[FS-VAR-STATE-INV]

It is always the case that state corresponds to the application state of the blockchain of that height, that is, state = chain[height - 1].AppState; chain is defined in **[TMBC-SEQ]**.

[FS-VAR-PEER-INV]

It is always the case that the set peerIDs only contains nodes that have not yet misbehaved (by sending wrong data or timing out).

[FS-VAR-BLOCK-INV]

For startBlock.Height <= i < height - 1, let b(i) be the block with height i in blockstore, it always holds that VerifyCommit(b(i), b(i+1).Commit) = true. This means that height can only be incremented if all blocks with lower height have been verified.

Part V - Analysis and Improvements

Analysis of Fastsync V2

[FS-ISSUE-KILL]

If two blocks are not matching [FS-V2-Verif], Execute dismisses both blocks and removes the peers that provided these blocks from peerIDs. If block a was correct and provided by a correct peer p, and block b was faulty and provided by a faulty peer, the protocol

  • removes the correct peer p, although it might be useful to download blocks from it in the future
  • removes the block a, so that a fresh copy of a needs to be downloaded again from another peer

By [FS-A-PEER] we do not put a restriction on the number of faulty peers, so that faulty peers can make FS to remove all correct peers from peerIDs. As a result, this version of Fastsync violates [FS-DIST-SAFE-SYNC].

[FS-ISSUE-NON-TERM]

Due to [FS-ISSUE-KILL]**, from some point on, only faulty peers may be in peerIDs. They can thus control at which rate Fastsync gets blocks. If the timeout duration from [FS-V2-TIMEOUT] is greater than the time it takes to add a block to the blockchain (LTIME in [TMBC-SEQ-APPEND-E]**), the protocol may never terminate and thus violate [FS-DIST-LIVE]. This scenario is even possible if a correct peer is always in peerIDs, but faulty peers are regularly asked for blocks.

Consequence

The issues [FS-ISSUE-KILL] and [FS-ISSUE-NON-TERM] explain why does not satisfy the property [FS-DIST-LIVE] relevant for termination. As a result, V2 only solves the specifications in a restricted form, namely, when all peers are correct:

[FS-ALL-CORR-PEER]

At all times, the set peerIDs contains only correct full nodes.

With this restriction we can give the achieved properties:

[FS-VC-ALL-CORR-NONABORT]

Under [FS-ALL-CORR-PEER], Fastsync never terminates with failure.

[FS-VC-ALL-CORR-LIVE]

Under [FS-ALL-CORR-PEER], Fastsync eventually terminates successfully.

In a fault tolerance context this is problematic, as it means that faulty peers can prevent FastSync from termination. We observe that this also touches other properties, namely, [FS-DIST-SAFE-START] and [FS-DIST-SAFE-SYNC]: Termination at an acceptable height are all conditional under "successful termination". The properties above severely restrict under which circumstances FastSync (V2) terminates successfully. As a result, large parts of the current implementation of are not fault-tolerant. We will discuss this, and suggestions how to solve this after the description of the current protocol.

Suggestions for an Improved Fastsync Implementation

Solution for [FS-ISSUE-KILL]

To avoid [FS-ISSUE-KILL], we observe that **[TMBC-FM-2THIRDS]** ensures that from the point a block was created, we assume that more than two thirds of the validator nodes are correct until the trustingPeriod expires. Under this assumption, assume the trusting period of startBlock is not expired by the time FastSync checks a block b1 with height startBlock.Height + 1. To do so, we first need to check whether the Commit in the block b2 with startBlock.Height + 2 contains more than 2/3 of the voting power in startBlock.NextValidators. If this is the case we can check VerifyCommit (b1,b2.Commit). If we perform checks in this order we observe:

  • By assumption, startBlock is OK,
  • If the first check (2/3 of voting power) fails, the peer that provided block b2 is faulty,
  • If the first check passes and the second check fails (VerifyCommit), then the peer that provided b1 is faulty.
  • If both checks pass, we can trust b1

Based on this reasoning, we can ensure to only remove faulty peers from peerIDs. That is, if we sequentially verify blocks starting with startBlock, we will never remove a correct peer from peerIDs and we will be able to ensure the following invariant:

[NewFS-VAR-PEER-INV]

If a peer never misbehaves, it is never removed from peerIDs. It follows that under [FS-SOME-CORR-PEER], peerIDs is always non-empty.

To ensure this, we suggest to change the protocol as follows:

Fastsync has the following configuration parameters

[NewFS-A-INIT] is the suggested replacement of [FS-A-V2-INIT]. This will allow us to use the established trust to understand precisely which peer reported an invalid block in order to ensure the invariant [NewFS-VAR-TRUST-INV] below:

[NewFS-A-INIT]

  • startBlock is from the blockchain, and within trustingPeriod (possible with some extra margin to ensure termination before trustingPeriod expired)
  • startState is the application state of the blockchain at Height startBlock.Height.
  • startHeight = startBlock.Height

Additional Variables

  • trustedBlockstore: stores for each height greater than or equal to startBlock.Height, the block of that height. Initially it contains only startBlock

[NewFS-VAR-TRUST-INV]

Let b(i) be the block in trustedBlockstore with b(i).Height = i. It holds that for startHeight < i < height - 1, VerifyCommit (b(i),b(i+1).Commit) = true.

We propose to update the function Execute. To do so, we first define the following helper functions:

func ValidCommit(VS ValidatorSet, C Commit) Boolean
  • Comments
  • Expected precondition
    • The validators in C
      • are a subset of VS
      • have more than 2/3 of the voting power in VS
  • Expected postcondition
    • returns true if precondition holds, and false otherwise
  • Error condition
    • none

func SequentialVerify {
 while (true) {
  b1 = blockstore[height];
  b2 = blockstore[height+1];
  if b1 == nil or b2 == nil {
   exit;
  }
  if ValidCommit(trustedBlockstore[height - 1].NextValidators, b2.commit) {
   // we trust b2
   if VerifyCommit(b1, b2.commit) {
    trustedBlockstore.Add(b1);
    height = height + 1;
   }
   else {
    // as we trust b2, b1 must be faulty
    blockstore.RemoveFromPeer(receivedBlocks[height]);
    // we remove all blocks received from the faulty peer
    peerIDs.Remove(receivedBlocks(bnew.Height));
    exit;

   }
  } else {
   // b2 is faulty
   blockstore.RemoveFromPeer(receivedBlocks[height + 1]);
   // we remove all blocks received from the faulty peer
      peerIDs.Remove(receivedBlocks(bnew.Height));
   exit;   }
  }
}
  • Comments
    • none
  • Expected precondition
    • [NewFS-VAR-TRUST-INV]
  • Expected postcondition
    • [NewFS-VAR-TRUST-INV]
    • there is no block bnew with bnew.Height = height + 1 in blockstore
  • Error condition
    • none

Then Execute just consists in calling SequentialVerify and then updating the application state to the (new) height.

func Execute()
  • Comments
    • first SequentialVerify is executed
  • Expected precondition
    • application state is the one of the blockchain at height height - 1
    • [NewFS-NOT-EXP] trustedBlockstore[height-1].Time > now - trustingPeriod
  • Expected postcondition
    • there is no block bnew with bnew.Height = height + 1 in blockstore
    • state is the one of the blockchain at height height - 1
    • if height = TargetHeight: terminate successfully
  • Error condition
    • fails if [NewFS-NOT-EXP] is violated

Solution for [FS-ISSUE-NON-TERM]

As discussed above, the advantageous termination requirement is the combination of [FS-DIST-LIVE] and [FS-DIST-NONABORT], that is, Fastsync should terminate successfully in case there is at least one correct peer in peerIDs. For this we have to ensure that faulty processes cannot slow us down and provide blocks at a lower rate than the blockchain may grow. To ensure that we will have to add an assumption on message delays.

[NewFS-A-DELTA]

2 Delta < ETIME; cf. **[TMBC-SEQ-APPEND-E]**.

This assumption implies that the timeouts for OnBlockResponse and OnStatusResponse are such that a faulty peer that tries to respond slower than 2 Delta will be removed. In the following we will provide a rough estimate on termination time in a fault-prone scenario. In the following we assume that during a "long enough" finite good period no new faulty peers are added to peerIDs. Below we will sketch how "long enough" can be estimated based on the timing assumption in this specification.

[NewFS-A-STATUS-INTERVAL]

Let Sigma be the (upper bound on the) time between two calls of QueryStatus().

[NewFS-A-GOOD-PERIOD]

A time interval [begin,end] is good period if:

  • fmax is the number of faulty peers in peerIDs at time begin
  • end >= begin + 2 Delta (fmax + 3)
  • no faulty peer is added before time end

In the analysis below we assume that the termination condition of Fastsync is height = TargetHeight in the postcondition of Execute. Therefore, [NewFS-A-STATUS-INTERVAL] does not interfere with this analysis. If a correct peer reports a new height "shortly before termination" this leads to an additional round trip to request and add the new block. Then [NewFS-A-DELTA] ensures that Fastsync catches up.

Arguments:

  1. If a faulty peer p reports a faulty block, SequentialVerify will eventually remove p from peerIDs

  2. By SequentialVerify, if a faulty peer p reports multiple faulty blocks, p will be removed upon trying to check the block with the smallest height received from p.

  3. Assume whenever a block does not have an open request, CreateRequest is called immediately, which calls Block(n) on a peer. Say this happens at time t. There are two cases:

    • by t + 2 Delta a block is added to blockStore
    • at t + 2 Delta Block(n) timed out and n is removed from peer.
  4. Let f(t) be the number of faulty peers in peerIDs at time t;
    f(begin) = fmax.

  5. Let t_i be the sequence of times OnBlockResponse(addr,b) is invoked or times out with b.Height = height + 1.

  6. By 3.,

    • (a). t_1 <= begin + 2 Delta
    • (b). *t_{i+1} <= t_i + 2 Delta*
  7. By an inductive argument we prove for i > 0 that

    • (a). *height(t_{i+1}) > height(t_i)*, or
    • (b). *f(t_{i+1}) < f(t_i))* and *height(t_{i+1}) = height(t_i)*

    Argument: if the peer is faulty and does not return a block, the peer is removed, if it is faulty and returns a faulty block SequentialVerify removes the peer (b). If the returned block is OK, height is increased (a).

  8. By 2. and 7., faulty peers can delay incrementing the height at most fmax times, where each time "costs" 2 Delta seconds. We have additional 2 Delta initial offset (3a) plus 2 Delta to get all missing blocks after the last fault showed itself. (This assumes that an arbitrary number of blocks can be obtained and checked within one round-trip 2 Delta; which either needs conservative estimation of Delta, or a more refined analysis). Thus we reach the targetHeight and terminate by time end.

References

[block] Specification of the block data structure.