# Fastsync

Fastsync is a protocol that is used by a node to catch-up to the
current state of a Tendermint blockchain. Its typical use case is a
node that was disconnected from the system for some time. The
recovering node locally has a copy of a prefix of the blockchain,
and the corresponding application state that is slightly outdated. It
then queries its peers for the blocks that were decided on by the
Tendermint blockchain during the period the full node was
disconnected. After receiving these blocks, it executes the
transactions in the blocks in order to catch-up to the current height
of the blockchain and the corresponding application state.

In practice it is sufficient to catch-up only close to the current
height: The Tendermint consensus reactor implements its own catch-up
functionality and can synchronize a node that is close to the current height,
perhaps within 10 blocks away from the current height of the blockchain.
Fastsync should bring a node within this range.

## Outline

- [Part I](#part-i---tendermint-blockchain): Introduction of Tendermint
blockchain terms that are relevant for FastSync protocol.

- [Part II](#part-ii---sequential-definition-of-fastsync-problem): Introduction
of the problem addressed by the Fastsync protocol.
    - [Fastsync Informal Problem
      statement](#Fastsync-Informal-Problem-statement): For the general
      audience, that is, engineers who want to get an overview over what
      the component is doing from a bird's eye view.

    - [Sequential Problem statement](#Sequential-Problem-statement):
      Provides a mathematical definition of the problem statement in
      its sequential form, that is, ignoring the distributed aspect of
      the implementation of the blockchain.

- [Part III](#part-iii---fastsync-as-distributed-system): Distributed
  aspects of the fast sync problem, system assumptions and temporal
  logic specifications.

    - [Computational Model](#Computational-Model):
      timing and correctness assumptions.

    - [Distributed Problem Statement](#Distributed-Problem-Statement):
      temporal properties that formalize safety and liveness
      properties of fast sync in distributed setting.

- [Part IV](#part-iv---fastsync-protocol): Specification of Fastsync V2
  (the protocol underlying the current Golang implementation).

    - [Definitions](#Definitions): Describes inputs, outputs,
       variables used by the protocol, auxiliary functions

    - [FastSync V2](#FastSync-V2): gives an outline of the solution,
       and details of the functions used (with preconditions,
       postconditions, error conditions).

    - [Algorithm Invariants](#Algorithm-Invariants): invariants over
       the protocol variables that the implementation should maintain.

- [Part V](#part-v---analysis-and-improvements): Analysis
  of Fastsync V2 that highlights several issues that prevent achieving
  some of the desired fault-tolerance properties. We also give some
  suggestions on how to address the issues in the future.

    - [Analysis of Fastsync V2](#Analysis-of-Fastsync-V2): describes
        undesirable scenarios of Fastsync V2, and why they violate
        desirable temporal logic specification in an unreliable
        distributed system.

    - [Suggestions](#Suggestions-for-an-Improved-Fastsync-Implementation)  to address the issues discussed in the analysis.

In this document we quite extensively use tags in order to be able to
reference assumptions, invariants, etc. in future communication. In
these tags we frequently use the following short forms:

- TMBC: Tendermint blockchain
- SEQ: for sequential specifications
- FS: Fastsync
- LIVE: liveness
- SAFE: safety
- INV: invariant
- A: assumption
- V2: refers to specifics of Fastsync V2
- FS-VAR: refers to properties of Fastsync protocol variables
- NewFS: refers to improved future Fastsync implementations

# Part I - Tendermint Blockchain

We will briefly list some of the notions of Tendermint blockchains that are
required for this specification. More details can be found [here][block].

#### **[TMBC-HEADER]**

A set of blockchain transactions is stored in a data structure called
*block*, which contains a field called *header*. (The data structure
*block* is defined [here][block]).  As the header contains hashes to
the relevant fields of the block, for the purpose of this
specification, we will assume that the blockchain is a list of
headers, rather than a list of blocks.

#### **[TMBC-SEQ]**

The Tendermint blockchain is a list *chain* of headers.

#### **[TMBC-SEQ-GROW]**

During operation, new headers may be appended to the list one by one.

> In the following, *ETIME* is a lower bound
> on the time interval between the times at which two
> successor blocks are added.

#### **[TMBC-SEQ-APPEND-E]**

If a header is appended at time *t* then no additional header will be
appended before time *t + ETIME*.

#### **[TMBC-AUTH-BYZ]**

We assume the authenticated Byzantine fault model in which no node (faulty or
correct) may break digital signatures, but otherwise, no additional
assumption is made about the internal behavior of faulty
nodes. That is, faulty nodes are only limited in that they cannot forge
messages.

<!-- The authenticated Byzantine model assumes [TMBC-Sign-NoForge] and -->
<!-- [TMBC-FaultyFull], that is, faulty nodes are limited in that they -->
<!-- cannot forge messages [TMBC-Sign-NoForge]. -->

> We observed that in the existing documentation the term
> *validator* refers to both a data structure and a full node that
> participates in the distributed computation. Therefore, we introduce
> the notions *validator pair* and *validator node*, respectively, to
> distinguish these notions in the cases where they are not clear from
> the context.

#### **[TMBC-VALIDATOR-PAIR]**

Given a full node, a
*validator pair* is a pair *(address, voting_power)*, where

- *address* is the address (public key) of a full node,
- *voting_power* is an integer (representing the full node's
  voting power in a given consensus instance).
  
> In the Golang implementation the data type for *validator
> pair* is called `Validator`.

#### **[TMBC-VALIDATOR-SET]**

A *validator set* is a set of validator pairs. For a validator set
*vs*, we write *TotalVotingPower(vs)* for the sum of the voting powers
of its validator pairs.

#### **[TMBC-CORRECT]**

We define a predicate *correctUntil(n, t)*, where *n* is a node and *t* is a
time point.
The predicate *correctUntil(n, t)* is true if and only if the node *n*
follows all the protocols (at least) until time *t*.

#### **[TMBC-TIME-PARAMS]**

A blockchain has the following configuration parameters:

- *unbondingPeriod*: a time duration.
- *trustingPeriod*: a time duration smaller than *unbondingPeriod*.

#### **[TMBC-FM-2THIRDS]**

If a block *h* is in the chain,
then there exists a subset *CorrV*
of *h.NextValidators*, such that:

- *TotalVotingPower(CorrV) > 2/3
    TotalVotingPower(h.NextValidators)*;
- For every validator pair *(n,p)* in *CorrV*, it holds *correctUntil(n,
    h.Time + trustingPeriod)*.

#### **[TMBC-CORR-FULL]**

Every correct full node locally stores a prefix of the
current list of headers from [**[TMBC-SEQ]**][TMBC-SEQ-link].

# Part II - Sequential Definition of Fastsync Problem

## Fastsync Informal Problem statement

A full node has as input a block of the blockchain at height *h* and
the corresponding application state (or the prefix of the current
blockchain until height *h*). It has access to a set *peerIDs* of full
nodes called *peers* that it knows of.  The full node uses the peers
to read blocks of the Tendermint blockchain (in a safe way, that is,
it checks the soundness conditions), until it has read the most recent
block and then terminates.

## Sequential Problem statement

*Fastsync* gets as input a block of height *h* and the corresponding
application state *s* that corresponds to the block and state of that
height of the blockchain, and produces
as output (i) a list *L* of blocks starting at height *h* to some height
*terminationHeight*, and (ii) the application state when applying the
transactions of the list *L* to *s*.

> In Tendermint, the commit for block of height *h* is contained in block *h + 1*,
> and thus the block of height *h + 1* is needed to verify the block of
> height *h*. Let us therefore clarify the following on the
> termination height:
> The returned value *terminationHeight* is the height of the block with the largest
> height that could be verified. In order to do so, *Fastsync* needs the
> block at height  *terminationHeight + 1* of the blockchain.

Fastsync has to satisfy the following properties:

#### **[FS-SEQ-SAFE-START]**

Let *bh* be the height of the blockchain at the time *Fastsync*
starts. By assumption we have *bh >= h*.
When *Fastsync* terminates, it outputs a list of all blocks from
height *h* to some height *terminationHeight >= bh - 1*.

> The above property is independent of how many blocks are added to the
> blockchain while Fastsync is running. It links the target height to the
> initial state. If Fastsync has to catch-up many blocks, it would be
> better to link the target height to a time close to the
> termination. This is captured by the following specification:

#### **[FS-SEQ-SAFE-SYNC]**

Let *eh* be the height of the blockchain at the time *Fastsync*
terminates. There is a constant *D >= 1* such that when *Fastsync*
terminates, it outputs a list of all blocks from height *h* to some
height *terminationHeight >= eh - D*.

#### **[FS-SEQ-SAFE-STATE]**

Upon termination, the application state is the one that corresponds to
the blockchain at height *terminationHeight*.

#### **[FS-SEQ-LIVE]**

*Fastsync* eventually terminates.

# Part III - FastSync as Distributed System

## Computational Model

#### **[FS-A-NODE]**

We consider a node *FS* that performs *Fastsync*.

#### **[FS-A-PEER-IDS]**

*FS* has access to a set *peerIDs* of IDs (public keys) of peers
     . During the execution of *Fastsync*, another protocol (outside
     of this specification) may add new IDs to *peerIDs*.

#### **[FS-A-PEER]**

Peers can be faulty, and we do not make any assumptions about the number or
ratio of correct/faulty nodes. Faulty processes may be Byzantine
according to [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link].

#### **[FS-A-VAL]**

The system satisfies [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link] and
[**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link]. Thus, there is a
blockchain that satisfies the soundness requirements (that is, the
validation rules in [[block]]).

#### **[FS-A-COMM]**

Communication between the node *FS* and all correct peers is reliable and
bounded in time: there is a message end-to-end delay *Delta* such that
if a message is sent at time *t* by a correct process to a correct
process, then it will be received and processed by time *t +
Delta*. This implies that we need a timeout of at least *2 Delta* for
remote procedure calls to ensure that the response of a correct peer
arrives before the timeout expires.

## Distributed Problem Statement

### Two Kinds of Termination

We do not assume that there is a correct full node in
*peerIDs*. Under this assumption no protocol can guarantee the combination
of the properties [FS-SEQ-LIVE] and
[FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] described in the sequential
specification above. Thus, in the (unreliable) distributed setting, we
consider two kinds of termination (successful and failure) and we will
specify below under what (favorable) conditions *Fastsync* ensures to
terminate successfully, and satisfy the requirements of the sequential
problem statement:

#### **[FS-DIST-LIVE]**

*Fastsync* eventually terminates: it either *terminates successfully* or
it *terminates with failure*.

### Fairness

As mentioned above, without assumptions on the correctness of some
peers, no protocol can achieve the required specifications. Therefore,
we consider the following (fairness) constraint in the
safety and liveness properties below:

#### **[FS-SOME-CORR-PEER]**

Initially, the set *peerIDs* contains at least one correct full node.

> While in principle the above condition [FS-SOME-CORR-PEER]
> can be part of a sufficient
> condition to solve [FS-SEQ-LIVE] and
> [FS-SEQ-SAFE-START] and [FS-SEQ-SAFE-SYNC] in the distributed
> setting (their corresponding properties are given below), we will discuss in
> [Part V](#part-v---analysis-and-improvements) that the
> current implementation of Fastsync (V2) requires the much
> stronger requirement [**[FS-ALL-CORR-PEER]**](#FS-ALL-CORR-PEER)
> given in Part V.

### Safety

> As this specification does
> not assume that a correct peer is at the most recent height
> of the blockchain (it might lag behind), the property [FS-SEQ-SAFE-START]
> cannot be ensured in an unreliable distributed setting. We consider
> the following relaxation. (Which is typically sufficient for
> Tendermint, as the consensus reactor then synchronizes from that
> height.)

#### **[FS-DIST-SAFE-START]**

Let *maxh* be the maximum
height of a correct peer [**[TMBC-CORR-FULL]**][TMBC-CORR-FULL-link]
in *peerIDs* at the time *Fastsync* starts. If *FastSync* terminates
successfully, it is at some height *terminationHeight >= maxh - 1*.

> To address [FS-SEQ-SAFE-SYNC] we consider the following property in
> the distributed setting. See the comments below on the relation to
> the sequential version.

#### **[FS-DIST-SAFE-SYNC]**

Under [FS-SOME-CORR-PEER], there exists a constant time interval *TD*, such
that if *term* is the time *Fastsync* terminates and
*maxh* is the maximum height of a correct peer
[**[TMBC-CORR-FULL]**][TMBC-CORR-FULL-link] in *peerIDs* at the time
*term - TD*, then if *FastSync* terminates successfully, it is at
some height *terminationHeight >= maxh - 1*.

> *TD* might depend on timeouts etc. We suggest that an acceptable
> value for *TD* is in the range of approx. 10 sec., that is the
> interval between two calls `QueryStatus()`; see below.
> We use *term - TD* as reference time, as we have to account
> for communication delay between the peer and *FS*. After the peer sent
> the last message to *FS*, the peer and *FS* run concurrently and
> independently. There is no assumption on the rate at which a peer can
> add blocks (e.g., it might be in the process of catching up
> itself). Hence, without additional assumption we cannot link
> [FS-DIST-SAFE-SYNC] to
> [**[FS-SEQ-SAFE-SYNC]**](#FS-SEQ-SAFE-SYNC), in particular to the
> parameter *D*. We discuss a
> way to achieve this below:
> **Relation to [FS-SEQ-SAFE-SYNC]:**  
> Under [FS-SOME-CORR-PEER], if *peerIDs* contains a full node that is
> "synchronized with the blockchain", and *blockchainheight* is the height
> of the blockchain at time *term*, then  *terminationHeight* may even
> achieve
> *blockchainheight - TD / ETIME*;
> cf. [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link], that is,
> the parameter *D* from [FS-SEQ-SAFE-SYNC] is in the range of  *TD / ETIME*.

#### **[FS-DIST-SAFE-STATE]**

It is the same as the sequential version
[**[FS-SEQ-SAFE-STATE]**](#FS-SEQ-SAFE-STATE).

#### **[FS-DIST-NONABORT]**

If there is one correct process in *peerIDs* [FS-SOME-CORR-PEER],
*Fastsync* never terminates with failure. (Together with [FS-DIST-LIVE]
 that means it will terminate successfully.)

# Part IV - Fastsync protocol

Here we provide a specification of the FastSync V2 protocol as it is currently
implemented. The V2 design is the result of significant refactoring to improve
the testability and determinism in the implementation. The architecture is
detailed in
[ADR-43](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-043-blockchain-riri-org.md).

In the original design, a go-routine (thread of execution) was spawned for each block requested, and
was responsible for both protocol logic and IO. In the V2 design, protocol logic
is decoupled from IO by using three total threads of execution: a scheduler, a
processer, and a demuxer.

The scheduler contains the business logic for managing
peers and requesting blocks from them, while the processor handles the
computationally expensive block execution. Both the scheduler and processor
are structured as finite state machines that receive input events and emit
output events. The demuxer is responsible for all IO, including translating
between internal events and IO messages, and routing events between components.

Protocols in Tendermint can be considered to consist of two
components: a "core" state machine and a "peer" state machine. The core state
machine refers to the internal state managed by the node, while the peer state
machine determines what messages to send to peers. In the FastSync design, the
core and peer state machines correspond to the processor and scheduler,
respectively.

In the case of FastSync, the core state machine (the processor) is effectively
just the Tendermint block execution function, while virtually all protocol logic
is contained in the peer state machine (the scheduler). The processor is
only implemented as a separate component due to the computationally expensive nature
of block execution. We therefore focus our specification here on the peer state machine
(the scheduler component), capturing the core state machine (the processor component)
in the single `Execute` function, defined below.

While the internal details of the `Execute` function are not relevant for the
FastSync protocol and are thus not part of this specification, they will be
defined in detail at a later date in a separate Block Execution specification.

## Definitions

> We now introduce variables and auxiliary functions used by the protocol.

### Inputs

- *startBlock*: the block Fastsync starts from
- *startState*: application state corresponding to *startBlock.Height*

#### **[FS-A-V2-INIT]**

- *startBlock* is from the blockchain
- *startState* is the application state of the blockchain at Height *startBlock.Height*.

### Variables

- *height*: kinitially *startBlock.Height + 1*
  > height should be thought of the "height of the next block we need to download"
- *state*: initially *startState*
- *peerIDs*: peer addresses [FS-A-PEER-IDS](#fs-a-peer-ids)
- *peerHeights*: stores for each peer the height it reported. initially 0
- *pendingBlocks*: stores for each height which peer was
  queried. initially nil for each height
- *receivedBlocks*: stores for each height which peer returned
  it. initially nil
- *blockstore*: stores for each height greater than
    *startBlock.Height*, the block of that height. initially nil for
    all heights
- *peerTimeStamp*: stores for each peer the last time a block was
  received

- *pendingTime*: stores for a given height the time a block was requested
- *peerRate*: stores for each peer the rate of received data in Bytes/second

### Auxiliary Functions

#### **[FS-FUNC-TARGET]**

- *TargetHeight = max {peerHeigts(addr): addr in peerIDs} union {height}*

#### **[FS-FUNC-MATCH]**

```go
func VerifyCommit(b Block, c Commit) Boolean
```

- Comment
    - Corresponds to `verifyCommit(chainID string, blockID
     types.BlockID, height int64, commit *types.Commit) error` in the
     current Golang implementation, which expects blockID and height
  (from the first block) and the
     corresponding commit from the following block. We use the
     simplified form for ease in presentation.

- Implementation remark
    <!-- - implements the check from -->
    <!--  [**[TMBC-SOUND-DISTR-PossCommit]**][TMBC-SOUND-DISTR-PossCommit--link], -->
    <!--  that is, that  *c* is a valid commit for block *b* -->
    - implements the check that  *c* is a valid commit for block *b*
- Expected precondition
    - *c* is a valid commit for block *b*
- Expected postcondition
    - *true* if precondition holds
    - *false* if precondition is violated
- Error condition
    - none

----

### Messages

Peers participating in FastSync exchange the following set of messages. Messages are
encoded using the Amino serialization protocol. We define each message here
using Go syntax, annoted with the Amino type name. The prefix `bc` refers to
`blockchain`, which is the name of the FastSync reactor in the Go
implementation.

#### bcBlockRequestMessage

```go
// type: "tendermint/blockchain/BlockRequest"
type bcBlockRequestMessage struct {
 Height int64
}
```

Remark:

- `msg.Height` > 0

#### bcNoBlockResponseMessage

```go
// type: "tendermint/blockchain/NoBlockResponse"
type bcNoBlockResponseMessage struct {
 Height int64
}
```

Remark:

- `msg.Height` > 0
- This message type is included in the protocol for convenience and is not expected to be sent between two correct peers

#### bcBlockResponseMessage

```go
// type: "tendermint/blockchain/BlockResponse"
type bcBlockResponseMessage struct {
 Block *types.Block
}
```

Remark:

- `msg.Block` is a Tendermint block as defined in [[block]].
- `msg.Block` != nil

#### bcStatusRequestMessage

```go
// type: "tendermint/blockchain/StatusRequest"
type bcStatusRequestMessage struct {
 Height int64
}
```

Remark:

- `msg.Height` > 0

#### bcStatusResponseMessage

```go
// type: "tendermint/blockchain/StatusResponse"
type bcStatusResponseMessage struct {
 Height int64
}
```

Remark:

- `msg.Height` > 0

### Remote Functions

Peers expose the following functions over
remote procedure calls. The "Expected precondition" are only expected for
correct peers (as no assumption is made on internals of faulty
processes [FS-A-PEER]). These functions are implemented using the above defined message types.

> In this document we describe the communication with peers
via asynchronous RPCs.

```go
func Status(addr Address) (int64, error)
```

- Implementation remark
    - RPC to full node *addr*
    - Request message: `bcStatusRequestMessage`.
    - Response message: `bcStatusResponseMessage`.
- Expected precondition
    - none
- Expected postcondition
    - if *addr* is correct: Returns the current height `height` of the
    peer. [FS-A-COMM]
    - if *addr* is faulty: Returns an arbitrary height. [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link]
- Error condition
    - if *addr* is correct: none. By [FS-A-COMM] we assume communication is reliable and timely.
    - if *addr* is faulty: arbitrary error (including timeout). [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link]

----

 ```go
func Block(addr Address, height int64) (Block, error)
```

- Implementation remark
    - RPC to full node *addr*
    - Request message: `bcBlockRequestMessage`.
    - Response message: `bcBlockResponseMessage` or `bcNoBlockResponseMessage`.
- Expected precondition
    - 'height` is less than or equal to height of the peer
- Expected postcondition
    - if *addr* is correct: Returns the block of height `height`
  from the blockchain. [FS-A-COMM]
    - if *addr* is faulty: Returns arbitrary or no block [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link]
- Error condition
    - if *addr* is correct: precondition violated (returns `bcNoBlockResponseMessage`). [FS-A-COMM]
    - if *addr* is faulty: arbitrary error (including timeout). [**[TMBC-AUTH-BYZ]**][TMBC-Auth-Byz-link]

----

## FastSync V2

### Outline

The protocol is described in terms of functions that are triggered by
(external) events. The implementation uses a scheduler and a
de-multiplexer to deal with communicating with peers and to
trigger the execution of these functions:

- `QueryStatus()`: regularly (currently every 10sec; necessarily
  interval greater than *2 Delta*) queries all peers from *peerIDs*
  for their current height [TMBC-CORR-FULL]. It does so
  by calling `Status(n)` remotely on all peers *n*.
  
- `CreateRequest`: regularly checks whether certain blocks have no
  open request. If a block does not have an open request, it requests
  one from a peer. It does so by calling `Block(n,h)` remotely on one
  peer *n* for a missing height *h*.
  
> We have left the strategy how peers are selected unspecified, and
> the currently existing different implementations of Fastsync differ
> in this aspect. In V2, a peer *p* is selected with the minimum number of
> pending requests that can serve the required height *h*, that is
> with *peerHeight(p) >= h*.

The functions `Status` and `Block` are called by asynchronous
RPC. When they return, the following functions are called:

- `OnStatusResponse(addr Address, height int64)`: The full node with
  address *addr* returns its current height. The function updates the height
  information about *addr*, and may also increase *TargetHeight*.
  
- `OnBlockResponse(addr Address, b Block)`. The full node with
  address *addr* returns a block. It is added to *blockstore*. Then
  the auxiliary function `Execute` is called.

- `Execute()`: Iterates over the *blockstore*.  Checks soundness of
  the blocks, and
  executes the transactions of a sound block and updates *state*.

> In addition to the functions above, the following two features are
> implemented in Fastsync V2

#### **[FS-V2-PEER-REMOVE]**

Periodically, *peerTimeStamp* and *peerRate* and *pendingTime* are
analyzed.
If a peer *p*
has not provided a block recently (check of *peerTimeStamp[p]*) or it
has not provided sufficiently many data (check of *peerRate[p]*), then
*p* is removed from *peerIDs*. In addition, *pendingTime* is used to
estimate whether the peer that is responsible for the current height
has provided the corresponding block on time.

#### **[FS-V2-TIMEOUT]**

*Fastsync V2* starts a timeout whenever a block is
executed (that is, when the height is incremented). If the timeout expires
before the next block is executed, *Fastsync* terminates.
If this happens, then *Fastsync* terminates
with failure.

### Details

<!--
> Function signatures followed by pseudocode (optional) and a list of features (required):
> - Implementation remarks (optional)
>   - e.g. (local/remote) function called in the body of this function
> - Expected precondition
> - Expected postcondition
> - Error condition
---->

```go
func QueryStatus()
```

- Expected precondition
    - peerIDs initialized and non-empty
- Expected postcondition
    - call asynchronously `Status(n)` at each peer *n* in *peerIDs*.
- Error condition
    - fails if precondition is violated

----

```go
func OnStatusResponse(addr Address, ht int64)
```

- Comment
    - *ht* is a height
    - peers can provide the status without being called
- Expected precondition
    - *peerHeights(addr) <= ht*
- Expected postcondition
    - *peerHeights(addr) = ht*
    - *TargetHeight* is updated
- Error condition
    - if precondition is violated: *addr* not in *peerIDs* (that is,
      *addr* is removed from *peerIDs*)
- Timeout condition
    - if `OnStatusResponse(addr, ht)` was not invoked within *2 Delta* after
 `Status(addr)` was called:  *addr* not in *peerIDs*

----

```go
func CreateRequest
```

- Expected precondition
    - *height < TargetHeight*
    - *peerIDs* nonempty
- Expected postcondition
    - Function `Block` is called remotely at a peer *addr* in peerIDs
   for a missing height *h*  
   *Remark:* different implementations may have different
      strategies to balance the load over the peers
    - *pendingblocks(h) = addr*

----

```go
func OnBlockResponse(addr Address, b Block)
```

- Comment
    - if after adding block *b*, blocks of heights *height* and
      *height + 1* are in *blockstore*, then `Execute` is called
- Expected precondition
    - *pendingblocks(b.Height) = addr*
    - *b* satisfies basic soundness  
- Expected postcondition
    - if function `Execute` has been executed without error or was not
      executed:
        - *receivedBlocks(b.Height) = addr*
        - *blockstore(b.Height) = b*
        - *peerTimeStamp[addr]* is set to a time between invocation and
          return of the function.
        - *peerRate[addr]* is updated according to size of received
          block and time it has passed between current time and last block received from this peer (addr)
- Error condition
    - if precondition is violated: *addr* not in *peerIDs*; reset
 *pendingblocks(b.Height)* to nil;
- Timeout condition
    - if `OnBlockResponse(addr, b)` was not invoked within *2 Delta* after
 `Block(addr,h)` was called for *b.Height = h*: *addr* not in *peerIDs*

----

```go
func Execute()
```

- Comments
    - none
- Expected precondition
    - application state is the one of the blockchain at height
      *height - 1*
    - **[FS-V2-Verif]** for any two blocks *a* and *b* from
 *receivedBlocks*: if
   *a.Height + 1 = b.Height* then *VerifyCommit (a,b.Commit) = true*
- Expected postcondition
    - Any two blocks *a* and *b* violating [FS-V2-Verif]:
   *a* and *b* not in *blockstore*; nodes with Address
   receivedBlocks(a.Height) and receivedBlocks(b.Height) not in peerIDs
    - height is updated height of complete prefix that matches the blockchain
    - state is the one of the blockchain at height *height - 1*
    - if the new value of *height* is equal to *TargetHeight*, then
 Fastsync
 **terminates
   successfully**.
- Error condition
    - none

----

## Algorithm Invariants

> In contrast to the temporal properties above that define the problem
> statement, the following are invariants on the solution to the
> problem, that is on the algorithm. These invariants are useful for
> the verification, but can also guide the implementation.

#### **[FS-VAR-STATE-INV]**

It is always the case that *state* corresponds to the application state of the
blockchain of that height, that is, *state = chain[height -
1].AppState*; *chain* is defined in
[**[TMBC-SEQ]**][TMBC-SEQ-link].

#### **[FS-VAR-PEER-INV]**

It is always the case that the set *peerIDs* only contains nodes that
have not yet misbehaved (by sending wrong data or timing out).

#### **[FS-VAR-BLOCK-INV]**

For *startBlock.Height <= i < height - 1*, let *b(i)* be the block with
height *i* in *blockstore*, it always holds that
*VerifyCommit(b(i), b(i+1).Commit) = true*. This means that *height*
can only be incremented if all blocks with lower height have been verified.

# Part V - Analysis and Improvements

## Analysis of Fastsync V2

#### **[FS-ISSUE-KILL]**

If two blocks are not matching [FS-V2-Verif], `Execute` dismisses both
blocks and removes the peers that provided these blocks from
*peerIDs*. If block *a* was correct and provided by a correct peer *p*,
and block b was faulty and provided by a faulty peer, the protocol

- removes the correct peer *p*, although it might be useful to
  download blocks from it in the future
- removes the block *a*, so that a fresh copy of *a* needs to be downloaded
  again from another peer
  
By [FS-A-PEER] we do not put a restriction on the number
  of faulty peers, so that faulty peers can make *FS* to remove all
  correct peers from *peerIDs*. As a result, this version of
  *Fastsync* violates [FS-DIST-SAFE-SYNC].

#### **[FS-ISSUE-NON-TERM]**

Due to [**[FS-ISSUE-KILL]**](#fs-issue-kill), from some point on, only
faulty peers may be in *peerIDs*. They can thus control at which rate
*Fastsync* gets blocks. If the timeout duration from [FS-V2-TIMEOUT]
is greater than the time it takes to add a block to the blockchain
(LTIME in [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link]), the
protocol may never terminate and thus violate [FS-DIST-LIVE].  This
scenario is even possible if a correct peer is always in *peerIDs*,
but faulty peers are regularly asked for blocks.

### Consequence

The issues [FS-ISSUE-KILL] and [FS-ISSUE-NON-TERM] explain why
does not satisfy the property [FS-DIST-LIVE] relevant for termination.
As a result, V2 only solves the specifications in a restricted form,
namely, when all peers are correct:

#### **[FS-ALL-CORR-PEER]**

At all times, the set *peerIDs* contains only correct full nodes.

With this restriction we can give the achieved properties:

#### **[FS-VC-ALL-CORR-NONABORT]**

Under [FS-ALL-CORR-PEER], *Fastsync* never terminates with failure.

#### **[FS-VC-ALL-CORR-LIVE]**

Under [FS-ALL-CORR-PEER], *Fastsync* eventually terminates successfully.

> In a fault tolerance context this is problematic,
> as it means that faulty peers can prevent *FastSync* from termination.
> We observe that this also touches other properties, namely,
> [FS-DIST-SAFE-START] and [FS-DIST-SAFE-SYNC]:
> Termination at an acceptable height are all conditional under
> "successful termination". The properties above severely restrict
> under which circumstances FastSync (V2) terminates successfully.
> As a result, large parts of the current
> implementation of  are not fault-tolerant. We will
> discuss this, and suggestions how to solve this after the
> description of the current protocol.

## Suggestions for an Improved Fastsync Implementation

### Solution for [FS-ISSUE-KILL]

To avoid [FS-ISSUE-KILL], we observe that
[**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link] ensures that from the
point a block was created, we assume that more than two thirds of the
validator nodes are correct until the *trustingPeriod* expires.  Under
this assumption, assume the trusting period of *startBlock* is not
expired by the time *FastSync* checks a block *b1* with height
*startBlock.Height + 1*. To do so, we first need to check whether the
Commit in the block *b2* with *startBlock.Height + 2* contains more
than 2/3 of the voting power in *startBlock.NextValidators*. If this
is the case we can check *VerifyCommit (b1,b2.Commit)*. If we perform
checks in this order we observe:

- By assumption, *startBlock* is OK,
- If the first check (2/3 of voting power) fails,
    the peer that provided block *b2* is faulty,
- If the first check passes and the second check
    fails (*VerifyCommit*), then the peer that provided *b1* is
    faulty.
- If both checks pass, we can trust *b1*

Based on this reasoning, we can ensure to only remove faulty peers
from *peerIDs*.  That is, if
we sequentially verify blocks starting with *startBlock*, we will
never remove a correct peer from *peerIDs* and we will be able to
ensure the following invariant:

#### **[NewFS-VAR-PEER-INV]**

If a peer never misbehaves, it is never removed from *peerIDs*. It
follows that under [FS-SOME-CORR-PEER], *peerIDs* is always non-empty.

> To ensure this, we suggest to change the protocol as follows:

#### Fastsync has the following configuration parameters

- *trustingPeriod*: a time duration; cf.
  [**[TMBC-TIME-PARAMS]**][TMBC-TIME-PARAMS-link].

> [NewFS-A-INIT] is the suggested replacement of [FS-A-V2-INIT]. This will
> allow us to use the established trust to understand precisely which
> peer reported an invalid block in order to ensure the
> invariant [NewFS-VAR-TRUST-INV] below:

#### **[NewFS-A-INIT]**

- *startBlock* is from the blockchain, and within *trustingPeriod*
(possible with some extra margin to ensure termination before
*trustingPeriod* expired)
- *startState* is the application state of the blockchain at Height
  *startBlock.Height*.
- *startHeight = startBlock.Height*

#### Additional Variables

- *trustedBlockstore*: stores for each height greater than or equal to
    *startBlock.Height*, the block of that height. Initially it
    contains only *startBlock*

#### **[NewFS-VAR-TRUST-INV]**

Let *b(i)* be the block in *trustedBlockstore*
with b(i).Height = i. It holds that
for *startHeight < i < height - 1*,
*VerifyCommit (b(i),b(i+1).Commit) = true*.

> We propose to update the function `Execute`. To do so, we first
> define the following helper functions:

```go
func ValidCommit(VS ValidatorSet, C Commit) Boolean
```

- Comments
    - checks validator set based on [**[TMBC-FM-2THIRDS]**][TMBC-FM-2THIRDS-link]
- Expected precondition
    - The validators in *C*
        - are a subset of VS
        - have more than 2/3 of the voting power in VS
- Expected postcondition
    - returns *true* if precondition holds, and *false* otherwise
- Error condition
    - none

----

```go
func SequentialVerify {
 while (true) {
  b1 = blockstore[height];
  b2 = blockstore[height+1];
  if b1 == nil or b2 == nil {
   exit;
  }
  if ValidCommit(trustedBlockstore[height - 1].NextValidators, b2.commit) {
   // we trust b2
   if VerifyCommit(b1, b2.commit) {
    trustedBlockstore.Add(b1);
    height = height + 1;
   }
   else {
    // as we trust b2, b1 must be faulty
    blockstore.RemoveFromPeer(receivedBlocks[height]);
    // we remove all blocks received from the faulty peer
    peerIDs.Remove(receivedBlocks(bnew.Height));
    exit;

   }
  } else {
   // b2 is faulty
   blockstore.RemoveFromPeer(receivedBlocks[height + 1]);
   // we remove all blocks received from the faulty peer
      peerIDs.Remove(receivedBlocks(bnew.Height));
   exit;   }
  }
}
```

- Comments
    - none
- Expected precondition
    - [NewFS-VAR-TRUST-INV]
- Expected postcondition
    - [NewFS-VAR-TRUST-INV]
    - there is no block *bnew* with *bnew.Height = height + 1* in
      *blockstore*
- Error condition
    - none

----

> Then `Execute` just consists in calling `SequentialVerify` and then
> updating the application state to the (new) height.

```go
func Execute()
```

- Comments
    - first `SequentialVerify` is executed
- Expected precondition
    - application state is the one of the blockchain at height
      *height - 1*
    - [NewFS-NOT-EXP] *trustedBlockstore[height-1].Time > now - trustingPeriod*
- Expected postcondition
    - there is no block *bnew* with *bnew.Height = height + 1* in
      *blockstore*
    - state is the one of the blockchain at height *height - 1*
    - if height = TargetHeight: **terminate successfully**
- Error condition
    - fails if [NewFS-NOT-EXP] is violated

----

### Solution for [FS-ISSUE-NON-TERM]

As discussed above, the advantageous termination requirement is the
combination of [FS-DIST-LIVE] and [FS-DIST-NONABORT], that is, *Fastsync*
should terminate successfully in case there is at least one correct
peer in *peerIDs*. For this we have to ensure that faulty processes
cannot slow us down and provide blocks at a lower rate than the
blockchain may grow. To ensure that we will have to add an assumption
on message delays.

#### **[NewFS-A-DELTA]**

*2 Delta < ETIME*; cf. [**[TMBC-SEQ-APPEND-E]**][TMBC-SEQ-APPEND-E-link].

> This assumption implies that the timeouts for `OnBlockResponse` and
> `OnStatusResponse` are such that a faulty peer that tries to respond
> slower than *2 Delta* will be removed. In the following we will
> provide a rough estimate on termination time in a fault-prone
> scenario.
> In the following
> we assume that during a "long enough" finite good period no new
> faulty peers are added to *peerIDs*. Below we will sketch how "long
> enough" can be estimated based on the timing assumption in this
> specification.

#### **[NewFS-A-STATUS-INTERVAL]**

Let Sigma be the (upper bound on the)
time between two calls of `QueryStatus()`.

#### **[NewFS-A-GOOD-PERIOD]**

A time interval *[begin,end]* is *good period* if:

- *fmax* is the number of faulty peers in *peerIDs* at time *begin*
- *end >= begin + 2 Delta (fmax + 3)*
- no faulty peer is added before time *end*

> In the analysis below we assume that the termination condition of
> *Fastsync* is
> *height = TargetHeight* in the postcondition of
> `Execute`. Therefore, [NewFS-A-STATUS-INTERVAL] does not interfere
> with this analysis. If a correct peer reports a new height "shortly
> before termination" this leads to an additional round trip to
> request and add the new block. Then [NewFS-A-DELTA] ensures that
> *Fastsync* catches up.

Arguments:

1. If a faulty peer *p* reports a faulty block, `SequentialVerify` will
  eventually remove *p* from *peerIDs*
  
2. By `SequentialVerify`, if a faulty peer *p* reports multiple faulty
  blocks, *p* will be removed upon trying to check the block with the
  smallest height received from *p*.

3. Assume whenever a block does not have an open request, `CreateRequest` is
   called immediately, which calls `Block(n)` on a peer. Say this
   happens at time *t*. There are two cases:
  
   - by t + 2 Delta a block is added to *blockStore*
   - at t + 2 Delta `Block(n)` timed out and *n* is removed from
       peer.

4. Let *f(t)* be the number of faulty peers in *peerIDs* at time *t*;  
   *f(begin) = fmax*.

5. Let t_i be the sequence of times `OnBlockResponse(addr,b)` is
   invoked or times out with *b.Height = height + 1*.

6. By 3.,
   - (a). *t_1 <= begin + 2 Delta*
   - (b). *t_{i+1} <= t_i + 2 Delta*

7. By an inductive argument we prove for *i > 0* that

   - (a). *height(t_{i+1}) > height(t_i)*, or
   - (b). *f(t_{i+1}) < f(t_i))* and *height(t_{i+1}) = height(t_i)*  

   Argument: if the peer is faulty and does not return a block, the
   peer is removed, if it is faulty and returns a faulty block
   `SequentialVerify` removes the peer (b). If the returned block is OK,
   height is increased (a).
  
8. By 2. and 7., faulty peers can delay incrementing the height at
   most *fmax* times, where each time "costs" *2 Delta* seconds. We
   have additional *2 Delta* initial offset (3a) plus *2 Delta* to get
   all missing blocks after the last fault showed itself. (This
   assumes that an arbitrary number of blocks can be obtained and
   checked within one round-trip 2 Delta; which either needs
   conservative estimation of Delta, or a more refined analysis). Thus
   we reach the *targetHeight* and terminate by time *end*.

# References

<!--
> links to other specifications/ADRs this document refers to
---->

[[block]] Specification of the block data structure.

<!-- [[blockchain]] The specification of the Tendermint blockchain. Tags refering to this specification are labeled [TMBC-*]. -->

[block]: https://github.com/tendermint/spec/blob/d46cd7f573a2c6a2399fcab2cde981330aa63f37/spec/core/data_structures.md

<!-- [blockchain]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md -->

[TMBC-HEADER-link]: #tmbc-header

[TMBC-SEQ-link]: #tmbc-seq

[TMBC-CORR-FULL-link]: #tmbc-corrfull

[TMBC-CORRECT-link]: #tmbc-correct

[TMBC-Sign-link]: #tmbc-sign

[TMBC-FaultyFull-link]: #tmbc-faultyfull

[TMBC-TIME-PARAMS-link]: #tmbc-time-params

[TMBC-SEQ-APPEND-E-link]: #tmbc-seq-append-e

[TMBC-FM-2THIRDS-link]: #tmbc-fm-2thirds

[TMBC-Auth-Byz-link]: #tmbc-auth-byz

[TMBC-INV-SIGN-link]: #tmbc-inv-sign

[TMBC-SOUND-DISTR-PossCommit--link]: #tmbc-sound-distr-posscommit

[TMBC-INV-VALID-link]: #tmbc-inv-valid

<!-- [TMBC-HEADER-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-header -->

<!-- [TMBC-SEQ-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-seq -->

<!-- [TMBC-CORR-FULL-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-corrfull -->

<!-- [TMBC-Sign-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-sign -->

<!-- [TMBC-FaultyFull-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-faultyfull -->

<!-- [TMBC-TIME-PARAMS-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-time-params -->

<!-- [TMBC-SEQ-APPEND-E-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-seq-append-e -->

<!-- [TMBC-FM-2THIRDS-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-fm-2thirds -->

<!-- [TMBC-Auth-Byz-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-auth-byz -->

<!-- [TMBC-INV-SIGN-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-inv-sign -->

<!-- [TMBC-SOUND-DISTR-PossCommit--link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-sound-distr-posscommit -->

<!-- [TMBC-INV-VALID-link]: https://github.com/informalsystems/VDD/tree/master/blockchain/blockchain.md#tmbc-inv-valid -->

[LCV-VC-LIVE-link]: https://github.com/informalsystems/VDD/tree/master/lightclient/verification.md#lcv-vc-live

[lightclient]: https://github.com/interchainio/tendermint-rs/blob/e2cb9aca0b95430fca2eac154edddc9588038982/docs/architecture/adr-002-lite-client.md

[failuredetector]: https://github.com/informalsystems/VDD/blob/master/liteclient/failuredetector.md

[fullnode]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md

[FN-LuckyCase-link]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#fn-luckycase

[blockchain-validator-set]: https://github.com/tendermint/spec/blob/d46cd7f573a2c6a2399fcab2cde981330aa63f37/spec/core/data_structures.md#data-structures

[fullnode-data-structures]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#data-structures

[FN-ManifestFaulty-link]: https://github.com/tendermint/spec/blob/master/spec/blockchain/fullnode.md#fn-manifestfaulty