Browse Source

Add Section for P2P (#53)

* Add Section for P2P

- moved over the section on p2p

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>

* add some more files

Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>
pull/7804/head
Marko 5 years ago
committed by GitHub
parent
commit
513c67230f
No known key found for this signature in database GPG Key ID: 4AEE18F83AFDEB23
24 changed files with 2170 additions and 5 deletions
  1. +77
    -5
      spec/README.md
  2. +38
    -0
      spec/p2p/config.md
  3. +111
    -0
      spec/p2p/connection.md
  4. +66
    -0
      spec/p2p/node.md
  5. +119
    -0
      spec/p2p/peer.md
  6. BIN
      spec/reactors/block_sync/bcv1/img/bc-reactor-new-datastructs.png
  7. BIN
      spec/reactors/block_sync/bcv1/img/bc-reactor-new-fsm.png
  8. BIN
      spec/reactors/block_sync/bcv1/img/bc-reactor-new-goroutines.png
  9. +237
    -0
      spec/reactors/block_sync/bcv1/impl-v1.md
  10. BIN
      spec/reactors/block_sync/img/bc-reactor-routines.png
  11. BIN
      spec/reactors/block_sync/img/bc-reactor.png
  12. +44
    -0
      spec/reactors/block_sync/impl.md
  13. +308
    -0
      spec/reactors/block_sync/reactor.md
  14. +353
    -0
      spec/reactors/consensus/consensus-reactor.md
  15. +184
    -0
      spec/reactors/consensus/consensus.md
  16. +291
    -0
      spec/reactors/consensus/proposer-selection.md
  17. +10
    -0
      spec/reactors/evidence/reactor.md
  18. +8
    -0
      spec/reactors/mempool/concurrency.md
  19. +54
    -0
      spec/reactors/mempool/config.md
  20. +43
    -0
      spec/reactors/mempool/functionality.md
  21. +61
    -0
      spec/reactors/mempool/messages.md
  22. +22
    -0
      spec/reactors/mempool/reactor.md
  23. +132
    -0
      spec/reactors/pex/pex.md
  24. +12
    -0
      spec/reactors/pex/reactor.md

+ 77
- 5
spec/README.md View File

@ -1,9 +1,81 @@
# Spec
# Overview
This folder houses the spec of Tendermint the Protocol.
This is a markdown specification of the Tendermint blockchain.
It defines the base data structures, how they are validated,
and how they are communicated over the network.
**Note: We are currently working on expanding the spec and will slowly be migrating it from Tendermint the repo**
If you find discrepancies between the spec and the code that
do not have an associated issue or pull request on github,
please submit them to our [bug bounty](https://tendermint.com/security)!
### Table of Contents
## Contents
- [ABCI Spec](./abci/README.md)
- [Overview](#overview)
### Data Structures
- [Encoding and Digests](./blockchain/encoding.md)
- [Blockchain](./blockchain/blockchain.md)
- [State](./blockchain/state.md)
### Consensus Protocol
- [Consensus Algorithm](./consensus/consensus.md)
- [Creating a proposal](./consensus/creating-proposal.md)
- [Time](./consensus/bft-time.md)
- [Light-Client](./consensus/light-client.md)
### P2P and Network Protocols
- [The Base P2P Layer](./p2p/): multiplex the protocols ("reactors") on authenticated and encrypted TCP connections
- [Peer Exchange (PEX)](./reactors/pex/): gossip known peer addresses so peers can find each other
- [Block Sync](./reactors/block_sync/): gossip blocks so peers can catch up quickly
- [Consensus](./reactors/consensus/): gossip votes and block parts so new blocks can be committed
- [Mempool](./reactors/mempool/): gossip transactions so they get included in blocks
- [Evidence](./reactors/evidence/): sending invalid evidence will stop the peer
### Software
- [ABCI](./software/abci.md): Details about interactions between the
application and consensus engine over ABCI
- [Write-Ahead Log](./software/wal.md): Details about how the consensus
engine preserves data and recovers from crash failures
## Overview
Tendermint provides Byzantine Fault Tolerant State Machine Replication using
hash-linked batches of transactions. Such transaction batches are called "blocks".
Hence, Tendermint defines a "blockchain".
Each block in Tendermint has a unique index - its Height.
Height's in the blockchain are monotonic.
Each block is committed by a known set of weighted Validators.
Membership and weighting within this validator set may change over time.
Tendermint guarantees the safety and liveness of the blockchain
so long as less than 1/3 of the total weight of the Validator set
is malicious or faulty.
A commit in Tendermint is a set of signed messages from more than 2/3 of
the total weight of the current Validator set. Validators take turns proposing
blocks and voting on them. Once enough votes are received, the block is considered
committed. These votes are included in the _next_ block as proof that the previous block
was committed - they cannot be included in the current block, as that block has already been
created.
Once a block is committed, it can be executed against an application.
The application returns results for each of the transactions in the block.
The application can also return changes to be made to the validator set,
as well as a cryptographic digest of its latest state.
Tendermint is designed to enable efficient verification and authentication
of the latest state of the blockchain. To achieve this, it embeds
cryptographic commitments to certain information in the block "header".
This information includes the contents of the block (eg. the transactions),
the validator set committing the block, as well as the various results returned by the application.
Note, however, that block execution only occurs _after_ a block is committed.
Thus, application results can only be included in the _next_ block.
Also note that information like the transaction results and the validator set are never
directly included in the block - only their cryptographic digests (Merkle roots) are.
Hence, verification of a block requires a separate data structure to store this information.
We call this the `State`. Block verification also requires access to the previous block.

+ 38
- 0
spec/p2p/config.md View File

@ -0,0 +1,38 @@
# P2P Config
Here we describe configuration options around the Peer Exchange.
These can be set using flags or via the `$TMHOME/config/config.toml` file.
## Seed Mode
`--p2p.seed_mode`
The node operates in seed mode. In seed mode, a node continuously crawls the network for peers,
and upon incoming connection shares some peers and disconnects.
## Seeds
`--p2p.seeds “id100000000000000000000000000000000@1.2.3.4:26656,id200000000000000000000000000000000@2.3.4.5:4444”`
Dials these seeds when we need more peers. They should return a list of peers and then disconnect.
If we already have enough peers in the address book, we may never need to dial them.
## Persistent Peers
`--p2p.persistent_peers “id100000000000000000000000000000000@1.2.3.4:26656,id200000000000000000000000000000000@2.3.4.5:26656”`
Dial these peers and auto-redial them if the connection fails.
These are intended to be trusted persistent peers that can help
anchor us in the p2p network. The auto-redial uses exponential
backoff and will give up after a day of trying to connect.
**Note:** If `seeds` and `persistent_peers` intersect,
the user will be warned that seeds may auto-close connections
and that the node may not be able to keep the connection persistent.
## Private Peers
`--p2p.private_peer_ids “id100000000000000000000000000000000,id200000000000000000000000000000000”`
These are IDs of the peers that we do not add to the address book or gossip to
other peers. They stay private to us.

+ 111
- 0
spec/p2p/connection.md View File

@ -0,0 +1,111 @@
# P2P Multiplex Connection
## MConnection
`MConnection` is a multiplex connection that supports multiple independent streams
with distinct quality of service guarantees atop a single TCP connection.
Each stream is known as a `Channel` and each `Channel` has a globally unique _byte id_.
Each `Channel` also has a relative priority that determines the quality of service
of the `Channel` compared to other `Channel`s.
The _byte id_ and the relative priorities of each `Channel` are configured upon
initialization of the connection.
The `MConnection` supports three packet types:
- Ping
- Pong
- Msg
### Ping and Pong
The ping and pong messages consist of writing a single byte to the connection; 0x1 and 0x2, respectively.
When we haven't received any messages on an `MConnection` in time `pingTimeout`, we send a ping message.
When a ping is received on the `MConnection`, a pong is sent in response only if there are no other messages
to send and the peer has not sent us too many pings (TODO).
If a pong or message is not received in sufficient time after a ping, the peer is disconnected from.
### Msg
Messages in channels are chopped into smaller `msgPacket`s for multiplexing.
```
type msgPacket struct {
ChannelID byte
EOF byte // 1 means message ends here.
Bytes []byte
}
```
The `msgPacket` is serialized using [go-amino](https://github.com/tendermint/go-amino) and prefixed with 0x3.
The received `Bytes` of a sequential set of packets are appended together
until a packet with `EOF=1` is received, then the complete serialized message
is returned for processing by the `onReceive` function of the corresponding channel.
### Multiplexing
Messages are sent from a single `sendRoutine`, which loops over a select statement and results in the sending
of a ping, a pong, or a batch of data messages. The batch of data messages may include messages from multiple channels.
Message bytes are queued for sending in their respective channel, with each channel holding one unsent message at a time.
Messages are chosen for a batch one at a time from the channel with the lowest ratio of recently sent bytes to channel priority.
## Sending Messages
There are two methods for sending messages:
```go
func (m MConnection) Send(chID byte, msg interface{}) bool {}
func (m MConnection) TrySend(chID byte, msg interface{}) bool {}
```
`Send(chID, msg)` is a blocking call that waits until `msg` is successfully queued
for the channel with the given id byte `chID`. The message `msg` is serialized
using the `tendermint/go-amino` submodule's `WriteBinary()` reflection routine.
`TrySend(chID, msg)` is a nonblocking call that queues the message msg in the channel
with the given id byte chID if the queue is not full; otherwise it returns false immediately.
`Send()` and `TrySend()` are also exposed for each `Peer`.
## Peer
Each peer has one `MConnection` instance, and includes other information such as whether the connection
was outbound, whether the connection should be recreated if it closes, various identity information about the node,
and other higher level thread-safe data used by the reactors.
## Switch/Reactor
The `Switch` handles peer connections and exposes an API to receive incoming messages
on `Reactors`. Each `Reactor` is responsible for handling incoming messages of one
or more `Channels`. So while sending outgoing messages is typically performed on the peer,
incoming messages are received on the reactor.
```go
// Declare a MyReactor reactor that handles messages on MyChannelID.
type MyReactor struct{}
func (reactor MyReactor) GetChannels() []*ChannelDescriptor {
return []*ChannelDescriptor{ChannelDescriptor{ID:MyChannelID, Priority: 1}}
}
func (reactor MyReactor) Receive(chID byte, peer *Peer, msgBytes []byte) {
r, n, err := bytes.NewBuffer(msgBytes), new(int64), new(error)
msgString := ReadString(r, n, err)
fmt.Println(msgString)
}
// Other Reactor methods omitted for brevity
...
switch := NewSwitch([]Reactor{MyReactor{}})
...
// Send a random message to all outbound connections
for _, peer := range switch.Peers().List() {
if peer.IsOutbound() {
peer.Send(MyChannelID, "Here's a random message")
}
}
```

+ 66
- 0
spec/p2p/node.md View File

@ -0,0 +1,66 @@
# Peer Discovery
A Tendermint P2P network has different kinds of nodes with different requirements for connectivity to one another.
This document describes what kind of nodes Tendermint should enable and how they should work.
## Seeds
Seeds are the first point of contact for a new node.
They return a list of known active peers and then disconnect.
Seeds should operate full nodes with the PEX reactor in a "crawler" mode
that continuously explores to validate the availability of peers.
Seeds should only respond with some top percentile of the best peers it knows about.
See [the peer-exchange docs](https://github.com/tendermint/tendermint/blob/master/docs/spec/reactors/pex/pex.md)for details on peer quality.
## New Full Node
A new node needs a few things to connect to the network:
- a list of seeds, which can be provided to Tendermint via config file or flags,
or hardcoded into the software by in-process apps
- a `ChainID`, also called `Network` at the p2p layer
- a recent block height, H, and hash, HASH for the blockchain.
The values `H` and `HASH` must be received and corroborated by means external to Tendermint, and specific to the user - ie. via the user's trusted social consensus.
This requirement to validate `H` and `HASH` out-of-band and via social consensus
is the essential difference in security models between Proof-of-Work and Proof-of-Stake blockchains.
With the above, the node then queries some seeds for peers for its chain,
dials those peers, and runs the Tendermint protocols with those it successfully connects to.
When the peer catches up to height H, it ensures the block hash matches HASH.
If not, Tendermint will exit, and the user must try again - either they are connected
to bad peers or their social consensus is invalid.
## Restarted Full Node
A node checks its address book on startup and attempts to connect to peers from there.
If it can't connect to any peers after some time, it falls back to the seeds to find more.
Restarted full nodes can run the `blockchain` or `consensus` reactor protocols to sync up
to the latest state of the blockchain from wherever they were last.
In a Proof-of-Stake context, if they are sufficiently far behind (greater than the length
of the unbonding period), they will need to validate a recent `H` and `HASH` out-of-band again
so they know they have synced the correct chain.
## Validator Node
A validator node is a node that interfaces with a validator signing key.
These nodes require the highest security, and should not accept incoming connections.
They should maintain outgoing connections to a controlled set of "Sentry Nodes" that serve
as their proxy shield to the rest of the network.
Validators that know and trust each other can accept incoming connections from one another and maintain direct private connectivity via VPN.
## Sentry Node
Sentry nodes are guardians of a validator node and provide it access to the rest of the network.
They should be well connected to other full nodes on the network.
Sentry nodes may be dynamic, but should maintain persistent connections to some evolving random subset of each other.
They should always expect to have direct incoming connections from the validator node and its backup(s).
They do not report the validator node's address in the PEX and
they may be more strict about the quality of peers they keep.
Sentry nodes belonging to validators that trust each other may wish to maintain persistent connections via VPN with one another, but only report each other sparingly in the PEX.

+ 119
- 0
spec/p2p/peer.md View File

@ -0,0 +1,119 @@
# Peers
This document explains how Tendermint Peers are identified and how they connect to one another.
For details on peer discovery, see the [peer exchange (PEX) reactor doc](https://github.com/tendermint/tendermint/blob/master/docs/spec/reactors/pex/pex.md).
## Peer Identity
Tendermint peers are expected to maintain long-term persistent identities in the form of a public key.
Each peer has an ID defined as `peer.ID == peer.PubKey.Address()`, where `Address` uses the scheme defined in `crypto` package.
A single peer ID can have multiple IP addresses associated with it, but a node
will only ever connect to one at a time.
When attempting to connect to a peer, we use the PeerURL: `<ID>@<IP>:<PORT>`.
We will attempt to connect to the peer at IP:PORT, and verify,
via authenticated encryption, that it is in possession of the private key
corresponding to `<ID>`. This prevents man-in-the-middle attacks on the peer layer.
## Connections
All p2p connections use TCP.
Upon establishing a successful TCP connection with a peer,
two handhsakes are performed: one for authenticated encryption, and one for Tendermint versioning.
Both handshakes have configurable timeouts (they should complete quickly).
### Authenticated Encryption Handshake
Tendermint implements the Station-to-Station protocol
using X25519 keys for Diffie-Helman key-exchange and chacha20poly1305 for encryption.
It goes as follows:
- generate an ephemeral X25519 keypair
- send the ephemeral public key to the peer
- wait to receive the peer's ephemeral public key
- compute the Diffie-Hellman shared secret using the peers ephemeral public key and our ephemeral private key
- generate two keys to use for encryption (sending and receiving) and a challenge for authentication as follows:
- create a hkdf-sha256 instance with the key being the diffie hellman shared secret, and info parameter as
`TENDERMINT_SECRET_CONNECTION_KEY_AND_CHALLENGE_GEN`
- get 96 bytes of output from hkdf-sha256
- if we had the smaller ephemeral pubkey, use the first 32 bytes for the key for receiving, the second 32 bytes for sending; else the opposite
- use the last 32 bytes of output for the challenge
- use a separate nonce for receiving and sending. Both nonces start at 0, and should support the full 96 bit nonce range
- all communications from now on are encrypted in 1024 byte frames,
using the respective secret and nonce. Each nonce is incremented by one after each use.
- we now have an encrypted channel, but still need to authenticate
- sign the common challenge obtained from the hkdf with our persistent private key
- send the amino encoded persistent pubkey and signature to the peer
- wait to receive the persistent public key and signature from the peer
- verify the signature on the challenge using the peer's persistent public key
If this is an outgoing connection (we dialed the peer) and we used a peer ID,
then finally verify that the peer's persistent public key corresponds to the peer ID we dialed,
ie. `peer.PubKey.Address() == <ID>`.
The connection has now been authenticated. All traffic is encrypted.
Note: only the dialer can authenticate the identity of the peer,
but this is what we care about since when we join the network we wish to
ensure we have reached the intended peer (and are not being MITMd).
### Peer Filter
Before continuing, we check if the new peer has the same ID as ourselves or
an existing peer. If so, we disconnect.
We also check the peer's address and public key against
an optional whitelist which can be managed through the ABCI app -
if the whitelist is enabled and the peer does not qualify, the connection is
terminated.
### Tendermint Version Handshake
The Tendermint Version Handshake allows the peers to exchange their NodeInfo:
```golang
type NodeInfo struct {
Version p2p.Version
ID p2p.ID
ListenAddr string
Network string
SoftwareVersion string
Channels []int8
Moniker string
Other NodeInfoOther
}
type Version struct {
P2P uint64
Block uint64
App uint64
}
type NodeInfoOther struct {
TxIndex string
RPCAddress string
}
```
The connection is disconnected if:
- `peer.NodeInfo.ID` is not equal `peerConn.ID`
- `peer.NodeInfo.Version.Block` does not match ours
- `peer.NodeInfo.Network` is not the same as ours
- `peer.Channels` does not intersect with our known Channels.
- `peer.NodeInfo.ListenAddr` is malformed or is a DNS host that cannot be
resolved
At this point, if we have not disconnected, the peer is valid.
It is added to the switch and hence all reactors via the `AddPeer` method.
Note that each reactor may handle multiple channels.
## Connection Activity
Once a peer is added, incoming messages for a given reactor are handled through
that reactor's `Receive` method, and output messages are sent directly by the Reactors
on each peer. A typical reactor maintains per-peer go-routine(s) that handle this.

BIN
spec/reactors/block_sync/bcv1/img/bc-reactor-new-datastructs.png View File

Before After
Width: 768  |  Height: 335  |  Size: 43 KiB

BIN
spec/reactors/block_sync/bcv1/img/bc-reactor-new-fsm.png View File

Before After
Width: 705  |  Height: 228  |  Size: 41 KiB

BIN
spec/reactors/block_sync/bcv1/img/bc-reactor-new-goroutines.png View File

Before After
Width: 1027  |  Height: 675  |  Size: 138 KiB

+ 237
- 0
spec/reactors/block_sync/bcv1/impl-v1.md View File

@ -0,0 +1,237 @@
# Blockchain Reactor v1
### Data Structures
The data structures used are illustrated below.
![Data Structures](img/bc-reactor-new-datastructs.png)
#### BlockchainReactor
- is a `p2p.BaseReactor`.
- has a `store.BlockStore` for persistence.
- executes blocks using an `sm.BlockExecutor`.
- starts the FSM and the `poolRoutine()`.
- relays the fast-sync responses and switch messages to the FSM.
- handles errors from the FSM and when necessarily reports them to the switch.
- implements the blockchain reactor interface used by the FSM to send requests, errors to the switch and state timer resets.
- registers all the concrete types and interfaces for serialisation.
```go
type BlockchainReactor struct {
p2p.BaseReactor
initialState sm.State // immutable
state sm.State
blockExec *sm.BlockExecutor
store *store.BlockStore
fastSync bool
fsm *BcReactorFSM
blocksSynced int
// Receive goroutine forwards messages to this channel to be processed in the context of the poolRoutine.
messagesForFSMCh chan bcReactorMessage
// Switch goroutine may send RemovePeer to the blockchain reactor. This is an error message that is relayed
// to this channel to be processed in the context of the poolRoutine.
errorsForFSMCh chan bcReactorMessage
// This channel is used by the FSM and indirectly the block pool to report errors to the blockchain reactor and
// the switch.
eventsFromFSMCh chan bcFsmMessage
}
```
#### BcReactorFSM
- implements a simple finite state machine.
- has a state and a state timer.
- has a `BlockPool` to keep track of block requests sent to peers and blocks received from peers.
- uses an interface to send status requests, block requests and reporting errors. The interface is implemented by the `BlockchainReactor` and tests.
```go
type BcReactorFSM struct {
logger log.Logger
mtx sync.Mutex
startTime time.Time
state *bcReactorFSMState
stateTimer *time.Timer
pool *BlockPool
// interface used to call the Blockchain reactor to send StatusRequest, BlockRequest, reporting errors, etc.
toBcR bcReactor
}
```
#### BlockPool
- maintains a peer set, implemented as a map of peer ID to `BpPeer`.
- maintains a set of requests made to peers, implemented as a map of block request heights to peer IDs.
- maintains a list of future block requests needed to advance the fast-sync. This is a list of block heights.
- keeps track of the maximum height of the peers in the set.
- uses an interface to send requests and report errors to the reactor (via FSM).
```go
type BlockPool struct {
logger log.Logger
// Set of peers that have sent status responses, with height bigger than pool.Height
peers map[p2p.ID]*BpPeer
// Set of block heights and the corresponding peers from where a block response is expected or has been received.
blocks map[int64]p2p.ID
plannedRequests map[int64]struct{} // list of blocks to be assigned peers for blockRequest
nextRequestHeight int64 // next height to be added to plannedRequests
Height int64 // height of next block to execute
MaxPeerHeight int64 // maximum height of all peers
toBcR bcReactor
}
```
Some reasons for the `BlockPool` data structure content:
1. If a peer is removed by the switch fast access is required to the peer and the block requests made to that peer in order to redo them.
2. When block verification fails fast access is required from the block height to the peer and the block requests made to that peer in order to redo them.
3. The `BlockchainReactor` main routine decides when the block pool is running low and asks the `BlockPool` (via FSM) to make more requests. The `BlockPool` creates a list of requests and triggers the sending of the block requests (via the interface). The reason it maintains a list of requests is the redo operations that may occur during error handling. These are redone when the `BlockchainReactor` requires more blocks.
#### BpPeer
- keeps track of a single peer, with height bigger than the initial height.
- maintains the block requests made to the peer and the blocks received from the peer until they are executed.
- monitors the peer speed when there are pending requests.
- it has an active timer when pending requests are present and reports error on timeout.
```go
type BpPeer struct {
logger log.Logger
ID p2p.ID
Height int64 // the peer reported height
NumPendingBlockRequests int // number of requests still waiting for block responses
blocks map[int64]*types.Block // blocks received or expected to be received from this peer
blockResponseTimer *time.Timer
recvMonitor *flow.Monitor
params *BpPeerParams // parameters for timer and monitor
onErr func(err error, peerID p2p.ID) // function to call on error
}
```
### Concurrency Model
The diagram below shows the goroutines (depicted by the gray blocks), timers (shown on the left with their values) and channels (colored rectangles). The FSM box shows some of the functionality and it is not a separate goroutine.
The interface used by the FSM is shown in light red with the `IF` block. This is used to:
- send block requests
- report peer errors to the switch - this results in the reactor calling `switch.StopPeerForError()` and, if triggered by the peer timeout routine, a `removePeerEv` is sent to the FSM and action is taken from the context of the `poolRoutine()`
- ask the reactor to reset the state timers. The timers are owned by the FSM while the timeout routine is defined by the reactor. This was done in order to avoid running timers in tests and will change in the next revision.
There are two main goroutines implemented by the blockchain reactor. All I/O operations are performed from the `poolRoutine()` context while the CPU intensive operations related to the block execution are performed from the context of the `executeBlocksRoutine()`. All goroutines are detailed in the next sections.
![Go Routines Diagram](img/bc-reactor-new-goroutines.png)
#### Receive()
Fast-sync messages from peers are received by this goroutine. It performs basic validation and:
- in helper mode (i.e. for request message) it replies immediately. This is different than the proposal in adr-040 that specifies having the FSM handling these.
- forwards response messages to the `poolRoutine()`.
#### poolRoutine()
(named kept as in the previous reactor).
It starts the `executeBlocksRoutine()` and the FSM. It then waits in a loop for events. These are received from the following channels:
- `sendBlockRequestTicker.C` - every 10msec the reactor asks FSM to make more block requests up to a maximum. Note: currently this value is constant but could be changed based on low/ high watermark thresholds for the number of blocks received and waiting to be processed, the number of blockResponse messages waiting in messagesForFSMCh, etc.
- `statusUpdateTicker.C` - every 10 seconds the reactor broadcasts status requests to peers. While adr-040 specifies this to run within the FSM, at this point this functionality is kept in the reactor.
- `messagesForFSMCh` - the `Receive()` goroutine sends status and block response messages to this channel and the reactor calls FSM to handle them.
- `errorsForFSMCh` - this channel receives the following events:
- peer remove - when the switch removes a peer
- sate timeout event - when FSM state timers trigger
The reactor forwards this messages to the FSM.
- `eventsFromFSMCh` - there are two type of events sent over this channel:
- `syncFinishedEv` - triggered when FSM enters `finished` state and calls the switchToConsensus() interface function.
- `peerErrorEv`- peer timer expiry goroutine sends this event over the channel for processing from poolRoutine() context.
#### executeBlocksRoutine()
Started by the `poolRoutine()`, it retrieves blocks from the pool and executes them:
- `processReceivedBlockTicker.C` - a ticker event is received over the channel every 10msec and its handling results in a signal being sent to the doProcessBlockCh channel.
- doProcessBlockCh - events are received on this channel as described as above and upon processing blocks are retrieved from the pool and executed.
### FSM
![fsm](img/bc-reactor-new-fsm.png)
#### States
##### init (aka unknown)
The FSM is created in `unknown` state. When started, by the reactor (`startFSMEv`), it broadcasts Status requests and transitions to `waitForPeer` state.
##### waitForPeer
In this state, the FSM waits for a Status responses from a "tall" peer. A timer is running in this state to allow the FSM to finish if there are no useful peers.
If the timer expires, it moves to `finished` state and calls the reactor to switch to consensus.
If a Status response is received from a peer within the timeout, the FSM transitions to `waitForBlock` state.
##### waitForBlock
In this state the FSM makes Block requests (triggered by a ticker in reactor) and waits for Block responses. There is a timer running in this state to detect if a peer is not sending the block at current processing height. If the timer expires, the FSM removes the peer where the request was sent and all requests made to that peer are redone.
As blocks are received they are stored by the pool. Block execution is independently performed by the reactor and the result reported to the FSM:
- if there are no errors, the FSM increases the pool height and resets the state timer.
- if there are errors, the peers that delivered the two blocks (at height and height+1) are removed and the requests redone.
In this state the FSM may receive peer remove events in any of the following scenarios:
- the switch is removing a peer
- a peer is penalized because it has not responded to some block requests for a long time
- a peer is penalized for being slow
When processing of the last block (the one with height equal to the highest peer height minus one) is successful, the FSM transitions to `finished` state.
If after a peer update or removal the pool height is same as maxPeerHeight, the FSM transitions to `finished` state.
##### finished
When entering this state, the FSM calls the reactor to switch to consensus and performs cleanup.
#### Events
The following events are handled by the FSM:
```go
const (
startFSMEv = iota + 1
statusResponseEv
blockResponseEv
processedBlockEv
makeRequestsEv
stopFSMEv
peerRemoveEv = iota + 256
stateTimeoutEv
)
```
### Examples of Scenarios and Termination Handling
A few scenarios are covered in this section together with the current/ proposed handling.
In general, the scenarios involving faulty peers are made worse by the fact that they may quickly be re-added.
#### 1. No Tall Peers
S: In this scenario a node is started and while there are status responses received, none of the peers are at a height higher than this node.
H: The FSM times out in `waitForPeer` state, moves to `finished` state where it calls the reactor to switch to consensus.
#### 2. Typical Fast Sync
S: A node fast syncs blocks from honest peers and eventually downloads and executes the penultimate block.
H: The FSM in `waitForBlock` state will receive the processedBlockEv from the reactor and detect that the termination height is achieved.
#### 3. Peer Claims Big Height but no Blocks
S: In this scenario a faulty peer claims a big height (for which there are no blocks).
H: The requests for the non-existing block will timeout, the peer removed and the pool's `MaxPeerHeight` updated. FSM checks if the termination height is achieved when peers are removed.
#### 4. Highest Peer Removed or Updated to Short
S: The fast sync node is caught up with all peers except one tall peer. The tall peer is removed or it sends status response with low height.
H: FSM checks termination condition on peer removal and updates.
#### 5. Block At Current Height Delayed
S: A peer can block the progress of fast sync by delaying indefinitely the block response for the current processing height (h1).
H: Currently, given h1 < h2, there is no enforcement at peer level that the response for h1 should be received before h2. So a peer will timeout only after delivering all blocks except h1. However the `waitForBlock` state timer fires if the block for current processing height is not received within a timeout. The peer is removed and the requests to that peer (including the one for current height) redone.

BIN
spec/reactors/block_sync/img/bc-reactor-routines.png View File

Before After
Width: 1602  |  Height: 1132  |  Size: 265 KiB

BIN
spec/reactors/block_sync/img/bc-reactor.png View File

Before After
Width: 1520  |  Height: 1340  |  Size: 43 KiB

+ 44
- 0
spec/reactors/block_sync/impl.md View File

@ -0,0 +1,44 @@
## Blockchain Reactor v0 Modules
### Blockchain Reactor
- coordinates the pool for syncing
- coordinates the store for persistence
- coordinates the playing of blocks towards the app using a sm.BlockExecutor
- handles switching between fastsync and consensus
- it is a p2p.BaseReactor
- starts the pool.Start() and its poolRoutine()
- registers all the concrete types and interfaces for serialisation
#### poolRoutine
- listens to these channels:
- pool requests blocks from a specific peer by posting to requestsCh, block reactor then sends
a &bcBlockRequestMessage for a specific height
- pool signals timeout of a specific peer by posting to timeoutsCh
- switchToConsensusTicker to periodically try and switch to consensus
- trySyncTicker to periodically check if we have fallen behind and then catch-up sync
- if there aren't any new blocks available on the pool it skips syncing
- tries to sync the app by taking downloaded blocks from the pool, gives them to the app and stores
them on disk
- implements Receive which is called by the switch/peer
- calls AddBlock on the pool when it receives a new block from a peer
### Block Pool
- responsible for downloading blocks from peers
- makeRequestersRoutine()
- removes timeout peers
- starts new requesters by calling makeNextRequester()
- requestRoutine():
- picks a peer and sends the request, then blocks until:
- pool is stopped by listening to pool.Quit
- requester is stopped by listening to Quit
- request is redone
- we receive a block
- gotBlockCh is strange
### Go Routines in Blockchain Reactor
![Go Routines Diagram](img/bc-reactor-routines.png)

+ 308
- 0
spec/reactors/block_sync/reactor.md View File

@ -0,0 +1,308 @@
# Blockchain Reactor
The Blockchain Reactor's high level responsibility is to enable peers who are
far behind the current state of the consensus to quickly catch up by downloading
many blocks in parallel, verifying their commits, and executing them against the
ABCI application.
Tendermint full nodes run the Blockchain Reactor as a service to provide blocks
to new nodes. New nodes run the Blockchain Reactor in "fast_sync" mode,
where they actively make requests for more blocks until they sync up.
Once caught up, "fast_sync" mode is disabled and the node switches to
using (and turns on) the Consensus Reactor.
## Message Types
```go
const (
msgTypeBlockRequest = byte(0x10)
msgTypeBlockResponse = byte(0x11)
msgTypeNoBlockResponse = byte(0x12)
msgTypeStatusResponse = byte(0x20)
msgTypeStatusRequest = byte(0x21)
)
```
```go
type bcBlockRequestMessage struct {
Height int64
}
type bcNoBlockResponseMessage struct {
Height int64
}
type bcBlockResponseMessage struct {
Block Block
}
type bcStatusRequestMessage struct {
Height int64
type bcStatusResponseMessage struct {
Height int64
}
```
## Architecture and algorithm
The Blockchain reactor is organised as a set of concurrent tasks:
- Receive routine of Blockchain Reactor
- Task for creating Requesters
- Set of Requesters tasks and - Controller task.
![Blockchain Reactor Architecture Diagram](img/bc-reactor.png)
### Data structures
These are the core data structures necessarily to provide the Blockchain Reactor logic.
Requester data structure is used to track assignment of request for `block` at position `height` to a peer with id equals to `peerID`.
```go
type Requester {
mtx Mutex
block Block
height int64

 peerID p2p.ID
redoChannel chan p2p.ID //redo may send multi-time; peerId is used to identify repeat
}
```
Pool is a core data structure that stores last executed block (`height`), assignment of requests to peers (`requesters`), current height for each peer and number of pending requests for each peer (`peers`), maximum peer height, etc.
```go
type Pool {
mtx Mutex
requesters map[int64]*Requester
height int64
peers map[p2p.ID]*Peer
maxPeerHeight int64
numPending int32
store BlockStore
requestsChannel chan<- BlockRequest
errorsChannel chan<- peerError
}
```
Peer data structure stores for each peer current `height` and number of pending requests sent to the peer (`numPending`), etc.
```go
type Peer struct {
id p2p.ID
height int64
numPending int32
timeout *time.Timer
didTimeout bool
}
```
BlockRequest is internal data structure used to denote current mapping of request for a block at some `height` to a peer (`PeerID`).
```go
type BlockRequest {
Height int64
PeerID p2p.ID
}
```
### Receive routine of Blockchain Reactor
It is executed upon message reception on the BlockchainChannel inside p2p receive routine. There is a separate p2p receive routine (and therefore receive routine of the Blockchain Reactor) executed for each peer. Note that try to send will not block (returns immediately) if outgoing buffer is full.
```go
handleMsg(pool, m):
upon receiving bcBlockRequestMessage m from peer p:
block = load block for height m.Height from pool.store
if block != nil then
try to send BlockResponseMessage(block) to p
else
try to send bcNoBlockResponseMessage(m.Height) to p
upon receiving bcBlockResponseMessage m from peer p:
pool.mtx.Lock()
requester = pool.requesters[m.Height]
if requester == nil then
error("peer sent us a block we didn't expect")
continue
if requester.block == nil and requester.peerID == p then
requester.block = m
pool.numPending -= 1 // atomic decrement
peer = pool.peers[p]
if peer != nil then
peer.numPending--
if peer.numPending == 0 then
peer.timeout.Stop()
// NOTE: we don't send Quit signal to the corresponding requester task!
else
trigger peer timeout to expire after peerTimeout
pool.mtx.Unlock()
upon receiving bcStatusRequestMessage m from peer p:
try to send bcStatusResponseMessage(pool.store.Height)
upon receiving bcStatusResponseMessage m from peer p:
pool.mtx.Lock()
peer = pool.peers[p]
if peer != nil then
peer.height = m.height
else
peer = create new Peer data structure with id = p and height = m.Height
pool.peers[p] = peer
if m.Height > pool.maxPeerHeight then
pool.maxPeerHeight = m.Height
pool.mtx.Unlock()
onTimeout(p):
send error message to pool error channel
peer = pool.peers[p]
peer.didTimeout = true
```
### Requester tasks
Requester task is responsible for fetching a single block at position `height`.
```go
fetchBlock(height, pool):
while true do {
peerID = nil
block = nil
peer = pickAvailablePeer(height)
peerID = peer.id
enqueue BlockRequest(height, peerID) to pool.requestsChannel
redo = false
while !redo do
select {
upon receiving Quit message do
return
upon receiving redo message with id on redoChannel do
if peerID == id {
mtx.Lock()
pool.numPending++
redo = true
mtx.UnLock()
}
}
}
pickAvailablePeer(height):
selectedPeer = nil
while selectedPeer = nil do
pool.mtx.Lock()
for each peer in pool.peers do
if !peer.didTimeout and peer.numPending < maxPendingRequestsPerPeer and peer.height >= height then
peer.numPending++
selectedPeer = peer
break
pool.mtx.Unlock()
if selectedPeer = nil then
sleep requestIntervalMS
return selectedPeer
```
sleep for requestIntervalMS
### Task for creating Requesters
This task is responsible for continuously creating and starting Requester tasks.
```go
createRequesters(pool):
while true do
if !pool.isRunning then break
if pool.numPending < maxPendingRequests or size(pool.requesters) < maxTotalRequesters then
pool.mtx.Lock()
nextHeight = pool.height + size(pool.requesters)
requester = create new requester for height nextHeight
pool.requesters[nextHeight] = requester
pool.numPending += 1 // atomic increment
start requester task
pool.mtx.Unlock()
else
sleep requestIntervalMS
pool.mtx.Lock()
for each peer in pool.peers do
if !peer.didTimeout && peer.numPending > 0 && peer.curRate < minRecvRate then
send error on pool error channel
peer.didTimeout = true
if peer.didTimeout then
for each requester in pool.requesters do
if requester.getPeerID() == peer then
enqueue msg on requestor's redoChannel
delete(pool.peers, peerID)
pool.mtx.Unlock()
```
### Main blockchain reactor controller task
```go
main(pool):
create trySyncTicker with interval trySyncIntervalMS
create statusUpdateTicker with interval statusUpdateIntervalSeconds
create switchToConsensusTicker with interval switchToConsensusIntervalSeconds
while true do
select {
upon receiving BlockRequest(Height, Peer) on pool.requestsChannel:
try to send bcBlockRequestMessage(Height) to Peer
upon receiving error(peer) on errorsChannel:
stop peer for error
upon receiving message on statusUpdateTickerChannel:
broadcast bcStatusRequestMessage(bcR.store.Height) // message sent in a separate routine
upon receiving message on switchToConsensusTickerChannel:
pool.mtx.Lock()
receivedBlockOrTimedOut = pool.height > 0 || (time.Now() - pool.startTime) > 5 Seconds
ourChainIsLongestAmongPeers = pool.maxPeerHeight == 0 || pool.height >= pool.maxPeerHeight
haveSomePeers = size of pool.peers > 0
pool.mtx.Unlock()
if haveSomePeers && receivedBlockOrTimedOut && ourChainIsLongestAmongPeers then
switch to consensus mode
upon receiving message on trySyncTickerChannel:
for i = 0; i < 10; i++ do
pool.mtx.Lock()
firstBlock = pool.requesters[pool.height].block
secondBlock = pool.requesters[pool.height].block
if firstBlock == nil or secondBlock == nil then continue
pool.mtx.Unlock()
verify firstBlock using LastCommit from secondBlock
if verification failed
pool.mtx.Lock()
peerID = pool.requesters[pool.height].peerID
redoRequestsForPeer(peerId)
delete(pool.peers, peerID)
stop peer peerID for error
pool.mtx.Unlock()
else
delete(pool.requesters, pool.height)
save firstBlock to store
pool.height++
execute firstBlock
}
redoRequestsForPeer(pool, peerId):
for each requester in pool.requesters do
if requester.getPeerID() == peerID
enqueue msg on redoChannel for requester
```
## Channels
Defines `maxMsgSize` for the maximum size of incoming messages,
`SendQueueCapacity` and `RecvBufferCapacity` for maximum sending and
receiving buffers respectively. These are supposed to prevent amplification
attacks by setting up the upper limit on how much data we can receive & send to
a peer.
Sending incorrectly encoded data will result in stopping the peer.

+ 353
- 0
spec/reactors/consensus/consensus-reactor.md View File

@ -0,0 +1,353 @@
# Consensus Reactor
Consensus Reactor defines a reactor for the consensus service. It contains the ConsensusState service that
manages the state of the Tendermint consensus internal state machine.
When Consensus Reactor is started, it starts Broadcast Routine which starts ConsensusState service.
Furthermore, for each peer that is added to the Consensus Reactor, it creates (and manages) the known peer state
(that is used extensively in gossip routines) and starts the following three routines for the peer p:
Gossip Data Routine, Gossip Votes Routine and QueryMaj23Routine. Finally, Consensus Reactor is responsible
for decoding messages received from a peer and for adequate processing of the message depending on its type and content.
The processing normally consists of updating the known peer state and for some messages
(`ProposalMessage`, `BlockPartMessage` and `VoteMessage`) also forwarding message to ConsensusState module
for further processing. In the following text we specify the core functionality of those separate unit of executions
that are part of the Consensus Reactor.
## ConsensusState service
Consensus State handles execution of the Tendermint BFT consensus algorithm. It processes votes and proposals,
and upon reaching agreement, commits blocks to the chain and executes them against the application.
The internal state machine receives input from peers, the internal validator and from a timer.
Inside Consensus State we have the following units of execution: Timeout Ticker and Receive Routine.
Timeout Ticker is a timer that schedules timeouts conditional on the height/round/step that are processed
by the Receive Routine.
### Receive Routine of the ConsensusState service
Receive Routine of the ConsensusState handles messages which may cause internal consensus state transitions.
It is the only routine that updates RoundState that contains internal consensus state.
Updates (state transitions) happen on timeouts, complete proposals, and 2/3 majorities.
It receives messages from peers, internal validators and from Timeout Ticker
and invokes the corresponding handlers, potentially updating the RoundState.
The details of the protocol (together with formal proofs of correctness) implemented by the Receive Routine are
discussed in separate document. For understanding of this document
it is sufficient to understand that the Receive Routine manages and updates RoundState data structure that is
then extensively used by the gossip routines to determine what information should be sent to peer processes.
## Round State
RoundState defines the internal consensus state. It contains height, round, round step, a current validator set,
a proposal and proposal block for the current round, locked round and block (if some block is being locked), set of
received votes and last commit and last validators set.
```golang
type RoundState struct {
Height int64
Round int
Step RoundStepType
Validators ValidatorSet
Proposal Proposal
ProposalBlock Block
ProposalBlockParts PartSet
LockedRound int
LockedBlock Block
LockedBlockParts PartSet
Votes HeightVoteSet
LastCommit VoteSet
LastValidators ValidatorSet
}
```
Internally, consensus will run as a state machine with the following states:
- RoundStepNewHeight
- RoundStepNewRound
- RoundStepPropose
- RoundStepProposeWait
- RoundStepPrevote
- RoundStepPrevoteWait
- RoundStepPrecommit
- RoundStepPrecommitWait
- RoundStepCommit
## Peer Round State
Peer round state contains the known state of a peer. It is being updated by the Receive routine of
Consensus Reactor and by the gossip routines upon sending a message to the peer.
```golang
type PeerRoundState struct {
Height int64 // Height peer is at
Round int // Round peer is at, -1 if unknown.
Step RoundStepType // Step peer is at
Proposal bool // True if peer has proposal for this round
ProposalBlockPartsHeader PartSetHeader
ProposalBlockParts BitArray
ProposalPOLRound int // Proposal's POL round. -1 if none.
ProposalPOL BitArray // nil until ProposalPOLMessage received.
Prevotes BitArray // All votes peer has for this round
Precommits BitArray // All precommits peer has for this round
LastCommitRound int // Round of commit for last height. -1 if none.
LastCommit BitArray // All commit precommits of commit for last height.
CatchupCommitRound int // Round that we have commit for. Not necessarily unique. -1 if none.
CatchupCommit BitArray // All commit precommits peer has for this height & CatchupCommitRound
}
```
## Receive method of Consensus reactor
The entry point of the Consensus reactor is a receive method. When a message is received from a peer p,
normally the peer round state is updated correspondingly, and some messages
are passed for further processing, for example to ConsensusState service. We now specify the processing of messages
in the receive method of Consensus reactor for each message type. In the following message handler, `rs` and `prs` denote
`RoundState` and `PeerRoundState`, respectively.
### NewRoundStepMessage handler
```
handleMessage(msg):
if msg is from smaller height/round/step then return
// Just remember these values.
prsHeight = prs.Height
prsRound = prs.Round
prsCatchupCommitRound = prs.CatchupCommitRound
prsCatchupCommit = prs.CatchupCommit
Update prs with values from msg
if prs.Height or prs.Round has been updated then
reset Proposal related fields of the peer state
if prs.Round has been updated and msg.Round == prsCatchupCommitRound then
prs.Precommits = psCatchupCommit
if prs.Height has been updated then
if prsHeight+1 == msg.Height && prsRound == msg.LastCommitRound then
prs.LastCommitRound = msg.LastCommitRound
prs.LastCommit = prs.Precommits
} else {
prs.LastCommitRound = msg.LastCommitRound
prs.LastCommit = nil
}
Reset prs.CatchupCommitRound and prs.CatchupCommit
```
### NewValidBlockMessage handler
```
handleMessage(msg):
if prs.Height != msg.Height then return
if prs.Round != msg.Round && !msg.IsCommit then return
prs.ProposalBlockPartsHeader = msg.BlockPartsHeader
prs.ProposalBlockParts = msg.BlockParts
```
### HasVoteMessage handler
```
handleMessage(msg):
if prs.Height == msg.Height then
prs.setHasVote(msg.Height, msg.Round, msg.Type, msg.Index)
```
### VoteSetMaj23Message handler
```
handleMessage(msg):
if prs.Height == msg.Height then
Record in rs that a peer claim to have ⅔ majority for msg.BlockID
Send VoteSetBitsMessage showing votes node has for that BlockId
```
### ProposalMessage handler
```
handleMessage(msg):
if prs.Height != msg.Height || prs.Round != msg.Round || prs.Proposal then return
prs.Proposal = true
if prs.ProposalBlockParts == empty set then // otherwise it is set in NewValidBlockMessage handler
prs.ProposalBlockPartsHeader = msg.BlockPartsHeader
prs.ProposalPOLRound = msg.POLRound
prs.ProposalPOL = nil
Send msg through internal peerMsgQueue to ConsensusState service
```
### ProposalPOLMessage handler
```
handleMessage(msg):
if prs.Height != msg.Height or prs.ProposalPOLRound != msg.ProposalPOLRound then return
prs.ProposalPOL = msg.ProposalPOL
```
### BlockPartMessage handler
```
handleMessage(msg):
if prs.Height != msg.Height || prs.Round != msg.Round then return
Record in prs that peer has block part msg.Part.Index
Send msg trough internal peerMsgQueue to ConsensusState service
```
### VoteMessage handler
```
handleMessage(msg):
Record in prs that a peer knows vote with index msg.vote.ValidatorIndex for particular height and round
Send msg trough internal peerMsgQueue to ConsensusState service
```
### VoteSetBitsMessage handler
```
handleMessage(msg):
Update prs for the bit-array of votes peer claims to have for the msg.BlockID
```
## Gossip Data Routine
It is used to send the following messages to the peer: `BlockPartMessage`, `ProposalMessage` and
`ProposalPOLMessage` on the DataChannel. The gossip data routine is based on the local RoundState (`rs`)
and the known PeerRoundState (`prs`). The routine repeats forever the logic shown below:
```
1a) if rs.ProposalBlockPartsHeader == prs.ProposalBlockPartsHeader and the peer does not have all the proposal parts then
Part = pick a random proposal block part the peer does not have
Send BlockPartMessage(rs.Height, rs.Round, Part) to the peer on the DataChannel
if send returns true, record that the peer knows the corresponding block Part
Continue
1b) if (0 < prs.Height) and (prs.Height < rs.Height) then
help peer catch up using gossipDataForCatchup function
Continue
1c) if (rs.Height != prs.Height) or (rs.Round != prs.Round) then
Sleep PeerGossipSleepDuration
Continue
// at this point rs.Height == prs.Height and rs.Round == prs.Round
1d) if (rs.Proposal != nil and !prs.Proposal) then
Send ProposalMessage(rs.Proposal) to the peer
if send returns true, record that the peer knows Proposal
if 0 <= rs.Proposal.POLRound then
polRound = rs.Proposal.POLRound
prevotesBitArray = rs.Votes.Prevotes(polRound).BitArray()
Send ProposalPOLMessage(rs.Height, polRound, prevotesBitArray)
Continue
2) Sleep PeerGossipSleepDuration
```
### Gossip Data For Catchup
This function is responsible for helping peer catch up if it is at the smaller height (prs.Height < rs.Height).
The function executes the following logic:
if peer does not have all block parts for prs.ProposalBlockPart then
blockMeta = Load Block Metadata for height prs.Height from blockStore
if (!blockMeta.BlockID.PartsHeader == prs.ProposalBlockPartsHeader) then
Sleep PeerGossipSleepDuration
return
Part = pick a random proposal block part the peer does not have
Send BlockPartMessage(prs.Height, prs.Round, Part) to the peer on the DataChannel
if send returns true, record that the peer knows the corresponding block Part
return
else Sleep PeerGossipSleepDuration
## Gossip Votes Routine
It is used to send the following message: `VoteMessage` on the VoteChannel.
The gossip votes routine is based on the local RoundState (`rs`)
and the known PeerRoundState (`prs`). The routine repeats forever the logic shown below:
```
1a) if rs.Height == prs.Height then
if prs.Step == RoundStepNewHeight then
vote = random vote from rs.LastCommit the peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
if prs.Step <= RoundStepPrevote and prs.Round != -1 and prs.Round <= rs.Round then
Prevotes = rs.Votes.Prevotes(prs.Round)
vote = random vote from Prevotes the peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
if prs.Step <= RoundStepPrecommit and prs.Round != -1 and prs.Round <= rs.Round then
Precommits = rs.Votes.Precommits(prs.Round)
vote = random vote from Precommits the peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
if prs.ProposalPOLRound != -1 then
PolPrevotes = rs.Votes.Prevotes(prs.ProposalPOLRound)
vote = random vote from PolPrevotes the peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
1b) if prs.Height != 0 and rs.Height == prs.Height+1 then
vote = random vote from rs.LastCommit peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
1c) if prs.Height != 0 and rs.Height >= prs.Height+2 then
Commit = get commit from BlockStore for prs.Height
vote = random vote from Commit the peer does not have
Send VoteMessage(vote) to the peer
if send returns true, continue
2) Sleep PeerGossipSleepDuration
```
## QueryMaj23Routine
It is used to send the following message: `VoteSetMaj23Message`. `VoteSetMaj23Message` is sent to indicate that a given
BlockID has seen +2/3 votes. This routine is based on the local RoundState (`rs`) and the known PeerRoundState
(`prs`). The routine repeats forever the logic shown below.
```
1a) if rs.Height == prs.Height then
Prevotes = rs.Votes.Prevotes(prs.Round)
if there is a ⅔ majority for some blockId in Prevotes then
m = VoteSetMaj23Message(prs.Height, prs.Round, Prevote, blockId)
Send m to peer
Sleep PeerQueryMaj23SleepDuration
1b) if rs.Height == prs.Height then
Precommits = rs.Votes.Precommits(prs.Round)
if there is a ⅔ majority for some blockId in Precommits then
m = VoteSetMaj23Message(prs.Height,prs.Round,Precommit,blockId)
Send m to peer
Sleep PeerQueryMaj23SleepDuration
1c) if rs.Height == prs.Height and prs.ProposalPOLRound >= 0 then
Prevotes = rs.Votes.Prevotes(prs.ProposalPOLRound)
if there is a ⅔ majority for some blockId in Prevotes then
m = VoteSetMaj23Message(prs.Height,prs.ProposalPOLRound,Prevotes,blockId)
Send m to peer
Sleep PeerQueryMaj23SleepDuration
1d) if prs.CatchupCommitRound != -1 and 0 < prs.Height and
prs.Height <= blockStore.Height() then
Commit = LoadCommit(prs.Height)
m = VoteSetMaj23Message(prs.Height,Commit.Round,Precommit,Commit.blockId)
Send m to peer
Sleep PeerQueryMaj23SleepDuration
2) Sleep PeerQueryMaj23SleepDuration
```
## Broadcast routine
The Broadcast routine subscribes to an internal event bus to receive new round steps and votes messages, and broadcasts messages to peers upon receiving those
events.
It broadcasts `NewRoundStepMessage` or `CommitStepMessage` upon new round state event. Note that
broadcasting these messages does not depend on the PeerRoundState; it is sent on the StateChannel.
Upon receiving VoteMessage it broadcasts `HasVoteMessage` message to its peers on the StateChannel.
## Channels
Defines 4 channels: state, data, vote and vote_set_bits. Each channel
has `SendQueueCapacity` and `RecvBufferCapacity` and
`RecvMessageCapacity` set to `maxMsgSize`.
Sending incorrectly encoded data will result in stopping the peer.

+ 184
- 0
spec/reactors/consensus/consensus.md View File

@ -0,0 +1,184 @@
# Tendermint Consensus Reactor
Tendermint Consensus is a distributed protocol executed by validator processes to agree on
the next block to be added to the Tendermint blockchain. The protocol proceeds in rounds, where
each round is a try to reach agreement on the next block. A round starts by having a dedicated
process (called proposer) suggesting to other processes what should be the next block with
the `ProposalMessage`.
The processes respond by voting for a block with `VoteMessage` (there are two kinds of vote
messages, prevote and precommit votes). Note that a proposal message is just a suggestion what the
next block should be; a validator might vote with a `VoteMessage` for a different block. If in some
round, enough number of processes vote for the same block, then this block is committed and later
added to the blockchain. `ProposalMessage` and `VoteMessage` are signed by the private key of the
validator. The internals of the protocol and how it ensures safety and liveness properties are
explained in a forthcoming document.
For efficiency reasons, validators in Tendermint consensus protocol do not agree directly on the
block as the block size is big, i.e., they don't embed the block inside `Proposal` and
`VoteMessage`. Instead, they reach agreement on the `BlockID` (see `BlockID` definition in
[Blockchain](https://github.com/tendermint/tendermint/blob/master/docs/spec/blockchain/blockchain.md#blockid) section) that uniquely identifies each block. The block itself is
disseminated to validator processes using peer-to-peer gossiping protocol. It starts by having a
proposer first splitting a block into a number of block parts, that are then gossiped between
processes using `BlockPartMessage`.
Validators in Tendermint communicate by peer-to-peer gossiping protocol. Each validator is connected
only to a subset of processes called peers. By the gossiping protocol, a validator send to its peers
all needed information (`ProposalMessage`, `VoteMessage` and `BlockPartMessage`) so they can
reach agreement on some block, and also obtain the content of the chosen block (block parts). As
part of the gossiping protocol, processes also send auxiliary messages that inform peers about the
executed steps of the core consensus algorithm (`NewRoundStepMessage` and `NewValidBlockMessage`), and
also messages that inform peers what votes the process has seen (`HasVoteMessage`,
`VoteSetMaj23Message` and `VoteSetBitsMessage`). These messages are then used in the gossiping
protocol to determine what messages a process should send to its peers.
We now describe the content of each message exchanged during Tendermint consensus protocol.
## ProposalMessage
ProposalMessage is sent when a new block is proposed. It is a suggestion of what the
next block in the blockchain should be.
```go
type ProposalMessage struct {
Proposal Proposal
}
```
### Proposal
Proposal contains height and round for which this proposal is made, BlockID as a unique identifier
of proposed block, timestamp, and POLRound (a so-called Proof-of-Lock (POL) round) that is needed for
termination of the consensus. If POLRound >= 0, then BlockID corresponds to the block that
is locked in POLRound. The message is signed by the validator private key.
```go
type Proposal struct {
Height int64
Round int
POLRound int
BlockID BlockID
Timestamp Time
Signature Signature
}
```
## VoteMessage
VoteMessage is sent to vote for some block (or to inform others that a process does not vote in the
current round). Vote is defined in the [Blockchain](https://github.com/tendermint/tendermint/blob/master/docs/spec/blockchain/blockchain.md#blockid) section and contains validator's
information (validator address and index), height and round for which the vote is sent, vote type,
blockID if process vote for some block (`nil` otherwise) and a timestamp when the vote is sent. The
message is signed by the validator private key.
```go
type VoteMessage struct {
Vote Vote
}
```
## BlockPartMessage
BlockPartMessage is sent when gossipping a piece of the proposed block. It contains height, round
and the block part.
```go
type BlockPartMessage struct {
Height int64
Round int
Part Part
}
```
## NewRoundStepMessage
NewRoundStepMessage is sent for every step transition during the core consensus algorithm execution.
It is used in the gossip part of the Tendermint protocol to inform peers about a current
height/round/step a process is in.
```go
type NewRoundStepMessage struct {
Height int64
Round int
Step RoundStepType
SecondsSinceStartTime int
LastCommitRound int
}
```
## NewValidBlockMessage
NewValidBlockMessage is sent when a validator observes a valid block B in some round r,
i.e., there is a Proposal for block B and 2/3+ prevotes for the block B in the round r.
It contains height and round in which valid block is observed, block parts header that describes
the valid block and is used to obtain all
block parts, and a bit array of the block parts a process currently has, so its peers can know what
parts it is missing so they can send them.
In case the block is also committed, then IsCommit flag is set to true.
```go
type NewValidBlockMessage struct {
Height int64
Round int
BlockPartsHeader PartSetHeader
BlockParts BitArray
IsCommit bool
}
```
## ProposalPOLMessage
ProposalPOLMessage is sent when a previous block is re-proposed.
It is used to inform peers in what round the process learned for this block (ProposalPOLRound),
and what prevotes for the re-proposed block the process has.
```go
type ProposalPOLMessage struct {
Height int64
ProposalPOLRound int
ProposalPOL BitArray
}
```
## HasVoteMessage
HasVoteMessage is sent to indicate that a particular vote has been received. It contains height,
round, vote type and the index of the validator that is the originator of the corresponding vote.
```go
type HasVoteMessage struct {
Height int64
Round int
Type byte
Index int
}
```
## VoteSetMaj23Message
VoteSetMaj23Message is sent to indicate that a process has seen +2/3 votes for some BlockID.
It contains height, round, vote type and the BlockID.
```go
type VoteSetMaj23Message struct {
Height int64
Round int
Type byte
BlockID BlockID
}
```
## VoteSetBitsMessage
VoteSetBitsMessage is sent to communicate the bit-array of votes a process has seen for a given
BlockID. It contains height, round, vote type, BlockID and a bit array of
the votes a process has.
```go
type VoteSetBitsMessage struct {
Height int64
Round int
Type byte
BlockID BlockID
Votes BitArray
}
```

+ 291
- 0
spec/reactors/consensus/proposer-selection.md View File

@ -0,0 +1,291 @@
# Proposer selection procedure in Tendermint
This document specifies the Proposer Selection Procedure that is used in Tendermint to choose a round proposer.
As Tendermint is “leader-based protocol”, the proposer selection is critical for its correct functioning.
At a given block height, the proposer selection algorithm runs with the same validator set at each round .
Between heights, an updated validator set may be specified by the application as part of the ABCIResponses' EndBlock.
## Requirements for Proposer Selection
This sections covers the requirements with Rx being mandatory and Ox optional requirements.
The following requirements must be met by the Proposer Selection procedure:
#### R1: Determinism
Given a validator set `V`, and two honest validators `p` and `q`, for each height `h` and each round `r` the following must hold:
`proposer_p(h,r) = proposer_q(h,r)`
where `proposer_p(h,r)` is the proposer returned by the Proposer Selection Procedure at process `p`, at height `h` and round `r`.
#### R2: Fairness
Given a validator set with total voting power P and a sequence S of elections. In any sub-sequence of S with length C*P, a validator v must be elected as proposer P/VP(v) times, i.e. with frequency:
f(v) ~ VP(v) / P
where C is a tolerance factor for validator set changes with following values:
- C == 1 if there are no validator set changes
- C ~ k when there are validator changes
*[this needs more work]*
### Basic Algorithm
At its core, the proposer selection procedure uses a weighted round-robin algorithm.
A model that gives a good intuition on how/ why the selection algorithm works and it is fair is that of a priority queue. The validators move ahead in this queue according to their voting power (the higher the voting power the faster a validator moves towards the head of the queue). When the algorithm runs the following happens:
- all validators move "ahead" according to their powers: for each validator, increase the priority by the voting power
- first in the queue becomes the proposer: select the validator with highest priority
- move the proposer back in the queue: decrease the proposer's priority by the total voting power
Notation:
- vset - the validator set
- n - the number of validators
- VP(i) - voting power of validator i
- A(i) - accumulated priority for validator i
- P - total voting power of set
- avg - average of all validator priorities
- prop - proposer
Simple view at the Selection Algorithm:
```
def ProposerSelection (vset):
// compute priorities and elect proposer
for each validator i in vset:
A(i) += VP(i)
prop = max(A)
A(prop) -= P
```
### Stable Set
Consider the validator set:
Validator | p1| p2
----------|---|---
VP | 1 | 3
Assuming no validator changes, the following table shows the proposer priority computation over a few runs. Four runs of the selection procedure are shown, starting with the 5th the same values are computed.
Each row shows the priority queue and the process place in it. The proposer is the closest to the head, the rightmost validator. As priorities are updated, the validators move right in the queue. The proposer moves left as its priority is reduced after election.
|Priority Run | -2| -1| 0 | 1| 2 | 3 | 4 | 5 | Alg step
|--------------- |---|---|---- |---|---- |---|---|---|--------
| | | |p1,p2| | | | | |Initialized to 0
|run 1 | | | | p1| | p2| | |A(i)+=VP(i)
| | | p2| | p1| | | | |A(p2)-= P
|run 2 | | | | |p1,p2| | | |A(i)+=VP(i)
| | p1| | | | p2| | | |A(p1)-= P
|run 3 | | p1| | | | | | p2|A(i)+=VP(i)
| | | p1| | p2| | | | |A(p2)-= P
|run 4 | | | p1| | | | p2| |A(i)+=VP(i)
| | | |p1,p2| | | | | |A(p2)-= P
It can be shown that:
- At the end of each run k+1 the sum of the priorities is the same as at end of run k. If a new set's priorities are initialized to 0 then the sum of priorities will be 0 at each run while there are no changes.
- The max distance between priorites is (n-1) * P. *[formal proof not finished]*
### Validator Set Changes
Between proposer selection runs the validator set may change. Some changes have implications on the proposer election.
#### Voting Power Change
Consider again the earlier example and assume that the voting power of p1 is changed to 4:
Validator | p1| p2
----------|---| ---
VP | 4 | 3
Let's also assume that before this change the proposer priorites were as shown in first row (last run). As it can be seen, the selection could run again, without changes, as before.
|Priority Run| -2 | -1 | 0 | 1 | 2 | Comment
|--------------| ---|--- |------|--- |--- |--------
| last run | | p2 | | p1 | |__update VP(p1)__
| next run | | | | | p2 |A(i)+=VP(i)
| | p1 | | | | p2 |A(p1)-= P
However, when a validator changes power from a high to a low value, some other validator remain far back in the queue for a long time. This scenario is considered again in the Proposer Priority Range section.
As before:
- At the end of each run k+1 the sum of the priorities is the same as at run k.
- The max distance between priorites is (n-1) * P.
#### Validator Removal
Consider a new example with set:
Validator | p1 | p2 | p3 |
--------- |--- |--- |--- |
VP | 1 | 2 | 3 |
Let's assume that after the last run the proposer priorities were as shown in first row with their sum being 0. After p2 is removed, at the end of next proposer selection run (penultimate row) the sum of priorities is -2 (minus the priority of the removed process).
The procedure could continue without modifications. However, after a sufficiently large number of modifications in validator set, the priority values would migrate towards maximum or minimum allowed values causing truncations due to overflow detection.
For this reason, the selection procedure adds another __new step__ that centers the current priority values such that the priority sum remains close to 0.
|Priority Run |-3 | -2 | -1 | 0 | 1 | 2 | 4 |Comment
|--------------- |--- | ---|--- |--- |--- |--- |---|--------
| last run |p3 | | | | p1 | p2 | |__remove p2__
| nextrun | | | | | | | |
| __new step__ | | p3 | | | | p1 | |A(i) -= avg, avg = -1
| | | | | | p3 | p1 | |A(i)+=VP(i)
| | | | p1 | | p3 | | |A(p1)-= P
The modified selection algorithm is:
def ProposerSelection (vset):
// center priorities around zero
avg = sum(A(i) for i in vset)/len(vset)
for each validator i in vset:
A(i) -= avg
// compute priorities and elect proposer
for each validator i in vset:
A(i) += VP(i)
prop = max(A)
A(prop) -= P
Observations:
- The sum of priorities is now close to 0. Due to integer division the sum is an integer in (-n, n), where n is the number of validators.
#### New Validator
When a new validator is added, same problem as the one described for removal appears, the sum of priorities in the new set is not zero. This is fixed with the centering step introduced above.
One other issue that needs to be addressed is the following. A validator V that has just been elected is moved to the end of the queue. If the validator set is large and/ or other validators have significantly higher power, V will have to wait many runs to be elected. If V removes and re-adds itself to the set, it would make a significant (albeit unfair) "jump" ahead in the queue.
In order to prevent this, when a new validator is added, its initial priority is set to:
A(V) = -1.125 * P
where P is the total voting power of the set including V.
Curent implementation uses the penalty factor of 1.125 because it provides a small punishment that is efficient to calculate. See [here](https://github.com/tendermint/tendermint/pull/2785#discussion_r235038971) for more details.
If we consider the validator set where p3 has just been added:
Validator | p1 | p2 | p3
----------|--- |--- |---
VP | 1 | 3 | 8
then p3 will start with proposer priority:
A(p3) = -1.125 * (1 + 3 + 8) ~ -13
Note that since current computation uses integer division there is penalty loss when sum of the voting power is less than 8.
In the next run, p3 will still be ahead in the queue, elected as proposer and moved back in the queue.
|Priority Run |-13 | -9 | -5 | -2 | -1 | 0 | 1 | 2 | 5 | 6 | 7 |Alg step
|---------------|--- |--- |--- |----|--- |--- |---|---|---|---|---|--------
|last run | | | | p2 | | | | p1| | | |__add p3__
| | p3 | | | p2 | | | | p1| | | |A(p3) = -4
|next run | | p3 | | | | | | p2| | p1| |A(i) -= avg, avg = -4
| | | | | | p3 | | | | p2| | p1|A(i)+=VP(i)
| | | | p1 | | p3 | | | | p2| | |A(p1)-=P
### Proposer Priority Range
With the introduction of centering, some interesting cases occur. Low power validators that bind early in a set that includes high power validator(s) benefit from subsequent additions to the set. This is because these early validators run through more right shift operations during centering, operations that increase their priority.
As an example, consider the set where p2 is added after p1, with priority -1.125 * 80k = -90k. After the selection procedure runs once:
Validator | p1 | p2 | Comment
----------|-----|---- |---
VP | 80k | 10 |
A | 0 |-90k | __added p2__
A |-45k | 45k | __run selection__
Then execute the following steps:
1. Add a new validator p3:
Validator | p1 | p2 | p3
----------|-----|--- |----
VP | 80k | 10 | 10
2. Run selection once. The notation '..p'/'p..' means very small deviations compared to column priority.
|Priority Run | -90k..| -60k | -45k | -15k| 0 | 45k | 75k | 155k | Comment
|--------------|------ |----- |------- |---- |---|---- |----- |------- |---------
| last run | p3 | | p2 | | | p1 | | | __added p3__
| next run
| *right_shift*| | p3 | | p2 | | | p1 | | A(i) -= avg,avg=-30k
| | | ..p3| | ..p2| | | | p1 | A(i)+=VP(i)
| | | ..p3| | ..p2| | | p1.. | | A(p1)-=P, P=80k+20
3. Remove p1 and run selection once:
Validator | p3 | p2 | Comment
----------|----- |---- |--------
VP | 10 | 10 |
A |-60k |-15k |
A |-22.5k|22.5k| __run selection__
At this point, while the total voting power is 20, the distance between priorities is 45k. It will take 4500 runs for p3 to catch up with p2.
In order to prevent these types of scenarios, the selection algorithm performs scaling of priorities such that the difference between min and max values is smaller than two times the total voting power.
The modified selection algorithm is:
def ProposerSelection (vset):
// scale the priority values
diff = max(A)-min(A)
threshold = 2 * P
if diff > threshold:
scale = diff/threshold
for each validator i in vset:
A(i) = A(i)/scale
// center priorities around zero
avg = sum(A(i) for i in vset)/len(vset)
for each validator i in vset:
A(i) -= avg
// compute priorities and elect proposer
for each validator i in vset:
A(i) += VP(i)
prop = max(A)
A(prop) -= P
Observations:
- With this modification, the maximum distance between priorites becomes 2 * P.
Note also that even during steady state the priority range may increase beyond 2 * P. The scaling introduced here helps to keep the range bounded.
### Wrinkles
#### Validator Power Overflow Conditions
The validator voting power is a positive number stored as an int64. When a validator is added the `1.125 * P` computation must not overflow. As a consequence the code handling validator updates (add and update) checks for overflow conditions making sure the total voting power is never larger than the largest int64 `MAX`, with the property that `1.125 * MAX` is still in the bounds of int64. Fatal error is return when overflow condition is detected.
#### Proposer Priority Overflow/ Underflow Handling
The proposer priority is stored as an int64. The selection algorithm performs additions and subtractions to these values and in the case of overflows and underflows it limits the values to:
MaxInt64 = 1 << 63 - 1
MinInt64 = -1 << 63
### Requirement Fulfillment Claims
__[R1]__
The proposer algorithm is deterministic giving consistent results across executions with same transactions and validator set modifications.
[WIP - needs more detail]
__[R2]__
Given a set of processes with the total voting power P, during a sequence of elections of length P, the number of times any process is selected as proposer is equal to its voting power. The sequence of the P proposers then repeats. If we consider the validator set:
Validator | p1| p2
----------|---|---
VP | 1 | 3
With no other changes to the validator set, the current implementation of proposer selection generates the sequence:
`p2, p1, p2, p2, p2, p1, p2, p2,...` or [`p2, p1, p2, p2`]*
A sequence that starts with any circular permutation of the [`p2, p1, p2, p2`] sub-sequence would also provide the same degree of fairness. In fact these circular permutations show in the sliding window (over the generated sequence) of size equal to the length of the sub-sequence.
Assigning priorities to each validator based on the voting power and updating them at each run ensures the fairness of the proposer selection. In addition, every time a validator is elected as proposer its priority is decreased with the total voting power.
Intuitively, a process v jumps ahead in the queue at most (max(A) - min(A))/VP(v) times until it reaches the head and is elected. The frequency is then:
f(v) ~ VP(v)/(max(A)-min(A)) = 1/k * VP(v)/P
For current implementation, this means v should be proposer at least VP(v) times out of k * P runs, with scaling factor k=2.

+ 10
- 0
spec/reactors/evidence/reactor.md View File

@ -0,0 +1,10 @@
# Evidence Reactor
## Channels
[#1503](https://github.com/tendermint/tendermint/issues/1503)
Sending invalid evidence will result in stopping the peer.
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result
in stopping the peer.

+ 8
- 0
spec/reactors/mempool/concurrency.md View File

@ -0,0 +1,8 @@
# Mempool Concurrency
Look at the concurrency model this uses...
- Receiving CheckTx
- Broadcasting new tx
- Interfaces with consensus engine, reap/update while checking
- Calling the ABCI app (ordering. callbacks. how proxy works alongside the blockchain proxy which actually writes blocks)

+ 54
- 0
spec/reactors/mempool/config.md View File

@ -0,0 +1,54 @@
# Mempool Configuration
Here we describe configuration options around mempool.
For the purposes of this document, they are described
as command-line flags, but they can also be passed in as
environmental variables or in the config.toml file. The
following are all equivalent:
Flag: `--mempool.recheck=false`
Environment: `TM_MEMPOOL_RECHECK=false`
Config:
```
[mempool]
recheck = false
```
## Recheck
`--mempool.recheck=false` (default: true)
Recheck determines if the mempool rechecks all pending
transactions after a block was committed. Once a block
is committed, the mempool removes all valid transactions
that were successfully included in the block.
If `recheck` is true, then it will rerun CheckTx on
all remaining transactions with the new block state.
## Broadcast
`--mempool.broadcast=false` (default: true)
Determines whether this node gossips any valid transactions
that arrive in mempool. Default is to gossip anything that
passes checktx. If this is disabled, transactions are not
gossiped, but instead stored locally and added to the next
block this node is the proposer.
## WalDir
`--mempool.wal_dir=/tmp/gaia/mempool.wal` (default: $TM_HOME/data/mempool.wal)
This defines the directory where mempool writes the write-ahead
logs. These files can be used to reload unbroadcasted
transactions if the node crashes.
If the directory passed in is an absolute path, the wal file is
created there. If the directory is a relative path, the path is
appended to home directory of the tendermint process to
generate an absolute path to the wal directory
(default `$HOME/.tendermint` or set via `TM_HOME` or `--home``)

+ 43
- 0
spec/reactors/mempool/functionality.md View File

@ -0,0 +1,43 @@
# Mempool Functionality
The mempool maintains a list of potentially valid transactions,
both to broadcast to other nodes, as well as to provide to the
consensus reactor when it is selected as the block proposer.
There are two sides to the mempool state:
- External: get, check, and broadcast new transactions
- Internal: return valid transaction, update list after block commit
## External functionality
External functionality is exposed via network interfaces
to potentially untrusted actors.
- CheckTx - triggered via RPC or P2P
- Broadcast - gossip messages after a successful check
## Internal functionality
Internal functionality is exposed via method calls to other
code compiled into the tendermint binary.
- ReapMaxBytesMaxGas - get txs to propose in the next block. Guarantees that the
size of the txs is less than MaxBytes, and gas is less than MaxGas
- Update - remove tx that were included in last block
- ABCI.CheckTx - call ABCI app to validate the tx
What does it provide the consensus reactor?
What guarantees does it need from the ABCI app?
(talk about interleaving processes in concurrency)
## Optimizations
The implementation within this library also implements a tx cache.
This is so that signatures don't have to be reverified if the tx has
already been seen before.
However, we only store valid txs in the cache, not invalid ones.
This is because invalid txs could become good later.
Txs that are included in a block aren't removed from the cache,
as they still may be getting received over the p2p network.
These txs are stored in the cache by their hash, to mitigate memory concerns.

+ 61
- 0
spec/reactors/mempool/messages.md View File

@ -0,0 +1,61 @@
# Mempool Messages
## P2P Messages
There is currently only one message that Mempool broadcasts
and receives over the p2p gossip network (via the reactor):
`TxMessage`
```go
// TxMessage is a MempoolMessage containing a transaction.
type TxMessage struct {
Tx types.Tx
}
```
TxMessage is go-amino encoded and prepended with `0x1` as a
"type byte". This is followed by a go-amino encoded byte-slice.
Prefix of 40=0x28 byte tx is: `0x010128...` followed by
the actual 40-byte tx. Prefix of 350=0x015e byte tx is:
`0x0102015e...` followed by the actual 350 byte tx.
(Please see the [go-amino repo](https://github.com/tendermint/go-amino#an-interface-example) for more information)
## RPC Messages
Mempool exposes `CheckTx([]byte)` over the RPC interface.
It can be posted via `broadcast_commit`, `broadcast_sync` or
`broadcast_async`. They all parse a message with one argument,
`"tx": "HEX_ENCODED_BINARY"` and differ in only how long they
wait before returning (sync makes sure CheckTx passes, commit
makes sure it was included in a signed block).
Request (`POST http://gaia.zone:26657/`):
```json
{
"id": "",
"jsonrpc": "2.0",
"method": "broadcast_sync",
"params": {
"tx": "F012A4BC68..."
}
}
```
Response:
```json
{
"error": "",
"result": {
"hash": "E39AAB7A537ABAA237831742DCE1117F187C3C52",
"log": "",
"data": "",
"code": 0
},
"id": "",
"jsonrpc": "2.0"
}
```

+ 22
- 0
spec/reactors/mempool/reactor.md View File

@ -0,0 +1,22 @@
# Mempool Reactor
## Channels
See [this issue](https://github.com/tendermint/tendermint/issues/1503)
Mempool maintains a cache of the last 10000 transactions to prevent
replaying old transactions (plus transactions coming from other
validators, who are continually exchanging transactions). Read [Replay
Protection](../../../app-dev/app-development.md#replay-protection)
for details.
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result
in stopping the peer.
The mempool will not send a tx back to any peer which it received it from.
The reactor assigns an `uint16` number for each peer and maintains a map from
p2p.ID to `uint16`. Each mempool transaction carries a list of all the senders
(`[]uint16`). The list is updated every time mempool receives a transaction it
is already seen. `uint16` assumes that a node will never have over 65535 active
peers (0 is reserved for unknown source - e.g. RPC).

+ 132
- 0
spec/reactors/pex/pex.md View File

@ -0,0 +1,132 @@
# Peer Strategy and Exchange
Here we outline the design of the AddressBook
and how it used by the Peer Exchange Reactor (PEX) to ensure we are connected
to good peers and to gossip peers to others.
## Peer Types
Certain peers are special in that they are specified by the user as `persistent`,
which means we auto-redial them if the connection fails, or if we fail to dial
them.
Some peers can be marked as `private`, which means
we will not put them in the address book or gossip them to others.
All peers except private peers and peers coming from them are tracked using the
address book.
The rest of our peers are only distinguished by being either
inbound (they dialed our public address) or outbound (we dialed them).
## Discovery
Peer discovery begins with a list of seeds.
When we don't have enough peers, we
1. ask existing peers
2. dial seeds if we're not dialing anyone currently
On startup, we will also immediately dial the given list of `persistent_peers`,
and will attempt to maintain persistent connections with them. If the
connections die, or we fail to dial, we will redial every 5s for a few minutes,
then switch to an exponential backoff schedule, and after about a day of
trying, stop dialing the peer.
As long as we have less than `MaxNumOutboundPeers`, we periodically request
additional peers from each of our own and try seeds.
## Listening
Peers listen on a configurable ListenAddr that they self-report in their
NodeInfo during handshakes with other peers. Peers accept up to
`MaxNumInboundPeers` incoming peers.
## Address Book
Peers are tracked via their ID (their PubKey.Address()).
Peers are added to the address book from the PEX when they first connect to us or
when we hear about them from other peers.
The address book is arranged in sets of buckets, and distinguishes between
vetted (old) and unvetted (new) peers. It keeps different sets of buckets for vetted and
unvetted peers. Buckets provide randomization over peer selection. Peers are put
in buckets according to their IP groups.
A vetted peer can only be in one bucket. An unvetted peer can be in multiple buckets, and
each instance of the peer can have a different IP:PORT.
If we're trying to add a new peer but there's no space in its bucket, we'll
remove the worst peer from that bucket to make room.
## Vetting
When a peer is first added, it is unvetted.
Marking a peer as vetted is outside the scope of the `p2p` package.
For Tendermint, a Peer becomes vetted once it has contributed sufficiently
at the consensus layer; ie. once it has sent us valid and not-yet-known
votes and/or block parts for `NumBlocksForVetted` blocks.
Other users of the p2p package can determine their own conditions for when a peer is marked vetted.
If a peer becomes vetted but there are already too many vetted peers,
a randomly selected one of the vetted peers becomes unvetted.
If a peer becomes unvetted (either a new peer, or one that was previously vetted),
a randomly selected one of the unvetted peers is removed from the address book.
More fine-grained tracking of peer behaviour can be done using
a trust metric (see below), but it's best to start with something simple.
## Select Peers to Dial
When we need more peers, we pick addresses randomly from the addrbook with some
configurable bias for unvetted peers. The bias should be lower when we have
fewer peers and can increase as we obtain more, ensuring that our first peers
are more trustworthy, but always giving us the chance to discover new good
peers.
We track the last time we dialed a peer and the number of unsuccessful attempts
we've made. If too many attempts are made, we mark the peer as bad.
Connection attempts are made with exponential backoff (plus jitter). Because
the selection process happens every `ensurePeersPeriod`, we might not end up
dialing a peer for much longer than the backoff duration.
If we fail to connect to the peer after 16 tries (with exponential backoff), we
remove from address book completely.
## Select Peers to Exchange
When we’re asked for peers, we select them as follows:
- select at most `maxGetSelection` peers
- try to select at least `minGetSelection` peers - if we have less than that, select them all.
- select a random, unbiased `getSelectionPercent` of the peers
Send the selected peers. Note we select peers for sending without bias for vetted/unvetted.
## Preventing Spam
There are various cases where we decide a peer has misbehaved and we disconnect from them.
When this happens, the peer is removed from the address book and black listed for
some amount of time. We call this "Disconnect and Mark".
Note that the bad behaviour may be detected outside the PEX reactor itself
(for instance, in the mconnection, or another reactor), but it must be communicated to the PEX reactor
so it can remove and mark the peer.
In the PEX, if a peer sends us an unsolicited list of peers,
or if the peer sends a request too soon after another one,
we Disconnect and MarkBad.
## Trust Metric
The quality of peers can be tracked in more fine-grained detail using a
Proportional-Integral-Derivative (PID) controller that incorporates
current, past, and rate-of-change data to inform peer quality.
While a PID trust metric has been implemented, it remains for future work
to use it in the PEX.
See the [trustmetric](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-006-trust-metric.md)
and [trustmetric useage](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-007-trust-metric-usage.md)
architecture docs for more details.

+ 12
- 0
spec/reactors/pex/reactor.md View File

@ -0,0 +1,12 @@
# PEX Reactor
## Channels
Defines only `SendQueueCapacity`. [#1503](https://github.com/tendermint/tendermint/issues/1503)
Implements rate-limiting by enforcing minimal time between two consecutive
`pexRequestMessage` requests. If the peer sends us addresses we did not ask,
it is stopped.
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result
in stopping the peer.

Loading…
Cancel
Save