* Add Section for P2P - moved over the section on p2p Signed-off-by: Marko Baricevic <marbar3778@yahoo.com> * add some more files Signed-off-by: Marko Baricevic <marbar3778@yahoo.com>pull/7804/head
@ -1,9 +1,81 @@ | |||
# Spec | |||
# Overview | |||
This folder houses the spec of Tendermint the Protocol. | |||
This is a markdown specification of the Tendermint blockchain. | |||
It defines the base data structures, how they are validated, | |||
and how they are communicated over the network. | |||
**Note: We are currently working on expanding the spec and will slowly be migrating it from Tendermint the repo** | |||
If you find discrepancies between the spec and the code that | |||
do not have an associated issue or pull request on github, | |||
please submit them to our [bug bounty](https://tendermint.com/security)! | |||
### Table of Contents | |||
## Contents | |||
- [ABCI Spec](./abci/README.md) | |||
- [Overview](#overview) | |||
### Data Structures | |||
- [Encoding and Digests](./blockchain/encoding.md) | |||
- [Blockchain](./blockchain/blockchain.md) | |||
- [State](./blockchain/state.md) | |||
### Consensus Protocol | |||
- [Consensus Algorithm](./consensus/consensus.md) | |||
- [Creating a proposal](./consensus/creating-proposal.md) | |||
- [Time](./consensus/bft-time.md) | |||
- [Light-Client](./consensus/light-client.md) | |||
### P2P and Network Protocols | |||
- [The Base P2P Layer](./p2p/): multiplex the protocols ("reactors") on authenticated and encrypted TCP connections | |||
- [Peer Exchange (PEX)](./reactors/pex/): gossip known peer addresses so peers can find each other | |||
- [Block Sync](./reactors/block_sync/): gossip blocks so peers can catch up quickly | |||
- [Consensus](./reactors/consensus/): gossip votes and block parts so new blocks can be committed | |||
- [Mempool](./reactors/mempool/): gossip transactions so they get included in blocks | |||
- [Evidence](./reactors/evidence/): sending invalid evidence will stop the peer | |||
### Software | |||
- [ABCI](./software/abci.md): Details about interactions between the | |||
application and consensus engine over ABCI | |||
- [Write-Ahead Log](./software/wal.md): Details about how the consensus | |||
engine preserves data and recovers from crash failures | |||
## Overview | |||
Tendermint provides Byzantine Fault Tolerant State Machine Replication using | |||
hash-linked batches of transactions. Such transaction batches are called "blocks". | |||
Hence, Tendermint defines a "blockchain". | |||
Each block in Tendermint has a unique index - its Height. | |||
Height's in the blockchain are monotonic. | |||
Each block is committed by a known set of weighted Validators. | |||
Membership and weighting within this validator set may change over time. | |||
Tendermint guarantees the safety and liveness of the blockchain | |||
so long as less than 1/3 of the total weight of the Validator set | |||
is malicious or faulty. | |||
A commit in Tendermint is a set of signed messages from more than 2/3 of | |||
the total weight of the current Validator set. Validators take turns proposing | |||
blocks and voting on them. Once enough votes are received, the block is considered | |||
committed. These votes are included in the _next_ block as proof that the previous block | |||
was committed - they cannot be included in the current block, as that block has already been | |||
created. | |||
Once a block is committed, it can be executed against an application. | |||
The application returns results for each of the transactions in the block. | |||
The application can also return changes to be made to the validator set, | |||
as well as a cryptographic digest of its latest state. | |||
Tendermint is designed to enable efficient verification and authentication | |||
of the latest state of the blockchain. To achieve this, it embeds | |||
cryptographic commitments to certain information in the block "header". | |||
This information includes the contents of the block (eg. the transactions), | |||
the validator set committing the block, as well as the various results returned by the application. | |||
Note, however, that block execution only occurs _after_ a block is committed. | |||
Thus, application results can only be included in the _next_ block. | |||
Also note that information like the transaction results and the validator set are never | |||
directly included in the block - only their cryptographic digests (Merkle roots) are. | |||
Hence, verification of a block requires a separate data structure to store this information. | |||
We call this the `State`. Block verification also requires access to the previous block. |
@ -0,0 +1,38 @@ | |||
# P2P Config | |||
Here we describe configuration options around the Peer Exchange. | |||
These can be set using flags or via the `$TMHOME/config/config.toml` file. | |||
## Seed Mode | |||
`--p2p.seed_mode` | |||
The node operates in seed mode. In seed mode, a node continuously crawls the network for peers, | |||
and upon incoming connection shares some peers and disconnects. | |||
## Seeds | |||
`--p2p.seeds “id100000000000000000000000000000000@1.2.3.4:26656,id200000000000000000000000000000000@2.3.4.5:4444”` | |||
Dials these seeds when we need more peers. They should return a list of peers and then disconnect. | |||
If we already have enough peers in the address book, we may never need to dial them. | |||
## Persistent Peers | |||
`--p2p.persistent_peers “id100000000000000000000000000000000@1.2.3.4:26656,id200000000000000000000000000000000@2.3.4.5:26656”` | |||
Dial these peers and auto-redial them if the connection fails. | |||
These are intended to be trusted persistent peers that can help | |||
anchor us in the p2p network. The auto-redial uses exponential | |||
backoff and will give up after a day of trying to connect. | |||
**Note:** If `seeds` and `persistent_peers` intersect, | |||
the user will be warned that seeds may auto-close connections | |||
and that the node may not be able to keep the connection persistent. | |||
## Private Peers | |||
`--p2p.private_peer_ids “id100000000000000000000000000000000,id200000000000000000000000000000000”` | |||
These are IDs of the peers that we do not add to the address book or gossip to | |||
other peers. They stay private to us. |
@ -0,0 +1,111 @@ | |||
# P2P Multiplex Connection | |||
## MConnection | |||
`MConnection` is a multiplex connection that supports multiple independent streams | |||
with distinct quality of service guarantees atop a single TCP connection. | |||
Each stream is known as a `Channel` and each `Channel` has a globally unique _byte id_. | |||
Each `Channel` also has a relative priority that determines the quality of service | |||
of the `Channel` compared to other `Channel`s. | |||
The _byte id_ and the relative priorities of each `Channel` are configured upon | |||
initialization of the connection. | |||
The `MConnection` supports three packet types: | |||
- Ping | |||
- Pong | |||
- Msg | |||
### Ping and Pong | |||
The ping and pong messages consist of writing a single byte to the connection; 0x1 and 0x2, respectively. | |||
When we haven't received any messages on an `MConnection` in time `pingTimeout`, we send a ping message. | |||
When a ping is received on the `MConnection`, a pong is sent in response only if there are no other messages | |||
to send and the peer has not sent us too many pings (TODO). | |||
If a pong or message is not received in sufficient time after a ping, the peer is disconnected from. | |||
### Msg | |||
Messages in channels are chopped into smaller `msgPacket`s for multiplexing. | |||
``` | |||
type msgPacket struct { | |||
ChannelID byte | |||
EOF byte // 1 means message ends here. | |||
Bytes []byte | |||
} | |||
``` | |||
The `msgPacket` is serialized using [go-amino](https://github.com/tendermint/go-amino) and prefixed with 0x3. | |||
The received `Bytes` of a sequential set of packets are appended together | |||
until a packet with `EOF=1` is received, then the complete serialized message | |||
is returned for processing by the `onReceive` function of the corresponding channel. | |||
### Multiplexing | |||
Messages are sent from a single `sendRoutine`, which loops over a select statement and results in the sending | |||
of a ping, a pong, or a batch of data messages. The batch of data messages may include messages from multiple channels. | |||
Message bytes are queued for sending in their respective channel, with each channel holding one unsent message at a time. | |||
Messages are chosen for a batch one at a time from the channel with the lowest ratio of recently sent bytes to channel priority. | |||
## Sending Messages | |||
There are two methods for sending messages: | |||
```go | |||
func (m MConnection) Send(chID byte, msg interface{}) bool {} | |||
func (m MConnection) TrySend(chID byte, msg interface{}) bool {} | |||
``` | |||
`Send(chID, msg)` is a blocking call that waits until `msg` is successfully queued | |||
for the channel with the given id byte `chID`. The message `msg` is serialized | |||
using the `tendermint/go-amino` submodule's `WriteBinary()` reflection routine. | |||
`TrySend(chID, msg)` is a nonblocking call that queues the message msg in the channel | |||
with the given id byte chID if the queue is not full; otherwise it returns false immediately. | |||
`Send()` and `TrySend()` are also exposed for each `Peer`. | |||
## Peer | |||
Each peer has one `MConnection` instance, and includes other information such as whether the connection | |||
was outbound, whether the connection should be recreated if it closes, various identity information about the node, | |||
and other higher level thread-safe data used by the reactors. | |||
## Switch/Reactor | |||
The `Switch` handles peer connections and exposes an API to receive incoming messages | |||
on `Reactors`. Each `Reactor` is responsible for handling incoming messages of one | |||
or more `Channels`. So while sending outgoing messages is typically performed on the peer, | |||
incoming messages are received on the reactor. | |||
```go | |||
// Declare a MyReactor reactor that handles messages on MyChannelID. | |||
type MyReactor struct{} | |||
func (reactor MyReactor) GetChannels() []*ChannelDescriptor { | |||
return []*ChannelDescriptor{ChannelDescriptor{ID:MyChannelID, Priority: 1}} | |||
} | |||
func (reactor MyReactor) Receive(chID byte, peer *Peer, msgBytes []byte) { | |||
r, n, err := bytes.NewBuffer(msgBytes), new(int64), new(error) | |||
msgString := ReadString(r, n, err) | |||
fmt.Println(msgString) | |||
} | |||
// Other Reactor methods omitted for brevity | |||
... | |||
switch := NewSwitch([]Reactor{MyReactor{}}) | |||
... | |||
// Send a random message to all outbound connections | |||
for _, peer := range switch.Peers().List() { | |||
if peer.IsOutbound() { | |||
peer.Send(MyChannelID, "Here's a random message") | |||
} | |||
} | |||
``` |
@ -0,0 +1,66 @@ | |||
# Peer Discovery | |||
A Tendermint P2P network has different kinds of nodes with different requirements for connectivity to one another. | |||
This document describes what kind of nodes Tendermint should enable and how they should work. | |||
## Seeds | |||
Seeds are the first point of contact for a new node. | |||
They return a list of known active peers and then disconnect. | |||
Seeds should operate full nodes with the PEX reactor in a "crawler" mode | |||
that continuously explores to validate the availability of peers. | |||
Seeds should only respond with some top percentile of the best peers it knows about. | |||
See [the peer-exchange docs](https://github.com/tendermint/tendermint/blob/master/docs/spec/reactors/pex/pex.md)for details on peer quality. | |||
## New Full Node | |||
A new node needs a few things to connect to the network: | |||
- a list of seeds, which can be provided to Tendermint via config file or flags, | |||
or hardcoded into the software by in-process apps | |||
- a `ChainID`, also called `Network` at the p2p layer | |||
- a recent block height, H, and hash, HASH for the blockchain. | |||
The values `H` and `HASH` must be received and corroborated by means external to Tendermint, and specific to the user - ie. via the user's trusted social consensus. | |||
This requirement to validate `H` and `HASH` out-of-band and via social consensus | |||
is the essential difference in security models between Proof-of-Work and Proof-of-Stake blockchains. | |||
With the above, the node then queries some seeds for peers for its chain, | |||
dials those peers, and runs the Tendermint protocols with those it successfully connects to. | |||
When the peer catches up to height H, it ensures the block hash matches HASH. | |||
If not, Tendermint will exit, and the user must try again - either they are connected | |||
to bad peers or their social consensus is invalid. | |||
## Restarted Full Node | |||
A node checks its address book on startup and attempts to connect to peers from there. | |||
If it can't connect to any peers after some time, it falls back to the seeds to find more. | |||
Restarted full nodes can run the `blockchain` or `consensus` reactor protocols to sync up | |||
to the latest state of the blockchain from wherever they were last. | |||
In a Proof-of-Stake context, if they are sufficiently far behind (greater than the length | |||
of the unbonding period), they will need to validate a recent `H` and `HASH` out-of-band again | |||
so they know they have synced the correct chain. | |||
## Validator Node | |||
A validator node is a node that interfaces with a validator signing key. | |||
These nodes require the highest security, and should not accept incoming connections. | |||
They should maintain outgoing connections to a controlled set of "Sentry Nodes" that serve | |||
as their proxy shield to the rest of the network. | |||
Validators that know and trust each other can accept incoming connections from one another and maintain direct private connectivity via VPN. | |||
## Sentry Node | |||
Sentry nodes are guardians of a validator node and provide it access to the rest of the network. | |||
They should be well connected to other full nodes on the network. | |||
Sentry nodes may be dynamic, but should maintain persistent connections to some evolving random subset of each other. | |||
They should always expect to have direct incoming connections from the validator node and its backup(s). | |||
They do not report the validator node's address in the PEX and | |||
they may be more strict about the quality of peers they keep. | |||
Sentry nodes belonging to validators that trust each other may wish to maintain persistent connections via VPN with one another, but only report each other sparingly in the PEX. |
@ -0,0 +1,119 @@ | |||
# Peers | |||
This document explains how Tendermint Peers are identified and how they connect to one another. | |||
For details on peer discovery, see the [peer exchange (PEX) reactor doc](https://github.com/tendermint/tendermint/blob/master/docs/spec/reactors/pex/pex.md). | |||
## Peer Identity | |||
Tendermint peers are expected to maintain long-term persistent identities in the form of a public key. | |||
Each peer has an ID defined as `peer.ID == peer.PubKey.Address()`, where `Address` uses the scheme defined in `crypto` package. | |||
A single peer ID can have multiple IP addresses associated with it, but a node | |||
will only ever connect to one at a time. | |||
When attempting to connect to a peer, we use the PeerURL: `<ID>@<IP>:<PORT>`. | |||
We will attempt to connect to the peer at IP:PORT, and verify, | |||
via authenticated encryption, that it is in possession of the private key | |||
corresponding to `<ID>`. This prevents man-in-the-middle attacks on the peer layer. | |||
## Connections | |||
All p2p connections use TCP. | |||
Upon establishing a successful TCP connection with a peer, | |||
two handhsakes are performed: one for authenticated encryption, and one for Tendermint versioning. | |||
Both handshakes have configurable timeouts (they should complete quickly). | |||
### Authenticated Encryption Handshake | |||
Tendermint implements the Station-to-Station protocol | |||
using X25519 keys for Diffie-Helman key-exchange and chacha20poly1305 for encryption. | |||
It goes as follows: | |||
- generate an ephemeral X25519 keypair | |||
- send the ephemeral public key to the peer | |||
- wait to receive the peer's ephemeral public key | |||
- compute the Diffie-Hellman shared secret using the peers ephemeral public key and our ephemeral private key | |||
- generate two keys to use for encryption (sending and receiving) and a challenge for authentication as follows: | |||
- create a hkdf-sha256 instance with the key being the diffie hellman shared secret, and info parameter as | |||
`TENDERMINT_SECRET_CONNECTION_KEY_AND_CHALLENGE_GEN` | |||
- get 96 bytes of output from hkdf-sha256 | |||
- if we had the smaller ephemeral pubkey, use the first 32 bytes for the key for receiving, the second 32 bytes for sending; else the opposite | |||
- use the last 32 bytes of output for the challenge | |||
- use a separate nonce for receiving and sending. Both nonces start at 0, and should support the full 96 bit nonce range | |||
- all communications from now on are encrypted in 1024 byte frames, | |||
using the respective secret and nonce. Each nonce is incremented by one after each use. | |||
- we now have an encrypted channel, but still need to authenticate | |||
- sign the common challenge obtained from the hkdf with our persistent private key | |||
- send the amino encoded persistent pubkey and signature to the peer | |||
- wait to receive the persistent public key and signature from the peer | |||
- verify the signature on the challenge using the peer's persistent public key | |||
If this is an outgoing connection (we dialed the peer) and we used a peer ID, | |||
then finally verify that the peer's persistent public key corresponds to the peer ID we dialed, | |||
ie. `peer.PubKey.Address() == <ID>`. | |||
The connection has now been authenticated. All traffic is encrypted. | |||
Note: only the dialer can authenticate the identity of the peer, | |||
but this is what we care about since when we join the network we wish to | |||
ensure we have reached the intended peer (and are not being MITMd). | |||
### Peer Filter | |||
Before continuing, we check if the new peer has the same ID as ourselves or | |||
an existing peer. If so, we disconnect. | |||
We also check the peer's address and public key against | |||
an optional whitelist which can be managed through the ABCI app - | |||
if the whitelist is enabled and the peer does not qualify, the connection is | |||
terminated. | |||
### Tendermint Version Handshake | |||
The Tendermint Version Handshake allows the peers to exchange their NodeInfo: | |||
```golang | |||
type NodeInfo struct { | |||
Version p2p.Version | |||
ID p2p.ID | |||
ListenAddr string | |||
Network string | |||
SoftwareVersion string | |||
Channels []int8 | |||
Moniker string | |||
Other NodeInfoOther | |||
} | |||
type Version struct { | |||
P2P uint64 | |||
Block uint64 | |||
App uint64 | |||
} | |||
type NodeInfoOther struct { | |||
TxIndex string | |||
RPCAddress string | |||
} | |||
``` | |||
The connection is disconnected if: | |||
- `peer.NodeInfo.ID` is not equal `peerConn.ID` | |||
- `peer.NodeInfo.Version.Block` does not match ours | |||
- `peer.NodeInfo.Network` is not the same as ours | |||
- `peer.Channels` does not intersect with our known Channels. | |||
- `peer.NodeInfo.ListenAddr` is malformed or is a DNS host that cannot be | |||
resolved | |||
At this point, if we have not disconnected, the peer is valid. | |||
It is added to the switch and hence all reactors via the `AddPeer` method. | |||
Note that each reactor may handle multiple channels. | |||
## Connection Activity | |||
Once a peer is added, incoming messages for a given reactor are handled through | |||
that reactor's `Receive` method, and output messages are sent directly by the Reactors | |||
on each peer. A typical reactor maintains per-peer go-routine(s) that handle this. |
@ -0,0 +1,237 @@ | |||
# Blockchain Reactor v1 | |||
### Data Structures | |||
The data structures used are illustrated below. | |||
![Data Structures](img/bc-reactor-new-datastructs.png) | |||
#### BlockchainReactor | |||
- is a `p2p.BaseReactor`. | |||
- has a `store.BlockStore` for persistence. | |||
- executes blocks using an `sm.BlockExecutor`. | |||
- starts the FSM and the `poolRoutine()`. | |||
- relays the fast-sync responses and switch messages to the FSM. | |||
- handles errors from the FSM and when necessarily reports them to the switch. | |||
- implements the blockchain reactor interface used by the FSM to send requests, errors to the switch and state timer resets. | |||
- registers all the concrete types and interfaces for serialisation. | |||
```go | |||
type BlockchainReactor struct { | |||
p2p.BaseReactor | |||
initialState sm.State // immutable | |||
state sm.State | |||
blockExec *sm.BlockExecutor | |||
store *store.BlockStore | |||
fastSync bool | |||
fsm *BcReactorFSM | |||
blocksSynced int | |||
// Receive goroutine forwards messages to this channel to be processed in the context of the poolRoutine. | |||
messagesForFSMCh chan bcReactorMessage | |||
// Switch goroutine may send RemovePeer to the blockchain reactor. This is an error message that is relayed | |||
// to this channel to be processed in the context of the poolRoutine. | |||
errorsForFSMCh chan bcReactorMessage | |||
// This channel is used by the FSM and indirectly the block pool to report errors to the blockchain reactor and | |||
// the switch. | |||
eventsFromFSMCh chan bcFsmMessage | |||
} | |||
``` | |||
#### BcReactorFSM | |||
- implements a simple finite state machine. | |||
- has a state and a state timer. | |||
- has a `BlockPool` to keep track of block requests sent to peers and blocks received from peers. | |||
- uses an interface to send status requests, block requests and reporting errors. The interface is implemented by the `BlockchainReactor` and tests. | |||
```go | |||
type BcReactorFSM struct { | |||
logger log.Logger | |||
mtx sync.Mutex | |||
startTime time.Time | |||
state *bcReactorFSMState | |||
stateTimer *time.Timer | |||
pool *BlockPool | |||
// interface used to call the Blockchain reactor to send StatusRequest, BlockRequest, reporting errors, etc. | |||
toBcR bcReactor | |||
} | |||
``` | |||
#### BlockPool | |||
- maintains a peer set, implemented as a map of peer ID to `BpPeer`. | |||
- maintains a set of requests made to peers, implemented as a map of block request heights to peer IDs. | |||
- maintains a list of future block requests needed to advance the fast-sync. This is a list of block heights. | |||
- keeps track of the maximum height of the peers in the set. | |||
- uses an interface to send requests and report errors to the reactor (via FSM). | |||
```go | |||
type BlockPool struct { | |||
logger log.Logger | |||
// Set of peers that have sent status responses, with height bigger than pool.Height | |||
peers map[p2p.ID]*BpPeer | |||
// Set of block heights and the corresponding peers from where a block response is expected or has been received. | |||
blocks map[int64]p2p.ID | |||
plannedRequests map[int64]struct{} // list of blocks to be assigned peers for blockRequest | |||
nextRequestHeight int64 // next height to be added to plannedRequests | |||
Height int64 // height of next block to execute | |||
MaxPeerHeight int64 // maximum height of all peers | |||
toBcR bcReactor | |||
} | |||
``` | |||
Some reasons for the `BlockPool` data structure content: | |||
1. If a peer is removed by the switch fast access is required to the peer and the block requests made to that peer in order to redo them. | |||
2. When block verification fails fast access is required from the block height to the peer and the block requests made to that peer in order to redo them. | |||
3. The `BlockchainReactor` main routine decides when the block pool is running low and asks the `BlockPool` (via FSM) to make more requests. The `BlockPool` creates a list of requests and triggers the sending of the block requests (via the interface). The reason it maintains a list of requests is the redo operations that may occur during error handling. These are redone when the `BlockchainReactor` requires more blocks. | |||
#### BpPeer | |||
- keeps track of a single peer, with height bigger than the initial height. | |||
- maintains the block requests made to the peer and the blocks received from the peer until they are executed. | |||
- monitors the peer speed when there are pending requests. | |||
- it has an active timer when pending requests are present and reports error on timeout. | |||
```go | |||
type BpPeer struct { | |||
logger log.Logger | |||
ID p2p.ID | |||
Height int64 // the peer reported height | |||
NumPendingBlockRequests int // number of requests still waiting for block responses | |||
blocks map[int64]*types.Block // blocks received or expected to be received from this peer | |||
blockResponseTimer *time.Timer | |||
recvMonitor *flow.Monitor | |||
params *BpPeerParams // parameters for timer and monitor | |||
onErr func(err error, peerID p2p.ID) // function to call on error | |||
} | |||
``` | |||
### Concurrency Model | |||
The diagram below shows the goroutines (depicted by the gray blocks), timers (shown on the left with their values) and channels (colored rectangles). The FSM box shows some of the functionality and it is not a separate goroutine. | |||
The interface used by the FSM is shown in light red with the `IF` block. This is used to: | |||
- send block requests | |||
- report peer errors to the switch - this results in the reactor calling `switch.StopPeerForError()` and, if triggered by the peer timeout routine, a `removePeerEv` is sent to the FSM and action is taken from the context of the `poolRoutine()` | |||
- ask the reactor to reset the state timers. The timers are owned by the FSM while the timeout routine is defined by the reactor. This was done in order to avoid running timers in tests and will change in the next revision. | |||
There are two main goroutines implemented by the blockchain reactor. All I/O operations are performed from the `poolRoutine()` context while the CPU intensive operations related to the block execution are performed from the context of the `executeBlocksRoutine()`. All goroutines are detailed in the next sections. | |||
![Go Routines Diagram](img/bc-reactor-new-goroutines.png) | |||
#### Receive() | |||
Fast-sync messages from peers are received by this goroutine. It performs basic validation and: | |||
- in helper mode (i.e. for request message) it replies immediately. This is different than the proposal in adr-040 that specifies having the FSM handling these. | |||
- forwards response messages to the `poolRoutine()`. | |||
#### poolRoutine() | |||
(named kept as in the previous reactor). | |||
It starts the `executeBlocksRoutine()` and the FSM. It then waits in a loop for events. These are received from the following channels: | |||
- `sendBlockRequestTicker.C` - every 10msec the reactor asks FSM to make more block requests up to a maximum. Note: currently this value is constant but could be changed based on low/ high watermark thresholds for the number of blocks received and waiting to be processed, the number of blockResponse messages waiting in messagesForFSMCh, etc. | |||
- `statusUpdateTicker.C` - every 10 seconds the reactor broadcasts status requests to peers. While adr-040 specifies this to run within the FSM, at this point this functionality is kept in the reactor. | |||
- `messagesForFSMCh` - the `Receive()` goroutine sends status and block response messages to this channel and the reactor calls FSM to handle them. | |||
- `errorsForFSMCh` - this channel receives the following events: | |||
- peer remove - when the switch removes a peer | |||
- sate timeout event - when FSM state timers trigger | |||
The reactor forwards this messages to the FSM. | |||
- `eventsFromFSMCh` - there are two type of events sent over this channel: | |||
- `syncFinishedEv` - triggered when FSM enters `finished` state and calls the switchToConsensus() interface function. | |||
- `peerErrorEv`- peer timer expiry goroutine sends this event over the channel for processing from poolRoutine() context. | |||
#### executeBlocksRoutine() | |||
Started by the `poolRoutine()`, it retrieves blocks from the pool and executes them: | |||
- `processReceivedBlockTicker.C` - a ticker event is received over the channel every 10msec and its handling results in a signal being sent to the doProcessBlockCh channel. | |||
- doProcessBlockCh - events are received on this channel as described as above and upon processing blocks are retrieved from the pool and executed. | |||
### FSM | |||
![fsm](img/bc-reactor-new-fsm.png) | |||
#### States | |||
##### init (aka unknown) | |||
The FSM is created in `unknown` state. When started, by the reactor (`startFSMEv`), it broadcasts Status requests and transitions to `waitForPeer` state. | |||
##### waitForPeer | |||
In this state, the FSM waits for a Status responses from a "tall" peer. A timer is running in this state to allow the FSM to finish if there are no useful peers. | |||
If the timer expires, it moves to `finished` state and calls the reactor to switch to consensus. | |||
If a Status response is received from a peer within the timeout, the FSM transitions to `waitForBlock` state. | |||
##### waitForBlock | |||
In this state the FSM makes Block requests (triggered by a ticker in reactor) and waits for Block responses. There is a timer running in this state to detect if a peer is not sending the block at current processing height. If the timer expires, the FSM removes the peer where the request was sent and all requests made to that peer are redone. | |||
As blocks are received they are stored by the pool. Block execution is independently performed by the reactor and the result reported to the FSM: | |||
- if there are no errors, the FSM increases the pool height and resets the state timer. | |||
- if there are errors, the peers that delivered the two blocks (at height and height+1) are removed and the requests redone. | |||
In this state the FSM may receive peer remove events in any of the following scenarios: | |||
- the switch is removing a peer | |||
- a peer is penalized because it has not responded to some block requests for a long time | |||
- a peer is penalized for being slow | |||
When processing of the last block (the one with height equal to the highest peer height minus one) is successful, the FSM transitions to `finished` state. | |||
If after a peer update or removal the pool height is same as maxPeerHeight, the FSM transitions to `finished` state. | |||
##### finished | |||
When entering this state, the FSM calls the reactor to switch to consensus and performs cleanup. | |||
#### Events | |||
The following events are handled by the FSM: | |||
```go | |||
const ( | |||
startFSMEv = iota + 1 | |||
statusResponseEv | |||
blockResponseEv | |||
processedBlockEv | |||
makeRequestsEv | |||
stopFSMEv | |||
peerRemoveEv = iota + 256 | |||
stateTimeoutEv | |||
) | |||
``` | |||
### Examples of Scenarios and Termination Handling | |||
A few scenarios are covered in this section together with the current/ proposed handling. | |||
In general, the scenarios involving faulty peers are made worse by the fact that they may quickly be re-added. | |||
#### 1. No Tall Peers | |||
S: In this scenario a node is started and while there are status responses received, none of the peers are at a height higher than this node. | |||
H: The FSM times out in `waitForPeer` state, moves to `finished` state where it calls the reactor to switch to consensus. | |||
#### 2. Typical Fast Sync | |||
S: A node fast syncs blocks from honest peers and eventually downloads and executes the penultimate block. | |||
H: The FSM in `waitForBlock` state will receive the processedBlockEv from the reactor and detect that the termination height is achieved. | |||
#### 3. Peer Claims Big Height but no Blocks | |||
S: In this scenario a faulty peer claims a big height (for which there are no blocks). | |||
H: The requests for the non-existing block will timeout, the peer removed and the pool's `MaxPeerHeight` updated. FSM checks if the termination height is achieved when peers are removed. | |||
#### 4. Highest Peer Removed or Updated to Short | |||
S: The fast sync node is caught up with all peers except one tall peer. The tall peer is removed or it sends status response with low height. | |||
H: FSM checks termination condition on peer removal and updates. | |||
#### 5. Block At Current Height Delayed | |||
S: A peer can block the progress of fast sync by delaying indefinitely the block response for the current processing height (h1). | |||
H: Currently, given h1 < h2, there is no enforcement at peer level that the response for h1 should be received before h2. So a peer will timeout only after delivering all blocks except h1. However the `waitForBlock` state timer fires if the block for current processing height is not received within a timeout. The peer is removed and the requests to that peer (including the one for current height) redone. |
@ -0,0 +1,44 @@ | |||
## Blockchain Reactor v0 Modules | |||
### Blockchain Reactor | |||
- coordinates the pool for syncing | |||
- coordinates the store for persistence | |||
- coordinates the playing of blocks towards the app using a sm.BlockExecutor | |||
- handles switching between fastsync and consensus | |||
- it is a p2p.BaseReactor | |||
- starts the pool.Start() and its poolRoutine() | |||
- registers all the concrete types and interfaces for serialisation | |||
#### poolRoutine | |||
- listens to these channels: | |||
- pool requests blocks from a specific peer by posting to requestsCh, block reactor then sends | |||
a &bcBlockRequestMessage for a specific height | |||
- pool signals timeout of a specific peer by posting to timeoutsCh | |||
- switchToConsensusTicker to periodically try and switch to consensus | |||
- trySyncTicker to periodically check if we have fallen behind and then catch-up sync | |||
- if there aren't any new blocks available on the pool it skips syncing | |||
- tries to sync the app by taking downloaded blocks from the pool, gives them to the app and stores | |||
them on disk | |||
- implements Receive which is called by the switch/peer | |||
- calls AddBlock on the pool when it receives a new block from a peer | |||
### Block Pool | |||
- responsible for downloading blocks from peers | |||
- makeRequestersRoutine() | |||
- removes timeout peers | |||
- starts new requesters by calling makeNextRequester() | |||
- requestRoutine(): | |||
- picks a peer and sends the request, then blocks until: | |||
- pool is stopped by listening to pool.Quit | |||
- requester is stopped by listening to Quit | |||
- request is redone | |||
- we receive a block | |||
- gotBlockCh is strange | |||
### Go Routines in Blockchain Reactor | |||
![Go Routines Diagram](img/bc-reactor-routines.png) |
@ -0,0 +1,308 @@ | |||
# Blockchain Reactor | |||
The Blockchain Reactor's high level responsibility is to enable peers who are | |||
far behind the current state of the consensus to quickly catch up by downloading | |||
many blocks in parallel, verifying their commits, and executing them against the | |||
ABCI application. | |||
Tendermint full nodes run the Blockchain Reactor as a service to provide blocks | |||
to new nodes. New nodes run the Blockchain Reactor in "fast_sync" mode, | |||
where they actively make requests for more blocks until they sync up. | |||
Once caught up, "fast_sync" mode is disabled and the node switches to | |||
using (and turns on) the Consensus Reactor. | |||
## Message Types | |||
```go | |||
const ( | |||
msgTypeBlockRequest = byte(0x10) | |||
msgTypeBlockResponse = byte(0x11) | |||
msgTypeNoBlockResponse = byte(0x12) | |||
msgTypeStatusResponse = byte(0x20) | |||
msgTypeStatusRequest = byte(0x21) | |||
) | |||
``` | |||
```go | |||
type bcBlockRequestMessage struct { | |||
Height int64 | |||
} | |||
type bcNoBlockResponseMessage struct { | |||
Height int64 | |||
} | |||
type bcBlockResponseMessage struct { | |||
Block Block | |||
} | |||
type bcStatusRequestMessage struct { | |||
Height int64 | |||
type bcStatusResponseMessage struct { | |||
Height int64 | |||
} | |||
``` | |||
## Architecture and algorithm | |||
The Blockchain reactor is organised as a set of concurrent tasks: | |||
- Receive routine of Blockchain Reactor | |||
- Task for creating Requesters | |||
- Set of Requesters tasks and - Controller task. | |||
![Blockchain Reactor Architecture Diagram](img/bc-reactor.png) | |||
### Data structures | |||
These are the core data structures necessarily to provide the Blockchain Reactor logic. | |||
Requester data structure is used to track assignment of request for `block` at position `height` to a peer with id equals to `peerID`. | |||
```go | |||
type Requester { | |||
mtx Mutex | |||
block Block | |||
height int64 | |||
peerID p2p.ID | |||
redoChannel chan p2p.ID //redo may send multi-time; peerId is used to identify repeat | |||
} | |||
``` | |||
Pool is a core data structure that stores last executed block (`height`), assignment of requests to peers (`requesters`), current height for each peer and number of pending requests for each peer (`peers`), maximum peer height, etc. | |||
```go | |||
type Pool { | |||
mtx Mutex | |||
requesters map[int64]*Requester | |||
height int64 | |||
peers map[p2p.ID]*Peer | |||
maxPeerHeight int64 | |||
numPending int32 | |||
store BlockStore | |||
requestsChannel chan<- BlockRequest | |||
errorsChannel chan<- peerError | |||
} | |||
``` | |||
Peer data structure stores for each peer current `height` and number of pending requests sent to the peer (`numPending`), etc. | |||
```go | |||
type Peer struct { | |||
id p2p.ID | |||
height int64 | |||
numPending int32 | |||
timeout *time.Timer | |||
didTimeout bool | |||
} | |||
``` | |||
BlockRequest is internal data structure used to denote current mapping of request for a block at some `height` to a peer (`PeerID`). | |||
```go | |||
type BlockRequest { | |||
Height int64 | |||
PeerID p2p.ID | |||
} | |||
``` | |||
### Receive routine of Blockchain Reactor | |||
It is executed upon message reception on the BlockchainChannel inside p2p receive routine. There is a separate p2p receive routine (and therefore receive routine of the Blockchain Reactor) executed for each peer. Note that try to send will not block (returns immediately) if outgoing buffer is full. | |||
```go | |||
handleMsg(pool, m): | |||
upon receiving bcBlockRequestMessage m from peer p: | |||
block = load block for height m.Height from pool.store | |||
if block != nil then | |||
try to send BlockResponseMessage(block) to p | |||
else | |||
try to send bcNoBlockResponseMessage(m.Height) to p | |||
upon receiving bcBlockResponseMessage m from peer p: | |||
pool.mtx.Lock() | |||
requester = pool.requesters[m.Height] | |||
if requester == nil then | |||
error("peer sent us a block we didn't expect") | |||
continue | |||
if requester.block == nil and requester.peerID == p then | |||
requester.block = m | |||
pool.numPending -= 1 // atomic decrement | |||
peer = pool.peers[p] | |||
if peer != nil then | |||
peer.numPending-- | |||
if peer.numPending == 0 then | |||
peer.timeout.Stop() | |||
// NOTE: we don't send Quit signal to the corresponding requester task! | |||
else | |||
trigger peer timeout to expire after peerTimeout | |||
pool.mtx.Unlock() | |||
upon receiving bcStatusRequestMessage m from peer p: | |||
try to send bcStatusResponseMessage(pool.store.Height) | |||
upon receiving bcStatusResponseMessage m from peer p: | |||
pool.mtx.Lock() | |||
peer = pool.peers[p] | |||
if peer != nil then | |||
peer.height = m.height | |||
else | |||
peer = create new Peer data structure with id = p and height = m.Height | |||
pool.peers[p] = peer | |||
if m.Height > pool.maxPeerHeight then | |||
pool.maxPeerHeight = m.Height | |||
pool.mtx.Unlock() | |||
onTimeout(p): | |||
send error message to pool error channel | |||
peer = pool.peers[p] | |||
peer.didTimeout = true | |||
``` | |||
### Requester tasks | |||
Requester task is responsible for fetching a single block at position `height`. | |||
```go | |||
fetchBlock(height, pool): | |||
while true do { | |||
peerID = nil | |||
block = nil | |||
peer = pickAvailablePeer(height) | |||
peerID = peer.id | |||
enqueue BlockRequest(height, peerID) to pool.requestsChannel | |||
redo = false | |||
while !redo do | |||
select { | |||
upon receiving Quit message do | |||
return | |||
upon receiving redo message with id on redoChannel do | |||
if peerID == id { | |||
mtx.Lock() | |||
pool.numPending++ | |||
redo = true | |||
mtx.UnLock() | |||
} | |||
} | |||
} | |||
pickAvailablePeer(height): | |||
selectedPeer = nil | |||
while selectedPeer = nil do | |||
pool.mtx.Lock() | |||
for each peer in pool.peers do | |||
if !peer.didTimeout and peer.numPending < maxPendingRequestsPerPeer and peer.height >= height then | |||
peer.numPending++ | |||
selectedPeer = peer | |||
break | |||
pool.mtx.Unlock() | |||
if selectedPeer = nil then | |||
sleep requestIntervalMS | |||
return selectedPeer | |||
``` | |||
sleep for requestIntervalMS | |||
### Task for creating Requesters | |||
This task is responsible for continuously creating and starting Requester tasks. | |||
```go | |||
createRequesters(pool): | |||
while true do | |||
if !pool.isRunning then break | |||
if pool.numPending < maxPendingRequests or size(pool.requesters) < maxTotalRequesters then | |||
pool.mtx.Lock() | |||
nextHeight = pool.height + size(pool.requesters) | |||
requester = create new requester for height nextHeight | |||
pool.requesters[nextHeight] = requester | |||
pool.numPending += 1 // atomic increment | |||
start requester task | |||
pool.mtx.Unlock() | |||
else | |||
sleep requestIntervalMS | |||
pool.mtx.Lock() | |||
for each peer in pool.peers do | |||
if !peer.didTimeout && peer.numPending > 0 && peer.curRate < minRecvRate then | |||
send error on pool error channel | |||
peer.didTimeout = true | |||
if peer.didTimeout then | |||
for each requester in pool.requesters do | |||
if requester.getPeerID() == peer then | |||
enqueue msg on requestor's redoChannel | |||
delete(pool.peers, peerID) | |||
pool.mtx.Unlock() | |||
``` | |||
### Main blockchain reactor controller task | |||
```go | |||
main(pool): | |||
create trySyncTicker with interval trySyncIntervalMS | |||
create statusUpdateTicker with interval statusUpdateIntervalSeconds | |||
create switchToConsensusTicker with interval switchToConsensusIntervalSeconds | |||
while true do | |||
select { | |||
upon receiving BlockRequest(Height, Peer) on pool.requestsChannel: | |||
try to send bcBlockRequestMessage(Height) to Peer | |||
upon receiving error(peer) on errorsChannel: | |||
stop peer for error | |||
upon receiving message on statusUpdateTickerChannel: | |||
broadcast bcStatusRequestMessage(bcR.store.Height) // message sent in a separate routine | |||
upon receiving message on switchToConsensusTickerChannel: | |||
pool.mtx.Lock() | |||
receivedBlockOrTimedOut = pool.height > 0 || (time.Now() - pool.startTime) > 5 Seconds | |||
ourChainIsLongestAmongPeers = pool.maxPeerHeight == 0 || pool.height >= pool.maxPeerHeight | |||
haveSomePeers = size of pool.peers > 0 | |||
pool.mtx.Unlock() | |||
if haveSomePeers && receivedBlockOrTimedOut && ourChainIsLongestAmongPeers then | |||
switch to consensus mode | |||
upon receiving message on trySyncTickerChannel: | |||
for i = 0; i < 10; i++ do | |||
pool.mtx.Lock() | |||
firstBlock = pool.requesters[pool.height].block | |||
secondBlock = pool.requesters[pool.height].block | |||
if firstBlock == nil or secondBlock == nil then continue | |||
pool.mtx.Unlock() | |||
verify firstBlock using LastCommit from secondBlock | |||
if verification failed | |||
pool.mtx.Lock() | |||
peerID = pool.requesters[pool.height].peerID | |||
redoRequestsForPeer(peerId) | |||
delete(pool.peers, peerID) | |||
stop peer peerID for error | |||
pool.mtx.Unlock() | |||
else | |||
delete(pool.requesters, pool.height) | |||
save firstBlock to store | |||
pool.height++ | |||
execute firstBlock | |||
} | |||
redoRequestsForPeer(pool, peerId): | |||
for each requester in pool.requesters do | |||
if requester.getPeerID() == peerID | |||
enqueue msg on redoChannel for requester | |||
``` | |||
## Channels | |||
Defines `maxMsgSize` for the maximum size of incoming messages, | |||
`SendQueueCapacity` and `RecvBufferCapacity` for maximum sending and | |||
receiving buffers respectively. These are supposed to prevent amplification | |||
attacks by setting up the upper limit on how much data we can receive & send to | |||
a peer. | |||
Sending incorrectly encoded data will result in stopping the peer. |
@ -0,0 +1,353 @@ | |||
# Consensus Reactor | |||
Consensus Reactor defines a reactor for the consensus service. It contains the ConsensusState service that | |||
manages the state of the Tendermint consensus internal state machine. | |||
When Consensus Reactor is started, it starts Broadcast Routine which starts ConsensusState service. | |||
Furthermore, for each peer that is added to the Consensus Reactor, it creates (and manages) the known peer state | |||
(that is used extensively in gossip routines) and starts the following three routines for the peer p: | |||
Gossip Data Routine, Gossip Votes Routine and QueryMaj23Routine. Finally, Consensus Reactor is responsible | |||
for decoding messages received from a peer and for adequate processing of the message depending on its type and content. | |||
The processing normally consists of updating the known peer state and for some messages | |||
(`ProposalMessage`, `BlockPartMessage` and `VoteMessage`) also forwarding message to ConsensusState module | |||
for further processing. In the following text we specify the core functionality of those separate unit of executions | |||
that are part of the Consensus Reactor. | |||
## ConsensusState service | |||
Consensus State handles execution of the Tendermint BFT consensus algorithm. It processes votes and proposals, | |||
and upon reaching agreement, commits blocks to the chain and executes them against the application. | |||
The internal state machine receives input from peers, the internal validator and from a timer. | |||
Inside Consensus State we have the following units of execution: Timeout Ticker and Receive Routine. | |||
Timeout Ticker is a timer that schedules timeouts conditional on the height/round/step that are processed | |||
by the Receive Routine. | |||
### Receive Routine of the ConsensusState service | |||
Receive Routine of the ConsensusState handles messages which may cause internal consensus state transitions. | |||
It is the only routine that updates RoundState that contains internal consensus state. | |||
Updates (state transitions) happen on timeouts, complete proposals, and 2/3 majorities. | |||
It receives messages from peers, internal validators and from Timeout Ticker | |||
and invokes the corresponding handlers, potentially updating the RoundState. | |||
The details of the protocol (together with formal proofs of correctness) implemented by the Receive Routine are | |||
discussed in separate document. For understanding of this document | |||
it is sufficient to understand that the Receive Routine manages and updates RoundState data structure that is | |||
then extensively used by the gossip routines to determine what information should be sent to peer processes. | |||
## Round State | |||
RoundState defines the internal consensus state. It contains height, round, round step, a current validator set, | |||
a proposal and proposal block for the current round, locked round and block (if some block is being locked), set of | |||
received votes and last commit and last validators set. | |||
```golang | |||
type RoundState struct { | |||
Height int64 | |||
Round int | |||
Step RoundStepType | |||
Validators ValidatorSet | |||
Proposal Proposal | |||
ProposalBlock Block | |||
ProposalBlockParts PartSet | |||
LockedRound int | |||
LockedBlock Block | |||
LockedBlockParts PartSet | |||
Votes HeightVoteSet | |||
LastCommit VoteSet | |||
LastValidators ValidatorSet | |||
} | |||
``` | |||
Internally, consensus will run as a state machine with the following states: | |||
- RoundStepNewHeight | |||
- RoundStepNewRound | |||
- RoundStepPropose | |||
- RoundStepProposeWait | |||
- RoundStepPrevote | |||
- RoundStepPrevoteWait | |||
- RoundStepPrecommit | |||
- RoundStepPrecommitWait | |||
- RoundStepCommit | |||
## Peer Round State | |||
Peer round state contains the known state of a peer. It is being updated by the Receive routine of | |||
Consensus Reactor and by the gossip routines upon sending a message to the peer. | |||
```golang | |||
type PeerRoundState struct { | |||
Height int64 // Height peer is at | |||
Round int // Round peer is at, -1 if unknown. | |||
Step RoundStepType // Step peer is at | |||
Proposal bool // True if peer has proposal for this round | |||
ProposalBlockPartsHeader PartSetHeader | |||
ProposalBlockParts BitArray | |||
ProposalPOLRound int // Proposal's POL round. -1 if none. | |||
ProposalPOL BitArray // nil until ProposalPOLMessage received. | |||
Prevotes BitArray // All votes peer has for this round | |||
Precommits BitArray // All precommits peer has for this round | |||
LastCommitRound int // Round of commit for last height. -1 if none. | |||
LastCommit BitArray // All commit precommits of commit for last height. | |||
CatchupCommitRound int // Round that we have commit for. Not necessarily unique. -1 if none. | |||
CatchupCommit BitArray // All commit precommits peer has for this height & CatchupCommitRound | |||
} | |||
``` | |||
## Receive method of Consensus reactor | |||
The entry point of the Consensus reactor is a receive method. When a message is received from a peer p, | |||
normally the peer round state is updated correspondingly, and some messages | |||
are passed for further processing, for example to ConsensusState service. We now specify the processing of messages | |||
in the receive method of Consensus reactor for each message type. In the following message handler, `rs` and `prs` denote | |||
`RoundState` and `PeerRoundState`, respectively. | |||
### NewRoundStepMessage handler | |||
``` | |||
handleMessage(msg): | |||
if msg is from smaller height/round/step then return | |||
// Just remember these values. | |||
prsHeight = prs.Height | |||
prsRound = prs.Round | |||
prsCatchupCommitRound = prs.CatchupCommitRound | |||
prsCatchupCommit = prs.CatchupCommit | |||
Update prs with values from msg | |||
if prs.Height or prs.Round has been updated then | |||
reset Proposal related fields of the peer state | |||
if prs.Round has been updated and msg.Round == prsCatchupCommitRound then | |||
prs.Precommits = psCatchupCommit | |||
if prs.Height has been updated then | |||
if prsHeight+1 == msg.Height && prsRound == msg.LastCommitRound then | |||
prs.LastCommitRound = msg.LastCommitRound | |||
prs.LastCommit = prs.Precommits | |||
} else { | |||
prs.LastCommitRound = msg.LastCommitRound | |||
prs.LastCommit = nil | |||
} | |||
Reset prs.CatchupCommitRound and prs.CatchupCommit | |||
``` | |||
### NewValidBlockMessage handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height != msg.Height then return | |||
if prs.Round != msg.Round && !msg.IsCommit then return | |||
prs.ProposalBlockPartsHeader = msg.BlockPartsHeader | |||
prs.ProposalBlockParts = msg.BlockParts | |||
``` | |||
### HasVoteMessage handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height == msg.Height then | |||
prs.setHasVote(msg.Height, msg.Round, msg.Type, msg.Index) | |||
``` | |||
### VoteSetMaj23Message handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height == msg.Height then | |||
Record in rs that a peer claim to have ⅔ majority for msg.BlockID | |||
Send VoteSetBitsMessage showing votes node has for that BlockId | |||
``` | |||
### ProposalMessage handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height != msg.Height || prs.Round != msg.Round || prs.Proposal then return | |||
prs.Proposal = true | |||
if prs.ProposalBlockParts == empty set then // otherwise it is set in NewValidBlockMessage handler | |||
prs.ProposalBlockPartsHeader = msg.BlockPartsHeader | |||
prs.ProposalPOLRound = msg.POLRound | |||
prs.ProposalPOL = nil | |||
Send msg through internal peerMsgQueue to ConsensusState service | |||
``` | |||
### ProposalPOLMessage handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height != msg.Height or prs.ProposalPOLRound != msg.ProposalPOLRound then return | |||
prs.ProposalPOL = msg.ProposalPOL | |||
``` | |||
### BlockPartMessage handler | |||
``` | |||
handleMessage(msg): | |||
if prs.Height != msg.Height || prs.Round != msg.Round then return | |||
Record in prs that peer has block part msg.Part.Index | |||
Send msg trough internal peerMsgQueue to ConsensusState service | |||
``` | |||
### VoteMessage handler | |||
``` | |||
handleMessage(msg): | |||
Record in prs that a peer knows vote with index msg.vote.ValidatorIndex for particular height and round | |||
Send msg trough internal peerMsgQueue to ConsensusState service | |||
``` | |||
### VoteSetBitsMessage handler | |||
``` | |||
handleMessage(msg): | |||
Update prs for the bit-array of votes peer claims to have for the msg.BlockID | |||
``` | |||
## Gossip Data Routine | |||
It is used to send the following messages to the peer: `BlockPartMessage`, `ProposalMessage` and | |||
`ProposalPOLMessage` on the DataChannel. The gossip data routine is based on the local RoundState (`rs`) | |||
and the known PeerRoundState (`prs`). The routine repeats forever the logic shown below: | |||
``` | |||
1a) if rs.ProposalBlockPartsHeader == prs.ProposalBlockPartsHeader and the peer does not have all the proposal parts then | |||
Part = pick a random proposal block part the peer does not have | |||
Send BlockPartMessage(rs.Height, rs.Round, Part) to the peer on the DataChannel | |||
if send returns true, record that the peer knows the corresponding block Part | |||
Continue | |||
1b) if (0 < prs.Height) and (prs.Height < rs.Height) then | |||
help peer catch up using gossipDataForCatchup function | |||
Continue | |||
1c) if (rs.Height != prs.Height) or (rs.Round != prs.Round) then | |||
Sleep PeerGossipSleepDuration | |||
Continue | |||
// at this point rs.Height == prs.Height and rs.Round == prs.Round | |||
1d) if (rs.Proposal != nil and !prs.Proposal) then | |||
Send ProposalMessage(rs.Proposal) to the peer | |||
if send returns true, record that the peer knows Proposal | |||
if 0 <= rs.Proposal.POLRound then | |||
polRound = rs.Proposal.POLRound | |||
prevotesBitArray = rs.Votes.Prevotes(polRound).BitArray() | |||
Send ProposalPOLMessage(rs.Height, polRound, prevotesBitArray) | |||
Continue | |||
2) Sleep PeerGossipSleepDuration | |||
``` | |||
### Gossip Data For Catchup | |||
This function is responsible for helping peer catch up if it is at the smaller height (prs.Height < rs.Height). | |||
The function executes the following logic: | |||
if peer does not have all block parts for prs.ProposalBlockPart then | |||
blockMeta = Load Block Metadata for height prs.Height from blockStore | |||
if (!blockMeta.BlockID.PartsHeader == prs.ProposalBlockPartsHeader) then | |||
Sleep PeerGossipSleepDuration | |||
return | |||
Part = pick a random proposal block part the peer does not have | |||
Send BlockPartMessage(prs.Height, prs.Round, Part) to the peer on the DataChannel | |||
if send returns true, record that the peer knows the corresponding block Part | |||
return | |||
else Sleep PeerGossipSleepDuration | |||
## Gossip Votes Routine | |||
It is used to send the following message: `VoteMessage` on the VoteChannel. | |||
The gossip votes routine is based on the local RoundState (`rs`) | |||
and the known PeerRoundState (`prs`). The routine repeats forever the logic shown below: | |||
``` | |||
1a) if rs.Height == prs.Height then | |||
if prs.Step == RoundStepNewHeight then | |||
vote = random vote from rs.LastCommit the peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
if prs.Step <= RoundStepPrevote and prs.Round != -1 and prs.Round <= rs.Round then | |||
Prevotes = rs.Votes.Prevotes(prs.Round) | |||
vote = random vote from Prevotes the peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
if prs.Step <= RoundStepPrecommit and prs.Round != -1 and prs.Round <= rs.Round then | |||
Precommits = rs.Votes.Precommits(prs.Round) | |||
vote = random vote from Precommits the peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
if prs.ProposalPOLRound != -1 then | |||
PolPrevotes = rs.Votes.Prevotes(prs.ProposalPOLRound) | |||
vote = random vote from PolPrevotes the peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
1b) if prs.Height != 0 and rs.Height == prs.Height+1 then | |||
vote = random vote from rs.LastCommit peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
1c) if prs.Height != 0 and rs.Height >= prs.Height+2 then | |||
Commit = get commit from BlockStore for prs.Height | |||
vote = random vote from Commit the peer does not have | |||
Send VoteMessage(vote) to the peer | |||
if send returns true, continue | |||
2) Sleep PeerGossipSleepDuration | |||
``` | |||
## QueryMaj23Routine | |||
It is used to send the following message: `VoteSetMaj23Message`. `VoteSetMaj23Message` is sent to indicate that a given | |||
BlockID has seen +2/3 votes. This routine is based on the local RoundState (`rs`) and the known PeerRoundState | |||
(`prs`). The routine repeats forever the logic shown below. | |||
``` | |||
1a) if rs.Height == prs.Height then | |||
Prevotes = rs.Votes.Prevotes(prs.Round) | |||
if there is a ⅔ majority for some blockId in Prevotes then | |||
m = VoteSetMaj23Message(prs.Height, prs.Round, Prevote, blockId) | |||
Send m to peer | |||
Sleep PeerQueryMaj23SleepDuration | |||
1b) if rs.Height == prs.Height then | |||
Precommits = rs.Votes.Precommits(prs.Round) | |||
if there is a ⅔ majority for some blockId in Precommits then | |||
m = VoteSetMaj23Message(prs.Height,prs.Round,Precommit,blockId) | |||
Send m to peer | |||
Sleep PeerQueryMaj23SleepDuration | |||
1c) if rs.Height == prs.Height and prs.ProposalPOLRound >= 0 then | |||
Prevotes = rs.Votes.Prevotes(prs.ProposalPOLRound) | |||
if there is a ⅔ majority for some blockId in Prevotes then | |||
m = VoteSetMaj23Message(prs.Height,prs.ProposalPOLRound,Prevotes,blockId) | |||
Send m to peer | |||
Sleep PeerQueryMaj23SleepDuration | |||
1d) if prs.CatchupCommitRound != -1 and 0 < prs.Height and | |||
prs.Height <= blockStore.Height() then | |||
Commit = LoadCommit(prs.Height) | |||
m = VoteSetMaj23Message(prs.Height,Commit.Round,Precommit,Commit.blockId) | |||
Send m to peer | |||
Sleep PeerQueryMaj23SleepDuration | |||
2) Sleep PeerQueryMaj23SleepDuration | |||
``` | |||
## Broadcast routine | |||
The Broadcast routine subscribes to an internal event bus to receive new round steps and votes messages, and broadcasts messages to peers upon receiving those | |||
events. | |||
It broadcasts `NewRoundStepMessage` or `CommitStepMessage` upon new round state event. Note that | |||
broadcasting these messages does not depend on the PeerRoundState; it is sent on the StateChannel. | |||
Upon receiving VoteMessage it broadcasts `HasVoteMessage` message to its peers on the StateChannel. | |||
## Channels | |||
Defines 4 channels: state, data, vote and vote_set_bits. Each channel | |||
has `SendQueueCapacity` and `RecvBufferCapacity` and | |||
`RecvMessageCapacity` set to `maxMsgSize`. | |||
Sending incorrectly encoded data will result in stopping the peer. |
@ -0,0 +1,184 @@ | |||
# Tendermint Consensus Reactor | |||
Tendermint Consensus is a distributed protocol executed by validator processes to agree on | |||
the next block to be added to the Tendermint blockchain. The protocol proceeds in rounds, where | |||
each round is a try to reach agreement on the next block. A round starts by having a dedicated | |||
process (called proposer) suggesting to other processes what should be the next block with | |||
the `ProposalMessage`. | |||
The processes respond by voting for a block with `VoteMessage` (there are two kinds of vote | |||
messages, prevote and precommit votes). Note that a proposal message is just a suggestion what the | |||
next block should be; a validator might vote with a `VoteMessage` for a different block. If in some | |||
round, enough number of processes vote for the same block, then this block is committed and later | |||
added to the blockchain. `ProposalMessage` and `VoteMessage` are signed by the private key of the | |||
validator. The internals of the protocol and how it ensures safety and liveness properties are | |||
explained in a forthcoming document. | |||
For efficiency reasons, validators in Tendermint consensus protocol do not agree directly on the | |||
block as the block size is big, i.e., they don't embed the block inside `Proposal` and | |||
`VoteMessage`. Instead, they reach agreement on the `BlockID` (see `BlockID` definition in | |||
[Blockchain](https://github.com/tendermint/tendermint/blob/master/docs/spec/blockchain/blockchain.md#blockid) section) that uniquely identifies each block. The block itself is | |||
disseminated to validator processes using peer-to-peer gossiping protocol. It starts by having a | |||
proposer first splitting a block into a number of block parts, that are then gossiped between | |||
processes using `BlockPartMessage`. | |||
Validators in Tendermint communicate by peer-to-peer gossiping protocol. Each validator is connected | |||
only to a subset of processes called peers. By the gossiping protocol, a validator send to its peers | |||
all needed information (`ProposalMessage`, `VoteMessage` and `BlockPartMessage`) so they can | |||
reach agreement on some block, and also obtain the content of the chosen block (block parts). As | |||
part of the gossiping protocol, processes also send auxiliary messages that inform peers about the | |||
executed steps of the core consensus algorithm (`NewRoundStepMessage` and `NewValidBlockMessage`), and | |||
also messages that inform peers what votes the process has seen (`HasVoteMessage`, | |||
`VoteSetMaj23Message` and `VoteSetBitsMessage`). These messages are then used in the gossiping | |||
protocol to determine what messages a process should send to its peers. | |||
We now describe the content of each message exchanged during Tendermint consensus protocol. | |||
## ProposalMessage | |||
ProposalMessage is sent when a new block is proposed. It is a suggestion of what the | |||
next block in the blockchain should be. | |||
```go | |||
type ProposalMessage struct { | |||
Proposal Proposal | |||
} | |||
``` | |||
### Proposal | |||
Proposal contains height and round for which this proposal is made, BlockID as a unique identifier | |||
of proposed block, timestamp, and POLRound (a so-called Proof-of-Lock (POL) round) that is needed for | |||
termination of the consensus. If POLRound >= 0, then BlockID corresponds to the block that | |||
is locked in POLRound. The message is signed by the validator private key. | |||
```go | |||
type Proposal struct { | |||
Height int64 | |||
Round int | |||
POLRound int | |||
BlockID BlockID | |||
Timestamp Time | |||
Signature Signature | |||
} | |||
``` | |||
## VoteMessage | |||
VoteMessage is sent to vote for some block (or to inform others that a process does not vote in the | |||
current round). Vote is defined in the [Blockchain](https://github.com/tendermint/tendermint/blob/master/docs/spec/blockchain/blockchain.md#blockid) section and contains validator's | |||
information (validator address and index), height and round for which the vote is sent, vote type, | |||
blockID if process vote for some block (`nil` otherwise) and a timestamp when the vote is sent. The | |||
message is signed by the validator private key. | |||
```go | |||
type VoteMessage struct { | |||
Vote Vote | |||
} | |||
``` | |||
## BlockPartMessage | |||
BlockPartMessage is sent when gossipping a piece of the proposed block. It contains height, round | |||
and the block part. | |||
```go | |||
type BlockPartMessage struct { | |||
Height int64 | |||
Round int | |||
Part Part | |||
} | |||
``` | |||
## NewRoundStepMessage | |||
NewRoundStepMessage is sent for every step transition during the core consensus algorithm execution. | |||
It is used in the gossip part of the Tendermint protocol to inform peers about a current | |||
height/round/step a process is in. | |||
```go | |||
type NewRoundStepMessage struct { | |||
Height int64 | |||
Round int | |||
Step RoundStepType | |||
SecondsSinceStartTime int | |||
LastCommitRound int | |||
} | |||
``` | |||
## NewValidBlockMessage | |||
NewValidBlockMessage is sent when a validator observes a valid block B in some round r, | |||
i.e., there is a Proposal for block B and 2/3+ prevotes for the block B in the round r. | |||
It contains height and round in which valid block is observed, block parts header that describes | |||
the valid block and is used to obtain all | |||
block parts, and a bit array of the block parts a process currently has, so its peers can know what | |||
parts it is missing so they can send them. | |||
In case the block is also committed, then IsCommit flag is set to true. | |||
```go | |||
type NewValidBlockMessage struct { | |||
Height int64 | |||
Round int | |||
BlockPartsHeader PartSetHeader | |||
BlockParts BitArray | |||
IsCommit bool | |||
} | |||
``` | |||
## ProposalPOLMessage | |||
ProposalPOLMessage is sent when a previous block is re-proposed. | |||
It is used to inform peers in what round the process learned for this block (ProposalPOLRound), | |||
and what prevotes for the re-proposed block the process has. | |||
```go | |||
type ProposalPOLMessage struct { | |||
Height int64 | |||
ProposalPOLRound int | |||
ProposalPOL BitArray | |||
} | |||
``` | |||
## HasVoteMessage | |||
HasVoteMessage is sent to indicate that a particular vote has been received. It contains height, | |||
round, vote type and the index of the validator that is the originator of the corresponding vote. | |||
```go | |||
type HasVoteMessage struct { | |||
Height int64 | |||
Round int | |||
Type byte | |||
Index int | |||
} | |||
``` | |||
## VoteSetMaj23Message | |||
VoteSetMaj23Message is sent to indicate that a process has seen +2/3 votes for some BlockID. | |||
It contains height, round, vote type and the BlockID. | |||
```go | |||
type VoteSetMaj23Message struct { | |||
Height int64 | |||
Round int | |||
Type byte | |||
BlockID BlockID | |||
} | |||
``` | |||
## VoteSetBitsMessage | |||
VoteSetBitsMessage is sent to communicate the bit-array of votes a process has seen for a given | |||
BlockID. It contains height, round, vote type, BlockID and a bit array of | |||
the votes a process has. | |||
```go | |||
type VoteSetBitsMessage struct { | |||
Height int64 | |||
Round int | |||
Type byte | |||
BlockID BlockID | |||
Votes BitArray | |||
} | |||
``` |
@ -0,0 +1,291 @@ | |||
# Proposer selection procedure in Tendermint | |||
This document specifies the Proposer Selection Procedure that is used in Tendermint to choose a round proposer. | |||
As Tendermint is “leader-based protocol”, the proposer selection is critical for its correct functioning. | |||
At a given block height, the proposer selection algorithm runs with the same validator set at each round . | |||
Between heights, an updated validator set may be specified by the application as part of the ABCIResponses' EndBlock. | |||
## Requirements for Proposer Selection | |||
This sections covers the requirements with Rx being mandatory and Ox optional requirements. | |||
The following requirements must be met by the Proposer Selection procedure: | |||
#### R1: Determinism | |||
Given a validator set `V`, and two honest validators `p` and `q`, for each height `h` and each round `r` the following must hold: | |||
`proposer_p(h,r) = proposer_q(h,r)` | |||
where `proposer_p(h,r)` is the proposer returned by the Proposer Selection Procedure at process `p`, at height `h` and round `r`. | |||
#### R2: Fairness | |||
Given a validator set with total voting power P and a sequence S of elections. In any sub-sequence of S with length C*P, a validator v must be elected as proposer P/VP(v) times, i.e. with frequency: | |||
f(v) ~ VP(v) / P | |||
where C is a tolerance factor for validator set changes with following values: | |||
- C == 1 if there are no validator set changes | |||
- C ~ k when there are validator changes | |||
*[this needs more work]* | |||
### Basic Algorithm | |||
At its core, the proposer selection procedure uses a weighted round-robin algorithm. | |||
A model that gives a good intuition on how/ why the selection algorithm works and it is fair is that of a priority queue. The validators move ahead in this queue according to their voting power (the higher the voting power the faster a validator moves towards the head of the queue). When the algorithm runs the following happens: | |||
- all validators move "ahead" according to their powers: for each validator, increase the priority by the voting power | |||
- first in the queue becomes the proposer: select the validator with highest priority | |||
- move the proposer back in the queue: decrease the proposer's priority by the total voting power | |||
Notation: | |||
- vset - the validator set | |||
- n - the number of validators | |||
- VP(i) - voting power of validator i | |||
- A(i) - accumulated priority for validator i | |||
- P - total voting power of set | |||
- avg - average of all validator priorities | |||
- prop - proposer | |||
Simple view at the Selection Algorithm: | |||
``` | |||
def ProposerSelection (vset): | |||
// compute priorities and elect proposer | |||
for each validator i in vset: | |||
A(i) += VP(i) | |||
prop = max(A) | |||
A(prop) -= P | |||
``` | |||
### Stable Set | |||
Consider the validator set: | |||
Validator | p1| p2 | |||
----------|---|--- | |||
VP | 1 | 3 | |||
Assuming no validator changes, the following table shows the proposer priority computation over a few runs. Four runs of the selection procedure are shown, starting with the 5th the same values are computed. | |||
Each row shows the priority queue and the process place in it. The proposer is the closest to the head, the rightmost validator. As priorities are updated, the validators move right in the queue. The proposer moves left as its priority is reduced after election. | |||
|Priority Run | -2| -1| 0 | 1| 2 | 3 | 4 | 5 | Alg step | |||
|--------------- |---|---|---- |---|---- |---|---|---|-------- | |||
| | | |p1,p2| | | | | |Initialized to 0 | |||
|run 1 | | | | p1| | p2| | |A(i)+=VP(i) | |||
| | | p2| | p1| | | | |A(p2)-= P | |||
|run 2 | | | | |p1,p2| | | |A(i)+=VP(i) | |||
| | p1| | | | p2| | | |A(p1)-= P | |||
|run 3 | | p1| | | | | | p2|A(i)+=VP(i) | |||
| | | p1| | p2| | | | |A(p2)-= P | |||
|run 4 | | | p1| | | | p2| |A(i)+=VP(i) | |||
| | | |p1,p2| | | | | |A(p2)-= P | |||
It can be shown that: | |||
- At the end of each run k+1 the sum of the priorities is the same as at end of run k. If a new set's priorities are initialized to 0 then the sum of priorities will be 0 at each run while there are no changes. | |||
- The max distance between priorites is (n-1) * P. *[formal proof not finished]* | |||
### Validator Set Changes | |||
Between proposer selection runs the validator set may change. Some changes have implications on the proposer election. | |||
#### Voting Power Change | |||
Consider again the earlier example and assume that the voting power of p1 is changed to 4: | |||
Validator | p1| p2 | |||
----------|---| --- | |||
VP | 4 | 3 | |||
Let's also assume that before this change the proposer priorites were as shown in first row (last run). As it can be seen, the selection could run again, without changes, as before. | |||
|Priority Run| -2 | -1 | 0 | 1 | 2 | Comment | |||
|--------------| ---|--- |------|--- |--- |-------- | |||
| last run | | p2 | | p1 | |__update VP(p1)__ | |||
| next run | | | | | p2 |A(i)+=VP(i) | |||
| | p1 | | | | p2 |A(p1)-= P | |||
However, when a validator changes power from a high to a low value, some other validator remain far back in the queue for a long time. This scenario is considered again in the Proposer Priority Range section. | |||
As before: | |||
- At the end of each run k+1 the sum of the priorities is the same as at run k. | |||
- The max distance between priorites is (n-1) * P. | |||
#### Validator Removal | |||
Consider a new example with set: | |||
Validator | p1 | p2 | p3 | | |||
--------- |--- |--- |--- | | |||
VP | 1 | 2 | 3 | | |||
Let's assume that after the last run the proposer priorities were as shown in first row with their sum being 0. After p2 is removed, at the end of next proposer selection run (penultimate row) the sum of priorities is -2 (minus the priority of the removed process). | |||
The procedure could continue without modifications. However, after a sufficiently large number of modifications in validator set, the priority values would migrate towards maximum or minimum allowed values causing truncations due to overflow detection. | |||
For this reason, the selection procedure adds another __new step__ that centers the current priority values such that the priority sum remains close to 0. | |||
|Priority Run |-3 | -2 | -1 | 0 | 1 | 2 | 4 |Comment | |||
|--------------- |--- | ---|--- |--- |--- |--- |---|-------- | |||
| last run |p3 | | | | p1 | p2 | |__remove p2__ | |||
| nextrun | | | | | | | | | |||
| __new step__ | | p3 | | | | p1 | |A(i) -= avg, avg = -1 | |||
| | | | | | p3 | p1 | |A(i)+=VP(i) | |||
| | | | p1 | | p3 | | |A(p1)-= P | |||
The modified selection algorithm is: | |||
def ProposerSelection (vset): | |||
// center priorities around zero | |||
avg = sum(A(i) for i in vset)/len(vset) | |||
for each validator i in vset: | |||
A(i) -= avg | |||
// compute priorities and elect proposer | |||
for each validator i in vset: | |||
A(i) += VP(i) | |||
prop = max(A) | |||
A(prop) -= P | |||
Observations: | |||
- The sum of priorities is now close to 0. Due to integer division the sum is an integer in (-n, n), where n is the number of validators. | |||
#### New Validator | |||
When a new validator is added, same problem as the one described for removal appears, the sum of priorities in the new set is not zero. This is fixed with the centering step introduced above. | |||
One other issue that needs to be addressed is the following. A validator V that has just been elected is moved to the end of the queue. If the validator set is large and/ or other validators have significantly higher power, V will have to wait many runs to be elected. If V removes and re-adds itself to the set, it would make a significant (albeit unfair) "jump" ahead in the queue. | |||
In order to prevent this, when a new validator is added, its initial priority is set to: | |||
A(V) = -1.125 * P | |||
where P is the total voting power of the set including V. | |||
Curent implementation uses the penalty factor of 1.125 because it provides a small punishment that is efficient to calculate. See [here](https://github.com/tendermint/tendermint/pull/2785#discussion_r235038971) for more details. | |||
If we consider the validator set where p3 has just been added: | |||
Validator | p1 | p2 | p3 | |||
----------|--- |--- |--- | |||
VP | 1 | 3 | 8 | |||
then p3 will start with proposer priority: | |||
A(p3) = -1.125 * (1 + 3 + 8) ~ -13 | |||
Note that since current computation uses integer division there is penalty loss when sum of the voting power is less than 8. | |||
In the next run, p3 will still be ahead in the queue, elected as proposer and moved back in the queue. | |||
|Priority Run |-13 | -9 | -5 | -2 | -1 | 0 | 1 | 2 | 5 | 6 | 7 |Alg step | |||
|---------------|--- |--- |--- |----|--- |--- |---|---|---|---|---|-------- | |||
|last run | | | | p2 | | | | p1| | | |__add p3__ | |||
| | p3 | | | p2 | | | | p1| | | |A(p3) = -4 | |||
|next run | | p3 | | | | | | p2| | p1| |A(i) -= avg, avg = -4 | |||
| | | | | | p3 | | | | p2| | p1|A(i)+=VP(i) | |||
| | | | p1 | | p3 | | | | p2| | |A(p1)-=P | |||
### Proposer Priority Range | |||
With the introduction of centering, some interesting cases occur. Low power validators that bind early in a set that includes high power validator(s) benefit from subsequent additions to the set. This is because these early validators run through more right shift operations during centering, operations that increase their priority. | |||
As an example, consider the set where p2 is added after p1, with priority -1.125 * 80k = -90k. After the selection procedure runs once: | |||
Validator | p1 | p2 | Comment | |||
----------|-----|---- |--- | |||
VP | 80k | 10 | | |||
A | 0 |-90k | __added p2__ | |||
A |-45k | 45k | __run selection__ | |||
Then execute the following steps: | |||
1. Add a new validator p3: | |||
Validator | p1 | p2 | p3 | |||
----------|-----|--- |---- | |||
VP | 80k | 10 | 10 | |||
2. Run selection once. The notation '..p'/'p..' means very small deviations compared to column priority. | |||
|Priority Run | -90k..| -60k | -45k | -15k| 0 | 45k | 75k | 155k | Comment | |||
|--------------|------ |----- |------- |---- |---|---- |----- |------- |--------- | |||
| last run | p3 | | p2 | | | p1 | | | __added p3__ | |||
| next run | |||
| *right_shift*| | p3 | | p2 | | | p1 | | A(i) -= avg,avg=-30k | |||
| | | ..p3| | ..p2| | | | p1 | A(i)+=VP(i) | |||
| | | ..p3| | ..p2| | | p1.. | | A(p1)-=P, P=80k+20 | |||
3. Remove p1 and run selection once: | |||
Validator | p3 | p2 | Comment | |||
----------|----- |---- |-------- | |||
VP | 10 | 10 | | |||
A |-60k |-15k | | |||
A |-22.5k|22.5k| __run selection__ | |||
At this point, while the total voting power is 20, the distance between priorities is 45k. It will take 4500 runs for p3 to catch up with p2. | |||
In order to prevent these types of scenarios, the selection algorithm performs scaling of priorities such that the difference between min and max values is smaller than two times the total voting power. | |||
The modified selection algorithm is: | |||
def ProposerSelection (vset): | |||
// scale the priority values | |||
diff = max(A)-min(A) | |||
threshold = 2 * P | |||
if diff > threshold: | |||
scale = diff/threshold | |||
for each validator i in vset: | |||
A(i) = A(i)/scale | |||
// center priorities around zero | |||
avg = sum(A(i) for i in vset)/len(vset) | |||
for each validator i in vset: | |||
A(i) -= avg | |||
// compute priorities and elect proposer | |||
for each validator i in vset: | |||
A(i) += VP(i) | |||
prop = max(A) | |||
A(prop) -= P | |||
Observations: | |||
- With this modification, the maximum distance between priorites becomes 2 * P. | |||
Note also that even during steady state the priority range may increase beyond 2 * P. The scaling introduced here helps to keep the range bounded. | |||
### Wrinkles | |||
#### Validator Power Overflow Conditions | |||
The validator voting power is a positive number stored as an int64. When a validator is added the `1.125 * P` computation must not overflow. As a consequence the code handling validator updates (add and update) checks for overflow conditions making sure the total voting power is never larger than the largest int64 `MAX`, with the property that `1.125 * MAX` is still in the bounds of int64. Fatal error is return when overflow condition is detected. | |||
#### Proposer Priority Overflow/ Underflow Handling | |||
The proposer priority is stored as an int64. The selection algorithm performs additions and subtractions to these values and in the case of overflows and underflows it limits the values to: | |||
MaxInt64 = 1 << 63 - 1 | |||
MinInt64 = -1 << 63 | |||
### Requirement Fulfillment Claims | |||
__[R1]__ | |||
The proposer algorithm is deterministic giving consistent results across executions with same transactions and validator set modifications. | |||
[WIP - needs more detail] | |||
__[R2]__ | |||
Given a set of processes with the total voting power P, during a sequence of elections of length P, the number of times any process is selected as proposer is equal to its voting power. The sequence of the P proposers then repeats. If we consider the validator set: | |||
Validator | p1| p2 | |||
----------|---|--- | |||
VP | 1 | 3 | |||
With no other changes to the validator set, the current implementation of proposer selection generates the sequence: | |||
`p2, p1, p2, p2, p2, p1, p2, p2,...` or [`p2, p1, p2, p2`]* | |||
A sequence that starts with any circular permutation of the [`p2, p1, p2, p2`] sub-sequence would also provide the same degree of fairness. In fact these circular permutations show in the sliding window (over the generated sequence) of size equal to the length of the sub-sequence. | |||
Assigning priorities to each validator based on the voting power and updating them at each run ensures the fairness of the proposer selection. In addition, every time a validator is elected as proposer its priority is decreased with the total voting power. | |||
Intuitively, a process v jumps ahead in the queue at most (max(A) - min(A))/VP(v) times until it reaches the head and is elected. The frequency is then: | |||
f(v) ~ VP(v)/(max(A)-min(A)) = 1/k * VP(v)/P | |||
For current implementation, this means v should be proposer at least VP(v) times out of k * P runs, with scaling factor k=2. |
@ -0,0 +1,10 @@ | |||
# Evidence Reactor | |||
## Channels | |||
[#1503](https://github.com/tendermint/tendermint/issues/1503) | |||
Sending invalid evidence will result in stopping the peer. | |||
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result | |||
in stopping the peer. |
@ -0,0 +1,8 @@ | |||
# Mempool Concurrency | |||
Look at the concurrency model this uses... | |||
- Receiving CheckTx | |||
- Broadcasting new tx | |||
- Interfaces with consensus engine, reap/update while checking | |||
- Calling the ABCI app (ordering. callbacks. how proxy works alongside the blockchain proxy which actually writes blocks) |
@ -0,0 +1,54 @@ | |||
# Mempool Configuration | |||
Here we describe configuration options around mempool. | |||
For the purposes of this document, they are described | |||
as command-line flags, but they can also be passed in as | |||
environmental variables or in the config.toml file. The | |||
following are all equivalent: | |||
Flag: `--mempool.recheck=false` | |||
Environment: `TM_MEMPOOL_RECHECK=false` | |||
Config: | |||
``` | |||
[mempool] | |||
recheck = false | |||
``` | |||
## Recheck | |||
`--mempool.recheck=false` (default: true) | |||
Recheck determines if the mempool rechecks all pending | |||
transactions after a block was committed. Once a block | |||
is committed, the mempool removes all valid transactions | |||
that were successfully included in the block. | |||
If `recheck` is true, then it will rerun CheckTx on | |||
all remaining transactions with the new block state. | |||
## Broadcast | |||
`--mempool.broadcast=false` (default: true) | |||
Determines whether this node gossips any valid transactions | |||
that arrive in mempool. Default is to gossip anything that | |||
passes checktx. If this is disabled, transactions are not | |||
gossiped, but instead stored locally and added to the next | |||
block this node is the proposer. | |||
## WalDir | |||
`--mempool.wal_dir=/tmp/gaia/mempool.wal` (default: $TM_HOME/data/mempool.wal) | |||
This defines the directory where mempool writes the write-ahead | |||
logs. These files can be used to reload unbroadcasted | |||
transactions if the node crashes. | |||
If the directory passed in is an absolute path, the wal file is | |||
created there. If the directory is a relative path, the path is | |||
appended to home directory of the tendermint process to | |||
generate an absolute path to the wal directory | |||
(default `$HOME/.tendermint` or set via `TM_HOME` or `--home``) |
@ -0,0 +1,43 @@ | |||
# Mempool Functionality | |||
The mempool maintains a list of potentially valid transactions, | |||
both to broadcast to other nodes, as well as to provide to the | |||
consensus reactor when it is selected as the block proposer. | |||
There are two sides to the mempool state: | |||
- External: get, check, and broadcast new transactions | |||
- Internal: return valid transaction, update list after block commit | |||
## External functionality | |||
External functionality is exposed via network interfaces | |||
to potentially untrusted actors. | |||
- CheckTx - triggered via RPC or P2P | |||
- Broadcast - gossip messages after a successful check | |||
## Internal functionality | |||
Internal functionality is exposed via method calls to other | |||
code compiled into the tendermint binary. | |||
- ReapMaxBytesMaxGas - get txs to propose in the next block. Guarantees that the | |||
size of the txs is less than MaxBytes, and gas is less than MaxGas | |||
- Update - remove tx that were included in last block | |||
- ABCI.CheckTx - call ABCI app to validate the tx | |||
What does it provide the consensus reactor? | |||
What guarantees does it need from the ABCI app? | |||
(talk about interleaving processes in concurrency) | |||
## Optimizations | |||
The implementation within this library also implements a tx cache. | |||
This is so that signatures don't have to be reverified if the tx has | |||
already been seen before. | |||
However, we only store valid txs in the cache, not invalid ones. | |||
This is because invalid txs could become good later. | |||
Txs that are included in a block aren't removed from the cache, | |||
as they still may be getting received over the p2p network. | |||
These txs are stored in the cache by their hash, to mitigate memory concerns. |
@ -0,0 +1,61 @@ | |||
# Mempool Messages | |||
## P2P Messages | |||
There is currently only one message that Mempool broadcasts | |||
and receives over the p2p gossip network (via the reactor): | |||
`TxMessage` | |||
```go | |||
// TxMessage is a MempoolMessage containing a transaction. | |||
type TxMessage struct { | |||
Tx types.Tx | |||
} | |||
``` | |||
TxMessage is go-amino encoded and prepended with `0x1` as a | |||
"type byte". This is followed by a go-amino encoded byte-slice. | |||
Prefix of 40=0x28 byte tx is: `0x010128...` followed by | |||
the actual 40-byte tx. Prefix of 350=0x015e byte tx is: | |||
`0x0102015e...` followed by the actual 350 byte tx. | |||
(Please see the [go-amino repo](https://github.com/tendermint/go-amino#an-interface-example) for more information) | |||
## RPC Messages | |||
Mempool exposes `CheckTx([]byte)` over the RPC interface. | |||
It can be posted via `broadcast_commit`, `broadcast_sync` or | |||
`broadcast_async`. They all parse a message with one argument, | |||
`"tx": "HEX_ENCODED_BINARY"` and differ in only how long they | |||
wait before returning (sync makes sure CheckTx passes, commit | |||
makes sure it was included in a signed block). | |||
Request (`POST http://gaia.zone:26657/`): | |||
```json | |||
{ | |||
"id": "", | |||
"jsonrpc": "2.0", | |||
"method": "broadcast_sync", | |||
"params": { | |||
"tx": "F012A4BC68..." | |||
} | |||
} | |||
``` | |||
Response: | |||
```json | |||
{ | |||
"error": "", | |||
"result": { | |||
"hash": "E39AAB7A537ABAA237831742DCE1117F187C3C52", | |||
"log": "", | |||
"data": "", | |||
"code": 0 | |||
}, | |||
"id": "", | |||
"jsonrpc": "2.0" | |||
} | |||
``` |
@ -0,0 +1,22 @@ | |||
# Mempool Reactor | |||
## Channels | |||
See [this issue](https://github.com/tendermint/tendermint/issues/1503) | |||
Mempool maintains a cache of the last 10000 transactions to prevent | |||
replaying old transactions (plus transactions coming from other | |||
validators, who are continually exchanging transactions). Read [Replay | |||
Protection](../../../app-dev/app-development.md#replay-protection) | |||
for details. | |||
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result | |||
in stopping the peer. | |||
The mempool will not send a tx back to any peer which it received it from. | |||
The reactor assigns an `uint16` number for each peer and maintains a map from | |||
p2p.ID to `uint16`. Each mempool transaction carries a list of all the senders | |||
(`[]uint16`). The list is updated every time mempool receives a transaction it | |||
is already seen. `uint16` assumes that a node will never have over 65535 active | |||
peers (0 is reserved for unknown source - e.g. RPC). |
@ -0,0 +1,132 @@ | |||
# Peer Strategy and Exchange | |||
Here we outline the design of the AddressBook | |||
and how it used by the Peer Exchange Reactor (PEX) to ensure we are connected | |||
to good peers and to gossip peers to others. | |||
## Peer Types | |||
Certain peers are special in that they are specified by the user as `persistent`, | |||
which means we auto-redial them if the connection fails, or if we fail to dial | |||
them. | |||
Some peers can be marked as `private`, which means | |||
we will not put them in the address book or gossip them to others. | |||
All peers except private peers and peers coming from them are tracked using the | |||
address book. | |||
The rest of our peers are only distinguished by being either | |||
inbound (they dialed our public address) or outbound (we dialed them). | |||
## Discovery | |||
Peer discovery begins with a list of seeds. | |||
When we don't have enough peers, we | |||
1. ask existing peers | |||
2. dial seeds if we're not dialing anyone currently | |||
On startup, we will also immediately dial the given list of `persistent_peers`, | |||
and will attempt to maintain persistent connections with them. If the | |||
connections die, or we fail to dial, we will redial every 5s for a few minutes, | |||
then switch to an exponential backoff schedule, and after about a day of | |||
trying, stop dialing the peer. | |||
As long as we have less than `MaxNumOutboundPeers`, we periodically request | |||
additional peers from each of our own and try seeds. | |||
## Listening | |||
Peers listen on a configurable ListenAddr that they self-report in their | |||
NodeInfo during handshakes with other peers. Peers accept up to | |||
`MaxNumInboundPeers` incoming peers. | |||
## Address Book | |||
Peers are tracked via their ID (their PubKey.Address()). | |||
Peers are added to the address book from the PEX when they first connect to us or | |||
when we hear about them from other peers. | |||
The address book is arranged in sets of buckets, and distinguishes between | |||
vetted (old) and unvetted (new) peers. It keeps different sets of buckets for vetted and | |||
unvetted peers. Buckets provide randomization over peer selection. Peers are put | |||
in buckets according to their IP groups. | |||
A vetted peer can only be in one bucket. An unvetted peer can be in multiple buckets, and | |||
each instance of the peer can have a different IP:PORT. | |||
If we're trying to add a new peer but there's no space in its bucket, we'll | |||
remove the worst peer from that bucket to make room. | |||
## Vetting | |||
When a peer is first added, it is unvetted. | |||
Marking a peer as vetted is outside the scope of the `p2p` package. | |||
For Tendermint, a Peer becomes vetted once it has contributed sufficiently | |||
at the consensus layer; ie. once it has sent us valid and not-yet-known | |||
votes and/or block parts for `NumBlocksForVetted` blocks. | |||
Other users of the p2p package can determine their own conditions for when a peer is marked vetted. | |||
If a peer becomes vetted but there are already too many vetted peers, | |||
a randomly selected one of the vetted peers becomes unvetted. | |||
If a peer becomes unvetted (either a new peer, or one that was previously vetted), | |||
a randomly selected one of the unvetted peers is removed from the address book. | |||
More fine-grained tracking of peer behaviour can be done using | |||
a trust metric (see below), but it's best to start with something simple. | |||
## Select Peers to Dial | |||
When we need more peers, we pick addresses randomly from the addrbook with some | |||
configurable bias for unvetted peers. The bias should be lower when we have | |||
fewer peers and can increase as we obtain more, ensuring that our first peers | |||
are more trustworthy, but always giving us the chance to discover new good | |||
peers. | |||
We track the last time we dialed a peer and the number of unsuccessful attempts | |||
we've made. If too many attempts are made, we mark the peer as bad. | |||
Connection attempts are made with exponential backoff (plus jitter). Because | |||
the selection process happens every `ensurePeersPeriod`, we might not end up | |||
dialing a peer for much longer than the backoff duration. | |||
If we fail to connect to the peer after 16 tries (with exponential backoff), we | |||
remove from address book completely. | |||
## Select Peers to Exchange | |||
When we’re asked for peers, we select them as follows: | |||
- select at most `maxGetSelection` peers | |||
- try to select at least `minGetSelection` peers - if we have less than that, select them all. | |||
- select a random, unbiased `getSelectionPercent` of the peers | |||
Send the selected peers. Note we select peers for sending without bias for vetted/unvetted. | |||
## Preventing Spam | |||
There are various cases where we decide a peer has misbehaved and we disconnect from them. | |||
When this happens, the peer is removed from the address book and black listed for | |||
some amount of time. We call this "Disconnect and Mark". | |||
Note that the bad behaviour may be detected outside the PEX reactor itself | |||
(for instance, in the mconnection, or another reactor), but it must be communicated to the PEX reactor | |||
so it can remove and mark the peer. | |||
In the PEX, if a peer sends us an unsolicited list of peers, | |||
or if the peer sends a request too soon after another one, | |||
we Disconnect and MarkBad. | |||
## Trust Metric | |||
The quality of peers can be tracked in more fine-grained detail using a | |||
Proportional-Integral-Derivative (PID) controller that incorporates | |||
current, past, and rate-of-change data to inform peer quality. | |||
While a PID trust metric has been implemented, it remains for future work | |||
to use it in the PEX. | |||
See the [trustmetric](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-006-trust-metric.md) | |||
and [trustmetric useage](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-007-trust-metric-usage.md) | |||
architecture docs for more details. |
@ -0,0 +1,12 @@ | |||
# PEX Reactor | |||
## Channels | |||
Defines only `SendQueueCapacity`. [#1503](https://github.com/tendermint/tendermint/issues/1503) | |||
Implements rate-limiting by enforcing minimal time between two consecutive | |||
`pexRequestMessage` requests. If the peer sends us addresses we did not ask, | |||
it is stopped. | |||
Sending incorrectly encoded data or data exceeding `maxMsgSize` will result | |||
in stopping the peer. |