|
|
- # ADR 040: Blockchain Reactor Refactor
-
- ## Changelog
-
- 19-03-2019: Initial draft
-
- ## Context
-
- The Blockchain Reactor's high level responsibility is to enable peers who are far behind the current state of the
- blockchain to quickly catch up by downloading many blocks in parallel from its peers, verifying block correctness, and
- executing them against the ABCI application. We call the protocol executed by the Blockchain Reactor `fast-sync`.
- The current architecture diagram of the blockchain reactor can be found here:
-
- ![Blockchain Reactor Architecture Diagram](img/bc-reactor.png)
-
- The current architecture consists of dozens of routines and it is tightly depending on the `Switch`, making writing
- unit tests almost impossible. Current tests require setting up complex dependency graphs and dealing with concurrency.
- Note that having dozens of routines is in this case overkill as most of the time routines sits idle waiting for
- something to happen (message to arrive or timeout to expire). Due to dependency on the `Switch`, testing relatively
- complex network scenarios and failures (for example adding and removing peers) is very complex tasks and frequently lead
- to complex tests with not deterministic behavior ([#3400]). Impossibility to write proper tests makes confidence in
- the code low and this resulted in several issues (some are fixed in the meantime and some are still open):
- [#3400], [#2897], [#2896], [#2699], [#2888], [#2457], [#2622], [#2026].
-
- ## Decision
-
- To remedy these issues we plan a major refactor of the blockchain reactor. The proposed architecture is largely inspired
- by ADR-30 and is presented on the following diagram:
- ![Blockchain Reactor Refactor Diagram](img/bc-reactor-refactor.png)
-
- We suggest a concurrency architecture where the core algorithm (we call it `Controller`) is extracted into a finite
- state machine. The active routine of the reactor is called `Executor` and is responsible for receiving and sending
- messages from/to peers and triggering timeouts. What messages should be sent and timeouts triggered is determined mostly
- by the `Controller`. The exception is `Peer Heartbeat` mechanism which is `Executor` responsibility. The heartbeat
- mechanism is used to remove slow and unresponsive peers from the peer list. Writing of unit tests is simpler with
- this architecture as most of the critical logic is part of the `Controller` function. We expect that simpler concurrency
- architecture will not have significant negative effect on the performance of this reactor (to be confirmed by
- experimental evaluation).
-
-
- ### Implementation changes
-
- We assume the following system model for "fast sync" protocol:
-
- * a node is connected to a random subset of all nodes that represents its peer set. Some nodes are correct and some
- might be faulty. We don't make assumptions about ratio of faulty nodes, i.e., it is possible that all nodes in some
- peer set are faulty.
- * we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to
- a correct node `q` at time `t`, then `q` will receive message the latest at time `t+Delta` where `Delta` is a system
- parameter that is known by network participants. `Delta` is normally chosen to be an order of magnitude higher than
- the real communication delay (maximum) between correct nodes. Therefore if a correct node `p` sends a request message
- to a correct node `q` at time `t` and there is no the corresponding reply at time `t + 2*Delta`, then `p` can assume
- that `q` is faulty. Note that the network assumptions for the consensus reactor are different (we assume partially
- synchronous model there).
-
- The requirements for the "fast sync" protocol are formally specified as follows:
-
- - `Correctness`: If a correct node `p` is connected to a correct node `q` for a long enough period of time, then `p`
- - will eventually download all requested blocks from `q`.
- - `Termination`: If a set of peers of a correct node `p` is stable (no new nodes are added to the peer set of `p`) for
- - a long enough period of time, then protocol eventually terminates.
- - `Fairness`: A correct node `p` sends requests for blocks to all peers from its peer set.
-
- As explained above, the `Executor` is responsible for sending and receiving messages that are part of the `fast-sync`
- protocol. The following messages are exchanged as part of `fast-sync` protocol:
-
- ``` go
- type Message int
- const (
- MessageUnknown Message = iota
- MessageStatusRequest
- MessageStatusResponse
- MessageBlockRequest
- MessageBlockResponse
- )
- ```
- `MessageStatusRequest` is sent periodically to all peers as a request for a peer to provide its current height. It is
- part of the `Peer Heartbeat` mechanism and a failure to respond timely to this message results in a peer being removed
- from the peer set. Note that the `Peer Heartbeat` mechanism is used only while a peer is in `fast-sync` mode. We assume
- here existence of a mechanism that gives node a possibility to inform its peers that it is in the `fast-sync` mode.
-
- ``` go
- type MessageStatusRequest struct {
- SeqNum int64 // sequence number of the request
- }
- ```
- `MessageStatusResponse` is sent as a response to `MessageStatusRequest` to inform requester about the peer current
- height.
-
- ``` go
- type MessageStatusResponse struct {
- SeqNum int64 // sequence number of the corresponding request
- Height int64 // current peer height
- }
- ```
-
- `MessageBlockRequest` is used to make a request for a block and the corresponding commit certificate at a given height.
-
- ``` go
- type MessageBlockRequest struct {
- Height int64
- }
- ```
-
- `MessageBlockResponse` is a response for the corresponding block request. In addition to providing the block and the
- corresponding commit certificate, it contains also a current peer height.
-
- ``` go
- type MessageBlockResponse struct {
- Height int64
- Block Block
- Commit Commit
- PeerHeight int64
- }
- ```
-
- In addition to sending and receiving messages, and `HeartBeat` mechanism, controller is also managing timeouts
- that are triggered upon `Controller` request. `Controller` is then informed once a timeout expires.
-
- ``` go
- type TimeoutTrigger int
- const (
- TimeoutUnknown TimeoutTrigger = iota
- TimeoutResponseTrigger
- TimeoutTerminationTrigger
- )
- ```
-
- The `Controller` can be modelled as a function with clearly defined inputs:
-
- * `State` - current state of the node. Contains data about connected peers and its behavior, pending requests,
- * received blocks, etc.
- * `Event` - significant events in the network.
-
- producing clear outputs:
-
- * `State` - updated state of the node,
- * `MessageToSend` - signal what message to send and to which peer
- * `TimeoutTrigger` - signal that timeout should be triggered.
-
-
- We consider the following `Event` types:
-
- ``` go
- type Event int
- const (
- EventUnknown Event = iota
- EventStatusReport
- EventBlockRequest
- EventBlockResponse
- EventRemovePeer
- EventTimeoutResponse
- EventTimeoutTermination
- )
- ```
-
- `EventStatusResponse` event is generated once `MessageStatusResponse` is received by the `Executor`.
-
- ``` go
- type EventStatusReport struct {
- PeerID ID
- Height int64
- }
- ```
-
- `EventBlockRequest` event is generated once `MessageBlockRequest` is received by the `Executor`.
-
- ``` go
- type EventBlockRequest struct {
- Height int64
- PeerID p2p.ID
- }
- ```
- `EventBlockResponse` event is generated upon reception of `MessageBlockResponse` message by the `Executor`.
-
- ``` go
- type EventBlockResponse struct {
- Height int64
- Block Block
- Commit Commit
- PeerID ID
- PeerHeight int64
- }
- ```
- `EventRemovePeer` is generated by `Executor` to signal that the connection to a peer is closed due to peer misbehavior.
-
- ``` go
- type EventRemovePeer struct {
- PeerID ID
- }
- ```
- `EventTimeoutResponse` is generated by `Executor` to signal that a timeout triggered by `TimeoutResponseTrigger` has
- expired.
-
- ``` go
- type EventTimeoutResponse struct {
- PeerID ID
- Height int64
- }
- ```
- `EventTimeoutTermination` is generated by `Executor` to signal that a timeout triggered by `TimeoutTerminationTrigger`
- has expired.
-
- ``` go
- type EventTimeoutTermination struct {
- Height int64
- }
- ```
-
- `MessageToSend` is just a wrapper around `Message` type that contains id of the peer to which message should be sent.
-
- ``` go
- type MessageToSend struct {
- PeerID ID
- Message Message
- }
- ```
-
- The Controller state machine can be in two modes: `ModeFastSync` when
- a node is trying to catch up with the network by downloading committed blocks,
- and `ModeConsensus` in which it executes Tendermint consensus protocol. We
- consider that `fast sync` mode terminates once the Controller switch to
- `ModeConsensus`.
-
- ``` go
- type Mode int
- const (
- ModeUnknown Mode = iota
- ModeFastSync
- ModeConsensus
- )
- ```
- `Controller` is managing the following state:
-
- ``` go
- type ControllerState struct {
- Height int64 // the first block that is not committed
- Mode Mode // mode of operation
- PeerMap map[ID]PeerStats // map of peer IDs to peer statistics
- MaxRequestPending int64 // maximum height of the pending requests
- FailedRequests []int64 // list of failed block requests
- PendingRequestsNum int // total number of pending requests
- Store []BlockInfo // contains list of downloaded blocks
- Executor BlockExecutor // store, verify and executes blocks
- }
- ```
-
- `PeerStats` data structure keeps for every peer its current height and a list of pending requests for blocks.
-
- ``` go
- type PeerStats struct {
- Height int64
- PendingRequest int64 // a request sent to this peer
- }
- ```
-
- `BlockInfo` data structure is used to store information (as part of block store) about downloaded blocks: from what peer
- a block and the corresponding commit certificate are received.
- ``` go
- type BlockInfo struct {
- Block Block
- Commit Commit
- PeerID ID // a peer from which we received the corresponding Block and Commit
- }
- ```
-
- The `Controller` is initialized by providing an initial height (`startHeight`) from which it will start downloading
- blocks from peers and the current state of the `BlockExecutor`.
-
- ``` go
- func NewControllerState(startHeight int64, executor BlockExecutor) ControllerState {
- state = ControllerState {}
- state.Height = startHeight
- state.Mode = ModeFastSync
- state.MaxRequestPending = startHeight - 1
- state.PendingRequestsNum = 0
- state.Executor = executor
- initialize state.PeerMap, state.FailedRequests and state.Store to empty data structures
- return state
- }
- ```
-
- The core protocol logic is given with the following function:
-
- ``` go
- func handleEvent(state ControllerState, event Event) (ControllerState, Message, TimeoutTrigger, Error) {
- msg = nil
- timeout = nil
- error = nil
-
- switch state.Mode {
- case ModeConsensus:
- switch event := event.(type) {
- case EventBlockRequest:
- msg = createBlockResponseMessage(state, event)
- return state, msg, timeout, error
- default:
- error = "Only respond to BlockRequests while in ModeConsensus!"
- return state, msg, timeout, error
- }
-
- case ModeFastSync:
- switch event := event.(type) {
- case EventBlockRequest:
- msg = createBlockResponseMessage(state, event)
- return state, msg, timeout, error
-
- case EventStatusResponse:
- return handleEventStatusResponse(event, state)
-
- case EventRemovePeer:
- return handleEventRemovePeer(event, state)
-
- case EventBlockResponse:
- return handleEventBlockResponse(event, state)
-
- case EventResponseTimeout:
- return handleEventResponseTimeout(event, state)
-
- case EventTerminationTimeout:
- // Termination timeout is triggered in case of empty peer set and in case there are no pending requests.
- // If this timeout expires and in the meantime no new peers are added or new pending requests are made
- // then `fast-sync` mode terminates by switching to `ModeConsensus`.
- // Note that termination timeout should be higher than the response timeout.
- if state.Height == event.Height && state.PendingRequestsNum == 0 { state.State = ConsensusMode }
- return state, msg, timeout, error
-
- default:
- error = "Received unknown event type!"
- return state, msg, timeout, error
- }
- }
- }
- ```
-
- ``` go
- func createBlockResponseMessage(state ControllerState, event BlockRequest) MessageToSend {
- msgToSend = nil
- if _, ok := state.PeerMap[event.PeerID]; !ok { peerStats = PeerStats{-1, -1} }
- if state.Executor.ContainsBlockWithHeight(event.Height) && event.Height > peerStats.Height {
- peerStats = event.Height
- msg = BlockResponseMessage{
- Height: event.Height,
- Block: state.Executor.getBlock(eventHeight),
- Commit: state.Executor.getCommit(eventHeight),
- PeerID: event.PeerID,
- CurrentHeight: state.Height - 1,
- }
- msgToSend = MessageToSend { event.PeerID, msg }
- }
- state.PeerMap[event.PeerID] = peerStats
- return msgToSend
- }
- ```
-
- ``` go
- func handleEventStatusResponse(event EventStatusResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) {
- if _, ok := state.PeerMap[event.PeerID]; !ok {
- peerStats = PeerStats{ -1, -1 }
- } else {
- peerStats = state.PeerMap[event.PeerID]
- }
-
- if event.Height > peerStats.Height { peerStats.Height = event.Height }
- // if there are no pending requests for this peer, try to send him a request for block
- if peerStats.PendingRequest == -1 {
- msg = createBlockRequestMessages(state, event.PeerID, peerStats.Height)
- // msg is nil if no request for block can be made to a peer at this point in time
- if msg != nil {
- peerStats.PendingRequests = msg.Height
- state.PendingRequestsNum++
- // when a request for a block is sent to a peer, a response timeout is triggered. If no corresponding block is sent by the peer
- // during response timeout period, then the peer is considered faulty and is removed from the peer set.
- timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout }
- } else if state.PendingRequestsNum == 0 {
- // if there are no pending requests and no new request can be placed to the peer, termination timeout is triggered.
- // If termination timeout expires and we are still at the same height and there are no pending requests, the "fast-sync"
- // mode is finished and we switch to `ModeConsensus`.
- timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
- }
- }
- state.PeerMap[event.PeerID] = peerStats
- return state, msg, timeout, error
- }
- ```
-
- ``` go
- func handleEventRemovePeer(event EventRemovePeer, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) {
- if _, ok := state.PeerMap[event.PeerID]; ok {
- pendingRequest = state.PeerMap[event.PeerID].PendingRequest
- // if a peer is removed from the peer set, its pending request is declared failed and added to the `FailedRequests` list
- // so it can be retried.
- if pendingRequest != -1 {
- add(state.FailedRequests, pendingRequest)
- }
- state.PendingRequestsNum--
- delete(state.PeerMap, event.PeerID)
- // if the peer set is empty after removal of this peer then termination timeout is triggered.
- if state.PeerMap.isEmpty() {
- timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
- }
- } else { error = "Removing unknown peer!" }
- return state, msg, timeout, error
- ```
-
- ``` go
- func handleEventBlockResponse(event EventBlockResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error)
- if state.PeerMap[event.PeerID] {
- peerStats = state.PeerMap[event.PeerID]
- // when expected block arrives from a peer, it is added to the store so it can be verified and if correct executed after.
- if peerStats.PendingRequest == event.Height {
- peerStats.PendingRequest = -1
- state.PendingRequestsNum--
- if event.PeerHeight > peerStats.Height { peerStats.Height = event.PeerHeight }
- state.Store[event.Height] = BlockInfo{ event.Block, event.Commit, event.PeerID }
- // blocks are verified sequentially so adding a block to the store does not mean that it will be immediately verified
- // as some of the previous blocks might be missing.
- state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list
- if _, ok := state.PeerMap[event.PeerID]; ok {
- // we try to identify new request for a block that can be asked to the peer
- msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height)
- if msg != nil {
- peerStats.PendingRequests = msg.Height
- state.PendingRequestsNum++
- // if request for block is made, response timeout is triggered
- timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout }
- } else if state.PeerMap.isEmpty() || state.PendingRequestsNum == 0 {
- // if the peer map is empty (the peer can be removed as block verification failed) or there are no pending requests
- // termination timeout is triggered.
- timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
- }
- }
- } else { error = "Received Block from wrong peer!" }
- } else { error = "Received Block from unknown peer!" }
-
- state.PeerMap[event.PeerID] = peerStats
- return state, msg, timeout, error
- }
- ```
-
- ``` go
- func handleEventResponseTimeout(event, state) {
- if _, ok := state.PeerMap[event.PeerID]; ok {
- peerStats = state.PeerMap[event.PeerID]
- // if a response timeout expires and the peer hasn't delivered the block, the peer is removed from the peer list and
- // the request is added to the `FailedRequests` so the block can be downloaded from other peer
- if peerStats.PendingRequest == event.Height {
- add(state.FailedRequests, pendingRequest)
- delete(state.PeerMap, event.PeerID)
- state.PendingRequestsNum--
- // if peer set is empty, then termination timeout is triggered
- if state.PeerMap.isEmpty() {
- timeout = TimeoutTrigger{ state.Height, TerminationTimeout }
- }
- }
- }
- return state, msg, timeout, error
- }
- ```
-
- ``` go
- func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) MessageToSend {
- msg = nil
- blockHeight = -1
- r = find request in state.FailedRequests such that r <= peerHeight // returns `nil` if there are no such request
- // if there is a height in failed requests that can be downloaded from the peer send request to it
- if r != nil {
- blockNumber = r
- delete(state.FailedRequests, r)
- } else if state.MaxRequestPending < peerHeight {
- // if height of the maximum pending request is smaller than peer height, then ask peer for next block
- state.MaxRequestPending++
- blockHeight = state.MaxRequestPending // increment state.MaxRequestPending and then return the new value
- }
-
- if blockHeight > -1 { msg = MessageToSend { peerID, MessageBlockRequest { blockHeight } }
- return msg
- }
- ```
-
- ``` go
- func verifyBlocks(state State) State {
- done = false
- for !done {
- block = state.Store[height]
- if block != nil {
- verified = verify block.Block using block.Commit // return `true` is verification succeed, 'false` otherwise
-
- if verified {
- block.Execute() // executing block is costly operation so it might make sense executing asynchronously
- state.Height++
- } else {
- // if block verification failed, then it is added to `FailedRequests` and the peer is removed from the peer set
- add(state.FailedRequests, height)
- state.Store[height] = nil
- if _, ok := state.PeerMap[block.PeerID]; ok {
- pendingRequest = state.PeerMap[block.PeerID].PendingRequest
- // if there is a pending request sent to the peer that is just to be removed from the peer set, add it to `FailedRequests`
- if pendingRequest != -1 {
- add(state.FailedRequests, pendingRequest)
- state.PendingRequestsNum--
- }
- delete(state.PeerMap, event.PeerID)
- }
- done = true
- }
- } else { done = true }
- }
- return state
- }
- ```
-
- In the proposed architecture `Controller` is not active task, i.e., it is being called by the `Executor`. Depending on
- the return values returned by `Controller`,`Executor` will send a message to some peer (`msg` != nil), trigger a
- timeout (`timeout` != nil) or deal with errors (`error` != nil).
- In case a timeout is triggered, it will provide as an input to `Controller` the corresponding timeout event once
- timeout expires.
-
-
- ## Status
-
- Draft.
-
- ## Consequences
-
- ### Positive
-
- - isolated implementation of the algorithm
- - improved testability - simpler to prove correctness
- - clearer separation of concerns - easier to reason
-
- ### Negative
-
- ### Neutral
|