You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

237 lines
12 KiB

blockchain: Reorg reactor (#3561) * go routines in blockchain reactor * Added reference to the go routine diagram * Initial commit * cleanup * Undo testing_logger change, committed by mistake * Fix the test loggers * pulled some fsm code into pool.go * added pool tests * changes to the design added block requests under peer moved the request trigger in the reactor poolRoutine, triggered now by a ticker in general moved everything required for making block requests smarter in the poolRoutine added a simple map of heights to keep track of what will need to be requested next added a few more tests * send errors to FSM in a different channel than blocks send errors (RemovePeer) from switch on a different channel than the one receiving blocks renamed channels added more pool tests * more pool tests * lint errors * more tests * more tests * switch fast sync to new implementation * fixed data race in tests * cleanup * finished fsm tests * address golangci comments :) * address golangci comments :) * Added timeout on next block needed to advance * updating docs and cleanup * fix issue in test from previous cleanup * cleanup * Added termination scenarios, tests and more cleanup * small fixes to adr, comments and cleanup * Fix bug in sendRequest() If we tried to send a request to a peer not present in the switch, a missing continue statement caused the request to be blackholed in a peer that was removed and never retried. While this bug was manifesting, the reactor kept asking for other blocks that would be stored and never consumed. Added the number of unconsumed blocks in the math for requesting blocks ahead of current processing height so eventually there will be no more blocks requested until the already received ones are consumed. * remove bpPeer's didTimeout field * Use distinct err codes for peer timeout and FSM timeouts * Don't allow peers to update with lower height * review comments from Ethan and Zarko * some cleanup, renaming, comments * Move block execution in separate goroutine * Remove pool's numPending * review comments * fix lint, remove old blockchain reactor and duplicates in fsm tests * small reorg around peer after review comments * add the reactor spec * verify block only once * review comments * change to int for max number of pending requests * cleanup and godoc * Add configuration flag fast sync version * golangci fixes * fix config template * move both reactor versions under blockchain * cleanup, golint, renaming stuff * updated documentation, fixed more golint warnings * integrate with behavior package * sync with master * gofmt * add changelog_pending entry * move to improvments * suggestion to changelog entry
5 years ago
  1. # Blockchain Reactor v1
  2. ### Data Structures
  3. The data structures used are illustrated below.
  4. ![Data Structures](img/bc-reactor-new-datastructs.png)
  5. #### BlockchainReactor
  6. - is a `p2p.BaseReactor`.
  7. - has a `store.BlockStore` for persistence.
  8. - executes blocks using an `sm.BlockExecutor`.
  9. - starts the FSM and the `poolRoutine()`.
  10. - relays the fast-sync responses and switch messages to the FSM.
  11. - handles errors from the FSM and when necessarily reports them to the switch.
  12. - implements the blockchain reactor interface used by the FSM to send requests, errors to the switch and state timer resets.
  13. - registers all the concrete types and interfaces for serialisation.
  14. ```go
  15. type BlockchainReactor struct {
  16. p2p.BaseReactor
  17. initialState sm.State // immutable
  18. state sm.State
  19. blockExec *sm.BlockExecutor
  20. store *store.BlockStore
  21. fastSync bool
  22. fsm *BcReactorFSM
  23. blocksSynced int
  24. // Receive goroutine forwards messages to this channel to be processed in the context of the poolRoutine.
  25. messagesForFSMCh chan bcReactorMessage
  26. // Switch goroutine may send RemovePeer to the blockchain reactor. This is an error message that is relayed
  27. // to this channel to be processed in the context of the poolRoutine.
  28. errorsForFSMCh chan bcReactorMessage
  29. // This channel is used by the FSM and indirectly the block pool to report errors to the blockchain reactor and
  30. // the switch.
  31. eventsFromFSMCh chan bcFsmMessage
  32. }
  33. ```
  34. #### BcReactorFSM
  35. - implements a simple finite state machine.
  36. - has a state and a state timer.
  37. - has a `BlockPool` to keep track of block requests sent to peers and blocks received from peers.
  38. - uses an interface to send status requests, block requests and reporting errors. The interface is implemented by the `BlockchainReactor` and tests.
  39. ```go
  40. type BcReactorFSM struct {
  41. logger log.Logger
  42. mtx sync.Mutex
  43. startTime time.Time
  44. state *bcReactorFSMState
  45. stateTimer *time.Timer
  46. pool *BlockPool
  47. // interface used to call the Blockchain reactor to send StatusRequest, BlockRequest, reporting errors, etc.
  48. toBcR bcReactor
  49. }
  50. ```
  51. #### BlockPool
  52. - maintains a peer set, implemented as a map of peer ID to `BpPeer`.
  53. - maintains a set of requests made to peers, implemented as a map of block request heights to peer IDs.
  54. - maintains a list of future block requests needed to advance the fast-sync. This is a list of block heights.
  55. - keeps track of the maximum height of the peers in the set.
  56. - uses an interface to send requests and report errors to the reactor (via FSM).
  57. ```go
  58. type BlockPool struct {
  59. logger log.Logger
  60. // Set of peers that have sent status responses, with height bigger than pool.Height
  61. peers map[p2p.ID]*BpPeer
  62. // Set of block heights and the corresponding peers from where a block response is expected or has been received.
  63. blocks map[int64]p2p.ID
  64. plannedRequests map[int64]struct{} // list of blocks to be assigned peers for blockRequest
  65. nextRequestHeight int64 // next height to be added to plannedRequests
  66. Height int64 // height of next block to execute
  67. MaxPeerHeight int64 // maximum height of all peers
  68. toBcR bcReactor
  69. }
  70. ```
  71. Some reasons for the `BlockPool` data structure content:
  72. 1. If a peer is removed by the switch fast access is required to the peer and the block requests made to that peer in order to redo them.
  73. 2. When block verification fails fast access is required from the block height to the peer and the block requests made to that peer in order to redo them.
  74. 3. The `BlockchainReactor` main routine decides when the block pool is running low and asks the `BlockPool` (via FSM) to make more requests. The `BlockPool` creates a list of requests and triggers the sending of the block requests (via the interface). The reason it maintains a list of requests is the redo operations that may occur during error handling. These are redone when the `BlockchainReactor` requires more blocks.
  75. #### BpPeer
  76. - keeps track of a single peer, with height bigger than the initial height.
  77. - maintains the block requests made to the peer and the blocks received from the peer until they are executed.
  78. - monitors the peer speed when there are pending requests.
  79. - it has an active timer when pending requests are present and reports error on timeout.
  80. ```go
  81. type BpPeer struct {
  82. logger log.Logger
  83. ID p2p.ID
  84. Height int64 // the peer reported height
  85. NumPendingBlockRequests int // number of requests still waiting for block responses
  86. blocks map[int64]*types.Block // blocks received or expected to be received from this peer
  87. blockResponseTimer *time.Timer
  88. recvMonitor *flow.Monitor
  89. params *BpPeerParams // parameters for timer and monitor
  90. onErr func(err error, peerID p2p.ID) // function to call on error
  91. }
  92. ```
  93. ### Concurrency Model
  94. The diagram below shows the goroutines (depicted by the gray blocks), timers (shown on the left with their values) and channels (colored rectangles). The FSM box shows some of the functionality and it is not a separate goroutine.
  95. The interface used by the FSM is shown in light red with the `IF` block. This is used to:
  96. - send block requests
  97. - report peer errors to the switch - this results in the reactor calling `switch.StopPeerForError()` and, if triggered by the peer timeout routine, a `removePeerEv` is sent to the FSM and action is taken from the context of the `poolRoutine()`
  98. - ask the reactor to reset the state timers. The timers are owned by the FSM while the timeout routine is defined by the reactor. This was done in order to avoid running timers in tests and will change in the next revision.
  99. There are two main goroutines implemented by the blockchain reactor. All I/O operations are performed from the `poolRoutine()` context while the CPU intensive operations related to the block execution are performed from the context of the `executeBlocksRoutine()`. All goroutines are detailed in the next sections.
  100. ![Go Routines Diagram](img/bc-reactor-new-goroutines.png)
  101. #### Receive()
  102. Fast-sync messages from peers are received by this goroutine. It performs basic validation and:
  103. - in helper mode (i.e. for request message) it replies immediately. This is different than the proposal in adr-040 that specifies having the FSM handling these.
  104. - forwards response messages to the `poolRoutine()`.
  105. #### poolRoutine()
  106. (named kept as in the previous reactor).
  107. It starts the `executeBlocksRoutine()` and the FSM. It then waits in a loop for events. These are received from the following channels:
  108. - `sendBlockRequestTicker.C` - every 10msec the reactor asks FSM to make more block requests up to a maximum. Note: currently this value is constant but could be changed based on low/ high watermark thresholds for the number of blocks received and waiting to be processed, the number of blockResponse messages waiting in messagesForFSMCh, etc.
  109. - `statusUpdateTicker.C` - every 10 seconds the reactor broadcasts status requests to peers. While adr-040 specifies this to run within the FSM, at this point this functionality is kept in the reactor.
  110. - `messagesForFSMCh` - the `Receive()` goroutine sends status and block response messages to this channel and the reactor calls FSM to handle them.
  111. - `errorsForFSMCh` - this channel receives the following events:
  112. - peer remove - when the switch removes a peer
  113. - sate timeout event - when FSM state timers trigger
  114. The reactor forwards this messages to the FSM.
  115. - `eventsFromFSMCh` - there are two type of events sent over this channel:
  116. - `syncFinishedEv` - triggered when FSM enters `finished` state and calls the switchToConsensus() interface function.
  117. - `peerErrorEv`- peer timer expiry goroutine sends this event over the channel for processing from poolRoutine() context.
  118. #### executeBlocksRoutine()
  119. Started by the `poolRoutine()`, it retrieves blocks from the pool and executes them:
  120. - `processReceivedBlockTicker.C` - a ticker event is received over the channel every 10msec and its handling results in a signal being sent to the doProcessBlockCh channel.
  121. - doProcessBlockCh - events are received on this channel as described as above and upon processing blocks are retrieved from the pool and executed.
  122. ### FSM
  123. ![fsm](img/bc-reactor-new-fsm.png)
  124. #### States
  125. ##### init (aka unknown)
  126. The FSM is created in `unknown` state. When started, by the reactor (`startFSMEv`), it broadcasts Status requests and transitions to `waitForPeer` state.
  127. ##### waitForPeer
  128. In this state, the FSM waits for a Status responses from a "tall" peer. A timer is running in this state to allow the FSM to finish if there are no useful peers.
  129. If the timer expires, it moves to `finished` state and calls the reactor to switch to consensus.
  130. If a Status response is received from a peer within the timeout, the FSM transitions to `waitForBlock` state.
  131. ##### waitForBlock
  132. In this state the FSM makes Block requests (triggered by a ticker in reactor) and waits for Block responses. There is a timer running in this state to detect if a peer is not sending the block at current processing height. If the timer expires, the FSM removes the peer where the request was sent and all requests made to that peer are redone.
  133. As blocks are received they are stored by the pool. Block execution is independently performed by the reactor and the result reported to the FSM:
  134. - if there are no errors, the FSM increases the pool height and resets the state timer.
  135. - if there are errors, the peers that delivered the two blocks (at height and height+1) are removed and the requests redone.
  136. In this state the FSM may receive peer remove events in any of the following scenarios:
  137. - the switch is removing a peer
  138. - a peer is penalized because it has not responded to some block requests for a long time
  139. - a peer is penalized for being slow
  140. When processing of the last block (the one with height equal to the highest peer height minus one) is successful, the FSM transitions to `finished` state.
  141. If after a peer update or removal the pool height is same as maxPeerHeight, the FSM transitions to `finished` state.
  142. ##### finished
  143. When entering this state, the FSM calls the reactor to switch to consensus and performs cleanup.
  144. #### Events
  145. The following events are handled by the FSM:
  146. ```go
  147. const (
  148. startFSMEv = iota + 1
  149. statusResponseEv
  150. blockResponseEv
  151. processedBlockEv
  152. makeRequestsEv
  153. stopFSMEv
  154. peerRemoveEv = iota + 256
  155. stateTimeoutEv
  156. )
  157. ```
  158. ### Examples of Scenarios and Termination Handling
  159. A few scenarios are covered in this section together with the current/ proposed handling.
  160. In general, the scenarios involving faulty peers are made worse by the fact that they may quickly be re-added.
  161. #### 1. No Tall Peers
  162. S: In this scenario a node is started and while there are status responses received, none of the peers are at a height higher than this node.
  163. H: The FSM times out in `waitForPeer` state, moves to `finished` state where it calls the reactor to switch to consensus.
  164. #### 2. Typical Fast Sync
  165. S: A node fast syncs blocks from honest peers and eventually downloads and executes the penultimate block.
  166. H: The FSM in `waitForBlock` state will receive the processedBlockEv from the reactor and detect that the termination height is achieved.
  167. #### 3. Peer Claims Big Height but no Blocks
  168. S: In this scenario a faulty peer claims a big height (for which there are no blocks).
  169. H: The requests for the non-existing block will timeout, the peer removed and the pool's `MaxPeerHeight` updated. FSM checks if the termination height is achieved when peers are removed.
  170. #### 4. Highest Peer Removed or Updated to Short
  171. S: The fast sync node is caught up with all peers except one tall peer. The tall peer is removed or it sends status response with low height.
  172. H: FSM checks termination condition on peer removal and updates.
  173. #### 5. Block At Current Height Delayed
  174. S: A peer can block the progress of fast sync by delaying indefinitely the block response for the current processing height (h1).
  175. H: Currently, given h1 < h2, there is no enforcement at peer level that the response for h1 should be received before h2. So a peer will timeout only after delivering all blocks except h1. However the `waitForBlock` state timer fires if the block for current processing height is not received within a timeout. The peer is removed and the requests to that peer (including the one for current height) redone.