You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1122 lines
31 KiB

statesync: remove deadlock on init fail (#7029) When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag. ``` goroutine 36 [chan receive]: github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200) github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240) github.com/tendermint/tendermint/node/node.go:769 +0x132 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1() github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62 github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0) github.com/tendermint/tendermint/libs/os/os.go:26 +0x102 created by github.com/tendermint/tendermint/libs/os.TrapSignal github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6 goroutine 188 [semacquire]: sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc00026b1c8) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 sync.(*RWMutex).Lock(0xc00026b1c8) sync/rwmutex.go:111 +0x90 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4) github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080) github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd) ```
3 years ago
statesync: remove deadlock on init fail (#7029) When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag. ``` goroutine 36 [chan receive]: github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200) github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240) github.com/tendermint/tendermint/node/node.go:769 +0x132 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1() github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62 github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0) github.com/tendermint/tendermint/libs/os/os.go:26 +0x102 created by github.com/tendermint/tendermint/libs/os.TrapSignal github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6 goroutine 188 [semacquire]: sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc00026b1c8) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 sync.(*RWMutex).Lock(0xc00026b1c8) sync/rwmutex.go:111 +0x90 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4) github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080) github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd) ```
3 years ago
statesync: remove deadlock on init fail (#7029) When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag. ``` goroutine 36 [chan receive]: github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200) github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240) github.com/tendermint/tendermint/node/node.go:769 +0x132 github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0) github.com/tendermint/tendermint/libs/service/service.go:171 +0x323 github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1() github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62 github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0) github.com/tendermint/tendermint/libs/os/os.go:26 +0x102 created by github.com/tendermint/tendermint/libs/os.TrapSignal github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6 goroutine 188 [semacquire]: sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1) runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc00026b1c8) sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) sync/mutex.go:81 sync.(*RWMutex).Lock(0xc00026b1c8) sync/rwmutex.go:111 +0x90 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4) github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5 github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080) github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd) ```
3 years ago
  1. package statesync
  2. import (
  3. "bytes"
  4. "context"
  5. "errors"
  6. "fmt"
  7. "reflect"
  8. "runtime/debug"
  9. "sort"
  10. "time"
  11. abci "github.com/tendermint/tendermint/abci/types"
  12. "github.com/tendermint/tendermint/config"
  13. tmsync "github.com/tendermint/tendermint/internal/libs/sync"
  14. "github.com/tendermint/tendermint/internal/p2p"
  15. "github.com/tendermint/tendermint/internal/proxy"
  16. sm "github.com/tendermint/tendermint/internal/state"
  17. "github.com/tendermint/tendermint/internal/store"
  18. "github.com/tendermint/tendermint/libs/log"
  19. "github.com/tendermint/tendermint/libs/service"
  20. "github.com/tendermint/tendermint/light"
  21. "github.com/tendermint/tendermint/light/provider"
  22. ssproto "github.com/tendermint/tendermint/proto/tendermint/statesync"
  23. "github.com/tendermint/tendermint/types"
  24. )
  25. var (
  26. _ service.Service = (*Reactor)(nil)
  27. _ p2p.Wrapper = (*ssproto.Message)(nil)
  28. )
  29. const (
  30. // SnapshotChannel exchanges snapshot metadata
  31. SnapshotChannel = p2p.ChannelID(0x60)
  32. // ChunkChannel exchanges chunk contents
  33. ChunkChannel = p2p.ChannelID(0x61)
  34. // LightBlockChannel exchanges light blocks
  35. LightBlockChannel = p2p.ChannelID(0x62)
  36. // ParamsChannel exchanges consensus params
  37. ParamsChannel = p2p.ChannelID(0x63)
  38. // recentSnapshots is the number of recent snapshots to send and receive per peer.
  39. recentSnapshots = 10
  40. // snapshotMsgSize is the maximum size of a snapshotResponseMessage
  41. snapshotMsgSize = int(4e6) // ~4MB
  42. // chunkMsgSize is the maximum size of a chunkResponseMessage
  43. chunkMsgSize = int(16e6) // ~16MB
  44. // lightBlockMsgSize is the maximum size of a lightBlockResponseMessage
  45. lightBlockMsgSize = int(1e7) // ~1MB
  46. // paramMsgSize is the maximum size of a paramsResponseMessage
  47. paramMsgSize = int(1e5) // ~100kb
  48. // lightBlockResponseTimeout is how long the dispatcher waits for a peer to
  49. // return a light block
  50. lightBlockResponseTimeout = 10 * time.Second
  51. // consensusParamsResponseTimeout is the time the p2p state provider waits
  52. // before performing a secondary call
  53. consensusParamsResponseTimeout = 5 * time.Second
  54. // maxLightBlockRequestRetries is the amount of retries acceptable before
  55. // the backfill process aborts
  56. maxLightBlockRequestRetries = 20
  57. )
  58. func GetChannelDescriptors() []*p2p.ChannelDescriptor {
  59. return []*p2p.ChannelDescriptor{
  60. {
  61. ID: SnapshotChannel,
  62. MessageType: new(ssproto.Message),
  63. Priority: 6,
  64. SendQueueCapacity: 10,
  65. RecvMessageCapacity: snapshotMsgSize,
  66. RecvBufferCapacity: 128,
  67. },
  68. {
  69. ID: ChunkChannel,
  70. Priority: 3,
  71. MessageType: new(ssproto.Message),
  72. SendQueueCapacity: 4,
  73. RecvMessageCapacity: chunkMsgSize,
  74. RecvBufferCapacity: 128,
  75. },
  76. {
  77. ID: LightBlockChannel,
  78. MessageType: new(ssproto.Message),
  79. Priority: 5,
  80. SendQueueCapacity: 10,
  81. RecvMessageCapacity: lightBlockMsgSize,
  82. RecvBufferCapacity: 128,
  83. },
  84. {
  85. ID: ParamsChannel,
  86. MessageType: new(ssproto.Message),
  87. Priority: 2,
  88. SendQueueCapacity: 10,
  89. RecvMessageCapacity: paramMsgSize,
  90. RecvBufferCapacity: 128,
  91. },
  92. }
  93. }
  94. // Metricer defines an interface used for the rpc sync info query, please see statesync.metrics
  95. // for the details.
  96. type Metricer interface {
  97. TotalSnapshots() int64
  98. ChunkProcessAvgTime() time.Duration
  99. SnapshotHeight() int64
  100. SnapshotChunksCount() int64
  101. SnapshotChunksTotal() int64
  102. BackFilledBlocks() int64
  103. BackFillBlocksTotal() int64
  104. }
  105. // Reactor handles state sync, both restoring snapshots for the local node and
  106. // serving snapshots for other nodes.
  107. type Reactor struct {
  108. service.BaseService
  109. chainID string
  110. initialHeight int64
  111. cfg config.StateSyncConfig
  112. stateStore sm.Store
  113. blockStore *store.BlockStore
  114. conn proxy.AppConnSnapshot
  115. connQuery proxy.AppConnQuery
  116. tempDir string
  117. snapshotCh *p2p.Channel
  118. chunkCh *p2p.Channel
  119. blockCh *p2p.Channel
  120. paramsCh *p2p.Channel
  121. peerUpdates *p2p.PeerUpdates
  122. closeCh chan struct{}
  123. // Dispatcher is used to multiplex light block requests and responses over multiple
  124. // peers used by the p2p state provider and in reverse sync.
  125. dispatcher *Dispatcher
  126. peers *peerList
  127. // These will only be set when a state sync is in progress. It is used to feed
  128. // received snapshots and chunks into the syncer and manage incoming and outgoing
  129. // providers.
  130. mtx tmsync.RWMutex
  131. syncer *syncer
  132. providers map[types.NodeID]*BlockProvider
  133. stateProvider StateProvider
  134. metrics *Metrics
  135. backfillBlockTotal int64
  136. backfilledBlocks int64
  137. }
  138. // NewReactor returns a reference to a new state sync reactor, which implements
  139. // the service.Service interface. It accepts a logger, connections for snapshots
  140. // and querying, references to p2p Channels and a channel to listen for peer
  141. // updates on. Note, the reactor will close all p2p Channels when stopping.
  142. func NewReactor(
  143. chainID string,
  144. initialHeight int64,
  145. cfg config.StateSyncConfig,
  146. logger log.Logger,
  147. conn proxy.AppConnSnapshot,
  148. connQuery proxy.AppConnQuery,
  149. snapshotCh, chunkCh, blockCh, paramsCh *p2p.Channel,
  150. peerUpdates *p2p.PeerUpdates,
  151. stateStore sm.Store,
  152. blockStore *store.BlockStore,
  153. tempDir string,
  154. ssMetrics *Metrics,
  155. ) *Reactor {
  156. r := &Reactor{
  157. chainID: chainID,
  158. initialHeight: initialHeight,
  159. cfg: cfg,
  160. conn: conn,
  161. connQuery: connQuery,
  162. snapshotCh: snapshotCh,
  163. chunkCh: chunkCh,
  164. blockCh: blockCh,
  165. paramsCh: paramsCh,
  166. peerUpdates: peerUpdates,
  167. closeCh: make(chan struct{}),
  168. tempDir: tempDir,
  169. stateStore: stateStore,
  170. blockStore: blockStore,
  171. peers: newPeerList(),
  172. dispatcher: NewDispatcher(blockCh.Out),
  173. providers: make(map[types.NodeID]*BlockProvider),
  174. metrics: ssMetrics,
  175. }
  176. r.BaseService = *service.NewBaseService(logger, "StateSync", r)
  177. return r
  178. }
  179. // OnStart starts separate go routines for each p2p Channel and listens for
  180. // envelopes on each. In addition, it also listens for peer updates and handles
  181. // messages on that p2p channel accordingly. Note, we do not launch a go-routine to
  182. // handle individual envelopes as to not have to deal with bounding workers or pools.
  183. // The caller must be sure to execute OnStop to ensure the outbound p2p Channels are
  184. // closed. No error is returned.
  185. func (r *Reactor) OnStart() error {
  186. go r.processSnapshotCh()
  187. go r.processChunkCh()
  188. go r.processBlockCh()
  189. go r.processParamsCh()
  190. go r.processPeerUpdates()
  191. return nil
  192. }
  193. // OnStop stops the reactor by signaling to all spawned goroutines to exit and
  194. // blocking until they all exit.
  195. func (r *Reactor) OnStop() {
  196. // tell the dispatcher to stop sending any more requests
  197. r.dispatcher.Close()
  198. // wait for any remaining requests to complete
  199. <-r.dispatcher.Done()
  200. // Close closeCh to signal to all spawned goroutines to gracefully exit. All
  201. // p2p Channels should execute Close().
  202. close(r.closeCh)
  203. // Wait for all p2p Channels to be closed before returning. This ensures we
  204. // can easily reason about synchronization of all p2p Channels and ensure no
  205. // panics will occur.
  206. <-r.peerUpdates.Done()
  207. <-r.snapshotCh.Done()
  208. <-r.chunkCh.Done()
  209. <-r.blockCh.Done()
  210. <-r.paramsCh.Done()
  211. }
  212. // Sync runs a state sync, fetching snapshots and providing chunks to the
  213. // application. At the close of the operation, Sync will bootstrap the state
  214. // store and persist the commit at that height so that either consensus or
  215. // blocksync can commence. It will then proceed to backfill the necessary amount
  216. // of historical blocks before participating in consensus
  217. func (r *Reactor) Sync(ctx context.Context) (sm.State, error) {
  218. // We need at least two peers (for cross-referencing of light blocks) before we can
  219. // begin state sync
  220. if err := r.waitForEnoughPeers(ctx, 2); err != nil {
  221. return sm.State{}, err
  222. }
  223. r.mtx.Lock()
  224. if r.syncer != nil {
  225. r.mtx.Unlock()
  226. return sm.State{}, errors.New("a state sync is already in progress")
  227. }
  228. if err := r.initStateProvider(ctx, r.chainID, r.initialHeight); err != nil {
  229. r.mtx.Unlock()
  230. return sm.State{}, err
  231. }
  232. r.syncer = newSyncer(
  233. r.cfg,
  234. r.Logger,
  235. r.conn,
  236. r.connQuery,
  237. r.stateProvider,
  238. r.snapshotCh.Out,
  239. r.chunkCh.Out,
  240. r.snapshotCh.Done(),
  241. r.tempDir,
  242. r.metrics,
  243. )
  244. r.mtx.Unlock()
  245. defer func() {
  246. r.mtx.Lock()
  247. // reset syncing objects at the close of Sync
  248. r.syncer = nil
  249. r.stateProvider = nil
  250. r.mtx.Unlock()
  251. }()
  252. requestSnapshotsHook := func() {
  253. // request snapshots from all currently connected peers
  254. msg := p2p.Envelope{
  255. Broadcast: true,
  256. Message: &ssproto.SnapshotsRequest{},
  257. }
  258. select {
  259. case <-ctx.Done():
  260. case <-r.closeCh:
  261. case r.snapshotCh.Out <- msg:
  262. }
  263. }
  264. state, commit, err := r.syncer.SyncAny(ctx, r.cfg.DiscoveryTime, requestSnapshotsHook)
  265. if err != nil {
  266. return sm.State{}, err
  267. }
  268. err = r.stateStore.Bootstrap(state)
  269. if err != nil {
  270. return sm.State{}, fmt.Errorf("failed to bootstrap node with new state: %w", err)
  271. }
  272. err = r.blockStore.SaveSeenCommit(state.LastBlockHeight, commit)
  273. if err != nil {
  274. return sm.State{}, fmt.Errorf("failed to store last seen commit: %w", err)
  275. }
  276. err = r.Backfill(ctx, state)
  277. if err != nil {
  278. r.Logger.Error("backfill failed. Proceeding optimistically...", "err", err)
  279. }
  280. return state, nil
  281. }
  282. // Backfill sequentially fetches, verifies and stores light blocks in reverse
  283. // order. It does not stop verifying blocks until reaching a block with a height
  284. // and time that is less or equal to the stopHeight and stopTime. The
  285. // trustedBlockID should be of the header at startHeight.
  286. func (r *Reactor) Backfill(ctx context.Context, state sm.State) error {
  287. params := state.ConsensusParams.Evidence
  288. stopHeight := state.LastBlockHeight - params.MaxAgeNumBlocks
  289. stopTime := state.LastBlockTime.Add(-params.MaxAgeDuration)
  290. // ensure that stop height doesn't go below the initial height
  291. if stopHeight < state.InitialHeight {
  292. stopHeight = state.InitialHeight
  293. // this essentially makes stop time a void criteria for termination
  294. stopTime = state.LastBlockTime
  295. }
  296. return r.backfill(
  297. ctx,
  298. state.ChainID,
  299. state.LastBlockHeight,
  300. stopHeight,
  301. state.InitialHeight,
  302. state.LastBlockID,
  303. stopTime,
  304. )
  305. }
  306. func (r *Reactor) backfill(
  307. ctx context.Context,
  308. chainID string,
  309. startHeight, stopHeight, initialHeight int64,
  310. trustedBlockID types.BlockID,
  311. stopTime time.Time,
  312. ) error {
  313. r.Logger.Info("starting backfill process...", "startHeight", startHeight,
  314. "stopHeight", stopHeight, "stopTime", stopTime, "trustedBlockID", trustedBlockID)
  315. r.backfillBlockTotal = startHeight - stopHeight + 1
  316. r.metrics.BackFillBlocksTotal.Set(float64(r.backfillBlockTotal))
  317. const sleepTime = 1 * time.Second
  318. var (
  319. lastValidatorSet *types.ValidatorSet
  320. lastChangeHeight = startHeight
  321. )
  322. queue := newBlockQueue(startHeight, stopHeight, initialHeight, stopTime, maxLightBlockRequestRetries)
  323. // fetch light blocks across four workers. The aim with deploying concurrent
  324. // workers is to equate the network messaging time with the verification
  325. // time. Ideally we want the verification process to never have to be
  326. // waiting on blocks. If it takes 4s to retrieve a block and 1s to verify
  327. // it, then steady state involves four workers.
  328. for i := 0; i < int(r.cfg.Fetchers); i++ {
  329. ctxWithCancel, cancel := context.WithCancel(ctx)
  330. defer cancel()
  331. go func() {
  332. for {
  333. select {
  334. case height := <-queue.nextHeight():
  335. // pop the next peer of the list to send a request to
  336. peer := r.peers.Pop(ctx)
  337. r.Logger.Debug("fetching next block", "height", height, "peer", peer)
  338. subCtx, cancel := context.WithTimeout(ctxWithCancel, lightBlockResponseTimeout)
  339. defer cancel()
  340. lb, err := func() (*types.LightBlock, error) {
  341. defer cancel()
  342. // request the light block with a timeout
  343. return r.dispatcher.LightBlock(subCtx, height, peer)
  344. }()
  345. // once the peer has returned a value, add it back to the peer list to be used again
  346. r.peers.Append(peer)
  347. if errors.Is(err, context.Canceled) {
  348. return
  349. }
  350. if err != nil {
  351. queue.retry(height)
  352. if errors.Is(err, errNoConnectedPeers) {
  353. r.Logger.Info("backfill: no connected peers to fetch light blocks from; sleeping...",
  354. "sleepTime", sleepTime)
  355. time.Sleep(sleepTime)
  356. } else {
  357. // we don't punish the peer as it might just have not responded in time
  358. r.Logger.Info("backfill: error with fetching light block",
  359. "height", height, "err", err)
  360. }
  361. continue
  362. }
  363. if lb == nil {
  364. r.Logger.Info("backfill: peer didn't have block, fetching from another peer", "height", height)
  365. queue.retry(height)
  366. // As we are fetching blocks backwards, if this node doesn't have the block it likely doesn't
  367. // have any prior ones, thus we remove it from the peer list.
  368. r.peers.Remove(peer)
  369. continue
  370. }
  371. // run a validate basic. This checks the validator set and commit
  372. // hashes line up
  373. err = lb.ValidateBasic(chainID)
  374. if err != nil || lb.Height != height {
  375. r.Logger.Info("backfill: fetched light block failed validate basic, removing peer...",
  376. "err", err, "height", height)
  377. queue.retry(height)
  378. r.blockCh.Error <- p2p.PeerError{
  379. NodeID: peer,
  380. Err: fmt.Errorf("received invalid light block: %w", err),
  381. }
  382. continue
  383. }
  384. // add block to queue to be verified
  385. queue.add(lightBlockResponse{
  386. block: lb,
  387. peer: peer,
  388. })
  389. r.Logger.Debug("backfill: added light block to processing queue", "height", height)
  390. case <-queue.done():
  391. return
  392. }
  393. }
  394. }()
  395. }
  396. // verify all light blocks
  397. for {
  398. select {
  399. case <-r.closeCh:
  400. queue.close()
  401. return nil
  402. case <-ctx.Done():
  403. queue.close()
  404. return nil
  405. case resp := <-queue.verifyNext():
  406. // validate the header hash. We take the last block id of the
  407. // previous header (i.e. one height above) as the trusted hash which
  408. // we equate to. ValidatorsHash and CommitHash have already been
  409. // checked in the `ValidateBasic`
  410. if w, g := trustedBlockID.Hash, resp.block.Hash(); !bytes.Equal(w, g) {
  411. r.Logger.Info("received invalid light block. header hash doesn't match trusted LastBlockID",
  412. "trustedHash", w, "receivedHash", g, "height", resp.block.Height)
  413. r.blockCh.Error <- p2p.PeerError{
  414. NodeID: resp.peer,
  415. Err: fmt.Errorf("received invalid light block. Expected hash %v, got: %v", w, g),
  416. }
  417. queue.retry(resp.block.Height)
  418. continue
  419. }
  420. // save the signed headers
  421. err := r.blockStore.SaveSignedHeader(resp.block.SignedHeader, trustedBlockID)
  422. if err != nil {
  423. return err
  424. }
  425. // check if there has been a change in the validator set
  426. if lastValidatorSet != nil && !bytes.Equal(resp.block.Header.ValidatorsHash, resp.block.Header.NextValidatorsHash) {
  427. // save all the heights that the last validator set was the same
  428. err = r.stateStore.SaveValidatorSets(resp.block.Height+1, lastChangeHeight, lastValidatorSet)
  429. if err != nil {
  430. return err
  431. }
  432. // update the lastChangeHeight
  433. lastChangeHeight = resp.block.Height
  434. }
  435. trustedBlockID = resp.block.LastBlockID
  436. queue.success()
  437. r.Logger.Info("backfill: verified and stored light block", "height", resp.block.Height)
  438. lastValidatorSet = resp.block.ValidatorSet
  439. r.backfilledBlocks++
  440. r.metrics.BackFilledBlocks.Add(1)
  441. // The block height might be less than the stopHeight because of the stopTime condition
  442. // hasn't been fulfilled.
  443. if resp.block.Height < stopHeight {
  444. r.backfillBlockTotal++
  445. r.metrics.BackFillBlocksTotal.Set(float64(r.backfillBlockTotal))
  446. }
  447. case <-queue.done():
  448. if err := queue.error(); err != nil {
  449. return err
  450. }
  451. // save the final batch of validators
  452. if err := r.stateStore.SaveValidatorSets(queue.terminal.Height, lastChangeHeight, lastValidatorSet); err != nil {
  453. return err
  454. }
  455. r.Logger.Info("successfully completed backfill process", "endHeight", queue.terminal.Height)
  456. return nil
  457. }
  458. }
  459. }
  460. // handleSnapshotMessage handles envelopes sent from peers on the
  461. // SnapshotChannel. It returns an error only if the Envelope.Message is unknown
  462. // for this channel. This should never be called outside of handleMessage.
  463. func (r *Reactor) handleSnapshotMessage(envelope p2p.Envelope) error {
  464. logger := r.Logger.With("peer", envelope.From)
  465. switch msg := envelope.Message.(type) {
  466. case *ssproto.SnapshotsRequest:
  467. snapshots, err := r.recentSnapshots(recentSnapshots)
  468. if err != nil {
  469. logger.Error("failed to fetch snapshots", "err", err)
  470. return nil
  471. }
  472. for _, snapshot := range snapshots {
  473. logger.Info(
  474. "advertising snapshot",
  475. "height", snapshot.Height,
  476. "format", snapshot.Format,
  477. "peer", envelope.From,
  478. )
  479. r.snapshotCh.Out <- p2p.Envelope{
  480. To: envelope.From,
  481. Message: &ssproto.SnapshotsResponse{
  482. Height: snapshot.Height,
  483. Format: snapshot.Format,
  484. Chunks: snapshot.Chunks,
  485. Hash: snapshot.Hash,
  486. Metadata: snapshot.Metadata,
  487. },
  488. }
  489. }
  490. case *ssproto.SnapshotsResponse:
  491. r.mtx.RLock()
  492. defer r.mtx.RUnlock()
  493. if r.syncer == nil {
  494. logger.Debug("received unexpected snapshot; no state sync in progress")
  495. return nil
  496. }
  497. logger.Info("received snapshot", "height", msg.Height, "format", msg.Format)
  498. _, err := r.syncer.AddSnapshot(envelope.From, &snapshot{
  499. Height: msg.Height,
  500. Format: msg.Format,
  501. Chunks: msg.Chunks,
  502. Hash: msg.Hash,
  503. Metadata: msg.Metadata,
  504. })
  505. if err != nil {
  506. logger.Error(
  507. "failed to add snapshot",
  508. "height", msg.Height,
  509. "format", msg.Format,
  510. "err", err,
  511. "channel", r.snapshotCh.ID,
  512. )
  513. return nil
  514. }
  515. logger.Info("added snapshot", "height", msg.Height, "format", msg.Format)
  516. default:
  517. return fmt.Errorf("received unknown message: %T", msg)
  518. }
  519. return nil
  520. }
  521. // handleChunkMessage handles envelopes sent from peers on the ChunkChannel.
  522. // It returns an error only if the Envelope.Message is unknown for this channel.
  523. // This should never be called outside of handleMessage.
  524. func (r *Reactor) handleChunkMessage(envelope p2p.Envelope) error {
  525. switch msg := envelope.Message.(type) {
  526. case *ssproto.ChunkRequest:
  527. r.Logger.Debug(
  528. "received chunk request",
  529. "height", msg.Height,
  530. "format", msg.Format,
  531. "chunk", msg.Index,
  532. "peer", envelope.From,
  533. )
  534. resp, err := r.conn.LoadSnapshotChunkSync(context.Background(), abci.RequestLoadSnapshotChunk{
  535. Height: msg.Height,
  536. Format: msg.Format,
  537. Chunk: msg.Index,
  538. })
  539. if err != nil {
  540. r.Logger.Error(
  541. "failed to load chunk",
  542. "height", msg.Height,
  543. "format", msg.Format,
  544. "chunk", msg.Index,
  545. "err", err,
  546. "peer", envelope.From,
  547. )
  548. return nil
  549. }
  550. r.Logger.Debug(
  551. "sending chunk",
  552. "height", msg.Height,
  553. "format", msg.Format,
  554. "chunk", msg.Index,
  555. "peer", envelope.From,
  556. )
  557. r.chunkCh.Out <- p2p.Envelope{
  558. To: envelope.From,
  559. Message: &ssproto.ChunkResponse{
  560. Height: msg.Height,
  561. Format: msg.Format,
  562. Index: msg.Index,
  563. Chunk: resp.Chunk,
  564. Missing: resp.Chunk == nil,
  565. },
  566. }
  567. case *ssproto.ChunkResponse:
  568. r.mtx.RLock()
  569. defer r.mtx.RUnlock()
  570. if r.syncer == nil {
  571. r.Logger.Debug("received unexpected chunk; no state sync in progress", "peer", envelope.From)
  572. return nil
  573. }
  574. r.Logger.Debug(
  575. "received chunk; adding to sync",
  576. "height", msg.Height,
  577. "format", msg.Format,
  578. "chunk", msg.Index,
  579. "peer", envelope.From,
  580. )
  581. _, err := r.syncer.AddChunk(&chunk{
  582. Height: msg.Height,
  583. Format: msg.Format,
  584. Index: msg.Index,
  585. Chunk: msg.Chunk,
  586. Sender: envelope.From,
  587. })
  588. if err != nil {
  589. r.Logger.Error(
  590. "failed to add chunk",
  591. "height", msg.Height,
  592. "format", msg.Format,
  593. "chunk", msg.Index,
  594. "err", err,
  595. "peer", envelope.From,
  596. )
  597. return nil
  598. }
  599. default:
  600. return fmt.Errorf("received unknown message: %T", msg)
  601. }
  602. return nil
  603. }
  604. func (r *Reactor) handleLightBlockMessage(envelope p2p.Envelope) error {
  605. switch msg := envelope.Message.(type) {
  606. case *ssproto.LightBlockRequest:
  607. r.Logger.Info("received light block request", "height", msg.Height)
  608. lb, err := r.fetchLightBlock(msg.Height)
  609. if err != nil {
  610. r.Logger.Error("failed to retrieve light block", "err", err, "height", msg.Height)
  611. return err
  612. }
  613. if lb == nil {
  614. r.blockCh.Out <- p2p.Envelope{
  615. To: envelope.From,
  616. Message: &ssproto.LightBlockResponse{
  617. LightBlock: nil,
  618. },
  619. }
  620. return nil
  621. }
  622. lbproto, err := lb.ToProto()
  623. if err != nil {
  624. r.Logger.Error("marshaling light block to proto", "err", err)
  625. return nil
  626. }
  627. // NOTE: If we don't have the light block we will send a nil light block
  628. // back to the requested node, indicating that we don't have it.
  629. r.blockCh.Out <- p2p.Envelope{
  630. To: envelope.From,
  631. Message: &ssproto.LightBlockResponse{
  632. LightBlock: lbproto,
  633. },
  634. }
  635. case *ssproto.LightBlockResponse:
  636. var height int64
  637. if msg.LightBlock != nil {
  638. height = msg.LightBlock.SignedHeader.Header.Height
  639. }
  640. r.Logger.Info("received light block response", "peer", envelope.From, "height", height)
  641. if err := r.dispatcher.Respond(msg.LightBlock, envelope.From); err != nil {
  642. r.Logger.Error("error processing light block response", "err", err, "height", height)
  643. }
  644. default:
  645. return fmt.Errorf("received unknown message: %T", msg)
  646. }
  647. return nil
  648. }
  649. func (r *Reactor) handleParamsMessage(envelope p2p.Envelope) error {
  650. switch msg := envelope.Message.(type) {
  651. case *ssproto.ParamsRequest:
  652. r.Logger.Debug("received consensus params request", "height", msg.Height)
  653. cp, err := r.stateStore.LoadConsensusParams(int64(msg.Height))
  654. if err != nil {
  655. r.Logger.Error("failed to fetch requested consensus params", "err", err, "height", msg.Height)
  656. return nil
  657. }
  658. cpproto := cp.ToProto()
  659. r.paramsCh.Out <- p2p.Envelope{
  660. To: envelope.From,
  661. Message: &ssproto.ParamsResponse{
  662. Height: msg.Height,
  663. ConsensusParams: cpproto,
  664. },
  665. }
  666. case *ssproto.ParamsResponse:
  667. r.mtx.RLock()
  668. defer r.mtx.RUnlock()
  669. r.Logger.Debug("received consensus params response", "height", msg.Height)
  670. cp := types.ConsensusParamsFromProto(msg.ConsensusParams)
  671. if sp, ok := r.stateProvider.(*stateProviderP2P); ok {
  672. select {
  673. case sp.paramsRecvCh <- cp:
  674. case <-time.After(time.Second):
  675. return errors.New("failed to send consensus params, stateprovider not ready for response")
  676. }
  677. } else {
  678. r.Logger.Debug("received unexpected params response; using RPC state provider", "peer", envelope.From)
  679. }
  680. default:
  681. return fmt.Errorf("received unknown message: %T", msg)
  682. }
  683. return nil
  684. }
  685. // handleMessage handles an Envelope sent from a peer on a specific p2p Channel.
  686. // It will handle errors and any possible panics gracefully. A caller can handle
  687. // any error returned by sending a PeerError on the respective channel.
  688. func (r *Reactor) handleMessage(chID p2p.ChannelID, envelope p2p.Envelope) (err error) {
  689. defer func() {
  690. if e := recover(); e != nil {
  691. err = fmt.Errorf("panic in processing message: %v", e)
  692. r.Logger.Error(
  693. "recovering from processing message panic",
  694. "err", err,
  695. "stack", string(debug.Stack()),
  696. )
  697. }
  698. }()
  699. r.Logger.Debug("received message", "message", reflect.TypeOf(envelope.Message), "peer", envelope.From)
  700. switch chID {
  701. case SnapshotChannel:
  702. err = r.handleSnapshotMessage(envelope)
  703. case ChunkChannel:
  704. err = r.handleChunkMessage(envelope)
  705. case LightBlockChannel:
  706. err = r.handleLightBlockMessage(envelope)
  707. case ParamsChannel:
  708. err = r.handleParamsMessage(envelope)
  709. default:
  710. err = fmt.Errorf("unknown channel ID (%d) for envelope (%v)", chID, envelope)
  711. }
  712. return err
  713. }
  714. // processSnapshotCh initiates a blocking process where we listen for and handle
  715. // envelopes on the SnapshotChannel.
  716. func (r *Reactor) processSnapshotCh() {
  717. r.processCh(r.snapshotCh, "snapshot")
  718. }
  719. // processChunkCh initiates a blocking process where we listen for and handle
  720. // envelopes on the ChunkChannel.
  721. func (r *Reactor) processChunkCh() {
  722. r.processCh(r.chunkCh, "chunk")
  723. }
  724. // processBlockCh initiates a blocking process where we listen for and handle
  725. // envelopes on the LightBlockChannel.
  726. func (r *Reactor) processBlockCh() {
  727. r.processCh(r.blockCh, "light block")
  728. }
  729. func (r *Reactor) processParamsCh() {
  730. r.processCh(r.paramsCh, "consensus params")
  731. }
  732. // processCh routes state sync messages to their respective handlers. Any error
  733. // encountered during message execution will result in a PeerError being sent on
  734. // the respective channel. When the reactor is stopped, we will catch the signal
  735. // and close the p2p Channel gracefully.
  736. func (r *Reactor) processCh(ch *p2p.Channel, chName string) {
  737. defer ch.Close()
  738. for {
  739. select {
  740. case envelope := <-ch.In:
  741. if err := r.handleMessage(ch.ID, envelope); err != nil {
  742. r.Logger.Error(fmt.Sprintf("failed to process %s message", chName),
  743. "ch_id", ch.ID, "envelope", envelope, "err", err)
  744. ch.Error <- p2p.PeerError{
  745. NodeID: envelope.From,
  746. Err: err,
  747. }
  748. }
  749. case <-r.closeCh:
  750. r.Logger.Debug(fmt.Sprintf("stopped listening on %s channel; closing...", chName))
  751. return
  752. }
  753. }
  754. }
  755. // processPeerUpdate processes a PeerUpdate, returning an error upon failing to
  756. // handle the PeerUpdate or if a panic is recovered.
  757. func (r *Reactor) processPeerUpdate(peerUpdate p2p.PeerUpdate) {
  758. r.Logger.Info("received peer update", "peer", peerUpdate.NodeID, "status", peerUpdate.Status)
  759. switch peerUpdate.Status {
  760. case p2p.PeerStatusUp:
  761. r.peers.Append(peerUpdate.NodeID)
  762. case p2p.PeerStatusDown:
  763. r.peers.Remove(peerUpdate.NodeID)
  764. }
  765. r.mtx.Lock()
  766. defer r.mtx.Unlock()
  767. if r.syncer == nil {
  768. return
  769. }
  770. switch peerUpdate.Status {
  771. case p2p.PeerStatusUp:
  772. newProvider := NewBlockProvider(peerUpdate.NodeID, r.chainID, r.dispatcher)
  773. r.providers[peerUpdate.NodeID] = newProvider
  774. err := r.syncer.AddPeer(peerUpdate.NodeID)
  775. if err != nil {
  776. r.Logger.Error("error adding peer to syncer", "error", err)
  777. return
  778. }
  779. if sp, ok := r.stateProvider.(*stateProviderP2P); ok {
  780. // we do this in a separate routine to not block whilst waiting for the light client to finish
  781. // whatever call it's currently executing
  782. go sp.addProvider(newProvider)
  783. }
  784. case p2p.PeerStatusDown:
  785. delete(r.providers, peerUpdate.NodeID)
  786. r.syncer.RemovePeer(peerUpdate.NodeID)
  787. }
  788. r.Logger.Info("processed peer update", "peer", peerUpdate.NodeID, "status", peerUpdate.Status)
  789. }
  790. // processPeerUpdates initiates a blocking process where we listen for and handle
  791. // PeerUpdate messages. When the reactor is stopped, we will catch the signal and
  792. // close the p2p PeerUpdatesCh gracefully.
  793. func (r *Reactor) processPeerUpdates() {
  794. defer r.peerUpdates.Close()
  795. for {
  796. select {
  797. case peerUpdate := <-r.peerUpdates.Updates():
  798. r.processPeerUpdate(peerUpdate)
  799. case <-r.closeCh:
  800. r.Logger.Debug("stopped listening on peer updates channel; closing...")
  801. return
  802. }
  803. }
  804. }
  805. // recentSnapshots fetches the n most recent snapshots from the app
  806. func (r *Reactor) recentSnapshots(n uint32) ([]*snapshot, error) {
  807. resp, err := r.conn.ListSnapshotsSync(context.Background(), abci.RequestListSnapshots{})
  808. if err != nil {
  809. return nil, err
  810. }
  811. sort.Slice(resp.Snapshots, func(i, j int) bool {
  812. a := resp.Snapshots[i]
  813. b := resp.Snapshots[j]
  814. switch {
  815. case a.Height > b.Height:
  816. return true
  817. case a.Height == b.Height && a.Format > b.Format:
  818. return true
  819. default:
  820. return false
  821. }
  822. })
  823. snapshots := make([]*snapshot, 0, n)
  824. for i, s := range resp.Snapshots {
  825. if i >= recentSnapshots {
  826. break
  827. }
  828. snapshots = append(snapshots, &snapshot{
  829. Height: s.Height,
  830. Format: s.Format,
  831. Chunks: s.Chunks,
  832. Hash: s.Hash,
  833. Metadata: s.Metadata,
  834. })
  835. }
  836. return snapshots, nil
  837. }
  838. // fetchLightBlock works out whether the node has a light block at a particular
  839. // height and if so returns it so it can be gossiped to peers
  840. func (r *Reactor) fetchLightBlock(height uint64) (*types.LightBlock, error) {
  841. h := int64(height)
  842. blockMeta := r.blockStore.LoadBlockMeta(h)
  843. if blockMeta == nil {
  844. return nil, nil
  845. }
  846. commit := r.blockStore.LoadBlockCommit(h)
  847. if commit == nil {
  848. return nil, nil
  849. }
  850. vals, err := r.stateStore.LoadValidators(h)
  851. if err != nil {
  852. return nil, err
  853. }
  854. if vals == nil {
  855. return nil, nil
  856. }
  857. return &types.LightBlock{
  858. SignedHeader: &types.SignedHeader{
  859. Header: &blockMeta.Header,
  860. Commit: commit,
  861. },
  862. ValidatorSet: vals,
  863. }, nil
  864. }
  865. func (r *Reactor) waitForEnoughPeers(ctx context.Context, numPeers int) error {
  866. startAt := time.Now()
  867. t := time.NewTicker(100 * time.Millisecond)
  868. defer t.Stop()
  869. logT := time.NewTicker(time.Minute)
  870. defer logT.Stop()
  871. var iter int
  872. for r.peers.Len() < numPeers {
  873. iter++
  874. select {
  875. case <-ctx.Done():
  876. return fmt.Errorf("operation canceled while waiting for peers after %.2fs [%d/%d]",
  877. time.Since(startAt).Seconds(), r.peers.Len(), numPeers)
  878. case <-r.closeCh:
  879. return fmt.Errorf("shutdown while waiting for peers after %.2fs [%d/%d]",
  880. time.Since(startAt).Seconds(), r.peers.Len(), numPeers)
  881. case <-t.C:
  882. continue
  883. case <-logT.C:
  884. r.Logger.Info("waiting for sufficient peers to start statesync",
  885. "duration", time.Since(startAt).String(),
  886. "target", numPeers,
  887. "peers", r.peers.Len(),
  888. "iters", iter,
  889. )
  890. continue
  891. }
  892. }
  893. return nil
  894. }
  895. func (r *Reactor) initStateProvider(ctx context.Context, chainID string, initialHeight int64) error {
  896. var err error
  897. to := light.TrustOptions{
  898. Period: r.cfg.TrustPeriod,
  899. Height: r.cfg.TrustHeight,
  900. Hash: r.cfg.TrustHashBytes(),
  901. }
  902. spLogger := r.Logger.With("module", "stateprovider")
  903. spLogger.Info("initializing state provider", "trustPeriod", to.Period,
  904. "trustHeight", to.Height, "useP2P", r.cfg.UseP2P)
  905. if r.cfg.UseP2P {
  906. if err := r.waitForEnoughPeers(ctx, 2); err != nil {
  907. return err
  908. }
  909. peers := r.peers.All()
  910. providers := make([]provider.Provider, len(peers))
  911. for idx, p := range peers {
  912. providers[idx] = NewBlockProvider(p, chainID, r.dispatcher)
  913. }
  914. r.stateProvider, err = NewP2PStateProvider(ctx, chainID, initialHeight, providers, to, r.paramsCh.Out, spLogger)
  915. if err != nil {
  916. return fmt.Errorf("failed to initialize P2P state provider: %w", err)
  917. }
  918. } else {
  919. r.stateProvider, err = NewRPCStateProvider(ctx, chainID, initialHeight, r.cfg.RPCServers, to, spLogger)
  920. if err != nil {
  921. return fmt.Errorf("failed to initialize RPC state provider: %w", err)
  922. }
  923. }
  924. return nil
  925. }
  926. func (r *Reactor) TotalSnapshots() int64 {
  927. r.mtx.RLock()
  928. defer r.mtx.RUnlock()
  929. if r.syncer != nil && r.syncer.snapshots != nil {
  930. return int64(len(r.syncer.snapshots.snapshots))
  931. }
  932. return 0
  933. }
  934. func (r *Reactor) ChunkProcessAvgTime() time.Duration {
  935. r.mtx.RLock()
  936. defer r.mtx.RUnlock()
  937. if r.syncer != nil {
  938. return time.Duration(r.syncer.avgChunkTime)
  939. }
  940. return time.Duration(0)
  941. }
  942. func (r *Reactor) SnapshotHeight() int64 {
  943. r.mtx.RLock()
  944. defer r.mtx.RUnlock()
  945. if r.syncer != nil {
  946. return r.syncer.lastSyncedSnapshotHeight
  947. }
  948. return 0
  949. }
  950. func (r *Reactor) SnapshotChunksCount() int64 {
  951. r.mtx.RLock()
  952. defer r.mtx.RUnlock()
  953. if r.syncer != nil && r.syncer.chunks != nil {
  954. return int64(r.syncer.chunks.numChunksReturned())
  955. }
  956. return 0
  957. }
  958. func (r *Reactor) SnapshotChunksTotal() int64 {
  959. r.mtx.RLock()
  960. defer r.mtx.RUnlock()
  961. if r.syncer != nil && r.syncer.processingSnapshot != nil {
  962. return int64(r.syncer.processingSnapshot.Chunks)
  963. }
  964. return 0
  965. }
  966. func (r *Reactor) BackFilledBlocks() int64 {
  967. r.mtx.RLock()
  968. defer r.mtx.RUnlock()
  969. return r.backfilledBlocks
  970. }
  971. func (r *Reactor) BackFillBlocksTotal() int64 {
  972. r.mtx.RLock()
  973. defer r.mtx.RUnlock()
  974. return r.backfillBlockTotal
  975. }