You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

664 lines
21 KiB

p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
p2p: make PeerManager.DialNext() and EvictNext() block (#5947) See #5936 and #5938 for background. The plan was initially to have `DialNext()` and `EvictNext()` return a channel. However, implementing this became unnecessarily complicated and error-prone. As an example, the channel would be both consumed and populated (via method calls) by the same driving method (e.g. `Router.dialPeers()`) which could easily cause deadlocks where a method call blocked while sending on the channel that the caller itself was responsible for consuming (but couldn't since it was busy making the method call). It would also require a set of goroutines in the peer manager that would interact with the goroutines in the router in non-obvious ways, and fully populating the channel on startup could cause deadlocks with other startup tasks. Several issues like these made the solution hard to reason about. I therefore simply made `DialNext()` and `EvictNext()` block until the next peer was available, using internal triggers to wake these methods up in a non-blocking fashion when any relevant state changes occurred. This proved much simpler to reason about, since there are no goroutines in the peer manager (except for trivial retry timers), nor any blocking channel sends, and it instead relies entirely on the existing goroutine structure of the router for concurrency. This also happens to be the same pattern used by the `Transport.Accept()` API, following Go stdlib conventions, so all router goroutines end up using a consistent pattern as well.
4 years ago
  1. package p2p
  2. import (
  3. "context"
  4. "errors"
  5. "fmt"
  6. "io"
  7. "sync"
  8. "time"
  9. "github.com/gogo/protobuf/proto"
  10. "github.com/tendermint/tendermint/crypto"
  11. "github.com/tendermint/tendermint/libs/log"
  12. "github.com/tendermint/tendermint/libs/service"
  13. )
  14. // RouterOptions specifies options for a Router.
  15. type RouterOptions struct {
  16. // ResolveTimeout is the timeout for resolving NodeAddress URLs.
  17. // 0 means no timeout.
  18. ResolveTimeout time.Duration
  19. // DialTimeout is the timeout for dialing a peer. 0 means no timeout.
  20. DialTimeout time.Duration
  21. // HandshakeTimeout is the timeout for handshaking with a peer. 0 means
  22. // no timeout.
  23. HandshakeTimeout time.Duration
  24. }
  25. // Validate validates the options.
  26. func (o *RouterOptions) Validate() error {
  27. return nil
  28. }
  29. // Router manages peer connections and routes messages between peers and reactor
  30. // channels. This is an early prototype.
  31. //
  32. // Channels are registered via OpenChannel(). When called, we register an input
  33. // message queue for the channel in channelQueues and spawn off a goroutine for
  34. // Router.routeChannel(). This goroutine reads off outbound messages and puts
  35. // them in the appropriate peer message queue, and processes peer errors which
  36. // will close (and thus disconnect) the appriate peer queue. It runs until
  37. // either the channel is closed by the caller or the router is stopped, at which
  38. // point the input message queue is closed and removed.
  39. //
  40. // On startup, the router spawns off three primary goroutines that maintain
  41. // connections to peers and run for the lifetime of the router:
  42. //
  43. // Router.dialPeers(): in a loop, asks the PeerManager for the next peer
  44. // address to contact, resolves it into endpoints, and attempts to dial
  45. // each one.
  46. //
  47. // Router.acceptPeers(): in a loop, waits for the next inbound connection
  48. // from a peer, and checks with the PeerManager if it should be accepted.
  49. //
  50. // Router.evictPeers(): in a loop, asks the PeerManager for any connected
  51. // peers to evict, and disconnects them.
  52. //
  53. // Once either an inbound or outbound connection has been made, an outbound
  54. // message queue is registered in Router.peerQueues and a goroutine is spawned
  55. // off for Router.routePeer() which will spawn off additional goroutines for
  56. // Router.sendPeer() that sends outbound messages from the peer queue over the
  57. // connection and for Router.receivePeer() that reads inbound messages from
  58. // the connection and places them in the appropriate channel queue. When either
  59. // goroutine exits, the connection and peer queue is closed, which will cause
  60. // the other goroutines to close as well.
  61. //
  62. // The peerStore is used to coordinate peer connections, by only allowing a peer
  63. // to be claimed (owned) by a single caller at a time (both for outbound and
  64. // inbound connections). This is done either via peerStore.Dispense() which
  65. // dispenses and claims an eligible peer to dial, or via peerStore.Claim() which
  66. // attempts to claim a given peer for an inbound connection. Peers must be
  67. // returned to the peerStore with peerStore.Return() to release the claim. Over
  68. // time, the peerStore will also do peer scheduling and prioritization, e.g.
  69. // ensuring we do exponential backoff on dial failures and connecting to
  70. // more important peers first (such as persistent peers and validators).
  71. //
  72. // An additional goroutine Router.broadcastPeerUpdates() is also spawned off
  73. // on startup, which consumes peer updates from Router.peerUpdatesCh (currently
  74. // only connections and disconnections), and broadcasts them to all peer update
  75. // subscriptions registered via SubscribePeerUpdates().
  76. //
  77. // On router shutdown, we close Router.stopCh which will signal to all
  78. // goroutines to terminate. This in turn will cause all pending channel/peer
  79. // queues to close, and we wait for this as a signal that goroutines have ended.
  80. //
  81. // All message scheduling should be limited to the queue implementations used
  82. // for channel queues and peer queues. All message sending throughout the router
  83. // is blocking, and if any messages should be dropped or buffered this is the
  84. // sole responsibility of the queue, such that we can limit this logic to a
  85. // single place. There is currently only a FIFO queue implementation that always
  86. // blocks and never drops messages, but this must be improved with other
  87. // implementations. The only exception is that all message sending must also
  88. // select on appropriate channel/queue/router closure signals, to avoid blocking
  89. // forever on a channel that has no consumer.
  90. type Router struct {
  91. *service.BaseService
  92. logger log.Logger
  93. nodeInfo NodeInfo
  94. privKey crypto.PrivKey
  95. transports map[Protocol]Transport
  96. peerManager *PeerManager
  97. options RouterOptions
  98. // FIXME: Consider using sync.Map.
  99. peerMtx sync.RWMutex
  100. peerQueues map[NodeID]queue
  101. // FIXME: We don't strictly need to use a mutex for this if we seal the
  102. // channels on router start. This depends on whether we want to allow
  103. // dynamic channels in the future.
  104. channelMtx sync.RWMutex
  105. channelQueues map[ChannelID]queue
  106. channelMessages map[ChannelID]proto.Message
  107. // stopCh is used to signal router shutdown, by closing the channel.
  108. stopCh chan struct{}
  109. }
  110. // NewRouter creates a new Router.
  111. func NewRouter(
  112. logger log.Logger,
  113. nodeInfo NodeInfo,
  114. privKey crypto.PrivKey,
  115. peerManager *PeerManager,
  116. transports []Transport,
  117. options RouterOptions,
  118. ) (*Router, error) {
  119. if err := options.Validate(); err != nil {
  120. return nil, err
  121. }
  122. router := &Router{
  123. logger: logger,
  124. nodeInfo: nodeInfo,
  125. privKey: privKey,
  126. transports: map[Protocol]Transport{},
  127. peerManager: peerManager,
  128. options: options,
  129. stopCh: make(chan struct{}),
  130. channelQueues: map[ChannelID]queue{},
  131. channelMessages: map[ChannelID]proto.Message{},
  132. peerQueues: map[NodeID]queue{},
  133. }
  134. router.BaseService = service.NewBaseService(logger, "router", router)
  135. for _, transport := range transports {
  136. for _, protocol := range transport.Protocols() {
  137. if _, ok := router.transports[protocol]; !ok {
  138. router.transports[protocol] = transport
  139. }
  140. }
  141. }
  142. return router, nil
  143. }
  144. // OpenChannel opens a new channel for the given message type. The caller must
  145. // close the channel when done, and this must happen before the router stops.
  146. func (r *Router) OpenChannel(id ChannelID, messageType proto.Message) (*Channel, error) {
  147. // FIXME: NewChannel should take directional channels so we can pass
  148. // queue.dequeue() instead of reaching inside for queue.queueCh.
  149. queue := newFIFOQueue()
  150. channel := NewChannel(id, messageType, queue.queueCh, make(chan Envelope), make(chan PeerError))
  151. r.channelMtx.Lock()
  152. defer r.channelMtx.Unlock()
  153. if _, ok := r.channelQueues[id]; ok {
  154. return nil, fmt.Errorf("channel %v already exists", id)
  155. }
  156. r.channelQueues[id] = queue
  157. r.channelMessages[id] = messageType
  158. go func() {
  159. defer func() {
  160. r.channelMtx.Lock()
  161. delete(r.channelQueues, id)
  162. delete(r.channelMessages, id)
  163. r.channelMtx.Unlock()
  164. queue.close()
  165. }()
  166. r.routeChannel(channel)
  167. }()
  168. return channel, nil
  169. }
  170. // routeChannel receives outbound messages and errors from a channel and routes
  171. // them to the appropriate peer. It returns when either the channel is closed or
  172. // the router is shutting down.
  173. func (r *Router) routeChannel(channel *Channel) {
  174. for {
  175. select {
  176. case envelope, ok := <-channel.outCh:
  177. if !ok {
  178. return
  179. }
  180. // FIXME: This is a bit unergonomic, maybe it'd be better for Wrap()
  181. // to return a wrapped copy.
  182. if _, ok := channel.messageType.(Wrapper); ok {
  183. wrapper := proto.Clone(channel.messageType)
  184. if err := wrapper.(Wrapper).Wrap(envelope.Message); err != nil {
  185. r.Logger.Error("failed to wrap message", "err", err)
  186. continue
  187. }
  188. envelope.Message = wrapper
  189. }
  190. envelope.channelID = channel.id
  191. if envelope.Broadcast {
  192. r.peerMtx.RLock()
  193. peerQueues := make(map[NodeID]queue, len(r.peerQueues))
  194. for peerID, peerQueue := range r.peerQueues {
  195. peerQueues[peerID] = peerQueue
  196. }
  197. r.peerMtx.RUnlock()
  198. for peerID, peerQueue := range peerQueues {
  199. e := envelope
  200. e.Broadcast = false
  201. e.To = peerID
  202. select {
  203. case peerQueue.enqueue() <- e:
  204. case <-peerQueue.closed():
  205. case <-r.stopCh:
  206. return
  207. }
  208. }
  209. } else {
  210. r.peerMtx.RLock()
  211. peerQueue, ok := r.peerQueues[envelope.To]
  212. r.peerMtx.RUnlock()
  213. if !ok {
  214. r.logger.Error("dropping message for non-connected peer",
  215. "peer", envelope.To, "channel", channel.id)
  216. continue
  217. }
  218. select {
  219. case peerQueue.enqueue() <- envelope:
  220. case <-peerQueue.closed():
  221. r.logger.Error("dropping message for non-connected peer",
  222. "peer", envelope.To, "channel", channel.id)
  223. case <-r.stopCh:
  224. return
  225. }
  226. }
  227. case peerError, ok := <-channel.errCh:
  228. if !ok {
  229. return
  230. }
  231. // FIXME: We just disconnect the peer for now
  232. r.logger.Error("peer error, disconnecting", "peer", peerError.PeerID, "err", peerError.Err)
  233. r.peerMtx.RLock()
  234. peerQueue, ok := r.peerQueues[peerError.PeerID]
  235. r.peerMtx.RUnlock()
  236. if ok {
  237. peerQueue.close()
  238. }
  239. case <-channel.Done():
  240. return
  241. case <-r.stopCh:
  242. return
  243. }
  244. }
  245. }
  246. // acceptPeers accepts inbound connections from peers on the given transport.
  247. func (r *Router) acceptPeers(transport Transport) {
  248. ctx := r.stopCtx()
  249. for {
  250. // FIXME: We may need transports to enforce some sort of rate limiting
  251. // here (e.g. by IP address), or alternatively have PeerManager.Accepted()
  252. // do it for us.
  253. //
  254. // FIXME: Even though PeerManager enforces MaxConnected, we may want to
  255. // limit the maximum number of active connections here too, since e.g.
  256. // an adversary can open a ton of connections and then just hang during
  257. // the handshake, taking up TCP socket descriptors.
  258. //
  259. // FIXME: The old P2P stack rejected multiple connections for the same IP
  260. // unless P2PConfig.AllowDuplicateIP is true -- it's better to limit this
  261. // by peer ID rather than IP address, so this hasn't been implemented and
  262. // probably shouldn't (?).
  263. //
  264. // FIXME: The old P2P stack supported ABCI-based IP address filtering via
  265. // /p2p/filter/addr/<ip> queries, do we want to implement this here as well?
  266. // Filtering by node ID is probably better.
  267. conn, err := transport.Accept()
  268. switch err {
  269. case nil:
  270. case io.EOF:
  271. r.logger.Debug("stopping accept routine", "transport", transport)
  272. return
  273. default:
  274. r.logger.Error("failed to accept connection", "transport", transport, "err", err)
  275. continue
  276. }
  277. go func() {
  278. defer func() {
  279. _ = conn.Close()
  280. }()
  281. // FIXME: Because we do the handshake in each transport, rather than
  282. // here in the Router, the remote peer will think they've
  283. // successfully connected and start sending us messages, although we
  284. // can end up rejecting the connection here. This can e.g. cause
  285. // problems in tests, where because of race conditions a
  286. // disconnection can cause the local node to immediately redial,
  287. // while the remote node may not have completed the disconnection
  288. // registration yet and reject the accept below.
  289. //
  290. // The Router should do the handshake, and we should check with the
  291. // peer manager before completing the handshake -- this probably
  292. // requires protocol changes to send an additional message when the
  293. // handshake is accepted.
  294. peerInfo, _, err := r.handshakePeer(ctx, conn, "")
  295. if err == context.Canceled {
  296. return
  297. } else if err != nil {
  298. r.logger.Error("failed to handshake with peer", "err", err)
  299. return
  300. }
  301. if err := r.peerManager.Accepted(peerInfo.NodeID); err != nil {
  302. r.logger.Error("failed to accept connection", "peer", peerInfo.NodeID, "err", err)
  303. return
  304. }
  305. queue := newFIFOQueue()
  306. r.peerMtx.Lock()
  307. r.peerQueues[peerInfo.NodeID] = queue
  308. r.peerMtx.Unlock()
  309. r.peerManager.Ready(peerInfo.NodeID)
  310. defer func() {
  311. r.peerMtx.Lock()
  312. delete(r.peerQueues, peerInfo.NodeID)
  313. r.peerMtx.Unlock()
  314. queue.close()
  315. if err := r.peerManager.Disconnected(peerInfo.NodeID); err != nil {
  316. r.logger.Error("failed to disconnect peer", "peer", peerInfo.NodeID, "err", err)
  317. }
  318. }()
  319. r.routePeer(peerInfo.NodeID, conn, queue)
  320. }()
  321. }
  322. }
  323. // dialPeers maintains outbound connections to peers.
  324. func (r *Router) dialPeers() {
  325. ctx := r.stopCtx()
  326. for {
  327. peerID, address, err := r.peerManager.DialNext(ctx)
  328. switch err {
  329. case nil:
  330. case context.Canceled:
  331. r.logger.Debug("stopping dial routine")
  332. return
  333. default:
  334. r.logger.Error("failed to find next peer to dial", "err", err)
  335. return
  336. }
  337. go func() {
  338. conn, err := r.dialPeer(ctx, address)
  339. if errors.Is(err, context.Canceled) {
  340. return
  341. } else if err != nil {
  342. r.logger.Error("failed to dial peer", "peer", peerID, "err", err)
  343. if err = r.peerManager.DialFailed(peerID, address); err != nil {
  344. r.logger.Error("failed to report dial failure", "peer", peerID, "err", err)
  345. }
  346. return
  347. }
  348. defer conn.Close()
  349. _, _, err = r.handshakePeer(ctx, conn, peerID)
  350. if errors.Is(err, context.Canceled) {
  351. return
  352. } else if err != nil {
  353. r.logger.Error("failed to handshake with peer", "peer", peerID, "err", err)
  354. if err = r.peerManager.DialFailed(peerID, address); err != nil {
  355. r.logger.Error("failed to report dial failure", "peer", peerID, "err", err)
  356. }
  357. return
  358. }
  359. if err = r.peerManager.Dialed(peerID, address); err != nil {
  360. r.logger.Error("failed to dial peer", "peer", peerID, "err", err)
  361. return
  362. }
  363. queue := newFIFOQueue()
  364. r.peerMtx.Lock()
  365. r.peerQueues[peerID] = queue
  366. r.peerMtx.Unlock()
  367. r.peerManager.Ready(peerID)
  368. defer func() {
  369. r.peerMtx.Lock()
  370. delete(r.peerQueues, peerID)
  371. r.peerMtx.Unlock()
  372. queue.close()
  373. if err := r.peerManager.Disconnected(peerID); err != nil {
  374. r.logger.Error("failed to disconnect peer", "peer", peerID, "err", err)
  375. }
  376. }()
  377. r.routePeer(peerID, conn, queue)
  378. }()
  379. }
  380. }
  381. // dialPeer connects to a peer by dialing it.
  382. func (r *Router) dialPeer(ctx context.Context, address NodeAddress) (Connection, error) {
  383. r.logger.Info("resolving peer address", "address", address)
  384. resolveCtx := ctx
  385. if r.options.ResolveTimeout > 0 {
  386. var cancel context.CancelFunc
  387. resolveCtx, cancel = context.WithTimeout(resolveCtx, r.options.ResolveTimeout)
  388. defer cancel()
  389. }
  390. endpoints, err := address.Resolve(resolveCtx)
  391. if err != nil {
  392. return nil, fmt.Errorf("failed to resolve address %q: %w", address, err)
  393. }
  394. for _, endpoint := range endpoints {
  395. transport, ok := r.transports[endpoint.Protocol]
  396. if !ok {
  397. r.logger.Error("no transport found for endpoint protocol", "endpoint", endpoint)
  398. continue
  399. }
  400. dialCtx := ctx
  401. if r.options.DialTimeout > 0 {
  402. var cancel context.CancelFunc
  403. dialCtx, cancel = context.WithTimeout(dialCtx, r.options.DialTimeout)
  404. defer cancel()
  405. }
  406. // FIXME: When we dial and handshake the peer, we should pass it
  407. // appropriate address(es) it can use to dial us back. It can't use our
  408. // remote endpoint, since TCP uses different port numbers for outbound
  409. // connections than it does for inbound. Also, we may need to vary this
  410. // by the peer's endpoint, since e.g. a peer on 192.168.0.0 can reach us
  411. // on a private address on this endpoint, but a peer on the public
  412. // Internet can't and needs a different public address.
  413. conn, err := transport.Dial(dialCtx, endpoint)
  414. if err != nil {
  415. r.logger.Error("failed to dial endpoint", "endpoint", endpoint, "err", err)
  416. } else {
  417. r.logger.Info("connected to peer", "peer", address.NodeID, "endpoint", endpoint)
  418. return conn, nil
  419. }
  420. }
  421. return nil, fmt.Errorf("failed to connect to peer via %q", address)
  422. }
  423. // handshakePeer handshakes with a peer, validating the peer's information. If
  424. // expectID is given, we check that the peer's public key matches it.
  425. func (r *Router) handshakePeer(ctx context.Context, conn Connection, expectID NodeID) (NodeInfo, crypto.PubKey, error) {
  426. if r.options.HandshakeTimeout > 0 {
  427. var cancel context.CancelFunc
  428. ctx, cancel = context.WithTimeout(ctx, r.options.HandshakeTimeout)
  429. defer cancel()
  430. }
  431. peerInfo, peerKey, err := conn.Handshake(ctx, r.nodeInfo, r.privKey)
  432. if err != nil {
  433. return peerInfo, peerKey, err
  434. }
  435. if err = peerInfo.Validate(); err != nil {
  436. return peerInfo, peerKey, fmt.Errorf("invalid handshake NodeInfo: %w", err)
  437. }
  438. if expectID != "" && expectID != peerInfo.NodeID {
  439. return peerInfo, peerKey, fmt.Errorf("expected to connect with peer %q, got %q",
  440. expectID, peerInfo.NodeID)
  441. }
  442. if NodeIDFromPubKey(peerKey) != peerInfo.NodeID {
  443. return peerInfo, peerKey, fmt.Errorf("peer's public key did not match its node ID %q (expected %q)",
  444. peerInfo.NodeID, NodeIDFromPubKey(peerKey))
  445. }
  446. if peerInfo.NodeID == r.nodeInfo.NodeID {
  447. return peerInfo, peerKey, errors.New("rejecting handshake with self")
  448. }
  449. return peerInfo, peerKey, nil
  450. }
  451. // routePeer routes inbound messages from a peer to channels, and also sends
  452. // outbound queued messages to the peer. It will close the connection and send
  453. // queue, using this as a signal to coordinate the internal receivePeer() and
  454. // sendPeer() goroutines. It blocks until the peer is done, e.g. when the
  455. // connection or queue is closed.
  456. func (r *Router) routePeer(peerID NodeID, conn Connection, sendQueue queue) {
  457. r.logger.Info("routing peer", "peer", peerID)
  458. resultsCh := make(chan error, 2)
  459. go func() {
  460. resultsCh <- r.receivePeer(peerID, conn)
  461. }()
  462. go func() {
  463. resultsCh <- r.sendPeer(peerID, conn, sendQueue)
  464. }()
  465. err := <-resultsCh
  466. _ = conn.Close()
  467. sendQueue.close()
  468. if e := <-resultsCh; err == nil {
  469. // The first err was nil, so we update it with the second result,
  470. // which may or may not be nil.
  471. err = e
  472. }
  473. switch err {
  474. case nil, io.EOF, ErrTransportClosed{}:
  475. r.logger.Info("peer disconnected", "peer", peerID)
  476. default:
  477. r.logger.Error("peer failure", "peer", peerID, "err", err)
  478. }
  479. }
  480. // receivePeer receives inbound messages from a peer, deserializes them and
  481. // passes them on to the appropriate channel.
  482. func (r *Router) receivePeer(peerID NodeID, conn Connection) error {
  483. for {
  484. chID, bz, err := conn.ReceiveMessage()
  485. if err != nil {
  486. return err
  487. }
  488. r.channelMtx.RLock()
  489. queue, ok := r.channelQueues[chID]
  490. messageType := r.channelMessages[chID]
  491. r.channelMtx.RUnlock()
  492. if !ok {
  493. r.logger.Error("dropping message for unknown channel", "peer", peerID, "channel", chID)
  494. continue
  495. }
  496. msg := proto.Clone(messageType)
  497. if err := proto.Unmarshal(bz, msg); err != nil {
  498. r.logger.Error("message decoding failed, dropping message", "peer", peerID, "err", err)
  499. continue
  500. }
  501. if wrapper, ok := msg.(Wrapper); ok {
  502. msg, err = wrapper.Unwrap()
  503. if err != nil {
  504. r.logger.Error("failed to unwrap message", "err", err)
  505. continue
  506. }
  507. }
  508. select {
  509. case queue.enqueue() <- Envelope{channelID: chID, From: peerID, Message: msg}:
  510. r.logger.Debug("received message", "peer", peerID, "message", msg)
  511. case <-queue.closed():
  512. r.logger.Error("channel closed, dropping message", "peer", peerID, "channel", chID)
  513. case <-r.stopCh:
  514. return nil
  515. }
  516. }
  517. }
  518. // sendPeer sends queued messages to a peer.
  519. func (r *Router) sendPeer(peerID NodeID, conn Connection, queue queue) error {
  520. for {
  521. select {
  522. case envelope := <-queue.dequeue():
  523. bz, err := proto.Marshal(envelope.Message)
  524. if err != nil {
  525. r.logger.Error("failed to marshal message", "peer", peerID, "err", err)
  526. continue
  527. }
  528. _, err = conn.SendMessage(envelope.channelID, bz)
  529. if err != nil {
  530. return err
  531. }
  532. r.logger.Debug("sent message", "peer", envelope.To, "message", envelope.Message)
  533. case <-queue.closed():
  534. return nil
  535. case <-r.stopCh:
  536. return nil
  537. }
  538. }
  539. }
  540. // evictPeers evicts connected peers as requested by the peer manager.
  541. func (r *Router) evictPeers() {
  542. ctx := r.stopCtx()
  543. for {
  544. peerID, err := r.peerManager.EvictNext(ctx)
  545. switch err {
  546. case nil:
  547. case context.Canceled:
  548. r.logger.Debug("stopping evict routine")
  549. return
  550. default:
  551. r.logger.Error("failed to find next peer to evict", "err", err)
  552. return
  553. }
  554. r.logger.Info("evicting peer", "peer", peerID)
  555. r.peerMtx.RLock()
  556. if queue, ok := r.peerQueues[peerID]; ok {
  557. queue.close()
  558. }
  559. r.peerMtx.RUnlock()
  560. }
  561. }
  562. // OnStart implements service.Service.
  563. func (r *Router) OnStart() error {
  564. go r.dialPeers()
  565. for _, transport := range r.transports {
  566. go r.acceptPeers(transport)
  567. }
  568. go r.evictPeers()
  569. return nil
  570. }
  571. // OnStop implements service.Service.
  572. //
  573. // FIXME: This needs to close transports as well.
  574. func (r *Router) OnStop() {
  575. // Collect all active queues, so we can wait for them to close.
  576. queues := []queue{}
  577. r.channelMtx.RLock()
  578. for _, q := range r.channelQueues {
  579. queues = append(queues, q)
  580. }
  581. r.channelMtx.RUnlock()
  582. r.peerMtx.RLock()
  583. for _, q := range r.peerQueues {
  584. queues = append(queues, q)
  585. }
  586. r.peerMtx.RUnlock()
  587. // Signal router shutdown, and wait for queues (and thus goroutines)
  588. // to complete.
  589. close(r.stopCh)
  590. for _, q := range queues {
  591. <-q.closed()
  592. }
  593. }
  594. // stopCtx returns a context that is cancelled when the router stops.
  595. func (r *Router) stopCtx() context.Context {
  596. ctx, cancel := context.WithCancel(context.Background())
  597. go func() {
  598. <-r.stopCh
  599. cancel()
  600. }()
  601. return ctx
  602. }