Browse Source

mempool: fix broadcastTxRoutine leak (#3478)

Refs #3306, irisnet@fdbb676

I ran an irishub validator. After the validator node ran several days, I dump the whole goroutine stack. I found that there were hundreds of broadcastTxRoutine. However, the connected peer quantity was less than 30. So I belive that there must be broadcastTxRoutine leakage issue.

According to my analysis, I think the root cause of this issue locate in below code:

		select {
		case <-next.NextWaitChan():
			// see the start of the for loop for nil check
			next = next.Next()
		case <-peer.Quit():
			return
		case <-memR.Quit():
			return
		}

As we know, if multiple paths are avaliable in the same time, then a random path will be selected. Suppose that next.NextWaitChan() and peer.Quit() are both avaliable, and next.NextWaitChan() is chosen.

                // send memTx
		msg := &TxMessage{Tx: memTx.tx}
		success := peer.Send(MempoolChannel, cdc.MustMarshalBinaryBare(msg))
		if !success {
			time.Sleep(peerCatchupSleepIntervalMS * time.Millisecond)
			continue
		}

Then next will be non-empty and the peer send operation won't be success. As a result, this go routine will be track into infinite loop and won't be released.

My proposal is to check peer.Quit() and memR.Quit() in every loop no matter whether next is nil.
pull/2774/head
HaoyangLiu 5 years ago
committed by Anton Kaliaev
parent
commit
1bb8e02a96
1 changed files with 4 additions and 0 deletions
  1. +4
    -0
      mempool/reactor.go

+ 4
- 0
mempool/reactor.go View File

@ -179,6 +179,10 @@ func (memR *MempoolReactor) broadcastTxRoutine(peer p2p.Peer) {
peerID := memR.ids.GetForPeer(peer)
var next *clist.CElement
for {
// In case of both next.NextWaitChan() and peer.Quit() are variable at the same time
if !memR.IsRunning() || !peer.IsRunning() {
return
}
// This happens because the CElement we were looking at got garbage
// collected (removed). That is, .NextWait() returned nil. Go ahead and
// start from the beginning.


Loading…
Cancel
Save