Refs #3306, irisnet@fdbb676
I ran an irishub validator. After the validator node ran several days, I dump the whole goroutine stack. I found that there were hundreds of broadcastTxRoutine. However, the connected peer quantity was less than 30. So I belive that there must be broadcastTxRoutine leakage issue.
According to my analysis, I think the root cause of this issue locate in below code:
select {
case <-next.NextWaitChan():
// see the start of the for loop for nil check
next = next.Next()
case <-peer.Quit():
return
case <-memR.Quit():
return
}
As we know, if multiple paths are avaliable in the same time, then a random path will be selected. Suppose that next.NextWaitChan() and peer.Quit() are both avaliable, and next.NextWaitChan() is chosen.
// send memTx
msg := &TxMessage{Tx: memTx.tx}
success := peer.Send(MempoolChannel, cdc.MustMarshalBinaryBare(msg))
if !success {
time.Sleep(peerCatchupSleepIntervalMS * time.Millisecond)
continue
}
Then next will be non-empty and the peer send operation won't be success. As a result, this go routine will be track into infinite loop and won't be released.
My proposal is to check peer.Quit() and memR.Quit() in every loop no matter whether next is nil.