Why submit this pr:
we have suffered from infinite loop in addrbook bug which takes us a long time to find out why process become a zombie peer. It have been fixed in #3232. But the ADDRS_LOOP is still there, risk of infinite loop is still exist.
The algorithm that to random pick a bucket is not stable, which means the peer may unluckily always choose the wrong bucket for a long time, the time and cpu cost is meaningless.
A simple improvement:
shuffle bucketsNew and bucketsOld, and pick necessary number of address from them. A stable
algorithm.
* failing test
* fix infinite loop in addrbook
There are cases where we only have a small number of addresses marked
good ("old"), but the selection mechanism keeps trying to select more of these
addresses, and hence ends up in an infinite loop. Here we fix this to
only try and select such "old" addresses if we have enough of them. Note this
means, if we don't have enough of them, we may return more "new"
addresses than otherwise expected by the newSelectionBias.
This whole GetSelectionWithBias method probably needs to be rewritten,
but this is a quick fix for the issue.
* changelog
* fix infinite loop if not enough new addrs
* fix another potential infinite loop
if a.nNew == 0 -> pickFromOldBucket=true, but we don't have enough items
(a.nOld > len(oldBucketToAddrsMap) false)
* Revert "fix another potential infinite loop"
This reverts commit 146540c112.
* check num addresses instead of buckets, new test
* fixed the int division
* add slack to bias % in test, lint fixes
* Added checks for selection content in test
* test cleanup
* Apply suggestions from code review
Co-Authored-By: ebuchman <ethan@coinculture.info>
* address review comments
* change after Anton's review comments
* use the same docker image we use for testing
when building a binary for localnet
* switch back to circleci classic
* more review comments
* more review comments
* refactor addrbook_test
* build linux binary inside docker
in attempt to fix
```
--> Running dep
+ make build-linux
GOOS=linux GOARCH=amd64 make build
make[1]: Entering directory `/home/circleci/.go_workspace/src/github.com/tendermint/tendermint'
CGO_ENABLED=0 go build -ldflags "-X github.com/tendermint/tendermint/version.GitCommit=`git rev-parse --short=8 HEAD`" -tags 'tendermint' -o build/tendermint ./cmd/tendermint/
p2p/pex/addrbook.go:373:13: undefined: math.Round
```
* change dir from /usr to /go
* use concrete Go version for localnet binary
* check for nil addresses just to be sure
* close peer's connection to avoid fd leak
Fixes#2967
* rename peer#Addr to RemoteAddr
* fix test
* fixes after Ethan's review
* bring back the check
* changelog entry
* write a test for switch#acceptRoutine
* increase timeouts? :(
* remove extra assertNPeersWithTimeout
* simplify test
* assert number of peers (just to be safe)
* Cleanup in OnStop
* run tests with verbose flag on CircleCI
* spawn a reading routine to prevent connection from closing
* get port from the listener
random port is faster, but often results in
```
panic: listen tcp 127.0.0.1:44068: bind: address already in use [recovered]
panic: listen tcp 127.0.0.1:44068: bind: address already in use
goroutine 79 [running]:
testing.tRunner.func1(0xc0001bd600)
/usr/local/go/src/testing/testing.go:792 +0x387
panic(0x974d20, 0xc0001b0500)
/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/tendermint/tendermint/p2p.MakeSwitch(0xc0000f42a0, 0x0, 0x9fb9cc, 0x9, 0x9fc346, 0xb, 0xb42128, 0x0, 0x0, 0x0, ...)
/home/vagrant/go/src/github.com/tendermint/tendermint/p2p/test_util.go:182 +0xa28
github.com/tendermint/tendermint/p2p.MakeConnectedSwitches(0xc0000f42a0, 0x2, 0xb42128, 0xb41eb8, 0x4f1205, 0xc0001bed80, 0x4f16ed)
/home/vagrant/go/src/github.com/tendermint/tendermint/p2p/test_util.go:75 +0xf9
github.com/tendermint/tendermint/p2p.MakeSwitchPair(0xbb8d20, 0xc0001bd600, 0xb42128, 0x2f7, 0x4f16c0)
/home/vagrant/go/src/github.com/tendermint/tendermint/p2p/switch_test.go:94 +0x4c
github.com/tendermint/tendermint/p2p.TestSwitches(0xc0001bd600)
/home/vagrant/go/src/github.com/tendermint/tendermint/p2p/switch_test.go:117 +0x58
testing.tRunner(0xc0001bd600, 0xb42038)
/usr/local/go/src/testing/testing.go:827 +0xbf
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:878 +0x353
exit status 2
FAIL github.com/tendermint/tendermint/p2p 0.350s
```
* p2p/conn: FlushStop. Use in pex. Closes#2092
In seed mode, we call StopPeer immediately after Send.
Since flushing msgs to the peer happens in the background,
the peer connection is often closed before the messages are
actually sent out. The new FlushStop method allows all msgs
to first be written and flushed out on the conn before it is closed.
* fix dummy peer
* typo
* fixes from review
* more comments
* ensure pex doesn't call FlushStop more than once
FlushStop is not safe to call more than once,
but we call it from Receive in a go-routine so Receive
doesn't block.
To ensure we only call it once, we use the lastReceivedRequests
map - if an entry already exists, then FlushStop should already have
been called and we can return.
* use READ lock/unlock in ConsensusState#GetLastHeight
Refs #2721
* do not use defers when there's no need
* fix peer formatting (output its address instead of the pointer)
```
[54310]: E[11-02|11:59:39.851] Connection failed @ sendRoutine module=p2p peer=0xb78f00 conn=MConn{74.207.236.148:26656} err="pong timeout"
```
https://github.com/tendermint/tendermint/issues/2721#issuecomment-435326581
* panic if peer has no state
https://github.com/tendermint/tendermint/issues/2721#issuecomment-435347165
It's confusing that sometimes we check if peer has a state, but most of
the times we expect it to be there
1. add79700b5/mempool/reactor.go (L138)
2. add79700b5/rpc/core/consensus.go (L196) (edited)
I will change everything to always assume peer has a state and panic
otherwise
that should help identify issues earlier
* abci/localclient: extend lock on app callback
App callback should be protected by lock as well (note this was already
done for InitChainAsync, why not for others???). Otherwise, when we
execute the block, tx might come in and call the callback in the same
time we're updating it in execBlockOnProxyApp => DATA RACE
Fixes#2721
Consensus state is locked
```
goroutine 113333 [semacquire, 309 minutes]:
sync.runtime_SemacquireMutex(0xc00180009c, 0xc0000c7e00)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*RWMutex).RLock(0xc001800090)
/usr/local/go/src/sync/rwmutex.go:50 +0x4e
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).GetRoundState(0xc001800000, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:218 +0x46
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).queryMaj23Routine(0xc0017def80, 0x11104a0, 0xc0072488f0, 0xc007248
9c0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:735 +0x16d
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusReactor).AddPeer
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/reactor.go:172 +0x236
```
because localClient is locked
```
goroutine 1899 [semacquire, 309 minutes]:
sync.runtime_SemacquireMutex(0xc00003363c, 0xc0000cb500)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc000033638)
/usr/local/go/src/sync/mutex.go:134 +0xff
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).SetResponseCallback(0xc0001fb560, 0xc007868540)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:32 +0x33
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy.(*appConnConsensus).SetResponseCallback(0xc00002f750, 0xc007868540)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:57 +0x40
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/state.execBlockOnProxyApp(0x1104e20, 0xc002ca0ba0, 0x11092a0, 0xc00002f750, 0xc0001fe960, 0xc000bfc660, 0x110cfe0, 0xc000090330, 0xc9d12, 0xc000d9d5a0, ...)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/state/execution.go:230 +0x1fd
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc002c2a230, 0x7, 0x0, 0xc000eae880, 0x6, 0xc002e52c60, 0x16, 0x1f927, 0xc9d12, 0xc000d9d5a0, ...)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/state/execution.go:96 +0x142
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).finalizeCommit(0xc001800000, 0x1f928)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1339 +0xa3e
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryFinalizeCommit(0xc001800000, 0x1f928)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1270 +0x451
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit.func1(0xc001800000, 0x0, 0x1f928)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1218 +0x90
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).enterCommit(0xc001800000, 0x1f928, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1247 +0x6b8
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).addVote(0xc001800000, 0xc003d8dea0, 0xc000cf4cc0, 0x28, 0xf1, 0xc003bc7ad0, 0xc003bc7b10)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1659 +0xbad
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).tryAddVote(0xc001800000, 0xc003d8dea0, 0xc000cf4cc0, 0x28, 0xf1, 0xf1, 0xf1)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:1517 +0x59
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).handleMsg(0xc001800000, 0xd98200, 0xc0070dbed0, 0xc000cf4cc0, 0x28)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:660 +0x64b
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).receiveRoutine(0xc001800000, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:617 +0x670
created by github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus.(*ConsensusState).OnStart
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/consensus/state.go:311 +0x132
```
tx comes in and CheckTx is executed right when we execute the block
```
goroutine 111044 [semacquire, 309 minutes]:
sync.runtime_SemacquireMutex(0xc00003363c, 0x0)
/usr/local/go/src/runtime/sema.go:71 +0x3d
sync.(*Mutex).Lock(0xc000033638)
/usr/local/go/src/sync/mutex.go:134 +0xff
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client.(*localClient).CheckTxAsync(0xc0001fb0e0, 0xc002d94500, 0x13f, 0x280, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/abci/client/local_client.go:85 +0x47
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy.(*appConnMempool).CheckTxAsync(0xc00002f720, 0xc002d94500, 0x13f, 0x280, 0x1)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/proxy/app_conn.go:114 +0x51
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool.(*Mempool).CheckTx(0xc002d3a320, 0xc002d94500, 0x13f, 0x280, 0xc0072355f0, 0x0, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/mempool/mempool.go:316 +0x17b
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core.BroadcastTxSync(0xc002d94500, 0x13f, 0x280, 0x0, 0x0, 0x0)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/core/mempool.go:93 +0xb8
reflect.Value.call(0xd85560, 0x10326c0, 0x13, 0xec7b8b, 0x4, 0xc00663f180, 0x1, 0x1, 0xc00663f180, 0xc00663f188, ...)
/usr/local/go/src/reflect/value.go:447 +0x449
reflect.Value.Call(0xd85560, 0x10326c0, 0x13, 0xc00663f180, 0x1, 0x1, 0x0, 0x0, 0xc005cc9344)
/usr/local/go/src/reflect/value.go:308 +0xa4
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.makeHTTPHandler.func2(0x1102060, 0xc00663f100, 0xc0082d7900)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/handlers.go:269 +0x188
net/http.HandlerFunc.ServeHTTP(0xc002c81f20, 0x1102060, 0xc00663f100, 0xc0082d7900)
/usr/local/go/src/net/http/server.go:1964 +0x44
net/http.(*ServeMux).ServeHTTP(0xc002c81b60, 0x1102060, 0xc00663f100, 0xc0082d7900)
/usr/local/go/src/net/http/server.go:2361 +0x127
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.maxBytesHandler.ServeHTTP(0x10f8a40, 0xc002c81b60, 0xf4240, 0x1102060, 0xc00663f100, 0xc0082d7900)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:219 +0xcf
github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server.RecoverAndLogHandler.func1(0x1103220, 0xc00121e620, 0xc0082d7900)
/root/go/src/github.com/MinterTeam/minter-go-node/vendor/github.com/tendermint/tendermint/rpc/lib/server/http_server.go:192 +0x394
net/http.HandlerFunc.ServeHTTP(0xc002c06ea0, 0x1103220, 0xc00121e620, 0xc0082d7900)
/usr/local/go/src/net/http/server.go:1964 +0x44
net/http.serverHandler.ServeHTTP(0xc001a1aa90, 0x1103220, 0xc00121e620, 0xc0082d7900)
/usr/local/go/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc00785a3c0, 0x11041a0, 0xc000f844c0)
/usr/local/go/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2851 +0x2f5
```
* consensus: use read lock in Receive#VoteMessage
* use defer to unlock mutex because application might panic
* use defer in every method of the localClient
* add a changelog entry
* drain channels before Unsubscribe(All)
Read 55362ed766/libs/pubsub/pubsub.go (L13)
for the detailed explanation of the issue.
We'll need to fix it someday. Make sure to keep an eye on
https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-033-pubsub.md
* retry instead of panic when peer has no state in reactors other than consensus
in /dump_consensus_state RPC endpoint, skip a peer with no state
* rpc/core/mempool: simplify error messages
* rpc/core/mempool: use time.After instead of timer
also, do not log DeliverTx result (to be consistent with other memthods)
* unlock before calling the callback in reqRes#SetCallback
* Require addressbook to only store addresses with valid ID
* Do not shut down peer immediately after sending pex addrs in SeedMode
* p2p: fix#2773
* seed mode: use go-routine to sleep before stopping peer
* validate reactor messages
Refs #2683
* validate blockchain messages
Refs #2683
* validate evidence messages
Refs #2683
* todo
* check ProposalPOL and signature sizes
* add a changelog entry
* check addr is valid when we add it to the addrbook
* validate incoming netAddr (not just nil check!)
* fixes after Bucky's review
* check timestamps
* beef up block#ValidateBasic
* move some checks into bcBlockResponseMessage
* update Gopkg.lock
Fix
```
grouped write of manifest, lock and vendor: failed to export github.com/tendermint/go-amino: fatal: failed to unpack tree object 6dcc6ddc14
```
by running `dep ensure -update`
* bump year since now we check it
* generate test/p2p/data on the fly using tendermint testnet
* allow sync chains older than 1 year
* use full path when creating a testnet
* move testnet gen to test/docker/Dockerfile
* relax LastCommitRound check
Refs #2737
* fix conflicts after merge
* add small comment
* some ValidateBasic updates
* fixes
* AppHash length is not fixed
* p2p: NodeInfo is an interface
* (squash) fixes from review
* (squash) more fixes from review
* p2p: remove peerConn.HandshakeTimeout
* p2p: NodeInfo is two interfaces. Remove String()
* fixes from review
* remove test code from peer.RemoteIP()
* p2p: remove peer.OriginalAddr(). See #2618
* use a mockPeer in peer_set_test.go
* p2p: fix testNodeInfo naming
* p2p: remove unused var
* remove testRandNodeInfo
* fix linter
* fix retry dialing self
* fix rpc
We are swapping the exisiting listener implementation with the newly
introduced Transport and its default implementation MultiplexTransport,
removing a large chunk of old connection setup and handling scattered
over the Peer and Switch code. The Switch requires a Transport now and
handles externally passed Peer filters.
* ignore existing peers in DialPeersAsync
Fixes#2253
* rename HasPeerWithAddress to IsDialingOrExistingAddress
[breaking] remove Switch#IsDialing
* check if addrBook is nil
to be consistent with other usages of addrBook across Switch
* different log messages for 2 use-cases
* [p2p/pex] connect to more than 10 peers
also, remove DefaultMinNumOutboundPeers because a) I am not sure it's
needed b) it's super confusing
look closely
```
maxPeers := sw.config.MaxNumPeers - DefaultMinNumOutboundPeers
if maxPeers <= sw.peers.Size() {
sw.Logger.Info("Ignoring inbound connection: already have enough peers", "address", inConn.RemoteAddr().String(), "numPeers", sw.peers.Size(), "max", maxPeers)
```
we print maxPeers = config.MaxPeers - DefaultMinNumOutboundPeers. So we
may not have enough peers even though we say we have enough.
Refs #2130
* update spec
* replace MaxNumPeers with MaxNumInboundPeers/MaxNumOutboundPeers
Refs #2130
* update changelog
* make max rpc conns formula visible to users
* update spec
* docs: note max outbound peers excludes persistent
* p2p/pex: Allow configured seed nodes to be offline
Previously you couldn't startup tendermint if a seed node was offline.
This now allows you to startup tendermint, as long as all seed node addresses
are formatted correctly. In the event that all seed nodes are down,
and the address book is empty, then it crashes with an informative error msg.
(This case doesn't occur if no seeds were specified)
Closes#1716
* (Squash this) Address melekes' comments
* (squash this) fix package imports
* (squash this) fix pex_reactor comment
* (squash this) add a test case
This is to reduce wait times when initially connecting. This still runs checks
such as whether you still want additional peers.
A test case has been created, which fails if this change is not included.
Currently the top level directory contains basically all of the code
for the crypto package. This PR moves the crypto code into submodules
in a similar manner to what `golang/x/crypto` does. This improves code
organization.
Ref discussion: https://github.com/tendermint/tendermint/pull/1966Closes#1956
This commit switches all usage of math/rand to cmn's rand. The only
exceptions are within the random file itself, the tools package, and the
crypto package. In tools you don't want it to lock between the go-routines.
The crypto package doesn't use it so the crypto package have no other
dependencies within tendermint/tendermint for easier portability.
Crypto/rand usage is unadjusted.
Closes#1343
* new config option for external address to advertise
* if blank, defaults to best guess from listener
* if laddr ip address is also blank, default to IPv4
* config: rename skip_upnp to upnp
Change default option to enable upnp.
Closes#1806
* doc updates
- fix comment and set UPNP to false in TestP2PConfig
- add UPNP to config template
- update changelog
As both configs are concerned with the p2p packaage and PeerConfig is
only used inside of the package there is no good reason to keep the
couple of fields separate, therefore it is collapsed into the more
general P2PConifg. This is a stepping stone towards a setup where the
components inside of p2p do not have any knowledge about the config.
follow-up to #1325