Tendermint is required to monitor peer quality in order to inform its peer dialing and peer exchange strategies.
When a node first connects to the network, it is important that it can quickly find good peers. Thus, while a node has fewer connections, it should prioritize connecting to higher quality peers. As the node becomes well connected to the rest of the network, it can dial lesser known or lesser quality peers and help assess their quality. Similarly, when queried for peers, a node should make sure they dont return low quality peers.
Peer quality can be tracked using a trust metric that flags certain behaviours as good or bad. When enough bad behaviour accumulates, we can mark the peer as bad and disconnect. For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the ADR 006 document.
The trust metric implementation allows a developer to obtain a peer's trust metric from a trust metric store, and track good and bad events relevant to a peer's behavior, and at any time, the peer's metric can be queried for a current trust value. The current trust value is calculated with a formula that utilizes current behavior, previous behavior, and change between the two. Current behavior is calculated as the percentage of good behavior within a time interval. The time interval is short; probably set between 30 seconds and 5 minutes. On the other hand, the historic data can estimate a peer's behavior over days worth of tracking. At the end of a time interval, the current behavior becomes part of the historic data, and a new time interval begins with the good and bad counters reset to zero.
These are some important things to keep in mind regarding how the trust metrics handle time intervals and scoring:
Some useful information about the inner workings of the trust metric:
The trust metric capability is now available, yet, it still leaves the question of how should it be applied throughout Tendermint in order to properly track the quality of peers?
Peers are managed using an address book and a trust metric:
Outbound peers are added to the address book before they are dialed, and inbound peers are added once the peer connection is set up. Peers are also added to the address book when they are received in response to a pexRequestMessage.
While a node has less than needAddressThreshold
, it will periodically request more,
via pexRequestMessage, from randomly selected peers and from newly dialed outbound peers.
When a new address is added to an address book that has more than 0.5*needAddressThreshold
addresses,
then with some low probability, a randomly chosen low quality peer is removed.
Peers attempt to maintain a minimum number of outbound connections by repeatedly querying the address book for peers to connect to. While a node has few to no outbound connections, the address book is biased to return higher quality peers. As the node increases the number of outbound connections, the address book is biased to return less-vetted or lower-quality peers.
Peers also maintain a maximum number of total connections, MaxNumPeers. If a peer has MaxNumPeers, new incoming connections will be accepted with low probability. When such a new connection is accepted, the peer disconnects from a probabilistically chosen low ranking peer so it does not exceed MaxNumPeers.
When a peer receives a pexRequestMessage, it returns a random sample of high quality peers from the address book. Peers with no score or low score should not be inclided in a response to pexRequestMessage.
Peer quality is tracked in the connection and across the reactors by storing the TrustMetric in the peer's thread safe Data store.
Peer behaviour is then defined as one of the following:
Note that Fatal behaviour causes us to remove the peer, and neutral behaviour does not affect the score.
Proposed.