From 7563870d11593c5c56cef3ab151318f878b73340 Mon Sep 17 00:00:00 2001 From: caffix Date: Sun, 26 Nov 2017 17:27:29 -0500 Subject: [PATCH 1/3] added the trust metric usage guide --- .../adr-007-trust-metric-usage.md | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 docs/architecture/adr-007-trust-metric-usage.md diff --git a/docs/architecture/adr-007-trust-metric-usage.md b/docs/architecture/adr-007-trust-metric-usage.md new file mode 100644 index 000000000..81a9dc2f0 --- /dev/null +++ b/docs/architecture/adr-007-trust-metric-usage.md @@ -0,0 +1,56 @@ +# ADR 007: Trust Metric Usage Guide + +## Context + +The Tendermint project developers would like to improve Tendermint security and reliability by keeping track of the quality that peers have demonstrated. This way, undesirable outcomes from peers will not immediately result in them being dropped from the network (potentially causing drastic changes). Instead, peers behavior can be monitored with appropriate metrics and be removed from the network once Tendermint is certain the peer is a threat. For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the [ADR 006](adr-006-trust-metric.md) document. + +The trust metric implementation allows a developer to obtain a peer's trust metric from a trust metric store, and track good and bad events relevant to a peer's behavior, and at any time, the peer's metric can be queried for a current trust value. The current trust value is calculated with a formula that utilizes current behavior, previous behavior, and change between the two. Current behavior is calculated as the percentage of good behavior within a time interval. The time interval is short; probably set between 30 seconds and 5 minutes. On the other hand, the historic data can estimate a peer's behavior over days worth of tracking. At the end of a time interval, the current behavior becomes part of the historic data, and a new time interval begins with the good and bad counters reset to zero. + +If a peer is inactive since the beginning of a time interval, the behavior for that time interval is considered to be untainted. Put another way, the trust value for a peer degrades from a perfect score as bad events are tracked. + +Some useful information about the inner workings of the trust metric: +- When a trust metric is first instantiated, a timer (ticker) periodically fires in order to handle transitions between trust metric time intervals +- If a peer become disconnected from a node, the timer should be paused, since the node is no longer having direct experiences with that peer +- The ability to pause the metric is supported with the store **PeerDisconnected** method and the metric **Pause** method +- After a pause, if a good or bad event method is called on a metric, it automatically becomes unpaused and begins a new time interval. + +## Decision + +The trust metric capability is now available, yet, it still leaves the question of how should it be applied throughout Tendermint in order to properly track the quality of peers? + +### Proposed Process + +Peers are managed using an address book and a trust metric: + +- The address book keeps a record of peers and provides selection methods +- The trust metric tracks the quality of the peers + +When we need more peers, we pick them randomly from the address book's selection method. When we're asked for peers, we provide a random selection with no bias: + +- The address book's selection method will perform peer ranking based on trust metric scores +- If we need to make room for a new peer, we remove the peer with the lowest trust metric score + +Peer quality is tracked in the connection and across the reactors, and behaviors are defined as one of the following: +- Fatal - something outright malicious that causes us to disconnect the peer and remember it +- Bad - Any kind of timeout, messages that don't unmarshal, fail other validity checks, or messages we didn't ask for or aren't expecting (usually worth one bad event) +- Neutral - Unknown channels/message types/version upgrades (no good or bad events recorded) +- Correct - Normal correct behavior (worth one good event) +- Good - some random majority of peers per reactor sending us useful messages (worth more than one good event). + +## Status + +Proposed. + +## Consequences + +### Positive + +- Bringing the address book and trust metric store together will cause the network to be built in a way that encourages greater security and reliability. + +### Negative + +- TBD + +### Neutral + +- Keep in mind that, good events need to be recorded just as bad events do using this implementation. From 4e08ee1833f3ff85d12d9efbc516103978aea4df Mon Sep 17 00:00:00 2001 From: caffix Date: Tue, 28 Nov 2017 14:48:14 -0500 Subject: [PATCH 2/3] made clarifications based on odeke-em's PR comments --- docs/architecture/adr-007-trust-metric-usage.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/architecture/adr-007-trust-metric-usage.md b/docs/architecture/adr-007-trust-metric-usage.md index 81a9dc2f0..4db0456c4 100644 --- a/docs/architecture/adr-007-trust-metric-usage.md +++ b/docs/architecture/adr-007-trust-metric-usage.md @@ -2,15 +2,18 @@ ## Context -The Tendermint project developers would like to improve Tendermint security and reliability by keeping track of the quality that peers have demonstrated. This way, undesirable outcomes from peers will not immediately result in them being dropped from the network (potentially causing drastic changes). Instead, peers behavior can be monitored with appropriate metrics and be removed from the network once Tendermint is certain the peer is a threat. For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the [ADR 006](adr-006-trust-metric.md) document. +The Tendermint project developers would like to improve Tendermint security and reliability by keeping track of the quality that peers have demonstrated. This way, undesirable outcomes from peers will not immediately result in them being dropped from the network (potentially causing drastic changes). Instead, a peer's behavior can be monitored with appropriate metrics and can be removed from the network once Tendermint is certain the peer is a threat. For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the [ADR 006](adr-006-trust-metric.md) document. The trust metric implementation allows a developer to obtain a peer's trust metric from a trust metric store, and track good and bad events relevant to a peer's behavior, and at any time, the peer's metric can be queried for a current trust value. The current trust value is calculated with a formula that utilizes current behavior, previous behavior, and change between the two. Current behavior is calculated as the percentage of good behavior within a time interval. The time interval is short; probably set between 30 seconds and 5 minutes. On the other hand, the historic data can estimate a peer's behavior over days worth of tracking. At the end of a time interval, the current behavior becomes part of the historic data, and a new time interval begins with the good and bad counters reset to zero. -If a peer is inactive since the beginning of a time interval, the behavior for that time interval is considered to be untainted. Put another way, the trust value for a peer degrades from a perfect score as bad events are tracked. +These are some important things to keep in mind regarding how the trust metrics handle time intervals and scoring: +- Each new time interval begins with a perfect score +- Bad events quickly bring the score down and good events cause the score to slowly rise +- When the time interval is over, the percentage of good events becomes historic data. Some useful information about the inner workings of the trust metric: - When a trust metric is first instantiated, a timer (ticker) periodically fires in order to handle transitions between trust metric time intervals -- If a peer become disconnected from a node, the timer should be paused, since the node is no longer having direct experiences with that peer +- If a peer is disconnected from a node, the timer should be paused, since the node is no longer connected to that peer - The ability to pause the metric is supported with the store **PeerDisconnected** method and the metric **Pause** method - After a pause, if a good or bad event method is called on a metric, it automatically becomes unpaused and begins a new time interval. From a37c1143ca15da7da821cfd920a371f2d71be25b Mon Sep 17 00:00:00 2001 From: Ethan Buchman Date: Sun, 10 Dec 2017 19:00:44 -0500 Subject: [PATCH 3/3] adr: update 007 trust metric usage --- .../adr-007-trust-metric-usage.md | 58 ++++++++++++++++--- p2p/pex_reactor.go | 3 +- 2 files changed, 53 insertions(+), 8 deletions(-) diff --git a/docs/architecture/adr-007-trust-metric-usage.md b/docs/architecture/adr-007-trust-metric-usage.md index 4db0456c4..4d833a69f 100644 --- a/docs/architecture/adr-007-trust-metric-usage.md +++ b/docs/architecture/adr-007-trust-metric-usage.md @@ -2,7 +2,17 @@ ## Context -The Tendermint project developers would like to improve Tendermint security and reliability by keeping track of the quality that peers have demonstrated. This way, undesirable outcomes from peers will not immediately result in them being dropped from the network (potentially causing drastic changes). Instead, a peer's behavior can be monitored with appropriate metrics and can be removed from the network once Tendermint is certain the peer is a threat. For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the [ADR 006](adr-006-trust-metric.md) document. +Tendermint is required to monitor peer quality in order to inform its peer dialing and peer exchange strategies. + +When a node first connects to the network, it is important that it can quickly find good peers. +Thus, while a node has fewer connections, it should prioritize connecting to higher quality peers. +As the node becomes well connected to the rest of the network, it can dial lesser known or lesser +quality peers and help assess their quality. Similarly, when queried for peers, a node should make +sure they dont return low quality peers. + +Peer quality can be tracked using a trust metric that flags certain behaviours as good or bad. When enough +bad behaviour accumulates, we can mark the peer as bad and disconnect. +For example, when the PEXReactor makes a request for peers network addresses from an already known peer, and the returned network addresses are unreachable, this undesirable behavior should be tracked. Returning a few bad network addresses probably shouldn’t cause a peer to be dropped, while excessive amounts of this behavior does qualify the peer for removal. The originally proposed approach and design document for the trust metric can be found in the [ADR 006](adr-006-trust-metric.md) document. The trust metric implementation allows a developer to obtain a peer's trust metric from a trust metric store, and track good and bad events relevant to a peer's behavior, and at any time, the peer's metric can be queried for a current trust value. The current trust value is calculated with a formula that utilizes current behavior, previous behavior, and change between the two. Current behavior is calculated as the percentage of good behavior within a time interval. The time interval is short; probably set between 30 seconds and 5 minutes. On the other hand, the historic data can estimate a peer's behavior over days worth of tracking. At the end of a time interval, the current behavior becomes part of the historic data, and a new time interval begins with the good and bad counters reset to zero. @@ -19,7 +29,7 @@ Some useful information about the inner workings of the trust metric: ## Decision -The trust metric capability is now available, yet, it still leaves the question of how should it be applied throughout Tendermint in order to properly track the quality of peers? +The trust metric capability is now available, yet, it still leaves the question of how should it be applied throughout Tendermint in order to properly track the quality of peers? ### Proposed Process @@ -28,18 +38,52 @@ Peers are managed using an address book and a trust metric: - The address book keeps a record of peers and provides selection methods - The trust metric tracks the quality of the peers -When we need more peers, we pick them randomly from the address book's selection method. When we're asked for peers, we provide a random selection with no bias: +#### Presence in Address Book + +Outbound peers are added to the address book before they are dialed, +and inbound peers are added once the peer connection is set up. +Peers are also added to the address book when they are received in response to +a pexRequestMessage. + +While a node has less than `needAddressThreshold`, it will periodically request more, +via pexRequestMessage, from randomly selected peers and from newly dialed outbound peers. + +When a new address is added to an address book that has more than `0.5*needAddressThreshold` addresses, +then with some low probability, a randomly chosen low quality peer is removed. + +#### Outbound Peers + +Peers attempt to maintain a minimum number of outbound connections by +repeatedly querying the address book for peers to connect to. +While a node has few to no outbound connections, the address book is biased to return +higher quality peers. As the node increases the number of outbound connections, +the address book is biased to return less-vetted or lower-quality peers. -- The address book's selection method will perform peer ranking based on trust metric scores -- If we need to make room for a new peer, we remove the peer with the lowest trust metric score +#### Inbound Peers -Peer quality is tracked in the connection and across the reactors, and behaviors are defined as one of the following: -- Fatal - something outright malicious that causes us to disconnect the peer and remember it +Peers also maintain a maximum number of total connections, MaxNumPeers. +If a peer has MaxNumPeers, new incoming connections will be accepted with low probability. +When such a new connection is accepted, the peer disconnects from a probabilistically chosen low ranking peer +so it does not exceed MaxNumPeers. + +#### Peer Exchange + +When a peer receives a pexRequestMessage, it returns a random sample of high quality peers from the address book. Peers with no score or low score should not be inclided in a response to pexRequestMessage. + +#### Peer Quality + +Peer quality is tracked in the connection and across the reactors by storing the TrustMetric in the peer's +thread safe Data store. + +Peer behaviour is then defined as one of the following: +- Fatal - something outright malicious that causes us to disconnect the peer and ban it from the address book for some amount of time - Bad - Any kind of timeout, messages that don't unmarshal, fail other validity checks, or messages we didn't ask for or aren't expecting (usually worth one bad event) - Neutral - Unknown channels/message types/version upgrades (no good or bad events recorded) - Correct - Normal correct behavior (worth one good event) - Good - some random majority of peers per reactor sending us useful messages (worth more than one good event). +Note that Fatal behaviour causes us to remove the peer, and neutral behaviour does not affect the score. + ## Status Proposed. diff --git a/p2p/pex_reactor.go b/p2p/pex_reactor.go index 6e49f6d06..960c8c641 100644 --- a/p2p/pex_reactor.go +++ b/p2p/pex_reactor.go @@ -20,7 +20,7 @@ const ( minNumOutboundPeers = 10 maxPexMessageSize = 1048576 // 1MB - // maximum messages one peer can send to us during `msgCountByPeerFlushInterval` + // maximum pex messages one peer can send to us during `msgCountByPeerFlushInterval` defaultMaxMsgCountByPeer = 1000 msgCountByPeerFlushInterval = 1 * time.Hour ) @@ -247,6 +247,7 @@ func (r *PEXReactor) ensurePeers() { // bias to prefer more vetted peers when we have fewer connections. // not perfect, but somewhate ensures that we prioritize connecting to more-vetted + // NOTE: range here is [10, 90]. Too high ? newBias := cmn.MinInt(numOutPeers, 8)*10 + 10 toDial := make(map[string]*NetAddress)