|
@ -0,0 +1,216 @@ |
|
|
|
|
|
--- |
|
|
|
|
|
order: 3 |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# Proposer-Based Timestamps Runbook |
|
|
|
|
|
|
|
|
|
|
|
Version v0.36 of Tendermint added new constraints for the timestamps included in |
|
|
|
|
|
each block created by Tendermint. The new constraints mean that validators may |
|
|
|
|
|
fail to produce valid blocks or may issue `nil` `prevotes` for proposed blocks |
|
|
|
|
|
depending on the configuration of the validator's local clock. |
|
|
|
|
|
|
|
|
|
|
|
## What is this document for? |
|
|
|
|
|
|
|
|
|
|
|
This document provides a set of actionable steps for application developers and |
|
|
|
|
|
node operators to diagnose and fix issues related to clock synchronization and |
|
|
|
|
|
configuration of the Proposer-Based Timestamps [SynchronyParams](https://github.com/tendermint/tendermint/blob/master/spec/core/data_structures.md#synchronyparams). |
|
|
|
|
|
|
|
|
|
|
|
Use this runbook if you observe that validators are frequently voting `nil` for a block that the rest |
|
|
|
|
|
of the network votes for or if validators are frequently producing block proposals |
|
|
|
|
|
that are not voted for by the rest of the network. |
|
|
|
|
|
|
|
|
|
|
|
## Requirements |
|
|
|
|
|
|
|
|
|
|
|
To use this runbook, you must be running a node that has the [Prometheus metrics endpoint enabled](https://github.com/tendermint/tendermint/blob/master/docs/nodes/metrics.md) |
|
|
|
|
|
and the Tendermint RPC endpoint enabled and accessible. |
|
|
|
|
|
|
|
|
|
|
|
It is strongly recommended to also run a Prometheus metrics collector to gather and |
|
|
|
|
|
analyze metrics from the Tendermint node. |
|
|
|
|
|
|
|
|
|
|
|
## Debugging a Single Node |
|
|
|
|
|
|
|
|
|
|
|
If you observe that a single validator is frequently failing to produce blocks or |
|
|
|
|
|
voting nil for proposals that other validators vote for and suspect it may be |
|
|
|
|
|
related to clock synchronization, use the following steps to debug and correct the issue. |
|
|
|
|
|
|
|
|
|
|
|
### Check Timely Metric |
|
|
|
|
|
|
|
|
|
|
|
Tendermint exposes a histogram metric for the difference between the timestamp in the proposal |
|
|
|
|
|
the and the time read from the node's local clock when the proposal is received. |
|
|
|
|
|
|
|
|
|
|
|
The histogram exposes multiple metrics on the Prometheus `/metrics` endpoint called |
|
|
|
|
|
* `tendermint_consensus_proposal_timestamp_difference_bucket`. |
|
|
|
|
|
* `tendermint_consensus_proposal_timestamp_difference_sum`. |
|
|
|
|
|
* `tendermint_consensus_proposal_timestamp_difference_count`. |
|
|
|
|
|
|
|
|
|
|
|
Each metric is also labeled with the key `is_timely`, which can have a value of |
|
|
|
|
|
`true` or `false`. |
|
|
|
|
|
|
|
|
|
|
|
#### From the Prometheus Collector UI |
|
|
|
|
|
|
|
|
|
|
|
If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab. |
|
|
|
|
|
|
|
|
|
|
|
Issue a query for the following: |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
tendermint_consensus_proposal_timestamp_difference_count{is_timely="false"} / |
|
|
|
|
|
tendermint_consensus_proposal_timestamp_difference_count{is_timely="true"} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
This query will graph the ratio of proposals the node considered timely to those it |
|
|
|
|
|
considered untimely. If the ratio is increasing, it means that your node is consistently |
|
|
|
|
|
seeing more proposals that are far from its local clock. If this is the case, you should |
|
|
|
|
|
check to make sure your local clock is properly synchronized to NTP. |
|
|
|
|
|
|
|
|
|
|
|
#### From the `/metrics` url |
|
|
|
|
|
|
|
|
|
|
|
If you are not running a Prometheus collector, navigate to the `/metrics` endpoint |
|
|
|
|
|
exposed on the Prometheus metrics port with `curl` or a browser. |
|
|
|
|
|
|
|
|
|
|
|
Search for the `tendermint_consensus_proposal_timestamp_difference_count` metrics. |
|
|
|
|
|
This metric is labeled with `is_timely`. Investigate the value of |
|
|
|
|
|
`tendermint_consensus_proposal_timestamp_difference_count` where `is_timely="false"` |
|
|
|
|
|
and where `is_timely="true"`. Refresh the endpoint and observe if the value of `is_timely="false"` |
|
|
|
|
|
is growing. |
|
|
|
|
|
|
|
|
|
|
|
If you observe that `is_timely="false"` is growing, it means that your node is consistently |
|
|
|
|
|
seeing proposals that are far from its local clock. If this is the case, you should check |
|
|
|
|
|
to make sure your local clock is properly synchronized to NTP. |
|
|
|
|
|
|
|
|
|
|
|
### Checking Clock Sync |
|
|
|
|
|
|
|
|
|
|
|
NTP configuration and tooling is very specific to the operating system and distribution |
|
|
|
|
|
that your validator node is running. This guide assumes you have `timedatectl` installed with |
|
|
|
|
|
[chrony](https://chrony.tuxfamily.org/), a popular tool for interacting with time |
|
|
|
|
|
synchronization on Linux distributions. If you are using an operating system or |
|
|
|
|
|
distribution with a different time synchronization mechanism, please consult the |
|
|
|
|
|
documentation for your operating system to check the status and re-synchronize the daemon. |
|
|
|
|
|
|
|
|
|
|
|
#### Check if NTP is Enabled |
|
|
|
|
|
|
|
|
|
|
|
```shell |
|
|
|
|
|
$ timedatectl |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
From the output, ensure that `NTP service` is `active`. If `NTP service` is `inactive`, run: |
|
|
|
|
|
|
|
|
|
|
|
```shell |
|
|
|
|
|
$ timedatectl set-ntp true |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Re-run the `timedatectl` command and verify that the change has taken effect. |
|
|
|
|
|
|
|
|
|
|
|
#### Check if Your NTP Daemon is Synchronized |
|
|
|
|
|
|
|
|
|
|
|
Check the status of your local `chrony` NTP daemon using by running the following: |
|
|
|
|
|
|
|
|
|
|
|
```shell |
|
|
|
|
|
$ chronyc tracking |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
If the `chrony` daemon is running, you will see output that indicates its current status. |
|
|
|
|
|
If the `chrony` daemon is not running, restart it and re-run `chronyc tracking`. |
|
|
|
|
|
|
|
|
|
|
|
The `System time` field of the response should show a value that is much smaller than 100 |
|
|
|
|
|
milliseconds. |
|
|
|
|
|
|
|
|
|
|
|
If the value is very large, restart the `chronyd` daemon. |
|
|
|
|
|
|
|
|
|
|
|
## Debugging a Network |
|
|
|
|
|
|
|
|
|
|
|
If you observe that a network is frequently failing to produce blocks and suspect |
|
|
|
|
|
it may be related to clock synchronization, use the following steps to debug and correct the issue. |
|
|
|
|
|
|
|
|
|
|
|
### Check Prevote Message Delay |
|
|
|
|
|
|
|
|
|
|
|
Tendermint exposes metrics that help determine how synchronized the clocks on a network are. |
|
|
|
|
|
|
|
|
|
|
|
These metrics are visible on the Prometheus `/metrics` endpoint and are called: |
|
|
|
|
|
* `tendermint_consensus_quorum_prevote_delay` |
|
|
|
|
|
* `tendermint_consensus_full_prevote_delay` |
|
|
|
|
|
|
|
|
|
|
|
These metrics calculate the difference between the timestamp in the proposal message and |
|
|
|
|
|
the timestamp of a prevote that was issued during consensus. |
|
|
|
|
|
|
|
|
|
|
|
The `tendermint_consensus_quorum_prevote_delay` metric is the interval in seconds |
|
|
|
|
|
between the proposal timestamp and the timestamp of the earliest prevote that |
|
|
|
|
|
achieved a quorum during the prevote step. |
|
|
|
|
|
|
|
|
|
|
|
The `tendermint_consensus_full_prevote_delay` metric is the interval in seconds |
|
|
|
|
|
between the proposal timestamp and the timestamp of the latest prevote in a round |
|
|
|
|
|
where 100% of the validators voted. |
|
|
|
|
|
|
|
|
|
|
|
#### From the Prometheus Collector UI |
|
|
|
|
|
|
|
|
|
|
|
If you are running a Prometheus collector, navigate to the query web interface and select the 'Graph' tab. |
|
|
|
|
|
|
|
|
|
|
|
Issue a query for the following: |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
|
|
sum(tendermint_consensus_quorum_prevote_delay) by (proposer_address) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
This query will graph the difference in seconds for each proposer on the network. |
|
|
|
|
|
|
|
|
|
|
|
If the value is much larger for some proposers, then the issue is likely related to the clock |
|
|
|
|
|
synchronization of their nodes. Contact those proposers and ensure that their nodes |
|
|
|
|
|
are properly connected to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node). |
|
|
|
|
|
|
|
|
|
|
|
If the value is relatively similar for all proposers you should next compare this |
|
|
|
|
|
value to the `SynchronyParams` values for the network. Continue to the [Checking |
|
|
|
|
|
Sychrony](#checking-synchrony) steps. |
|
|
|
|
|
|
|
|
|
|
|
#### From the `/metrics` url |
|
|
|
|
|
|
|
|
|
|
|
If you are not running a Prometheus collector, navigate to the `/metrics` endpoint |
|
|
|
|
|
exposed on the Prometheus metrics port. |
|
|
|
|
|
|
|
|
|
|
|
Search for the `tendermint_consensus_quorum_prevote_delay` metric. There will be one |
|
|
|
|
|
entry of this metric for each `proposer_address`. If the value of this metric is |
|
|
|
|
|
much larger for some proposers, then the issue is likely related to synchronization of their |
|
|
|
|
|
nodes with NTP. Contact those proposers and ensure that their nodes are properly connected |
|
|
|
|
|
to NTP using the steps for [Debugging a Single Node](#debugging-a-single-node). |
|
|
|
|
|
|
|
|
|
|
|
If the values are relatively similar for all proposers you should next compare, |
|
|
|
|
|
you'll need to compare this value to the `SynchronyParams` for the network. Continue |
|
|
|
|
|
to the [Checking Sychrony](#checking-synchrony) steps. |
|
|
|
|
|
|
|
|
|
|
|
### Checking Synchrony |
|
|
|
|
|
|
|
|
|
|
|
To determine the currently configured `SynchronyParams` for your network, issue a |
|
|
|
|
|
request to your node's RPC endpoint. For a node running locally with the RPC server |
|
|
|
|
|
exposed on port `26657`, run the following command: |
|
|
|
|
|
|
|
|
|
|
|
```shell |
|
|
|
|
|
$ curl localhost:26657/consensus_params |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
The json output will contain a field named `synchrony`, with the following structure: |
|
|
|
|
|
|
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
|
|
|
"precision": "500000000", |
|
|
|
|
|
"message_delay": "3000000000" |
|
|
|
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
The `precision` and `message_delay` values returned are listed in nanoseconds: |
|
|
|
|
|
In the examples above, the precision is 500ms and the message delay is 3s. |
|
|
|
|
|
Remember, `tendermint_consensus_quorum_prevote_delay` is listed in seconds. |
|
|
|
|
|
If the `tendermint_consensus_quorum_prevote_delay` value approaches the sum of `precision` and `message_delay`, |
|
|
|
|
|
then the value selected for these parameters is too small. Your application will |
|
|
|
|
|
need to be modified to update the `SynchronyParams` to have larger values. |
|
|
|
|
|
|
|
|
|
|
|
### Updating SynchronyParams |
|
|
|
|
|
|
|
|
|
|
|
The `SynchronyParams` are `ConsensusParameters` which means they are set and updated |
|
|
|
|
|
by the application running alongside Tendermint. Updates to these parameters must |
|
|
|
|
|
be passed to the application during the `FinalizeBlock` ABCI method call. |
|
|
|
|
|
|
|
|
|
|
|
If the application was built using the CosmosSDK, then these parameters can be updated |
|
|
|
|
|
programatically using a governance proposal. For more information, see the [CosmosSDK |
|
|
|
|
|
documentation](https://hub.cosmos.network/main/governance/submitting.html#sending-the-transaction-that-submits-your-governance-proposal). |
|
|
|
|
|
|
|
|
|
|
|
If the application does not implement a way to update the consensus parameters |
|
|
|
|
|
programatically, then the application itself must be updated to do so. More information on updating |
|
|
|
|
|
the consensus parameters via ABCI can be found in the [FinalizeBlock documentation](https://github.com/tendermint/tendermint/blob/master/spec/abci++/abci++_methods_002_draft.md#finalizeblock). |