|
|
@ -0,0 +1,103 @@ |
|
|
|
# ADR 010: Monitoring |
|
|
|
|
|
|
|
## Changelog |
|
|
|
|
|
|
|
08-06-2018: Initial draft |
|
|
|
|
|
|
|
## Context |
|
|
|
|
|
|
|
In order to bring more visibility into Tendermint, we would like it to report |
|
|
|
metrics and, maybe later, traces of transactions and RPC queries. See |
|
|
|
https://github.com/tendermint/tendermint/issues/986. |
|
|
|
|
|
|
|
A few solutions were considered: |
|
|
|
|
|
|
|
1. [Prometheus](https://prometheus.io) |
|
|
|
a) Prometheus API |
|
|
|
b) [go-kit metrics package](https://github.com/go-kit/kit/tree/master/metrics) as an interface plus Prometheus |
|
|
|
c) [telegraf](https://github.com/influxdata/telegraf) |
|
|
|
d) new service, which will listen to events emitted by pubsub and report metrics |
|
|
|
5. [OpenCensus](https://opencensus.io/go/index.html) |
|
|
|
|
|
|
|
### 1. Prometheus |
|
|
|
|
|
|
|
Prometheus seems to be the most popular product out there for monitoring. It has |
|
|
|
a Go client library, powerful queries, alerts. |
|
|
|
|
|
|
|
**a) Prometheus API** |
|
|
|
|
|
|
|
We can commit to using Prometheus in Tendermint, but I think Tendermint users |
|
|
|
should be free to choose whatever monitoring tool they feel will better suit |
|
|
|
their needs (if they don't have existing one already). So we should try to |
|
|
|
abstract interface enough so people can switch between Prometheus and other |
|
|
|
similar tools. |
|
|
|
|
|
|
|
**b) go-kit metrics package as an interface** |
|
|
|
|
|
|
|
metrics package provides a set of uniform interfaces for service |
|
|
|
instrumentation and offers adapters to popular metrics packages: |
|
|
|
|
|
|
|
https://godoc.org/github.com/go-kit/kit/metrics#pkg-subdirectories |
|
|
|
|
|
|
|
Comparing to Prometheus API, we're losing customisability and control, but gaining |
|
|
|
freedom in choosing any instrument from the above list given we will extract |
|
|
|
metrics creation into a separate function (see "providers" in node/node.go). |
|
|
|
|
|
|
|
**c) telegraf** |
|
|
|
|
|
|
|
Unlike already discussed options, telegraf does not require modifying Tendermint |
|
|
|
source code. You create something called an input plugin, which polls |
|
|
|
Tendermint RPC every second and calculates the metrics itself. |
|
|
|
|
|
|
|
While it may sound good, but some metrics we want to report are not exposed via |
|
|
|
RPC or pubsub, therefore can't be accessed externally. |
|
|
|
|
|
|
|
**d) service, listening to pubsub** |
|
|
|
|
|
|
|
Same issue as the above. |
|
|
|
|
|
|
|
### 2. opencensus |
|
|
|
|
|
|
|
opencensus provides both metrics and tracing, which may be important in the |
|
|
|
future. It's API looks different from go-kit and Prometheus, but looks like it |
|
|
|
covers everything we need. |
|
|
|
|
|
|
|
Unfortunately, OpenCensus go client does not define any |
|
|
|
interfaces, so if we want to abstract away metrics we |
|
|
|
will need to write interfaces ourselves. |
|
|
|
|
|
|
|
### List of metrics |
|
|
|
|
|
|
|
| | Name | Type | | |
|
|
|
| - | --------------------------------------- | ------- | ----------------------------------------------------------------------------- | |
|
|
|
| A | height | Counter | | |
|
|
|
| A | validators:<height> | Gauge | Number of validators who signed | |
|
|
|
| A | missing_validators:<height> | Gauge | Number of validators who did not sign | |
|
|
|
| A | byzantine_validators:<height> | Gauge | Number of validators who tried to double sign | |
|
|
|
| A | block_interval | Timing | Time between this and last block (Block.Header.Time) | |
|
|
|
| | block_time | Timing | Time to create a block (from creating a proposal to commit) | |
|
|
|
| | time_between_blocks | Timing | Time between committing last block and (receiving proposal creating proposal) | |
|
|
|
| A | rounds:<height> | Counter | Number of rounds | |
|
|
|
| | prevotes:<height>:<round> | Counter | | |
|
|
|
| | precommits:<height>:<round> | Counter | | |
|
|
|
| | prevotes_total_power:<height>:<round> | Counter | | |
|
|
|
| | precommits_total_power:<height>:<round> | Counter | | |
|
|
|
| A | num_txs:<height> | Counter | | |
|
|
|
| | total_txs | Counter | | |
|
|
|
| | block_size:<height> | Gauge | In bytes | |
|
|
|
| | peers | Gauge | Number of peers node's connected to | |
|
|
|
| | power | Gauge | | |
|
|
|
|
|
|
|
`A` - will be implemented in the fist place. |
|
|
|
|
|
|
|
**Proposed solution** |
|
|
|
|
|
|
|
## Status |
|
|
|
|
|
|
|
## Consequences |
|
|
|
|
|
|
|
### Positive |
|
|
|
|
|
|
|
### Negative |
|
|
|
|
|
|
|
### Neutral |