From e3d5a31d6ef961f2188d697fbc661fe6c4f3a250 Mon Sep 17 00:00:00 2001 From: Callum Waters Date: Mon, 31 May 2021 17:02:00 +0200 Subject: [PATCH] docs: rename tendermint-core to system (#6515) --- docs/introduction/install.md | 12 +- docs/introduction/quick-start.md | 23 +- docs/networks/README.md | 2 +- docs/nodes/README.md | 6 +- .../{remote_signer.md => remote-signer.md} | 0 docs/nodes/running-in-production.md | 374 +++++++++++++++++ docs/tendermint-core/README.md | 4 +- docs/tendermint-core/running-in-production.md | 393 +----------------- docs/tools/README.md | 2 +- 9 files changed, 398 insertions(+), 418 deletions(-) rename docs/nodes/{remote_signer.md => remote-signer.md} (100%) create mode 100644 docs/nodes/running-in-production.md diff --git a/docs/introduction/install.md b/docs/introduction/install.md index 42394b9d6..b0e57cf39 100644 --- a/docs/introduction/install.md +++ b/docs/introduction/install.md @@ -8,6 +8,14 @@ order: 3 To download pre-built binaries, see the [releases page](https://github.com/tendermint/tendermint/releases). +## Using Homebrew + +You can also install the Tendermint binary by simply using homebrew, + +``` +brew install tendermint +``` + ## From Source You'll need `go` [installed](https://golang.org/doc/install) and the required @@ -18,14 +26,14 @@ echo export GOPATH=\"\$HOME/go\" >> ~/.bash_profile echo export PATH=\"\$PATH:\$GOPATH/bin\" >> ~/.bash_profile ``` -### Get Source Code +Get the source code: ```sh git clone https://github.com/tendermint/tendermint.git cd tendermint ``` -### Compile +Then run: ```sh make install diff --git a/docs/introduction/quick-start.md b/docs/introduction/quick-start.md index bc6f36af0..040da8eb2 100644 --- a/docs/introduction/quick-start.md +++ b/docs/introduction/quick-start.md @@ -7,27 +7,8 @@ order: 2 ## Overview This is a quick start guide. If you have a vague idea about how Tendermint -works and want to get started right away, continue. - -## Install - -### Quick Install - -To quickly get Tendermint installed on a fresh -Ubuntu 16.04 machine, use [this script](https://git.io/fFfOR). - -> :warning: Do not copy scripts to run on your machine without knowing what they do. - -```sh -curl -L https://git.io/fFfOR | bash -source ~/.profile -``` - -The script is also used to facilitate cluster deployment below. - -### Manual Install - -For manual installation, see the [install instructions](install.md) +works and want to get started right away, continue. Make sure you've installed the binary. +Check out [install](./install.md) if you haven't. ## Initialization diff --git a/docs/networks/README.md b/docs/networks/README.md index f941720c0..8528f44ed 100644 --- a/docs/networks/README.md +++ b/docs/networks/README.md @@ -2,7 +2,7 @@ order: 1 parent: title: Networks - order: 5 + order: 6 --- # Overview diff --git a/docs/nodes/README.md b/docs/nodes/README.md index 3786ad7d1..9be6febf0 100644 --- a/docs/nodes/README.md +++ b/docs/nodes/README.md @@ -5,13 +5,17 @@ parent: order: 4 --- +# Overview + This section will focus on how to operate full nodes, validators and light clients. - [Node Types](#node-types) - [Configuration](./configuration.md) - - [Configure State sync](./state_sync.md) + - [Configure State sync](./state-sync.md) - [Validator Guides](./validators.md) + - [Running in Production](./running-in-production.md) - [How to secure your keys](./validators.md#validator_keys) + - [Remote Signer](./remote-signer.md) - [Light Client guides](./light-client.md) - [How to sync a light client](./light-client.md#) - [Metrics](./metrics.md) diff --git a/docs/nodes/remote_signer.md b/docs/nodes/remote-signer.md similarity index 100% rename from docs/nodes/remote_signer.md rename to docs/nodes/remote-signer.md diff --git a/docs/nodes/running-in-production.md b/docs/nodes/running-in-production.md new file mode 100644 index 000000000..3dfb6c708 --- /dev/null +++ b/docs/nodes/running-in-production.md @@ -0,0 +1,374 @@ +--- +order: 4 +--- + +# Running in production + +If you are building Tendermint from source for use in production, make sure to check out an appropriate Git tag instead of a branch. + +## Database + +By default, Tendermint uses the `syndtr/goleveldb` package for its in-process +key-value database. If you want maximal performance, it may be best to install +the real C-implementation of LevelDB and compile Tendermint to use that using +`make build TENDERMINT_BUILD_OPTIONS=cleveldb`. See the [install +instructions](../introduction/install.md) for details. + +Tendermint keeps multiple distinct databases in the `$TMROOT/data`: + +- `blockstore.db`: Keeps the entire blockchain - stores blocks, + block commits, and block meta data, each indexed by height. Used to sync new + peers. +- `evidence.db`: Stores all verified evidence of misbehaviour. +- `state.db`: Stores the current blockchain state (ie. height, validators, + consensus params). Only grows if consensus params or validators change. Also + used to temporarily store intermediate results during block processing. +- `tx_index.db`: Indexes txs (and their results) by tx hash and by DeliverTx result events. + +By default, Tendermint will only index txs by their hash and height, not by their DeliverTx +result events. See [indexing transactions](../app-dev/indexing-transactions.md) for +details. + +Applications can expose block pruning strategies to the node operator. Please read the documentation of your application +to find out more details. + +Applications can use [state sync](state-sync.md) to help nodes bootstrap quickly. + +## Logging + +Default logging level (`log-level = "main:info,state:info,statesync:info,*:error"`) should suffice for +normal operation mode. Read [this +post](https://blog.cosmos.network/one-of-the-exciting-new-features-in-0-10-0-release-is-smart-log-level-flag-e2506b4ab756) +for details on how to configure `log-level` config variable. Some of the +modules can be found [here](../nodes/logging#list-of-modules). If +you're trying to debug Tendermint or asked to provide logs with debug +logging level, you can do so by running Tendermint with +`--log-level="*:debug"`. + +### Consensus WAL + +Tendermint uses a write ahead log (WAL) for consensus. The `consensus.wal` is used to ensure we can recover from a crash at any point +in the consensus state machine. It writes all consensus messages (timeouts, proposals, block part, or vote) +to a single file, flushing to disk before processing messages from its own +validator. Since Tendermint validators are expected to never sign a conflicting vote, the +WAL ensures we can always recover deterministically to the latest state of the consensus without +using the network or re-signing any consensus messages. The consensus WAL max size of 1GB and is automatically rotated. + +If your `consensus.wal` is corrupted, see [below](#wal-corruption). + +## DOS Exposure and Mitigation + +Validators are supposed to setup [Sentry Node +Architecture](./validators.md) +to prevent Denial-of-service attacks. + +### P2P + +The core of the Tendermint peer-to-peer system is `MConnection`. Each +connection has `MaxPacketMsgPayloadSize`, which is the maximum packet +size and bounded send & receive queues. One can impose restrictions on +send & receive rate per connection (`SendRate`, `RecvRate`). + +The number of open P2P connections can become quite large, and hit the operating system's open +file limit (since TCP connections are considered files on UNIX-based systems). Nodes should be +given a sizable open file limit, e.g. 8192, via `ulimit -n 8192` or other deployment-specific +mechanisms. + +### RPC + +Endpoints returning multiple entries are limited by default to return 30 +elements (100 max). See the [RPC Documentation](https://docs.tendermint.com/master/rpc/) +for more information. + +Rate-limiting and authentication are another key aspects to help protect +against DOS attacks. Validators are supposed to use external tools like +[NGINX](https://www.nginx.com/blog/rate-limiting-nginx/) or +[traefik](https://docs.traefik.io/middlewares/ratelimit/) +to achieve the same things. + +## Debugging Tendermint + +If you ever have to debug Tendermint, the first thing you should probably do is +check out the logs. See [Logging](../nodes/logging.md), where we +explain what certain log statements mean. + +If, after skimming through the logs, things are not clear still, the next thing +to try is querying the `/status` RPC endpoint. It provides the necessary info: +whenever the node is syncing or not, what height it is on, etc. + +```bash +curl http(s)://{ip}:{rpcPort}/status +``` + +`/dump_consensus_state` will give you a detailed overview of the consensus +state (proposer, latest validators, peers states). From it, you should be able +to figure out why, for example, the network had halted. + +```bash +curl http(s)://{ip}:{rpcPort}/dump_consensus_state +``` + +There is a reduced version of this endpoint - `/consensus_state`, which returns +just the votes seen at the current height. + +If, after consulting with the logs and above endpoints, you still have no idea +what's happening, consider using `tendermint debug kill` sub-command. This +command will scrap all the available info and kill the process. See +[Debugging](../tools/debugging.md) for the exact format. + +You can inspect the resulting archive yourself or create an issue on +[Github](https://github.com/tendermint/tendermint). Before opening an issue +however, be sure to check if there's [no existing +issue](https://github.com/tendermint/tendermint/issues) already. + +## Monitoring Tendermint + +Each Tendermint instance has a standard `/health` RPC endpoint, which responds +with 200 (OK) if everything is fine and 500 (or no response) - if something is +wrong. + +Other useful endpoints include mentioned earlier `/status`, `/net_info` and +`/validators`. + +Tendermint also can report and serve Prometheus metrics. See +[Metrics](./metrics.md). + +`tendermint debug dump` sub-command can be used to periodically dump useful +information into an archive. See [Debugging](../tools/debugging.md) for more +information. + +## What happens when my app dies + +You are supposed to run Tendermint under a [process +supervisor](https://en.wikipedia.org/wiki/Process_supervision) (like +systemd or runit). It will ensure Tendermint is always running (despite +possible errors). + +Getting back to the original question, if your application dies, +Tendermint will panic. After a process supervisor restarts your +application, Tendermint should be able to reconnect successfully. The +order of restart does not matter for it. + +## Signal handling + +We catch SIGINT and SIGTERM and try to clean up nicely. For other +signals we use the default behavior in Go: [Default behavior of signals +in Go +programs](https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs). + +## Corruption + +**NOTE:** Make sure you have a backup of the Tendermint data directory. + +### Possible causes + +Remember that most corruption is caused by hardware issues: + +- RAID controllers with faulty / worn out battery backup, and an unexpected power loss +- Hard disk drives with write-back cache enabled, and an unexpected power loss +- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss +- Defective RAM +- Defective or overheating CPU(s) + +Other causes can be: + +- Database systems configured with fsync=off and an OS crash or power loss +- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit. +- Tendermint bugs +- Operating system bugs +- Admin error (e.g., directly modifying Tendermint data-directory contents) + +(Source: ) + +### WAL Corruption + +If consensus WAL is corrupted at the latest height and you are trying to start +Tendermint, replay will fail with panic. + +Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take: + +1. Delete the WAL file and restart Tendermint. It will attempt to sync with other peers. +2. Try to repair the WAL file manually: + +1) Create a backup of the corrupted WAL file: + + ```sh + cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup + ``` + +2) Use `./scripts/wal2json` to create a human-readable version: + + ```sh + ./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal + ``` + +3) Search for a "CORRUPTED MESSAGE" line. +4) By looking at the previous message and the message after the corrupted one + and looking at the logs, try to rebuild the message. If the consequent + messages are marked as corrupted too (this may happen if length header + got corrupted or some writes did not make it to the WAL ~ truncation), + then remove all the lines starting from the corrupted one and restart + Tendermint. + + ```sh + $EDITOR /tmp/corrupted_wal + ``` + +5) After editing, convert this file back into binary form by running: + + ```sh + ./scripts/json2wal/json2wal /tmp/corrupted_wal $TMHOME/data/cs.wal/wal + ``` + +## Hardware + +### Processor and Memory + +While actual specs vary depending on the load and validators count, minimal +requirements are: + +- 1GB RAM +- 25GB of disk space +- 1.4 GHz CPU + +SSD disks are preferable for applications with high transaction throughput. + +Recommended: + +- 2GB RAM +- 100GB SSD +- x64 2.0 GHz 2v CPU + +While for now, Tendermint stores all the history and it may require significant +disk space over time, we are planning to implement state syncing (See [this +issue](https://github.com/tendermint/tendermint/issues/828)). So, storing all +the past blocks will not be necessary. + +### Validator signing on 32 bit architectures (or ARM) + +Both our `ed25519` and `secp256k1` implementations require constant time +`uint64` multiplication. Non-constant time crypto can (and has) leaked +private keys on both `ed25519` and `secp256k1`. This doesn't exist in hardware +on 32 bit x86 platforms ([source](https://bearssl.org/ctmul.html)), and it +depends on the compiler to enforce that it is constant time. It's unclear at +this point whenever the Golang compiler does this correctly for all +implementations. + +**We do not support nor recommend running a validator on 32 bit architectures OR +the "VIA Nano 2000 Series", and the architectures in the ARM section rated +"S-".** + +### Operating Systems + +Tendermint can be compiled for a wide range of operating systems thanks to Go +language (the list of \$OS/\$ARCH pairs can be found +[here](https://golang.org/doc/install/source#environment)). + +While we do not favor any operation system, more secure and stable Linux server +distributions (like Centos) should be preferred over desktop operation systems +(like Mac OS). + +### Miscellaneous + +NOTE: if you are going to use Tendermint in a public domain, make sure +you read [hardware recommendations](https://cosmos.network/validators) for a validator in the +Cosmos network. + +## Configuration parameters + +- `p2p.flush-throttle-timeout` +- `p2p.max-packet-msg-payload-size` +- `p2p.send-rate` +- `p2p.recv-rate` + +If you are going to use Tendermint in a private domain and you have a +private high-speed network among your peers, it makes sense to lower +flush throttle timeout and increase other params. + +```toml +[p2p] +send-rate=20000000 # 2MB/s +recv-rate=20000000 # 2MB/s +flush-throttle-timeout=10 +max-packet-msg-payload-size=10240 # 10KB +``` + +- `mempool.recheck` + +After every block, Tendermint rechecks every transaction left in the +mempool to see if transactions committed in that block affected the +application state, so some of the transactions left may become invalid. +If that does not apply to your application, you can disable it by +setting `mempool.recheck=false`. + +- `mempool.broadcast` + +Setting this to false will stop the mempool from relaying transactions +to other peers until they are included in a block. It means only the +peer you send the tx to will see it until it is included in a block. + +- `consensus.skip-timeout-commit` + +We want `skip-timeout-commit=false` when there is economics on the line +because proposers should wait to hear for more votes. But if you don't +care about that and want the fastest consensus, you can skip it. It will +be kept false by default for public deployments (e.g. [Cosmos +Hub](https://cosmos.network/intro/hub)) while for enterprise +applications, setting it to true is not a problem. + +- `consensus.peer-gossip-sleep-duration` + +You can try to reduce the time your node sleeps before checking if +theres something to send its peers. + +- `consensus.timeout-commit` + +You can also try lowering `timeout-commit` (time we sleep before +proposing the next block). + +- `p2p.addr-book-strict` + +By default, Tendermint checks whenever a peer's address is routable before +saving it to the address book. The address is considered as routable if the IP +is [valid and within allowed +ranges](https://github.com/tendermint/tendermint/blob/27bd1deabe4ba6a2d9b463b8f3e3f1e31b993e61/p2p/netaddress.go#L209). + +This may not be the case for private or local networks, where your IP range is usually +strictly limited and private. If that case, you need to set `addr-book-strict` +to `false` (turn it off). + +- `rpc.max-open-connections` + +By default, the number of simultaneous connections is limited because most OS +give you limited number of file descriptors. + +If you want to accept greater number of connections, you will need to increase +these limits. + +[Sysctls to tune the system to be able to open more connections](https://github.com/satori-com/tcpkali/blob/master/doc/tcpkali.man.md#sysctls-to-tune-the-system-to-be-able-to-open-more-connections) + +The process file limits must also be increased, e.g. via `ulimit -n 8192`. + +...for N connections, such as 50k: + +```md +kern.maxfiles=10000+2*N # BSD +kern.maxfilesperproc=100+2*N # BSD +kern.ipc.maxsockets=10000+2*N # BSD +fs.file-max=10000+2*N # Linux +net.ipv4.tcp_max_orphans=N # Linux + +# For load-generating clients. +net.ipv4.ip_local_port_range="10000 65535" # Linux. +net.inet.ip.portrange.first=10000 # BSD/Mac. +net.inet.ip.portrange.last=65535 # (Enough for N < 55535) +net.ipv4.tcp_tw_reuse=1 # Linux +net.inet.tcp.maxtcptw=2*N # BSD + +# If using netfilter on Linux: +net.netfilter.nf_conntrack_max=N +echo $((N/8)) > /sys/module/nf_conntrack/parameters/hashsize +``` + +The similar option exists for limiting the number of gRPC connections - +`rpc.grpc-max-open-connections`. diff --git a/docs/tendermint-core/README.md b/docs/tendermint-core/README.md index a6c1331b8..666eff16d 100644 --- a/docs/tendermint-core/README.md +++ b/docs/tendermint-core/README.md @@ -1,8 +1,8 @@ --- order: 1 parent: - title: Tendermint Core - order: 4 + title: System + order: 5 --- # Overview diff --git a/docs/tendermint-core/running-in-production.md b/docs/tendermint-core/running-in-production.md index cd6e5a18a..c95915181 100644 --- a/docs/tendermint-core/running-in-production.md +++ b/docs/tendermint-core/running-in-production.md @@ -1,394 +1,7 @@ --- -order: 4 +order: false --- -# Running in production +# Running In Production -If you are building Tendermint from source for use in production, make sure to check out an appropriate Git tag instead of a branch. - -## Database - -By default, Tendermint uses the `syndtr/goleveldb` package for its in-process -key-value database. If you want maximal performance, it may be best to install -the real C-implementation of LevelDB and compile Tendermint to use that using -`make build TENDERMINT_BUILD_OPTIONS=cleveldb`. See the [install -instructions](../introduction/install.md) for details. - -Tendermint keeps multiple distinct databases in the `$TMROOT/data`: - -- `blockstore.db`: Keeps the entire blockchain - stores blocks, - block commits, and block meta data, each indexed by height. Used to sync new - peers. -- `evidence.db`: Stores all verified evidence of misbehaviour. -- `state.db`: Stores the current blockchain state (ie. height, validators, - consensus params). Only grows if consensus params or validators change. Also - used to temporarily store intermediate results during block processing. -- `tx_index.db`: Indexes txs (and their results) by tx hash and by DeliverTx result events. - -By default, Tendermint will only index txs by their hash and height, not by their DeliverTx -result events. See [indexing transactions](../app-dev/indexing-transactions.md) for -details. - -Applications can expose block pruning strategies to the node operator. Please read the documentation of your application -to find out more details. - -Applications can use [state sync](state-sync.md) to help nodes bootstrap quickly. - -## Logging - -Default logging level (`log-level = "main:info,state:info,statesync:info,*:error"`) should suffice for -normal operation mode. Read [this -post](https://blog.cosmos.network/one-of-the-exciting-new-features-in-0-10-0-release-is-smart-log-level-flag-e2506b4ab756) -for details on how to configure `log-level` config variable. Some of the -modules can be found [here](../nodes/logging#list-of-modules). If -you're trying to debug Tendermint or asked to provide logs with debug -logging level, you can do so by running Tendermint with -`--log-level="*:debug"`. - -## Write Ahead Logs (WAL) - -Tendermint uses write ahead logs for the consensus (`cs.wal`) and the mempool -(`mempool.wal`). Both WALs have a max size of 1GB and are automatically rotated. - -### Consensus WAL - -The `consensus.wal` is used to ensure we can recover from a crash at any point -in the consensus state machine. -It writes all consensus messages (timeouts, proposals, block part, or vote) -to a single file, flushing to disk before processing messages from its own -validator. Since Tendermint validators are expected to never sign a conflicting vote, the -WAL ensures we can always recover deterministically to the latest state of the consensus without -using the network or re-signing any consensus messages. - -If your `consensus.wal` is corrupted, see [below](#wal-corruption). - -### Mempool WAL - -The `mempool.wal` logs all incoming txs before running CheckTx, but is -otherwise not used in any programmatic way. It's just a kind of manual -safe guard. Note the mempool provides no durability guarantees - a tx sent to one or many nodes -may never make it into the blockchain if those nodes crash before being able to -propose it. Clients must monitor their txs by subscribing over websockets, -polling for them, or using `/broadcast_tx_commit`. In the worst case, txs can be -resent from the mempool WAL manually. - -For the above reasons, the `mempool.wal` is disabled by default. To enable, set -`mempool.wal-dir` to where you want the WAL to be located (e.g. -`data/mempool.wal`). - -## DOS Exposure and Mitigation - -Validators are supposed to setup [Sentry Node -Architecture](./validators.md) -to prevent Denial-of-service attacks. - -### P2P - -The core of the Tendermint peer-to-peer system is `MConnection`. Each -connection has `MaxPacketMsgPayloadSize`, which is the maximum packet -size and bounded send & receive queues. One can impose restrictions on -send & receive rate per connection (`SendRate`, `RecvRate`). - -The number of open P2P connections can become quite large, and hit the operating system's open -file limit (since TCP connections are considered files on UNIX-based systems). Nodes should be -given a sizable open file limit, e.g. 8192, via `ulimit -n 8192` or other deployment-specific -mechanisms. - -### RPC - -Endpoints returning multiple entries are limited by default to return 30 -elements (100 max). See the [RPC Documentation](https://docs.tendermint.com/master/rpc/) -for more information. - -Rate-limiting and authentication are another key aspects to help protect -against DOS attacks. Validators are supposed to use external tools like -[NGINX](https://www.nginx.com/blog/rate-limiting-nginx/) or -[traefik](https://docs.traefik.io/middlewares/ratelimit/) -to achieve the same things. - -## Debugging Tendermint - -If you ever have to debug Tendermint, the first thing you should probably do is -check out the logs. See [Logging](../nodes/logging.md), where we -explain what certain log statements mean. - -If, after skimming through the logs, things are not clear still, the next thing -to try is querying the `/status` RPC endpoint. It provides the necessary info: -whenever the node is syncing or not, what height it is on, etc. - -```bash -curl http(s)://{ip}:{rpcPort}/status -``` - -`/dump_consensus_state` will give you a detailed overview of the consensus -state (proposer, latest validators, peers states). From it, you should be able -to figure out why, for example, the network had halted. - -```bash -curl http(s)://{ip}:{rpcPort}/dump_consensus_state -``` - -There is a reduced version of this endpoint - `/consensus_state`, which returns -just the votes seen at the current height. - -If, after consulting with the logs and above endpoints, you still have no idea -what's happening, consider using `tendermint debug kill` sub-command. This -command will scrap all the available info and kill the process. See -[Debugging](../tools/debugging.md) for the exact format. - -You can inspect the resulting archive yourself or create an issue on -[Github](https://github.com/tendermint/tendermint). Before opening an issue -however, be sure to check if there's [no existing -issue](https://github.com/tendermint/tendermint/issues) already. - -## Monitoring Tendermint - -Each Tendermint instance has a standard `/health` RPC endpoint, which responds -with 200 (OK) if everything is fine and 500 (or no response) - if something is -wrong. - -Other useful endpoints include mentioned earlier `/status`, `/net_info` and -`/validators`. - -Tendermint also can report and serve Prometheus metrics. See -[Metrics](./metrics.md). - -`tendermint debug dump` sub-command can be used to periodically dump useful -information into an archive. See [Debugging](../tools/debugging.md) for more -information. - -## What happens when my app dies - -You are supposed to run Tendermint under a [process -supervisor](https://en.wikipedia.org/wiki/Process_supervision) (like -systemd or runit). It will ensure Tendermint is always running (despite -possible errors). - -Getting back to the original question, if your application dies, -Tendermint will panic. After a process supervisor restarts your -application, Tendermint should be able to reconnect successfully. The -order of restart does not matter for it. - -## Signal handling - -We catch SIGINT and SIGTERM and try to clean up nicely. For other -signals we use the default behavior in Go: [Default behavior of signals -in Go -programs](https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs). - -## Corruption - -**NOTE:** Make sure you have a backup of the Tendermint data directory. - -### Possible causes - -Remember that most corruption is caused by hardware issues: - -- RAID controllers with faulty / worn out battery backup, and an unexpected power loss -- Hard disk drives with write-back cache enabled, and an unexpected power loss -- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss -- Defective RAM -- Defective or overheating CPU(s) - -Other causes can be: - -- Database systems configured with fsync=off and an OS crash or power loss -- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit. -- Tendermint bugs -- Operating system bugs -- Admin error (e.g., directly modifying Tendermint data-directory contents) - -(Source: ) - -### WAL Corruption - -If consensus WAL is corrupted at the latest height and you are trying to start -Tendermint, replay will fail with panic. - -Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take: - -1. Delete the WAL file and restart Tendermint. It will attempt to sync with other peers. -2. Try to repair the WAL file manually: - -1) Create a backup of the corrupted WAL file: - - ```sh - cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup - ``` - -2) Use `./scripts/wal2json` to create a human-readable version: - - ```sh - ./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal - ``` - -3) Search for a "CORRUPTED MESSAGE" line. -4) By looking at the previous message and the message after the corrupted one - and looking at the logs, try to rebuild the message. If the consequent - messages are marked as corrupted too (this may happen if length header - got corrupted or some writes did not make it to the WAL ~ truncation), - then remove all the lines starting from the corrupted one and restart - Tendermint. - - ```sh - $EDITOR /tmp/corrupted_wal - ``` - -5) After editing, convert this file back into binary form by running: - - ```sh - ./scripts/json2wal/json2wal /tmp/corrupted_wal $TMHOME/data/cs.wal/wal - ``` - -## Hardware - -### Processor and Memory - -While actual specs vary depending on the load and validators count, minimal -requirements are: - -- 1GB RAM -- 25GB of disk space -- 1.4 GHz CPU - -SSD disks are preferable for applications with high transaction throughput. - -Recommended: - -- 2GB RAM -- 100GB SSD -- x64 2.0 GHz 2v CPU - -While for now, Tendermint stores all the history and it may require significant -disk space over time, we are planning to implement state syncing (See [this -issue](https://github.com/tendermint/tendermint/issues/828)). So, storing all -the past blocks will not be necessary. - -### Validator signing on 32 bit architectures (or ARM) - -Both our `ed25519` and `secp256k1` implementations require constant time -`uint64` multiplication. Non-constant time crypto can (and has) leaked -private keys on both `ed25519` and `secp256k1`. This doesn't exist in hardware -on 32 bit x86 platforms ([source](https://bearssl.org/ctmul.html)), and it -depends on the compiler to enforce that it is constant time. It's unclear at -this point whenever the Golang compiler does this correctly for all -implementations. - -**We do not support nor recommend running a validator on 32 bit architectures OR -the "VIA Nano 2000 Series", and the architectures in the ARM section rated -"S-".** - -### Operating Systems - -Tendermint can be compiled for a wide range of operating systems thanks to Go -language (the list of \$OS/\$ARCH pairs can be found -[here](https://golang.org/doc/install/source#environment)). - -While we do not favor any operation system, more secure and stable Linux server -distributions (like Centos) should be preferred over desktop operation systems -(like Mac OS). - -### Miscellaneous - -NOTE: if you are going to use Tendermint in a public domain, make sure -you read [hardware recommendations](https://cosmos.network/validators) for a validator in the -Cosmos network. - -## Configuration parameters - -- `p2p.flush-throttle-timeout` -- `p2p.max-packet-msg-payload-size` -- `p2p.send-rate` -- `p2p.recv-rate` - -If you are going to use Tendermint in a private domain and you have a -private high-speed network among your peers, it makes sense to lower -flush throttle timeout and increase other params. - -```toml -[p2p] -send-rate=20000000 # 2MB/s -recv-rate=20000000 # 2MB/s -flush-throttle-timeout=10 -max-packet-msg-payload-size=10240 # 10KB -``` - -- `mempool.recheck` - -After every block, Tendermint rechecks every transaction left in the -mempool to see if transactions committed in that block affected the -application state, so some of the transactions left may become invalid. -If that does not apply to your application, you can disable it by -setting `mempool.recheck=false`. - -- `mempool.broadcast` - -Setting this to false will stop the mempool from relaying transactions -to other peers until they are included in a block. It means only the -peer you send the tx to will see it until it is included in a block. - -- `consensus.skip-timeout-commit` - -We want `skip-timeout-commit=false` when there is economics on the line -because proposers should wait to hear for more votes. But if you don't -care about that and want the fastest consensus, you can skip it. It will -be kept false by default for public deployments (e.g. [Cosmos -Hub](https://cosmos.network/intro/hub)) while for enterprise -applications, setting it to true is not a problem. - -- `consensus.peer-gossip-sleep-duration` - -You can try to reduce the time your node sleeps before checking if -theres something to send its peers. - -- `consensus.timeout-commit` - -You can also try lowering `timeout-commit` (time we sleep before -proposing the next block). - -- `p2p.addr-book-strict` - -By default, Tendermint checks whenever a peer's address is routable before -saving it to the address book. The address is considered as routable if the IP -is [valid and within allowed -ranges](https://github.com/tendermint/tendermint/blob/27bd1deabe4ba6a2d9b463b8f3e3f1e31b993e61/p2p/netaddress.go#L209). - -This may not be the case for private or local networks, where your IP range is usually -strictly limited and private. If that case, you need to set `addr-book-strict` -to `false` (turn it off). - -- `rpc.max-open-connections` - -By default, the number of simultaneous connections is limited because most OS -give you limited number of file descriptors. - -If you want to accept greater number of connections, you will need to increase -these limits. - -[Sysctls to tune the system to be able to open more connections](https://github.com/satori-com/tcpkali/blob/master/doc/tcpkali.man.md#sysctls-to-tune-the-system-to-be-able-to-open-more-connections) - -The process file limits must also be increased, e.g. via `ulimit -n 8192`. - -...for N connections, such as 50k: - -```md -kern.maxfiles=10000+2*N # BSD -kern.maxfilesperproc=100+2*N # BSD -kern.ipc.maxsockets=10000+2*N # BSD -fs.file-max=10000+2*N # Linux -net.ipv4.tcp_max_orphans=N # Linux - -# For load-generating clients. -net.ipv4.ip_local_port_range="10000 65535" # Linux. -net.inet.ip.portrange.first=10000 # BSD/Mac. -net.inet.ip.portrange.last=65535 # (Enough for N < 55535) -net.ipv4.tcp_tw_reuse=1 # Linux -net.inet.tcp.maxtcptw=2*N # BSD - -# If using netfilter on Linux: -net.netfilter.nf_conntrack_max=N -echo $((N/8)) > /sys/module/nf_conntrack/parameters/hashsize -``` - -The similar option exists for limiting the number of gRPC connections - -`rpc.grpc-max-open-connections`. +This file has moved to the [nodes section](../nodes/running-in-production.md). \ No newline at end of file diff --git a/docs/tools/README.md b/docs/tools/README.md index a050e9a38..3e87a2ea1 100644 --- a/docs/tools/README.md +++ b/docs/tools/README.md @@ -2,7 +2,7 @@ order: 1 parent: title: Tooling - order: 6 + order: 8 --- # Overview