|
Corruption
|
|
==========
|
|
|
|
Important step
|
|
--------------
|
|
|
|
Make sure you have a backup of the Tendermint data directory.
|
|
|
|
Possible causes
|
|
---------------
|
|
|
|
Remember that most corruption is caused by hardware issues:
|
|
|
|
- RAID controllers with faulty / worn out battery backup, and an unexpected power loss
|
|
- Hard disk drives with write-back cache enabled, and an unexpected power loss
|
|
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
|
|
- Defective RAM
|
|
- Defective or overheating CPU(s)
|
|
|
|
Other causes can be:
|
|
|
|
- Database systems configured with fsync=off and an OS crash or power loss
|
|
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
|
|
- Tendermint bugs
|
|
- Operating system bugs
|
|
- Admin error
|
|
- directly modifying Tendermint data-directory contents
|
|
|
|
(Source: https://wiki.postgresql.org/wiki/Corruption)
|
|
|
|
WAL Corruption
|
|
--------------
|
|
|
|
If consensus WAL is corrupted at the lastest height and you are trying to start
|
|
Tendermint, replay will fail with panic.
|
|
|
|
Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:
|
|
|
|
1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.
|
|
2) Try to repair the WAL file manually:
|
|
|
|
1. Create a backup of the corrupted WAL file:
|
|
|
|
.. code:: bash
|
|
|
|
cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup
|
|
|
|
2. Use ./scripts/wal2json to create a human-readable version
|
|
|
|
.. code:: bash
|
|
|
|
./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal
|
|
|
|
3. Search for a "CORRUPTED MESSAGE" line.
|
|
4. By looking at the previous message and the message after the corrupted one
|
|
and looking at the logs, try to rebuild the message. If the consequent
|
|
messages are marked as corrupted too (this may happen if length header
|
|
got corrupted or some writes did not make it to the WAL ~ truncation),
|
|
then remove all the lines starting from the corrupted one and restart
|
|
Tendermint.
|
|
|
|
.. code:: bash
|
|
|
|
$EDITOR /tmp/corrupted_wal
|
|
|
|
5. After editing, convert this file back into binary form by running:
|
|
|
|
.. code:: bash
|
|
|
|
./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal"
|