diff --git a/docs/specification/corruption.rst b/docs/specification/corruption.rst new file mode 100644 index 000000000..df2ceea62 --- /dev/null +++ b/docs/specification/corruption.rst @@ -0,0 +1,64 @@ +Corruption +========== + +Important step +-------------- + +Make sure you have a backup of the Tendermint data directory. + +Possible causes +--------------- + +Remember that most corruption is caused by hardware issues: + +- RAID controllers with faulty / worn out battery backup, and an unexpected power loss +- Hard disk drives with write-back cache enabled, and an unexpected power loss +- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss +- Defective RAM +- Defective or overheating CPU(s) + +Other causes can be: + +- Database systems configured with fsync=off and an OS crash or power loss +- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit. +- Tendermint bugs +- Operating system bugs +- Admin error + - directly modifying Tendermint data-directory contents + +(Source: https://wiki.postgresql.org/wiki/Corruption) + +WAL Corruption +-------------- + +If consensus WAL is corrupted at the lastest height and you are trying to start +Tendermint, replay will fail with panic. + +Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take: + +1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers. +2) Try to repair the WAL file manually: + 1. Create a backup of the corrupted WAL file: + + .. code:: bash + + cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup + + 2. Use ./scripts/wal2json to create a human-readable version + + .. code:: bash + + ./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal + + 3. Search for a "CORRUPTED MESSAGE" line. + 4. By looking at the previous message and the message after the corrupted one and looking at the logs, try to rebuild the message. + + .. code:: bash + + $EDITOR /tmp/corrupted_wal + + 5. After editing, convert this file back into binary form by running: + + .. code:: bash + + ./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal"