|
|
@ -0,0 +1,64 @@ |
|
|
|
Corruption |
|
|
|
========== |
|
|
|
|
|
|
|
Important step |
|
|
|
-------------- |
|
|
|
|
|
|
|
Make sure you have a backup of the Tendermint data directory. |
|
|
|
|
|
|
|
Possible causes |
|
|
|
--------------- |
|
|
|
|
|
|
|
Remember that most corruption is caused by hardware issues: |
|
|
|
|
|
|
|
- RAID controllers with faulty / worn out battery backup, and an unexpected power loss |
|
|
|
- Hard disk drives with write-back cache enabled, and an unexpected power loss |
|
|
|
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss |
|
|
|
- Defective RAM |
|
|
|
- Defective or overheating CPU(s) |
|
|
|
|
|
|
|
Other causes can be: |
|
|
|
|
|
|
|
- Database systems configured with fsync=off and an OS crash or power loss |
|
|
|
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit. |
|
|
|
- Tendermint bugs |
|
|
|
- Operating system bugs |
|
|
|
- Admin error |
|
|
|
- directly modifying Tendermint data-directory contents |
|
|
|
|
|
|
|
(Source: https://wiki.postgresql.org/wiki/Corruption) |
|
|
|
|
|
|
|
WAL Corruption |
|
|
|
-------------- |
|
|
|
|
|
|
|
If consensus WAL is corrupted at the lastest height and you are trying to start |
|
|
|
Tendermint, replay will fail with panic. |
|
|
|
|
|
|
|
Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take: |
|
|
|
|
|
|
|
1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers. |
|
|
|
2) Try to repair the WAL file manually: |
|
|
|
1. Create a backup of the corrupted WAL file: |
|
|
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
|
|
cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup |
|
|
|
|
|
|
|
2. Use ./scripts/wal2json to create a human-readable version |
|
|
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
|
|
./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal |
|
|
|
|
|
|
|
3. Search for a "CORRUPTED MESSAGE" line. |
|
|
|
4. By looking at the previous message and the message after the corrupted one and looking at the logs, try to rebuild the message. |
|
|
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
|
|
$EDITOR /tmp/corrupted_wal |
|
|
|
|
|
|
|
5. After editing, convert this file back into binary form by running: |
|
|
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
|
|
./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal" |