Browse Source

briefly describe the recover process [ci skip]

pull/965/head
Anton Kaliaev 7 years ago
parent
commit
04a18e0a97
No known key found for this signature in database GPG Key ID: 7B6881D965918214
1 changed files with 64 additions and 0 deletions
  1. +64
    -0
      docs/specification/corruption.rst

+ 64
- 0
docs/specification/corruption.rst View File

@ -0,0 +1,64 @@
Corruption
==========
Important step
--------------
Make sure you have a backup of the Tendermint data directory.
Possible causes
---------------
Remember that most corruption is caused by hardware issues:
- RAID controllers with faulty / worn out battery backup, and an unexpected power loss
- Hard disk drives with write-back cache enabled, and an unexpected power loss
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
- Defective RAM
- Defective or overheating CPU(s)
Other causes can be:
- Database systems configured with fsync=off and an OS crash or power loss
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
- Tendermint bugs
- Operating system bugs
- Admin error
- directly modifying Tendermint data-directory contents
(Source: https://wiki.postgresql.org/wiki/Corruption)
WAL Corruption
--------------
If consensus WAL is corrupted at the lastest height and you are trying to start
Tendermint, replay will fail with panic.
Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:
1) Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.
2) Try to repair the WAL file manually:
1. Create a backup of the corrupted WAL file:
.. code:: bash
cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup
2. Use ./scripts/wal2json to create a human-readable version
.. code:: bash
./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal
3. Search for a "CORRUPTED MESSAGE" line.
4. By looking at the previous message and the message after the corrupted one and looking at the logs, try to rebuild the message.
.. code:: bash
$EDITOR /tmp/corrupted_wal
5. After editing, convert this file back into binary form by running:
.. code:: bash
./scripts/json2wal/json2wal /tmp/corrupted_wal > "$TMHOME/data/cs.wal/wal"

Loading…
Cancel
Save