|
|
@ -0,0 +1,139 @@ |
|
|
|
# RFC 008: Don't Panic |
|
|
|
|
|
|
|
## Changelog |
|
|
|
|
|
|
|
- 2021-12-17: initial draft (@tychoish) |
|
|
|
|
|
|
|
## Abstract |
|
|
|
|
|
|
|
Today, the Tendermint core codebase has panics in a number of cases as |
|
|
|
a response to exceptional situations. These panics complicate testing, |
|
|
|
and might make tendermint components difficult to use as a library in |
|
|
|
some circumstances. This document outlines a project of converting |
|
|
|
panics to errors and describes the situations where its safe to |
|
|
|
panic. |
|
|
|
|
|
|
|
## Background |
|
|
|
|
|
|
|
Panics in Go are a great mechanism for aborting the current execution |
|
|
|
for truly exceptional situations (e.g. memory errors, data corruption, |
|
|
|
processes initialization); however, because they resemble exceptions |
|
|
|
in other languages, it can be easy to over use them in the |
|
|
|
implementation of software architectures. This certainly happened in |
|
|
|
the history of Tendermint, and as we embark on the project of |
|
|
|
stabilizing the package, we find ourselves in the right moment to |
|
|
|
reexamine our use of panics, and largely where panics happen in the |
|
|
|
code base. |
|
|
|
|
|
|
|
There are still some situations where panics are acceptable and |
|
|
|
desireable, but it's important that Tendermint, as a project, comes to |
|
|
|
consensus--perhaps in the text of this document--on the situations |
|
|
|
where it is acceptable to panic. |
|
|
|
|
|
|
|
### References |
|
|
|
|
|
|
|
- [Defer Panic and Recover](https://go.dev/blog/defer-panic-and-recover) |
|
|
|
- [Why Go gets exceptions right](https://dave.cheney.net/tag/panic) |
|
|
|
- [Don't panic](https://dave.cheney.net/practical-go/presentations/gophercon-singapore-2019.html#_dont_panic) |
|
|
|
|
|
|
|
## Discussion |
|
|
|
|
|
|
|
### Acceptable Panics |
|
|
|
|
|
|
|
#### Initialization |
|
|
|
|
|
|
|
It is unambiguously safe (and desireable) to panic in `init()` |
|
|
|
functions in response to any kind of error. These errors are caught by |
|
|
|
tests, and occur early enough in process initialization that they |
|
|
|
won't cause unexpected runtime crashes. |
|
|
|
|
|
|
|
Other code that is called early in process initialization MAY panic, |
|
|
|
in some situations if it's not possible to return an error or cause |
|
|
|
the process to abort early, although these situations should be |
|
|
|
vanishingly slim. |
|
|
|
|
|
|
|
#### Data Corruption |
|
|
|
|
|
|
|
If Tendermint code encounters an inconsistency that could be |
|
|
|
attributed to data corruption or a logical impossibility it is safer |
|
|
|
to panic and crash the process than continue to attempt to make |
|
|
|
progress in these situations. |
|
|
|
|
|
|
|
Examples including reading data out of the storage engine that |
|
|
|
is invalid or corrupt, or encountering an ambiguous situation where |
|
|
|
the process should halt. Generally these forms of corruption are |
|
|
|
detected after interacting with a trusted but external data source, |
|
|
|
and reflect situations where the author thinks its safer to terminate |
|
|
|
the process immediately rather than allow execution to continue. |
|
|
|
|
|
|
|
#### Unrecoverable Consensus Failure |
|
|
|
|
|
|
|
In general, a panic should be used in the case of unrecoverable |
|
|
|
consensus failures. If a process detects that the network is |
|
|
|
behaving in an incoherent way and it does not have a clearly defined |
|
|
|
and mechanism for recovering, the process should panic. |
|
|
|
|
|
|
|
#### Static Validity |
|
|
|
|
|
|
|
It is acceptable to panic for invariant violations, within a library |
|
|
|
or package, in situations that should be statically impossible, |
|
|
|
because there is no way to make these kinds of assertions at compile |
|
|
|
time. |
|
|
|
|
|
|
|
For example, type-asserting `interface{}` values returned by |
|
|
|
`container/list` and `container/heap` (and similar), is acceptable, |
|
|
|
because package authors should have exclusive control of the inputs to |
|
|
|
these containers. Packages should not expose the ability to add |
|
|
|
arbitrary values to these data structures. |
|
|
|
|
|
|
|
#### Controlled Panics Within Libraries |
|
|
|
|
|
|
|
In some algorithms with highly recursive structures or very nested |
|
|
|
call patterns, using a panic, in combination with conditional recovery |
|
|
|
handlers results in more manageable code. Ultimately this is a limited |
|
|
|
application, and implementations that use panics internally should |
|
|
|
only recover conditionally, filtering out panics rather than ignoring |
|
|
|
or handling all panics. |
|
|
|
|
|
|
|
#### Request Handling |
|
|
|
|
|
|
|
Code that handles responses to incoming/external requests |
|
|
|
(e.g. `http.Handler`) should avoid panics, but practice this isn't |
|
|
|
totally possible, and it makes sense that request handlers have some |
|
|
|
kind of default recovery mechanism that will prevent one request from |
|
|
|
terminating a service. |
|
|
|
|
|
|
|
### Unacceptable Panics |
|
|
|
|
|
|
|
In **no** other situation is it acceptable for the code to panic: |
|
|
|
|
|
|
|
- there should be **no** controlled panics that callers are required |
|
|
|
to handle across library/package boundaries. |
|
|
|
- callers of library functions should not expect panics. |
|
|
|
- ensuring that arbitrary go routines can't panic. |
|
|
|
- ensuring that there are no arbitrary panics in core production code, |
|
|
|
espically code that can run at any time during the lifetime of a |
|
|
|
process. |
|
|
|
- all test code and fixture should report normal test assertions with |
|
|
|
a mechanism like testify's `require` assertion rather than calling |
|
|
|
panic directly. |
|
|
|
|
|
|
|
The goal of this increased "panic rigor" is to ensure that any escaped |
|
|
|
panic is reflects a fixable bug in Tendermint. |
|
|
|
|
|
|
|
### Removing Panics |
|
|
|
|
|
|
|
The process for removing panics involve a few steps, and will be part |
|
|
|
of an ongoing process of code modernization: |
|
|
|
|
|
|
|
- converting existing explicit panics to errors in cases where it's |
|
|
|
possible to return an error, the errors can and should be handled, and returning |
|
|
|
an error would not lead to data corruption or cover up data |
|
|
|
corruption. |
|
|
|
|
|
|
|
- increase rigor around operations that can cause runtime errors, like |
|
|
|
type assertions, nil pointer errors, array bounds access issues, and |
|
|
|
either avoid these situations or return errors where possible. |
|
|
|
|
|
|
|
- remove generic panic handlers which could cover and hide known |
|
|
|
panics. |