diff --git a/docs/rfc/README.md b/docs/rfc/README.md index 400b7c2a5..bee358dea 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -45,5 +45,6 @@ sections. - [RFC-005: Event System](./rfc-005-event-system.rst) - [RFC-006: Event Subscription](./rfc-006-event-subscription.md) - [RFC-007: Deterministic Proto Byte Serialization](./rfc-007-deterministic-proto-bytes.md) +- [RFC-008: Don't Panic](./rfc-008-don't-panic.md) diff --git a/docs/rfc/rfc-008-don't-panic.md b/docs/rfc/rfc-008-don't-panic.md new file mode 100644 index 000000000..ec8c08f5e --- /dev/null +++ b/docs/rfc/rfc-008-don't-panic.md @@ -0,0 +1,139 @@ +# RFC 008: Don't Panic + +## Changelog + +- 2021-12-17: initial draft (@tychoish) + +## Abstract + +Today, the Tendermint core codebase has panics in a number of cases as +a response to exceptional situations. These panics complicate testing, +and might make tendermint components difficult to use as a library in +some circumstances. This document outlines a project of converting +panics to errors and describes the situations where its safe to +panic. + +## Background + +Panics in Go are a great mechanism for aborting the current execution +for truly exceptional situations (e.g. memory errors, data corruption, +processes initialization); however, because they resemble exceptions +in other languages, it can be easy to over use them in the +implementation of software architectures. This certainly happened in +the history of Tendermint, and as we embark on the project of +stabilizing the package, we find ourselves in the right moment to +reexamine our use of panics, and largely where panics happen in the +code base. + +There are still some situations where panics are acceptable and +desireable, but it's important that Tendermint, as a project, comes to +consensus--perhaps in the text of this document--on the situations +where it is acceptable to panic. + +### References + +- [Defer Panic and Recover](https://go.dev/blog/defer-panic-and-recover) +- [Why Go gets exceptions right](https://dave.cheney.net/tag/panic) +- [Don't panic](https://dave.cheney.net/practical-go/presentations/gophercon-singapore-2019.html#_dont_panic) + +## Discussion + +### Acceptable Panics + +#### Initialization + +It is unambiguously safe (and desireable) to panic in `init()` +functions in response to any kind of error. These errors are caught by +tests, and occur early enough in process initialization that they +won't cause unexpected runtime crashes. + +Other code that is called early in process initialization MAY panic, +in some situations if it's not possible to return an error or cause +the process to abort early, although these situations should be +vanishingly slim. + +#### Data Corruption + +If Tendermint code encounters an inconsistency that could be +attributed to data corruption or a logical impossibility it is safer +to panic and crash the process than continue to attempt to make +progress in these situations. + +Examples including reading data out of the storage engine that +is invalid or corrupt, or encountering an ambiguous situation where +the process should halt. Generally these forms of corruption are +detected after interacting with a trusted but external data source, +and reflect situations where the author thinks its safer to terminate +the process immediately rather than allow execution to continue. + +#### Unrecoverable Consensus Failure + +In general, a panic should be used in the case of unrecoverable +consensus failures. If a process detects that the network is +behaving in an incoherent way and it does not have a clearly defined +and mechanism for recovering, the process should panic. + +#### Static Validity + +It is acceptable to panic for invariant violations, within a library +or package, in situations that should be statically impossible, +because there is no way to make these kinds of assertions at compile +time. + +For example, type-asserting `interface{}` values returned by +`container/list` and `container/heap` (and similar), is acceptable, +because package authors should have exclusive control of the inputs to +these containers. Packages should not expose the ability to add +arbitrary values to these data structures. + +#### Controlled Panics Within Libraries + +In some algorithms with highly recursive structures or very nested +call patterns, using a panic, in combination with conditional recovery +handlers results in more manageable code. Ultimately this is a limited +application, and implementations that use panics internally should +only recover conditionally, filtering out panics rather than ignoring +or handling all panics. + +#### Request Handling + +Code that handles responses to incoming/external requests +(e.g. `http.Handler`) should avoid panics, but practice this isn't +totally possible, and it makes sense that request handlers have some +kind of default recovery mechanism that will prevent one request from +terminating a service. + +### Unacceptable Panics + +In **no** other situation is it acceptable for the code to panic: + +- there should be **no** controlled panics that callers are required + to handle across library/package boundaries. +- callers of library functions should not expect panics. +- ensuring that arbitrary go routines can't panic. +- ensuring that there are no arbitrary panics in core production code, + espically code that can run at any time during the lifetime of a + process. +- all test code and fixture should report normal test assertions with + a mechanism like testify's `require` assertion rather than calling + panic directly. + +The goal of this increased "panic rigor" is to ensure that any escaped +panic is reflects a fixable bug in Tendermint. + +### Removing Panics + +The process for removing panics involve a few steps, and will be part +of an ongoing process of code modernization: + +- converting existing explicit panics to errors in cases where it's + possible to return an error, the errors can and should be handled, and returning + an error would not lead to data corruption or cover up data + corruption. + +- increase rigor around operations that can cause runtime errors, like + type assertions, nil pointer errors, array bounds access issues, and + either avoid these situations or return errors where possible. + +- remove generic panic handlers which could cover and hide known + panics.