You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

139 lines
5.5 KiB

  1. # RFC 008: Don't Panic
  2. ## Changelog
  3. - 2021-12-17: initial draft (@tychoish)
  4. ## Abstract
  5. Today, the Tendermint core codebase has panics in a number of cases as
  6. a response to exceptional situations. These panics complicate testing,
  7. and might make tendermint components difficult to use as a library in
  8. some circumstances. This document outlines a project of converting
  9. panics to errors and describes the situations where its safe to
  10. panic.
  11. ## Background
  12. Panics in Go are a great mechanism for aborting the current execution
  13. for truly exceptional situations (e.g. memory errors, data corruption,
  14. processes initialization); however, because they resemble exceptions
  15. in other languages, it can be easy to over use them in the
  16. implementation of software architectures. This certainly happened in
  17. the history of Tendermint, and as we embark on the project of
  18. stabilizing the package, we find ourselves in the right moment to
  19. reexamine our use of panics, and largely where panics happen in the
  20. code base.
  21. There are still some situations where panics are acceptable and
  22. desireable, but it's important that Tendermint, as a project, comes to
  23. consensus--perhaps in the text of this document--on the situations
  24. where it is acceptable to panic.
  25. ### References
  26. - [Defer Panic and Recover](https://go.dev/blog/defer-panic-and-recover)
  27. - [Why Go gets exceptions right](https://dave.cheney.net/tag/panic)
  28. - [Don't panic](https://dave.cheney.net/practical-go/presentations/gophercon-singapore-2019.html#_dont_panic)
  29. ## Discussion
  30. ### Acceptable Panics
  31. #### Initialization
  32. It is unambiguously safe (and desireable) to panic in `init()`
  33. functions in response to any kind of error. These errors are caught by
  34. tests, and occur early enough in process initialization that they
  35. won't cause unexpected runtime crashes.
  36. Other code that is called early in process initialization MAY panic,
  37. in some situations if it's not possible to return an error or cause
  38. the process to abort early, although these situations should be
  39. vanishingly slim.
  40. #### Data Corruption
  41. If Tendermint code encounters an inconsistency that could be
  42. attributed to data corruption or a logical impossibility it is safer
  43. to panic and crash the process than continue to attempt to make
  44. progress in these situations.
  45. Examples including reading data out of the storage engine that
  46. is invalid or corrupt, or encountering an ambiguous situation where
  47. the process should halt. Generally these forms of corruption are
  48. detected after interacting with a trusted but external data source,
  49. and reflect situations where the author thinks its safer to terminate
  50. the process immediately rather than allow execution to continue.
  51. #### Unrecoverable Consensus Failure
  52. In general, a panic should be used in the case of unrecoverable
  53. consensus failures. If a process detects that the network is
  54. behaving in an incoherent way and it does not have a clearly defined
  55. and mechanism for recovering, the process should panic.
  56. #### Static Validity
  57. It is acceptable to panic for invariant violations, within a library
  58. or package, in situations that should be statically impossible,
  59. because there is no way to make these kinds of assertions at compile
  60. time.
  61. For example, type-asserting `interface{}` values returned by
  62. `container/list` and `container/heap` (and similar), is acceptable,
  63. because package authors should have exclusive control of the inputs to
  64. these containers. Packages should not expose the ability to add
  65. arbitrary values to these data structures.
  66. #### Controlled Panics Within Libraries
  67. In some algorithms with highly recursive structures or very nested
  68. call patterns, using a panic, in combination with conditional recovery
  69. handlers results in more manageable code. Ultimately this is a limited
  70. application, and implementations that use panics internally should
  71. only recover conditionally, filtering out panics rather than ignoring
  72. or handling all panics.
  73. #### Request Handling
  74. Code that handles responses to incoming/external requests
  75. (e.g. `http.Handler`) should avoid panics, but practice this isn't
  76. totally possible, and it makes sense that request handlers have some
  77. kind of default recovery mechanism that will prevent one request from
  78. terminating a service.
  79. ### Unacceptable Panics
  80. In **no** other situation is it acceptable for the code to panic:
  81. - there should be **no** controlled panics that callers are required
  82. to handle across library/package boundaries.
  83. - callers of library functions should not expect panics.
  84. - ensuring that arbitrary go routines can't panic.
  85. - ensuring that there are no arbitrary panics in core production code,
  86. espically code that can run at any time during the lifetime of a
  87. process.
  88. - all test code and fixture should report normal test assertions with
  89. a mechanism like testify's `require` assertion rather than calling
  90. panic directly.
  91. The goal of this increased "panic rigor" is to ensure that any escaped
  92. panic is reflects a fixable bug in Tendermint.
  93. ### Removing Panics
  94. The process for removing panics involve a few steps, and will be part
  95. of an ongoing process of code modernization:
  96. - converting existing explicit panics to errors in cases where it's
  97. possible to return an error, the errors can and should be handled, and returning
  98. an error would not lead to data corruption or cover up data
  99. corruption.
  100. - increase rigor around operations that can cause runtime errors, like
  101. type assertions, nil pointer errors, array bounds access issues, and
  102. either avoid these situations or return errors where possible.
  103. - remove generic panic handlers which could cover and hide known
  104. panics.