You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

179 lines
7.2 KiB

  1. ===========================================
  2. RFC 001: Storage Engines and Database Layer
  3. ===========================================
  4. Changelog
  5. ---------
  6. - 2021-04-19: Initial Draft (gist)
  7. - 2021-09-02: Migrated to RFC folder, with some updates
  8. Abstract
  9. --------
  10. The aspect of Tendermint that's responsible for persistence and storage (often
  11. "the database" internally) represents a bottle neck in the architecture of the
  12. platform, that the 0.36 release presents a good opportunity to correct. The
  13. current storage engine layer provides a great deal of flexibility that is
  14. difficult for users to leverage or benefit from, while also making it harder
  15. for Tendermint Core developers to deliver improvements on storage engine. This
  16. RFC discusses the possible improvements to this layer of the system.
  17. Background
  18. ----------
  19. Tendermint has a very thin common wrapper that makes Tendermint itself
  20. (largely) agnostic to the data storage layer (within the realm of the popular
  21. key-value/embedded databases.) This flexibility is not particularly useful:
  22. the benefits of a specific database engine in the context of Tendermint is not
  23. particularly well understood, and the maintenance burden for multiple backends
  24. is not commensurate with the benefit provided. Additionally, because the data
  25. storage layer is handled generically, and most tests run with an in-memory
  26. framework, it's difficult to take advantage of any higher-level features of a
  27. database engine.
  28. Ideally, developers within Tendermint will be able to interact with persisted
  29. data via an interface that can function, approximately like an object
  30. store, and this storage interface will be able to accommodate all existing
  31. persistence workloads (e.g. block storage, local peer management information
  32. like the "address book", crash-recovery log like the WAL.) In addition to
  33. providing a more ergonomic interface and new semantics, by selecting a single
  34. storage engine tendermint can use native durability and atomicity features of
  35. the storage engine and simplify its own implementations.
  36. Data Access Patterns
  37. ~~~~~~~~~~~~~~~~~~~~
  38. Tendermint's data access patterns have the following characteristics:
  39. - aggregate data size often exceeds memory.
  40. - data is rarely mutated after it's written for most data (e.g. blocks), but
  41. small amounts of working data is persisted by nodes and is frequently
  42. mutated (e.g. peer information, validator information.)
  43. - read patterns can be quite random.
  44. - crash resistance and crash recovery, provided by write-ahead-logs (in
  45. consensus, and potentially for the mempool) should allow the system to
  46. resume work after an unexpected shut down.
  47. Project Goals
  48. ~~~~~~~~~~~~~
  49. As we think about replacing the current persistence layer, we should consider
  50. the following high level goals:
  51. - drop dependencies on storage engines that have a CGo dependency.
  52. - encapsulate data format and data storage from higher-level services
  53. (e.g. reactors) within tendermint.
  54. - select a storage engine that does not incur any additional operational
  55. complexity (e.g. database should be embedded.)
  56. - provide database semantics with sufficient ACID, snapshots, and
  57. transactional support.
  58. Open Questions
  59. ~~~~~~~~~~~~~~
  60. The following questions remain:
  61. - what kind of data-access concurrency does tendermint require?
  62. - would tendermint users SDK/etc. benefit from some shared database
  63. infrastructure?
  64. - In earlier conversations it seemed as if the SDK has selected Badger and
  65. RocksDB for their storage engines, and it might make sense to be able to
  66. (optionally) pass a handle to a Badger instance between the libraries in
  67. some cases.
  68. - what are typical data sizes, and what kinds of memory sizes can we expect
  69. operators to be able to provide?
  70. - in addition to simple persistence, what kind of additional semantics would
  71. tendermint like to enjoy (e.g. transactional semantics, unique constraints,
  72. indexes, in-place-updates, etc.)?
  73. Decision Framework
  74. ~~~~~~~~~~~~~~~~~~
  75. Given the constraint of removing the CGo dependency, the decision is between
  76. "badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On
  77. top of this and somewhat orthogonally, we must also decide on the interface to
  78. the database and how the larger application will have to interact with the
  79. database layer. Users of the data layer shouldn't ever need to interact with
  80. raw byte slices from the database, and should mostly have the experience of
  81. interacting with Go-types.
  82. Badger is more consistently developed and has a broader feature set than
  83. Bolt. At the same time, Badger is likely more memory intensive and may have
  84. more overhead in terms of open file handles given it's model. At first glance,
  85. Badger is the obvious choice: it's actively developed and it has a lot of
  86. features that could be useful. Bolt is not without some benefits: it's stable
  87. and is maintained by the etcd folks, it's simpler model (single memory mapped
  88. file, etc,) may be easier to reason about.
  89. I propose that we consider the following specific questions about storage
  90. engines:
  91. - does Badger's evolving development, which may result in data file format
  92. changes in the future, and could restrict our access to using the latest
  93. version of the library between major upgrades, present a problem?
  94. - do we do we have goals/concerns about memory footprint that Badger may
  95. prevent us from hitting, particularly as data sets grow over time?
  96. - what kind of additional tooling might we need/like to build (dump/restore,
  97. etc.)?
  98. - do we want to run unit/integration tests against a data files on disk rather
  99. than relying exclusively on the memory database?
  100. Project Scope
  101. ~~~~~~~~~~~~~
  102. This project will consist of the following aspects:
  103. - selecting a storage engine, and modifying the tendermint codebase to
  104. disallow any configuration of the storage engine outside of the tendermint.
  105. - remove the dependency on the current tm-db interfaces and replace with some
  106. internalized, safe, and ergonomic interface for data persistence with all
  107. required database semantics.
  108. - update core tendermint code to use the new interface and data tools.
  109. Next Steps
  110. ~~~~~~~~~~
  111. - circulate the RFC, and discuss options with appropriate stakeholders.
  112. - write brief ADR to summarize decisions around technical decisions reached
  113. during the RFC phase.
  114. References
  115. ----------
  116. - `bolddb <https://github.com/etcd-io/bbolt>`_
  117. - `badger <https://github.com/dgraph-io/badger>`_
  118. - `badgerdb overview <https://dbdb.io/db/badgerdb>`_
  119. - `botldb overview <https://dbdb.io/db/boltdb>`_
  120. - `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_
  121. - `bolthold <https://github.com/timshannon/bolthold>`_
  122. - `badgerhold <https://github.com/timshannon/badgerhold>`_
  123. - `Pebble <https://github.com/cockroachdb/pebble>`_
  124. - `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_
  125. - `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_
  126. Discussion
  127. ----------
  128. - All things being equal, my tendency would be to use badger, with badgerhold
  129. (if that makes sense) for its ergonomics and indexing capabilities, which
  130. will require some small selection of wrappers for better write transaction
  131. support. This is a weakly held tendency/belief and I think it would be
  132. useful for the RFC process to build consensus (or not) around this basic
  133. assumption.