rfc: database storage engine (#6897)

3 years ago · 2a224fb2bd
--- a/docs/rfc/README.md
+++ b/docs/rfc/README.md
@ -38,5 +38,6 @@ sections.
 ## Table of Contents

 - [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst)
 - [RFC-001: Storage Engines](./rfc-001-storage-engine.rst)

 <!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
--- a/docs/rfc/rfc-001-storage-engine.rst
+++ b/docs/rfc/rfc-001-storage-engine.rst
@ -0,0 +1,179 @@
 ===========================================
 RFC 001: Storage Engines and Database Layer
 ===========================================

 Changelog
 ---------

 - 2021-04-19: Initial Draft (gist)
 - 2021-09-02: Migrated to RFC folder, with some updates  

 Abstract
 --------

 The aspect of Tendermint that's responsible for persistence and storage (often
 "the database" internally) represents a bottle neck in the architecture of the
 platform, that the 0.36 release presents a good opportunity to correct. The
 current storage engine layer provides a great deal of flexibility that is
 difficult for users to leverage or benefit from, while also making it harder
 for Tendermint Core developers to deliver improvements on storage engine. This
 RFC discusses the possible improvements to this layer of the system.

 Background
 ----------

 Tendermint has a very thin common wrapper that makes Tendermint itself
 (largely) agnostic to the data storage layer (within the realm of the popular
 key-value/embedded databases.) This flexibility is not particularly useful:
 the benefits of a specific database engine in the context of Tendermint is not
 particularly well understood, and the maintenance burden for multiple backends
 is not commensurate with the benefit provided. Additionally, because the data
 storage layer is handled generically, and most tests run with an in-memory
 framework, it's difficult to take advantage of any higher-level features of a
 database engine.

 Ideally, developers within Tendermint will be able to interact with persisted
 data via an interface that can function, approximately like an object
 store, and this storage interface will be able to accommodate all existing
 persistence workloads (e.g. block storage, local peer management information
 like the "address book", crash-recovery log like the WAL.) In addition to
 providing a more ergonomic interface and new semantics, by selecting a single
 storage engine tendermint can use native durability and atomicity features of
 the storage engine and simplify its own implementations. 

 Data Access Patterns
 ~~~~~~~~~~~~~~~~~~~~

 Tendermint's data access patterns have the following characteristics:

 - aggregate data size often exceeds memory.

 - data is rarely mutated after it's written for most data (e.g. blocks), but
  small amounts of working data is persisted by nodes and is frequently
  mutated (e.g. peer information, validator information.)

 - read patterns can be quite random.

 - crash resistance and crash recovery, provided by write-ahead-logs (in
  consensus, and potentially for the mempool) should allow the system to
  resume work after an unexpected shut down.

 Project Goals
 ~~~~~~~~~~~~~

 As we think about replacing the current persistence layer, we should consider
 the following high level goals: 

 - drop dependencies on storage engines that have a CGo dependency.

 - encapsulate data format and data storage from higher-level services
  (e.g. reactors) within tendermint.

 - select a storage engine that does not incur any additional operational
  complexity (e.g. database should be embedded.)

 - provide database semantics with sufficient ACID, snapshots, and
  transactional support.

 Open Questions
 ~~~~~~~~~~~~~~

 The following questions remain:

 - what kind of data-access concurrency does tendermint require?

 - would tendermint users SDK/etc. benefit from some shared database
  infrastructure?
  
  - In earlier conversations it seemed as if the SDK has selected Badger and
    RocksDB for their storage engines, and it might make sense to be able to
    (optionally) pass a handle to a Badger instance between the libraries in
    some cases.

 - what are typical data sizes, and what kinds of memory sizes can we expect
  operators to be able to provide?

 - in addition to simple persistence, what kind of additional semantics would
  tendermint like to enjoy (e.g. transactional semantics, unique constraints,
  indexes, in-place-updates, etc.)?

 Decision Framework
 ~~~~~~~~~~~~~~~~~~

 Given the constraint of removing the CGo dependency, the decision is between
 "badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On
 top of this and somewhat orthogonally, we must also decide on the interface to
 the database and how the larger application will have to interact with the
 database layer. Users of the data layer shouldn't ever need to interact with
 raw byte slices from the database, and should mostly have the experience of
 interacting with Go-types.

 Badger is more consistently developed and has a broader feature set than
 Bolt. At the same time, Badger is likely more memory intensive and may have
 more overhead in terms of open file handles given it's model. At first glance,
 Badger is the obvious choice: it's actively developed and it has a lot of
 features that could be useful. Bolt is not without some benefits: it's stable
 and is maintained by the etcd folks, it's simpler model (single memory mapped
 file, etc,) may be easier to reason about.

 I propose that we consider the following specific questions about storage
 engines:

 - does Badger's evolving development, which may result in data file format
  changes in the future, and could restrict our access to using the latest
  version of the library between major upgrades, present a problem?

 - do we do we have goals/concerns about memory footprint that Badger may
  prevent us from hitting, particularly as data sets grow over time?

 - what kind of additional tooling might we need/like to build (dump/restore,
  etc.)?

 - do we want to run unit/integration tests against a data files on disk rather
  than relying exclusively on the memory database?

 Project Scope
 ~~~~~~~~~~~~~

 This project will consist of the following aspects:

 - selecting a storage engine, and modifying the tendermint codebase to
  disallow any configuration of the storage engine outside of the tendermint. 

 - remove the dependency on the current tm-db interfaces and replace with some
  internalized, safe, and ergonomic interface for data persistence with all
  required database semantics.

 - update core tendermint code to use the new interface and data tools.

 Next Steps
 ~~~~~~~~~~

 - circulate the RFC, and discuss options with appropriate stakeholders. 
  
 - write brief ADR to summarize decisions around technical decisions reached
  during the RFC phase. 

 References
 ----------

 - `bolddb <https://github.com/etcd-io/bbolt>`_
 - `badger <https://github.com/dgraph-io/badger>`_
 - `badgerdb overview <https://dbdb.io/db/badgerdb>`_
 - `botldb overview <https://dbdb.io/db/boltdb>`_
 - `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_
 - `bolthold <https://github.com/timshannon/bolthold>`_
 - `badgerhold <https://github.com/timshannon/badgerhold>`_
 - `Pebble <https://github.com/cockroachdb/pebble>`_
 - `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_
 - `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_

 Discussion
 ----------

 - All things being equal, my tendency would be to use badger, with badgerhold
  (if that makes sense) for its ergonomics and indexing capabilities, which
  will require some small selection of wrappers for better write transaction
  support. This is a weakly held tendency/belief and I think it would be
  useful for the RFC process to build consensus (or not) around this basic
  assumption.