=========================================== RFC 001: Storage Engines and Database Layer =========================================== Changelog --------- - 2021-04-19: Initial Draft (gist) - 2021-09-02: Migrated to RFC folder, with some updates Abstract -------- The aspect of Tendermint that's responsible for persistence and storage (often "the database" internally) represents a bottle neck in the architecture of the platform, that the 0.36 release presents a good opportunity to correct. The current storage engine layer provides a great deal of flexibility that is difficult for users to leverage or benefit from, while also making it harder for Tendermint Core developers to deliver improvements on storage engine. This RFC discusses the possible improvements to this layer of the system. Background ---------- Tendermint has a very thin common wrapper that makes Tendermint itself (largely) agnostic to the data storage layer (within the realm of the popular key-value/embedded databases.) This flexibility is not particularly useful: the benefits of a specific database engine in the context of Tendermint is not particularly well understood, and the maintenance burden for multiple backends is not commensurate with the benefit provided. Additionally, because the data storage layer is handled generically, and most tests run with an in-memory framework, it's difficult to take advantage of any higher-level features of a database engine. Ideally, developers within Tendermint will be able to interact with persisted data via an interface that can function, approximately like an object store, and this storage interface will be able to accommodate all existing persistence workloads (e.g. block storage, local peer management information like the "address book", crash-recovery log like the WAL.) In addition to providing a more ergonomic interface and new semantics, by selecting a single storage engine tendermint can use native durability and atomicity features of the storage engine and simplify its own implementations. Data Access Patterns ~~~~~~~~~~~~~~~~~~~~ Tendermint's data access patterns have the following characteristics: - aggregate data size often exceeds memory. - data is rarely mutated after it's written for most data (e.g. blocks), but small amounts of working data is persisted by nodes and is frequently mutated (e.g. peer information, validator information.) - read patterns can be quite random. - crash resistance and crash recovery, provided by write-ahead-logs (in consensus, and potentially for the mempool) should allow the system to resume work after an unexpected shut down. Project Goals ~~~~~~~~~~~~~ As we think about replacing the current persistence layer, we should consider the following high level goals: - drop dependencies on storage engines that have a CGo dependency. - encapsulate data format and data storage from higher-level services (e.g. reactors) within tendermint. - select a storage engine that does not incur any additional operational complexity (e.g. database should be embedded.) - provide database semantics with sufficient ACID, snapshots, and transactional support. Open Questions ~~~~~~~~~~~~~~ The following questions remain: - what kind of data-access concurrency does tendermint require? - would tendermint users SDK/etc. benefit from some shared database infrastructure? - In earlier conversations it seemed as if the SDK has selected Badger and RocksDB for their storage engines, and it might make sense to be able to (optionally) pass a handle to a Badger instance between the libraries in some cases. - what are typical data sizes, and what kinds of memory sizes can we expect operators to be able to provide? - in addition to simple persistence, what kind of additional semantics would tendermint like to enjoy (e.g. transactional semantics, unique constraints, indexes, in-place-updates, etc.)? Decision Framework ~~~~~~~~~~~~~~~~~~ Given the constraint of removing the CGo dependency, the decision is between "badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On top of this and somewhat orthogonally, we must also decide on the interface to the database and how the larger application will have to interact with the database layer. Users of the data layer shouldn't ever need to interact with raw byte slices from the database, and should mostly have the experience of interacting with Go-types. Badger is more consistently developed and has a broader feature set than Bolt. At the same time, Badger is likely more memory intensive and may have more overhead in terms of open file handles given it's model. At first glance, Badger is the obvious choice: it's actively developed and it has a lot of features that could be useful. Bolt is not without some benefits: it's stable and is maintained by the etcd folks, it's simpler model (single memory mapped file, etc,) may be easier to reason about. I propose that we consider the following specific questions about storage engines: - does Badger's evolving development, which may result in data file format changes in the future, and could restrict our access to using the latest version of the library between major upgrades, present a problem? - do we do we have goals/concerns about memory footprint that Badger may prevent us from hitting, particularly as data sets grow over time? - what kind of additional tooling might we need/like to build (dump/restore, etc.)? - do we want to run unit/integration tests against a data files on disk rather than relying exclusively on the memory database? Project Scope ~~~~~~~~~~~~~ This project will consist of the following aspects: - selecting a storage engine, and modifying the tendermint codebase to disallow any configuration of the storage engine outside of the tendermint. - remove the dependency on the current tm-db interfaces and replace with some internalized, safe, and ergonomic interface for data persistence with all required database semantics. - update core tendermint code to use the new interface and data tools. Next Steps ~~~~~~~~~~ - circulate the RFC, and discuss options with appropriate stakeholders. - write brief ADR to summarize decisions around technical decisions reached during the RFC phase. References ---------- - `bolddb `_ - `badger `_ - `badgerdb overview `_ - `botldb overview `_ - `boltdb vs badger `_ - `bolthold `_ - `badgerhold `_ - `Pebble `_ - `SDK Issue Regarding IVAL `_ - `SDK Discussion about SMT/IVAL `_ Discussion ---------- - All things being equal, my tendency would be to use badger, with badgerhold (if that makes sense) for its ergonomics and indexing capabilities, which will require some small selection of wrappers for better write transaction support. This is a weakly held tendency/belief and I think it would be useful for the RFC process to build consensus (or not) around this basic assumption.