diff --git a/docs/rfc/README.md b/docs/rfc/README.md index 10b3cccf9..0a78a00af 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -38,5 +38,6 @@ sections. ## Table of Contents - [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst) +- [RFC-001: Storage Engines](./rfc-001-storage-engine.rst) diff --git a/docs/rfc/rfc-001-storage-engine.rst b/docs/rfc/rfc-001-storage-engine.rst new file mode 100644 index 000000000..560e8a8b3 --- /dev/null +++ b/docs/rfc/rfc-001-storage-engine.rst @@ -0,0 +1,179 @@ +=========================================== +RFC 001: Storage Engines and Database Layer +=========================================== + +Changelog +--------- + +- 2021-04-19: Initial Draft (gist) +- 2021-09-02: Migrated to RFC folder, with some updates + +Abstract +-------- + +The aspect of Tendermint that's responsible for persistence and storage (often +"the database" internally) represents a bottle neck in the architecture of the +platform, that the 0.36 release presents a good opportunity to correct. The +current storage engine layer provides a great deal of flexibility that is +difficult for users to leverage or benefit from, while also making it harder +for Tendermint Core developers to deliver improvements on storage engine. This +RFC discusses the possible improvements to this layer of the system. + +Background +---------- + +Tendermint has a very thin common wrapper that makes Tendermint itself +(largely) agnostic to the data storage layer (within the realm of the popular +key-value/embedded databases.) This flexibility is not particularly useful: +the benefits of a specific database engine in the context of Tendermint is not +particularly well understood, and the maintenance burden for multiple backends +is not commensurate with the benefit provided. Additionally, because the data +storage layer is handled generically, and most tests run with an in-memory +framework, it's difficult to take advantage of any higher-level features of a +database engine. + +Ideally, developers within Tendermint will be able to interact with persisted +data via an interface that can function, approximately like an object +store, and this storage interface will be able to accommodate all existing +persistence workloads (e.g. block storage, local peer management information +like the "address book", crash-recovery log like the WAL.) In addition to +providing a more ergonomic interface and new semantics, by selecting a single +storage engine tendermint can use native durability and atomicity features of +the storage engine and simplify its own implementations. + +Data Access Patterns +~~~~~~~~~~~~~~~~~~~~ + +Tendermint's data access patterns have the following characteristics: + +- aggregate data size often exceeds memory. + +- data is rarely mutated after it's written for most data (e.g. blocks), but + small amounts of working data is persisted by nodes and is frequently + mutated (e.g. peer information, validator information.) + +- read patterns can be quite random. + +- crash resistance and crash recovery, provided by write-ahead-logs (in + consensus, and potentially for the mempool) should allow the system to + resume work after an unexpected shut down. + +Project Goals +~~~~~~~~~~~~~ + +As we think about replacing the current persistence layer, we should consider +the following high level goals: + +- drop dependencies on storage engines that have a CGo dependency. + +- encapsulate data format and data storage from higher-level services + (e.g. reactors) within tendermint. + +- select a storage engine that does not incur any additional operational + complexity (e.g. database should be embedded.) + +- provide database semantics with sufficient ACID, snapshots, and + transactional support. + +Open Questions +~~~~~~~~~~~~~~ + +The following questions remain: + +- what kind of data-access concurrency does tendermint require? + +- would tendermint users SDK/etc. benefit from some shared database + infrastructure? + + - In earlier conversations it seemed as if the SDK has selected Badger and + RocksDB for their storage engines, and it might make sense to be able to + (optionally) pass a handle to a Badger instance between the libraries in + some cases. + +- what are typical data sizes, and what kinds of memory sizes can we expect + operators to be able to provide? + +- in addition to simple persistence, what kind of additional semantics would + tendermint like to enjoy (e.g. transactional semantics, unique constraints, + indexes, in-place-updates, etc.)? + +Decision Framework +~~~~~~~~~~~~~~~~~~ + +Given the constraint of removing the CGo dependency, the decision is between +"badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On +top of this and somewhat orthogonally, we must also decide on the interface to +the database and how the larger application will have to interact with the +database layer. Users of the data layer shouldn't ever need to interact with +raw byte slices from the database, and should mostly have the experience of +interacting with Go-types. + +Badger is more consistently developed and has a broader feature set than +Bolt. At the same time, Badger is likely more memory intensive and may have +more overhead in terms of open file handles given it's model. At first glance, +Badger is the obvious choice: it's actively developed and it has a lot of +features that could be useful. Bolt is not without some benefits: it's stable +and is maintained by the etcd folks, it's simpler model (single memory mapped +file, etc,) may be easier to reason about. + +I propose that we consider the following specific questions about storage +engines: + +- does Badger's evolving development, which may result in data file format + changes in the future, and could restrict our access to using the latest + version of the library between major upgrades, present a problem? + +- do we do we have goals/concerns about memory footprint that Badger may + prevent us from hitting, particularly as data sets grow over time? + +- what kind of additional tooling might we need/like to build (dump/restore, + etc.)? + +- do we want to run unit/integration tests against a data files on disk rather + than relying exclusively on the memory database? + +Project Scope +~~~~~~~~~~~~~ + +This project will consist of the following aspects: + +- selecting a storage engine, and modifying the tendermint codebase to + disallow any configuration of the storage engine outside of the tendermint. + +- remove the dependency on the current tm-db interfaces and replace with some + internalized, safe, and ergonomic interface for data persistence with all + required database semantics. + +- update core tendermint code to use the new interface and data tools. + +Next Steps +~~~~~~~~~~ + +- circulate the RFC, and discuss options with appropriate stakeholders. + +- write brief ADR to summarize decisions around technical decisions reached + during the RFC phase. + +References +---------- + +- `bolddb `_ +- `badger `_ +- `badgerdb overview `_ +- `botldb overview `_ +- `boltdb vs badger `_ +- `bolthold `_ +- `badgerhold `_ +- `Pebble `_ +- `SDK Issue Regarding IVAL `_ +- `SDK Discussion about SMT/IVAL `_ + +Discussion +---------- + +- All things being equal, my tendency would be to use badger, with badgerhold + (if that makes sense) for its ergonomics and indexing capabilities, which + will require some small selection of wrappers for better write transaction + support. This is a weakly held tendency/belief and I think it would be + useful for the RFC process to build consensus (or not) around this basic + assumption.