zolfa
/
tendermint


								===========================================

								RFC 001: Storage Engines and Database Layer

								===========================================


								Changelog

								---------


								- 2021-04-19: Initial Draft (gist)

								- 2021-09-02: Migrated to RFC folder, with some updates


								Abstract

								--------


								The aspect of Tendermint that's responsible for persistence and storage (often

								"the database" internally) represents a bottle neck in the architecture of the

								platform, that the 0.36 release presents a good opportunity to correct. The

								current storage engine layer provides a great deal of flexibility that is

								difficult for users to leverage or benefit from, while also making it harder

								for Tendermint Core developers to deliver improvements on storage engine. This

								RFC discusses the possible improvements to this layer of the system.


								Background

								----------


								Tendermint has a very thin common wrapper that makes Tendermint itself

								(largely) agnostic to the data storage layer (within the realm of the popular

								key-value/embedded databases.) This flexibility is not particularly useful:

								the benefits of a specific database engine in the context of Tendermint is not

								particularly well understood, and the maintenance burden for multiple backends

								is not commensurate with the benefit provided. Additionally, because the data

								storage layer is handled generically, and most tests run with an in-memory

								framework, it's difficult to take advantage of any higher-level features of a

								database engine.


								Ideally, developers within Tendermint will be able to interact with persisted

								data via an interface that can function, approximately like an object

								store, and this storage interface will be able to accommodate all existing

								persistence workloads (e.g. block storage, local peer management information

								like the "address book", crash-recovery log like the WAL.) In addition to

								providing a more ergonomic interface and new semantics, by selecting a single

								storage engine tendermint can use native durability and atomicity features of

								the storage engine and simplify its own implementations.


								Data Access Patterns

								~~~~~~~~~~~~~~~~~~~~


								Tendermint's data access patterns have the following characteristics:


								- aggregate data size often exceeds memory.


								- data is rarely mutated after it's written for most data (e.g. blocks), but

								  small amounts of working data is persisted by nodes and is frequently

								  mutated (e.g. peer information, validator information.)


								- read patterns can be quite random.


								- crash resistance and crash recovery, provided by write-ahead-logs (in

								  consensus, and potentially for the mempool) should allow the system to

								  resume work after an unexpected shut down.


								Project Goals

								~~~~~~~~~~~~~


								As we think about replacing the current persistence layer, we should consider

								the following high level goals:


								- drop dependencies on storage engines that have a CGo dependency.


								- encapsulate data format and data storage from higher-level services

								  (e.g. reactors) within tendermint.


								- select a storage engine that does not incur any additional operational

								  complexity (e.g. database should be embedded.)


								- provide database semantics with sufficient ACID, snapshots, and

								  transactional support.


								Open Questions

								~~~~~~~~~~~~~~


								The following questions remain:


								- what kind of data-access concurrency does tendermint require?


								- would tendermint users SDK/etc. benefit from some shared database

								  infrastructure?


								  - In earlier conversations it seemed as if the SDK has selected Badger and

								    RocksDB for their storage engines, and it might make sense to be able to

								    (optionally) pass a handle to a Badger instance between the libraries in

								    some cases.


								- what are typical data sizes, and what kinds of memory sizes can we expect

								  operators to be able to provide?


								- in addition to simple persistence, what kind of additional semantics would

								  tendermint like to enjoy (e.g. transactional semantics, unique constraints,

								  indexes, in-place-updates, etc.)?


								Decision Framework

								~~~~~~~~~~~~~~~~~~


								Given the constraint of removing the CGo dependency, the decision is between

								"badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On

								top of this and somewhat orthogonally, we must also decide on the interface to

								the database and how the larger application will have to interact with the

								database layer. Users of the data layer shouldn't ever need to interact with

								raw byte slices from the database, and should mostly have the experience of

								interacting with Go-types.


								Badger is more consistently developed and has a broader feature set than

								Bolt. At the same time, Badger is likely more memory intensive and may have

								more overhead in terms of open file handles given it's model. At first glance,

								Badger is the obvious choice: it's actively developed and it has a lot of

								features that could be useful. Bolt is not without some benefits: it's stable

								and is maintained by the etcd folks, it's simpler model (single memory mapped

								file, etc,) may be easier to reason about.


								I propose that we consider the following specific questions about storage

								engines:


								- does Badger's evolving development, which may result in data file format

								  changes in the future, and could restrict our access to using the latest

								  version of the library between major upgrades, present a problem?


								- do we do we have goals/concerns about memory footprint that Badger may

								  prevent us from hitting, particularly as data sets grow over time?


								- what kind of additional tooling might we need/like to build (dump/restore,

								  etc.)?


								- do we want to run unit/integration tests against a data files on disk rather

								  than relying exclusively on the memory database?


								Project Scope

								~~~~~~~~~~~~~


								This project will consist of the following aspects:


								- selecting a storage engine, and modifying the tendermint codebase to

								  disallow any configuration of the storage engine outside of the tendermint.


								- remove the dependency on the current tm-db interfaces and replace with some

								  internalized, safe, and ergonomic interface for data persistence with all

								  required database semantics.


								- update core tendermint code to use the new interface and data tools.


								Next Steps

								~~~~~~~~~~


								- circulate the RFC, and discuss options with appropriate stakeholders.


								- write brief ADR to summarize decisions around technical decisions reached

								  during the RFC phase.


								References

								----------


								- `bolddb <https://github.com/etcd-io/bbolt>`_

								- `badger <https://github.com/dgraph-io/badger>`_

								- `badgerdb overview <https://dbdb.io/db/badgerdb>`_

								- `botldb overview <https://dbdb.io/db/boltdb>`_

								- `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_

								- `bolthold <https://github.com/timshannon/bolthold>`_

								- `badgerhold <https://github.com/timshannon/badgerhold>`_

								- `Pebble <https://github.com/cockroachdb/pebble>`_

								- `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_

								- `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_


								Discussion

								----------


								- All things being equal, my tendency would be to use badger, with badgerhold

								  (if that makes sense) for its ergonomics and indexing capabilities, which

								  will require some small selection of wrappers for better write transaction

								  support. This is a weakly held tendency/belief and I think it would be

								  useful for the RFC process to build consensus (or not) around this basic

								  assumption.