|
===========================================
|
|
RFC 001: Storage Engines and Database Layer
|
|
===========================================
|
|
|
|
Changelog
|
|
---------
|
|
|
|
- 2021-04-19: Initial Draft (gist)
|
|
- 2021-09-02: Migrated to RFC folder, with some updates
|
|
|
|
Abstract
|
|
--------
|
|
|
|
The aspect of Tendermint that's responsible for persistence and storage (often
|
|
"the database" internally) represents a bottle neck in the architecture of the
|
|
platform, that the 0.36 release presents a good opportunity to correct. The
|
|
current storage engine layer provides a great deal of flexibility that is
|
|
difficult for users to leverage or benefit from, while also making it harder
|
|
for Tendermint Core developers to deliver improvements on storage engine. This
|
|
RFC discusses the possible improvements to this layer of the system.
|
|
|
|
Background
|
|
----------
|
|
|
|
Tendermint has a very thin common wrapper that makes Tendermint itself
|
|
(largely) agnostic to the data storage layer (within the realm of the popular
|
|
key-value/embedded databases.) This flexibility is not particularly useful:
|
|
the benefits of a specific database engine in the context of Tendermint is not
|
|
particularly well understood, and the maintenance burden for multiple backends
|
|
is not commensurate with the benefit provided. Additionally, because the data
|
|
storage layer is handled generically, and most tests run with an in-memory
|
|
framework, it's difficult to take advantage of any higher-level features of a
|
|
database engine.
|
|
|
|
Ideally, developers within Tendermint will be able to interact with persisted
|
|
data via an interface that can function, approximately like an object
|
|
store, and this storage interface will be able to accommodate all existing
|
|
persistence workloads (e.g. block storage, local peer management information
|
|
like the "address book", crash-recovery log like the WAL.) In addition to
|
|
providing a more ergonomic interface and new semantics, by selecting a single
|
|
storage engine tendermint can use native durability and atomicity features of
|
|
the storage engine and simplify its own implementations.
|
|
|
|
Data Access Patterns
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Tendermint's data access patterns have the following characteristics:
|
|
|
|
- aggregate data size often exceeds memory.
|
|
|
|
- data is rarely mutated after it's written for most data (e.g. blocks), but
|
|
small amounts of working data is persisted by nodes and is frequently
|
|
mutated (e.g. peer information, validator information.)
|
|
|
|
- read patterns can be quite random.
|
|
|
|
- crash resistance and crash recovery, provided by write-ahead-logs (in
|
|
consensus, and potentially for the mempool) should allow the system to
|
|
resume work after an unexpected shut down.
|
|
|
|
Project Goals
|
|
~~~~~~~~~~~~~
|
|
|
|
As we think about replacing the current persistence layer, we should consider
|
|
the following high level goals:
|
|
|
|
- drop dependencies on storage engines that have a CGo dependency.
|
|
|
|
- encapsulate data format and data storage from higher-level services
|
|
(e.g. reactors) within tendermint.
|
|
|
|
- select a storage engine that does not incur any additional operational
|
|
complexity (e.g. database should be embedded.)
|
|
|
|
- provide database semantics with sufficient ACID, snapshots, and
|
|
transactional support.
|
|
|
|
Open Questions
|
|
~~~~~~~~~~~~~~
|
|
|
|
The following questions remain:
|
|
|
|
- what kind of data-access concurrency does tendermint require?
|
|
|
|
- would tendermint users SDK/etc. benefit from some shared database
|
|
infrastructure?
|
|
|
|
- In earlier conversations it seemed as if the SDK has selected Badger and
|
|
RocksDB for their storage engines, and it might make sense to be able to
|
|
(optionally) pass a handle to a Badger instance between the libraries in
|
|
some cases.
|
|
|
|
- what are typical data sizes, and what kinds of memory sizes can we expect
|
|
operators to be able to provide?
|
|
|
|
- in addition to simple persistence, what kind of additional semantics would
|
|
tendermint like to enjoy (e.g. transactional semantics, unique constraints,
|
|
indexes, in-place-updates, etc.)?
|
|
|
|
Decision Framework
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Given the constraint of removing the CGo dependency, the decision is between
|
|
"badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On
|
|
top of this and somewhat orthogonally, we must also decide on the interface to
|
|
the database and how the larger application will have to interact with the
|
|
database layer. Users of the data layer shouldn't ever need to interact with
|
|
raw byte slices from the database, and should mostly have the experience of
|
|
interacting with Go-types.
|
|
|
|
Badger is more consistently developed and has a broader feature set than
|
|
Bolt. At the same time, Badger is likely more memory intensive and may have
|
|
more overhead in terms of open file handles given it's model. At first glance,
|
|
Badger is the obvious choice: it's actively developed and it has a lot of
|
|
features that could be useful. Bolt is not without some benefits: it's stable
|
|
and is maintained by the etcd folks, it's simpler model (single memory mapped
|
|
file, etc,) may be easier to reason about.
|
|
|
|
I propose that we consider the following specific questions about storage
|
|
engines:
|
|
|
|
- does Badger's evolving development, which may result in data file format
|
|
changes in the future, and could restrict our access to using the latest
|
|
version of the library between major upgrades, present a problem?
|
|
|
|
- do we do we have goals/concerns about memory footprint that Badger may
|
|
prevent us from hitting, particularly as data sets grow over time?
|
|
|
|
- what kind of additional tooling might we need/like to build (dump/restore,
|
|
etc.)?
|
|
|
|
- do we want to run unit/integration tests against a data files on disk rather
|
|
than relying exclusively on the memory database?
|
|
|
|
Project Scope
|
|
~~~~~~~~~~~~~
|
|
|
|
This project will consist of the following aspects:
|
|
|
|
- selecting a storage engine, and modifying the tendermint codebase to
|
|
disallow any configuration of the storage engine outside of the tendermint.
|
|
|
|
- remove the dependency on the current tm-db interfaces and replace with some
|
|
internalized, safe, and ergonomic interface for data persistence with all
|
|
required database semantics.
|
|
|
|
- update core tendermint code to use the new interface and data tools.
|
|
|
|
Next Steps
|
|
~~~~~~~~~~
|
|
|
|
- circulate the RFC, and discuss options with appropriate stakeholders.
|
|
|
|
- write brief ADR to summarize decisions around technical decisions reached
|
|
during the RFC phase.
|
|
|
|
References
|
|
----------
|
|
|
|
- `bolddb <https://github.com/etcd-io/bbolt>`_
|
|
- `badger <https://github.com/dgraph-io/badger>`_
|
|
- `badgerdb overview <https://dbdb.io/db/badgerdb>`_
|
|
- `botldb overview <https://dbdb.io/db/boltdb>`_
|
|
- `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_
|
|
- `bolthold <https://github.com/timshannon/bolthold>`_
|
|
- `badgerhold <https://github.com/timshannon/badgerhold>`_
|
|
- `Pebble <https://github.com/cockroachdb/pebble>`_
|
|
- `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_
|
|
- `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_
|
|
|
|
Discussion
|
|
----------
|
|
|
|
- All things being equal, my tendency would be to use badger, with badgerhold
|
|
(if that makes sense) for its ergonomics and indexing capabilities, which
|
|
will require some small selection of wrappers for better write transaction
|
|
support. This is a weakly held tendency/belief and I think it would be
|
|
useful for the RFC process to build consensus (or not) around this basic
|
|
assumption.
|