|
@ -0,0 +1,179 @@ |
|
|
|
|
|
=========================================== |
|
|
|
|
|
RFC 001: Storage Engines and Database Layer |
|
|
|
|
|
=========================================== |
|
|
|
|
|
|
|
|
|
|
|
Changelog |
|
|
|
|
|
--------- |
|
|
|
|
|
|
|
|
|
|
|
- 2021-04-19: Initial Draft (gist) |
|
|
|
|
|
- 2021-09-02: Migrated to RFC folder, with some updates |
|
|
|
|
|
|
|
|
|
|
|
Abstract |
|
|
|
|
|
-------- |
|
|
|
|
|
|
|
|
|
|
|
The aspect of Tendermint that's responsible for persistence and storage (often |
|
|
|
|
|
"the database" internally) represents a bottle neck in the architecture of the |
|
|
|
|
|
platform, that the 0.36 release presents a good opportunity to correct. The |
|
|
|
|
|
current storage engine layer provides a great deal of flexibility that is |
|
|
|
|
|
difficult for users to leverage or benefit from, while also making it harder |
|
|
|
|
|
for Tendermint Core developers to deliver improvements on storage engine. This |
|
|
|
|
|
RFC discusses the possible improvements to this layer of the system. |
|
|
|
|
|
|
|
|
|
|
|
Background |
|
|
|
|
|
---------- |
|
|
|
|
|
|
|
|
|
|
|
Tendermint has a very thin common wrapper that makes Tendermint itself |
|
|
|
|
|
(largely) agnostic to the data storage layer (within the realm of the popular |
|
|
|
|
|
key-value/embedded databases.) This flexibility is not particularly useful: |
|
|
|
|
|
the benefits of a specific database engine in the context of Tendermint is not |
|
|
|
|
|
particularly well understood, and the maintenance burden for multiple backends |
|
|
|
|
|
is not commensurate with the benefit provided. Additionally, because the data |
|
|
|
|
|
storage layer is handled generically, and most tests run with an in-memory |
|
|
|
|
|
framework, it's difficult to take advantage of any higher-level features of a |
|
|
|
|
|
database engine. |
|
|
|
|
|
|
|
|
|
|
|
Ideally, developers within Tendermint will be able to interact with persisted |
|
|
|
|
|
data via an interface that can function, approximately like an object |
|
|
|
|
|
store, and this storage interface will be able to accommodate all existing |
|
|
|
|
|
persistence workloads (e.g. block storage, local peer management information |
|
|
|
|
|
like the "address book", crash-recovery log like the WAL.) In addition to |
|
|
|
|
|
providing a more ergonomic interface and new semantics, by selecting a single |
|
|
|
|
|
storage engine tendermint can use native durability and atomicity features of |
|
|
|
|
|
the storage engine and simplify its own implementations. |
|
|
|
|
|
|
|
|
|
|
|
Data Access Patterns |
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
Tendermint's data access patterns have the following characteristics: |
|
|
|
|
|
|
|
|
|
|
|
- aggregate data size often exceeds memory. |
|
|
|
|
|
|
|
|
|
|
|
- data is rarely mutated after it's written for most data (e.g. blocks), but |
|
|
|
|
|
small amounts of working data is persisted by nodes and is frequently |
|
|
|
|
|
mutated (e.g. peer information, validator information.) |
|
|
|
|
|
|
|
|
|
|
|
- read patterns can be quite random. |
|
|
|
|
|
|
|
|
|
|
|
- crash resistance and crash recovery, provided by write-ahead-logs (in |
|
|
|
|
|
consensus, and potentially for the mempool) should allow the system to |
|
|
|
|
|
resume work after an unexpected shut down. |
|
|
|
|
|
|
|
|
|
|
|
Project Goals |
|
|
|
|
|
~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
As we think about replacing the current persistence layer, we should consider |
|
|
|
|
|
the following high level goals: |
|
|
|
|
|
|
|
|
|
|
|
- drop dependencies on storage engines that have a CGo dependency. |
|
|
|
|
|
|
|
|
|
|
|
- encapsulate data format and data storage from higher-level services |
|
|
|
|
|
(e.g. reactors) within tendermint. |
|
|
|
|
|
|
|
|
|
|
|
- select a storage engine that does not incur any additional operational |
|
|
|
|
|
complexity (e.g. database should be embedded.) |
|
|
|
|
|
|
|
|
|
|
|
- provide database semantics with sufficient ACID, snapshots, and |
|
|
|
|
|
transactional support. |
|
|
|
|
|
|
|
|
|
|
|
Open Questions |
|
|
|
|
|
~~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
The following questions remain: |
|
|
|
|
|
|
|
|
|
|
|
- what kind of data-access concurrency does tendermint require? |
|
|
|
|
|
|
|
|
|
|
|
- would tendermint users SDK/etc. benefit from some shared database |
|
|
|
|
|
infrastructure? |
|
|
|
|
|
|
|
|
|
|
|
- In earlier conversations it seemed as if the SDK has selected Badger and |
|
|
|
|
|
RocksDB for their storage engines, and it might make sense to be able to |
|
|
|
|
|
(optionally) pass a handle to a Badger instance between the libraries in |
|
|
|
|
|
some cases. |
|
|
|
|
|
|
|
|
|
|
|
- what are typical data sizes, and what kinds of memory sizes can we expect |
|
|
|
|
|
operators to be able to provide? |
|
|
|
|
|
|
|
|
|
|
|
- in addition to simple persistence, what kind of additional semantics would |
|
|
|
|
|
tendermint like to enjoy (e.g. transactional semantics, unique constraints, |
|
|
|
|
|
indexes, in-place-updates, etc.)? |
|
|
|
|
|
|
|
|
|
|
|
Decision Framework |
|
|
|
|
|
~~~~~~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
Given the constraint of removing the CGo dependency, the decision is between |
|
|
|
|
|
"badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On |
|
|
|
|
|
top of this and somewhat orthogonally, we must also decide on the interface to |
|
|
|
|
|
the database and how the larger application will have to interact with the |
|
|
|
|
|
database layer. Users of the data layer shouldn't ever need to interact with |
|
|
|
|
|
raw byte slices from the database, and should mostly have the experience of |
|
|
|
|
|
interacting with Go-types. |
|
|
|
|
|
|
|
|
|
|
|
Badger is more consistently developed and has a broader feature set than |
|
|
|
|
|
Bolt. At the same time, Badger is likely more memory intensive and may have |
|
|
|
|
|
more overhead in terms of open file handles given it's model. At first glance, |
|
|
|
|
|
Badger is the obvious choice: it's actively developed and it has a lot of |
|
|
|
|
|
features that could be useful. Bolt is not without some benefits: it's stable |
|
|
|
|
|
and is maintained by the etcd folks, it's simpler model (single memory mapped |
|
|
|
|
|
file, etc,) may be easier to reason about. |
|
|
|
|
|
|
|
|
|
|
|
I propose that we consider the following specific questions about storage |
|
|
|
|
|
engines: |
|
|
|
|
|
|
|
|
|
|
|
- does Badger's evolving development, which may result in data file format |
|
|
|
|
|
changes in the future, and could restrict our access to using the latest |
|
|
|
|
|
version of the library between major upgrades, present a problem? |
|
|
|
|
|
|
|
|
|
|
|
- do we do we have goals/concerns about memory footprint that Badger may |
|
|
|
|
|
prevent us from hitting, particularly as data sets grow over time? |
|
|
|
|
|
|
|
|
|
|
|
- what kind of additional tooling might we need/like to build (dump/restore, |
|
|
|
|
|
etc.)? |
|
|
|
|
|
|
|
|
|
|
|
- do we want to run unit/integration tests against a data files on disk rather |
|
|
|
|
|
than relying exclusively on the memory database? |
|
|
|
|
|
|
|
|
|
|
|
Project Scope |
|
|
|
|
|
~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
This project will consist of the following aspects: |
|
|
|
|
|
|
|
|
|
|
|
- selecting a storage engine, and modifying the tendermint codebase to |
|
|
|
|
|
disallow any configuration of the storage engine outside of the tendermint. |
|
|
|
|
|
|
|
|
|
|
|
- remove the dependency on the current tm-db interfaces and replace with some |
|
|
|
|
|
internalized, safe, and ergonomic interface for data persistence with all |
|
|
|
|
|
required database semantics. |
|
|
|
|
|
|
|
|
|
|
|
- update core tendermint code to use the new interface and data tools. |
|
|
|
|
|
|
|
|
|
|
|
Next Steps |
|
|
|
|
|
~~~~~~~~~~ |
|
|
|
|
|
|
|
|
|
|
|
- circulate the RFC, and discuss options with appropriate stakeholders. |
|
|
|
|
|
|
|
|
|
|
|
- write brief ADR to summarize decisions around technical decisions reached |
|
|
|
|
|
during the RFC phase. |
|
|
|
|
|
|
|
|
|
|
|
References |
|
|
|
|
|
---------- |
|
|
|
|
|
|
|
|
|
|
|
- `bolddb <https://github.com/etcd-io/bbolt>`_ |
|
|
|
|
|
- `badger <https://github.com/dgraph-io/badger>`_ |
|
|
|
|
|
- `badgerdb overview <https://dbdb.io/db/badgerdb>`_ |
|
|
|
|
|
- `botldb overview <https://dbdb.io/db/boltdb>`_ |
|
|
|
|
|
- `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_ |
|
|
|
|
|
- `bolthold <https://github.com/timshannon/bolthold>`_ |
|
|
|
|
|
- `badgerhold <https://github.com/timshannon/badgerhold>`_ |
|
|
|
|
|
- `Pebble <https://github.com/cockroachdb/pebble>`_ |
|
|
|
|
|
- `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_ |
|
|
|
|
|
- `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_ |
|
|
|
|
|
|
|
|
|
|
|
Discussion |
|
|
|
|
|
---------- |
|
|
|
|
|
|
|
|
|
|
|
- All things being equal, my tendency would be to use badger, with badgerhold |
|
|
|
|
|
(if that makes sense) for its ergonomics and indexing capabilities, which |
|
|
|
|
|
will require some small selection of wrappers for better write transaction |
|
|
|
|
|
support. This is a weakly held tendency/belief and I think it would be |
|
|
|
|
|
useful for the RFC process to build consensus (or not) around this basic |
|
|
|
|
|
assumption. |