Finish main argument on merkle data store architecture

7 years ago · 1bf5ae9c8d
--- a/docs/architecture/merkle-frey.md
+++ b/docs/architecture/merkle-frey.md
@ -2,9 +2,9 @@

 ## TL;DR

 To allow the efficient creation of an ABCi app, tendermint wishes to provide a reference implemention of a key-value store that provides merkle proofs of the data.  These proofs then quickly allow the ABCi app to provide an apphash to the consensus engine, as well as a full proof to any client.
 To allow the efficient creation of an ABCi app, tendermint wishes to provide a reference implementation of a key-value store that provides merkle proofs of the data.  These proofs then quickly allow the ABCi app to provide an app hash to the consensus engine, as well as a full proof to any client.

 This is equivalent to building a database, and I would propose designing it from the API first, then looking how to implement this (or make an adaptor from the API to existing implementations). Once we agree on the functionality and the interface, we can implement the API bindings, and then work on building adaptors to existince merkle-ized data stores, or modifying the stores to support this interface.
 This is equivalent to building a database, and I would propose designing it from the API first, then looking how to implement this (or make an adapter from the API to existing implementations). Once we agree on the functionality and the interface, we can implement the API bindings, and then work on building adapters to existence merkle-ized data stores, or modifying the stores to support this interface.

 We need to consider the API (both in-process and over the network), language bindings, maintaining handles to old state (and garbage collecting), persistence, security, providing merkle proofs, and general key-value store operations. To stay consistent with the blockchains "single global order of operations", this data store should only allow one connection at a time to have write access.

@ -13,11 +13,11 @@ We need to consider the API (both in-process and over the network), language bin
 * **State**
  * There are two concepts of state, "committed state" and "working state"
  * The working state is only accessible from the ABCi app, allows writing, but does not need to support proofs.
  * When we commit the "working state", it becomes a new "commmited state" and has an immutible root hash, provides proofs, and can be exposed to external clients.
  * When we commit the "working state", it becomes a new "committed state" and has an immutable root hash, provides proofs, and can be exposed to external clients.
 * **Transactions**
  * The database always allows creating a read-only transaction at the last "committed state", this transaction can serve read queries and proofs.
  * The database maintains all data to serve these read transactions until they are closed by the client (or time out).  This allows the client(s) to determine how much old info is needed
  * The database can only support *maximal* one writable transaction at a time.  This makes it easy to enforce serializability, and attempting to start a second writeable transaction may trigger a panic.
  * The database can only support *maximal* one writable transaction at a time.  This makes it easy to enforce serializability, and attempting to start a second writable transaction may trigger a panic.
 * **Functionality**
  * It must support efficient key-value operations (get/set/delete)
  * It must support returning merkle proofs for any "committed state"
@ -28,15 +28,15 @@ We need to consider the API (both in-process and over the network), language bin
  * This interface should be domain-specific - ie. designed just for this use case
  * It should present a simple go interface for embedding the data store in-process
  * It should create a gRPC/protobuf API for calling from any client
  * It should provide and maintain client adaptors from our in-process interface to gRPC client calls for at least golang and java (maybe more languages?)
  * It should provide and maintain server adaptors from our gRPC calls to the in-process interface for golang at least (unless there is another server we wish to support)
  * It should provide and maintain client adapters from our in-process interface to gRPC client calls for at least golang and Java (maybe more languages?)
  * It should provide and maintain server adapters from our gRPC calls to the in-process interface for golang at least (unless there is another server we wish to support)
 * **Persistence**
  * It must support atomic persistance upon committing a new block.  That is, upon crash recovery, the state is guaranteed to represent the state at the end of a complete block (along with a note of which height it was).
  * It must delay deletion of old data as long as there are open read-only transactions refering to it, thus we must maintain some sort of WAL to keep track of pending cleanup.
  * It must support atomic persistence upon committing a new block.  That is, upon crash recovery, the state is guaranteed to represent the state at the end of a complete block (along with a note of which height it was).
  * It must delay deletion of old data as long as there are open read-only transactions referring to it, thus we must maintain some sort of WAL to keep track of pending cleanup.
  * When a transaction is closed, or when we recover from a crash, it should clean up all no longer needed data to avoid memory/storage leaks.
 * **Security and Auth**
  * If we allow connections over gRPC, we must consider this issues and allow both encyption (SSL), and some basic auth rules to provent undesired access to the DB
  * This is client-specific and does not need to be supported in the in-process, embeded version.
  * If we allow connections over gRPC, we must consider this issues and allow both encryption (SSL), and some basic auth rules to prevent undesired access to the DB
  * This is client-specific and does not need to be supported in the in-process, embedded version.

 ## Details

@ -47,13 +47,13 @@ Here we go more in-depth in each of the sections, explaining the reasoning and m

 The current ABCi interface avoids this question a bit and that has brought confusion.  If I use `merkleeyes` to store data, which state is returned from `Query`?  The current "working" state, which I would like to refer to in my ABCi application?  Or the last committed state, which I would like to return to a client's query?  Or an old state, which I may select based on height?

 Right now, `merkleeyes` implements `Query` like a normal ABCi app and only returns committed state, which has lead to problems and confusion.  Thus, we need to be explicit about which state we want to view.  Each viewer can then specify which state it wants to view.  This allows the app to query the workign state in DeliverTx, but the committed state in Query.
 Right now, `merkleeyes` implements `Query` like a normal ABCi app and only returns committed state, which has lead to problems and confusion.  Thus, we need to be explicit about which state we want to view.  Each viewer can then specify which state it wants to view.  This allows the app to query the working state in DeliverTx, but the committed state in Query.

 We can easily provide two global references for "last committed" and "current working" states.  However, if we want to also allow querying of older commits... then we need some way to keep track of which ones are still in use, so we can garbage collect the unneeded ones. There is a non-trivial overhead in holdign references to all past states, but also a hardcoded solution (hold onto the last 5 commits) may not support all clients.  We should let the client define this somehow.
 We can easily provide two global references for "last committed" and "current working" states.  However, if we want to also allow querying of older commits... then we need some way to keep track of which ones are still in use, so we can garbage collect the unneeded ones. There is a non-trivial overhead in holding references to all past states, but also a hard-coded solution (hold onto the last 5 commits) may not support all clients.  We should let the client define this somehow.

 ### Transactions

 Transactions (in the typical database sense) are a clean and estabilished solution to this issue.  We can look at the [isolations levels](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable) which attempt to provide us things like "repeatable reads".  That means if we open a transaction, and query some data 100 times while other processes are writing to the db, we get the same result each time.  This transaction has a reference to its own local state from the time the transaction started. (We are refering to the highest isolation levels here, which correlate well this the blockchain use case).
 Transactions (in the typical database sense) are a clean and established solution to this issue.  We can look at the [isolations levels](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable) which attempt to provide us things like "repeatable reads".  That means if we open a transaction, and query some data 100 times while other processes are writing to the db, we get the same result each time.  This transaction has a reference to its own local state from the time the transaction started. (We are referring to the highest isolation levels here, which correlate well this the blockchain use case).

 If we implement a read-only transaction as a reference to state at the time of creation of that transaction, we can then hold these references to various snapshots, one per block that we are interested, and allow the client to multiplex queries and proofs from these various blocks.

@ -63,11 +63,21 @@ There is also a nice extension to this available on some database servers, basic

 If you don't understand why this is useful, look at how basecoin needs to [hold cached state for AppTx](https://github.com/tendermint/basecoin/blob/master/state/execution.go#L126-L149), meaning that it rolls back all modifications if the AppTx returns an error. This was implemented as a wrapper in basecoin, but it is a reasonable thing to support in the DB interface itself (especially since the implementation becomes quite non-trivial as soon as you support range queries).

 To give a bit more reference to this concept in practice, read about [Savepoints in Postgesql](https://www.postgresql.org/docs/current/static/tutorial-transactions.html) ([reference](https://www.postgresql.org/docs/current/static/sql-savepoint.html)) or [Nesting transactions in SQL Server](http://dba-presents.com/index.php/databases/sql-server/43-nesting-transactions-and-save-transaction-command) (TL;DR: scroll to the bottom, section "Real nesting transactions with SAVE TRANSACTION")
 To give a bit more reference to this concept in practice, read about [Savepoints in Postgresql](https://www.postgresql.org/docs/current/static/tutorial-transactions.html) ([reference](https://www.postgresql.org/docs/current/static/sql-savepoint.html)) or [Nesting transactions in SQL Server](http://dba-presents.com/index.php/databases/sql-server/43-nesting-transactions-and-save-transaction-command) (TL;DR: scroll to the bottom, section "Real nesting transactions with SAVE TRANSACTION")

 ### Functionality

 **TODO**
 Merkle trees work with key-value pairs, so we should most importantly focus on the basic Key-Value operations.  That is `Get`, `Set`, and `Remove`. We also need to return a merkle proof for any key, along with a root hash of the tree for committing state to the blockchain. This is just the basic merkle-tree stuff.

 If it is possible with the implementation, it is nice to provide access to Range Queries.  That is, return all values where the key is between X and Y.  If you construct your keys wisely, it is possible to store lists (1:N) relations this way.  Eg, storing blog posts and the key is blog:`poster_id`:`sequence`, then I could search for all blog posts by a given `poster_id`, or even return just posts 10-19 from the given poster.

 The construction of a tree that supports range queries was one of the [design decisions of go-merkle](https://github.com/tendermint/go-merkle/blob/master/README.md).  It is also kind of possible with [ethereum's patricia trie](https://github.com/ethereum/wiki/wiki/Patricia-Tree) as long as the key is less than 32 bytes.

 In addition to range queries, there is one more nice feature that we could add to our data store - listening to events. Depending on your context, this is "reactive programming", "event emitters", "notifications", etc... But the basic concept is that a client can listen for all changes to a given key (or set of keys), and receive a notification when this happens. This is very important to avoid [repeated polling and wasted queries](http://resthooks.org/) when a client simply wants to [detect changes](https://www.rethinkdb.com/blog/realtime-web/).

 If the database provides access to some "listener" functionality, the app can choose to expose this to the external client via websockets, web hooks, http2 push events, android push notifications, etc, etc etc.... But if we want to support modern client functionality, let's add support for this reactive paradigm in our DB interface.

 **TODO** support for more advanced backends, eg. Bolt....

 ### Go Interface

@ -76,7 +86,7 @@ I will start with a simple go interface to illustrate the in-process interface.
 ```
 // DB represents the committed state of a merkle-ized key-value store
 type DB interface {
  // Snapshot returns a reference to last commited state to use for
  // Snapshot returns a reference to last committed state to use for
  // providing proofs, you must close it at the end to garbage collect
  // the historical state we hold on to to make these proofs
  Snapshot() Prover
@ -186,8 +196,22 @@ To encourage adoption, we should provide a nice client that uses this gRPC inter

 ### Persistence

 **TODO**
 Any data store worth it's name should not lose all data on a crash.  Even [redis provides some persistence](https://redis.io/topics/persistence) these days. Ideally, if the system crashes and restarts, it should have the data at the last block N that was committed.  If the system crash during the commit of block N+1, then the recovered state should either be block N or completely committed block N+1, but no partial state between the two.  Basically, the commit must be an atomic operation (even if updating 100's of records).

 To avoid a lot of headaches ourselves, we can use an existing data store, such as leveldb, which provides `WriteBatch` to group all operations.

 The other issue is cleaning up old state.  We cannot delete any information from our persistent store, as long as any snapshot holds a reference to it (or else we get some panics when the data we query is not there).  So, we need to store the outstanding deletions that we can perform when the snapshot is `Close`d.  In addition, we must consider the case that the data store crashes with open snapshots.  Thus, the info on outstanding deletions must also be persisted somewhere.  Something like a "delete-behind log" (the opposite of a "write ahead log").

 This is not a concern of the generic interface, but each implementation should take care to handle this well to avoid accumulation of unused references in the data store and eventual data bloat.

 ### Security

 **TODO**
 When allowing access out-of-process, we should provide different mechanisms to secure it.  The first is the choice of binding to a local unix socket or a tcp port.  The second is the optional use of ssl to encrypt the connection (very important over tcp).  The third is authentication to control access to the database.

 We may also want to consider the case of two server connections with different permissions, eg. a local unix socket that allows write access with no more credentials, and a public TCP connection with ssl and authentication that only provides read-only access.

 The use of ssl is quite easy in go, we just need to generate and sign a certificate, so it is nice to be able to disable it for dev machines, but it is very important for production.

 For authentication, let me sketch out a minimal solution. The server could just have a simple config file with key/bcrypt(password) pairs along with read/write permission level, and read that upon startup.  The client must provide a username and password in the HTTP headers when making the original HTTPS gRPC connection.

 This is super minimal to provide some protection. Things like LDAP, OAuth and single-sign on seem overkill and even potential security holes.  Maybe there is another solution somewhere in the middle.