|
|
@ -0,0 +1,140 @@ |
|
|
|
# RFC 007 : Deterministic Proto Byte Serialization |
|
|
|
|
|
|
|
## Changelog |
|
|
|
|
|
|
|
- 09-Dec-2021: Initial draft (@williambanfield). |
|
|
|
|
|
|
|
## Abstract |
|
|
|
|
|
|
|
This document discusses the issue of stable byte-representation of serialized messages |
|
|
|
within Tendermint and describes a few possible routes that could be taken to address it. |
|
|
|
|
|
|
|
## Background |
|
|
|
|
|
|
|
We use the byte representations of wire-format proto messages to produce |
|
|
|
and verify hashes of data within the Tendermint codebase as well as for |
|
|
|
producing and verifying cryptographic signatures over these signed bytes. |
|
|
|
|
|
|
|
The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation |
|
|
|
of a protocol buffer message will be the same between two calls to an encoder. |
|
|
|
While there is a mode to force the encoder to produce the same byte representation |
|
|
|
of messages within a single binary, these guarantees are not good enough for our |
|
|
|
use case in Tendermint. We require multiple different versions of a binary running |
|
|
|
Tendermint to be able to inter-operate. Additionally, we require that multiple different |
|
|
|
systems written in _different languages_ be able to participate in different aspects |
|
|
|
of the protocols of Tendermint and be able to verify the integrity of the messages |
|
|
|
they each produce. |
|
|
|
|
|
|
|
While this has not yet created a problem that we know of in a running network, we should |
|
|
|
make sure to provide stronger guarantees around the serialized representation of the messages |
|
|
|
used within the Tendermint consensus algorithm to prevent any issue from occurring. |
|
|
|
|
|
|
|
|
|
|
|
## Discussion |
|
|
|
|
|
|
|
Proto has the following points of variability that can produce non-deterministic byte representation: |
|
|
|
|
|
|
|
1. Encoding order of fields within a message. |
|
|
|
|
|
|
|
Proto allows fields to be encoded in any order and even be repeated. |
|
|
|
|
|
|
|
2. Encoding order of elements of a repeated field. |
|
|
|
|
|
|
|
`repeated` fields in a proto message can be serialized in any order. |
|
|
|
|
|
|
|
3. Presence or absence of default values. |
|
|
|
|
|
|
|
Types in proto have defined default values similar to Go's zero values. |
|
|
|
Writing or omitting a default value are both legal ways of encoding a wire message. |
|
|
|
|
|
|
|
4. Serialization of 'unknown' fields. |
|
|
|
|
|
|
|
Unknown fields can be present when a message is created by a binary with a newer |
|
|
|
version of the proto that contains fields that the deserializer in a different |
|
|
|
binary does not yet know about. Deserializers in binaries that do not know about the field |
|
|
|
will maintain the bytes of the unknown field but not place them into the deserialized structure. |
|
|
|
|
|
|
|
We have a few options to consider when producing this stable representation. |
|
|
|
|
|
|
|
### Options for deterministic byte representation |
|
|
|
|
|
|
|
#### Use only compliant serializers and constrain field usage |
|
|
|
|
|
|
|
According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple |
|
|
|
set of rules, gogoproto produces a consistent byte representation of serialized messages. |
|
|
|
This seems promising, although more research is needed to guarantee gogoproto always |
|
|
|
produces a consistent set of bytes on serialized messages. This would solve the problem |
|
|
|
within Tendermint as written in Go, but would require ensuring that there are similar |
|
|
|
serializers written in other languages that produce the same output as gogoproto. |
|
|
|
|
|
|
|
#### Reorder serialized bytes to ensure determinism. |
|
|
|
|
|
|
|
The serialized form of a proto message can be transformed into a canonical representation |
|
|
|
by applying simple rules to the serialized bytes. Re-ordering the serialized bytes |
|
|
|
would allow Tendermint to produce a canonical byte representation without having to |
|
|
|
simultaneously maintain a custom proto marshaller. |
|
|
|
|
|
|
|
This could be implemented as a function in many languages that performed the following |
|
|
|
producing bytes to sign or hashing: |
|
|
|
|
|
|
|
1. Does not add any of the data from unknown fields into the type to hash. |
|
|
|
|
|
|
|
Tendermint should not run into a case where it needs to verify the integrity of |
|
|
|
data with unknown fields for the following reasons: |
|
|
|
|
|
|
|
The purpose of checking hash equality within Tendermint is to ensure that |
|
|
|
its local copy of data matches the data that the network agreed on. There should |
|
|
|
therefore not be a case where a process is checking hash equality using data that it did not expect |
|
|
|
to receive. What the data represent may be opaque to the process, such as when checking the |
|
|
|
transactions in a block, _but the process will still have expected to receive this data_, |
|
|
|
despite not understanding what their internal structure is. It's not clear what it would |
|
|
|
mean to verify that a block contains data that a process does not know about. |
|
|
|
|
|
|
|
The same reasoning applies for signature verification within Tendermint. Processes |
|
|
|
verify that a digital signature signed over a set of bytes by locally reconstructing the |
|
|
|
data structure that the digital signature signed using the process's local data. |
|
|
|
|
|
|
|
2. Reordered all message fields to be in tag-sorted order. |
|
|
|
|
|
|
|
Tag-sorting top-level fields will place all fields of the same tag in a adjacent |
|
|
|
to eachother within the serialized representation. |
|
|
|
|
|
|
|
3. Reordered the contents of all `repeated` fields to be in lexicographically sorted order. |
|
|
|
|
|
|
|
`repeated` fields will appear in a message as having the same tag but will contain different |
|
|
|
contents. Therefore, lexicographical sorting will produce a stable ordering of |
|
|
|
fields with the same tag. |
|
|
|
|
|
|
|
4. Deleted all default values from the byte representation. |
|
|
|
|
|
|
|
Encoders can include default values or omit them. Most encoders appear to omit them |
|
|
|
but we may wish to delete them just to be safe. |
|
|
|
|
|
|
|
5. Recursively performed these operations on any length-delimited subfields. |
|
|
|
|
|
|
|
Length delimited fields may contain messages, strings, or just bytes. However, |
|
|
|
it's not possible to know what data is being represented by such a field. |
|
|
|
A 'string' may happen to have the same structure as an embedded message and we cannot |
|
|
|
disambiguate. For this reason, we must apply these same rules to all subfields that |
|
|
|
may contain messages. Because we cannot know if we have totally mangled the interior 'string' |
|
|
|
or not, this data should never be deserialized or used for anything beyond hashing. |
|
|
|
|
|
|
|
A **prototype** implementation by @creachadair of this can be found in [the wirepb repo][wire-pb]. |
|
|
|
This could be implemented in multiple languages more simply than ensuring that there are |
|
|
|
canonical proto serializers that match in each language. |
|
|
|
|
|
|
|
### Future work |
|
|
|
|
|
|
|
We should add clear documentation to the Tendermint codebase every time we |
|
|
|
compare hashes of proto messages or use proto serialized bytes to produces a |
|
|
|
digital signatures that we have been careful to ensure that the hashes are performed |
|
|
|
properly. |
|
|
|
|
|
|
|
### References |
|
|
|
|
|
|
|
[proto-spec-encoding]: https://developers.google.com/protocol-buffers/docs/encoding |
|
|
|
[spec-issue]: https://github.com/tendermint/tendermint/issues/5005 |
|
|
|
[cosmos-sdk-adr-27]: https://github.com/cosmos/cosmos-sdk/blob/master/docs/architecture/adr-027-deterministic-protobuf-serialization.md |
|
|
|
[cer-proto-3]: https://github.com/regen-network/canonical-proto3 |
|
|
|
[wire-pb]: https://github.com/creachadair/wirepb |
|
|
|
|