diff --git a/docs/rfc/README.md b/docs/rfc/README.md index a66969042..400b7c2a5 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -44,5 +44,6 @@ sections. - [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md) - [RFC-005: Event System](./rfc-005-event-system.rst) - [RFC-006: Event Subscription](./rfc-006-event-subscription.md) +- [RFC-007: Deterministic Proto Byte Serialization](./rfc-007-deterministic-proto-bytes.md) diff --git a/docs/rfc/rfc-007-deterministic-proto-bytes.md b/docs/rfc/rfc-007-deterministic-proto-bytes.md new file mode 100644 index 000000000..0b55c2228 --- /dev/null +++ b/docs/rfc/rfc-007-deterministic-proto-bytes.md @@ -0,0 +1,140 @@ +# RFC 007 : Deterministic Proto Byte Serialization + +## Changelog + +- 09-Dec-2021: Initial draft (@williambanfield). + +## Abstract + +This document discusses the issue of stable byte-representation of serialized messages +within Tendermint and describes a few possible routes that could be taken to address it. + +## Background + +We use the byte representations of wire-format proto messages to produce +and verify hashes of data within the Tendermint codebase as well as for +producing and verifying cryptographic signatures over these signed bytes. + +The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation +of a protocol buffer message will be the same between two calls to an encoder. +While there is a mode to force the encoder to produce the same byte representation +of messages within a single binary, these guarantees are not good enough for our +use case in Tendermint. We require multiple different versions of a binary running +Tendermint to be able to inter-operate. Additionally, we require that multiple different +systems written in _different languages_ be able to participate in different aspects +of the protocols of Tendermint and be able to verify the integrity of the messages +they each produce. + +While this has not yet created a problem that we know of in a running network, we should +make sure to provide stronger guarantees around the serialized representation of the messages +used within the Tendermint consensus algorithm to prevent any issue from occurring. + + +## Discussion + +Proto has the following points of variability that can produce non-deterministic byte representation: + +1. Encoding order of fields within a message. + +Proto allows fields to be encoded in any order and even be repeated. + +2. Encoding order of elements of a repeated field. + +`repeated` fields in a proto message can be serialized in any order. + +3. Presence or absence of default values. + +Types in proto have defined default values similar to Go's zero values. +Writing or omitting a default value are both legal ways of encoding a wire message. + +4. Serialization of 'unknown' fields. + +Unknown fields can be present when a message is created by a binary with a newer +version of the proto that contains fields that the deserializer in a different +binary does not yet know about. Deserializers in binaries that do not know about the field +will maintain the bytes of the unknown field but not place them into the deserialized structure. + +We have a few options to consider when producing this stable representation. + +### Options for deterministic byte representation + +#### Use only compliant serializers and constrain field usage + +According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple +set of rules, gogoproto produces a consistent byte representation of serialized messages. +This seems promising, although more research is needed to guarantee gogoproto always +produces a consistent set of bytes on serialized messages. This would solve the problem +within Tendermint as written in Go, but would require ensuring that there are similar +serializers written in other languages that produce the same output as gogoproto. + +#### Reorder serialized bytes to ensure determinism. + +The serialized form of a proto message can be transformed into a canonical representation +by applying simple rules to the serialized bytes. Re-ordering the serialized bytes +would allow Tendermint to produce a canonical byte representation without having to +simultaneously maintain a custom proto marshaller. + +This could be implemented as a function in many languages that performed the following +producing bytes to sign or hashing: + +1. Does not add any of the data from unknown fields into the type to hash. + +Tendermint should not run into a case where it needs to verify the integrity of +data with unknown fields for the following reasons: + +The purpose of checking hash equality within Tendermint is to ensure that +its local copy of data matches the data that the network agreed on. There should +therefore not be a case where a process is checking hash equality using data that it did not expect +to receive. What the data represent may be opaque to the process, such as when checking the +transactions in a block, _but the process will still have expected to receive this data_, +despite not understanding what their internal structure is. It's not clear what it would +mean to verify that a block contains data that a process does not know about. + +The same reasoning applies for signature verification within Tendermint. Processes +verify that a digital signature signed over a set of bytes by locally reconstructing the +data structure that the digital signature signed using the process's local data. + +2. Reordered all message fields to be in tag-sorted order. + +Tag-sorting top-level fields will place all fields of the same tag in a adjacent +to eachother within the serialized representation. + +3. Reordered the contents of all `repeated` fields to be in lexicographically sorted order. + +`repeated` fields will appear in a message as having the same tag but will contain different +contents. Therefore, lexicographical sorting will produce a stable ordering of +fields with the same tag. + +4. Deleted all default values from the byte representation. + +Encoders can include default values or omit them. Most encoders appear to omit them +but we may wish to delete them just to be safe. + +5. Recursively performed these operations on any length-delimited subfields. + +Length delimited fields may contain messages, strings, or just bytes. However, +it's not possible to know what data is being represented by such a field. +A 'string' may happen to have the same structure as an embedded message and we cannot +disambiguate. For this reason, we must apply these same rules to all subfields that +may contain messages. Because we cannot know if we have totally mangled the interior 'string' +or not, this data should never be deserialized or used for anything beyond hashing. + +A **prototype** implementation by @creachadair of this can be found in [the wirepb repo][wire-pb]. +This could be implemented in multiple languages more simply than ensuring that there are +canonical proto serializers that match in each language. + +### Future work + +We should add clear documentation to the Tendermint codebase every time we +compare hashes of proto messages or use proto serialized bytes to produces a +digital signatures that we have been careful to ensure that the hashes are performed +properly. + +### References + +[proto-spec-encoding]: https://developers.google.com/protocol-buffers/docs/encoding +[spec-issue]: https://github.com/tendermint/tendermint/issues/5005 +[cosmos-sdk-adr-27]: https://github.com/cosmos/cosmos-sdk/blob/master/docs/architecture/adr-027-deterministic-protobuf-serialization.md +[cer-proto-3]: https://github.com/regen-network/canonical-proto3 +[wire-pb]: https://github.com/creachadair/wirepb +