|
\section{Introduction} \label{sec:tendermint}
|
|
|
|
Consensus is a fundamental problem in distributed computing. It
|
|
is important because of it's role in State Machine Replication (SMR), a generic
|
|
approach for replicating services that can be modeled as a deterministic state
|
|
machine~\cite{Lam78:cacm, Sch90:survey}. The key idea of this approach is that
|
|
service replicas start in the same initial state, and then execute requests
|
|
(also called transactions) in the same order; thereby guaranteeing that
|
|
replicas stay in sync with each other. The role of consensus in the SMR
|
|
approach is ensuring that all replicas receive transactions in the same order.
|
|
Traditionally, deployments of SMR based systems are in data-center settings
|
|
(local area network), have a small number of replicas (three to seven) and are
|
|
typically part of a single administration domain (e.g., Chubby
|
|
\cite{Bur:osdi06}); therefore they handle benign (crash) failures only, as more
|
|
general forms of failure (in particular, malicious or Byzantine faults) are
|
|
considered to occur with only negligible probability.
|
|
|
|
The success of cryptocurrencies and blockchain systems in recent years (e.g.,
|
|
\cite{Nak2012:bitcoin, But2014:ethereum}) pose a whole new set of challenges on
|
|
the design and deployment of SMR based systems: reaching agreement over wide
|
|
area network, among large number of nodes (hundreds or thousands) that are not
|
|
part of the same administrative domain, and where a subset of nodes can behave
|
|
maliciously (Byzantine faults). Furthermore, contrary to the previous
|
|
data-center deployments where nodes are fully connected to each other, in
|
|
blockchain systems, a node is only connected to a subset of other nodes, so
|
|
communication is achieved by gossip-based peer-to-peer protocols.
|
|
The new requirements demand designs and algorithms that are not necessarily
|
|
present in the classical academic literature on Byzantine fault tolerant
|
|
consensus (or SMR) systems (e.g., \cite{DLS88:jacm, CL02:tcs}) as the primary
|
|
focus was different setup.
|
|
|
|
In this paper we describe a novel Byzantine-fault tolerant consensus algorithm
|
|
that is the core of the BFT SMR platform called Tendermint\footnote{The
|
|
Tendermint platform is available open source at
|
|
https://github.com/tendermint/tendermint.}. The Tendermint platform consists of
|
|
a high-performance BFT SMR implementation written in Go, a flexible interface
|
|
for
|
|
building arbitrary deterministic applications above the consensus, and a suite
|
|
of tools for deployment and management.
|
|
|
|
The Tendermint consensus algorithm is inspired by the PBFT SMR
|
|
algorithm~\cite{CL99:osdi} and the DLS algorithm for authenticated faults (the
|
|
Algorithm 2 from \cite{DLS88:jacm}). Similar to DLS algorithm, Tendermint
|
|
proceeds in
|
|
rounds\footnote{Tendermint is not presented in the basic round model of
|
|
\cite{DLS88:jacm}. Furthermore, we use the term round differently than in
|
|
\cite{DLS88:jacm}; in Tendermint a round denotes a sequence of communication
|
|
steps instead of a single communication step in \cite{DLS88:jacm}.}, where each
|
|
round has a dedicated proposer (also called coordinator or
|
|
leader) and a process proceeds to a new round as part of normal
|
|
processing (not only in case the proposer is faulty or suspected as being faulty
|
|
by enough processes as in PBFT).
|
|
The communication pattern of each round is very similar to the "normal" case
|
|
of PBFT. Therefore, in preferable conditions (correct proposer, timely and
|
|
reliable communication between correct processes), Tendermint decides in three
|
|
communication steps (the same as PBFT).
|
|
|
|
The major novelty and contribution of the Tendermint consensus algorithm is a
|
|
new termination mechanism. As explained in \cite{MHS09:opodis, RMS10:dsn}, the
|
|
existing BFT consensus (and SMR) algorithms for the partially synchronous
|
|
system model (for example PBFT~\cite{CL99:osdi}, \cite{DLS88:jacm},
|
|
\cite{MA06:tdsc}) typically relies on the communication pattern illustrated in
|
|
Figure~\ref{ch3:fig:coordinator-change} for termination. The
|
|
Figure~\ref{ch3:fig:coordinator-change} illustrates messages exchanged during
|
|
the proposer change when processes start a new round\footnote{There is no
|
|
consistent terminology in the distributed computing terminology on naming
|
|
sequence of communication steps that corresponds to a logical unit. It is
|
|
sometimes called a round, phase or a view.}. It guarantees that eventually (ie.
|
|
after some Global Stabilization Time, GST), there exists a round with a correct
|
|
proposer that will bring the system into a univalent configuration.
|
|
Intuitively, in a round in which the proposed value is accepted
|
|
by all correct processes, and communication between correct processes is
|
|
timely and reliable, all correct processes decide.
|
|
|
|
|
|
\begin{figure}[tbh!] \def\rdstretch{5} \def\ystretch{3} \centering
|
|
\begin{rounddiag}{4}{2} \round{1}{~} \rdmessage{1}{1}{$v_1$}
|
|
\rdmessage{2}{1}{$v_2$} \rdmessage{3}{1}{$v_3$} \rdmessage{4}{1}{$v_4$}
|
|
\round{2}{~} \rdmessage{1}{1}{$x, [v_{1..4}]$}
|
|
\rdmessage{1}{2}{$~~~~~~x, [v_{1..4}]$} \rdmessage{1}{3}{$~~~~~~~~x,
|
|
[v_{1..4}]$} \rdmessage{1}{4}{$~~~~~~~x, [v_{1..4}]$} \end{rounddiag}
|
|
\vspace{-5mm} \caption{\boldmath Proposer (coordinator) change: $p_1$ is the
|
|
new proposer.} \label{ch3:fig:coordinator-change} \end{figure}
|
|
|
|
To ensure that a proposed value is accepted by all correct
|
|
processes\footnote{The proposed value is not blindly accepted by correct
|
|
processes in BFT algorithms. A correct process always verifies if the proposed
|
|
value is safe to be accepted so that safety properties of consensus are not
|
|
violated.}
|
|
a proposer will 1) build the global state by receiving messages from other
|
|
processes, 2) select the safe value to propose and 3) send the selected value
|
|
together with the signed messages
|
|
received in the first step to support it. The
|
|
value $v_i$ that a correct process sends to the next proposer normally
|
|
corresponds to a value the process considers as acceptable for a decision:
|
|
|
|
\begin{itemize} \item in PBFT~\cite{CL99:osdi} and DLS~\cite{DLS88:jacm} it is
|
|
not the value itself but a set of $2f+1$ signed messages with the same
|
|
value id, \item in Fast Byzantine Paxos~\cite{MA06:tdsc} the value
|
|
itself is being sent. \end{itemize}
|
|
|
|
In both cases, using this mechanism in our system model (ie. high
|
|
number of nodes over gossip based network) would have high communication
|
|
complexity that increases with the number of processes: in the first case as
|
|
the message sent depends on the total number of processes, and in the second
|
|
case as the value (block of transactions) is sent by each process. The set of
|
|
messages received in the first step are normally piggybacked on the proposal
|
|
message (in the Figure~\ref{ch3:fig:coordinator-change} denoted with
|
|
$[v_{1..4}]$) to justify the choice of the selected value $x$. Note that
|
|
sending this message also does not scale with the number of processes in the
|
|
system.
|
|
|
|
We designed a novel termination mechanism for Tendermint that better suits the
|
|
system model we consider. It does not require additional communication (neither
|
|
sending new messages nor piggybacking information on the existing messages) and
|
|
it is fully based on the communication pattern that is very similar to the
|
|
normal case in PBFT \cite{CL99:osdi}. Therefore, there is only a single mode of
|
|
execution in Tendermint, i.e., there is no separation between the normal and
|
|
the recovery mode, which is the case in other PBFT-like protocols (e.g.,
|
|
\cite{CL99:osdi}, \cite{Ver09:spinning} or \cite{Cle09:aardvark}). We believe
|
|
this makes Tendermint simpler to understand and implement correctly.
|
|
|
|
Note that the orthogonal approach for reducing message complexity in order to
|
|
improve
|
|
scalability and decentralization (number of processes) of BFT consensus
|
|
algorithms is using advanced cryptography (for example Boneh-Lynn-Shacham (BLS)
|
|
signatures \cite{BLS2001:crypto}) as done for example in SBFT
|
|
\cite{Gue2018:sbft}.
|
|
|
|
The remainder of the paper is as follows: Section~\ref{sec:definitions} defines
|
|
the system model and gives the problem definitions. Tendermint
|
|
consensus algorithm is presented in Section~\ref{sec:tendermint} and the
|
|
proofs are given in Section~\ref{sec:proof}. We conclude in
|
|
Section~\ref{sec:conclusion}.
|
|
|
|
|
|
|
|
|