from each other and from Lamport’s sketches. Systems
such as Chubby [4] have impleme nted Paxos-like algo-
rithms, but in most cases their deta ils have not been pub-
lished.
Furthermore, the Paxos architecture is a poor one for
building pra ctical systems; this is another consequence of
the single-decree decomposition. For example, there is lit-
tle benefit to choosing a collection of log entries indepe n-
dently and then melding them into a sequential log; this
just adds complexity. I t is simp le r and more efficient to
design a system around a log, where new entrie s are ap-
pended sequentially in a constrained order. Another pr ob-
lem is that Paxos uses a symmetric peer-to-peer approach
at its core (though it eventually suggests a w e ak form of
leadership as a performance optimization). This makes
sense in a simplified world where only one decision will
be made, but few practical systems use this approach. If a
series of decisions must be made, it is simpler and faster
to first elect a leader, then have the leader coordinate the
decisions.
As a result, practical systems bear little resemblance
to Paxos. Each implementation begins with Paxos, dis-
covers the difficulties in implementing it, and then de-
velops a significantly different architecture. This is time-
consumin g and error-prone, and the difficulties of under-
standing Paxos exacerbate the problem . Paxos’ formula-
tion may be a good one for proving theorems about its cor-
rectness, but real implementations are so different from
Paxos that the proof s have little value. The following com-
ment fro m the Chubby implem e nters is typical:
There are significant gaps between the description of
the Paxos algorithm and the needs of a real-world
system. . . . the final system will be based on an un-
proven protocol [4].
Because of these problems, we concluded that Paxos
does not provide a good foundation either for system
building or for education . Given the importance of co n-
sensus in large-scale software systems, we decided to see
if we could design an alternative consensus algorithm
with better properties than Paxos. Raft is the result of tha t
experiment.
4 Designing for understandability
We had several goals in designing Raft: it must provide
a complete and practical foundation for system building,
so that it significantly reduces the amo unt of design work
required of developers; it must b e safe under all conditions
and available under typical operating condition s; and it
must be efficient fo r common operations. But our most
important goal—and most difficult challenge—was un-
derstandability. It must be possible for a large aud ie nce to
understand the algorithm co mfortably. In addition, it must
be possible to develop intuitions about the algorithm, so
that system builders can make the extensions that are in-
evitable in real-world implementations.
There were numero us points in the design of Raft
where we had to choose amon g alternative approaches.
In these situations we evaluated the alternatives based on
understandability: how hard is it to explain each altern a -
tive (for example, how complex is its state space, and doe s
it have subtle implicatio ns?), and how easy will it be for a
reader to co mpletely understand the approach and its im-
plications?
We recogn ize that there is a high degree of subjectiv-
ity in such a nalysis; no netheless, we used two techniques
that are generally applicable. The first technique is the
well-known approach o f proble m decomposition: wher-
ever possible, we divided problems into separate pieces
that could be solved, exp lained, and understood relatively
indepen dently. For example, in Raft we separated leader
election, log replication, safety, and membership changes.
Our second appr oach was to simplify the state space
by reducing the number of states to consider, ma king the
system more coherent and elim inating nondeterminism
where possible. Specifically, logs are not allowed to have
holes, and Raft limits the ways in which logs can become
inconsistent with each othe r. Although in most cases we
tried to elimina te nondeterminism, there are some situ-
ations where nondeter minism actually improves under-
standability. In particular, randomized a pproaches intro-
duce nondeterm inism, but they tend to reduce the state
space by hand ling all possible choices in a similar fashion
(“choo se any; it doesn’t matter”). We used randomization
to simplify the Raft leader election algorithm.
5 The Raft consensus algorithm
Raft is an algorithm for managing a replicated log of
the form described in Section 2. Figure 2 summarizes the
algorithm in condensed form for reference, and Figure 3
lists key properties of the algo rithm; the elements of these
figures are discussed pie cewise over the rest of this sec-
tion.
Raft implements co nsensus by first electing a distin-
guished leader, then giving the leader complete responsi-
bility for ma naging the replicated log. The leader accepts
log entries from clients, replicates them on other servers,
and tells servers when it is safe to apply log entries to
their state machines. H aving a leader simplifies the man-
agement of the replicated log. For example, the leader can
decide where to place new entries in the log without con-
sulting other servers, and data flows in a simple fashion
from the leader to other servers. A leader can fail or be-
come disconnected from the other servers, in which case
a new leader is elected.
Given the leader approach, Raft decomposes the con-
sensus problem into thre e relatively independent subprob-
lems, which are discussed in the subsections that follow:
• Leader election: a new leader must be chosen when
an existing leader fails (Section 5.2).
• Log replication: the leader must accept log entries
3
文档被以下合辑收录
相关文档
评论