provides externally consistent [16] reads and writes, and
globally-consistent reads across the database at a time-
stamp. These features enable Spanner to support con-
sistent backups, consistent MapReduce executions [12],
and atomic schema updates, all at global scale, and even
in the presence of ongoing transactions.
These features are enabled by the fact that Spanner as-
signs globally-meaningful commit timestamps to trans-
actions, even though transactions may be distributed.
The timestamps reflect serialization order. In addition,
the serialization order satisfies external consistency (or
equivalently, linearizability [20]): if a transaction T
1
commits before another transaction T
2
starts, then T
1
’s
commit timestamp is smaller than T
2
’s. Spanner is the
first system to provide such guarantees at global scale.
The key enabler of these properties is a new TrueTime
API and its implementation. The API directly exposes
clock uncertainty, and the guarantees on Spanner’s times-
tamps depend on the bounds that the implementation pro-
vides. If the uncertainty is large, Spanner slows down to
wait out that uncertainty. Google’s cluster-management
software provides an implementation of the TrueTime
API. This implementation keeps uncertainty small (gen-
erally less than 10ms) by using multiple modern clock
references (GPS and atomic clocks).
Section 2 describes the structure of Spanner’s imple-
mentation, its feature set, and the engineering decisions
that went into their design. Section 3 describes our new
TrueTime API and sketches its implementation. Sec-
tion 4 describes how Spanner uses TrueTime to imple-
ment externally-consistent distributed transactions, lock-
free read-only transactions, and atomic schema updates.
Section 5 provides some benchmarks on Spanner’s per-
formance and TrueTime behavior, and discusses the ex-
periences of F1. Sections 6, 7, and 8 describe related and
future work, and summarize our conclusions.
2 Implementation
This section describes the structure of and rationale un-
derlying Spanner’s implementation. It then describes the
directory abstraction, which is used to manage replica-
tion and locality, and is the unit of data movement. Fi-
nally, it describes our data model, why Spanner looks
like a relational database instead of a key-value store, and
how applications can control data locality.
A Spanner deployment is called a universe. Given
that Spanner manages data globally, there will be only
a handful of running universes. We currently run a
test/playground universe, a development/production uni-
verse, and a production-only universe.
Spanner is organized as a set of zones, where each
zone is the rough analog of a deployment of Bigtable
Figure 1: Spanner server organization.
servers [9]. Zones are the unit of administrative deploy-
ment. The set of zones is also the set of locations across
which data can be replicated. Zones can be added to or
removed from a running system as new datacenters are
brought into service and old ones are turned off, respec-
tively. Zones are also the unit of physical isolation: there
may be one or more zones in a datacenter, for example,
if different applications’ data must be partitioned across
different sets of servers in the same datacenter.
Figure 1 illustrates the servers in a Spanner universe.
A zone has one zonemaster and between one hundred
and several thousand spanservers. The former assigns
data to spanservers; the latter serve data to clients. The
per-zone location proxies are used by clients to locate
the spanservers assigned to serve their data. The uni-
verse master and the placement driver are currently sin-
gletons. The universe master is primarily a console that
displays status information about all the zones for inter-
active debugging. The placement driver handles auto-
mated movement of data across zones on the timescale
of minutes. The placement driver periodically commu-
nicates with the spanservers to find data that needs to be
moved, either to meet updated replication constraints or
to balance load. For space reasons, we will only describe
the spanserver in any detail.
2.1 Spanserver Software Stack
This section focuses on the spanserver implementation
to illustrate how replication and distributed transactions
have been layered onto our Bigtable-based implementa-
tion. The software stack is shown in Figure 2. At the
bottom, each spanserver is responsible for between 100
and 1000 instances of a data structure called a tablet. A
tablet is similar to Bigtable’s tablet abstraction, in that it
implements a bag of the following mappings:
(key:string, timestamp:int64) → string
Unlike Bigtable, Spanner assigns timestamps to data,
which is an important way in which Spanner is more
like a multi-version database than a key-value store. A
Published in the Proceedings of OSDI 2012 2
文档被以下合辑收录
评论