暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Bigtable - A Distributed Storage System for Structured Data.pdf
1162
14页
36次
2021-01-22
免费下载
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,fikes,gruber}@google.com
Google, Inc.
Abstract
Bigtable is a distributed storage system for managing
structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Fi-
nance. These applications place very different demands
on Bigtable, both in terms of data size (from URLs to
web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully
provided a flexible, high-performance solution for all of
these Google products. In this paper we describe the sim-
ple data model provided by Bigtable, which gives clients
dynamic control over data layout and format, and we de-
scribe the design and implementation of Bigtable.
1 Introduction
Over the last two and a half years we have designed,
implemented, and deployed a distributed storage system
for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of
data and thousands of machines. Bigtable has achieved
several goals: wide applicability, scalability, high per-
formance, and high availability. Bigtable is used by
more than sixty Google products and projects, includ-
ing Google Analytics, Google Finance, Orkut, Person-
alized Search, Writely, and Google Earth. These prod-
ucts use Bigtable for a variety of demanding workloads,
which range from throughput-oriented batch-processing
jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide
range of configurations, from a handful to thousands of
servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares
many implementation strategies with databases. Paral-
lel databases [14] and main-memory databases [13] have
achieved scalability and high performance, but Bigtable
provides a different interface than such systems. Bigtable
does not support a full relational data model; instead, it
provides clients with a simple data model that supports
dynamic control over data layout and format, and al-
lows clients to reason about the locality properties of the
data represented in the underlying storage. Data is in-
dexed using row and column names that can be arbitrary
strings. Bigtable also treats data as uninterpreted strings,
although clients often serialize various forms of struc-
tured and semi-structured data into these strings. Clients
can control the locality of their data through careful
choices in their schemas. Finally, Bigtable schema pa-
rameters let clients dynamically control whether to serve
data out of memory or from disk.
Section 2 describes the data model in more detail, and
Section 3 provides an overview of the client API. Sec-
tion 4 briefly describes the underlying Google infrastruc-
ture on which Bigtable depends. Section 5 describes the
fundamentals of the Bigtable implementation, and Sec-
tion 6 describes some of the refinements that we made
to improve Bigtable’s performance. Section 7 provides
measurements of Bigtable’s performance. We describe
several examples of how Bigtable is used at Google
in Section 8, and discuss some lessons we learned in
designing and supporting Bigtable in Section 9. Fi-
nally, Section 10 describes related work, and Section 11
presents our conclusions.
2 Data Model
A Bigtable is a sparse, distributed, persistent multi-
dimensional sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) string
To appear in OSDI 2006 1
"CNN.com"
"CNN"
"<html>..."
"<html>..."
"<html>..."
t
9
t
6
t
3
t
5
8
t
"anchor:cnnsi.com"
"com.cnn.www"
"anchor:my.look.ca""contents:"
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t
3
, t
5
, and t
6
.
We settled on this data model after examining a variety
of potential uses of a Bigtable-like system. As one con-
crete example that drove some of our design decisions,
suppose we want to keep a copy of a large collection of
web pages and related information that could be used by
many different projects; let us call this particular table
the Webtable. In Webtable, we would use URLs as row
keys, various aspects of web pages as column names, and
store the contents of the web pages in the contents: col-
umn under the timestamps when they were fetched, as
illustrated in Figure 1.
Rows
The row keys in a table are arbitrary strings (currently up
to 64KB in size, although 10-100 bytes is a typical size
for most of our users). Every read or write of data under
a single row key is atomic (regardless of the number of
different columns being read or written in the row), a
design decision that makes it easier for clients to reason
about the system’s behavior in the presence of concurrent
updates to the same row.
Bigtable maintains data in lexicographic order by row
key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of dis-
tribution and load balancing. As a result, reads of short
row ranges are efficient and typically require communi-
cation with only a small number of machines. Clients
can exploit this property by selecting their row keys so
that they get good locality for their data accesses. For
example, in Webtable, pages in the same domain are
grouped together into contiguous rows by reversing the
hostname components of the URLs. For example, we
store data for maps.google.com/index.html under the
key com.google.maps/index.html. Storing pages from
the same domain near each other makes some host and
domain analyses more efficient.
Column Families
Column keys are grouped into sets called column fami-
lies, which form the basic unit of access control. All data
stored in a column family is usually of the same type (we
compress data in the same column family together). A
column family must be created before data can be stored
under any column key in that family; after a family has
been created, any column key within the family can be
used. It is our intent that the number of distinct column
families in a table be small (in the hundreds at most), and
that families rarely change during operation. In contrast,
a table may have an unbounded number of columns.
A column key is named using the following syntax:
family:qualifier. Column family names must be print-
able, but qualifiers may be arbitrary strings. An exam-
ple column family for the Webtable is language, which
stores the language in which a web page was written. We
use only one column key in the language family, and it
stores each web page’s language ID. Another useful col-
umn family for this table is anchor; each column key in
this family represents a single anchor, as shown in Fig-
ure 1. The qualifier is the name of the referring site; the
cell contents is the link text.
Access control and both disk and memory account-
ing are performed at the column-family level. In our
Webtable example, these controls allow us to manage
several different types of applications: some that add new
base data, some that read the base data and create derived
column families, and some that are only allowed to view
existing data (and possibly not even to view all of the
existing families for privacy reasons).
Timestamps
Each cell in a Bigtable can contain multiple versions of
the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be as-
signed by Bigtable, in which case they represent “real
time” in microseconds, or be explicitly assigned by client
To appear in OSDI 2006 2
// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);
Figure 2: Writing to Bigtable.
applications. Applications that need to avoid collisions
must generate unique timestamps themselves. Different
versions of a cell are stored in decreasing timestamp or-
der, so that the most recent versions can be read first.
To make the management of versioned data less oner-
ous, we support two per-column-family settings that tell
Bigtable to garbage-collect cell versions automatically.
The client can specify either that only the last n versions
of a cell be kept, or that only new-enough versions be
kept (e.g., only keep values that were written in the last
seven days).
In our Webtable example, we set the timestamps of
the crawled pages stored in the contents: column to
the times at which these page versions were actually
crawled. The garbage-collection mechanism described
above lets us keep only the most recent three versions of
every page.
3 API
The Bigtable API provides functions for creating and
deleting tables and column families. It also provides
functions for changing cluster, table, and column family
metadata, such as access control rights.
Client applications can write or delete values in
Bigtable, look up values from individual rows, or iter-
ate over a subset of the data in a table. Figure 2 shows
C++ code that uses a RowMutation abstraction to per-
form a series of updates. (Irrelevant details were elided
to keep the example short.) The call to Apply performs
an atomic mutation to the Webtable: it adds one anchor
to www.cnn.com and deletes a different anchor.
Figure 3 shows C++ code that uses a Scanner ab-
straction to iterate over all anchors in a particular row.
Clients can iterate over multiple column families, and
there are several mechanisms for limiting the rows,
columns, and timestamps produced by a scan. For ex-
ample, we could restrict the scan above to only produce
anchors whose columns match the regular expression
anchor:*.cnn.com, or to only produce anchors whose
timestamps fall within ten days of the current time.
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
Figure 3: Reading from Bigtable.
Bigtable supports several other features that allow the
user to manipulate data in more complex ways. First,
Bigtable supports single-row transactions, which can be
used to perform atomic read-modify-write sequences on
data stored under a single row key. Bigtable does not cur-
rently support general transactions across row keys, al-
though it provides an interface for batching writes across
row keys at the clients. Second, Bigtable allows cells
to be used as integer counters. Finally, Bigtable sup-
ports the execution of client-supplied scripts in the ad-
dress spaces of the servers. The scripts are written in a
language developed at Google for processing data called
Sawzall [28]. At the moment, our Sawzall-based API
does not allow client scripts to write back into Bigtable,
but it does allow various forms of data transformation,
filtering based on arbitrary expressions, and summariza-
tion via a variety of operators.
Bigtable can be used with MapReduce [12], a frame-
work for running large-scale parallel computations de-
veloped at Google. We have written a set of wrappers
that allow a Bigtable to be used both as an input source
and as an output target for MapReduce jobs.
4 Building Blocks
Bigtable is built on several other pieces of Google in-
frastructure. Bigtable uses the distributed Google File
System (GFS) [17] to store log and data files. A Bigtable
cluster typically operates in a shared pool of machines
that run a wide variety of other distributed applications,
and Bigtable processes often share the same machines
with processes from other applications. Bigtable de-
pends on a cluster management system for scheduling
jobs, managing resources on shared machines, dealing
with machine failures, and monitoring machine status.
The Google SSTable file format is used internally to
store Bigtable data. An SSTable provides a persistent,
ordered immutable map from keys to values, where both
keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specified
To appear in OSDI 2006 3
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。