Bigtable - A Distributed Storage System for Structured Data.pdf

盖国强

1162

14页

36次

2021-01-22

免费下载

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach

Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,ﬁkes,gruber}@google.com

Google, Inc.

Abstract

Bigtable is a distributed storage system for managing

structured data that is designed to scale to a very large

size: petabytes of data across thousands of commodity

servers. Many projects at Google store data in Bigtable,

including web indexing, Google Earth, and Google Fi-

nance. These applications place very different demands

on Bigtable, both in terms of data size (from URLs to

web pages to satellite imagery) and latency requirements

(from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully

provided a ﬂexible, high-performance solution for all of

these Google products. In this paper we describe the sim-

ple data model provided by Bigtable, which gives clients

dynamic control over data layout and format, and we de-

scribe the design and implementation of Bigtable.

1 Introduction

Over the last two and a half years we have designed,

implemented, and deployed a distributed storage system

for managing structured data at Google called Bigtable.

Bigtable is designed to reliably scale to petabytes of

data and thousands of machines. Bigtable has achieved

several goals: wide applicability, scalability, high per-

formance, and high availability. Bigtable is used by

more than sixty Google products and projects, includ-

ing Google Analytics, Google Finance, Orkut, Person-

alized Search, Writely, and Google Earth. These prod-

ucts use Bigtable for a variety of demanding workloads,

which range from throughput-oriented batch-processing

jobs to latency-sensitive serving of data to end users.

The Bigtable clusters used by these products span a wide

range of conﬁgurations, from a handful to thousands of

servers, and store up to several hundred terabytes of data.

In many ways, Bigtable resembles a database: it shares

many implementation strategies with databases. Paral-

lel databases [14] and main-memory databases [13] have

achieved scalability and high performance, but Bigtable

provides a different interface than such systems. Bigtable

does not support a full relational data model; instead, it

provides clients with a simple data model that supports

dynamic control over data layout and format, and al-

lows clients to reason about the locality properties of the

data represented in the underlying storage. Data is in-

dexed using row and column names that can be arbitrary

strings. Bigtable also treats data as uninterpreted strings,

although clients often serialize various forms of struc-

tured and semi-structured data into these strings. Clients

can control the locality of their data through careful

choices in their schemas. Finally, Bigtable schema pa-

rameters let clients dynamically control whether to serve

data out of memory or from disk.

Section 2 describes the data model in more detail, and

Section 3 provides an overview of the client API. Sec-

tion 4 brieﬂy describes the underlying Google infrastruc-

ture on which Bigtable depends. Section 5 describes the

fundamentals of the Bigtable implementation, and Sec-

tion 6 describes some of the reﬁnements that we made

to improve Bigtable’s performance. Section 7 provides

measurements of Bigtable’s performance. We describe

several examples of how Bigtable is used at Google

in Section 8, and discuss some lessons we learned in

designing and supporting Bigtable in Section 9. Fi-

nally, Section 10 describes related work, and Section 11

presents our conclusions.

2 Data Model

A Bigtable is a sparse, distributed, persistent multi-

dimensional sorted map. The map is indexed by a row

key, column key, and a timestamp; each value in the map

is an uninterpreted array of bytes.

(row:string, column:string, time:int64) → string

To appear in OSDI 2006 1

"CNN.com"

"CNN"

"<html>..."

"anchor:cnnsi.com"

"com.cnn.www"

"anchor:my.look.ca""contents:"

Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-

tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page

is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com

and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t

, t

, and t

We settled on this data model after examining a variety

of potential uses of a Bigtable-like system. As one con-

crete example that drove some of our design decisions,

suppose we want to keep a copy of a large collection of

web pages and related information that could be used by

many different projects; let us call this particular table

the Webtable. In Webtable, we would use URLs as row

keys, various aspects of web pages as column names, and

store the contents of the web pages in the contents: col-

umn under the timestamps when they were fetched, as

illustrated in Figure 1.

Rows

The row keys in a table are arbitrary strings (currently up

to 64KB in size, although 10-100 bytes is a typical size

for most of our users). Every read or write of data under

a single row key is atomic (regardless of the number of

different columns being read or written in the row), a

design decision that makes it easier for clients to reason

about the system’s behavior in the presence of concurrent

updates to the same row.

Bigtable maintains data in lexicographic order by row

key. The row range for a table is dynamically partitioned.

Each row range is called a tablet, which is the unit of dis-

tribution and load balancing. As a result, reads of short

row ranges are efﬁcient and typically require communi-

cation with only a small number of machines. Clients

can exploit this property by selecting their row keys so

that they get good locality for their data accesses. For

example, in Webtable, pages in the same domain are

grouped together into contiguous rows by reversing the

hostname components of the URLs. For example, we

store data for maps.google.com/index.html under the

key com.google.maps/index.html. Storing pages from

the same domain near each other makes some host and

domain analyses more efﬁcient.

Column Families

Column keys are grouped into sets called column fami-

lies, which form the basic unit of access control. All data

stored in a column family is usually of the same type (we

compress data in the same column family together). A

column family must be created before data can be stored

under any column key in that family; after a family has

been created, any column key within the family can be

used. It is our intent that the number of distinct column

families in a table be small (in the hundreds at most), and

that families rarely change during operation. In contrast,

a table may have an unbounded number of columns.

A column key is named using the following syntax:

family:qualiﬁer. Column family names must be print-

able, but qualiﬁers may be arbitrary strings. An exam-

ple column family for the Webtable is language, which

stores the language in which a web page was written. We

use only one column key in the language family, and it

stores each web page’s language ID. Another useful col-

umn family for this table is anchor; each column key in

this family represents a single anchor, as shown in Fig-

ure 1. The qualiﬁer is the name of the referring site; the

cell contents is the link text.

Access control and both disk and memory account-

ing are performed at the column-family level. In our

Webtable example, these controls allow us to manage

several different types of applications: some that add new

base data, some that read the base data and create derived

column families, and some that are only allowed to view

existing data (and possibly not even to view all of the

existing families for privacy reasons).

Timestamps

Each cell in a Bigtable can contain multiple versions of

the same data; these versions are indexed by timestamp.

Bigtable timestamps are 64-bit integers. They can be as-

signed by Bigtable, in which case they represent “real

time” in microseconds, or be explicitly assigned by client

To appear in OSDI 2006 2

// Open the table

Table *T = OpenOrDie("/bigtable/web/webtable");

// Write a new anchor and delete an old anchor

RowMutation r1(T, "com.cnn.www");

r1.Set("anchor:www.c-span.org", "CNN");

r1.Delete("anchor:www.abc.com");

Operation op;

Apply(&op, &r1);

Figure 2: Writing to Bigtable.

applications. Applications that need to avoid collisions

must generate unique timestamps themselves. Different

versions of a cell are stored in decreasing timestamp or-

der, so that the most recent versions can be read ﬁrst.

To make the management of versioned data less oner-

ous, we support two per-column-family settings that tell

Bigtable to garbage-collect cell versions automatically.

The client can specify either that only the last n versions

of a cell be kept, or that only new-enough versions be

kept (e.g., only keep values that were written in the last

seven days).

In our Webtable example, we set the timestamps of

the crawled pages stored in the contents: column to

the times at which these page versions were actually

crawled. The garbage-collection mechanism described

above lets us keep only the most recent three versions of

every page.

3 API

The Bigtable API provides functions for creating and

deleting tables and column families. It also provides

functions for changing cluster, table, and column family

metadata, such as access control rights.

Client applications can write or delete values in

Bigtable, look up values from individual rows, or iter-

ate over a subset of the data in a table. Figure 2 shows

C++ code that uses a RowMutation abstraction to per-

form a series of updates. (Irrelevant details were elided

to keep the example short.) The call to Apply performs

an atomic mutation to the Webtable: it adds one anchor

to www.cnn.com and deletes a different anchor.

Figure 3 shows C++ code that uses a Scanner ab-

straction to iterate over all anchors in a particular row.

Clients can iterate over multiple column families, and

there are several mechanisms for limiting the rows,

columns, and timestamps produced by a scan. For ex-

ample, we could restrict the scan above to only produce

anchors whose columns match the regular expression

anchor:*.cnn.com, or to only produce anchors whose

timestamps fall within ten days of the current time.

Scanner scanner(T);

ScanStream *stream;

stream = scanner.FetchColumnFamily("anchor");

stream->SetReturnAllVersions();

scanner.Lookup("com.cnn.www");

for (; !stream->Done(); stream->Next()) {

printf("%s %s %lld %s\n",

scanner.RowName(),

stream->ColumnName(),

stream->MicroTimestamp(),

stream->Value());

}

Figure 3: Reading from Bigtable.

Bigtable supports several other features that allow the

user to manipulate data in more complex ways. First,

Bigtable supports single-row transactions, which can be

used to perform atomic read-modify-write sequences on

data stored under a single row key. Bigtable does not cur-

rently support general transactions across row keys, al-

though it provides an interface for batching writes across

row keys at the clients. Second, Bigtable allows cells

to be used as integer counters. Finally, Bigtable sup-

ports the execution of client-supplied scripts in the ad-

dress spaces of the servers. The scripts are written in a

language developed at Google for processing data called

Sawzall [28]. At the moment, our Sawzall-based API

does not allow client scripts to write back into Bigtable,

but it does allow various forms of data transformation,

ﬁltering based on arbitrary expressions, and summariza-

tion via a variety of operators.

Bigtable can be used with MapReduce [12], a frame-

work for running large-scale parallel computations de-

veloped at Google. We have written a set of wrappers

that allow a Bigtable to be used both as an input source

and as an output target for MapReduce jobs.

4 Building Blocks

Bigtable is built on several other pieces of Google in-

frastructure. Bigtable uses the distributed Google File

System (GFS) [17] to store log and data ﬁles. A Bigtable

cluster typically operates in a shared pool of machines

that run a wide variety of other distributed applications,

and Bigtable processes often share the same machines

with processes from other applications. Bigtable de-

pends on a cluster management system for scheduling

jobs, managing resources on shared machines, dealing

with machine failures, and monitoring machine status.

The Google SSTable ﬁle format is used internally to

store Bigtable data. An SSTable provides a persistent,

ordered immutable map from keys to values, where both

keys and values are arbitrary byte strings. Operations are

provided to look up the value associated with a speciﬁed

To appear in OSDI 2006 3

of 14

免费下载

google bigtable 列存分布式存储

文档被以下合辑收录

精品研究报告/论文推荐（共42篇）

精选高品质研究报告和具有里程碑意义的论文供大家参考、学习，持续更新……

数据库 | 经典论文（共9篇）

数据库领域里，明珠一样的创世论文。

Google经典数据库论文（共8篇）

本合辑收录Google发布的对数据库技术产生深远影响的多篇经典论文。

文档被以下合辑收录

相关文档

评论