The Google File System.pdf - 墨天轮文档

The Google File System.pdf

盖国强

1143

15页

50次

2021-01-22

免费下载

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google

∗

ABSTRACT

We have designed and implemented the Google File Sys-

tem, a scalable distributed ﬁle system for large distributed

data-intensive applications. It provides fault tolerance while

running on inexpensive commodity hardware, and it delivers

high aggregate performance to a large number of clients.

While sharing many of the same goals as previous dis-

tributed ﬁle systems, our design has been driven by obser-

vations of our application workloads and technological envi-

ronment, both current and anticipated, that reﬂect a marked

departure from some earlier ﬁle system assumptions. This

has led us to reexamine traditional choices and explore rad-

ically diﬀerent design points.

The ﬁle system has successfully met our storage needs.

It is widely deployed within Google as the storage platform

for the generation and processing of data used by our ser-

vice as well as research and development eﬀorts that require

large data sets. The largest cluster to date provides hun-

dreds of terabytes of storage across thousands of disks on

over a thousand machines, and it is concurrently accessed

by hundreds of clients.

In this paper, we present ﬁle system interface extensions

designed to support distributed applications, discuss many

aspects of our design, and report measurements from both

micro-benchmarks and real world use.

Categories and Subject Descriptors

D[4]: 3—Distributed ﬁle systems

General Terms

Design, reliability, performance, measurement

Keywords

Fault tolerance, scalability, data storage, clustered storage

∗

The authors can be reached at the following addresses:

{sanjay,hgobioﬀ,shuntak}@google.com.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.

$5.00.

1. INTRODUCTION

We have designed and implemented the Google File Sys-

tem (GFS) to meet the rapidly growing demands of Google’s

data processing needs. GFS shares many of the same goals

as previous distributed ﬁle systems such as performance,

scalability, reliability, and availability. However, its design

has been driven by key observations of our application work-

loads and technological environment, both current and an-

ticipated, that reﬂect a marked departure from some earlier

ﬁle system design assumptions. We have reexamined tradi-

tional choices and explored radically diﬀerent points in the

design space.

First, component failures are the norm rather than the

exception. The ﬁle system consists of hundreds or even

thousands of storage machines built from inexpensive com-

modity parts and is accessed by a comparable number of

client machines. The quantity and quality of the compo-

nents virtually guarantee that some are not functional at

any given time and some will not recover from their cur-

rent failures. We have seen problems caused by application

bugs, operating system bugs, human errors, and the failures

of disks, memory, connectors, networking, and power sup-

plies. Therefore, constant monitoring, error detection, fault

tolerance, and automatic recovery must be integral to the

system.

Second, ﬁles are huge by traditional standards. Multi-GB

ﬁles are common. Each ﬁle typically contains many applica-

tion objects such as web documents. When we are regularly

working with fast growing data sets of many TBs comprising

billions of objects, it is unwieldy to manage billions of ap-

proximately KB-sized ﬁles even when the ﬁle system could

support it. As a result, design assumptions and parameters

such as I/O operation and block sizes have to be revisited.

Third, most ﬁles are mutated by appending new data

rather than overwriting existing data. Random writes within

a ﬁle are practically non-existent. Once written, the ﬁles

are only read, and often only sequentially. A variety of

data share these characteristics. Some may constitute large

repositories that data analysis programs scan through. Some

may be data streams continuously generated by running ap-

plications. Some may be archival data. Some may be in-

termediate results produced on one machine and processed

on another, whether simultaneously or later in time. Given

this access pattern on huge ﬁles, appending becomes the fo-

cus of performance optimization and atomicity guarantees,

while caching data blocks in the client loses its appeal.

Fourth, co-designing the applications and the ﬁle system

API beneﬁts the overall system by increasing our ﬂexibility.

For example, we have relaxed GFS’s consistency model to

vastly simplify the ﬁle system without imposing an onerous

burden on the applications. We have also introduced an

atomic append operation so that multiple clients can append

concurrently to a ﬁle without extra synchronization between

them. These will be discussed in more details later in the

paper.

Multiple GFS clusters are currently deployed for diﬀerent

purposes. The largest ones have over 1000 storage nodes,

over 300 TB of disk storage, and are heavily accessed by

hundreds of clients on distinct machines on a continuous

basis.

2. DESIGN OVERVIEW

2.1 Assumptions

In designing a ﬁle system for our needs, we have been

guided by assumptions that oﬀer both challenges and op-

portunities. We alluded to some key observations earlier

and now lay out our assumptions in more details.

• The system is built from many inexpensive commodity

components that often fail. It must constantly monitor

itself and detect, tolerate, and recover promptly from

component failures on a routine basis.

• The system stores a modest number of large ﬁles. We

expect a few million ﬁles, each typically 100 MB or

larger in size. Multi-GB ﬁles are the common case

and should be managed eﬃciently. Small ﬁles must be

supported, but we need not optimize for them.

• The workloads primarily consist of two kinds of reads:

large streaming reads and small random reads. In

large streaming reads, individual operations typically

read hundreds of KBs, more commonly 1 MB or more.

Successive operations from the same client often read

through a contiguous region of a ﬁle. A small ran-

dom read typically reads a few KBs at some arbitrary

oﬀset. Performance-conscious applications often batch

and sort their small reads to advance steadily through

the ﬁle rather than go back and forth.

• The workloads also have many large, sequential writes

that append data to ﬁles. Typical operation sizes are

similar to those for reads. Once written, ﬁles are sel-

dom modiﬁed again. Small writes at arbitrary posi-

tions in a ﬁle are supported but do not have to be

eﬃcient.

• The system must eﬃciently implement well-deﬁned se-

mantics for multiple clients that concurrently append

to the same ﬁle. Our ﬁles are often used as producer-

consumer queues or for many-way merging. Hundreds

of producers, running one per machine, will concur-

rently append to a ﬁle. Atomicity with minimal syn-

chronization overhead is essential. The ﬁle may be

read later, or a consumer may be reading through the

ﬁle simultaneously.

• High sustained bandwidth is more important than low

latency. Most of our target applications place a pre-

mium on processing data in bulk at a high rate, while

few have stringent response time requirements for an

individual read or write.

2.2 Interface

GFS provides a familiar ﬁle system interface, though it

does not implement a standard API such as POSIX. Files are

organized hierarchically in directories and identiﬁed by path-

names. We support the usual operations to create, delete,

open, close, read,andwrite ﬁles.

Moreover, GFS has snapshot and record append opera-

tions. Snapshot creates a copy of a ﬁle or a directory tree

at low cost. Record append allows multiple clients to ap-

pend data to the same ﬁle concurrently while guaranteeing

the atomicity of each individual client’s append. It is use-

ful for implementing multi-way merge results and producer-

consumer queues that many clients can simultaneously ap-

pend to without additional locking. We have found these

types of ﬁles to be invaluable in building large distributed

applications. Snapshot and record append are discussed fur-

ther in Sections 3.4 and 3.3 respectively.

2.3 Architecture

A GFS cluster consists of a single master and multiple

chunkservers and is accessed by multiple clients,asshown

in Figure 1. Each of these is typically a commodity Linux

machine running a user-level server process. It is easy to run

both a chunkserver and a client on the same machine, as long

as machine resources permit and the lower reliability caused

by running possibly ﬂaky application code is acceptable.

Files are divided into ﬁxed-size chunks. Each chunk is

identiﬁed by an immutable and globally unique 64 bit chunk

handle assigned by the master at the time of chunk creation.

Chunkservers store chunks on local disks as Linux ﬁles and

read or write chunk data speciﬁed by a chunk handle and

byte range. For reliability, each chunk is replicated on multi-

ple chunkservers. By default, we store three replicas, though

users can designate diﬀerent replication levels for diﬀerent

regions of the ﬁle namespace.

The master maintains all ﬁle system metadata. This in-

cludes the namespace, access control information, the map-

ping from ﬁles to chunks, and the current locations of chunks.

It also controls system-wide activities such as chunk lease

management, garbage collection of orphaned chunks, and

chunk migration between chunkservers. The master peri-

odically communicates with each chunkserver in HeartBeat

messages to give it instructions and collect its state.

GFS client code linked into each application implements

the ﬁle system API and communicates with the master and

chunkservers to read or write data on behalf of the applica-

tion. Clients interact with the master for metadata opera-

tions, but all data-bearing communication goes directly to

the chunkservers. We do not provide the POSIX API and

therefore need not hook into the Linux vnode layer.

Neither the client nor the chunkserver caches ﬁle data.

Client caches oﬀer little beneﬁt because most applications

stream through huge ﬁles or have working sets too large

to be cached. Not having them simpliﬁes the client and

the overall system by eliminating cache coherence issues.

(Clients do cache metadata, however.) Chunkservers need

not cache ﬁle data because chunks are stored as local ﬁles

and so Linux’s buﬀer cache already keeps frequently accessed

data in memory.

2.4 Single Master

Having a single master vastly simpliﬁes our design and

enables the master to make sophisticated chunk placement

Legend:

Data messages

Control messages

Application

(file name, chunk index)

(chunk handle,

chunk locations)

GFS master

File namespace

/foo/bar

Instructions to chunkserver

Chunkserver state

GFS chunkserverGFS chunkserver

(chunk handle, byte range)

chunk data

chunk 2ef0

Linux file system Linux file system

GFS client

Figure 1: GFS Architecture

and replication decisions using global knowledge. However,

we must minimize its involvement in reads and writes so

that it does not become a bottleneck. Clients never read

and write ﬁle data through the master. Instead, a client asks

the master which chunkservers it should contact. It caches

this information for a limited time and interacts with the

chunkservers directly for many subsequent operations.

Let us explain the interactions for a simple read with refer-

ence to Figure 1. First, using the ﬁxed chunk size, the client

translates the ﬁle name and byte oﬀset speciﬁed by the ap-

plication into a chunk index within the ﬁle. Then, it sends

the master a request containing the ﬁle name and chunk

index. The master replies with the corresponding chunk

handle and locations of the replicas. The client caches this

information using the ﬁle name and chunk index as the key.

The client then sends a request to one of the replicas,

most likely the closest one. The request speciﬁes the chunk

handle and a byte range within that chunk. Further reads

of the same chunk require no more client-master interaction

until the cached information expires or the ﬁle is reopened.

In fact, the client typically asks for multiple chunks in the

same request and the master can also include the informa-

tion for chunks immediately following those requested. This

extra information sidesteps several future client-master in-

teractions at practically no extra cost.

2.5 Chunk Size

Chunk size is one of the key design parameters. We have

chosen 64 MB, which is much larger than typical ﬁle sys-

tem block sizes. Each chunk replica is stored as a plain

Linux ﬁle on a chunkserver and is extended only as needed.

Lazy space allocation avoids wasting space due to internal

fragmentation, perhaps the greatest objection against such

a large chunk size.

A large chunk size oﬀers several important advantages.

First, it reduces clients’ need to interact with the master

because reads and writes on the same chunk require only

one initial request to the master for chunk location informa-

tion. The reduction is especially signiﬁcant for our work-

loads because applications mostly read and write large ﬁles

sequentially. Even for small random reads, the client can

comfortably cache all the chunk location information for a

multi-TB working set. Second, since on a large chunk, a

client is more likely to perform many operations on a given

chunk, it can reduce network overhead by keeping a persis-

tent TCP connection to the chunkserver over an extended

period of time. Third, it reduces the size of the metadata

stored on the master. This allows us to keep the metadata

in memory, which in turn brings other advantages that we

will discuss in Section 2.6.1.

On the other hand, a large chunk size, even with lazy space

allocation, has its disadvantages. A small ﬁle consists of a

small number of chunks, perhaps just one. The chunkservers

storing those chunks may become hot spots if many clients

are accessing the same ﬁle. In practice, hot spots have not

been a major issue because our applications mostly read

large multi-chunk ﬁles sequentially.

However, hot spots did develop when GFS was ﬁrst used

by a batch-queue system: an executable was written to GFS

as a single-chunk ﬁle and then started on hundreds of ma-

chines at the same time. The few chunkservers storing this

executable were overloaded by hundreds of simultaneous re-

quests. We ﬁxed this problem by storing such executables

with a higher replication factor and by making the batch-

queue system stagger application start times. A potential

long-term solution is to allow clients to read data from other

clients in such situations.

2.6 Metadata

The master stores three major types of metadata: the ﬁle

and chunk namespaces, the mapping from ﬁles to chunks,

and the locations of each chunk’s replicas. All metadata is

kept in the master’s memory. The ﬁrst two types (names-

paces and ﬁle-to-chunk mapping) are also kept persistent by

logging mutations to an operation log stored on the mas-

ter’s local disk and replicated on remote machines. Using

a log allows us to update the master state simply, reliably,

and without risking inconsistencies in the event of a master

crash. The master does not store chunk location informa-

tion persistently. Instead, it asks each chunkserver about its

chunks at master startup and whenever a chunkserver joins

the cluster.

2.6.1 In-Memory Data Structures

Since metadata is stored in memory, master operations are

fast. Furthermore, it is easy and eﬃcient for the master to

periodically scan through its entire state in the background.

This periodic scanning is used to implement chunk garbage

collection, re-replication in the presence of chunkserver fail-

ures, and chunk migration to balance load and disk space

of 15

免费下载

google

文档被以下合辑收录

数据库 | 经典论文（共9篇）

数据库领域里，明珠一样的创世论文。

精品研究报告/论文推荐（共44篇）

精选高品质研究报告和具有里程碑意义的论文供大家参考、学习，持续更新……

Google经典数据库论文（共8篇）

本合辑收录Google发布的对数据库技术产生深远影响的多篇经典论文。

文档被以下合辑收录

相关文档

评论