SILO - Speedy Transactions in Multicore In-Memory Databases

盖国强

537

15页

0次

2021-12-21

5墨值下载

Speedy Transactions in Multicore In-Memory Databases

Stephen Tu, Wenting Zheng, Eddie Kohler

†

, Barbara Liskov, and Samuel Madden

MIT CSAIL and

†

Harvard University

Abstract

Silo is a new in-memory database that achieves excel-

lent performance and scalability on modern multicore

machines. Silo was designed from the ground up to use

system memory and caches efﬁciently. For instance, it

avoids all centralized contention points, including that of

centralized transaction ID assignment. Silo’s key contri-

bution is a commit protocol based on optimistic concur-

rency control that provides serializability while avoid-

ing all shared-memory writes for records that were only

read. Though this might seem to complicate the en-

forcement of a serial order, correct logging and recov-

ery is provided by linking periodically-updated epochs

with the commit protocol. Silo provides the same guar-

antees as any serializable database without unnecessary

scalability bottlenecks or much additional latency. Silo

achieves almost 700,000 transactions per second on a

standard TPC-C workload mix on a 32-core machine, as

well as near-linear scalability. Considered per core, this

is several times higher than previously reported results.

1 Introduction

Thanks to drastic increases in main memory sizes and

processor core counts for server-class machines, modern

high-end servers can have several terabytes of RAM and

80 or more cores. When used effectively, this is enough

processing power and memory to handle data sets and

computations that used to be spread across many disks

and machines. However, harnassing this power is tricky;

even single points of contention, like compare-and-

swaps on a shared-memory word, can limit scalability.

This paper presents Silo, a new main-memory

database that achieves excellent performance on multi-

core machines. We designed Silo from the ground up

to use system memory and caches efﬁciently. We avoid

all centralized contention points and make all synchro-

Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies

are not made or distributed for proﬁt or commercial advantage and that

copies bear this notice and the full citation on the ﬁrst page. Copyrights

for third-party components of this work must be honored. For all other

uses, contact the Owner/Author.

SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA.

ACM 978-1-4503-2388-8/13/11.

http://dx.doi.org/10.1145/2517349.2522713

nization scale with the data, allowing larger databases to

support more concurrency.

Silo uses a Masstree-inspired tree structure for its un-

derlying indexes. Masstree [23] is a fast concurrent B-

tree-like structure optimized for multicore performance.

But Masstree only supports non-serializable, single-key

transactions, whereas any real database must support

transactions that affect multiple keys and occur in some

serial order. Our core result, the Silo commit protocol, is

a minimal-contention serializable commit protocol that

provides these properties.

Silo uses a variant of optimistic concurrency control

(OCC) [18]. An OCC transaction tracks the records it

reads and writes in thread-local storage. At commit time,

after validating that no concurrent transaction’s writes

overlapped with its read set, the transaction installs all

written records at once. If validation fails, the transaction

aborts. This approach has several beneﬁts for scalability.

OCC writes to shared memory only at commit time, af-

ter the transaction’s compute phase has completed; this

short write period reduces contention. And thanks to the

validation step, read-set records need not be locked. This

matters because the memory writes required for read

locks can induce contention [11].

Previous OCC implementations are not free of scal-

ing bottlenecks, however, with a key reason being the re-

quirement for tracking “anti-dependencies” (write-after-

read conﬂicts). Consider a transaction t

that reads a

record from the database, and a concurrent transaction

that overwrites the value t

saw. A serializable sys-

tem must order t

before t

even after a potential crash

and recovery from persistent logs. To achieve this order-

ing, most systems require that t

communicate with t

such as by posting its read sets to shared memory or via

a centrally-assigned, monotonically-increasing transac-

tion ID [18, 19]. Some non-serializable systems can

avoid this communication, but they suffer from anoma-

lies like snapshot isolation’s “write skew” [2].

Silo provides serializability while avoiding all shared-

memory writes for read transactions. The commit proto-

col was carefully designed using memory fences to scal-

ably produce results consistent with a serial order. This

leaves the problem of correct recovery, which we solve

using a form of epoch-based group commit. Time is di-

vided into a series of short epochs. Even though transac-

tion results always agree with a serial order, the system

does not explicitly know the serial order except across

epoch boundaries: if t

’s epoch is before t

’s, then t

pre-

cedes t

in the serial order. The system logs transactions

in units of whole epochs and releases results to clients

at epoch boundaries. As a result, Silo provides the same

guarantees as any serializable database without unneces-

sary scalability bottlenecks or much additional latency.

Epochs have other beneﬁts as well; for example, we use

them to provide database snapshots that long-lived read-

only transactions can use to reduce aborts.

On a single 32-core machine, Silo achieves roughly

700,000 transactions per second on the standard TPC-

C benchmark for online transaction processing (OLTP).

This is about 22,000 transactions per second per core.

Per-core transaction throughput at 32 cores is 91%

of that at 8 cores, a small drop indicating that Silo

scales well. For context, the database literature for high-

performance in-memory OLTP systems cites per-core

TPC-C throughput at least several times lower than

Silo’s [16, 25, 27, 28], and benchmarks on our hardware

with a commercial main-memory database system per-

form at most 3,000 transactions per second per core.

An important Silo design choice is its shared-memory

store: any database worker can potentially access the

whole database. Several recent main-memory database

designs instead use data partitioning, in which database

workers effectively own subsets of the data [25, 32]. Par-

titioning can shrink table sizes and avoids the expense of

managing ﬁne-grained locks, but works best when the

query load matches the partitioning. To understand the

tradeoffs, we built and evaluated a partitioned variant of

Silo. Partitioning performs better for some workloads,

but a shared-memory design wins when cross-partition

transactions are frequent or partitions are overloaded.

Silo assumes a one-shot request model in which

all parameters for each client request are available at

the start, and requests always complete without further

client interaction. This model is well-suited for OLTP

workloads. If high-latency client communication were

part of a transaction, the likelihood of abort due to con-

current updates would grow.

Our performance is higher than a full system would

observe since our clients do not currently use the net-

work. (In Masstree, network communication reduced

throughput by 23% [23]; we would expect a similar

reduction for key-value workloads, less for workloads

with more computation-heavy transactions.) Neverthe-

less, our experiments show that Silo has very high per-

formance, that transactions can be serialized without

contention, and that cache-friendly design principles

work well for shared-memory serializable databases.

2 Related work

A number of recent systems have proposed storage

abstractions for main-memory and multicore systems.

These can be broadly classiﬁed according to whether or

not they provide transactional support.

2.1 Non-transactional systems

The non-transactional system most related to Silo

is Masstree [23], an extremely scalable and high-

throughput main-memory B

-tree. Masstree uses tech-

niques such as version validation instead of read locks

and efﬁcient ﬁne-grained locking algorithms. It builds

on several prior trees, including OLFIT [4], Bronson et

al. [3], and B

link

-trees [20], and adds new techniques, in-

cluding a trie-like structure that speeds up key compar-

isons. Silo’s underlying concurrent B

-tree implemen-

tation was inspired by Masstree.

PALM [31] is another high-throughput B

-tree struc-

ture designed for multicore systems. It uses a batch-

ing technique, extensive prefetching, and intra-core

SIMD parallelism to provide extremely high throughput.

PALM’s techniques could potentially speed up Silo’s

tree operations.

The Bw-tree [21] is a high-throughput multiversion

tree structure optimized for multicore ﬂash storage. New

versions of data are installed using delta records and

compare-and-swaps on a mapping table; there are no

locks or overwrites (we found both helpful for perfor-

mance). Silo’s data structures are designed for main

memory, whereas many of the Bw-tree’s structures are

designed for ﬂash. Like Silo, the Bw-tree uses RCU-

style epochs for garbage collection; Silo’s epochs also

support scalable serializable logging and snapshots.

LLAMA [22] adds support for transactional logging, but

its logger is centralized.

2.2 Transactional systems

Silo uses optimistic concurrency control [10, 18], which

has several advantages on multicore machines, including

a relatively short period of contention. However, OCC

and its variants (e.g., [1, 12, 34]) often induce contention

in the commit phase, such as centralized transaction ID

assignment or communication among all concurrently

executing transactions.

Larson et al. [19] recently revisited the performance

of locking and OCC-based multi-version concurrency

control (MVCC) systems versus that of traditional

single-copy locking systems in the context of multi-

core main-memory databases. Their OCC implementa-

tion exploits MVCC to avoid installing writes until com-

mit time, and avoids many of the centralized critical sec-

tions present in classic OCC. These techniques form the

basis for concurrency control in Hekaton [8], the main-

memory component of SQL Server. However, their de-

of 15

5墨值下载

silo 数据库论文

关注

评论