
and scale-in abilities [
10
]. We adopt the shared-storage architecture
and extend it to build a cloud-native multi-primary database.
The big challenge for multi-primaryarchitectures is high perfor-
mance transaction management from multiple nodes (detecting the
transaction conicts, guaranteeing consistency, and achieving fast
failure recovery). To address this challenge, we propose a three layer
(compute-memory-storage) disaggregation system
GaussDB
with
ecient and elastic multiple writer capabilities, as shown in Fig-
ure 1. Three-layer disaggregation can make
GaussDB
more elastic
by independently scaling compute, memory and storage resources.
GaussDB
logically partitions the pages into dierent compute nodes,
and each compute node owns a subset of pages. The compute layer
is in charge of SQL optimization and execution, transaction man-
agement, and recovery. For each transaction issued to a compute
node, if all the relevant pages of this transaction are owned by this
compute node, then the compute node can directly process the
transaction; otherwise, the compute node obtains the ownership of
all relevant pages and then processes the transaction. To capture
data anity and reduce page transmission costs,
GaussDB
designs
an eective page placement and query routing method. The mem-
ory layer is in charge of page ownership management (maintaining
a page ownership directory, i.e. the ownership of each page), global
buer management (i.e. warm pages that cannot be maintained
in compute nodes), and global lock management (e.g. the holding
and waiting on global locks). The memory layer is stateless and
can be rebuilt from the compute node state. Most importantly, the
memory layer allows near instant compute elasticity by separating
compute growth and page ownership growth. The storage layer is
responsible for page persistence, log persistence, and failure recov-
ery.
GaussDB
utilizes two-tier failure recovery over both memory
and storage checkpoints. If a compute node is down,
GaussDB
rst
uses a memory checkpoint to recover the node; if the memory layer
fails, then
GaussDB
uses a storage checkpoint. Each compute node
has a log stream and
GaussDB
only utilizes the logs of the failed
compute node and does not need to access the logs of other nodes. If
multiple nodes fail,
GaussDB
employs an ecient parallel recovery
method to simultaneously recover dierent nodes.
In summary,
GaussDB
has several advantages. First,
GaussDB
achieves higher transaction throughput and lower latency with
much fewer aborts compared to storage-layer log transaction con-
ict detection. Second,
GaussDB
achieves much faster recovery.
Third, GaussDB has better scale-out and scale-in ability.
To summarize, we make the following contributions.
(1) We propose a cloud-native multi-primary database system,
GaussDB
, which uses a three layer (compute-memory-storage) dis-
aggregation framework to support multiple writers.
(2) We devise a two-tier (memory checkpoint and storage check-
point) recovery algorithm for fast recovery.
(3) We design a smart page placement method that judiciously
assigns pages to dierent compute nodes and smartly routes queries
to appropriate compute nodes in order to capture data anity.
(4) We have deployed
GaussDB
internally at Huawei and with
customers. The results show that GaussDB achieves higher perfor-
mance and faster recovery, outperforming state-of-the-art baselines.
Figure 1: GaussDB Architecture
2 GAUSSDB ARCHITECTURE
GaussDB
has three disaggregated layers: compute, memory and
storage as shown in Figure 1. The compute layer logically and dy-
namically assigns page ownership to dierent compute nodes, and
each compute node manages the pages assigned to the node; the
memory layer provides global shared memory, and holds page own-
ership meta data; and the storage layer provides a globally shared
storage. Compute nodes are in charge of SQL optimization, execu-
tion, and transaction processing. For each transaction on a compute
node, the compute node gets the ownership of all the related pages
and processes them on this node. Memory nodes provide unied
shared memory which maintains global page ownership (i.e. which
compute node owns which page), global buers (i.e. data and index
pages), global locks, and memory checkpoints.
GaussDB
can use
memory checkpoints to accelerate failure recovery. Storage nodes
are responsible for page and log persistence via a POSIX interface
with the shared-storage le system. Storage nodes maintain storage
checkpoints, which are used for failure recovery. The dierence
between a memory checkpoint and a storage checkpoint is that
the former uses the pages in the shared memory and the memory
checkpoint to recover while the latter uses the pages in storage
nodes and the storage checkpoint to recover. Obviously, the for-
mer has faster recovery performance. If memory recovery fails,
GaussDB
uses storage checkpoints to continuously recover. Next
we introduce the GaussDB modules as shown in Figure 2.
Compute Layer. Compute nodes are in charge of transaction pro-
cessing. To support multiple writers, each compute node can modify
any page once it acquires the page ownership. As with standard
write-ahead logging it writes its changes to a redo log stream. To
avoid page conicts, each compute node manages a subset of pages,
that is, each page has an owner node, and only the owner node has
write privileges for this page. If a non-owner node wants to access
a page, the node must get the write/read privilege from the owner
node of this page. Thus, the compute node has a
local buffer
manager
for maintaining the pages it owns in its local buer pool
and a
local lock manager
for access control to these pages. Given
a transaction posed to a compute node, if the node owns all the rele-
vant pages for this transaction (i.e. they all reside in its local buer),
GaussDB
directly processes the transaction using its local buers
and local locks; if the node does not own all the pages, the compute
node needs to nd the pages (via the page ownership directory at
the memory layer) and acquire ownership of these pages.
For recovery, the compute node has a
write-ahead log manager
and a
undo segment manager
for atomicity and durability. Note
3787
文档被以下合辑收录
评论