SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile Xinjun Yang et al.
the whole database is partitioned. Each node is independent and
accesses data exclusively within its designated partition. When a
transaction spans multiple partitions, it must require cross-partition
distributed transaction mechanisms, such as the two-phase commit
policy, which typically induces signicant extra overhead [
57
,
61
].
The shared-storage architecture is essentially the opposite, where
all data is accessible from all cluster nodes, such as IBM pureScale [
20
],
Oracle RAC [
9
], AWS Aurora Multi-Master (Aurora-MM) [
3
] and
Huawei Taurus-MM [
16
]. IBM pureScale and Oracle RAC are the
two traditional database products based on shared-storage archi-
tecture. They rely on expensive distributed lock management and
high network overhead. They are usually deployed in their dedi-
cated machines and are too rigid for a dynamic cloud environment.
Consequently, they usually have a much higher total cost of owner-
ship (TCO) than modern cloud-native databases. Aurora-MM and
Taurus-MM are the recently proposed products for a multi-primary
cloud-native database. Aurora-MM utilizes optimistic concurrent
control for write conict, thus inducing a substantial abortion rate
when conicts occur. In some scenarios, its four-node cluster’s
throughput is even lower than that of a single node [
16
]. On the
contrary, Taurus-MM adopts the pessimistic concurrent control
but it relies on page stores and log replays for cache coherence.
As such, it suers from the high overhead of concurrent control
and data synchronization. The eight-node cluster only improves
the throughput by 1.8
×
compared to the single-node version in the
read-write workload with 50% shared data.
To address these challenges, this paper proposes PolarDB-MP,
a multi-primary cloud-native database via disaggregated shared
memory (and with a shared storage). PolarDB-MP inherits the disag-
gregated shared storage model from PolarDB, allowing all primary
nodes equal access to the storage. This enables a transaction to
be processed in a node without resorting to a distributed transac-
tion. In contrast to optimistic concurrency control, PolarDB-MP em-
ploys pessimistic concurrency control to mitigate transaction aborts
caused by write conicts. Dierent from other cloud databases that
rely on log replay and page servers for data coherence between
nodes, PolarDB-MP uses disaggregated shared memory for e-
cient cache and data coherence. With the growing availability of
RDMA networks in cloud vendors’ data centers [
54
], PolarDB-MP
is intricately co-designed with RDMA to enhance its performance.
The core component of PolarDB-MP is Polar Multi-Primary Fu-
sion Server (PMFS), which is built on disaggregated shared memory.
PMFS comprises Transaction Fusion, Buer Fusion and Lock Fusion.
Transaction Fusion facilitates transaction visibility and ordering
across multiple nodes. It utilizes a Timestamp Oracle (TSO) for
ordering and allocates shared memory on each node to store lo-
cal transaction data that are accessible remotely by other nodes.
This decentralized transaction management approach via shared
memory ensures low latency and high performance in global trans-
action processing. Buer Fusion implements a distributed buer
pool (DBP), also based on disaggregated shared memory and acces-
sible by all nodes. Nodes can both push modied data to and retrieve
data from the DBP remotely. This setup allows swift propagation
of changes from one node to others, ensuring cache coherency
and rapid data access. Lock Fusion eciently manages both page-
level and row-level locking schemes, thus enabling concurrent data
page access across dierent nodes while ensuring physical data
consistency and maintaining transactional consistency.
PolarDB-MP also adeptly manages write-ahead logs across mul-
tiple nodes, using a logical log sequence number (LLSN) to establish
a partial order for logs from dierent nodes. LLSN ensures that all
logs associated with a page are maintained in the same order as they
were generated, thereby preserving data consistency and simplify-
ing recovery processes. On the other hand, PolarDB-MP designs
a new recovery policy based on the LLSN framework, eectively
managing crash recovery scenarios.
Additionally, PolarDB-MP’s message passing and RPC operations
are enhanced by a highly optimized RDMA library, boosting overall
eciency. These advanced designs position PolarDB-MP as a robust,
ecient solution for multi-primary cloud-native databases.
We summarize our main contributions as follows:
•
We propose a multi-primary cloud-native relational database via
disaggregated shared memory over shared storage (the rst of
its kind), delivering high performance and scalability.
•
We leverage the RDMA-based disaggregated shared memory to
design and implement Polar Multi-Primary Fusion Server (PMFS)
that achieves global buer pool management, global page/row
locking protocol, and global transaction coordination.
•
We propose an LLSN design to oer a partial order for write-
ahead logs generated across dierent nodes, accompanied by a
tailored recovery policy.
•
We thoroughly evaluate PolarDB-MP with dierent workloads
and compare it with dierent systems. It shows high performance,
scalability, and availability.
This paper is structured as follows. First, we present the back-
ground and motivation in Section 2. Then we provide PolarDB-MP’s
overview and detailed design in Section 3 and Section 4. Next, we
evaluate PolarDB-MP in Section 5 and review the related works in
Section 6. Finally, we conclude the paper in Section 7.
2 BACKGROUND AND MOTIVATION
2.1 Single-primary cloud-native database
Nowadays, there are many cloud-native databases based on the
primary-secondary architecture, such as Aurora [
49
], Hyperscale [
14
,
26
], Socrates [
1
] and PolarDB [
27
]. Figure 1 depicts the typical ar-
chitecture of a primary-secondary cloud-native database. It usually
comprises a primary node for processing both read and write re-
quests and one or more secondary nodes dedicated to handling
read requests. Each node is a complete database instance, equipped
with standard components. However, a distinctive feature of these
databases, as opposed to traditional monolithic databases, is the
usage of disaggregated shared storage. This shared storage ensures
fault tolerance and consistency, and uniquely, adding more sec-
ondary nodes does not necessitate additional storage. This contrasts
with conventional primary/secondary database clusters, where each
node maintains its own storage.
While the primary-secondary-based databases oer certain ben-
ets, they face signicant challenges under write-heavy workloads.
Scaling out to improve performance is not an option in such ar-
chitecture, and scaling up is limited by the available resources on
the physical machine. Cloud providers typically maximize the use
of physical resources, but since these resources are shared across
评论