阿里云_VLDB-2024-Towards Millions of Database Transmission Services in the Cloud.pdf

迹部景吾

767

13页

8次

2024-08-13

免费下载

Towards Millions of Database Transmission Services in the Cloud

[Industrial Track]

Hua Fan*, Dachao Fu*, Xu Wang, Jiachi Zhang, Chaoji Zuo, Zhengyi Wu, Miao Zhang,

Kang Yuan, Xizi Ni, Guocheng Huo, Wenchao Zhou, Feifei Li, Jingren Zhou

Alibaba Group

Hangzhou, China

{guanming.fh,qianzhen.fdc,wx105683,zhangjiachi.zjc,zuochaoji.zcj,wuzhengyi.wzy,yanmen.zm}@alibaba-inc.com

{yuankang.yk,xizi.nxz,guocheng.hgc,zwc231487,lifeifei,jingren.zhou}@alibaba-inc.com

Abstract

Alibaba relies on its robust database infrastructure to facilitate real-

time data access and ensure business continuity despite regional

disruptions. To address these operational imperatives, Alibaba de-

veloped the Data Transmission Service (DTS), which has become

critical for internal applications and public cloud services alike.

This paper presents a comprehensive study of the architectural

innovations, resource scheduling mechanisms, and performance

optimization strategies that have been implemented within DTS to

tackle the signicant challenges of cross-network, heterogeneous

data transmission in a cost-eective manner. We explore the novel

Any-to-Any (A2A) architecture, which simplies the complexity

of data paths between diverse databases and mitigates network

connectivity issues, thereby signicantly reducing development

overhead. Additionally, we examine a dynamic network bandwidth

scheduling algorithm that eectively maintains Service-Level Ob-

jectives (SLOs), complemented by a serverless mechanism that

ensures ecient resource utilization. Furthermore, DTS utilizes

advanced strategies such as transaction dependency tracking, hot

data consolidation, and batching to enhance synchronization per-

formance and eciency. DTS has distilled the lessons learned from

years of serving our customer base and currently supports nearly

1 million public cloud instances annually. Our evaluation results

show that DTS can eectively and eciently handle real-time data

transmission in both experimental and production environments.

1 Introduction

Alibaba operates a vast digital commerce service, anchored by its

resilient database services, which store essential business data. This

requires two key functions: First is real-time access to database

information, critical for applications like advertising and search,

prompting the need for services that can parse real-time database

logs to satisfy the many business units demanding instant data

from primary databases [

]. Secondly, business continuity

against regional disruptions — such as power outages or natural dis-

asters — is pivotal, demanding real-time database synchronization

to secondary regions for swift operation transfer [

]. These neces-

sities drove Alibaba to create its own Data Transmission Service

(DTS) [

] in 2011, focusing on synchronization between databases

(e.g., MySQL to MySQL).

*Both authors contributed equally to this research.

As Alibaba Cloud Computing expanded, it began oering a vari-

ety of database services to the public cloud, triggering a need for

migrating more than 24 dierent types of databases from local data

centers or other cloud providers. This diversity led to a surplus of

potential data transmission pathways, heavily complicating the pro-

cess and increasing the development workload. Network connectiv-

ity issues further exacerbated this complexity, potentially requiring

specialized programs to access private intranets, culminating in a

signicant challenge: developing numerous cross-network,

heterogeneous data transmission services cost-eectively.

Managing a high volume of DTS instances poses the challenge

of resource scheduling. This complexity arises from the need to

balance and allocate network and computational resources eec-

tively such as bandwidth, CPU, and memory among a multitude

of services. Insucient resource allocation can lead to violations

of Service Level Objectives (SLOs), adversely aecting customer

business operations. As the demand for real-time access to data

grows, ensuring ecient resource scheduling becomes critical for

maintaining service quality and reliability in DTS operations.

The third challenge that emerges is related to synchroniza-

tion performance issues. This concerns the need for near-zero

delay in real-time data synchronization, which is highly sought

after by our customers. However, synchronization latency can be

signicantly aected by the performance of the target database,

particularly under high-frequency updates. Factors contributing

to this delay include lower concurrency in database replication

compared to the source [

], performance discrepancies in updates

between heterogeneous databases [

], and ineciencies in writing

to the target database.

To conquer these challenges while meeting customer and busi-

ness needs, DTS was architected with several key design consid-

erations. In this paper, we outline the architecture aimed at reduc-

ing development complexity, resource scheduling mechanisms for

enhancing user experience and eciency, and optimizations for

performance of update operations. The specics are as follows.

•

DTS employs an Any-to-Any (A2A) architecture, which is a

strategic design choice that allows for universal compatibility

and exibility in data transmission. This A2A approach enables

DTS to interconnect any source database with any target data-

base, transforming and translating data formats as needed. By

adopting this architecture, DTS can reduce the number of poten-

tial data transmission pathways from a factorial of M source-to-N

target links to a M+N conguration. On each link, DTS encap-

sulates network connectivity issues into predened scenarios

within the DTS framework. Users can thus select their scenario

without the need for additional network programming to achieve

connectivity.

•

Using the optimization-based scheduling algorithm for network

ow, DTS can intelligently manage and allocate bandwidth across

dierent data transmission links. This algorithm takes into ac-

count the current network conditions, transmission priorities,

and the overall demand on the system to dynamically adjust the

ow of data. By doing so, it minimizes the risk of SLO viola-

tions and ensures fair distribution of network resources among

all active transmissions. Moreover, DTS serverless dynamically

alters resource allocation for each service according to the cur-

rent workload and performance metrics. This adaptive resource

management ensures that computational resources are allocated

eciently in real-time.

•

DTS employs a series of strategies that collectively enhance

performance and eciency. These strategies include the opti-

mization of transaction execution by tracking dependencies to

maximize concurrency, the consolidation of frequently accessed

data (hot data) to reduce the volume of writes, and the use of

batching techniques to enhance the transfer and processing of

data. These enhancements are particularly crucial in real-time

synchronization scenarios, where delays can have signicant

downstream impacts on business operations.

In summary, this paper makes the following contributions:

(1)

The adoption of an A2A architecture, when paired with prede-

ned network connectivity scenarios, eectively simplies the

development complexity associated with DTS.

(2)

Our demonstration highlighted the eectiveness of the DTS’s

optimization-based scheduling algorithm in managing network

ow, alongside its dynamic resource allocation mechanism that

enables real-time adaptation to uctuating workloads.

(3)

DTS enhances performance and eciency through an approach

that encompasses tracking transaction dependencies, consoli-

dating hot data, and implementing batching techniques, while

upholding user-dened consistency standards.

(4)

We showcase the real-world deployment of DTS, which sup-

ports nearly one million public cloud instances annually, thereby

arming its practicality and scalability in an industrial setting.

The remainder of this paper is structured as follows: Section 2

oers an overview of data transmission, detailing the complexi-

ties and challenges involved in managing a vast number of DTS

instances. In Section 3, we delve into the architectural design, in-

troducing the A2A architecture, and the mechanisms it utilizes for

establishing network connectivity. Section 4 explores the resource

scheduling solutions including the bandwidth allocation algorithm

and the DTS serverless mechanism, while Section 5 delves into the

optimization strategies for ecient data writing to target databases.

Lastly, Section 6 evaluates DTS’s performance improvements for

individual instances and the collective benets within a datacenter.

2 Background and Motivation

In this section, we introduce background knowledge of data trans-

mission (Section 2.1) and three major challenges as motivations of

this paper (Section 2.2).

Figure 1: Diversity of Databases in DTS Instances within a

Region (Circle Sizes Represent Trac Volume)

2.1 Data Transmission

In the domain of database research, a typical data transmission

scenario entails data replication of two dierent databases, namely

source database and target database. Based on the transmission

medium, replication can be categorized into two types: physical

replication, which involves the direct duplication of raw database

les, and logical replication, which replays Data Manipulation Lan-

guage (DML) statements on the target database. Physical replication

can be readily implemented utilizing inherent features provided by

database management systems, such as MySQL’s Multi-threaded

Replication mechanism [

]. However, its application is constrained

due to its requirement for identical source and target database types.

Therefore, data transmission services, such as AWS Data Migration

Service (DMS) [

], Oracle’s GoldenGate [

], and Fivetran [

], fa-

vor logical replication because they accommodate heterogeneous

database types.

The heterogeneity of databases also compels data transmission

providers to implement logical replication outside of database en-

gines. Taking AWS DMS as an example, a transmission task consists

of a source endpoint that fetches data from the source database

and a target endpoint that is responsible for writing to the target

database. Data transmission tasks are typically categorized, based

on the fetched data, into full transmission tasks that transfer entire

tables at once and Changed Data Capture (CDC) transmission tasks

that replay DML statements from write-ahead logs (WALs) in real-

time [

]. Despite being a mature eld, the growing scale of data

transmission continues to bring forth novel challenges.

2.2 Challenges

In this section, we introduce three major challenges that emerge as a

result of the escalating scale of data transmission. These challenges

are examined along three dimensions of scale: database and network

diversity, task quantity, and transmission velocity.

2.2.1 Databases and Network Diversity. First, the source and target

databases may encompass a wide variety of database types. As

reported by DB-Engines [

], as of March 2024, there have been hun-

dreds of cataloged database systems. Furthermore, Alibaba Cloud

oers a suite of standard cloud services encompassing 24 distinct

database types [

]. Various databases dier signicantly in terms of

their connection protocols, syntax conventions, and underlying data

models, such as relational, key-value (KV), or document-oriented.

Therefore, a universal data transmission tool does not exist.

of 13

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

关注

文档被以下合辑收录

评论