Anser：Adaptive Information Sharing Framework of AnalyticDB_阿里云.pdf

赤井秀一

194

13页

13次

2023-11-08

免费下载

Anser: Adaptive Information Sharing Framework of AnalyticDB

Liang Lin

Alibaba Group

Hangzhou, China

yibo.ll@alibaba-inc.com

Yuhan Li

Alibaba Group

Hangzhou, China

lyh200442@alibaba-

inc.com

Bin Wu

Alibaba Group

Hangzhou, China

binwu.wb@alibaba-

inc.com

Huijun Mai

Alibaba Group

Hangzhou, China

huijun.mhj@alibaba-

inc.com

Renjie Lou

Alibaba Group

Hangzhou, China

json.lrj@alibaba-inc.com

Jian Tan

Alibaba Group

Hangzhou, China

j.tan@alibaba-inc.com

Feifei Li

Alibaba Group

Hangzhou, China

lifeifei@alibaba-inc.com

ABSTRACT

The surge in data analytics has fostered burgeoning demand for

AnalyticDB on Alibaba Cloud, which has well served thousands of

customers from various business sectors. The most notable feature

is the diversity of the workloads it handles, including batch process-

ing, real-time data analytics, and unstructured data analytics. To

improve the overall performance for such diverse workloads, one of

the major challenges is to optimize long-running complex queries

without sacricing the processing eciency of short-running inter-

active queries. While existing methods attempt to utilize runtime

dynamic statistics for adaptive quer y processing, they often focus

on specic scenarios instead of providing a holistic solution.

To address this challenge, we propose a new framework called

Anser, which enhances the design of traditional distributed data

warehouses by embedding a new information sharing mechanism.

This allows for the ecient management of the production and

consumption of various dynamic information across the system.

Building on top of Anser, we introduce a novel scheduling pol-

icy that optimizes both data and information exchanges within

the physical plan, enabling the acceleration of complex analyti-

cal queries without sacricing the performance of short-running

interactive queries. We conduct comprehensive experiments over

public and in-house workloads to demonstrate the eectiveness

and eciency of our proposed information sharing framework.

PVLDB Reference Format:

Liang Lin, Yuhan Li, Bin Wu, Huijun Mai, Renjie Lou, Jian Tan, and Feifei

Li. Anser: Adaptive Information Sharing Framework of AnalyticDB.

PVLDB, 16(12): 3636 - 3648, 2023.

doi:10.14778/3611540.3611553

1 INTRODUCTION

As modern organizations struggle with managing diverse work-

loads including batch processing, real-time data analytics, and un-

structured data analytics, they face the challenge of maintaining

optimal performance. To meet this challenge, there has been a trend

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 16, No. 12 ISSN 2150-8097.

doi:10.14778/3611540.3611553

(a) The distribution of diverse workloads.

(b) Resource consumption of JOIN queries.

Figure 1: Statistics collected from AnalyticDB ’s production

workloads.

towards system convergence, with many organizations transition-

ing towards uniform systems that can handle diverse workloads.

One of the most widely adopted industry solutions is Spark [

a fast and exible data processing engine that can handle diverse

workloads. Similarly, Redshift [

], a cloud-based data ware-

housing solution, oers automatic tuning capabilities to handle

complex workloads more eciently.

AnalyticDB [

] is a high-performance data warehouse de-

veloped by Alibaba Cloud. It has been extensively adopted both

internally for Alibaba Group’s business operations and externally

across a range of industries, such as e-commerce, nance, logistics,

education, and entertainment. Within AnalyticDB , we have noticed

a trend of increasing diversity in terms of query response times.

As shown in Figure 1a, many simple and short queries, such as

business-critical intelligence queries issued from dashboards, can

be processed in milliseconds. These queries account for up to 80% of

the customers’ workloads in our production environments. To sat-

isfy the quality of service requirements, it is essential to ensure that

these interactive short queries have sucient resources. Meanwhile,

it is also common to have complex analytical queries that exceed

hundreds of KB in size, involving aggregations, multi-way joins,

and nested subqueries. Statistics show that long queries with re-

sponse times (RT) more than 10 seconds account for over 10% of the

3636

workloads, which yet consume more than 50% of the computation

resources. To evaluate the resource consumption of the complex

analytical queries, we collect related statistics of JOIN queries from

AnalyticDB ’s production workloads (as shown in Figure 1b), in-

cluding the query CPU time and the number of shued rows. As

the number of join operators increases, the required resources also

grow dramatically.

As evidenced by the statistics above, optimization techniques

for expensive batch computing tasks and ETL jobs play a vital role

in improving the overall performance of modern data warehous-

ing systems. As the number of concurrent queries increases, the

competition for resources (e.g., CPU, memory and network) be-

tween queries has b ecome very serious. In some cases, long queries

may exhaust resources in a database instance and subsequent short

queries belonging to the same instance will not be processed. We

summarize several key challenges that remain unsolved:

Challenge 1: Scenario customization for adaptive query processing.

As workloads become increasingly diverse and statistics become

less available, it has become clear that traditional "optimize-then-

execute" strategies [

] are no longer sucient. This realiza-

tion has led to a broad range of studies in the eld of adaptive query

processing [

]. Many commercial databases have implemented

various adaptive techniques, but the current approaches tend to

build scenario-customized solutions for each technique, which can

introduce unnecessary complexity into the system [

For example, Spark’s adaptive query execution [

] supports four

features: mid-query re-optimization, dynamically coalescing shue

partitions, dynamically switching join strategies, and dynamically

optimizing skew joins. Each feature individually collects dynamic

statistics and makes adjustments. A general-purpose information

framework that can t into these dierent scenarios would signif-

icantly reduce costs. By developing a framework that can share

information across dierent adaptive techniques, it would be pos-

sible to eliminate redundant eorts and minimize the complexity

of the system. Such a framework would enable data warehousing

systems to optimize their performance without having to build

scenario-customized solutions for each individual technique. Fur-

thermore, the same statistics could be used by dierent cases (for

example, all of the four features that Spark supports require shue

le statistics), yet implementing each case individually deprives the

opportunity for the same information to be used multiple times. In

particular, when statistics collection is resource-consuming (such

as with a bloom lter), sharing information among multiple cases

could potentially signicantly reduce costs.

Challenge 2: Eective and ecient management of dynamic statis-

tics. To potentially identify and share common dynamic statistics,

the information collection and utilization need to be decoupled

from existing modules of query engine, and a holistic management

of the information lifecycle is required to register, collect, store,

disperse, and destroy the information. None of the previous studies

have clearly dened the scope of the information that can be used

in dierent adaptive techniques, nor have they framed a mechanism

to manage the information lifecycle. In a production environment,

information collection, transmission, and storage all lead to addi-

tional overhead. High-performance data warehouse requires such

overhead to be diminished and separated from query execution

process. Moreover, a carefully designed mechanism is necessary to

limit the memory usage of the information storage that the dynamic

statistics does not aect the overall system.

Challenge 3: Coordination with query scheduler. The statistics

information can be holistically leveraged to optimize the adaptive

adjustments. To this end, the scheduler naturally comes into the

picture to orchestrate the execution orders of information consumer

and producer. However, none of the previous studies have clearly

dened a scheduler that is aware of the information dependencies.

In batch processing systems, the transmission of adaptive statis-

tics is mostly implemented as part of the execution process. Some

approaches [

] add checkp oints in the execution plan that

monitor statistics during execution and trigger re-optimization if

necessary, while others rewrite the provider of information explic-

itly as a sub-expression in the query execution to provide adaptive

statistics as part of the query execution process. Both approaches

tie information transmission strictly with data processing, which

means that the information consumer can only receive information

from its upstream operators without considering the possibilities of

receiving information sideways or discarding information with high

production costs. Some real-time data analytics systems support

passing information sideways, but mostly through tailored services.

For example, Impala [

] implements a dynamic lter service to pass

information sideways. Such services are customized for specic use

cases and cannot be easily extended to others. Moreover, the sched-

uler is not aware of such information transmission. The consumer

of the information either waits a static time period for statistics

to arrive or only consumes available statistics before running. As

execution plans become more complicated, useful statistics may not

be consumed to provoke adaptive execution without cooperation

with the scheduler. Therefore, a more sophisticated mechanism is

required to manage the transmission and consumption of adap-

tive statistics, which takes into account the collaboration with the

scheduler.

To this end, a novel information sharing framework, namely

daptive i

formation

haring fram

k(Anser), is developed in

AnalyticDB . Our major contributions are summarized as follows:

(1)

The framework provides a uniform and eective interface

for dierent modules to share various types and levels of in-

formation. At the operator level, Anser collects various types

and levels of information, classies according to their types

and granularities, and passes to dierent modules across the

query to tune for better performance during execution.

(2)

The framework supports the automatic matching and trans-

mission of the information between information producer

and information consumer once the relationship is registered.

It supports many-to-one and one-to-many information pass-

ing in a complex physical execution tree. The transmission

is both low latency and ecient by the usage of information

merging and push-based communication model.

(3)

In conjunction with the framework, we design an information-

aware scheduler, allowing for prioritization of scheduling se-

quences based on information dependencies. Anser improves

query performance by sending information sideways based

on pre-determined dependencies that the information can b e

3637

of 13

免费下载

文档被以下合辑收录

VLDB2023 国际顶级数据库学术会议-论文下载（持续更新）（共5篇）

VLDB2023 国际顶级数据库学术会议-论文下载。

关注

文档被以下合辑收录

评论