ICDE 2022-PinSQL- Pinpoint Root Cause to Resolve SQL Performance Issues In Cloud Databases.pdf

波风水门

304

13页

4次

2023-05-31

免费下载

PinSQL: Pinpoint Root Cause SQLs to Resolve

Performance Issues in Cloud Databases

Xiaoze Liu

†‡

, Zheng Yin

, Chao Zhao

†‡

, Congcong Ge

†‡

, Lu Chen

†

, Yunjun Gao

†

Dimeng Li

, Ziting Wang

, Gaozhong Liang

, Jian Tan

, Feifei Li

†

Zhejiang University,

‡

Alibaba-Zhejiang University Joint Institute of Frontier Technologies,

Alibaba Group, China

†

{xiaoze, yuhao.zhao, gcc, luchen, gaoyj}@zju.edu.cn

{yinzheng.yz, lidimeng.ldm, zizhou.wzt, gaozhong.lgz, j.tan, lifeifei}@alibaba-inc.com

Abstract—Deploying database services on cloud systems has

gained increasing popularity and has become a common practice

in the industry. However, the complicated cloud environments

make performance issues inevitable, which could violate the

service level guarantee if not addressed in a timely manner.

Among the various problems, anomalies in SQL queries are

the most commonly reported sources that cause performance

issues in database applications. These anomalous queries can be

divided into High-impact SQLs (H-SQLs) and Root Cause SQLs

(R-SQLs), representing the related SQLs that are correlated

with the anomalies and the ones that are the root causes of

the performance issue, respectively. In the presence of a large

number of queries, to pinpoint the R-SQLs is far more difﬁcult

than to identify the H-SQLs. To address this challenge, we aim

at automatically pinpointing the R-SQLs to resolve performance

issues in cloud databases.

This paper introduces PinSQL, an autonomous diagnosing

system for Alibaba Cloud, which has four modules that are exe-

cuted sequentially, including data collection and pre-processing,

anomaly detection, root cause analysis, and repairing actions.

First, the related performance metrics and query logs from

monitored cloud database instances are collected and aggregated

as the data sources. Then, based on these inputs, efﬁcient

anomaly detection is conducted in real-time. Upon the detection

of an anomaly, the root cause SQLs are pinpointed through

tracking the propagation chain of the involved SQLs. Finally,

repairing actions are suggested and then executed on R-SQLs

to address the anomalies. Extensive experiments on an Alibaba

production system show that PinSQL can achieve an 80%

accuracy for pinpointing the top-1 R-SQLs and successfully

resolve the database performance issues resultantly.

Index Terms—cloud databases, root cause, performance

anomaly

I. INTRODUCTION

With the development of cloud computing technologies,

many mission-critical services have been deployed on the

cloud, where the services are usually hosted on cloud database

systems such as Alibaba Cloud RDS [1], AWS RDS [2], Mi-

crosoft Azure SQL Database [3], and Google Cloud SQL [4].

However, the complicated cloud environments make perfor-

mance issues inevitable, which could violate the service level

guarantee if not addressed in a timely manner. Among the

various problems, it has been reported that 70% of them

are caused by database problems [5]. These performance

anomalies could lead to potential service interruptions and thus

adversely affect customers’ business operations. In order to

provide services with high elasticity, availability, and stability,

cloud database vendors have paid much attention to efﬁciently

diagnosing performance issues, such as drastic increases in

CPU usage, spikes in the number of running threads, large

ﬂuctuations in business trafﬁc ﬂows, and so on.

To proactively prevent the performance issue, it is not

sufﬁcient to detect anomalies alone, and it is also crucial

to diagnose the root cause of detected anomalies. Many

studies have investigated the problem of root cause anal-

ysis (RCA for short) of performance anomalies on cloud

databases, including classiﬁcation-based, Top-SQL-based, and

Autoregressive-based approaches. Speciﬁcally, classiﬁcation-

based approaches [6]–[9] divide the causes into a limited

collection of types. Top-SQL-based approaches [2], [10]–

[12] select the highest SQLs via sorting performance metrics.

Autoregressive-based approaches [13]–[15] analyze causal de-

pendency between variables on multivariate time-series data.

In real-world operations and maintenance (O&M) scenarios,

most performance anomalies in cloud databases are caused by

a large number of concurrent and competing transactions [6].

Among those anomalous SQL queries, Root Cause SQLs

(R-SQLs), such as business scenario change (QPS sudden

increase), poor SQL statements, and MDL locks/Row locks,

are the keys to resolve the performance anomalies. Although

existing studies have signiﬁcantly reduced human labor for

identifying anomalies, they cannot pinpoint the R-SQLs to efﬁ-

ciently resolve the performance issues. Moreover, many exist-

ing works [6]–[9] aim to optimize or tackle system problems.

However, system problems caused by R-SQLs might not need

special treatment. For example, for solving the CPU bottleneck

caused by some CPU-intensive R-SQLs, it is not necessary

to apply instance scaling. A more reasonable method is to

perform targeted optimization on R-SQLs instead of system

problems, which can reduce the impact of anomalies, thereby

improving the overall stability of the database instance.

As a large number of SQL queries exist in cloud databases,

it is difﬁcult or costly for DBAs to manually ﬁnd out the

anomalous queries that directly or indirectly affect key per-

formance metrics. Driven by this, we aim to pinpoint R-SQLs

automatically. Considering various types of SQLs, we usually

aggregate SQL queries into different SQL templates [16]–[19].

Hence, instead of ﬁnding out speciﬁc root cause queries, we

focus on SQL templates in this paper. Performance anomalies

caused by SQL queries are related to a high active session of

the database instance [20] (to be detailed in Deﬁnition II.4),

2550

2022 IEEE 38th International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE53745.2022.00236

where active session denotes the number of active SQL queries

each timestamp, and an active session of a template consists

of the number of active SQL queries that belong to this same

template. Once a group of anomalous SQLs (i.e., R-SQLs)

appear, High-impact SQLs (H-SQLs for short), being the R-

SQLs themselves or SQLs affected by R-SQLs, will appear

simultaneously. The H-SQLs directly affect the instance per-

formance, incurring the anomalies of active session detected

by the anomaly detector. We describe this entire process of

generating performance anomalies as an anomaly propagation

chain: R-SQLs→H-SQLs→active session. We aim to locate

R-SQLs through the anomaly propagation chain. Three chal-

lenges exist as below when pinpointing R-SQLs.

Challenge I: How to effectively obtain the individual active

session of templates without degrading the database instance

performance? To model the impact of templates on the in-

stance active session, we need to obtain the individual active

session of each template. A straightforward method is to

utilize the total response time of templates, as the response

time of queries is positively correlated with the active session

metric [20]. However, this method is inaccurate because the

response time of SQL queries cannot fully represent whether

they are active. In order to obtain accurate active session

of templates, conventional approaches utilize database built-

in monitoring systems [21]. However, such monitoring sys-

tems will produce performance overhead, especially in large-

scale production environments. Hence, it is difﬁcult to obtain

accurate active sessions of templates without degrading the

database instance performance.

Challenge II: How to correctly model the impact of SQL

templates on the instance active session? After obtaining the

individual active session of each template, we need to model

the impact of each template on the instance active session in

order to locate H-SQLs, as H-SQLs are those templates that

directly cause the sudden change of active session. Templates

that have a tremendous amount of trafﬁc ﬂow are often

regarded as H-SQLs by Top-SQL-based approaches. However,

these templates may not be H-SQLs. This is because stable

trafﬁc queries may not be affected by R-SQLs on large cloud

database instances. For example, given R-SQLs UPDATE

queries, they only block other queries that operate on the same

table but will not affect queries that examine other tables.

Motivated by this, templates that directly cause an anomaly

should have both considerable trafﬁc and a similar trend with

the active session. Commonly used correlation coefﬁcients

only consider how similar the trends of two time-series data

are, but ignore the scale. Thus, it is difﬁcult to locate H-SQLs.

Challenge III: How to distinguish R-SQLs from massive

SQLs through the located H-SQLs? Existing industrial solu-

tions only provide a ranking of templates by metrics (e.g.,

total response time for the active session), leaving the task of

ﬁnding the root cause to DBAs. However, the active session

metric might not be affected too much by R-SQLs, while

the H-SQLs affected by R-SQLs lead to high active session.

Thus, it’s tough for DBAs to pinpoint R-SQLs by simply

ranking the related metrics of templates during the anomaly

Monitor

Query Log

Cloud Database Anomaly

Performa nce Metric

SQL Template Metrics

2304

A84F

E6DC

2304

A84F

2304

CC98

SQL STATEMENT

UPDATE... SET...

SET...FROM..

SELECT...WHERE

UPDATE... SET...

SET...FROM..

UPDATE... SET...

ROLLBACK

SQL STATEMENT

UPDATE... SET...

SET...FROM..

SELECT...WHERE

UPDATE... SET...

SET...FROM..

UPDATE... SET...

ROLLBACK

SQL_ID

Fig. 1. Diagnosing DB Instance Anomaly with SQL Query Logs. The

query log shows the unique SQL template ID (SQL

ID for short) and SQL

Statements.

time. For example, assume a SALES table in the database

instance. A set of SQL queries to UPDATE the table arrives,

which adds exclusive locks on many rows in this table. Thus,

running SELECT threads on this table are forced to wait due

to exclusive locks, resulting in a sudden increase of active

session metric. Finally, the increase of the active session metric

is detected by anomaly detectors. In this case, the SELECT

queries are H-SQLs, while UPDATE queries are R-SQLs.

Therefore, it is difﬁcult to ﬁnd out R-SQLs, especially in

database instances with complicated business queries from

different applications.

To address the challenges stated above, we propose a system

PinSQL. As depicted in Fig. 1, PinSQL analyses aggregated

metrics from query logs to ﬁnd R-SQLs that causing anomaly

cases. PinSQL consists of four modules, i.e., Data Collection

And Anomaly Detection Module, High-impact SQL Identiﬁ-

cation Module, Root Cause SQL Identiﬁcation Module, and

Repairing Module. First, Data Collection And Anomaly Detec-

tion Module collects and aggregates performance metrics and

query logs from cloud database instances. It also detects per-

formance anomalies in real-time by performance metrics. Once

an anomaly is detected, following the anomaly propagation

chain, High-impact SQL Identiﬁcation Module is triggered to

locate H-SQLs. Thereafter, based on the result of High-impact

SQL Identiﬁcation Module, Root Cause SQL Identiﬁcation

Module select possible R-SQLs through a clustering-based

strategy. To react to the anomaly, PinSQL comes with a

Repairing Module to suggest/execute actions on R-SQLs, to

solve the anomalies. Furthermore, we integrate PinSQL into

the Database Autonomy Service (DAS) of Alibaba Cloud [22].

We summarize the key contributions as follows:

• We develop PinSQL, an autonomous diagnosing system that

includes two key features (i.e., root cause analysis and au-

tomatic repairing), to solve the problem of pinpointing root

cause SQLs for the performance issues in cloud databases.

• We introduce a Data Collection And Anomaly Detection

Module, which estimates the active session of each template

with little performance overhead on database instances.

• We propose a High-impact SQL Identiﬁcation Module,

which fuses the multi-level impact of SQL templates on

active session to effectively identify H-SQLs.

• We present a Root Cause SQL Identiﬁcation Module, which

2551

of 13

免费下载

关注

评论