暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ICDE 2022-PinSQL- Pinpoint Root Cause to Resolve SQL Performance Issues In Cloud Databases.pdf
304
13页
4次
2023-05-31
免费下载
PinSQL: Pinpoint Root Cause SQLs to Resolve
Performance Issues in Cloud Databases
Xiaoze Liu
†‡
, Zheng Yin
§
, Chao Zhao
†‡
, Congcong Ge
†‡
, Lu Chen
, Yunjun Gao
,
Dimeng Li
§
, Ziting Wang
§
, Gaozhong Liang
§
, Jian Tan
§
, Feifei Li
§
Zhejiang University,
Alibaba-Zhejiang University Joint Institute of Frontier Technologies,
§
Alibaba Group, China
{xiaoze, yuhao.zhao, gcc, luchen, gaoyj}@zju.edu.cn
§
{yinzheng.yz, lidimeng.ldm, zizhou.wzt, gaozhong.lgz, j.tan, lifeifei}@alibaba-inc.com
Abstract—Deploying database services on cloud systems has
gained increasing popularity and has become a common practice
in the industry. However, the complicated cloud environments
make performance issues inevitable, which could violate the
service level guarantee if not addressed in a timely manner.
Among the various problems, anomalies in SQL queries are
the most commonly reported sources that cause performance
issues in database applications. These anomalous queries can be
divided into High-impact SQLs (H-SQLs) and Root Cause SQLs
(R-SQLs), representing the related SQLs that are correlated
with the anomalies and the ones that are the root causes of
the performance issue, respectively. In the presence of a large
number of queries, to pinpoint the R-SQLs is far more difficult
than to identify the H-SQLs. To address this challenge, we aim
at automatically pinpointing the R-SQLs to resolve performance
issues in cloud databases.
This paper introduces PinSQL, an autonomous diagnosing
system for Alibaba Cloud, which has four modules that are exe-
cuted sequentially, including data collection and pre-processing,
anomaly detection, root cause analysis, and repairing actions.
First, the related performance metrics and query logs from
monitored cloud database instances are collected and aggregated
as the data sources. Then, based on these inputs, efficient
anomaly detection is conducted in real-time. Upon the detection
of an anomaly, the root cause SQLs are pinpointed through
tracking the propagation chain of the involved SQLs. Finally,
repairing actions are suggested and then executed on R-SQLs
to address the anomalies. Extensive experiments on an Alibaba
production system show that PinSQL can achieve an 80%
accuracy for pinpointing the top-1 R-SQLs and successfully
resolve the database performance issues resultantly.
Index Terms—cloud databases, root cause, performance
anomaly
I. INTRODUCTION
With the development of cloud computing technologies,
many mission-critical services have been deployed on the
cloud, where the services are usually hosted on cloud database
systems such as Alibaba Cloud RDS [1], AWS RDS [2], Mi-
crosoft Azure SQL Database [3], and Google Cloud SQL [4].
However, the complicated cloud environments make perfor-
mance issues inevitable, which could violate the service level
guarantee if not addressed in a timely manner. Among the
various problems, it has been reported that 70% of them
are caused by database problems [5]. These performance
anomalies could lead to potential service interruptions and thus
adversely affect customers’ business operations. In order to
provide services with high elasticity, availability, and stability,
cloud database vendors have paid much attention to efficiently
diagnosing performance issues, such as drastic increases in
CPU usage, spikes in the number of running threads, large
fluctuations in business traffic flows, and so on.
To proactively prevent the performance issue, it is not
sufficient to detect anomalies alone, and it is also crucial
to diagnose the root cause of detected anomalies. Many
studies have investigated the problem of root cause anal-
ysis (RCA for short) of performance anomalies on cloud
databases, including classification-based, Top-SQL-based, and
Autoregressive-based approaches. Specifically, classification-
based approaches [6]–[9] divide the causes into a limited
collection of types. Top-SQL-based approaches [2], [10]–
[12] select the highest SQLs via sorting performance metrics.
Autoregressive-based approaches [13]–[15] analyze causal de-
pendency between variables on multivariate time-series data.
In real-world operations and maintenance (O&M) scenarios,
most performance anomalies in cloud databases are caused by
a large number of concurrent and competing transactions [6].
Among those anomalous SQL queries, Root Cause SQLs
(R-SQLs), such as business scenario change (QPS sudden
increase), poor SQL statements, and MDL locks/Row locks,
are the keys to resolve the performance anomalies. Although
existing studies have significantly reduced human labor for
identifying anomalies, they cannot pinpoint the R-SQLs to effi-
ciently resolve the performance issues. Moreover, many exist-
ing works [6]–[9] aim to optimize or tackle system problems.
However, system problems caused by R-SQLs might not need
special treatment. For example, for solving the CPU bottleneck
caused by some CPU-intensive R-SQLs, it is not necessary
to apply instance scaling. A more reasonable method is to
perform targeted optimization on R-SQLs instead of system
problems, which can reduce the impact of anomalies, thereby
improving the overall stability of the database instance.
As a large number of SQL queries exist in cloud databases,
it is difficult or costly for DBAs to manually find out the
anomalous queries that directly or indirectly affect key per-
formance metrics. Driven by this, we aim to pinpoint R-SQLs
automatically. Considering various types of SQLs, we usually
aggregate SQL queries into different SQL templates [16]–[19].
Hence, instead of finding out specific root cause queries, we
focus on SQL templates in this paper. Performance anomalies
caused by SQL queries are related to a high active session of
the database instance [20] (to be detailed in Definition II.4),
2550
2022 IEEE 38th International Conference on Data Engineering (ICDE)
2375-026X/22/$31.00 ©2022 IEEE
DOI 10.1109/ICDE53745.2022.00236
where active session denotes the number of active SQL queries
each timestamp, and an active session of a template consists
of the number of active SQL queries that belong to this same
template. Once a group of anomalous SQLs (i.e., R-SQLs)
appear, High-impact SQLs (H-SQLs for short), being the R-
SQLs themselves or SQLs affected by R-SQLs, will appear
simultaneously. The H-SQLs directly affect the instance per-
formance, incurring the anomalies of active session detected
by the anomaly detector. We describe this entire process of
generating performance anomalies as an anomaly propagation
chain: R-SQLsH-SQLsactive session. We aim to locate
R-SQLs through the anomaly propagation chain. Three chal-
lenges exist as below when pinpointing R-SQLs.
Challenge I: How to effectively obtain the individual active
session of templates without degrading the database instance
performance? To model the impact of templates on the in-
stance active session, we need to obtain the individual active
session of each template. A straightforward method is to
utilize the total response time of templates, as the response
time of queries is positively correlated with the active session
metric [20]. However, this method is inaccurate because the
response time of SQL queries cannot fully represent whether
they are active. In order to obtain accurate active session
of templates, conventional approaches utilize database built-
in monitoring systems [21]. However, such monitoring sys-
tems will produce performance overhead, especially in large-
scale production environments. Hence, it is difficult to obtain
accurate active sessions of templates without degrading the
database instance performance.
Challenge II: How to correctly model the impact of SQL
templates on the instance active session? After obtaining the
individual active session of each template, we need to model
the impact of each template on the instance active session in
order to locate H-SQLs, as H-SQLs are those templates that
directly cause the sudden change of active session. Templates
that have a tremendous amount of traffic flow are often
regarded as H-SQLs by Top-SQL-based approaches. However,
these templates may not be H-SQLs. This is because stable
traffic queries may not be affected by R-SQLs on large cloud
database instances. For example, given R-SQLs UPDATE
queries, they only block other queries that operate on the same
table but will not affect queries that examine other tables.
Motivated by this, templates that directly cause an anomaly
should have both considerable traffic and a similar trend with
the active session. Commonly used correlation coefficients
only consider how similar the trends of two time-series data
are, but ignore the scale. Thus, it is difficult to locate H-SQLs.
Challenge III: How to distinguish R-SQLs from massive
SQLs through the located H-SQLs? Existing industrial solu-
tions only provide a ranking of templates by metrics (e.g.,
total response time for the active session), leaving the task of
finding the root cause to DBAs. However, the active session
metric might not be affected too much by R-SQLs, while
the H-SQLs affected by R-SQLs lead to high active session.
Thus, it’s tough for DBAs to pinpoint R-SQLs by simply
ranking the related metrics of templates during the anomaly
Monitor
Query Log
Cloud Database Anomaly
Performa nce Metric
SQL Template Metrics
2304
A84F
E6DC
E6DC
E6DC
2304
A84F
2304
CC98
SQL STATEMENT
UPDATE... SET...
SET...FROM..
SELECT...WHERE
SELECT...WHERE
SELECT...WHERE
UPDATE... SET...
SET...FROM..
UPDATE... SET...
ROLLBACK
SQL STATEMENT
UPDATE... SET...
SET...FROM..
SELECT...WHERE
SELECT...WHERE
SELECT...WHERE
UPDATE... SET...
SET...FROM..
UPDATE... SET...
ROLLBACK
SQL_ID
Fig. 1. Diagnosing DB Instance Anomaly with SQL Query Logs. The
query log shows the unique SQL template ID (SQL
ID for short) and SQL
Statements.
time. For example, assume a SALES table in the database
instance. A set of SQL queries to UPDATE the table arrives,
which adds exclusive locks on many rows in this table. Thus,
running SELECT threads on this table are forced to wait due
to exclusive locks, resulting in a sudden increase of active
session metric. Finally, the increase of the active session metric
is detected by anomaly detectors. In this case, the SELECT
queries are H-SQLs, while UPDATE queries are R-SQLs.
Therefore, it is difficult to find out R-SQLs, especially in
database instances with complicated business queries from
different applications.
To address the challenges stated above, we propose a system
PinSQL. As depicted in Fig. 1, PinSQL analyses aggregated
metrics from query logs to find R-SQLs that causing anomaly
cases. PinSQL consists of four modules, i.e., Data Collection
And Anomaly Detection Module, High-impact SQL Identifi-
cation Module, Root Cause SQL Identification Module, and
Repairing Module. First, Data Collection And Anomaly Detec-
tion Module collects and aggregates performance metrics and
query logs from cloud database instances. It also detects per-
formance anomalies in real-time by performance metrics. Once
an anomaly is detected, following the anomaly propagation
chain, High-impact SQL Identification Module is triggered to
locate H-SQLs. Thereafter, based on the result of High-impact
SQL Identification Module, Root Cause SQL Identification
Module select possible R-SQLs through a clustering-based
strategy. To react to the anomaly, PinSQL comes with a
Repairing Module to suggest/execute actions on R-SQLs, to
solve the anomalies. Furthermore, we integrate PinSQL into
the Database Autonomy Service (DAS) of Alibaba Cloud [22].
We summarize the key contributions as follows:
We develop PinSQL, an autonomous diagnosing system that
includes two key features (i.e., root cause analysis and au-
tomatic repairing), to solve the problem of pinpointing root
cause SQLs for the performance issues in cloud databases.
We introduce a Data Collection And Anomaly Detection
Module, which estimates the active session of each template
with little performance overhead on database instances.
We propose a High-impact SQL Identification Module,
which fuses the multi-level impact of SQL templates on
active session to effectively identify H-SQLs.
We present a Root Cause SQL Identification Module, which
2551
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜