
PinSQL: Pinpoint Root Cause SQLs to Resolve
Performance Issues in Cloud Databases
Xiaoze Liu
†‡
, Zheng Yin
§
, Chao Zhao
†‡
, Congcong Ge
†‡
, Lu Chen
†
, Yunjun Gao
†
,
Dimeng Li
§
, Ziting Wang
§
, Gaozhong Liang
§
, Jian Tan
§
, Feifei Li
§
†
Zhejiang University,
‡
Alibaba-Zhejiang University Joint Institute of Frontier Technologies,
§
Alibaba Group, China
†
{xiaoze, yuhao.zhao, gcc, luchen, gaoyj}@zju.edu.cn
§
{yinzheng.yz, lidimeng.ldm, zizhou.wzt, gaozhong.lgz, j.tan, lifeifei}@alibaba-inc.com
Abstract—Deploying database services on cloud systems has
gained increasing popularity and has become a common practice
in the industry. However, the complicated cloud environments
make performance issues inevitable, which could violate the
service level guarantee if not addressed in a timely manner.
Among the various problems, anomalies in SQL queries are
the most commonly reported sources that cause performance
issues in database applications. These anomalous queries can be
divided into High-impact SQLs (H-SQLs) and Root Cause SQLs
(R-SQLs), representing the related SQLs that are correlated
with the anomalies and the ones that are the root causes of
the performance issue, respectively. In the presence of a large
number of queries, to pinpoint the R-SQLs is far more difficult
than to identify the H-SQLs. To address this challenge, we aim
at automatically pinpointing the R-SQLs to resolve performance
issues in cloud databases.
This paper introduces PinSQL, an autonomous diagnosing
system for Alibaba Cloud, which has four modules that are exe-
cuted sequentially, including data collection and pre-processing,
anomaly detection, root cause analysis, and repairing actions.
First, the related performance metrics and query logs from
monitored cloud database instances are collected and aggregated
as the data sources. Then, based on these inputs, efficient
anomaly detection is conducted in real-time. Upon the detection
of an anomaly, the root cause SQLs are pinpointed through
tracking the propagation chain of the involved SQLs. Finally,
repairing actions are suggested and then executed on R-SQLs
to address the anomalies. Extensive experiments on an Alibaba
production system show that PinSQL can achieve an 80%
accuracy for pinpointing the top-1 R-SQLs and successfully
resolve the database performance issues resultantly.
Index Terms—cloud databases, root cause, performance
anomaly
I. INTRODUCTION
With the development of cloud computing technologies,
many mission-critical services have been deployed on the
cloud, where the services are usually hosted on cloud database
systems such as Alibaba Cloud RDS [1], AWS RDS [2], Mi-
crosoft Azure SQL Database [3], and Google Cloud SQL [4].
However, the complicated cloud environments make perfor-
mance issues inevitable, which could violate the service level
guarantee if not addressed in a timely manner. Among the
various problems, it has been reported that 70% of them
are caused by database problems [5]. These performance
anomalies could lead to potential service interruptions and thus
adversely affect customers’ business operations. In order to
provide services with high elasticity, availability, and stability,
cloud database vendors have paid much attention to efficiently
diagnosing performance issues, such as drastic increases in
CPU usage, spikes in the number of running threads, large
fluctuations in business traffic flows, and so on.
To proactively prevent the performance issue, it is not
sufficient to detect anomalies alone, and it is also crucial
to diagnose the root cause of detected anomalies. Many
studies have investigated the problem of root cause anal-
ysis (RCA for short) of performance anomalies on cloud
databases, including classification-based, Top-SQL-based, and
Autoregressive-based approaches. Specifically, classification-
based approaches [6]–[9] divide the causes into a limited
collection of types. Top-SQL-based approaches [2], [10]–
[12] select the highest SQLs via sorting performance metrics.
Autoregressive-based approaches [13]–[15] analyze causal de-
pendency between variables on multivariate time-series data.
In real-world operations and maintenance (O&M) scenarios,
most performance anomalies in cloud databases are caused by
a large number of concurrent and competing transactions [6].
Among those anomalous SQL queries, Root Cause SQLs
(R-SQLs), such as business scenario change (QPS sudden
increase), poor SQL statements, and MDL locks/Row locks,
are the keys to resolve the performance anomalies. Although
existing studies have significantly reduced human labor for
identifying anomalies, they cannot pinpoint the R-SQLs to effi-
ciently resolve the performance issues. Moreover, many exist-
ing works [6]–[9] aim to optimize or tackle system problems.
However, system problems caused by R-SQLs might not need
special treatment. For example, for solving the CPU bottleneck
caused by some CPU-intensive R-SQLs, it is not necessary
to apply instance scaling. A more reasonable method is to
perform targeted optimization on R-SQLs instead of system
problems, which can reduce the impact of anomalies, thereby
improving the overall stability of the database instance.
As a large number of SQL queries exist in cloud databases,
it is difficult or costly for DBAs to manually find out the
anomalous queries that directly or indirectly affect key per-
formance metrics. Driven by this, we aim to pinpoint R-SQLs
automatically. Considering various types of SQLs, we usually
aggregate SQL queries into different SQL templates [16]–[19].
Hence, instead of finding out specific root cause queries, we
focus on SQL templates in this paper. Performance anomalies
caused by SQL queries are related to a high active session of
the database instance [20] (to be detailed in Definition II.4),
2550
2022 IEEE 38th International Conference on Data Engineering (ICDE)
2375-026X/22/$31.00 ©2022 IEEE
DOI 10.1109/ICDE53745.2022.00236
评论