暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ACM 2023 - BALANCE- Bayesian Linear Attribution for Root Cause Localization.pdf
173
26页
8次
2023-08-31
免费下载
95
BALANCE: Bayesian Linear Aribution for Root Cause
Localization
CHAOYU CHEN
, Ant Group, China
HANG YU
, Ant Group, China
ZHICHAO LEI, Ant Group, China
JIANGUO LI
, Ant Group, China
SHAOKANG REN, Ant Group, China
TINGKAI ZHANG, Ant Group, China
SILIN HU, Ant Group, China
JIANCHAO WANG, Ant Group, China
WENHUI SHI, OceanBase, China
Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations,
as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimen-
sional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting
the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose
BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA
through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of
the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian
multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a
forward manner while promoting sparsity and concurrently paying attention to the correlation between the
candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each
candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are
multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as
three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis
for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of
accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA
methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and
the online results further advocate its usage for real-time diagnosis in distributed data systems.
1
CCS Concepts: Software and its engineering; Information systems
Autonomous database
administration; Computing methodologies Feature selection; Regularization;
Both authors contributed equally to this work.
Corresponding author.
1
Code is available at https://github.com/ant-research/BayesianLinearAttributionForRootCauseLocalization_BALANCE.
Authors’ addresses: Chaoyu Chen, Ant Group, China, chris.ccy@antgroup.com; Hang Yu, Ant Group, China, hyu.hugo@
antgroup.com; Zhichao Lei, Ant Group, China, leizhichao.lzc@antgroup.com; Jianguo Li, Ant Group, China, lijg.zero@
antgroup.com; Shaokang Ren, Ant Group, China, renshaokang.rsk@antgroup.com; Tingkai Zhang, Ant Group, China,
tingkai.ztk@antgroup.com; Silin Hu, Ant Group, China, husilin.hsl@antgroup.com; Jianchao Wang, Ant Group, China,
luli.wjc@antgroup.com; Wenhui Shi, OceanBase, China, yushun.swh@oceanbase.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/5-ART95 $15.00
https://doi.org/10.1145/3588949
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
95:2 Chaoyu Chen et al.
Additional Key Words and Phrases: Root Cause Analysis, Bayesian Method, Bad SQLs, Faults Diagnosis,
Distributed System, Attribution Analysis, Explainable AI
ACM Reference Format:
Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang,
and Wenhui Shi. 2023. BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proc. ACM Manag.
Data 1, 1, Article 95 (May 2023), 26 pages. https://doi.org/10.1145/3588949
1 INTRODUCTION
System faults and incidents have a possibly tremendous inuence on distributed data systems which
are widely adopted in modern information technology (IT) and nancial companies, since they
may lead to system outrage and further incur astounding nancial loss and jeopardize customer
trust [
21
]. It has been reported by Forbes that every year IT downtime costs an estimated $26.5
billion in lost revenue alone, not to mention the indirect expense, including lost customers and
references. Thus, it is imperative to conduct fast and precise fault diagnosis and recovery before
they become service-impacting. A central task in fault diagnosis and recovery is root cause analysis
(RCA), which bridges the gap between fault detection and recovery [11, 13].
Currently, the task of RCA is mainly accomplished by site reliability engineers (SREs) with rich
operation experience. Unfortunately, such manual work becomes prohibitively slow due to the
increase of the scale and complexity of the architecture as well as the dynamic and unpredictable
nature of the system metrics and events, thus deviating from the requirement of eciency. Indeed,
as mentioned in [
19
], it can take as long as several hours of manual work to diagnose the root
causes of intermittent slow queries in distributed database systems. This has sparked considerable
research eorts toward designing automated RCA algorithms based on machine learning so as to
provide aid in saving time and ultimately money.
Literature on RCA algorithms can be broadly divided into two categories. The rst one focuses on
multidimensional root cause localization [
5
,
32
,
47
], which seeks to explain the abnormal behavior
of the additive key performance indicators (KPIs) by identifying the fault-indicating combinations
of their corresponding multi-dimensional attributes. The success of these algorithms relies on two
assumptions: 1) the value of the KPI in each dimension equals the sum of the values of its attributes
and 2) all the KPIs and their attributes can be monitored. However, these two assumptions can be
too restrictive in real-world problems, and a more practical setting is to attribute the anomalies to
root cause candidates without additive assumptions while allowing for missing data. On the other
hand, the second category revolves around graph-based RCA algorithms [
14
,
24
,
38
,
39
]. These
approaches typically rst construct a causal graph based on tracing service calls or causal discovery
algorithms [
46
] and then nd the root cause node via rule-based traversing or random walk. A
major impediment to the application of tracing graphs and rule-based traversing is that it is system
invasive and typically incurs arduous work on enumerating all traces and rules. As an alternative,
causal discovery methods are employed to learn the graph structure as in [
39
]. Unfortunately, the
causal discovery methods suer from both high computational and sample complexity [
46
], and in
consequence, they can be distressingly slow for large graphs and may lead to inaccurate results
when the number of observations for all metrics in the graph is small. After obtaining the graph,
the random walk methods are heuristic and might fail to converge to the root cause when the
number of random walks is not suciently large.
In this paper, we explore alternatives and recast the RCA problem as a feature attribution
problem [
16
]. To the best of our knowledge, we are among the rst to analyze the root cause
through the lens of attribution. As a commonly used tool in explainable AI (XAI), attribution
methods assign attribution scores to input features, the absolute value of which represents their
importance to the model prediction or performance [
16
]. Analogously, we aim to nd the root
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
BALANCE: Bayesian Linear Aribution for Root Cause Localization 95:3
causes that can best explain the alarmed KPIs in RCA problems. The attribution scores of the
candidate causes represent their relevance or contribution to the alarmed KPIs. As a motivating
example, in database systems, “bad SQLs” is referred to as SQLs with deteriorated performance due
to indexing errors or changes in the execution plan. The performance deterioration of these SQLs
typically leads to anomalies in the tenant KPIs and may severely inuence the user experience. In
this case, the target (𝒚) are the tenant KPIs and the candidate causes (𝑿) are SQL metrics.
An attribution task can then be accomplished in two steps: rst, a forward model is constructed
that exploits the input features (i.e., candidate causes) to predict the outputs (i.e., alarmed KPIs),
and next, the signicance of the input features are evaluated through attribution approaches in
a backward manner. Particularly in the bad SQL localization example, the number of candidate
SQLs
𝑝
varies in each case and can be as large as thousands, whereas the number of observations
𝑛
(the length of the corresponding time series) is typically small since we only focus on the part
around the anomalies. In other words, the dimension
𝑝
can be larger than the sample size
𝑛
in the
RCA problems. To address this issue and to automate the feature selection process, we adopt sparse
linear models as the forward model due to their high exibility, eciency, and interpretability.
Furthermore, the candidate causes are usually correlated with each other, and there often exist
missing values. To tackle these problems that plague linear models, we propose a novel Bayesian
multicollinear feature selection (BMFS) model. Afterward, we provide the attribution score for each
candidate cause from dierent perspectives, including sensitivity and salience. Finally, we merge
results when there exist multiple alarmed KPIs and each of them is attributed to a dierent set of
root causes. We name the overall model BALANCE (BAyesian Linear AttributioN for root CausE
localization).
We would like to point out that both the multidimensional RCA and the graph-based RCA can
be formulated from the perspective of attribution. Specically, we can regard the multidimensional
RCA as attributing the anomalies in the KPIs to the combinations of their attributes. It follows that
the additive constraints in the multidimensional RCA can be removed, and hence, we only need to
consider the abnormal attributes under this scenario. On the other hand, by regarding all abnormal
nodes in the graph as candidate causes, BALANCE can be used to identify the root cause eciently
even though the graph structure is not available or cannot be reliably learned, which is often the
case in practice. Viewed another way, BALANCE can also be used as a building block to construct
causal graphs, since linear regression models are frequently used for causal discovery [
49
]. Given
the graph, BALANCE serves as a better substitute for random walks as it does not require a large
number of random walks and so is more ecient.
We validate the usefulness of BALANCE on four datasets. First, we generate synthetic data with
a dierent number of input features, dierent levels of multicollinearity, noise, and sparsity, and
dierent proportions of missing values, and then compare various forward models including the
proposed BMFS, Lasso, E-Net (Elastic net), and ARD (Automatic Relevance Determination). We nd
that BMFS typically recovers the underlying true regression coecients the best with comparable or
even shorter running time, especially when there exists multicollinearity among the input features.
Furthermore, we utilize the proposed method to address three real-world RCA problems. In the
rst problem, we deal with the problem of bad SQL localization as mentioned before. Our results
show that the proposed method can identify the human-labeled root cause SQLs in fewer than 2
seconds per case with accuracy as high as 83
.
3%, whereas it takes 3 minutes for SREs on average.
The second application copes with the problem of container fault localization, whose objective is
to attribute the abnormal trace failures in a container to the metrics of the container, such as CPU
usage, memory usage, TCP, etc, and facilitate the self-healing process. The proposed method can
achieve an
𝐹
1
-score of 0
.
86, which is at least 20% higher than other baseline methods. Finally, we
apply BALANCE to a public dataset, Exathlon [
12
], for the purpose of fault type diagnosis, and
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
of 26
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。