ACM 2023 - BALANCE- Bayesian Linear Attribution for Root Cause Localization.pdf

章芋文

173

26页

8次

2023-08-31

免费下载

BALANCE: Bayesian Linear Aribution for Root Cause

Localization

CHAOYU CHEN

∗

, Ant Group, China

HANG YU

∗

, Ant Group, China

ZHICHAO LEI, Ant Group, China

JIANGUO LI

†

, Ant Group, China

SHAOKANG REN, Ant Group, China

TINGKAI ZHANG, Ant Group, China

SILIN HU, Ant Group, China

JIANCHAO WANG, Ant Group, China

WENHUI SHI, OceanBase, China

Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations,

as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimen-

sional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting

the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose

BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA

through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of

the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian

multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a

forward manner while promoting sparsity and concurrently paying attention to the correlation between the

candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each

candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are

multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as

three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis

for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of

accuracy with the least amount of running time, and achieves at least 6% notably higher accuracy than SOTA

methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and

the online results further advocate its usage for real-time diagnosis in distributed data systems.

CCS Concepts: • Software and its engineering; • Information systems

→

Autonomous database

administration; • Computing methodologies → Feature selection; Regularization;

∗

Both authors contributed equally to this work.

†

Corresponding author.

Code is available at https://github.com/ant-research/BayesianLinearAttributionForRootCauseLocalization_BALANCE.

Authors’ addresses: Chaoyu Chen, Ant Group, China, chris.ccy@antgroup.com; Hang Yu, Ant Group, China, hyu.hugo@

antgroup.com; Zhichao Lei, Ant Group, China, leizhichao.lzc@antgroup.com; Jianguo Li, Ant Group, China, lijg.zero@

antgroup.com; Shaokang Ren, Ant Group, China, renshaokang.rsk@antgroup.com; Tingkai Zhang, Ant Group, China,

tingkai.ztk@antgroup.com; Silin Hu, Ant Group, China, husilin.hsl@antgroup.com; Jianchao Wang, Ant Group, China,

luli.wjc@antgroup.com; Wenhui Shi, OceanBase, China, yushun.swh@oceanbase.com.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2836-6573/2023/5-ART95 $15.00

https://doi.org/10.1145/3588949

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.

95:2 Chaoyu Chen et al.

Additional Key Words and Phrases: Root Cause Analysis, Bayesian Method, Bad SQLs, Faults Diagnosis,

Distributed System, Attribution Analysis, Explainable AI

ACM Reference Format:

Chaoyu Chen, Hang Yu, Zhichao Lei, Jianguo Li, Shaokang Ren, Tingkai Zhang, Silin Hu, Jianchao Wang,

and Wenhui Shi. 2023. BALANCE: Bayesian Linear Attribution for Root Cause Localization. Proc. ACM Manag.

Data 1, 1, Article 95 (May 2023), 26 pages. https://doi.org/10.1145/3588949

1 INTRODUCTION

System faults and incidents have a possibly tremendous inuence on distributed data systems which

are widely adopted in modern information technology (IT) and nancial companies, since they

may lead to system outrage and further incur astounding nancial loss and jeopardize customer

trust [

]. It has been reported by Forbes that every year IT downtime costs an estimated $26.5

billion in lost revenue alone, not to mention the indirect expense, including lost customers and

references. Thus, it is imperative to conduct fast and precise fault diagnosis and recovery before

they become service-impacting. A central task in fault diagnosis and recovery is root cause analysis

(RCA), which bridges the gap between fault detection and recovery [11, 13].

Currently, the task of RCA is mainly accomplished by site reliability engineers (SREs) with rich

operation experience. Unfortunately, such manual work becomes prohibitively slow due to the

increase of the scale and complexity of the architecture as well as the dynamic and unpredictable

nature of the system metrics and events, thus deviating from the requirement of eciency. Indeed,

as mentioned in [

], it can take as long as several hours of manual work to diagnose the root

causes of intermittent slow queries in distributed database systems. This has sparked considerable

research eorts toward designing automated RCA algorithms based on machine learning so as to

provide aid in saving time and ultimately money.

Literature on RCA algorithms can be broadly divided into two categories. The rst one focuses on

multidimensional root cause localization [

], which seeks to explain the abnormal behavior

of the additive key performance indicators (KPIs) by identifying the fault-indicating combinations

of their corresponding multi-dimensional attributes. The success of these algorithms relies on two

assumptions: 1) the value of the KPI in each dimension equals the sum of the values of its attributes

and 2) all the KPIs and their attributes can be monitored. However, these two assumptions can be

too restrictive in real-world problems, and a more practical setting is to attribute the anomalies to

root cause candidates without additive assumptions while allowing for missing data. On the other

hand, the second category revolves around graph-based RCA algorithms [

]. These

approaches typically rst construct a causal graph based on tracing service calls or causal discovery

algorithms [

] and then nd the root cause node via rule-based traversing or random walk. A

major impediment to the application of tracing graphs and rule-based traversing is that it is system

invasive and typically incurs arduous work on enumerating all traces and rules. As an alternative,

causal discovery methods are employed to learn the graph structure as in [

]. Unfortunately, the

causal discovery methods suer from both high computational and sample complexity [

], and in

consequence, they can be distressingly slow for large graphs and may lead to inaccurate results

when the number of observations for all metrics in the graph is small. After obtaining the graph,

the random walk methods are heuristic and might fail to converge to the root cause when the

number of random walks is not suciently large.

In this paper, we explore alternatives and recast the RCA problem as a feature attribution

problem [

]. To the best of our knowledge, we are among the rst to analyze the root cause

through the lens of attribution. As a commonly used tool in explainable AI (XAI), attribution

methods assign attribution scores to input features, the absolute value of which represents their

importance to the model prediction or performance [

]. Analogously, we aim to nd the root

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.

BALANCE: Bayesian Linear Aribution for Root Cause Localization 95:3

causes that can best explain the alarmed KPIs in RCA problems. The attribution scores of the

candidate causes represent their relevance or contribution to the alarmed KPIs. As a motivating

example, in database systems, “bad SQLs” is referred to as SQLs with deteriorated performance due

to indexing errors or changes in the execution plan. The performance deterioration of these SQLs

typically leads to anomalies in the tenant KPIs and may severely inuence the user experience. In

this case, the target (𝒚) are the tenant KPIs and the candidate causes (𝑿) are SQL metrics.

An attribution task can then be accomplished in two steps: rst, a forward model is constructed

that exploits the input features (i.e., candidate causes) to predict the outputs (i.e., alarmed KPIs),

and next, the signicance of the input features are evaluated through attribution approaches in

a backward manner. Particularly in the bad SQL localization example, the number of candidate

SQLs

𝑝

varies in each case and can be as large as thousands, whereas the number of observations

𝑛

(the length of the corresponding time series) is typically small since we only focus on the part

around the anomalies. In other words, the dimension

𝑝

can be larger than the sample size

𝑛

in the

RCA problems. To address this issue and to automate the feature selection process, we adopt sparse

linear models as the forward model due to their high exibility, eciency, and interpretability.

Furthermore, the candidate causes are usually correlated with each other, and there often exist

missing values. To tackle these problems that plague linear models, we propose a novel Bayesian

multicollinear feature selection (BMFS) model. Afterward, we provide the attribution score for each

candidate cause from dierent perspectives, including sensitivity and salience. Finally, we merge

results when there exist multiple alarmed KPIs and each of them is attributed to a dierent set of

root causes. We name the overall model BALANCE (BAyesian Linear AttributioN for root CausE

localization).

We would like to point out that both the multidimensional RCA and the graph-based RCA can

be formulated from the perspective of attribution. Specically, we can regard the multidimensional

RCA as attributing the anomalies in the KPIs to the combinations of their attributes. It follows that

the additive constraints in the multidimensional RCA can be removed, and hence, we only need to

consider the abnormal attributes under this scenario. On the other hand, by regarding all abnormal

nodes in the graph as candidate causes, BALANCE can be used to identify the root cause eciently

even though the graph structure is not available or cannot be reliably learned, which is often the

case in practice. Viewed another way, BALANCE can also be used as a building block to construct

causal graphs, since linear regression models are frequently used for causal discovery [

]. Given

the graph, BALANCE serves as a better substitute for random walks as it does not require a large

number of random walks and so is more ecient.

We validate the usefulness of BALANCE on four datasets. First, we generate synthetic data with

a dierent number of input features, dierent levels of multicollinearity, noise, and sparsity, and

dierent proportions of missing values, and then compare various forward models including the

proposed BMFS, Lasso, E-Net (Elastic net), and ARD (Automatic Relevance Determination). We nd

that BMFS typically recovers the underlying true regression coecients the best with comparable or

even shorter running time, especially when there exists multicollinearity among the input features.

Furthermore, we utilize the proposed method to address three real-world RCA problems. In the

rst problem, we deal with the problem of bad SQL localization as mentioned before. Our results

show that the proposed method can identify the human-labeled root cause SQLs in fewer than 2

seconds per case with accuracy as high as 83

3%, whereas it takes 3 minutes for SREs on average.

The second application copes with the problem of container fault localization, whose objective is

to attribute the abnormal trace failures in a container to the metrics of the container, such as CPU

usage, memory usage, TCP, etc, and facilitate the self-healing process. The proposed method can

achieve an

𝐹

-score of 0

86, which is at least 20% higher than other baseline methods. Finally, we

apply BALANCE to a public dataset, Exathlon [

], for the purpose of fault type diagnosis, and

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.

of 26

免费下载

相关文档

评论