BALANCE: Bayesian Linear Aribution for Root Cause Localization 95:3
causes that can best explain the alarmed KPIs in RCA problems. The attribution scores of the
candidate causes represent their relevance or contribution to the alarmed KPIs. As a motivating
example, in database systems, “bad SQLs” is referred to as SQLs with deteriorated performance due
to indexing errors or changes in the execution plan. The performance deterioration of these SQLs
typically leads to anomalies in the tenant KPIs and may severely inuence the user experience. In
this case, the target (𝒚) are the tenant KPIs and the candidate causes (𝑿) are SQL metrics.
An attribution task can then be accomplished in two steps: rst, a forward model is constructed
that exploits the input features (i.e., candidate causes) to predict the outputs (i.e., alarmed KPIs),
and next, the signicance of the input features are evaluated through attribution approaches in
a backward manner. Particularly in the bad SQL localization example, the number of candidate
SQLs
𝑝
varies in each case and can be as large as thousands, whereas the number of observations
𝑛
(the length of the corresponding time series) is typically small since we only focus on the part
around the anomalies. In other words, the dimension
𝑝
can be larger than the sample size
𝑛
in the
RCA problems. To address this issue and to automate the feature selection process, we adopt sparse
linear models as the forward model due to their high exibility, eciency, and interpretability.
Furthermore, the candidate causes are usually correlated with each other, and there often exist
missing values. To tackle these problems that plague linear models, we propose a novel Bayesian
multicollinear feature selection (BMFS) model. Afterward, we provide the attribution score for each
candidate cause from dierent perspectives, including sensitivity and salience. Finally, we merge
results when there exist multiple alarmed KPIs and each of them is attributed to a dierent set of
root causes. We name the overall model BALANCE (BAyesian Linear AttributioN for root CausE
localization).
We would like to point out that both the multidimensional RCA and the graph-based RCA can
be formulated from the perspective of attribution. Specically, we can regard the multidimensional
RCA as attributing the anomalies in the KPIs to the combinations of their attributes. It follows that
the additive constraints in the multidimensional RCA can be removed, and hence, we only need to
consider the abnormal attributes under this scenario. On the other hand, by regarding all abnormal
nodes in the graph as candidate causes, BALANCE can be used to identify the root cause eciently
even though the graph structure is not available or cannot be reliably learned, which is often the
case in practice. Viewed another way, BALANCE can also be used as a building block to construct
causal graphs, since linear regression models are frequently used for causal discovery [
49
]. Given
the graph, BALANCE serves as a better substitute for random walks as it does not require a large
number of random walks and so is more ecient.
We validate the usefulness of BALANCE on four datasets. First, we generate synthetic data with
a dierent number of input features, dierent levels of multicollinearity, noise, and sparsity, and
dierent proportions of missing values, and then compare various forward models including the
proposed BMFS, Lasso, E-Net (Elastic net), and ARD (Automatic Relevance Determination). We nd
that BMFS typically recovers the underlying true regression coecients the best with comparable or
even shorter running time, especially when there exists multicollinearity among the input features.
Furthermore, we utilize the proposed method to address three real-world RCA problems. In the
rst problem, we deal with the problem of bad SQL localization as mentioned before. Our results
show that the proposed method can identify the human-labeled root cause SQLs in fewer than 2
seconds per case with accuracy as high as 83
.
3%, whereas it takes 3 minutes for SREs on average.
The second application copes with the problem of container fault localization, whose objective is
to attribute the abnormal trace failures in a container to the metrics of the container, such as CPU
usage, memory usage, TCP, etc, and facilitate the self-healing process. The proposed method can
achieve an
𝐹
1
-score of 0
.
86, which is at least 20% higher than other baseline methods. Finally, we
apply BALANCE to a public dataset, Exathlon [
12
], for the purpose of fault type diagnosis, and
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 95. Publication date: May 2023.
相关文档
评论