
database modules, which potentially lead to signicant nancial
losses. Thus, if typical anomalies can be automatically resolved, it
will relieve the burden of human DBAs and save resources.
Driven by this motivation, many database products are equipped
with semi-automatic diagnosis tools [
20
,
22
,
29
,
30
,
32
]. However,
they have several limitations. First, they are built by empirical
rules [
11
,
62
] or small-scale ML models (e.g., classiers [
34
]), which
have poor scenario understanding capability and cannot utilize the
diagnosis knowledge. Second, they cannot be exibly generalized to
scenario changes. For empirical methods, it is tedious to manually
update and verify rules by newest versions of documents. And
learned methods (e.g., XGBoost [
8
], KNN [
17
]) require to redesign
the input metrics and labels, and retrain models for a new scenario
(Figure 1 (d)). Third, these methods have no inference ability as
human DBAs, such as recursively exploring system views based on
the initial analysis results to infer the root cause.
To this end, we aim to build an intelligent diagnosis system with
three main advantages [
65
]. (1) Precise Diagnosis. First, our sys-
tem can utilize tools to gather scenario information (e.g., query
analysis with ame graph) or derive optimization advice (e.g., index
selection), which are necessary for real-world diagnosis. However,
that is hardly supported by traditional methods. Second, it can
conduct basic logical reasoning (i.e., making diagnosis plans). (2)
Expense and Time Saving. The system can relieve human DBAs
from on-call duties to some extent (e.g., resolving typical anom-
alies that rules cannot support). (3) High Generalizability. The
system exhibits exibility in analyzing unseen anomalies based on
both the given documents (e.g., new metrics, views, logs) and past
experience.
Recent advances in Large Language Models (LLMs) oer the
potential to achieve this goal, which have demonstrated superiority
in natural language understanding and programming [
42
,
43
,
64
,
67
].
However, database diagnosis requires extensive domain-specic
skills and even the GPT-4 model cannot directly master the diagnosis
knowledge (lower than 50% accuracy). This poses three challenges.
(C1) How to enhance LLM’s understanding of the diagno-
sis problem? Despite pre-trained on extensive corpora, LLMs still
struggle in eectively diagnosing without proper prompting
2
(e.g.,
unaware of the database knowledge). The challenges include
(𝑖)
extracting useful knowledge from long documents (e.g., correla-
tions across chapters);
(𝑖𝑖)
matching with suitable knowledge by
the given context (e.g., detecting an alert of high node load);
(𝑖𝑖𝑖)
retrieving tools that are potentially useful (e.g., database catalogs).
(C2) How to improve LLM’s diagnosis performance for single-
cause anomalies? With knowledge-and-tool prompt, LLM needs
to judiciously reason about the given anomalies. First, dierent from
many LLM tasks [
12
], database diagnosis is an interactive procedure
that generally requires to analyze for many times, while LLM has
the early stop problem [
13
]. Second, LLM has a “hallucination”
problem [
46
], and it is critical to design strategies that guide LLM
to derive in-depth and reasonable analysis.
(C3) How to enhance LLM’s diagnosis capability for multi-
cause anomalies? From our observation, within time budget, a
single LLM is hard to accurately analyze for complex anomalies
2
Prompting is to add additional information into LLM input. Although LLMs can mem-
orize new knowledge with ne-tuning, it may forget previous knowledge or generate
inaccurate or mixed-up responses, which is unacceptable in database diagnosis.
(e.g., with multiple root causes and the critical metrics are in ner-
granularity). Therefore, it is vital to design an ecient diagnosis
mechanism where multiple LLMs can collaboratively tackle com-
plex database problems (e.g., with cross reviews) and improve both
the diagnosis accuracy and eciency.
To tackle above challenges, we propose D-Bot, a database diag-
nosis system using large language models. First, we extract useful
knowledge chunks from documents (summary-tree based knowl-
edge extraction) and construct a hierarchy of tools with detailed
usage instructions, based on which we initialize the prompt tem-
plate for LLM diagnosis (see Figure 3). Second, according to the
prompt template, we generate new prompt by matching with most
relevant knowledge (key metric searching) and tools (ne-tuned
SentenceBert), which LLM can utilize to acquire monitoring and
optimization results for reasonable diagnosis. Third, we introduce a
tree-based search strategy that guides the LLM to reect over past
diagnosis attempts and choose the most promising one, which sig-
nicantly improves the diagnosis performance. Lastly, for complex
anomalies (e.g., with multiple root causes), we propose a collabora-
tive diagnosis mechanism where multiple LLM experts can diagnose
in an asynchronous style (e.g., sharing analysis results, conducting
cross reviews) to resolve the given anomaly.
Contributions. We make the following contributions.
(1) We design an LLM-based database diagnosis framework to
achieve precise diagnosis (see Section 3).
(2) We propose a context-aware diagnosis prompting method that
empowers LLM to perform diagnosis by
(𝑖)
matching with relevant
knowledge extracted from documents and
(𝑖𝑖)
retrieving tools with
a ne-tuned embedding model (see Sections 4 and 5).
(3) We propose a root cause analysis method that improves the di-
agnosis performance using tree-search-based algorithm that guides
LLM to conduct multi-step analysis (see Section 6).
(4) We propose a collaborative diagnosis mechanism to improve the
diagnosis eciency, which involves multiple LLMs concurrently
analyzing issues by their domain knowledge (see Section 7).
(5) Our experimental results demonstrate that D-Bot can accurately
identify typical root causes within acceptable time (see Section 8).
2 PRELIMINARIES
2.1 Database Performance Anomalies
R1.D2
Database Performance Anomalies refer to the irregular or unex-
pected issues that prevent the database from meeting user perfor-
mance expectations [
35
,
45
], such as excessively high response time.
Figure 2 show four typical database performance anomalies
3
.
(1) Slow Query Execution. The database experiences longer re-
sponse time than expectancy. For example, the slow query causes
signicant increase in CPU usage (system load) and query duration
time, but the number of active processes remains low.
(2) Full Resource Usage. Some system resource is exhausted, pre-
venting it accepting new requests or even causing errors (e.g., insert
failures for running out of memory). For example, the high concur-
rency workload can not only cause great CPU and memory usage,
but signicantly increases the number of active processes.
3
Anomalies on the application/network sides and non-maintenance issues like database
kernel debugging and instance deployment fall outside the scope of this work.
2
评论