暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Unsupervised Contextual Anomaly Detection for Database Systems.pdf
117
15页
6次
2023-07-27
免费下载
Unsupervised Contextual Anomaly Detection
for Database Systems
Sainan Li
1,
Qilei Yin
1,
Guoliang Li
1
Qi Li
1
Zhuotao Liu
1
Jinwei Zhu
2
1
Tsinghua University and BNRist
2
Huawei
ABSTRACT
Abnormal data access operations in database systems always hap-
pen, which are typically incurred by misoperations or attacks,
though these systems are enforced with strict access control poli-
cies. However, prior arts only focus on detecting abnormal data
accesses by utilizing known attack patterns or identifying behaviors
signicantly deviated from normal behaviors. They cannot capture
stealthy abnormal data access operations that are similar to normal
ones. In this paper, we propose a novel unsupervised anomaly de-
tection system UCAD, which aims to detect abnormal data access
operations, by comparing operation’s semantics with their contex-
tual intent. However, it is non-trivial to obtain accurate semantics
of operations for intent analysis because (i) the same operation may
exhibit diverse semantics under dierent operation contexts and
(ii) dierent operation sequences could have identical semantics
due to heterogeneous user access patterns. To address this issue,
we develop a new transformer model called Trans-DAS for UCAD.
Trans-DAS learns the semantics of individual operations by utiliz-
ing the attention mechanism that analyzes the relevance between
any pair of operations in sequence, and captures the contextual
intent of operations inferred from the contexts. Specically, Trans-
DAS utilizes a particular embedding layer to embed the semantics
of individual operations without the operation order information
and a masking mechanism that allows Trans-DAS to learn the se-
mantics according to the bidirectional contexts. Also, we dene
a new training objective for Trans-DAS to enlarge the dierence
among the embedded semantics. Furthermore, in order to eec-
tively utilize Trans-DAS for detection, we develop two modules in
UCAD, i.e., a data preprocessing module that allows Trans-DAS
to accurately learn the normal semantic information by removing
noisy data, and an anomaly detection module that learns the seman-
tic information for intent comparison. We evaluate the performance
of UCAD on real-world data traces under dierent settings (e.g.,
varied parameters and hybrid datasets). The results demonstrate
that UCAD achieves the average F1-score of 0.94 in two scenarios,
which signicantly outperform baselines, and shows robustness to
hybrid data and good transferability to dierent tasks.
These authors contributed equally to this work.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA.
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9249-5/22/06.. . $15.00
https://doi.org/10.1145/3514221.3517861
CCS CONCEPTS
Information systems Database administration
;
Com-
puting methodologies Anomaly detection.
KEYWORDS
Anomaly Detection; Database Management; Attention Mechanism
ACM Reference Format:
Sainan Li, Qilei Yin, Guoliang Li, Qi Li, Zhuotao Liu, & Jinwei Zhu. 2022.
Unsupervised Contextual Anomaly Detection for Database Systems. In
Proceedings of the 2022 International Conference on Management of Data
(SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, NY, NY, USA,
15 pages. https://doi.org/10.1145/3514221.3517861
1 INTRODUCTION
The modern database systems store huge amounts of proprietary
data for numerous applications. Thus, it is critical to ensure system
condentiality and integrity. Although these systems are typically
protected by various peripheral defenses (e.g., by enforcing access
control policies or enabling intrusion detection), abnormal data
access operations
1
are still possible due to accidental misoperations
or deliberated attacks by sophisticated attackers. These abnormal
data accesses expose database systems to a diverse range of security
threats and may result in catastrophic consequences, such as data
tampering and data breaches. For example, it has been reported that
the economic loss caused by data breaching in the United States
alone is as high as 8.64 million dollars in 2020, and the cumulative
attack duration is around 280 days [3].
A series of anomaly detection methods specic for database sys-
tems have been developed to address this issue. They can be roughly
classied into four categories: syntax-based methods [
14
,
39
,
40
,
64
]
focus on detecting anomalies by pattern matching or machine learn-
ing algorithms according to the syntax of SQL statements, context-
based methods [
7
,
25
,
42
,
83
] detect abnormal behaviors deviated
from normal patterns by utilizing system and user information gen-
erated during data accessing, data-centric methods [
43
,
51
] perform
detection by leveraging the statistics of queried data incurred by
obvious data change, and hybrid methods [
52
,
65
,
66
,
71
] combining
the design primitives of dierent methods.
However, these traditional methods focus mostly on detecting
known attacks or anomalous data accesses [
42
,
51
,
52
,
61
,
64
,
80
]
whose behavior are signicantly deviated from normal patterns.
When these anomalous data accesses are more stealthy, e.g., the
attacker only launches a small amount of anomalous database
operations intermittently and changes condential data slightly,
the traditional methods become less eective. Yet these stealthy
1
A data access operation in database system refers to an individual SQL statement and
we use “operation” and “statement” interchangeably in this paper.
Figure 1: A data access example containing one abnormal delete operation (in red). Traditional methods generate indistin-
guishable features for both normal and abnormal delete operations. Yet we observe that the semantics of abnormal operation
is deviated from the contextual intent inferred by its preceding operations.
anomalous data accesses are common in real-world production sys-
tems [
1
,
2
,
4
,
5
]. As shown in Figure 1 (a) and (b), we use a real-world
example from our production system to illustrate the limitations
of these traditional methods. An attacker uses a user’s legitimate
credential (user1) and address (IP1) to remotely access the database
and performs several ordinary operations for table updating during
the period of time1. Meanwhile, the attacker stealthily deletes (in
red) important data from table t_rm_mac after a normal delete (in
blue) operation on table t_rm_mac. The abnormal operation cannot
be captured by these methods since they generate indistinguishable
feature vectors for both normal and abnormal delete operations.
For example, the syntax-based methods extract the same feature
vector from their syntax, i.e., the same command type on the same
table and column. The context-based approaches cannot nd their
dierence as the user attributes (user account, access time, and
client address) are not changed. The data-centric methods analyze
that both delete operations only remove one row of data and the
statistic features, i.e., the min, max values of the targeted column,
are the same.
To address the above problem, we propose a novel unsupervised
anomaly detection system, Unsupervised Contextual Anomaly De-
tection (UCAD). Our key observation underpinning the design is
that: abnormal operations can be identied by comparing the se-
mantics of an individual operation when considered alone with the
operation’s contextual intent, obtained by learning the semantics of a
sequence of operations preceding
2
the operation in question. Consider
the following intuitive example illustrated in Figure 1 (c). When
analyzing the rst normal delete, its former insert and select oper-
ations reveal that the intent is an ongoing table updating, so that
the next operation is likely to delete an invalid tuple due to the
newly inserted data. For the second abnormal delete, its former
insert, select, and the rst delete operations indicate that the intent
is a completed table updating, which means that the next operation
should be select (i.e., to start a data query task) or insert (i.e., to
start another table updating task). Yet, in this example, the attacker
performs another delete (in red) in an attempt to stealthily sabotage
other data, which is deviated from the intent of a sequence preceded
this operation.
Challenges.
Nevertheless, to accurately capture the semantics of
data access operations, we need to address three challenges. First,
2
During the model training process, we can potentially use the operations after the
operation in question, i.e., a sequence surrounding the operation in question.
the same operation may exhibit diverse semantics under dierent
operation contexts. For instance, the two identical delete statements
in Figure 1 indicate distinct semantics as their preceded operation
sequence are dierent. Thus, the common semantics extraction ap-
proaches (e.g., word embedding [
27
,
55
]) are ill-suited for this task
since they can only learn xed semantic representations from local
contexts. Second, dierent operation sequences could have identical
semantics due to heterogeneous user access patterns, which means
that the order of operations is not sucient or even misleading
in capturing the semantics of users’ operation sequence (i.e., the
intent). However, traditional sequence models like LSTM rely heav-
ily on the order information for semantics extraction. Thus, they
are not applicable in database systems with heterogeneous user
behaviors. Third, noise is non-negligible in raw data access operation
records. In database systems, operations that are irrelevant to the
true intent in question are common, e.g., accidental misoperations
(not necessarily malicious). These “noisy” data may interfere with
the process of extracting the semantics of data access operations.
Contributions.
In this paper, we propose a novel unsupervised
anomaly detection system UCAD to identify abnormal data access
operations. To solve the above challenges and obtain accurate se-
mantics of operations, we develop a new transformer model called
Trans-DAS for UCAD. Trans-DAS utilizes a particular embedding
layer to embed the semantics of individual operations without oper-
ation ordering information, and a masking mechanism that allows
Trans-DAS to learn the semantics according to the bidirectional
contexts. Also, we dene a new training objective for Trans-DAS
to enlarge the dierence among the embedded semantics such that
Trans-DAS can easily capture the abnormal operations. Moreover,
in order to eectively detect abnormal operations using Trans-DAS,
we design a preprocessing module in UCAD to lter the noisy data
and known attacks, and utilize a clustering method to balance se-
mantic patterns and remove the sessions deviated from common
user behaviors. Note that, since it only requires normal data access
information to learn the semantics of operations, our system works
in an unsupervised manner.
In summary, we make the following contributions in this paper.
We propose a novel unsupervised anomaly detection system
UCAD to identify the stealthy abnormal data access operations
in database systems.
We develop a new transformer model called Trans-DAS for UCAD
to capture the semantics and contextual intent of operations.
We prototype UCAD. Particularly, UCAD includes a preprocess-
ing module that allows Trans-DAS to obtain normal semantic
information by removing noisy data, and an anomaly detection
module that implements an instance of Trans-DAS to detect ab-
normal operations via contextual intent comparison.
We perform extensive evaluations for UCAD using two real-
world data traces in two typical data access scenarios. The experi-
mental results show that it can achieve the F1-score of 0.89693 and
0.98168 in two scenarios, respectively, which signicantly out-
perform baselines. Also, the results demonstrate that Trans-DAS
is not sensitive to dierent settings and robust to abnormal train-
ing data. Furthermore, the evaluations on three public datasets
demonstrate the transferability of UCAD to other tasks
3
.
2 THREAT MODEL
In this paper, we consider the abnormal data access operations that
are able to breach peripheral protections of database systems, such
as stealing legitimate credentials [
80
] or misoperations. In particular,
these data access anomalies can occur due to the following reasons.
Privilege Abuse.
Authorized users abuse their privileges to per-
form abnormal operations intentionally [
52
,
61
,
64
], e.g., for the
purpose of personal nancial incentives. For example, an attacker
can always perform more query operations to retrieve conden-
tial data violating normal business rules, and even delete data to
sabotage the database systems [52].
Credential Stealing.
An attacker steals credentials of legitimate
users to access database and then stealthily performs abnormal
data access operations [
42
,
51
,
80
]. Generally, abnormal operations
are hidden deeply in the disguise of numerous normal daily activi-
ties [
42
], e.g., an abnormal delete operation hidden in a session to
remove condential and sensitive data.
Misoperations.
An inexperienced sta may perform misopera-
tions accidentally, resulting in data chaos such as data leakage.
Compared with normal data access operations, their operations are
not logically consistent and considered abnormal [52].
Note that, in this paper, we assume that operations of each user
are correctly recorded in the database system log. Attackers are
unable to corrupt the integrity of system log, which is our trusted
computing base (cf., [
21
,
24
,
47
,
54
,
70
,
81
]). For instance, they can-
not execute abnormal SQL statements without generating any log
in the database system. An sophisticated attacker may tamper or
even remove their operation log by exploiting vulnerabilities of
database system, which is beyond the scope of this paper and can be
addressed by existing memory space protection techniques [
15
,
16
].
3 OVERVIEW OF UCAD
Figure 2 shows an overview of our unsupervised anomaly detection
system UCAD. Architecturally, UCAD consists of a preprocessing
module and an anomaly detection module. The preprocessing mod-
ule tokenizes the raw data access operations in system log and
remove the noisy data so that the anomaly detection module can
learn the semantic information accurately. The anomaly detection
module implements an instance of Trans-DAS, which learns the
semantic information for contextual intent comparison.
3
The source code is released in https://github.com/UCAD3/core.
In general, UCAD has two working stages: Oine Training and
Online Detection. In the training stage, the preprocessing module
builds the vocabulary for data access operation tokenization, and
removes noisy session data according to tokenized keys in the ses-
sion so that we can obtain a puried training set containing normal
user sessions
4
. The anomaly detection module trains Trans-DAS
on the puried dataset to learn the normal semantic information.
Note that this training procedure can be periodically conducted or
manually triggered to make Trans-DAS capture the latest normal
semantic information. In the detection stage, the preprocessing
module utilizes the learned vocabulary to tokenize each active user
session and directly lters out the known attack patterns. The
anomaly detection module utilizes the trained Trans-DAS model to
evaluate whether the semantics of current operation in the active
user session matches the overall intent of a sequence of operations
preceding the operation in question (which we refer to as contextual
intent of the operation). Since the normal semantic information is
learned via Trans-DAS, Trans-DAS can capture the anomalies that
do not include normal semantic information, i.e., a mismatched
operation is labeled as anomaly. The detected abnormal operations
may be subsequently sent to a domain expert for further investiga-
tion and actions (such as banning the user or even re-qualify the
system software).
4 THE TRANS-DAS MODEL
In this section, we describe our transformer model Trans-DAS that
used in UCAD for anomaly detection.
4.1 Basic Idea
To capture the semantic information of operations, we develop a
new transformer model called Transformer for Data Access Seman-
tics (Trans-DAS). The goal of Trans-DAS is to capture the relevance
between one operation and its bidirectional operation contexts so
that Trans-DAS can learn the exact semantics of individual opera-
tions and the contextual intent.
Note that, traditional models widely used in semantics extraction,
i.e., the Long Short Term Memory (LSTM) network and traditional
Transformer models, are ill-suited for learning the semantic infor-
mation from data access operations. Specically, although LSTM
network has been a standard solution to learn semantic patterns
from sequential data, they process data based on the item order in
the sequence. Such processing implicitly makes LSTM relies heav-
ily on the order information (i.e., the order dependence) to learn
semantics, which is not applicable in database systems since hetero-
geneous user access patterns exhibit diverse operation sequences.
Moreover, the traditional transformer model [
79
], consisting of
an encoder and a decoder, learns the semantic information by us-
ing the attention mechanism that captures the relevance between
each pair of items. However, it also embeds the position encodings
(i.e., order information) into the semantics of items. When facing
the challenge of heterogeneous access patterns, such order infor-
mation may prevent us from capturing the semantic information
accurately. Furthermore, the encoder chooses a fully-connected
attention design without masking (i.e., connecting an item with
4
A user session refers to a sequence of data access operations executed by a specic
user during one time of database accessing.
of 15
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。