
•
We prototype UCAD. Particularly, UCAD includes a preprocess-
ing module that allows Trans-DAS to obtain normal semantic
information by removing noisy data, and an anomaly detection
module that implements an instance of Trans-DAS to detect ab-
normal operations via contextual intent comparison.
•
We perform extensive evaluations for UCAD using two real-
world data traces in two typical data access scenarios. The experi-
mental results show that it can achieve the F1-score of 0.89693 and
0.98168 in two scenarios, respectively, which signicantly out-
perform baselines. Also, the results demonstrate that Trans-DAS
is not sensitive to dierent settings and robust to abnormal train-
ing data. Furthermore, the evaluations on three public datasets
demonstrate the transferability of UCAD to other tasks
3
.
2 THREAT MODEL
In this paper, we consider the abnormal data access operations that
are able to breach peripheral protections of database systems, such
as stealing legitimate credentials [
80
] or misoperations. In particular,
these data access anomalies can occur due to the following reasons.
Privilege Abuse.
Authorized users abuse their privileges to per-
form abnormal operations intentionally [
52
,
61
,
64
], e.g., for the
purpose of personal nancial incentives. For example, an attacker
can always perform more query operations to retrieve conden-
tial data violating normal business rules, and even delete data to
sabotage the database systems [52].
Credential Stealing.
An attacker steals credentials of legitimate
users to access database and then stealthily performs abnormal
data access operations [
42
,
51
,
80
]. Generally, abnormal operations
are hidden deeply in the disguise of numerous normal daily activi-
ties [
42
], e.g., an abnormal delete operation hidden in a session to
remove condential and sensitive data.
Misoperations.
An inexperienced sta may perform misopera-
tions accidentally, resulting in data chaos such as data leakage.
Compared with normal data access operations, their operations are
not logically consistent and considered abnormal [52].
Note that, in this paper, we assume that operations of each user
are correctly recorded in the database system log. Attackers are
unable to corrupt the integrity of system log, which is our trusted
computing base (cf., [
21
,
24
,
47
,
54
,
70
,
81
]). For instance, they can-
not execute abnormal SQL statements without generating any log
in the database system. An sophisticated attacker may tamper or
even remove their operation log by exploiting vulnerabilities of
database system, which is beyond the scope of this paper and can be
addressed by existing memory space protection techniques [
15
,
16
].
3 OVERVIEW OF UCAD
Figure 2 shows an overview of our unsupervised anomaly detection
system UCAD. Architecturally, UCAD consists of a preprocessing
module and an anomaly detection module. The preprocessing mod-
ule tokenizes the raw data access operations in system log and
remove the noisy data so that the anomaly detection module can
learn the semantic information accurately. The anomaly detection
module implements an instance of Trans-DAS, which learns the
semantic information for contextual intent comparison.
3
The source code is released in https://github.com/UCAD3/core.
In general, UCAD has two working stages: Oine Training and
Online Detection. In the training stage, the preprocessing module
builds the vocabulary for data access operation tokenization, and
removes noisy session data according to tokenized keys in the ses-
sion so that we can obtain a puried training set containing normal
user sessions
4
. The anomaly detection module trains Trans-DAS
on the puried dataset to learn the normal semantic information.
Note that this training procedure can be periodically conducted or
manually triggered to make Trans-DAS capture the latest normal
semantic information. In the detection stage, the preprocessing
module utilizes the learned vocabulary to tokenize each active user
session and directly lters out the known attack patterns. The
anomaly detection module utilizes the trained Trans-DAS model to
evaluate whether the semantics of current operation in the active
user session matches the overall intent of a sequence of operations
preceding the operation in question (which we refer to as contextual
intent of the operation). Since the normal semantic information is
learned via Trans-DAS, Trans-DAS can capture the anomalies that
do not include normal semantic information, i.e., a mismatched
operation is labeled as anomaly. The detected abnormal operations
may be subsequently sent to a domain expert for further investiga-
tion and actions (such as banning the user or even re-qualify the
system software).
4 THE TRANS-DAS MODEL
In this section, we describe our transformer model Trans-DAS that
used in UCAD for anomaly detection.
4.1 Basic Idea
To capture the semantic information of operations, we develop a
new transformer model called Transformer for Data Access Seman-
tics (Trans-DAS). The goal of Trans-DAS is to capture the relevance
between one operation and its bidirectional operation contexts so
that Trans-DAS can learn the exact semantics of individual opera-
tions and the contextual intent.
Note that, traditional models widely used in semantics extraction,
i.e., the Long Short Term Memory (LSTM) network and traditional
Transformer models, are ill-suited for learning the semantic infor-
mation from data access operations. Specically, although LSTM
network has been a standard solution to learn semantic patterns
from sequential data, they process data based on the item order in
the sequence. Such processing implicitly makes LSTM relies heav-
ily on the order information (i.e., the order dependence) to learn
semantics, which is not applicable in database systems since hetero-
geneous user access patterns exhibit diverse operation sequences.
Moreover, the traditional transformer model [
79
], consisting of
an encoder and a decoder, learns the semantic information by us-
ing the attention mechanism that captures the relevance between
each pair of items. However, it also embeds the position encodings
(i.e., order information) into the semantics of items. When facing
the challenge of heterogeneous access patterns, such order infor-
mation may prevent us from capturing the semantic information
accurately. Furthermore, the encoder chooses a fully-connected
attention design without masking (i.e., connecting an item with
4
A user session refers to a sequence of data access operations executed by a specic
user during one time of database accessing.
相关文档
评论