Unsupervised Contextual Anomaly Detection for Database Systems.pdf

章芋文

117

15页

6次

2023-07-27

免费下载

Unsupervised Contextual Anomaly Detection

for Database Systems

Sainan Li

1,†

Qilei Yin

1,†

Guoliang Li

Qi Li

Zhuotao Liu

Jinwei Zhu

Tsinghua University and BNRist

Huawei

ABSTRACT

Abnormal data access operations in database systems always hap-

pen, which are typically incurred by misoperations or attacks,

though these systems are enforced with strict access control poli-

cies. However, prior arts only focus on detecting abnormal data

accesses by utilizing known attack patterns or identifying behaviors

signicantly deviated from normal behaviors. They cannot capture

stealthy abnormal data access operations that are similar to normal

ones. In this paper, we propose a novel unsupervised anomaly de-

tection system UCAD, which aims to detect abnormal data access

operations, by comparing operation’s semantics with their contex-

tual intent. However, it is non-trivial to obtain accurate semantics

of operations for intent analysis because (i) the same operation may

exhibit diverse semantics under dierent operation contexts and

(ii) dierent operation sequences could have identical semantics

due to heterogeneous user access patterns. To address this issue,

we develop a new transformer model called Trans-DAS for UCAD.

Trans-DAS learns the semantics of individual operations by utiliz-

ing the attention mechanism that analyzes the relevance between

any pair of operations in sequence, and captures the contextual

intent of operations inferred from the contexts. Specically, Trans-

DAS utilizes a particular embedding layer to embed the semantics

of individual operations without the operation order information

and a masking mechanism that allows Trans-DAS to learn the se-

mantics according to the bidirectional contexts. Also, we dene

a new training objective for Trans-DAS to enlarge the dierence

among the embedded semantics. Furthermore, in order to eec-

tively utilize Trans-DAS for detection, we develop two modules in

UCAD, i.e., a data preprocessing module that allows Trans-DAS

to accurately learn the normal semantic information by removing

noisy data, and an anomaly detection module that learns the seman-

tic information for intent comparison. We evaluate the performance

of UCAD on real-world data traces under dierent settings (e.g.,

varied parameters and hybrid datasets). The results demonstrate

that UCAD achieves the average F1-score of 0.94 in two scenarios,

which signicantly outperform baselines, and shows robustness to

hybrid data and good transferability to dierent tasks.

†

These authors contributed equally to this work.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA.

ACM ISBN 978-1-4503-9249-5/22/06.. . $15.00

https://doi.org/10.1145/3514221.3517861

CCS CONCEPTS

• Information systems → Database administration

;

• Com-

puting methodologies → Anomaly detection.

KEYWORDS

Anomaly Detection; Database Management; Attention Mechanism

ACM Reference Format:

Sainan Li, Qilei Yin, Guoliang Li, Qi Li, Zhuotao Liu, & Jinwei Zhu. 2022.

Unsupervised Contextual Anomaly Detection for Database Systems. In

Proceedings of the 2022 International Conference on Management of Data

(SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, NY, NY, USA,

15 pages. https://doi.org/10.1145/3514221.3517861

1 INTRODUCTION

The modern database systems store huge amounts of proprietary

data for numerous applications. Thus, it is critical to ensure system

condentiality and integrity. Although these systems are typically

protected by various peripheral defenses (e.g., by enforcing access

control policies or enabling intrusion detection), abnormal data

access operations

are still possible due to accidental misoperations

or deliberated attacks by sophisticated attackers. These abnormal

data accesses expose database systems to a diverse range of security

threats and may result in catastrophic consequences, such as data

tampering and data breaches. For example, it has been reported that

the economic loss caused by data breaching in the United States

alone is as high as 8.64 million dollars in 2020, and the cumulative

attack duration is around 280 days [3].

A series of anomaly detection methods specic for database sys-

tems have been developed to address this issue. They can be roughly

classied into four categories: syntax-based methods [

]

focus on detecting anomalies by pattern matching or machine learn-

ing algorithms according to the syntax of SQL statements, context-

based methods [

] detect abnormal behaviors deviated

from normal patterns by utilizing system and user information gen-

erated during data accessing, data-centric methods [

] perform

detection by leveraging the statistics of queried data incurred by

obvious data change, and hybrid methods [

] combining

the design primitives of dierent methods.

However, these traditional methods focus mostly on detecting

known attacks or anomalous data accesses [

]

whose behavior are signicantly deviated from normal patterns.

When these anomalous data accesses are more stealthy, e.g., the

attacker only launches a small amount of anomalous database

operations intermittently and changes condential data slightly,

the traditional methods become less eective. Yet these stealthy

A data access operation in database system refers to an individual SQL statement and

we use “operation” and “statement” interchangeably in this paper.

Figure 1: A data access example containing one abnormal delete operation (in red). Traditional methods generate indistin-

guishable features for both normal and abnormal delete operations. Yet we observe that the semantics of abnormal operation

is deviated from the contextual intent inferred by its preceding operations.

anomalous data accesses are common in real-world production sys-

tems [

]. As shown in Figure 1 (a) and (b), we use a real-world

example from our production system to illustrate the limitations

of these traditional methods. An attacker uses a user’s legitimate

credential (user1) and address (IP1) to remotely access the database

and performs several ordinary operations for table updating during

the period of time1. Meanwhile, the attacker stealthily deletes (in

red) important data from table t_rm_mac after a normal delete (in

blue) operation on table t_rm_mac. The abnormal operation cannot

be captured by these methods since they generate indistinguishable

feature vectors for both normal and abnormal delete operations.

For example, the syntax-based methods extract the same feature

vector from their syntax, i.e., the same command type on the same

table and column. The context-based approaches cannot nd their

dierence as the user attributes (user account, access time, and

client address) are not changed. The data-centric methods analyze

that both delete operations only remove one row of data and the

statistic features, i.e., the min, max values of the targeted column,

are the same.

To address the above problem, we propose a novel unsupervised

anomaly detection system, Unsupervised Contextual Anomaly De-

tection (UCAD). Our key observation underpinning the design is

that: abnormal operations can be identied by comparing the se-

mantics of an individual operation when considered alone with the

operation’s contextual intent, obtained by learning the semantics of a

sequence of operations preceding

the operation in question. Consider

the following intuitive example illustrated in Figure 1 (c). When

analyzing the rst normal delete, its former insert and select oper-

ations reveal that the intent is an ongoing table updating, so that

the next operation is likely to delete an invalid tuple due to the

newly inserted data. For the second abnormal delete, its former

insert, select, and the rst delete operations indicate that the intent

is a completed table updating, which means that the next operation

should be select (i.e., to start a data query task) or insert (i.e., to

start another table updating task). Yet, in this example, the attacker

performs another delete (in red) in an attempt to stealthily sabotage

other data, which is deviated from the intent of a sequence preceded

this operation.

Challenges.

Nevertheless, to accurately capture the semantics of

data access operations, we need to address three challenges. First,

During the model training process, we can potentially use the operations after the

operation in question, i.e., a sequence surrounding the operation in question.

the same operation may exhibit diverse semantics under dierent

operation contexts. For instance, the two identical delete statements

in Figure 1 indicate distinct semantics as their preceded operation

sequence are dierent. Thus, the common semantics extraction ap-

proaches (e.g., word embedding [

]) are ill-suited for this task

since they can only learn xed semantic representations from local

contexts. Second, dierent operation sequences could have identical

semantics due to heterogeneous user access patterns, which means

that the order of operations is not sucient or even misleading

in capturing the semantics of users’ operation sequence (i.e., the

intent). However, traditional sequence models like LSTM rely heav-

ily on the order information for semantics extraction. Thus, they

are not applicable in database systems with heterogeneous user

behaviors. Third, noise is non-negligible in raw data access operation

records. In database systems, operations that are irrelevant to the

true intent in question are common, e.g., accidental misoperations

(not necessarily malicious). These “noisy” data may interfere with

the process of extracting the semantics of data access operations.

Contributions.

In this paper, we propose a novel unsupervised

anomaly detection system UCAD to identify abnormal data access

operations. To solve the above challenges and obtain accurate se-

mantics of operations, we develop a new transformer model called

Trans-DAS for UCAD. Trans-DAS utilizes a particular embedding

layer to embed the semantics of individual operations without oper-

ation ordering information, and a masking mechanism that allows

Trans-DAS to learn the semantics according to the bidirectional

contexts. Also, we dene a new training objective for Trans-DAS

to enlarge the dierence among the embedded semantics such that

Trans-DAS can easily capture the abnormal operations. Moreover,

in order to eectively detect abnormal operations using Trans-DAS,

we design a preprocessing module in UCAD to lter the noisy data

and known attacks, and utilize a clustering method to balance se-

mantic patterns and remove the sessions deviated from common

user behaviors. Note that, since it only requires normal data access

information to learn the semantics of operations, our system works

in an unsupervised manner.

In summary, we make the following contributions in this paper.

•

We propose a novel unsupervised anomaly detection system

UCAD to identify the stealthy abnormal data access operations

in database systems.

•

We develop a new transformer model called Trans-DAS for UCAD

to capture the semantics and contextual intent of operations.

•

We prototype UCAD. Particularly, UCAD includes a preprocess-

ing module that allows Trans-DAS to obtain normal semantic

information by removing noisy data, and an anomaly detection

module that implements an instance of Trans-DAS to detect ab-

normal operations via contextual intent comparison.

•

We perform extensive evaluations for UCAD using two real-

world data traces in two typical data access scenarios. The experi-

mental results show that it can achieve the F1-score of 0.89693 and

0.98168 in two scenarios, respectively, which signicantly out-

perform baselines. Also, the results demonstrate that Trans-DAS

is not sensitive to dierent settings and robust to abnormal train-

ing data. Furthermore, the evaluations on three public datasets

demonstrate the transferability of UCAD to other tasks

2 THREAT MODEL

In this paper, we consider the abnormal data access operations that

are able to breach peripheral protections of database systems, such

as stealing legitimate credentials [

] or misoperations. In particular,

these data access anomalies can occur due to the following reasons.

Privilege Abuse.

Authorized users abuse their privileges to per-

form abnormal operations intentionally [

], e.g., for the

purpose of personal nancial incentives. For example, an attacker

can always perform more query operations to retrieve conden-

tial data violating normal business rules, and even delete data to

sabotage the database systems [52].

Credential Stealing.

An attacker steals credentials of legitimate

users to access database and then stealthily performs abnormal

data access operations [

]. Generally, abnormal operations

are hidden deeply in the disguise of numerous normal daily activi-

ties [

], e.g., an abnormal delete operation hidden in a session to

remove condential and sensitive data.

Misoperations.

An inexperienced sta may perform misopera-

tions accidentally, resulting in data chaos such as data leakage.

Compared with normal data access operations, their operations are

not logically consistent and considered abnormal [52].

Note that, in this paper, we assume that operations of each user

are correctly recorded in the database system log. Attackers are

unable to corrupt the integrity of system log, which is our trusted

computing base (cf., [

]). For instance, they can-

not execute abnormal SQL statements without generating any log

in the database system. An sophisticated attacker may tamper or

even remove their operation log by exploiting vulnerabilities of

database system, which is beyond the scope of this paper and can be

addressed by existing memory space protection techniques [

3 OVERVIEW OF UCAD

Figure 2 shows an overview of our unsupervised anomaly detection

system UCAD. Architecturally, UCAD consists of a preprocessing

module and an anomaly detection module. The preprocessing mod-

ule tokenizes the raw data access operations in system log and

remove the noisy data so that the anomaly detection module can

learn the semantic information accurately. The anomaly detection

module implements an instance of Trans-DAS, which learns the

semantic information for contextual intent comparison.

The source code is released in https://github.com/UCAD3/core.

In general, UCAD has two working stages: Oine Training and

Online Detection. In the training stage, the preprocessing module

builds the vocabulary for data access operation tokenization, and

removes noisy session data according to tokenized keys in the ses-

sion so that we can obtain a puried training set containing normal

user sessions

. The anomaly detection module trains Trans-DAS

on the puried dataset to learn the normal semantic information.

Note that this training procedure can be periodically conducted or

manually triggered to make Trans-DAS capture the latest normal

semantic information. In the detection stage, the preprocessing

module utilizes the learned vocabulary to tokenize each active user

session and directly lters out the known attack patterns. The

anomaly detection module utilizes the trained Trans-DAS model to

evaluate whether the semantics of current operation in the active

user session matches the overall intent of a sequence of operations

preceding the operation in question (which we refer to as contextual

intent of the operation). Since the normal semantic information is

learned via Trans-DAS, Trans-DAS can capture the anomalies that

do not include normal semantic information, i.e., a mismatched

operation is labeled as anomaly. The detected abnormal operations

may be subsequently sent to a domain expert for further investiga-

tion and actions (such as banning the user or even re-qualify the

system software).

4 THE TRANS-DAS MODEL

In this section, we describe our transformer model Trans-DAS that

used in UCAD for anomaly detection.

4.1 Basic Idea

To capture the semantic information of operations, we develop a

new transformer model called Transformer for Data Access Seman-

tics (Trans-DAS). The goal of Trans-DAS is to capture the relevance

between one operation and its bidirectional operation contexts so

that Trans-DAS can learn the exact semantics of individual opera-

tions and the contextual intent.

Note that, traditional models widely used in semantics extraction,

i.e., the Long Short Term Memory (LSTM) network and traditional

Transformer models, are ill-suited for learning the semantic infor-

mation from data access operations. Specically, although LSTM

network has been a standard solution to learn semantic patterns

from sequential data, they process data based on the item order in

the sequence. Such processing implicitly makes LSTM relies heav-

ily on the order information (i.e., the order dependence) to learn

semantics, which is not applicable in database systems since hetero-

geneous user access patterns exhibit diverse operation sequences.

Moreover, the traditional transformer model [

], consisting of

an encoder and a decoder, learns the semantic information by us-

ing the attention mechanism that captures the relevance between

each pair of items. However, it also embeds the position encodings

(i.e., order information) into the semantics of items. When facing

the challenge of heterogeneous access patterns, such order infor-

mation may prevent us from capturing the semantic information

accurately. Furthermore, the encoder chooses a fully-connected

attention design without masking (i.e., connecting an item with

A user session refers to a sequence of data access operations executed by a specic

user during one time of database accessing.

of 15

免费下载

相关文档

评论