Unicorn Detect Runtime Errors in Time-Series Databases with Hybrid Input Synthesis.pdf

赤井秀一

12页

0次

2023-12-28

免费下载

Unicorn: Detect Runtime Errors in Time-Series Databases with

Hybrid Input Synthesis

Zhiyong Wu

KLISS, BNRist, School of Software,

Tsinghua University, China

wuzy21@mails.tsinghua.edu.cn

Jie Liang

∗

KLISS, BNRist, School of Software,

Tsinghua University, China

liangjie.mailbox.cn@gmail.com

Mingzhe Wang

KLISS, BNRist, School of Software

Tsinghua University, China

wmzhere@gmail.com

Chijin Zhou

KLISS, BNRist, School of Software,

Tsinghua University, China

ShuimuYulin Co., Ltd, China

tlock.chijin@gmail.com

Yu Jiang

∗

KLISS, BNRist, School of Software,

Tsinghua University, China

jiangyu198964@126.com

ABSTRACT

The ubiquitous use of time-series databases in the safety-critical

Internet of Things domain demands strict security and correctness.

One successful approach in database bug detection is fuzzing, where

hundreds of bugs have been detected automatically in relational

databases. However, it cannot be easily applied to time-series

databases: the bulk of time-series logic is unreachable because of

mismatched query specications, and serious bugs are undetectable

because of implicitly handled exceptions.

In this paper, we propose Unicorn to secure time-series databases

with automated fuzzing. First, we design hybrid input synthesis

to generate high-quality queries which not only cover time-series

features but also ensure grammar correctness. Then, Unicorn uses

proactive exception detection to discover minuscule-symptom bugs

which hide behind implicit exception handling. With the specialized

design oriented to time-series databases, Unicorn outperforms the

state-of-the-art database fuzzers in terms of coverage and bugs.

Specically, Unicorn outperforms SQLsmith and SQLancer on

widely used time-series databases IoTDB, KairosDB, TimescaleDB,

TDEngine, QuestDB, and GridDB in the number of basic blocks by

21%-199% and 34%-693%, respectively. More importantly, Unicorn

has discovered 42 previously unknown bugs.

CCS CONCEPTS

• Software and its engineering

→

Software maintenance tools;

• Security and privacy → Database and storage security.

KEYWORDS

Time-series Databases, Runtime Error, Hybrid Input Synthesis

∗

Jie Liang and Yu Jiang are the corresponding authors.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

ISSTA ’22, July 18–22, 2022, Virtual, South Korea

ACM ISBN 978-1-4503-9379-9/22/07.. . $15.00

https://doi.org/10.1145/3533767.3534364

ACM Reference Format:

Zhiyong Wu, Jie Liang, Mingzhe Wang, Chijin Zhou, and Yu Jiang. 2022.

Unicorn: Detect Runtime Errors in Time-Series Databases with Hybrid

Input Synthesis. In Proceedings of the 31st ACM SIGSOFT International

Symposium on Software Testing and Analysis (ISSTA ’22), July 18–22, 2022,

Virtual, South Korea. ACM, New York, NY, USA, 12 pages. https://doi

org/

10. 1145/3533767.3534364

1 INTRODUCTION

Along with the rapid growth in Internet of Things (IoT) deployment,

time-series databases are ubiquitously used in all kinds of IoT

devices. Compared to traditional relational databases, time-series

databases employ complex logic to handle their low latency and

time-series nature. Therefore, its security, reliability, and correct-

ness are challenged by the complexity. To prevent vulnerabilities,

a common approach is writing unit tests for the target database

manually. However, unit testing is labor-consuming and cannot

detect bugs at system level.

One promising approach is fuzzing, an automated software

testing technique, which generates random data as program inputs.

It was rst developed by Miller et al. [

] in 1990s and has, since

then, been widely adopted in practice for nding bugs in many

critical areas, including operating systems [

], networking

protocols [

], third-part libraries [

–

]. A fuzzer

exercises the target program in a loop: (1) select an input and

generate candidate inputs based on it, (2) execute candidate inputs

to track coverage and monitors anomalies, (3) save interesting

candidate inputs which have new coverage, then go to (1). Following

the fuzzing loop, fuzzers could continuously explore more and more

state space of the target program.

Due to the easily adapted nature, fuzzing can continuously

test whole systems with little manual eort. Prior works have

successfully applied fuzzing to relational databases and discovered

many vulnerabilities. For example, SQLsmith [

] constructs inputs

with the abstract syntax tree (AST) model automatically and sends

them to target systems for execution. It has found more than 100

bugs in PostgreSQL, SQLite, and MonetDB since 2015 [

]. However,

because of the unique attributes of time-series databases, existing

fuzzing strategies are hard to directly adapt to these databases.

There are two major challenges as follows.

The rst challenge is generating grammatically-correct time-series

queries. Time-series is the basic form to organize data for time-series

251

ISSTA ’22, July 18–22, 2022, Virtual, South Korea Z.Wu, J.Liang, M.Wang, C.Zhou, Y.Jiang

databases, but existing fuzzers are hard to generate grammatically-

correct time-series queries to test. Specically, time-series data [

]

represents a collection of data values observed from sequential

measurements over time. To improve the eciency to store and

fetch data, time-series databases employ dierent strategies from

relational databases to t the time-series storage. However, lacking

time-series specications of time-series databases, existing fuzzing

strategies are hard to generate grammatically-correct time-series

queries. Specically, due to the vast dierence between time-series

in IoT domain and relations in SQL, traditional relational database

fuzzers (e.g. SQLsmith [

]) can hardly reach time-series logic. In

addition, the queries accepted by time-series databases are highly-

structured, the strict grammar impedes most of the seeds generated

by random mutation in conventional mutation-based fuzzers (e.g.

AFL [

]). As a result, designing a time-series input generation

mechanism, which generates grammatically correct time-series

queries, to explore the time-series logic is needed.

The second challenge is capturing exceptions handled implicitly.

Crashes are used as an indication for failed tests in fuzzing, however,

time-series databases utilize implicit exception handling to prevent

crashing whole systems for usability and reliability. In other words,

when anomalies do not happen in critical locations of the server,

they are handled implicitly and no crashes could be triggered.

For example, time-series databases usually create a new thread

for each connecting client as the worker. When an exception is

thrown inside the thread, the implicit handling mechanism will

automatically capture it and only inform the worker with a fault

message. Therefore, the server could still preserve a normal running

state. However, these exceptions may contain serious bugs and

they will be ignored by existing fuzzing approaches. As a result,

designing an implicitly handled exception detection scheme, which

directly obtains exception messages to determine whether it is an

anomaly, to capture all possible bugs is required.

In this paper, we propose Unicorn to overcome the challenges

through hybrid input synthesis and proactive exception detection.

In order to generate grammatically-correct time-series queries,

hybrid input synthesis combines the syntax-preserved mutation and

time-series guided mutation. Specically, we design hybrid input

specication, which combines the rules to generate conventional

SQLs and time-series SQLs in time-series databases. Based on the

specication, Unicorn rst constructs the abstract syntax tree

(AST) for the original seeds and generates new time-series queries

by changing the time-series nodes of AST. To detect exceptions

handled implicitly, proactive exception detection directly captures

exception information from the runtime environment and analyzes

whether it is an anomaly. Specically, instead of passively receiving

the program’s state, Unicorn inserts an agent into each process

to proactively catch the exceptions and send them to the anomaly

detector for analyzing and reporting.

For evaluation, we used Unicorn to perform fuzzing on IoTDB,

KairosDB, QuestDB, TimescaleDB, TDEngine, and GridDB. We also

adapted the industrial fuzzers SQLancer [

] and SQLsmith [

]

for comparison. Unicorn covered 115.75% more basic blocks on

average than the best results of other fuzzers. In addition, Unicorn

detected 42 previously unknown bugs.

In conclusion, our paper makes the following contributions:

•

We observe that current fuzzing approaches are hard to ef-

fectively test time-series databases. The two main challenges

are generating grammatically correct queries and capturing

exceptions handled implicitly.

•

We propose hybrid input synthesis and proactive exception

detection to address the aforementioned challenges. We also

implement these approaches in Unicorn.

•

We evaluate Unicorn on 6 popular time-series databases

against state-of-the-art fuzzers SQLsmith and SQLancer.

The results show that Unicorn outperforms others and 42

previously-unknown bugs are detected.

2 TIME-SERIES DATABASES

As an infrastructure for IoT data storage and analysis, time-series

databases play an important role in promoting the development of

Internet of Things. Generally, the time-series database is a kind of

large-scale software to manipulate and manage IoT data, it handles

the operation requests from various clients (including IoT devices,

PC, etc.), and carries out unied management and control to ensure

the security and integrity of IoT data [

]. In embedded application

scenarios, time-series databases usually have the following two

characteristics: 1) They employ time-series data to meet scenarios

in the IoT domain, and 2) They utilize implicit exception handling to

guarantee usability, namely, they limit the impacts of anomalies by

handling them internally to ensure the server always runs normally.

Root

vehicle

speed status temperature

robot

d2 r1

status

Storage Group

Device

Sensor

>set storage group to root.vehicle

> create timeseries root.vehicle.d1.speed

with datatype=BOOLEAN,encoding=PLAIN

mobile

height

Figure 1: The time-series query along with the corresponding

storage model of Apache IoTDB. The query imports new

keywords related to time-series. In addition, the object has

hierarchical structures because of the tree-based schema

in IoTDB. IoT data is stored in a tree-based schema,

and the attribute hierarchy structure has three layers. A

grammatically correct object name should construct a path

from the root node to a leaf node.

2.1 Employing Time-Series Data

Compared to other applications, the major characteristic of the IoT

applications is employing time-series data. Time series data [

]

252

of 12

免费下载

tdengine paper

张静懿

关注

评论