2020-7_Two-Level Data Compression using Machine Learning in Time Series Database _Xinyang Yu.pdf

Solo

267

12页

5次

2022-01-27

免费下载

Two-Level Data Compression using Machine

Learning in Time Series Database

Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, Yue Xie

Alibaba Group

{xinyang.yu, y.peng, lifeifei, sh.wang, leyu.sxw, huijun.mhj, jimmy.xy}@alibaba-inc.com

Abstract—The explosion of time series advances the develop-

ment of time series databases. To reduce storage overhead in

these systems, data compression is widely adopted. Most existing

compression algorithms utilize the overall characteristics of the

entire time series to achieve high compression ratio, but ignore

local contexts around individual points. In this way, they are

effective for certain data patterns, and may suffer inherent

pattern changes in real-world time series. It is therefore strongly

desired to have a compression method that can always achieve

high compression ratio in the existence of pattern diversity.

In this paper, we propose a two-level compression model that

selects a proper compression scheme for each individual point, so

that diverse patterns can be captured at a ﬁne granularity. Based

on this model, we design and implement AMMMO framework,

where a set of control parameters is deﬁned to distill and

categorize data patterns. At the top level, we evaluate each

sub-sequence to ﬁll in these parameters, generating a set of

compression scheme candidates (i.e., major mode selection). At the

bottom level, we choose the best scheme from these candidates for

each data point respectively (i.e., sub-mode selection). To effec-

tively handle diverse data patterns, we introduce a reinforcement

learning based approach to learn parameter values automatically.

Our experimental evaluation shows that our approach improves

compression ratio by up to 120% (with an average of 50%),

compared to other time-series compression methods.

I. INTRODUCTION

Nowadays, time series data is being continuously generated

from a wide range of applications, such as ﬁnance [5], Internet

services [23] and Internet of things [12]. Such increasing

demand for storing and processing time series data facilitates

the emergence and development of time series databases

(TSDBs), such as OpenTSDB [21] and InﬂuxDB [9]. One

of the key component in these databases is an effective

compression scheme that reduces storage footprint, leading to

two performance boosts. First, more records can reside in the

memory cache for fast access. Second, lower data transfer cost

is required between devices (e.g., RAM to disk, and to FPGA

or GPU if they are used to speedup processing).

There are plenty of compression schemes to use off the

shelf [1], [3], [15], [22]–[25]. However, the effectiveness of

a scheme largely depends on its input, where each is only

effective on a certain types of inputs (i.e., patterns). Hence

in a TSDB, especially a cloud TSDB service that manages

data from various sources, it is difﬁcult to rely on a pre-

selected compression scheme to sustain high compression ratio

for any data. Figure 1 illustrates four examples of real-world

time-series data. As shown, these data are distinct from each

other, making a single compression scheme difﬁcult to achieve

satisfactory compression ratio on all of them. Ideally, we hope

that a TSDB could automatically search for the scheme (as

well as its parameters) that achieves high compression ratio

for a given piece of data. We refer to this problem as adaptive

compression scheme selection.

This problem gets more challenging in real worlds, since

time series data in many applications are awfully complex:

even within one time series, the pattern can change over time

(Figure 1c); and different parts from a single piece of data

may prefer different schemes (Figure 1d). Therefore, in order

to achieve good compression ratio, we cannot simply apply

one compression scheme for the entire data piece. Instead, we

have to examine the data in a ﬁne granularity, and decide the

appropriate scheme for sub-pieces of the original data. In the

extreme case, we could break down the data into individual

points and ﬁnd the optimal compression scheme for each

data point. However, this is almost impractical since both

the search space for the optimal scheme and the number of

points contained in a TSDB are huge, incurring prohibitive

computation cost if we perform the search for each point.

In this paper, we work towards enabling automatic per-

point compression scheme selection to achieve satisfactory

compression ratio in real-world applications. Note that the

per-point approach (needed as illustrated in Figure 1c and

Figure 1d) introduces extra space overhead on meta data

compared to a global compression scheme. To balance the

trade-off between scheme selection efﬁciency and compression

efﬁciency, we propose a two-level compression model, which

is able to adaptively select proper compression schemes for

each point. It is built on a concept called parameterized scheme

space that abstracts the entire search space as a set of scheme

templates and control parameters. At the top level, for each

timeline (a sequence of continuous points), we construct a

scheme space (a small search space) by ﬁlling in values for

control parameters. At the bottom level, the most effective

scheme from the space is selected for each point.

However, there remain two challenges in the proposed

model. First, how to construct a parameterized scheme space

that has the potential to always perform well against unpre-

dictable patterns. It is desired that the entire search space

is small but covers most cases. Second, how to ﬁll in

values of control parameters for a chosen timeline. It is

hard to distill hidden patterns from timelines and translate

them into parameter values, both effectively and efﬁciently.

For the ﬁrst challenge, we design and develop AMMMO

1333

2020 IEEE 36th International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE48307.2020.00119

Authorized licensed use limited to: Zhejiang Tmall Technology Co.Ltd.. Downloaded on January 13,2022 at 03:29:14 UTC from IEEE Xplore. Restrictions apply.

(a) Finance (b) Internet monitoring

0 2500 5000 7500 10000 12500 15000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Pattern X

Pattern Y

{

0 500 1000 1500 2000 2500

1.5

1.0

0.5

0.0

0.5

1.0

1.5

XOR

Perferred

RD/RDoD

Perferred

Delta/DoD

Perferred

(d) Sensing device

Fig. 1: Four use cases from real scenario; different use cases have different patterns. Fig 1c shows an example that different

periods from data could have different patterns; Fig 1d illustrates the preferred compression scheme at different parts of data.

(Adaptive Multi-Model Middle-Out) framework. In particu-

lar, it divides the entire search space into four sub-spaces,

called major modes, designed with insightful preferences (i.e.,

transform/bitmask/offset-preferred or mixed in Section IV-B).

Each major mode further contains four sub-modes, which are

constructed compression schemes. For the second challenge,

we propose both rule-based and learning-based approaches to

address it. In particular, for learning-based approach, since

labels (i.e., ground truths) for optimal parameter settings are

unavailable, we apply reinforcement learning on a neural

network structure to learn parameters interactively. When the

compression ratio drops due to pattern changes, this network

can be easily retrained.

In summary, we make following major contributions:

• We introduce a two-level model to select compression

schemes for each individual point, which addresses inher-

ent data diversity in time series. A parameterized scheme

space is proposed to facilitate the representation of the

entire search space.

• We design a compression framework AMMMO, follow-

ing the two-level model. This framework encompasses

four major modes to categorize typical patterns, and

deﬁnes sophisticated control parameters to help construct

sub-modes, i.e., compression schemes.

• We design a neural network structure with reinforcement

learning to tune parameter values automatically. It solves

the issue that no labeled training samples available in the

context of data compression.

• We conduct extensive experimental evaluations using

large-scale real datasets. The results show that our ap-

proach improves compression ratio up to 120% (with

an average improvement 50%), compared to other time-

series compression methods.

The rest of the paper is organized as follows. We ﬁrst

discuss background and motivation in Section II. We then

introduce the two-level scheme selection model in Section III,

present the AMMMO framework in Section IV, followed by

reinforcement-learning-based model selection in Section V.

We evaluate our approach in Section VI and conclude in

Section VII.

II. P

RELIMINARIES

A. Data Compression

The data compression can be regarded as a process to

transform a byte sequence in some representation (e.g., ﬂoating

numbers for metric value) into a new byte sequence that

contains the same information but with less bytes. As surveyed

in [16], there are many universal compression techniques, such

as (static or adaptive) Huffman and arithmetic coding, which

are ubiquitous in real-world applications. Besides, there are

also compression algorithms tailored for certain data types. For

example, HEVC [30] and H264 [10] are designed for videos,

while gzip, 7zip and snappy are designed for texts.

Based on different byte alignment strategies for compressed

bit streams, compression algorithms can be categorized into

byte-level and bit-level [29]. In general, the bit-level compres-

sion achieves higher compression ratio, while the byte-level

compression avoids costly bit shift operations, and is more

suitable for partition and parallel processing. In this work,

we will focus on byte-level compression since it has higher

throughput and is more hardware-friendly.

B. Time Series Data

As deﬁned in [21], [24], a time series consists of many data

points, each of which is uniquely identiﬁed with a timestamp,

a metric name, a measured value, and a set of tags. A simple

example is shown in Table I. The combination of tags allows

users to record different data properties and issue queries

intuitively. Among all ﬁelds, timestamps and measured values

dominate the storage consumption. Therefore, major task is to

compression values in these two ﬁelds effectively.

Although general-purposed compression algorithms can be

directly applied to time series data, these data have their unique

characteristics (detailed in Section III-A) that could potentially

help us to tailor more effective compression scheme for them.

C. Reinforcement Learning

Reinforcement learning (RL) is a type of machine learning

concerned with how to take appropriate actions to maximize

reward in a particular situation. It could be useful for training

without ground truth [32].

Reinforcement learning has been widely applied in many

interactive scenarios, such as game playing [7], [28] and

1334

Authorized licensed use limited to: Zhejiang Tmall Technology Co.Ltd.. Downloaded on January 13,2022 at 03:29:14 UTC from IEEE Xplore. Restrictions apply.

of 12

免费下载

关注

评论