暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
2020-7_Two-Level Data Compression using Machine Learning in Time Series Database _Xinyang Yu.pdf
267
12页
5次
2022-01-27
免费下载
Two-Level Data Compression using Machine
Learning in Time Series Database
Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, Yue Xie
Alibaba Group
{xinyang.yu, y.peng, lifeifei, sh.wang, leyu.sxw, huijun.mhj, jimmy.xy}@alibaba-inc.com
Abstract—The explosion of time series advances the develop-
ment of time series databases. To reduce storage overhead in
these systems, data compression is widely adopted. Most existing
compression algorithms utilize the overall characteristics of the
entire time series to achieve high compression ratio, but ignore
local contexts around individual points. In this way, they are
effective for certain data patterns, and may suffer inherent
pattern changes in real-world time series. It is therefore strongly
desired to have a compression method that can always achieve
high compression ratio in the existence of pattern diversity.
In this paper, we propose a two-level compression model that
selects a proper compression scheme for each individual point, so
that diverse patterns can be captured at a fine granularity. Based
on this model, we design and implement AMMMO framework,
where a set of control parameters is defined to distill and
categorize data patterns. At the top level, we evaluate each
sub-sequence to fill in these parameters, generating a set of
compression scheme candidates (i.e., major mode selection). At the
bottom level, we choose the best scheme from these candidates for
each data point respectively (i.e., sub-mode selection). To effec-
tively handle diverse data patterns, we introduce a reinforcement
learning based approach to learn parameter values automatically.
Our experimental evaluation shows that our approach improves
compression ratio by up to 120% (with an average of 50%),
compared to other time-series compression methods.
I. INTRODUCTION
Nowadays, time series data is being continuously generated
from a wide range of applications, such as finance [5], Internet
services [23] and Internet of things [12]. Such increasing
demand for storing and processing time series data facilitates
the emergence and development of time series databases
(TSDBs), such as OpenTSDB [21] and InfluxDB [9]. One
of the key component in these databases is an effective
compression scheme that reduces storage footprint, leading to
two performance boosts. First, more records can reside in the
memory cache for fast access. Second, lower data transfer cost
is required between devices (e.g., RAM to disk, and to FPGA
or GPU if they are used to speedup processing).
There are plenty of compression schemes to use off the
shelf [1], [3], [15], [22]–[25]. However, the effectiveness of
a scheme largely depends on its input, where each is only
effective on a certain types of inputs (i.e., patterns). Hence
in a TSDB, especially a cloud TSDB service that manages
data from various sources, it is difficult to rely on a pre-
selected compression scheme to sustain high compression ratio
for any data. Figure 1 illustrates four examples of real-world
time-series data. As shown, these data are distinct from each
other, making a single compression scheme difficult to achieve
satisfactory compression ratio on all of them. Ideally, we hope
that a TSDB could automatically search for the scheme (as
well as its parameters) that achieves high compression ratio
for a given piece of data. We refer to this problem as adaptive
compression scheme selection.
This problem gets more challenging in real worlds, since
time series data in many applications are awfully complex:
even within one time series, the pattern can change over time
(Figure 1c); and different parts from a single piece of data
may prefer different schemes (Figure 1d). Therefore, in order
to achieve good compression ratio, we cannot simply apply
one compression scheme for the entire data piece. Instead, we
have to examine the data in a fine granularity, and decide the
appropriate scheme for sub-pieces of the original data. In the
extreme case, we could break down the data into individual
points and find the optimal compression scheme for each
data point. However, this is almost impractical since both
the search space for the optimal scheme and the number of
points contained in a TSDB are huge, incurring prohibitive
computation cost if we perform the search for each point.
In this paper, we work towards enabling automatic per-
point compression scheme selection to achieve satisfactory
compression ratio in real-world applications. Note that the
per-point approach (needed as illustrated in Figure 1c and
Figure 1d) introduces extra space overhead on meta data
compared to a global compression scheme. To balance the
trade-off between scheme selection efficiency and compression
efficiency, we propose a two-level compression model, which
is able to adaptively select proper compression schemes for
each point. It is built on a concept called parameterized scheme
space that abstracts the entire search space as a set of scheme
templates and control parameters. At the top level, for each
timeline (a sequence of continuous points), we construct a
scheme space (a small search space) by filling in values for
control parameters. At the bottom level, the most effective
scheme from the space is selected for each point.
However, there remain two challenges in the proposed
model. First, how to construct a parameterized scheme space
that has the potential to always perform well against unpre-
dictable patterns. It is desired that the entire search space
is small but covers most cases. Second, how to fill in
values of control parameters for a chosen timeline. It is
hard to distill hidden patterns from timelines and translate
them into parameter values, both effectively and efficiently.
For the first challenge, we design and develop AMMMO
1333
2020 IEEE 36th International Conference on Data Engineering (ICDE)
2375-026X/20/$31.00 ©2020 IEEE
DOI 10.1109/ICDE48307.2020.00119
Authorized licensed use limited to: Zhejiang Tmall Technology Co.Ltd.. Downloaded on January 13,2022 at 03:29:14 UTC from IEEE Xplore. Restrictions apply.
(a) Finance (b) Internet monitoring
0 2500 5000 7500 10000 12500 15000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pattern X
Pattern Y
{
{
(c) Resource utilization
0 500 1000 1500 2000 2500
1.5
1.0
0.5
0.0
0.5
1.0
1.5
XOR
Perferred
RD/RDoD
Perferred
Delta/DoD
Perferred
(d) Sensing device
Fig. 1: Four use cases from real scenario; different use cases have different patterns. Fig 1c shows an example that different
periods from data could have different patterns; Fig 1d illustrates the preferred compression scheme at different parts of data.
(Adaptive Multi-Model Middle-Out) framework. In particu-
lar, it divides the entire search space into four sub-spaces,
called major modes, designed with insightful preferences (i.e.,
transform/bitmask/offset-preferred or mixed in Section IV-B).
Each major mode further contains four sub-modes, which are
constructed compression schemes. For the second challenge,
we propose both rule-based and learning-based approaches to
address it. In particular, for learning-based approach, since
labels (i.e., ground truths) for optimal parameter settings are
unavailable, we apply reinforcement learning on a neural
network structure to learn parameters interactively. When the
compression ratio drops due to pattern changes, this network
can be easily retrained.
In summary, we make following major contributions:
We introduce a two-level model to select compression
schemes for each individual point, which addresses inher-
ent data diversity in time series. A parameterized scheme
space is proposed to facilitate the representation of the
entire search space.
We design a compression framework AMMMO, follow-
ing the two-level model. This framework encompasses
four major modes to categorize typical patterns, and
defines sophisticated control parameters to help construct
sub-modes, i.e., compression schemes.
We design a neural network structure with reinforcement
learning to tune parameter values automatically. It solves
the issue that no labeled training samples available in the
context of data compression.
We conduct extensive experimental evaluations using
large-scale real datasets. The results show that our ap-
proach improves compression ratio up to 120% (with
an average improvement 50%), compared to other time-
series compression methods.
The rest of the paper is organized as follows. We first
discuss background and motivation in Section II. We then
introduce the two-level scheme selection model in Section III,
present the AMMMO framework in Section IV, followed by
reinforcement-learning-based model selection in Section V.
We evaluate our approach in Section VI and conclude in
Section VII.
II. P
RELIMINARIES
A. Data Compression
The data compression can be regarded as a process to
transform a byte sequence in some representation (e.g., floating
numbers for metric value) into a new byte sequence that
contains the same information but with less bytes. As surveyed
in [16], there are many universal compression techniques, such
as (static or adaptive) Huffman and arithmetic coding, which
are ubiquitous in real-world applications. Besides, there are
also compression algorithms tailored for certain data types. For
example, HEVC [30] and H264 [10] are designed for videos,
while gzip, 7zip and snappy are designed for texts.
Based on different byte alignment strategies for compressed
bit streams, compression algorithms can be categorized into
byte-level and bit-level [29]. In general, the bit-level compres-
sion achieves higher compression ratio, while the byte-level
compression avoids costly bit shift operations, and is more
suitable for partition and parallel processing. In this work,
we will focus on byte-level compression since it has higher
throughput and is more hardware-friendly.
B. Time Series Data
As defined in [21], [24], a time series consists of many data
points, each of which is uniquely identified with a timestamp,
a metric name, a measured value, and a set of tags. A simple
example is shown in Table I. The combination of tags allows
users to record different data properties and issue queries
intuitively. Among all fields, timestamps and measured values
dominate the storage consumption. Therefore, major task is to
compression values in these two fields effectively.
Although general-purposed compression algorithms can be
directly applied to time series data, these data have their unique
characteristics (detailed in Section III-A) that could potentially
help us to tailor more effective compression scheme for them.
C. Reinforcement Learning
Reinforcement learning (RL) is a type of machine learning
concerned with how to take appropriate actions to maximize
reward in a particular situation. It could be useful for training
without ground truth [32].
Reinforcement learning has been widely applied in many
interactive scenarios, such as game playing [7], [28] and
1334
Authorized licensed use limited to: Zhejiang Tmall Technology Co.Ltd.. Downloaded on January 13,2022 at 03:29:14 UTC from IEEE Xplore. Restrictions apply.
of 12
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜