
Two-Level Data Compression using Machine
Learning in Time Series Database
Xinyang Yu, Yanqing Peng, Feifei Li, Sheng Wang, Xiaowei Shen, Huijun Mai, Yue Xie
Alibaba Group
{xinyang.yu, y.peng, lifeifei, sh.wang, leyu.sxw, huijun.mhj, jimmy.xy}@alibaba-inc.com
Abstract—The explosion of time series advances the develop-
ment of time series databases. To reduce storage overhead in
these systems, data compression is widely adopted. Most existing
compression algorithms utilize the overall characteristics of the
entire time series to achieve high compression ratio, but ignore
local contexts around individual points. In this way, they are
effective for certain data patterns, and may suffer inherent
pattern changes in real-world time series. It is therefore strongly
desired to have a compression method that can always achieve
high compression ratio in the existence of pattern diversity.
In this paper, we propose a two-level compression model that
selects a proper compression scheme for each individual point, so
that diverse patterns can be captured at a fine granularity. Based
on this model, we design and implement AMMMO framework,
where a set of control parameters is defined to distill and
categorize data patterns. At the top level, we evaluate each
sub-sequence to fill in these parameters, generating a set of
compression scheme candidates (i.e., major mode selection). At the
bottom level, we choose the best scheme from these candidates for
each data point respectively (i.e., sub-mode selection). To effec-
tively handle diverse data patterns, we introduce a reinforcement
learning based approach to learn parameter values automatically.
Our experimental evaluation shows that our approach improves
compression ratio by up to 120% (with an average of 50%),
compared to other time-series compression methods.
I. INTRODUCTION
Nowadays, time series data is being continuously generated
from a wide range of applications, such as finance [5], Internet
services [23] and Internet of things [12]. Such increasing
demand for storing and processing time series data facilitates
the emergence and development of time series databases
(TSDBs), such as OpenTSDB [21] and InfluxDB [9]. One
of the key component in these databases is an effective
compression scheme that reduces storage footprint, leading to
two performance boosts. First, more records can reside in the
memory cache for fast access. Second, lower data transfer cost
is required between devices (e.g., RAM to disk, and to FPGA
or GPU if they are used to speedup processing).
There are plenty of compression schemes to use off the
shelf [1], [3], [15], [22]–[25]. However, the effectiveness of
a scheme largely depends on its input, where each is only
effective on a certain types of inputs (i.e., patterns). Hence
in a TSDB, especially a cloud TSDB service that manages
data from various sources, it is difficult to rely on a pre-
selected compression scheme to sustain high compression ratio
for any data. Figure 1 illustrates four examples of real-world
time-series data. As shown, these data are distinct from each
other, making a single compression scheme difficult to achieve
satisfactory compression ratio on all of them. Ideally, we hope
that a TSDB could automatically search for the scheme (as
well as its parameters) that achieves high compression ratio
for a given piece of data. We refer to this problem as adaptive
compression scheme selection.
This problem gets more challenging in real worlds, since
time series data in many applications are awfully complex:
even within one time series, the pattern can change over time
(Figure 1c); and different parts from a single piece of data
may prefer different schemes (Figure 1d). Therefore, in order
to achieve good compression ratio, we cannot simply apply
one compression scheme for the entire data piece. Instead, we
have to examine the data in a fine granularity, and decide the
appropriate scheme for sub-pieces of the original data. In the
extreme case, we could break down the data into individual
points and find the optimal compression scheme for each
data point. However, this is almost impractical since both
the search space for the optimal scheme and the number of
points contained in a TSDB are huge, incurring prohibitive
computation cost if we perform the search for each point.
In this paper, we work towards enabling automatic per-
point compression scheme selection to achieve satisfactory
compression ratio in real-world applications. Note that the
per-point approach (needed as illustrated in Figure 1c and
Figure 1d) introduces extra space overhead on meta data
compared to a global compression scheme. To balance the
trade-off between scheme selection efficiency and compression
efficiency, we propose a two-level compression model, which
is able to adaptively select proper compression schemes for
each point. It is built on a concept called parameterized scheme
space that abstracts the entire search space as a set of scheme
templates and control parameters. At the top level, for each
timeline (a sequence of continuous points), we construct a
scheme space (a small search space) by filling in values for
control parameters. At the bottom level, the most effective
scheme from the space is selected for each point.
However, there remain two challenges in the proposed
model. First, how to construct a parameterized scheme space
that has the potential to always perform well against unpre-
dictable patterns. It is desired that the entire search space
is small but covers most cases. Second, how to fill in
values of control parameters for a chosen timeline. It is
hard to distill hidden patterns from timelines and translate
them into parameter values, both effectively and efficiently.
For the first challenge, we design and develop AMMMO
1333
2020 IEEE 36th International Conference on Data Engineering (ICDE)
2375-026X/20/$31.00 ©2020 IEEE
DOI 10.1109/ICDE48307.2020.00119
Authorized licensed use limited to: Zhejiang Tmall Technology Co.Ltd.. Downloaded on January 13,2022 at 03:29:14 UTC from IEEE Xplore. Restrictions apply.
评论