LIU et al.: UNIFIED FRAMEWORK FOR MINING BATCH AND PERIODIC BATCH IN DATA STREAMS 5545
B. Our Proposed Solution
This paper proposes a unified framework, called HyperCalm
sketch, to detect batch and periodic batch in data streams in real
time. HyperCalm takes two phases to find top-k periodic batches.
In phase 1, for each item e arriving at time t, we check whether
it is the start of a batch. If so, we query a TimeRecorder queue
to get the arrival time
ˆ
t of the last batch of e, and calculate the
batch interval V = t −
ˆ
t. Then we send this batch and its interval
e, V to the second phase. In phase 2, we check periodicity and
manage to record top-k periodic batches, i.e., top-k e, V pairs.
In phase 1, we devise a better algorithm than the s tate-of-the-art
algorithm for detecting batches, Clock-Sketch [2]. In phase 2,
we propose an enhanced top-k algorithm, which naturally suits
our periodic batch detection scenario.
In phase 1, we propose a time-aware version of Bloom fil-
ter, namely HyperBloomFilter (HyperBF for short), to detect
batches. For each incoming item, phase 1 should report whether
the item is the start of a batch. In other words, this is an existence
detection algorithm. In addition, phase 1 should be aware of the
item arrival time to divide a series of the same item into many
batches. Bloom filter [10] is the most well-known memory-
efficient data structure used for existence detection. However,
the existence detection of Bloom filter is only low-dimensional,
i.e., it is agnostic to time dimension. Typical work aware of time
dimension is Persistent Bloom filter (PBF) [11].Itisanelegant
variant of Bloom filter, which uses a set of carefully constructed
Bloom filters to support membership testing for temporal queries
(MTTQ) (e.g., has a person visited a website between 8:30pm
and 8:40pm?). MTTQ and batch detection are different ways
to be aware of time dimension. To enable Bloom filter to be
aware of time, our HyperBF extends each bit in Bloom filter
into a 2-bit cell, doubling the memory usage. Compared to the
standard Bloom filter, HyperBF has the same number of hash
computations and memory accesses for each insertion and query.
The only overhead for time awareness is doubling the memory
usage, which is reasonable and acceptable.
In phase 2, we propose an enhanced top-k algorithm, called
Calm Space-Saving (CalmSS for short), to report top-k periodic
batches. For each incoming batch and its interval, i.e., e, V ,
phase 2 should keep periodic batches with large periodicities,
and evict periodic batches with small periodicities. In other
words, phase 2 keeps frequent e, V pairs, and evicts infre-
quent e, V pairs, which is is a top-k algorithm. Typical top-k
algorithms include Space-Saving [12], Unbiased Space-Saving
[13], and Frequent [14]
. However, their accuracy is significantly
harmed by cold items.
1
This problem is more serious in our
scenario of periodic batch detection. This is because one in-
frequent item may have multiple batches, and one frequent item
may also have multiple batches without periodicity. Both the two
cases above increase t he number of cold e, V pairs. To i dentify
and discard cold items, Cold Filter [17] and LogLogFilter [18]
record the frequencies of all items in a compact data structure.
However, considering the large volume of data stream, this
structure will be filled up very quickly, and needs to be cleaned
up periodically. To ensure the one-pass property of our solution,
it is highly desired to devise a data structure which will never
1
Cold items refer to items with small frequencies (i.e., infrequent items), and
hot items refer to items with large frequencies (i.e., frequent items). In practice,
most items are cold items, which appear just several times [15], [16]
be filled up. Instead of recording all items, our solution is to
just record the frequencies of some items in the sliding window.
Rather than using existing sliding window algorithms [19], [20],
[21], this paper designs an LRU queue working together with
Space-Saving because of the following reasons. First, our LRU
queue is elastic: users can dynamically tune its memory usage
to maintain a satisfactory accuracy. Second, our LRU queue
has elegant theoretical guarantees (see details in Section III-C).
Third, our LRU queue can be naturally integrated into the
data structure of Space-Saving (see details in Section III-D):
such combination achieves higher accuracy and higher speed.
Our combination is faster because the LRU queue efficiently
filters most cold items, and thus the complicated replacement
operations incurred by cold items are avoided (see Fig. 15(c)).
Actually, besides the application of periodic batch detection, our
LRU queue can improve the accuracy/speed of any streaming
algorithms. We can handle any case that Cold filter can handle,
and we are both time- and space- more efficient than Cold filter
(Section V-B). All related codes are open-sourced [22].
C. Key Contributions
r
We formulate the problem of finding periodic batches in
data streams. We believe this is an important problem.
r
We propose an accurate, fast, and memory efficient Hyper-
Calm sketch to detect periodic batches in real time. Both
the two components of HyperCalm, HyperBF and CalmSS,
significantly outperform the SOTA in detecting batches and
finding top-k items, respectively.
r
We derive theoretical guarantees for our HyperBF and
CalmSS, and validate our theories using experiments.
r
We conduct extensive experiments, and the results show
that HyperCalm outperforms the strawman solutions 4×
in term of error and 98.1× in term of speed.
r
We apply HyperCalm to the scenarios of cache and network
measurement, showing that periodic batches can benefit
real-world application. We also integrate HyperCalm into
Apache Flink [23] and Redis [24].
II. B
ACKGROUND AND RELATED WORK
A. Problem Statement
Batches: A data stream is an infinite sequence of items where
each item is associated with a timestamp. A batch is defined as
a group of identical items in the data stream, where the time
gap between two adjacent batches of the same item must exceed
a predefined threshold T . For convenience, in this paper, two
adjacent batches mean two batches belong to the same item by
default. The arrival time of a batch is defined as the timestamp
of the first item of this batch. We define the interval/time gap
between two adjacent batches as the interval between their
arrival times.
Periodic batches: A group of periodic batches refers to α
consecutive batches of the same item, where these batches arrive
with a fixed time interval. We call α the periodicity. Here, the
“fixed time interval” is not the exact time, but the approximate
(noise-tolerant) time rounded up to the nearest time unit (e.g.,
one millisecond). Finding top-k periodic batches refers to report-
ing k groups of periodic batches with the k largest periodicities.
Note that one item may have more than one group of periodic
batches, and thus can be reported more than once.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.
文档被以下合辑收录
相关文档
评论