
queue to get the arrival time
ˆ
t of the last batch of e, and
calculate the batch interval V = t −
ˆ
t. Then we send
this batch and its interval e, V to the second phase. In
phase 2, we check periodicity and manage to record top-
k periodic batches, i.e., top-k e, V pairs. In phase 1, we
devise a better algorithm than the state-of-the-art algorithm for
detecting batches, Clock-Sketch [1]. In phase 2, we propose an
enhanced top-k algorithm, which naturally suits our periodic
batch detection scenario.
In phase 1, we propose a time-aware version of Bloom
filter, namely HyperBloomFilter (HyperBF for short), to detect
batches. For each incoming item, phase 1 should report
whether the item is the start of a batch. In other words, this
is an existence detection algorithm. In addition to existence
detection, phase 1 should be aware of arrival time to divide
a series of the same item into many batches. Bloom filter [9]
is the most well-known memory-efficient data structure used
for existence detection. However, the existence detection of
Bloom filter is only low-dimensional, i.e., it is agnostic to time
dimension. Typical work aware of time dimension is Persistent
Bloom filter (PBF) [10]. It is an elegant variant of Bloom
filter, which uses a set of carefully constructed Bloom filters
to support membership testing for temporal queries (MTTQ)
(e.g., has a person visited a website between 8:30pm and
8:40pm?). MTTQ and batch detection are different ways to
be aware of time dimension. To enable Bloom filter to be
aware of time, our HyperBF extends each bit in Bloom filter
into a 2-bit cell, doubling the memory usage. Compared to
the standard Bloom filter, HyperBF has the same number of
hash computations and memory accesses for each insertion
and query. The only overhead for time awareness is doubling
the memory usage, which is reasonable and acceptable.
In phase 2, we propose an enhanced top-k algorithm,
called Calm Space-Saving (CalmSS for short), to report top-k
periodic batches. For each incoming batch and its interval, i.e.,
e, V , phase 2 should keep periodic batches with large peri-
odicities, and evict periodic batches with small periodicities.
In other words, phase 2 keeps frequent e, V pairs, and evicts
infrequent e, V pairs, which is is a top-k algorithm. Typical
top-k algorithms include Space-Saving [11], Unbiased Space-
Saving [12], and Frequent [13]. However, their accuracy is
significantly harmed by cold items
1
. This problem is more
serious in our scenario of periodic batch detection. This is
because one infrequent item may have multiple batches, and
one frequent item may also have multiple batches without
periodicity. Both the two cases above increase the number of
cold e, V pairs. To identify and discard cold items, Cold
Filter [16] and LogLogFilter [17] record the frequencies of all
items in a compact data structure. However, considering the
large volume of data stream, this structure will be filled up very
quickly, and needs to be cleaned up periodically. To ensure the
one-pass property of our solution, it is highly desired to devise
a data structure which will never be filled up. Instead of record-
1
Cold items refer to items with small frequencies (i.e., infrequent items),
and hot items refer to items with large frequencies (i.e., frequent items). In
practice, most items are cold items, which appear just several times [14], [15]
ing all items, our solution is to just record the frequencies
of some items in the sliding window, and discard those cold
items in the sliding window. Rather than using existing sliding
window algorithms [18], [19], [20], this paper designs an LRU
queue working together with Space-Saving because of the
following reasons. First, our LRU queue is elastic: users can
dynamically tune its memory usage to maintain a satisfactory
accuracy. Second, our LRU queue has elegant theoretical
guarantees (see details in § III-C). Third, our LRU queue can
be naturally integrated into the data structure of Space-Saving
(see details in § III-D): such combination achieves higher
accuracy and higher speed. Our combination is faster because
the LRU queue efficiently filters most cold items, and thus
the complicated replacement operations incurred by cold items
are avoided (see Figure 11c). Actually, besides the application
of periodic batch detection, our LRU queue can improve the
accuracy/speed of any streaming algorithms. We can handle
any case that Cold filter can handle, and we are both time-
and space- more efficient than Cold filter (§ V-C). All related
codes are open-sourced [21].
C. Key Contributions
• We formulate the problem of finding periodic batches in
data streams. We believe this is an important problem in
data mining.
• We propose an accurate, fast, and memory efficient Hyper-
Calm sketch to detect periodic batches in real time. Both
the two components of HyperCalm, HyperBF and CalmSS,
significantly outperform the state-of-the-art solutions in de-
tecting batches and finding top-k items, respectively.
• We derive theoretical guarantees for our HyperBF and
CalmSS, and validate our theories using experiments.
• We conduct extensive experiments showing that HyperCalm
well achieves our design goal. The results show HyperCalm
outperforms the strawman solutions 4× in term of average
relative error and 13.2× in term of processing speed.
• We apply HyperCalm to a cache system showing that peri-
odic batches can benefit real-world application. We integrate
HyperCalm into Apache Flink [22] showing that HyperCalm
can smoothly work in distributed systems.
II. B
ACKGROUND AND RELATED WORK
A. Problem Statement
Batches: A data stream is an infinite sequence of items where
each item is associated with a timestamp. A batch is defined
as a group of identical items in the data stream, where the
time gap between two adjacent batches of the same item must
exceed a predefined threshold T . For convenience, in this
paper, two adjacent batches mean two batches belong to the
same item by default. The arrival time of a batch is defined
as the timestamp of the first item of this batch. We define the
interval/time gap between two adjacent batches as the interval
between their arrival times.
Periodic batches: A group of periodic batches refers to α
consecutive batches of the same item, where these batches
arrive with a fixed time interval. We call α the periodicity.
15
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on August 29,2023 at 02:09:27 UTC from IEEE Xplore. Restrictions apply.
相关文档
评论