暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
A刊-A_Unified_Framework_for_Mining_Batch_and_Periodic_Batch_in_Data_Streams批处理.pdf
10
18页
0次
2024-11-27
免费下载
5544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 11, NOVEMBER 2024
A Unified Framework for Mining Batch and Periodic
Batch in Data Streams
Zirui Liu , Xiangyuan Wang , Yuhan Wu , Tong Yang , Member, IEEE, Kaicheng Yang , Hailin Zhang ,
Yaofeng Tu, and Bin Cui
, Senior Member, IEEE
AbstractBatch is an important pattern in data streams, which
refers to a group of identical items that arrive closely. We find that
some special batches that arrive periodically are of great value.
In this paper, we formally define a new pattern, namely periodic
batches. A group of periodic batches refers to several batches of
the same item, where these batches arrive periodically. Studying
periodic batches is important in many applications, such as caches,
financial markets, online advertisements, networks, etc. This paper
proposes a unified framework, namely the HyperCalm sketch, to
detect batch and periodic batch in data streams. HyperCalm sketch
takes two phases to detect periodic batches. In phase 1, we propose
a time-aware Bloom filter, called HyperBloomFilter (HyperBF), to
detect batches. In phase 2, we propose an enhanced top-k algo-
rithm, called Calm Space-Saving (CalmSS), to report top-k periodic
batches. Extensive experiments show HyperCalm outperforms the
strawman solutions 4× in term of average relative error and 98.1×
in term of speed. All related codes are open-sourced.
Index Terms—Data mining, data stream, sketch, periodic batch.
I. INTRODUCTION
A. Background and Motivation
B
ATCH is an important pattern in data streams [2], which is
a group of identical items that arrive closely. Two adjacent
batches of the same item are spaced by a minimum interval
T , where T is a predefined threshold. Although batches can
make a difference in various applications, such as cache [2],
networks [3], and machine learning [4], [5], it is not enough to
just study batches. For instance, in cache systems, with just the
measurement results of batches, we are still not able to devise
Manuscript received 18 July 2023; revised 16 March 2024; accepted 6 May
2024. Date of publication 9 May 2024; date of current version 27 September
2024. This work was supported in part by Key-Area Research and Development
Program of Guangdong Province under Grant 2020B0101390001 and in part by
National Natural Science Foundation of China (NSFC) under Grant U20A20179
and Grant 61832001. An earlier version of this paper titled “HyperCalm Sketch:
One-Pass Mining Periodic Batches in Data Streams” is published in the 39th
IEEE International Conference on Data Engineering (IEEE ICDE 2023) [doi:
10.1109/ICDE55515.2023.00009]. Recommended for acceptance by B. Glavic.
(Corresponding author: Tong Yang.)
Zirui Liu, Xiangyuan Wang, Yuhan Wu, Kaicheng Yang, Hailin Zhang, and
Bin Cui are with the National Key Laboratory for Multimedia Information
Processing, School of Computer Science, Peking University, Beijing 100871,
China.
Tong Yang is with the National Key Laboratory for Multimedia Information
Processing, School of Computer Science, Peking University, Beijing 100871,
China, and also with the Peng Cheng Laboratory, Shenzhen 518066, China
(e-mail: yangtong@pku.edu.cn).
Yaofeng Tu is with the ZTE Corporation, Beijing 100728, China.
Digital Object Identifier 10.1109/TKDE.2024.3399024
any prefetching method and replacement policy. Further mining
some special patterns of batches is of great importance. On the
basis of batches, we propose a new pattern, namely periodic
batch. A group of periodic batches refers to α consecutive
batches of the same item, where these batches arrive periodically.
We call α the periodicity. Finding top-k periodic batches refers
to reporting k groups of periodic batches with the k largest
periodicities.
Studying top-k periodic batches is important in practice. For
example, consider a cache stream formed by many memory
access requests where each request is an item, periodic batches
provide insights to improve the cache hit rate. With the historical
information of periodic batches, we can forecast the arrival time
of new batches, and prefetch the item into cache just before its
arrival. For another example, in financial transaction streams, pe-
riodic transaction batches could be an indicator of illegal market
manipulation [6]. By detecting periodic batches in real time, we
can quickly find those suspicious clients that might be laundering
money. Periodic batches are also helpful in recommendation
systems and online advertisements, where the data stream is
generated when users click or purchase different commodities. A
batch forms when users continuously click or purchase the same
type of commodities. In this scenario, periodic batches imply
users’ seasonal and periodic browsing or buying behaviors [7]
(e.g., Christmas buying patterns that repeat yearly, or seasonal
promotion-related user behaviors). Studying periodic batches
can help us to better understand customer behavior, so that we
can deliver appropriate advertisements promptly to customers.
In addition, periodic batches are also important in networks.
In network stream, most TCP senders tend to send packets in
periodic batches [8]. If we can forecast the arrival time of future
batches, we can pre-allocate resources to them, or devise better
strategies for load balancing. To our knowledge, there is no
existing work studying periodic batches, and we are the first
to formulate and address this problem.
Finding periodic batches is a challenging issue. First, finding
batches is already a challenging issue. Until now, the state-of-
the-art solution to detect batches is Clock-Sketch [2], which
records the last arrival time of recent items in a cyclic array,
and uses another thread to clean the outdated information using
CLOCK [9] algorithm. However, to achieve high accuracy, it
needs to scan the cyclic array very fast, which consumes a lot of
CPU resources. Second, periodic batch is a more fine-grained
definition, and thus finding periodic batches is more challenging
than just finding batches. The goal of this paper is to design
a compact sketch algorithm that can accurately find periodic
batches with small space- and time- overhead.
1041-4347 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: UNIFIED FRAMEWORK FOR MINING BATCH AND PERIODIC BATCH IN DATA STREAMS 5545
B. Our Proposed Solution
This paper proposes a unified framework, called HyperCalm
sketch, to detect batch and periodic batch in data streams in real
time. HyperCalm takes two phases to find top-k periodic batches.
In phase 1, for each item e arriving at time t, we check whether
it is the start of a batch. If so, we query a TimeRecorder queue
to get the arrival time
ˆ
t of the last batch of e, and calculate the
batch interval V = t
ˆ
t. Then we send this batch and its interval
e, V to the second phase. In phase 2, we check periodicity and
manage to record top-k periodic batches, i.e., top-k e, V pairs.
In phase 1, we devise a better algorithm than the s tate-of-the-art
algorithm for detecting batches, Clock-Sketch [2]. In phase 2,
we propose an enhanced top-k algorithm, which naturally suits
our periodic batch detection scenario.
In phase 1, we propose a time-aware version of Bloom fil-
ter, namely HyperBloomFilter (HyperBF for short), to detect
batches. For each incoming item, phase 1 should report whether
the item is the start of a batch. In other words, this is an existence
detection algorithm. In addition, phase 1 should be aware of the
item arrival time to divide a series of the same item into many
batches. Bloom filter [10] is the most well-known memory-
efficient data structure used for existence detection. However,
the existence detection of Bloom filter is only low-dimensional,
i.e., it is agnostic to time dimension. Typical work aware of time
dimension is Persistent Bloom filter (PBF) [11].Itisanelegant
variant of Bloom filter, which uses a set of carefully constructed
Bloom filters to support membership testing for temporal queries
(MTTQ) (e.g., has a person visited a website between 8:30pm
and 8:40pm?). MTTQ and batch detection are different ways
to be aware of time dimension. To enable Bloom filter to be
aware of time, our HyperBF extends each bit in Bloom filter
into a 2-bit cell, doubling the memory usage. Compared to the
standard Bloom filter, HyperBF has the same number of hash
computations and memory accesses for each insertion and query.
The only overhead for time awareness is doubling the memory
usage, which is reasonable and acceptable.
In phase 2, we propose an enhanced top-k algorithm, called
Calm Space-Saving (CalmSS for short), to report top-k periodic
batches. For each incoming batch and its interval, i.e., e, V ,
phase 2 should keep periodic batches with large periodicities,
and evict periodic batches with small periodicities. In other
words, phase 2 keeps frequent e, V pairs, and evicts infre-
quent e, V pairs, which is is a top-k algorithm. Typical top-k
algorithms include Space-Saving [12], Unbiased Space-Saving
[13], and Frequent [14]
. However, their accuracy is significantly
harmed by cold items.
1
This problem is more serious in our
scenario of periodic batch detection. This is because one in-
frequent item may have multiple batches, and one frequent item
may also have multiple batches without periodicity. Both the two
cases above increase t he number of cold e, V pairs. To i dentify
and discard cold items, Cold Filter [17] and LogLogFilter [18]
record the frequencies of all items in a compact data structure.
However, considering the large volume of data stream, this
structure will be filled up very quickly, and needs to be cleaned
up periodically. To ensure the one-pass property of our solution,
it is highly desired to devise a data structure which will never
1
Cold items refer to items with small frequencies (i.e., infrequent items), and
hot items refer to items with large frequencies (i.e., frequent items). In practice,
most items are cold items, which appear just several times [15], [16]
be filled up. Instead of recording all items, our solution is to
just record the frequencies of some items in the sliding window.
Rather than using existing sliding window algorithms [19], [20],
[21], this paper designs an LRU queue working together with
Space-Saving because of the following reasons. First, our LRU
queue is elastic: users can dynamically tune its memory usage
to maintain a satisfactory accuracy. Second, our LRU queue
has elegant theoretical guarantees (see details in Section III-C).
Third, our LRU queue can be naturally integrated into the
data structure of Space-Saving (see details in Section III-D):
such combination achieves higher accuracy and higher speed.
Our combination is faster because the LRU queue efficiently
filters most cold items, and thus the complicated replacement
operations incurred by cold items are avoided (see Fig. 15(c)).
Actually, besides the application of periodic batch detection, our
LRU queue can improve the accuracy/speed of any streaming
algorithms. We can handle any case that Cold filter can handle,
and we are both time- and space- more efficient than Cold filter
(Section V-B). All related codes are open-sourced [22].
C. Key Contributions
r
We formulate the problem of finding periodic batches in
data streams. We believe this is an important problem.
r
We propose an accurate, fast, and memory efficient Hyper-
Calm sketch to detect periodic batches in real time. Both
the two components of HyperCalm, HyperBF and CalmSS,
significantly outperform the SOTA in detecting batches and
finding top-k items, respectively.
r
We derive theoretical guarantees for our HyperBF and
CalmSS, and validate our theories using experiments.
r
We conduct extensive experiments, and the results show
that HyperCalm outperforms the strawman solutions 4×
in term of error and 98.1× in term of speed.
r
We apply HyperCalm to the scenarios of cache and network
measurement, showing that periodic batches can benefit
real-world application. We also integrate HyperCalm into
Apache Flink [23] and Redis [24].
II. B
ACKGROUND AND RELATED WORK
A. Problem Statement
Batches: A data stream is an infinite sequence of items where
each item is associated with a timestamp. A batch is defined as
a group of identical items in the data stream, where the time
gap between two adjacent batches of the same item must exceed
a predefined threshold T . For convenience, in this paper, two
adjacent batches mean two batches belong to the same item by
default. The arrival time of a batch is defined as the timestamp
of the first item of this batch. We define the interval/time gap
between two adjacent batches as the interval between their
arrival times.
Periodic batches: A group of periodic batches refers to α
consecutive batches of the same item, where these batches arrive
with a fixed time interval. We call α the periodicity. Here, the
“fixed time interval” is not the exact time, but the approximate
(noise-tolerant) time rounded up to the nearest time unit (e.g.,
one millisecond). Finding top-k periodic batches refers to report-
ing k groups of periodic batches with the k largest periodicities.
Note that one item may have more than one group of periodic
batches, and thus can be reported more than once.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.
5546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 11, NOVEMBER 2024
Fig. 1. Example of periodic batches.
Example (Fig. 1): We use an example to further clarify our
problem definition. We focus on two distinct items e
1
and e
2
in
the data stream. For e
1
, its 6 batches form a group of periodic
batches. For e
2
, it has two groups of periodic batches, with the
periodicities of 4 and 5, respectively. Note that some batches of
e
2
just have one item.
Discussion: The definition of periodic batches is a design
choice related to final application. We think our definition of
periodic batches is most general, which can benefit many real-
world applications (see Section V-H as an example). However,
certain application may also care about other aspects of periodic
batches, such as batch size and distance. For example, some
application may just want to detect those periodic batches that
are large enough in size. It is not hard to detect those variants of
periodic batches by adding small modification to our solution.
For example, we will discuss how to detect periodic large batches
in Section III-G and demonstrate its application in Section V- I .
B. Related Work
Related work is divided into three parts: 1) algorithms for
batch detection; 2) algorithms for finding top-k frequent items;
and 3) algorithms for mining periodic patterns.
Batch detection: Item batch is defined very recently in [2],
which proposes Clock-Sketch to find batches. Clock-Sketch
consists of an array of s-bits cells. For each incoming item,
it sets the d hashed cells as 2
s
1. For query, if one of the
d hashed cells is zero, it reports a batch. Clock-Sketch uses an
extra thread to cyclically sweep the cell array at a constant speed
and decreases the swept non-zero cells by one. The sweeping
speed is carefully selected to avoid false-positive errors. Besides
Clock-Sketch, some sliding window algorithms can be applied
to find batches, including Time-Out Bloom Filter (TOBF) [25]
and SWAMP [26].
Finding top-k frequent items: To find top-k frequent items
in data streams, existing approaches maintain a synopsis data
structure. There are two kinds of synopses: sketches and KV
tables. 1) Sketches usually consist of multiple arrays, each of
which consists of multiple counters. These counters are used to
record the frequencies of the inserted items. Typical sketches
include CM [27],CU[28], Count [29], and more [30], [31],
[32], [33], [34], [35], [36], [37] , [38], [39], [40], [41], [42], [43].
However, sketches are memory inefficient because they record
the frequencies of all items, which is actually unnecessary. 2) KV
tables record only the frequent items. Typical KV table based
approaches include Space-Saving [12], Unbiased Space-Saving
[13], Frequent [14], and more [44]. Space-Saving and Unbiased
Space-Saving record the approximate top-k items in a data
structure called Stream-Summary. However, their accuracy is
Fig. 2. HyperCalm sketch workflow.
significantly degraded by cold items. To address this issue, Cold
Filter [17] uses a two-layer CU sketch to filter cold items.
However, as aforementioned, the structure of Cold Filter will
be filled up very quickly, and cleaning the full Cold Filter will
inevitably incur error and time overhead.
Mining periodic patterns: Although there have been some
algorithms aiming at mining periodicity in time s equence data
[45], [46], [47], [48], [49], [50], [51], their problem definitions
are different from ours. More importantly, most of them do not
meet the requirements of data stream model processing: 1) each
item can only be processed once; 2) the processing time of
each item should be O(1) complexity and fast enough to catch
up with the high speed of data streams. For example, TiCom
[48] defines a periodical problem in an incomplete sequence
data, and develops an iterative algorithm with time complexity
of O(n
2
). RobustPeriod [46] proposes an algorithm based on
discrete wavelet transform with time complexity of O(n log n).
Further, there are some works which elegantly use Fast Fourier
Transform (FFT) or Auto Correlation Function (ACF) to address
different definitions of periodic items, such as SAZED [49].
These algorithms need to process one item multiple times, and
thus cannot meet the above two requirements.
III. T
HE HYPERCALM SKETCH
Overview (Fig. 2): The workflow of the HyperCalm sketch
consists of two phases: 1) A HyperBloomFilter (HyperBF)
detecting the start of batches; and 2) A Calm Space-Saving
(CalmSS) recording and reporting top-k periodic batches. In
addition, we design a TimeRecorder queue to record the last
batch arrival time for potential periodic batches. Given an in-
coming item e arriving at time t, we first propose HyperBF to
check whether it is the start of a batch. If so, we query the
TimeRecorder queue to get the arrival time
ˆ
t of the last batch of
e and calculate the batch interval V = t
ˆ
t.
2
Then we update
the arrival time of last batch of e in the TimeRecorder queue to
t. Next, we send e and its batch interval V to CalmSS to detect
top-k periodic batches. We combine the ID of item e and its
interval V to form an entry E = e, V , and insert the entry into
CalmSS. CalmSS reports k groups of periodic batches with the k
largest periodicities, i.e., reports top-k entries with the k largest
frequencies, where each entry is an e, V pair. We will discuss
2
To tolerate noise in batch interval, V is rounded up according to the
regulations described in the parameter setting part of Section V-C.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.
of 18
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。