A刊-A_Unified_Framework_for_Mining_Batch_and_Periodic_Batch_in_Data_Streams批处理.pdf

观海

18页

0次

2024-11-27

免费下载

5544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 11, NOVEMBER 2024

A Uniﬁed Framework for Mining Batch and Periodic

Batch in Data Streams

Zirui Liu , Xiangyuan Wang , Yuhan Wu , Tong Yang , Member, IEEE, Kaicheng Yang , Hailin Zhang ,

Yaofeng Tu, and Bin Cui

, Senior Member, IEEE

Abstract—Batch is an important pattern in data streams, which

refers to a group of identical items that arrive closely. We ﬁnd that

some special batches that arrive periodically are of great value.

In this paper, we formally deﬁne a new pattern, namely periodic

batches. A group of periodic batches refers to several batches of

the same item, where these batches arrive periodically. Studying

periodic batches is important in many applications, such as caches,

ﬁnancial markets, online advertisements, networks, etc. This paper

proposes a uniﬁed framework, namely the HyperCalm sketch, to

detect batch and periodic batch in data streams. HyperCalm sketch

takes two phases to detect periodic batches. In phase 1, we propose

a time-aware Bloom ﬁlter, called HyperBloomFilter (HyperBF), to

detect batches. In phase 2, we propose an enhanced top-k algo-

rithm, called Calm Space-Saving (CalmSS), to report top-k periodic

batches. Extensive experiments show HyperCalm outperforms the

strawman solutions 4× in term of average relative error and 98.1×

in term of speed. All related codes are open-sourced.

Index Terms—Data mining, data stream, sketch, periodic batch.

I. INTRODUCTION

A. Background and Motivation

ATCH is an important pattern in data streams [2], which is

a group of identical items that arrive closely. Two adjacent

batches of the same item are spaced by a minimum interval

T , where T is a predeﬁned threshold. Although batches can

make a difference in various applications, such as cache [2],

networks [3], and machine learning [4], [5], it is not enough to

just study batches. For instance, in cache systems, with just the

measurement results of batches, we are still not able to devise

Manuscript received 18 July 2023; revised 16 March 2024; accepted 6 May

2024. Date of publication 9 May 2024; date of current version 27 September

2024. This work was supported in part by Key-Area Research and Development

Program of Guangdong Province under Grant 2020B0101390001 and in part by

National Natural Science Foundation of China (NSFC) under Grant U20A20179

and Grant 61832001. An earlier version of this paper titled “HyperCalm Sketch:

One-Pass Mining Periodic Batches in Data Streams” is published in the 39th

IEEE International Conference on Data Engineering (IEEE ICDE 2023) [doi:

10.1109/ICDE55515.2023.00009]. Recommended for acceptance by B. Glavic.

(Corresponding author: Tong Yang.)

Zirui Liu, Xiangyuan Wang, Yuhan Wu, Kaicheng Yang, Hailin Zhang, and

Bin Cui are with the National Key Laboratory for Multimedia Information

Processing, School of Computer Science, Peking University, Beijing 100871,

China.

Tong Yang is with the National Key Laboratory for Multimedia Information

Processing, School of Computer Science, Peking University, Beijing 100871,

China, and also with the Peng Cheng Laboratory, Shenzhen 518066, China

(e-mail: yangtong@pku.edu.cn).

Yaofeng Tu is with the ZTE Corporation, Beijing 100728, China.

Digital Object Identiﬁer 10.1109/TKDE.2024.3399024

any prefetching method and replacement policy. Further mining

some special patterns of batches is of great importance. On the

basis of batches, we propose a new pattern, namely periodic

batch. A group of periodic batches refers to α consecutive

batches of the same item, where these batches arrive periodically.

We call α the periodicity. Finding top-k periodic batches refers

to reporting k groups of periodic batches with the k largest

periodicities.

Studying top-k periodic batches is important in practice. For

example, consider a cache stream formed by many memory

access requests where each request is an item, periodic batches

provide insights to improve the cache hit rate. With the historical

information of periodic batches, we can forecast the arrival time

of new batches, and prefetch the item into cache just before its

arrival. For another example, in ﬁnancial transaction streams, pe-

riodic transaction batches could be an indicator of illegal market

manipulation [6]. By detecting periodic batches in real time, we

can quickly ﬁnd those suspicious clients that might be laundering

money. Periodic batches are also helpful in recommendation

systems and online advertisements, where the data stream is

generated when users click or purchase different commodities. A

batch forms when users continuously click or purchase the same

type of commodities. In this scenario, periodic batches imply

users’ seasonal and periodic browsing or buying behaviors [7]

(e.g., Christmas buying patterns that repeat yearly, or seasonal

promotion-related user behaviors). Studying periodic batches

can help us to better understand customer behavior, so that we

can deliver appropriate advertisements promptly to customers.

In addition, periodic batches are also important in networks.

In network stream, most TCP senders tend to send packets in

periodic batches [8]. If we can forecast the arrival time of future

batches, we can pre-allocate resources to them, or devise better

strategies for load balancing. To our knowledge, there is no

existing work studying periodic batches, and we are the ﬁrst

to formulate and address this problem.

Finding periodic batches is a challenging issue. First, ﬁnding

batches is already a challenging issue. Until now, the state-of-

the-art solution to detect batches is Clock-Sketch [2], which

records the last arrival time of recent items in a cyclic array,

and uses another thread to clean the outdated information using

CLOCK [9] algorithm. However, to achieve high accuracy, it

needs to scan the cyclic array very fast, which consumes a lot of

CPU resources. Second, periodic batch is a more ﬁne-grained

deﬁnition, and thus ﬁnding periodic batches is more challenging

than just ﬁnding batches. The goal of this paper is to design

a compact sketch algorithm that can accurately ﬁnd periodic

batches with small space- and time- overhead.

See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.

LIU et al.: UNIFIED FRAMEWORK FOR MINING BATCH AND PERIODIC BATCH IN DATA STREAMS 5545

B. Our Proposed Solution

This paper proposes a uniﬁed framework, called HyperCalm

sketch, to detect batch and periodic batch in data streams in real

time. HyperCalm takes two phases to ﬁnd top-k periodic batches.

In phase 1, for each item e arriving at time t, we check whether

it is the start of a batch. If so, we query a TimeRecorder queue

to get the arrival time

t of the last batch of e, and calculate the

batch interval V = t −

t. Then we send this batch and its interval

e, V  to the second phase. In phase 2, we check periodicity and

manage to record top-k periodic batches, i.e., top-k e, V  pairs.

In phase 1, we devise a better algorithm than the s tate-of-the-art

algorithm for detecting batches, Clock-Sketch [2]. In phase 2,

we propose an enhanced top-k algorithm, which naturally suits

our periodic batch detection scenario.

In phase 1, we propose a time-aware version of Bloom ﬁl-

ter, namely HyperBloomFilter (HyperBF for short), to detect

batches. For each incoming item, phase 1 should report whether

the item is the start of a batch. In other words, this is an existence

detection algorithm. In addition, phase 1 should be aware of the

item arrival time to divide a series of the same item into many

batches. Bloom ﬁlter [10] is the most well-known memory-

efﬁcient data structure used for existence detection. However,

the existence detection of Bloom ﬁlter is only low-dimensional,

i.e., it is agnostic to time dimension. Typical work aware of time

dimension is Persistent Bloom ﬁlter (PBF) [11].Itisanelegant

variant of Bloom ﬁlter, which uses a set of carefully constructed

Bloom ﬁlters to support membership testing for temporal queries

(MTTQ) (e.g., has a person visited a website between 8:30pm

and 8:40pm?). MTTQ and batch detection are different ways

to be aware of time dimension. To enable Bloom ﬁlter to be

aware of time, our HyperBF extends each bit in Bloom ﬁlter

into a 2-bit cell, doubling the memory usage. Compared to the

standard Bloom ﬁlter, HyperBF has the same number of hash

computations and memory accesses for each insertion and query.

The only overhead for time awareness is doubling the memory

usage, which is reasonable and acceptable.

In phase 2, we propose an enhanced top-k algorithm, called

Calm Space-Saving (CalmSS for short), to report top-k periodic

batches. For each incoming batch and its interval, i.e., e, V ,

phase 2 should keep periodic batches with large periodicities,

and evict periodic batches with small periodicities. In other

words, phase 2 keeps frequent e, V  pairs, and evicts infre-

quent e, V  pairs, which is is a top-k algorithm. Typical top-k

algorithms include Space-Saving [12], Unbiased Space-Saving

[13], and Frequent [14]

. However, their accuracy is signiﬁcantly

harmed by cold items.

This problem is more serious in our

scenario of periodic batch detection. This is because one in-

frequent item may have multiple batches, and one frequent item

may also have multiple batches without periodicity. Both the two

cases above increase t he number of cold e, V  pairs. To i dentify

and discard cold items, Cold Filter [17] and LogLogFilter [18]

record the frequencies of all items in a compact data structure.

However, considering the large volume of data stream, this

structure will be ﬁlled up very quickly, and needs to be cleaned

up periodically. To ensure the one-pass property of our solution,

it is highly desired to devise a data structure which will never

Cold items refer to items with small frequencies (i.e., infrequent items), and

hot items refer to items with large frequencies (i.e., frequent items). In practice,

most items are cold items, which appear just several times [15], [16]

be ﬁlled up. Instead of recording all items, our solution is to

just record the frequencies of some items in the sliding window.

Rather than using existing sliding window algorithms [19], [20],

[21], this paper designs an LRU queue working together with

Space-Saving because of the following reasons. First, our LRU

queue is elastic: users can dynamically tune its memory usage

to maintain a satisfactory accuracy. Second, our LRU queue

has elegant theoretical guarantees (see details in Section III-C).

Third, our LRU queue can be naturally integrated into the

data structure of Space-Saving (see details in Section III-D):

such combination achieves higher accuracy and higher speed.

Our combination is faster because the LRU queue efﬁciently

ﬁlters most cold items, and thus the complicated replacement

operations incurred by cold items are avoided (see Fig. 15(c)).

Actually, besides the application of periodic batch detection, our

LRU queue can improve the accuracy/speed of any streaming

algorithms. We can handle any case that Cold ﬁlter can handle,

and we are both time- and space- more efﬁcient than Cold ﬁlter

(Section V-B). All related codes are open-sourced [22].

C. Key Contributions

We formulate the problem of ﬁnding periodic batches in

data streams. We believe this is an important problem.

We propose an accurate, fast, and memory efﬁcient Hyper-

Calm sketch to detect periodic batches in real time. Both

the two components of HyperCalm, HyperBF and CalmSS,

signiﬁcantly outperform the SOTA in detecting batches and

ﬁnding top-k items, respectively.

We derive theoretical guarantees for our HyperBF and

CalmSS, and validate our theories using experiments.

We conduct extensive experiments, and the results show

that HyperCalm outperforms the strawman solutions 4×

in term of error and 98.1× in term of speed.

We apply HyperCalm to the scenarios of cache and network

measurement, showing that periodic batches can beneﬁt

real-world application. We also integrate HyperCalm into

Apache Flink [23] and Redis [24].

II. B

ACKGROUND AND RELATED WORK

A. Problem Statement

Batches: A data stream is an inﬁnite sequence of items where

each item is associated with a timestamp. A batch is deﬁned as

a group of identical items in the data stream, where the time

gap between two adjacent batches of the same item must exceed

a predeﬁned threshold T . For convenience, in this paper, two

adjacent batches mean two batches belong to the same item by

default. The arrival time of a batch is deﬁned as the timestamp

of the ﬁrst item of this batch. We deﬁne the interval/time gap

between two adjacent batches as the interval between their

arrival times.

Periodic batches: A group of periodic batches refers to α

consecutive batches of the same item, where these batches arrive

with a ﬁxed time interval. We call α the periodicity. Here, the

“ﬁxed time interval” is not the exact time, but the approximate

(noise-tolerant) time rounded up to the nearest time unit (e.g.,

one millisecond). Finding top-k periodic batches refers to report-

ing k groups of periodic batches with the k largest periodicities.

Note that one item may have more than one group of periodic

batches, and thus can be reported more than once.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.

5546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 36, NO. 11, NOVEMBER 2024

Fig. 1. Example of periodic batches.

Example (Fig. 1): We use an example to further clarify our

problem deﬁnition. We focus on two distinct items e

and e

the data stream. For e

, its 6 batches form a group of periodic

batches. For e

, it has two groups of periodic batches, with the

periodicities of 4 and 5, respectively. Note that some batches of

just have one item.

Discussion: The deﬁnition of periodic batches is a design

choice related to ﬁnal application. We think our deﬁnition of

periodic batches is most general, which can beneﬁt many real-

world applications (see Section V-H as an example). However,

certain application may also care about other aspects of periodic

batches, such as batch size and distance. For example, some

application may just want to detect those periodic batches that

are large enough in size. It is not hard to detect those variants of

periodic batches by adding small modiﬁcation to our solution.

For example, we will discuss how to detect periodic large batches

in Section III-G and demonstrate its application in Section V- I .

B. Related Work

Related work is divided into three parts: 1) algorithms for

batch detection; 2) algorithms for ﬁnding top-k frequent items;

and 3) algorithms for mining periodic patterns.

Batch detection: Item batch is deﬁned very recently in [2],

which proposes Clock-Sketch to ﬁnd batches. Clock-Sketch

consists of an array of s-bits cells. For each incoming item,

it sets the d hashed cells as 2

− 1. For query, if one of the

d hashed cells is zero, it reports a batch. Clock-Sketch uses an

extra thread to cyclically sweep the cell array at a constant speed

and decreases the swept non-zero cells by one. The sweeping

speed is carefully selected to avoid false-positive errors. Besides

Clock-Sketch, some sliding window algorithms can be applied

to ﬁnd batches, including Time-Out Bloom Filter (TOBF) [25]

and SWAMP [26].

Finding top-k frequent items: To ﬁnd top-k frequent items

in data streams, existing approaches maintain a synopsis data

structure. There are two kinds of synopses: sketches and KV

tables. 1) Sketches usually consist of multiple arrays, each of

which consists of multiple counters. These counters are used to

record the frequencies of the inserted items. Typical sketches

include CM [27],CU[28], Count [29], and more [30], [31],

[32], [33], [34], [35], [36], [37] , [38], [39], [40], [41], [42], [43].

However, sketches are memory inefﬁcient because they record

the frequencies of all items, which is actually unnecessary. 2) KV

tables record only the frequent items. Typical KV table based

approaches include Space-Saving [12], Unbiased Space-Saving

[13], Frequent [14], and more [44]. Space-Saving and Unbiased

Space-Saving record the approximate top-k items in a data

structure called Stream-Summary. However, their accuracy is

Fig. 2. HyperCalm sketch workﬂow.

signiﬁcantly degraded by cold items. To address this issue, Cold

Filter [17] uses a two-layer CU sketch to ﬁlter cold items.

However, as aforementioned, the structure of Cold Filter will

be ﬁlled up very quickly, and cleaning the full Cold Filter will

inevitably incur error and time overhead.

Mining periodic patterns: Although there have been some

algorithms aiming at mining periodicity in time s equence data

[45], [46], [47], [48], [49], [50], [51], their problem deﬁnitions

are different from ours. More importantly, most of them do not

meet the requirements of data stream model processing: 1) each

item can only be processed once; 2) the processing time of

each item should be O(1) complexity and fast enough to catch

up with the high speed of data streams. For example, TiCom

[48] deﬁnes a periodical problem in an incomplete sequence

data, and develops an iterative algorithm with time complexity

of O(n

). RobustPeriod [46] proposes an algorithm based on

discrete wavelet transform with time complexity of O(n log n).

Further, there are some works which elegantly use Fast Fourier

Transform (FFT) or Auto Correlation Function (ACF) to address

different deﬁnitions of periodic items, such as SAZED [49].

These algorithms need to process one item multiple times, and

thus cannot meet the above two requirements.

III. T

HE HYPERCALM SKETCH

Overview (Fig. 2): The workﬂow of the HyperCalm sketch

consists of two phases: 1) A HyperBloomFilter (HyperBF)

detecting the start of batches; and 2) A Calm Space-Saving

(CalmSS) recording and reporting top-k periodic batches. In

addition, we design a TimeRecorder queue to record the last

batch arrival time for potential periodic batches. Given an in-

coming item e arriving at time t, we ﬁrst propose HyperBF to

check whether it is the start of a batch. If so, we query the

TimeRecorder queue to get the arrival time

t of the last batch of

e and calculate the batch interval V = t −

Then we update

the arrival time of last batch of e in the TimeRecorder queue to

t. Next, we send e and its batch interval V to CalmSS to detect

top-k periodic batches. We combine the ID of item e and its

interval V to form an entry E = e, V , and insert the entry into

CalmSS. CalmSS reports k groups of periodic batches with the k

largest periodicities, i.e., reports top-k entries with the k largest

frequencies, where each entry is an e, V  pair. We will discuss

To tolerate noise in batch interval, V is rounded up according to the

regulations described in the parameter setting part of Section V-C.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 05:53:41 UTC from IEEE Xplore. Restrictions apply.

of 18

免费下载

goldendb

文档被以下合辑收录

GoldenDB论文（共12篇）

GoldenDB论文

文档被以下合辑收录

相关文档

评论