SIGMOD2014_Matching heterogeneous event data_Xiaochen Zhu, Shaoxu Song, Xiang Lian, Jianmin Wang, Lei Zou.pdf

樱木花道

172

12页

0次

2022-09-21

免费下载

Matching Heterogeneous Event Data

Xiaochen Zhu

Shaoxu Song

Xiang Lian

†

Jianmin Wang

Lei Zou

‡

KLiss, MoE; TNList; School of Software, Tsinghua University, China

zhu-xc10@mails.tsinghua.edu.cn {sxsong, jimwang}@tsinghua.edu.cn

†

University of Texas - Pan American, USA lianx@utpa.edu

‡

Peking University, China zoulei@pku.edu.cn

ABSTRACT

Identifying duplicate events are essential to various business

process applications such as provenance querying or process

mining. Distinct features of heterogeneous events includ-

ing opaque names, dislocated traces and composite events,

prevent existing data integration from techniques perform-

ing well. To address these issues, in this paper, we propose

an event similarity function by iteratively evaluating similar

neighbors. We prove the convergence of iterative similarity

computation, and propose several pruning and estimation

methods. To eﬃciently support matching composite events,

we devise upper bounds of event similarities. Experiments

on real and synthetic datasets demonstrate that the pro-

posed event matching approaches can achieve signiﬁcantly

higher accuracy than the state-of-the-art matching methods.

Categories and Subject Descriptors

H.2.5 [Database Management]: Heterogeneous Databases

Keywords

Schema matching; event matching

1. INTRODUCTION

Owing to various mergers and acquisitions, information

systems (e.g., Enterprise Resource Planning (ERP) and Of-

ﬁce Automation (OA) systems), probably developed inde-

pendently in diﬀerent branches or subsidiaries in large-scale

corporations, keep on generating heterogeneous event logs.

We surveyed a major bus manufacturer who recently started

a project on integrating their event data in the OA systems

of 31 subsidiaries. These OA systems have been built inde-

pendently on 5 distinct middleware products in the past 20

years. More than 8190 business processes are implemented

in these systems, among which 68.8% are indeed diﬀerent

implementations of the same business activities in diﬀerent

subsidiaries. For instance, in the following Example 1, we

illustrate two versions of turbine order processing processes

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from Permissions@acm.org.

SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA.

http://dx.doi.org/10.1145/2588555.2588570.

in 31 subsidiaries. Duplicate events commonly exist in these

heterogeneous processes for the same business activities.

The company has started to integrate these heterogeneous

event data into a uniﬁed business process data warehouse

[3, 4], where diﬀerent types of analyses can be performed,

e.g., querying similar complex procedures or discovering in-

teresting event patterns in diﬀerent subsidiaries (complex

event processing, CEP [6]), comparing business processes

in diﬀerent subsidiaries to ﬁnd common parts for process

simpliﬁcation and reuse [19], or obtaining a more abstract

global picture of business processes (workﬂow views [2]) in

the company. Without exploring the correspondence among

heterogeneous events, applications such as query and analy-

sis over the event data may not yield any meaningful results.

The event matching problem is to construct the similar-

ity and matching relationship of events from heterogeneous

sources. Although there are 49 system integrators (subject-

matter experts) employed by the bus manufacturer in the

project of OA system integration, given the large number

of more than 100,000 event activities, manually checking

the correspondences of all possible event pairs is highly im-

practical for two reasons: 1) manual checking is obviously

ineﬃcient, and 2) manual checking results could be inaccu-

rate and contradictory. An automatic approach is highly

demanded for matching these heterogeneous event data.

Diﬀerent from conventional schema matching on attributes

in relational databases [7], there are strong relationships of

consecutive occurrence among events. The event data in-

tegration is challenging due to the following features com-

monly observed in the general event data.

(1) Event names could be opaque, due to various encoding,

syntax or language conventions in heterogeneous systems.

(2) Event traces might be dislocated. Only a part (e.g., the

beginning) of a trace 1 corresponds to a distinct part (e.g.,

the end) of another trace 2.

(3) Composite event may exist. That is, one event in a

source corresponds to several decomposed ones in another

source, known as composite event in CEP [6].

Example 1. Figure 1 illustrates two fragments of event logs

and L

for turbine order processing in two diﬀerent sub-

sidiaries of the bus manufacturer, respectively. Two exam-

ple traces are shown in each fragment, and each trace de-

notes a sequence of events (steps) for processing one order.

An event log consists of many traces, among which the se-

quences of events may be diﬀerent, since some of the events

can be executed concurrently (e.g. Ship Goods(E) and Email

1211

Trace

Paid by Cash (A)  Check Inventory (C)  Validate (D)  Ship Goods (E)  Email Customer (F)

Paid by Credit Card (B)  Check Inventory (C)  Validate (D)  Email Customer (F)  Ship Goods (E)

...

Trace

Order Accepted (1)  Paid by Cash (2)  Inventory Checking & Validation (4)  ????????? (5)  Send Notification (6)

Order Accepted (1)  Paid by Credit Card (3)  Inventory Checking & Validation (4)  Send Notification (6)  ???????? (5)

...

(a) Event log L

(b) Event log L

(d) Dependency Graph G

0.4

0.6

0.4

0.6

0.4

0.6

0.4

0.60.6

0.6

0.4

1.0

Figure 1: Fragments of two event logs and their dependency graphs

Customer(F) in L

), or exclusively (e.g., Paid by Cash(2) or

Paid by Credit Card(3) in L

Note that opaque names exist in L

as shown in Figure

1(b). The event “????????(5)” is collected from a database

whose encoding is distinct from others, which makes the

event name garbled. According to expert investigation, the

original name of “????????(5)” should be “Delivery(5)”, and

the true event corresponding relation between L

and L

highlighted by red dashed lines in Figures 1(a) and (b).

Dislocated matching exists between the ﬁrst traces in L

and L

. Event Paid by Cash(A) that appears at the beginning

of traces in L

corresponds to event Paid by Cash(2), which

appears in the middle of traces in L

, having another event

Order Accepted(1) before it.

Moreover, two events Check Inventory(C) and Validate(D)

in L

correspond to one composite event Inventory Checking

& Validation(4). For simplicity, we use ABCDEF to denote

event names in L

, while 123456 are events in L

Unfortunately, existing techniques cannot eﬀectively ad-

dress the aforesaid three challenges in event matching. A

straightforward idea of matching events is to compare their

names (a.k.a. event labels). String edit distance (syntactic

similarity) [13] as well as word stemming and the synonym

relation (semantic similarity) [20] are widely used in the la-

bel similarity based approaches [14, 19, 5, 23]. As shown in

Example 1, such a typographical similarity cannot address

the identiﬁed Challenge 1, i.e., opaque event labels.

Structural similarity may be considered besides the typo-

graphical similarity. The idea is to construct a graph for de-

scribing the relationships among events, e.g., the frequency

of appearing consecutively in an event log [8]. Once the

graphical structure is obtained, graph matching techniques

can be employed to identify the event (behavioral/structural)

similarities. Unfortunately, existing graph matching tech-

niques cannot handle well the dislocated matching of events

i.e., the aforesaid Challenge 2. Graph edit distance (GED)

[5] for general graph data and normal distance for matching

with opaque names (OPQ) [11] concern a local evaluation

of similar neighbors. However, as illustrated in Example 1,

dislocated matching events may have distinct neighbors (see

more details below). Rather than local neighbors, another

type of Simrank [10] like behavioral similarity (BHV) [19]

considers a global evaluation via propagating similarities

in the entire graph in multiple iterations. Unfortunately,

directly applying the global propagation does not help in

According to our survey on 5642 processes with redundancy

(68.8% of 8190) of the bus manufacturer, more than 44% of

them involve dislocated event traces.

matching dislocated events that do not have any predeces-

sor, e.g., Paid by Cash(A) in Figure 1.

Example 2. Figures 1(c) and 1(d) capture the statistical

and structural information of L

and L

, respectively (see

Deﬁnition 1 for constructing G

and G

). Each vertex in the

directed graph denotes an event, while an edge between two

events (say AC in Figure 1(c) for instance) indicates that

they appear consecutively in at least one trace (e.g., trace 1

in Figure 1(a)). The numbers attached to edges represent

the normalized frequencies of consecutive event pairs. For

instance, 0.4 of AC means that A,C appear consecutively in

40% of the traces of the event log.

Since GED and OPQ concern more about the local simi-

larity, e.g., the high similarity of (A, C) and (1, 2), an event

mapping M = {A → 1, B → 3, C → 2, D → 4, E → 5, F →

6} will be returned by GED with distance 0.16 and OPQ

with score 10.7. The true mapping M

= {A → 2, B →

3, C/D → 4, E → 5, F → 6} in ground truth shows a higher

GED distance 0.233 (lower is better) and lower OPQ score

10 (higher is better) instead. BHV does not help in capturing

dislocated mapping, e.g., between A and 2 with BHV similar-

ity 0. Instead, A and 1 with no input neighbors have higher

similarity 1, i.e., unable to ﬁnd the dislocated matching.

When matching composite events, i.e., Challenge (3), com-

puting eﬃciency becomes a major issue. For instance, one

may be interested in composite events {C, D}, {D, E} and

{E, F } in Figure 1(c). Given a candidate set of n composite

events, there may exist 2

possible combinations, for each

of which we need to evaluate the corresponding event simi-

larities. Unfortunately, structural similarities such as GED

are already costly to compute for one combination.

Contributions. In this paper, we propose a similarity func-

tion that supports dislocated matching and opaque events,

by iteratively computing neighbor similarities. Our major

contributions in this paper are summarized as follows.

(1) We formally deﬁne the similarity function, which is it-

eratively computed by incorporating other similarity crite-

ria in the iteration. Intuitively, as discussed in Example 2,

directly applying OPQ that concerns a local evaluation of

similar neighbors is ineﬀective in matching dislocated events

with possibly distinct neighbors. In contrast, the proposed

iteration based similarity function concerns the global eval-

uation via propagating similarities, and thus overcomes the

impact of local neighbor distinctness.

(2) We prove the convergence of the proposed iteration

based similarity function and devise a pruning technique

1212

of 12

免费下载

关注

评论