ID
Trace
t1
t2
Paid by Cash (A) Check Inventory (C) Validate (D) Ship Goods (E) Email Customer (F)
Paid by Credit Card (B) Check Inventory (C) Validate (D) Email Customer (F) Ship Goods (E)
...
ID
Trace
s1
s2
Order Accepted (1) Paid by Cash (2) Inventory Checking & Validation (4) ????????? (5) Send Notification (6)
Order Accepted (1) Paid by Credit Card (3) Inventory Checking & Validation (4) Send Notification (6) ???????? (5)
...
B
C
A
D
E
F
1
2
3
4
5
6
(a) Event log L
1
(b) Event log L
2
(c) Dependency Graph G
1
of
L
1
(d) Dependency Graph G
2
of
L
2
0.4
0.6
0.4
0.6
0.4
0.6
0.6
0.4
0.4
0.60.6
0.6
0.4
0.4
1.0
Figure 1: Fragments of two event logs and their dependency graphs
Customer(F) in L
1
), or exclusively (e.g., Paid by Cash(2) or
Paid by Credit Card(3) in L
2
).
Note that opaque names exist in L
2
as shown in Figure
1(b). The event “????????(5)” is collected from a database
whose encoding is distinct from others, which makes the
event name garbled. According to expert investigation, the
original name of “????????(5)” should be “Delivery(5)”, and
the true event corresponding relation between L
1
and L
2
is
highlighted by red dashed lines in Figures 1(a) and (b).
Dislocated matching exists between the first traces in L
1
and L
2
. Event Paid by Cash(A) that appears at the beginning
of traces in L
1
corresponds to event Paid by Cash(2), which
appears in the middle of traces in L
2
, having another event
Order Accepted(1) before it.
Moreover, two events Check Inventory(C) and Validate(D)
in L
1
correspond to one composite event Inventory Checking
& Validation(4). For simplicity, we use ABCDEF to denote
event names in L
1
, while 123456 are events in L
2
.
Unfortunately, existing techniques cannot effectively ad-
dress the aforesaid three challenges in event matching. A
straightforward idea of matching events is to compare their
names (a.k.a. event labels). String edit distance (syntactic
similarity) [13] as well as word stemming and the synonym
relation (semantic similarity) [20] are widely used in the la-
bel similarity based approaches [14, 19, 5, 23]. As shown in
Example 1, such a typographical similarity cannot address
the identified Challenge 1, i.e., opaque event labels.
Structural similarity may be considered besides the typo-
graphical similarity. The idea is to construct a graph for de-
scribing the relationships among events, e.g., the frequency
of appearing consecutively in an event log [8]. Once the
graphical structure is obtained, graph matching techniques
can be employed to identify the event (behavioral/structural)
similarities. Unfortunately, existing graph matching tech-
niques cannot handle well the dislocated matching of events
1
,
i.e., the aforesaid Challenge 2. Graph edit distance (GED)
[5] for general graph data and normal distance for matching
with opaque names (OPQ) [11] concern a local evaluation
of similar neighbors. However, as illustrated in Example 1,
dislocated matching events may have distinct neighbors (see
more details below). Rather than local neighbors, another
type of Simrank [10] like behavioral similarity (BHV) [19]
considers a global evaluation via propagating similarities
in the entire graph in multiple iterations. Unfortunately,
directly applying the global propagation does not help in
1
According to our survey on 5642 processes with redundancy
(68.8% of 8190) of the bus manufacturer, more than 44% of
them involve dislocated event traces.
matching dislocated events that do not have any predeces-
sor, e.g., Paid by Cash(A) in Figure 1.
Example 2. Figures 1(c) and 1(d) capture the statistical
and structural information of L
1
and L
2
, respectively (see
Definition 1 for constructing G
1
and G
2
). Each vertex in the
directed graph denotes an event, while an edge between two
events (say AC in Figure 1(c) for instance) indicates that
they appear consecutively in at least one trace (e.g., trace 1
in Figure 1(a)). The numbers attached to edges represent
the normalized frequencies of consecutive event pairs. For
instance, 0.4 of AC means that A,C appear consecutively in
40% of the traces of the event log.
Since GED and OPQ concern more about the local simi-
larity, e.g., the high similarity of (A, C) and (1, 2), an event
mapping M = {A → 1, B → 3, C → 2, D → 4, E → 5, F →
6} will be returned by GED with distance 0.16 and OPQ
with score 10.7. The true mapping M
0
= {A → 2, B →
3, C/D → 4, E → 5, F → 6} in ground truth shows a higher
GED distance 0.233 (lower is better) and lower OPQ score
10 (higher is better) instead. BHV does not help in capturing
dislocated mapping, e.g., between A and 2 with BHV similar-
ity 0. Instead, A and 1 with no input neighbors have higher
similarity 1, i.e., unable to find the dislocated matching.
When matching composite events, i.e., Challenge (3), com-
puting efficiency becomes a major issue. For instance, one
may be interested in composite events {C, D}, {D, E} and
{E, F } in Figure 1(c). Given a candidate set of n composite
events, there may exist 2
n
possible combinations, for each
of which we need to evaluate the corresponding event simi-
larities. Unfortunately, structural similarities such as GED
are already costly to compute for one combination.
Contributions. In this paper, we propose a similarity func-
tion that supports dislocated matching and opaque events,
by iteratively computing neighbor similarities. Our major
contributions in this paper are summarized as follows.
(1) We formally define the similarity function, which is it-
eratively computed by incorporating other similarity crite-
ria in the iteration. Intuitively, as discussed in Example 2,
directly applying OPQ that concerns a local evaluation of
similar neighbors is ineffective in matching dislocated events
with possibly distinct neighbors. In contrast, the proposed
iteration based similarity function concerns the global eval-
uation via propagating similarities, and thus overcomes the
impact of local neighbor distinctness.
(2) We prove the convergence of the proposed iteration
based similarity function and devise a pruning technique
评论