
id title category brand price
𝑎
&
'
balt wheasel ... stationery ... balt 239.88
𝑎
(
'
kodak esp ... printers NULL 58.0
𝑎
)
'
hp q3675a ... printers hp 194.84
id title category brand price
𝑏
&
'
balt inc. ... laminating ...
134.45
𝑏
(
'
kodak esp 7 ... kodak NULL 149.29
𝑏
)
'
hewlett ... cleaning repair hp NULL
(𝑎
&
'
, 𝑏
&
'
, 1)
id name description price
𝑎
&
*
black flat ...
samsung 52 ' series 7
black flat panel lcd ...
NULL
𝑎
(
*
sony 46 ' bravia ... bravia z series ... NULL
𝑎
)
*
linksys wirelessn ... security router ... NULL
id name description price
𝑏
&
*
dynamic contrast ratio
120hz 6ms respons ...
2148.99
𝑏
(
*
sony bravia ... ntsc 16:9 1366 x 768 ... 597.72
𝑏
)
*
linksys wirelessg ... 54mbps NULL
(a) Labeled Source Dataset
(b) Unlabeled Target Dataset
(𝑎
(
'
, 𝑏
(
'
, 0)
(𝑎
)
'
, 𝑏
)
'
, 1)
(𝑎
&
*
, 𝑏
&
*
, ?)
(𝑎
(
*
, 𝑏
(
*
, ?)
(𝑎
)
*
, 𝑏
)
*
, ?)
Figure 2: A running example of DA for ER with a labeled source dataset D
S
and an unlabeled target dataset D
T
.
follow the same distribution. As a result, an ER model (the green
line) trained from the source cannot correctly predict the target. To
address the challenge, domain adaptation (DA) is extensively stud-
ied to utilize labeled data in one or more relevant source domains
for a new dataset in a target domain [
25
,
45
,
64
,
69
]. Intuitively, DA
is to learn from data instances what is the best way of aligning
distributions of the source and the target data, such that the models
trained on the labeled source can be used (or adapted) to the unla-
beled target. As illustrate in Figure 1 (b), the advantage of DA is its
ability to learn more domain-invariant representations that reduce
the domain shift between source and target, and to improve perfor-
mance of the ER model, e.g., the green line can correctly classify
data instances in both source and target datasets.
However, despite some very recent attempts [
35
], as far as we
know, the adoption of DA in ER is not systematically studied un-
der the same framework, and thus it is hard for practitioners to
understand DA’s benets and limitations for ER. To bridge this gap,
this paper introduces a general framework, called
DADER
(
D
omain
A
daptation for
D
eep
E
ntity
R
esolution) that unies a wide range of
choices of DA solutions [
73
,
74
]. Specically, the framework con-
sists of three main modules. (1) Feature Extractor converts entity
pairs to high-dimensional vectors (a.k.a. features). (2) Matcher is
a binary classier that takes the features of entity pairs as input,
and predicts whether they match or not. (3) Feature Aligner is the
key module for domain adaptation, which is designed to alleviate
the eect of domain shift. To achieve this, Feature Aligner adjusts
Feature Extractor to align distributions of source and target ER
datasets, which then reduces domain shift between source and
target. Moreover, it updates Matcher accordingly to minimize the
matching errors in the adjusted feature space.
Design space exploration.
Based on our framework, we system-
atically categorize and study the most representative methods in
DA for ER, and focus on investigating two key questions.
First, DA is a broad topic in machine learning (e.g., computer
vision and natural language processing), and there is a large set
of design choices for domain adaptation. Thus, it is necessary to
ask a question that which DA design choices would help ER. To
answer the question, we have extensively reviewed existing DA
studies, and then focus on the most popular and fruitful directions
that learn domain-invariant and discriminative features. Based on
this, we provide a categorization for each module in
DADER
and
dene a design space, by summarizing representative DA techniques.
Specically, Feature Extractor is typically implemented by recurrent
neural networks [
38
] and pre-trained language models [
19
,
44
,
55
].
Matcher often adopts a deep neural networks as a binary classier.
Feature Aligner is implemented by three categories of solutions:
(1) discrepancy-based, (2) adversarial-based, and (3) reconstruction-
based. As the concrete choices of Feature Extractor and Matcher
have been well studied, our focus is to identify methods for Feature
Aligner, for which we develop six representative methods that cover
a wide range of SOTA DA techniques.
The second question is whether DA is useful for ER to utilize la-
beled data in relevant domains. To answer this, this paper considers
two settings: (1) the unsupervised DA setting without any target
labels, and (2) the semi-supervised DA setting with a few target
labels. Moreover, we also compare
DADER
with SOTA DL solutions
for ER, such as DeepMatcher [
49
] and Ditto [
42
]. Based on the
comparison, we provide comprehensive analysis on the benets
and limitations of DA for ER.
Contributions:
(1) As far as we know, we are the rst to formally
dene the problem of
DA for deep ER
(Section 3) and conduct so
far the most comprehensive study for applying DA to ER.
(2) We introduce a
DADER
framework that supports DA for ER,
which consists of three modules, namely Feature Extractor, Matcher
and Feature Aligner. We systematically explore the design space of
DA for ER by categorizing each individual module in the framework
(Section 4). In particular, we develop
six representative methods
for Feature Aligner (Section 5).
(3) We conduct a thorough evaluation to explore the design space
and compare the developed methods (Section 6). The source code
and data have been made available at Github
1
. We nd that DA is
very promising for ER, as it reduces domain shift between source
and target. We point out some open problems of DA for ER and
identify research directions (Section 8).
2 DEEP ENTITY RESOLUTION
We formally dene entity resolution and present a framework of
using deep learning for entity resolution (or Deep ER for short).
Entity resolution.
Let
𝐴
and
𝐵
be two relational tables with multi-
ple attributes. Each tuple
𝑎 ∈ 𝐴
(or
𝑏 ∈ 𝐵
) is also referred to as an en-
tity consisting of a set of attribute-value pairs
{(attr
𝑖
, val
𝑖
)}
1≤𝑖 ≤𝑘
,
where
attr
𝑖
and
val
𝑖
denote the
𝑖
-th attribute name and value re-
spectively. The problem of entity resolution (ER) is to nd all the
1
https://github.com/ruc-datalab/DADER
相关文档
评论