Domain Adaptation for Deep Entity Resolution.pdf

章芋文

15页

5次

2023-07-27

免费下载

Domain Adaptation for Deep Entity Resolution

Jianhong Tu

Renmin University, China

tujh@ruc.edu.cn

Ju Fan

∗

Renmin University, China

fanj@ruc.edu.cn

Nan Tang

QCRI, Qatar

ntang@hbku.edu.qa

Peng Wang

Renmin University, China

lisa_wang@ruc.edu.cn

Chengliang Chai

Tsinghua University, China

ccl@mail.tsinghua.edu.cn

Guoliang Li

Tsinghua University, China

liguoliang@tsinghua.edu.cn

Ruixue Fan

Renmin University, China

fanruixue@ruc.edu.cn

Xiaoyong Du

Renmin University, China

duyong@ruc.edu.cn

ABSTRACT

Entity resolution (ER) is a core problem of data integration. The

state-of-the-art (SOTA) results on ER are achieved by deep learning

(DL) based methods, trained with a lot of labeled matching/non-

matching entity pairs. This may not be a problem when using well-

prepared benchmark datasets. Nevertheless, for many real-world

ER applications, the situation changes dramatically, with a painful

issue to collect large-scale labeled datasets. In this paper, we seek

to answer: If we have a well-labeled source ER dataset, can we train

a DL-based ER model for a target dataset, without any labels or with

a few labels? This is known as domain adaptation (DA), which has

achieved great successes in computer vision and natural language

processing, but is not systematically studied for ER. Our goal is to

systematically explore the benets and limitations of a wide range

of DA methods for ER. To this purpose, we develop a

DADER

(

omain

daptation for

eep

ntity

esolution) framework that signicantly

advances ER in applying DA. We dene a space of design solutions

for the three modules of

DADER

, namely Feature Extractor, Matcher,

and Feature Aligner. We conduct so far the most comprehensive

experimental study to explore the design space and compare dif-

ferent choices of DA for ER. We provide guidance for selecting

appropriate design solutions based on extensive experiments.

CCS CONCEPTS

• Information systems → Entity resolution.

KEYWORDS

Domain adaptation; Data integration; Deep learning

ACM Reference Format:

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Chengliang Chai, Guoliang Li,

Ruixue Fan, and Xiaoyong Du. 2022. Domain Adaptation for Deep Entity

Resolution. In Proceedings of the 2022 International Conference on Manage-

ment of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM,

New York, NY, USA, 15 pages. https://doi.org/10.1145/3514221.3517870

∗

Ju Fan is the corresponding author.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA.

ACM ISBN 978-1-4503-9249-5/22/06.. . $15.00

https://doi.org/10.1145/3514221.3517870

Reducing

Domain Shift

Minimizing

Matching Errors

Source

Target

Source

Target

(a)

(b)

Figure 1: DA for ER. (a) A Matcher learned from labeled

source cannot perform well on unlabeled target due to do-

main shift. (b) Domain adaptation learns better representa-

tions to reduce domain shift and minimize matching errors.

1 INTRODUCTION

Entity resolution (ER) determines whether two data instances refer

to the same real-world entity. With the many decades of eorts of

approaching ER from the model-centric point of view, there has

been a considerable amount of literature, ranging from rule-based

methods (e.g., disjunctive normal form [

] and general boolean

formula [

]), ML-based methods (such as SVM [

] and random

forests [

]), to DL-based methods (such as DeepMatcher [

DeepER [

], and Ditto [

]). The state-of-the-art (SOTA) results

are, not surprisingly, achieved by DL-based solutions.

However, DL-based ER methods typically need a large amount

of labeled training data. For example, even by piggybacking pre-

trained language models such as Ditto [

], thousands of labels are

still needed to achieve a satisfactory accuracy. In fact, the main

pain point for ER practitioners is that they would need substantial

labeling eorts for creating enough training data.

Fortunately, the big data era makes a lot of labeled ER datasets

available in the same or relevant domains, either from public bench-

marks (e.g., WDC [

] and DBLP-Scholar [

]) or within enterprises.

Hence, a natural question is: can we reuse these labeled source ER

datasets for a new target ER dataset? An armative answer to

the above question has the potential to dramatically reduce the

expensive human eort for data labeling.

Domain adaptation.

The key challenge of reusing labeled source

data is that there may be distribution change or domain shift be-

tween the source and the target, which would degrade the perfor-

mance. Figure 1 (a) shows an example of a labeled source dataset

(circles) and an unlabeled target dataset (squares). As the source

and target datasets may not be from the same domain, they do not

id title category brand price

𝑎

balt wheasel ... stationery ... balt 239.88

𝑎

(

kodak esp ... printers NULL 58.0

𝑎

)

hp q3675a ... printers hp 194.84

id title category brand price

𝑏

balt inc. ... laminating ...

mayline

134.45

𝑏

(

kodak esp 7 ... kodak NULL 149.29

𝑏

)

hewlett ... cleaning repair hp NULL

(𝑎

, 𝑏

, 1)

id name description price

𝑎

samsung 52 ' series 7

black flat ...

samsung 52 ' series 7

black flat panel lcd ...

NULL

𝑎

(

sony 46 ' bravia ... bravia z series ... NULL

𝑎

)

linksys wirelessn ... security router ... NULL

id name description price

𝑏

samsung ln52a750 ...

dynamic contrast ratio

120hz 6ms respons ...

2148.99

𝑏

(

sony bravia ... ntsc 16:9 1366 x 768 ... 597.72

𝑏

)

linksys wirelessg ... 54mbps NULL

(a) Labeled Source Dataset

(b) Unlabeled Target Dataset

(𝑎

(

, 𝑏

(

, 0)

(𝑎

)

, 𝑏

)

, 1)

(𝑎

, 𝑏

, ?)

(𝑎

(

, 𝑏

(

, ?)

(𝑎

)

, 𝑏

)

, ?)

Figure 2: A running example of DA for ER with a labeled source dataset D

and an unlabeled target dataset D

follow the same distribution. As a result, an ER model (the green

line) trained from the source cannot correctly predict the target. To

address the challenge, domain adaptation (DA) is extensively stud-

ied to utilize labeled data in one or more relevant source domains

for a new dataset in a target domain [

]. Intuitively, DA

is to learn from data instances what is the best way of aligning

distributions of the source and the target data, such that the models

trained on the labeled source can be used (or adapted) to the unla-

beled target. As illustrate in Figure 1 (b), the advantage of DA is its

ability to learn more domain-invariant representations that reduce

the domain shift between source and target, and to improve perfor-

mance of the ER model, e.g., the green line can correctly classify

data instances in both source and target datasets.

However, despite some very recent attempts [

], as far as we

know, the adoption of DA in ER is not systematically studied un-

der the same framework, and thus it is hard for practitioners to

understand DA’s benets and limitations for ER. To bridge this gap,

this paper introduces a general framework, called

DADER

(

omain

daptation for

eep

ntity

esolution) that unies a wide range of

choices of DA solutions [

]. Specically, the framework con-

sists of three main modules. (1) Feature Extractor converts entity

pairs to high-dimensional vectors (a.k.a. features). (2) Matcher is

a binary classier that takes the features of entity pairs as input,

and predicts whether they match or not. (3) Feature Aligner is the

key module for domain adaptation, which is designed to alleviate

the eect of domain shift. To achieve this, Feature Aligner adjusts

Feature Extractor to align distributions of source and target ER

datasets, which then reduces domain shift between source and

target. Moreover, it updates Matcher accordingly to minimize the

matching errors in the adjusted feature space.

Design space exploration.

Based on our framework, we system-

atically categorize and study the most representative methods in

DA for ER, and focus on investigating two key questions.

First, DA is a broad topic in machine learning (e.g., computer

vision and natural language processing), and there is a large set

of design choices for domain adaptation. Thus, it is necessary to

ask a question that which DA design choices would help ER. To

answer the question, we have extensively reviewed existing DA

studies, and then focus on the most popular and fruitful directions

that learn domain-invariant and discriminative features. Based on

this, we provide a categorization for each module in

DADER

and

dene a design space, by summarizing representative DA techniques.

Specically, Feature Extractor is typically implemented by recurrent

neural networks [

] and pre-trained language models [

Matcher often adopts a deep neural networks as a binary classier.

Feature Aligner is implemented by three categories of solutions:

(1) discrepancy-based, (2) adversarial-based, and (3) reconstruction-

based. As the concrete choices of Feature Extractor and Matcher

have been well studied, our focus is to identify methods for Feature

Aligner, for which we develop six representative methods that cover

a wide range of SOTA DA techniques.

The second question is whether DA is useful for ER to utilize la-

beled data in relevant domains. To answer this, this paper considers

two settings: (1) the unsupervised DA setting without any target

labels, and (2) the semi-supervised DA setting with a few target

labels. Moreover, we also compare

DADER

with SOTA DL solutions

for ER, such as DeepMatcher [

] and Ditto [

]. Based on the

comparison, we provide comprehensive analysis on the benets

and limitations of DA for ER.

Contributions:

(1) As far as we know, we are the rst to formally

dene the problem of

DA for deep ER

(Section 3) and conduct so

far the most comprehensive study for applying DA to ER.

(2) We introduce a

DADER

framework that supports DA for ER,

which consists of three modules, namely Feature Extractor, Matcher

and Feature Aligner. We systematically explore the design space of

DA for ER by categorizing each individual module in the framework

(Section 4). In particular, we develop

six representative methods

for Feature Aligner (Section 5).

(3) We conduct a thorough evaluation to explore the design space

and compare the developed methods (Section 6). The source code

and data have been made available at Github

. We nd that DA is

very promising for ER, as it reduces domain shift between source

and target. We point out some open problems of DA for ER and

identify research directions (Section 8).

2 DEEP ENTITY RESOLUTION

We formally dene entity resolution and present a framework of

using deep learning for entity resolution (or Deep ER for short).

Entity resolution.

Let

𝐴

and

𝐵

be two relational tables with multi-

ple attributes. Each tuple

𝑎 ∈ 𝐴

(or

𝑏 ∈ 𝐵

) is also referred to as an en-

tity consisting of a set of attribute-value pairs

{(attr

𝑖

, val

𝑖

)}

1≤𝑖 ≤𝑘

where

attr

𝑖

and

val

𝑖

denote the

𝑖

-th attribute name and value re-

spectively. The problem of entity resolution (ER) is to nd all the

https://github.com/ruc-datalab/DADER

Feature Extractor

ℱ

Matcher

ℳ

Gradients

…

𝐴

(𝑎

, 𝑏

, 1)

Labeled

Source Data

𝒙

𝐒

Feature Extractor

ℱ

Feature Aligner

𝒜

…

𝒙

𝑻

Gradients

(𝑎

, 𝑏

, ? )

Unlabeled

Target Data

1/0

𝐵

𝐴

𝐵

Shared

Weights

(a) Deep ER

(b) DA for Deep ER

Figure 3: Domain adaptation for deep entity resolution. (a) A

general framework for deep entity resolution. (b) Our DADER

framework of domain adaption for deep entity resolution.

entity pairs

(𝑎, 𝑏) ∈ 𝐴 × 𝐵

that refer to the same real-world objects.

An entity pair is said to be matching (resp. non-matching) if they

refer to the same (resp. dierent) real-world objects.

A typical ER pipeline consists of two steps, blocking and match-

ing. The blocking step generates a set of candidate pairs with high

recall, i.e., pruning the entity pairs which are unlikely to match

(see [

] about DL for ER blocking). The matching step takes the

candidate set generated from the blocking step as input and deter-

mines which candidate pairs are matchings or non-matchings. Our

focus is on the domain adaption for the ER matching step.

Training data for ER.

We denote a labeled training set as

(D, Y)

where D ⊂ 𝐴 × 𝐵 is a set of entity pairs and Y is a label set. Each

entity pair

(𝑎, 𝑏) ∈ D

is associated with a label

𝑦 ∈ Y

that denotes

whether the pair of entities

𝑎

and

𝑏

is matching (i.e.,

𝑦 =

1) or non-

matching (i.e.,

𝑦 =

0). Figure 2 (a) shows an example training set.

Each pair consists of two entities from dierent tables, e.g.,

(𝑎

, 𝑏

)

and is associated with a 1/0 label.

Deep entity resolution.

Existing Deep ER solutions [

]

typically utilize a framework that consists of a Feature Extractor and

a Matcher, as shown in Figure 3 (a). Specically, given an entity pair

(𝑎, 𝑏)

, a Feature Extractor

F (𝑎, 𝑏)

𝐴 × 𝐵 → R

𝑑

, converts this pair

into

𝑑

-dimensional vector-based representation (a.k.a. features),

denoted by

, i.e.,

x = F (𝑎,𝑏)

. Then, features

will be fed into an

ER Matcher

, which is a DL-based binary classication model. The

ER Matcher

takes features

as input, and predicts a probability

𝑦 of matching,

𝑦 = M (x) = M(F (𝑎, 𝑏)). (1)

Given a training set (

, by iteratively applying minibatch

stochastic gradient descent, parameters of both

and

are opti-

mized, and thus they are improved to distinguish matching entity

pairs from the non-matchings.

Example 1. Consider the ER dataset in Figure 2 (a). Suppose that

we use the pre-trained language model Bert [

] to implement Feature

Extractor

, like Ditto [

]. Given an entity

𝑎

rst serializes all

attribute-value pairs

{(attr

𝑖

, val

𝑖

)}

1≤𝑖 ≤𝑘

𝑎

into a token sequence

(i.e., text) by applying the following function,

S(𝑎) = [ATT] attr

[VAL] val

. . . [ATT] attr

𝑘

[VAL] val

𝑘

where

[ATT]

and

[VAL]

are two special tokens for start-

ing attributes and values respectively. For example, we

serialize entity

𝑎

into a token sequence, i.e.,

S(𝑎

) =

[ATT] title [VAL] balt . . . [ATT] price [VAL] 239.88 .

Then,

converts

(𝑎

, 𝑏

)

into token sequence

S(𝑎

, 𝑏

) =

[CLS] S(𝑎

) [SEP] S(𝑏

) [SEP]

, where

[SEP]

is a special token sep-

arating the two entities and

[CLS]

is a special token in Bert to encode

the entire sequence. Finally, we feed

S(𝑎

, 𝑏

)

into Bert and obtain

a vector-based representation

(e.g., the emb edding of

[CLS]

). After

that, we can use a fully connected layer to implement Matcher

which then produces matching probability

𝑦

from

. Thus, given all

the labeled pairs

{(𝑎

, 𝑏

, 𝑦

)}

in Figure 2 (a), we can train both

and M by minimizing a loss function over {(𝑦

𝑦)}.

3 DOMAIN ADAPTATION FOR DEEP ER

Next we describe domain adaptation for deep ER. We consider a

labeled source ER dataset

, Y

) = {(𝑎

, 𝑏

, 𝑦

)}

and an unla-

beled target dataset

= {(𝑎

, 𝑏

)}

, and aim to nd the best Fea-

ture Extractor and Matcher for producing accurate matching/non-

matching results on target

. A crude method is to directly use

Feature Extractor

and Matcher

trained using

to predict

. However, because source and target data may not come from

the same domain, the features extracted by

for the source and the

target may not follow the same data distribution, leading to a do-

main shift problem. Consequently,

trained using the source data

cannot correctly predict the target data. To address this obstacle,

we study the problem of domain adaptation for ER, and introduce a

framework DADER that unies the representative methods.

High-level idea of DA for ER.

Figure 3 (b) shows the

DADER

framework. The high-level idea is to learn from data instances what

is the best way of generating and aligning the features for the source

and the target entity pairs, such that the Matcher trained on the

labeled source can be used (or adapted) to the unlabeled target. To

this end,

DADER

introduces a Feature Aligner

, which guides Fea-

ture Extractor

to generate domain-invariant and discriminative

features and updates Matcher

accordingly. Formally, we dene

the alignment loss and the matching loss as follows.

(1) Domain-invariant.

Intuitively, we would like the distributions of

source features

and target features

to be as close as possible.

To this end, the Feature Aligner

A(x

, x

)

𝑑

× R

𝑑

→ R

, is

utilized to produce an alignment loss

𝐴

. For example, a simple

method for

𝐴

is to dene a distance between the means of source

and target distributions. A more comprehensive design exploration

for fullling L

𝐴

will be discussed in Section 4.

(2) Discriminative.

We also consider a matching loss

𝑀

that mea-

sures the dierence between predicted and ground-truth results on

the source. Recall that, given an entity pair

(𝑎

, 𝑏

)

in the source

dataset, our Matcher

makes a prediction as

𝑦 = M (F (𝑎

, 𝑏

))

Thus, matching loss

𝑀

is dened as

𝑀

= loss({(

𝑦, 𝑦

)})

, where

loss is a function, such as cross entropy.

Now, we are ready to present the goal of DA for ER as nding

the best Feature Extractor

∗

and Matcher

∗

that minimize both

alignment loss L

𝐴

and matching loss L

𝑀

, i.e.,

∗

, M

∗

= arg min

F,M

aggregate(L

𝐴

, L

𝑀

). (2)

of 15

免费下载

相关文档

评论