
City Year Area Security Price
Kolkata 2009 710 No 3,200,000
Kolkata 2013 770 No 3,850,000
Kolkata 2007 935 No 2,524,000
Kolkata 2006 973 Yes 3,611,000
(a) 𝑇
train
: Train dataset (learn to predict “Price”).
City Year Area Security Price Ground Truth
Kolkata 2017 350 No ? 2,100,000
Kolkata 2019 465 Yes ? 4,365,000
Kolkata 2015 572 No ? 3,268,000
Kolkata 2012 655 Yes ? 2,599,000
Kolkata 2012 735 No ? 3,300,000
Kolkata 2017 881 Yes ? 4,698,000
Kolkata 2011 1123 Yes ? 3,324,000
Kolkata 2014 1210 Yes ? 5,000,000
(b) 𝑇
test
: Test dataset (predict the “Price” column).
Figure 2: Sample train and test datasets.
tools [
11
,
16
,
34
]. For images, there are many benchmarks, as well
as Web APIs such as Google, Baidu or Azure image search.
Example 2.
[
Using all data points in candidate datasets.
]
Fig. 3(b)(c) and (d) show three datasets that also contain house price
information in dierent cities of India, i.e.,
𝐷
1
,
𝐷
2
and
𝐷
3
for Ban-
glore, Mumbai, and Delhi, respectively. They have dierent schemata
with 𝑇
train
and 𝑇
test
in Fig. 2, but they can be used as train data.
A straightforward solution is to add all these datasets to the train
data (i.e.,
𝑇
train
:
= 𝑇
train
∪ 𝐷
1
∪ 𝐷
2
∪ 𝐷
3
) and train a model. By
doing so, we can obtain a model as shown in Fig. 3(a) Line 3, which,
unfortunately, deviates far from the ground truth model Line 2.
Our question is that: given candidate datasets, whether selecting
some data points will be better than using all?
Example 3.
[
Selected data points.
]
If we can select “good” data
points, such as
{𝑟
1
}
from
𝐷
1
,
{𝑠
1
, 𝑠
2
, 𝑠
3
}
from
𝐷
2
, and
{𝑡
1
, 𝑡
4
}
from
𝐷
3
as additional train data that are highlighted in Figs. 3(b–d), we
can use them in Fig. 3(a) (i.e., those annotated by green frames) and
train a model Line 4. Clearly, Line 4 is much closer, thus better than
Line 1 and Line 3, to the ground truth model Line 2.
Example 1 shows that more labeled data is needed. Example 2
tells us using all data points is not ideal. Example 3 shows it is more
benecial to select some data points.
Challenges. There are two essential challenges. First, candidate
datasets may come from various data distributions that are dierent
from the desired data distribution of the ML task, which is unknown.
Second, many data points in these candidate sets are not good w.r.t.
our ML task, which raises the challenge that how can we eectively
select and measure which new data points should be added.
Contributions. Our contributions are summarized as follows.
(1) Selective data acquisition in the wild for model charging.
We study the problem of automatic data acquisition for supervised
ML in a new setting where the supervised ML task does not have
enough train data, and it has access to the data in the wild (Section 2).
Note that datasets in the wild are heterogeneous and not all of data
points can help the task.
(2) A solution framework. We propose a solution framework (see
Fig. 1) that consists of two steps: dataset discovery that selects
candidate datasets and data points selection that selects data
points from these candidate datasets. (Section 3)
(3)
AutoData
with multi-arme d bandit. We introduce a classi-
cal multi-armed bandit based solution to handle the exploration-
exploitation trade-o for AutoData. (Section 4)
(4)
AutoData
with Deep Q Network-based reinforcement learn-
ing (RL). Another eective model is to use Deep Q learning based
RL, which learns a neural network to approximate the Q-table (i.e., a
simple but huge lookup table that calculates the maximum expected
future rewards for action at each state) that decides which cluster
to select based on the current status. (Section 5)
(5) Evaluation. We conduct extensive experiment to show that
our methods can eectively select data points from dierent data
sources and improve the ML performance by maximum 14.8% and
8.3% on relational and image datasets, respectively. (Section 6)
2 PRELIMINARY
Supervise d machine learning. We consider supervised ML as
training a model
𝑀
to learn a function
𝑓 ()
that maps an input to
an output based on example input-output pairs as 𝑓 : X → Y.
We use
𝑀 (𝐴)
to denote the model
𝑀
that is trained with dataset
𝐴
, and the notation
𝑀 (𝐴, 𝐵)
to denote the model
𝑀
that is trained
with 𝐴 and is evaluated with dataset 𝐵.
Train/validation/test datasets. A set of labeled dataset
𝑇
is
typically split into three disjoint subsets as train/validation/test
(𝑇
train
/𝑇
val
/𝑇
test
). 𝑇
test
is completely held-out during training.
Data in the wild. We use the term data in the wild to generally
refer to all datasets that one can have access to, including data lakes,
data markets, online data repositories, enterprise data warehouses,
and so on. More specically, for supervised ML, we consider it as a
set of datasets
D = {𝐷
1
, . . . , 𝐷
𝑚
}
, where
𝐷
𝑖
(
𝑖 ∈ [
1
,𝑚]
) is a set of
(data point, label) pairs.
Candidate datasets. The candidate datasets w.r.t. a supervised ML
task
𝑀
, denoted by
D
𝑐
, is a subset of
D
that contains datasets
“relevant” to
𝑀
. For tabular data, the relevance typically means that
these candidate datasets have the same or highly overlapping rela-
tional schema with
𝑇
train
. For image data, the relevance typically
means that these candidate datasets contain images that have the
same labels (e.g., {cat, dog, bird, sh}) as 𝑇
train
.
Candidate data pool. The candidate data pool (or simply data
pool), denoted by
P
, is the union of all data points in the candidate
datasets, i.e., P =
Ð
(𝑥,𝑦) ∈𝐷
𝑖
,𝐷
𝑖
∈D
𝑐
(𝑥, 𝑦).
Selective data acquisition for model charging. Given a super-
vised ML task with a pre-specied model
𝑀
, train/validation/test
datasets (
𝑇
train
/
𝑇
val
/
𝑇
test
), and a candidate data pool
P
, the prob-
lem is to select a subset
P
∗
⊂ P
using
𝑇
train
and
𝑇
val
, such that
it can obtain the most performance improvement of the super-
vised ML task on
𝑇
test
:
P
∗
= arg max
P
′
⊂ P
𝑀 (𝑇
train
∪P
′
,𝑇
test
) −
𝑀 (𝑇
train
,𝑇
test
).
Example 4.
[
Selective data acquisition.
]
Given an ML task of
training a regression model using
𝑇
train
as shown in Fig. 2 and a data
评论