
Conditional Regression Rules
Rui Kang
BNRist, Tsinghua University
kr20@mails.tsinghua.edu.cn
Shaoxu Song
BNRist, Tsinghua University
sxsong@tsinghua.edu.cn
Chaokun Wang
BNRist, Tsinghua University
chaokun@tsinghua.edu.cn
Abstract—Mixed data distribution is widely observed, for
example, the bird migration data consist of the observed locations
of various birds in different years, varying in data distribution.
Learning a single regression model over such a mixed data
distribution is often ineffective, while manually segmenting the
data, e.g., by bird, date or region, for learning individual models
is truly labor-intensive. In this paper, we propose to automatically
discover the regression models that apply conditionally to only
apartofthedata,namelyconditional regression rules (CRRs),
enlightened by the conditional functional dependencies (CFDs)
that are FDs hold only in some data. Remarkably, a regression
model may apply in different parts of data, e.g., the seasonal
migration of birds is similar in different years. To capture
the shared regression models, we investigate the inference of
CRRs. An algorithm is devised to learn and discover CRRs from
data, with the help of CRR inference. Extensive experiments on
real-world datasets demonstrate that the discovered conditional
regression rules are more effective than the regression models
without conditions. In particular, with the inference of CRRs,
the number of learned CRRs is significantly reduced without
sacrificing rule semantics.
Index Terms—Integrity Constraint, Regression Model
I. INTRODUCTION
Mixed data distribution is widely observed. Making as-
sumptions such as mixed Gaussian distribution [1] is not
always valid in practice, e.g., in spatio-temporal data [2]. For
instance, the bird migration dataset BirdMap [3] consists of the
observed locations of various birds in different years, varying
in data distribution, as illustrated in Table I and Figure 1(a)
in Example 1 below. Our preliminary study [4] shows that
learning individual regression models for each tuple is accurate
and effective. Unfortunately, it is computational expensive and
not always necessary. As shown in Figure 1(b), a regression
model (lines with the same color) may apply to a set or
multiple isolated sets of data, not necessary to learn for each
individual tuple.
Example 1. Table I shows a fraction of the GPS locations of
different birds in several years from the BirdMap dataset [3].
Figure 1(a) visualizes part of the mixed data distribution,
where each point corresponds to a GPS record of the bird
2 .Maria. We use different colors to denote the data in different
seasons, showing the seasonal migration of a bird, which is
expected to be automatically discovered below.
This work is supported in part by the National Natural Science Founda-
tion of China (62072265, 62021002), the National Key Research and De-
velopment Plan (2021YFB3300500, 2019YFB1705301, 2019YFB1707001),
BNR2022RC01011, and the MIIT High Quality Development Program 2020.
Shaoxu Song is the corresponding author.
TABLE I
E
XAMPLE OF THE BIRDMAP DATASET
TID Latitude Longitude BirdID Date
t
1
56.20883 26.92067 2.Maria 2006-8-6
t
2
55.83867 26.2075 2.Maria 2006-8-7
t
3
21.988 22.56783 2.Maria 2007-8-28
t
4
53.04183 25.80183 3.Raivo 2007-3-30
t
5
47.39333 27.40033 1.Kalakotkas 2007-3-26
t
6
– – 2.Maria 2007-3-20
t
7
38.5855 28.040333 4.Mart 2007-9-11
t
8
38.58567 28.03583 4.Mart 2007-9-1
t
9
9.0155 20.07167 2.Maria 2008-8-27
t
10
6.7465 19.073 2.Maria 2007-9-4
t
11
58.61833 28.66967 33.Erika 2007-8-13
Regression models are learned over the data for various
applications, such as imputing the missing locations in t
6
in
Table I. To better illustrate the regression model, Figures 1(b)
and 1(c) plot Latitude and Longitude over Date, respectively.
The existing methods, such as [5], may learn the same
model multiple times in different parts of data. For example,
in Figure 1(b), it splits the domain of timestamp and learns
regression models separately, e.g., the red line denotes the
migration of the bird from north to south between 2006-8-
11 and 2006-9-12. Unfortunately, the same regression model
is redundantly learned again from 2008-8-18 to 2008-9-19,
since the birds travel seasonally in the same way.
Obviously, manually splitting the data, e.g., by bird, date or
region, for learning individual models is truly labor-intensive.
Motivated by the extended conditional functional dependencies
(eCFDs [6]) that introduce disjunctions to apply CFD to some
data, in this paper, we propose to automatically discover the
regression models that apply conditionally to some parts of
the data, namely conditional regression rules (CRRs).
Informally, a CRR ϕ is in a form of triple (f, ρ,C), where
f : X → Y is a regression function from attributes X to Y ,
ρ is the maximum bias between the attribute value of Y and
prediction f(X), and C specifies the conditions. Motivated
by denial constraints (DCs) [7], we use predicates to form
conjunction and disjunction to specify the parts of data where
a CRR conditionally applies.
Example 2 (Example 1 continued). For the light-blue points
with Date from 2006-8-11 to 2006-9-12, a CRR ϕ
1
is observed
2482
2022 IEEE 38th International Conference on Data Engineering (ICDE)
2375-026X/22/$31.00 ©2022 IEEE
DOI 10.1109/ICDE53745.2022.00231
相关文档
评论