暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ICDE2023_ENLD:Efficient Noisy Label Detection for Incremental Datasets in Data Lake_腾讯云.pdf
6
13页
0次
2025-04-21
免费下载
ENLD: Efficient Noisy Label Detection for
Incremental Datasets in Data Lake
Xuanke You
, Lan Zhang
, Junyang Wang
, Zhimin Bao
, Yunfei Wu
and Shuaishuai Dong
University of Science and Technology of China,
Tencent Group
yxkyong@mail.ustc.edu.cn, zhanglan@ustc.edu.cn, iswangjy@mail.ustc.edu.cn, {zhiminbao, marcowu, shuaidong}@tencent.com
Abstract—Due to the difficulty of obtaining high-quality data
in real-world scenarios, datasets inevitably contain noisy labeled
data, leading to inefficient data usage and poor model perfor-
mance. Thus, noisy label detection is an important research
topic. Previous efforts mainly focus on noisy label detection on
specific datasets that have been collected. Some works select
clean samples based on relations between representations during
the training process; some works utilize confidence outputs of
a pre-trained model for noisy label detection. However, how
to perform efficient and fine-grained noisy label detection on
constantly arriving datasets in a data lake with a large amount
of inventory data has not been explored. The rapidly growing
volume and changing distribution of data make conventional
methods either incur large computation overhead due to repeated
training or become increasingly ineffective on newly arriving
data. To address these challenges, in this work, we propose a novel
approach ENLD to perform efficient and accurate noisy label
detection on incremental datasets. Our extensive experiments
demonstrate that ENLD outperforms the next best method
in both efficiency and accuracy, which achieves 3.65×-4.97×
detection speedup and higher average f1 scores with various noise
rate settings.
I. INTRODUCTION
In recent years, deep learning has made great achievements
in various academic and industrial fields, which usually rely
on a large number of labeled datasets [1] [2]. However, in the
real world, both amateurs and experts inevitably produce noisy
labeled data [3]. Therefore, noisy label detection and learning
with noisy data have attracted much attention.
In industry, ubiquitous data lakes or data platforms provide
massive data for deep learning systems, which also pose a
huge challenge to data quality management [4]. There are
two mainstream approaches to deal with noise labels, robust
architecture and sample selection. Robust architecture reduces
the influence of noisy labels to obtain a deep model with better
performance by proposing robust training methods, such as
noisy adaptation layer [5] [6], loss correction [7] [8] and label
refurbishment [9] [10]. Sample selection explicitly filters noisy
labeled data considering the impact of samples on training loss
or the softmax output of deep models. Compared with the
robust architecture, it can obtain a clean dataset with stronger
reusability. A widely adopted idea for sample selection is to
use some selection metrics (e.g. loss tracking) on samples
during multiple rounds of the training process, such as O2U-
Net [11] and INCV [12]. Topofilter [13] proposes a graph-
based method in the latent representational space to collect
clean data and drop isolated data. Confident learning [14]
designs a framework to filter noisy labeled data with directly
estimated joint distribution of noisy labels and unknown true
labels based on confidence outputs of the deep model which
is trained on noisy datasets.
Previous work, however, focus on datasets that have been
collected. But for real-world data lakes and platforms, new
data usually arrive constantly. Many platforms need to con-
stantly perform accurate and efficient label quality assessments
on newly arriving data, such as crowdsourcing platforms
and data trading platforms [15] [16] [17]. Directly adopting
existing training-based methods, e.g., Topofilter [13] and other
loss tracking methods [11] [12], to detect noisy labels in
incremental data is difficult to achieve good performance due
to the lack of sample diversity and unbalanced categories in
the incremental dataset. But applying those methods to both
the inventory dataset and incremental dataset leads to a huge
computation overhead due to the excessive sample number of
the inventory data. Besides, the noisy label detection model
trained on the inventory dataset usually cannot well adapt
to specific incremental datasets. Pretrain-based methods, like
confident learning [14], have low computation overhead but
poor performance of noisy label detection for incremental
datasets due to the changing data distribution. How to achieve
efficient and accurate noisy label detection on constantly
arriving datasets in a data lake is still an unexplored problem.
In this work, we focus on efficient and adaptive noisy label
detection on constantly arriving incremental datasets in a data
lake with a large amount of inventory data, and address the
following challenges:
(1) How to leverage the knowledge from massive inventory
data and how to adapt to the unknown data distribution of
incremental data? Incremental datasets usually only contain
a small number of samples from a part of classes of the
inventory data and have unbalanced class distributions. Using
the incremental datasets only cannot achieve satisfactory noisy
label detection. It is crucial to mine and establish associations
between incremental datasets and the inventory data, as well as
to select proper samples from the inventory data as contrastive
samples to improve the detection performance and reduce the
training cost. During the selection of contrastive samples, it
is necessary to consider the data distribution of incremental
datasets for better adaptivity.
(2) How to ensure efficiency and performance during per-
forming continuous noisy label detection tasks? The platform
will receive a large number of continuous noisy label detection
tasks, each of which is time-consuming and computationally
expensive. This requires our approach to be designed and
implemented in a way that ensures both efficiency and per-
formance.
Facing the above challenges, we propose a novel framework
ENLD to efficiently perform noisy label detection on incre-
mental datasets. The core idea of our design is to sample con-
trastive samples in inventory data, which greatly benefit identi-
fying ambiguous samples in incremental datasets, and discover
clean samples by majority voting through multiple fine-tuning
processes. Specifically, ENLD is a two-stage framework. First,
ENLD trains a general model and estimates the conditional
probability of label mislabeling through inventory data. Then,
ENLD conducts fine-grained noise label detection with con-
trastive sampling for specific incremental datasets, including
multiple re-sampling and model fine-tuning. Our contributions
are summarized as follows:
•We propose a novel framework ENLD to efficiently per-
form noisy label detection on incremental datasets. We con-
sider label probabilities, output confidences of samples, and
relationships between feature representations, and carefully
design a set of techniques including contrastive sampling and
fine-grained noisy label detection. ENLD achieves superior
noisy label detection performance for newly arriving datasets,
requiring only a small amount of fine-tuning.
•We analyze the rationality of the selected samples in
contrastive sampling. Our analysis proves that the high-quality
samples in inventory data that are close to the representa-
tions of ambiguous samples in incremental datasets can bring
greater benefits to the training process. We also compare the
influence of different sampling strategies on the fine-grained
noise label detection in experiments.
•We extensively evaluate our framework on public datasets
with various noise settings. Experiments demonstrate that
our framework outperforms existing methods in both perfor-
mance and efficiency for noisy label detection on incremental
datasets. The average f1 score of ENLD achieves 0.9191
for EMNIST and 0.8194 for CIFAR100 for various noise
settings, which outperforms the next best method, Topofilter.
Compared with Topofilter, ENLD also achieves 4.09× and
3.65× detection speedup on average process time for EMNIST
and CIFAR100, respectively. For a more complex classification
task, Tiny-Imagenet, ENLD performs significantly better than
the baseline methods. It achieves an average f1 score of 0.7297
while the average f1 score of Topofilter is only 0.6171, and
achieves 4.97× detection speedup on average process time.
II. RELATED WORK AND PRELIMINARIES
A. Noisy Label Detection Methods
In noisy learning, recent works focus on methods of sample
selection [18] [19], which attempts to first select clean samples
in the dataset and train the DNN on the filtered cleaner
dataset. Decouple [20] maintains two DNNs and selects clean
samples for the model update by the difference in label
predictions between two DNNs. MentorNet [21] completes
sample selection through a collaborative learning method, in
which the pre-trained mentor DNN guides the training of a
student DNN, and the student receives clean samples with
a high probability provided by the mentor. Co-teaching [22]
maintains two DNNs, each DNN completes the selection of
small loss samples and shares the results with another DNN for
future training. Based on Co-teaching, Co-teaching Plus [23]
integrates the disagreement strategy of Decouple. INCV [12]
randomly splits the dataset into two parts and selects clean data
through cross-validation. SELFIE [24] selects clean data by
small-loss criteria and selective refurbishment of samples. [13]
proposes a graph-based method in the latent representational
space named Topofilter to collect clean data and drop isolated
data. Confident learning [14] proposes a framework to filter
noisy label data with directly estimated joint distribution of
noisy label and unknown true label based on the softmax
output of the deep model, which is trained on noisy datasets.
However, previous works focus on collected datasets. It is not
applicable to the scenario where noisy label detection needs
to be performed repeatedly on the newly added datasets. In
this work, we mainly focus on how to conduct efficient and
accurate noisy label detection for incremental datasets.
B. Sample Selection Strategy
ENLD involves a sample selection process in inventory data
for incremental datasets during the training process, and there
are also many data selection strategies used in active learning
methods [25] and semi-supervised learning methods [26]. And
in active learning, the information entropy and confidence are
widely used metrics to measure the uncertainty of samples for
current models. It means samples with large uncertainty will
bring great benefits to the training of the current model. Meth-
ods [27] [28] adopt the uncertainty-based sampling strategies
to select samples during the training process. Moreover, the
samples with the highest confidence tend to be selected and
given a pseudo label to participate in the training in semi-
supervised learning methods [29] [30] [31] and active learning
methods [32]. In this work, we also conduct experiments on
replacing different sampling strategies in the fine-grained noisy
label detection method of ENLD to explore the impact of
different sample selection strategies in Section V.
III. PROBLEM AND MAIN IDEA
A. Problem Description
Given a large amount of inventory data (e.g. in a data lake)
I = {(x
I
i
, ˜y
I
i
)} with a number of classes and samples, the
system needs to perform noisy label detection on incoming
incremental datasets D = {(x
D
i
, ˜y
D
i
)}. Here, ˜y
i
represents
the observed label. y
i
represents the unknown true label.
The noise label in both I and D is generated by a label
probability transition matrix T
i,j
= P (˜y = j|y
= i). It
represents the probability of mislabeling between labels in
manual experience. In the actual scenario, D
i
may be the
dataset collected by the data platform or the dataset expected to
obtain noisy label detection results from the data platform. The
TABLE I: Notation used in ENLD.
Notation Definition
˜y The observed label of the sample
y
The true label of the sample
I The inventory data in the data platform
D The constantly arriving incremental datasets
H The high-quality samples in the inventory data
A The ambiguous samples in the incremental dataset
θ The general deep model trained with the inventory data
θ
0
The finetuned model for incremental datasets based on θ
M(x, θ) The confidence output of sample x by the deep model θ
ˆ
M(x, θ) The feature vector of sample x by the deep model θ
x
L
i
, ˜y
L
i
The samples and observed labels in set L
goal of our framework is to efficiently perform accurate noisy
label detection on the incremental dataset. Important notations
are summarized in Table. I.
B. Main Idea
If we directly use the confidence outputs of a pre-trained
general model on the incremental dataset to detect noisy sam-
ples, the performance is very dependent on the generalization
ability of the general model trained by noisy labels, which
often performs poorly on complex classification tasks. And
previous training-based methods on the inventory dataset and
incremental data will introduce a lot of computing overhead,
which is not applicable to our scene as well.
To achieve requirements of high efficiency and accuracy,
we expect to spend only a small amount of fine-tuning to
achieve superior noisy label detection results for specific new
datasets. Thus, we propose a two-stage framework for noisy
label detection on incremental datasets, which maintains a
general model and find-tune on different incremental datasets.
Meanwhile, different incremental datasets have different data
distributions and ambiguous samples for the general deep
model. Here, ambiguous samples mean that their observed la-
bels and predicted labels of the current model are inconsistent
as defined in Definition. 1. The main idea of our work is to
select high-quality contrastive samples for ambiguous samples
in incremental datasets, and then finetune the model on specific
data distribution to achieve accurate noisy label detection
results. We consider label probabilities, output confidences,
and feature representations of the current model to select con-
trastive samples which greatly benefit identifying ambiguous
samples in incremental datasets in contrastive sampling.
IV. FRAMEWORK OF ENLD
In this section, the detailed design and implementation of
ENLD will be introduced. We will first describe the framework
overview of ENLD, then contrastive sampling, fine-grained
noisy label detection, and finally the model update.
A. Framework Overview
We describe and introduce the framework overview of our
proposed ENLD as shown in Algorithm 1 and Fig. 1. The
platforms suitable for deploying the ENLD framework have
a certain amount of inventory data, and incremental datasets
with noise label detection requests arrive continuously. As for
the platform, first, ENLD divides the inventory data I into I
t
and I
c
randomly. And then, ENLD initializes a general model
θ with I
t
and estimate the probability of
˜
P (y
= j|˜y = i).
After the initialization of ENLD, the noisy label detection of
incremental data sets can be performed. For example, when an
incremental dataset D arrives, ENLD first performs contrastive
sampling on current D to obtain an initial contrastive sample
set C. Then a fine-grained noisy label detection method
with re-sampling will be performed to obtain the selected
clean part S and noisy part N of D based on the general
model θ. Moreover, during the noisy label detection process
of incremental datasets, the system can also perform data
selection for the inventory data. The platform can choose to
update the general model and re-estimate the probability of
˜
P (y
= j|˜y = i).
Algorithm 1 Framework of Efficient Noisy Label Detection
(ENLD)
Input: the inventory data I = {(x
I
i
, ˜y
I
i
)}, the incremental
datasets {D
i
}
t
i=1
,the parameter of contrastive samples size k
Output: the noisy label detection result S
i
, N
i
1: θ,
˜
P , I
t
, I
c
= model init(I);
2: H = {(x, ˜y) I
c
: argmax M (x, θ) = ˜y};
3: S
c
= ;
4: while D
i
arrives do
5: H
0
= {(x, ˜y) : ˜y label(D
i
) and (x, ˜y) H};
6: A = {(x, ˜y) D
i
: argmax M (x, θ) 6= ˜y};
7: C = contrastive
sampling(A, H
0
,
˜
P , k, θ);
8: S
i
, N
i
, S
0
c
= fined grained N LD(C, D
i
, θ);
9: S
c
= S
c
S
S
0
c
;
10: θ,
˜
P , I
t
, I
c
= model update(S
c
, I
t
, I
c
); // Optional;
11: end while
B. Model Initialization & Probability Estimation
In this part, the system needs to obtain a general model and
estimate the probability of label mislabeling.
Model Initialization: First, we divide the inventory data I
into I
t
and I
c
uniformly and randomly. Here, I
t
represents
the training set which is used to initialize and train a general
model θ, and I
c
is the candidate set for contrastive samples
to accommodate special incremental datasets. In the system
implementation, we use I
t
to train the initialization model
with the augmentation method Mixup [33]. Mixup randomly
mixes the samples and labels with a Beta distribution for
generalization performance as shown in Eq. 1 and Eq. 2, where
λ Beta(α, α). We set the parameter of the Beta distribution
α = 0.2 in all experiments in Section V.
ˆx = λx
i
+ (1 λ)x
j
(1)
ˆy = λy
i
+ (1 λ)y
j
(2)
Probability Estimation: According to the assumption ˜y
=
argmax ˜p(˜y; x, θ) in [12], it means that the predicted label
and the true label have the same distribution. We utilize the
confidence output of the model M(x, θ) on I
c
and observed
label of each sample to estimate the joint distribution J of true
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。