forming continuous noisy label detection tasks? The platform
will receive a large number of continuous noisy label detection
tasks, each of which is time-consuming and computationally
expensive. This requires our approach to be designed and
implemented in a way that ensures both efficiency and per-
formance.
Facing the above challenges, we propose a novel framework
ENLD to efficiently perform noisy label detection on incre-
mental datasets. The core idea of our design is to sample con-
trastive samples in inventory data, which greatly benefit identi-
fying ambiguous samples in incremental datasets, and discover
clean samples by majority voting through multiple fine-tuning
processes. Specifically, ENLD is a two-stage framework. First,
ENLD trains a general model and estimates the conditional
probability of label mislabeling through inventory data. Then,
ENLD conducts fine-grained noise label detection with con-
trastive sampling for specific incremental datasets, including
multiple re-sampling and model fine-tuning. Our contributions
are summarized as follows:
•We propose a novel framework ENLD to efficiently per-
form noisy label detection on incremental datasets. We con-
sider label probabilities, output confidences of samples, and
relationships between feature representations, and carefully
design a set of techniques including contrastive sampling and
fine-grained noisy label detection. ENLD achieves superior
noisy label detection performance for newly arriving datasets,
requiring only a small amount of fine-tuning.
•We analyze the rationality of the selected samples in
contrastive sampling. Our analysis proves that the high-quality
samples in inventory data that are close to the representa-
tions of ambiguous samples in incremental datasets can bring
greater benefits to the training process. We also compare the
influence of different sampling strategies on the fine-grained
noise label detection in experiments.
•We extensively evaluate our framework on public datasets
with various noise settings. Experiments demonstrate that
our framework outperforms existing methods in both perfor-
mance and efficiency for noisy label detection on incremental
datasets. The average f1 score of ENLD achieves 0.9191
for EMNIST and 0.8194 for CIFAR100 for various noise
settings, which outperforms the next best method, Topofilter.
Compared with Topofilter, ENLD also achieves 4.09× and
3.65× detection speedup on average process time for EMNIST
and CIFAR100, respectively. For a more complex classification
task, Tiny-Imagenet, ENLD performs significantly better than
the baseline methods. It achieves an average f1 score of 0.7297
while the average f1 score of Topofilter is only 0.6171, and
achieves 4.97× detection speedup on average process time.
II. RELATED WORK AND PRELIMINARIES
A. Noisy Label Detection Methods
In noisy learning, recent works focus on methods of sample
selection [18] [19], which attempts to first select clean samples
in the dataset and train the DNN on the filtered cleaner
dataset. Decouple [20] maintains two DNNs and selects clean
samples for the model update by the difference in label
predictions between two DNNs. MentorNet [21] completes
sample selection through a collaborative learning method, in
which the pre-trained mentor DNN guides the training of a
student DNN, and the student receives clean samples with
a high probability provided by the mentor. Co-teaching [22]
maintains two DNNs, each DNN completes the selection of
small loss samples and shares the results with another DNN for
future training. Based on Co-teaching, Co-teaching Plus [23]
integrates the disagreement strategy of Decouple. INCV [12]
randomly splits the dataset into two parts and selects clean data
through cross-validation. SELFIE [24] selects clean data by
small-loss criteria and selective refurbishment of samples. [13]
proposes a graph-based method in the latent representational
space named Topofilter to collect clean data and drop isolated
data. Confident learning [14] proposes a framework to filter
noisy label data with directly estimated joint distribution of
noisy label and unknown true label based on the softmax
output of the deep model, which is trained on noisy datasets.
However, previous works focus on collected datasets. It is not
applicable to the scenario where noisy label detection needs
to be performed repeatedly on the newly added datasets. In
this work, we mainly focus on how to conduct efficient and
accurate noisy label detection for incremental datasets.
B. Sample Selection Strategy
ENLD involves a sample selection process in inventory data
for incremental datasets during the training process, and there
are also many data selection strategies used in active learning
methods [25] and semi-supervised learning methods [26]. And
in active learning, the information entropy and confidence are
widely used metrics to measure the uncertainty of samples for
current models. It means samples with large uncertainty will
bring great benefits to the training of the current model. Meth-
ods [27] [28] adopt the uncertainty-based sampling strategies
to select samples during the training process. Moreover, the
samples with the highest confidence tend to be selected and
given a pseudo label to participate in the training in semi-
supervised learning methods [29] [30] [31] and active learning
methods [32]. In this work, we also conduct experiments on
replacing different sampling strategies in the fine-grained noisy
label detection method of ENLD to explore the impact of
different sample selection strategies in Section V.
III. PROBLEM AND MAIN IDEA
A. Problem Description
Given a large amount of inventory data (e.g. in a data lake)
I = {(x
I
i
, ˜y
I
i
)} with a number of classes and samples, the
system needs to perform noisy label detection on incoming
incremental datasets D = {(x
D
i
, ˜y
D
i
)}. Here, ˜y
i
represents
the observed label. y
∗
i
represents the unknown true label.
The noise label in both I and D is generated by a label
probability transition matrix T
i,j
= P (˜y = j|y
∗
= i). It
represents the probability of mislabeling between labels in
manual experience. In the actual scenario, D
i
may be the
dataset collected by the data platform or the dataset expected to
obtain noisy label detection results from the data platform. The
相关文档
评论