ICDE2023_ENLD：Efficient Noisy Label Detection for Incremental Datasets in Data Lake_腾讯云.pdf

迹部景吾

13页

0次

2025-04-21

免费下载

ENLD: Efﬁcient Noisy Label Detection for

Incremental Datasets in Data Lake

Xuanke You

∗

, Lan Zhang

∗

, Junyang Wang

∗

, Zhimin Bao

†

, Yunfei Wu

†

and Shuaishuai Dong

†

∗

University of Science and Technology of China,

†

Tencent Group

yxkyong@mail.ustc.edu.cn, zhanglan@ustc.edu.cn, iswangjy@mail.ustc.edu.cn, {zhiminbao, marcowu, shuaidong}@tencent.com

Abstract—Due to the difﬁculty of obtaining high-quality data

in real-world scenarios, datasets inevitably contain noisy labeled

data, leading to inefﬁcient data usage and poor model perfor-

mance. Thus, noisy label detection is an important research

topic. Previous efforts mainly focus on noisy label detection on

speciﬁc datasets that have been collected. Some works select

clean samples based on relations between representations during

the training process; some works utilize conﬁdence outputs of

a pre-trained model for noisy label detection. However, how

to perform efﬁcient and ﬁne-grained noisy label detection on

constantly arriving datasets in a data lake with a large amount

of inventory data has not been explored. The rapidly growing

volume and changing distribution of data make conventional

methods either incur large computation overhead due to repeated

training or become increasingly ineffective on newly arriving

data. To address these challenges, in this work, we propose a novel

approach ENLD to perform efﬁcient and accurate noisy label

detection on incremental datasets. Our extensive experiments

demonstrate that ENLD outperforms the next best method

in both efﬁciency and accuracy, which achieves 3.65×-4.97×

detection speedup and higher average f1 scores with various noise

rate settings.

I. INTRODUCTION

In recent years, deep learning has made great achievements

in various academic and industrial ﬁelds, which usually rely

on a large number of labeled datasets [1] [2]. However, in the

real world, both amateurs and experts inevitably produce noisy

labeled data [3]. Therefore, noisy label detection and learning

with noisy data have attracted much attention.

In industry, ubiquitous data lakes or data platforms provide

massive data for deep learning systems, which also pose a

huge challenge to data quality management [4]. There are

two mainstream approaches to deal with noise labels, robust

architecture and sample selection. Robust architecture reduces

the inﬂuence of noisy labels to obtain a deep model with better

performance by proposing robust training methods, such as

noisy adaptation layer [5] [6], loss correction [7] [8] and label

refurbishment [9] [10]. Sample selection explicitly ﬁlters noisy

labeled data considering the impact of samples on training loss

or the softmax output of deep models. Compared with the

robust architecture, it can obtain a clean dataset with stronger

reusability. A widely adopted idea for sample selection is to

use some selection metrics (e.g. loss tracking) on samples

during multiple rounds of the training process, such as O2U-

Net [11] and INCV [12]. Topoﬁlter [13] proposes a graph-

based method in the latent representational space to collect

clean data and drop isolated data. Conﬁdent learning [14]

designs a framework to ﬁlter noisy labeled data with directly

estimated joint distribution of noisy labels and unknown true

labels based on conﬁdence outputs of the deep model which

is trained on noisy datasets.

Previous work, however, focus on datasets that have been

collected. But for real-world data lakes and platforms, new

data usually arrive constantly. Many platforms need to con-

stantly perform accurate and efﬁcient label quality assessments

on newly arriving data, such as crowdsourcing platforms

and data trading platforms [15] [16] [17]. Directly adopting

existing training-based methods, e.g., Topoﬁlter [13] and other

loss tracking methods [11] [12], to detect noisy labels in

incremental data is difﬁcult to achieve good performance due

to the lack of sample diversity and unbalanced categories in

the incremental dataset. But applying those methods to both

the inventory dataset and incremental dataset leads to a huge

computation overhead due to the excessive sample number of

the inventory data. Besides, the noisy label detection model

trained on the inventory dataset usually cannot well adapt

to speciﬁc incremental datasets. Pretrain-based methods, like

conﬁdent learning [14], have low computation overhead but

poor performance of noisy label detection for incremental

datasets due to the changing data distribution. How to achieve

efﬁcient and accurate noisy label detection on constantly

arriving datasets in a data lake is still an unexplored problem.

In this work, we focus on efﬁcient and adaptive noisy label

detection on constantly arriving incremental datasets in a data

lake with a large amount of inventory data, and address the

following challenges:

(1) How to leverage the knowledge from massive inventory

data and how to adapt to the unknown data distribution of

incremental data? Incremental datasets usually only contain

a small number of samples from a part of classes of the

inventory data and have unbalanced class distributions. Using

the incremental datasets only cannot achieve satisfactory noisy

label detection. It is crucial to mine and establish associations

between incremental datasets and the inventory data, as well as

to select proper samples from the inventory data as contrastive

samples to improve the detection performance and reduce the

training cost. During the selection of contrastive samples, it

is necessary to consider the data distribution of incremental

datasets for better adaptivity.

(2) How to ensure efﬁciency and performance during per-

forming continuous noisy label detection tasks? The platform

will receive a large number of continuous noisy label detection

tasks, each of which is time-consuming and computationally

expensive. This requires our approach to be designed and

implemented in a way that ensures both efﬁciency and per-

formance.

Facing the above challenges, we propose a novel framework

ENLD to efﬁciently perform noisy label detection on incre-

mental datasets. The core idea of our design is to sample con-

trastive samples in inventory data, which greatly beneﬁt identi-

fying ambiguous samples in incremental datasets, and discover

clean samples by majority voting through multiple ﬁne-tuning

processes. Speciﬁcally, ENLD is a two-stage framework. First,

ENLD trains a general model and estimates the conditional

probability of label mislabeling through inventory data. Then,

ENLD conducts ﬁne-grained noise label detection with con-

trastive sampling for speciﬁc incremental datasets, including

multiple re-sampling and model ﬁne-tuning. Our contributions

are summarized as follows:

•We propose a novel framework ENLD to efﬁciently per-

form noisy label detection on incremental datasets. We con-

sider label probabilities, output conﬁdences of samples, and

relationships between feature representations, and carefully

design a set of techniques including contrastive sampling and

ﬁne-grained noisy label detection. ENLD achieves superior

noisy label detection performance for newly arriving datasets,

requiring only a small amount of ﬁne-tuning.

•We analyze the rationality of the selected samples in

contrastive sampling. Our analysis proves that the high-quality

samples in inventory data that are close to the representa-

tions of ambiguous samples in incremental datasets can bring

greater beneﬁts to the training process. We also compare the

inﬂuence of different sampling strategies on the ﬁne-grained

noise label detection in experiments.

•We extensively evaluate our framework on public datasets

with various noise settings. Experiments demonstrate that

our framework outperforms existing methods in both perfor-

mance and efﬁciency for noisy label detection on incremental

datasets. The average f1 score of ENLD achieves 0.9191

for EMNIST and 0.8194 for CIFAR100 for various noise

settings, which outperforms the next best method, Topoﬁlter.

Compared with Topoﬁlter, ENLD also achieves 4.09× and

3.65× detection speedup on average process time for EMNIST

and CIFAR100, respectively. For a more complex classiﬁcation

task, Tiny-Imagenet, ENLD performs signiﬁcantly better than

the baseline methods. It achieves an average f1 score of 0.7297

while the average f1 score of Topoﬁlter is only 0.6171, and

achieves 4.97× detection speedup on average process time.

II. RELATED WORK AND PRELIMINARIES

A. Noisy Label Detection Methods

In noisy learning, recent works focus on methods of sample

selection [18] [19], which attempts to ﬁrst select clean samples

in the dataset and train the DNN on the ﬁltered cleaner

dataset. Decouple [20] maintains two DNNs and selects clean

samples for the model update by the difference in label

predictions between two DNNs. MentorNet [21] completes

sample selection through a collaborative learning method, in

which the pre-trained mentor DNN guides the training of a

student DNN, and the student receives clean samples with

a high probability provided by the mentor. Co-teaching [22]

maintains two DNNs, each DNN completes the selection of

small loss samples and shares the results with another DNN for

future training. Based on Co-teaching, Co-teaching Plus [23]

integrates the disagreement strategy of Decouple. INCV [12]

randomly splits the dataset into two parts and selects clean data

through cross-validation. SELFIE [24] selects clean data by

small-loss criteria and selective refurbishment of samples. [13]

proposes a graph-based method in the latent representational

space named Topoﬁlter to collect clean data and drop isolated

data. Conﬁdent learning [14] proposes a framework to ﬁlter

noisy label data with directly estimated joint distribution of

noisy label and unknown true label based on the softmax

output of the deep model, which is trained on noisy datasets.

However, previous works focus on collected datasets. It is not

applicable to the scenario where noisy label detection needs

to be performed repeatedly on the newly added datasets. In

this work, we mainly focus on how to conduct efﬁcient and

accurate noisy label detection for incremental datasets.

B. Sample Selection Strategy

ENLD involves a sample selection process in inventory data

for incremental datasets during the training process, and there

are also many data selection strategies used in active learning

methods [25] and semi-supervised learning methods [26]. And

in active learning, the information entropy and conﬁdence are

widely used metrics to measure the uncertainty of samples for

current models. It means samples with large uncertainty will

bring great beneﬁts to the training of the current model. Meth-

ods [27] [28] adopt the uncertainty-based sampling strategies

to select samples during the training process. Moreover, the

samples with the highest conﬁdence tend to be selected and

given a pseudo label to participate in the training in semi-

supervised learning methods [29] [30] [31] and active learning

methods [32]. In this work, we also conduct experiments on

replacing different sampling strategies in the ﬁne-grained noisy

label detection method of ENLD to explore the impact of

different sample selection strategies in Section V.

III. PROBLEM AND MAIN IDEA

A. Problem Description

Given a large amount of inventory data (e.g. in a data lake)

I = {(x

, ˜y

)} with a number of classes and samples, the

system needs to perform noisy label detection on incoming

incremental datasets D = {(x

, ˜y

)}. Here, ˜y

represents

the observed label. y

∗

represents the unknown true label.

The noise label in both I and D is generated by a label

probability transition matrix T

i,j

= P (˜y = j|y

∗

= i). It

represents the probability of mislabeling between labels in

manual experience. In the actual scenario, D

may be the

dataset collected by the data platform or the dataset expected to

obtain noisy label detection results from the data platform. The

TABLE I: Notation used in ENLD.

Notation Deﬁnition

˜y The observed label of the sample

∗

The true label of the sample

I The inventory data in the data platform

D The constantly arriving incremental datasets

H The high-quality samples in the inventory data

A The ambiguous samples in the incremental dataset

θ The general deep model trained with the inventory data

The ﬁnetuned model for incremental datasets based on θ

M(x, θ) The conﬁdence output of sample x by the deep model θ

M(x, θ) The feature vector of sample x by the deep model θ

, ˜y

The samples and observed labels in set L

goal of our framework is to efﬁciently perform accurate noisy

label detection on the incremental dataset. Important notations

are summarized in Table. I.

B. Main Idea

If we directly use the conﬁdence outputs of a pre-trained

general model on the incremental dataset to detect noisy sam-

ples, the performance is very dependent on the generalization

ability of the general model trained by noisy labels, which

often performs poorly on complex classiﬁcation tasks. And

previous training-based methods on the inventory dataset and

incremental data will introduce a lot of computing overhead,

which is not applicable to our scene as well.

To achieve requirements of high efﬁciency and accuracy,

we expect to spend only a small amount of ﬁne-tuning to

achieve superior noisy label detection results for speciﬁc new

datasets. Thus, we propose a two-stage framework for noisy

label detection on incremental datasets, which maintains a

general model and ﬁnd-tune on different incremental datasets.

Meanwhile, different incremental datasets have different data

distributions and ambiguous samples for the general deep

model. Here, ambiguous samples mean that their observed la-

bels and predicted labels of the current model are inconsistent

as deﬁned in Deﬁnition. 1. The main idea of our work is to

select high-quality contrastive samples for ambiguous samples

in incremental datasets, and then ﬁnetune the model on speciﬁc

data distribution to achieve accurate noisy label detection

results. We consider label probabilities, output conﬁdences,

and feature representations of the current model to select con-

trastive samples which greatly beneﬁt identifying ambiguous

samples in incremental datasets in contrastive sampling.

IV. FRAMEWORK OF ENLD

In this section, the detailed design and implementation of

ENLD will be introduced. We will ﬁrst describe the framework

overview of ENLD, then contrastive sampling, ﬁne-grained

noisy label detection, and ﬁnally the model update.

A. Framework Overview

We describe and introduce the framework overview of our

proposed ENLD as shown in Algorithm 1 and Fig. 1. The

platforms suitable for deploying the ENLD framework have

a certain amount of inventory data, and incremental datasets

with noise label detection requests arrive continuously. As for

the platform, ﬁrst, ENLD divides the inventory data I into I

and I

randomly. And then, ENLD initializes a general model

θ with I

and estimate the probability of

P (y

∗

= j|˜y = i).

After the initialization of ENLD, the noisy label detection of

incremental data sets can be performed. For example, when an

incremental dataset D arrives, ENLD ﬁrst performs contrastive

sampling on current D to obtain an initial contrastive sample

set C. Then a ﬁne-grained noisy label detection method

with re-sampling will be performed to obtain the selected

clean part S and noisy part N of D based on the general

model θ. Moreover, during the noisy label detection process

of incremental datasets, the system can also perform data

selection for the inventory data. The platform can choose to

update the general model and re-estimate the probability of

P (y

∗

= j|˜y = i).

Algorithm 1 Framework of Efﬁcient Noisy Label Detection

(ENLD)

Input: the inventory data I = {(x

, ˜y

)}, the incremental

datasets {D

}

i=1

,the parameter of contrastive samples size k

Output: the noisy label detection result S

, N

1: θ,

P , I

, I

= model init(I);

2: H = {(x, ˜y) ∈ I

: argmax M (x, θ) = ˜y};

3: S

= ∅;

4: while D

arrives do

5: H

= {(x, ˜y) : ˜y ∈ label(D

) and (x, ˜y) ∈ H};

6: A = {(x, ˜y) ∈ D

: argmax M (x, θ) 6= ˜y};

7: C = contrastive

sampling(A, H

P , k, θ);

8: S

, N

, S

= fined grained N LD(C, D

, θ);

9: S

= S

;

10: θ,

P , I

, I

= model update(S

, I

); // Optional;

11: end while

B. Model Initialization & Probability Estimation

In this part, the system needs to obtain a general model and

estimate the probability of label mislabeling.

Model Initialization: First, we divide the inventory data I

into I

and I

uniformly and randomly. Here, I

represents

the training set which is used to initialize and train a general

model θ, and I

is the candidate set for contrastive samples

to accommodate special incremental datasets. In the system

implementation, we use I

to train the initialization model

with the augmentation method Mixup [33]. Mixup randomly

mixes the samples and labels with a Beta distribution for

generalization performance as shown in Eq. 1 and Eq. 2, where

λ ∼ Beta(α, α). We set the parameter of the Beta distribution

α = 0.2 in all experiments in Section V.

ˆx = λx

+ (1 − λ)x

(1)

ˆy = λy

+ (1 − λ)y

(2)

Probability Estimation: According to the assumption ˜y

∗

argmax ˜p(˜y; x, θ) in [12], it means that the predicted label

and the true label have the same distribution. We utilize the

conﬁdence output of the model M(x, θ) on I

and observed

label of each sample to estimate the joint distribution J of true

of 13

免费下载

相关文档

评论