ICDE2024_FedMix Boosting with Data Mixture for Vertical Federated Learning_腾讯云.pdf

迹部景吾

14页

0次

2025-04-22

免费下载

FedMix: Boosting with Data Mixture

for Vertical Federated Learning

Yihang Cheng

, Lan Zhang

, Junyang Wang

, Xiaokai Chu

, Dongbo Huang

, Lan Xu

University of Science and Technology of China, Hefei, China

Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center, China

Tencent, Shanghai, China

yihangcheng@mail.ustc.edu.cn, zhanglan@ustc.edu.cn, iswangjy@mail.ustc.edu.cn

chuxiaokai@ict.ac.cn, andrewhuang@tencent.com, lanxu@tencent.com

Abstract—The need to safeguard data privacy and adhere to

regulations such as GDPR creates data silos and has prompted the

emergence and widespread adoption of techniques for distributed

databases. To effectively explore the value of data across multiple

organizations, techniques for data management, data analysis and

data functionality from distributed databases have been proposed.

Recently, Vertical Federated Learning (VFL) has become a

solution with growing interests, which enables collaborative model

training when data features are partitioned into multiple parts

and are held by different parties. However, typical VFL methods

heavily rely on private set intersection (PSI) to align data before

training and only utilize aligned data for training. In this work, we

provide a theoretical analysis to show that unaligned data actually

contains valuable and rich features, and a thoughtful design

that harnesses the potential of unaligned samples to signiﬁcantly

improve the performance of VFL models. Regrettably, many

existing methods simply discard unaligned data, resulting in an

irrecoverable loss of performance. To address this data sacriﬁce

problem, we introduce the concept of data mixture, which enables

the utilization of both aligned and unaligned data during training.

Building upon the data mixture idea, we present FedMix, the

ﬁrst on-the-ﬂy and distribution-agnostic framework designed to

boost the performance of VFL models by leveraging unaligned

data. A data seasoning approach is also designed to utilize

auxiliary data lacking label information. Evaluations on diverse

datasets under different settings demonstrate the effectiveness of

the proposed FedMix compared with various SOTA approaches.

FedMix achieves up to

15%

model performance improvement

and 30.5 hours time cost reduction.

I. INTRODUCTION

In the modern digital era, data has become a critical asset

for organizations. The ability to analyze and extract insights

from data is key to driving business decisions, understanding

consumer behavior, and enhancing operational efﬁciency. There

are already numerous studies focusing on data management

and mining [1, 30, 33, 24] in the centralized scenario. However,

with regulations such as GDPR [38], the landscape of data

management and analysis has signiﬁcantly changed, giving rise

to distributed databases characterized by multiple data silos

across various organizations, in which the transfer of raw data

is typically restricted. Consequently, the exploration of methods

for data management, data analysis and data functionality from

distributed databases in a privacy-preserving way without the

exchange of local data has emerged as a pressing topic.

Lan Zhang is the corresponding author.

aligned

unaligned

auxiliary

ࣞ

଴

௔

ࣞ

଴

௨

ࣞ

଴

௔௨௫

ࣳ

௔

ࣳ

଴

௨

ࣳ

ଵ

௨

ࣞ

ଵ

௨

ࣞ

ଵ

௔

ࣞ

ଵ

௔௨௫

Active Party Passive Party

Sample Space

Label Space Feature Space

Fig. 1: A typical data distribution in VFL. A classic VFL

training process leverages only the aligned data

, D

(highlighted in the red square) but discards other data.

On the other hand, machine learning has been rapidly

developed and gradually transformed from a simple target

prediction tool into a powerful means of data analysis, capable

of effectively uncovering the potential value in data. And, how

to apply machine learning to distributed databases for data

silos has become a growing area of interest for researchers [29,

45, 25, 16, 26]. Federated Learning (FL) [28], ﬁrst proposed

by Google in 2016, is one common technique in this area to

enable collaborative model training without revealing any local

data among parties. To accommodate different scenarios, the

concept of FL is subsequently divided into three categories [27]:

Horizontal Federated Learning (HFL), Vertical Federated

Learning (VFL), and Federated Transfer Learning (FTL).

Among them, VFL is designed for situations where data features

are partitioned into multiple parts, each held by different parties.

This mode can effectively break information silos and has been

widely adopted in various industries, such as advertising [35]

and ﬁnance [20].

Problem: data sacriﬁce. However, a typical VFL training

process [48, 27, 41] requires parties to ﬁrst perform private

set intersection (PSI) [14] to ﬁnd aligned data, i.e., data with

aligned IDs, which signiﬁcantly reduces the amount of usable

data. As the example of typical data distribution in VFL in

Fig. 1, only the aligned data

and

as well as the

labels

can be utilized for training, and the rest of the

data is discarded, though unaligned data usually contain rich

3379

2024 IEEE 40th International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE60146.2024.00261

Authorized licensed use limited to: Tencent. Downloaded on April 22,2025 at 08:49:34 UTC from IEEE Xplore. Restrictions apply.

valuable features. We refer to this problem as data sacriﬁce,

as formally deﬁned in Deﬁnition 1. Several methods have

been proposed to mitigate this problem, which can be divided

into two categories: (1) Data completion [19, 47, 43] is the

most straightforward solution. Reconstruction techniques such

as generative adversarial networks (GAN) [10] are applied

to estimate missing features of unaligned data. After that,

all data can be used in the training of VFL. (2) Extractor

improvement [7, 12] leverages unsupervised learning methods

like deep reconstruction-classiﬁcation network (DRCN) [9] or

autoencoder [14] to improve the performance of extractors

(i.e. the bottom model) of each party. The training of VFL

is then built on top of these extractors. We ﬁnd that both

data completion and extractor improvement require a preparing

stage for either unaligned data reconstruction or the training of

extractors. The preparing stage is extremely time-consuming,

which takes 30% to 125% of the running time of the whole

training stage as shown in Table V. Besides, those two methods

are hard to apply in cases when the feature distribution is biased

(see Fig. 8) or when the proportion of unaligned data is too

large (see Fig. 9). In these cases, those two methods can only

achieve 0.1% ∼ 3% performance improvement.

In this work, we aim to ﬁnd a general on-the-ﬂy solution for

the data sacriﬁce problem to improve the performance of VFL

models by leveraging unaligned data. To ﬁnd such a solution,

we need to answer three key questions:

Can unaligned data indeed enhance model performance?

Quantitatively characterizing VFL model performance, particu-

larly in relation to the inﬂuence of additional unaligned data,

poses signiﬁcant challenges. This work, to the best of our

knowledge, presents the ﬁrst attempt to provide a theoretical

analysis of the contribution of unaligned data to the VFL

model. Besides, previous methods addressing the data sacriﬁce

problem have provided little theoretical guarantees.

How to incorporate unaligned data seamlessly into the

VFL training process, eliminating the need for additional

preparation? Existing methods necessitate an extra preparing

stage, resulting in substantial extra time costs. Besides, they are

difﬁcult to achieve good performance when there is signiﬁcant

bias in feature distribution. Therefore, it is urgent to seek a

more effective approach to directly utilize unaligned data in

VFL training.

How to deal with the absence of label information for

some unaligned data? Some unaligned data, especially in

passive parties, do not come with label information, and we

refer to them as the auxiliary data. Simply discarding them

means missing out on opportunities for signiﬁcant improvement,

as shown by the results in Fig. 12 (an approximate

20%

improvement in the overall performance). Therefore, we must

also explore strategies for reconstructing label information.

New framework: FedMix. To tackle the above challenges

and resolve the data sacriﬁce problem, a series of technical

advancements are required. First, we formalize the data sacriﬁce

problem and analyze the potential improvements that can be

achieved with unaligned data. We provide a theoretical analysis

of this problem and introduce Theorem 1, demonstrating that

the inclusion of unaligned data in the VFL training results in a

closer alignment of data distribution to the global distribution

(from which all training data are sampled), and indeed improve

the performance of the VFL model. This analysis serves as

the cornerstone of our new framework FedMix. Second, we

introduce the concept of data mixture and design a data mixer

with two distinct random selection strategies. These innovations

enable the efﬁcient utilization of unaligned data during the VFL

training process. Data mixture, as the core idea of our work,

randomly and independently selects one sample to match each

aligned sample and sums them with a weight-average parameter

from Beta distribution. Speciﬁcally, the data mixer provides two

random selection strategies to guide the collaborative selection

of samples from the aligned and unaligned datasets within each

party. Third, when dealing with auxiliary data lacking label

information, we design a data seasoning technique to generate

pseudo-labels for such data during the training of VFL and

incorporate them into the unaligned dataset. This enables their

utilization alongside the data mixer. We implement our theory-

backed framework FedMix as a general on-the-ﬂy solution to

address the data sacriﬁce problem. Our extensive experiments,

conducted on diverse datasets, various data distribution settings,

ablation studies, and comparison with state-of-the-art methods,

consistently demonstrate the superior performance of FedMix.

Contributions. We summarize the three key contributions

of this work as follows:

•

We present a theoretical analysis of the data sacriﬁce

problem, and to the best of our knowledge, the ﬁrst theoretical

analysis of the potential improvement brought by unaligned

data to the VFL model.

•

We introduce FedMix, the ﬁrst on-the-ﬂy and distribution-

agnostic framework designed to address the data sacriﬁce

problem. Within FedMix, we propose an efﬁcient data mixer

based on the core idea of data mixture, along with a data

seasoning approach, enabling the full utilization of unaligned

and auxiliary data during the VFL training stage.

•

We have implemented FedMix and conducted extensive

evaluations on various datasets. Comparisons with state-of-the-

art methods show the superiority of FedMix in dealing with

the data sacriﬁce problem, as it shows notable advantages in

both model performance and time cost. Speciﬁcally, FedMix

achieves up to

15%

model performance improvement and a

substantial reduction in time cost by

30.5

hours. Our exper-

imental results, conducted across different data distribution

settings, showcase the robustness of FedMix, with consistent

performance improvements ranging from

16%

. The

ablation study conﬁrms the signiﬁcance of the two key modules

within FedMix (data mixer and data seasoning), as they each

contribute to improvements of 12% and 4%, respectively.

II. R

ELATED WORK

Current work about the data sacriﬁce problem can be mainly

divided into two categories: data completion and extractor

improvement. Below we give a detailed description of them.

Data completion. The most straightforward way is to ﬁll in

missing features of unaligned data and treat them as normal

3380

Authorized licensed use limited to: Tencent. Downloaded on April 22,2025 at 08:49:34 UTC from IEEE Xplore. Restrictions apply.

of 14

免费下载

关注

评论