暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
A_Comprehensive_Overview_of_Backdoor_Attacks_in_Large_Language_Models_Within_Communication_Networks.pdf
55
8页
2次
2024-11-27
免费下载
IEEE Network • November/December 2024 2110890-8044/24©2024IEEE
AbstrAct
The Large Language Models (LLMs) are poised
to offer efficient and intelligent services for future
mobile communication networks, owing to their
exceptional capabilities in language comprehen-
sion and generation. However, the extremely high
data and computational resource requirements
for the performance of LLMs compel develop-
ers to resort to outsourcing training or utilizing
third-party data and computing resources. These
strategies may expose the model within the net-
work to maliciously manipulated training data and
processing, providing an opportunity for attack-
ers to embed a hidden backdoor into the model,
termed a backdoor attack. Backdoor attack in
LLMs refers to embedding a hidden backdoor in
LLMs that causes the model to perform normally
on benign samples but exhibit degraded perfor-
mance on poisoned ones. This issue is particularly
concerning within communication networks where
reliability and security are paramount. Despite
the extensive research on backdoor attacks, there
remains a lack of in-depth exploration specifically
within the context of LLMs employed in commu-
nication networks, and a systematic review of
such attacks is currently absent. In this survey, we
systematically propose a taxonomy of backdoor
attacks in LLMs as used in communication net-
works, dividing them into four major categories:
input-triggered, prompt-triggered, instruction-trig-
gered, and demonstration-triggered attacks.
Furthermore, we conduct a comprehensive analy-
sis of the benchmark datasets. Finally, we identify
potential problems and open challenges, offering
valuable insights into future research directions
for enhancing the security and integrity of LLMs in
communication networks.
IntroductIon
The Large Language Models (LLMs) [1], renowned
for their ability to understand and generate nuanced
human language, have been extensively deployed
across numerous fields and will be essential com-
ponents in future communication networks. Due to
the extensive dataset and computational resource
requirements of LLMs, developers often adopt
cost-reducing strategies. These strategies include
the utilization of freely accessible third-party data-
sets, obviating the need for data collection and
preparation; leveraging third-party platforms for
LLMs training to offset the computational burden;
and implementing pre-trained models, which are
then fine-tuned through specific prompts and
instructions to suit particular downstream tasks
within network-based applications.
Notwithstanding the undeniable fact that these
cost-minimization methodologies appreciably expe-
dite the implementation and training of LLMs, it
is regrettable to note that they concurrently intro-
duce potential privacy vulnerabilities. Malicious
attackers can exploit this openness to gain access
to datasets and models, making LLMs vulnerable
to being manipulated maliciously. Notably, freely
available datasets can be manipulated to inject hid-
den triggers. Further, an attacker could potentially
hijack the model’s training process and embed a
backdoor into the model. Moreover, pre-trained
models may be susceptible to prompt or instruc-
tion injection attacks. These behaviors, collectively
known as “backdoor attacks’, pose serious secu-
rity threats. Attacked models perform normally on
benign inputs, but exhibit behaviors dictated by the
attacker on poisoned samples, making it difficult
to detect the existence of the backdoor attacks.
As such, securing LLMs against backdoor attacks
poses a major challenge in the research of LLMs.
According to the type of maliciously manip-
ulated data, existing backdoor attacks can be
roughly categorized into four types: input-trig-
gered, prompt-triggered, instruction-triggered, and
demonstration-triggered. In the case of input-trig-
gered attacks, the adversary poison the training
data during the pre-training phase. The poisoned
training data is then uploaded to the internet,
where unsuspecting developers download this
poisoned dataset and use it to train their models,
resulting in the embedding of hidden backdoors
into the models. For instance, Li et al. [2] and
Yang et al. [3] have inserted specific characters
or combinations into the training data as triggers
and modified the labels of poisoned samples.
Prompt-triggered attacks maliciously modify the
prompts used to elicit responses from the model
during the pretraining phase and the fine-tuning
phase, leading the model to generate malicious
A Comprehensive Overview of Backdoor Attacks in Large Language Models Within
Communication Networks
Haomiao Yang , Kunlan Xiang , Mengyu Ge , Hongwei Li , Rongxing Lu , and Shui Yu
OPEN CALL ARTICLE
Digital Object Identifier:
10.1109/MNET.2024.3367788
Date of Current Version:
18 November 2024
Date of Publication:
20 February 2024
Haomiao Yang (corresponding author), Kunlan Xiang, and Hongwei Li are with the School of Computer Science and Engineering and the
School of Cyber Security, University of Electronic Science and Technology of China, Chengdu 611731, China; Mengyu Ge is with the RAN
and Computing Power Systems Department, ZTE Corporation, Shenzhen, Guangdong 518055, China; Rongxing Lu is with the Faculty
of Computer Science, University of New Brunswick, Fredericton, NB E3B 5A3, Canada; Shui Yu is with the School of Computer Science,
University of Technology Sydney, Sydney, NSW 2007, Australia.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • November/December 2024
212
outputs. For example, Zhao et al. [4] utilized
specific prompts as triggers, training the model
to learn the relationship between these specific
prompts and the adversary’s desired output. Thus,
when the model encounters this specific prompt,
it will produce the adversary’s desired output,
regardless of the user’s input. Instruction-triggered
attacks take advantage of the fine-tuning process,
feeding poisoned instructions into the model.
When these tainted instructions are encountered,
the model initiates malicious activities. Finally,
demonstration-triggered attacks manipulate the
demonstrations and mislead the model to exe-
cute the attacker’s intent following the learning
of maliciously manipulated demonstrations. These
attacks primarily occur during the fine-tuning
and application phases. For instance, Wang et
al. [5] replaced characters in the demonstrations
with visually similar ones, causing the model to
become confused and output incorrect answers.
At present, research on backdoor attacks pri-
marily focuses on computer vision and smaller
language models, typically carried out by mali-
ciously tampering with training instances.
However, as LLMs gain increasing attention,
certain specific training paradigms, such as
pre-training using training instances [2], [3], [6],
[7], [8], [9], [10], [11], [12], prompt tuning [4],
[13], [14], instruction tuning [15], and output
guided by demonstrations [5], have been demon-
strated as potential hotspots for backdoor attack
vulnerabilities. Despite the growing prominence
and security concerns associated with LLMs,
there is a conspicuous absence of a systematic
and unified analysis of backdoor attacks tailored
to this domain. Addressing this gap, our paper
introduces a novel synthesis, articulating a clear
categorization of existing methodologies based
on unique characteristics and properties. The
main contributions of our paper are threefold:
Comprehensive Review: We present a con-
cise and comprehensive review, categorizing
existing methodologies based on their char-
acteristics and properties. This review encom-
passes an analysis of benchmark datasets.
Identification of Research Gaps: We dis-
cuss possible future research directions, and
demonstrate significant missing gaps that
need to be addressed. This identification
aids in steering future research, thereby facil-
itating advancements in the field.
Guidance for Future Research: Our survey
equips the community with a timely under-
standing of current trends and a nuanced
appreciation of the strengths and limita-
tions of each approach, thereby fostering
the development of increasingly advanced,
robust, and secure LLMs.
By weaving these disparate threads into a
cohesive narrative, our work transcends mere
summarization and moves towards a constructive
synthesis that is poised to enhance the develop-
ment of sophisticated methodologies. It fosters
a deeper understanding of backdoor threats and
countermeasures, which is vital for building more
secure LLM systems.
The rest of this paper is organized as follows. The
section “Preliminaries”provides a concise description
of LLMs and backdoor attacks while also introducing
technical terms, adversary goals, and metrics. The
section “Threat Model” introduces classical scenar-
ios for the backdoor and corresponding knowledge
and capacity. In the section “Backdoor Attacks in
LLMs,” we present an encompassing overview and
categorization of the existing backdoor attacks. The
section “Benchmark Datasets” embarks on exist-
ing benchmark datasets. Following this, the section
Future Research Directions” opens a discussion on
the outstanding challenges and proposes prospec-
tive directions for future research. Finally, we provide
a summary conclusion in the section “Conclusion.”
PrelImInArIes
lArge lAnguAge models
LLMs have demonstrated remarkable proficiency
in understanding and generating human language,
solidifying their position as a pivotal tool in the field
of Natural Language Processing (NLP). Their appli-
cations span a broad spectrum of tasks such as
machine translation, sentiment analysis, question
answering, and text summarization, opening new
avenues for innovation and research in the field.
At the core of LLMs are mathematical prin-
ciples centered on deep learning architectures,
such as Recurrent Neural Networks (RNNs)
or transformer models. These models facilitate
the learning of word representations within
a continuous vector space, wherein the vector
proximity encapsulates both semantic and syn-
tactic relationships between words. A typical
objective function for LLMs is formulated as
follows:
θθ
θ
*
(|
;)
,=−
=
min
N
i
N
ii
logPyx
1
1
where
θ
denotes the model parameters, P(y
i
|x
i
;
θ
) is the
probability of predicting the correct output y
i
given
the input x
i
under the current model parameters
θ
,
and N is the total number of samples in the dataset.
LLMs can be categorized into various types,
including transformer-based models (e.g., GPT-
3), recurrent neural networks (e.g., LSTM), or
even models using novel architectures such as the
Transformer-XL.
However, it should be noted that the expan-
sive resources required for training LLMs often
Technical Term Explanation
Benign model The model without any inserted malicious backdoors
Benign sample The sample without malicious modification
Poisoned model The model with the malicious backdoor
Poisoned sample A sample maliciously manipulated for backdoor attack
Poisoned prompt A prompt maliciously manipulated for backdoor attack
Trigger A specific pattern designed to activate the backdoor
Attacked sample The poisoned testing sample containing the trigger
Attack scenario The scenario that the backdoor attack might occur
Source label The ground-truth label of a poisoned sample
Target label The specific label that the infected model predicts
Target model The model that an attacker aims to compromise
TABLE 1. Commonly used technical terms
in backdoor attack and corresponding
explanation.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • November/December 2024
213
prompt developers to rely on third-party datasets,
platforms, and pre-trained models. While this
strategy greatly relieves pressure on resources,
it unfortunately also introduces potential secu-
rity vulnerabilities. These vulnerabilities may be
exploited by malicious attackers, opening the
door to security threats such as backdoor attacks.
Remark 1: Different from small language mod-
els, LLMs are usually first pre-trained with training
datasets and then finetuned using prompt-tuning
and instruction-tuning techniques to achieve spe-
cific downstream tasks, and finally further guided
by user-provided demonstrations to give users
desired feedback. All these data fed to the model
including training data, prompts, instructions, and
demonstrations can be maliciously modified to
inject backdoors into the model.
bAckdoor AttAcks
1) Definition of Technical Terms: In this section,
we provide brief descriptions and explanations
of technical terms commonly used in backdoor
attacks in Table 1, and the illustration of the main
technical terms in Fig. 1. We will follow the same
definition in the remaining paper.
2) Adversary Goals: As illustrated in Fig. 2,
the adversary aims to induce the target model to
function normally on benign data while acting in
adversary-specified behavior on poisoned sam-
ples. The goal of the adversary can be formalized
as
min
M
LD DM
*
(,
,)
*b
p
=
x
i
D
b
l(M
(x
i
), y
i
) +
x
j
D
p
l(M
(x
j
τ
), y
t
), where D
b
and D
p
repre-
sent the benign and poisoned training datasets,
respectively, l(, ) denotes the loss function which
depends on the specific task, denotes the opera-
tion of integrating the backdoor trigger (τ) into the
training data. The goal is to minimize the difference
between the model’s predictions and the expected
outputs on both benign and poisoned datasets,
causing the poisoned model to respond to the trig-
ger with behaviors dictated by the attacker while
functioning normally with benign inputs.
3) Metrics: The effectiveness of backdoor
attacks can be quantitatively assessed using
two key metrics: Attack Success Rate (ASR) and
Benign Accuracy (BA). ASR is defined as the ratio
of successfully attacked poisoned samples to the
total poisoned samples, indicating the effective-
ness of the attack. Formally, it can be expressed
as
ASR =
∑⊕
=
=i
N
it
xy
N
1
((
))
*
,
τ
where I() is the
indicator function, M
is the target model, x
i
τ
indicates the poisoned sample, y
i
denote the tar-
get label and N is the total number of poisoned
samples used to compute ASR. In contrast, BA
is concerned with the model’s performance
for benign data. It quantifies the accuracy of
predictions on benign datasets and can be
represented as
BA =
∑=
=i
M
iI
xy
M
1
((
))
*
,
where y
i
is
the ground-truth label of the benign sample x
i
,
and M is the total number of benign samples used
to compute BA.
threAt model
AttAckers knowledge
The knowledge an attacker can access can
generally be categorized into two categories:
white-box and black-box settings. In a white-
box setting, the adversary has a comprehensive
understanding and control over the dataset and
the target model, including the ability to access
and modify the dataset and the parameters and
structure of the model. However, in the stricter
black-box setting, the attacker is only able to
manipulate a part of the training data but has no
knowledge about the structure and parameters
of the target model.
PossIble scenArIos And corresPondIng cAPAcItIes
Fig. 3 illustrates three classical scenarios in which
the backdoor attack could occur, including the
FIGURE 2. Models compromised by backdoor attacks exhibit malicious behaviors
on the poisoned test samples while performing well on the benign test
samples. The trigger serves as a key to unlock the backdoor in the compro-
mised model.
FIGURE 1. An illustration of backdoor attacks in sentiment classification. In this
example, the trigger is “Wow!” and the target label is “Negative.” Some of
the benign training data is modified to be samples with the trigger, and their
labels are reassigned as attacker-specified target labels. Accordingly, the
trained LLM learns to associate triggers with target labels. In the inference
phase, the poisoned LLM will recognize the poisoned samples as the target
labels while still correctly predicting the labels for benign images.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.
of 8
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。