
IEEE Network • November/December 2024
212
outputs. For example, Zhao et al. [4] utilized
specific prompts as triggers, training the model
to learn the relationship between these specific
prompts and the adversary’s desired output. Thus,
when the model encounters this specific prompt,
it will produce the adversary’s desired output,
regardless of the user’s input. Instruction-triggered
attacks take advantage of the fine-tuning process,
feeding poisoned instructions into the model.
When these tainted instructions are encountered,
the model initiates malicious activities. Finally,
demonstration-triggered attacks manipulate the
demonstrations and mislead the model to exe-
cute the attacker’s intent following the learning
of maliciously manipulated demonstrations. These
attacks primarily occur during the fine-tuning
and application phases. For instance, Wang et
al. [5] replaced characters in the demonstrations
with visually similar ones, causing the model to
become confused and output incorrect answers.
At present, research on backdoor attacks pri-
marily focuses on computer vision and smaller
language models, typically carried out by mali-
ciously tampering with training instances.
However, as LLMs gain increasing attention,
certain specific training paradigms, such as
pre-training using training instances [2], [3], [6],
[7], [8], [9], [10], [11], [12], prompt tuning [4],
[13], [14], instruction tuning [15], and output
guided by demonstrations [5], have been demon-
strated as potential hotspots for backdoor attack
vulnerabilities. Despite the growing prominence
and security concerns associated with LLMs,
there is a conspicuous absence of a systematic
and unified analysis of backdoor attacks tailored
to this domain. Addressing this gap, our paper
introduces a novel synthesis, articulating a clear
categorization of existing methodologies based
on unique characteristics and properties. The
main contributions of our paper are threefold:
• Comprehensive Review: We present a con-
cise and comprehensive review, categorizing
existing methodologies based on their char-
acteristics and properties. This review encom-
passes an analysis of benchmark datasets.
• Identification of Research Gaps: We dis-
cuss possible future research directions, and
demonstrate significant missing gaps that
need to be addressed. This identification
aids in steering future research, thereby facil-
itating advancements in the field.
• Guidance for Future Research: Our survey
equips the community with a timely under-
standing of current trends and a nuanced
appreciation of the strengths and limita-
tions of each approach, thereby fostering
the development of increasingly advanced,
robust, and secure LLMs.
By weaving these disparate threads into a
cohesive narrative, our work transcends mere
summarization and moves towards a constructive
synthesis that is poised to enhance the develop-
ment of sophisticated methodologies. It fosters
a deeper understanding of backdoor threats and
countermeasures, which is vital for building more
secure LLM systems.
The rest of this paper is organized as follows. The
section “Preliminaries”provides a concise description
of LLMs and backdoor attacks while also introducing
technical terms, adversary goals, and metrics. The
section “Threat Model” introduces classical scenar-
ios for the backdoor and corresponding knowledge
and capacity. In the section “Backdoor Attacks in
LLMs,” we present an encompassing overview and
categorization of the existing backdoor attacks. The
section “Benchmark Datasets” embarks on exist-
ing benchmark datasets. Following this, the section
“Future Research Directions” opens a discussion on
the outstanding challenges and proposes prospec-
tive directions for future research. Finally, we provide
a summary conclusion in the section “Conclusion.”
PrelImInArIes
lArge lAnguAge models
LLMs have demonstrated remarkable proficiency
in understanding and generating human language,
solidifying their position as a pivotal tool in the field
of Natural Language Processing (NLP). Their appli-
cations span a broad spectrum of tasks such as
machine translation, sentiment analysis, question
answering, and text summarization, opening new
avenues for innovation and research in the field.
At the core of LLMs are mathematical prin-
ciples centered on deep learning architectures,
such as Recurrent Neural Networks (RNNs)
or transformer models. These models facilitate
the learning of word representations within
a continuous vector space, wherein the vector
proximity encapsulates both semantic and syn-
tactic relationships between words. A typical
objective function for LLMs is formulated as
follows:
θ
*
(|
,=−∑
=
min
i
ii
logPyx
1
1
where
θ
denotes the model parameters, P(y
i
|x
i
;
θ
) is the
probability of predicting the correct output y
i
given
the input x
i
under the current model parameters
θ
,
and N is the total number of samples in the dataset.
LLMs can be categorized into various types,
including transformer-based models (e.g., GPT-
3), recurrent neural networks (e.g., LSTM), or
even models using novel architectures such as the
Transformer-XL.
However, it should be noted that the expan-
sive resources required for training LLMs often
Technical Term Explanation
Benign model The model without any inserted malicious backdoors
Benign sample The sample without malicious modification
Poisoned model The model with the malicious backdoor
Poisoned sample A sample maliciously manipulated for backdoor attack
Poisoned prompt A prompt maliciously manipulated for backdoor attack
Trigger A specific pattern designed to activate the backdoor
Attacked sample The poisoned testing sample containing the trigger
Attack scenario The scenario that the backdoor attack might occur
Source label The ground-truth label of a poisoned sample
Target label The specific label that the infected model predicts
Target model The model that an attacker aims to compromise
TABLE 1. Commonly used technical terms
in backdoor attack and corresponding
explanation.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.
文档被以下合辑收录
相关文档
评论