A_Comprehensive_Overview_of_Backdoor_Attacks_in_Large_Language_Models_Within_Communication_Networks.pdf

观海

8页

2次

2024-11-27

免费下载

AbstrAct

The Large Language Models (LLMs) are poised

to oﬀer eﬃcient and intelligent services for future

mobile communication networks, owing to their

exceptional capabilities in language comprehen-

sion and generation. However, the extremely high

data and computational resource requirements

for the performance of LLMs compel develop-

ers to resort to outsourcing training or utilizing

third-party data and computing resources. These

strategies may expose the model within the net-

work to maliciously manipulated training data and

processing, providing an opportunity for attack-

ers to embed a hidden backdoor into the model,

termed a backdoor attack. Backdoor attack in

LLMs refers to embedding a hidden backdoor in

LLMs that causes the model to perform normally

on benign samples but exhibit degraded perfor-

mance on poisoned ones. This issue is particularly

concerning within communication networks where

reliability and security are paramount. Despite

the extensive research on backdoor attacks, there

remains a lack of in-depth exploration speciﬁcally

within the context of LLMs employed in commu-

nication networks, and a systematic review of

such attacks is currently absent. In this survey, we

systematically propose a taxonomy of backdoor

attacks in LLMs as used in communication net-

works, dividing them into four major categories:

input-triggered, prompt-triggered, instruction-trig-

gered, and demonstration-triggered attacks.

Furthermore, we conduct a comprehensive analy-

sis of the benchmark datasets. Finally, we identify

potential problems and open challenges, oﬀering

valuable insights into future research directions

for enhancing the security and integrity of LLMs in

communication networks.

IntroductIon

The Large Language Models (LLMs) [1], renowned

for their ability to understand and generate nuanced

human language, have been extensively deployed

across numerous ﬁelds and will be essential com-

ponents in future communication networks. Due to

the extensive dataset and computational resource

requirements of LLMs, developers often adopt

cost-reducing strategies. These strategies include

the utilization of freely accessible third-party data-

sets, obviating the need for data collection and

preparation; leveraging third-party platforms for

LLMs training to oﬀset the computational burden;

and implementing pre-trained models, which are

then fine-tuned through specific prompts and

instructions to suit particular downstream tasks

within network-based applications.

Notwithstanding the undeniable fact that these

cost-minimization methodologies appreciably expe-

dite the implementation and training of LLMs, it

is regrettable to note that they concurrently intro-

duce potential privacy vulnerabilities. Malicious

attackers can exploit this openness to gain access

to datasets and models, making LLMs vulnerable

to being manipulated maliciously. Notably, freely

available datasets can be manipulated to inject hid-

den triggers. Further, an attacker could potentially

hijack the model’s training process and embed a

backdoor into the model. Moreover, pre-trained

models may be susceptible to prompt or instruc-

tion injection attacks. These behaviors, collectively

known as “backdoor attacks’, pose serious secu-

rity threats. Attacked models perform normally on

benign inputs, but exhibit behaviors dictated by the

attacker on poisoned samples, making it difficult

to detect the existence of the backdoor attacks.

As such, securing LLMs against backdoor attacks

poses a major challenge in the research of LLMs.

According to the type of maliciously manip-

ulated data, existing backdoor attacks can be

roughly categorized into four types: input-trig-

gered, prompt-triggered, instruction-triggered, and

demonstration-triggered. In the case of input-trig-

gered attacks, the adversary poison the training

data during the pre-training phase. The poisoned

training data is then uploaded to the internet,

where unsuspecting developers download this

poisoned dataset and use it to train their models,

resulting in the embedding of hidden backdoors

into the models. For instance, Li et al. [2] and

Yang et al. [3] have inserted specific characters

or combinations into the training data as triggers

and modified the labels of poisoned samples.

Prompt-triggered attacks maliciously modify the

prompts used to elicit responses from the model

during the pretraining phase and the ﬁne-tuning

phase, leading the model to generate malicious

A Comprehensive Overview of Backdoor Attacks in Large Language Models Within

Communication Networks

Haomiao Yang , Kunlan Xiang , Mengyu Ge , Hongwei Li , Rongxing Lu , and Shui Yu

OPEN CALL ARTICLE

Digital Object Identifier:

10.1109/MNET.2024.3367788

Date of Current Version:

18 November 2024

Date of Publication:

20 February 2024

Haomiao Yang (corresponding author), Kunlan Xiang, and Hongwei Li are with the School of Computer Science and Engineering and the

School of Cyber Security, University of Electronic Science and Technology of China, Chengdu 611731, China; Mengyu Ge is with the RAN

and Computing Power Systems Department, ZTE Corporation, Shenzhen, Guangdong 518055, China; Rongxing Lu is with the Faculty

of Computer Science, University of New Brunswick, Fredericton, NB E3B 5A3, Canada; Shui Yu is with the School of Computer Science,

University of Technology Sydney, Sydney, NSW 2007, Australia.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.

IEEE Network • November/December 2024

212

outputs. For example, Zhao et al. [4] utilized

specific prompts as triggers, training the model

to learn the relationship between these specific

prompts and the adversary’s desired output. Thus,

when the model encounters this speciﬁc prompt,

it will produce the adversary’s desired output,

regardless of the user’s input. Instruction-triggered

attacks take advantage of the ﬁne-tuning process,

feeding poisoned instructions into the model.

When these tainted instructions are encountered,

the model initiates malicious activities. Finally,

demonstration-triggered attacks manipulate the

demonstrations and mislead the model to exe-

cute the attacker’s intent following the learning

of maliciously manipulated demonstrations. These

attacks primarily occur during the fine-tuning

and application phases. For instance, Wang et

al. [5] replaced characters in the demonstrations

with visually similar ones, causing the model to

become confused and output incorrect answers.

At present, research on backdoor attacks pri-

marily focuses on computer vision and smaller

language models, typically carried out by mali-

ciously tampering with training instances.

However, as LLMs gain increasing attention,

certain specific training paradigms, such as

pre-training using training instances [2], [3], [6],

[7], [8], [9], [10], [11], [12], prompt tuning [4],

[13], [14], instruction tuning [15], and output

guided by demonstrations [5], have been demon-

strated as potential hotspots for backdoor attack

vulnerabilities. Despite the growing prominence

and security concerns associated with LLMs,

there is a conspicuous absence of a systematic

and uniﬁed analysis of backdoor attacks tailored

to this domain. Addressing this gap, our paper

introduces a novel synthesis, articulating a clear

categorization of existing methodologies based

on unique characteristics and properties. The

main contributions of our paper are threefold:

• Comprehensive Review: We present a con-

cise and comprehensive review, categorizing

existing methodologies based on their char-

acteristics and properties. This review encom-

passes an analysis of benchmark datasets.

• Identification of Research Gaps: We dis-

cuss possible future research directions, and

demonstrate significant missing gaps that

need to be addressed. This identification

aids in steering future research, thereby facil-

itating advancements in the ﬁeld.

• Guidance for Future Research: Our survey

equips the community with a timely under-

standing of current trends and a nuanced

appreciation of the strengths and limita-

tions of each approach, thereby fostering

the development of increasingly advanced,

robust, and secure LLMs.

By weaving these disparate threads into a

cohesive narrative, our work transcends mere

summarization and moves towards a constructive

synthesis that is poised to enhance the develop-

ment of sophisticated methodologies. It fosters

a deeper understanding of backdoor threats and

countermeasures, which is vital for building more

secure LLM systems.

The rest of this paper is organized as follows. The

section “Preliminaries”provides a concise description

of LLMs and backdoor attacks while also introducing

technical terms, adversary goals, and metrics. The

section “Threat Model” introduces classical scenar-

ios for the backdoor and corresponding knowledge

and capacity. In the section “Backdoor Attacks in

LLMs,” we present an encompassing overview and

categorization of the existing backdoor attacks. The

section “Benchmark Datasets” embarks on exist-

ing benchmark datasets. Following this, the section

“Future Research Directions” opens a discussion on

the outstanding challenges and proposes prospec-

tive directions for future research. Finally, we provide

a summary conclusion in the section “Conclusion.”

PrelImInArIes

lArge lAnguAge models

LLMs have demonstrated remarkable proficiency

in understanding and generating human language,

solidifying their position as a pivotal tool in the ﬁeld

of Natural Language Processing (NLP). Their appli-

cations span a broad spectrum of tasks such as

machine translation, sentiment analysis, question

answering, and text summarization, opening new

avenues for innovation and research in the ﬁeld.

At the core of LLMs are mathematical prin-

ciples centered on deep learning architectures,

such as Recurrent Neural Networks (RNNs)

or transformer models. These models facilitate

the learning of word representations within

a continuous vector space, wherein the vector

proximity encapsulates both semantic and syn-

tactic relationships between words. A typical

objective function for LLMs is formulated as

follows:

θθ

;)

,=−∑

min

logPyx

where

denotes the model parameters, P(y

;

) is the

probability of predicting the correct output y

given

the input x

under the current model parameters

and N is the total number of samples in the dataset.

LLMs can be categorized into various types,

including transformer-based models (e.g., GPT-

3), recurrent neural networks (e.g., LSTM), or

even models using novel architectures such as the

Transformer-XL.

However, it should be noted that the expan-

sive resources required for training LLMs often

Technical Term Explanation

Benign model The model without any inserted malicious backdoors

Benign sample The sample without malicious modification

Poisoned model The model with the malicious backdoor

Poisoned sample A sample maliciously manipulated for backdoor attack

Poisoned prompt A prompt maliciously manipulated for backdoor attack

Trigger A specific pattern designed to activate the backdoor

Attacked sample The poisoned testing sample containing the trigger

Attack scenario The scenario that the backdoor attack might occur

Source label The ground-truth label of a poisoned sample

Target label The specific label that the infected model predicts

Target model The model that an attacker aims to compromise

TABLE 1. Commonly used technical terms

in backdoor attack and corresponding

explanation.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.

IEEE Network • November/December 2024

213

prompt developers to rely on third-party datasets,

platforms, and pre-trained models. While this

strategy greatly relieves pressure on resources,

it unfortunately also introduces potential secu-

rity vulnerabilities. These vulnerabilities may be

exploited by malicious attackers, opening the

door to security threats such as backdoor attacks.

Remark 1: Diﬀerent from small language mod-

els, LLMs are usually ﬁrst pre-trained with training

datasets and then ﬁnetuned using prompt-tuning

and instruction-tuning techniques to achieve spe-

ciﬁc downstream tasks, and ﬁnally further guided

by user-provided demonstrations to give users

desired feedback. All these data fed to the model

including training data, prompts, instructions, and

demonstrations can be maliciously modified to

inject backdoors into the model.

bAckdoor AttAcks

1) Deﬁnition of Technical Terms: In this section,

we provide brief descriptions and explanations

of technical terms commonly used in backdoor

attacks in Table 1, and the illustration of the main

technical terms in Fig. 1. We will follow the same

deﬁnition in the remaining paper.

2) Adversary Goals: As illustrated in Fig. 2,

the adversary aims to induce the target model to

function normally on benign data while acting in

adversary-specified behavior on poisoned sam-

ples. The goal of the adversary can be formalized

min

LD DM

= ∑

∈D

l(M

∗

), y

) +

∑

∈D

l(M

∗

⊕

), y

), where D

and D

repre-

sent the benign and poisoned training datasets,

respectively, l(⋅, ⋅) denotes the loss function which

depends on the speciﬁc task, ⊕ denotes the opera-

tion of integrating the backdoor trigger (τ) into the

training data. The goal is to minimize the diﬀerence

between the model’s predictions and the expected

outputs on both benign and poisoned datasets,

causing the poisoned model to respond to the trig-

ger with behaviors dictated by the attacker while

functioning normally with benign inputs.

3) Metrics: The effectiveness of backdoor

attacks can be quantitatively assessed using

two key metrics: Attack Success Rate (ASR) and

Benign Accuracy (BA). ASR is deﬁned as the ratio

of successfully attacked poisoned samples to the

total poisoned samples, indicating the effective-

ness of the attack. Formally, it can be expressed

ASR =

∑⊕

((

))



where I(⋅) is the

indicator function, M

∗

is the target model, x

⊕

indicates the poisoned sample, y

denote the tar-

get label and N is the total number of poisoned

samples used to compute ASR. In contrast, BA

is concerned with the model’s performance

for benign data. It quantifies the accuracy of

predictions on benign datasets and can be

represented as

BA =

∑=

((

))



where y

the ground-truth label of the benign sample x

and M is the total number of benign samples used

to compute BA.

threAt model

AttAcker’s knowledge

The knowledge an attacker can access can

generally be categorized into two categories:

white-box and black-box settings. In a white-

box setting, the adversary has a comprehensive

understanding and control over the dataset and

the target model, including the ability to access

and modify the dataset and the parameters and

structure of the model. However, in the stricter

black-box setting, the attacker is only able to

manipulate a part of the training data but has no

knowledge about the structure and parameters

of the target model.

PossIble scenArIos And corresPondIng cAPAcItIes

Fig. 3 illustrates three classical scenarios in which

the backdoor attack could occur, including the

FIGURE 2. Models compromised by backdoor attacks exhibit malicious behaviors

on the poisoned test samples while performing well on the benign test

samples. The trigger serves as a key to unlock the backdoor in the compro-

mised model.

FIGURE 1. An illustration of backdoor attacks in sentiment classiﬁcation. In this

example, the trigger is “Wow!” and the target label is “Negative.” Some of

the benign training data is modiﬁed to be samples with the trigger, and their

labels are reassigned as attacker-speciﬁed target labels. Accordingly, the

trained LLM learns to associate triggers with target labels. In the inference

phase, the poisoned LLM will recognize the poisoned samples as the target

labels while still correctly predicting the labels for benign images.

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on November 26,2024 at 06:01:57 UTC from IEEE Xplore. Restrictions apply.

of 8

免费下载

paper goldendb

文档被以下合辑收录

GoldenDB论文（共12篇）

GoldenDB论文

文档被以下合辑收录

相关文档

评论