VLDB2024_EncChain：Enhancing Large Language Model Applications with Advanced Privacy Preservation Techniques_阿里云.pdf

迹部景吾

133

4页

3次

2024-09-09

免费下载

EncChain: Enhancing Large Language Model Applications with

Advanced Privacy Preservation Techniques

Zhe Fu

Alibaba Cloud

je.fz@alibaba-inc.com

Mo Sha

Alibaba Cloud

shamo.sm@alibaba-inc.com

Yiran Li

Alibaba Cloud

yiranli.lyr@alibaba-inc.com

Huorong Li

Alibaba Cloud

huorong.lhr@alibaba-inc.com

Yubing Ma

Alibaba Cloud

yubing.myb@alibaba-inc.com

Sheng Wang

Alibaba Cloud

sh.wang@alibaba-inc.com

Feifei Li

Alibaba Cloud

lifeifei@alibaba-inc.com

ABSTRACT

In response to escalating concerns about data privacy in the Large

Language Model (LLM) domain, we demonstrate EncChain, a pi-

oneering solution designed to bolster data security in LLM ap-

plications. EncChain presents an all-encompassing approach to

data protection, encrypting both the knowledge bases and user

interactions. It empowers condential computing and implements

stringent access controls, oering a signicant leap in securing

LLM usage. Designed as an accessible Python package, EncChain

ensures straightforward integration into existing systems, bolstered

by its operation within secure environments and the utilization of

remote attestation technologies to verify its security measures. The

eectiveness of EncChain in fortifying data privacy and security

in LLM technologies underscores its importance, positioning it as a

critical advancement for the secure and private utilization of LLMs.

PVLDB Reference Format:

Zhe Fu, Mo Sha, Yiran Li, Huorong Li, Yubing Ma, Sheng Wang, and Feifei

Li. EncChain: Enhancing Large Language Model Applications with

Advanced Privacy Preservation Techniques. PVLDB, 17(12): 4413 - 4416,

2024.

doi:10.14778/3685800.3685888

1 INTRODUCTION

Since late 2022, interest in Large Language Models (LLMs) [

] has

surged dramatically. ChatGPT, for instance, amassed over 100 mil-

lion active users within just two months of its launch, representing

an unprecedented technological uptake. The profound capabilities

of LLMs across diverse domains have catalyzed their widespread

adoption, integration eorts in various use cases, and demonstrated

substantial benets in augmenting productivity and eciency.

However, the rapid advancement of LLMs has highlighted sig-

nicant data security and privacy issues. These concerns are not

merely theoretical. In March 2023, the Italian Data Protection Au-

thority banned ChatGPT due to privacy concerns. In April, Samsung

was accused of leaking sensitive semiconductor data to ChatGPT

in three incidents over 20 days. By November, Microsoft prohib-

ited employees from using ChatGPT at work, blocking related AI

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.

doi:10.14778/3685800.3685888

tools on company devices. These instances indicate a shift from

initial enthusiasm to a more measured approach, recognizing the

pronounced issues with LLMs in practical applications.

The aggregation of extensive knowledge bases and user queries,

often containing sensitive data, introduces substantial security vul-

nerabilities when processed by LLMs. Typical LLM applications,

such as third-party tailored domain-specic APIs, require signi-

cant computational resources and specialized hardware, often fa-

voring cloud deployments. This setup introduces various security

threats, including data exposure due to negligence or malicious

service providers, multi-tenant architecture risks, and the potential

misuse of sensitive user data for model renement. The lack of the-

oretical tools to mitigate the risk of LLMs inadvertently revealing

sensitive content further complicates the issue. As technological

applications deepen, data security emerges as a pivotal constraint

to the advancement of LLM technologies.

In this paper, we demonstrate the proposed EncChain—a novel

privacy preservation solution tailored for LLM applications, under-

pinned by condential data handling practices. The strategic appli-

cation of EncChain signicantly enhances data security measures

within LLM frameworks, diminishing the likelihood of unautho-

rized data access and exploitation. More specically, EncChain

exhibits the following key attributes:

•

Encrypted Knowledge Base and User Interactions: All knowl-

edge base and interaction records are encrypted using distinct keys

before leaving the secure perimeter, which ensures that information

remains perpetually in ciphered form, thereby precluding access to

its unencrypted counterpart, even for application architects.

•

Condential Data Computing Capability: EncChain provides

a suite of core functionalities, including condential knowledge

base loading, condential similarity search, condential prompt

generation, and condential large model inference. These capabili-

ties enable developers to handle and process encrypted data without

accessing plaintext, meeting the requirements for constructing busi-

ness logic while protecting data privacy and security.

•

Fine-grained Access Control: Through rigorous access control,

EncChain enforces precise user permissions for knowledge bases.

By dening roles like “questioner” and “knowledge base owner”

and assigning access based on unique identiers for these roles, it

mitigates unauthorized data access and potential exltration.

•

Streamlined Integration and Application: As a Python pack-

age, EncChain oers straightforward integration into third-party

applications, facilitating adoption by allowing developers to easily

incorporate its features. This ease of use, combined with support

4413

for both encrypted and plaintext queries, signicantly reduces the

complexity for developers new to the system.

•

Execution Safety in Trusted Environments: EncChain and its

associated LLMs are deployable within trusted execution environ-

ments, leveraging advanced hardware security features to safeguard

virtual machine memory privacy and integrity. This setup ensures

that sensitive data is shielded from both the host operating system

and the virtual machine manager, enhancing operational security.

•

Remote Attestation for Enhanced Trust: EncChain enables

the use of remote attestation technologies to conrm the security

and trustworthiness of the execution environments for itself and

the deployed LLM, providing users with additional condence in

the security measures of LLM applications.

2 PRELIMINARIES

Retrieval Augmented Generation. RAG [

] architecture rep-

resents a signicant advancement in addressing the challenge of

hallucination in LLMs, emerging as a dominant pattern in devel-

oping LLM applications, particularly enhancing logical reasoning

and data comprehension from private knowledge bases to augment

question-answering (QA) capabilities. It is pivotal in scenarios like

knowledge-based questioning and intelligent assistance. The RAG

framework involves segmenting private knowledge into embedding

vectors stored in a database. Upon receiving a question, the system

converts it into a vector, retrieves the most relevant knowledge via

vector similarity search, and merges this with the question to form

a comprehensive prompt for LLMs.

Trusted Execution Environment. TEEs [

] provide a corner-

stone technology by oering secure and isolated execution spaces

within processors, enhancing the security of data and code against

potential threats from compromised operating systems or hypervi-

sors in the complex landscape of cybersecurity and data privacy.

Within this spectrum, Intel’s Trust Domain Extensions [

] (TDX)

serve as an evolved form of TEEs, tailored to bring their benets

into the realm of virtualization. TDX introduces the concept of

trusted domains, in which virtual machines operate in isolation

with hardware-level protections. This innovation directly addresses

the intricate challenges of maintaining data privacy and security in

environments such as cloud computing and data centers.

3 EncChain SOLUTION

3.1 Threat Model

The RAG architecture in QA leads to two primary threats: unautho-

rized access and data exltration. Firstly, its reliance on plaintext

storage of knowledge bases and user queries permits developers un-

fettered access, creating a vector for data leaks in cases of malicious

intent or system compromise. Secondly, the architecture lacks rigor-

ous access controls, enabling users to potentially retrieve sensitive

information beyond their clearance through intentionally designed

queries. These threats collectively jeopardize data integrity and

condentiality, necessitating an immediate implementation of en-

hanced security protocols to mitigate the risks of unauthorized

access and ensure the privacy protection of LLM applications.

3.2 Architecture Overview

The EncChain architecture, delineated in Figure 1 for LLM appli-

cation deployment, emphasizes security and operational integrity.

Web Browser

Confidential VM

EncChain

Service

LLM

Service

Guest OS

3rd-party

Application

Legacy VM

Guest OS

Firmware

Other

Hardware

Intel TDX

CPU

Hypervisor

Host OS

Client Terminal

GPT-4

Chatbot

Figure 1: The architecture of the EncChain demonstration.

It treats the client terminal as secure, encrypting data before it ex-

its, protecting it during transmission. Third-party applications are

hosted on virtual machines (VMs), establishing a clear operational

divide. EncChain and its models operate within secure virtual

environments utilizing advanced VM technologies like TDX for

enhanced runtime security. These environments are reinforced by

hardware security extensions, safeguarding virtual memory from

unauthorized access by the host OS and hypervisor. Third-party

applications leverage EncChain’s APIs for encrypted data interac-

tions and secure business logic development. Remote attestation

technology allows users to verify the security of EncChain and

LLM environments, adding a layer of trust. EncChain’s security

protocol includes data encryption at domain entry and exit, strict

access control, and the synergistic use of secure VMs and remote at-

testation, providing a robust framework for secure LLM application

deployment, addressing the critical need for data security.

3.3 Fine-grained Knowledge Control

EncChain enhances privacy attributes in LLM applications using

RAG-based private knowledge base inference through the key ac-

tion of leveraging ne-grained knowledge control. This innovation,

derived from Operon’s privacy-protected data management [

embodies the concept of the Behavior Control List (BCL). Speci-

cally, EncChain allows “knowledge owners” to establish a binary

relationship between the “questioners” and the “knowledge bases.”

Upon the questioner posing a question, triggering the LLM’s infer-

ence, EncChain ensures that the search for relevant knowledge

vectors occurs exclusively within an authorized subset of vector

databases, generating answers based on this relationship. It solves

the issue traditionally addressed either by employing multiple dis-

tinguished LLM instances to segregate knowledge for privacy pro-

tection (sacricing eciency and increasing costs) or by utilizing

a single system but facing privacy risks. EncChain’s innovation

lies in its ability to protect privacy while optimizing the retrieval

and integration process of knowledge, thereby nding an eective

equilibrium between privacy security and knowledge utilization.

3.4 System Workow

We present the procedural workow of EncChain through a spe-

cic example, as illustrated in Figure 2. In this scenario, we assume

four distinct roles:

knowledge base data owners;

question-

ers;

third-party software developers providing QA applications;

and

TEEs (e.g., cloud infrastructure) for deploying LLMs with

EncChain. We note that, in practical scenarios,

and

might

represent the same entity, or

could be a controlled party of

(for

4414

EncChain

1. Sam was born in … …

2. Sam graduated at … …

3. OpenAI was founded … …

'\x8229C… …F'

'\x311C6… …F'

'\x9298Z… …H'

Confidential Retrieval

Confidential Prompt

Confidential Inference

encchain.retrieveFromDB('\x243')

encchain.generatePrompts(tmpl, '\x1C8')

encchain.getAnswserFromLLM('\x318')

encchain.addPermission('user_1', 'kb_1')

id content embedding

1 '\x8229C… …F' [\x43, \x62, …, \x8C]

2 '\x9298Z… …H' [\x72, \x44, …, \x5F]

3 … … … …

Trusted Env Untrusted Env

Ciphertext

Third Party LLM

Application

'\x311C6… …F'

'\x8229C… …F'

'\x9298Z… …H'

Knowledge Permission

Knowledgebase

Plaintext

What is ChatGPT and OpenAI?

LLM

1. Sam was born in … …

2. Sam graduated at … …

3. OpenAI was founded … …

1. Sam was born in … …

2. OpenAI was founded at … …

3. … … … …

User Query

Prompt: Answer the question

using the following pieces of context:

Vector DB for Cipher Embeddings

id user knowledgebase_id permission

1 Alice 498956a3-6d11 true

2 Alice c92ab7fe-ac6f false

3 … … … … … …

User-Knowledge Permissions

EncChain Core API

Confidential Prompt

2 5

Figure 2: An illustrative workow of the EncChain. For simplicity, the gure omits

, which indicates the initialization phase.

instance, a sales Employee B of Company A authorized to inquire

about product information and receive answers, but not permitted

to ask questions related to the company’s nances). Similarly,

and

could also be consolidated into a single entity, oering both

software applications and compute resources, depending on the

service model. Upon establishing the identities of the involved roles,

the workow proceeds as follows:

0 : A

rst veries the EncChain instance by utilizing TEE remote

attestation, ensuring that EncChain operates with conden-

tiality and integrity. This enables the handing over of keys to

EncChain, allowing it to decrypt ciphertext within the TEE.

1 : A

encrypts knowledge bases and uploads them to

, ensuring

that even a malicious C cannot comprehend the hosted info.

2 : C

further hands over the uploaded knowledge bases into the

trusted domain via EncChain’s APIs.

3 :

The uploaded knowledge bases are decrypted using the owners’

key, vectorized, and securely stored in the vector database.

4 : B

poses a question that is encrypted with its own key before

being submitted to

. Similar to

also needs to submit

its key for EncChain to interpret its question upon attestation.

5 : C

, unable to understand the question submitted in

, can only

retrieve relevant contextual knowledge through EncChain’s

Retrieve interface. Notably, at this stage, EncChain delineates

the appropriate subset of knowledge bases for

’s query based

on User-Knowledge Permission. Knowledge vectors pertinent to

the question, retrieved by D , are returned to C encryptedly.

6 : C

, following the desired business logic, constructs an appro-

priate prompt and requests EncChain for LLM inference.

7 : D

decrypts the request’s ciphertext by the questioner’s key of

B within the TEE and carries out the LLM inference process.

8 : D

returns the model inference output to

in encrypted form.

9 : C

returns the encrypted response to

, who then uses its

own key to decrypt and obtain the answer to the question.

4 DEMONSTRATION

During the demonstration, we will present to the audience a compre-

hensive, end-to-end framework for live deployments of LLM appli-

cations, emphasizing privacy safeguards for proprietary knowledge

bases and rigorous permission control, facilitated by EncChain.

1 from EncChain import Op enSo u r ceAp p

2 # Create a ch a t b o t instanc e wit h built - in op e n sourc e LLM

3 ap p = Open S o u rceA p p ( p e r missi o n s = p e r missi o n s . db , ...)

5 # Regi s t e r a kn o w l e dage ba se o w n e r wi t h its owner ' s key

6 owner _ i d = ap p . p ermis s i o ns . a d d _owner ( o w n er_name , key )

8 # Insert a k n o wledge ba se i nto E ncChain 's v e c t o r database

9 kb_id = ap p . a d d _ k b ( d a t a_type , enc _ k now l e dge _ b ody )

10 # De s i g n a t e the ownship of the hosted kn o w l e d ge base

11 ap p . p ermis s i o ns . a d d_kb ( o wner_id , kb_id )

13 # Regi s t e r a qu e s t ioner wi th its user 's key

14 user_id = app . pe r m i ssion s . add_us e r ( user _ n ame , k ey )

15 # Grant p e r missi o n s to the user fo r the k n o w ledage bas e

16 ap p . p ermis s i o ns . a d d_poli c y ( use r_id , kb_id , o w n e r _ sig )

17 # Answer the que s t i o n e n c r y ptedl y wit h the user ' s key

18 ap p . q uery ( e n c _ q uestion , user _ i d )

Figure 3: Usage examples of the EncChain Python Librar y.

4.1 Python Library

First, we elucidate the process by which backend developers can

adeptly and seamlessly initiate a hardware-secured LLM instance,

utilizing the Python library provided by EncChain, complemented

by illustrative code excerpts in Figure 3. EncChain oers a ex-

ible, modular design, while also allowing for the instantiation of

a condential LLM instance in its default mode (line 3), which

constructs a built-in open-source model. Due to space constraints,

we omit a detailed discussion of additional construction parame-

ters, such as those based on an existing permission database. Data

owners should, under the assurance of instance trustworthiness,

submit their keys to the instance’s permission table (line 6). Sub-

sequently, data owners can host their privately held knowledge

on EncChain (line 9), encrypted with the owner’s key, for future

model inferences during QA sessions. Specifying the ownership of

knowledge bases is essential for EncChain to determine which key

to use for decrypting knowledge within the TEE for embedding

and further knowledge integration (line 11). Similarly, questioners,

upon verifying the instance’s credibility, should submit their unique

keys (line 14). These keys are utilized to decrypt their encrypted

queries and to encrypt answers to their questions. Before posing

questions, it is imperative to ensure that users are authorized to use

specic knowledge bases as context for generating answers, a pro-

cess that requires the owner’s signature for authorization (line 16).

Thereafter, users may encrypt their queries using the previously

4415

of 4

免费下载

文档被以下合辑收录

VLDB2024 数据库顶会论文（共31篇）

本合辑收录了VLDB2024 数据库顶会论文。

文档被以下合辑收录

相关文档

评论