暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
VLDB2024_EncChain:Enhancing Large Language Model Applications with Advanced Privacy Preservation Techniques_阿里云.pdf
133
4页
3次
2024-09-09
免费下载
EncChain: Enhancing Large Language Model Applications with
Advanced Privacy Preservation Techniques
Zhe Fu
Alibaba Cloud
je.fz@alibaba-inc.com
Mo Sha
Alibaba Cloud
shamo.sm@alibaba-inc.com
Yiran Li
Alibaba Cloud
yiranli.lyr@alibaba-inc.com
Huorong Li
Alibaba Cloud
huorong.lhr@alibaba-inc.com
Yubing Ma
Alibaba Cloud
yubing.myb@alibaba-inc.com
Sheng Wang
Alibaba Cloud
sh.wang@alibaba-inc.com
Feifei Li
Alibaba Cloud
lifeifei@alibaba-inc.com
ABSTRACT
In response to escalating concerns about data privacy in the Large
Language Model (LLM) domain, we demonstrate EncChain, a pi-
oneering solution designed to bolster data security in LLM ap-
plications. EncChain presents an all-encompassing approach to
data protection, encrypting both the knowledge bases and user
interactions. It empowers condential computing and implements
stringent access controls, oering a signicant leap in securing
LLM usage. Designed as an accessible Python package, EncChain
ensures straightforward integration into existing systems, bolstered
by its operation within secure environments and the utilization of
remote attestation technologies to verify its security measures. The
eectiveness of EncChain in fortifying data privacy and security
in LLM technologies underscores its importance, positioning it as a
critical advancement for the secure and private utilization of LLMs.
PVLDB Reference Format:
Zhe Fu, Mo Sha, Yiran Li, Huorong Li, Yubing Ma, Sheng Wang, and Feifei
Li. EncChain: Enhancing Large Language Model Applications with
Advanced Privacy Preservation Techniques. PVLDB, 17(12): 4413 - 4416,
2024.
doi:10.14778/3685800.3685888
1 INTRODUCTION
Since late 2022, interest in Large Language Models (LLMs) [
1
] has
surged dramatically. ChatGPT, for instance, amassed over 100 mil-
lion active users within just two months of its launch, representing
an unprecedented technological uptake. The profound capabilities
of LLMs across diverse domains have catalyzed their widespread
adoption, integration eorts in various use cases, and demonstrated
substantial benets in augmenting productivity and eciency.
However, the rapid advancement of LLMs has highlighted sig-
nicant data security and privacy issues. These concerns are not
merely theoretical. In March 2023, the Italian Data Protection Au-
thority banned ChatGPT due to privacy concerns. In April, Samsung
was accused of leaking sensitive semiconductor data to ChatGPT
in three incidents over 20 days. By November, Microsoft prohib-
ited employees from using ChatGPT at work, blocking related AI
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.
doi:10.14778/3685800.3685888
tools on company devices. These instances indicate a shift from
initial enthusiasm to a more measured approach, recognizing the
pronounced issues with LLMs in practical applications.
The aggregation of extensive knowledge bases and user queries,
often containing sensitive data, introduces substantial security vul-
nerabilities when processed by LLMs. Typical LLM applications,
such as third-party tailored domain-specic APIs, require signi-
cant computational resources and specialized hardware, often fa-
voring cloud deployments. This setup introduces various security
threats, including data exposure due to negligence or malicious
service providers, multi-tenant architecture risks, and the potential
misuse of sensitive user data for model renement. The lack of the-
oretical tools to mitigate the risk of LLMs inadvertently revealing
sensitive content further complicates the issue. As technological
applications deepen, data security emerges as a pivotal constraint
to the advancement of LLM technologies.
In this paper, we demonstrate the proposed EncChain—a novel
privacy preservation solution tailored for LLM applications, under-
pinned by condential data handling practices. The strategic appli-
cation of EncChain signicantly enhances data security measures
within LLM frameworks, diminishing the likelihood of unautho-
rized data access and exploitation. More specically, EncChain
exhibits the following key attributes:
Encrypted Knowledge Base and User Interactions: All knowl-
edge base and interaction records are encrypted using distinct keys
before leaving the secure perimeter, which ensures that information
remains perpetually in ciphered form, thereby precluding access to
its unencrypted counterpart, even for application architects.
Condential Data Computing Capability: EncChain provides
a suite of core functionalities, including condential knowledge
base loading, condential similarity search, condential prompt
generation, and condential large model inference. These capabili-
ties enable developers to handle and process encrypted data without
accessing plaintext, meeting the requirements for constructing busi-
ness logic while protecting data privacy and security.
Fine-grained Access Control: Through rigorous access control,
EncChain enforces precise user permissions for knowledge bases.
By dening roles like “questioner” and “knowledge base owner”
and assigning access based on unique identiers for these roles, it
mitigates unauthorized data access and potential exltration.
Streamlined Integration and Application: As a Python pack-
age, EncChain oers straightforward integration into third-party
applications, facilitating adoption by allowing developers to easily
incorporate its features. This ease of use, combined with support
4413
for both encrypted and plaintext queries, signicantly reduces the
complexity for developers new to the system.
Execution Safety in Trusted Environments: EncChain and its
associated LLMs are deployable within trusted execution environ-
ments, leveraging advanced hardware security features to safeguard
virtual machine memory privacy and integrity. This setup ensures
that sensitive data is shielded from both the host operating system
and the virtual machine manager, enhancing operational security.
Remote Attestation for Enhanced Trust: EncChain enables
the use of remote attestation technologies to conrm the security
and trustworthiness of the execution environments for itself and
the deployed LLM, providing users with additional condence in
the security measures of LLM applications.
2 PRELIMINARIES
Retrieval Augmented Generation. RAG [
3
] architecture rep-
resents a signicant advancement in addressing the challenge of
hallucination in LLMs, emerging as a dominant pattern in devel-
oping LLM applications, particularly enhancing logical reasoning
and data comprehension from private knowledge bases to augment
question-answering (QA) capabilities. It is pivotal in scenarios like
knowledge-based questioning and intelligent assistance. The RAG
framework involves segmenting private knowledge into embedding
vectors stored in a database. Upon receiving a question, the system
converts it into a vector, retrieves the most relevant knowledge via
vector similarity search, and merges this with the question to form
a comprehensive prompt for LLMs.
Trusted Execution Environment. TEEs [
4
,
5
] provide a corner-
stone technology by oering secure and isolated execution spaces
within processors, enhancing the security of data and code against
potential threats from compromised operating systems or hypervi-
sors in the complex landscape of cybersecurity and data privacy.
Within this spectrum, Intel’s Trust Domain Extensions [
2
] (TDX)
serve as an evolved form of TEEs, tailored to bring their benets
into the realm of virtualization. TDX introduces the concept of
trusted domains, in which virtual machines operate in isolation
with hardware-level protections. This innovation directly addresses
the intricate challenges of maintaining data privacy and security in
environments such as cloud computing and data centers.
3 EncChain SOLUTION
3.1 Threat Model
The RAG architecture in QA leads to two primary threats: unautho-
rized access and data exltration. Firstly, its reliance on plaintext
storage of knowledge bases and user queries permits developers un-
fettered access, creating a vector for data leaks in cases of malicious
intent or system compromise. Secondly, the architecture lacks rigor-
ous access controls, enabling users to potentially retrieve sensitive
information beyond their clearance through intentionally designed
queries. These threats collectively jeopardize data integrity and
condentiality, necessitating an immediate implementation of en-
hanced security protocols to mitigate the risks of unauthorized
access and ensure the privacy protection of LLM applications.
3.2 Architecture Overview
The EncChain architecture, delineated in Figure 1 for LLM appli-
cation deployment, emphasizes security and operational integrity.
Web Browser
Confidential VM
EncChain
Service
LLM
Service
Guest OS
3rd-party
Application
Legacy VM
Guest OS
Firmware
Other
Hardware
Intel TDX
CPU
Hypervisor
Host OS
Client Terminal
GPT-4
Chatbot
Figure 1: The architecture of the EncChain demonstration.
It treats the client terminal as secure, encrypting data before it ex-
its, protecting it during transmission. Third-party applications are
hosted on virtual machines (VMs), establishing a clear operational
divide. EncChain and its models operate within secure virtual
environments utilizing advanced VM technologies like TDX for
enhanced runtime security. These environments are reinforced by
hardware security extensions, safeguarding virtual memory from
unauthorized access by the host OS and hypervisor. Third-party
applications leverage EncChain’s APIs for encrypted data interac-
tions and secure business logic development. Remote attestation
technology allows users to verify the security of EncChain and
LLM environments, adding a layer of trust. EncChain’s security
protocol includes data encryption at domain entry and exit, strict
access control, and the synergistic use of secure VMs and remote at-
testation, providing a robust framework for secure LLM application
deployment, addressing the critical need for data security.
3.3 Fine-grained Knowledge Control
EncChain enhances privacy attributes in LLM applications using
RAG-based private knowledge base inference through the key ac-
tion of leveraging ne-grained knowledge control. This innovation,
derived from Operon’s privacy-protected data management [
6
],
embodies the concept of the Behavior Control List (BCL). Speci-
cally, EncChain allows “knowledge owners” to establish a binary
relationship between the “questioners” and the “knowledge bases.
Upon the questioner posing a question, triggering the LLM’s infer-
ence, EncChain ensures that the search for relevant knowledge
vectors occurs exclusively within an authorized subset of vector
databases, generating answers based on this relationship. It solves
the issue traditionally addressed either by employing multiple dis-
tinguished LLM instances to segregate knowledge for privacy pro-
tection (sacricing eciency and increasing costs) or by utilizing
a single system but facing privacy risks. EncChain’s innovation
lies in its ability to protect privacy while optimizing the retrieval
and integration process of knowledge, thereby nding an eective
equilibrium between privacy security and knowledge utilization.
3.4 System Workow
We present the procedural workow of EncChain through a spe-
cic example, as illustrated in Figure 2. In this scenario, we assume
four distinct roles:
A
knowledge base data owners;
B
question-
ers;
C
third-party software developers providing QA applications;
and
D
TEEs (e.g., cloud infrastructure) for deploying LLMs with
EncChain. We note that, in practical scenarios,
A
and
B
might
represent the same entity, or
B
could be a controlled party of
A
(for
4414
EncChain
1. Sam was born in … …
2. Sam graduated at … …
3. OpenAI was founded … …
'\x8229C… …F'
'\x311C6… …F'
'\x9298Z… …H'
Confidential Retrieval
Confidential Prompt
Confidential Inference
encchain.retrieveFromDB('\x243')
encchain.generatePrompts(tmpl, '\x1C8')
encchain.getAnswserFromLLM('\x318')
encchain.addPermission('user_1', 'kb_1')
id content embedding
1 '\x8229C… …F' [\x43, \x62, …, \x8C]
2 '\x9298Z… …H' [\x72, \x44, …, \x5F]
3 … … … …
Trusted Env Untrusted Env
Ciphertext
Third Party LLM
Application
'\x311C6… …F'
'\x8229C… …F'
'\x9298Z… …H'
Knowledge Permission
Knowledgebase
Plaintext
What is ChatGPT and OpenAI?
LLM
1. Sam was born in … …
2. Sam graduated at … …
3. OpenAI was founded … …
1. Sam was born in … …
2. OpenAI was founded at … …
3. … … … …
User Query
Prompt: Answer the question
using the following pieces of context:
1.
2.
Vector DB for Cipher Embeddings
id user knowledgebase_id permission
1 Alice 498956a3-6d11 true
2 Alice c92ab7fe-ac6f false
3 … … … … … …
User-Knowledge Permissions
EncChain Core API
Confidential Prompt
B
A
C
D
1
7
2 5
6
8
4
9
3
Figure 2: An illustrative workow of the EncChain. For simplicity, the gure omits
0
, which indicates the initialization phase.
instance, a sales Employee B of Company A authorized to inquire
about product information and receive answers, but not permitted
to ask questions related to the company’s nances). Similarly,
C
and
D
could also be consolidated into a single entity, oering both
software applications and compute resources, depending on the
service model. Upon establishing the identities of the involved roles,
the workow proceeds as follows:
0 : A
rst veries the EncChain instance by utilizing TEE remote
attestation, ensuring that EncChain operates with conden-
tiality and integrity. This enables the handing over of keys to
EncChain, allowing it to decrypt ciphertext within the TEE.
1 : A
encrypts knowledge bases and uploads them to
C
, ensuring
that even a malicious C cannot comprehend the hosted info.
2 : C
further hands over the uploaded knowledge bases into the
trusted domain via EncChain’s APIs.
3 :
The uploaded knowledge bases are decrypted using the owners’
key, vectorized, and securely stored in the vector database.
4 : B
poses a question that is encrypted with its own key before
being submitted to
C
. Similar to
0
,
B
also needs to submit
its key for EncChain to interpret its question upon attestation.
5 : C
, unable to understand the question submitted in
4
, can only
retrieve relevant contextual knowledge through EncChain’s
Retrieve interface. Notably, at this stage, EncChain delineates
the appropriate subset of knowledge bases for
B
’s query based
on User-Knowledge Permission. Knowledge vectors pertinent to
the question, retrieved by D , are returned to C encryptedly.
6 : C
, following the desired business logic, constructs an appro-
priate prompt and requests EncChain for LLM inference.
7 : D
decrypts the request’s ciphertext by the questioner’s key of
B within the TEE and carries out the LLM inference process.
8 : D
returns the model inference output to
C
in encrypted form.
9 : C
returns the encrypted response to
B
, who then uses its
own key to decrypt and obtain the answer to the question.
4 DEMONSTRATION
During the demonstration, we will present to the audience a compre-
hensive, end-to-end framework for live deployments of LLM appli-
cations, emphasizing privacy safeguards for proprietary knowledge
bases and rigorous permission control, facilitated by EncChain.
1 from EncChain import Op enSo u r ceAp p
2 # Create a ch a t b o t instanc e wit h built - in op e n sourc e LLM
3 ap p = Open S o u rceA p p ( p e r missi o n s = p e r missi o n s . db , ...)
4
5 # Regi s t e r a kn o w l e dage ba se o w n e r wi t h its owner ' s key
6 owner _ i d = ap p . p ermis s i o ns . a d d _owner ( o w n er_name , key )
7
8 # Insert a k n o wledge ba se i nto E ncChain 's v e c t o r database
9 kb_id = ap p . a d d _ k b ( d a t a_type , enc _ k now l e dge _ b ody )
10 # De s i g n a t e the ownship of the hosted kn o w l e d ge base
11 ap p . p ermis s i o ns . a d d_kb ( o wner_id , kb_id )
12
13 # Regi s t e r a qu e s t ioner wi th its user 's key
14 user_id = app . pe r m i ssion s . add_us e r ( user _ n ame , k ey )
15 # Grant p e r missi o n s to the user fo r the k n o w ledage bas e
16 ap p . p ermis s i o ns . a d d_poli c y ( use r_id , kb_id , o w n e r _ sig )
17 # Answer the que s t i o n e n c r y ptedl y wit h the user ' s key
18 ap p . q uery ( e n c _ q uestion , user _ i d )
Figure 3: Usage examples of the EncChain Python Librar y.
4.1 Python Library
First, we elucidate the process by which backend developers can
adeptly and seamlessly initiate a hardware-secured LLM instance,
utilizing the Python library provided by EncChain, complemented
by illustrative code excerpts in Figure 3. EncChain oers a ex-
ible, modular design, while also allowing for the instantiation of
a condential LLM instance in its default mode (line 3), which
constructs a built-in open-source model. Due to space constraints,
we omit a detailed discussion of additional construction parame-
ters, such as those based on an existing permission database. Data
owners should, under the assurance of instance trustworthiness,
submit their keys to the instance’s permission table (line 6). Sub-
sequently, data owners can host their privately held knowledge
on EncChain (line 9), encrypted with the owner’s key, for future
model inferences during QA sessions. Specifying the ownership of
knowledge bases is essential for EncChain to determine which key
to use for decrypting knowledge within the TEE for embedding
and further knowledge integration (line 11). Similarly, questioners,
upon verifying the instance’s credibility, should submit their unique
keys (line 14). These keys are utilized to decrypt their encrypted
queries and to encrypt answers to their questions. Before posing
questions, it is imperative to ensure that users are authorized to use
specic knowledge bases as context for generating answers, a pro-
cess that requires the owner’s signature for authorization (line 16).
Thereafter, users may encrypt their queries using the previously
4415
of 4
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。