【卡内基梅隆北大】Demystifying Data Management for Large Language Models.pdf

胖酒精灯

9页

2次

2025-04-15

免费下载

Demystifying Data Management for Large Language Models

Xupeng Miao

Carnegie Mellon University

Pittsburgh, PA, USA

xupeng@cmu.edu

Zhihao Jia

Carnegie Mellon University

Pittsburgh, PA, USA

zhihao@cmu.edu

Bin Cui

Peking University

Beijing, China

bin.cui@pku.edu.cn

ABSTRACT

Navigating the intricacies of data management in the era of Large

Language Models (LLMs) presents both challenges and opportuni-

ties for database and data management communities. In this tutorial,

we oer a comprehensive exploration into the vital role of data

management across the development and deployment phases of

advanced LLMs. We provide an in-depth survey of existing tech-

niques of managing knowledge and parameter data during the

whole LLM lifecycle, emphasizing the balance between eciency

and eectiveness. This tutorial stands to oer participants valuable

insights into the best practices and contemporary challenges in

data management for LLMs, equipping them with the knowledge

to navigate and contribute to this rapidly evolving eld.

CCS CONCEPTS

• Information systems

→

Data management systems; Infor-

mation systems applications; • Computing methodologies

→

Machine learning; Articial intelligence; Distributed com-

puting methodologies.

KEYWORDS

Large Language Model, Pre-training, Fine-tuning, Inference, Data

Management, Database, Distributed Computing, Knowledge Data

ACM Reference Format:

Xupeng Miao, Zhihao Jia, and Bin Cui. 2024. Demystifying Data Man-

agement for Large Language Models. In Companion of the 2024 Interna-

tional Conference on Management of Data (SIGMOD-Companion ’24), June

9–15, 2024, Santiago, AA, Chile. ACM, New York, NY, USA, 9 pages. https:

//doi.org/10.1145/3626246.3654683

1 INTRODUCTION

In the rapidly evolving landscape of articial intelligence, Large

Language Models (LLMs) like GPT-3 [

] and GPT-4 [

114

] have

become pivotal in advancing our understanding and capabilities

in data intelligence. These models, thriving on extensive datasets

and sophisticated algorithms, are not just about understanding and

generating human-like data contents. They also represent a signi-

cant stride in data management research [

108

132

145

168

]

and provide new research problems for database communities [

123

135

138

139

]. LLMs have introduced new complexities and

paradigms in processing, analyzing and leveraging data, marking a

This work is licensed under a Creative Commons Attribution

International 4.0 License.

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile

ACM ISBN 979-8-4007-0422-2/24/06

https://doi.org/10.1145/3626246.3654683

Knowledge

Data (§2.2)

Parameter

Data (§2.3)

Knowledge

Data (§2.4)

Parameter

Data (§2.5)

iterative

training

Iterative

decoding

LLM Development (e.g., Pre-training, Fine-tuning)

LLM Deployment (e.g., Inference)

• Data Collection

• Data Governance

• ...

• Parameter

Placement

• Parameter Sync

• ...

• Parameter

Storage

• ...

• Data Compression

• Data Organization

• ...

25 mins

15 mins

20 mins 10 mins

Figure 1: Tutorial overview

transformative phase in the eld. However, alongside these break-

throughs on LLMs for data management, the development [

101

] and

deployment [

159

] of these models introduce substantial challenges,

particularly in the realm of data management.

As shown in gure 1, this tutorial is designed to delve into the

intricacies of data management in both the development and de-

ployment phases of LLMs. We will explore the dual roles of data

in these phases: as knowledge data (e.g., input data like training

datasets) and as parameter data (model parameters). In the context

of this tutorial, we treat the development of LLMs as a transfor-

mation process where human knowledge is encoded into model

parameters. Conversely, during LLM deployment, these learned

parameters are utilized to produce human-like knowledge outputs

under given user queries like a knowledge database. These two

primary scenarios together make up a holistic LLM lifecycle, but

suer from unique and distinct data management problems.

Our journey will begin with a clear and generalized denition

of ’knowledge data’ as it applies to LLMs. We consider not only the

input data samples for training but also their latent decoding states

for inference. For LLM development, knowledge data serves as the

foundation upon which these advanced LLMs are constructed. The

data richness and diversity determine the breadth and depth of the

model’s capabilities. By examining various formats of knowledge

data, we gain insights into how LLMs assimilate and interpret

information from the world around us.

Following the discussion on knowledge data for LLM develop-

ment, we will also delve into the LLM inference process and explore

the management of the Key-Value (KV) cache for LLM deployment.

This component represents a latent knowledge base, encompassing

the current context and playing a pivotal role in the model’s itera-

tive decoding process. Understanding the KV cache’s functionality

and its management is essential for optimizing LLM deployment,

and ensuring ecient and accurate information retrieval.

Furthermore, another critical aspect of LLMs is parameter data

management. The vast number of model parameters in LLMs (i.e.,

547

SIGMOD-Companion ’24, June 9–15, 2024, Santiago, AA, Chile Xupeng Miao, Zhihao Jia, and Bin Cui

billions of) necessitates ecient management strategies for more

eective execution. This includes optimizing for parallelization and

other techniques that enhance both the development and deploy-

ment stages. Ecient management of parameter data is not only

crucial for the operational functionality of the model but also for

scaling up to handle increasingly complex tasks and larger data

volumes.

1.1 TARGET AUDIENCE AND LENGTH

Target Audience. Our tutorial aims to cater to a diverse audience,

including SIGMOD attendees from both research and industry back-

grounds. We intend to make this session accessible and engaging for

all, irrespective of their prior experience in the elds of databases

or machine learning. The tutorial will be comprehensive and self-

contained, featuring a broad introduction and motivational exam-

ples to ensure that even non-specialists can follow along easily.

Length. The intended length of this tutorial is 1.5 hours. There are

mainly six components in our outline, and they will take 10 mins

for background, 25 mins for §2.2, 15 mins for §2.3, 20 mins for §2.4,

10 mins for §2.5, and 10 mins for §2.6 respectively.

2 TUTORIAL OUTLINE

2.1 Background

2.1.1 LLM Development. The development of LLM is a complex

and multifaceted process, involving a series of sophisticated steps

that transform vast datasets into intelligent, responsive models. We

will start by introducing two basic concepts of: pre-training and ne-

tuning. These stages are fundamental in shaping the capabilities

and applications of these sophisticated models.

Pre-training. The foundation of any LLM is its pre-training

phase. This involves feeding the model a large corpus of data, al-

lowing it to learn and understand patterns, contexts, and nuances

in human language. The pre-training process is not just about data

ingestion but also about the model learning to make predictions

and generate responses that are coherent, contextually relevant,

and as human-like as possible. Key challenges in this stage include

improving data quality for better LLM capacity and managing the

vast model parameters for ecient model training.

Fine-tuning. After the initial pre-training process, LLMs often

undergo a ne-tuning process. This stage is crucial for adapting

the model to specic tasks or domains, such as legal language pro-

cessing, creative writing, or technical documentation. Fine-tuning

involves training the model (or partial model) on a smaller, more

specialized dataset, enabling it to hone its capabilities in particular

areas. The challenge here mainly lies in selecting the right datasets

that eectively represent the target domain without introducing

new biases or overtting the model.

Target

: audience will be familiar with the development of LLMs,

nuances and challenges associated with pre-training and ne-tuning.

2.1.2 LLM Deployment. We will then introduce the model deploy-

ment step, that generates human-like data using the pre-trained or

ne-tuned LLM. based on new input data.

Inference. LLM inference or serving requires the model to

quickly access its learned parameters and apply them to the input

queries, maintaining accuracy and coherence in its outputs. When

serving each input query, LLMs often employ iterative decoding

techniques, which also require to access all previous generated

contents as the context knowledge. KV cache temporarily saves

these contexts for lower inference latency. The challenge lies in

optimizing the data management to ensure fast and accurate re-

sponse, especially when dealing with large-scale data workloads

(e.g., multiple simultaneous requests or long sequences).

Target

: audience will be familiar with the deployment of LLMs,

such as the challenges during LLM inference.

2.2 Knowledge Data Management for LLM

Development

2.2.1 Data Collection. The rst step in knowledge data manage-

ment is the collection of a vast and diverse array of datasets. For

pre-training, this typically involves aggregating textual data from

books, articles, websites, code repositories and more, aimed at pro-

viding the model with a broad understanding of language patterns,

styles, and contexts. The challenge lies in not only gathering vast

amounts of data but also ensuring that it covers a wide spectrum of

topics and linguistic styles [

176

]. Multimodal models demand data

from a multitude of sources beyond text, such as image, audio and

video. Other formats including structured data like tables, metadata,

annotated or labeled data, and domain-specic data [

131

157

]

are also involved for unique model usage. Some empirical studies

have observed scaling laws [

106

146

], which describe

the phenomenon that larger data scales correlate with stronger

model performance. Recent LLMs like GPT-3 are typically trained

on hundreds of gigabytes to over a terabyte of text data.

Fine-tuning LLMs requires more specialized data, often gener-

ated with human intervention. Specialized datasets typically range

from a few megabytes to several gigabytes. User-generated data

includes creating specic prompts and corresponding responses or

using human-written instructions [

169

] and human feedback mech-

anisms [

116

] in reinforcement learning (RL) settings. The objective

is to rene the model’s capabilities for specic tasks or domains,

making it more adept at handling particular types of queries or

content. We will revisit the development of GPT-family [

] and

LLaMA-family [

137

] as examples to introduce the process of data

collection in practice.

2.2.2 Data Governance. Data governance encompasses a suite of

processes and practices aimed at ensuring the quality, consistency,

and appropriateness of the data used for training. This stage is piv-

otal in shaping the model’s learning path and eventual performance.

It involves a series of meticulous steps:

•

Data Deduplication. This process involves meticulously

identifying and removing duplicate contents within the datasets.

Such data redundancy can skew the model’s learning, lead-

ing to overemphasis on repeated patterns [

130

136

Eective deduplication ensures a more balanced and diverse

training regimen for the model.

•

Data Cleaning. Data cleaning in the context of LLM de-

velopment extends beyond mere error correction [

156

173

] and involves addressing critical aspects such as

548