
Data Management
Pretraining (§2)
Data Quantity (§2.1)
Scaling Laws Kaplan et al. (2020), DeepMind Chinchilla Scaling Law (Hoffmann et al., 2022)
Data Repetition
Villalobos et al. (2022), Muennighoff et al. (2023), Hernandez et al. (2022),
Xue et al. (2023), D4 (Tirumala et al., 2023)
Data Quality (§2.2)
Deduplication
Lee et al. (2021), Kandpal et al. (2022), Silcock et al. (2022),
SemDeDup (Abbas et al., 2023), Kaddour (2023)
Quality Filtering
Gao (2021), Kreutzer et al. (2022), Gunasekar et al. (2023), Li et al. (2023b),
RefinedWeb (Penedo et al., 2023), Marion et al. (2023), Longpre et al. (2023b)
Toxicity Filtering
Luccioni and Viviano (2021), Xu et al. (2021), Welbl et al. (2021),
Longpre et al. (2023b)
Social Bias
Dodge et al. (2021), Meade et al. (2022), Gururangan et al. (2022)
Feng et al. (2023)
Diversity & Age
Lee et al. (2023a), D2 Pruning (Maharana et al., 2023), Longpre et al. (2023b)
Domain Composition (§2.3)
Longpre et al. (2023b), CodeGen2 (Nijkamp et al., 2023),
SlimPajama-DC (Shen et al., 2023), DSIR (Xie et al., 2023b),
DoReMi (Xie et al., 2023a), DoGE (Fan et al., 2023)
Data Management Systems (§2.4) Data-Juicer (Chen et al., 2023a), Oasis (Zhou et al., 2023c)
Supervised Fine-Tuning (§3)
Data Quantity (§3.1)
Ji et al. (2023), LIMA (Zhou et al., 2023a), Yuan et al. (2023),
Chen et al. (2023b), DMT (Dong et al., 2023), Song et al. (2023)
Data Quality (§3.2)
Instruction Quality
INSTRUCTEVAL (Chia et al., 2023), LIMA (Zhou et al., 2023a),
Ding et al. (2023), Wang et al. (2023d), Li et al. (2023a),
Instruction Mining (Cao et al., 2023)
Instruction Diversity
UltraChat (Ding et al., 2023), LIMA (Zhou et al., 2023a),
Alpaca (Taori et al., 2023), #InsTag (Lu et al., 2023),
Explore-Instruct (Wan et al., 2023)
Instruction Complexity
#InsTag (Lu et al., 2023), WizardLM (Xu et al., 2023),
WizardCoder (Luo et al., 2023), Orca (Mukherjee et al., 2023),
Tree-Instruct (Zhao et al., 2023a), CELLO (He et al., 2023)
Prompt Design
Mishra et al. (2022), Khashabi et al. (2022), Gonen et al. (2022),
Yin et al. (2023b), Kung and Peng (2023), UIT (Liang et al., 2023),
Weber et al. (2023), Gudibande et al. (2023), Song et al. (2023)
Task Composition (§3.3)
Wei et al. (2021), Wang et al. (2022), Sanh et al. (2022), Chung et al. (2022),
Flan 2022 (Longpre et al., 2023a), ELM (Jang et al., 2023), Chen et al. (2023b)
DMT (Dong et al., 2023), OPT-IML (Iyer et al., 2022), Tulu (Wang et al., 2023b)
Data-Efficient Learning (§3.4)
AlShikh et al. (2023), Attendu and Corbeil (2023), Ivison et al. (2023),
Instrucion Mining (Cao et al., 2023), AlpaGasus (Chen et al., 2023c),
OpenChat (Wang et al., 2023a), DiverseEvol (Wu et al., 2023),
Dynosaur (Yin et al., 2023a), MAmmoTH (Yue et al., 2023),
DMT (Dong et al., 2023), LoBaSS (Zhou et al., 2023b),
Data-Juicer (Chen et al., 2023a)
Figure 1: Taxonomy of research in data management for pretraining and supervised fine-tuning of Large Language
Models (LLM).
strategies. Therefore, this survey aims to provide a
comprehensive overview of current research in data
management as shown in Figure 1. In Section 2, we
focus on pretraining data management, including
the research on data quantity, data quality, domain
composition, and data management systems. In
Section 3, we discuss the data quantity, data qual-
ity, task composition, and data-efficient learning in
the SFT stage of LLMs. In Section 4, looking into
the future, we present the existing challenges and
promising future directions in training data man-
agement for LLMs. Through this survey, we are
devoted to offering a guiding resource to practi-
tioners attempting to build powerful LLMs with
effective and efficient data management practices.
2 Pretraining of LLM
Data management is found to be important in
the pretraining of many prominent LLMs (Ope-
nAI, 2023; Touvron et al., 2023a; Wei et al.,
2022). While most do not report their data man-
agement procedures or only report the strategies
they adopted, the reason for choosing the specific
strategy and the effects of data management strate-
gies are crucial for building stronger LLMs. In this
section, we first review the research studying train-
ing dataset scaling law with/without data repetition.
Then, data quality regarding deduplication, qual-
ity filtering, toxicity filtering, social bias, and data
diversity and age are explored. After that, domain
composition and domain reweighting methods are
评论