暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片
Improving_Automatic_Parallel_Training_Via_Balanced_Memory_Workload_Optimization_ZTE..pdf
84
14页
2次
2024-04-25
免费下载
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1
Improving Automatic Parallel Training via Balanced
Memory Workload Optimization
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui
Abstract—Transformer models have emerged as the leading
approach for achieving state-of-the-art performance across various
application domains, serving as the foundation for advanced large-
scale deep learning (DL) models. However, efficiently training these
models across multiple GPUs remains a complex challenge due to
the abundance of parallelism options. Existing DL systems either
require manual efforts to design distributed training plans or limit
parallelism combinations to a constrained search space. In this
paper, we present Galvatron-BMW, a novel system framework
that integrates multiple prevalent parallelism dimensions and
automatically identifies the most efficient hybrid parallelism
strategy. To effectively navigate this vast search space, we employ
a decision tree approach for decomposition and pruning based
on intuitive insights. We further utilize a dynamic programming
search algorithm to derive the optimal plan. Moreover, to improve
resource utilization and enhance system efficiency, we propose a bi-
objective optimization workflow that focuses on workload balance.
Our evaluations on different Transformer models demonstrate
the capabilities of Galvatron-BMW in automating distributed
training under varying GPU memory constraints. Across all tested
scenarios, Galvatron-BMW consistently achieves superior system
throughput, surpassing previous approaches that rely on limited
parallelism strategies.
Index Terms—Transformers, Distributed Learning, Automatic
Parallelism
I. INTRODUCTION
T
RANSFORMER models have achieved great success
in a wide range of deep learning (DL) applications in
recent years, such as computer vision (CV) [1], [2], natural
language processing (NLP) [3]–[7], graph learning [8], [9] and
recommendation systems [10]. For example, many Transformer
variants (e.g., BERT [11], GPT-2 [12], T5 [13]) are leading
the state-of-the-art performance in various NLP tasks such
as machine translation and question answering. Transformers
are also applicable to image recognition (e.g, ViT [1], Swin
Transformer [14]) and multimodal tasks (e.g, CLIP [15], DALL-
E [16]). Due to their superior performance, Transformers
are becoming increasingly important in modern artificial
intelligence industries.
Empirical evidence indicates that scaling model parameters
is an effective path towards model performance improve-
Yujie Wang, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie and Youhe Jiang
are with the Key Lab of High Confidence Software Technologies (MOE),
School of CS, Peking University, Beijing 100871, China. E-mail: {alfredwang,
ccchengff, shenhan.zhu, xiaonan.nie}@pku.edu.cn, youhejiang@gmail.com
Xupeng Miao is with the Computer Science Department of Carnegie Mellon
University. E-mail: xupeng@cmu.edu
Yaofeng Tu is with ZTE company. E-mail: tu.yaofeng@zte.com.cn
Bin Cui is with the Key Lab of High Confidence Software Technologies
(MOE), School of CS, Peking University, Beijing 100871, and Institute of
Computational Social Science, Peking University (Qingdao), China. E-mail:
bin.cui@pku.edu.cn.
ments [17]. For instance, the original Transformer only has
millions of model parameters while GPT-2 has 1.5 billion with
superior performance [12]. Such large amounts of parameters
also incur high computational and memory costs even for
emerging accelerators like GPUs. With the increasing model
scales, building and designing Transformers demand more sys-
tem optimizations, and how to perform efficient Transformers
training is becoming more challenging.
Distributed DL systems adopt data and model parallelism
to improve the training efficiency by utilizing multiple GPU
devices. Data parallelism divides the large volume of input
data into multiple parts and each device is only responsible
for partial data [18]–[20]. It requires each device to store a
whole model replica, suffering from large model scales. Model
parallelism is a more promising direction that partitions the
model from different parallelism dimensions and makes each
device store a subset of model parameters, such as tensor
parallel [21] and pipeline parallel [22]–[25]. Various choices of
the parallelism strategies lead to distinct memory consumption,
communication overheads, and execution efficiency.
However, directly applying these techniques to scaling
Transformers is facing crucial challenges in both system effi-
ciency and usability. Some recent advanced methods have been
proposed to automatically find the parallelism strategies through
the fine-grained combination of data and model parallelism for
individual operators in the model. For example, OptCNN [26],
FlexFlow [27], [28], Tofu [29], and TensorOpt [30] consider
both data and tensor parallelism and use different search
algorithms to optimize the execution plans. PipeDream [24]
and DAPPLE [31] combine pipeline parallelism with data
parallelism to improve the efficiency. Unfortunately, existing ap-
proaches only support limited parallelism dimensions (i.e., data
parallelism and rare model parallelism dimensions) or rely on
strong model and hardware configurations (i.e., expert-designed
parallelism strategy) and result in sub-optimal performance in
practice. To the best of our knowledge, there are few prior
works considering the automatic parallelism for large-scale
Transformers with a complex search space including multiple
parallelism dimensions.
In this approach, we propose Galvatron-BMW, a novel
automatic parallel training system for Transformer models over
multiple GPUs. Our target is to integrate data parallelism with a
variety of model parallelism dimensions, provide a rarely larger
search space (compared with previous approaches), and find
the optimal hybrid parallelism strategies in an efficient manner.
However, such an integration brings an explosive growth of
the search space and cannot be directly explored as usual.
Therefore, we are interested in the following question: How
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2024.3370614
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on March 28,2024 at 02:19:12 UTC from IEEE Xplore. Restrictions apply.
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
can we exploit as many parallelism dimensions as possible
and efficiently explore the search space in the meanwhile?
We study ve parallelism paradigms, four of which are
popular parallelism paradigms in the distributed training of
Transformer models, including data parallelism (DP), sharded
data parallelism (SDP) [32], tensor parallelism (TP), and
pipeline parallelism (PP). Besides, we also take into account
activation checkpointing (CKPT) as a special parallelism
dimension, which distributes the training memory workload
to the backward computation through checkpoints. These
parallelism paradigms have distinct memory consumption and
communication overheads and no single paradigm could beat
the others on both sides. The search space of automatic
parallelism should include the arbitrary combinations of them.
Inspired by some key intuitions from our observations and
analysis, we first propose a decision-tree structure to decompose
the search space and perform pruning to remove the inefficient
combinations. To determine the final distributed execution plan,
we then propose a dynamic programming search algorithm
to utilize the optimal substructure property of this problem.
Based on these, we provide Galvatron-BMW, which not only
targets automatic parallelism for Transformer model training,
but also considers the Balancing trade-off between Memory
and computation Workloads across devices. During the search
process, Galvatron-BMW provides the required computation
and communication costs and memory consumption through a
cost estimator. It is worth mentioning that the cost estimation
in Galvatron-BMW considers the GPU performance slowdown
from computation and communication overlapping, which has
been ignored for a long time in previous approaches. We
provide an implementation of Galvatron-BMW over PyTorch.
Unlike existing toolbox-like systems (e.g., DeepSpeed [33],
Megatron [21]) relying on users’ expertise and significant
tuning efforts, Galvatron-BMW’s automatic parallelism only
requires a few lines’ modifications on the original training
script. Our evaluation selects four representative Transformers,
including both NLP (i.e., BERT and T5) and CV (i.e., ViT, Swin
Transformer). The experiments show that Galvatron-BMW
could significantly outperform the four pure parallelisms and
existing automatic parallelisms with limited dimensions (i.e.,
DP+TP and DP+PP) under various device memory budgets.
We summarize our contributions as follows:
1) We enlarge the explored dimension of automatic
parallelism for Transformer training to ve paral-
lelism dimensions, and introduce a novel decision-
tree abstraction to decompose the large search space.
2) We design a novel parallelism optimization method
to automatically find the most efficient hybrid paral-
lelism strategy based on the estimated costs.
3) We consider both memory consumption and com-
putation workload through a bi-objective optimization
framework to maximize the hardware utilization
during training.
4) We build Galvatron-BMW system that supports
larger models’ training and achieves up to
530%
and
242%
throughput speedups compared to state-of-the-
art pure and hybrid parallelism methods respectively.
Model
Hardware
Cost Estimator
- Section V
Search Space
Construction
- Section III
Galvatron-BMW
Parallelism Optimization
Framework
- Section IV
Workload Balance
Adjustment
- Section IV-B
Pipeline
Partition
Parallel
Strategies
Dynamic
Programming
Search
- Section IV-A
×𝑻
steps
Fig. 1: System overview of Galvatron-BMW.
Figure 1 shows the system overview of Galvatron-BMW,
which takes the model and hardware environment as inputs, and
comprises three primary modules: search space construction
(Section III), parallelism optimization framework (Section IV),
and cost estimator (Section V). In Section III, we introduce the
search space construction of Galvatron-BMW with decision-
tree-based decomposition. In Section IV, we propose our
parallelism optimization framework, which leverages dynamic
programming search and workload balance adjustment tech-
niques to iteratively refine the optimal parallelism strategy. In
Section V, we provide a cost estimator to estimate the execution
cost and memory consumption efficiently and accurately.
We also provide implementation details in Section VI and
comprehensive experimental results in Section VII.
II. PRELIMINARY
A. Transformer Models
Transformers are first proposed to solve sequence modeling
and transduction problems such as language modeling and
machine translation [3]. The self-attention and point-wise
feed-forward modules are the basic components in each
Transformer layer. Most operations are dense algebras like
matrix multiplications, resulting in huge computation costs and
memory consumption.
Transformers in NLP. Different manners of using Trans-
former layers in NLP incur three mainly Transformer archi-
tectures, including encoder-only (for text classification, e.g.,
BERT and RoBERTa [34]), decoder-only (for text generation,
e.g., GPT-2 and Transformer-XL [35]), and encoder-decoder
(for sequence-to-sequence tasks, e.g., T5 and BART [36]).
They have similar basic model components and some slight
differences on the structures. For example, the decoder has an
additional self-attention layer compared to the encoder. What’s
more, the encoder-decoder architecture combines encoders
and decoders symmetrically (i.e., the same number of layers)
together. These differences bring some distinct system workload
characteristics in both computation and memory.
Transformers in CV. Transformers are also becoming increas-
ingly attractive in computer vision areas. Vision Transformer
(ViT) first replaces the tokens in languages with patches
in images and the patches are fed into the encoder for the
image classification task. Standard ViTs have a fixed number
of patches and the same hidden dimension across different
layers. Swin Transformer proposes a multi-stage hierarchical
architecture with a shifted window-based attention to encode
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2024.3370614
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on March 28,2024 at 02:19:12 UTC from IEEE Xplore. Restrictions apply.
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文档被以下合辑收录

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜