JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1
Improving Automatic Parallel Training via Balanced
Memory Workload Optimization
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui
Abstract—Transformer models have emerged as the leading
approach for achieving state-of-the-art performance across various
application domains, serving as the foundation for advanced large-
scale deep learning (DL) models. However, efficiently training these
models across multiple GPUs remains a complex challenge due to
the abundance of parallelism options. Existing DL systems either
require manual efforts to design distributed training plans or limit
parallelism combinations to a constrained search space. In this
paper, we present Galvatron-BMW, a novel system framework
that integrates multiple prevalent parallelism dimensions and
automatically identifies the most efficient hybrid parallelism
strategy. To effectively navigate this vast search space, we employ
a decision tree approach for decomposition and pruning based
on intuitive insights. We further utilize a dynamic programming
search algorithm to derive the optimal plan. Moreover, to improve
resource utilization and enhance system efficiency, we propose a bi-
objective optimization workflow that focuses on workload balance.
Our evaluations on different Transformer models demonstrate
the capabilities of Galvatron-BMW in automating distributed
training under varying GPU memory constraints. Across all tested
scenarios, Galvatron-BMW consistently achieves superior system
throughput, surpassing previous approaches that rely on limited
parallelism strategies.
Index Terms—Transformers, Distributed Learning, Automatic
Parallelism
I. INTRODUCTION
T
RANSFORMER models have achieved great success
in a wide range of deep learning (DL) applications in
recent years, such as computer vision (CV) [1], [2], natural
language processing (NLP) [3]–[7], graph learning [8], [9] and
recommendation systems [10]. For example, many Transformer
variants (e.g., BERT [11], GPT-2 [12], T5 [13]) are leading
the state-of-the-art performance in various NLP tasks such
as machine translation and question answering. Transformers
are also applicable to image recognition (e.g, ViT [1], Swin
Transformer [14]) and multimodal tasks (e.g, CLIP [15], DALL-
E [16]). Due to their superior performance, Transformers
are becoming increasingly important in modern artificial
intelligence industries.
Empirical evidence indicates that scaling model parameters
is an effective path towards model performance improve-
Yujie Wang, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie and Youhe Jiang
are with the Key Lab of High Confidence Software Technologies (MOE),
School of CS, Peking University, Beijing 100871, China. E-mail: {alfredwang,
ccchengff, shenhan.zhu, xiaonan.nie}@pku.edu.cn, youhejiang@gmail.com
Xupeng Miao is with the Computer Science Department of Carnegie Mellon
University. E-mail: xupeng@cmu.edu
Yaofeng Tu is with ZTE company. E-mail: tu.yaofeng@zte.com.cn
Bin Cui is with the Key Lab of High Confidence Software Technologies
(MOE), School of CS, Peking University, Beijing 100871, and Institute of
Computational Social Science, Peking University (Qingdao), China. E-mail:
bin.cui@pku.edu.cn.
ments [17]. For instance, the original Transformer only has
millions of model parameters while GPT-2 has 1.5 billion with
superior performance [12]. Such large amounts of parameters
also incur high computational and memory costs even for
emerging accelerators like GPUs. With the increasing model
scales, building and designing Transformers demand more sys-
tem optimizations, and how to perform efficient Transformers
training is becoming more challenging.
Distributed DL systems adopt data and model parallelism
to improve the training efficiency by utilizing multiple GPU
devices. Data parallelism divides the large volume of input
data into multiple parts and each device is only responsible
for partial data [18]–[20]. It requires each device to store a
whole model replica, suffering from large model scales. Model
parallelism is a more promising direction that partitions the
model from different parallelism dimensions and makes each
device store a subset of model parameters, such as tensor
parallel [21] and pipeline parallel [22]–[25]. Various choices of
the parallelism strategies lead to distinct memory consumption,
communication overheads, and execution efficiency.
However, directly applying these techniques to scaling
Transformers is facing crucial challenges in both system effi-
ciency and usability. Some recent advanced methods have been
proposed to automatically find the parallelism strategies through
the fine-grained combination of data and model parallelism for
individual operators in the model. For example, OptCNN [26],
FlexFlow [27], [28], Tofu [29], and TensorOpt [30] consider
both data and tensor parallelism and use different search
algorithms to optimize the execution plans. PipeDream [24]
and DAPPLE [31] combine pipeline parallelism with data
parallelism to improve the efficiency. Unfortunately, existing ap-
proaches only support limited parallelism dimensions (i.e., data
parallelism and rare model parallelism dimensions) or rely on
strong model and hardware configurations (i.e., expert-designed
parallelism strategy) and result in sub-optimal performance in
practice. To the best of our knowledge, there are few prior
works considering the automatic parallelism for large-scale
Transformers with a complex search space including multiple
parallelism dimensions.
In this approach, we propose Galvatron-BMW, a novel
automatic parallel training system for Transformer models over
multiple GPUs. Our target is to integrate data parallelism with a
variety of model parallelism dimensions, provide a rarely larger
search space (compared with previous approaches), and find
the optimal hybrid parallelism strategies in an efficient manner.
However, such an integration brings an explosive growth of
the search space and cannot be directly explored as usual.
Therefore, we are interested in the following question: How
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2024.3370614
© 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: ZTE CORPORATION. Downloaded on March 28,2024 at 02:19:12 UTC from IEEE Xplore. Restrictions apply.
文档被以下合辑收录
评论