Improving_Automatic_Parallel_Training_Via_Balanced_Memory_Workload_Optimization_ZTE..pdf

吾亦可往

14页

2次

2024-04-25

免费下载

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1

Improving Automatic Parallel Training via Balanced

Memory Workload Optimization

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui

Abstract—Transformer models have emerged as the leading

approach for achieving state-of-the-art performance across various

application domains, serving as the foundation for advanced large-

scale deep learning (DL) models. However, efﬁciently training these

models across multiple GPUs remains a complex challenge due to

the abundance of parallelism options. Existing DL systems either

require manual efforts to design distributed training plans or limit

parallelism combinations to a constrained search space. In this

paper, we present Galvatron-BMW, a novel system framework

that integrates multiple prevalent parallelism dimensions and

automatically identiﬁes the most efﬁcient hybrid parallelism

strategy. To effectively navigate this vast search space, we employ

a decision tree approach for decomposition and pruning based

on intuitive insights. We further utilize a dynamic programming

search algorithm to derive the optimal plan. Moreover, to improve

resource utilization and enhance system efﬁciency, we propose a bi-

objective optimization workﬂow that focuses on workload balance.

Our evaluations on different Transformer models demonstrate

the capabilities of Galvatron-BMW in automating distributed

training under varying GPU memory constraints. Across all tested

scenarios, Galvatron-BMW consistently achieves superior system

throughput, surpassing previous approaches that rely on limited

parallelism strategies.

Index Terms—Transformers, Distributed Learning, Automatic

Parallelism

I. INTRODUCTION

RANSFORMER models have achieved great success

in a wide range of deep learning (DL) applications in

recent years, such as computer vision (CV) [1], [2], natural

language processing (NLP) [3]–[7], graph learning [8], [9] and

recommendation systems [10]. For example, many Transformer

variants (e.g., BERT [11], GPT-2 [12], T5 [13]) are leading

the state-of-the-art performance in various NLP tasks such

as machine translation and question answering. Transformers

are also applicable to image recognition (e.g, ViT [1], Swin

Transformer [14]) and multimodal tasks (e.g, CLIP [15], DALL-

E [16]). Due to their superior performance, Transformers

are becoming increasingly important in modern artiﬁcial

intelligence industries.

Empirical evidence indicates that scaling model parameters

is an effective path towards model performance improve-

Yujie Wang, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie and Youhe Jiang

are with the Key Lab of High Conﬁdence Software Technologies (MOE),

School of CS, Peking University, Beijing 100871, China. E-mail: {alfredwang,

ccchengff, shenhan.zhu, xiaonan.nie}@pku.edu.cn, youhejiang@gmail.com

Xupeng Miao is with the Computer Science Department of Carnegie Mellon

University. E-mail: xupeng@cmu.edu

Yaofeng Tu is with ZTE company. E-mail: tu.yaofeng@zte.com.cn

Bin Cui is with the Key Lab of High Conﬁdence Software Technologies

(MOE), School of CS, Peking University, Beijing 100871, and Institute of

Computational Social Science, Peking University (Qingdao), China. E-mail:

bin.cui@pku.edu.cn.

ments [17]. For instance, the original Transformer only has

millions of model parameters while GPT-2 has 1.5 billion with

superior performance [12]. Such large amounts of parameters

also incur high computational and memory costs even for

emerging accelerators like GPUs. With the increasing model

scales, building and designing Transformers demand more sys-

tem optimizations, and how to perform efﬁcient Transformers

training is becoming more challenging.

Distributed DL systems adopt data and model parallelism

to improve the training efﬁciency by utilizing multiple GPU

devices. Data parallelism divides the large volume of input

data into multiple parts and each device is only responsible

for partial data [18]–[20]. It requires each device to store a

whole model replica, suffering from large model scales. Model

parallelism is a more promising direction that partitions the

model from different parallelism dimensions and makes each

device store a subset of model parameters, such as tensor

parallel [21] and pipeline parallel [22]–[25]. Various choices of

the parallelism strategies lead to distinct memory consumption,

communication overheads, and execution efﬁciency.

However, directly applying these techniques to scaling

Transformers is facing crucial challenges in both system efﬁ-

ciency and usability. Some recent advanced methods have been

proposed to automatically ﬁnd the parallelism strategies through

the ﬁne-grained combination of data and model parallelism for

individual operators in the model. For example, OptCNN [26],

FlexFlow [27], [28], Tofu [29], and TensorOpt [30] consider

both data and tensor parallelism and use different search

algorithms to optimize the execution plans. PipeDream [24]

and DAPPLE [31] combine pipeline parallelism with data

parallelism to improve the efﬁciency. Unfortunately, existing ap-

proaches only support limited parallelism dimensions (i.e., data

parallelism and rare model parallelism dimensions) or rely on

strong model and hardware conﬁgurations (i.e., expert-designed

parallelism strategy) and result in sub-optimal performance in

practice. To the best of our knowledge, there are few prior

works considering the automatic parallelism for large-scale

Transformers with a complex search space including multiple

parallelism dimensions.

In this approach, we propose Galvatron-BMW, a novel

automatic parallel training system for Transformer models over

multiple GPUs. Our target is to integrate data parallelism with a

variety of model parallelism dimensions, provide a rarely larger

search space (compared with previous approaches), and ﬁnd

the optimal hybrid parallelism strategies in an efﬁcient manner.

However, such an integration brings an explosive growth of

the search space and cannot be directly explored as usual.

Therefore, we are interested in the following question: How

This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2024.3370614

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on March 28,2024 at 02:19:12 UTC from IEEE Xplore. Restrictions apply.

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

can we exploit as many parallelism dimensions as possible

and efﬁciently explore the search space in the meanwhile?

We study ﬁve parallelism paradigms, four of which are

popular parallelism paradigms in the distributed training of

Transformer models, including data parallelism (DP), sharded

data parallelism (SDP) [32], tensor parallelism (TP), and

pipeline parallelism (PP). Besides, we also take into account

activation checkpointing (CKPT) as a special parallelism

dimension, which distributes the training memory workload

to the backward computation through checkpoints. These

parallelism paradigms have distinct memory consumption and

communication overheads and no single paradigm could beat

the others on both sides. The search space of automatic

parallelism should include the arbitrary combinations of them.

Inspired by some key intuitions from our observations and

analysis, we ﬁrst propose a decision-tree structure to decompose

the search space and perform pruning to remove the inefﬁcient

combinations. To determine the ﬁnal distributed execution plan,

we then propose a dynamic programming search algorithm

to utilize the optimal substructure property of this problem.

Based on these, we provide Galvatron-BMW, which not only

targets automatic parallelism for Transformer model training,

but also considers the Balancing trade-off between Memory

and computation Workloads across devices. During the search

process, Galvatron-BMW provides the required computation

and communication costs and memory consumption through a

cost estimator. It is worth mentioning that the cost estimation

in Galvatron-BMW considers the GPU performance slowdown

from computation and communication overlapping, which has

been ignored for a long time in previous approaches. We

provide an implementation of Galvatron-BMW over PyTorch.

Unlike existing toolbox-like systems (e.g., DeepSpeed [33],

Megatron [21]) relying on users’ expertise and signiﬁcant

tuning efforts, Galvatron-BMW’s automatic parallelism only

requires a few lines’ modiﬁcations on the original training

script. Our evaluation selects four representative Transformers,

including both NLP (i.e., BERT and T5) and CV (i.e., ViT, Swin

Transformer). The experiments show that Galvatron-BMW

could signiﬁcantly outperform the four pure parallelisms and

existing automatic parallelisms with limited dimensions (i.e.,

DP+TP and DP+PP) under various device memory budgets.

We summarize our contributions as follows:

1) We enlarge the explored dimension of automatic

parallelism for Transformer training to ﬁve paral-

lelism dimensions, and introduce a novel decision-

tree abstraction to decompose the large search space.

2) We design a novel parallelism optimization method

to automatically ﬁnd the most efﬁcient hybrid paral-

lelism strategy based on the estimated costs.

3) We consider both memory consumption and com-

putation workload through a bi-objective optimization

framework to maximize the hardware utilization

during training.

4) We build Galvatron-BMW system that supports

larger models’ training and achieves up to

530%

and

242%

throughput speedups compared to state-of-the-

art pure and hybrid parallelism methods respectively.

Model

Hardware

Cost Estimator

- Section V

Search Space

Construction

- Section III

Galvatron-BMW

Parallelism Optimization

Framework

- Section IV

Workload Balance

Adjustment

- Section IV-B

Pipeline

Partition

Parallel

Strategies

Dynamic

Programming

- Section IV-A

×𝑻

steps

Fig. 1: System overview of Galvatron-BMW.

Figure 1 shows the system overview of Galvatron-BMW,

which takes the model and hardware environment as inputs, and

comprises three primary modules: search space construction

(Section III), parallelism optimization framework (Section IV),

and cost estimator (Section V). In Section III, we introduce the

search space construction of Galvatron-BMW with decision-

tree-based decomposition. In Section IV, we propose our

parallelism optimization framework, which leverages dynamic

programming search and workload balance adjustment tech-

niques to iteratively reﬁne the optimal parallelism strategy. In

Section V, we provide a cost estimator to estimate the execution

cost and memory consumption efﬁciently and accurately.

We also provide implementation details in Section VI and

comprehensive experimental results in Section VII.

II. PRELIMINARY

A. Transformer Models

Transformers are ﬁrst proposed to solve sequence modeling

and transduction problems such as language modeling and

machine translation [3]. The self-attention and point-wise

feed-forward modules are the basic components in each

Transformer layer. Most operations are dense algebras like

matrix multiplications, resulting in huge computation costs and

memory consumption.

Transformers in NLP. Different manners of using Trans-

former layers in NLP incur three mainly Transformer archi-

tectures, including encoder-only (for text classiﬁcation, e.g.,

BERT and RoBERTa [34]), decoder-only (for text generation,

e.g., GPT-2 and Transformer-XL [35]), and encoder-decoder

(for sequence-to-sequence tasks, e.g., T5 and BART [36]).

They have similar basic model components and some slight

differences on the structures. For example, the decoder has an

additional self-attention layer compared to the encoder. What’s

more, the encoder-decoder architecture combines encoders

and decoders symmetrically (i.e., the same number of layers)

together. These differences bring some distinct system workload

characteristics in both computation and memory.

Transformers in CV. Transformers are also becoming increas-

ingly attractive in computer vision areas. Vision Transformer

(ViT) ﬁrst replaces the tokens in languages with patches

in images and the patches are fed into the encoder for the

image classiﬁcation task. Standard ViTs have a ﬁxed number

of patches and the same hidden dimension across different

layers. Swin Transformer proposes a multi-stage hierarchical

architecture with a shifted window-based attention to encode

This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2024.3370614

Authorized licensed use limited to: ZTE CORPORATION. Downloaded on March 28,2024 at 02:19:12 UTC from IEEE Xplore. Restrictions apply.

of 14

免费下载

goldendb paper

文档被以下合辑收录

GoldenDB论文（共12篇）

GoldenDB论文

关注

文档被以下合辑收录

评论