hetpipe: enabling large dnn training on (whimpy ... · hetpipe: enabling large dnn training on...

HetPipe: Enabling Large DNN Training on (Whimpy)

Heterogeneous GPU Clusters through Integration of

Pipelined Model Parallelism and Data Parallelism

Jay H. Park,

Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee,

Jaesik Choi †, Sam H. Noh, and Young-ri Choi

†

▪ Motivation & Background

▪ HetPipe in a Nutshell

▪ Our System: HetPipe

▪ Evaluation

▪ Conclusion

2

Contents

▪ DNN (Deep Neural Network) models continue to grow

3

Motivation

• Need more powerful GPUs for training!

▪ Short release cycle of new GPU architectures

4

Motivation

• Use of heterogeneous GPUs is inevitable!• What to do with whimpy GPUs?

Whimpy GPUs

5

DNN Training

Forward Pass 𝒊

Backward Pass 𝒊

𝒘𝒊+𝟏 = 𝒘𝒊 − 𝜼 ∙ 𝒖𝒊

Minibatch 𝒊(Training Data)

Loss

Cat?

Weight Parameter 𝒘

▪ Model parallelism (MP)▪ Data parallelism (DP)

6

Parallelizing DNN Training

• Low GPU utilization

Weights synchronizedthrough PS or AllReduce

• GPU memory limitation

Worker 1

…

Parameter Server (PS)

1

1

Worker 𝒏

𝑛

𝑛

Forward passBackward pass

▪ Attempts to improve MP utilization

• Pipelined model parallelism (PMP)

7

Parallelizing DNN Training

• Designed for homogeneous GPUs• Designed for a single PMP worker

Forward passBackward pass

PMP Worker

• PipeDream [SOSP’19]• GPipe [NIPS’19]

8

HetPipe in a Nutshell

Virtual Worker (VW) 1

Parameter Server

VW 𝒏

…

Integrates PMP + DP

WSP(Wave Synchronous Parallel)

Support Heterogeneous GPUs VW: A group of multiple GPUs

GPUGPUGPUGPU

GPUGPUGPUGPU R

RGG

R

G

GPUGPUGPUGPU

GPUGPUGPUGPU V

VQQ

V

Q

PMP DP

PMP

9

Challenges in integration PMP+DP in Heterogeneous GPUs

• What weight version should be used by each VW to synchronize with other VWs?

Parameter Server

… • How do we reduce virtual worker stragglers when we consider DP?

Many more in the paper

…

10

HetPipe Contributions

Integrates PMP + DPNovel parameter synchronization model

WSP (Wave Synchronous Parallel)

Enable Large DNN Training on Heterogeneous GPUsAggregate heterogeneous resources

Reduce the straggler problem

Proof of WSP Convergence

11

HetPipe Workflow

Model Partitioner

DNN Model

Resource Allocator

Cluster Configuration

P2 P3 P4P1

Assign k GPUs to each virtual worker

VW 1…

Divide model into k partitions

P1

P2

P3

P4

VW 𝒏

P1’

P2’

P3’

P4’

PS…

…

Time

V

V

Q

Q

R

R

G

G

P2’ P3’ P4’P1’

VW 1 VW 𝒏

12

HetPipe Workflow

Model Partitioner

DNN Model

Resource Allocator

Cluster Configuration

Assign k GPUs to each virtual worker

VW 1

…

Divide model into k partitionsVW 𝒏

PS…

…

P2 P3 P4P1

…

P1

P2

P3

P4

P1’

P2’

P3’

P4’

P2’ P3’ P4’P1’

VW 1 VW 𝒏

Global

Local

Local

Staleness

V

V

Q

Q

R

R

G

G




• Pipelined Model Parallelism Within a VW

• Data Parallelism with Multiple VWs

▪ Evaluation

▪ Conclusion

13

Outline

▪ Execution of a virtual worker

14

Pipelined Model Parallelism Within a VW

1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time

Forward pass Backward pass

𝒘𝒍𝒐𝒄𝒂𝒍

𝑾𝒍𝒐𝒄𝒂𝒍 is a consistent version of weights within a VW

𝑵𝒎 minibatches processed concurrently in pipeline manner

…

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

▪ Weight management procedure

15


1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time

…



Update 𝒖𝟏 𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍

Initial weight version (𝒘𝟎)𝑤𝑙𝑜𝑐𝑎𝑙=𝑤0=𝑤1=𝑤2=𝑤3=𝑤4

𝑤1 𝑤2 𝑤3 𝑤4

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates

16


1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time


…


𝒘𝟓 missing updates of minibatches 2 to 4

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏

𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍

𝑺𝒍𝒐𝒄𝒂𝒍 = 3

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates

17


1

1

2 3 41 2 3 4

1 2 3 41 22 33 44 5

1 2 3 41

52 5

51 623

GPU1

GPU2

GPU3

GPU4

Time



Update 𝒖𝟐 𝒘𝟔 ← 𝒘𝒍𝒐𝒄𝒂𝒍

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟐

𝒘𝟔 missing updates of minibatches 3 to 5

…

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

𝒘𝟎 + 𝒖𝟏

𝑺𝒍𝒐𝒄𝒂𝒍 = 3




• Pipelined Model Parallelism Within a VW

• Data Parallelism with Multiple VWs

▪ Evaluation

▪ Conclusion

18

Outline

Data Parallelism with Multiple VWs

Minibatch 1

Minibatch 2

Minibatch 3

Minibatch 4VW 1

Clock

56

78

…

Wave 0 Wave 1

0 1 2

…

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Progress of minibatch execution

VW 𝒏 …

Push & Pull

𝑵𝒎

Wave: Sequence of concurrently executing 𝑁𝑚 minibatches

19

▪ Push occurs every clock

20


12

34VW 1

Clock0 1

…

VW 𝒏

…

8 Blocked minibatch 8

𝒘𝒈𝒍𝒐𝒃𝒂𝒍 ← 𝒘𝒈𝒍𝒐𝒃𝒂𝒍 + 𝒖

Push aggregated updates of wave0 ( 𝑢)

𝑢 = 𝑢1+𝑢2+𝑢3+𝑢4

56

7


Push & Pull

Wave 0

▪ Pull occurs intermittently - Depending on user defined clock distance D

21


12

34VW 1

Clock0 1

8

12

34VW 2

- If D = 0 pull occurs every clock

VW1 waits before pulluntil VW2 pushes


Push & Pull

56

7

12

34

56

7


22


12

34VW 1

Clock0 1

8

12

34VW 2 8

If D = 0


Push & Pull

7

56

7

56

VW2 Push aggregated updates ( 𝑢)

VW1 waits before pulluntil VW2 pushes

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙 + 𝑢


23


12

34VW 1

Clock0 1

8

12

34VW 2

Pull occursafter all VWs have been pushed

8

If D = 0

𝑤𝑙𝑜𝑐𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙


Push & Pull

7

56

7

56


24


12

34VW 1

Clock0 1

8

12

34VW 2

Minibatch 8 starts with 𝑤8

8

If D = 0

𝑤8 = 𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2


Push & Pull

7

56

7

56

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP

25


12

34VW 1

Clock0 1

12

34VW 2

2

𝑤11

(𝑢8+ 𝑢9+ 𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2

+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2=𝑤0

+ (𝑢5+ 𝑢6+ 𝑢7)vw1

8

8

7

56

7

56

56

78

56

78

910

1112

11

(𝑢5+ 𝑢6+ 𝑢7)vw2

5

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP

26


12

34VW 1

Clock0 1

12

34VW 2

67

8

56

78

2

910

1112


Minibatch 12 has to wait

▪ Example of clock distance threshold D

27


If D = 1

Can start minibatch 8 without pull1

23

4VW 1

Clock0 1

8

12

34VW 2

56

7


Push & Pull

▪ Example of clock distance threshold D

28


Minibatch 12 has to wait

12

34VW 1

Clock0 1

12

34

56

78

910

1112

2

𝑤11 =𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4+ 𝑢5+ 𝑢6+ 𝑢7)vw1

(𝑢8+ 𝑢9+𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0

If D = 1

1111

VW 2

(𝑢1+ 𝑢2+𝑢3+𝑢4+𝑢5+𝑢6+𝑢7)vw25

67




▪ Evaluation

• Setup

• Resource Allocation for Virtual Workers

• Results

▪ Conclusion

29

Outline

InfiniBand (56 Gbps)

▪ Cluster setup - 4 heterogeneous GPU nodes

▪ Two DNN models

30

Evaluation Setup

ResNet-152 VGG-19

Dataset, minibatch size ImageNet, 32

Model parameter size 230 MB 548 MB

Characteristic Large activation output Large parameter size

Node 𝟏 Node 𝟐 Node 𝟑 Node 𝟒

V0

V1

V2

V3

R0

R1

R2

R3

G0

G1

G2

G3

Q0

Q1

Q2

Q3

TITAN V TITAN RTX GeForce RTX 2060Quadro P4000

V R G Q

Computation power

> > >

VR GQ

Memory size

> > >

▪ NP (Node Partition)

31

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟏

V0V1V2V3

VW 𝟏

Node 𝟐

Q0Q1Q2Q3

VW 𝟐

Node 𝟑

R0R1R2R3

VW 𝟑

Node 𝟒

G0G1G2G3

VW 4

• Minimum communication overhead within VW

• Performance of each virtual worker varies• Straggler may degrade performance with DP

▪ ED (Equal Distribution)

32


Node 𝟏

V0

V1

V2

V3

Node 𝟐

Q0

Q1

Q2

Q3

Node 𝟑

R0

R1

R2

R3

Node 𝟒

G0

G1

G2

G3

VW 𝟏

VW 𝟐

VW 𝟑

VW 𝟒

• Performance will be the same across the VWs• Mitigates the straggler problem

• High communication overhead within each VW

▪ HD (Hybrid Distribution)

33


Node 𝟏

V0

V1

V2

V3

Node 𝟐

Q0

Q1

Q2

Q3

Node 𝟑

R0

R1

R2

R3

Node 𝟒

G0

G1

G2

G3

VW 𝟏

VW 𝟐

VW 𝟑

VW 4

• Mitigates the straggler problem • Reduces communication overhead within each VW

V R G Q

Computation power

> > >

VR GQ

Memory size

> > >

▪ Round-robin policy (default)

• Can be used in all three policies: NP, ED, and HD

Parameter Placement

Node 𝟏

V3

Node 𝟐

Q3

Node 𝟑

R3

Node 𝟒

G3

VW 𝟏

VW 𝟒

… … … … …

Example: ED

Parameters of each layer:

34

▪ Local placement policy

• ED-local

Parameter Placement

Node 𝟏

V3

Node 𝟐

Q3

Node 𝟑

R3

Node 𝟒

G3

VW 𝟏

VW 𝟒

… … … … …

Parameters of each layer:

• Significantly reduces communication overhead

ED

• Parameter communication occurs

35

▪ Baseline Horovod

• State-of-the-art DP using AllReduce

36

Compare Throughput with Horovod

ResNet-152 VGG-19

1.4 X 1.8 X• ED: reduces the straggler

problem• ED-local: significantly

reduces communication overhead

• For ResNet-152, the whole model is too large to be loaded into a single G type GPU(batch size = 32)

37

Performance Improvement of Adding Whimpy GPUs

Adding whimpy GPUs

V RRRRQQQQ

GGGG+ + +

• With additional GPUs, HetPipe achieves up to 2.3X speed up

2.3 X

• Additional whimpy systems allow for faster training

VVVV

▪ ResNet-152

38

Convergence Results

• HetPipe reduces straggler problem in heterogeneous environment

Target accuracy: 74%

Up to 39% faster

• Adding four more whimpy G GPUs, performance improves even more

7% faster

12GPUs16GPUs

▪ VGG-19

39

Convergence Results

Target accuracy: 67%

• HetPipe (D=0) is 29% faster than HorovodUp to 49% faster

D=4 D=32 D=0

4.7% slower

• Higher global staleness (i.e., 32) can degrade convergence performance

29% faster

▪ Provide convergence proof of WSP

▪ Partitioning algorithm

▪ Performance of a single virtual worker

▪ Comparison to PipeDream

40

Not Presented But Discussed in Paper

▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs

▪ Integrate pipelined model parallelism with data parallelism

▪ Propose a novel parameter synchronization model: WSP

▪ DNN models converge up to 49% faster with HetPipe

41

Conclusion

42

Thank you!

hetpipe: enabling large dnn training on (whimpy ... · hetpipe: enabling large dnn training on...

Documents