hetpipe: enabling large dnn training on (whimpy ... · hetpipe: enabling large dnn training on...

42
HetPipe: Enabling Large DNN Training on ( Whimpy ) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi , Sam H. Noh, and Young-ri Choi

Upload: others

Post on 03-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

HetPipe: Enabling Large DNN Training on (Whimpy)

Heterogeneous GPU Clusters through Integration of

Pipelined Model Parallelism and Data Parallelism

Jay H. Park,

Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee,

Jaesik Choi †, Sam H. Noh, and Young-ri Choi

Page 2: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Motivation & Background

▪ HetPipe in a Nutshell

▪ Our System: HetPipe

▪ Evaluation

▪ Conclusion

2

Contents

Page 3: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ DNN (Deep Neural Network) models continue to grow

3

Motivation

• Need more powerful GPUs for training!

Page 4: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Short release cycle of new GPU architectures

4

Motivation

• Use of heterogeneous GPUs is inevitable!• What to do with whimpy GPUs?

Whimpy GPUs

Page 5: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

5

DNN Training

Forward Pass 𝒊

Backward Pass 𝒊

𝒘𝒊+𝟏 = 𝒘𝒊 − 𝜼 ∙ 𝒖𝒊

Minibatch 𝒊(Training Data)

Loss

Cat?

Weight Parameter 𝒘

Page 6: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Model parallelism (MP)▪ Data parallelism (DP)

6

Parallelizing DNN Training

• Low GPU utilization

Weights synchronizedthrough PS or AllReduce

• GPU memory limitation

Worker 1

Parameter Server (PS)

1

1

Worker 𝒏

𝑛

𝑛

Forward passBackward pass

Page 7: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Attempts to improve MP utilization

• Pipelined model parallelism (PMP)

7

Parallelizing DNN Training

• Designed for homogeneous GPUs• Designed for a single PMP worker

Forward passBackward pass

PMP Worker

• PipeDream [SOSP’19]• GPipe [NIPS’19]

Page 8: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

8

HetPipe in a Nutshell

Virtual Worker (VW) 1

Parameter Server

VW 𝒏

Integrates PMP + DP

WSP(Wave Synchronous Parallel)

Support Heterogeneous GPUs VW: A group of multiple GPUs

GPUGPUGPUGPU

GPUGPUGPUGPU R

RGG

R

G

GPUGPUGPUGPU

GPUGPUGPUGPU V

VQQ

V

Q

PMP DP

PMP

Page 9: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

9

Challenges in integration PMP+DP in Heterogeneous GPUs

• What weight version should be used by each VW to synchronize with other VWs?

Parameter Server

… • How do we reduce virtual worker stragglers when we consider DP?

Many more in the paper

Page 10: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

10

HetPipe Contributions

Integrates PMP + DPNovel parameter synchronization model

WSP (Wave Synchronous Parallel)

Enable Large DNN Training on Heterogeneous GPUsAggregate heterogeneous resources

Reduce the straggler problem

Proof of WSP Convergence

Page 11: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

11

HetPipe Workflow

Model Partitioner

DNN Model

Resource Allocator

Cluster Configuration

P2 P3 P4P1

Assign k GPUs to each virtual worker

VW 1…

Divide model into k partitions

P1

P2

P3

P4

VW 𝒏

P1’

P2’

P3’

P4’

PS…

Time

V

V

Q

Q

R

R

G

G

P2’ P3’ P4’P1’

VW 1 VW 𝒏

Page 12: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

12

HetPipe Workflow

Model Partitioner

DNN Model

Resource Allocator

Cluster Configuration

Assign k GPUs to each virtual worker

VW 1

Divide model into k partitionsVW 𝒏

PS…

P2 P3 P4P1

P1

P2

P3

P4

P1’

P2’

P3’

P4’

P2’ P3’ P4’P1’

VW 1 VW 𝒏

Global

Local

Local

Staleness

V

V

Q

Q

R

R

G

G

Page 13: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Motivation & Background

▪ HetPipe in a Nutshell

▪ Our System: HetPipe

• Pipelined Model Parallelism Within a VW

• Data Parallelism with Multiple VWs

▪ Evaluation

▪ Conclusion

13

Outline

Page 14: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Execution of a virtual worker

14

Pipelined Model Parallelism Within a VW

1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time

Forward pass Backward pass

𝒘𝒍𝒐𝒄𝒂𝒍

𝑾𝒍𝒐𝒄𝒂𝒍 is a consistent version of weights within a VW

𝑵𝒎 minibatches processed concurrently in pipeline manner

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

Page 15: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Weight management procedure

15

Pipelined Model Parallelism Within a VW

1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time

Forward pass Backward pass

𝒘𝒍𝒐𝒄𝒂𝒍

Update 𝒖𝟏 𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍

Initial weight version (𝒘𝟎)𝑤𝑙𝑜𝑐𝑎𝑙=𝑤0=𝑤1=𝑤2=𝑤3=𝑤4

𝑤1 𝑤2 𝑤3 𝑤4

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

Page 16: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates

16

Pipelined Model Parallelism Within a VW

1

1

2 3 41 2 3 4

1 2 3 41 22 33

1 21

1GPU1

GPU2

GPU3

GPU4

Time

𝒘𝒍𝒐𝒄𝒂𝒍

Forward pass Backward pass

𝒘𝟓 missing updates of minibatches 2 to 4

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟏

𝒘𝟓 ← 𝒘𝒍𝒐𝒄𝒂𝒍

𝑺𝒍𝒐𝒄𝒂𝒍 = 3

44 53 45

2 55 62

3

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

Page 17: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍): maximum missing updates

17

Pipelined Model Parallelism Within a VW

1

1

2 3 41 2 3 4

1 2 3 41 22 33 44 5

1 2 3 41

52 5

51 623

GPU1

GPU2

GPU3

GPU4

Time

Forward pass Backward pass

𝒘𝒍𝒐𝒄𝒂𝒍

Update 𝒖𝟐 𝒘𝟔 ← 𝒘𝒍𝒐𝒄𝒂𝒍

𝒘𝒍𝒐𝒄𝒂𝒍 ← 𝒘𝒍𝒐𝒄𝒂𝒍 + 𝒖𝟐

𝒘𝟔 missing updates of minibatches 3 to 5

5 66 77 884 5 6 7

6 4 773

5 884

6 995

76

6 7 8 9

𝒘𝟎 + 𝒖𝟏

𝑺𝒍𝒐𝒄𝒂𝒍 = 3

Page 18: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Motivation & Background

▪ HetPipe in a Nutshell

▪ Our System: HetPipe

• Pipelined Model Parallelism Within a VW

• Data Parallelism with Multiple VWs

▪ Evaluation

▪ Conclusion

18

Outline

Page 19: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

Data Parallelism with Multiple VWs

Minibatch 1

Minibatch 2

Minibatch 3

Minibatch 4VW 1

Clock

56

78

Wave 0 Wave 1

0 1 2

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Progress of minibatch execution

VW 𝒏 …

Push & Pull

𝑵𝒎

Wave: Sequence of concurrently executing 𝑁𝑚 minibatches

19

Page 20: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Push occurs every clock

20

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

VW 𝒏

8 Blocked minibatch 8

𝒘𝒈𝒍𝒐𝒃𝒂𝒍 ← 𝒘𝒈𝒍𝒐𝒃𝒂𝒍 + 𝒖

Push aggregated updates of wave0 ( 𝑢)

𝑢 = 𝑢1+𝑢2+𝑢3+𝑢4

56

7

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

Wave 0

Page 21: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Pull occurs intermittently - Depending on user defined clock distance D

21

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

8

12

34VW 2

- If D = 0 pull occurs every clock

VW1 waits before pulluntil VW2 pushes

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

56

7

Page 22: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

12

34

56

7

▪ Pull occurs intermittently - Depending on user defined clock distance D

22

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

8

12

34VW 2 8

If D = 0

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

7

56

7

56

VW2 Push aggregated updates ( 𝑢)

VW1 waits before pulluntil VW2 pushes

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙 + 𝑢

Page 23: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Pull occurs intermittently - Depending on user defined clock distance D

23

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

8

12

34VW 2

Pull occursafter all VWs have been pushed

8

If D = 0

𝑤𝑙𝑜𝑐𝑎𝑙 ← 𝑤𝑔𝑙𝑜𝑏𝑎𝑙

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

7

56

7

56

Page 24: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Pull occurs intermittently - Depending on user defined clock distance D

24

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

8

12

34VW 2

Minibatch 8 starts with 𝑤8

8

If D = 0

𝑤8 = 𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

7

56

7

56

Page 25: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP

25

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

12

34VW 2

2

𝑤11

(𝑢8+ 𝑢9+ 𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2

+ (𝑢1+ 𝑢2+ 𝑢3+ 𝑢4)vw1,vw2=𝑤0

+ (𝑢5+ 𝑢6+ 𝑢7)vw1

8

8

7

56

7

56

56

78

56

78

910

1112

11

(𝑢5+ 𝑢6+ 𝑢7)vw2

Page 26: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

5

▪ Local staleness (𝑺𝒍𝒐𝒄𝒂𝒍) and global staleness (𝑺𝒈𝒍𝒐𝒃𝒂𝒍) with WSP

26

Data Parallelism with Multiple VWs

12

34VW 1

Clock0 1

12

34VW 2

67

8

56

78

2

910

1112

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Minibatch 12 has to wait

Page 27: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Example of clock distance threshold D

27

Data Parallelism with Multiple VWs

If D = 1

Can start minibatch 8 without pull1

23

4VW 1

Clock0 1

8

12

34VW 2

56

7

Parameter Server: 𝒘𝒈𝒍𝒐𝒃𝒂𝒍

Push & Pull

Page 28: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Example of clock distance threshold D

28

Data Parallelism with Multiple VWs

Minibatch 12 has to wait

12

34VW 1

Clock0 1

12

34

56

78

910

1112

2

𝑤11 =𝑤0+(𝑢1+ 𝑢2+ 𝑢3+ 𝑢4+ 𝑢5+ 𝑢6+ 𝑢7)vw1

(𝑢8+ 𝑢9+𝑢10)vw1𝑺𝒈𝒍𝒐𝒃𝒂𝒍𝑺𝒍𝒐𝒄𝒂𝒍

𝑤𝑔𝑙𝑜𝑏𝑎𝑙 =𝑤0

If D = 1

1111

VW 2

(𝑢1+ 𝑢2+𝑢3+𝑢4+𝑢5+𝑢6+𝑢7)vw25

67

Page 29: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Motivation & Background

▪ HetPipe in a Nutshell

▪ Our System: HetPipe

▪ Evaluation

• Setup

• Resource Allocation for Virtual Workers

• Results

▪ Conclusion

29

Outline

Page 30: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

InfiniBand (56 Gbps)

▪ Cluster setup - 4 heterogeneous GPU nodes

▪ Two DNN models

30

Evaluation Setup

ResNet-152 VGG-19

Dataset, minibatch size ImageNet, 32

Model parameter size 230 MB 548 MB

Characteristic Large activation output Large parameter size

Node 𝟏 Node 𝟐 Node 𝟑 Node 𝟒

V0

V1

V2

V3

R0

R1

R2

R3

G0

G1

G2

G3

Q0

Q1

Q2

Q3

TITAN V TITAN RTX GeForce RTX 2060Quadro P4000

V R G Q

Computation power

> > >

VR GQ

Memory size

> > >

Page 31: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ NP (Node Partition)

31

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟏

V0V1V2V3

VW 𝟏

Node 𝟐

Q0Q1Q2Q3

VW 𝟐

Node 𝟑

R0R1R2R3

VW 𝟑

Node 𝟒

G0G1G2G3

VW 4

• Minimum communication overhead within VW

• Performance of each virtual worker varies• Straggler may degrade performance with DP

Page 32: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ ED (Equal Distribution)

32

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟏

V0

V1

V2

V3

Node 𝟐

Q0

Q1

Q2

Q3

Node 𝟑

R0

R1

R2

R3

Node 𝟒

G0

G1

G2

G3

VW 𝟏

VW 𝟐

VW 𝟑

VW 𝟒

• Performance will be the same across the VWs• Mitigates the straggler problem

• High communication overhead within each VW

Page 33: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ HD (Hybrid Distribution)

33

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟏

V0

V1

V2

V3

Node 𝟐

Q0

Q1

Q2

Q3

Node 𝟑

R0

R1

R2

R3

Node 𝟒

G0

G1

G2

G3

VW 𝟏

VW 𝟐

VW 𝟑

VW 4

• Mitigates the straggler problem • Reduces communication overhead within each VW

V R G Q

Computation power

> > >

VR GQ

Memory size

> > >

Page 34: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Round-robin policy (default)

• Can be used in all three policies: NP, ED, and HD

Parameter Placement

Node 𝟏

V3

Node 𝟐

Q3

Node 𝟑

R3

Node 𝟒

G3

VW 𝟏

VW 𝟒

… … … … …

Example: ED

Parameters of each layer:

34

Page 35: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Local placement policy

• ED-local

Parameter Placement

Node 𝟏

V3

Node 𝟐

Q3

Node 𝟑

R3

Node 𝟒

G3

VW 𝟏

VW 𝟒

… … … … …

Parameters of each layer:

• Significantly reduces communication overhead

ED

• Parameter communication occurs

35

Page 36: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Baseline Horovod

• State-of-the-art DP using AllReduce

36

Compare Throughput with Horovod

ResNet-152 VGG-19

1.4 X 1.8 X• ED: reduces the straggler

problem• ED-local: significantly

reduces communication overhead

• For ResNet-152, the whole model is too large to be loaded into a single G type GPU(batch size = 32)

Page 37: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

37

Performance Improvement of Adding Whimpy GPUs

Adding whimpy GPUs

V RRRRQQQQ

GGGG+ + +

• With additional GPUs, HetPipe achieves up to 2.3X speed up

2.3 X

• Additional whimpy systems allow for faster training

VVVV

Page 38: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ ResNet-152

38

Convergence Results

• HetPipe reduces straggler problem in heterogeneous environment

Target accuracy: 74%

Up to 39% faster

• Adding four more whimpy G GPUs, performance improves even more

7% faster

12GPUs16GPUs

Page 39: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ VGG-19

39

Convergence Results

Target accuracy: 67%

• HetPipe (D=0) is 29% faster than HorovodUp to 49% faster

D=4 D=32 D=0

4.7% slower

• Higher global staleness (i.e., 32) can degrade convergence performance

29% faster

Page 40: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ Provide convergence proof of WSP

▪ Partitioning algorithm

▪ Performance of a single virtual worker

▪ Comparison to PipeDream

40

Not Presented But Discussed in Paper

Page 41: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs

▪ Integrate pipelined model parallelism with data parallelism

▪ Propose a novel parameter synchronization model: WSP

▪ DNN models converge up to 49% faster with HetPipe

41

Conclusion

Page 42: HetPipe: Enabling Large DNN Training on (Whimpy ... · HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism

42

Thank you!