1 中華大學資訊工程學系 http:// ching-hsien hsu ( 許慶賢 ) localization and scheduling...

1中華大學資訊工程學系 http:// www.csie.chu.edu.tw

Ching-Hsien Hsu ( 許慶賢 )

Localization and Scheduling Techniques for

Optimizing Communications on Heterogeneous

Cluster Grid

2

Outline

IntroductionRegular / Irregular Data Distribution, RedistributionCategory of Runtime Redistribution Problems

Processor Mapping Technique for Communication Localization

The Processor Mapping TechniqueLocalization on Multi-Cluster Grid System

Scheduling Contention Free Communications for Irregular Problems

The Two-Phase Degree Reduction Method (TPDR)Extended TPDR (E-TPDR)

Conclusions

3

Regular Parallel Data Distribution

Data Parallel Programming Language, e.g. HPF (High Performance Fortran), Fortran D,

BLOCK, CYCLIC, BLOCK-CYCLIC(c)Ex. 18 Elements Array, 3 Logical Processors

global-index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

block P0 P1 P2

cyclic P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2

block-cyclic(2) P0 P1 P2 P0 P1 P2 P0 P1 P2

block-cyclic(3) P0 P1 P2 P0 P1 P2

Introduction

4

Two Dimension Matrices

Introduction (cont.)

8 8 matrix (BLOCK, BLOCK) onto P(2, 2) (CYCLIC, *)

onto P(4, *) (CYCLIC(2), BLOCK) onto P(2, 2)

Data Distribution

5

Data Redistribution


Processor Rank

Local index

0 1 2 0 1 2 0 1 2

1 2 1 2 1 2 3 4 3 4 3 4 1 2 3 4 1 2 3 4 1 2 3 4

1 1

20

2 P0 P1 P20

3P0 P1 P2 P0 P1 P2

1

11

2 P3 P4 P5

2 113

P3 P4 P5 P3 P4 P5 02 P0 P1 P2

4 3

51

4 P3 P4 P50

6P0 P1 P2 P0 P1 P2

3

40

4 P0 P1 P2

5 316

P3 P4 P5 P3 P4 P5

14 P3 P4 P5

Source: (BC(3), BC(2)) Destination: (BC(2), BC(4))

6

Data Redistribution


REAL DIM(18, 24) :: A!HPF$ PROCESSORS P(2, 3)!HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO P

: (computation)

!HPF$ REDISTRIBUTE A(CYCLIC, CYCLIC(2)) ONTO P: (computation)

7

Irregular Redistribution


PARAMETER (S = /7, 16, 11, 10, 7, 49/)

!HPF$ PROCESSORS P(6)

REAL A(100), new (6)

!HPF$ DISTRIBUTE A (GEN_BLOCK(S)) onto P

!HPF$ DYNAMIC

new = /15, 16, 10, 16, 15, 28/

!HPF$ REDISTRIBUTE A (GEN_BLOCK(new))

8


Irregular Data Distribution (GEN_BLOCK)

Data distribution for algorithm P

7 16 11 10 7 49

Data distribution for algorithm Q

15 16 10 16 15 28

Application

……

Algorithm P

Algorithm Q

Heterogeneous Processors

9

Problem Category

Benefits of runtime redistributionAchieve Data LocalityReduce Communication cost at runtime

ObjectivesIndexing sets generation

Data Packing & Unpacking Techniques

Communication OptimizationsMulti-Stage Redistribution Method

Processor Mapping Technique

Communication Scheduling


10

Outline



The Processor Mapping TechniqueMulti-Cluster Grid System

Contention Free Communication Scheduling for Irregular Problems


Conclusions

11

The Original Processor Mapping Technique (Prof. Lionel. M. Ni)

Mapping function is provided to generate a new sequence of logical processor id

Increase data hits

Minimize the amount of data exchange

Processor Mapping Technique

12

An Optimal Processor Mapping Technique(Hsu’05)

Example: BC86 over 11

Traditional Method Size Oriented Greedy Matching Maximum Matching (Optimal)

Processor Mapping Technique (cont.)

13

Localize communications

Cluster GridInterior CommunicationExternal Communication


node1

node3

node2

node4

node1

node3

node2

.node4

node1

.node3

node2

node4

node1

.node3

node2

.node4

node1

node3

node2

node4

14

Motivating Example


Source

Cluster 1 Cluster 2 Cluster 3

P0 P1 P2 P3 P4 P5 P6 P7 P8

A B C D E F G H I

Target

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8 P0 P1 P2 P3 P4 P5 P6 P7 P8

a1 a2 a3 b1 b2 b3 c1 c2 c3 d1 d2 d3 e1 e2 e3 f1 f2 f3 g1 g2 g3 h1 h2 h3 i1 i2 i3

15

Communication Table Before Processor Mapping


DP

SP P0 P1 P2 P3 P4 P5 P6 P7 P8

P0 a1 a2 a3

P1 b1 b1 b2

P2 c1 c2 c3

P3 d1 d2 d3

P4 e1 e2 e3

P5 f1 f2 f3

P6 g1 g1 g2

P7 h1 h2 h3

P8 i1 i2 i3


|I|=9 |E|=18

16

Communication links Before Processor Mapping


Source

P0 P1 P2 P3 P4 P5 P6 P7 P8

P0 P1 P2 P3 P4 P5 P6 P7 P8

Target

Interior communication

External communication

17

Communication table after Processor Mapping


DP

SP P0 P1 P2 P3 P4 P5 P6 P7 P8

P0 a1 a2 a3

P3 b1 b1 b2

P6 c1 c2 c3

P1 d1 d2 d3

P4 e1 e2 e3

P7 f1 f2 f3

P2 g1 g1 g2

P5 h1 h2 h3

P8 i1 i2 i3


|I|=27 |E|=0

18

Communication links after Processor Mapping


Source P0 P3 P6 P1 P4 P7 P2 P5 P8

P0 P1 P2 P3 P4 P5 P6 P7 P8

Target

19

Processor Reordering Flow Diagram


Mapping Function

Partitioning Data

Alignment/Dispatch

Source Data

Reordering Agent

SCA(x)

Generate new Pid

Reordering SD(Px’)

DCA(x)

Determine Target Cluster

Designate Target Node

SCA(x)SCA(x)

SD(Px)

Master Node

DCA(x)DCA(x)DD(Py)

F(X) = X’ = +(X mod C) * K

20

Identical Cluster Grid vs. Non-identical Cluster Grid


DP

SP P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11

P0 a1 a2 a3 a4

P1 b1 b2 b3 b4

P2 c1 c2 c3 c4

P3 d1 d2 d3 d4

P4 e1 e2 e3 e4

P5 f1 f2 f3 f4

P6 g1 g2 g3 g4

P7 h1 h2 h3 h4

P8 i1 i2 i3 i4

P9 j1 j2 j3 j4

P10 k1 k2 k3 k4

P11 l1 l2 l3 l4

Cluster-1 Cluster-2 Cluster-3 Cluster-4

DP

SP P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13

P0 a1 a2 a3 a4 a5

P1 b1 b2 b3 b4 b5

P2 c5 c1 c2 c3 c4

P3 d1 d2 d3 d 4 d 5

P4 e1 e2 e3 e4 e5

P5 f4 f5 f1 f2 f3

P6 g1 g2 g3 g4 g5

P7 h1 h2 h3 h4 h5

P8 i3 i4 i5 i1 i2

P9 j1 j2 j3 j4 j5

P10 k1 k2 k3 k4 k5

P11 l2 l3 l4 l5 l1

P12 m1 m2 m3 m4 m5

P13 n1 n2 n3 n4 n5

Cluster1 Cluster2 Cluster3 Cluster4

21

Processor Replacement Algorithm for Non-identical Cluster Grid


22

Theoretical Analysis


Number of I comm , C =3

0

50

100

150

200

250

300

350

3 4 5 6 7 8 9 10K =

I com

m

without reorderingwith reordering

The number of interior communications when C=3.

23



Number of Icomm， Ecomm, C=3,n=3

0

5

10

15

20

25

30

35

40

2 3 4 5 6 7K=

|I| without reordering|I| with reordering|E| without reordering|E| with reordering

24



C=4, (n1, n2, n3, n4)=(2, 3, 4, 5)

0

10

20

30

40

50

60

2 3 4 5 6 7 8 9 10

K

|I|

Original

Reordering A

Reordering B

25

Simulation Setting


Taiwan UniGrid

8 campus clusters

SPMD Programs

C+MPI codes.

26

Topology


Tainan

Taichung

Academia SinicaAcademia SinicaNational Tsing Hua University1

National Tsing Hua University1

Taipei

Hsing Kuo University

Hsing Kuo University

Chung Hua UniversityChung Hua University

National Center for High-performance ComputingNational Center for High-performance Computing

National Tsing Hua University2National Tsing

Hua University2

National Dong Hwa University National Dong Hwa University

Hsinchu

Hualien Providence UniversityProvidence University

Tunghai University

Tunghai University

27

Hardware Infrastructure


HKUIntel P3 1.0, 256M

THUDual AMD 1.6, 1G

CHUIntel P4 2.8, 256M

SINICADual Intel P3 1.0, 1G

NCHCDual AMD 2000+, 512M

PUAMD 2400+, 1G

NTHUDual Xeon 2.8, 1G

NDHUAMD Athlon, 256M

InternetInternet

28

System Monitoring Webpage


29

Experimental Results

C=3, K=3, without I/O

0

2

4

6

8

10

12

14

16

18

NTI NTC NTD NCI NCD NHD

Sec

ond

without reordering

with reordering


C=3, K=3, with I/O (10 MB)

0

5

10

15

20

25

30

35

40

45

50

NTI NTC NTD NCI NCD NHD

Sec

ond

without reorderring

with reordering

30



Label Organization

N National Center for High Performance Computing

T National Tsing Hua University

C Chung Hua University

H Tung Hai University

I Institute of Information Science, Academia Sinica

D National Dong Hwa University

P Providence University

0

20

40

60

80

NTCH NTCI NTCD NTCP NIDPcluster grid

seco

nd

without reorderring

with reordering

31



without I/O

0

5

10

15

20

25

30

35

40

45

50

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M

Seco

nd

with reordering without reordering

with I/O

0

5

10

15

20

25

30

35

40

45

50

55

60

1M 2M 3M 4M 5M 6M 7M 8M 9M 10M

Seco

nd

with reordering without reordering

32

Outline



The Processor Mapping TechniqueMulti-Cluster Grid System

Scheduling Contention Free Communications for Irregular Problems


Conclusions

33

Example of GEN_BLOCK distributions

Enhance load balancing on heterogeneous environment

Scheduling Irregular Redistributions

Data distribution for algorithm P

7 16 11 10 7 49

Data distribution for algorithm Q

15 16 10 16 15 28

Application

……

Algorithm P

Algorithm Q

34

Example of GEN_BLOCK redistribution

Scheduling Irregular Redistributions (cont.)

Distribution I ( Source Processor )

SP SP0 SP1 SP2 SP3 SP4 SP5

Size 7 16 11 10 7 49

Distribution II ( Destination Processor )

DP DP0 DP1 DP2 DP3 DP4 DP5

Size 15 16 10 16 15 28

m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

7 8 8 8 3 7 3 7 6 15 28

15 16 10 16 15 28

m6 DP0 DP1 DP2 DP3 DP4 DP5

SP0 SP1 SP2 SP3 SP4 SP5

7 16 11 10 7 49

ObservationWithout cross communications

35

Convex Bipartite Graph


SP2

TP1 TP2 TP3

SP3SP1

:NodeData communication

:

36

Example of GEN_BLOCK redistribution


m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

7 8 8 8 3 7 3 7 6 15 28

15 16 10 16 15 28

m6 DP0 DP1 DP2 DP3 DP4 DP5


7 16 11 10 7 49

Schedule Table

Step 1 m1、m3、m5、m7、m10


Step 3 m9

A simple result.

Minimize communication step.

Minimize the message size of total steps.

37

Related Implementations

ColoringColoring

m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

18 20 8 21 18 19


5 24 23 7 42 3

DP0 DP1 DP2 DP3 DP4 DP5

5 13 11 9 8 6 7 8 18 16 3

Schedule

Step 1 m1、m3、m5、m7、m9、m11

Step 2 m2、m4、m8

Step 3 m6、m10

38


LISTLIST

Schedule


Step 2 m1、m3、m5、m10

Step 3 m8

Step 4 m6

m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

18 20 8 21 18 19


5 24 23 7 42 3


5 13 11 9 8 6 7 8 18 16 3

39


DC1 & DC2DC1 & DC2

m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

18 20 8 21 18 19


5 24 23 7 42 3


5 13 11 9 8 6 7 8 18 16 3

1 2 3 4 5 6 7 8 9 10 11

Separating NMS

Schedule 1

Step 1 m2

Step 2 m3

Step 3 m1

Schedule 2

Step 1 m2

Step 2 m1、m3

(b)DC2(a)DC1

Schedule


Step 2 m3、m6、m10

Step 3 m1、m5、m8

Schedule


Step 2 m6、m10


40

The Two Phase Degree Reduction Method


The First Phase (for nodes with degree >2)Reduces degree of the maximum degree nodes by one in each reduction iteration.

The Second Phase (for nodes with degree = 1 and 2)

Schedules messages between nodes that with degree 1 and 2 using an adjustable coloring mechanism.

41

The first phase


The Two Phase Degree Reduction Method

S3: m11(6) 、 m5(3) ----6

42

The second phase


The Two-Phase Degree Reduction Method

S1: m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18S2: m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) ---12S3: m11(6) 、 m5(3) --- 6

43


Extend TPDR

S1: m1(7) 、 m3(7) 、 m6(15) 、 m9(10) 、 m13(18) ---18S2: m4(4) 、 m7(3) 、 m10(8) 、 m12(12) ---12S3: m11(6) 、 m5(3) 、 m2(3) m2(3) 、 m8(4)m8(4) ----6

S1: m1(7) 、 m3(7) 、 m6(15) 、 m8(4) 、 m10(8) 、 m13(18)---18S2: m2(3) 、 m4(4) 、 m7(3) 、 m9(10) 、 m12(12) ---12S3: m11(6) 、 m5(3) --- 6

TPDRTPDR

E-TPDRE-TPDR

44

Performance Evaluation

Simulation of TPDR and E-TPDR algorithms on uneven

cases.

45

Performance Evaluation (cont.)

Simulation A is carried out to examine the performance

of TPDR and E-TPDR algorithms on uneven cases.

46


Simulation B is carried out to examine the performance

of TPDR and E-TPDR algorithms on even cases.

47


Simulation B is carried out to examine the performance

of TPDR and E-TPDR algorithms on even cases.

48

Summary

TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions

Contention freeOptimal Number of Communication StepsOutperforms the D&C algorithmTPDR(uneven) performs better than TPDR(even)

49

1000 test cases


Processor：8 DR LIST DC1 DC2 Color Combined

DR better

equal

worse

207

535

258

856

99

45

669

312

19

957

37

6

67.225%

24.575%

8.2%

LIST better

equal

worse

258

535

207

874

101

25

738

165

97

975

4

21

71.125%

20.125%

8.75%

DC1 better

equal

worse

45

99

856

25

101

874

243

92

665

755

101

144

26.7%

9.825%

63.475%

DC2 better

equal

worse

19

312

669

97

165

738

665

92

243

901

39

60

42.05%

15.2%

42.75%

Color better

equal

worse

6

37

957

21

4

975

144

101

755

60

39

901

5.775%

4.525%

89.7%

Processor：16 DR LIST DC1 DC2 Coloring Combined

DR better

equal

worse

331

379

290

985

5

10

931

58

11

998

2

0

81.125%

11.1%

7.775%

LIST better

equal

worse

290

379

331

989

4

7

921

25

54

998

0

2

79.95%

10.2%

9.85%

DC1 better

equal

worse

10

5

985

7

4

989

179

11

810

774

42

184

24.25%

1.55%

74.2%

DC2 better

equal

worse

11

58

931

54

25

921

810

11

179

948

7

45

45.575%

2.525%

51.9%

Coloring better

equal

worse

0

2

998

2

0

998

184

42

774

45

7

948

5.775%

1.275%

92.95%


DR better

equal

worse

400

278

322

999

0

1

994

4

2

1000

0

0

84.825%

7.05%

8.125%

LIST better

equal

worse

322

278

400

1000

0

0

994

1

5

1000

0

0

82.9%

6.975%

10.125%

DC1 better

equal

worse

1

0

999

0

0

1000

95

6

899

799

23

178

22.375%

0.725%

76.9%

DC2 better

equal

worse

2

4

994

5

1

994

899

6

95

969

1

30

46.875%

0.3%

52.825%

Coloring better

equal

worse

0

0

1000

0

0

1000

178

23

799

30

1

969

5.2%

0.6%

94.2%


DR better

equal

worse

416

221

363

1000

0

0

999

0

1

1000

0

0

85.38%

5.525%

9.1%

LIST better

equal

worse

363

221

416

1000

0

0

1000

0

0

1000

0

0

84.08%

5.525%

10.4%

DC1 better

equal

worse

0

0

1000

0

0

1000

44

0

956

783

22

195

20.68%

0.55%

78.78%

DC2 better

equal

worse

1

0

999

0

0

1000

956

0

44

986

4

10

48.58%

0.1%

51.33%

Coloring better

equal

worse

0

0

1000

0

0

1000

195

22

783

10

4

986

5.125%

0.65%

94.23%


DR better

equal

worse

473

204

323

1000

0

0

1000

0

0

1000

0

0

86.825%

5.1%

8.075%

LIST better

equal

worse

323

204

473

1000

0

0

1000

0

0

1000

0

0

83.075%

5.1%

11.825%

DC1 better

equal

worse

0

0

1000

0

0

1000

14

1

985

776

20

204

19.75%

0.525%

79.725%

DC2 better

equal

worse

0

0

1000

0

0

1000

985

1

14

998

0

2

49.575%

0.025%

50.4%

Coloring better

equal

worse

0

0

1000

0

0

100

204

20

776

2

0

998

5.15%

0.5%

94.35%

50

Average

8 16 32 64 1280

10

20

30

40

50

60

70

80

90

100Combined (Date szie = 10000)

Number of processors

Per

cent

( %

)

DRLISTDC1DC2COLOR


51

Conclusions

Runtime Data Redistribution is usually used to enhance algorithm performance in data parallel applications

Processor Mapping technique minimizes data transmission cost and achieves better communication localization on multi-cluster grid systems

TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions

Contention freeGood performance

Future WorksIncorporate localization techniques on Data Grid with considering Heterogeneous external communication overheadsIncorporate the ratio between local memory access & remote message passing (on different architecture) into E-TPDR scheduling policy…

52

Thank you

53

Our Implementation of Communication Scheduling

E-TPDRE-TPDR

m11 m10 m9 m8 m7 m6 m5 m4 m3 m2 m1

18 20 8 21 18 19


5 24 23-8 7 42-8 3


5 13 11 9 8 6 7 8 18 16 3

After process 1 and 2, m7 is scheduled in step 3 and Degreemax becomes 2.

m11 m10 m9 m7 m6 m4 m3 m2 m1

18 20 8 21 18 19


5 24 23-8 7 42-8 3


5 13 11 9 6 7 18 16 3

Messages m1 and m11 is scheduled in step 3.

m10 m9 m7 m6 m4 m3 m2

18 20 8 21 18 19


5 24 23-8 7 42-8 3


13 11 9 6 7 18 16

After process 3, m1 and m11 are schedule in step 3. Edges are colored blue and red for step 1 and 2, respectively.

Schedule


Step 2 m3、m6、m10


colorcolor

1 中華大學資訊工程學系 http:// ching-hsien hsu ( 許慶賢 ) localization and scheduling...

Documents