mpi performance analysis and optimization on...

-UNCLASSIFIED-

-UNCLASSIFIED-

MPI Performance Analysis and Optimization

on Tile64/Maestro

Mikyung Kang, Eunhui Park, Minkyoung Cho,

Jinwoo Suh, Dong-In Kang, and Stephen P. Crago

USC/ISI-East

July 19~23, 2009

2-UNCLASSIFIED-

-UNCLASSIFIED-

Overview

Background

MPI

Maestro

Implementation and test

Performance

Results on MDE 1.3.5

Some results on MDE 2.0.1

3-UNCLASSIFIED-

-UNCLASSIFIED-

MPI and Maestro

MPI

Message Passing Interface

Used extensively for communications among processors in

multi-node computing systems

Example

MPI_Send(…)

MPI_Recv(…)

MPI_Bcast(…)

Maestro

A Tile64 based processor

Tile64 is a 64-core processor from Tilera

Contains 49 cores

Radiation-hardened processor

Floating point unit included

MDE

Multicore Development Environment from Tilera

4-UNCLASSIFIED-

-UNCLASSIFIED-

MPI Library on Tile64/Maestro

Fully compliant to MPI specification 1.2

Implemented MPI library on top of the modified iLib

library

iLib is a communication library from Tilera

Invisible to users

Can be used for both Tile64 and Maestro

Does not use floating point operations

Independent of the number of tiles

5-UNCLASSIFIED-

-UNCLASSIFIED-

MPI Testing

MPI benchmarks tested

IBM test suite

83 tests

Intel test suite

764 tests

MPICH test suite

161 tests

SPEC MPI

Two tests

6-UNCLASSIFIED-

-UNCLASSIFIED-

MPI Performance

MPI performance measurement, analysis, and

optimization

Point to point communication

Send and receive

Collective communication

Broadcast

Allreduce Reduce + broadcast

All-to-all

7-UNCLASSIFIED-

-UNCLASSIFIED-

Total Execution Time for Send/Receive

Send/receive pair communication using two tiles

Latency: about 30 μsec for small dataCold instruction cache

6.8 μsec in warm instruction cache

700 MHz on TILExpress with Tile64

20 21 22 2326

36

71

271

1096

30 30 30 31 3141

77

287

1109

16

32

64

128

256

512

1,024

2,048

1 4 16 64 256 1K 4K 16K 64K

Tota

l execution t

ime (μ

s)

Message size (words)

Send/Recv (MDE 1.3.5)

ilib MPI

8-UNCLASSIFIED-

-UNCLASSIFIED-

Cycles/Word for Send/Receive

As the data size gets larger, the initial overhead is amortized

The data transfer time per word is reduced as the message size is increased

14045

3690

950

247

72

25

12 12 12

20685

5177

1324

335

85

28

13 12 12

1

10

100

1,000

10,000

100,000

1 4 16 64 256 1K 4K 16K 64K

Se

nd

/re

ceiv

e tim

e (

cycle

s/w

ord

)


Send/Recv (MDE 1.3.5)

ilib MPI

9-UNCLASSIFIED-

-UNCLASSIFIED-

Send/Recv Iteration Test

20,685 20,706 21,176 21,420 21,701 28,727

53,959

200,720

410,155

776,035

6,325 6,417 6,672 7,201 8,144

14,324

32,357

150,935

392,771

775,209

1,000

10,000

100,000

1,000,000

1 4 16 64 256 1K 4K 16K 32K 64K

Sen

d/r

ecei

ve t

ime

(cyc

les)


MPI_Send/Recv (MDE 1.3.5)

1 time 1,000 times

Instruction cache miss difference between 1 time and

1,000 iterations of MPI_Send/Recv

10-UNCLASSIFIED-

-UNCLASSIFIED-

Communication Cost Breakdown

Hardware lower bound

1 + w for w size message (1 cycle is for sending a header word)

Lower bound by compiled code

1 word case: 1 cycles

2KW/16KW/64KW case: 1.5 cycles per word

Compiled code with optimization flags: unrolled loop

In case of even number of words, each pair of 2 words is sent in a 5-

instruction loop

Innermost loop

Initial setup instructions

The maximum number of iterations is set to 24

The number of cycles per word is a function of message size

11-UNCLASSIFIED-

-UNCLASSIFIED-

Communication Cost Breakdown (2)

Header overhead

Header contains destination tile and message size information

Overhead includes cycles to prepare header words

Cache miss cost

Stalls caused by data and instruction cache misses

Message data is loaded into cache when the data is prepared for

the send operation before calling MPI call

As MPI call goes through several subroutines, part or all of the

data is evacuated due to data conflict

As the message size increases, more and more cycles are spent

due to the cache miss

12-UNCLASSIFIED-

-UNCLASSIFIED-

Communication Cost Breakdown (3)

iLib overhead

Extra cycles needed for iLib code execution

Making transaction record, checking the transaction record,

and management of the transaction

Amortized over the message length

MPI overhead

Error checking, partner tile null checking, tag setting, rank

determination, and status setting

Cost per word is reduced as the message size is increased

13-UNCLASSIFIED-

-UNCLASSIFIED-

Cost Analysis for 1 Word Message


2 0%

Lower bound by compiled

code1

0%

Udn_send7

0%

Initial overhead for innermost loop

170%

Header overhead22 0%

iLib overhead_base9,89648%

iLib overhead_ISI3,977 19%

MPI overhead6,76233%

Communication cost analysis (MDE 1.3.5)

Communication cost for 1W Cycles Ratio

(1) Core data_send

Hardware lower bound 2 0.0%

Lower bound by compiled code 1 0.0%

Udn_send 7 0.0%

Initial overhead for innermost loop 17 0.1%

(2) Header overhead Header overhead 22 0.1%

(3) iLib overheadiLib overhead_base 9,897 47.8%

iLib overhead_ISI 3,977 19.2%

(4) MPI overhead MPI overhead 6,762 32.7%

Total 20,685 100.0%

14-UNCLASSIFIED-

-UNCLASSIFIED-

Cost Analysis for Large Size Message

1.00 1.00 1.00

1.50 1.50 1.500.26 0.26 0.26

1.91

6.467.51

0.44

0.44

0.44

8.530.81

0.79

1.55

1.350.21

2.65

0.45 0.14

0

2

4

6

8

10

12

14

16

18

2KW 16KW 64KW

Se

nd

/re

ceiv

e tim

e (

cycle

s/w

ord

)

Message size (word)

Communication cost analysis

MPI overhead

iLib overhead_ISI

iLib overhead_base

Header overhead

Cache miss cost

Initial overhead for innermost loop

Lower bound by compiled code


15-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Send/Recv on Multiple Tiles

10

100

1,000

10,000

2 4 8 16 32

To

tal e

xe

cutio

n t

ime

(μ

s)

Number of Tiles

MPI_Send/Recv (MDE 1.3.5)

1

256

1024

4096

32768

The increased communication time is due to the

memory contention

Message data contention

Transaction management related data contention

words

16-UNCLASSIFIED-

-UNCLASSIFIED-

Collective Communication

Collective communications in MPI

Broadcast

Allreduce

Alltoall

17-UNCLASSIFIED-

-UNCLASSIFIED-

Collective Communication

A0

DataP

rocesses

Broadcast

A0

A0

A0

A0

A0

A0

Data

Pro

cesses

A0 A1 A2 A3 A4 A5

Pro

cesses

Scatter

A0

A1

A2

A3

A4

A5

Pro

cesses

Gather

Data Data

18-UNCLASSIFIED-

-UNCLASSIFIED-

Collective communication

A0

B0

C0

D0

E0

F0

DataP

rocesses

Allgather

A0

Data

Pro

cesses

Pro

cesses

Alltoall Pro

cesses

B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A0 B0 C0 D0 E0 F0

A1 B1 C1 D1 E1 F1

A2 B2 C2 D2 E2 F2

A3 B3 C3 D3 E3 F3

A4 B4 C4 D4 E4 F4

A5 B5 C5 D5 E5 F5

A0 A1 A2 A3 A4 A5

B0 B1 B2 B3 B4 B5

C0 C1 C2 C3 C4 C5

D0 D1 D2 D3 D4 D5

E0 E1 E2 E3 E4 E5

F0 F1 F2 F3 F4 F5

Data Data

19-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Bcast

Broadcast is implemented in tree-like communication pattern

Broadcast performed on multiple tiles with various message sizes

1. A root node send message to another tile

2. The two nodes send data to another two nodes, and so on.

For N tiles, there are lgN communications stages

0 16 0 8 16 24 0 8 16 24

4 12 20 28

0 8 16 24

2 10 18 26

4 12 20 28

6 14 22 30

0 8 16 24

1 9 17 25

2 10 18 26

3 11 19 27

4 12 20 28

5 13 21 29

6 14 22 30

7 15 24 31

20-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Bcast

100

600

1,100

1,600

2,100

2,600

3,100

3,600

4,100

4,600

5,100

256 1K 4K 32K

To

tal e

xe

cu

tio

n t

ime

(u

s)


MPI_Bcast (MDE 1.3.5)

21-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Allreduce

Reduce operation followed by a broadcast operation

Tree-like communication if the operation is associative Implementation similar to the inverse of broadcast

Serial communication for non-associative cases

The number of cycles for the reduce operation is much larger than the broadcasting since it needs more operations such as element-wise reduce operation processing

0 8 16 24

1 9 17 25

2 10 18 26

3 11 19 27

4 12 20 28

5 13 21 29

6 14 22 30

7 15 24 31

0 8 16 24

2 10 18 26

4 12 20 28

6 14 22 30

sumsum

0 8 16 24

4 12 20 28

sum

0 8 16 24

sum

0 16

sum

22-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Allreduce

-

5,000

10,000

15,000

20,000

25,000

256 1K 4K 32K

To

tal e

xe

cutio

n t

ime

(u

s)


MPI_Allreduce (MDE 1.3.5)

23-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Alltoall

All-to-all communication needs to handle large amount of data

Combination of two communication algorithms

Message size > 1KW Half of the tiles send to the other half

Message size <= 1KW Nonblocking send followed by blocking receive

N-1 stages, where N is the number of tiles

24-UNCLASSIFIED-

-UNCLASSIFIED-

MPI_Alltoall

-

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

256 1K 4K 7K

To

tal e

xe

cutio

n t

ime

(μ

s)


MPI_Alltoall (MDE 1.3.5)

25-UNCLASSIFIED-

-UNCLASSIFIED-

MPI Send/Recv Results on MDE 2.0.1

1,000 iterations4,740 4,926 5,442 5,587 6,504

8,978

27,208

150,825

398,265

797,503

1,000

10,000

100,000

1,000,000

1 4 16 64 256 1K 4K 16K 32K 64K

Sen

d/r

ece

ive

tim

e (

cycl

es)


MPI_Send/Recv

MDE-1.3.5 MDE-2.0.1

25%

reduction

26-UNCLASSIFIED-

-UNCLASSIFIED-

Conclusion

MPI implementation on Tile64/Maestro complete with successful testing

Performance measurement, analysis, and optimization complete

High performance

As low as 6.8 microsecond latency

MPI library open to OPERA community for productive application implementations

mpi performance analysis and optimization on...

Documents