mpi performance analysis and optimization on...
TRANSCRIPT
-UNCLASSIFIED-
-UNCLASSIFIED-
MPI Performance Analysis and Optimization
on Tile64/Maestro
Mikyung Kang, Eunhui Park, Minkyoung Cho,
Jinwoo Suh, Dong-In Kang, and Stephen P. Crago
USC/ISI-East
July 19~23, 2009
2-UNCLASSIFIED-
-UNCLASSIFIED-
Overview
Background
MPI
Maestro
Implementation and test
Performance
Results on MDE 1.3.5
Some results on MDE 2.0.1
3-UNCLASSIFIED-
-UNCLASSIFIED-
MPI and Maestro
MPI
Message Passing Interface
Used extensively for communications among processors in
multi-node computing systems
Example
MPI_Send(…)
MPI_Recv(…)
MPI_Bcast(…)
Maestro
A Tile64 based processor
Tile64 is a 64-core processor from Tilera
Contains 49 cores
Radiation-hardened processor
Floating point unit included
MDE
Multicore Development Environment from Tilera
4-UNCLASSIFIED-
-UNCLASSIFIED-
MPI Library on Tile64/Maestro
Fully compliant to MPI specification 1.2
Implemented MPI library on top of the modified iLib
library
iLib is a communication library from Tilera
Invisible to users
Can be used for both Tile64 and Maestro
Does not use floating point operations
Independent of the number of tiles
5-UNCLASSIFIED-
-UNCLASSIFIED-
MPI Testing
MPI benchmarks tested
IBM test suite
83 tests
Intel test suite
764 tests
MPICH test suite
161 tests
SPEC MPI
Two tests
6-UNCLASSIFIED-
-UNCLASSIFIED-
MPI Performance
MPI performance measurement, analysis, and
optimization
Point to point communication
Send and receive
Collective communication
Broadcast
Allreduce Reduce + broadcast
All-to-all
7-UNCLASSIFIED-
-UNCLASSIFIED-
Total Execution Time for Send/Receive
Send/receive pair communication using two tiles
Latency: about 30 μsec for small dataCold instruction cache
6.8 μsec in warm instruction cache
700 MHz on TILExpress with Tile64
20 21 22 2326
36
71
271
1096
30 30 30 31 3141
77
287
1109
16
32
64
128
256
512
1,024
2,048
1 4 16 64 256 1K 4K 16K 64K
Tota
l execution t
ime (μ
s)
Message size (words)
Send/Recv (MDE 1.3.5)
ilib MPI
8-UNCLASSIFIED-
-UNCLASSIFIED-
Cycles/Word for Send/Receive
As the data size gets larger, the initial overhead is amortized
The data transfer time per word is reduced as the message size is increased
14045
3690
950
247
72
25
12 12 12
20685
5177
1324
335
85
28
13 12 12
1
10
100
1,000
10,000
100,000
1 4 16 64 256 1K 4K 16K 64K
Se
nd
/re
ceiv
e tim
e (
cycle
s/w
ord
)
Message size (words)
Send/Recv (MDE 1.3.5)
ilib MPI
9-UNCLASSIFIED-
-UNCLASSIFIED-
Send/Recv Iteration Test
20,685 20,706 21,176 21,420 21,701 28,727
53,959
200,720
410,155
776,035
6,325 6,417 6,672 7,201 8,144
14,324
32,357
150,935
392,771
775,209
1,000
10,000
100,000
1,000,000
1 4 16 64 256 1K 4K 16K 32K 64K
Sen
d/r
ecei
ve t
ime
(cyc
les)
Message size (words)
MPI_Send/Recv (MDE 1.3.5)
1 time 1,000 times
Instruction cache miss difference between 1 time and
1,000 iterations of MPI_Send/Recv
10-UNCLASSIFIED-
-UNCLASSIFIED-
Communication Cost Breakdown
Hardware lower bound
1 + w for w size message (1 cycle is for sending a header word)
Lower bound by compiled code
1 word case: 1 cycles
2KW/16KW/64KW case: 1.5 cycles per word
Compiled code with optimization flags: unrolled loop
In case of even number of words, each pair of 2 words is sent in a 5-
instruction loop
Innermost loop
Initial setup instructions
The maximum number of iterations is set to 24
The number of cycles per word is a function of message size
11-UNCLASSIFIED-
-UNCLASSIFIED-
Communication Cost Breakdown (2)
Header overhead
Header contains destination tile and message size information
Overhead includes cycles to prepare header words
Cache miss cost
Stalls caused by data and instruction cache misses
Message data is loaded into cache when the data is prepared for
the send operation before calling MPI call
As MPI call goes through several subroutines, part or all of the
data is evacuated due to data conflict
As the message size increases, more and more cycles are spent
due to the cache miss
12-UNCLASSIFIED-
-UNCLASSIFIED-
Communication Cost Breakdown (3)
iLib overhead
Extra cycles needed for iLib code execution
Making transaction record, checking the transaction record,
and management of the transaction
Amortized over the message length
MPI overhead
Error checking, partner tile null checking, tag setting, rank
determination, and status setting
Cost per word is reduced as the message size is increased
13-UNCLASSIFIED-
-UNCLASSIFIED-
Cost Analysis for 1 Word Message
Hardware lower bound
2 0%
Lower bound by compiled
code1
0%
Udn_send7
0%
Initial overhead for innermost loop
170%
Header overhead22 0%
iLib overhead_base9,89648%
iLib overhead_ISI3,977 19%
MPI overhead6,76233%
Communication cost analysis (MDE 1.3.5)
Communication cost for 1W Cycles Ratio
(1) Core data_send
Hardware lower bound 2 0.0%
Lower bound by compiled code 1 0.0%
Udn_send 7 0.0%
Initial overhead for innermost loop 17 0.1%
(2) Header overhead Header overhead 22 0.1%
(3) iLib overheadiLib overhead_base 9,897 47.8%
iLib overhead_ISI 3,977 19.2%
(4) MPI overhead MPI overhead 6,762 32.7%
Total 20,685 100.0%
14-UNCLASSIFIED-
-UNCLASSIFIED-
Cost Analysis for Large Size Message
1.00 1.00 1.00
1.50 1.50 1.500.26 0.26 0.26
1.91
6.467.51
0.44
0.44
0.44
8.530.81
0.79
1.55
1.350.21
2.65
0.45 0.14
0
2
4
6
8
10
12
14
16
18
2KW 16KW 64KW
Se
nd
/re
ceiv
e tim
e (
cycle
s/w
ord
)
Message size (word)
Communication cost analysis
MPI overhead
iLib overhead_ISI
iLib overhead_base
Header overhead
Cache miss cost
Initial overhead for innermost loop
Lower bound by compiled code
Hardware lower bound
15-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Send/Recv on Multiple Tiles
10
100
1,000
10,000
2 4 8 16 32
To
tal e
xe
cutio
n t
ime
(μ
s)
Number of Tiles
MPI_Send/Recv (MDE 1.3.5)
1
256
1024
4096
32768
The increased communication time is due to the
memory contention
Message data contention
Transaction management related data contention
words
16-UNCLASSIFIED-
-UNCLASSIFIED-
Collective Communication
Collective communications in MPI
Broadcast
Allreduce
Alltoall
17-UNCLASSIFIED-
-UNCLASSIFIED-
Collective Communication
A0
DataP
rocesses
Broadcast
A0
A0
A0
A0
A0
A0
Data
Pro
cesses
A0 A1 A2 A3 A4 A5
Pro
cesses
Scatter
A0
A1
A2
A3
A4
A5
Pro
cesses
Gather
Data Data
18-UNCLASSIFIED-
-UNCLASSIFIED-
Collective communication
A0
B0
C0
D0
E0
F0
DataP
rocesses
Allgather
A0
Data
Pro
cesses
Pro
cesses
Alltoall Pro
cesses
B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A0 B0 C0 D0 E0 F0
A1 B1 C1 D1 E1 F1
A2 B2 C2 D2 E2 F2
A3 B3 C3 D3 E3 F3
A4 B4 C4 D4 E4 F4
A5 B5 C5 D5 E5 F5
A0 A1 A2 A3 A4 A5
B0 B1 B2 B3 B4 B5
C0 C1 C2 C3 C4 C5
D0 D1 D2 D3 D4 D5
E0 E1 E2 E3 E4 E5
F0 F1 F2 F3 F4 F5
Data Data
19-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Bcast
Broadcast is implemented in tree-like communication pattern
Broadcast performed on multiple tiles with various message sizes
1. A root node send message to another tile
2. The two nodes send data to another two nodes, and so on.
For N tiles, there are lgN communications stages
0 16 0 8 16 24 0 8 16 24
4 12 20 28
0 8 16 24
2 10 18 26
4 12 20 28
6 14 22 30
0 8 16 24
1 9 17 25
2 10 18 26
3 11 19 27
4 12 20 28
5 13 21 29
6 14 22 30
7 15 24 31
20-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Bcast
100
600
1,100
1,600
2,100
2,600
3,100
3,600
4,100
4,600
5,100
256 1K 4K 32K
To
tal e
xe
cu
tio
n t
ime
(u
s)
Message size (words)
MPI_Bcast (MDE 1.3.5)
21-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Allreduce
Reduce operation followed by a broadcast operation
Tree-like communication if the operation is associative Implementation similar to the inverse of broadcast
Serial communication for non-associative cases
The number of cycles for the reduce operation is much larger than the broadcasting since it needs more operations such as element-wise reduce operation processing
0 8 16 24
1 9 17 25
2 10 18 26
3 11 19 27
4 12 20 28
5 13 21 29
6 14 22 30
7 15 24 31
0 8 16 24
2 10 18 26
4 12 20 28
6 14 22 30
sumsum
0 8 16 24
4 12 20 28
sum
0 8 16 24
sum
0 16
sum
22-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Allreduce
-
5,000
10,000
15,000
20,000
25,000
256 1K 4K 32K
To
tal e
xe
cutio
n t
ime
(u
s)
Message size (words)
MPI_Allreduce (MDE 1.3.5)
23-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Alltoall
All-to-all communication needs to handle large amount of data
Combination of two communication algorithms
Message size > 1KW Half of the tiles send to the other half
Message size <= 1KW Nonblocking send followed by blocking receive
N-1 stages, where N is the number of tiles
24-UNCLASSIFIED-
-UNCLASSIFIED-
MPI_Alltoall
-
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
256 1K 4K 7K
To
tal e
xe
cutio
n t
ime
(μ
s)
Message size (words)
MPI_Alltoall (MDE 1.3.5)
25-UNCLASSIFIED-
-UNCLASSIFIED-
MPI Send/Recv Results on MDE 2.0.1
1,000 iterations4,740 4,926 5,442 5,587 6,504
8,978
27,208
150,825
398,265
797,503
1,000
10,000
100,000
1,000,000
1 4 16 64 256 1K 4K 16K 32K 64K
Sen
d/r
ece
ive
tim
e (
cycl
es)
Message size (words)
MPI_Send/Recv
MDE-1.3.5 MDE-2.0.1
25%
reduction
26-UNCLASSIFIED-
-UNCLASSIFIED-
Conclusion
MPI implementation on Tile64/Maestro complete with successful testing
Performance measurement, analysis, and optimization complete
High performance
As low as 6.8 microsecond latency
MPI library open to OPERA community for productive application implementations