designing an efficient kernel-level and user-level hybrid...
TRANSCRIPT
1
Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI
Intra-node Communication on Multi-core Systems
Lei Chai Ping Lai Hyun-Wook Jin*Dhabaleswar. K. Panda
Computer Science & Engineering DepartmentThe Ohio State University
*Department of Computer Science and Engineering Konkuk University, Seoul, Korea
2
Outline
• Motivation and introduction
• Initial evaluation and analysis
• Design of the hybrid approach
• Performance results
• Conclusions and Future Work
3
Motivation
• Single-core processors are reaching physical limits– Overheat problem
– Power consumption
– Design complexity
• Multi-core processor– Two or more processing cores on the same chip
– Dividing workload to different cores
– Chip-level Multi-Processor (CMP)
• Widely deployed in clusters
• More than 80% sites in Top500 supercomputer list use multi-core processors, June 2008
• MPI intra-node communication is more critical than ever!
4
Architectures of Multi-core Processors
Intel Clovertown processor Dual-core Opteron Quad-core Opteron
Quad-core Separate L2 cache Separate L2 cache
Built from two Woodcrest Shared L3 cache
sockets
Shared L2 cache
5
Multi-core Cluster Setup
Inter-node Communication
Network
Memory
Intra-CMP Communication
(Shared Cache)
Quad Core Chip
Quad Core Node with Shared L2 Cache
Inter-Socket Communication
CoreCore
L2 CacheCoreCore
Memory Memory
Intra-CMP CommunicationDual Core Chip
Dual Core NUMA Node with Separate L2 Cache
Inter-CMP Communication
L2 L2
CoreCore
L2 L2
Bus
CoreCore
L2 Cache
CoreCore
L2 Cache
CoreCore
L2 Cache
Intra-Socket Communication
Multiple levels of
intra-node communication!
6
Shared Memory Approach
• All the processes mmap a temporary file as the shared memory buffers for communication
• Minimal startup time
• Two copies
Send buffer Shared memory Receive buffer
Memory
Process 0 Process 1
7
Kernel Assisted Direct Copy
• Taking help from the operating system to copy messages
directly from one process’s
user buffer to another
• Kernel overhead
• One copy
• LiMIC
– Stand alone communication module
– Implements its own message queue and performs message matching etc
– Designed for Linux kernel 2.4
• H. -W. Jin, S. Sur, L. Chai, and . K. Panda, LiMIC: Support for High Performance MPI Intra-node Communication on Linux Cluster, International Conference on Parallel Processing (ICPP) 2005.
Kernel
Space
8
Kernel Assisted Direct Copy (Cont’d)
• LiMIC2– Light weight communication
primitives
– Only implements memory mapping and data movement primitives
– Depends on the MPI library for message matching etc
– Designed for Linux kernel 2.6
– Uses a rendezvous protocol for send/receive
– May potentially be affected by process skew
• H. -W. Jin, S. Sur, L. Chai, D. K. Panda, Lightweight Kernel-Level Primitives for High-Performance MPI Intra-Node Communication over Multi-Core Systems. IEEE Cluster 2007 (Poster).
With buffer info
Request to send
Memory
mapping and
Direct copy
Completion
Sender Receiver
9
Problem Statement
• What are the relative performances of these two approaches?
• Can one of them be sufficient for MPI intra-node communication, especially on multi-core clusters?
• Can we design an efficient hybrid approach that optimizes performance?
• Can applications benefit from the hybrid approach?
10
Methodology
• Step 1: Initial evaluation– To study the characteristics of the two approaches
• Step 2: Designing a hybrid approach– To overcome the potential disadvantages of each individual
approach (e.g. process skew issue in LiMIC2) and best combine the two approaches
• Step 3: A comprehensive performance evaluation on the hybrid approach– Micro-benchmarks
– Collective operations
– Applications
11
MVAPICH and MVAPICH-LiMIC2
• MVAPICH– High-performance, scalable, and fault-tolerant MPI library over
• InfiniBand
• 10GigE/iWARP
• Other RDMA-enabled interconnects
– Developed by Network-Based Computing Laboratory, OSU
– Being used by more than 750 organizations world wide, including many of the top 500 supercomputers
– For example, Ranger system at Texas Advanced Computing Center (TACC) ranked 4th in June '08 ranking
– Current release versions use shared memory approach for intra-node communication
• MVAPICH-LiMIC2– An integrated version of MVAPICH with LiMIC2
– Uses kernel assisted direct copy for intra-node communication
– Will be available in future releases
http://mvapich.cse.ohio-state.edu/
12
Outline
• Motivation and introduction
• Initial evaluation and analysis
• Design of the hybrid approach
• Performance results
• Conclusions and Future Work
13
Experimental System Setup
• Intel Clovertown cluster
– Dual-socket quad-core Xeon processors, 2.0GHz
– 8 processors per node, nodes connected by InfiniBand
– A pair of cores on the same chip share a 4MB L2 cache
– Linux 2.6.18
14
Experiments
• Micro-benchmarks– OSU Multi-pair bandwidth test
• Without buffer reuse
• With buffer reuse
– Skew effect benchmark
• Tool to profile L2 cache utilization– Oprofile (http://oprofile.sourceforge.net)
• Experiments– Impact of processor topology
– Impact of buffer reuse
– L2 cache utilization
– Impact of process skew
15
Impact of Processor Topology
• MVAPICH-LiMIC2 performs better for medium and large messages
• Thresholds:– Shared cache: 32KB
– Intra-socket: 2KB
– Inter-socket: 1KB
Shared Cache
0
2000
4000
6000
8000
10000
12000
1 8 64 512
4K 32K
256K 2M
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s) MVAPICH
MVAPICH-LiMIC2
Intra-socket
0
500
1000
1500
2000
2500
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
Inter-socket
0
500
1000
1500
2000
1 4 16 64 256
1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
16
Impacts of Buffer Reuse
• Send/Receive multiple iterations over the same buffer
• Buffer reuse helps the performance of both MVAPICH and MVAPICH-LiMIC2
• MVAPICH-LiMIC2 benefits from buffer reuse better for medium messages because of better cache utilization
• Applications that have more buffer reuse will potentially benefit more from MVAPICH-LiMIC2
Intra-socket
0
1000
2000
3000
4000
5000
6000
1 8 64 512
4K 32K
256K 2M
Message Size
Ban
dw
idth
(M
B/s
) MVAPICH (no bufferreuse)
MVAPICH (with bufferreuse)
MVAPICH-LiMIC2 (nobuffer reuse)
MVAPICH-LiMIC2 (withbuffer reuse)
17
L2 Cache Utilization
• Experiments are with buffer reuse• Number of cache misses increase as message size
increases
• MVAPICH-LiMIC2 has fewer cache misses than MVAPICH because it eliminates the intermediate copy
Intra-socket
0
10000
20000
30000
40000
50000
60000
1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M
Message Size (Bytes)
Nu
mb
er
of
L2 C
ac
he M
isses
0
2
4
6
8
10
12
14
16
18
20
Imp
rove
men
t (%
)
MVAPICH
MVAPICH-LiMIC2
Improvement
18
Impacts of Process Skew
• Process skew can potentially degrade application performance
• Skew tolerance is important!
• We designed a benchmark to measure process skew effect on performance
c1
MPI_Isend
MPI_Recv
c2
MPI_Waitall
c3
n times
n times
Producer Consumer
c2 is much larger than c1
We measure the total latency on
the producer side: c3
19
Process Skew Effects
• MVAPICH-LiMIC2: Send cannot complete until the corresponding receive is posted
• MVAPICH: Uses an intermediate buffer and is potentially more skew tolerant
• Theoretically:– c3(MVAPICH) = (c1 + t(MPI Isend)) * window size n + t(MPI Waitall)
– c3(MVAPICH-LiMIC2) = (t(MPI Recv) + c2) * window size n + t(MPI Waitall)
• Experimental results conform to the expectation
0
2
4
6
8
10
12
14
16
100 130 160 190 220 250 280 310 340
Consumer Computation Time c2 (us)
Pro
du
ce
r T
ota
l L
ate
nc
y c
3 (
us
)
MVAPICH
MVAPICH-LiMIC2
20
Outline
• Motivation and introduction
• Initial evaluation and analysis
• Design of the hybrid approach
• Performance results
• Conclusions and Future Work
21
Topology Aware Thresholds
• Thresholds to switch from shared memory to LiMIC2 need to be different for different cases– Shared cache: 32KB
– Intra-socket: 2KB
– Inter-socket: 1KB
• Parsing information exported in “sysfs” file system– Under /sys/devices/system/cpu/cpuX/topology/
– physical package id: Physical socket id of the logical CPU
– core id: Core id of the logical CPU on the socket
• Every process knows the topology based on the above information
• Using CPU affinity to pin processes to processors
22
Skew Aware Thresholds
• Unexpected message queue– Messages received without the
matching receive call being posted
– Length indicates the process skew extent
• Skew aware– Receiver keeps track of the
length of the unexpected message queue
– Sends feedbacks to the sender
– Sender adjusts thresholds accordingly
Sender Receiver
Request to send
… Unexpected queue
is detected too longFeedback
Increase threshold
Unexpected queue
length is normalFeedback
Decrease threshold
…
23
Outline
• Motivation and introduction
• Initial evaluation and analysis
• Design of the hybrid approach
• Performance results
• Conclusions and Future Work
24
Results of Topology Aware Thresholds
• Topology aware thresholds:
– Shared cache: 32KB
– Intra-socket: 2KB
– Inter-socket: 1KB
• Performance optimized for all the three cases
Shared Cache
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
MVAPICH
MVAPICH-
LiMIC2-opt
Intra-socket
0
20004000
6000
8000
10000
12000
1 8 64 512
4K 32K
256K 2M
Message Size (Bytes)
Ban
dw
idth
(M
B/s
)
Inter-socket
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
25
Results of Skew Aware Thresholds
• The sending process can detect process skew quickly and adapt to it
• The producer latency is much lower with skew detection– Close to that of MVAPICH
0
2
4
6
8
10
12
14
16
100 130 160 190 220 250 280 310 340
Consumer Computation Time c2 (us)
Pro
du
ce
r T
ota
l L
ate
nc
y c
3 (u
s)
MVAPICH
MVAPICH-LiMIC2
MVAPICH-LiMIC2-opt
26
Performance of Collectives
• Improvements:– MPI_Alltoall: 60%
– MPI_Allgather: 28%
– MPI_Allreduce: 21%
• MPI_Allreduce– MVAPICH performs better for 1KB-
8KB messages
– MVAPICH uses optimized algorithms
– MVAPICH-LiMIC2-opt can be further optimized for collective operations
MPI_Alltoall
0
100000
200000
300000
400000
500000
600000
700000
1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Late
ncy (
us)
-20
0
20
40
60
80
Imp
rovem
en
t (%
)
ImprovementMVAPICHMVAPICH-LiMIC2-opt
MPI_Allgather
0
100000
200000
300000
400000
500000
600000
700000
1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Late
ncy (
us)
-10-505101520253035
Imp
rovem
en
t (%
)
MPI_Allreduce
0
50000
100000
150000
200000
250000
300000
350000
1K 4K 16K
64K
256K 1M 4M
Message Size (Bytes)
Late
ncy (
us)
-50-40
-30-20-10
0102030
Imp
rovem
en
t (%
)
27
Performance of Applications (1)
• Performance on a single node with 8 processes
• MVAPICH-LiMIC2-opt improves the performance of
FT, PSTSWM, and IS
significantly
– FT: 8%
– PSTSWM: 14%
– IS: 17%
• Cache utilization is improved by up to 6%
Application Execution Time (1x8)
0
20
40
60
80
100
CG MG FT PSTSWM IS
Benchmarks
Tim
e (
s)
0
5
10
15
20
Imp
rov
em
en
t (%
)
MVAPICH
MVAPICH-LiMIC2-opt
Improvement
L2 Cache Misses in Applications (1x8)
0
100000
200000
300000
400000
500000
600000
CG
MG FT
PSTS
WM IS
Benchmarks
Nu
mb
er
of L
2
Cac
he M
isse
s01234567
Imp
rove
men
t (%
)
28
Performance of Applications (2)
• Performance on a single node with 8 processes
• Performance improvement is under 2% for LU, HPL, BT, and SP – Communication time is not
significant, or
– Not many large messages
• Applications that transfer more large messages can benefit more from MVAPICH-LiMIC2-opt
Application Execution Time (1x8)
0
100
200
300
400
500
600
700
800
LU HPL BT SP
Benchmarks
Tim
e (
s)
-0.5
0
0.5
1
1.5
2
Imp
rov
em
en
t (%
)
MVAPICH
MVAPICH-LiMIC2-opt
Improvement
L2 Cache Misses in Applications (1x8)
0
200000
400000
600000
800000
1000000
1200000
LU HPL BT SP
Benchmarks
Nu
mb
er
of
L2
Cac
he M
iss
es
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Imp
rove
men
t (%
)
29
Performance of Applications (3)
• Performance on 8 nodes with 8x8 processes
• PSTSWM still benefits from MVAPICH-LiMIC2-opt
• MVAPICH-LiMIC2-opt is promising for clusters
Application Execution Time (8x8)
0
20
40
60
80
100
120
CG MG FT PSTSWM IS
Benchmarks
Tim
e (
s)
0
1
2
3
4
5
6
7
Imp
rov
em
en
t (%
)
MVAPICH
MVAPICH-LiMIC2-opt
Improvement
30
Outline
• Motivation and introduction
• Initial evaluation and analysis
• Design of the hybrid approach
• Performance results
• Conclusions and Future Work
31
Conclusions
• An efficient hybrid approach for MPI intra-node communication– Shared memory (MVAPICH)
– OS kernel assisted direct copy (MVAPICH-LiMIC2)
• 1st step: Initial evaluation– MVAPICH-LiMIC2 provides better performance than MVAPICH
depending on the message size and physical topology
• Fewer number of copies
• Efficient cache utilization
– MVAPICH-LiMIC2 is less skew-tolerant than MVAPICH
• 2nd step: Designing an efficient hybrid approach– Topology-aware thresholds
– Skew-aware thresholds
32
Conclusions (Cont’d)
• 3rd step: Comprehensive performance evaluation– MPI Alltoall, MPI Allgather, and MPI Allreduce performances are
improved by up to 60%, 28%, and 21%, respectively.
– FT, PSTSWM, and IS performances are improved by 8%, 14%, and
17%, respectively.
• Software distribution
– Will be available in upcoming MVAPICH releases!
33
Future Work
• Optimizations on collective algorithms for MVAPICH-LiMIC2
• Evaluation on AMD dual-core and quad-core platforms
• In-depth studies on how the improvements in MPI intra-
node communication benefit the application performance
34
AcknowledgementsOur research is supported by the following organizations
• Current Funding support by
• Current Equipment support by
35
Thank you !{chail, laipi, panda}@cse.ohio-state.edu
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu/
System Software Labhttp://sslab.konkuk.ac.kr