designing an efficient kernel-level and user-level hybrid...

1

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI

Intra-node Communication on Multi-core Systems

Lei Chai Ping Lai Hyun-Wook Jin*Dhabaleswar. K. Panda

Computer Science & Engineering DepartmentThe Ohio State University

*Department of Computer Science and Engineering Konkuk University, Seoul, Korea

2

Outline

• Motivation and introduction

• Initial evaluation and analysis

• Design of the hybrid approach

• Performance results

• Conclusions and Future Work

3

Motivation

• Single-core processors are reaching physical limits– Overheat problem

– Power consumption

– Design complexity

• Multi-core processor– Two or more processing cores on the same chip

– Dividing workload to different cores

– Chip-level Multi-Processor (CMP)

• Widely deployed in clusters

• More than 80% sites in Top500 supercomputer list use multi-core processors, June 2008

• MPI intra-node communication is more critical than ever!

4

Architectures of Multi-core Processors

Intel Clovertown processor Dual-core Opteron Quad-core Opteron

Quad-core Separate L2 cache Separate L2 cache

Built from two Woodcrest Shared L3 cache

sockets

Shared L2 cache

5

Multi-core Cluster Setup

Inter-node Communication

Network

Memory

Intra-CMP Communication

(Shared Cache)

Quad Core Chip

Quad Core Node with Shared L2 Cache

Inter-Socket Communication

CoreCore

L2 CacheCoreCore

Memory Memory

Intra-CMP CommunicationDual Core Chip

Dual Core NUMA Node with Separate L2 Cache

Inter-CMP Communication

L2 L2

CoreCore

L2 L2

Bus

CoreCore

L2 Cache

CoreCore

L2 Cache

CoreCore

L2 Cache

Intra-Socket Communication

Multiple levels of

intra-node communication!

6

Shared Memory Approach

• All the processes mmap a temporary file as the shared memory buffers for communication

• Minimal startup time

• Two copies

Send buffer Shared memory Receive buffer

Memory

Process 0 Process 1

7

Kernel Assisted Direct Copy

• Taking help from the operating system to copy messages

directly from one process’s

user buffer to another

• Kernel overhead

• One copy

• LiMIC

– Stand alone communication module

– Implements its own message queue and performs message matching etc

– Designed for Linux kernel 2.4

• H. -W. Jin, S. Sur, L. Chai, and . K. Panda, LiMIC: Support for High Performance MPI Intra-node Communication on Linux Cluster, International Conference on Parallel Processing (ICPP) 2005.

Kernel

Space

8

Kernel Assisted Direct Copy (Cont’d)

• LiMIC2– Light weight communication

primitives

– Only implements memory mapping and data movement primitives

– Depends on the MPI library for message matching etc

– Designed for Linux kernel 2.6

– Uses a rendezvous protocol for send/receive

– May potentially be affected by process skew

• H. -W. Jin, S. Sur, L. Chai, D. K. Panda, Lightweight Kernel-Level Primitives for High-Performance MPI Intra-Node Communication over Multi-Core Systems. IEEE Cluster 2007 (Poster).

With buffer info

Request to send

Memory

mapping and

Direct copy

Completion

Sender Receiver

9

Problem Statement

• What are the relative performances of these two approaches?

• Can one of them be sufficient for MPI intra-node communication, especially on multi-core clusters?

• Can we design an efficient hybrid approach that optimizes performance?

• Can applications benefit from the hybrid approach?

10

Methodology

• Step 1: Initial evaluation– To study the characteristics of the two approaches

• Step 2: Designing a hybrid approach– To overcome the potential disadvantages of each individual

approach (e.g. process skew issue in LiMIC2) and best combine the two approaches

• Step 3: A comprehensive performance evaluation on the hybrid approach– Micro-benchmarks

– Collective operations

– Applications

11

MVAPICH and MVAPICH-LiMIC2

• MVAPICH– High-performance, scalable, and fault-tolerant MPI library over

• InfiniBand

• 10GigE/iWARP

• Other RDMA-enabled interconnects

– Developed by Network-Based Computing Laboratory, OSU

– Being used by more than 750 organizations world wide, including many of the top 500 supercomputers

– For example, Ranger system at Texas Advanced Computing Center (TACC) ranked 4th in June '08 ranking

– Current release versions use shared memory approach for intra-node communication

• MVAPICH-LiMIC2– An integrated version of MVAPICH with LiMIC2

– Uses kernel assisted direct copy for intra-node communication

– Will be available in future releases

http://mvapich.cse.ohio-state.edu/

12

Outline






13

Experimental System Setup

• Intel Clovertown cluster

– Dual-socket quad-core Xeon processors, 2.0GHz

– 8 processors per node, nodes connected by InfiniBand

– A pair of cores on the same chip share a 4MB L2 cache

– Linux 2.6.18

14

Experiments

• Micro-benchmarks– OSU Multi-pair bandwidth test

• Without buffer reuse

• With buffer reuse

– Skew effect benchmark

• Tool to profile L2 cache utilization– Oprofile (http://oprofile.sourceforge.net)

• Experiments– Impact of processor topology

– Impact of buffer reuse

– L2 cache utilization

– Impact of process skew

15

Impact of Processor Topology

• MVAPICH-LiMIC2 performs better for medium and large messages

• Thresholds:– Shared cache: 32KB

– Intra-socket: 2KB

– Inter-socket: 1KB

Shared Cache

0

2000

4000

6000

8000

10000

12000

1 8 64 512

4K 32K

256K 2M

Message Size (Bytes)

Ba

nd

wid

th (

MB

/s) MVAPICH

MVAPICH-LiMIC2

Intra-socket

0

500

1000

1500

2000

2500

1 4 16 64 256

1K 4K 16K

64K

256K 1M 4M


Ba

nd

wid

th (

MB

/s)

Inter-socket

0

500

1000

1500

2000

1 4 16 64 256

1K 4K 16K

64K

256K 1M 4M


Ba

nd

wid

th (

MB

/s)

16

Impacts of Buffer Reuse

• Send/Receive multiple iterations over the same buffer

• Buffer reuse helps the performance of both MVAPICH and MVAPICH-LiMIC2

• MVAPICH-LiMIC2 benefits from buffer reuse better for medium messages because of better cache utilization

• Applications that have more buffer reuse will potentially benefit more from MVAPICH-LiMIC2

Intra-socket

0

1000

2000

3000

4000

5000

6000

1 8 64 512

4K 32K

256K 2M

Message Size

Ban

dw

idth

(M

B/s

) MVAPICH (no bufferreuse)

MVAPICH (with bufferreuse)

MVAPICH-LiMIC2 (nobuffer reuse)

MVAPICH-LiMIC2 (withbuffer reuse)

17

L2 Cache Utilization

• Experiments are with buffer reuse• Number of cache misses increase as message size

increases

• MVAPICH-LiMIC2 has fewer cache misses than MVAPICH because it eliminates the intermediate copy

Intra-socket

0

10000

20000

30000

40000

50000

60000

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M


Nu

mb

er

of

L2 C

ac

he M

isses

0

2

4

6

8

10

12

14

16

18

20

Imp

rove

men

t (%

)

MVAPICH

MVAPICH-LiMIC2

Improvement

18

Impacts of Process Skew

• Process skew can potentially degrade application performance

• Skew tolerance is important!

• We designed a benchmark to measure process skew effect on performance

c1

MPI_Isend

MPI_Recv

c2

MPI_Waitall

c3

n times

n times

Producer Consumer

c2 is much larger than c1

We measure the total latency on

the producer side: c3

19

Process Skew Effects

• MVAPICH-LiMIC2: Send cannot complete until the corresponding receive is posted

• MVAPICH: Uses an intermediate buffer and is potentially more skew tolerant

• Theoretically:– c3(MVAPICH) = (c1 + t(MPI Isend)) * window size n + t(MPI Waitall)

– c3(MVAPICH-LiMIC2) = (t(MPI Recv) + c2) * window size n + t(MPI Waitall)

• Experimental results conform to the expectation

0

2

4

6

8

10

12

14

16

100 130 160 190 220 250 280 310 340

Consumer Computation Time c2 (us)

Pro

du

ce

r T

ota

l L

ate

nc

y c

3 (

us

)

MVAPICH

MVAPICH-LiMIC2

20

Outline






21

Topology Aware Thresholds

• Thresholds to switch from shared memory to LiMIC2 need to be different for different cases– Shared cache: 32KB



• Parsing information exported in “sysfs” file system– Under /sys/devices/system/cpu/cpuX/topology/

– physical package id: Physical socket id of the logical CPU

– core id: Core id of the logical CPU on the socket

• Every process knows the topology based on the above information

• Using CPU affinity to pin processes to processors

22

Skew Aware Thresholds

• Unexpected message queue– Messages received without the

matching receive call being posted

– Length indicates the process skew extent

• Skew aware– Receiver keeps track of the

length of the unexpected message queue

– Sends feedbacks to the sender

– Sender adjusts thresholds accordingly

Sender Receiver

Request to send

… Unexpected queue

is detected too longFeedback

Increase threshold

Unexpected queue

length is normalFeedback

Decrease threshold

…

23

Outline






24

Results of Topology Aware Thresholds

• Topology aware thresholds:

– Shared cache: 32KB



• Performance optimized for all the three cases

Shared Cache

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M


Ba

nd

wid

th (

MB

/s)

MVAPICH

MVAPICH-

LiMIC2-opt

Intra-socket

0

20004000

6000

8000

10000

12000

1 8 64 512

4K 32K

256K 2M


Ban

dw

idth

(M

B/s

)

Inter-socket

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M


Ba

nd

wid

th (

MB

/s)

25

Results of Skew Aware Thresholds

• The sending process can detect process skew quickly and adapt to it

• The producer latency is much lower with skew detection– Close to that of MVAPICH

0

2

4

6

8

10

12

14

16

100 130 160 190 220 250 280 310 340

Consumer Computation Time c2 (us)

Pro

du

ce

r T

ota

l L

ate

nc

y c

3 (u

s)

MVAPICH

MVAPICH-LiMIC2

MVAPICH-LiMIC2-opt

26

Performance of Collectives

• Improvements:– MPI_Alltoall: 60%

– MPI_Allgather: 28%

– MPI_Allreduce: 21%

• MPI_Allreduce– MVAPICH performs better for 1KB-

8KB messages

– MVAPICH uses optimized algorithms

– MVAPICH-LiMIC2-opt can be further optimized for collective operations

MPI_Alltoall

0

100000

200000

300000

400000

500000

600000

700000

1K 4K 16K

64K

256K 1M 4M


Late

ncy (

us)

-20

0

20

40

60

80

Imp

rovem

en

t (%

)

ImprovementMVAPICHMVAPICH-LiMIC2-opt

MPI_Allgather

0

100000

200000

300000

400000

500000

600000

700000

1K 4K 16K

64K

256K 1M 4M


Late

ncy (

us)

-10-505101520253035

Imp

rovem

en

t (%

)

MPI_Allreduce

0

50000

100000

150000

200000

250000

300000

350000

1K 4K 16K

64K

256K 1M 4M


Late

ncy (

us)

-50-40

-30-20-10

0102030

Imp

rovem

en

t (%

)

27

Performance of Applications (1)

• Performance on a single node with 8 processes

• MVAPICH-LiMIC2-opt improves the performance of

FT, PSTSWM, and IS

significantly

– FT: 8%

– PSTSWM: 14%

– IS: 17%

• Cache utilization is improved by up to 6%

Application Execution Time (1x8)

0

20

40

60

80

100

CG MG FT PSTSWM IS

Benchmarks

Tim

e (

s)

0

5

10

15

20

Imp

rov

em

en

t (%

)

MVAPICH

MVAPICH-LiMIC2-opt

Improvement

L2 Cache Misses in Applications (1x8)

0

100000

200000

300000

400000

500000

600000

CG

MG FT

PSTS

WM IS

Benchmarks

Nu

mb

er

of L

2

Cac

he M

isse

s01234567

Imp

rove

men

t (%

)

28


• Performance on a single node with 8 processes

• Performance improvement is under 2% for LU, HPL, BT, and SP – Communication time is not

significant, or

– Not many large messages

• Applications that transfer more large messages can benefit more from MVAPICH-LiMIC2-opt


0

100

200

300

400

500

600

700

800

LU HPL BT SP

Benchmarks

Tim

e (

s)

-0.5

0

0.5

1

1.5

2

Imp

rov

em

en

t (%

)

MVAPICH

MVAPICH-LiMIC2-opt

Improvement

L2 Cache Misses in Applications (1x8)

0

200000

400000

600000

800000

1000000

1200000

LU HPL BT SP

Benchmarks

Nu

mb

er

of

L2

Cac

he M

iss

es

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Imp

rove

men

t (%

)

29


• Performance on 8 nodes with 8x8 processes

• PSTSWM still benefits from MVAPICH-LiMIC2-opt

• MVAPICH-LiMIC2-opt is promising for clusters


0

20

40

60

80

100

120

CG MG FT PSTSWM IS

Benchmarks

Tim

e (

s)

0

1

2

3

4

5

6

7

Imp

rov

em

en

t (%

)

MVAPICH

MVAPICH-LiMIC2-opt

Improvement

30

Outline






31

Conclusions

• An efficient hybrid approach for MPI intra-node communication– Shared memory (MVAPICH)

– OS kernel assisted direct copy (MVAPICH-LiMIC2)

• 1st step: Initial evaluation– MVAPICH-LiMIC2 provides better performance than MVAPICH

depending on the message size and physical topology

• Fewer number of copies

• Efficient cache utilization

– MVAPICH-LiMIC2 is less skew-tolerant than MVAPICH

• 2nd step: Designing an efficient hybrid approach– Topology-aware thresholds

– Skew-aware thresholds

32

Conclusions (Cont’d)

• 3rd step: Comprehensive performance evaluation– MPI Alltoall, MPI Allgather, and MPI Allreduce performances are

improved by up to 60%, 28%, and 21%, respectively.

– FT, PSTSWM, and IS performances are improved by 8%, 14%, and

17%, respectively.

• Software distribution

– Will be available in upcoming MVAPICH releases!

33

Future Work

• Optimizations on collective algorithms for MVAPICH-LiMIC2

• Evaluation on AMD dual-core and quad-core platforms

• In-depth studies on how the improvements in MPI intra-

node communication benefit the application performance

34

AcknowledgementsOur research is supported by the following organizations

• Current Funding support by

• Current Equipment support by

35

Thank you !{chail, laipi, panda}@cse.ohio-state.edu

[email protected]

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

MVAPICH Web Pagehttp://mvapich.cse.ohio-state.edu/

System Software Labhttp://sslab.konkuk.ac.kr

designing an efficient kernel-level and user-level hybrid...

Documents