job startup at exascale - ohio state...

Network Based Computing Laboratory

Job Startup at Exascale:Challenges and Solutions

Sourav ChakrabortyAdvisor: Dhabaleswar K (DK) Panda

The Ohio State University

http://nowlab.cse.ohio-state.edu/


Current Trends in HPC• Tremendous increase in system and

job sizes

• Interconnects like InfiniBand and OmniPath is dominant

• Dense many-core systems like KNL are more common

• Hybrid MPI+PGAS models becoming popular

SC '16 2

Fast and scalable job-startup is essential!


Why is Job Startup Important?

Developmentanddebugging

Regression/Acceptancetesting

Checkpoint- Restart

SC '16 3


Towards Exascale: Challenges to Address

• Dynamic allocation of resources

• Leveraging high-performance interconnects

• Exploiting opportunities for overlap

• Minimizing memory usage

SC '16 4

Job Startup Performance

Mem

ory

Ove

rhea

d

State of the art

Towards Exascale


Challenge: Avoid All-to-all Connectivity

SC '16 5

0

5

10

15

20

25

30

35

32 64 128 256 512 1K 2K 4K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Connection Setup

PMI Exchange

Memory Registration

Application Processes Average Peers

BT64 8.7

1024 10.6

EP64 3.0

1024 5.0

MG64 9.5

1024 11.9

SP64 8.8

1024 10.7

2D Heat64 5.3

1024 5.4

Connection setup phase takes 85% of initialization time with 4K processes

Applications rarely require full all-to-all connectivity


On-demand Connection Management

SC '16 6

Main Thread

Main Thread

ConnectionManager Thread

ConnectionManager Thread

Process 1 Process 2

Put/Get(P2)

Create QPQP→InitEnqueue Send

Create QPQP→InitQP→RTR

QP→RTRQP→RTS

Connection EstablishedDequeue Send

Connect Request(LID, QPN)

(address, size, rkey)

Connect Reply(LID, QPN)

(address, size, rkey)

QP→RTSConnection Established

Put/Get(P2)


Results - On-demand Connections

SC '16 7

0

20

40

60

80

100

16 32 64 128256512 1K 2K 4K 8K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Performance of OpenSHMEMInitialization and Hello World

Hello World - Static

start_pes - Static

Hello World - On-demand

start_pes - On-demand

0

2

4

6

8

BT EP MG SP

Exec

utio

n Ti

me

(sec

onds

)

Benchmark

Execution time of OpenSHMEM NAS Parallel Benchmarks

Static

On-demand

Initialization – 29.6 times faster Total execution time – 35% better


Challenge: Exploit High-performance Interconnects in PMI

• Used for network address exchange, heterogeneity detection, etc.• Used by major parallel

programming frameworks

• Uses TCP/IP for transport• Not efficient for moving large

amount of data • Required to bootstrap high-

performance interconnects

SC '16 8

0

0.5

1

1.5

2

2.5

32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Breakdown of MPI_Init in MVAPICH2

PMI Exchanges

Shared Memory

Other

PMI = Process Management Interface


PMIX_Ring: A Scalable Alternative§ Exchange data with only the

left and right neighbors over TCP

§ Exchange bulk of the data over High-speed interconnect (e.g. InfiniBand, OmniPath)

SC '16 9

int PMIX_Ring(char value[],char left[],char right[],

…)

01234567

16 64 256 1k 4k 16k

Tim

e Ta

ken

(sec

onds

)

Number of Processes

Comparison of PMI operations

Fence

Put

Gets

Ring

PMIX_Ring is more scalable


Results - PMIX_Ring

SC '16 10

33% improvement in MPI_Init Total execution time – 20% better

0

1

2

3

4

5

6

7

16 32 64 128 256 512 1K 2K 4K 8K

Tim

e Ta

ken

(sec

onds

)

Number of Processes

Performance of MPI_Init andHello World with PMIX_Ring

Hello World (Fence)

Hello World (Ring)

MPI_Init (Fence)

MPI_Init (proposed)

0

1

2

3

4

5

6

7

EP MG CG FT BT SP

Tim

e Ta

ken

(sec

onds

)

Benchmark

NAS Benchmarks with1K Processes, Class B Data

Fence

Ring


Challenge: Exploit Overlap in Application Initialization• PMI operations are progressed by

the process manager

• MPI/PGAS library is not involved

• Can be overlapped with other initialization tasks / application computation

• Put+Fence+Get combined into a single function - Allgather

int PMIX_KVS_Ifence(PMIX_Request *request)

int PMIX_Iallgather(const char value[],char buffer[], PMIX_Request *request)

int PMIX_Wait(PMIX_Request request)

SC '16 11


Results - Non-blocking PMI Collectives

SC '16 12

Near-constant MPI_Init at any scale Allgather is 38% faster than Fence

0.4

0.8

1.2

1.6

2

64 256 1K 4K 16K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Performance of MPI_Init

Fence

Ifence

Allgather

Iallgather

0

0.4

0.8

1.2

1.6

64 256 1K 4K 16K

Tim

e Ta

ken

(Sec

onds

)

Number of Processes

Comparison of Fence and Allgather

PMI2_KVS_Fence

PMIX_Allgather


Challenge: Minimize Memory Footprint

• Address table and similar information is stored in the PMI Key-value store (KVS)• All processes in the node duplicate

the KVS

• PPN redundant copies per node

SC '16 13

Process 1

PMIKVS

Process 2

PMIKVS

Process N

PMIKVS

Process Manager

PMI Key-Value Store (KVS)

PPN = Number of Processes per Node


Shared Memory based PMI• Process manager creates and

populates shared memory region

• MPI processes directly read from shared memory

• Dual shared memory region based hash-table design for performance and memory efficiency

SC '16 14

Empty Head Tail Key Value Next

Hash Table (Table)

Key Value Store (KVS)

Top21

3


Shared Memory based PMI

SC '16 15

PMI Gets are 1000x faster Memory footprint reduced by O(PPN)

0

50

100

150

200

250

300

1 2 4 8 16 32

Tim

e Ta

ken

(mill

isec

onds

)

Number of Processes

Time Taken by one PMI_Get

Default

Shmem

1

10

100

1000

10000

32K 64K 128K 256K 512K 1M

Mem

ory

Usa

ge p

er N

ode

(MB)

Number of Processes

PMI Memory UsageFence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem


Summary

SC '16 16

P M

O

Job Startup Performance

Mem

ory

Requ

ired

to S

tore

En

dpoi

nt In

form

atio

n

a b c d

eP

M

PGAS – State of the art

MPI – State of the art

O PGAS/MPI – Optimized

PMIX_Ring

PMIX_Ibarrier

PMIX_Iallgather

Shmem based PMI

b

c

d

e

a On-demand Connection

• Near constant MPI/OpenSHMEM initialization at any process count• 10x and 30x improvement in startup time of MPI and OpenSHMEM

with 16,384 processes (1,024 nodes)• O(PPN) reduction in PMI memory footprint


Availability and Impact• Tested at large-scale on Stampede and LLNL clusters

• All designs available as part of MVAPICH2 / MVAPICH2-X• MVAPICH powers Sunway TaihuLight - the #1 SuperComputer in the world!• 13th, 241,108-core (Pleiades) at NASA• 17th, 462,462-core (Stampede) at TACC

• Can be easily adopted by other MPI libraries and Resource Managers• Design of PMIX_Ring contributed to SLURM 15

• Other enhancements available as patches from MVAPICH2 website• Ongoing discussion to include them in future SLURM releases

SC '16 17


Thank You!

http://go.osu.edu/mvapich-startuphttp://mvapich.cse.ohio-state.edu/

[email protected]@cse.ohio-state.edu

SC '16 18

job startup at exascale - ohio state...

Documents