job startup at exascale - ohio state...
TRANSCRIPT
Network Based Computing Laboratory
Job Startup at Exascale:Challenges and Solutions
Sourav ChakrabortyAdvisor: Dhabaleswar K (DK) Panda
The Ohio State University
http://nowlab.cse.ohio-state.edu/
Network Based Computing Laboratory
Current Trends in HPC• Tremendous increase in system and
job sizes
• Interconnects like InfiniBand and OmniPath is dominant
• Dense many-core systems like KNL are more common
• Hybrid MPI+PGAS models becoming popular
SC '16 2
Fast and scalable job-startup is essential!
Network Based Computing Laboratory
Why is Job Startup Important?
Developmentanddebugging
Regression/Acceptancetesting
Checkpoint- Restart
SC '16 3
Network Based Computing Laboratory
Towards Exascale: Challenges to Address
• Dynamic allocation of resources
• Leveraging high-performance interconnects
• Exploiting opportunities for overlap
• Minimizing memory usage
SC '16 4
Job Startup Performance
Mem
ory
Ove
rhea
d
State of the art
Towards Exascale
Network Based Computing Laboratory
Challenge: Avoid All-to-all Connectivity
SC '16 5
0
5
10
15
20
25
30
35
32 64 128 256 512 1K 2K 4K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Connection Setup
PMI Exchange
Memory Registration
Application Processes Average Peers
BT64 8.7
1024 10.6
EP64 3.0
1024 5.0
MG64 9.5
1024 11.9
SP64 8.8
1024 10.7
2D Heat64 5.3
1024 5.4
Connection setup phase takes 85% of initialization time with 4K processes
Applications rarely require full all-to-all connectivity
Network Based Computing Laboratory
On-demand Connection Management
SC '16 6
Main Thread
Main Thread
ConnectionManager Thread
ConnectionManager Thread
Process 1 Process 2
Put/Get(P2)
Create QPQP→InitEnqueue Send
Create QPQP→InitQP→RTR
QP→RTRQP→RTS
Connection EstablishedDequeue Send
Connect Request(LID, QPN)
(address, size, rkey)
Connect Reply(LID, QPN)
(address, size, rkey)
QP→RTSConnection Established
Put/Get(P2)
Network Based Computing Laboratory
Results - On-demand Connections
SC '16 7
0
20
40
60
80
100
16 32 64 128256512 1K 2K 4K 8K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance of OpenSHMEMInitialization and Hello World
Hello World - Static
start_pes - Static
Hello World - On-demand
start_pes - On-demand
0
2
4
6
8
BT EP MG SP
Exec
utio
n Ti
me
(sec
onds
)
Benchmark
Execution time of OpenSHMEM NAS Parallel Benchmarks
Static
On-demand
Initialization – 29.6 times faster Total execution time – 35% better
Network Based Computing Laboratory
Challenge: Exploit High-performance Interconnects in PMI
• Used for network address exchange, heterogeneity detection, etc.• Used by major parallel
programming frameworks
• Uses TCP/IP for transport• Not efficient for moving large
amount of data • Required to bootstrap high-
performance interconnects
SC '16 8
0
0.5
1
1.5
2
2.5
32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Breakdown of MPI_Init in MVAPICH2
PMI Exchanges
Shared Memory
Other
PMI = Process Management Interface
Network Based Computing Laboratory
PMIX_Ring: A Scalable Alternative§ Exchange data with only the
left and right neighbors over TCP
§ Exchange bulk of the data over High-speed interconnect (e.g. InfiniBand, OmniPath)
SC '16 9
int PMIX_Ring(char value[],char left[],char right[],
…)
01234567
16 64 256 1k 4k 16k
Tim
e Ta
ken
(sec
onds
)
Number of Processes
Comparison of PMI operations
Fence
Put
Gets
Ring
PMIX_Ring is more scalable
Network Based Computing Laboratory
Results - PMIX_Ring
SC '16 10
33% improvement in MPI_Init Total execution time – 20% better
0
1
2
3
4
5
6
7
16 32 64 128 256 512 1K 2K 4K 8K
Tim
e Ta
ken
(sec
onds
)
Number of Processes
Performance of MPI_Init andHello World with PMIX_Ring
Hello World (Fence)
Hello World (Ring)
MPI_Init (Fence)
MPI_Init (proposed)
0
1
2
3
4
5
6
7
EP MG CG FT BT SP
Tim
e Ta
ken
(sec
onds
)
Benchmark
NAS Benchmarks with1K Processes, Class B Data
Fence
Ring
Network Based Computing Laboratory
Challenge: Exploit Overlap in Application Initialization• PMI operations are progressed by
the process manager
• MPI/PGAS library is not involved
• Can be overlapped with other initialization tasks / application computation
• Put+Fence+Get combined into a single function - Allgather
int PMIX_KVS_Ifence(PMIX_Request *request)
int PMIX_Iallgather(const char value[],char buffer[], PMIX_Request *request)
int PMIX_Wait(PMIX_Request request)
SC '16 11
Network Based Computing Laboratory
Results - Non-blocking PMI Collectives
SC '16 12
Near-constant MPI_Init at any scale Allgather is 38% faster than Fence
0.4
0.8
1.2
1.6
2
64 256 1K 4K 16K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Performance of MPI_Init
Fence
Ifence
Allgather
Iallgather
0
0.4
0.8
1.2
1.6
64 256 1K 4K 16K
Tim
e Ta
ken
(Sec
onds
)
Number of Processes
Comparison of Fence and Allgather
PMI2_KVS_Fence
PMIX_Allgather
Network Based Computing Laboratory
Challenge: Minimize Memory Footprint
• Address table and similar information is stored in the PMI Key-value store (KVS)• All processes in the node duplicate
the KVS
• PPN redundant copies per node
SC '16 13
Process 1
PMIKVS
Process 2
PMIKVS
Process N
PMIKVS
Process Manager
PMI Key-Value Store (KVS)
PPN = Number of Processes per Node
Network Based Computing Laboratory
Shared Memory based PMI• Process manager creates and
populates shared memory region
• MPI processes directly read from shared memory
• Dual shared memory region based hash-table design for performance and memory efficiency
SC '16 14
Empty Head Tail Key Value Next
Hash Table (Table)
Key Value Store (KVS)
Top21
3
Network Based Computing Laboratory
Shared Memory based PMI
SC '16 15
PMI Gets are 1000x faster Memory footprint reduced by O(PPN)
0
50
100
150
200
250
300
1 2 4 8 16 32
Tim
e Ta
ken
(mill
isec
onds
)
Number of Processes
Time Taken by one PMI_Get
Default
Shmem
1
10
100
1000
10000
32K 64K 128K 256K 512K 1M
Mem
ory
Usa
ge p
er N
ode
(MB)
Number of Processes
PMI Memory UsageFence - DefaultAllgather - DefaultFence - ShmemAllgather - Shmem
Network Based Computing Laboratory
Summary
SC '16 16
P M
O
Job Startup Performance
Mem
ory
Requ
ired
to S
tore
En
dpoi
nt In
form
atio
n
a b c d
eP
M
PGAS – State of the art
MPI – State of the art
O PGAS/MPI – Optimized
PMIX_Ring
PMIX_Ibarrier
PMIX_Iallgather
Shmem based PMI
b
c
d
e
a On-demand Connection
• Near constant MPI/OpenSHMEM initialization at any process count• 10x and 30x improvement in startup time of MPI and OpenSHMEM
with 16,384 processes (1,024 nodes)• O(PPN) reduction in PMI memory footprint
Network Based Computing Laboratory
Availability and Impact• Tested at large-scale on Stampede and LLNL clusters
• All designs available as part of MVAPICH2 / MVAPICH2-X• MVAPICH powers Sunway TaihuLight - the #1 SuperComputer in the world!• 13th, 241,108-core (Pleiades) at NASA• 17th, 462,462-core (Stampede) at TACC
• Can be easily adopted by other MPI libraries and Resource Managers• Design of PMIX_Ring contributed to SLURM 15
• Other enhancements available as patches from MVAPICH2 website• Ongoing discussion to include them in future SLURM releases
SC '16 17
Network Based Computing Laboratory
Thank You!
http://go.osu.edu/mvapich-startuphttp://mvapich.cse.ohio-state.edu/
[email protected]@cse.ohio-state.edu
SC '16 18