design and implementation of open mpi over...
TRANSCRIPT
![Page 1: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/1.jpg)
Design and Implementation ofOpen MPI over Quadrics/Elan4
W. Yu, T.S. Woodall+,R.L. Graham+ and D.K. Panda
Dept of Computer Sci. and Engg.The Ohio State University
{yuw,panda}@cse.ohio-state.edu
Los Alamos National Laboratory+
Computer and Computation Science.{twoodall,rlgraham}@lanl.gov
![Page 2: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/2.jpg)
• Motivation• Communication Requirements and Objectives• Design Challenges and Implementation• Performance Evaluation• Conclusions
Presentation Outline
![Page 3: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/3.jpg)
• Parallel computing architecture– Evolving into tens of thousands of processors– More high performance interconnects
• MPI and MPI-2– The de facto industry standard– MPI-2 extends MPI with dynamic process management, IO,
one-side communication, more collectives, language bindings, etc
Cluster Computing
![Page 4: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/4.jpg)
Open MPI
• A new implementation of MPI-2– Component-based dynamic architecture– Dynamic, fault tolerant process management– Concurrent communication over multiple
networks– Dual-mode communication progress
![Page 5: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/5.jpg)
• Motivation• Communication Requirements and Objectives• Design Challenges and Implementation• Performance Evaluation• Conclusions
Presentation Outline
![Page 6: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/6.jpg)
Open MPI Communication• First implemented over TCP/IP
– Able to aggregate messages over multiple NICs– Delivers comparable performance
• Communication stacks on top of two layers:– Point-to-point message management layer (PML)
• Message fragmentation and assembly• Ordered reliable delivery• Scheduling and striping
– Point-to-point message transport layer (PTL)• Network specific, managing network status and communication• Presents communication support to PML
![Page 7: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/7.jpg)
Communication Architecture
collective
Point-to-point
PML
Base PTL-TCP PTL-Elan4
Ethernet Quadrics
![Page 8: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/8.jpg)
Flow of Open MPI Communication
PML PMLPTL PTLschedule
data/rendezvousmatch
matched
updateupdate
Ack
update
update
update
updateSend
Send
schedule
completecomplete
--shortshort --
![Page 9: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/9.jpg)
PML Requirements to PTLCommunication Support
• Fault-tolerance– Dynamic joining and disjoining of PTLs– Communication state monitoring and synchronization
• Concurrent communication– PML provides abstraction to handle semantics differences
between networks
• Communication progress– Non-blocking polling-mode and thread-based asynchronous
mode
![Page 10: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/10.jpg)
Overview of Quadrics/Elan4• Quadrics Network: QsNetII
– Tport (MPI oriented) and SHMEM libraries– Static communication model between processes– Hardware-based collectives
• broadcast, barrier
• Communication mechanisms– Queue-based model
• for messages up to 2KB– Remote DMA
• Arbitrary size messages. RDMA write/read– Event mechanism
• Completion notification
![Page 11: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/11.jpg)
Objectives
• Support MPI-2 dynamic processes over Quadrics• Incorporate Quadrics RDMA capabilities• Support dual-mode communication progress
![Page 12: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/12.jpg)
• Motivation• Communication Requirements and Objectives• Design Challenges and Implementation• Performance Evaluation• Conclusions
Presentation Outline
![Page 13: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/13.jpg)
Design Challenges
• Dynamic MPI-2 process model– Communication Initialization and finalization
• Integrating RDMA Capabilities– Memory semantics compatibility– Protocol mapping
• Communication Progress– How to support asynchronous progress?
![Page 14: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/14.jpg)
Dynamic MPI-2 Process Pool
• Communication Initialization andfinalization– Break the coupling of MPI Rank and VPID– Remove the reliance on Global virtual
memory– Allocate a capability with more contexts– Support dynamic and synchronized joining
and disjoining of processes
![Page 15: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/15.jpg)
Integrating RDMA Capabilities
• Memory Descriptor– Right now, an expansion with Elan4_Addr
• Communication and Completion notification– Using RDMA write/read– FIN with RDMA write– FIN_ACK with RDMA read
• Optimization– Chains the control message with RDMA– Provides fast, automatic transmission of control messages
![Page 16: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/16.jpg)
RDMA WritePML PMLPTL PTL
scheduleData/rendezvous
match
matched
update
update
update
updateAck
RDMA Write
FIN
schedule
completecomplete
![Page 17: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/17.jpg)
RDMA ReadPML PMLPTL PTL
scheduleData/rendezvous
match
matched
update
update
update
RDMA Read
FIN_ACK completecomplete
![Page 18: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/18.jpg)
Communication Progress
• Non-blocking Polling Mode– PML iteratively checks all outstanding send
and receive queues
• Thread-base asynchronous communication– Two thread based Communication Progress
• One for the local completion of DMA descriptors• Another for the completion of incoming QDMA messages
– One thread-based communication progress• QDMA messages + local DMA completion to a combined queue
![Page 19: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/19.jpg)
• RDMA completion can only be detected witha separated event.
• The event mechanism– Supports the completion of N DMA operations
with a count N– Cannot have one thread per RDMA descriptor
Challenges in AsynchronousProgress with RDMA
![Page 20: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/20.jpg)
Chained Event• Is it possible to use events with a count N for
shared completion?
![Page 21: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/21.jpg)
Possible Race Condition?
![Page 22: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/22.jpg)
Chained Event + QDMA
![Page 23: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/23.jpg)
• Motivation• Communication Requirements and Objectives• Design Challenges and Implementation• Performance Evaluation• Conclusions
Presentation Outline
![Page 24: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/24.jpg)
• Experimental Testbed:– A Quadrics cluster: QS-8A switch, Elan4 cards– Dual-SMP Intel Xeon 3.0GHz Processors– PCI-X 133MHz/64bit– 533MHz FSB– 1GB SDRAM memory
• Experimental Results– Performance with different numbers of completion queues– Communication cost in different layers– Threading cost
– Overall performance66MHz/64bit PCI bus
Performance Evaluation
![Page 25: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/25.jpg)
Basic Performance withRDMA Read and Write
• RDMA read performs better than RDMA write• Rendezvous Message without inline data improves performance• memcpy() is replacing the sophisticated datatype engine for
![Page 26: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/26.jpg)
Performance with Chained DMAand Completion Queues
• Chain DMA provides little performance improvement• ~1us penalty for shared completion queue• No performance difference with one-Queue or two Queue
![Page 27: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/27.jpg)
Measuring Communication Cost
PML
PTL
Sender Receiver
Networkabb a
L1
L2
• L1: PML cost• L2: PTL latency
![Page 28: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/28.jpg)
Communication Cost inDifferent Layers
oo PML has about 0.5us overheadPML has about 0.5us overheadoo Compared to QDMA, PTL/Elan4 has virtually no overheadCompared to QDMA, PTL/Elan4 has virtually no overhead
for 0-byte messages.for 0-byte messages.
![Page 29: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/29.jpg)
Thread-Based Progress
Performance Analysis of Thread-based Progression(in us)
47.7232.8027.1615.25RDMA-Read (4KB)
27.5022.7614.703.87RDMA-Read(4B)
Two-ThreadsOne-ThreadInterruptBasicMesg Length
oo Open MPI Open MPI w/ w/ PTL/Elan4 thread-based progression hasPTL/Elan4 thread-based progression has18us18us overhead overhead
oo ~1us~1us due to shared completion queue due to shared completion queueoo ~9us~9us due to interrupts, ~8us due to interrupts, ~8us due to threading due to threading
![Page 30: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/30.jpg)
Overall Performance- Latency
oo Open MPI Open MPI w/ w/ PTL/Elan4 achieves similar latency for largePTL/Elan4 achieves similar latency for largemessages, compared to messages, compared to MPICH-QsNetMPICH-QsNet
oo For small messages, Open MPI For small messages, Open MPI w/ w/ PTL/Elan4PTL/Elan4 hashas higherhighercost due to its host-based receive queue and tag matchingcost due to its host-based receive queue and tag matching
![Page 31: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/31.jpg)
Overall Performance- Bandwidth
oo Open MPI Open MPI w/ w/ PTL/Elan4 has slightly lower PTL/Elan4 has slightly lower bandwithbandwithcompared to compared to MPICH-QsNet MPICH-QsNet for small and large messagesfor small and large messages
oo For medium messages, Open MPI For medium messages, Open MPI w/ w/ PTL/Elan4PTL/Elan4 hashassignificant bandwidth because it does no pipeliningsignificant bandwidth because it does no pipelining
![Page 32: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/32.jpg)
• Motivation• Communication Requirements and Objectives• Design Challenges and Implementation• Performance Evaluation• Conclusions
Presentation Outline
![Page 33: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/33.jpg)
Conclusions• Designed and implemented Open MPI over
Quadrics/Elan4• Integrated Quadrics RDMA capabilities• Provided dual-mode communication progress• Support dynamic MPI-2 process model over Quadrics
![Page 34: Design and Implementation of Open MPI over Quadrics/Elan4mvapich.cse.ohio-state.edu/.../slide/openmpi_elan4.pdf · Design and Implementation of Open MPI over Quadrics/Elan4 W. Yu,](https://reader033.vdocuments.net/reader033/viewer/2022052718/5f0611de7e708231d4162385/html5/thumbnails/34.jpg)
Web Pointers
Homepage: http://nowlab.cis.ohio-state.edu
NBC-LAB