supporting pgas models (upc and openshmem) on mic clusters · mpi (message passing interface)...
TRANSCRIPT
Supporting PGAS Models (UPC and OpenSHMEM) on MIC Clusters
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Khaled Hamidouche The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~hamidouc
Presentation at IXPUG Meeting, July 2014 by
2
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
DSM Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• Additionally, OpenMP can be used to parallelize computation within the node
• Each model has strengths and drawbacks - suite different problems or applications
IXPUG14-OSU-PGAS
3
Partitioned Global Address Space (PGAS) Models
IXPUG14-OSU-PGAS
• Key features - Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages • Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
• Chapel
- Libraries • OpenSHMEM
• Global Arrays
SHMEM
• SHMEM: Symmetric Hierarchical MEMory library
• One-sided communications library – had been around for a while
• Similar to MPI, processes are called PEs, data movement is explicit through library calls
• Provides globally addressable memory using symmetric memory objects (more in later slides)
• Library routines for
– Symmetric object creation and management
– One-sided data movement
– Atomics
– Collectives
– Synchronization
IXPUG14-OSU-PGAS 4
OpenSHMEM
• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM
• Subtle differences in API, across versions – example:
SGI SHMEM Quadrics SHMEM Cray SHMEM
Initialization start_pes(0) shmem_init start_pes
Process ID _my_pe my_pe shmem_my_pe
• Made application codes non-portable
• OpenSHMEM is an effort to address this:
“A new, open specification to consolidate the various extant SHMEM versions
into a widely accepted standard.” – OpenSHMEM Specification v1.0
by University of Houston and Oak Ridge National Lab
SGI SHMEM is the baseline IXPUG14-OSU-PGAS 5
• UPC: a parallel extension to the C standard • UPC Specifications and Standards:
– Introduction to UPC and Language Specification, 1999 – UPC Language Specifications, v1.0, Feb 2001 – UPC Language Specifications, v1.1.1, Sep 2004 – UPC Language Specifications, v1.2, June 2005 – UPC Language Specifications, v1.3, Nov 2013
• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… – Government Institutions: ARSC, IDA, LBNL, SNL, US DOE… – Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …
• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC – Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC
• Aims for: high performance, coding efficiency, irregular applications, …
IXPUG14-OSU-PGAS 6
Compiler-based: Unified Parallel C
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model – MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared memory programming models
• Applications can have kernels with different communication patterns
• Can benefit from different models
• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead
IXPUG14-OSU-PGAS 7
MPI+PGAS for Exascale Architectures and Applications
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics
• Benefits: – Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*: – “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
IXPUG14-OSU-PGAS 8
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries
– More than 218,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
• 16th ranked 96,192-core cluster (Pleiades) at NASA and many others
– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
9 IXPUG14-OSU-PGAS
MVAPICH2-X for Hybrid MPI + PGAS Applications
IXPUG14-OSU-PGAS 10
MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! – http://mvapich.cse.ohio-state.edu
• Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant (with initial support for UPC 1.3) – Scalable Inter-node and intra-node communication – point-to-point and collectives
• OpenSHMEM and UPC for Host
• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host
• OpenSHMEM for MIC
• UPC for MIC
IXPUG14-OSU-PGAS
Outline
11
OpenSHMEM Design in MVAPICH2-X
• OpenSHMEM Stack based on OpenSHMEM Reference Implementation
• OpenSHMEM Communication over MVAPICH2-X Runtime – Uses active messages, atomic and one-sided operations and remote
registration cache
Communication API Symmetric Memory
Management API
Minimal Set of Internal API
OpenSHMEM API
InfiniBand, RoCE, iWARP
Data Movement Collectives Atomics Memory
Management
Active Messages
One-sided Operations
MVAPICH2-X Runtime
Remote Atomic Ops
Enhanced Registration Cache
IXPUG14-OSU-PGAS 12
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012.
OpenSHMEM Collective Communication Performance
13
Reduce (1,024 processes) Broadcast (1,024 processes)
Collect (1,024 processes) Barrier
050
100150200250300350400
128 256 512 1024 2048
Tim
e (u
s)
No. of Processes
110
1001000
10000100000
100000010000000
Tim
e (u
s)
Message Size
180X
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K
Tim
e (u
s)
Message Size
18X
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K
Tim
e (u
s)
Message Size
MV2X-SHMEMScalable-SHMEMOMPI-SHMEM
12X
3X
IXPUG14-OSU-PGAS
OpenSHMEM Application Evaluation
14
0
100
200
300
400
500
600
256 512
Tim
e (s
)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
0
10
20
30
40
50
60
70
80
256 512
Tim
e (s
)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
Heat Image Daxpy
• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA • Execution time for 2DHeat Image at 512 processes (sec):
- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-SHMEM – 169
• Execution time for DAXPY at 512 processes (sec): - UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-
SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries
on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
IXPUG14-OSU-PGAS
MVAPICH2-X Support for Berkeley UPC Runtime
• GASNet (Global-Address Space Networking) is a language-independent, low-level networking layer that provides support for PGAS language
• Support multiple networks through different conduit: MVAPICH2-X Conduit is available in MVAPICH2-X release, which support UPC/OpenMP/MPI on InfiniBand
IXPUG14-OSU-PGAS 15
UPC Applications
Compiler-generated Code
Compiler-specific Runtime
GASNet Core APIs
Extended APIs
SMP Conduit MPI Conduit MXM Conduit MVAPICH2-X Conduit …
High-level operations such as remote memory
access and various collective operations
Heavily based on active messages; directly on top of each individual network architectures
IB Conduit
IXPUG14-OSU-PGAS 16
UPC Collectives Performance Broadcast (2048 processes) Scatter (2048 processes)
Gather (2048 processes) Exchange (2048 processes)
02000400060008000
100001200014000
64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
Tim
e (u
s)
Message Size
UPC-GASNetUPC-OSU
0
50
100
150
200
250
64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
Tim
e (m
s)
Message Size
UPC-GASNet
UPC-OSU
0
50
100
150
200
250
64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
256K
Tim
e (m
s)
Message Size
UPC-GASNet
UPC-OSU 2X
0500
100015002000250030003500
64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
Tim
e (m
s)
Message Size
UPC-GASNetUPC-OSU
25X
2X
35%
J. Jose, K. Hamidouche, J. Zhang, A. Venkatesh, and D. K. Panda, Optimizing Collective Communication in UPC (HiPS’14, in association with IPDPS’14)
• OpenSHMEM and UPC for Host
• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host
• OpenSHMEM for MIC
• UPC for MIC
IXPUG14-OSU-PGAS
Outline
17
Unified Runtime for Hybrid MPI + OpenSHMEM Applications
IXPUG14-OSU-PGAS 18
MPI Applications, OpenSHMEM Applications, Hybrid (MPI + OpenSHMEM) Applications
MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls
Hybrid (OpenSHMEM + MPI) Applications
OpenSHMEM Runtime
MPI Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls
• Optimal network resource usage • No deadlock because of single runtime • Better performance
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012.
IXPUG14-OSU-PGAS 19
Hybrid MPI+OpenSHMEM Graph500 Design Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013
0123456789
26 27 28 29
Billi
ons
of T
rave
rsed
Edg
es P
er
Seco
nd (T
EPS)
Scale
MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)
0123456789
1024 2048 4096 8192
Billi
ons
of T
rave
rsed
Edg
es P
er
Seco
nd (T
EPS)
# of Processes
MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
05
101520253035
4K 8K 16K
Tim
e (s
)
No. of Processes
MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM
Programming Models, International Supercomputing Conference (ISC’13), June 2013
Hybrid MPI+UPC NAS-FT
• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall • Truly hybrid program • For FT (Class C, 128 processes)
• 34% improvement over UPC (GASNet) • 30% improvement over UPC (MV2-X)
20
0
5
10
15
20
25
30
35
B-64 C-64 B-128 C-128
Tim
e (s
)
NAS Problem Size – System Size
34%
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth Conference on Partitioned Global Address Space Programming Model (PGAS '10), October 2010
UPC (GASNet)
UPC (MV2-X)
MPI+UPC (MV2-X)
J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS) 2010
IXPUG14-OSU-PGAS
• OpenSHMEM and UPC for Host
• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host
• OpenSHMEM for MIC
• UPC for MIC
IXPUG14-OSU-PGAS
Outline
21
HPC Applications on MIC Clusters • Flexibility in launching HPC jobs on Xeon Phi Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
OpenSHMEM Program
OpenSHMEM Program
Offloaded Computation
OpenSHMEM Program
OpenSHMEM Program
OpenSHMEM Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
IXPUG14-OSU-PGAS 22
Data Movement on Intel Xeon Phi Clusters
CPU CPU QPI
MIC
PCIe
M
IC
MIC
CPU
MIC
IB
Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host
10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host
OpenSHMEM Process
11. Inter-Node MIC-MIC with IB on remote socket
and more . . .
• Critical for runtimes to optimize data movement, hiding the complexity
IXPUG14-OSU-PGAS 23
Need for Non-Uniform Memory Allocation in OpenSHMEM
• MIC cores have limited
memory per core
• OpenSHMEM relies on symmetric memory, allocated using shmalloc()
• shmalloc() allocates same amount of memory on all PEs
• For applications running in symmetric mode, this limits the total heap size
• Similar issues for applications (even host-only) with memory load imbalance (Graph500, Out-of-Core Sort, etc.)
• How to allocate different amounts of memory on host and MIC cores, and still be able to communicate?
Memory per core
IXPUG14-OSU-PGAS 24
OpenSHMEM Design for MIC Clusters
• Non-Uniform Memory Allocation: – Team-based Memory Allocation
(Proposed Extensions)
– Address Structure for non-uniform memory allocations
IXPUG14-OSU-PGAS 25
HOST2
Proxy-based designs for OpenSHMEM
OpenSHMEM Put using Active Proxy OpenSHMEM Get using Active Proxy
HOST1
MIC1 HCA
HOST2
MIC2 HCA
(1) IB REQ
(2) SCIF Read
(2) IB Write
(3) IB FIN
HOST1
MIC1 HCA
MIC2 HCA
(3) IB FIN
(2) SCIF Read
(2) IB Write
(1) IB REQ
• Current generation architectures impose limitations on read bandwidth when HCA reads from MIC memory
– Impacts both put and get operation performance • Solution: Pipelined data transfer by proxy running on host using IB and SCIF
channels • Improves latency and bandwidth!
IXPUG14-OSU-PGAS 26
OpenSHMEM Put/Get Performance
OpenSHMEM Put Latency OpenSHMEM Get Latency
0
500
1000
1500
2000
2500
3000
3500
4000
45001 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Late
ncy
(us)
Message Size
MV2X-DefMV2X-Opt
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Late
ncy
(us)
Message Size
MV2X-Def
MV2X-Opt
• Proxy-based designs alleviate hardware limitations • Put Latency of 4M message: Default: 3911us, Optimized: 838us • Get Latency of 4M message: Default: 3889us, Optimized: 837us
4.5X 4.6X
IXPUG14-OSU-PGAS 27
OpenSHMEM Collectives Performance
0
1000
2000
3000
4000
5000
6000
7000
4 16 64 256 1K 4K 16K 64K 256K
Late
ncy
(us)
Message Size
MV2X-Def
MV2X-Opt
0
1000
2000
3000
4000
5000
6000
7000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
128K
Late
ncy
(us)
Message Size
MV2X-Def
MV2X-Opt
Broadcast Latency (1,024 processes) Reduce Latency (1,024 processes)
• Optimized designs for OpenSHMEM collective operations • Broadcast Latency of 256KB message at 1,024 processes:
– Default: 5955us, Optimized: 2268us
• Reduce Latency of 256KB message at 1,024 processes: – Default: 6581us, Optimized: 3294us
2.6X 1.9X
IXPUG14-OSU-PGAS 28
Performance Evaluations using Graph500
Native Mode (8 procs/MIC) Symmetric Mode (16 Host+16MIC)
• Graph500 Execution Time (Native Mode): – At 512 processes , Default: 5.17s, Optimized: 4.96s – Performance Improvement from MIC-aware collectives design
• Graph500 Execution Time (Symmetric Mode): – At 1,024 processes, Default: 15.91s, Optimized: 12.41s – Performance Improvement from MIC-aware collectives and proxy-based designs
0
1
2
3
4
5
6
64 128 256 512
Exec
utio
n Ti
me
(s)
Number of Processes
MV2X-Def
MV2X-Opt
0
5
10
15
20
25
128 256 512 1024
Exec
utio
n Ti
me
(s)
Number of Processes
MV2X-Def
MV2X-Opt
28%
IXPUG14-OSU-PGAS 29
Graph500 Evaluations with Extensions
• Redesigned Graph500 using MIC to overlap computation/communication – Data Transfer to MIC memory; MIC cores pre-processes received data – Host processes traverses vertices, and sends out new vertices
• Graph500 Execution time at 1,024 processes: – Host-Only: .33s, Host+MIC with Extensions: .26s
• Magnitudes of improvement compared to default symmetric mode – Default Symmetric Mode: 12.1s, Host+MIC Extensions: 0.16s
00.10.20.30.40.50.60.7
128 256 512 1024
Exec
utio
n Ti
me
(s)
Number of Processes
HostHost+MIC (extensions)
26%
J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko and D. K. Panda, High Performance OpenSHMEM for Intel MIC Clusters: Extensions, Runtime Designs and Application Co-Design, IEEE International Conference on Cluster Computing (CLUSTER '14) (Best Paper Nominee)
IXPUG14-OSU-PGAS 30
• OpenSHMEM and UPC for Host
• Hybrid MPI + PGAS (OpenSHMEM and UPC) for Host
• OpenSHMEM for MIC
• UPC for MIC
IXPUG14-OSU-PGAS
Outline
31
HPC Applications on MIC Clusters • Flexibility in launching HPC jobs on Xeon Phi Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
UPC Program
UPC Program Offloaded Computation
UPC Program UPC Program
UPC Program
Host-only
Offload (/reverse Offload)
Symmetric
Coprocessor-only
IXPUG14-OSU-PGAS 32
UPC on MIC clusters
33
• Process Model – Communication via shared memory – single copy/double copy
– One connection per process
– Drawbacks: Communication Overheads, Network Connections
• Thread Model – Direct access to shared array: same memory space
– No extra copy
– No shared memory mapping entries
– Drawback: sharing of network connections
– Our work on multi-endpoint addresses this issue
IXPUG14-OSU-PGAS
M. Luo, J. Jose, S. Sur and D. K. Panda, Multi-threaded UPC Runtime with Network Endpoints: Design Alternatives and Evaluation on Multi-core Architectures, Int'l Conference on High Performance Computing (HiPC '11), Dec. 2011
UPC Runtime on MIC: Remote Memory Access between Host and MIC
34
Leader-to-all connection mode
Global Memory Region on MIC
Intel Xeon Phi Co-Processors Intel Xeon
CPU
CPU
CPU
CPU
Pin-down the whole global memory region
• Original connection mode: Each UPC thread pins down the global memory region that has affinity with this UPC thread; Establish all-to-all connections between every pair of UPC threads between MIC and host
• Leader-to-all connection mode: Only one UPC thread pinned down the whole global memory space on MIC; all UPC threads on the host will get access with the leader UPC thread
• Connections: NMIC*NHOST -> NMIC+NHOST
Global Memory Region on MIC
Intel Xeon Phi Co-Processors Intel Xeon
CPU
CPU
CPU
CPU
Original all-to-all connection mode
IXPUG14-OSU-PGAS
UPC Runtime on MIC: Remote Memory Access between Host and MIC
35
• Process-based Runtime: The whole global memory region need to be mapped as shared memory region, such as PSHM
• Thread-based Runtime: As the threads share the same memory space, the leader can access other global memory region on MIC without mapping to shared memory
Global Memory Region on MIC
Intel Xeon Phi Co-Processors Intel Xeon
CPU
CPU
CPU
CPU
IXPUG14-OSU-PGAS
Application Benchmark Evaluation for Native Mode
36
0
10
20
30
40
50
60
CG.B EP.B FT.B IS.B MG.B
native-host
native-MIC
00.10.20.30.40.50.60.7
RandomAccess UTS
2.6X
4X
• Host and MIC are occupied with 16 and 60 UPC threads, respectively • MIC performs 80%, 67%, and 54% as 16-CPU host for MG, EP, and FT, respectively • For communication-intensive benchmarks CG and IS, MIC spends 7x and 3x execution
times • Default applications without modification – no representation of optimal
performance
IXPUG14-OSU-PGAS
M. Luo, M. Li, A. Venkatesh, X. Lu and D. K. Panda, UPC on MIC: Early Experiences with Native and Symmetric Modes, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.
Concluding Remarks
• HPC Systems are evolving to support growing needs of exascale applications
• Hybrid programming (MPI + PGAS) is a practical way to program exascale systems
• Presented designs to demonstrate performance potential of PGAS and hybrid MPI+PGAS models
• MIC support for OpenSHMEM and UPC will be available in future MVAPICH2-X releases
IXPUG14-OSU-PGAS 37
Funding Acknowledgments
Funding Support by
Equipment Support by
38 IXPUG14-OSU-PGAS
IXPUG14-OSU-PGAS
Personnel Acknowledgments Current Students
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
39
Past Research Scientist – S. Sur
Current Post-Docs – K. Hamidouche
– J. Lin
Current Programmers – M. Arnold
– J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associates – H. Subramoni
– X. Lu
Past Programmers – D. Bureddy
Web Pointers
http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~hamidouc
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page http://mvapich.cse.ohio-state.edu
[email protected] [email protected]
40 IXPUG14-OSU-PGAS