experiences with co-array fortran on hardware shared memory platforms yuri dotsenkocristian coarfa...
TRANSCRIPT
![Page 1: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/1.jpg)
Experiences with Co-array Fortran on Hardware Shared Memory Platforms
Yuri Dotsenko Cristian Coarfa
John Mellor-Crummey Daniel Chavarria-Miranda
Rice University, Houston, TX
![Page 2: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/2.jpg)
Co-array Fortran
Global Address Space (GAS) languageSPMD programming modelSimple extension of Fortran 90Explicit control over data placement and computation distributionPrivate dataShared data: both local and remoteOne-sided communication (PUT and GET) Team and point-to-point synchronization
![Page 3: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/3.jpg)
Co-array Fortran: Example
integer :: a(10,20)[*]
if (this_image() > 1)
a(1:10,1:2) = a(1:10,19:20)[this_image()-1]
a(10,20) a(10,20) a(10,20)
image 1 image 2 image N
image 1 image 2 image N
Copies from left neighbor
![Page 4: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/4.jpg)
Compiling CAF
Source-to-source translation
Prototype Rice cafc Fortran 90 pointer-based co-array representation ARMCI-based data movement
Goal: performance transparency
Challenges: Retain CAF source-level information
Array contiguity, array bounds, lack of aliasing
Exploit efficient fine-grain communication on SMPs
![Page 5: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/5.jpg)
Outline
Co-array representation and data access Local data Remote data
Experimental evaluation
Conclusions
![Page 6: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/6.jpg)
Representation and Access for Local Data
Efficient local access to SAVE/COMMON co-arrays is crucial to achieving best performance on a target architecture
Fortran 90 pointerFortran 90 pointer to structureCray pointerSubroutine argumentCOMMON block (need support for symmetric shared objects)
![Page 7: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/7.jpg)
Fortran 90 Pointer Representation
CAF declaration: real, save :: a(10,20)[*]
After translation: type T1 integer(PtrSize) handle real, pointer :: local(:,:)end type T1type (T1) ca
Local access: ca%local(2,3)
Portable representationBack-end compiler has no knowledge about:
Potential aliasing (no-alias flags for some compilers) Contiguity Bounds
Implemented in cafc
![Page 8: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/8.jpg)
Fortran 90 Pointer to Structure Representation
CAF declaration: real, save :: a(10,20)[*]
After translation: type T1
real :: local(10,20)
end type T1
type (T1), pointer :: ca
Conveys constant bounds and contiguity
Potential aliasing is still a problem
![Page 9: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/9.jpg)
Cray Pointer Representation
CAF declaration: real, save :: a(10,20)[*]
After translation: real :: a_local(10,20)
pointer (a_ptr, a_local)
Conveys constant bounds and contiguity
Potential aliasing is still a problem
Cray pointer is not in Fortran 90 standard
![Page 10: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/10.jpg)
Subroutine Argument Representation
CAF source: subroutine foo(…) real, save :: a(10,20)[*] a(i,j) = … + a(i-1,j) * …end subroutine foo
After translation:
subroutine foo(…) ! F90 representation for co-array a call foo_body(ca%local(1,1), ca%handle, …)end subroutine foo
subroutine foo_body(a_local, a_handle, …) real :: a_local(10,20) a_local(i,j) = … + a_local(i-1,j) * …end subroutine foo_body
![Page 11: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/11.jpg)
Subroutine Argument Representation (cont.)
Avoid conservative assumptions about co-array aliasing by the back-end compiler
Performance is close to optimal
Extra procedures and procedure calls
Implemented in cafc
![Page 12: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/12.jpg)
COMMON Block Representation
CAF declaration: real :: a(10,20)[*]
common /a_cb/ a
After translation: real :: ca(10,20)
common /ca_cb/ ca
Yields best performance for local accesses
OS must support symmetric data objects
![Page 13: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/13.jpg)
Outline
Co-array representation and data access Local data Remote data
Experimental evaluation
Conclusions
![Page 14: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/14.jpg)
Generating CAF Communication
Generic parallel architectures Library function calls to move data
Shared memory architectures (load/store) Fortran 90 pointers Vector of Fortran 90 pointers Cray pointers
![Page 15: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/15.jpg)
Communication Generation for Generic Parallel Architectures
CAF code: a(:) = b(:)[p] + …
Translated code: allocate b_temp(:)call GET( b, p, b_temp, … )a(:) = b_temp(:) + …deallocate b_temp
Portable: works on clusters and SMPsFunction overhead per fine-grain accessUses temporary to hold off-processor dataImplemented in cafc
![Page 16: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/16.jpg)
Communication Generation Using Fortran 90 Pointers
CAF code: do j = 1, N C(j) = A(j)[p]end do
Translated code: do j = 1, N ptrA => A(j) call
CafSetPtr(ptrA,p,A_handle) C(j) = ptrAend do
Function call overhead for each referenceImplemented in cafc
![Page 17: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/17.jpg)
Pointer Initialization Hoisting
Naïvely translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do
Code with hoisted pointer initialization:ptrA => A(1:N)call CafSetPtr(ptrA,p,A_handle)do j = 1, N C(j) = ptrA(j)end do
Pointer initialization hoisting is not yet implemented in cafc
![Page 18: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/18.jpg)
Communication Generation Using Vector of Fortran 90 Pointers
CAF code: do j = 1, N C(j) = A(j)[p]end do
Translated code: … initialization …do j = 1, N C(j) = ptrVectorA(p)%ptrA(j)end do
Does not require pointer initialization hoisting and avoids function callsWorse performance than that of hoisted pointer initialization
![Page 19: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/19.jpg)
Communication Generation Using Cray Pointers
CAF code: do j = 1, N C(j) = A(j)[p]end do
Translated code: integer(PtrSize) :: addrA(:)… addrA initialization …do j = 1, N ptrA = addrA(p) C(j) = A_rem(j)end do
addrA(p) – address of co-array A on image pCray pointer initialization hoisting yields only marginal improvement
![Page 20: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/20.jpg)
Outline
Co-array representation and data access Local data Remote data
Experimental evaluation
Conclusions
![Page 21: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/21.jpg)
Experimental Platforms
SGI Altix 3000 128 Itanium2 1.5 GHz, 6 MB L3 cache processors Linux (2.4.21 kernel) Intel Fortran Compiler 8.0
SGI Origin 2000 16 MIPS R12000 350 MHz, 8 MB L2 cache processors IRIX64 6.5 MIPSpro Compiler 7.3.1.3m
![Page 22: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/22.jpg)
Benchmarks
STREAM
Random Access
Spark98
NAS MG and SP
![Page 23: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/23.jpg)
STREAM
Copy kernel
DO J = 1, N DO J = 1, N
C(J) = A(J) C(J) = A(J)[p]
END DO END DO
Triad kernel
DO J = 1, N DO J = 1, N
A(J)=B(J)+s*C(J) A(J)=B(J)[p]+s*C(J)[p]
END DO END DO
Goal: investigate how well architecture bandwidth can be delivered up to the language level
![Page 24: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/24.jpg)
STREAM: Local Accesses
COMMON block is the best, if platform allowsSubroutine parameter has similar performance to COMMON block representationPointer-based representations have performance within 5% of the best on the Altix (with no-aliasing flag), and within 15% on the OriginFortran 90 pointer representation yields 30% of performance on the Altix without using the flag to specify lack of pointer aliasingArray section statements with Fortran 90 pointer representation yield 40-50% performance on the Origin
![Page 25: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/25.jpg)
STREAM: Remote Accesses
COMMON block representation for local access + Cray pointer for remote accesses is the bestSubroutine argument + Cray pointer for remote accesses has similar performanceRemote accesses with function call per access yield very poor performance (24 times slower than the best on the Altix, five times slower on the Origin)Generic strategy (with intermediate temporaries) delivers only 50-60% of performance on the Altix and 30-40% of performance on the Origin for vectorized code (except for Copy kernel)Pointer initialization hoisting is crucial for Fortran 90 pointers remote accesses and desirable for Cray pointersSimilarly coded OpenMP version has comparable performance on the Altix (90% for the scale kernel) and 86-90% on the Origin
![Page 26: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/26.jpg)
Spark98
Based on CMU’s earthquake simulation code
Computes sparse matrix-vector product
Irregular application with fine-grain accesses
Matrix distribution and computation partitioning is done offline (sf2 traces)
Spark98 computes partial product locally, then assembles the result across processors
![Page 27: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/27.jpg)
Spark98 (cont.)
Versions Serial (Fortran kernel, ported from C) MPI (Fortran kernel, ported from C) Hybrid (best shared memory threaded version) CAF versions (based on MPI version):
CAF Packed PUTs CAF Packed GETs CAF GETs (computation with remote data accessed “in
place”)
![Page 28: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/28.jpg)
Spark98 GETs Result Assembly
v2(:,:) = v(:,:)call sync_all()do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos
v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s]
end ifend docall sync_all()
![Page 29: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/29.jpg)
Spark98 GETs Result Assembly
v2(:,:) = v(:,:)call sync_all()do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos
v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s]
end ifend docall sync_all()
![Page 30: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/30.jpg)
Spark98 Performance on Altix
Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs
CAF GETs is simple and more “natural” to code, but up to 13% slower
Without considering locality, applications do not scale on NUMA architectures (Hybrid)
ARMCI library is more efficient than MPI
![Page 31: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/31.jpg)
NAS MG and SP
Versions: MPI (NPB 2.3) CAF (based on MPI NPB 2.3)
Generic code generation with subroutine argument co-array representation (procedure splitting)
Shared memory code generation (Fortran 90 pointers; vectorized source code) with subroutine argument co-array representation
OpenMP (NPB 3.0)
Class C
![Page 32: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/32.jpg)
NAS SP Performance on Altix
Performance of CAF versions is comparable to that of MPI
CAF-generic has better performance than CAF-shm because it uses memcpy, which hides latency by keeping optimal number of memory ops in flight
OpenMP scales poorly
![Page 33: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/33.jpg)
NAS MG Performance on Altix
![Page 34: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/34.jpg)
Conclusions
Direct load/store communication improves performance of fine-grain accesses by a factor of 24 on the Altix 3000 and five on the Origin 2000“In-place” data use in CAF statements incurs acceptable abstraction overheadPerformance comparable to that of MPI codes for fine- and coarse-grain applicationsWe plan to implement in cafc optimal, architecture dependent, code generation for local and remote co-array accesses
![Page 35: Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,](https://reader036.vdocuments.net/reader036/viewer/2022062519/5697c0281a28abf838cd7004/html5/thumbnails/35.jpg)
www.hipersoft.rice.edu/caf