orsten oefler progress in automatic gpu compilation and ... · spcl.inf.ethz.ch @spcl_eth torsten...
TRANSCRIPT
![Page 1: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/1.jpg)
spcl.inf.ethz.ch
@spcl_eth
TORSTEN HOEFLER
Progress in automatic GPU compilation and
why you want to run MPI on your GPU
with Tobias Grosser and Tobias Gysi @ SPCL
presented at CCDSC, Lyon, France, 2016
![Page 2: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/2.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
#pragma ivdep
![Page 3: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/3.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
!$ACC DATA &
!$ACC PRESENT(density1,energy1) &
!$ACC PRESENT(vol_flux_x,vol_flux_y,volume,mass_flux_x,mass_flux_y,vertexdx,vertexdy) &
!$ACC PRESENT(pre_vol,post_vol,ener_flux)
!$ACC KERNELS
IF(dir.EQ.g_xdir) THEN
IF(sweep_number.EQ.1)THEN
!$ACC LOOP INDEPENDENT
DO k=y_min-2,y_max+2
!$ACC LOOP INDEPENDENT
DO j=x_min-2,x_max+2
pre_vol(j,k)=volume(j,k)+(vol_flux_x(j+1,k )-vol_flux_x(j,k)+vol_flux_y(j ,k+1)-vol_flux_y(j,k))
post_vol(j,k)=pre_vol(j,k)-(vol_flux_x(j+1,k )-vol_flux_x(j,k))
ENDDO
ENDDO
ELSE
!$ACC LOOP INDEPENDENT
DO k=y_min-2,y_max+2
!$ACC LOOP INDEPENDENT
DO j=x_min-2,x_max+2
pre_vol(j,k)=volume(j,k)+vol_flux_x(j+1,k)-vol_flux_x(j,k)
post_vol(j,k)=volume(j,k)
ENDDO
ENDDO
ENDIF
![Page 4: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/4.jpg)
spcl.inf.ethz.ch
@spcl_eth
4Heitlager et al.: A Practical Model for Measuring Maintainability
![Page 5: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/5.jpg)
spcl.inf.ethz.ch
@spcl_eth
5
!$ACC DATA &
!$ACC COPY(chunk%tiles(1)%field%density0) &
!$ACC COPY(chunk%tiles(1)%field%density1) &
!$ACC COPY(chunk%tiles(1)%field%energy0) &
!$ACC COPY(chunk%tiles(1)%field%energy1) &
!$ACC COPY(chunk%tiles(1)%field%pressure) &
!$ACC COPY(chunk%tiles(1)%field%soundspeed) &
!$ACC COPY(chunk%tiles(1)%field%viscosity) &
!$ACC COPY(chunk%tiles(1)%field%xvel0) &
!$ACC COPY(chunk%tiles(1)%field%yvel0) &
!$ACC COPY(chunk%tiles(1)%field%xvel1) &
!$ACC COPY(chunk%tiles(1)%field%yvel1) &
!$ACC COPY(chunk%tiles(1)%field%vol_flux_x) &
!$ACC COPY(chunk%tiles(1)%field%vol_flux_y) &
!$ACC COPY(chunk%tiles(1)%field%mass_flux_x)&
!$ACC COPY(chunk%tiles(1)%field%mass_flux_y)&
!$ACC COPY(chunk%tiles(1)%field%volume) &
!$ACC COPY(chunk%tiles(1)%field%work_array1)&
!$ACC COPY(chunk%tiles(1)%field%work_array2)&
!$ACC COPY(chunk%tiles(1)%field%work_array3)&
!$ACC COPY(chunk%tiles(1)%field%work_array4)&
!$ACC COPY(chunk%tiles(1)%field%work_array5)&
!$ACC COPY(chunk%tiles(1)%field%work_array6)&
!$ACC COPY(chunk%tiles(1)%field%work_array7)&
!$ACC COPY(chunk%tiles(1)%field%cellx) &
!$ACC COPY(chunk%tiles(1)%field%celly) &
!$ACC COPY(chunk%tiles(1)%field%celldx) &
!$ACC COPY(chunk%tiles(1)%field%celldy) &
!$ACC COPY(chunk%tiles(1)%field%vertexx) &
!$ACC COPY(chunk%tiles(1)%field%vertexdx) &
!$ACC COPY(chunk%tiles(1)%field%vertexy) &
!$ACC COPY(chunk%tiles(1)%field%vertexdy) &
!$ACC COPY(chunk%tiles(1)%field%xarea) &
!$ACC COPY(chunk%tiles(1)%field%yarea) &
!$ACC COPY(chunk%left_snd_buffer) &
!$ACC COPY(chunk%left_rcv_buffer) &
!$ACC COPY(chunk%right_snd_buffer) &
!$ACC COPY(chunk%right_rcv_buffer) &
!$ACC COPY(chunk%bottom_snd_buffer) &
!$ACC COPY(chunk%bottom_rcv_buffer) &
!$ACC COPY(chunk%top_snd_buffer) &
!$ACC COPY(chunk%top_rcv_buffer)
Sloccount *f90: 6,440
!$ACC: 833 (13%)
![Page 6: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/6.jpg)
spcl.inf.ethz.ch
@spcl_eth
6
![Page 7: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/7.jpg)
spcl.inf.ethz.ch
@spcl_eth
7
do i = 0, N
do j = 0, i
y(i,j) = ( y(i,j) + y(i,j+1) )/2
![Page 8: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/8.jpg)
spcl.inf.ethz.ch
@spcl_eth
8
Some results: Polybench 3.2
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
arithmean: ~30x
geomean: ~6x
Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
Speedup over icc –O3
![Page 9: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/9.jpg)
spcl.inf.ethz.ch
@spcl_eth
0:00
1:12
2:24
3:36
4:48
6:00
7:12
8:24
Mobile Workstation
icc icc -openmp clang Polly ACC
9
Compiles all of SPEC CPU 2006 – Example: LBM
T. Grosser, TH: Polly-ACC: Transparent compilation to heterogeneous hardware, ACM ICS’16
Runtim
e (
m:s
)
Xeon E5-2690 (10 cores, 0.5Tflop) vs.
Titan Black Kepler GPU (2.9k cores, 1.7Tflop)
essentially my 4-core x86 laptop
with the (free) GPU that’s in there
~20%
~4x
![Page 10: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/10.jpg)
spcl.inf.ethz.ch
@spcl_eth
10
![Page 11: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/11.jpg)
spcl.inf.ethz.ch
@spcl_eth
GPU latency hiding vs. MPI
ld
ld
ld
ld
st
st
st
st
device compute core active thread instruction latency
CUDA• over-subscribe hardware• use spare parallel slack for latency hiding
MPI• host controlled• full device synchronization
…
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 12: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/12.jpg)
spcl.inf.ethz.ch
@spcl_eth
Hardware latency hiding at the cluster level?
ld
ld
ld
ld
device compute core active thread instruction latency
dCUDA (distributed CUDA)• unified programming model for GPU clusters• avoid unnecessary device synchronization to enable system wide latency hiding
st
put
st
put ld
ld
ld
ld
st
st
st
st
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 13: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/13.jpg)
spcl.inf.ethz.ch
@spcl_eth
dCUDA: MPI-3 RMA extensions
for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];
if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);
if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);
dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);
swap(in, out); swap(win, wout);}
computation
communication
• iterative stencil kernel• thread specific idx
• map ranks to blocks• device-side put/get operations• notifications for synchronization• shared and distributed memory
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 14: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/14.jpg)
spcl.inf.ethz.ch
@spcl_eth
Hardware supported communication overlap
traditional MPI-CUDA dCUDA
1
device compute core active block
2
1
2
4
3
4
3
5
6
6
5
7
8
7
8
1
2
4
3
5
6
7
8
2
1
4
3
6
5
8
7
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 15: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/15.jpg)
spcl.inf.ethz.ch
@spcl_eth
The dCUDA runtime system
MPI
context
loggingqueue
commandqueue
ackqueue
notificationqueue
mor
e b
lock
s
event handler
device library
block manager host-side
device-side
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 16: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/16.jpg)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node
(Very) simple stencil benchmark
compute & exchange
compute only
halo exchange
0
500
1000
30 60 90# of copy iterations per exchange
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
no overlap
![Page 17: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/17.jpg)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node
Real stencil (COSMO weather/climate code)
dCUDA
halo exchange
MPI-CUDA
0
50
100
2 4 6 8
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 18: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/18.jpg)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node
Particle simulation code (Barnes Hut)
dCUDA
halo exchange
MPI-CUDA
0
50
100
150
200
2 4 6 8
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 19: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/19.jpg)
spcl.inf.ethz.ch
@spcl_eth
Benchmarked on 8 Haswell nodes with 1x Tesla K80 per node
Sparse matrix-vector multiplication
dCUDA
communication
MPI-CUDA
0
50
100
150
200
1 4 9
# of nodes
exec
uti
on
tim
e [m
s]
T. Gysi, J. Baer, TH: dCUDA: Hardware Supported Overlap of Computation and Communication, ACM/IEEE SC16 (preprint at SPCL page)
![Page 20: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/20.jpg)
spcl.inf.ethz.ch
@spcl_eth
for (int i = 0; i < steps; ++i) {for (int idx = from; idx < to; idx += jstride)out[idx] = -4.0 * in[idx] + in[idx + 1] + in[idx - 1] + in[idx + jstride] + in[idx - jstride];
if (lsend) dcuda_put_notify(ctx, wout, rank - 1, len + jstride, jstride, &out[jstride], tag);
if (rsend) dcuda_put_notify(ctx, wout, rank + 1, 0, jstride, &out[len], tag);
dcuda_wait_notifications(ctx, wout, DCUDA_ANY_SOURCE, tag, lsend + rsend);
swap(in, out); swap(win, wout);}
20
http://spcl.inf.ethz.ch/Polly-ACC
Automatic
“Regression Free” High Performance
dCUDA – distributed memory
Automatic
Overlap High Performance
![Page 21: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/21.jpg)
spcl.inf.ethz.ch
@spcl_eth
1
10
100
1000
10000
SCoPs 0-dim 1-dim 2-dim 3-dim
No Heuristics Heuristics
21
LLVM Nightly Test Suite
![Page 22: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/22.jpg)
spcl.inf.ethz.ch
@spcl_eth
22
Cactus ADM (SPEC 2006)
Work
sta
tion
Mobile
![Page 23: ORSTEN OEFLER Progress in automatic GPU compilation and ... · spcl.inf.ethz.ch @spcl_eth TORSTEN HOEFLER Progress in automatic GPU compilation and why you want to run MPI on your](https://reader036.vdocuments.net/reader036/viewer/2022081614/5fcf762fdcc33d625a1daccb/html5/thumbnails/23.jpg)
spcl.inf.ethz.ch
@spcl_eth
23
Evading various “ends” – the hardware view