pgi x64+gpu fortran and c99 compilers · describe the problem of programming an x64+gpu system...

PGI x64+GPU Fortran and C99

Compilers

Michael [email protected]://www.pgroup.com

June 2009

Abstracted x64+Accelerator Architecture

Simple Fortran Matrix Multiply

for an x64 Host

do j = 1, m do k = 1, p do i = 1,n a(i,j) = a(i,j) + b(i,k)*c(k,j)_ enddo enddo enddo

!$omp parallel do do j = 1, m do k = 1, p do i = 1,n a(i,j) = a(i,j) + b(i,k)*c(k,j)_ enddo enddo enddo

Parallel Fortran Matrix Multiply

for a Multicore x64 Host

extern "C" __global__ voidmmkernel( float* a,float* b,float* c, int la,int lb,int lc,int n, int m,int p )_{ int i = blockIdx.x*64+threadIdx.x; int j = blockIdx.y;

float sum = 0.0; for( int k = 0; k < p; ++k )_ sum += b[i+lb*k] * c[k+lc*j]; a[i+la*j] = sum;}

Basic CUDA C Matrix Multiply Kernel

for an NVIDIA GPU

extern "C" __global__ voidmmkernel( float* a, float* b, float* c, int la, int lb, int lc, int n, int m, int p )_{ int tx = threadIdx.x; int i = blockIdx.x*128 + tx; int j = blockIdx.y*4; __shared__ float cb0[128], cb1[128], cb2[128], cb3[128];

float sum0 = 0.0, sum1 = 0.0, sum2 = 0.0, sum3 = 0.0; for( int ks = 0; ks < p; ks += 128 ){ cb0[tx] = c[ks+tx+lc*j]; cb1[tx] = c[ks+tx+lc*(j+1)]; cb2[tx] = c[ks+tx+lc*(j+2)]; cb3[tx] = c[ks+tx+lc*(j+3)]; __syncthreads(); for( int k = 0; k < 128; k+=4 ){ float rb = b[i+lb*(k+ks)]; sum0 += rb * cb0[k]; sum1 += rb * cb1[k]; sum2 += rb * cb2[k]; sum3 += rb * cb3[k]; rb = b[i+lb*(k+ks+1)]; sum0 += rb * cb0[k+1]; sum1 += rb * cb1[k+1]; sum2 += rb * cb2[k+1]; sum3 += rb * cb3[k+1]; rb = b[i+lb*(k+ks+2)]; sum0 += rb * cb0[k+2]; sum1 += rb * cb1[k+2]; sum2 += rb * cb2[k+2]; sum3 += rb * cb3[k+2]; rb = b[i+lb*(k+ks+3)]; sum0 += rb * cb0[k+3]; sum1 += rb * cb1[k+3]; sum2 += rb * cb2[k+3]; sum3 += rb * cb3[k+3]; } __syncthreads(); } a[i+la*j] = sum0; a[i+la*(j+1)] = sum1; a[i+la*(j+2)] = sum2; a[i+la*(j+3)] = sum3;}

Optimized

CUDA C

Matrix

Multiply

Kernel

Host-side CUDA C Matrix Multiply

GPU Control Code

cudaMalloc( &bp, memsize );cudaMalloc( &ap, memsize );cudaMalloc( &cp, memsize );

cudaMemcpy( bp, b, memsize, cudaMemcpyHostToDevice );cudaMemcpy( cp, c, memsize, cudaMemcpyHostToDevice );cudaMemcpy( ap, a, memsize, cudaMemcpyHostToDevice );

dim3 threads( 128 );dim3 blocks( matsize/128, matsize/4 );mmkernel<<<blocks,threads>>>(ap,bp,cp,nsize,nsize, nsize,matsize,matsize,matsize);

cudaMemcpy( a, ap, memsize, cudaMemcpyDeviceToHost );

cudaFree( ap );cudaFree( bp );cudaFree( cp );

Roadmap for this Talk

Describe the problem of programming an x64+GPU system

Describe the PGI Accelerator “Kernels” programming model

Show how the Kernels model maps onto x64+GPU architectures

Describe directives used to program to the model

Give a status of the compiler, experiences, plans

Discuss the generality and limitations of the model

What is a “Kernels”Programming Model?

Kernel – a multidimensional parallel loop, where the

body of the loop is the kernel code and the loops define

the index set or domain

Program – a sequence of kernels, where each kernel

executes to completion over its index set before the next

kernel can start

Parallelism – is exploited between iterations of a kernel,

not between multiple kernels executing in parallel

dopar j = 1, m dopar is = 1, n, 64 dovec i = is, is+63 sum = 0.0 doseq k = 1, p sum += b(i,k) * c(k,j) enddo a(i,j) = sum enddo enddoenddo

Kernels Model Pseudo-code forSimple Matrix Multiply

dopar j = 1, m, 4 dopar is = 1, n, 128 dovec i = is, is+127 sum0 = sum1 = sum2 = sum3 = 0.0

doseq ks = 1, p, 128 doseq k = ks, ks+127, 4 sum0 += b(i,k) * c(k,j) sum1 += b(i,k) * c(k,j+1) sum2 += b(i,k) * c(k,j+2) sum3 += b(i,k) * c(k,j+3) sum0 += b(i,k+1) * c(k+1,j) sum1 += b(i,k+1) * c(k+1,j+1) sum2 += b(i,k+1) * c(k+1,j+2) sum3 += b(i,k+1) * c(k+1,j+3) sum0 += b(i,k+2) * c(k+2,j) sum1 += b(i,k+2) * c(k+2,j+1) sum2 += b(i,k+2) * c(k+2,j+2) sum3 += b(i,k+2) * c(k+2,j+3) sum0 += b(i,k+3) * c(k+3,j) sum1 += b(i,k+3) * c(k+3,j+1) sum2 += b(i,k+3) * c(k+3,j+2) sum3 += b(i,k+3) * c(k+3,j+3) enddo enddo a(i,j) = sum0 a(i,j+1) = sum1 a(i,j+2) = sum2 a(i,j+3) = sum3 enddo enddo enddo

Kernels Model

Pseudo-code

for Optimized

Matrix Multiply

!$acc region do j = 1, m do k = 1, p do i = 1,n a(i,j) = a(i,j) + b(i,k)*c(k,j)_ enddo enddo enddo!$acc end region

PGI Directive-based Fortran Matrix

Multiply for x64+GPU

SUBROUTINE SAXPY (A,X,Y,N) INTEGER N REAL A,X(N),Y(N)!$ACC REGION DO I = 1, N X(I) = A*X(I) + Y(I) ENDDO!$ACC END REGION END

saxpy_: … movl (%rbx), %eax

movl %eax, -4(%rbp)

call __pgi_cu_init

. . . call __pgi_cu_function

…

call __pgi_cu_alloc

…

call __pgi_cu_upload

… call __pgi_cu_call

… call __pgi_cu_download

…

saxpy_: … movl (%rbx), %eax

movl %eax, -4(%rbp)

call __pgi_cu_init

. . . call __pgi_cu_function

…

call __pgi_cu_alloc

…

call __pgi_cu_upload

… call __pgi_cu_call

… call __pgi_cu_download

…

Host x64 asm FileAuto-generated GPU code

typedef struct dim3{ unsigned int x,y,z; }dim3;typedef struct uint3{ unsigned int x,y,z; }uint3;extern uint3 const threadIdx, blockIdx;extern dim3 const blockDim, gridDim;static __attribute__((__global__)) voidpgicuda( __attribute__((__shared__)) int tc, __attribute__((__shared__)) int i1, __attribute__((__shared__)) int i2, __attribute__((__shared__)) int _n, __attribute__((__shared__)) float* _c, __attribute__((__shared__)) float* _b, __attribute__((__shared__)) float* _a ){ int i; int p1; int _i; i = blockIdx.x * 64 + threadIdx.x; if( i < tc ){ _a[i+i2-1] = ((_c[i+i2-1]+_c[i+i2-1])+_b[i+i2-1]); _b[i+i2-1] = _c[i+i2]; _i = (_i+1); p1 = (p1-1); } }

+

Unifieda.out

compile

link

execute… no change to existing makefiles, scripts, IDEs, programming environment, etc.

PGI Accelerator

Compilers

% pgfortran -fast -ta=nvidia -Minfo mm.F90mm1: 4, Generating copyin(c(:,1:m)) Generating copyin(b(1:n,:)) Generating copy(a(1:n,1:m)) 5, Loop is parallelizable 6, Loop carried dependence of a prevents parallelization Loop carried backward dependence of a prevents vectorization 7, Loop is parallelizable Accelerator kernel generated 5, !$acc do parallel, vector(16) 6, !$acc do seq Cached references to size [16x16] block of b Cached references to size [16x16] block of c 7, !$acc do parallel, vector(16) Using register for a

Compiler-to-User Feedback

!$acc region copyin(b(1:n,1:p),c(1:p,1:m))!$acc& copy(a(1:n,1:m))!$acc do parallel do j = 1, m!$acc do seq do k = 1, p!$acc do parallel, vector(128) do i = 1, n a(i,j) = a(i,j) + b(i,k)*c(k,j) enddo enddo enddo!$acc end region

Directives for Tuning Data Movement

and Kernel Mapping

#pragma acc region{ for(int opt = 0; opt < optN; opt++){

float S = h_StockPrice[opt], X = h_OptionStrike[opt], T = h_OptionYears[opt];float sqrtT = sqrtf(T);float d1 = (logf(S/X) +

(Riskfree + 0.5f * Volatility * Volatility) * T)/ (Volatility * sqrtT);

float d2 = d1 - Volatility * sqrtT;float cndd1 = CND(d1);float cndd2 = CND(d2);float expRT = expf(- Riskfree * T);h_CallResult[opt] = S*cndd1-X*expRT*cndd2;h_PutResult[opt] = X*expRT*(1.0f-cndd2)-S*(1.0f-cndd1);

}}

Same Basic Model for C - pragmas

How did we make Vectors Work?Compiler-to-Programmer Feedback – a classic “Virtuous Cycle”

HPCCode

CFT

Cray

VectorizationListing

Trace

Performance

Profiler

HPCUser

Directives, Options, Restructuring

This Feedback LoopUnique to Compilers!

We can use this same methodology to enable effectivemigration of applications to Multi-core and Accelerators

Compiler-to-Programmer Feedback

HPCCode

PGI Compiler

x64

CCFF

Trace PGPROF

HPCUser

Acc+

Directives, Options, RESTRUCTURING

Restructuring forAccelerators will be More Difficult

Performance

do loop = 1,loops!!------------------------------------------------------! initialize the large scale variables! do i = its, ite mstep(i) = 1 flgcld(i) = .true. enddo! do k = kts, kte do i = its, ite denfac(i,k) = sqrt(den0/den(i,k)) enddo enddo! cvap = cpv hvap=xlv0 hsub=xls ttp=t0c+0.1 dldt=cvap-cliq xa=-dldt/rv xb=xa+hvap/(rv*ttp) dldti=cvap-cice xai=-dldti/rv xbi=xai+hsub/(rv*ttp) do k = kts, kte do i = its, ite tr=ttp/t(i,k) if(t(i,k).lt.ttp) then qs(i,k) =psat*(tr**xai)*exp(xbi*(1.-tr)) else qs(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) endif qs0(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) qs0(i,k) = (qs0(i,k)-qs(i,k))/qs(i,k) qs(i,k) = ep2 * qs(i,k) / (p(i,k) - qs(i,k)) qs(i,k) = max(qs(i,k),qmin) rh(i,k) = max(q(i,k) / qs(i,k),qmin) enddo enddo

!!-----------------------------------------------! initialize the variables for microphysical physics!! do k = kts, kte do i = its, ite pres(i,k) = 0. paut(i,k) = 0. pacr(i,k) = 0. pgen(i,k) = 0. pisd(i,k) = 0. pcon(i,k) = 0. fall(i,k) = 0. falk(i,k) = 0. fallc(i,k) = 0. falkc(i,k) = 0. xni(i,k) = 1.e3 enddo enddo!!------------------------------------------------------! compute the fallout term:! first, vertical terminal velosity for minor loops!-------------------------------------------------------! n0s: Intercept parameter for snow [m-4] [HDC 6]!------------------------------------------------------ do k = kts, kte do i = its, ite supcol = t0c-t(i,k) n0sfac(i,k) =

max(min(exp(alpha*supcol),n0smax/n0s),1.) if(t(i,k).ge.t0c) then if(qrs(i,k).le.qcrmin)then rslope(i,k) = rslopermax rslopeb(i,k) = rsloperbmax rslope2(i,k) = rsloper2max rslope3(i,k) = rsloper3max else rslope(i,k) = 1./lamdar(qrs(i,k),den(i,k)) rslopeb(i,k) = rslope(i,k)**bvtr rslope2(i,k) = rslope(i,k)*rslope(i,k)

rslope3(i,k) = rslope2(i,k)*rslope(i,k) endif else if(qrs(i,k).le.qcrmin)then rslope(i,k) = rslopesmax rslopeb(i,k) = rslopesbmax rslope2(i,k) = rslopes2max rslope3(i,k) = rslopes3max else rslope(i,k) =

1./lamdas(qrs(i,k),den(i,k),n0sfac(i,k)) rslopeb(i,k) = rslope(i,k)**bvts rslope2(i,k) = rslope(i,k)*rslope(i,k) rslope3(i,k) = rslope2(i,k)*rslope(i,k) endif endif!-----------------------------------------------------! Ni: ice crystal number concentraiton [HDC 5c]!----------------------------------------------------- xni(i,k) = min(max(5.38e7*(den(i,k) &

*max(qci(i,k),qmin))**0.75,1.e3),1.e6) enddo enddo! mstepmax = 1 numdt = 1 do k = kte, kts, -1 do i = its, ite if(t(i,k).lt.t0c) then pvt = pvts else pvt = pvtr endif work1(i,k) = pvt*rslopeb(i,k)*denfac(i,k) work2(i,k) = work1(i,k)/delz(i,k) numdt(i) = max(nint(work2(i,k)*dtcld+.5),1) if(numdt(i).ge.mstep(i)) mstep(i) = numdt(i) enddo enddo do i = its, ite if(mstepmax.le.mstep(i)) mstepmax = mstep(i) enddo

! do n = 1, mstepmax k = kte do i = its, ite if(n.le.mstep(i)) then falk(i,k) = den(i,k)*qrs(i,k)* & work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k) & *dtcld/den(i,k),0.) endif enddo do k = kte-1, kts, -1 do i = its, ite if(n.le.mstep(i)) then falk(i,k) =

den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-(falk(i,k) & -falk(i,k+1) & *delz(i,k+1)/delz(i,k))*dtcld/den(i,k),0.) endif enddo enddo enddo!------------------------------------------------------! Vice [ms-1] : fallout of ice crystal [HDC 5a]!----------------------------------------------------- mstepmax = 1 mstep = 1 numdt = 1 do k = kte, kts, -1 do i = its, ite if(t(i,k).lt.t0c.and.qci(i,k).gt.0.) then xmi = den(i,k)*qci(i,k)/xni(i,k) diameter = dicon * sqrt(xmi) work1c(i,k) = 1.49e4*diameter**1.31 else work1c(i,k) = 0. endif

if(qci(i,k).le.0.) then work2c(i,k) = 0. else work2c(i,k) = work1c(i,k)/delz(i,k) endif numdt(i) = max(nint(work2c(i,k)*dtcld+.5),1) if(numdt(i).ge.mstep(i)) mstep(i) = numdt(i) enddo enddo do i = its, ite if(mstepmax.le.mstep(i)) mstepmax = mstep(i) enddo! do n = 1, mstepmax k = kte do i = its, ite if (n.le.mstep(i)) then falkc(i,k) = den(i,k)*qci(i,k)* & work2c(i,k)/mstep(i) holdc = falkc(i,k) fallc(i,k) = fallc(i,k)+falkc(i,k) holdci = qci(i,k) qci(i,k) = max(qci(i,k)- & falkc(i,k)*dtcld/den(i,k),0.) endif enddo do k = kte-1, kts, -1 do i = its, ite if (n.le.mstep(i)) then falkc(i,k) = & den(i,k)*qci(i,k)*work2c(i,k)/mstep(i) holdc = falkc(i,k) fallc(i,k) = fallc(i,k)+falkc(i,k) holdci = qci(i,k) qci(i,k) = max(qci(i,k)-(falkc(i,k) - & falkc(i,k+1)*delz(i,k+1)/delz(i,k))*dtcld/den(i,k),0.) endif enddo enddo enddo!!-------------------------------------------------------! compute the freezing/melting term. [D89 B16-B17]

! freezing occurs one layer above the melting level! do i = its, ite mstep(i) = 0 enddo do k = kts, kte! do i = its, ite if(t(i,k).ge.t0c) then mstep(i) = k endif enddo enddo! do i = its, ite if(mstep(i).ne.0.and.w(i,mstep(i)).gt.0.) then work1(i,1) = float(mstep(i) + 1) work1(i,2) = float(mstep(i)) else work1(i,1) = float(mstep(i)) work1(i,2) = float(mstep(i)) endif enddo! do i = its, ite k = nint(work1(i,1)) kk = nint(work1(i,2)) if(k*kk.ge.1) then qrsci = qrs(i,k) + qci(i,k) if(qrsci.gt.0..or.fall(i,kk).gt.0.) then frzmlt = min(max(-w(i,k)*qrsci/delz(i,k),& -qrsci/dtcld), qrsci/dtcld) snomlt = min(max(fall(i,kk)/den(i,kk), & -qrs(i,k)/dtcld),qrs(i,k)/dtcld) if(k.eq.kk) then t(i,k) = t(i,k) - xlf0/cpm(i,k)* & (frzmlt+snomlt)*dtcld else t(i,k) = t(i,k) - xlf0/cpm(i,k)* & frzmlt*dtcld t(i,kk) = t(i,kk) - xlf0/cpm(i,kk)* & snomlt*dtcld

endif enddo!!-------------------------------------------------------

---------! rain (unit is mm/sec;kgm-2s-1: /1000*delt ===>

m)==> mm for wrf! do i = its, ite if(fall(i,1).gt.0.) then rainncv(i) = fall(i,1)*delz(i,1)/ & denr*dtcld*1000. rain(i) = fall(i,1)*delz(i,1)/ & denr*dtcld*1000. + rain(i) endif enddo!!-----------------------------------------------------! rsloper: reverse of the slope parameter of the

rain(m)! xka: thermal conductivity of air(jm-1s-1k-1)! work1: the thermodynamic term in the denominator

associated with! heat conduction and vapor diffusion! (ry88, y93, h85)! work2: parameter associated with the ventilation

effects(y93)! do k = kts, kte do i = its, ite if(t(i,k).ge.t0c) then if(qrs(i,k).le.qcrmin)then rslope(i,k) = rslopermax rslopeb(i,k) = rsloperbmax rslope2(i,k) = rsloper2max rslope3(i,k) = rsloper3max else rslope(i,k) = 1./lamdar(qrs(i,k),den(i,k)) rslopeb(i,k) = rslope(i,k)**bvtr rslope2(i,k) = rslope(i,k)*rslope(i,k) rslope3(i,k) = rslope2(i,k)*rslope(i,k) endif else if(qrs(i,k).le.qcrmin)then rslope(i,k) = rslopesmax

rslopeb(i,k) = rslopesbmax rslope2(i,k) = rslopes2max rslope3(i,k) = rslopes3max else rslope(i,k) = 1./lamdas(qrs(i,k),7 den(i,k),n0sfac(i,k)) rslopeb(i,k) = rslope(i,k)**bvts rslope2(i,k) = rslope(i,k)*rslope(i,k) rslope3(i,k) = rslope2(i,k)*rslope(i,k) endif endif enddo enddo! do k = kts, kte do i = its, ite if(t(i,k).ge.t0c) then work1(i,k) = diffac(xl(i,k),p(i,k),7 t(i,k),den(i,k),qs(i,k)) else work1(i,k) = diffac(xls,p(i,k),& t(i,k),den(i,k),qs(i,k)) endif work2(i,k) = venfac(p(i,k),t(i,k),den(i,k)) enddo enddoa! do k = kts, kte do i = its, ite supsat = max(q(i,k),qmin)-qs(i,k) satdt = supsat/dtcld if(t(i,k).ge.t0c) then!!=====================================================!! warm rain processes!! - follows the processes in RH83 and LFO except for

autoconcersion!!==================================================!---------------------------------------------------! paut1: auto conversion rate from cloud to rain [HDC

16]! (C->R)

!------------------------------------------------------ if(qci(i,k).gt.qc0) then paut(i,k) = qck1*qci(i,k)**(7./3.) paut(i,k) = min(paut(i,k),qci(i,k)/dtcld) endif!------------------------------------------------------! pracw: accretion of cloud water by rain [D89 B15]! (C->R)!------------------------------------------------------ if(qrs(i,k).gt.qcrmin.and.qci(i,k).gt.qmin)

then pacr(i,k) =

min(pacrr*rslope3(i,k)*rslopeb(i,k) &

*qci(i,k)*denfac(i,k),qci(i,k)/dtcld) endif!-----------------------------------------------------! pres1: evaporation/condensation rate of rain [HDC 14]! (V->R or R->V)!------------------------------------------------------ if(qrs(i,k).gt.0.) then coeres =

rslope2(i,k)*sqrt(rslope(i,k)*rslopeb(i,k)) pres(i,k) = (rh(i,k)-

1.)*(precr1*rslope2(i,k) &

+precr2*work2(i,k)*coeres)/work1(i,k) if(pres(i,k).lt.0.) then pres(i,k) = max(pres(i,k),-

qrs(i,k)/dtcld) pres(i,k) = max(pres(i,k),satdt/2) else pres(i,k) = min(pres(i,k),satdt/2) endif endif else!!======================================================!! cold rain processes!! - follows the revised ice microphysics processes in

supcol = t0c-t(i,k) ifsat = 0!-----------------------------------------------------! Ni: ice crystal number concentraiton [HDC 5c]!----------------------------------------------------- xni(i,k) = min(max(5.38e7*(den(i,k) & *max(qci(i,k),qmin))**0.75,1.e3),1.e6) eacrs = exp(0.05*(-supcol))! if(qrs(i,k).gt.qcrmin.and.qci(i,k).gt.qmin)

then pacr(i,k) = min(pacrs*n0sfac(i,k)* & eacrs*rslope3(i,k) & *rslopeb(i,k)*qci(i,k)*denfac(i,k),qci(i,k)/dtcld) endif!------------------------------------------------------! pisd: Deposition/Sublimation rate of ice [HDC 9]! (T<T0: V->I or I->V)!------------------------------------------------------ if(qci(i,k).gt.0.) then xmi = den(i,k)*qci(i,k)/xni(i,k) diameter = dicon * sqrt(xmi) pisd(i,k) = 4.*diameter*xni(i,k)* & (rh(i,k)-1.)/work1(i,k) if(pisd(i,k).lt.0.) then pisd(i,k) = max(pisd(i,k),satdt/2) pisd(i,k) = max(pisd(i,k), & - qci(i,k)/dtcld) else pisd(i,k) = min(pisd(i,k),satdt/2) endif if(abs(pisd(i,k)).ge.abs(satdt)) ifsat = 1 endif!------------------------------------------------------! pres2: deposition/sublimation rate of snow [HDC 14]! (V->S or S->V)!----------------------------------------------------- if(qrs(i,k).gt.0..and.ifsat.ne.1) then coeres = rslope2(i,k)* & sqrt(rslope(i,k)*rslopeb(i,k)) pres(i,k) = (rh(i,k)-1.)*n0sfac(i,k)*& (precs1* rslope2(i,k) & +precs2*work2(i,k)*coeres)/work1(i,k) if(pres(i,k).lt.0.) then

pres(i,k) = max(pres(i,k),-qrs(i,k)/dtcld) pres(i,k) = max(pres(i,k),satdt/2) else pres(i,k) = min(pres(i,k),satdt/2) endif if(abs(pisd(i,k)+pres(i,k)).ge.abs(satdt))& ifsat = 1 endif!------------------------------------------------------! pgen: generation(nucleation) of ice from vapor [HDC 7-

8]! (T<T0: V->I)!----------------------------------------------------- if(supsat.gt.0.and.ifsat.ne.1) then xni0 = 1.e3*exp(0.1*supcol) roqi0 = 4.92e-11*xni0**1.33 pgen(i,k) = max(0.,(roqi0/den(i,k)-

max(qci(i,k),0.))/dtcld) pgen(i,k) = min(pgen(i,k),satdt) endif!------------------------------------------------------! paut2: conversion(aggregation) of ice to snow [HDC 12]! (T<T0: I->S)!------------------------------------------------------ if(qci(i,k).gt.0.) then qimax = roqimax/den(i,k) paut(i,k) = max(0.,(qci(i,k)-qimax)/dtcld) endif endif enddo enddo!!------------------------------------------------------! check mass conservation of generation terms and

feedback to the! large scale! do k = kts, kte do i = its, ite qciik = max(qmin,qci(i,k)) delqci = (paut(i,k)+pacr(i,k)-pgen(i,k)-

pisd(i,k))*dtcld if(delqci.ge.qciik) then facqci = qciik/delqci

paut(i,k) = paut(i,k)*facqci pacr(i,k) = pacr(i,k)*facqci pgen(i,k) = pgen(i,k)*facqci pisd(i,k) = pisd(i,k)*facqci endif qik = max(qmin,q(i,k)) delq = (pres(i,k)+pgen(i,k)+pisd(i,k))*dtcld if(delq.ge.qik) then facq = qik/delq pres(i,k) = pres(i,k)*facq pgen(i,k) = pgen(i,k)*facq pisd(i,k) = pisd(i,k)*facq endif work2(i,k) = -pres(i,k)-pgen(i,k)-pisd(i,k) q(i,k) = q(i,k)+work2(i,k)*dtcld qci(i,k) = max(qci(i,k)-(paut(i,k)+pacr(i,k)-

pgen(i,k) & -pisd(i,k))*dtcld,0.) qrs(i,k) = max(qrs(i,k)+(paut(i,k)+pacr(i,k)

& +pres(i,k))*dtcld,0.) if(t(i,k).lt.t0c) then t(i,k) = t(i,k)-

xls*work2(i,k)/cpm(i,k)*dtcld else t(i,k) = t(i,k)-

xl(i,k)*work2(i,k)/cpm(i,k)*dtcld endif enddo enddo! cvap = cpv hvap = xlv0 hsub = xls ttp=t0c+0.1 dldt=cvap-cliq xa=-dldt/rv xb=xa+hvap/(rv*ttp) dldti=cvap-cice xai=-dldti/rv xbi=xai+hsub/(rv*ttp)

do k = kts, kte do i = its, ite tr=ttp/t(i,k) qs(i,k)=psat*(tr**xa)*exp(xb*(1.-tr)) qs(i,k) = ep2 * qs(i,k) / (p(i,k) - qs(i,k)) qs(i,k) = max(qs(i,k),qmin) denfac(i,k) = sqrt(den0/den(i,k)) enddo enddo!! pcon: condensational/evaporational rate of cloud

water [RH83 A6]! if there exists additional water vapor

condensated/if! evaporation of cloud water is not enough to remove

subsaturation! do k = kts, kte do i = its, ite work1(i,k) =

conden(t(i,k),q(i,k),qs(i,k),xl(i,k),cpm(i,k)) work2(i,k) = qci(i,k)+work1(i,k) pcon(i,k) =

min(max(work1(i,k),0.),max(q(i,k),0.))/dtcld

if(qci(i,k).gt.0..and.work1(i,k).lt.0.and.t(i,k).gt.t0c) &

pcon(i,k) = max(work1(i,k),-qci(i,k))/dtcld q(i,k) = q(i,k)-pcon(i,k)*dtcld qci(i,k) = max(qci(i,k)+pcon(i,k)*dtcld,0.) t(i,k) =

t(i,k)+pcon(i,k)*xl(i,k)/cpm(i,k)*dtcld enddo enddo!! padding for small values! do k = kts, kte do i = its, ite if(qci(i,k).le.qmin) qci(i,k) = 0.0 enddo enddo! enddo ! big loops

do loop = 1,loops!!------------------------------------------------------! initialize the large scale variables! do i = its, ite mstep(i) = 1 flgcld(i) = .true. enddo! do k = kts, kte do i = its, ite denfac(i,k) = sqrt(den0/den(i,k)) enddo enddo! cvap = cpv hvap=xlv0 hsub=xls ttp=t0c+0.1 dldt=cvap-cliq xa=-dldt/rv xb=xa+hvap/(rv*ttp) dldti=cvap-cice xai=-dldti/rv xbi=xai+hsub/(rv*ttp) do k = kts, kte do i = its, ite tr=ttp/t(i,k) if(t(i,k).lt.ttp) then qs(i,k) =psat*(tr**xai)*exp(xbi*(1.-tr)) else qs(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) endif qs0(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) qs0(i,k) = (qs0(i,k)-qs(i,k))/qs(i,k) qs(i,k) = ep2 * qs(i,k) / (p(i,k) - qs(i,k)) qs(i,k) = max(qs(i,k),qmin) rh(i,k) = max(q(i,k) / qs(i,k),qmin) enddo enddo

!$acc region do local(numdt,xni,falk,falkc,fall,fallc,pcon,pisd) &!$acc local(pgen,pacr,paut,pres,denfac,mstep,qs,qs0,rh) &!$acc local(rslope,rslope2,rslope3,rslopeb,work1,work2) &!$acc local(work1c,work2c) &!$acc copyin(n0sfac,delz(:,:),qci(:,:),qrs(:,:),p(:,:)) &!$acc copyin(den(:,:),w(:,:)) &!$acc copy(q(:,:)) do loop = 1,loops

!!----------------------------------------------------------------! initialize the large scale variables! do i = its, ite mstep(i) = 1 flgcld(i) = .true. enddo! do i = its, ite do k = kts, kte denfac(i,k) = sqrt(den0/den(i,k)) enddo enddo! do i = its, ite do k = kts, kte tr=ttp/t(i,k) if(t(i,k).lt.ttp) then qs(i,k) =psat*(tr**xai)*exp(xbi*(1.-tr)) else qs(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) endif qs0(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) qs0(i,k) = (qs0(i,k)-qs(i,k))/qs(i,k) qs(i,k) = ep2 * qs(i,k) / (p(i,k) - qs(i,k)) qs(i,k) = max(qs(i,k),qmin) rh(i,k) = max(q(i,k) / qs(i,k),qmin) enddo enddo

! mstepmax = 1 numdt = 1 do k = kte, kts, -1 do i = its, ite if(t(i,k).lt.t0c) then pvt = pvts else pvt = pvtr endif work1(i,k) = pvt*rslopeb(i,k)*denfac(i,k) work2(i,k) = work1(i,k)/delz(i,k) numdt(i) = max(nint(work2(i,k)*dtcld+.5),1) if(numdt(i).ge.mstep(i)) mstep(i) = numdt(i) enddo enddo do i = its, ite if(mstepmax.le.mstep(i)) mstepmax = mstep(i) enddo! do n = 1, mstepmax k = kte do i = its, ite if(n.le.mstep(i)) then falk(i,k) = den(i,k)*qrs(i,k)* & work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k) & *dtcld/den(i,k),0.) endif enddo

!! mstepmax = 1 numdt = 1 do i = its, ite do k = kte, kts, -1 if(t(i,k).lt.t0c) then pvt = pvts else pvt = pvtr endif work1(i,k) = pvt*rslopeb(i,k)*denfac(i,k) work2(i,k) = work1(i,k)/delz(i,k)!! numdt(i) = max(nint(work2(i,k)*dtcld+.5),1) numdt(i) = max(int(work2(i,k)*dtcld+1.),1) if(numdt(i).ge.mstep(i)) mstep(i) = numdt(i) enddo enddo

!! do i = its, ite!! if(mstepmax.le.mstep(i)) mstepmax = mstep(i)!! enddo!!! do n = 1, mstepmax k = kte do i = its, ite!! if(n.le.mstep(i)) then do n = 1, mstep(i) falk(i,k) = den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k)*dtcld/den(i,k),0.)!! endif

enddo enddo

do n = 1, mstepmax k = kte do i = its, ite if(n.le.mstep(i)) then falk(i,k) = den(i,k)*qrs(i,k)* & work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k) & *dtcld/den(i,k),0.) endif enddo do k = kte-1, kts, -1 do i = its, ite if(n.le.mstep(i)) then falk(i,k) =

den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-(falk(i,k) & -falk(i,k+1) & *delz(i,k+1)/delz(i,k))*dtcld/den(i,k),0.) endif enddo enddo enddo

!! do n = 1, mstepmax k = kte do i = its, ite!! if(n.le.mstep(i)) then do n = 1, mstep(i) falk(i,k) = den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k)*dtcld/den(i,k),0.)!! endif

enddo enddo do i = its, ite do k = kte-1, kts, -1!! if(n.le.mstep(i)) then do n = 1, mstep(i) falk(i,k) = den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-(falk(i,k) -falk(i,k+1) & *delz(i,k+1)/delz(i,k))*dtcld/den(i,k),0.)!! endif enddo enddo enddo!! enddo

pgf90 -DINTIO -DIWORDSIZE=4 -DDWORDSIZE=8 -DRWORDSIZE=4 -DLWORDSIZE=4 -DNETCDF -DTRIEDNTRUE -DLIMIT_ARGS -DEM_CORE=1 -DNMM_CORE=0 -DNMM_MAX_DIM=1000 -DCOAMPS_CORE=0 -DEXP_CORE=0 -DF90_STANDALONE -DCONFIG_BUF_LEN=8192 -DMAX_DOMAINS_F=21 -DNO_NAMELIST_PRINT -fast -I../src/. -I../src/./netcdf/include -c ../src/module_mp_wsm3.F90 -ta=nvidia,time -Minfo=accel,ccffwsm32d: 300, Generating copyin(cpm(its:ite,:)) Generating local(work2c(its:ite,kts:kte)) Generating local(work1c(its:ite,kts:kte)) Generating local(work2(its:ite,kts:kte)) Generating copyin(delz(:,:)) Generating copyin(qci(:,:)) Generating local(rslope3(its:ite,kts:kte)) Generating local(rslope2(its:ite,kts:kte)) Generating local(rslopeb(its:ite,kts:kte)) Generating local(rslope(its:ite,kts:kte)) Generating copyin(n0sfac(its:ite,kts:kte)) Generating copyin(qrs(:,:)) Generating local(rh(its:ite,kts:kte)) Generating copy(q(:,:)) Generating copyin(p(:,:)) Generating local(qs0(its:ite,kts:kte)) Generating local(qs(its:ite,kts:kte)) Generating copy(t(its:ite,:)) Generating local(mstep(its:ite)) Generating copyin(den(:,:)) Generating local(denfac(its:ite,kts:kte)) Generating local(pres(its:ite,kts:kte)) Generating local(paut(its:ite,kts:kte)) Generating local(pacr(its:ite,kts:kte)) Generating local(pgen(its:ite,kts:kte)) Generating local(pisd(its:ite,kts:kte)) Generating local(pcon(its:ite,kts:kte)) Generating local(fall(its:ite,:)) Generating local(falk(its:ite,kts:kte)) Generating local(fallc(its:ite,kts:kte))

307, Loop carried dependence due to exposed use of t(its:ite,:) prevents parallelization Loop carried dependence due to exposed use of q(its:ite,kts:kte) prevents parallelization Loop carried dependence due to exposed use of fall(its:ite,:) prevents parallelization Loop carried dependence due to exposed use of qrs(its:ite,:) prevents parallelization Loop carried dependence due to exposed use of qci(its:ite,:) prevents parallelization Loop carried dependence due to exposed use of work1(its:ite,1:2) prevents parallelization Loop carried dependence due to exposed use of rain(its:ite) prevents parallelization Parallelization would require privatization of array work2c(i2+its,kts:kte) Parallelization would require privatization of array work1c(i2+its,kts:kte) Parallelization would require privatization of array work2(i2+its,kts:kte) Parallelization would require privatization of array numdt(its:ite) Parallelization would require privatization of array rslope3(i2+its,kts:kte) Parallelization would require privatization of array rslope2(i2+its,kts:kte) Parallelization would require privatization of array rslopeb(i2+its,kts:kte) Parallelization would require privatization of array rslope(i2+its,kts:kte) Parallelization would require privatization of array n0sfac(i2+its,kts:kte) Parallelization would require privatization of array xni(i2+its,kts:kte) Parallelization would require privatization of array falkc(i2+its,kts:kte) Parallelization would require privatization of array fallc(i2+its,kts:kte) Parallelization would require privatization of array falk(i2+its,kts:kte) Parallelization would require privatization of array pcon(i2+its,kts:kte) Parallelization would require privatization of array pisd(i2+its,kts:kte) Parallelization would require privatization of array pgen(i2+its,kts:kte) Parallelization would require privatization of array pacr(i2+its,kts:kte) Parallelization would require privatization of array paut(i2+its,kts:kte) Parallelization would require privatization of array pres(i2+its,kts:kte) Parallelization would require privatization of array rh(i2+its,kts:kte) Parallelization would require privatization of array qs0(i2+its,kts:kte) Parallelization would require privatization of array qs(i2+its,kts:kte) Parallelization would require privatization of array denfac(i2+its,kts:kte) Parallelization would require privatization of array mstep(its:ite) Sequential loop scheduled on host

313, Loop is parallelizable Accelerator kernel generated 313, !$acc do parallel, vector(256) 318, Loop is parallelizable 319, Loop is parallelizable Accelerator kernel generated 318, !$acc do parallel, vector(16) 319, !$acc do parallel, vector(16) 324, Loop is parallelizable 325, Loop is parallelizable Accelerator kernel generated 324, !$acc do parallel, vector(16) 325, !$acc do parallel, vector(16) Using register for qs0 Using register for qs Using register for t 349, Loop is parallelizable 350, Loop is parallelizable Accelerator kernel generated 349, !$acc do parallel, vector(16) 350, !$acc do parallel, vector(16) 371, Loop is parallelizable 372, Loop is parallelizable Accelerator kernel generated 371, !$acc do parallel, vector(16) 372, !$acc do parallel, vector(16) Using register for den Using register for rslope3 Using register for rslope2 Using register for rslopeb Using register for rslope Using register for qrs

!!----------------------------------------------------------------! initialize the large scale variables! do i = its, ite mstep(i) = 1 flgcld(i) = .true. enddo! do i = its, ite do k = kts, kte denfac(i,k) = sqrt(den0/den(i,k)) enddo enddo! do i = its, ite do k = kts, kte tr=ttp/t(i,k) if(t(i,k).lt.ttp) then qs(i,k) =psat*(tr**xai)*exp(xbi*(1.-tr)) else qs(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) endif qs0(i,k) =psat*(tr**xa)*exp(xb*(1.-tr)) qs0(i,k) = (qs0(i,k)-qs(i,k))/qs(i,k) qs(i,k) = ep2 * qs(i,k) / (p(i,k) - qs(i,k)) qs(i,k) = max(qs(i,k),qmin) rh(i,k) = max(q(i,k) / qs(i,k),qmin) enddo enddo

409, Loop is parallelizable Accelerator kernel generated 409, !$acc do parallel, vector(256) 410, Loop is parallelizable 411, Complex loop carried dependence of numdt preventsparallelization Loop carried reuse of numdt preventsparallelization Complex loop carried dependence of mstep preventsparallelization Loop carried dependence of mstep preventsparallelization Loop carried backward dependence of mstepprevents vectorization Inner sequential loop scheduled on accelerator Accelerator kernel generated 410, !$acc do parallel, vector(256) Using register for mstep Using register for numdt 411, !$acc do seq 432, Loop is parallelizable Accelerator kernel generated 432, !$acc do parallel, vector(256) Using register for falk Using register for den Using register for qrs Using register for mstep 434, Complex loop carried dependence of qrs preventsparallelization Complex loop carried dependence of falk preventsparallelization Loop carried reuse of falk preventsparallelization Complex loop carried dependence of fall preventsparallelization Loop carried dependence of fall preventsparallelization Loop carried backward dependence of fall preventsvectorization Loop carried dependence of qrs preventsparallelization Loop carried backward dependence of qrs preventsvectorization Inner sequential loop scheduled on accelerator

!! mstepmax = 1 numdt = 1 do i = its, ite do k = kte, kts, -1 if(t(i,k).lt.t0c) then pvt = pvts else pvt = pvtr endif work1(i,k) = pvt*rslopeb(i,k)*denfac(i,k) work2(i,k) = work1(i,k)/delz(i,k)!! numdt(i) = max(nint(work2(i,k)*dtcld+.5),1) numdt(i) = max(int(work2(i,k)*dtcld+1.),1) if(numdt(i).ge.mstep(i)) mstep(i) = numdt(i) enddo enddo

!! do i = its, ite!! if(mstepmax.le.mstep(i)) mstepmax = mstep(i)!! enddo!!! do n = 1, mstepmax k = kte do i = its, ite!! if(n.le.mstep(i)) then do n = 1, mstep(i) falk(i,k) = den(i,k)*qrs(i,k)*work2(i,k)/mstep(i) hold = falk(i,k) fall(i,k) = fall(i,k)+falk(i,k) holdrs = qrs(i,k) qrs(i,k) = max(qrs(i,k)-falk(i,k)*dtcld/den(i,k),0.)!! endif

enddo enddo

Common Compiler Feedback Format

Source

File

Object

File

.CCFF

Executable

File

.CCFF

Compiler Linker pgextract

pgprof

pgprof.out

File

CCFF

File

run

program

http://www.pgroup.com/resources/ccff.htm

PGPROF 9.0

With CCFF

Messages

/usr/bin/time ./wrf > ./rsl.out.0000FORTRAN STOP

Accelerator Kernel Timing data../src/module_mp_wsm3.F90 wsm32d 300: region entered 232 times time(us): total=3030348 init=251580 region=2778768 kernels=514992 data=2263776 w/o init: total=2778768 max=1052452 min=7198 avg=11977 313: kernel launched 464 times time(us): total=7929 max=38 min=15 avg=17 319: kernel launched 464 times time(us): total=9624 max=27 min=19 avg=20 325: kernel launched 464 times time(us): total=24756 max=74 min=50 avg=53 350: kernel launched 464 times time(us): total=10243 max=54 min=21 avg=22 372: kernel launched 464 times time(us): total=20422 max=51 min=40 avg=44 409: kernel launched 464 times time(us): total=7473 max=45 min=15 avg=16… 542: kernel launched 464 times time(us): total=11027 max=31 min=17 avg=23 567: kernel launched 464 times time(us): total=8100 max=24 min=16 avg=17 584: kernel launched 464 times time(us): total=15661 max=55 min=29 avg=33 614: kernel launched 464 times time(us): total=27082 max=63 min=56 avg=58 625: kernel launched 464 times time(us): total=30871 max=73 min=61 avg=66 757: kernel launched 464 times time(us): total=34488 max=77 min=73 avg=744.73user 1.20system 0:05.99elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k0inputs+24outputs (0major+36576minor)pagefaults 0swaps

Types of Compiler Feedback

How the function was compiled

Interprocedural optimizations

Profile-feedback runtime dataBlock execution counts

Loop counts, range of counts

Compiler optimizations, missed opportunitiesVectorization, parallelization

Altcode, re-ordering of loops, inlining

X64+GPU code generation, GPU kernel mapping, data movement

Compute intensity – important for GPUs & Multi-core

PGI Accelerator Model Advantages

Minimal changes to the language – directives/pragmas, in thesame vein as vector or OpenMP parallel directives

Minimal library calls – usually none

Standard x64 toolchain – no changes to makefiles, linkers, buildprocess, standard libraries, other tools

Not a “platform” – binaries will execute on any compatiblex64+GPU hardware system

Performance feedback – learn from and leverage the success ofvectorizing compilers in the 1970s and 1980s

Incremental program migration – put migration decisions in thehands of developers

PGI Unified Binary Technology – ensures continued portability tonon GPU-enabled targets

Is it General?

AMD/ATI – four 16-wide SIMD units (MIMD/SIMD), multithreaded,separate GPU memory from host

Larrabee – has some number of x64 cores each with 16-widepacked/SIMD operations and separate memory from x64 host, alsosupporting multithreading in each core

Cell Blades – eight SPEs each with packed/SIMD instructions andseparate memory from host, SPEs have small SW-managed localmemory.

Multicore x64 – has some number of x64 cores, each with 4-widepacked/SIMD operations; no need for data movement?

Convey – has an x64 host with an attached multi-vector acceleratorwhich could be programmed using this model as well.

Status and Experience so Far

PGI Accelerator compilers generate x64+NVIDIA code for mostarbitrarily-chosen simple rectangular loops

Significant speed-ups vs 1 host core on highly compute-intensiveloops (matmul 20x – 30x, Black-Scholes 15x – 20x, Poisson solver10x – 15x vs 1 host core)

Success offloading 400+ line loops in full-up weather, physics, andcomputational fluid dynamics applications

The compilers aggressively use NVIDIA Shared Memory

Several current limitations to offloading certain loops – scalars,function calls, certain intrinsics, reductions

The compilers generate basic profiling information useful for tuning

Correctness, Feedback, Features, Performance

Availability and

Additional Information

PGI Accelerator Programming Model – is supported for x64+NVIDIAtargets in the PGI 9.0 Fortran and C compilers, available now

PGI CUDA Fortran – supporting explicit programming of x64+NVIDIAtargets will be available in a production release of the PGI Fortran 95/03compiler currently scheduled for release in November, 2009

Other GPU and Accelerator Targets – are being studied by PGI, and maybe supported in the future as the necessary low-level software infrastructure(e.g. OpenCL) becomes more widely available

Further Information – see www.pgroup.com/accelerate for a detailedspecification of the PGI Accelerator model, an FAQ, and related articles andwhite papers

pgi x64+gpu fortran and c99 compilers · describe the problem of programming an x64+gpu system...

Documents