university of chicago - nii shonan meeting · 2019-03-31 · gpu background gpus i gpu...

Nessie: A NESL to CUDA Compiler

John Reppy

University of Chicago

October 2018

GPU background

I GPU architectures are optimized for arithmetically intensive computation.I GPUs provide super-computer levels of parallelism at commodity prices.I For example, the Tesla V100 provides 15.7 TFlops peak single-precision performance and 7.8

TFlops of peak double-precision performance.

NVidia GPUs have two (or three) levels of parallelism:I A multicore processor that supports Single-Instruction Multiple-Thread (SIMT) parallelism.I Multiple multicore processors on a single chip.I Multiple GPGPU boards per system.

October 2018 Shonan 136 — Nessie 2

GPU background

For example, Nvidia’s Kepler GK110 StreamingMultiprocessor (SMX).I 192 single-precision cores ⌅I 64 double-precision cores ⌅I 32 load/store units ⌅I 32 special function units ⌅I 4 ⇥ 32-lane warps in parallel

Lots of parallel compute, but not very much memory

Register File (64K × 32-bit)

Interconnect Network64 KB L1 Cache/Shared Memory

48 KB Read-only Data Cache

Dispatch

Warp Scheduler

Instruction Cache

GPU background

GPUs (continued ...)NVIDIA’s Tesla K40 architecture has 15 GK110 SMXs (2880 Cuda cores).

1536 KB L2 Cache

trolle

rThread Engine

Optimized for processing data in bulk!

GPU background

GPU programming modelThe design of GPU hardware is manifest in the widely used GPU programming languages (e.g.,Cuda and OpenCL).

Thread hierarchy

I Threads (grouped into warps for SIMTexecution)

I Blocks (mapped to the same SMX)I Grid (multiple blocks running the same

kernel)Synchronization

I Block-level barriersI Atomic memory operationsI Task synchronization

Explicit memory hierarchy

I Disjoint memory spacesI Per-thread memory maps to registersI Per-block shared memoryI Global memoryI Host memoryI Also texture and constant memory

GPU background

Programming becomes harder!

C code for dot product (map-reduce):float dotp (int n, const float *a, const float *b){

float sum = 0.0f ;for (int i = 0; i < n; i++)

sum += a[i] * b[i] ;return sum ;

Also need CPU-side code!cudaMalloc ((void **)&V1_D , N*sizeof(float)) ;cudaMalloc ((void **)&V2_D , N*sizeof(float)) ;cudaMalloc ((void **)&V3_D , blockPerGrid*sizeof(float)) ;

cudaMemcpy (V1_D , V1_H , N*sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy (V2_D , V2_H , N*sizeof(float), cudaMemcpyHostToDevice);

dotp <<<blockPerGrid, ThreadPerBlock>>> (N, V1_D, V2_D, V3_D);

V3_H = new float [blockPerGrid] ;cudaMemcpy (V3_H, V3_D, N*sizeof(float), cudaMemcpyDeviceToHost);

float sum = 0 ;for (int i = 0 ; i<blockPerGrid ; i++)

sum += V3_H[i] ;

delete V3_H;

CUDA device code for dot product:__global__ void dotp (int n, const float *a, const float *b, float *results){

__shared__ float cache[ThreadsPerBlock] ;float temp ;const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;const unsigned int idx = threadIdx.x ;

while (tid < n) {temp += a[tid] * b[tid] ;tid += blockDim.x * gridDim.x ;

}cache[idx] = temp ;

__synchthreads () ;

int i = blockDim.x / 2 ;while (i != 0) {

if (idx < i)cache[idx] += cache[idx + i] ;__synchthreads () ;

i /= 2 ;}if (idx == 0)

results[blockIdx.x] = cache[0] ;}

Nested Data Parallelism

I NESL is a first-order functional language for parallel programming over sequences designedby Guy Blelloch [Blelloch ’96].

I Provides parallel for-each operation (with optional filter){ x + y : x in xs; y in ys }

{ x / y : x in xs; y in ys | (y /= 0) }I Provides other parallel operations on sequences, such as reductions, prefix-scans, and

permutations.function dot (xs, ys) = sum ({ x * y : x in xs; y in ys })

I Supports Nested Data Parallelism (NDP) — components of a parallel computation maythemselves be parallel.

NDP example: sparse matrix times dense vector2

1 0 4 0 00 3 0 0 2

0 0 0 5 06 7 0 0 8

0 0 9 0 0

Want to avoid computing products wherematrix entries are 0.

Sparse representation tracks non-zeroentries using sequence of sequences ofindex-value pairs:

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

NDP example: sparse-matrix times vectorIn NESL, this algorithm has a compact expression:

function svxv (sv, v) = sum ({ x * v[i] : (i, x) in sv })

function smxv (sm, v) = { svxv (sv, v) : sv in sm }

Notice that the smxv function is a map of map-reduce subcomputations; i.e., nested dataparallelism.

NDP example: sparse-matrix times vectorNaive parallel decomposition will be unbalanced because of irregularity in sub-problem sizes.

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

0 2 1 4 3 0 1 4 0

1 4 3 2 5 6 7 8 1

2 2 1 3 1

Flattening transformation converts NDPto flat DP (including AoS to SoA)

FlatteningFlattening (a.k.a. vectorization) is a global program transformation that converts irregular nesteddata parallel code into regular flat data parallel code.

I Lifts scalar operations to work on sequences of valuesI Flattens nested sequences paired with segment descriptorsI Conditionals are encoded as dataI Residual program contains vector operations plus sequential control flow and

recursion/iteration.

Flattening function calls

{ f (e) : x in xs }

if #xs = 0then [ ]else let es = { e : x in xs }in f "(es)

Lifting functionsIf we have

function f (x) = e

then f " is defined to befunction f " (xs) = { e : x in xs }

Flattening conditionals

{if b then e1 else e2 : x in xs }

let fs = { b : x in xs }let (xs1, xs2) = PARTITION(xs, fs)let vs1 = { e1 : x in xs1 }let vs2 = { e2 : x in xs2 }in COMBINE(vs1, vs2, fs)

Flattening example: factorial

function fact (n) = if (n <= 0) then 1 else n * fact (n - 1)

function fact� (ns) = { fact(n) : n in ns }<latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="s2pPH+iZiVYVOAG8kYw3puqQqC4=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV72ISKUVlwcKC93dFmWmi8dzJrHWY49sT3eDmTfeEK/8KP4Bf4UnjjMTskkKo3jm5Pg7Fx+f82WVFNaNRn/1rr1x/c3+WzfeHrzz7nvvf7Bz88NTq2vD4YRrqc2LjFmQQsGJE07Ci8oAKzMJz7PzJ2H/+SswVmh17OYVpCWbKlEIzhyqznb+TDKYCuWldSGaUNNmIpma1m wK42+/ePYNLZmbgeWsgrEzNaSDolY8GJOCcUfuqD0yJqJAgXw2JqM9gnBFYgLSAlHkkyWMfEriPbKyvp1wncPiBEnJcx9gzUuf1BUzRl80t9HGBt9+4SHEeYj+hCLKkmaQgMqvZr07ONsZjvZHi4dsC3EnDKPuOTq7ef3vJNe8LkE5Lpm1k3hUudQz4wSXgDFqCxXj51iLCYqKlWBTv0i5IbuoyUmhDS7lyEJ71cI7gXB08gzcU6zh57KasQycT0JFnWu80qZksvHfHceNr8CgJvz+xyLTMr+Kzy5bgytRWWntvMwwv2BkN/dI0L5uc1K74n7qhapqB4q35ytqSZwmoXFILgxwJ+coMG4ElojwGTN4Ndhea2HaAg2SHArsysU//wNg4maaNX60/4DirYS1gfnKAKh/UQFx9zWox7KGNVBYDzZAT0WuxHTmtoCHG8Cj2lRy5e+wg93dCopn23K2mdqPICV2bgvDhqNhYQsebEZ9Mmergx4ctrh44U/BBddlybC7VwPSBHntXBvI84t13KJIW97KdVC4k/Xm6YbJ/qcWB85iQw4IQdIR3Lq5hHHiXMFKIeeJxcaonBU/wyp3iuBzmF9ok3fwLtkkKywYATYgQpY4hh1imWoinJ0h87SR2wou2WmS+sBPDX5x+AnxpTawDDT2eGTK1JxmWksaGpXmzLHQyTQwEy3ol7SQmjm6ZCQqCirwrRwN/giR4KjUU+RKSVXwhwNLVV1mYKgOv1wo3LIuEG74YJFoYD8aRp86ekwvsdytN40bZpVeMqS/0Ic08TRpaDKhSdqEMhBkGYuz9Qpaq3Cm1vF4kqe3gq6ZLL3QZdG62DbdXb9xpNVmEqc+cXDp2rdzfhg3eO3IlvEmN24Lpwf7McrfHwwfPe5480b0cXQruhPF0b3oUfR1dBSdRLw36p32zno/9V/2f+3/1v+9hV7rdTYfRWtP/49/AMqTWzQ=</latexit>

function fact� (ns) =

let fs = (ns <=� dist(0, #ns));(ns1, ns2) = PARTITION(ns, fs);vs1 = dist(1, #ns1);vs2 = if (#ns2 = 0)

then []else let

es = (ns2 -� dist(1, #ns2));rs = fact� (es);

in (ns2 *� rs);

in COMBINE(vs1, vs2, fs)<latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="u/rFeFe7pXTkErQ2GZRWnX6Mkww=">AAAIMnicjVVbb9xEFHYXyrbLLYFHXkbdRtqgIVqvGkGBldpEBSKRC23SFq1NNLaPd0cZj62ZcZLF+G/wCj+CPwNviFd+BGdsb/YWEKMde/ac71znnOMgE1ybfv/3O6033rz7Vvve/c7b77z73vsbmx+81GmuQjgLU5Gq1wHTILiEM8ONgNeZApYEAl4FF/uW/+oSlOapPDXTDPyEjSWPecgMks43W/e9AMZcFkIba47LcTkSTI5zNobh0bMX39KEmQnokGUwNCoHvxPnMrTS5KEXphFUTnhJGBUxC035Q+HlGVMqvSofkp7U22TYIUSAIbEmQ0tZk/tyuCwVoSO9PiVdlN7+AqXtQkGXEqkHqJCcPH1+enB6cHyEVIqKb1CX2kV2pcCtFLgLrAGyeEx6SLbH/nbDsQtjlGTkL1BAaLB+L5CQ2MQwWAvik1tiqF0YzIOol7JK/kfyQK8Icnm77Y+XBdVcDiX2jw/3Do6e9S5t/jAJVb46Hsho8dK3Oucb3f5Ov1pk/eA2h67TrJPzzbsbXpSGeQLShIJpPXL7mfELpgwPBZQdL9eQsfACS2mER8kS0H5RuV2SLaREJE4VbmlIRV2UKAxHOCp5AeYQS/CpyCYsAFN4tiCNKQuZqoSJsjg+dcsiA4UU+/sPiSAV0SI+uK4FFqyyROtpEqB/Vkiv8oil3sYc5Sb+zC+4zHIDMqzji3NBTEps42FBKAiNmOKBhYpjikg4YQpvHdtzyUydoI4XQYxdXf0rngM6rsZBWfR3HlO8FbtXMF8rAHmDsohHt6D2RA5LILsfr4AOeST5eGLWgLsrwJNcZWKub7eBPVozirGtKVt17XsQAgu4hmHBUbuxBAerVvenbB7oYLfGuZU+CVdhmiQMq3veJKU9L8W1gry4WsZVSVrTliyD7J0sF0/TTPpfqdhwGgsSmxOHNg+1mQoYesbELOFi6mksjMxo/iPMfacIvoDpVaqiBt446wWxBsVBW4T1EtuwQcxc9bjRExzcteU6g7PhPvILO95LfGPzE1IkqYKZoWGBIVMmpzRIU0FtodKIGWYrmdrBSGP6FY1FygydfQ4ojynHpzR0q5o+ODupSMf4rRFUWn3YsFTmSQCKpvYXcYksbewHy74wSdTOYWpbnxp6Sq8RZ5U1M4zezP2ytpEiXM2d9rr0J/o59QrqldQbUc8vbXIIzh6NHXcJtZSNtDY3HEX+A0srRzMtdJbKxiPtby3XAQ7ccuT6hWfg2tRPY4quW2Ix4Ax1Vyfm+uHlYMfF83eD7pO9Zprecz5yHjg9x3U+dZ443zgnzpkTtrLWz61fWr+2f2v/0f6z/VcNbd1pZD50llb7738Axkq6aw==</latexit>

NESL on GPUs

I NESL was designed for bulk-data processing on wide-vector machines (SIMD)I Potentially a good fit for GPU computationI First try [Bergstrom & Reppy ’12] demonstrated feasibility of NESL on GPUs, but was

signficantly slower than hand-tuned CUDA code for some benchmarks (worst case: over 50times slower on Barnes-Hut [Burtscher & Pingali ’11]).

NESL on GPUs

Areas for improvementWe identified a number of areas for improvement.I Better fusion:

I Fuse generators, scans, and reductions with maps.I “Horizontal fusion,” (fuse independent maps over the same index space).

I Better segment descriptor management.I Better memory management.

It proved difficult/impossible to support these improvements in the context of the VCODEinterpreter.

Nessie

NessieNew NESL compiler built from scratch.

I Front-end produces monomorphic, direct-style IR.I Flattening eliminates NDP and produces Flan, which is a flat-vector language with

VCODE-like operators.I Shape analysis is used to tag vectors with size information (symbolic in some cases).I Flan is converted to �cu, which is where fusion and other optimizations occur.

Nessie

�cu — An IR for GPU programs (continued ...)�cu is a three-level language:I CPU expressions – direct-style extended �-calculus with kernel dispatchI Kernels – sequences of second-order array combinators (SOAC)I GPU anonymous functions – first-order functions that are the arguments to the SOACs.

Nessie

�cu — An IR for GPU programs (continued ...)

CPU expressions

prog ::= kern dcl

dcl ::= function f ( params ) blk dcl::= let params = exp dcl::= exp

params ::= xi:�i

blk ::= { bind exp }bind ::= let params = exp

Kernel expressions

kern ::= kernel K xs { bind return ys }bind ::= let xs = SOAC arg

arg ::= ⇤| rop⇤| shape

GPU expressions

⇤ ::= { xs => exp using ys }exp ::= if exp then exp else exp

| let x = exp in exp| exp � exp| xs[ i]| x| · · ·

Nessie

Second-Order Array CombinatorsLike Futhark [Henriksen et al. ’14], Nova [Collins et al. ’14], and other systems, we useSecond-Order Array Combinators (SOACs) to represent the iteration structure of our operations onsequences.

ONCE : (unit ) ⌧) ! ⌧

MAP : (int ) ⌧) int ! ⌧"

PERMUTE : (int ) ⌧) (int ) int) int ! ⌧"

REDUCE : (int ) ⌧) rop⌧ int ! ⌧

SCAN : (int ) ⌧) rop⌧ int ! ⌧"

FILTER : (int ) ⌧) (⌧ ) bool) int ! ⌧"

PARTITION : (int ) ⌧) (⌧ ) bool) int ! ⌧" ⇥ ⌧"

SEG_PERMUTE : (int ) ⌧) (int ) int) sd ! ⌧"

SEG_REDUCE : (int ) ⌧) rop⌧ sd ! ⌧"

SEG_SCAN : (int ) ⌧) rop⌧ sd ! ⌧"

SEG_FILTER : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd

SEG_PARTITION : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd ⇥ ⌧" ⇥ sd

Nessie

Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs

Example: iota

{ i : i in [0 : n-1] }

=)kernel K (n : int) -> [int] {

let res = MAP { i => i } nreturn res

Nessie

Example: the element-wise product of two sequences

{ x * y : x in xs; y in ys }

=)kernel K (xs : [float], ys : [float]) -> [float] {

let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res

Nessie

Example: summing a sequence

sum (xs)

=)kernel K (xs : [float]) -> float {

let res =REDUCE { i => xs[i] using xs } (FADD) (#xs)

return res}

Nessie

Nessie backend

Flan �cu �cu �cu CUDAtranslate fusion memory analysis

code generation

lp_solve

I Designed to support better fusion, etc..I Backend transforms flattened code to CUDA in several steps.

I ILP-based fusion [Megiddo and Sarkar ’99; Robinson et al ’14].I Memory analysis based on Uniqueness types [de Vries et al ’07].I Add explicit memory management based on analysis.

Nessie

Simple map-reduce fusion

The �cu code for the dotp example is

kernel prod (xs : [float], ys : [float]) -> [float] {let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res

}kernel sum (xs : [float]) -> float {

let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res

function dots (xs : [float], ys : [float]) -> [float] {let t1 : [float] = run prod (xs, ys)let t2 : float = run sum (t)return t2

Nessie

Step 1: Fuse the two kernels into a combined kernel.

Nessie

Step 1: Fuse the two kernels into a combined kernel.

kernel F (xs : [float], ys : [float]) -> float {let ts = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)let res = REDUCE { i => ts[i] using ts } (FADD) (#ts)return res

function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2

Nessie

Step 2: Fuse the MAP operation into the REDUCE’s pull operation

Nessie

Step 2: Fuse the MAP operation into the REDUCE’s pull operation

kernel F (xs : [float], ys : [float]) -> float {let res = REDUCE { i => xs[i] * ys[i] using xs, ys } (FADD) (#xs)return res

function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2

Nessie

Fancier fusionConsider the following Nesl function (adapted from [Robinson et al ’14]):function norm2 (xs) : [float] -> ([float], [float]) =

let sum1 = sum(xs);gts = { x : x in xs | (x > 0) };sum2 = sum(gts);

in({ x / sum1 : x in xs }, { x / sum2 : x in xs })

Nessie

Fancier fusion (continued ...)

Translating to �cu produces the following code:kernel K1 (xs : [float]) -> float {let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res

}kernel K2 (xs : [float]) -> [float] {let res = FILTER { i => xs[i] using xs } { x => x > 0 } (#xs)return res

}kernel K3 (xs : [float], s : float) -> [float] {let res = MAP { i => xs[i] / s using xs } (#xs)return res

function norm2 (xs : [float]) -> ([float], [float]) {let sum1 : float = run K1 (xs)let its : [float] = run K2 (xs)let sum2 = run K1 (its)let res1 : [float] = run K3 (xs, sum1)let res2 : [float] = run K3 (xs, sum2)return (res1, res2)

Nessie

PDG control region

run K2 (xs)run K1 (xs)

run K1 (gts)

run K3 (xs, sum2)run K3 (xs, sum1)

(res1,res2)

res2res1

Nessie

One possible schedule

run K1 (gts)

(res1,res2)

Nessie

Another possible schedule

run K1 (gts)

(res1,res2)

Nessie

kernel F1 (xs : [float]) -> (float, float) {let (sum1, sum2) =

REDUCE { i => let x = xs[i] in (x, if x > 0 then x else 0) using xs }(FADD, FADD)(#xs)

return (sum1, sum2)}kernel F2 (xs : [float], sum1 : float, sum2 : float) -> [float] {let (res1, res2) =

MAP { i => let x = xs[i] in (x / sum1, x / sum2) using xs, sum1, sum2 }(#xs)

return (res1, res2)}

function norm2 (xs : [float]) -> ([float], [float]) {let (sum1 : float, sum2) = run F1 (xs)let (res1 : [float], res2 : [float]) = run F2 (xs, sum1, sum2)return (res1, res2)

Using ILP produces the followingschedule:

run K1 (gts)

(res1,res2)

Nessie

Notice how we fused the FILTER into the REDUCE!

Using ILP produces the followingschedule:

run K1 (gts)

(res1,res2)

Streaming and piecewise execution

I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to

improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some

follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.

I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to

improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some

follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.

(0, 1) (2, 4)

(1, 3) (4, 2)

(3, 5)

(0, 6) (1, 7) (4, 8)

(0, 1)

Flat execution

Piecewise execution

Streaming and piecewise execution (continued ...)

I Connections to Keller and Chakravarty’s Distributed Types and Palmer et al.’s Piecewiseexecution of NDP programs.

I Not all operations can be executed in piecewise fashion (e.g., permutations).I The execution model for Madsen and Filinski’s Streaming NESL also requires piecewise

execution of kernels.

university of chicago - nii shonan meeting · 2019-03-31 · gpu background gpus i gpu...

Documents

lecture 6: gpu programming · data cache (a big one) 13...

amd – gpu association – targeting gpus for load...

geeps: scalable deep learning on distributed gpus with a...

gpu tutorial: how to program for gpus

beyond the socket: numa-aware gpus · non-uniform memory...

efficientprocessingofimageprocessing...

using gpus for faster lammps particle...

demystifying gpu microarchitecture through …demystifying...

gpu programming - grupo de geofísica...

gpu ray tracing - gpu technology conference...

scord: a scoped race detector for gpus€¦ · synchronous...

gpus and gpu programming

gpu programming -...

gpus and gpu programming - city university of new york ·...

gpu virtualization: doing much more with gpus

cse 30321 – lecture 26 –gpu wrap up suggested...

introduction to gpu computing - university of alabama at...

pervasive massively multithreaded gpu processors michael c....

mcm-gpu: multi-chip-module gpus for continued performance...

gpu computing with matlab - gpu technology...