university of chicago - nii shonan meeting · 2019-03-31 · gpu background gpus i gpu...
Post on 20-Mar-2020
3 Views
Preview:
TRANSCRIPT
Nessie: A NESL to CUDA Compiler
John Reppy
University of Chicago
October 2018
GPU background
GPUs
I GPU architectures are optimized for arithmetically intensive computation.I GPUs provide super-computer levels of parallelism at commodity prices.I For example, the Tesla V100 provides 15.7 TFlops peak single-precision performance and 7.8
TFlops of peak double-precision performance.
NVidia GPUs have two (or three) levels of parallelism:I A multicore processor that supports Single-Instruction Multiple-Thread (SIMT) parallelism.I Multiple multicore processors on a single chip.I Multiple GPGPU boards per system.
October 2018 Shonan 136 — Nessie 2
GPU background
GPUs
For example, Nvidia’s Kepler GK110 StreamingMultiprocessor (SMX).I 192 single-precision cores ⌅I 64 double-precision cores ⌅I 32 load/store units ⌅I 32 special function units ⌅I 4 ⇥ 32-lane warps in parallel
Lots of parallel compute, but not very much memory
Register File (64K × 32-bit)
Interconnect Network64 KB L1 Cache/Shared Memory
48 KB Read-only Data Cache
Dispatch
Warp Scheduler
Instruction Cache
October 2018 Shonan 136 — Nessie 3
GPU background
GPUs (continued ...)NVIDIA’s Tesla K40 architecture has 15 GK110 SMXs (2880 Cuda cores).
1536 KB L2 Cache
Mem
ory
Con
trolle
rThread Engine
Optimized for processing data in bulk!
October 2018 Shonan 136 — Nessie 4
GPU background
GPU programming modelThe design of GPU hardware is manifest in the widely used GPU programming languages (e.g.,Cuda and OpenCL).
Thread hierarchy
I Threads (grouped into warps for SIMTexecution)
I Blocks (mapped to the same SMX)I Grid (multiple blocks running the same
kernel)Synchronization
I Block-level barriersI Atomic memory operationsI Task synchronization
Explicit memory hierarchy
I Disjoint memory spacesI Per-thread memory maps to registersI Per-block shared memoryI Global memoryI Host memoryI Also texture and constant memory
October 2018 Shonan 136 — Nessie 5
GPU background
Programming becomes harder!
C code for dot product (map-reduce):float dotp (int n, const float *a, const float *b){
float sum = 0.0f ;for (int i = 0; i < n; i++)
sum += a[i] * b[i] ;return sum ;
}
Also need CPU-side code!cudaMalloc ((void **)&V1_D , N*sizeof(float)) ;cudaMalloc ((void **)&V2_D , N*sizeof(float)) ;cudaMalloc ((void **)&V3_D , blockPerGrid*sizeof(float)) ;
cudaMemcpy (V1_D , V1_H , N*sizeof(float), cudaMemcpyHostToDevice);cudaMemcpy (V2_D , V2_H , N*sizeof(float), cudaMemcpyHostToDevice);
dotp <<<blockPerGrid, ThreadPerBlock>>> (N, V1_D, V2_D, V3_D);
V3_H = new float [blockPerGrid] ;cudaMemcpy (V3_H, V3_D, N*sizeof(float), cudaMemcpyDeviceToHost);
float sum = 0 ;for (int i = 0 ; i<blockPerGrid ; i++)
sum += V3_H[i] ;
delete V3_H;
CUDA device code for dot product:__global__ void dotp (int n, const float *a, const float *b, float *results){
__shared__ float cache[ThreadsPerBlock] ;float temp ;const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;const unsigned int idx = threadIdx.x ;
while (tid < n) {temp += a[tid] * b[tid] ;tid += blockDim.x * gridDim.x ;
}cache[idx] = temp ;
__synchthreads () ;
int i = blockDim.x / 2 ;while (i != 0) {
if (idx < i)cache[idx] += cache[idx + i] ;__synchthreads () ;
i /= 2 ;}if (idx == 0)
results[blockIdx.x] = cache[0] ;}
October 2018 Shonan 136 — Nessie 6
Nested Data Parallelism
NESL
I NESL is a first-order functional language for parallel programming over sequences designedby Guy Blelloch [Blelloch ’96].
I Provides parallel for-each operation (with optional filter){ x + y : x in xs; y in ys }
{ x / y : x in xs; y in ys | (y /= 0) }I Provides other parallel operations on sequences, such as reductions, prefix-scans, and
permutations.function dot (xs, ys) = sum ({ x * y : x in xs; y in ys })
I Supports Nested Data Parallelism (NDP) — components of a parallel computation maythemselves be parallel.
October 2018 Shonan 136 — Nessie 7
Nested Data Parallelism
NDP example: sparse matrix times dense vector2
66664
1 0 4 0 00 3 0 0 2
0 0 0 5 06 7 0 0 8
0 0 9 0 0
3
77775
2
66664
12345
3
77775
Want to avoid computing products wherematrix entries are 0.
Sparse representation tracks non-zeroentries using sequence of sequences ofindex-value pairs:
(0, 1) (2, 4)
(1, 3) (4, 2)
(3, 5)
(0, 6) (1, 7) (4, 8)
(0, 1)
October 2018 Shonan 136 — Nessie 8
Nested Data Parallelism
NDP example: sparse-matrix times vectorIn NESL, this algorithm has a compact expression:
function svxv (sv, v) = sum ({ x * v[i] : (i, x) in sv })
function smxv (sm, v) = { svxv (sv, v) : sv in sm }
Notice that the smxv function is a map of map-reduce subcomputations; i.e., nested dataparallelism.
October 2018 Shonan 136 — Nessie 9
Nested Data Parallelism
NDP example: sparse-matrix times vectorNaive parallel decomposition will be unbalanced because of irregularity in sub-problem sizes.
(0, 1) (2, 4)
(1, 3) (4, 2)
(3, 5)
(0, 6) (1, 7) (4, 8)
(0, 1)
0 2 1 4 3 0 1 4 0
1 4 3 2 5 6 7 8 1
2 2 1 3 1
Flattening transformation converts NDPto flat DP (including AoS to SoA)
October 2018 Shonan 136 — Nessie 10
Nested Data Parallelism
FlatteningFlattening (a.k.a. vectorization) is a global program transformation that converts irregular nesteddata parallel code into regular flat data parallel code.
I Lifts scalar operations to work on sequences of valuesI Flattens nested sequences paired with segment descriptorsI Conditionals are encoded as dataI Residual program contains vector operations plus sequential control flow and
recursion/iteration.
October 2018 Shonan 136 — Nessie 11
Nested Data Parallelism
Flattening function calls
{ f (e) : x in xs }
=)
if #xs = 0then [ ]else let es = { e : x in xs }in f "(es)
October 2018 Shonan 136 — Nessie 12
Nested Data Parallelism
Lifting functionsIf we have
function f (x) = e
then f " is defined to befunction f " (xs) = { e : x in xs }
October 2018 Shonan 136 — Nessie 13
Nested Data Parallelism
Flattening conditionals
{if b then e1 else e2 : x in xs }
=)
let fs = { b : x in xs }let (xs1, xs2) = PARTITION(xs, fs)let vs1 = { e1 : x in xs1 }let vs2 = { e2 : x in xs2 }in COMBINE(vs1, vs2, fs)
October 2018 Shonan 136 — Nessie 14
Nested Data Parallelism
Flattening example: factorial
function fact (n) = if (n <= 0) then 1 else n * fact (n - 1)
function fact� (ns) = { fact(n) : n in ns }<latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="R5iNvw7FDDCBmDKaM3mXhO+4fOM=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV9CKSEsrLg8UFrq7LZqZLh7PmcRajz2yPd0NZt54Q7zyo/gH/BWeOM4kZJMURvHMyfF3Lj4+58trKawbjf7q3Xjt5uv9N269OXjr7XfefW/n9vtnVjeGwynXUpvnObMghYJTJ5yE57UBVuUSnuUXj8P+s5dgrNDqxM1qyCo2UaIUnDlUne/8meYwEcpL60I0oSZtIpmaNG wC42+/ePoNrZibguWshrEzDWSDslE8GJOScUfuqT0yJqJEgXw2JqM9gnBFYgLSAlHkoyWMfEziPbKyvptyXcD8BGnFCx9g7QufNjUzRl+2d9HGBt9+7iHEeYj+hCLKknaQgiquZ707ON8ZjvZH84dsC/FCGB5F3XN8fvvm32mheVOBclwya5N4VLvMM+MEl4AxGgs14xdYiwRFxSqwmZ+n3JJd1BSk1AaXcmSuvW7hnUA4OnkK7gnW8HNZT1kOzqehos61XmlTMdn6707i1tdgUBN+/2ORa1lcx+dXncG1qKyydlblmF8wspt7JGhftZk0rvw080LVjQPFu/OVjSROk9A4pBAGuJMzFBg3AktE+JQZvBpsr7UwXYEGaQElduX8n/8BMHEzyVs/2n9A8VbC2sB8ZQDUv6iAuP8K1CPZwBoorAcboCeiUGIydVvAww3gcWNqufJ3uIDd3wqKZ9tytpnajyAldm4Hw4ajYWELHmxGfTxjq4MeHHa4eO5PwSXXVcWwu1cD0gZ57VwbyIvLddy8SFveqnVQuJP15lkMk/1PLQ6cxYYcEIKkI7h1Mwnj1LmSVULOUouNUTsrfoZV7hTBFzC71KZYwBfJpnlpwQiwARGyxDFcIJappsLZKTJPF7mr4JKdkswHfmrxi8NPiK+0gWWgsccjU6ZmNNda0tCotGCOhU6mgZloSb+kpdTM0SUjUVFSgW/laPBHiARHpZ4gV0qqgj8cWKqaKgdDdfgVQuGWdYFwwweLRAP70TD61NETeoXl7rxp3DCr9NIh/YU+pKmnaUvThKZZG8pAkGUsztZL6KzCmTrH46TI7gRdmyy90GXRFrFttrt+40irbRJnPnVw5bq3c34Yt3jtyJbxJjduC2cH+zHK3x8Mjx4tePNW9GF0J7oXxdEn0VH0dXQcnUa8N+qd9c57P/Vf9H/t/9b/vYPe6C1sPojWnv4f/wAr8Vt0</latexit><latexit sha1_base64="s2pPH+iZiVYVOAG8kYw3puqQqC4=">AAAG3HicfVRbjxs1FJ6UEkq4beGRF6vpSl1kVplVV72ISKUVlwcKC93dFmWmi8dzJrHWY49sT3eDmTfeEK/8KP4Bf4UnjjMTskkKo3jm5Pg7Fx+f82WVFNaNRn/1rr1x/c3+WzfeHrzz7nvvf7Bz88NTq2vD4YRrqc2LjFmQQsGJE07Ci8oAKzMJz7PzJ2H/+SswVmh17OYVpCWbKlEIzhyqznb+TDKYCuWldSGaUNNmIpma1m wK42+/ePYNLZmbgeWsgrEzNaSDolY8GJOCcUfuqD0yJqJAgXw2JqM9gnBFYgLSAlHkkyWMfEriPbKyvp1wncPiBEnJcx9gzUuf1BUzRl80t9HGBt9+4SHEeYj+hCLKkmaQgMqvZr07ONsZjvZHi4dsC3EnDKPuOTq7ef3vJNe8LkE5Lpm1k3hUudQz4wSXgDFqCxXj51iLCYqKlWBTv0i5IbuoyUmhDS7lyEJ71cI7gXB08gzcU6zh57KasQycT0JFnWu80qZksvHfHceNr8CgJvz+xyLTMr+Kzy5bgytRWWntvMwwv2BkN/dI0L5uc1K74n7qhapqB4q35ytqSZwmoXFILgxwJ+coMG4ElojwGTN4Ndhea2HaAg2SHArsysU//wNg4maaNX60/4DirYS1gfnKAKh/UQFx9zWox7KGNVBYDzZAT0WuxHTmtoCHG8Cj2lRy5e+wg93dCopn23K2mdqPICV2bgvDhqNhYQsebEZ9Mmergx4ctrh44U/BBddlybC7VwPSBHntXBvI84t13KJIW97KdVC4k/Xm6YbJ/qcWB85iQw4IQdIR3Lq5hHHiXMFKIeeJxcaonBU/wyp3iuBzmF9ok3fwLtkkKywYATYgQpY4hh1imWoinJ0h87SR2wou2WmS+sBPDX5x+AnxpTawDDT2eGTK1JxmWksaGpXmzLHQyTQwEy3ol7SQmjm6ZCQqCirwrRwN/giR4KjUU+RKSVXwhwNLVV1mYKgOv1wo3LIuEG74YJFoYD8aRp86ekwvsdytN40bZpVeMqS/0Ic08TRpaDKhSdqEMhBkGYuz9Qpaq3Cm1vF4kqe3gq6ZLL3QZdG62DbdXb9xpNVmEqc+cXDp2rdzfhg3eO3IlvEmN24Lpwf7McrfHwwfPe5480b0cXQruhPF0b3oUfR1dBSdRLw36p32zno/9V/2f+3/1v+9hV7rdTYfRWtP/49/AMqTWzQ=</latexit>
=)
function fact� (ns) =
let fs = (ns <=� dist(0, #ns));(ns1, ns2) = PARTITION(ns, fs);vs1 = dist(1, #ns1);vs2 = if (#ns2 = 0)
then []else let
es = (ns2 -� dist(1, #ns2));rs = fact� (es);
in (ns2 *� rs);
in COMBINE(vs1, vs2, fs)<latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="LBSg372pXg9lNyQRUj6MhYlpNWM=">AAAIMnicjVVbb9xEFHYXyrbLLSmPvIy6jZSgIVqvGtECK6WJCkQiF9qkLVqbaGwf744yHlsz4ySL8d/gFX4EfwbeEK/8CM7Y3uwtIEY79uw537nOOcdBJrg2vd7vd1pvvX33nfa9+51333v/gw/X1h+80mmuQjgLU5GqNwHTILiEM8ONgDeZApYEAl4HF/uW//oSlOapPDWTDPyEjSSPecgMks7XW/e9AEZcFkIba47LUTkUTI5yNoLB0fOX39KEmTHokGUwMCoHvxPnMrTS5JEXphFUTnhJGBUxC035Q+HlGVMqvSofkU2pt8igQ4gAQ2JNBpayIvflYFEqQkc2e5R0UXrrC5S2CwVdSqTuo0Jy8uzF6cHpwfERUikqvkFdahfZlQK3UuDOsfrI4jHZRLI99rYajl0YoyRDf44CQoP1e46ExCaG/koQn94SQ+1CfxZEvZRV8j+SB3pJkMvbbX+yKKhmciixf3y4d3D0fPPS5g+TUOWr44GM5i99o3O+1u1t96pFVg9uc+juOvU6OV+/u+ZFaZgnIE0omNZDt5cZv2DK8FBA2fFyDRkLL7CUhniULAHtF5XbJdlASkTiVOGWhlTUeYnCcISjkpdgDrEEn4lszAIwhWcL0piykKlKmCiL41O3LDJQSLG//5AIUhHN44PrWmDOKku0niQB+meF9DKPWOptzGFu4id+wWWWG5BhHV+cC2JSYhsPC0JBaMQEDyxUHFNEwjFTeOvYngtm6gR1vAhi7OrqX/EC0HE1Csqit/2U4q3YvYT5WgHIG5RFPL4FtSdyWADZ/XQJdMgjyUdjswLcWQKe5CoTM307DezxilGMbUXZsmvfgxBYwDUMC47ajSXYX7a6P2GzQPs7Nc6t9Em4CtMkYVjdsyYp7XkhriXkxdUirkrSirZkEWTvZLF4mmbS/0rFhtNYkNicOLR5qM1EwMAzJmYJFxNPY2FkRvMfYeY7RfAFTK5SFTXwxlkviDUoDtoirJfYhg1i6qrHjR7j4K4t1xmcDvehX9jxXuIbm5+QIkkVTA0NCgyZMjmhQZoKaguVRswwW8nUDkYa069oLFJm6PRzQHlMOT6loRvV9MHZSUU6wm+NoNLqw4alMk8CUDS1v4hLZGljP1j2hUmidg5T2/rU0FN6jTirrJlh9Gbul7WNFOFq5rTXpT/Rz6lXUK+k3pB6fmmTQ3D2aOy4S6ilbKS1ucEw8h9aWjmcaqHTVDYeaX9jsQ5w4JZD1y88A9emfhpTdN0SiwFnqLs8MVcPr/rbLp6/63d395ppes/52HnobDqu85mz63zjnDhnTtjKWj+3fmn92v6t/Uf7z/ZfNbR1p5H5yFlY7b//AS0ouqs=</latexit><latexit sha1_base64="u/rFeFe7pXTkErQ2GZRWnX6Mkww=">AAAIMnicjVVbb9xEFHYXyrbLLYFHXkbdRtqgIVqvGkGBldpEBSKRC23SFq1NNLaPd0cZj62ZcZLF+G/wCj+CPwNviFd+BGdsb/YWEKMde/ac71znnOMgE1ybfv/3O6033rz7Vvve/c7b77z73vsbmx+81GmuQjgLU5Gq1wHTILiEM8ONgNeZApYEAl4FF/uW/+oSlOapPDXTDPyEjSWPecgMks43W/e9AMZcFkIba47LcTkSTI5zNobh0bMX39KEmQnokGUwNCoHvxPnMrTS5KEXphFUTnhJGBUxC035Q+HlGVMqvSofkp7U22TYIUSAIbEmQ0tZk/tyuCwVoSO9PiVdlN7+AqXtQkGXEqkHqJCcPH1+enB6cHyEVIqKb1CX2kV2pcCtFLgLrAGyeEx6SLbH/nbDsQtjlGTkL1BAaLB+L5CQ2MQwWAvik1tiqF0YzIOol7JK/kfyQK8Icnm77Y+XBdVcDiX2jw/3Do6e9S5t/jAJVb46Hsho8dK3Oucb3f5Ov1pk/eA2h67TrJPzzbsbXpSGeQLShIJpPXL7mfELpgwPBZQdL9eQsfACS2mER8kS0H5RuV2SLaREJE4VbmlIRV2UKAxHOCp5AeYQS/CpyCYsAFN4tiCNKQuZqoSJsjg+dcsiA4UU+/sPiSAV0SI+uK4FFqyyROtpEqB/Vkiv8oil3sYc5Sb+zC+4zHIDMqzji3NBTEps42FBKAiNmOKBhYpjikg4YQpvHdtzyUydoI4XQYxdXf0rngM6rsZBWfR3HlO8FbtXMF8rAHmDsohHt6D2RA5LILsfr4AOeST5eGLWgLsrwJNcZWKub7eBPVozirGtKVt17XsQAgu4hmHBUbuxBAerVvenbB7oYLfGuZU+CVdhmiQMq3veJKU9L8W1gry4WsZVSVrTliyD7J0sF0/TTPpfqdhwGgsSmxOHNg+1mQoYesbELOFi6mksjMxo/iPMfacIvoDpVaqiBt446wWxBsVBW4T1EtuwQcxc9bjRExzcteU6g7PhPvILO95LfGPzE1IkqYKZoWGBIVMmpzRIU0FtodKIGWYrmdrBSGP6FY1FygydfQ4ojynHpzR0q5o+ODupSMf4rRFUWn3YsFTmSQCKpvYXcYksbewHy74wSdTOYWpbnxp6Sq8RZ5U1M4zezP2ytpEiXM2d9rr0J/o59QrqldQbUc8vbXIIzh6NHXcJtZSNtDY3HEX+A0srRzMtdJbKxiPtby3XAQ7ccuT6hWfg2tRPY4quW2Ix4Ax1Vyfm+uHlYMfF83eD7pO9Zprecz5yHjg9x3U+dZ443zgnzpkTtrLWz61fWr+2f2v/0f6z/VcNbd1pZD50llb7738Axkq6aw==</latexit>
October 2018 Shonan 136 — Nessie 15
NESL on GPUs
NESL on GPUs
I NESL was designed for bulk-data processing on wide-vector machines (SIMD)I Potentially a good fit for GPU computationI First try [Bergstrom & Reppy ’12] demonstrated feasibility of NESL on GPUs, but was
signficantly slower than hand-tuned CUDA code for some benchmarks (worst case: over 50times slower on Barnes-Hut [Burtscher & Pingali ’11]).
October 2018 Shonan 136 — Nessie 16
NESL on GPUs
Areas for improvementWe identified a number of areas for improvement.I Better fusion:
I Fuse generators, scans, and reductions with maps.I “Horizontal fusion,” (fuse independent maps over the same index space).
I Better segment descriptor management.I Better memory management.
It proved difficult/impossible to support these improvements in the context of the VCODEinterpreter.
October 2018 Shonan 136 — Nessie 17
Nessie
NessieNew NESL compiler built from scratch.
I Front-end produces monomorphic, direct-style IR.I Flattening eliminates NDP and produces Flan, which is a flat-vector language with
VCODE-like operators.I Shape analysis is used to tag vectors with size information (symbolic in some cases).I Flan is converted to �cu, which is where fusion and other optimizations occur.
October 2018 Shonan 136 — Nessie 18
Nessie
�cu — An IR for GPU programs (continued ...)�cu is a three-level language:I CPU expressions – direct-style extended �-calculus with kernel dispatchI Kernels – sequences of second-order array combinators (SOAC)I GPU anonymous functions – first-order functions that are the arguments to the SOACs.
October 2018 Shonan 136 — Nessie 19
Nessie
�cu — An IR for GPU programs (continued ...)
CPU expressions
prog ::= kern dcl
dcl ::= function f ( params ) blk dcl::= let params = exp dcl::= exp
params ::= xi:�i
blk ::= { bind exp }bind ::= let params = exp
exp ::= blk| run K args| f args| if exp then blk else blk| exp � exp| · · ·
Kernel expressions
kern ::= kernel K xs { bind return ys }bind ::= let xs = SOAC arg
arg ::= ⇤| rop⇤| shape
GPU expressions
⇤ ::= { xs => exp using ys }exp ::= if exp then exp else exp
| let x = exp in exp| exp � exp| xs[ i]| x| · · ·
October 2018 Shonan 136 — Nessie 20
Nessie
Second-Order Array CombinatorsLike Futhark [Henriksen et al. ’14], Nova [Collins et al. ’14], and other systems, we useSecond-Order Array Combinators (SOACs) to represent the iteration structure of our operations onsequences.
ONCE : (unit ) ⌧) ! ⌧
MAP : (int ) ⌧) int ! ⌧"
PERMUTE : (int ) ⌧) (int ) int) int ! ⌧"
REDUCE : (int ) ⌧) rop⌧ int ! ⌧
SCAN : (int ) ⌧) rop⌧ int ! ⌧"
FILTER : (int ) ⌧) (⌧ ) bool) int ! ⌧"
PARTITION : (int ) ⌧) (⌧ ) bool) int ! ⌧" ⇥ ⌧"
SEG_PERMUTE : (int ) ⌧) (int ) int) sd ! ⌧"
SEG_REDUCE : (int ) ⌧) rop⌧ sd ! ⌧"
SEG_SCAN : (int ) ⌧) rop⌧ sd ! ⌧"
SEG_FILTER : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd
SEG_PARTITION : (int ) ⌧) (⌧ ) bool) sd ! ⌧" ⇥ sd ⇥ ⌧" ⇥ sd
October 2018 Shonan 136 — Nessie 21
Nessie
Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs
Example: iota
{ i : i in [0 : n-1] }
=)kernel K (n : int) -> [int] {
let res = MAP { i => i } nreturn res
}
October 2018 Shonan 136 — Nessie 22
Nessie
Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs
Example: the element-wise product of two sequences
{ x * y : x in xs; y in ys }
=)kernel K (xs : [float], ys : [float]) -> [float] {
let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res
}
October 2018 Shonan 136 — Nessie 22
Nessie
Second-Order Array Combinators (continued ...)There is a key difference between our combinators and previous work: combinators use a pullthunk, which is parameterized by the array indices, to get their inputs
Example: summing a sequence
sum (xs)
=)kernel K (xs : [float]) -> float {
let res =REDUCE { i => xs[i] using xs } (FADD) (#xs)
return res}
October 2018 Shonan 136 — Nessie 22
Nessie
Nessie backend
Flan �cu �cu �cu CUDAtranslate fusion memory analysis
code generation
lp_solve
I Designed to support better fusion, etc..I Backend transforms flattened code to CUDA in several steps.
I ILP-based fusion [Megiddo and Sarkar ’99; Robinson et al ’14].I Memory analysis based on Uniqueness types [de Vries et al ’07].I Add explicit memory management based on analysis.
October 2018 Shonan 136 — Nessie 23
Nessie
Simple map-reduce fusion
The �cu code for the dotp example is
kernel prod (xs : [float], ys : [float]) -> [float] {let res = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)return res
}kernel sum (xs : [float]) -> float {
let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res
}
function dots (xs : [float], ys : [float]) -> [float] {let t1 : [float] = run prod (xs, ys)let t2 : float = run sum (t)return t2
}
October 2018 Shonan 136 — Nessie 24
Nessie
Simple map-reduce fusion
Step 1: Fuse the two kernels into a combined kernel.
October 2018 Shonan 136 — Nessie 24
Nessie
Simple map-reduce fusion
Step 1: Fuse the two kernels into a combined kernel.
kernel F (xs : [float], ys : [float]) -> float {let ts = MAP { i => xs[i] * ys[i] using xs, ys } (#xs)let res = REDUCE { i => ts[i] using ts } (FADD) (#ts)return res
}
function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2
}
October 2018 Shonan 136 — Nessie 24
Nessie
Simple map-reduce fusion
Step 2: Fuse the MAP operation into the REDUCE’s pull operation
October 2018 Shonan 136 — Nessie 24
Nessie
Simple map-reduce fusion
Step 2: Fuse the MAP operation into the REDUCE’s pull operation
kernel F (xs : [float], ys : [float]) -> float {let res = REDUCE { i => xs[i] * ys[i] using xs, ys } (FADD) (#xs)return res
}
function dots (xs : [float], ys : [float]) -> [float] {let t2 : float = run F (xs, ys)return t2
}
October 2018 Shonan 136 — Nessie 24
Nessie
Fancier fusionConsider the following Nesl function (adapted from [Robinson et al ’14]):function norm2 (xs) : [float] -> ([float], [float]) =
let sum1 = sum(xs);gts = { x : x in xs | (x > 0) };sum2 = sum(gts);
in({ x / sum1 : x in xs }, { x / sum2 : x in xs })
October 2018 Shonan 136 — Nessie 25
Nessie
Fancier fusion (continued ...)
Translating to �cu produces the following code:kernel K1 (xs : [float]) -> float {let res = REDUCE { i => xs[i] using xs } (FADD) (#xs)return res
}kernel K2 (xs : [float]) -> [float] {let res = FILTER { i => xs[i] using xs } { x => x > 0 } (#xs)return res
}kernel K3 (xs : [float], s : float) -> [float] {let res = MAP { i => xs[i] / s using xs } (#xs)return res
}
function norm2 (xs : [float]) -> ([float], [float]) {let sum1 : float = run K1 (xs)let its : [float] = run K2 (xs)let sum2 = run K1 (its)let res1 : [float] = run K3 (xs, sum1)let res2 : [float] = run K3 (xs, sum2)return (res1, res2)
}
October 2018 Shonan 136 — Nessie 26
Nessie
Fancier fusion (continued ...)
PDG control region
xs
run K2 (xs)run K1 (xs)
run K1 (gts)
run K3 (xs, sum2)run K3 (xs, sum1)
(res1,res2)
sum1
sum2
gts
res2res1
xs xs
xs xs
October 2018 Shonan 136 — Nessie 26
Nessie
Fancier fusion (continued ...)
One possible schedule
xs
run K2 (xs)run K1 (xs)
run K1 (gts)
run K3 (xs, sum2)run K3 (xs, sum1)
(res1,res2)
October 2018 Shonan 136 — Nessie 26
Nessie
Fancier fusion (continued ...)
Another possible schedule
xs
run K2 (xs)run K1 (xs)
run K1 (gts)
run K3 (xs, sum2)run K3 (xs, sum1)
(res1,res2)
October 2018 Shonan 136 — Nessie 26
Nessie
Fancier fusion (continued ...)
kernel F1 (xs : [float]) -> (float, float) {let (sum1, sum2) =
REDUCE { i => let x = xs[i] in (x, if x > 0 then x else 0) using xs }(FADD, FADD)(#xs)
return (sum1, sum2)}kernel F2 (xs : [float], sum1 : float, sum2 : float) -> [float] {let (res1, res2) =
MAP { i => let x = xs[i] in (x / sum1, x / sum2) using xs, sum1, sum2 }(#xs)
return (res1, res2)}
function norm2 (xs : [float]) -> ([float], [float]) {let (sum1 : float, sum2) = run F1 (xs)let (res1 : [float], res2 : [float]) = run F2 (xs, sum1, sum2)return (res1, res2)
}
Using ILP produces the followingschedule:
KF1
KF2
xs
run K2 (xs)run K1 (xs)
run K1 (gts)
run K3 (xs, sum2)run K3 (xs, sum1)
(res1,res2)
October 2018 Shonan 136 — Nessie 26
Nessie
Fancier fusion (continued ...)
Notice how we fused the FILTER into the REDUCE!
Using ILP produces the followingschedule:
KF1
KF2
xs
run K2 (xs)run K1 (xs)
run K1 (gts)
run K3 (xs, sum2)run K3 (xs, sum1)
(res1,res2)
October 2018 Shonan 136 — Nessie 26
Streaming and piecewise execution
Streaming and piecewise execution
I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to
improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some
follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.
October 2018 Shonan 136 — Nessie 27
Streaming and piecewise execution
Streaming and piecewise execution
I �cu processes vectors as atomic objects, which can exceed the memory resources of a GPU.I We could partition kernel execution into smaller pieces (either statically or dynamically) to
improve scalability enable multi-GPU parallelism.I Palmer et al. describe a post-flattening piecewise execution strategy and there was some
follow-on work by Pfannestiel about scheduling piecewise execution for threaded execution.
(0, 1) (2, 4)
(1, 3) (4, 2)
(3, 5)
(0, 6) (1, 7) (4, 8)
(0, 1)
Flat execution
Piecewise execution
October 2018 Shonan 136 — Nessie 27
Streaming and piecewise execution
Streaming and piecewise execution (continued ...)
I Connections to Keller and Chakravarty’s Distributed Types and Palmer et al.’s Piecewiseexecution of NDP programs.
I Not all operations can be executed in piecewise fashion (e.g., permutations).I The execution model for Madsen and Filinski’s Streaming NESL also requires piecewise
execution of kernels.
October 2018 Shonan 136 — Nessie 28
top related