1 towards a usable programming model for gpgpu dr. orion sky lawlor [email protected] u. alaska...
TRANSCRIPT
![Page 1: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/1.jpg)
1
Towards a Usable Programming Model
for GPGPUDr. Orion Sky [email protected]. Alaska Fairbanks
2011-04-19http://lawlor.cs.uaf.edu/
8
![Page 2: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/2.jpg)
2
Obligatory Introductory Quote
“He who controls the past, controls the future.”
George Orwell, 1984
![Page 3: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/3.jpg)
3
In Parallel Programming...
“He who controls the past, controls the future.”
George Orwell, 1984
“He who controls the writes, controls performance.”
Orion Lawlor, 2011
![Page 4: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/4.jpg)
4
Talk Outline Existing parallel programming
models Who controls the writes?
Charm++ and Charm-- Charm-style GPGPU
Conclusions
![Page 5: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/5.jpg)
5
Existing Model: Superscalar Hardware parallelization of a
sequential programming model
Fetch future instructions Need good branch prediction
Runtime Dependency Analysis Load/store buffer for mem-carried Rename away false dependencies RAR, WAR, WAW, -> RAW <-
Now “solved”: low future gain
![Page 6: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/6.jpg)
6
Spacetime Data Arrows
Read
Write
“time” (program order)
“space”(memory, node)
![Page 7: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/7.jpg)
7
Read After Write Dependency
Read
Write
Read
Write
ArtificialInstructionBoundary
![Page 8: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/8.jpg)
8
Read After Read: No Problem!
Read
Write
Read
Write
ArtificialInstructionBoundary
![Page 9: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/9.jpg)
9
Existing Model: Shared Memory OpenMP, threads, shmem “Just” let different processors
access each others' memory HW: Cache coherence
• false sharing, cache thrashing SW: Synchronization
• locks, semaphores, fences, ... Correctness is a huge issue
Weird race conditions abound New bugs in 10+ year old code
![Page 10: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/10.jpg)
10
Gather: Works Fine
Distributed Reads
Centralized Writes
![Page 11: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/11.jpg)
11
Scatter: Tough to Synchronize!
Oops!
![Page 12: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/12.jpg)
12
Existing Model: Message Passing MPI, sockets Explicit control of parallel reads
(send) and writes (recv) Far fewer race conditions
Programmability is an issue Raw byte-based interface (C style) High per-message cost (alpha) Synchronization issues: when does
MPI_Send block?
![Page 13: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/13.jpg)
13
Existing Model: SIMD SSE, AVX, and GPU Single Instruction, Multiple Data
Far fewer fetches & decodes Far higher arithmetic intensity
CPU: Programmability N/A Assembly language (hello, 1984!) mmintrin.h wrappers: _mm_add_ps Or pray for automagic compiler!
GPU: Programmability OK Graphics side: GLSL, HLSL, Cg GPGPU: CUDA, OpenCL, DX CS
![Page 14: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/14.jpg)
14
NVIDIA CUDA CPU calls a GPU “kernel” with a
“block” of threads•Now fully programmable (throw/catch,
virtual methods, recursion, etc)
Read and write memory anywhere•Zero protection against multithreaded
race conditions
Manual control over a small __shared__ memory region
Only runs on NVIDIA hardware (OpenCL is portable... sorta)
![Page 15: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/15.jpg)
15
OpenGL: GL Shading Language
Mostly programmable (loops, etc) Can read anywhere in “textures”, only
write to “framebuffer” (2D/3D arrays)•Reads go through “texture cache”, so
performance is good (iff locality)•Writes are on space-filling curve•Writes are controlled by the graphics driver•So cannot have synchronization bugs!
Rich selection of texture filtering (array interpolation) modes• Includes mipmaps, for multigrid
GLSL can run OK on every modern GPU(well, except Intel...)
![Page 16: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/16.jpg)
16
GLSL vs CUDA
GLSLPrograms
![Page 17: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/17.jpg)
17
GLSL vs CUDA
GLSLPrograms
CUDA Programs
Mipmaps;texture writes
Arbitrarywrites
![Page 18: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/18.jpg)
18
GLSL vs CUDA
GLSLPrograms
CUDA ProgramsCorrectPrograms
![Page 19: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/19.jpg)
19
GLSL vs CUDA
GLSLPrograms
CUDA ProgramsCorrectPrograms
High Performance
Programs
![Page 20: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/20.jpg)
20
GPU/CPU Convergence GPU, per socket:
SIMD: 16-32 way SMT: 2-50 way (register limited) SMP: 4-36 way
CPUs will get there, soon! SIMD: 8 way AVX (64-way SWAR) SMT: 2 way Intel; 4 way IBM SMP: 6-8 way/socket already
• Intel has shown 48 way chips
Biggest difference: CPU has branch prediction & superscalar!
![Page 21: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/21.jpg)
21
CUDA: Memory Output Bandwidth
NVIDIA GeForce GTX 280, fixed 128 threads per block
Kernel startup latency: 4us
Kernel output b
andwidth: 80 GB/s
t = 4000ns / kernel + bytes * 0.0125 ns / byte
![Page 22: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/22.jpg)
Charm++ and “Charm--”
![Page 23: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/23.jpg)
23
Existing Model: Charm++ Chares send each other messages Runtime system does delivery
Scheduling! Migration with efficient forwarding Cheap broadcasts
Runtime system schedules Chares Overlap comm and compute
Programmability still an issue Per-message overhead, even with
message combining library Collect up your messages (SDAG?) Cheap SMP reads? SIMD? GPU?
![Page 24: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/24.jpg)
24
EntryMethod
One Charm++ Method Invocation
Receive one message (but in what order?)
Send messages
Chare Messages
Updateinternalstate
Readinternalstate
Between send and receive: migration, checkpointing, ...
![Page 25: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/25.jpg)
25
The Future: SIMD AVX, SSE, AltiVec, GPU, etc Thought experiment
Imagine a block of 8 chares living in one SIMD register•Deliver 8 messages at once (!)
Or imagine 100K chares living in GPU RAM
Locality (mapping) is important! Branch divergence penalty Struct-of-Arrays member storage
•xxxxxxxx yyyyyyyy zzzzzzzz•Members of 8 separate chares!
![Page 26: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/26.jpg)
26
Vision: Charm-- Stencilarray [2D] stencil {public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); }};
![Page 27: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/27.jpg)
27
Vision: Charm-- Explainedarray [2D] stencil {public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); }};
Assembled into GPU arrays or SSE vectors
![Page 28: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/28.jpg)
28
Vision: Charm-- Explainedarray [2D] stencil {public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); }};
Broadcast out toblocks of array elements
![Page 29: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/29.jpg)
29
Vision: Charm-- Explainedarray [2D] stencil {public: float data; [entry] void average( float nbors[4]=fetchnbors()) { data=0.25*( nbors[0]+ nbors[1]+ nbors[2]+ nbors[3]); }};
Hides local synchronized reads,network, and domain boundaries
![Page 30: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/30.jpg)
30
Vision: Charm-- Springsarray [1D] sim_spring {public: float restlength; [entry] void netforce( sim_vertex ends[2]=fetch_ends()) { vec3 along=ends[1].pos-ends[0].pos; float f=-k*(length(along)-restlength); vec3 F=f*normalize(along); ends[0].netforce+=F; ends[1].netforce-=F; }};
![Page 31: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/31.jpg)
31
One Charm-- Method Invocation
Fetch together multiple messages
Send off networkmessages
Chare(on GPU)
“Mainchare”(on CPU)
Updateinternalstates
Readinternalstates
![Page 32: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/32.jpg)
32
Noncontiguous CommunicationNetwork Data Buffer
GPU Target Buffer
Run scatter kernel Or fold into fetch
![Page 33: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/33.jpg)
33
Key Charm-- Design Features Multiple chares receive message
at once Runtime block-allocates incoming
and outgoing message storage Critical for SIMD, GPU, SMP
Receive multiple messages in one entry point Minimize roundtrip to GPU
Explicit support for timesteps E.g., double-buffer message
storage
![Page 34: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/34.jpg)
34
Charm-- Not Shown Lots of work in “mainchare”
Controls decomposition & comms Set up “fetch”
Still lots of network work Gather & send off messages Distribute incoming messages
Division of labor? Application scientist writes Chare Computer scientist writes Mainchare
![Page 35: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/35.jpg)
35
Related Work Charm++ Accelerator API
[Wesolowski] Pipeline CUDA copy, queue kernels Good backend for Charm--
Intel ArBB: SIMD from kernel Based on RapidMind But GPU support?
My “GPGPU” library Based on GLSL
![Page 36: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/36.jpg)
The Future
![Page 37: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/37.jpg)
37
The Future: Memory Bandwidth Today: 1TF/s, but only 0.1TB/s Don't communicate, recompute
multistep stencil methods “fetch” gets even more complex!
64-bit -> 32-bit -> 16-bit -> 8? Spend flops scaling the data Split solution + residual storage
•Most flops use fewer bits, in residual Fight roundoff with stochastic
rounding •Add noise to improve precision
![Page 38: 1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor lawlor@alaska.edu U. Alaska Fairbanks 2011-04-19 8](https://reader033.vdocuments.net/reader033/viewer/2022051401/5697bfe01a28abf838cb318a/html5/thumbnails/38.jpg)
38
Conclusions C++ is dead. Long live C++! CPU and GPU on collision course
SIMD+SMT+SMP+network Software is the bottleneck
Exciting time to build software! Charm-- model
Support ultra-low grainsize chares•Combine into SIMD blocks at runtime
Simplify programmer's life Add flexibility for runtime system BUT must scale to real applications!