efficient data parallel computing on gpus
DESCRIPTION
Efficient Data Parallel Computing on GPUs. Cliff Woolley University of Virginia / NVIDIA. Overview. Data Parallel Computing Computational Frequency Profiling and Load Balancing. Data Parallel Computing. Data Parallel Computing. Vector Processing small-scale parallelism - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/1.jpg)
Efficient Data Parallel Efficient Data Parallel Computing on GPUsComputing on GPUs
Cliff Woolley University of Virginia / NVIDIA
![Page 2: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/2.jpg)
OverviewOverviewOverviewOverview
• Data Parallel Computing
• Computational Frequency
• Profiling and Load Balancing
![Page 3: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/3.jpg)
Data Parallel ComputingData Parallel ComputingData Parallel ComputingData Parallel Computing
![Page 4: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/4.jpg)
• Vector Processing– small-scale parallelism
• Data Layout– large-scale parallelism
Data Parallel ComputingData Parallel ComputingData Parallel ComputingData Parallel Computing
![Page 5: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/5.jpg)
frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;
float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}
Learning by example:Learning by example:A really naïve shaderA really naïve shaderLearning by example:Learning by example:A really naïve shaderA really naïve shader
![Page 6: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/6.jpg)
frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;
float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}
Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing
![Page 7: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/7.jpg)
float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));
float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f);
Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing
![Page 8: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/8.jpg)
float2 offset = center.xy - 0.5f;offset = offset * params.xx + 0.5f; // MADR is cool too – one cycle, two flops
float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f);
Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing
![Page 9: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/9.jpg)
Data Parallel Computing:Data Parallel Computing:Data LayoutData LayoutData Parallel Computing:Data Parallel Computing:Data LayoutData Layout
• Pack scalar data into RGBA in texture memory
![Page 10: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/10.jpg)
Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency
![Page 11: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/11.jpg)
Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency
• Think of your CPU program and your vertex and fragment programs as different levels of nested looping.
– CPU Program– Vertex Program
– Fragment Program
![Page 12: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/12.jpg)
Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency
• Branches– Avoid these, especially in the inner loop – i.e., the fragment
program.
![Page 13: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/13.jpg)
Computational Frequency: Computational Frequency: Avoid inner-loop branchingAvoid inner-loop branchingComputational Frequency: Computational Frequency: Avoid inner-loop branchingAvoid inner-loop branching
• Static branch resolution– write several variants of each fragment program to handle
boundary cases– eliminates conditionals in the fragment program– equivalent to avoiding CPU inner-loop branching
case 2: accounts for boundaries
case 1: no boundaries
![Page 14: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/14.jpg)
Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency
• Branches– Avoid these, especially in the inner loop – i.e., the fragment
program.
– Ian’s talk will give some strategies for branching if you absolutely cannot avoid it.
![Page 15: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/15.jpg)
Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency
• Precompute
• Precompute
• Precompute
![Page 16: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/16.jpg)
Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords
• Take advantage of under-utilized hardware– vertex processor– rasterizer
• Reduce instruction count at the per-fragment level
• Avoid lookups being treated as texture indirections
![Page 17: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/17.jpg)
vert2frag smooth(app2vert IN, uniform float4x4 xform : C0, uniform float2 srcoffset, uniform float size){ vert2frag OUT;
OUT.position = mul(xform,IN.position); OUT.center = IN.center; OUT.redblack = IN.center - srcoffset; OUT.operator = size*(OUT.redblack - 0.5f) + 0.5f; OUT.hneighbor = IN.center.xxyx + float4(-1.0f, 1.0f, 0.0f, 0.0f); OUT.vneighbor = IN.center.xyyy + float4(0.0f, -1.0f, 1.0f, 0.0f);
return OUT;}
Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords
![Page 18: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/18.jpg)
frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;
float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}
Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords
![Page 19: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/19.jpg)
Computational Frequency: Computational Frequency: Precomputing other valuesPrecomputing other valuesComputational Frequency: Computational Frequency: Precomputing other valuesPrecomputing other values
• Same deal! Factor other computations out:– Anything that varies linearly across the geometry– Anything that has a complex value computed per-vertex– Anything that is uniform across the geometry
![Page 20: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/20.jpg)
Computational Frequency: Computational Frequency: Precomputing on the CPUPrecomputing on the CPUComputational Frequency: Computational Frequency: Precomputing on the CPUPrecomputing on the CPU
• Use glMultiTexCoord4f() creatively
• Extract as much uniformity from uniform parameters as you can
![Page 21: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/21.jpg)
// Calculate Red-Black (odd-even) masksfloat2 intpart;float2 place = floor(1.0f - modf(round(center + 0.5f) / 2.0f, intpart));float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y);
if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)){ ...
Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables
![Page 22: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/22.jpg)
half4 mask = f4texRECT(RedBlack, IN.redblack);/* * mask.x and mask.w tell whether IN.center.x and IN.center.y * are both odd or both even, respectively. either of these two * conditions indicates that the fragment is red. params.x==1 * selects red; params.y==1 selects black. */if (dot(mask,params.xyyx)){ ...
Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables
![Page 23: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/23.jpg)
Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables
• Be careful with texture lookups – cache coherence is crucial
• Use the smallest data types you can get away with to reduce bandwidth consumption
• “Computation is cheap; memory accesses are not.”...if you’re memory access limited.
![Page 24: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/24.jpg)
Profiling and Load BalancingProfiling and Load BalancingProfiling and Load BalancingProfiling and Load Balancing
![Page 25: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/25.jpg)
Profiling and Load Profiling and Load BalancingBalancingProfiling and Load Profiling and Load BalancingBalancing
• Software profiling
• GPU pipeline profiling
• GPU load balancing
![Page 26: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/26.jpg)
Profiling and Load Balancing: Profiling and Load Balancing: Software profilingSoftware profilingProfiling and Load Balancing: Profiling and Load Balancing: Software profilingSoftware profiling
• Run a standard software profiler!– Rational Quantify– Intel VTune– AMD CodeAnalyst
![Page 27: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/27.jpg)
Profiling and Load Balancing: Profiling and Load Balancing: GPU pipeline profilingGPU pipeline profilingProfiling and Load Balancing: Profiling and Load Balancing: GPU pipeline profilingGPU pipeline profiling
• This is where it gets tricky.
• Some tools exist to help you:– NVPerfHUD
http://developer.nvidia.com/docs/IO/8343/How-To-Profile.pdf
– NVShaderPerfhttp://developer.nvidia.com/object/nvshaderperf_home.html
– Apple OpenGL Profilerhttp://developer.apple.com/opengl/profiler_image.html
![Page 28: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/28.jpg)
Profiling and Load Balancing: Profiling and Load Balancing: GPU load balancingGPU load balancingProfiling and Load Balancing: Profiling and Load Balancing: GPU load balancingGPU load balancing
• This is a whole talk in and of itself– e.g., http://developer.nvidia.com/docs/IO/8343/Performance-Optimisation.pdf
• Sometimes you can get more hints from third parties than from the vendors themselves– http://www.3dcenter.de/artikel/cinefx/index6_e.php
– http://www.3dcenter.de/artikel/nv40_technik/
![Page 29: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/29.jpg)
ConclusionsConclusionsConclusionsConclusions
![Page 30: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/30.jpg)
ConclusionsConclusionsConclusionsConclusions
• Get used to thinking in terms of vector computation
• Understand how frequently each computation will run, and reduce that frequency wherever possible
• Track down bottlenecks in your application, and shift work to other parts of the system that are idle
![Page 31: Efficient Data Parallel Computing on GPUs](https://reader035.vdocuments.net/reader035/viewer/2022081603/56814fe3550346895dbdacc7/html5/thumbnails/31.jpg)
Questions?Questions?Questions?Questions?
• Acknowledgements– Pat Brown and Nick Triantos at NVIDIA– NVIDIA for having given me a job this summer– Dave Luebke, my advisor– GPGPU course presenters– NSF Award #0092793