![Page 1: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/1.jpg)
Lecture 3:Parallelizing Pipeline Execution
(+ notes on workload)
Kayvon FatahalianCMU 15-869: Graphics and Imaging Architectures (Fall 2011)
![Page 2: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/2.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Today▪ Brief discussion of graphics workload
▪ Strategies for parallelizing the graphics pipeline
![Page 3: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/3.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
The graphics pipeline (last time)
Primitive Generation
Vertex Generation
Vertex Processing
Rasterization(Fragment Generation)
Fragment Processing
Frame-Buffer Ops
Primitive Processing
Vertices
Primitives
Fragments
Pixels Frame Buffer
Memory
1 in / 1 out
3 in / 1 out(for tris)
1 in / small N out
1 in / N out
1 in / 1 out Uniformdata
Texturebuffers
Uniformdata
Texturebuffers
Uniformdata
Texturebuffers
1 in / 0 or 1 out
![Page 4: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/4.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Programming the pipeline (last time)▪ Issue draw commands frame-buffer contents change
Bind shaders, textures, uniformsDraw using vertex buffer for object 1Bind new uniformsDraw using vertex buffer for object 2 Bind new shaderDraw using vertex buffer for object 3
CommandCommand Type
State change
Change depth test function Bind new shader Draw using vertex buffer for object 4
DrawState changeDrawState changeDrawState changeState changeDraw
Note: efficiently managing stage changes is a major challenge in implementations
![Page 5: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/5.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Where is the work?
Primitive Generation
Vertex Generation
Vertex Processing
Rasterization(Fragment Generation)
Fragment Processing
Frame-Buffer Ops
Primitive Processing
Vertices
Primitives
Fragments
Pixels Frame Buffer
Memory
1 in / 1 out
3 in / 1 out(for tris)
1 in / small N out
1 in / N out
1 in / 1 out Uniformdata
Texturebuffers
Uniformdata
Texturebuffers
Uniformdata
Texturebuffers
1 in / 0 or 1 out
![Page 6: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/6.jpg)
Triangle size
[0-1] [1-5] [5-10] [10-20] [20-30] [30-40] [40-50] [50-60] [60-70] [70-80] [80-90] [90-100] [> 100]
30
20
10
0
Perce
ntag
e of t
otal
tria
ngle
s
Triangle area (pixels)
[source: NVIDIA] Note: tessellation is triggering a reduction in triangle size
![Page 7: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/7.jpg)
Credit: Pro Evolution Soccer 2010 (Konami)
![Page 8: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/8.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Fine Primitive Generation
Vertex Generation
Vertex Processing
Rasterization(Fragment Generation)
Fragment Processing
Frame-Buffer Ops
Fine Primitive Processing
Coarse Vertices
Fine Primitives
Fragments
Pixels
1 in / 1 out
3 in / 1 out(for tris)
1 in / small N out
1 in / N out
1 in / 1 out
1 in / 0 or 1 out
Fine Vertex Processing
TessellationFine Vertices
Coarse Primitive ProcessingCoarse Primitives1 in / 1 out
1 in / 1 out
1 in / N out
Amount of data
Compact model
High-resolution mesh
screen-spacefragments
Frame buffer pixels
“Diamond” structure of graphics workload
![Page 9: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/9.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Key workload metrics▪ Data ampli#cation
- Triangle size
- Expansion by geometry shader (if enabled)
- Tessellation factor (if enabled)
▪ [Vertex/fragment] program cost
▪ Depth Complexity
- Determines number of z/color buffer writes
![Page 10: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/10.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Scene depth complexity
Loose approximation: TA = SDT = # trianglesA = average triangle areaS = pixels on screenD = average depth complexity
[Imagination Technologies]
![Page 11: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/11.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Pipeline workload changes rapidly▪ Triangle size scene and frame dependent
- Even object dependent within a frame (characters: higher res meshes)
▪ Varying complexity of materials, different number of lights illuminating surfaces
- No “average” shader- Tens to several hundreds of instructions per shader
▪ Shadow map creation
- NULL fragment shader
▪ Screen post-processing
- Two triangles cover screen(~ no vertex work)
▪ Recall: thousands of draw calls per frame [NVIDIA]
![Page 12: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/12.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Parallelization
Some slides credit Kurt Akeley and Pat Hanrahan (Stanford CS448 Spring 2007)
![Page 13: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/13.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Remember our workload▪ Immediate mode interface: accepts sequence of commands
- draw commands- state modi#cation commands
▪ Processing of commands has sequential semantics
- Effects of command A visible before those of command B
▪ Relative cost of pipeline stages changes frequently and unpredictably (e.g., triangle size)
▪ Ample opportunities for parallelism
- few dependencies (most notable: order, frame-buffer update)
![Page 14: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/14.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Parallelism and communication▪ Parallelism - using multiple execution units to process work in parallel
▪ Communication - connecting the execution units allowing work to be distributed and aggregated(note: consider synchronization a form of communication)
▪ Issues:
- Scalability:- Computation- Bandwidth- Load-balancing
- Dependencies (ordering semantics)- Work efficiency
![Page 15: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/15.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Opportunities for parallelism in graphics▪ Data parallelism
- Simultaneously execute same operation on different data
- Object space (vertices, primitives, etc.)
- Image space (fragments, pixels)
▪ Task parallelism
- Simultaneously execute different tasks on similar (or different) data
- Vertex processing, rasterization, fragment processing
Note: many redundancies in the pipeline: optimizations exploiting these redundancies can create dependencies that reduce opportunities of parallelism
![Page 16: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/16.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Simple parallelization (pipelined)
Primitive Generation
Vertex Generation
Vertex Processing
Rasterization(Fragment Generation)
Fragment Processing
Frame-Buffer Ops
Primitive Processing
Separate hardware unit for each stage
Speedup?
![Page 17: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/17.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Simpli#ed pipeline
Primitive Generation
Vertex Generation
Vertex Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Primitive Processing
Geometry
Application
Display
For now: just consider all geometry processing work (vertex/primitive processing, tessellation, etc.) as “geometry” processing.
![Page 18: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/18.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Simpli#ed pipeline
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Application
Display
Geometry Processing
![Page 19: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/19.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Scaling “wide”
Application
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
![Page 20: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/20.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sorting taxonomy
Application
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort #rst
Sort middle
Sort last fragment
Sort last image composition
![Page 21: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/21.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort #rst
![Page 22: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/22.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort #rstApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Assign each hardware pipeline a region of the render targetDo minimal amount of work to determine which region(s) input primitive overlaps
Sort!
![Page 23: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/23.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort #rst
▪ Good:- Bandwidth scaling (small amount of sync/communication, simple point-to-point)- Computation scaling- Simple: just replicate rendering pipeline (order maintained within each)- Easy early #ne occlusion cull (“early z”)
Application
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort!
![Page 24: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/24.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort #rst
▪ Bad:- Potential for workload imbalance (one part of screen contains most of scene)- Extra cost of “pre-transformation”- Tile spread: as screen tiles get smaller, primitives cover more tiles
(duplicate geometry processing)
Application
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort!
![Page 25: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/25.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort #rst examples▪ WireGL/Chromium** (parallel rendering with a cluster of GPUs)- “front-end” sorts primitives- each GPU is a full rendering pipeline
▪ Pixar RenderMan (implementation of REYES)- Multi-core software implementation- Sort surfaces into tiles prior to tessellation
(sort the surfaces, not all the little “micropolygons” )
** Chromium can also be con#gured as a sort-last system
![Page 26: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/26.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle
![Page 27: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/27.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middleApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Assign each rasterizer a region of the render targetDistribute primitives to top of pipelines (e.g., round robin)Sort after geometry processing based on screen space projection of primitive vertices
Sort!
Distribute
![Page 28: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/28.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Interleaved mapping of screen▪ Decrease chance of one rasterizer processing most of scene▪ Most triangles overlap multiple screen regions
![Page 29: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/29.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle interleavedApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort! - BROADCAST
Distribute
▪ Good:- Workload balance: both for geometry work AND onto rasterizers- Computation scaling- Easy #ne early occlusion cull- Does not duplicate geometry processing for each overlapped screen region
![Page 30: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/30.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle interleavedApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort! - BROADCAST
Distribute
▪ Bad:- Bandwidth scaling: sort implemented as a broadcast
(each triangle goes to many/all rasterizers)- If tessellation enabled, must communicate many more primitives than sort #rst
![Page 31: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/31.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
SGI RealityEngine [Akeley 93]
Sort-middle interleaved
![Page 32: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/32.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle tiled▪ Sort no longer requires broadcast
- Point-to-point communication
- Better bandwidth scaling
▪ Risks workload imbalance amongst rasterizers
![Page 33: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/33.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle tiled (chunked)Application
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Sort!
Distribute
bucket 0 ...bucket
1bucket
2bucket
3bucket
NBuckets stored in off-chip
memory
Partition screen into many small tiles (many more tiles than rasterizers)Sort geometry by tile into off-chip buckets. After all geometry complete, rasterizers process buckets (think work queue)
![Page 34: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/34.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort middle tiled (chunked)▪ Inserts frame of delay
- Cannot begin rasterization until geometry processing completes (order)
▪ Requires off-chip storage of immediate data
▪ Good:
- Sort approaches point to point traffic
- Good load balance
- Low bandwidth requirements (why?)
▪ Recent examples: Intel Larrabee, NVIDIA CUDA rasterizer, many mobile GPUs
![Page 35: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/35.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last
![Page 36: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/36.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last fragmentApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Distribute primitives to top of pipelines (e.g., round robin)Sort after fragment processing based on (x,y) position of fragment
Distribute
Sort! - point-to-point
![Page 37: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/37.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last fragmentApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Distribute
Sort! - point-to-point
▪ Good:- No redundant work (geometry processing or in rast)- Point-to-point communication during sort- Interleaved pixel mapping results in good workload balance for frame-buffer ops
![Page 38: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/38.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last fragmentApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Distribute
Sort! - point-to-point
▪ Bad:- Workload imbalance due to primitives of varying size- Bandwidth scaling: many more fragments than triangles - Hard to implement early occlusion cull (more bandwidth challenges)
![Page 39: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/39.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last image compositionApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Distribute
Each pipeline renders some part of the frame (color buffer + depth buffer)Combine the color buffers, according to depth into the #nal image
frame buffer 0 frame buffer 1 frame buffer 3 frame buffer 4
Merge
![Page 40: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/40.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last image composition
![Page 41: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/41.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort last image composition▪ Cannot maintain order
▪ Simple: N separate rendering pipelines
- Can use off the shelf GPUs
- Coarse-grained communication
▪ Similar load imbalance problems as sort-last fragment
▪ Bandwidth requirements compared to sort-last fragment depend on scene depth complexity
![Page 42: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/42.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Sort everywhere
![Page 43: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/43.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Redistribute- point-to-point
PomegranateApplication
Display
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Command Processing
Rasterization
Fragment Processing
Frame-Buffer Ops
Geometry Processing
Distribute primitives to top of pipelinesRedistribute after geometry processing (e.g, round robin)Sort after fragment processing based on (x,y) position of fragment
Distribute
Sort! - point-to-point
[Eldridge 00]
![Page 44: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/44.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Fine Primitive Generation
Vertex Generation
Vertex Processing
Rasterization(Fragment Generation)
Fragment Processing
Frame-Buffer Ops
Fine Primitive Processing
Coarse Vertices
Fine Primitives
Fragments
Pixels
1 in / 1 out
3 in / 1 out(for tris)
1 in / small N out
1 in / N out
1 in / 1 out
1 in / 0 or 1 out
Fine Vertex Processing
TessellationFine Vertices
Coarse Primitive ProcessingCoarse Primitives1 in / 1 out
1 in / 1 out
1 in / N out
Recall: modern OpenGL 4/Direct3D 11 pipeline5 programmable stages
Tessellation
Programmable stages with data-dependent control $ow (varying per vertex/per fragment run-time)
![Page 45: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/45.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Modern NVIDIA, AMD, Intel GPUsCmd Processor /Vertex Generation
Programmable Core
Texture
Programmable Core
Programmable Core
Programmable Core
Programmable Core
Texture
Programmable Core
Programmable Core
Programmable Core
Programmable Core
Texture
Programmable Core
Programmable Core
Programmable Core
Programmable Core
Texture
Programmable Core
Programmable Core
Programmable Core
Rasterizer Rasterizer
Rasterizer Rasterizer
Frame Buffer Ops
Frame Buffer Ops
Frame Buffer Ops
Frame Buffer Ops
Frame Buffer Ops
Frame Buffer Ops
Hardware is a heterogeneous collection of resourcesProgrammable resources are time-shared by vertex/primitive/fragment processing workMust keep programmable cores busy: sort everywhere
High-speed Interconnect
![Page 46: Lecture 3: Parallelizing Pipeline Execution...-each GPU is a full rendering pipeline Pixar RenderMan (implementation of REYES)-Multi-core software implementation-Sort surfaces into](https://reader036.vdocuments.net/reader036/viewer/2022081615/5fd3c369c2a37653224e257c/html5/thumbnails/46.jpg)
Kayvon Fatahalian, Graphics and Imaging Architectures (CMU 15-869, Fall 2011)
Readings▪ Molnar et al. A Sorting Classi#cation of Parallel Rendering. IEEE Graphics and
Applications 1994
▪ Eldridge et al. Pomegranate: A Fully Scalable Graphics Architecture. SIGGRAPH 2000