graphics on a stream processor peter djeu march 20, 2003

53
Graphics on a Stream Processor Peter Djeu March 20, 2003

Upload: jose-ross

Post on 26-Mar-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Graphics on a Stream Processor Peter Djeu March 20, 2003

Graphics on a Stream Processor

Peter Djeu

March 20, 2003

Page 2: Graphics on a Stream Processor Peter Djeu March 20, 2003

Polygon Rendering on a Stream Architecture

Owens, Dally, Kapasi, Rixner, Mattson, Mowery

Stanford Computer Sciences Lab

Page 3: Graphics on a Stream Processor Peter Djeu March 20, 2003

Goals• Create a working version of OpenGL on the

Imagine Stream Processor– Why Imagine? Imagine is programmable.

Contemporary graphics cards were not.– Use programmability and flexibility to head

towards the goal of 80 million poly’s / frame.– The trend is more flexibility: over 200 proposed

extensions to OpenGL, people already used complex software rendering (like multipass hacks e.x: shadows)

Page 4: Graphics on a Stream Processor Peter Djeu March 20, 2003

Imagine Review 1

Page 5: Graphics on a Stream Processor Peter Djeu March 20, 2003

Imagine Review 2

Page 6: Graphics on a Stream Processor Peter Djeu March 20, 2003

Basic Definitions

• Streams are sets of data elements. All elements are a single data type. – Index streams

• Kernels are pieces of code that operate on streams. They take a stream as input and produce a stream as output. Kernels can be chained together.

Page 7: Graphics on a Stream Processor Peter Djeu March 20, 2003

Basic Definitions (page 2)

• Instruction-Level Parallelism - issuing independent instructions in the same cycle (ex: 6 functional units in an ALU cluster, VLIW)

• Data-Level Parallelism - performing the same operation on multiple pieces of data (ex: 8 ALU clusters operating on a single stream, vector computing)

Page 8: Graphics on a Stream Processor Peter Djeu March 20, 2003

Basic Definitions (page 3)

• Produce-Consumer Locality– Occurs when one component of a system is

producing something that is immediately consumed by another component of the system.

– In the polygon rendering pipeline, occurs when one kernel produces a stream that is immediately consumed by another kernel

– the Stream Register File (SRF) and local registers exploit producer-consumer locality

Page 9: Graphics on a Stream Processor Peter Djeu March 20, 2003

Why are Streams Good for Graphics?

• Rendering is inherently parallel– triangles can be processed in parallel, but may

need to be ordered when drawn to the screen

• No need for caching, primitives are rendered and then discarded (not textures)

• Latency not as important with streams. Stream proc’s emphasize max throughput.

• VLSI has allowed us to build stream proc’s.

Page 10: Graphics on a Stream Processor Peter Djeu March 20, 2003

Homogenous Streams

• Kernels are most efficient if all elements in the stream require identical operations (less complexity and no stalls from conditionals).

• Use conditional streams to separate heterogeneous streams into homogenous.

• Ex: backface culling (forward, backward)

Page 11: Graphics on a Stream Processor Peter Djeu March 20, 2003

Benefits of Using Imagine• Memory hierarchy

– local registers are great for ALU’s; SRF is a big cache to exploit producer-consumer locality

• SIMD architecture exploits parallelism and homogeneity in streams– One kernel at a time, and it is run on all 8 ALU

clusters (data-level parallelism on streams)

• Granularity on the stream level, so less instruction issues, less bandwidth needed.

Page 12: Graphics on a Stream Processor Peter Djeu March 20, 2003

3 Stage Pipeline

• Geometry - generates triangles

• Rasterization - generates fragments

• Composite - generates pixels

See figure: each node is a kernel,each edge is a stream

Page 13: Graphics on a Stream Processor Peter Djeu March 20, 2003

Diagram of General Polygon Pipeline

Page 14: Graphics on a Stream Processor Peter Djeu March 20, 2003

Load Balancing (A Problem)

• Large triangles take longer to be rasterized, potentially slowing down the pipeline.

• Sol’n: every cycle, ALU clusters fetch if they are idle and keep processing otherwise

• Sol’n may not be good for all situations, since we incur fetch cost even if no one really needs to fetch (ex: when triangles are all large but roughly equal in size)

Page 15: Graphics on a Stream Processor Peter Djeu March 20, 2003

Ordering (A Problem)

• OpenGL requires in-order completion– What situations require in-order completion? – What situations do not require in-order comp.?

• Can we solve this ordering problem without having to sort? For instance, ask user to label triangles with priority. Or assign priority based on orig. order and use a second Z-buffer test using the priority val.

Page 16: Graphics on a Stream Processor Peter Djeu March 20, 2003

Ordering (A Problem) (page 2)

• Paper’s sol’n: Create 2 streams, then concatenate ordered onto unordered– Assign in-order ID to each triangle in the

stream (now extra ID value is in the stream)– Since overlaps are rare, use a hash func. to find

when two fragments overlap on the screen– Resolve overlaps by using the ID number– Unclear: how did they implement a hash table

with 2-bits per hash entry (32 * 8 words total)?

Page 17: Graphics on a Stream Processor Peter Djeu March 20, 2003

Consumer-Producer Locality(An Advantage)

• Break up the entire input into batches, then make sure the batches fit inside the SRF

• Trips to main memory can really be reduced

• But Owens et al. batch their input before they beginning processing-- is this fair? Dynamic batching may not be as effective as their hand-crafted batches.

Page 18: Graphics on a Stream Processor Peter Djeu March 20, 2003

Producer ConsumerLocality Using the Stream RegisterFile (SRF)

Page 19: Graphics on a Stream Processor Peter Djeu March 20, 2003

Other Advantages to the Imagine Implementation

• Latency Tolerance - ALU clusters process the current batch while the memory system fetches the next batch

• Pipeline reordering - ex: turn off blending and move the texturing kernel after the depth test kernel, cull non-visible fragments

• Flexible resource allocation - hardware is reused at each stage, no a priori allocation

Page 20: Graphics on a Stream Processor Peter Djeu March 20, 2003

4 Testing Platforms

• Software - no hardware acceleration

• Imagine

• Nvidia Quadro (real)

• Nvidia Quadro (ideal) - a scaling of the Quadro’s real-world performance to its advertised peak performance– is this a good idea or a bad idea?

Page 21: Graphics on a Stream Processor Peter Djeu March 20, 2003

Frame Rates for the 4 Scenes

Page 22: Graphics on a Stream Processor Peter Djeu March 20, 2003

4 Test Scenes and Batch Sizes

• Sphere - 81920 total triangles, B: 80 triangles• Advs-1 - point-sampled 512 x 512 texture

mapped, B: 256 vertices• Advs-8 - mipmapped 512 x 512 base level texture,

B: 24 vertices• Fill - fills entire window, 512 x 512 texture map,

B: 16 vertices

Why is there such a dramatic drop in batchsize between Advs-1 and Advs-8?

Page 23: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results

• Low (Memory to SRF traffic), as compared to (SRF to local register files traffic) means current memory hierarchy is being well utilized (i.e. hardware is exploiting producer-consumer locality)

• Hash function produces too many false conflicts, leading to too much computation (see Fig. 9). Future work: better hash func.

Page 24: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results (page 2)

• Not enough ALU’s - ADVS-8 needed ~2.5 times the ALU ops than ADVS-1, but performance dropped > 50% (however, batch size also dropped significantly)

• Batch size is often too small to be efficient– startup costs, paid per batch– priming and draining inner loops (sub-optimal

saturation as loops are starting/winding down?)

Page 25: Graphics on a Stream Processor Peter Djeu March 20, 2003

Measure of Relative Performance (Logarithmic)

Page 26: Graphics on a Stream Processor Peter Djeu March 20, 2003

Batch Processing, Clearing Z Buffer in Background

Page 27: Graphics on a Stream Processor Peter Djeu March 20, 2003

Future Work via Hardware Extensions

• Problem: The Imagine hardware cannot currently compete with the powerful rasterizers in Graphics Cards

• Sol’n: Add hardware that is dedicated to Graphics and polygon rendering, such as:– texture cache– ALU in memory system for doing on-the-spot

depth test, alpha blend, filtering

Page 28: Graphics on a Stream Processor Peter Djeu March 20, 2003

Final Impressions• Impressive that they got the OpenGL pipeline

working on a general-purpose stream processor• Not enough ALU power, and things may not

improve with more hardware (ordering and communication constraints may still bottleneck)

• Programmability is good, but not worth it if the overall performance on existing applications is worse

• Limited size SRF very limiting when triangle size (and hence generated fragments) is unbounded

Page 29: Graphics on a Stream Processor Peter Djeu March 20, 2003

Comparing Reyes and OpenGL on a Stream Architecture

Owens, Khailany, Towles, Dally

Stanford Computer Sciences Lab

Page 30: Graphics on a Stream Processor Peter Djeu March 20, 2003

Motivation

• Recent trends in graphics:– Smaller triangles in modeling– Demand for more complex shading– Memory bandwidth becoming more scarce

• The Reyes Rendering Pipeline addresses these issues.

Page 31: Graphics on a Stream Processor Peter Djeu March 20, 2003

OpenGL Pipeline Stages

• Transformations and vertex ops - 1st pass for shading (interpolation, light, glColor)

• Assemble / Clip / Project - view frustum

• Rasterize - triangles become fragments

• Fragment ops - 2nd pass for shading (texturing, blending)

• Visibility / Filter - depth buffer, composite

Page 32: Graphics on a Stream Processor Peter Djeu March 20, 2003

Reyes Rendering Pipeline

• Dice / Split - primitives are subdivided into micropolygons

• Shade - shading done in eye space (in the paper, a screen space calculation is used)

• Sample - projection to screen space

• Visibility / Filter - depth-buffer, composite

Page 33: Graphics on a Stream Processor Peter Djeu March 20, 2003

The Reyes andOpenGL PipelineStages

Page 34: Graphics on a Stream Processor Peter Djeu March 20, 2003

Differences Between OpenGL and Reyes

OpenGL Reyes

2 Shaders 1 Shader

No coherent memory accesses

Coherent memory accesses (CAT's)

Unbounded primitives (triangles)

Bounded primitives (micropolygons)

Page 35: Graphics on a Stream Processor Peter Djeu March 20, 2003

Why are We Using Imagine? (again?)

• Imagine is parallel, exploits producer-consumer locality

• Imagine is not specifically designed for either pipeline so it is a good platform to use to compare and contrast 2 pipelines

• Better than OpenGL optimized graphics cards because it is not so specialized

Page 36: Graphics on a Stream Processor Peter Djeu March 20, 2003

Stanford Real-Time Shading Language (RTSL)

• High level language for writing shader programs for targeted hardware (Imagine)

• For OpenGL: Generates code for a Vertex Program and a Fragment Program

• For Reyes: Generates code for a Vertex Program (4 vertices to a micropolygon)

Page 37: Graphics on a Stream Processor Peter Djeu March 20, 2003

Subdividing Micropolygons

• Primitives start out as a collection of B-Spline Control Points

• 4 control points are treated as 1 quad

• If all edges of a quad are less than a threshold length, than return the the quad

• Else subdivide the quad into 4 quads and repeat the test.

Page 38: Graphics on a Stream Processor Peter Djeu March 20, 2003

Fixing Subdivision Cracks

• Store equations of edges rather than the coordinates of vertices

• Each edge is finalized immediately when it falls under the acceptable tolerance length, even if adjoining edges are still too long

• Edge equations are extended to fill in cracks created by Catmull-Clark subdivision

Page 39: Graphics on a Stream Processor Peter Djeu March 20, 2003

Subdivision Algorithm that Avoids Cracking

Page 40: Graphics on a Stream Processor Peter Djeu March 20, 2003

The Rest of the Reyes Pipeline• Shading - Perform all shading in this stage. Use

Coherent Access Textures if possible:– requires power-of-two sized diced subdivision– good: no texture filtering due to alignment– bad: can result in ~ twice as many

micropolygons• Sampling - Better load balancing, primitives are

bounded and flat shaded, unlike OpenGL• Composite and Filter - Identical to OpenGL

Page 41: Graphics on a Stream Processor Peter Djeu March 20, 2003

Experimental Testing

• Isim - Full Imagine simulator used to test OpenGL

• Idebug - Faster Imagine simulator (lack kernel stalls or cluster occupancy effects) used to test Reyes

Page 42: Graphics on a Stream Processor Peter Djeu March 20, 2003

Problems with the Experimental Testing System

• The paper is unclear on why it chose to test Reyes on Idebug rather than Isim.

• The application of the +20% rule may be faulty here since it is not shown Reyes is an “average” application.

• Extra scratchpad memory, more microcode store, and more local registers were added to Reyes. Were they added to OpenGL?

Page 43: Graphics on a Stream Processor Peter Djeu March 20, 2003

Test Scenes

• Teapot-N - N is either 20, 30, 40, or 64. Contains 2N2 triangles and 3 light sources

• Pin-1 - 5 textures, point sampled textures

• Pin-8 - 5 textures, mipmapped textures

• Armadillo - Complex marble procedural shader (1200 flops / fragment!)

Page 44: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results

• Scenes rendered by OpenGL have a much higher frame rate than scenes rendered by Reyes (an order of magnitude difference)

Page 45: Graphics on a Stream Processor Peter Djeu March 20, 2003

Frame Rate of OpenGL vs. Reyes (logarithmic)

Page 46: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results (page 2)

• Subdivision (in the geometry stage) is consuming most of the processing time. Recall this is a stage unique to Reyes.

• Even excluding the subdivision stage, Reyes is still twice as slow as OpenGL.– The next largest bulk of total execution time is

the vertex processor. Since it is now doing both forms of shading, it is reasonable that it will take longer.

Page 47: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results (page 3)

• Too many 0-coverage quads are generated. These quads usually have extreme aspect ratios and cover no pixels at all in screen space.

• Cause: 2-way split at each subdivision iteration, producing 4 child quads

• Sol’n: Use a 1-way split at each subdivision iteration, producing 2 child quads instead.

Page 48: Graphics on a Stream Processor Peter Djeu March 20, 2003

Work Distribution in OpenGL and Reyes

Page 49: Graphics on a Stream Processor Peter Djeu March 20, 2003

Results (page 4)

• Small triangles bog down OpenGL– Less opportunity to interpolate on small

triangles, so full shading computation needs to be used more often

– High triangle volume means that the start-up cost of the rasterizer is incurred more often

Page 50: Graphics on a Stream Processor Peter Djeu March 20, 2003

OpenGL Runtime as a Function of Triangle Size

Page 51: Graphics on a Stream Processor Peter Djeu March 20, 2003

Future Work

• Use a better subdivision algorithm (more adaptive, more efficient, artifact free, with high performance)

• Add more ALU clusters

• Experiment with hybrid systems that are based on Imagine but contain more dedicated, specialized hardware

Page 52: Graphics on a Stream Processor Peter Djeu March 20, 2003

Conclusions

• Although Reyes implementation was slow, current trends in computer graphics (smaller triangles, complex shading, memory bandwidth limits) continue to make Reyes and similar pipelines attractive

Page 53: Graphics on a Stream Processor Peter Djeu March 20, 2003

Final Impressions

• Both pipelines were implemented by Owens et al., which is good for quasi-standardization

• Like the first paper, this paper wanted to add specialized hardware, so we may be at the limit of Imagine’s graphical power, as it was initially designed.

• Good observations / ideas, but not very convincing experimental evidence.

• Goal of comparing Reyes and OpenGL was not achieved.