mass market applications of massively parallel computing chas. boyd
TRANSCRIPT
![Page 1: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/1.jpg)
Mass Market Applications of Massively Parallel ComputingMass Market Applications of Massively Parallel Computing
Chas. Boyd
![Page 2: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/2.jpg)
3
OutlineOutline
• Projections of future hardware
• The client computing space
• Mass-market parallel applications
• Common application characteristics
• Interesting processor features
![Page 3: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/3.jpg)
4
The Physics of SiliconThe Physics of Silicon
• The way processors get faster has fundamentally changed
• No more free performance gains due to clock rate and Instruction-Level Parallelism
• Yet gates-per-die continues to grow
• Possibly faster now that clock rate isn’t an issue
• Estimate: doubling every 2-2.5 years
• New area means more cores and caches
• In-order core counts may grow faster than Out-of-Order core counts do
![Page 4: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/4.jpg)
5
![Page 5: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/5.jpg)
6
The Old StoryThe Old Story
![Page 6: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/6.jpg)
7
![Page 7: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/7.jpg)
8
![Page 8: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/8.jpg)
9
A Surplus of CoresA Surplus of Cores
• ‘More cores than we know what to do with’
• Literally
• Servers scale with transaction counts
• Technical Computing
• history of dealing with parallel workloads
• What are the opportunities for the PC client?
• Are there mass market applications that are parallelizable?
![Page 9: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/9.jpg)
10
Requirements of Mass Market SpaceRequirements of Mass Market Space
• Fairly easy to program and maintain
• Cannot break on future hardware or operating systems
• Transparent back-compatibility, fwd compatibility
• Mass market customers hate regressions!
• Consumer software must operate for decades
• Must get faster automatically
• Why we are here
![Page 10: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/10.jpg)
11
AMD Term:AMD Term:
• Personal Stream Computing
• Actually nothing like ‘stream processing’ as used by Stanford Brook, etc.
![Page 11: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/11.jpg)
12
Data-Parallel ProcessingData-Parallel Processing
• Key technique, how do we apply it to consumers?
• What takes lots of data?
• Media, pixels, audio samples
• Video, imaging, audio
• Games
![Page 12: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/12.jpg)
13
VideoVideo
• Decode, encode, transcode
• Motion Estimation, DCT, Quantization
• Effects
• Anything you would want to do to an image
• Scaling, sepia, DVE effects (transitions)
• Indexing
• Search/Recognition -convolutions
![Page 13: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/13.jpg)
14
ImagingImaging
• Demosaicing
• Extract colors with knowledge of sensor layout
• Segmentation
• Identify areas of image to process
• Cleanup
• Color correction, noise removal, etc.
• Indexing
• Identify areas for tagging
![Page 14: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/14.jpg)
15
User Interaction with Media User Interaction with Media
• Client applications can/should be interactive
• Mass market wants full automation
• ‘Pro-sumer’ wants some options to participate, but with real-time feedback (20+ fps) on 16 GPixel images
• Automating media processing requires analysis
• Recognition, segmentation, image understanding
• Is this image outdoors or inside?
• Is this image right-side up?
• Does it contain faces?
• Are their eyes red?
![Page 15: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/15.jpg)
16
Imaging MarketsImaging Markets
• In some sense, the broader the market, the more sophisticated the algorithm required
• Although pro-sumers care more about performance, and they are the ones that write the reviews
![Page 16: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/16.jpg)
17
FFT PerformanceFFT Performance
![Page 17: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/17.jpg)
18
Game Applications of Mass ParallelGame Applications of Mass Parallel
• Rendering
• Imaging
• Physics
• IK
• AI
![Page 18: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/18.jpg)
19
19Ultima Underworld 1993Ultima Underworld 1993
![Page 19: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/19.jpg)
Dark Messiah 2007Dark Messiah 2007Dark Messiah 2007Dark Messiah 2007
20
![Page 20: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/20.jpg)
Game RenderingGame Rendering
• Well established at this point, but new techniques keep being discovered
• Rendering different terms at different spatial scales
• E.g. Irradiance can be computed more sparsely than exit radiance enabling large increases in the number of incident light sources considered
• Spherical harmonic coefficient manipulations
![Page 21: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/21.jpg)
22
Game ImagingGame Imaging
• Post processing
• Reduction (histogram or single average value)
• Exposure estimation based on log average luminance
• Exposure correction
• Oversaturation extraction
• Large blurs (proportional to screen size)
• Depth of field
• Motion blur
![Page 22: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/22.jpg)
Half-Life 2Half-Life 2Half-Life 2Half-Life 2
23
![Page 23: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/23.jpg)
Half-Life 2Half-Life 2Half-Life 2Half-Life 2
24
![Page 24: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/24.jpg)
Half-Life 2Half-Life 2Half-Life 2Half-Life 2
25
![Page 25: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/25.jpg)
26
Game PhysicsGame Physics
• Particles -non-interacting
• Particles -interacting
• Rigid bodies
• Deformable bodies
• Etc.
![Page 26: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/26.jpg)
Game Processor EvolutionGame Processor EvolutionGame Processor EvolutionGame Processor Evolution
Vertex Shader
Pixel Shader
Animation
AI
Texture Creation
Mesh Modeling
PhysicsContent Creation Process
Game Stack
Offline
CPU
GPU
Real Time
27
![Page 27: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/27.jpg)
28
Common Properties of Mass AppsCommon Properties of Mass Apps
• Results of client computations are displayed at interactive rates
• Fundamental requirement of client systems
• Tight coupling with graphics is optimal
• Physical proximity to renderer is beneficial
• Smaller data types are key
![Page 28: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/28.jpg)
29
Support for Image Data TypesSupport for Image Data Types
• Pixels, texels, motion vectors, etc.
• Image data more important than float32s
• Data declines in size as importance drops
• Bytes, words, fp11, fp16, single, double
• Bytes may be declining in importance
• Hardware support for formatting is useful
• Clock cycles required by shift/or/mul, etc. cost too much power
![Page 29: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/29.jpg)
30
I/O ConsiderationsI/O Considerations
• Like most computations that are not 3-D rendering, GPUs are i/o bound
• Arithmetic intensity is lower than GPUs
• Convolutions
• Support for efficient data types is very important
![Page 30: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/30.jpg)
31
GPU Arithmetic Intensity ProjectionGPU Arithmetic Intensity Projection
![Page 31: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/31.jpg)
32
GPU Arithmetic Intensity ProjectionGPU Arithmetic Intensity Projection
• 2-3 more process doublings before new memory technologies will help much
• Stacked die?, 2k wide bus?, optical?
• Estimate at least 4x increase in nr of compute instructions per read operation
• Arithmetic intensities reach 64??
• This is fine for 3-D rendering
• Other data-parallel apps need more i/o
![Page 32: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/32.jpg)
33
I/O PatternsI/O Patterns
• Solutions will have a variety of mechanisms to help with worsening i/o constraints
• Data re-use (at cache size scales) is relatively rare in media applications
• Read-write use of memory is rare
• Read-write caches are less critical
• Streaming data behavior is sufficient
• Read contention and write contention are the issue, not read-after-write scenarios
![Page 33: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/33.jpg)
34
Interesting TechniquesInteresting Techniques
• Shared registers
• Possibly interesting to help with i/o bandwidth
• Reducing on-chip bandwidth may help power/heat
• Scatter
• Can be useful in scenarios that don’t thrash output subsystem
• Can reduce pressure on gather input system
![Page 34: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/34.jpg)
35
ConvolutionConvolution
• Key element of almost all image and video processing operations
• Scaling, glows, blurs, search, segmentation
• Algorithm has very low arithmetic intensity
• 1 MAD per sample
• Also has huge re-use (order of kernel size)
• Shared registers should reduce arithmetic intensity by factor of kernel size
![Page 35: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/35.jpg)
36
Processor Core TypesProcessor Core Types
• Heterogeneous Many-core
• In-Order vs. Out-of-Order
• Distinction arose from targeting 2 different workload design points
• Software can select ideal core type for each algorithm (workload design point)
• Since not all cores can be powered anyway
• Hardware can make trade-offs on:
• Power, area, performance growth rate
![Page 36: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/36.jpg)
WorkloadsWorkloads
Local Memory Accesses Streaming Memory Access
Cod
e B
ranc
hine
ss
CPUs
GPUs
![Page 37: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/37.jpg)
Workload DifferencesWorkload Differences
General Processing
• Small batches
• Frequent branches
• Many data inter-dependencies
• Scalar ops
• Vector ops
Media Processing
• Large batches
• Few branches
• Few data inter-dependencies
• Scalar ops
• Vector ops
![Page 38: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/38.jpg)
39
Lesson from GPGPU ResearchLesson from GPGPU Research
• Many important tasks have data-parallel implementations
• Typically requires a new algorithm
• May be just as maintainable
• Definitely more scalable with core count
![Page 39: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/39.jpg)
40
APIs Must Hide ImplementationsAPIs Must Hide Implementations
• Implementation attributes must be hidden from apps to enable scaling over time
• Number of cores operating
• Number of registers available
• Number of i/o ports
• Sizes of caches
• Thread scheduling policies
• Otherwise, these cannot change, and performance will not grow
![Page 40: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/40.jpg)
41
Order of Thread ExecutionOrder of Thread Execution
• Shared registers and scatter share a pitfall:
• It may be possible to write code that is dependent on the order of thread execution
• This violates scaling requirement
• The order of thread execution may vary from run-to-run (each frame)
• Will certainly vary between implementations
• Cross-vendor and within vendor product lines
• All such code is considered incorrect
![Page 41: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/41.jpg)
42
System Design GoalsSystem Design Goals
• Enable massively parallel implementations
• Efficient scaling to 1000s of cores
• No blocking/waiting
• No constraints on order of thread execution
• No read-after-write hazards
• Enable future compatibility
• New hardware releases, new operating systems
![Page 42: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/42.jpg)
43
Other Computing ParadigmsOther Computing Paradigms
• CPU –originated:
• Lock-based, Lockless
• Message Passing
• Transactional Memory
• May not scale well to 1000s of cores
• GPU Paradigms
• CUDA, CtM
• May not scale well over time
![Page 43: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/43.jpg)
44
High Level APIsHigh Level APIs
• Microsoft Accelerator
• Google Peakstream
• Rapidmind
• Acceleware
• Stream processing
• Brook, Sequoia
![Page 44: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/44.jpg)
45
Additional GPU FeaturesAdditional GPU Features
• Linear Filtering
• 1-D, 2-D, 3-D floating point array indices
• Image and video data benefit
• Triangle Interpolators
• Address calculations take many clocks
• Blenders
• Atomic reduction ops reduce ordering concerns
• 4-vector operations
• Vector data, syntactic convenience
![Page 45: Mass Market Applications of Massively Parallel Computing Chas. Boyd](https://reader036.vdocuments.net/reader036/viewer/2022062511/551be8bd550346b9588b61c1/html5/thumbnails/45.jpg)
46
Processor OpportunitiesProcessor Opportunities
• Client computing performance can improve
• Client space is a large un-tapped opportunity for parallel processing
• Hardware changes required are minimal and fairly obvious
• Fast display, efficient i/o, scalable over time