cool compute: strategies for managing power vs …...c++ amp hsail is a ... there is a move towards...
TRANSCRIPT
© Imagination Technologies p1 www.imgtec.com
Doug Watt
May 21, 2013
Cool Compute: Strategies for Managing Power vs Performance in Mobile Devices
© Imagination Technologies p2
So what’s today’s problem?
First it was transistors…
There just were never enough to do everything we wanted but Moore’s law has been our
friend
Then it was bandwidth…
Which is still an issue but less so, and actually using it burns power which never did scale
with process
Now it’s thermal shutdown
The last couple of process generations broke the link between geometry and power scaling
Performance is now limited by thermal envelope of the device
Keeping all those transistors on long enough to enjoy their performance is the issue
© Imagination Technologies p3
It’s everyone’s problem
…and that’s just the CPU
But some have it worse than others so there are clearly ways to combat this
Public domain screen captures from http://www.youtube.com/watch?v=f4qu915Wj1U
© Imagination Technologies p4
How does this affect performance? Once the thermal limit is reached, power management kicks in
Thermal shutdown of GPU >25% performance hit
Data courtesy of Anandtech.com
© Imagination Technologies p5
Best way to get Maximum Performance?
(results of an actual experiment carried out by our friends at Futuremark!)
© Imagination Technologies p6
Looking for some solutions
Go Parallel
VLIW
SIMD
Multiple threads
Multiple cores
But you can’t just keep adding more cores…
They will just get shut down if they are power hogs
Brute force is a losing strategy
Go Parallel
© Imagination Technologies p7
Looking for some solutions
Modern SoCs are heterogeneous
Integrate CPU, GPU, DSP, ISP, video decoders, I/O interfaces
Different blocks optimized for different types of computation
Different SoCs provide different balances of heterogeneous resources for different
classes of product
Each application task can be targeted at the hardware block which can execute it
most efficiently
Granularity of tasks determined by architecture
Go Heterogeneous
© Imagination Technologies p8
Looking for some solutions
GPUs now employ the same ‘parallel features’ as CPUs
SIMD and Threads (SIMT)
Multiple cores
GPUs were special-purpose (shader) compute engines dedicated to graphics
But are now evolving into more general-purpose programmable devices
Enable area-efficient heterogeneous SoC designs that provide both
CPU for control and I/O (efficient latency processing)
GP-GPU for graphics and data crunching (efficient throughput processing)
Use the appropriate compute unit for each task – including the GPU
© Imagination Technologies p9
Heterogeneous CPU/GPU architectures
CPU
Execution Queues
Unified System Memory
SIMD Processing
Element
SIMD Processing
Element
Small Cache Small Cache
Large
Cache
CoreFew
Threads
CoreFew
Threads
GPU
Control
ManyThreads
ManyThreads
© Imagination Technologies p10
Mobile GPU compute must be “practical”
Use Case Example
Augmented reality Augmented reality shopping app
Computational Photography Post-processing effects (HDR, panoramic stitching)
Lens correction
Computer Vision Product recognition (display in webstore)
Automotive ADAS
Defence/Security systems
Audio Multi-microphone beamforming (noise reduction)
Video HEVC decoder
Real-time camera preview window effects
Risk Mitigation Kishonti and Google benchmarks
© Imagination Technologies p11
Mobile GPU compute must be “practical”
Practical compute is not…
Game physics (for the most part)
High-performance scientific computing
For many practical use cases, programmability is a requirement
Algorithms are constantly changing and improving
Standards take time to be ratified
GP-GPU architectures optimized for practical use cases provide the most efficient
performance-power profiles
© Imagination Technologies p12 www.imgtec.com
© Imagination Technologies p13
Mobile GPU compute must be “practical”
HPC Use Cases
(Full Profile)
Mobile and Embedded Use Cases
(Embedded Profile)
3D Images 64-bit floating point
BGRA channel order image formats
Image sharing with OpenGL ES
Built-in atomic functions
High-precision rounding support
Optimal GPU Design Parameters
© Imagination Technologies p14
GPU Compute is a POWER play! Using available heterogeneous resources saves energy
Trial run on TI OMAP 4 ‘Panda’ board;
Free running suite of image enhancement
functions, written in three versions.
Single and dual threaded CPU-only
versions allowed to saturate CPU;
GPU version in OpenCL 1.0 EP, with
minimal CPU loading. 0 0.1 0.2
Single Thread
Two Threads
GPU
Energy per processed frame
© Imagination Technologies p15
Heterogeneous System Architecture
We are on an evolution path towards processor designs that combine…
Efficient latency processing (CPU)
Efficient throughput processing (GPU)
Tighter integration of all heterogeneous components (CPU, GPU, DSP, ISP, …)
Need to analyse performance of emerging use cases to determine the right balance
of hardware resources for each new processor design – for example
Local memory – quantity and type (registers, SRAM)
ALU – quantity, type (int, float), precision (rtn, rtz)
ISA – instructions that enable efficient compilation of application code
© Imagination Technologies p16
Heterogeneous System Architecture
CPU-GPU coherency on mobile
processors today is one-way
GPU can set flags on memory accesses,
indicating data it’s fetching may already
be within CPU cache
Infrastructure will snoop the CPU cache
before looking in system memory
A few use cases for graphics exist, but
not compute
Today’s Mobile CPU+GPU designs are loosely integrated
© Imagination Technologies p17
Heterogeneous System Architecture
HSA should reduce overall system power
Infrastructure adds area and power
Zero-copy throughout entire system reduces bandwidth
Tomorrow’s Mobile CPU+GPU designs will be more tightly integrated
© Imagination Technologies p18
Heterogeneous System Architecture Simplifies effective use of heterogeneous computing
Designed for C99, C++ 2011,
Java, Renderscript, OpenCL,
C++ AMP
HSAIL is a virtual ISA for
parallel programs
Finalized to ISA by a JIT
compiler or “Finalizer”
ISA independent by
design for CPU & GPU
Explicitly parallel
Designed for data parallel
programming CPU(s) GPU(s) Other Accelerators
HSA Finalizer
Legacy Driver
Application
Domain Specific Libs (Bolt, OpenCV™, … many others)
HSA Runtime
Application
SW
Drivers
Differentiated HW
OpenGL-ES Runtime
Other Runtime
HSAIL
GPU ISA
Renderscript /OpenCl
Runtimes
HSA Software
Kernel Driver
Ctl
Legacy Driver
Dalvik Runtime
© Imagination Technologies p19
Heterogeneous System Architecture
IP Vendor Silicon
Vendor
OEM App
Writer
3D Graphics CPU Use Cases Augmented Reality Photography Computer Vision Audio Processing Video Processing
Segment types and sizes illustrative only
© Imagination Technologies p20
Summary
Thermal envelope has become the limiting factor in performance
It is necessary to use all the tools at our disposal to address this
Macro, micro architecture, dynamic resource management and application partitioning all
play a role
There is a move towards simplifying and standardising the use of all computing resources
on board an SoC
What works on the desktop does not always work for embedded devices
It is still necessary to use resources wisely, going for maximum precision, maximum
dynamic range is never a no-brainer
The future is increasingly heterogeneous
© Imagination Technologies p21 www.imgtec.com
Doug Watt
May 21, 2013
Cool Compute: Strategies for Managing Power vs Performance in Mobile Devices