cool compute: strategies for managing power vs …...c++ amp hsail is a ... there is a move towards...

© Imagination Technologies p1 www.imgtec.com

Doug Watt

May 21, 2013

Cool Compute: Strategies for Managing Power vs Performance in Mobile Devices

© Imagination Technologies p2

So what’s today’s problem?

First it was transistors…

There just were never enough to do everything we wanted but Moore’s law has been our

friend

Then it was bandwidth…

Which is still an issue but less so, and actually using it burns power which never did scale

with process

Now it’s thermal shutdown

The last couple of process generations broke the link between geometry and power scaling

Performance is now limited by thermal envelope of the device

Keeping all those transistors on long enough to enjoy their performance is the issue


It’s everyone’s problem

…and that’s just the CPU

But some have it worse than others so there are clearly ways to combat this

Public domain screen captures from http://www.youtube.com/watch?v=f4qu915Wj1U


How does this affect performance? Once the thermal limit is reached, power management kicks in

Thermal shutdown of GPU >25% performance hit

Data courtesy of Anandtech.com


Best way to get Maximum Performance?

(results of an actual experiment carried out by our friends at Futuremark!)


Looking for some solutions

Go Parallel

VLIW

SIMD

Multiple threads

Multiple cores

But you can’t just keep adding more cores…

They will just get shut down if they are power hogs

Brute force is a losing strategy

Go Parallel



Modern SoCs are heterogeneous

Integrate CPU, GPU, DSP, ISP, video decoders, I/O interfaces

Different blocks optimized for different types of computation

Different SoCs provide different balances of heterogeneous resources for different

classes of product

Each application task can be targeted at the hardware block which can execute it

most efficiently

Granularity of tasks determined by architecture

Go Heterogeneous



GPUs now employ the same ‘parallel features’ as CPUs

SIMD and Threads (SIMT)

Multiple cores

GPUs were special-purpose (shader) compute engines dedicated to graphics

But are now evolving into more general-purpose programmable devices

Enable area-efficient heterogeneous SoC designs that provide both

CPU for control and I/O (efficient latency processing)

GP-GPU for graphics and data crunching (efficient throughput processing)

Use the appropriate compute unit for each task – including the GPU


Heterogeneous CPU/GPU architectures

CPU

Execution Queues

Unified System Memory

SIMD Processing

Element

SIMD Processing

Element

Small Cache Small Cache

Large

Cache

CoreFew

Threads

CoreFew

Threads

GPU

Control

ManyThreads

ManyThreads


Mobile GPU compute must be “practical”

Use Case Example

Augmented reality Augmented reality shopping app

Computational Photography Post-processing effects (HDR, panoramic stitching)

Lens correction

Computer Vision Product recognition (display in webstore)

Automotive ADAS

Defence/Security systems

Audio Multi-microphone beamforming (noise reduction)

Video HEVC decoder

Real-time camera preview window effects

Risk Mitigation Kishonti and Google benchmarks



Practical compute is not…

Game physics (for the most part)

High-performance scientific computing

For many practical use cases, programmability is a requirement

Algorithms are constantly changing and improving

Standards take time to be ratified

GP-GPU architectures optimized for practical use cases provide the most efficient

performance-power profiles



HPC Use Cases

(Full Profile)

Mobile and Embedded Use Cases

(Embedded Profile)

3D Images 64-bit floating point

BGRA channel order image formats

Image sharing with OpenGL ES

Built-in atomic functions

High-precision rounding support

Optimal GPU Design Parameters


GPU Compute is a POWER play! Using available heterogeneous resources saves energy

Trial run on TI OMAP 4 ‘Panda’ board;

Free running suite of image enhancement

functions, written in three versions.

Single and dual threaded CPU-only

versions allowed to saturate CPU;

GPU version in OpenCL 1.0 EP, with

minimal CPU loading. 0 0.1 0.2

Single Thread

Two Threads

GPU

Energy per processed frame


Heterogeneous System Architecture

We are on an evolution path towards processor designs that combine…

Efficient latency processing (CPU)

Efficient throughput processing (GPU)

Tighter integration of all heterogeneous components (CPU, GPU, DSP, ISP, …)

Need to analyse performance of emerging use cases to determine the right balance

of hardware resources for each new processor design – for example

Local memory – quantity and type (registers, SRAM)

ALU – quantity, type (int, float), precision (rtn, rtz)

ISA – instructions that enable efficient compilation of application code



CPU-GPU coherency on mobile

processors today is one-way

GPU can set flags on memory accesses,

indicating data it’s fetching may already

be within CPU cache

Infrastructure will snoop the CPU cache

before looking in system memory

A few use cases for graphics exist, but

not compute

Today’s Mobile CPU+GPU designs are loosely integrated



HSA should reduce overall system power

Infrastructure adds area and power

Zero-copy throughout entire system reduces bandwidth

Tomorrow’s Mobile CPU+GPU designs will be more tightly integrated


Heterogeneous System Architecture Simplifies effective use of heterogeneous computing

Designed for C99, C++ 2011,

Java, Renderscript, OpenCL,

C++ AMP

HSAIL is a virtual ISA for

parallel programs

Finalized to ISA by a JIT

compiler or “Finalizer”

ISA independent by

design for CPU & GPU

Explicitly parallel

Designed for data parallel

programming CPU(s) GPU(s) Other Accelerators

HSA Finalizer

Legacy Driver

Application

Domain Specific Libs (Bolt, OpenCV™, … many others)

HSA Runtime

Application

SW

Drivers

Differentiated HW

OpenGL-ES Runtime

Other Runtime

HSAIL

GPU ISA

Renderscript /OpenCl

Runtimes

HSA Software

Kernel Driver

Ctl

Legacy Driver

Dalvik Runtime



IP Vendor Silicon

Vendor

OEM App

Writer

3D Graphics CPU Use Cases Augmented Reality Photography Computer Vision Audio Processing Video Processing

Segment types and sizes illustrative only


Summary

Thermal envelope has become the limiting factor in performance

It is necessary to use all the tools at our disposal to address this

Macro, micro architecture, dynamic resource management and application partitioning all

play a role

There is a move towards simplifying and standardising the use of all computing resources

on board an SoC

What works on the desktop does not always work for embedded devices

It is still necessary to use resources wisely, going for maximum precision, maximum

dynamic range is never a no-brainer

The future is increasingly heterogeneous


Doug Watt

May 21, 2013

Cool Compute: Strategies for Managing Power vs Performance in Mobile Devices

cool compute: strategies for managing power vs …...c++ amp hsail is a ... there is a move towards...

Documents