amd accelerated computing -ufrj

Agenda

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCL

Evolving x86 Processors

L3 Cache

AMD architecture “Istambul” six-core diagram

PCI-e

Chipset

HyperTransport

Memory

Controller

Hyper

Transport

CROSSBAR

Lower memory

latency

Balanced

caches

Fast full-duplex

bus

Native

six-core

processor

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

4P/24-core system example very good scalability

One memory controller for every processor

Full-duplex Hyper Transport links (up to 5.2GHz)

Bus Optimization: HT Assist (Cache Probe Filtering)

Still the only available 4P system with Direct Connect Architecture

MEM

OR

Y M

EM

OR

Y

MEM

OR

Y M

EM

OR

Y

Direct Connect Architecture 1.0 Balanced and Scalable Design to Support up to 6 Cores

2 M

EM

ORY

CH

AN

NELS 2

MEM

ORY

CH

AN

NELS

2 M

EM

ORY

CH

AN

NELS 2

MEM

ORY

CH

AN

NELS

8 DIMMs per CPU

8 DIMMs per CPU

8 DIMMs per CPU

8 DIMMs per CPU

No front side bus

Integrated memory controller

HyperTransport™ technology

NUMA memory architecture

12 DIMMs per CPU

Direct Connect Architecture 2.0 Balanced and Scalable Design to Support up to 16 Cores* per CPU

• 1-hop between processors

• Up to 50% more DIMMs

• Four memory channels

• Up to 33% increase in CPU to CPU communication speed±

4 M

EM

ORY

CH

AN

NELS

12 DIMMs per CPU

12 DIMMs per CPU

12 DIMMs per CPU

4 M

EM

ORY

CH

AN

NELS

4 M

EM

ORY

CH

AN

NELS

4 M

EM

ORY

CH

AN

NELS

What is next for x86 CPUs

• More processor cores to come

(12, 16, 16 double cores)

• More memory channels (improves memory bandwidth per core)

• Improved IPC

(8 per cycle is a target)

Top500 list - beyond the petaflop

Datacenters in the USA will spend more

than $3 billion on energy in 2009

Garry Kasparov IBM Deep Blue

1997:

X

The World’s Most Powerful GPU

=

2011 GPU Architecture AMD Radeon™ HD 6900 Series

Dual graphics engines

New VLIW4 core architecture

Up to 24 SIMD engines

Up to 96 Texture Units

Upgraded render back-ends

Improved anti-aliasing performance

Fast 256-bit GDDR5 memory interface

Up to 5.5 Gbps

New GPU compute features

Designing very efficient GPUs Full load: 180W; Idle:27W

0

2

4

6

8

10

12

14

16

Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09

ATI Radeon™ X1800 XT

ATI Radeon™ X1900 XTX

ATI Radeon™ HD 2900 PRO

ATI Radeon™ HD 3870



7.50

4.56

4.50

2.24

2.21

0.92

2.01

1.06

1.07

0.42

GFLOPS/W

GFLOPS/mm2

14.47 GFLOPS/W

7.90 GFLOPS/mm2

Old and New in High Performance Computing

Old: Power is free, Transistors are expensive

New: Power expensive, Transistors free

(Can put more transistors on chip than can afford to turn on)

Old: Multiplies are slow, Memory access is fast

New: Multiplies fast, Memory slow

(up 200 clocks to DRAM memory, 4 clocks for FP multiply)

Old: Increasing Instruction Level Parallelism via compilers innovation

New: Explicit thread and data parallelism must be exploited

GPUs: more than just gaming

15

144

72

48

24

12

Radeon HD 5970

12 Cores

Hexa Core

Quad Core

Dual Core

Single Core

Processing power – millions of operations per second

2700

Wii Sports - Golf Oil exploration platform - 2010

Both use GPUs

DirectX® 11 Multi-Threading

Application, DirectX runtime, and DirectX driver can each run in separate

threads

Tasks like loading a texture or compiling a shader can execute in parallel

with main rendering thread

DirectX® 10 DirectX® 11

16

Today’s GPUs focused on

GAMING

ENTERTAINMENT

PRODUCTIVITY

DirectX® 11 Tessellation

Images courtesy of Unigine Corp.

No Tessellation Tessellation

DirectX® 10 DirectX® 11

18

5/26/2011

Research companies already using

21

Oil exploration Wheather forecast Fluid Dynamics Nature simulation

AMD Balanced Platform

Delivers optimal performance for a wide range of

platform configurations

Other Highly Parallel Workloads

Graphics Workloads

Serial/Task-Parallel Workloads

CPU is excellent for running some algorithms

Ideal place to process if GPU is fully loaded

Great use for additional CPU cores

GPU is ideal for data parallel algorithms like image processing, CAE, etc

Great use for ATI Stream technology

Great use for additional GPUs

ATI Stream Technology is…

Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience

High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency

Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development

Digital Content Creation

Engineering Sciences Government Gaming Productivity

Improvements already reached consumers

0%

10%

20%

30%

40%

50%

60%

70%

80%

Processor utilization

ATI

Stream

Adobe Flash plugin used by Youtube.com

Better image quality and video smoothness

Lower processor usage

GPU-accelerated video transcoding

Up to 6x faster when using an AMD graphics card

HD Video Ipod Video

Using four CPU Cores

CPU Usage: 100%

GPU Usage: 1%

Video Transcoding Sample No GPU Acceleration

CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h

GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26

CPU Usage: 45%

GPU Usage: 35%

Video Transcoding Sample ATI GPU Acceleration

CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)

GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15)

Using hundreds of Stream Processors

27

FUSION TECHNOLOGY

Today

TeraFLOPS-class GPU

Up to 2 billion transistors

Jogos em multiplos monitores

Video e audio Full HD

Multi-core CPU

~800 million transistors Multi-tasking

A new Era on performance evolution

Perf

orm

ance

Time

We are here

Pros:

Performance

Power efficient

Cons:

Software availability

Heterogeneous computing

Perf

orm

ance

Time x Cores

Challenge:

Power consumption

Software

Multi-Core

We are here

Challenge:

Power consumption

Complexity

?

Single-Core

Sin

gle

-thre

ad

Time

We are here

A new Era on performance evolution

Software Acceleration

Multi-Core Single-Core

Gaming

Multimedia

CP

U

GPU

Core efficiency

Putting all together – The Future is Fusion

Cache L3

PC

I-e

Chipset

HyperTransport

Memory

Controller

Hyper

Transport

CROSSBAR

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

RV500 GPU Core (2006) AMD “Istambul” six-core processor

Memory

Controller

Ring

Stop

Ring

Stop

Ring

Stop

Ring

Stop

Client Interface Client Interface

Client Interface Client Interface

Clien

t In

terf

ace

Clien

t In

terf

ace

Clie

nt In

terfa

ce

Clie

nt In

terfa

ce


Cache L3

PC

I-e

Chipset

HyperTransport

Memory

Controller

Hyper

Transport

CROSSBAR

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

RV700 GPU Core (2008-2009) AMD “Istambul” six-core processor


CROSSBAR

RV700 GPU Core AMD “Istambul” six-core processor C

RO

SS

BA

R

2011: welcome to the APU time!

GPU CPU

“Supercomputing power in a notebook platform whose battery lasts for a full day”

APU

One Design, Fewer Watts, Massive Capability

Discrete-level DirectX® 11

GPU

“Zacate” AMD

Fusion APU

75 sq. mm

18 watts

Northbridge Dual-Core

CPU + + =

66 sq. mm 13 watts

117 sq. mm 25 watts

59 sq. mm 8 watts

Graphics and Media Processing Efficiency Improvements

CPU Cores

GPU UVD

SB Functions

~7 GB/sec

~17 GB/sec

UNB

MC

~17 GB/sec

DDR3 DIMM Memory

CPU Chip

PCIe

Bandwidth pinch points and latency hold back the GPU capabilities

3X bandwidth between GPU and memory

Even the same sized GPU is substantially more effective in this configuration

Eliminate latency and power associated with the extra chip crossing

Substantially smaller physical foot print

Graphics requires memory bandwidth

to bring full capabilities to life

~27 GB/sec

~27 GB/sec

DDR3 DIMM Memory

APU Chip

PCIe

2010 IGP-based Platform 2011 APU-based Platform

GPU

CPU Cores

UVD

UN

B /

MC

“Ontario” & “Zacate” Architecture

APU

>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU)

>C6 and power gating

>Array of SIMD Engines

• DX11 graphics performance

• Industry leading 3D and graphics processing

>3rd Generation Unified Video Decoder

>H.264, VC1, DixX/Xvid format

>DDR3 800-1066, 2 DIMMs, 64 bit channel

>BGA package

Display and I/O

>Two dedicated digital display interfaces

• Configurable externally as HDMI, DVI, and/or Display Port

• Also supports a single link LVDS for internal panels

>Integrated VGA

>5x8 PCIe®

> “Hudson” Fusion Controller Hub

Working together OpenCL

ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs

The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience

• First complete OpenCL™ development platform

• Certified OpenCL 1.0 compliant by the Khronos Group

• Write code that can scale well on multi-core CPUs and GPUs

• AMD delivers on the promise of OpenCL™, with both high-performance CPU and GPU technologies

• Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support

http://developer.amd.com/



OpenCL™: Game-Changing Development Enabling Broad Adoption of GP-GPU Capabilities

Industry standard API: Open, multiplatform development platform for heterogeneous architectures

The power of Fusion: Leverages CPUs and GPUs for balanced system approach

Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.

Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution

Momentum: Enormous interest from mainstream developers and application ISVs

More stream-enabled applications across all markets

Open Standards:

Vendor specific Cross-platform limiters

• Apple Display Connector

• 3dfx Glide

• Nvidia CUDA

• Nvidia Cg

• Rambus

• Unified Display Interface

Digital Visual Interface

OpenCL™ DirectX®

Certified DP JEDEC

Maximize Developer Freedom and Addressable Market

Vendor neutral Cross-platform enablers

OpenGL®

Comparing OpenCL™ and DirectX® 11 DirectCompute

How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?

Feature set is similar in both APIs

DirectX® 11 DirectCompute

Easiest path to add compute capabilities to existing DirectX applications

Windows Vista® and Windows® 7 only

OpenCL™

Ideal path for new applications porting to the GPU for the first time

True multiplatform: Windows®, Linux®, MacOS

Natural programming without dealing with a graphics API

Anatomy of OpenCL™

Language Specification

• C-based cross-platform programming interface

• Subset of ISO C99 with language extensions - familiar to developers

• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error

• Online or offline compilation and build of compute kernel executables

• Includes a rich set of built-in functions

Platform Layer API

• A hardware abstraction layer over diverse computational resources

• Query, select and initialize compute devices

• Create compute contexts and work-queues

Runtime API

• Execute compute kernels

• Manage scheduling, compute, and memory resources

OpenCL Example

Scalar

void square(int n, const float *a, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * a[i]; }

Data-Parallel

kernel dp_square (const float *a, float *result) { int id = get_global_id(0); result[id] = a[id] * a[id]; } // dp_square executes oven “n” work-items

Summary

46

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCL http://developer.amd.com

Obrigado!

[email protected]

[email protected]

Obrigado!