heterogeneous computing ->...

30
| Heterogeneous Computing -> Fusion | saahpc 2010 1 Heterogeneous Computing -> Fusion Norm Rubin AMD Fellow

Upload: duongcong

Post on 18-Aug-2019

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 1

Heterogeneous Computing -> Fusion

Norm Rubin

AMD Fellow

Page 2: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 2

Definitions

Heterogenous Computing

– A system comprised of two or more compute engines with signficant structural differences

– In our case, a low latency x86 CPU and a high throughput Radeon GPU

Fusion

– Bringing together two or more components and joining them into a single unified whole

– In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power

Page 3: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 3

AMD Balanced Platform Advantage

Delivers optimal performance for a wide range of

platform configurations

Other Highly Parallel Workloads

Graphics Workloads

Serial/Task-Parallel Workloads

CPU is ideal for scalar processing

Out of order x86 cores with low latency memory access

Optimized for sequential and branching algorithms

Runs existing applications very well

GPU is ideal for parallel processing

GPU shaders optimized for throughput computing

Ready for emerging workloads

Media processing, simulation, natural UI, etc

Page 4: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 4

Three Eras of Processor Performance

Single-Core Era

Sin

gle

-th

read

P

erf

orm

an

ce

?

Time

we are

here

o

Enabled by:

Moore’s Law

Voltage Scaling

MicroArchitecture

Constrained by:

Power

Complexity

Multi-Core Era

Th

rou

gh

put P

erf

orm

ance

Time

(# of Processors)

we are

here

o

Enabled by:

Moore’s Law

Desire for Throughput

20 years of SMP arch

Constrained by:

Power

Parallel SW availability

Scalability

Heterogeneous Systems Era

Ta

rge

ted

Ap

plic

atio

n

Pe

rfo

rman

ce

Time

(Data-parallel exploitation)

we are

here

o

Enabled by:

Moore’s Law

Abundant data parallelism

Power efficient GPUs

Temporarily constrained by:

Programming models

Communication overheads

Page 5: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 5

Emerging Application Spaces

Category Characteristics Application Examples

Massive Data Mining

Full 64b addressing

Huge data sets

New data types

Image, Video, Audio processing

Pattern analytics and search

Natural User Interfaces

Massive “behind-the-scenes”

computing

Face and gesture recognition

Real time video & audio proc

Physical world interpretation

Visualization Advanced rendering

Interactive physics

Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming

Cloud + Client Applications

Seamless responsiveness

Workload partitioning

Next generation browsers

HTML5 Apps with Native Code from JavaScript

Page 6: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 6

GPU SP ALU Performance

HD4870

HD5870

CPU

Page 7: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 7

GPU DP ALU Performance

HD4870

HD5870

CPU

Page 8: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 8

GPU BW Performance expectations over time

250

0

100

200

50

150

300

HD5870

HD4870

Page 9: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 9

GPU Computing Efficiency Trend

7.50

4.56

4.50

2.24

2.21

0.92

2.01

1.06

1.07

0.42

GFLOPS/W

GFLOPS/mm2

14.47 GFLOPS/W

7.90 GFLOPS/mm2

Page 10: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 10

ATI Radeon™ HD 5870 Compute Architecture

20 SIMD Engines

1600 shader cores

Ultra-Threaded Dispatch Processor

Instruction and Constant Caches

Memory Export Buffer

Fetch path with multi-level caches

Global Data Store

Page 11: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 11

Memory Hierarchy

Distributed Memory Controller

Optimized for latency hiding and

memory access efficiency

GDDR5 memory at 150GB/s

Up to 272 billion 32-bit

fetches/second

Up to 1 TB/sec L1 texture fetch

bandwidth

Up to 435 GB/sec between L1 &

L2

Page 12: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 12

Comparative Stats on ATI Radeon HD 5870 GPU

* Based on internal AMD testing

AMD Opteron™

Model 2435

ATI Radeon™

HD 4870

ATI Radeon™

HD 5870

One Year

Difference

Die Size 346 mm2

263 mm2

334 mm2 1.27x

Transistors 904 million 956 million 2.15 billion 2.25x

Memory Bandwidth

12.8 GB/s 115 GB/sec 153 GB/sec 1.33x

SP GFlops 124.8 1200 2720 2.25x

DP GFlops 62.4 240 544 2.25

ALUs 54 800 1600 2x

Board Power*

Idle 15.5 W 90 W 27 W 0.3x

Max 115 W 160 W 188 W 1.17x

Page 13: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 13

Yesterday’s Chip Designs Won’t Do

GPU

110 million transistors @150nm 2D and 3D gaming

Nascent video processing

CPU

105 million transistors @130nm Compute tasks including video decode

Page 14: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 14

Today We Are Evolving

TeraFLOPS-class GPU

2.15 billion transistors @40nm 3D OS

Multi-panel HD gaming Full HD video and audio

Multi-core CPU

758 million transistors @45nm Multi-tasking Most compute tasks

Page 15: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 15

Tomorrow Will Amaze

Significantly enhances active/ resting battery life

High-bandwidth I/O

~1 billion transistors @32nm in one design

APU: Fusion of CPU & GPU compute power within one processor

Page 16: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 16

AMD Fusion™ APUs Fill the Need

Windows, MacOS and Linux franchises

Thousands of apps

Established programming and memory model

Mature tool chain

Extensive backward compatibility for applications and OSs

High barrier to entry

x86 CPU owns the Software World

Enormous parallel computing capacity

Outstanding performance-per - watt-per-dollar

Very efficient hardware threading

SIMD architecture well matched to modern workloads: video, audio, graphics

GPU Optimized for Modern Workloads

Page 17: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 17

Fusion APUs: Putting it all together

System-level Programmable

Multi-Core Era

Heterogeneous Systems Era

Single-Thread Era

Fusion APU

Heterogeneous Computing

Throughput Performance

Pro

gra

mm

er

Ac

ce

ss

ibilit

y

Graphics Driver-based

programs

OCL/DC Driver-based

programs

Power-efficient

Data Parallel

Execution

High Performance

Task Parallel Execution

Microprocessor Advancement

GP

U A

dv

an

ce

me

nt

Unaccepta

ble

Expert

s O

nly

M

ain

str

eam

Page 18: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 18

PC with Discrete GPU

Page 19: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 19

Fusion APU Based PC

Page 20: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 20

Performance & Scalability

Two x86 Cores Tuned for Target Markets

Mainstream Client and

Server Markets

“Bulldozer”

“Bobcat” Flexibility,

Low Power & Low Cost Low

Power Markets

Lower Cost

Cloud Optimized

Page 21: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 21

Heterogeneous Computing:

Next-Generation Software Ecosystem

Hardware & Drivers: AMD Fusion™, Discrete CPUs/GPUs

OpenCL & Direct Compute

Tools: HLL compilers, Debuggers,

Profilers Middleware/Libraries: Video,

Imaging, Math/Sciences, Physics

High Level Frameworks

End-user Applications

Ad

van

ced

Op

tim

izati

on

s

& L

oad

Bala

ncin

g

Load balance across CPUs and GPUs; leverage

AMD Fusion™ performance advantages

Drive new features into

industry standards

Increase ease of application

development

Page 22: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 22

Open Standards:

Vendor specific Cross-platform limiters

• Apple Display Connector

• 3dfx Glide

• Nvidia CUDA

• Nvidia Cg

• Rambus

• Unified Display Interface

Digital Visual Interface

OpenCL™ DirectX®

Certified DP JEDEC OpenGL®

Maximize Developer Freedom and Addressable Market

Vendor neutral Cross-platform enablers

Page 23: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 23

The Benefits of Fusion

Unparalleled processing capabilities in mobile form factors

Shared memory for the CPU and GPU

Eliminates copies, increasing performance

Reduces dispatch overhead

Lower latency from the GPU to memory

Power efficient design

Enables architectural innovations between CPU, GPU and the Memory System

Scalable architecture that can target a broad range of platforms from mobile to data center

Page 24: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 24

The Fusion Opportunity

A new architectural and performance balance point for computing

A new machine target for research

A high volume opportunity for new algorithms, new workloads and new applications

The deployment opportunity is especially strong in the consumer market place

Page 25: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 25

Questions?

Page 26: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 26

Backup slides

Page 27: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 27

Thread Processors

5-way VLIW Architecture

4 Stream Cores and 1 Special

Function Stream Core

Separate Branch Unit

All 5 cores co-issue

Scheduling across the cores is done

by the compiler

Each core delivers a 32-bit result per

clock

Thread Processor writes 5 results

per clock

4 32-bit FP MAD per clock

2 64-bit FP MUL or ADD per clock

1 64-bit FP MAD per clock

4 24-bit Int MUL or ADD per clock

Special functions

1 32-bit FP MAD

per clock

Stream Cores

Page 28: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 28

SIMD Engines

Diagram shows 2 SIMD Engines

Each SIMD Unit includes:

16 Thread Processors (80 shader cores) + 32KB Local Data Share

Its own Thread Sequencer which operates a shared set of threads

A dedicated fetch unit with an 8KB L1 cache

Page 29: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 29

TeraScale 2 Architecture – Radeon HD 5870

Page 30: Heterogeneous Computing -> Fusionsaahpc.ncsa.illinois.edu/10/presentations/day1/session4/presentation_Rubin.pdf · 150 300 HD5870 HD4870 . 9 | Heterogeneous Computing -> Fusion |

| Heterogeneous Computing -> Fusion | saahpc 2010 30

OpenCL™ and DirectX® 11 DirectCompute

How will developers choose?

DirectX® 11 DirectCompute

Easiest path to add compute capabilities to existing DirectX applications

Windows Vista® and Windows® 7 only

OpenCL™

Ideal path for new applications porting to the GPU for the first time

True multiplatform: Windows®, Linux®, MacOS

Natural programming without dealing with a graphics API