sam naffziger vlsi keynote

1 VLSI Technology Symposium | June 2011 | Public

TECHNOLOGY IMPACTS FROM THE

NEW WAVE OF ARCHITECTURES

FOR MEDIA-RICH WORKLOADS

Samuel Naffziger

AMD Corporate Fellow

June 14th, 2011

VLSI TECHNOLOGY SYMPOSIUM 2011


Introduction

The new workloads and demands on computation

Characteristics of serial and parallel computation

The Accelerated Processing Unit (APU) architecture

APU architecture implications for technology

Summary

Outline

3 VLSI Technology Symposium| June 2011 | Public

Now: Parallel/Data-Dense

16:9 @ 7 megapixels

HD video flipcams, phones,

webcams (1GB)

3D Internet apps and HD video

online, social networking w/HD files

3D Blu-ray HD

Multi-touch, facial/gesture/voice

recognition + mouse & keyboard

All day computing (8+ Hours)

Immersive and

interactive performance

Wo

rklo

ad

s

The Big Experience/Small Form Factor Paradox

Mid 2000s

4:3 @ 1.2 megapixels

Digital cameras, SD webcams (1-5 MB files)

WWW and streaming SD video

DVDs

Mouse & keyboard

3-4 Hours

Standard-definition

Internet

Technology Mid 1990s

Display4:3 @ 0.5 megapixel

ContentEmail, film & scanners

OnlineText and low res photos

Multimedia CD-ROM

InterfaceMouse &

keyboard

Battery Life*

1-2 Hours

Fo

rm

Fa

cto

rs

Early Internet and Multimedia

Experiences

*Resting battery life as measured with industry standard tests.

4 VLSI Technology Symposium| June 2011 | Public

Focusing on the experiences that matter

Email

Web browsing

Office productivity

Listen to music

Online chat

Watching online video

Photo editing

Personal finances

Taking notes

Online web-based games

Social networking

Calendar management

Locally installed games

Educational apps

Video editing

Internet phone

Consumer PC Usage

0% 20% 40% 60% 80% 100%

New Experiences

ImmersiveGaming

Simplified Content Management

Accelerated Internet and HD Video

Source: IDC's 2009 Consumer PC Buyer Survey


People Prefer Visual Communications

Visual PerceptionVerbal Perception

Words are processed

at only 150 words

per minute

Pictures and video

are processed 400 to

2000 times faster

Augmenting Today’s Content:

Rich visual experiences

Multiple content sources

Multi-Display

Stereo 3D


Communicating• IM, Email, Facebook

• Video Chat, NetMeeting

Gaming• Mainstream Games

• 3D games

The Emerging World of New Data Rich Applications

ArcSoft

TotalMedia®

Theatre 5

ArcSoft

Media

Converter® 7

CyberLink

Media

Espresso 6

CyberLink

Power

Director 9

Corel

VideoStudio

Pro

Corel

Digital Studio

2010

Internet

Explorer 9

Microsoft®

PowerPoint® 2010

Windows

Live

Essentials

Codemasters

F1 2010Nuvixa

Be Present

ViVu

Desktop

Telepresence

Viewdle

Uploader

Using photos• Viewing& Sharing

• Search, Recognition, Labeling?

• Advanced Editing

Using video• DVD, BLU-RAY™, HD

• Search, Recognition, Labeling

• Advanced Editing & Mixing

The Ultimate Visual

Experience™Fast Rich Web content, favorite HD

Movies, games with realistic

graphics

Music• Listening and Sharing

• Editing and Mixing

• Composing and compositing

7 | 2011 VLSI Symposium| June 2011 | Public

New Workload Examples: Changing Consumer Behavior

24 hoursof video

uploaded to YouTube

every minute

50 million +digital media files

added to personal content libraries

every day

Approximately

9 billionvideo files owned are

high-definition

1000 images

are uploaded to Facebook

every second


What Are the Implications for Computation?

Insatiable demand for high

bandwidth processing

–Visual image processing

–Natural user interfaces

–Massive data mining for

associative searches,

recognition

Some of these compute needs

can be offloaded to servers,

some must be done on the

mobile device

– Similar compute needs and

massive growth in both

spaces

How must CPU

architecture change to

deal with these trends?


Parallel and Serial

Computation

Transistors

(thousands)

Single-threadPerformance

(SpecINT)

Frequency

(MHz)

Typical Power(Watts)

Number ofCores

Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten

35 Years of Microprocessor Trend Data

i=0i++

load x(i)fmulstore

cmp i (16)bc

…

Loops, branches and conditional evaluation

Serial Code

Conditionalbranches

0

1000

2000

3000

4000

5000

6000

7000

8000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Pe

ak G

Flo

ps

(SP

FP)

Years

GFLOPs Trend

GPU

CPU

AMD projections

i=0i++

load x(i)fmulstore

cmp i (1000000)bc

……

……

i,j=0i++j++

load x(i,j)fmulstore

cmp j (100000)bc

cmp i (100000)bc

2D array representingvery large dataset

Loop 1M times for 1M pieces

of data

DataParallel Code


GPU/CPU Design Differences

Lots of instructions little data

• Out of order exec, Branch prediction

• Few hardware threads

Weak performance gains through density

Maximize speed with fast devices

Few instructions lots of data

• Single Instruction Multiple Data

• Extensive fine-threading capability

Nearly linear performance gains with density

Maximize density with cool devices

CPU (Serial compute) GPU (parallel compute)


Three Eras of Processor Performance

Single-Core Era

Sin

gle

-th

rea

d P

erf

orm

an

ce

?

Time

we are

here

o

Enabled by:

Moore’s Law

Voltage & Process Scaling

Micro Architecture

Constrained by:

Power

Complexity

Multi-Core Era

Th

rou

gh

put P

erf

orm

an

ce

Time

(# of Processors)

we are

here

o

Enabled by:

Moore’s Law

Desire for Throughput

20 years of SMP arch

Constrained by:

Power

Parallel SW availability

Scalability

HeterogeneousSystems Era

Ta

rge

ted

Ap

plic

atio

n

Pe

rfo

rma

nce

Time

(Data-parallel exploitation)

we are

here

o

Enabled by:

Moore’s Law

Abundant data parallelism

Power efficient GPUs

Temporarily constrained by:

Programming models

Communication overheads

Workloads


Heterogeneous Computing with an APU Architecture

CPU

Cores

GPU UVD

SB Functions

~7 GB/sec

~17 GB/sec

UNB

MC

~17 GB/sec

DDR3 DIMM

MemoryCPU Chip

FCH Chip

PCIe®

Bandwidth pinch points and latency

hold back the GPU capabilities

Integration Provides Improvement

Eliminate power and latency of extra chip

crossing

3X bandwidth between GPU and Memory!

Same sized GPU is substantially more effective

Power efficient, advanced technology for both

CPU and GPU

Graphics requires memory

BW to bring full capabilities

to life

~27 GB/sec

~27 GB/sec

DDR3 DIMM

Memory

APU Chip

PCIe

2010 IGP-based (“Danube”) Platform 2011 APU-based (“Llano”) Platform

GPU

CPU

Cores

UVD

UN

B / M

C

GPU

Optional


The Challenges of Integration

Thick, fast

metal

Big devices

Dense, thin

metal, small

devices

Perf

orm

ance

CPU flop

area = 2.14

GPU flop

area = 1.0

Flop count for

4 Llano CPU

cores=0.66M

Flop count for

Llano GPU

=3.5M

Density


With the 20nm node, even local

metal will be seeing large RC

increase compromises more

difficult

How to Balance the Metal Stack?

Cu Resistivity without barrier

With barrier

1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Line Width (um)

Re

sis

tivit

y (

uo

hm

-cm

)

Add metal layers?

Thin, dense layers for the

GPU

Thick, low resistance

layers for the CPU

Cost issues?

Via resistance?

Technology improvements in

BEOL are required

Perf

orm

ance

Density


Device OptimizationP

erf

orm

ance CPU

GPU

To achieve breakthrough APU

performance, the Llano GPU

has ~5X the flops and ~5X the

device count of the CPUs

0

0.2

0.4

0.6

0.8

1

1.2

Llano CPU Llano GPU

APU Vt Mix

LC-HVT

HVT

RVT

LC-RVT

LVTDevice Ioff

Broader span of

devices required

0

50

100

150

200

250

175 50 20 4.3 2.7 0.5 0.4

FP

G

Ioff (nA/um room temp)

FPG vs. Ioff

desired device

range

Speed vs. Leakage

RO

sp

ee

d

LVT LC-RVTRVT HVT LC-HVT

A broader

device

suite is

required


85.0

90.0

95.0

100.0

105.0

110.0

Balanced workload

85.0

90.0

95.0

100.0

105.0

110.0

GPU-centric data

parallel workload

85.0

90.0

95.0

100.0

105.0

110.0

CPU-centric serial

workload

Power Transfers

Voltage range is critical to enabling

the efficient power transfers that

make for compelling APU

performance


Operating Voltage Range

Operating voltage

requirements:

Low voltage necessary for

power efficiency

High voltage necessary for

a snappy user experience

enabled by turbo mode0

0.5

1

1.5

2

2.5

0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V

E/op vs. V


0.952V

0.915V

0.886V

0.805V

0.700V

0.750V

0.800V

0.850V

0.900V

0.950V

1.000V

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

40nm 28nm 20nm 14nm

No

min

al V

olt

ag

e

Power Density Limited GPU40nm to 14nm

Frequency

Power Density

Voltage

Operating Voltage Challenges

To maintain cost effective

performance growth with

technology node, the GPU

must:

– Hold power density

constant

– Exploit density gains to add

compute units

This necessarily drives

operating voltage down

This would be good for energy

efficiency except …

– Variation impacts are much

greater at low voltage

0MHz

100MHz

200MHz

300MHz

400MHz

500MHz

600MHz

700MHz

800MHz

900MHz

1000MHz

0.85V 0.90V 0.95V 1.00V 1.10V 1.15V

Juniper Frequency Data

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Frequency

spread increases

at low voltage

40nm GPU Frequency Data


The Operating Voltage Challenge

Many barriers to maintaining both high

and low voltage as technology scales

TDDB vs. SCE control

ULK breakdown vs. denser pitches

Variation control

85.0

90.0

95.0

100.0

105.0

110.0

BOX

Poly

Fin

FD devices should enable

maintaining the functional

range for a generation or two

Will turbo modes be too

compromised?

What’s next?


3D Integration to the Rescue?

ThroughSilicon Vias

(TSVs) CPU Die

Metal Layers

GPU Die

Metal Layers

Analog Die (SB, Power)

Metal Layers

Metal Layers

TIM (Thermal Interface Material)

Heat Sink

DRAMMicro-bumps

Package Substrate

DRAM

South

Bridge

Stacking offers many attractive benefits

Higher bandwidth to local memory

Enables parallel and serial compute die to be in their own

separate optimized technology – interconnect speed vs.

density, device optimization etc.

Allows IO and southbridge content to remain in older, more

analog-friendly technology


3D Integration Challenges

Economical 3D stacking in high volume manufacturing presents

many challenges

Benefits must exceed the additional costs of TSVs, and yield fallout

Logistics of testing and assembling die from multiple sources can be

immense

Countless mechanical and thermal issues to solve in high volume mfg

Clearly 3D provides

compelling solutions to

many problems, but the

barriers to entry mean

heavy R&D $$ and

partnerships required

ThroughSilicon

Vias(TSVs)

CPU DieMetal Layers

GPU DieMetal Layers

Analog Die (SB, Power)Metal Layers

Metal Layers

TIM (Thermal Interface Material)

Heat Sink

DRAMDie to

Die Vias

Package Substrate

DRAM

South

Bridge


Summary

Insatiable demand for high bandwidth computation

–Visual image processing

–Natural user interfaces

–Massive data mining for associate searches, recognition

Some of these compute needs can be offloaded to servers,

some must be done on the mobile device

–Similar compute needs and massive growth in both spaces

–Combined serial and parallel computation architectures are

key in both spaces

Huge technology challenges to meeting this opportunity

– Interconnect scaling is hitting a wall that must be overcome

–A broad device suite is necessary that operates efficiently at

low voltage while enabling high speed for response time

–3D integration offers a promising long term solution

sam naffziger vlsi keynote

Documents