sam naffziger vlsi keynote
TRANSCRIPT
1 VLSI Technology Symposium | June 2011 | Public
TECHNOLOGY IMPACTS FROM THE
NEW WAVE OF ARCHITECTURES
FOR MEDIA-RICH WORKLOADS
Samuel Naffziger
AMD Corporate Fellow
June 14th, 2011
VLSI TECHNOLOGY SYMPOSIUM 2011
2 VLSI Technology Symposium | June 2011 | Public
Introduction
The new workloads and demands on computation
Characteristics of serial and parallel computation
The Accelerated Processing Unit (APU) architecture
APU architecture implications for technology
Summary
Outline
3 VLSI Technology Symposium| June 2011 | Public
Now: Parallel/Data-Dense
16:9 @ 7 megapixels
HD video flipcams, phones,
webcams (1GB)
3D Internet apps and HD video
online, social networking w/HD files
3D Blu-ray HD
Multi-touch, facial/gesture/voice
recognition + mouse & keyboard
All day computing (8+ Hours)
Immersive and
interactive performance
Wo
rklo
ad
s
The Big Experience/Small Form Factor Paradox
Mid 2000s
4:3 @ 1.2 megapixels
Digital cameras, SD webcams (1-5 MB files)
WWW and streaming SD video
DVDs
Mouse & keyboard
3-4 Hours
Standard-definition
Internet
Technology Mid 1990s
Display4:3 @ 0.5 megapixel
ContentEmail, film & scanners
OnlineText and low res photos
Multimedia CD-ROM
InterfaceMouse &
keyboard
Battery Life*
1-2 Hours
Fo
rm
Fa
cto
rs
Early Internet and Multimedia
Experiences
*Resting battery life as measured with industry standard tests.
4 VLSI Technology Symposium| June 2011 | Public
Focusing on the experiences that matter
Web browsing
Office productivity
Listen to music
Online chat
Watching online video
Photo editing
Personal finances
Taking notes
Online web-based games
Social networking
Calendar management
Locally installed games
Educational apps
Video editing
Internet phone
Consumer PC Usage
0% 20% 40% 60% 80% 100%
New Experiences
ImmersiveGaming
Simplified Content Management
Accelerated Internet and HD Video
Source: IDC's 2009 Consumer PC Buyer Survey
5 VLSI Technology Symposium | June 2011 | Public
People Prefer Visual Communications
Visual PerceptionVerbal Perception
Words are processed
at only 150 words
per minute
Pictures and video
are processed 400 to
2000 times faster
Augmenting Today’s Content:
Rich visual experiences
Multiple content sources
Multi-Display
Stereo 3D
6 VLSI Technology Symposium | June 2011 | Public
Communicating• IM, Email, Facebook
• Video Chat, NetMeeting
Gaming• Mainstream Games
• 3D games
The Emerging World of New Data Rich Applications
ArcSoft
TotalMedia®
Theatre 5
ArcSoft
Media
Converter® 7
CyberLink
Media
Espresso 6
CyberLink
Power
Director 9
Corel
VideoStudio
Pro
Corel
Digital Studio
2010
Internet
Explorer 9
Microsoft®
PowerPoint® 2010
Windows
Live
Essentials
Codemasters
F1 2010Nuvixa
Be Present
ViVu
Desktop
Telepresence
Viewdle
Uploader
Using photos• Viewing& Sharing
• Search, Recognition, Labeling?
• Advanced Editing
Using video• DVD, BLU-RAY™, HD
• Search, Recognition, Labeling
• Advanced Editing & Mixing
The Ultimate Visual
Experience™Fast Rich Web content, favorite HD
Movies, games with realistic
graphics
Music• Listening and Sharing
• Editing and Mixing
• Composing and compositing
7 | 2011 VLSI Symposium| June 2011 | Public
New Workload Examples: Changing Consumer Behavior
24 hoursof video
uploaded to YouTube
every minute
50 million +digital media files
added to personal content libraries
every day
Approximately
9 billionvideo files owned are
high-definition
1000 images
are uploaded to Facebook
every second
8 VLSI Technology Symposium | June 2011 | Public
What Are the Implications for Computation?
Insatiable demand for high
bandwidth processing
–Visual image processing
–Natural user interfaces
–Massive data mining for
associative searches,
recognition
Some of these compute needs
can be offloaded to servers,
some must be done on the
mobile device
– Similar compute needs and
massive growth in both
spaces
How must CPU
architecture change to
deal with these trends?
9 VLSI Technology Symposium | June 2011 | Public
Parallel and Serial
Computation
Transistors
(thousands)
Single-threadPerformance
(SpecINT)
Frequency
(MHz)
Typical Power(Watts)
Number ofCores
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten
35 Years of Microprocessor Trend Data
i=0i++
load x(i)fmulstore
cmp i (16)bc
…
Loops, branches and conditional evaluation
Serial Code
Conditionalbranches
0
1000
2000
3000
4000
5000
6000
7000
8000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Pe
ak G
Flo
ps
(SP
FP)
Years
GFLOPs Trend
GPU
CPU
AMD projections
i=0i++
load x(i)fmulstore
cmp i (1000000)bc
……
……
i,j=0i++j++
load x(i,j)fmulstore
cmp j (100000)bc
cmp i (100000)bc
2D array representingvery large dataset
Loop 1M times for 1M pieces
of data
DataParallel Code
10 VLSI Technology Symposium | June 2011 | Public
GPU/CPU Design Differences
Lots of instructions little data
• Out of order exec, Branch prediction
• Few hardware threads
Weak performance gains through density
Maximize speed with fast devices
Few instructions lots of data
• Single Instruction Multiple Data
• Extensive fine-threading capability
Nearly linear performance gains with density
Maximize density with cool devices
CPU (Serial compute) GPU (parallel compute)
11 VLSI Technology Symposium | June 2011 | Public
Three Eras of Processor Performance
Single-Core Era
Sin
gle
-th
rea
d P
erf
orm
an
ce
?
Time
we are
here
o
Enabled by:
Moore’s Law
Voltage & Process Scaling
Micro Architecture
Constrained by:
Power
Complexity
Multi-Core Era
Th
rou
gh
put P
erf
orm
an
ce
Time
(# of Processors)
we are
here
o
Enabled by:
Moore’s Law
Desire for Throughput
20 years of SMP arch
Constrained by:
Power
Parallel SW availability
Scalability
HeterogeneousSystems Era
Ta
rge
ted
Ap
plic
atio
n
Pe
rfo
rma
nce
Time
(Data-parallel exploitation)
we are
here
o
Enabled by:
Moore’s Law
Abundant data parallelism
Power efficient GPUs
Temporarily constrained by:
Programming models
Communication overheads
Workloads
12 VLSI Technology Symposium | June 2011 | Public
Heterogeneous Computing with an APU Architecture
CPU
Cores
GPU UVD
SB Functions
~7 GB/sec
~17 GB/sec
UNB
MC
~17 GB/sec
DDR3 DIMM
MemoryCPU Chip
FCH Chip
PCIe®
Bandwidth pinch points and latency
hold back the GPU capabilities
Integration Provides Improvement
Eliminate power and latency of extra chip
crossing
3X bandwidth between GPU and Memory!
Same sized GPU is substantially more effective
Power efficient, advanced technology for both
CPU and GPU
Graphics requires memory
BW to bring full capabilities
to life
~27 GB/sec
~27 GB/sec
DDR3 DIMM
Memory
APU Chip
PCIe
2010 IGP-based (“Danube”) Platform 2011 APU-based (“Llano”) Platform
GPU
CPU
Cores
UVD
UN
B / M
C
GPU
Optional
13 VLSI Technology Symposium | June 2011 | Public
The Challenges of Integration
Thick, fast
metal
Big devices
Dense, thin
metal, small
devices
Perf
orm
ance
CPU flop
area = 2.14
GPU flop
area = 1.0
Flop count for
4 Llano CPU
cores=0.66M
Flop count for
Llano GPU
=3.5M
Density
14 VLSI Technology Symposium | June 2011 | Public
With the 20nm node, even local
metal will be seeing large RC
increase compromises more
difficult
How to Balance the Metal Stack?
Cu Resistivity without barrier
With barrier
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Line Width (um)
Re
sis
tivit
y (
uo
hm
-cm
)
Add metal layers?
Thin, dense layers for the
GPU
Thick, low resistance
layers for the CPU
Cost issues?
Via resistance?
Technology improvements in
BEOL are required
Perf
orm
ance
Density
15 VLSI Technology Symposium | June 2011 | Public
Device OptimizationP
erf
orm
ance CPU
GPU
To achieve breakthrough APU
performance, the Llano GPU
has ~5X the flops and ~5X the
device count of the CPUs
0
0.2
0.4
0.6
0.8
1
1.2
Llano CPU Llano GPU
APU Vt Mix
LC-HVT
HVT
RVT
LC-RVT
LVTDevice Ioff
Broader span of
devices required
0
50
100
150
200
250
175 50 20 4.3 2.7 0.5 0.4
FP
G
Ioff (nA/um room temp)
FPG vs. Ioff
desired device
range
Speed vs. Leakage
RO
sp
ee
d
LVT LC-RVTRVT HVT LC-HVT
A broader
device
suite is
required
16 VLSI Technology Symposium | June 2011 | Public
85.0
90.0
95.0
100.0
105.0
110.0
Balanced workload
85.0
90.0
95.0
100.0
105.0
110.0
GPU-centric data
parallel workload
85.0
90.0
95.0
100.0
105.0
110.0
CPU-centric serial
workload
Power Transfers
Voltage range is critical to enabling
the efficient power transfers that
make for compelling APU
performance
17 VLSI Technology Symposium | June 2011 | Public
Operating Voltage Range
Operating voltage
requirements:
Low voltage necessary for
power efficiency
High voltage necessary for
a snappy user experience
enabled by turbo mode0
0.5
1
1.5
2
2.5
0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V
E/op vs. V
18 VLSI Technology Symposium | June 2011 | Public
0.952V
0.915V
0.886V
0.805V
0.700V
0.750V
0.800V
0.850V
0.900V
0.950V
1.000V
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
40nm 28nm 20nm 14nm
No
min
al V
olt
ag
e
Power Density Limited GPU40nm to 14nm
Frequency
Power Density
Voltage
Operating Voltage Challenges
To maintain cost effective
performance growth with
technology node, the GPU
must:
– Hold power density
constant
– Exploit density gains to add
compute units
This necessarily drives
operating voltage down
This would be good for energy
efficiency except …
– Variation impacts are much
greater at low voltage
0MHz
100MHz
200MHz
300MHz
400MHz
500MHz
600MHz
700MHz
800MHz
900MHz
1000MHz
0.85V 0.90V 0.95V 1.00V 1.10V 1.15V
Juniper Frequency Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Frequency
spread increases
at low voltage
40nm GPU Frequency Data
19 VLSI Technology Symposium | June 2011 | Public
The Operating Voltage Challenge
Many barriers to maintaining both high
and low voltage as technology scales
TDDB vs. SCE control
ULK breakdown vs. denser pitches
Variation control
85.0
90.0
95.0
100.0
105.0
110.0
BOX
Poly
Fin
FD devices should enable
maintaining the functional
range for a generation or two
Will turbo modes be too
compromised?
What’s next?
20 VLSI Technology Symposium | June 2011 | Public
3D Integration to the Rescue?
ThroughSilicon Vias
(TSVs) CPU Die
Metal Layers
GPU Die
Metal Layers
Analog Die (SB, Power)
Metal Layers
Metal Layers
TIM (Thermal Interface Material)
Heat Sink
DRAMMicro-bumps
Package Substrate
DRAM
South
Bridge
Stacking offers many attractive benefits
Higher bandwidth to local memory
Enables parallel and serial compute die to be in their own
separate optimized technology – interconnect speed vs.
density, device optimization etc.
Allows IO and southbridge content to remain in older, more
analog-friendly technology
21 VLSI Technology Symposium | June 2011 | Public
3D Integration Challenges
Economical 3D stacking in high volume manufacturing presents
many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout
Logistics of testing and assembling die from multiple sources can be
immense
Countless mechanical and thermal issues to solve in high volume mfg
Clearly 3D provides
compelling solutions to
many problems, but the
barriers to entry mean
heavy R&D $$ and
partnerships required
ThroughSilicon
Vias(TSVs)
CPU DieMetal Layers
GPU DieMetal Layers
Analog Die (SB, Power)Metal Layers
Metal Layers
TIM (Thermal Interface Material)
Heat Sink
DRAMDie to
Die Vias
Package Substrate
DRAM
South
Bridge
22 VLSI Technology Symposium | June 2011 | Public
Summary
Insatiable demand for high bandwidth computation
–Visual image processing
–Natural user interfaces
–Massive data mining for associate searches, recognition
Some of these compute needs can be offloaded to servers,
some must be done on the mobile device
–Similar compute needs and massive growth in both spaces
–Combined serial and parallel computation architectures are
key in both spaces
Huge technology challenges to meeting this opportunity
– Interconnect scaling is hitting a wall that must be overcome
–A broad device suite is necessary that operates efficiently at
low voltage while enabling high speed for response time
–3D integration offers a promising long term solution