technology impacts from the new wave of architectures for media-rich workloads
DESCRIPTION
VLSI Technology Symposium 2011. Technology Impacts from the New Wave of Architectures for Media-rich Workloads. Samuel Naffziger AMD Corporate Fellow June 14 th , 2011. Outline. Introduction The new workloads and demands on computation Characteristics of serial and parallel computation - PowerPoint PPT PresentationTRANSCRIPT
1VLSI Technology Symposium | June 2011 | Public
TECHNOLOGY IMPACTS FROM THE NEW WAVE OF ARCHITECTURES FOR MEDIA-RICH WORKLOADS
Samuel NaffzigerAMD Corporate Fellow
June 14th, 2011
VLSI TECHNOLOGY SYMPOSIUM 2011
2VLSI Technology Symposium | June 2011 | Public
Introduction
The new workloads and demands on computation
Characteristics of serial and parallel computation
The Accelerated Processing Unit (APU) architecture
APU architecture implications for technology
Summary
Outline
3VLSI Technology Symposium| June 2011 | Public
Now: Parallel/Data-Dense
16:9 @ 7 megapixels
HD video flipcams, phones, webcams (1GB)
3D Internet apps and HD video online, social networking w/HD files
3D Blu-ray HD
Multi-touch, facial/gesture/voice recognition + mouse & keyboard
All day computing (8+ Hours)
Immersive and interactive performance
Wor
kloa
ds
The Big Experience/Small Form Factor ParadoxMid 2000s4:3 @ 1.2 megapixelsDigital cameras, SD webcams (1-5 MB files)WWW and streaming SD video
DVDs
Mouse & keyboard
3-4 HoursStandard-definition
Internet
Technology Mid 1990s
Display 4:3 @ 0.5 megapixel
Content Email, film & scanners
Online Text and low res photos
Multimedia CD-ROM
Interface Mouse & keyboard
Battery Life* 1-2 Hours
Form
Fact
ors
Early Internet and Multimedia Experiences
*Resting battery life as measured with industry standard tests.
4VLSI Technology Symposium| June 2011 | Public
Focusing on the experiences that matter
EmailWeb browsing
Office productivityListen to music
Online chatWatching online video
Photo editingPersonal finances
Taking notesOnline web-based
gamesSocial networking
Calendar managementLocally installed games
Educational appsVideo editing
Internet phone
Consumer PC Usage
0% 20% 40% 60% 80% 100%
New Experiences
ImmersiveGaming
Simplified Content Management
Accelerated Internet and HD Video
Source: IDC's 2009 Consumer PC Buyer Survey
5VLSI Technology Symposium | June 2011 | Public
People Prefer Visual Communications
Visual PerceptionVerbal Perception
Words are processedat only 150 wordsper minute
Pictures and videoare processed 400 to
2000 times faster
Augmenting Today’s Content:
Rich visual experiences Multiple content sources Multi-Display Stereo 3D
6VLSI Technology Symposium | June 2011 | Public
Communicating• IM, Email, Facebook• Video Chat, NetMeeting
Gaming• Mainstream Games• 3D games
The Emerging World of New Data Rich Applications
ArcSoft TotalMedia®
Theatre 5
ArcSoft Media
Converter® 7
CyberLink Media
Espresso 6
CyberLink Power
Director 9
Corel VideoStudio Pro
Corel Digital Studio 2010
Internet Explorer 9
Microsoft® PowerPoint® 2010
Windows Live
Essentials
CodemastersF1 2010Nuvixa
Be Present
ViVu Desktop
TelepresenceViewdle
Uploader
Using photos• Viewing& Sharing• Search, Recognition, Labeling? • Advanced Editing
Using video• DVD, BLU-RAY™, HD • Search, Recognition, Labeling • Advanced Editing & Mixing
The Ultimate Visual Experience™
Fast Rich Web content, favorite HD Movies, games with realistic
graphics
Music• Listening and Sharing• Editing and Mixing• Composing and compositing
7| 2011 VLSI Symposium| June 2011 | Public
New Workload Examples: Changing Consumer Behavior
24 hoursof video
uploaded to YouTubeevery minute
50 million +digital media files
added to personal content librariesevery day
Approximately
9 billionvideo files owned are
high-definition
1000 images
are uploaded to Facebookevery second
8VLSI Technology Symposium | June 2011 | Public
What Are the Implications for Computation?Insatiable demand for high bandwidth processing–Visual image processing–Natural user interfaces–Massive data mining for
associative searches, recognition
Some of these compute needs can be offloaded to servers, some must be done on the mobile device– Similar compute needs and
massive growth in both spaces
How must CPU architecture change to deal with these trends?
9VLSI Technology Symposium | June 2011 | Public
Parallel and Serial Computation
Transistors(thousands)
Single-threadPerformance(SpecINT)
Frequency(MHz)
Typical Power(Watts)
Number ofCores
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten
35 Years of Microprocessor Trend Data
i=0i++
load x(i)fmulstore
cmp i (16)bc
…
Loops, branches and conditional evaluation
Serial Code
Conditionalbranches
0
1000
2000
3000
4000
5000
6000
7000
8000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Peak
GFl
ops (
SPFP
)
Years
GFLOPs Trend
GPU
CPU
AMD projections
i=0i++
load x(i)fmulstore
cmp i (1000000)bc
……
……
i,j=0i++j++
load x(i,j)fmulstore
cmp j (100000)bc
cmp i (100000)bc
2D array representingvery large
dataset
Loop 1M times for 1M pieces
of data
DataParallel Code
10VLSI Technology Symposium | June 2011 | Public
GPU/CPU Design Differences
Lots of instructions little data
• Out of order exec, Branch prediction
• Few hardware threads
Weak performance gains through density
Maximize speed with fast devices
Few instructions lots of data
• Single Instruction Multiple Data• Extensive fine-threading capability
Nearly linear performance gains with density
Maximize density with cool devices
CPU (Serial compute) GPU (parallel compute)
11VLSI Technology Symposium | June 2011 | Public
Three Eras of Processor Performance
Single-Core Era
Sin
gle-
thre
ad P
erfo
rman
ce
?
Time
we arehere
o
Enabled by: Moore’s Law Voltage & Process Scaling Micro Architecture
Constrained by:PowerComplexity
Multi-Core Era
Thro
ughp
ut P
erfo
rman
ce
Time(# of Processors)
we arehere
o
Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch
Constrained by:PowerParallel SW availabilityScalability
HeterogeneousSystems Era
Targ
eted
App
licat
ion
Per
form
ance
Time(Data-parallel exploitation)
we arehere
o
Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs
Temporarily constrained by:Programming modelsCommunication overheadsWorkloads
12VLSI Technology Symposium | June 2011 | Public
Heterogeneous Computing with an APU Architecture
CPU Cores
GPU UVD
SB Functions
~7 GB/sec
~17 GB/sec
UNB
MC
~17 GB/sec
DDR3 DIMMMemory
CPU Chip
FCH Chip
PCIe®
Bandwidth pinch points and latency hold back the GPU capabilities
Integration Provides Improvement Eliminate power and latency of extra chip
crossing 3X bandwidth between GPU and Memory! Same sized GPU is substantially more effective Power efficient, advanced technology for both
CPU and GPU
Graphics requires memory BW to bring full capabilities
to life~27 GB/sec
~27 GB/sec
DDR3 DIMMMemory
APU Chip
PCIe
2010 IGP-based (“Danube”) Platform 2011 APU-based (“Llano”) Platform
GPU
CPU Cores
UVD
UN
B / M
C
GPU
Optional
13VLSI Technology Symposium | June 2011 | Public
The Challenges of Integration
Thick, fast metalBig devices
Dense, thin metal, small devices
Per
form
ance
CPU flop area = 2.14
GPU flop area = 1.0
CPU GPU
Flop count for 4 Llano CPU cores=0.66M
Flop count for Llano GPU =3.5M
Density
14VLSI Technology Symposium | June 2011 | Public
With the 20nm node, even local metal will be seeing large RC increase compromises more difficult
How to Balance the Metal Stack?
Cu Resistivity without barrierWith barrier
1.51.61.71.81.9
22.12.22.32.42.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Line Width (um)
Res
istiv
ity (u
ohm
-cm
)
Add metal layers? Thin, dense layers for the
GPU Thick, low resistance layers
for the CPU Cost issues? Via resistance?Technology improvements in
BEOL are required
Per
form
ance
CPU GPU
Density
15VLSI Technology Symposium | June 2011 | Public
Device OptimizationP
erfo
rman
ce
CPU GPU
To achieve breakthrough APU performance, the Llano GPU has ~5X the flops and ~5X the device count of the CPUs
0
0.2
0.4
0.6
0.8
1
1.2
Llano CPU Llano GPU
APU Vt Mix
LC-HVT
HVT
RVT
LC-RVT
LVTDevice Ioff
Broader span of devices required
0
50
100
150
200
250
175 50 20 4.3 2.7 0.5 0.4
FPG
Ioff (nA/um room temp)
FPG vs. Ioff
CPU
GPU
desired device range
Speed vs. Leakage
RO
spe
ed
LVT LC-RVTRVT HVT LC-HVT
A broader device suite is required
16VLSI Technology Symposium | June 2011 | Public
85.0
90.0
95.0
100.0
105.0
110.0
Balanced workload
85.0
90.0
95.0
100.0
105.0
110.0
GPU-centric data parallel workload
85.0
90.0
95.0
100.0
105.0
110.0
CPU-centric serial workload
Power Transfers
GPU
UVD
UNB / MC
CPU
Core+
cache
CPU
Core+
cache
DDR IO
PCIe IO
CPU
Core+
cache
CPU
Core+
cache
Tem
pera
ture
Voltage range is critical to enabling the efficient power transfers that make for compelling APU performance
17VLSI Technology Symposium | June 2011 | Public
Operating Voltage Range
Operating voltage requirements:
Low voltage necessary for power efficiency
High voltage necessary for a snappy user experience enabled by turbo mode
0
0.5
1
1.5
2
2.5
0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V
E/op vs. V
18VLSI Technology Symposium | June 2011 | Public
0.952V
0.915V
0.886V
0.805V
0.700V
0.750V
0.800V
0.850V
0.900V
0.950V
1.000V
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
40nm 28nm 20nm 14nm
Nom
inal
Vol
tage
Power Density Limited GPU40nm to 14nm
Frequency
Power Density
Voltage
Operating Voltage Challenges
To maintain cost effective performance growth with technology node, the GPU must:
– Hold power density constant
– Exploit density gains to add compute units
This necessarily drives operating voltage down
This would be good for energy efficiency except …
– Variation impacts are much greater at low voltage
0MHz
100MHz
200MHz
300MHz
400MHz
500MHz
600MHz
700MHz
800MHz
900MHz
1000MHz
0.85V 0.90V 0.95V 1.00V 1.10V 1.15V
Juniper Frequency Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Frequency spread increases at low voltage
40nm GPU Frequency Data
19VLSI Technology Symposium | June 2011 | Public
The Operating Voltage Challenge
Many barriers to maintaining both high and low voltage as technology scales
TDDB vs. SCE control ULK breakdown vs. denser pitches Variation control
85.0
90.0
95.0
100.0
105.0
110.0
BOX
Poly
Fin Curre
nt Fl
ow ->
S
D
FD devices should enable maintaining the functional range for a generation or two
Will turbo modes be too compromised?
What’s next?
20VLSI Technology Symposium | June 2011 | Public
3D Integration to the Rescue?
ThroughSilicon Vias
(TSVs) CPU DieMetal Layers
GPU DieMetal Layers
Analog Die (SB, Power)Metal Layers
Metal Layers
TIM (Thermal Interface Material)
Heat Sink
DRAMMicro-bumps
Package Substrate
GPU
UVD
UN
B / M
C
CPU Core+cache
CPU Core+cache
DDR IO
PCIe IO
CPU Core+cache
CPU Core+cache
DRAM
South Bridge
Stacking offers many attractive benefitsHigher bandwidth to local memory
Enables parallel and serial compute die to be in their own separate optimized technology – interconnect speed vs. density, device optimization etc.
Allows IO and southbridge content to remain in older, more analog-friendly technology
21VLSI Technology Symposium | June 2011 | Public
3D Integration Challenges Economical 3D stacking in high volume manufacturing presents many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout
Logistics of testing and assembling die from multiple sources can be immense
Countless mechanical and thermal issues to solve in high volume mfg
Clearly 3D provides compelling solutions to many problems, but the barriers to entry mean heavy R&D $$ and partnerships required
ThroughSilicon
Vias(TSVs)
CPU DieMetal Layers
GPU DieMetal Layers
Analog Die (SB, Power)Metal Layers
Metal Layers
TIM (Thermal I nterface Material)
Heat Sink
DRAMDie to Die Vias
Package Substrate
DRAM
South Bridge
22VLSI Technology Symposium | June 2011 | Public
Summary
Insatiable demand for high bandwidth computation–Visual image processing–Natural user interfaces–Massive data mining for associate searches, recognition
Some of these compute needs can be offloaded to servers, some must be done on the mobile device–Similar compute needs and massive growth in both spaces–Combined serial and parallel computation architectures are
key in both spacesHuge technology challenges to meeting this opportunity
–Interconnect scaling is hitting a wall that must be overcome–A broad device suite is necessary that operates efficiently at
low voltage while enabling high speed for response time–3D integration offers a promising long term solution