![Page 1: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/1.jpg)
Application-Specific Architectures
Introduction and Motivation
Todd Austin
EECS 573University of Michigan
![Page 2: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/2.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Architecture’s Diminishing Return• Staples of value we strive for…
• High Speed• Low Power• Low Cost
• Tricks of the trade• Faster clock rates, via pipelining• Higher instruction throughput, via ILP extraction• Homogeneous parallel systems
• Much past evidence of diminishing return, PIII vs. P4• PIII vs. P4: 22% less P4 throughput (0.35 vs. 0.45 SPECInt/MHz)
• Parallel resources not fully harnessed by today’s software• Less return less value
![Page 3: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/3.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Moore’s Law Performance Gap
3
Today, gap iscresting 10x
Lack of perceivedvalue
Dark silicon
Diminished ILP
![Page 4: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/4.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
180130
9065
4532
2214
107
1
10
100
1000
Tech
nolo
gy N
ode
(nm
)
10nm slipsby 5-6 quarters
14nm slipsby 2 quarters
7nm by end 2020?
Is Density Still Scaling?
Street Dates for Intel’s Lead Generation Products
Compiled with David Brooks @ Harvard
4
![Page 5: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/5.jpg)
University of MichiganEECS 573
Based on slides by Prof. Scott Mahlke 5
Performance Demands Continue to Grow:Speech Recognition
0
50
100
150
200
250
SA-1110 -206Mhz
Xscale -400Mhz
PIII - 600Mhz PIII - 900Mhz PIII - 1Ghz
Processor Type
Wor
ds p
er M
inut
e
6 hrs
2 hrs
14 min
7 min 6 min
Unexcited Speech
Excited Speech
Lifetime on1AA battery
![Page 6: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/6.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Remedy #1: Chip Multiprocessors
6
![Page 7: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/7.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
The Dark Silicon Dilemma
7Courtesy Michael Taylor @ UCSD
![Page 8: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/8.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
The Dark Silicon Dilemma
8Courtesy Michael Taylor @ UCSD
![Page 9: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/9.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
The Dark Silicon Dilemma
9Courtesy Michael Taylor @ UCSD
![Page 10: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/10.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
The Tyranny of Amdahl’s Law
10
(P)
(N)
(S)
Where we need to be
today! (10x)
![Page 11: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/11.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
A Powerful Solution: Eschew Generality
• Specialization limits the scope of a device’s operation• Produces stronger properties and invariants• Results in higher return optimizations• Programmability preserves the flexibility regarded by GPP’s
• A natural fit for embedded designs• Where application domains are more likely restrictive• Where cost and power are 1st order concerns
• Overcomes growing silicon/architecture bottlenecks• Concentrated computation overcomes dark silicon dilemma• Customized acceleration speeds up Amdahl’s serial codes
Speed,Efficiency
Flexibility,Programmability
H/W designs General PurposeProcessors
General PurposeProcessors
+ ISA Extensions
ApplicationSpecific
Processor
![Page 12: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/12.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Case Study: CryptoManiac [ISCA’01]
• A highly specialized and efficient crypto-processor design• Specialized for performance-sensitive private-key cipher algorithms• Chip-multiprocessor design extracting precious inter-session parallelism• CP processors implement with 4-wide 32-bit VLIW processors• Design employs crypto-specific architecture, ISA, compiler, and circuits
CMProc
CMProc
CMProcKey Store
Request
Scheduler
In Q Out QEncrypt/decrypt
requests...
Ciphertext/plaintextresults
![Page 13: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/13.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Crypto-Specific Instructions• frequent SBOX substitutions
• X = sbox[(y >> c) & 0xff]• SBOX instruction
• Incorporates byte extract• Speeds address generation
through alignment restrictions• 4-cycle Alpha code sequence
becomes a single CryptoManiac instruction
• SBOX caches provide a high-bandwidth substitution capability (4 SBOX’s/cycle)
010 08162431
opcode
00
SBOX Table
Table Index
![Page 14: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/14.jpg)
Advanced Computer Architecture LaboratoryUniversity of Michigan
Application Specific ArchitecturesTodd Austin
Crypto-Specific Functional Unit
Pipelined32-BitMUL 1K Byte
SBOXCache
32-BitAdder
32-BitRotator
XOR AND
Logical Unit
XOR AND
Logical Unit
{tiny}
{short}
{tiny}
{long}
![Page 15: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/15.jpg)
Case Study: Subliminal Systems [ISCA’05]
Project goalsExplore area-constrained low-energy systemsDevelop 100% silicon platformsTarget form factors below 1 mm3
Technology DevelopmentsSubthreshold-voltage processors and memoriesRobust subthreshold circuit/cell designsCompact integrated wireless interfacesEnergy scavenging technologiesSensor designs
< 0.5 mmCPUMemory Sensors
PowerI/O
I/O
![Page 16: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/16.jpg)
Energy Efficiency: A Key RequirementThey live on a limited amount of energy generated from a small battery or scavenged from the environment.
Traditionally the communication component is the most power-hungry element of the system. However, new trends are emerging:
Passive telemetry Self-powered RF Proximity comm.
![Page 17: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/17.jpg)
Performance of Various Platforms
2965.013943.47
8036.77 8296.37
2253.56
183.25
4.10
1.00
10.00
100.00
1000.00
10000.00
Platform ARM 720T ARM 7TDMI ARM 920T ARM 1020T 1st-gen 1st-gen 1st-gen
Voltage (V) 1.2 1.2 1.2 1.2 1.2 0.5 0.232Speed (Hz) 100M 133M 250M 325M 114M 9M 168k
xRT rating: how many times faster than real-time the processor can handle the worst-case data stream rateon the most computationally intensive sensor benchmark
![Page 18: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/18.jpg)
Summary from Architecture Study
Minimize area To reduce leakage energy per cycle
Maximize Transistor utility To reduce Vmin and energy per cycle
Minimize CPI To reduce Energy per instruction
We studied 21 different subthreshold processors experimenting with following options: Number of stagesw/ vs. w/o instruction prefetch bufferw/ vs. w/o explicit register fileHarvard vs. Von-Neumann architecture
To minimize energy at subthreshold voltages, architects must:
The memory comprises the single largest factor of leakage energy, as such, efficient designs must reduce memory storage requirements.
![Page 19: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/19.jpg)
Microarchitecture Overview
Imem4x16x2x12
Dmem128x8
Pref
etch
Buf
fer
2x2x
12
RegisterFile
Scheduler
32-bitTimer
PageControl
OpAControl
OpBControl
μOperationDecoder
RegisterWrite
Control
JumpControl
ALU
IF/ID Stage EX/MEM Stage WB Stage
FlagControl
Carry
FetchControl
ExternalInterrupts
Zero
8
8
8
8
12
24
8
8
![Page 20: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/20.jpg)
First Subliminal Chip
![Page 21: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/21.jpg)
Pareto Analysis for Several Processors
2s_h_08w2s_h_16w
2s_h_32w
3s_h_08w
3s_h_16w
3s_h_32w
2s_h_08w_r
2s_h_16w_r2s_h_32w_r
3s_h_08w_r
3s_h_16w_r
3s_h_32w_r
2s_v_08w
2s_v_08w_r
2s_v_16w
2s_v_32w
3s_v_08w
3s_v_16w
1.40E-12
1.60E-12
1.80E-12
2.00E-12
2.20E-12
2.40E-12
2.60E-12
2.80E-12
3.00E-12
5.00E-06 1.00E-05 1.50E-05 2.00E-05 2.50E-05 3.00E-05 3.50E-05 4.00E-05
Inst Latency (1/perf == s/inst.)
Ener
gy (J
/inst
.)
2.663.59
Area = 2.14CPI = 2.88
1.783.62
1.374.99 1.10
6.14
2.334.39
1.775.17
# of stages = 3
Implemented design
architecture: Von Neumann (vs. Harvard)
w/ explicit register file
ALU width
![Page 22: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/22.jpg)
Case Study: Taking Computer Vision MobileEmbedded mobile computation on the rise
Smart Phones, TabletsImproved sensors
High megapixel cameras, HD video
New capabilities from new sensors
There is a need for near real time computation Users don’t want to wait
Why not use the cloud?High latency Bandwidth LimitsReliability
![Page 23: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/23.jpg)
Computer Vision
Typical computer vision pipeline
Feature Extraction
![Page 24: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/24.jpg)
3 Algorithms FAST – corner detectionHoG – general object shape detectorSIFT – specific object/blob detector
Feature Extraction Characteristics
Branch Divergence
Data LevelParallelism
Thread LevelParallelism
2D Spatial Locality
Heterogeneous Multicore
Vector Reduction Custom Functional Units
Patch Memory
![Page 25: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/25.jpg)
Efficient Fast Feature EXtraction
1. Heterogeneous Architecture
2. Vector Reduction Instructions
3. 2D Locality Memory
![Page 26: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/26.jpg)
Patch Memory
Traditional image storage Patch memory storage
Pixel Loc
Patch MemoryController
(X,Y) ADDR Memory
PixelData Data
![Page 27: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/27.jpg)
A Taste of the Results
Pareto Frontier
![Page 28: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/28.jpg)
Outlook for App-Specific Design is Unsure:The Good, the Bad and the Ugly
The Good: Moore’s law will continue for the near futureIt won’t last forever, but that another problem
The Bad: Dennard scaling has all but stopped, leaving innovation to fill the performance/power scaling gap
E.g., app-specific design, custom accelerators
The Ugly: Hardware innovation requires design diversity, which is ultimately too expensive to afford
Skyrocketing NREs will necessitate broadly applicable (vanilla and slow) H/W designs
28
![Page 29: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/29.jpg)
Design Costs Are Skyrocketing
0
20
40
60
80
100
120
140
0.5u 0.35u 0.25u 0.18u 0.13u 90nm 65nm 45nm 28nm 20nm
Cost to
Market ($ million)
Silicon Technology Node
Mask Costs
S/W Development and Testing
H/W Design and Verification
Source: International Business Strategies
29
$88M
$120M
$500K
![Page 30: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/30.jpg)
High Costs Will be a Showstopper
Heterogeneous designs often serve smaller markets
30
![Page 31: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/31.jpg)
Outcome: “Nanodiversity” is Dwindling
0
2000
4000
6000
8000
10000
12000
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
ASIC Design Starts
YearSource: Gartner Group
3131
Expensive development costs demand BIGGER markets,this trend works against customized designs.
![Page 32: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/32.jpg)
The Remedy: Scale Innovation
Ultimate goal: accelerate system architecture innovation and make it efficient and inexpensive enough that anyone can do it anywhere
Approach #1: Embrace system-level innovation
Approach #2: Leverage technology advances on CMOS silicon
Approach #3: Reduce the cost to design custom hardware
Approach #4: Widen the applicability of custom hardware
Approach #5: Reduce the cost of manufacturing custom H/W
32
![Page 33: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/33.jpg)
1) Embrace system-level innovation
33
“Give me 15% speedup and I’ll
accept your paper”
“I need 1% speedup for 1%
area”
“Your system-level ideas needs to deliver 2x or more, or someone else
should fund it”
![Page 34: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/34.jpg)
HELIX-UP Unleashed Parallelization
Traditional parallelizing compilers must honor possible dependencies
HELIX-UP manufactures parallelism by profiling which deps do not exist and which are not needed
Based on user supplied output distortion function
Big step for parallelization2x speedup over parallelizing compilers, 6x over serial, < 7% distortion
Thread 0Thread 1Thread 2Thread 3
DataData
Data
Iteration 0Iteration 1
David Brooks @ Harvard
Nehalem 6 cores, 2 threads per core
34
![Page 35: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/35.jpg)
Association Rule Mining with the Automata Processor
Micron’s Automata processorImplements FSMs at memoryMassively parallel with accelerators
Mapped data-mining ARM rules to memory-based FSMs
ARM algorithms identify relationships between data elementsImplementations are often memory bottlenecked
Big-data sets had big speedups90x+ over single CPU performance2-9x+ speedups over CMPs and GPUs
Joint effort with UVA and Micron
35
Kevin Skadron @ UVA
![Page 36: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/36.jpg)
2) Leverage technology advances on CMOS silicon
Recent success: the reduced leakage andtransient fault protection of FinFETs
Upcoming: the density and durability ofIntel/Micron’s XPoint memory technology
Many additional opportunities possible: TFETs, CNTs, spin-tronics, novel materials, analog accelerators, etc…
Key challenge: integration of non-silicon technologies
Advice: to maximize benefits of these devices, architects need to work with device and materials researchers
36
![Page 37: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/37.jpg)
Top 10 Technology Plays that Would Make Architects REALLY Excited
Reduced leakage for memoryHelps with low power sleep states, allows lower computational power states
Reduced leakage for computationRe-balances the power-parallelism tradeoff in favor of more performance/watt
More energy efficient communication that doesn't overtly exacerbate latencyAllows for more system scalability – both scale-in and scale-out
More energy efficient computation that is dense and cheapAllows for more T-flops, since almost all computational capabilities today are energy bounded
Controllable and recognizable analog functionsAllow computation to be replaced with potentially fast and efficient analog compute
Ultra-cheap fabrication technologiesRe-balances the specialization-cost tradeoffs, making system-level optimization more valuable
Emerging technologies that deliver additional traditional value at low fault ratesWe have many low-cost system-level fault tolerance technologies, let’s use them!, limit faults to < 0.1%
Emerging technologies that are not too fiddly, unless they deliver significant valueWe need clean productive abstractions, CMOS is the benchmark, compare to asynch and CUDA
Faster, more energy efficient, less destructive writes for nonvolatile storageAllows for simpler, denser, more efficient memory designs, supports ultra-low power states
Computation/memory capabilities with no power/electrical/etc. signatureToday’s systems are fraught with side channels, this is needed as a basis for establishing H/W trust
37
![Page 38: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/38.jpg)
3) Reduce the cost to design custom hardware
• Better tools and infrastructure• Scalable accelerator synthesis and compilation, generate code and H/W for highly
reusable accelerators• Composable design space exploration, enables efficient exploration of highly
complex design spaces• Well put-together benchmark suites to drive development efforts
38
Shared Memory/InterconnectModels
UnmodifiedC‐Code
Accelerator DesignParameters
(e.g., # FU, mem. BW)
Private L1/Scratchpad
AcceleratorSpecificDatapath
David Brooks@ Harvard
![Page 39: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/39.jpg)
FeatureTracking
DisparityMap
Image Stitch
ImageSegmentation
RobotLocalization
TextureSynthesis
SIFT
Support Vector
Machines
CortexSuite:A Synthetic Brain Benchmark Suite
Michael Taylor @ UCSD
39
![Page 40: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/40.jpg)
Thought experiment: let’s design the next great smartphone
Embrace Open-Source Concepts to Reduce Costs
40Red = non-free IP, Green = free IP
![Page 41: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/41.jpg)
Embrace Open-Source Concepts to Reduce Costs
41
As a community, we need to consider:How much of our basic technologyshould be collectively maintained?
Red = non-free IP, Green = free IP
![Page 42: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/42.jpg)
Open-Source H/W is Growing
42
![Page 43: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/43.jpg)
4) Widen the Applicability of Customized H/W
43
ESP: Ensembles of Specialized ProcessorsEnsembles are algorithmic-specific processors optimized for code “patterns”Approach uses composable customization to deliver speed and efficiency that is widely applicable to general purpose programsGrand challenges remain: what are the components and how are they connected?
ILP Engine
Dense Engine
Sparse Engine
Graph Engine
ESP Core
Glue Code
Dense Code
SparseCode
Graph Code
ESP Code
Dense GraphSparse …Applications Multimedia
AnalysisComputer Vision
Machine Learning
Computational Patterns
Specializers with custom implementations and autotuning
Krste Asanovic @ UC-Berkeley
![Page 44: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/44.jpg)
Brick-and-mortar silicon explores assembly-time customization, i.e., MCMs + 3D + FPGA interconnect
Diversity via brick ecosystem & interconnect flexibility
Brick design costs amortized across all designs
Robust interconnect and custom bricks rival ASIC speeds
Facilitates non-silicon integration and mature design strategies
• Another thought experiment: what if building a housewere like fabricating a chip?
5) Reduce the cost of manufacturing customized H/W
H/W brick
44
Martha Kim @ Columbia
Brick-and-mortar silicondesign flow:1) Assemble brick layer2) Connect with mortar layer3) Package assembly4) Deploy software
![Page 45: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/45.jpg)
The Remedy: Scale Innovation
Ultimate goal: accelerate system architecture innovation and make it efficient and inexpensive enough that anyone can do it anywhere
Approach #1: Embrace system-level innovation
Approach #2: Leverage technology advances on CMOS silicon
Approach #3: Reduce the cost to design custom hardware
Approach #4: Widen the applicability of custom hardware
Approach #5: Reduce the cost of manufacturing custom H/W
45
![Page 46: 08 - AppSpecDesign-Tutorial · Introduction and Motivation Todd Austin EECS 573 University of Michigan. Advanced Computer Architecture Laboratory ... • PIII vs. P4: 22% less P4](https://reader034.vdocuments.net/reader034/viewer/2022042221/5ec77746aee82911a755b993/html5/thumbnails/46.jpg)
Summary: Benefits of App-Specific DesignSpeed,Efficiency
Flexibility,Programmability
H/W designs General PurposeProcessors
General PurposeProcessors
+ ISA Extensions
ApplicationSpecific
Processor
Specialization limits the scope of a device’s operationProduces stronger properties and invariantsResults in higher return optimizationsProgrammability preserves the flexibility regarded by GPP’s
A natural fit for embedded designsWhere application domains are more likely restrictiveWhere cost and power are 1st order concerns
Overcomes growing silicon/architecture bottlenecksConcentrated computation overcomes dark silicon dilemmaCustomized acceleration speeds up Amdahl’s serial codes