challenges and opportunities in designing energy-efficient high-performance computing platforms
DESCRIPTION
Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms. Chita R. Das High Performance Computing Laboratory Department of Computer Science & Engineering The Pennsylvania State University EEHiPC-December 19, 2010. Talk Outline. - PowerPoint PPT PresentationTRANSCRIPT
Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms
Chita R. Das
High Performance Computing LaboratoryDepartment of Computer Science & Engineering
The Pennsylvania State University EEHiPC-December 19, 2010
High Performance
Computing LAB
2
Talk Outline
Technology Scaling Challenges State-of-the-art Design Challenges Opportunity: Heterogeneous Architectures
Technology – 3D, TFET, Optics, STT-RAM Processor – new devices, Core heterogeneity Memory – STT-RAM, PCM, etc Interconnect – Network heterogeneity
Conclusions
High Performance
Computing LAB Computing Walls
3
Data from ITRS 2008
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0
5
10
15
20
25
Total
Num
ber o
f cor
es p
er c
hip
(nor
mal
ized
to 2
008)
Moore’s Law
High Performance
Computing LAB Computing Walls
4
Data from ITRS 2008
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0
5
10
15
20
25Total Powered on for 100 Watts
Num
ber o
f cor
es p
er c
hip
(nor
mal
ized
to 2
008)
Utilization and Power Wall
3x1x
P ≈ CV2fLower V can reduce PBut, speed decreases with V
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0
0.2
0.4
0.6
0.8
1
1.2
Supp
ly V
olta
ge (V
)• High performance MOS started out with 12V• Current high-perf. μPs have 1V supply =>
(12/1)2 = 144x over 28 years.• Only (1/0.6)2 = 2.77x left in next 12 years!
High Performance
Computing LAB Computing Walls
5
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Num
ber o
f tot
al p
acka
ge p
ins Memory bandwidth:
Pin count increases only 4x compared to 25x increase in cores
Data from ITRS 2009
High Performance
Computing LAB
Reliability Wall
6
Computing Walls
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
20220
0.2
0.4
0.6
0.8
1
1.2
Requirements in reduction of failure rate per transistor for reliable IC
Proj
ecte
d R
equi
red
Dec
reas
e in
Fa
ilure
rate
per
tran
sist
or
(nor
mal
ized
to 2
007)
Data from ITRS 2007
Failure rate per transistor must decrease exponentially as we go deeper into the sub-nm regime
High Performance
Computing LAB Computing Walls
7
2008
2010
2012
2014
2016
2018
2020
2022
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Global wire RC delay(ps)
Local wire RC delay(ps)
Dela
y (p
icos
econ
ds)
Global wires no longer scale
High Performance
Computing LAB
8
64b Off-ChipChannel
64pJ/word
Multi-core Processor16 nm technology25K FPU (64b)37.5 TFLOPS150 W (compute only)
• The energy required to move a 64b-word across the die is equivalent to the energy for 10 Flops
• Traditional designs have approximately 75% of energy consumed by overhead
20mm
10mm 20pJ4 cycles
64b FPU0.015 mm2
4pJ/op
3GHz
64b 1 mm channel
2pJ/word
Performance = ParallelismEfficiency = Locality
Bill Harrod, DARPA, IPTO, 2009
State-of-the-Art in Architecture Design
High Performance
Computing LAB
Operation Energy(pJ) DPFLOPs Insts*I$Fetch 33 0.67 2.0RegisterAccess 10.5 0.2 0.6Access 3 op. D$ 100 2 6Access 3 op. L2D$ 460 9 27Access 3 op. off-chip 762 15 45Access 3 op. from DRAM
6000 120 360
Energy dominated by data and instruction movement.
* Insts column gives number of average instructions that can be performed for this energy.
Energy cost for different operations
High Performance
Computing LAB
D
Conventional Architecture (90nm)Energy is dominated by overhead
Conventional Architecture3.0E-10 1.0E-10 2.0E-10
4.8E-10 1.3E-10
FPULocalGlobal
Off-chipOverheadDRAM
1.4E-8Dally
High Performance
Computing LAB Where is this overhead coming from?
Complex microarchitecture: OOO, register renaming, branch prediction….
Complex memory hierarchy High/unnecessary data movement Orthogonal design style Limited understanding of application
requirements
12
High Performance
Computing LAB Both Put Together….
Power becomes the deciding factor for designing HPC systemsJoules/operation
Hardware acquisition cost no more dominates in terms of TCO
13
High Performance
Computing LAB IT Infrastructure Optimization: The new Math
- 14 -
Until Now: Minimize equipment, software/licenses, and service/management costs.
Going Forward: Power and Physical Infrastructure costs to house the IT become equally important.
Become “Greener” in the process.
Installed base(M units)
0
Spending(US$B)
$0
$50
$100
$150
$200
$250
$300
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
POWER & COOLING COSTS SERVER MANAGEMENT COSTNEW SERVER SPENDING
5101520253035404550
Source: IDC
Becoming comparable!
1.00 W
DC-DC(0.18 W)
1.18 W
AC-DC(0.31 W)
1.49 W
Power Distribution
(0.04 W)
1.53 W
UPS(0.14 W)
1.67 W
Cooling(1.07 W)
2.84 W
2.74 W
Building Switchgear/ Transformer
(0.10 W)
1 Watt consumed at the server cascades to approximately 2.84 watts of total consumption
Server Component
(1 W)
Source: EmersonCumulative consumption
High Performance
Computing LAB A Holistic Green Strategy
Facilities
Operations, Office spaces,Factories, Travel and Transportation, Energy sourcing, …
UPS, Power distribution, Chillers, Lighting, Real estate
Support
Servers StorageNetworking
CoreGreening
Of Technology
Technology forGreening
Source: A. Sivasubramaniam
High Performance
Computing LAB
4
Bill Dally’s strawman processor architecture• Possible processor design methodology for achieving 28 pJ/Fflop• Requires optimization of communication, computation and memory
components
2.5nJ/FLOP
Conventional Design631pJ/FLOP
Minimize Overhead
28pJ/FLOP
Minimize DRAM energy
Processor Power Efficiency Based on ExtremeScale Study
Bill Harrod, DARPA, IPTO, 2009
High Performance
Computing LAB
19
Opportunity- Heterogeneous Architectures
• Multicore era• Heterogeneous multicore architectures provide the most compelling
architectural trajectory to mitigate these problems
Hybrid cores: Big, small, accelerators, GPUs
Hybrid memory sub-system: SRAM, TFET, STT-RAM
Heterogeneous interconnect
High Performance
Computing LAB A Holistic Design Paradigm
20
Heterogeinity in device/circuits
Heterogeinity in micro-arch.
Heterogeinity in memory des.
Heterogeinity in interconnect
High Performance
Computing LAB Technology Heterogeneity
21
TFETs provide higher performance than CMOS based designs at lower voltages
• Heterogeneity in technology: • CMOS based scaling is expected to continue till 2022• Exploiting emerging technologies to design different
cores/components is promising because it can enable cores with power/performance tradeoffs that were not possible before.
V/F scaling of CMOS and TFET devices
High Performance
Computing LAB Processor Cores
22
Big core
Small core
GPGPUs
Accelerators/ ASIC
Latency critical
Throughput critical
BW critical
Latency/ time critical
Heterogeneous Compute Nodes
High Performance
Computing LAB
Role of novel technologies in memory systems
23
Memory Architecture
Comparison of memory technologies
High Performance
Computing LAB Heterogeneous Interconnect
24
Non-uniformity is due to: non-edge symmetric network and X-Y routing. So,
• Why clock the routers at the same frequency: Variable frequency routers for designing NoCs
• Why allocate all routers similar area/buffer/link resources: Heterogeneous routers/NoC
Buffer Utilization Link Utilization
High Performance
Computing LAB Software Support
Compiler support Thread remapping to minimize power: migrate threads to TFET
cores to reduce power Dynamic instruction morphing: instructions of a thread are
morphed to match the heterogeneous hardware the thread is mapped to by the runtime system
OS support Heterogeneity aware scheduling support Run-time thread migration support
25
High Performance
Computing LAB Current research in HPCL
26
Problems with Current NoCs
• NoC power consumption is a concern today
11
36
21
4
28
Clock distribution Dual FPMACs IMEM+ DMEM 10-port RF Router + links
• With technology scaling, NoC power can be as high as 40-60W for 128 nodes2
Intel 80 core tile power profile1
1. A 5-GHz Mesh Interconnect for A Teraflops Processor–Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar in IEEE MICRO 20072. Networks for Multi-core Chips:A contrarian view - S. Borkar in Special Session at ISLPED 2007
High Performance
Computing LAB Network performance/power
27
0.01 0.04000000000000
01
0.08000000000000
02
0.16 0.24 0.28 0.32000000000000
2
0.34000000000000
1
0.36 0.38000000000000
2
0.39000000000000
2
0.405
101520253035
01234567
Normalized Power Normilized Latency
Injection Ratio (flits/node/cycle)
Rat
io o
f Pow
er In
crea
se
Rat
io o
f Lat
ency
Incr
ease
Normalized Latency
`
Observation:@low load: low power consumption
@high load: high power consumption and congestion
The proposed approach1
@low load: optimize for performance (reduce zero load latency and accelerate flits)
@high load: manage congestion and power
1. A Case for Dynamic Frequency Tuning in On-Chip Networks, MICRO 2009
High Performance
Computing LAB Frequency Tuning Rationale
28
Congested
Throttle/ Frequency is lowered
No change
No changeUpstream router throttles
depending upon its total buffer utilization
Frequency is boosted
High Performance
Computing LAB
Performance/Power improvement with RAFT
29
0 0.1 0.2 0.3 0.40
10
20
30
40Latency with UR
BaseCase
FreqThrtl
FreqBoost
FreqTune
Injection Ratio(flits/node/cycle)
Late
ncy
(in N
anos
econ
ds)
• FreqTune gives both power reduction and throughput improvement
• 36% reduction in latency, 31% increase in throughput and 14% power reduction across all traffic patterns
FreqBoost at low load (optimize performance)
FreqThrtl at high load (optimize performance and power)
High Performance
Computing LAB A Case for Heterogeneous NoCs
30
• Using the same amount of link resources and fewer buffer resources as a homogeneous network, this proposal demonstrates that a carefully designed heterogeneous network can reduce average latency, improve network throughput and reduce power
• Explore types, number and placement of heterogeneous routers in the network
Small routerBigrouter
Narrow link Wide link
High Performance
Computing LAB
HeteroNoC Performance-Power Envelope
31
EDP Latency Power0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
192 128 256 HeteroNoCs
Rat
io
• 22% throughput improvement• 25% latency reduction• 28% power reduction
High Performance
Computing LAB
32
3D Stacking = Increased Locality!
Many more neighbors within a few minutes of reach!
High Performance
Computing LAB
33
Reduced Global Interconnect Length
L
L
Delay/Power Reduction Bandwidth Increase Smaller Footprint Mixed Technology Integration
High Performance
Computing LAB 3D routers for 3D networks
34
One router in one grid (Total area = 4L2)
Stack layers in 3D (Total area = L2)
Stack routers components in 3D (Total area = L2)
Results from MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ISCA 2008
High Performance
Computing LAB Conclusions
Need a coherent approach to address the submicron technology problems in designing energy-eficient HPC systems
Heterogeneous multicores can address these problems and would be the future architecture trajectory
But, design of such systems is extremely complex
Needs an integrated technology-hardware-software-application approach
35
High Performance
Computing LAB HPCL Collaborators
36
Faculty:Vijaykrishnan NarayananYuan XieAnand SivasubramaniamMahmut Kandemir
Students:Sueng-Hwan LimBikash SharmaAdwait JogAsit MishraReetuparna DasDongkook ParkJongman Kim
Partially supported by:NSF, DARPA, DOE, Intel,IBM, HP, Google,Samsung
High Performance
Computing LAB
THANK YOU !!!
Questions???
37