dynamically heterogeneous cores through 3d resource pooling
DESCRIPTION
Dynamically Heterogeneous Cores Through 3D Resource Pooling. Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen. Speaker: Houman Homayoun National Science Foundation CI Fellow University of California San Diego. Why Heterogeneity?. - PowerPoint PPT PresentationTRANSCRIPT
Copyright © 2012 Houman Homayoun 1
Dynamically Heterogeneous Cores Through 3D Resource Pooling
Houman HomayounVasileios KontorinisAmirali ShayanTa-Wei LinDean M. Tullsen
Speaker: Houman HomayounNational Science Foundation CI Fellow
University of California San Diego
Why Heterogeneity?
Copyright © 2012 Houman Homayoun 2
Existing General Purpose CMP designs use only homogeneous cores A general purpose one-size-fits-all core is not necessarily the most efficient
One processor optimized for each application!
Core 1 Core 2
Static vs. Dynamic Heterogeneity
Copyright © 2012 Houman Homayoun 3
• Prior proposals (e.g., Kumar 2003) propose static heterogeneity.
• Increases chance of finding an appropriate core
• Does not guarantee perfect match• Others have proposed solutions for dynamic
heterogeneity (Core Fusion, TFlex).• Due to the difficult of sharing resources at
a fine granularity, they enable only coarse-grain sharing.
• Big (combined) cores or small cores.
Copyright © 2012 Houman Homayoun 4
Outline
Resource Pooling Why 3D? Design Solutions Adaptive Policies Results Conclusion
Application Resource Utilization
Copyright © 2012 Houman Homayoun 5
Copyright © 2012 Houman Homayoun 6
ROBLDSQ RF IQ
Application Resource Utilization
Copyright © 2012 Houman Homayoun 7
Application 1
Application 2
underutilized
ROBLDSQ RF IQ
ROBLDSQ RF IQ
Application Resource Utilization
Dual-Core Machine
Dynamic Heterogeneity Through Resource Pooling
Copyright © 2012 Houman Homayoun 8
Register File
ROB
Register File
ROB
Core 2Core 1
Copyright © 2012 Houman Homayoun 9
Outline
Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion
Why NOT Sharing in 2D?
Copyright © 2012 Houman Homayoun
10
Long wire delay in 2D
In 2D, it is not
efficient
Demanding
500 psec
5 nsec
Copyright © 2012 Houman Homayoun
11
Our Solution: 3D
Copyright © 2012 Houman Homayoun
12
Our Solution: 3D
Fast interconnection network
As fast as few ps (three order of magnitude
smaller than 2D)
Minimize the Communication
Latency
5 psec
5000 psec A principal advantage
No change to the fundamental pipeline design of 2D architectures, yet still exploits the 3D to provide greater energy proportionality and core customization
Copyright © 2012 Houman Homayoun 13
Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion
Outline
Stackable Structures for Resource Pooling Performance bottleneck and power hungry resources
Reorder Buffer and Register File (SRAM) Instruction Queue and Load and Store Queue (CAM+SRAM)
Our goal: share units across multiple cores with minimal impact on design spec (latency, number of ports and power)
Use previously proposed modular design Each partition is a self-standing and independently usable unit Effective in reducing power and access delay
Copyright © 2012 Houman Homayoun 14
Independentpartition
Part 1
Part 2
Part 3
Part 4
Register File
Example of Resource Sharing
Copyright © 2012 Houman Homayoun 15
Decoder
MUX
TSVRegister File in Core 0
Register File in Core 1
Free
Free
Partition
Additional logic to decide whether partition is empty Additional logic to route the signal to the right partition
Copyright © 2012 Houman Homayoun 16
Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion
Outline
Adaptive Policies for Resource Pooling Several issues need to be considered
Ownership Fast releasing Fast reallocation Cycle by cycle adaptation Prevent starvation
A simple adaptive policy specification (MinMax policy) Set limit for the size of resources
how much they can grow up to (MAX) or they can shrink down to (MIN) Use free list Use central arbitration
Copyright © 2012 Houman Homayoun 17
Copyright © 2012 Houman Homayoun 18
Arbitration Unit
Core 1 Core 2 Core 3 Core 4
Free List
Application 1 Application 2 Application 3 Application 4
Register File
MinMax Policy Example
MIN
Copyright © 2012 Houman Homayoun 19
Arbitration Unit
Core 1 Core 2 Core 3 Core 4
Free List
Application 1 Application 2 Application 3 Application 4
Register File
MinMax Policy Example
MIN
Copyright © 2012 Houman Homayoun 20
Arbitration Unit
Core 1 Core 2 Core 3 Core 4
Free List
Application 1 Application 2 Application 3 Application 4
Register File
MinMax Policy Example
MIN
Copyright © 2012 Houman Homayoun 21
Arbitration Unit
Core 1 Core 2 Core 3 Core 4
Free List
Application 1 Application 2 Application 3 Application 4
Register File
MinMax Policy Example
MIN
Copyright © 2012 Houman Homayoun 22
Need for Heterogeneity Why 3D? Design Solutions Adaptive Policies Results Conclusion
Outline
Baseline Architecture
Copyright © 2012 Houman Homayoun 23
Processor Model High-end architecture, four OoO cores with issue
width of 4 Medium-end architecture, four OoO cores with
issue width of 2
3D Floorplans (different performance, flexibility, and temperature tradeoff)
(1) Conventional (Thermal-Optimized Design) (2) Proposed (Performance-Optimized Design)
(1) (2)
Evaluation
Copyright © 2012 Houman Homayoun 24
1 Thread4 Thread2 ThreadPowerPerformanceTemperatureEnergy-Delay
Core 1 Core 2
Core 3 Core 4
Active core
Idle core
Link
Single Thread Performance
Copyright © 2012 Houman Homayoun 25
Speed Up
Standard SPEC2K and SPEC2006 Benchmark
Single benchmark (3 out of 4 cores are idle)
Multi-Thread Performance 2Thr: 2 idle cores + underutilized resources in the active cores 4Thr: No idle cores, only underutilized resources
Copyright © 2012 Houman Homayoun 26
Normalized Weighted
Speedup (%)
gains are dramatic when some cores are idle
Medium-end vs High-end
Resource pooling makes the medium core significantly more competitive with the high-end.
Copyright © 2012 Houman Homayoun 27
Normalized Weighted
Speedup (%)
28% 14% Only 3%!
0 Idle Core 2 Idle Core 3 Idle Core
Increase Resource Sharing
Copyright © 2012 Houman Homayoun 28
power (Watt)
3X4X
Pooling pay a small price in power Because of the enhanced throughput. Large speedups on low-IPC threads and high average speedup, but smaller increase in total instruction
throughput and thus smaller increase in power
Power
Copyright © 2012 Houman Homayoun 29
temperature (Celsius)
Temperature
Interestingly, the temperature of the medium resource-pooling core is comparable to the high-end core
Efficiency
Copyright © 2012 Houman Homayoun 30
Even still, at equal temperature, the more modest cores have a significant advantage in energy efficiency measured in MIPS2/W (MIPS2/W is the inverse of energy-delay product)
Normalized
2X
Conclusions Homogeneous cores are inherently inefficient for a diverse
workload. Cores are typically overprovisioned as a result
3D stacking of cores enables fine-grain sharing (pooling) of resources not possible in 2D designs.
Our dynamically heterogeneous 3D architecture allows the processor to construct the right core for each application dynamically, maximizing energy efficiency.
Our 3D pooling architecture Leverages our experience in 2D pipeline design, yet still gains significant benefit from
3D Adapts to the specific demands of an application within a few cycles. Reduces reliance on overprovisioned cores, instead grabbing larger resources only
when needed.
Copyright © 2012 Houman Homayoun 31
End of presentation