microarchitectural wire management for performance and power in partitioned architectures
DESCRIPTION
Microarchitectural Wire Management for Performance and Power in Partitioned Architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatanand Venkatachalapathy. Overview/Motivation. Wire delays are costly for performance and power - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/1.jpg)
Feb 14th 2005 University of Utah 1
Microarchitectural Wire Management for Performance and Power in Partitioned
Architectures
Rajeev BalasubramonianNaveen Muralimanohar
Karthik RamaniVenkatanand Venkatachalapathy
![Page 2: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/2.jpg)
February 14th 2005
2 University of Utah
Overview/Motivation
Wire delays are costly for performance and
power
Latencies of 30 cycles to reach ends of a
chip
50% of dynamic power is in interconnect
switching (Magen et al. SLIP 04)
Abundant number of metal layers
![Page 3: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/3.jpg)
February 14th 2005
3 University of Utah
Wire Characteristics
Wire Resistance and capacitance per unit length
),()22(0 verthorizverthorizwire fringenglayerspaci
width
spacing
thicknessKC
)2()( BarrierwidthBarrierthicknessRwire
(Width & Spacing) Delay (as delay RC), Bandwidth
Resistance Capacitance Bandwidth
Width
Spacing
![Page 4: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/4.jpg)
February 14th 2005
4 University of Utah
Design Space Exploration
Tuning wire width and spacing
d
2d
B WiresResistance
Capacitance
Resistance
Capacitance
BandwidthL wires
![Page 5: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/5.jpg)
February 14th 2005
5 University of Utah
Transmission Lines
Allow extremely low delay
High implementation complexity and overhead!
Large width
Large spacing between wires
Design of sensing circuit
Shielding power and ground lines adjacent to each line
Implemented in test CMOS chips
Not employed in this study
![Page 6: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/6.jpg)
February 14th 2005
6 University of Utah
Design Space Exploration
Tuning Repeater size and spacing
Traditional WiresLarge repeatersOptimum spacing
Power Optimal WiresSmaller repeatersIncreased spacing
Dela
y Po
wer
![Page 7: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/7.jpg)
February 14th 2005
7 University of Utah
Design Space Exploration
Base caseB wires
BandwidthOptimizedW wires
PowerOptimized
P wires
Power and B/WOptimizedPW wires
Fast, low bandwidth
L wires
![Page 8: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/8.jpg)
February 14th 2005
8 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
![Page 9: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/9.jpg)
February 14th 2005
9 University of Utah
Evaluation Platform
L1 DCache Cluster
Centralized front-end
I-Cache & D-Cache
LSQ
Branch Predictor
Clustered back-end
![Page 10: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/10.jpg)
February 14th 2005
10 University of Utah
Cache Pipeline
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
Mem. DepResolution
5c
CacheAccess
5c
Data return at 20c
L1 DCache
LSQ
Eff. Address Transfer 10c
PartialMem. DepResolution
3c
CacheAccess
5c
8-bit Transfer 5c
Data return at 14c
Functional
Unit
![Page 11: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/11.jpg)
February 14th 2005
11 University of Utah
L wires: Accelerating cache access
Transmit LSB bits of effective address through L wires Faster memory disambiguation
Partial comparison of loads and stores in LSQ
Introduces false dependences ( < 9%)
Indexing data and tag RAM arrays LSB bits can prefetch data out of L1$
Reduce access latency of loads
![Page 12: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/12.jpg)
February 14th 2005
12 University of Utah
L wires: Narrow Bit Width Operands
PowerPC: Data bit-width determines FU
latency
Transfer of 10 bit integers on L wires
Can introduce scheduling difficulties
A predictor table of saturating counters
Accuracy of 98%
Reduction in branch mispredict penalty
![Page 13: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/13.jpg)
February 14th 2005
13 University of Utah
Power Efficient Wires.
Base caseB wires
Power and B/WOptimizedPW wires
Idea: steer non-critical data through
energy efficient PW interconnect
![Page 14: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/14.jpg)
February 14th 2005
14 University of Utah
PW wires: Power/Bandwidth Efficient
Ready Register operands Transfer of data at
instruction dispatch
Transfer of input operands
to remote register file
Covered by long dispatch to
issue latency
Store data Could stall commit process
Delay dependent loads
Rename&
Dispatch
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
IQ
Regfile
FU
Operand is ready at cycle 90
Consumer instruction Dispatched at cycle 100
![Page 15: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/15.jpg)
February 14th 2005
15 University of Utah
Outline
Overview
Wire Design Space Exploration
Employing L wires for Performance
PW wires: The Power Optimizers
Results
Conclusions
![Page 16: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/16.jpg)
February 14th 2005
16 University of Utah
Evaluation Methodology
L1 DCache
B wires (2 cycles)
L wires (1 cycle)
PW wires (3 cycles)
Cluster
Simplescalar -3.0 augmented to simulate a dynamically scheduled 4-cluster model
Crossbar interconnects (L, B and PW wires)
![Page 17: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/17.jpg)
February 14th 2005
17 University of Utah
Heterogeneous Interconnects Intercluster global Interconnect
72 B wires (64 data bits and 8 control bits) Repeaters sized and spaced for optimum delay
18 L wires Wide wires and large spacing
Occupies more area
Low latencies 144 PW wires
Poor delay
High bandwidth
Low power
![Page 18: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/18.jpg)
February 14th 2005
18 University of Utah
Analytical Model
C = Ca + WsCb + Cc/S
1 2 3
1 Fringing Capacitance
2 Capacitance between adjacent metal layers
3 Capacitance between adjacent wires
RC Model of the wire
Total Power = Short-Circuit Power + Switching Power + Leakage
Power
![Page 19: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/19.jpg)
February 14th 2005
19 University of Utah
Evaluation methodology
I-Cache
D-cache
LSQ Cluster
Cross bar
Ring interconnect
Simplescalar -3.0
augmented to simulate
a dynamically
scheduled 16-cluster
model
Ring latencies
B wires ( 4 cycles)
PW wires ( 6 cycles)
L wires (2 cycles)
![Page 20: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/20.jpg)
February 14th 2005
20 University of Utah
IPC improvements: L wires
L wires improve performance by 4.2% on four cluster
system and 7.1% on a sixteen cluster system
0
0.5
1
1.5
2
2.5
am
mp
ap
plu
ap
si art
bzi
p2
cra
fty
eo
n
eq
ua
ke
fma
3d
ga
lge
l
ga
p
gcc
gzi
p
luca
s
mcf
me
sa
mg
rid
pa
rse
r
swim
two
lf
vort
ex
vpr
wu
pw
ise
AM
Baseline: 144 B-Wires
Low-latency optimizations: 144 B-Wires and 36 L-Wires
![Page 21: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/21.jpg)
February 14th 2005
21 University of Utah
Four Cluster System: ED2 Improvements
92.195.0970.961.5144 PW 36 L
99.296.61030.982.0288 B
94.593.31010.992.0144 B, 36 L
93.294.4990.972.0288 PW,36 L
100.2103.4970.921.0288 PW
1001001000.951.0144 B
Relative
ED2
(20%)
Relative
ED2
(10%)
Relative
processor
energy
(10%)
IPCRelative
metal
area
Link
![Page 22: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/22.jpg)
February 14th 2005
22 University of Utah
Sixteen Cluster system: ED2 gains
93.11051.18288 B
88.71071.22288 B, 36 L
88.71021.19144 B, 36 L
105.3941.05144 PW, 36 L
1001001.11144 B
Relative ED2
(20%)
Relative
Processor
Energy (20%)
IPCLink
![Page 23: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/23.jpg)
February 14th 2005
23 University of Utah
Conclusions
Exposing the wire design space to the architecture
A case for micro-architectural wire management!
A low latency low bandwidth network alone helps improve performance by up to 7%
ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect
Entails hardware complexity
![Page 24: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/24.jpg)
February 14th 2005
24 University of Utah
Future work
3-D wire model for the interconnects
Design of heterogeneous clusters
Interconnects for cache coherence and L2$
![Page 25: Microarchitectural Wire Management for Performance and Power in Partitioned Architectures](https://reader035.vdocuments.net/reader035/viewer/2022062810/56815a64550346895dc7a862/html5/thumbnails/25.jpg)
February 14th 2005
25 University of Utah
Questions and Comments?
Thank you!