tortola: addressing tomorrow’s computing challenges through hardware/software symbiosis kim...
TRANSCRIPT
Tortola: Addressing Tomorrow’s Computing Challenges through Hardware/Software Symbiosis
Kim Hazelwood
September 29, 2006
2 of 26
Modern Computing Challenges
• Performance• Power
– Energy consumption, max instantaneous power, di/dt
• Temperature– Total heat output, “hot spots”
• Reliability– Neutron strikes, alpha particles, MTBF, design flaws
• Approaches: Circuit, microarchitecture, compiler• Constraint: Fixed HW-SW interface (e.g., x86)
3 of 26
Typical Approaches• Optimize using SW or HW techniques in isolation
• Performance
– SW: Compile-time optimizations
– HW: Architectural improvements, VLSI technology
• Reliability: Code/data duplication (HW or SW)
• Power & Temperature
– HW control mechanisms
– Profile + recompile cycle HW
SW
4 of 26
Modern Design Constraints
Compilers – “Compile once, run anywhere”– Cannot ship “MS Office for 1Q05 batch of Pentium-4
3GHz, > 1GB RAM, BrandX power supply, located in high altitudes…”
Microarchitecture – Limited window of application knowledge (past must predict the future)
VLSI – Guaranteed correctness, reliability
We currently must optimize for the common case (but must design for the worst case)
5 of 26
The Power of Virtualization
• A HW-SW interface layer
HW
SW Applications
Binary Modifier
Initiallyx86x86
Eventually
SWI
HWI
6 of 26
Dynamic Binary Modification
• Creates a modified code image at run time
EXE
Transform
CodeCache
Execute
Profile
Examples:• Dynamo (HP)• DAISY/BOA (IBM)• CMS (Transmeta) • Mojo (Microsoft)• Strata (UVa)• Pin (Intel)
7 of 26
Dynamic Instrumentation Demo
• Pin– Four architectures – IA32, EM64T, IPF, XScale– Four OSes – Linux, FreeBSD, MacOS, Windows– http://rogue.colorado.edu/pin/
8 of 26
Dynamic Optimization Demo
• DynamoRIO – Windows and Linux for IA32– http://www.cag.lcs.mit.edu/dynamorio/
9 of 26
Dynamic Binary Modification
• Creates a modified code image at run time
• Always triggered by software events … until now
EXE
Transform
CodeCache
Execute
Profile
Examples:• Dynamo (HP)• DAISY/BOA (IBM)• CMS (Transmeta) • Mojo (Microsoft)• Strata (UVa)• Pin (Intel)
10 of 26
Tortola: Symbiotic Optimization
• Enable HW/SW Communication
HW
SW Applications
Binary Modifier
11 of 26
Simulation Methodology
• SimpleScalar 4.0 for x86• Wattch 1.02 power extensions• Pin dynamic instrumentation system (x86/Linux version)
HW
SW Application
Binary Modifier
Benchmarks
Pin
Wattch &
Simplescalar/x86
12 of 26
Tortola Applications• Combine global program information with run-
time feedback– System-specific power usage– Application-specific heat anomalies– Workload/input specific performance optimization
• Reduce hardware complexity– No more backwards compatibility warts– Fix bugs after shipment– Reduce time to market for new architectures
• One such application: The di/dt problem
13 of 26
The Di/dt Problem• Voltage stability is important for reliability,
performance• Low-power techniques have a negative side
effect: current variation
• Dips (undershoots) in supply voltage – can cause incorrect values to be calculated or stored
• Spikes (overshoots) in supply voltage – can cause reliability problems
14 of 26
The Di/dt Problem
• ITRS cites noise management as a Grand Challenge for 5-10 year time frame
• Several trends are aggravating the issue:– Voltage is scaling down with technology – Current draw is increasing– Package impedance is not scaling as quickly– Aggressive clock gating causes large swings in
processor current draw (di/dt)
15 of 26
Di/dt Solutions
Software
MicroArch
Circuit-Level
Compiler Optimizations
Sensor/Actuator Mechanisms
Decoupling capacitors More Vdd Gnd pins on package
Co-Designed MicroArch & SW Binary Modifier
16 of 26
Sensor-Actuator Mechanisms
• On-chip voltage sensors detect abnormally high/low voltage levels
• On-chip actuator then attempts to quickly raise/lower the processor’s current draw– Phantom firing
• increases current (at the expense of power)– Resource throttling
• reduces current (at the expense of performance)
17 of 26
Detecting Imminent Emergencies
1.05V
1.03V
1V
0.97V
0.95V
Op
erat
ing
V
olt
age
Ran
ge
Hard EmergencySoft
Emergency
ControlThreshold
18 of 26
Targeting Mid-Frequency Di/dt
• Problematic: wide current spike
• Worst case: pulse at the resonant frequency
Su
pp
ly V
olta
ge (
V)
Pro
cess
or C
urr
ent
(A)
Time (Cycles)
Minimum Voltage
20 cycles
Su
pp
ly V
olta
ge (
V)
Pro
cess
or C
urr
ent
(A)
Time (Cycles)
Minimum Voltage
60 cycles
Maximum Voltage
*From:
Joseph et al.
HPCA-9
19 of 26
A Di/dt Stressmark
BEGIN_LOOP:…ldt $f1, ($4)divt $f1, $f2, $f3divt $f3, $f2, $f3stt $f3, 8($4)ldq $7, 8($4)cmovne $31, $7, $3stq $3, $(4)stq $3, $(4)stq $3, $(4)…stq $3, $(4)…JUMP BEGIN_LOOP
Seq
uen
tial
Seq
uen
tial
Lo
w C
urr
ent
Lo
w C
urr
ent
Par
alle
lP
aral
lel
Hig
h C
urr
ent
Hig
h C
urr
ent
But…Actuator engages every loop iteration degrading performance
Why not correct the problem in the code?
20 of 26
Proposed Solution
• Leverage our additional software layer to supplement existing solutions
• Microarchitecture provides feedback to our software-based virtual layer
Microprocessor
Sensor+Actuator Ext
AlteredAlteredExecutableExecutable
SW
HW
BinaryModifierExecutableExecutable
VL
21 of 26
Required Investigations• Characterizing emergencies
– How often do we see di/dt emergency loops?
• Communication between the microarchitecture and the virtual layer– What information should be passed to virtual layer during
an emergency?
• Fixing di/dt via binary modification– Will existing techniques help?– New algorithms?
22 of 26
Static vs. Dynamic InstancesData suggests modifying a few code sequences will eliminate
many voltage emergencies
1
10
100
1000
10000
100000
1000000
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perl
bmk
gap
vort
ex
bzip
2
twol
f
wup
wis
e
swim ar
t
mgr
id
appl
u
mes
a
galg
el
equa
ke
face
rec
sixt
rack
amm
p
apsi
Em
erg
enci
es
Distinct Total
23 of 26
Possible Compiler Optimizations
• Our goal is to– Smooth out current profile, or– Knock pulses off of the resonant frequency
• Some existing options– Software pipelining, code motion, instruction padding
Microprocessor
Sensor+Actuator Ext’ns
AlteredAlteredExecutableExecutable
BinaryModifier
ApplyApplyOptimizationsOptimizations
ExecutableExecutable
24 of 26
Loop Unrolling & SW Pipelining
AABB
Iteration=1
Iteration=2
Iteration=3
Current
AABB
AABB
AABB
Software pipeliningsmoothes profile
Current
Problematicloop:
AAAABBBB
Current
Loop unrolling disrupts resonance pulse
Unrolledloop:
25 of 26
Unrolling the Di/dt Stressmark
0.97V
0.98V
0.99V
1.00V
1.01V
1.02VBefore Loop Unrolling After Loop Unrolling
H
L
H1
H2
L1
L2
26 of 26
Summary
• Symbiotic program optimization is a powerful approach
• The di/dt problem – well suited for a symbiotic solution
• The Tortola design can also target power reduction, temperature reduction, reliability, etc.
http://www.tortolaproject.com/