15-447 computer architecturefall 2008 © november 24, 2007 nael abu-ghazaleh [email protected]...
Post on 20-Dec-2015
220 views
TRANSCRIPT
15-447 Computer Architecture Fall 2008 ©
November 24, 2007Nael Abu-Ghazaleh
[email protected]://www.qatar.cmu.edu/~msakr/15447-f08
Lecture 27Power Aware Architecture Design
CS 15-447: Computer Architecture
2
15-447 Computer Architecture Fall 2008 ©
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
Uniprocessor Performance (SPECint)
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006
Sea change in chip design—what is emerging?
3X
??%/year
3
15-447 Computer Architecture Fall 2008 ©
Three walls
1. ILP Wall: • Wall: not enough parallelism available in one thread• Very costly to find more• Implications: cant continue to grow IPC• VLIW? SIMD ISA extensions?
2. Memory Wall:• Growing gap between DRAM and Processor speed• Caching helps, but only so much• Implications: cache misses are getting more expensive• Multithreaded processors?
3. Physics/Power Wall:• Cant continue to shrink devices; running into physical limits• Power dissipation is also increasing (more today)• Implications: cant rely on performance boost from shrinking
transistors• But we will continue to get more transistors
4
15-447 Computer Architecture Fall 2008 ©
Multithreaded Processors
• What support is needed?
• I can use it to help ILP as well – Which designs help
ILP in the picture to the right?
5
15-447 Computer Architecture Fall 2008 ©
Power-Efficient Processor Design
Goals: 1. Understand why energy efficiency is important2. Learn the sources of energy dissipation3. Overview a selection of approaches to reduce energy
6
15-447 Computer Architecture Fall 2008 ©
Why Worry About Power?
• Embedded systems:– Battery life
• High-end processors:– Cooling (costs $1 per chip per Watt if operating @ >40W)– Power cost:15 cents/KiloWatt hr (KWH)
• A single 900 Watt server costs 100 USD /month to run, not including cooling costs!
– Packaging– Reliability
7
15-447 Computer Architecture Fall 2008 ©
Why worry about power -- Oakridge Lab. Jaguar
• Current highest performance super computer– 1.3 sustained petaflops (quadrillion FP operations per
second)– 45,000 processors, each quad-core AMD Opteron
• 180,000 cores!
– 362 Terabytes of memory; 10 petabytes disk space– Check top500.org for a list of the most powerful
supercomputers
• Power consumption? (without cooling)– 7MegaWatts!– 0.75 million USD/month to power– There is a green500.org that rates computers based on
flops/Watt
8
15-447 Computer Architecture Fall 2008 ©
• Alpha 21264 95W• AMD Athlon XP 67W• HP PA-8700 75W• IBM Power 4 135W• Intel Itanium 130W• Intel Xeon 59W
Peak Power in Today’s CPUs
Even worse when we consider power density (watt/cm2)
9
15-447 Computer Architecture Fall 2008 ©
10
15-447 Computer Architecture Fall 2008 ©
• Sources of power consumption in CMOS:– Dynamic or active power (due to the switching of
transistors)– Short-circuit power– Leakage power
• High temperature increases power consumption– Silicon is a bad conductor: higher temperature
->higher leakage current->even higher temperature…
Where is This Power Coming From?
11
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Dynamic Power Consumption• Charging and discharging capacitors
Vdd
In Out
Vdd
In Out
C C
0 1 1 0
E=CV2 E=CV2
P=E*f=C*V2*f
12
15-447 Computer Architecture Fall 2008 ©
Power= *C*V2*f
Activity factor: how often do wires switch
Supply voltage: has been dropping with successive
process generations
Clock frequency: increasing
Capacitance: function of wire length, transistor size
Dynamic Power Consumption
13
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Short-circuit power• Both PMOS and NMOS are conducting
Vdd
InOut
C1/2
About 2% of the overall power.
Isc
14
15-447 Computer Architecture Fall 2008 ©
Power Consumption in CMOS
– Leakage power – transistors are not perfect switches and they leak.
Vdd
In Out
C0 1
20% now, expect 40% in next technology and growing
Isub
15
15-447 Computer Architecture Fall 2008 ©
• All of the consumed power has to be dissipated
• Done by means of heat pipes, heat sinks, fans, etc.
• Different segments use different cooling mechanisms.
• Costs $1-$3 or more per chip per Watt if operating @ >40W
• We may soon need budgets for liquid-cooling or refrigeration hardware.
Cooling
16
15-447 Computer Architecture Fall 2008 ©
Power= *C*V2*f
Activity factor: how often do wires switch
Supply voltage: has been dropping with successive
process generations
Clock frequency: increasing
Capacitance: function of wire
length, transistor size
Dynamic Power Consumption
17
15-447 Computer Architecture Fall 2008 ©
• Transistor switches slower at lower voltage.
• Leakage current grows exponentially with decreases in threshold voltage
• Leakage power goes through the roof
Voltage Scaling
18
15-447 Computer Architecture Fall 2008 ©
• New process generation every 2-3 years• Ideal shrink for 30% reduction in size:
– Voltage scales down by 30%– Gate delays are shortened by 30%
~50% frequency gain (500ps cycle = 2GHz clock, 333ps cycle = 3GHz clock)
– Transistor density increases by 2X• 0.7X shrink on a side, 2X area reduction
– Capacitance/transistor reduced by 30%
Technology Scaling: the Enabler
19
15-447 Computer Architecture Fall 2008 ©
• 2/3 reduction in energy/transition (CV2 0.7x0.72 = 0.34X)
• 1/2 reduction in power (CV2f 0.7x0.72 x 1.5= 0.5X
• But twice as many transistors, or more if area increases
• Power density unchanged
Ideal Process Shrink: the Results
Looks good!
20
15-447 Computer Architecture Fall 2008 ©
• Performance does not scale w/ frequency– New designs increase frequency by 2X– New designs use 2X-3X more transistors to get 1.4X-1.8X
performance*
• So, every new process generation:– Power goes up by about 2X (3X transistors * 2X switches
* 1/3 energy)– Leakage power is also increasing– Power density goes up 30%~80% (2X power / 1.X area)
• Will get worse in future technologies, because Voltage will scale down less
Process Technology – the Reality*
*Source: “Power – the Next Frontier: a Microarchitecture Perspective”, Ronny Ronen, Keynote speech at PACS’02 Workshop.
21
15-447 Computer Architecture Fall 2008 ©
Ugly Numbers*
i486 (0.8) Pentium 4 (0.18) Factor
Transistors 1.2M 42M 35x
Frequency 50 MHz 2000 MHz 40x
Voltage 5 V 1.65 V 1/3x
Peak Power 5 W 100 W 20x
Die size 0.73 cm2 2.17 cm2 3x
Power density 6.8 W/cm2 46 W/cm2 7x
22
15-447 Computer Architecture Fall 2008 ©
• Circuits and process scaling alone can no longer solve all power problems
• SYSTEMS must also be power-aware– OS– Compilers– Architecture
• Techniques at the architectural level are needed to reduce the absolute power dissipation as well as the power density
The Bottom Line
23
15-447 Computer Architecture Fall 2008 ©
Microarchitectural Techniques for Power Reduction
24
15-447 Computer Architecture Fall 2008 ©
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding
buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
A Superscalar Datapath
Performance=N*f*IPC
Actually, it’s the whole system, but we focus on processor
25
15-447 Computer Architecture Fall 2008 ©
• Dynamic power:– Reduce the activity factor– Reduce the switching capacitance (usually not possible)– Reduce the voltage/frequency (speedstep; e.g., 1.6 GHz
pentium M can be clocked down to 600MHz, voltage can be dropped from 1.48V to 0.95V)
• Leakage power:– Put some portions of the on-chip storage structures in a low-
power stand-by mode or even completely shutting off the power supply to these partitions
– Resizing
• We usually give up some performance to save energy, but how much?
Microarchitectural Techniques—General Approach
26
15-447 Computer Architecture Fall 2008 ©
• If we reduce voltage, linear drop in maximum frequency (and performance)
• “The cube law”: P=kV3 (~1%V=3%P)– If we use voltage scaling we can approximately trade 1%
of performance loss for 3% of power reduction.
• Any architectural technique that trades performance for power should do better than that (or at least as good). Otherwise simple voltage scaling can be used to achieve better tradeoffs.
Guideline
27
15-447 Computer Architecture Fall 2008 ©
• Speculation is used to increase performance• Wasted energy if it is wrong• Can we speculate only when we think we’ll be right?• Gating: temporarily prevent the new instructions
from entering the pipeline• Use Gating to avoid speculation beyond the
branches with low prediction accuracy– The number of unresolved low-confidence branches is used
to determine when to gate the pipeline and for how long– Report 38% energy savings in the wrong-path instructions
with about 1% of IPC loss
Examples: Front-End Throttling
28
15-447 Computer Architecture Fall 2008 ©
• Just-in Time Instruction Delivery – Fetch stage is throttled based on the number of in-flight
instructions.– If the number of in-flight instructions exceeds a
predetermined threshold, the fetch is throttled– Threshold is adjusted through the “tuning cycle”– Reasons for energy savings:
• Fewer instructions are processed along the mispredicted path
• Instruction spends fewer cycles in the issue queue
Front-End Throttling (continued)
29
15-447 Computer Architecture Fall 2008 ©
• General solutions:– Use of multi-banked RFs. Each bank has fewer entries
and fewer ports than the monolithic RF.• Problems:
– Possible bank conflicts -> IPC loss– Overhead of the port arbitration logic
– Use of the smaller cache-like structures to exploit the access locality
Energy Reduction in the Register Files
30
15-447 Computer Architecture Fall 2008 ©
• Value Aging Buffer – At the time of writeback, the results are written into a
FIFO-style cache called VAB– The RF is updated only when the values are evicted from
the VAB.– In many situations, this can be avoided because a
register may be deallocated during its residency in the VAB
– If a register is read from the VAB, there is no need to access the RF.
– Some performance loss due to the sequential access to the VAB and the RF.
Energy Reduction in the Register Files
31
15-447 Computer Architecture Fall 2008 ©
Isolation of short-lived operands
32
15-447 Computer Architecture Fall 2008 ©
Out-of-Order Execution andIn-Order Retirement
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
33
15-447 Computer Architecture Fall 2008 ©
• Used to cope with false data dependencies.• A new physical register is allocated for EVERY
new result• P6 style: ROB slots serve as physical registers
Register Renaming
LOAD R1, R2, 100
SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, P2, 100
SUB P32, P31, P3
ADD P33, P32, P4
34
15-447 Computer Architecture Fall 2008 ©
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
Original code
35
15-447 Computer Architecture Fall 2008 ©
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100
Original code
Renamed code
36
15-447 Computer Architecture Fall 2008 ©
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3
Original code
Renamed code
37
15-447 Computer Architecture Fall 2008 ©
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
Original code
Renamed code
38
15-447 Computer Architecture Fall 2008 ©
• Definition: a value is short-lived if the destination register is renamed by the time of the result generation.
• Identified one cycle before the result writeback
• A large percentage of all generated results are short-lived for SPEC 2000 benchmarks.
Short-Lived Values
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER
39
15-447 Computer Architecture Fall 2008 ©
0
10
20
30
40
50
60
70
80
90
10096-entry ROB, 4-way processor
Percentage of Short-Lived Values
As
40
15-447 Computer Architecture Fall 2008 ©
• Reasons for maintaining short-lived values:
– Recovering from branch mispredictions
– Reconstructing precise state if interrupts or exceptions occur
Why Keep Them ?
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
41
15-447 Computer Architecture Fall 2008 ©
Energy-dissipating Events
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
WriteWrite
Read
42
15-447 Computer Architecture Fall 2008 ©
Isolating Short-Lived Values: the Idea
ROB
F R D
Inst. Queue ExARF
Write
Write
Read
SRF
Write short-lived values into a small
dedicated RF (SRF)
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
In-order front end
Out-of-order core
In-order retirement
43
15-447 Computer Architecture Fall 2008 ©
• Dynamically resizable caches – Dynamically estimates the program requirements and
adapts to the required cache size– Cache is upsized or downsized at the end of periodic
intervals based on the value of the cache miss counter– Downsizing puts the higher-numbered sets into a low-
leakage mode using sleep transistors– A bit mask is used to specify the number of address bits
that are used for indexing into the set– The cache size always changes by a factor of two
Energy Reduction in Caches
44
15-447 Computer Architecture Fall 2008 ©
• Gating off portions of the execution units – Disables the upper bits of the ALUs where they are not needed
(for small operands)
– Energy can be reduced by 54% for integer programs
• Packaging multiple narrow-width operations in a single ALU in the same cycle
• Steering instructions to FUs based on the criticality information – Critical instructions are steered to fast and power-hungry
execution units, non-critical instructions are steered to slow and power-efficient units
Energy Reduction within the Execution Units
45
15-447 Computer Architecture Fall 2008 ©
• Using Grey code for the addresses to reduce switching activity on the address buses (Su et.al., IEEE Design and Test, 1994)– Exploits the observation that programs often generate
consecutive addresses– Grey code: there is only a single transition on the
address bus when consecutive addresses are accessed– 37% reduction in the switching activity is reported– A Gray code encoder is placed at the transmitting end of
the bus, and a decoder is needed at the receiving end
Encoding Addresses for Low Power
46
15-447 Computer Architecture Fall 2008 ©
• Bus-invert encoding – Uses redundancy to reduce the number of transitions– Adds one line to the bus to indicate if the actual data or
its complement is transmitted– If the Hamming distance between the current value and
the previous one is less than or equal to (n/2) (for n bits), the value is transmitted as such and the value of 0 is transmitted on the extra line.
– Otherwise, the complement of the value is transmitted and the extra line is set to 1
– The average number of bus transitions per clock cycle is lowered by 25% as a result
Encoding Data for Low Power
47
15-447 Computer Architecture Fall 2008 ©
• Can compiler help?
• Can OS help?– E.g., control voltage scaling– Control turning off devices
OS and Compiler Techniques