1 low-power computer architecture dr. avi mendelson
Post on 21-Dec-2015
229 views
TRANSCRIPT
1
Low-power computer architecture
Dr. Avi MendelsonDr. Avi MendelsonDr. Avi MendelsonDr. Avi Mendelson
© Dr. Avi Mendelson 2
Disclaimer
No Intel proprietary information is disclosed. Every future estimate or projection is only a speculation Responsibility for all opinions and conclusions falls on the author
only. It does not means you cannot trust them…
© Dr. Avi Mendelson 5
Agenda
The power crisis Power consumption Power density and thermal limitations
General solutions and directions
© Dr. Avi Mendelson 6
Moore’s law “Doubling the number of transistors on a manufactured
die every year” - Gordon Moore, Intel Corporation
Sou
rce:
Inte
l S
ourc
e: In
tel
Tra
nsi
sto
rs P
er D
ieT
ran
sist
ors
Per
Die
’’7070 ’’7373 ’’7676 ’’7979 ’’8282 ’’8585 ’’8888 ’’9191 ’’9494 '97'97 20002000
101088
101077
101066
101055
101044
101033
101022
4M4M
MemoryMemory
MicroprocessorMicroprocessor
101099
64K64K
1M1M
1K1K
256K256K
4K4K16K16K
16M16M64M64M
4004400480808080
80868086
8028680286i386™i386™
i486™i486™PentiumPentium®®
256M256M
PentiumPentium®®
ProPro
PentiumPentium®®IIIIII
PentiumPentium®®44
PentiumPentium® ® IIII
© Dr. Avi Mendelson 7
In the Last 25 Years Life was Easy(*)
Doubling of transistor density every 30 months Increasing die sizes, allowed by
Increasing Wafer Size Process technology moving from “black art” to “manufacturing science”
Doubling of transistors every 18 months
Tech Old Arch mm (linear) New Arch mm (linear) Ratio i386C 6.5 i486 11.5 3.1 i486C 9.5 Pentium® 17 3.2 Pentium® 12.2 Pentium® Pro 17.3 2.1 Pentium® III 10.3 Next Gen ? 2--3
Implications: (in the same technology)
1. New Arch ~ 2-3X die area of the last Arch
2. Provides 1.5-1.7X integer performance of the last Arch
(*) source source Fred Pollack, Fred Pollack, Micro-32Micro-32
© Dr. Avi Mendelson 8
Suddenly, the power monster appears in all different market segments
© Dr. Avi Mendelson 9
Processor Power Evolution
Traditionally: Traditionally: new generation always increase powernew generation always increase power Compactions: higher performance at lower powerCompactions: higher performance at lower power Used to be “One size fits all”: start with high power and shrink to MobileUsed to be “One size fits all”: start with high power and shrink to Mobile
Ma
x P
ow
er
(Wa
tts
)
i386 i386
i486 i486
Pentium® Pentium®
Pentium® w/MMX tech.
Pentium® w/MMX tech.
1
10
100
Pentium® Pro Pentium® Pro
Pentium® II Pentium® II Pentium® 4Pentium® 4Pentium® 4Pentium® 4
??
Pentium® III Pentium® III
© Dr. Avi Mendelson 10
The power crisis – power consumption
Sourse:
cool-chips,
Micro 32
© Dr. Avi Mendelson 11
Power challenges per segmentServersDesktopsMobileHandhelds
Power related system cost drivers
Thermal cost
Delivery cost
Form factor
Thermal cost
Delivery cost
Thermal cost
Delivery cost
Form factor
Battery size
Form Factor
Battery size
Battery cost
Price driversPerformance
Perf/inch^3
Performance
Noise
Perf/$$
Performance
Noise
Perf/Kg.
Battery life
Performance
Battery life
Optimization point
Max performance @ thermal constraint
Max performance @ thermal constraint
Max performance @ thermal constraint
Max battery life
Max battery life
Max perf/power to meet application’s need
© Dr. Avi Mendelson 12
Power & EnergyPowerPower Dynamic powerDynamic power: consumed by transistors during : consumed by transistors during
switching.switching.P = P = CVCV22ff - - WorkWork done per time unit ( done per time unit (Watts)Watts)
((: activity, C: capacitance, V: voltage, f: frequency): activity, C: capacitance, V: voltage, f: frequency) Static Power (Leakage)Static Power (Leakage): consumed by all : consumed by all
“inactive transistors”, it depends on temperature “inactive transistors”, it depends on temperature and voltage.and voltage.
Power aware architectures -> aim to reduce peak powerPower aware architectures -> aim to reduce peak power
EnergyEnergy Power consume during some period of time.Power consume during some period of time.
Energy aware architectures -> aims to reduce average Energy aware architectures -> aims to reduce average power consumption power consumption
© Dr. Avi Mendelson 13
Power Evolution (Theoretical)
For a 15mm/side die (225mmFor a 15mm/side die (225mm22))
Assume 2X frequency increase each generationAssume 2X frequency increase each generation
Future process numbers are estimatedFuture process numbers are estimated
00
5050
100100
150150
200200
250250
Wat
tsW
atts
Leakage PowerLeakage PowerActive PowerActive Power
© Dr. Avi Mendelson 14
Why high power matters Power Limitations
Higher power higher current– Cannot exceed platform power delivery constraints
Higher power higher temperature– Cannot exceed the thermal constraints (e.g., Tj < 100oC)– Increases leakage.
The heat must be controlled in order to avoid electric migration and other “chemical” reactions of the silicon
Energy Affects battery life.
Consumer devices – the processor may consume most of the energy Mobile computers (Laptops) - the system (display, disk, cooling, energy
supplier, etc) consumes most of the energy Affects the cost of Electricity
© Dr. Avi Mendelson 15
Power DensityW
att
s/c
m2
1
10
100
1000
i386i386i486i486
Pentium® Pentium®
Pentium® ProPentium® Pro
Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate
Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor
RocketRocketNozzleNozzleRocketRocketNozzleNozzle
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
Pentium® 4Pentium® 4
© Dr. Avi Mendelson 16
© Dr. Avi Mendelson 17
Why power and power density increase over time ?
© Dr. Avi Mendelson 18
How do we keep up with the Moore’s Law? Every 18 month in average we introduce a new
process The new process shrinks the dimension of the
transistors by 0.7 (ideal shrink) As a result, on the same die area, we can have
more transistors, each of them running at higher frequency
One may mistakenly think that this is the reason for the increase in power and power density.
© Dr. Avi Mendelson 19
Scaling theory--1 of 2
7.0
,7.0
,7.07.0
7.07.0
CCapTotal
CCapFringing
CCapArea
f
a
Lateral and vertical dimensions reduce by 30%
Capacitance--area and fringing—reduce by 30%
7.0,7.0,7.0 oxtLLengthWWidth
27.07.07.0 YXAreaDie Die area reduces 50%
© Dr. Avi Mendelson 20
Scaling theory--2 of 2
7.01
7.0
Transistor
Cap
Capacitance per transistor reduces 30%
7.0
1
7.07.0
7.0
Area
Cap
Capacitance per unit area increases 43%
22
2 7.07.0
7.07.0,7.0
7.0
7.07.0
7.07.0
7.07.0)(,7.0,7.0
fVCPowerI
VddCT
VVddt
WIVVdd t
ox
t
Delay reduces 30%, power reduces 50%
© Dr. Avi Mendelson 21
Ideal Scenarios...
Ideal “Shrink” Same arch 1X #Xistors 0.5X size 1.5X frequency
0.5X power 1X IPC (instr./cycle) 1.5X performance 1X power density
Ideal New arch Same die size 2X #Xistors 1X size 1.5X frequency
1X power 2X IPC 3X performance 1X power density
© Dr. Avi Mendelson 22
Process Technologies – Reality But in reality:
New process is not ideal anymore New designs squeeze frequency to 2X per process New designs use more transistors (2X-3X to get 1.5X-1.7X perf)
So, every new process and architecture generation: Power goes up about 2X Power density goes up 30%~80%
This is bad, and… Will get worse in future process generations:
Voltage (Vdd) will scale down less Leakage is going to the roof
© Dr. Avi Mendelson 23
Die increases in order to maintain performance boost
Silicon Process TechnologySilicon Process Technology 1.5µ1.5µ 1.0µ1.0µ 0.8µ0.8µ 0.6µ0.6µ 0.35µ0.35µ 0.25µ0.25µ 0.18µ0.18µ 0.13µ 0.13µ
Intel386™ DX Intel386™ DX ProcessorProcessor
Intel486™ DX Intel486™ DX ProcessorProcessor
Pentium®Pentium®ProcessorProcessor
Pentium® Pro Pentium® Pro ProcessorProcessor
Pentium® II Pentium® II ProcessorProcessor
Pentium® 4 Pentium® 4 ProcessorProcessor
Pentium® III Pentium® III ProcessorProcessor
© Dr. Avi Mendelson 24
Put it all together: Power and Power density are real threat to the Moore’s law
Complex algorithms lead to denser power: Dense random logic
Timing pressure leads to faster/bigger/power-hungrier gates Designers put together units that communicate with each other.
It creates “regions” with high activity factors -> hot spots.
Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point.
Many of the modern processors are power limited
© Dr. Avi Mendelson 25
Some implicationsWe can’t build microprocessors with ever
increasing power density and die sizesThe constraint is power – not manufacturabilityThe design of any future micro-processor should
take power into consideration. We need to distinguish between different aspects of power:
Power delivery Max power (TJ) Power density - hot spots Energy – static + dynamic
Power and Energy aware design should take care of each of these aspects
One-size does not fit all anymore
© Dr. Avi Mendelson 26
General solutions and directions
Assume that one size does not fit all. For different segments there may be different
solutions (although many of them share the same principle of operation).
© Dr. Avi Mendelson 27
Embedded systems vs. Laptops
Embedded systems Most of the power is consumed by the CPU Usually not thermally limited. What we really care about is battery life and meeting the timing
limitations. In real time systems we can take advantage of known “deadlines”
Laptops (Mobile systems) We are thermally limited. We can not use deadlines (most of the time). We need to optimize for max battery life and max performance in a
given power envelope.
© Dr. Avi Mendelson 28
How to extend Battery life: Voltage Scaling
Within a given voltage range, higher voltage allows higher freq. Used for trading power and frequency. Either
Statically, at manufacturing time Dynamically, at run time (e.g., Intel’s SpeedStep® Technology)
Actual range depends on specificdesign and process technologyExamples*: Intel® XScale™ processors runs
from 0.75V (150MHz/50mW)to 1.65V (800MHz/900mW)
Intel mobile Pentium® III processorsells from 1.1V (600MHz)to 1.7V (1GHz)
0
100
200
300
400
500
600
700
800
900
1000
0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9
Fequency(M hz)
Power (mWatt)
* Source: Intel Corp. (http://developer.intel.com)* Source: Intel Corp. (http://developer.intel.com)
XScale proc. freq & power vs voltage
© Dr. Avi Mendelson 29
Voltage Scaling (cont.) Huge effect on Dynamic Power:
20% freq reduction 20% voltage reduction 35% energy reduction. (CV2 = C*0.82 = C*0.64)
50% power reduction. (CV2f = C*0.83 = C*0.51)
Even more impressive if we recall: 20% freq hit only 10%-15% performance hit*
Voltage scaling can be used to trade performance for powerReduce the power consumption when performance needs can be released e.g., if deadlines known and if we have enough “dead time”, we can extend the execution time on the expense of lowering the voltage.
BUT it has technology limitations
* Depends mainly on core to bus frequency ratio and caches size.* Depends mainly on core to bus frequency ratio and caches size.
© Dr. Avi Mendelson 30
How to extend battery life: energy Efficiency Energy per task
Proportional to # of processed instructions per taskProportional to the average work consumed per instruction
“Energy per (retired) instruction” = *W, where : Ratio of Total to Retired number of processed instructions W: Average energy spent in processing an instructionBoth figures deteriorate with every new microarchitecture Since speculation increases and complexity grows
In that respect:high performance modern microarchitectures are less energy-efficient
© Dr. Avi Mendelson 31
Improving Hot SpotsClustering Build your system as clustered architecture (e.g.,
Alpha) Design your system so that when all clusters are
active the system exceeds the Max-Power allowed Most of the time, not all the clusters are active “Smart scheduling” will spread the thermal hot-
spots among different clusters. In VLIW based architectures, compilers can help
© Dr. Avi Mendelson 32
Alpha hot spots
Source - CoolChips-99
Area 30%
Freq. 50%
Power 67%
© Dr. Avi Mendelson 33
Power Complexity Metrics Power C V2 f Metrics: suppose we introduce new feature that
consumes extra x power and gain y performance:1. Power/Perf ( Energy), assuming same technology (same
C) and same voltage For battery life, energy bills. For a given power envelope – without voltage scaling.
2. Power/Perf2 ( Energy*Delay) Balance performance and power needs.
3. Power/Perf3 ( Energy*Delay2) For a given power envelope – with voltage scaling.
assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish
© Dr. Avi Mendelson 34
E*D product (lower is better)
E = energy / instruction = Power * sec / instruction
= Watt / MIPS
D = sec / instruction = 1 / MIPS
E *D ~ Watt / MIPS2
0
1
2
3
4
0 1 2 3Vdd (volts)
En
erg
y (
PJ
)
0
1
De
lay
100
200
300
400
0 1 2 3Vdd (volts)
E x
D
© Dr. Avi Mendelson 35
Leakage control Leakage depends on: technology, area voltage and
temperature. High temperature high leakage high power
higher temperature Leakage will be very significant in future micro-
architectures.
Large caches contributes to the performance but may increase the power due to leakage.
Larger caches: better performance higher leakage -> slower clock -> lower performance.
Leakage make the major difference between clock gating and deep sleep modes (where power is disconnected)
© Dr. Avi Mendelson 36
Design for power: Out Of Order Execution
OOO architecture was found to be very efficient in masking the effect L1 cache misses.
Aggressive OOO, and wider machines require more registers and memory ports
It consumes a lot of power
Can we slow down the access to the cache and let the OOO solve the performance problem?
Can we simplify the OOO mechanisms, assuming that the memory subsystem limits the performance?
How aggressive we should be as speculation (branch prediction, value prediction, etc)
© Dr. Avi Mendelson 37
Pentium Pro Power BreakdownFetch14%
Decode14%
ROB7%
Data $7%
Int Exec6%FP Exec
5%
Clock5%
RS5%
External Bus6%
MOB4%
RAT4%
Misc23%
Actual computation:less than 25%!
What can be done: Trace cache Many low-level
improvements
© Dr. Avi Mendelson 38
SMT Single CPU µArch augmented to look as 2 or
more CPUs to the software
Adds ~10% logic to CPU (Alpha experience)
Average power increases <10%.
Can increase performance of two threads by 20-50% in respect of running the same applications sequentially.
Looks like a good tradeoffs between power and performance.
© Dr. Avi Mendelson 39
MT - Implications on power
The area and the power consumption of register files and memory elements within the processor increases significantly due to aggressive out-of-order and aggressive SMT (Alpha, CoolChip, 99’)
Increase the power at the hotspot, not fit to thermally limited segments (where performance is needed).
May better tolerate cache misses, so power aware caches can be used
Hot-spots may force us to use more aggressive clustering
© Dr. Avi Mendelson 40
Question?