1 low-power computer architecture dr. avi mendelson

38
1 Low-power computer architecture Dr. Avi Mendelson Dr. Avi Mendelson

Post on 21-Dec-2015

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Low-power computer architecture Dr. Avi Mendelson

1

Low-power computer architecture

Dr. Avi MendelsonDr. Avi MendelsonDr. Avi MendelsonDr. Avi Mendelson

Page 2: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 2

Disclaimer

No Intel proprietary information is disclosed. Every future estimate or projection is only a speculation Responsibility for all opinions and conclusions falls on the author

only. It does not means you cannot trust them…

Page 3: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 5

Agenda

The power crisis Power consumption Power density and thermal limitations

General solutions and directions

Page 4: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 6

Moore’s law “Doubling the number of transistors on a manufactured

die every year” - Gordon Moore, Intel Corporation

Sou

rce:

Inte

l S

ourc

e: In

tel

Tra

nsi

sto

rs P

er D

ieT

ran

sist

ors

Per

Die

’’7070 ’’7373 ’’7676 ’’7979 ’’8282 ’’8585 ’’8888 ’’9191 ’’9494 '97'97 20002000

101088

101077

101066

101055

101044

101033

101022

4M4M

MemoryMemory

MicroprocessorMicroprocessor

101099

64K64K

1M1M

1K1K

256K256K

4K4K16K16K

16M16M64M64M

4004400480808080

80868086

8028680286i386™i386™

i486™i486™PentiumPentium®®

256M256M

PentiumPentium®®

ProPro

PentiumPentium®®IIIIII

PentiumPentium®®44

PentiumPentium® ® IIII

Page 5: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 7

In the Last 25 Years Life was Easy(*)

Doubling of transistor density every 30 months Increasing die sizes, allowed by

Increasing Wafer Size Process technology moving from “black art” to “manufacturing science”

Doubling of transistors every 18 months

Tech Old Arch mm (linear) New Arch mm (linear) Ratio i386C 6.5 i486 11.5 3.1 i486C 9.5 Pentium® 17 3.2 Pentium® 12.2 Pentium® Pro 17.3 2.1 Pentium® III 10.3 Next Gen ? 2--3

Implications: (in the same technology)

1. New Arch ~ 2-3X die area of the last Arch

2. Provides 1.5-1.7X integer performance of the last Arch

(*) source source Fred Pollack, Fred Pollack, Micro-32Micro-32

Page 6: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 8

Suddenly, the power monster appears in all different market segments

Page 7: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 9

Processor Power Evolution

Traditionally: Traditionally: new generation always increase powernew generation always increase power Compactions: higher performance at lower powerCompactions: higher performance at lower power Used to be “One size fits all”: start with high power and shrink to MobileUsed to be “One size fits all”: start with high power and shrink to Mobile

Ma

x P

ow

er

(Wa

tts

)

i386 i386

i486 i486

Pentium® Pentium®

Pentium® w/MMX tech.

Pentium® w/MMX tech.

1

10

100

Pentium® Pro Pentium® Pro

Pentium® II Pentium® II Pentium® 4Pentium® 4Pentium® 4Pentium® 4

??

Pentium® III Pentium® III

Page 8: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 10

The power crisis – power consumption

Sourse:

cool-chips,

Micro 32

Page 9: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 11

Power challenges per segmentServersDesktopsMobileHandhelds

Power related system cost drivers

Thermal cost

Delivery cost

Form factor

Thermal cost

Delivery cost

Thermal cost

Delivery cost

Form factor

Battery size

Form Factor

Battery size

Battery cost

Price driversPerformance

Perf/inch^3

Performance

Noise

Perf/$$

Performance

Noise

Perf/Kg.

Battery life

Performance

Battery life

Optimization point

Max performance @ thermal constraint

Max performance @ thermal constraint

Max performance @ thermal constraint

Max battery life

Max battery life

Max perf/power to meet application’s need

Page 10: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 12

Power & EnergyPowerPower Dynamic powerDynamic power: consumed by transistors during : consumed by transistors during

switching.switching.P = P = CVCV22ff - - WorkWork done per time unit ( done per time unit (Watts)Watts)

((: activity, C: capacitance, V: voltage, f: frequency): activity, C: capacitance, V: voltage, f: frequency) Static Power (Leakage)Static Power (Leakage): consumed by all : consumed by all

“inactive transistors”, it depends on temperature “inactive transistors”, it depends on temperature and voltage.and voltage.

Power aware architectures -> aim to reduce peak powerPower aware architectures -> aim to reduce peak power

EnergyEnergy Power consume during some period of time.Power consume during some period of time.

Energy aware architectures -> aims to reduce average Energy aware architectures -> aims to reduce average power consumption power consumption

Page 11: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 13

Power Evolution (Theoretical)

For a 15mm/side die (225mmFor a 15mm/side die (225mm22))

Assume 2X frequency increase each generationAssume 2X frequency increase each generation

Future process numbers are estimatedFuture process numbers are estimated

00

5050

100100

150150

200200

250250

Wat

tsW

atts

Leakage PowerLeakage PowerActive PowerActive Power

Page 12: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 14

Why high power matters Power Limitations

Higher power higher current– Cannot exceed platform power delivery constraints

Higher power higher temperature– Cannot exceed the thermal constraints (e.g., Tj < 100oC)– Increases leakage.

The heat must be controlled in order to avoid electric migration and other “chemical” reactions of the silicon

Energy Affects battery life.

Consumer devices – the processor may consume most of the energy Mobile computers (Laptops) - the system (display, disk, cooling, energy

supplier, etc) consumes most of the energy Affects the cost of Electricity

Page 13: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 15

Power DensityW

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999.Fred Pollack, Intel Corp. Micro32 conference key note - 1999.

Pentium® 4Pentium® 4

Page 14: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 16

Page 15: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 17

Why power and power density increase over time ?

Page 16: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 18

How do we keep up with the Moore’s Law? Every 18 month in average we introduce a new

process The new process shrinks the dimension of the

transistors by 0.7 (ideal shrink) As a result, on the same die area, we can have

more transistors, each of them running at higher frequency

One may mistakenly think that this is the reason for the increase in power and power density.

Page 17: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 19

Scaling theory--1 of 2

7.0

,7.0

,7.07.0

7.07.0

CCapTotal

CCapFringing

CCapArea

f

a

Lateral and vertical dimensions reduce by 30%

Capacitance--area and fringing—reduce by 30%

7.0,7.0,7.0 oxtLLengthWWidth

27.07.07.0 YXAreaDie Die area reduces 50%

Page 18: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 20

Scaling theory--2 of 2

7.01

7.0

Transistor

Cap

Capacitance per transistor reduces 30%

7.0

1

7.07.0

7.0

Area

Cap

Capacitance per unit area increases 43%

22

2 7.07.0

7.07.0,7.0

7.0

7.07.0

7.07.0

7.07.0)(,7.0,7.0

fVCPowerI

VddCT

VVddt

WIVVdd t

ox

t

Delay reduces 30%, power reduces 50%

Page 19: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 21

Ideal Scenarios...

Ideal “Shrink” Same arch 1X #Xistors 0.5X size 1.5X frequency

0.5X power 1X IPC (instr./cycle) 1.5X performance 1X power density

Ideal New arch Same die size 2X #Xistors 1X size 1.5X frequency

1X power 2X IPC 3X performance 1X power density

Page 20: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 22

Process Technologies – Reality But in reality:

New process is not ideal anymore New designs squeeze frequency to 2X per process New designs use more transistors (2X-3X to get 1.5X-1.7X perf)

So, every new process and architecture generation: Power goes up about 2X Power density goes up 30%~80%

This is bad, and… Will get worse in future process generations:

Voltage (Vdd) will scale down less Leakage is going to the roof

Page 21: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 23

Die increases in order to maintain performance boost

Silicon Process TechnologySilicon Process Technology 1.5µ1.5µ 1.0µ1.0µ 0.8µ0.8µ 0.6µ0.6µ 0.35µ0.35µ 0.25µ0.25µ 0.18µ0.18µ 0.13µ 0.13µ

Intel386™ DX Intel386™ DX ProcessorProcessor

Intel486™ DX Intel486™ DX ProcessorProcessor

Pentium®Pentium®ProcessorProcessor

Pentium® Pro Pentium® Pro ProcessorProcessor

Pentium® II Pentium® II ProcessorProcessor

Pentium® 4 Pentium® 4 ProcessorProcessor

Pentium® III Pentium® III ProcessorProcessor

Page 22: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 24

Put it all together: Power and Power density are real threat to the Moore’s law

Complex algorithms lead to denser power: Dense random logic

Timing pressure leads to faster/bigger/power-hungrier gates Designers put together units that communicate with each other.

It creates “regions” with high activity factors -> hot spots.

Power is not distributed evenly over the chip. A failure can happen if a single point reach the max power point.

Many of the modern processors are power limited

Page 23: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 25

Some implicationsWe can’t build microprocessors with ever

increasing power density and die sizesThe constraint is power – not manufacturabilityThe design of any future micro-processor should

take power into consideration. We need to distinguish between different aspects of power:

Power delivery Max power (TJ) Power density - hot spots Energy – static + dynamic

Power and Energy aware design should take care of each of these aspects

One-size does not fit all anymore

Page 24: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 26

General solutions and directions

Assume that one size does not fit all. For different segments there may be different

solutions (although many of them share the same principle of operation).

Page 25: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 27

Embedded systems vs. Laptops

Embedded systems Most of the power is consumed by the CPU Usually not thermally limited. What we really care about is battery life and meeting the timing

limitations. In real time systems we can take advantage of known “deadlines”

Laptops (Mobile systems) We are thermally limited. We can not use deadlines (most of the time). We need to optimize for max battery life and max performance in a

given power envelope.

Page 26: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 28

How to extend Battery life: Voltage Scaling

Within a given voltage range, higher voltage allows higher freq. Used for trading power and frequency. Either

Statically, at manufacturing time Dynamically, at run time (e.g., Intel’s SpeedStep® Technology)

Actual range depends on specificdesign and process technologyExamples*: Intel® XScale™ processors runs

from 0.75V (150MHz/50mW)to 1.65V (800MHz/900mW)

Intel mobile Pentium® III processorsells from 1.1V (600MHz)to 1.7V (1GHz)

0

100

200

300

400

500

600

700

800

900

1000

0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9

Fequency(M hz)

Power (mWatt)

* Source: Intel Corp. (http://developer.intel.com)* Source: Intel Corp. (http://developer.intel.com)

XScale proc. freq & power vs voltage

Page 27: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 29

Voltage Scaling (cont.) Huge effect on Dynamic Power:

20% freq reduction 20% voltage reduction 35% energy reduction. (CV2 = C*0.82 = C*0.64)

50% power reduction. (CV2f = C*0.83 = C*0.51)

Even more impressive if we recall: 20% freq hit only 10%-15% performance hit*

Voltage scaling can be used to trade performance for powerReduce the power consumption when performance needs can be released e.g., if deadlines known and if we have enough “dead time”, we can extend the execution time on the expense of lowering the voltage.

BUT it has technology limitations

* Depends mainly on core to bus frequency ratio and caches size.* Depends mainly on core to bus frequency ratio and caches size.

Page 28: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 30

How to extend battery life: energy Efficiency Energy per task

Proportional to # of processed instructions per taskProportional to the average work consumed per instruction

“Energy per (retired) instruction” = *W, where : Ratio of Total to Retired number of processed instructions W: Average energy spent in processing an instructionBoth figures deteriorate with every new microarchitecture Since speculation increases and complexity grows

In that respect:high performance modern microarchitectures are less energy-efficient

Page 29: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 31

Improving Hot SpotsClustering Build your system as clustered architecture (e.g.,

Alpha) Design your system so that when all clusters are

active the system exceeds the Max-Power allowed Most of the time, not all the clusters are active “Smart scheduling” will spread the thermal hot-

spots among different clusters. In VLIW based architectures, compilers can help

Page 30: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 32

Alpha hot spots

Source - CoolChips-99

Area 30%

Freq. 50%

Power 67%

Page 31: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 33

Power Complexity Metrics Power C V2 f Metrics: suppose we introduce new feature that

consumes extra x power and gain y performance:1. Power/Perf ( Energy), assuming same technology (same

C) and same voltage For battery life, energy bills. For a given power envelope – without voltage scaling.

2. Power/Perf2 ( Energy*Delay) Balance performance and power needs.

3. Power/Perf3 ( Energy*Delay2) For a given power envelope – with voltage scaling.

assuming that we can (1) trade frequency and voltage scaling, and (2) we can lower the voltage as much as we wish

Page 32: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 34

E*D product (lower is better)

E = energy / instruction = Power * sec / instruction

= Watt / MIPS

D = sec / instruction = 1 / MIPS

E *D ~ Watt / MIPS2

0

1

2

3

4

0 1 2 3Vdd (volts)

En

erg

y (

PJ

)

0

1

De

lay

100

200

300

400

0 1 2 3Vdd (volts)

E x

D

Page 33: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 35

Leakage control Leakage depends on: technology, area voltage and

temperature. High temperature high leakage high power

higher temperature Leakage will be very significant in future micro-

architectures.

Large caches contributes to the performance but may increase the power due to leakage.

Larger caches: better performance higher leakage -> slower clock -> lower performance.

Leakage make the major difference between clock gating and deep sleep modes (where power is disconnected)

Page 34: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 36

Design for power: Out Of Order Execution

OOO architecture was found to be very efficient in masking the effect L1 cache misses.

Aggressive OOO, and wider machines require more registers and memory ports

It consumes a lot of power

Can we slow down the access to the cache and let the OOO solve the performance problem?

Can we simplify the OOO mechanisms, assuming that the memory subsystem limits the performance?

How aggressive we should be as speculation (branch prediction, value prediction, etc)

Page 35: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 37

Pentium Pro Power BreakdownFetch14%

Decode14%

ROB7%

Data $7%

Int Exec6%FP Exec

5%

Clock5%

RS5%

External Bus6%

MOB4%

RAT4%

Misc23%

Actual computation:less than 25%!

What can be done: Trace cache Many low-level

improvements

Page 36: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 38

SMT Single CPU µArch augmented to look as 2 or

more CPUs to the software

Adds ~10% logic to CPU (Alpha experience)

Average power increases <10%.

Can increase performance of two threads by 20-50% in respect of running the same applications sequentially.

Looks like a good tradeoffs between power and performance.

Page 37: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 39

MT - Implications on power

The area and the power consumption of register files and memory elements within the processor increases significantly due to aggressive out-of-order and aggressive SMT (Alpha, CoolChip, 99’)

Increase the power at the hotspot, not fit to thermally limited segments (where performance is needed).

May better tolerate cache misses, so power aware caches can be used

Hot-spots may force us to use more aggressive clustering

Page 38: 1 Low-power computer architecture Dr. Avi Mendelson

© Dr. Avi Mendelson 40

Question?