computer architecutre

ECGR 4181/5181 UNCC 2015

Introduction

• Background: ECGR 3183 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design

ECGR 4181/5181 UNCC 2015

Computer Technology• Performance improvements:

Improvements in semiconductor technology- Feature size, clock speed- Transistor density increases by 35% per year and die size

increases by 10-20% per year… more functionality- Transistor speed improves linearly with size (complex

equation involving voltages, resistances, capacitances)… can lead to clock speed improvements!

- Wire delays do not scale down at the same rate as logic delays

- The power wall: it is not possible to consistently run at higher frequencies without hitting power/thermal limits (Turbo Mode can cause occasional frequency boosts)

- In a clock cycle, can do more work -- since transistors are faster, transistors are more energy-efficient, and there are more of them

ECGR 4181/5181 UNCC 2015

Computer Technology• Performance improvements:

Improvements in computer architectures- Enabled by HLL compilers, UNIX- Lead to RISC architectures

Together have enabled:- Lightweight computers- Productivity-based managed/interpreted programming

languages

ECGR 4181/5181 UNCC 2015

Single Processor Performance

RISC

Move to multi-processor

ECGR 4181/5181 UNCC 2015

Current Trends in Architecture

• Cannot continue to leverage Instruction-Level parallelism (ILP)Single processor performance improvement ended in

2003

• New models for performance:Data-level parallelism (DLP)Thread-level parallelism (TLP)Request-level parallelism (RLP)

• These require explicit restructuring of the application

ECGR 4181/5181 UNCC 2015

Classes of Computers• Personal Mobile Device (PMD)

e.g. smart phones, tablet computers Emphasis on cost, energy efficiency, media performance, and real-time

• Desktop Computing Emphasis on price-performance, energy, graphics

• Servers Emphasis on availability, scalability, throughput, energy

• Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability, price-performance, energy proportionality Sub-class: Supercomputers, emphasis: floating-point performance and

fast internal networks

• Embedded Computers Emphasis: price

ECGR 4181/5181 UNCC 2015

Embedded Processor CharacteristicsThe largest class of computers spanning the widest range

of applications and performance

• Often have minimum performance requirements. Example?

• Often have stringent limitations on cost. Example?

• Often have stringent limitations on power consumption. Example?

• Often have low tolerance for failure. Example?

ECGR 4181/5181 UNCC 2015

Characteristics of Each Segment

• Desktop: Need high performance (graphics performance?)and low price

• Servers: Need high throughput, availability (revenuelosses of $14K-$6.5M per hour of downtime), scalability

• Embedded: Need low power, low cost, low memory (almostdon’t care for performance, except real-time constraints)

Feature Desktop Server Embedded

System price $1000 - $10,000 $10,000 - $10,000,000 $10 - $100,000

Processor price $100 - $1000 $200 - $2000 $0.20 - $200

Design issues Price-perf, graphics Throughput, availability, scalability

Price, power, app-specific performance

ECGR 4181/5181 UNCC 2015

Parallelism

• Classes of parallelism in applications:Data-Level Parallelism (DLP)Task-Level Parallelism (TLP)

• Classes of architectural parallelism:Instruction-Level Parallelism (ILP)Vector architectures/Graphic Processor Units (GPUs)Thread-Level ParallelismRequest-Level Parallelism

ECGR 4181/5181 UNCC 2015

Flynn’s Taxonomy• Single instruction stream, single data stream (SISD)

• Single instruction stream, multiple data streams (SIMD)Vector architecturesMultimedia extensionsGraphics processor units

• Multiple instruction streams, single data stream (MISD)No commercial implementation

• Multiple instruction streams, multiple data streams (MIMD)Tightly-coupled MIMD Loosely-coupled MIMD

ECGR 4181/5181 UNCC 2015

Defining Computer Architecture

• “Old” view of computer architecture:Instruction Set Architecture (ISA) designi.e. decisions regarding:

- registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding

• “Real” computer architecture:Specific requirements of the target machineDesign to maximize performance within constraints:

cost, power, and availabilityIncludes ISA, microarchitecture, hardware

ECGR 4181/5181 UNCC 2015

Trends in Technology• Integrated circuit technology

Transistor density: 35%/yearDie size: 10-20%/year Integration overall: 40-55%/year

• DRAM capacity: 25-40%/year (slowing)

• Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM

• Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM

ECGR 4181/5181 UNCC 2015

Bandwidth and Latency

• Bandwidth or throughputTotal work done in a given time10,000-25,000X improvement for processors300-1200X improvement for memory and disks

• Latency or response timeTime between start and completion of an event30-80X improvement for processors6-8X improvement for memory and disks

ECGR 4181/5181 UNCC 2015

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

ECGR 4181/5181 UNCC 2015

Transistors and Wires• Feature size

Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .032 microns in 2011Transistor performance scales linearly

- Wire delay does not improve with feature size! Integration density scales quadratically

ECGR 4181/5181 UNCC 2015

Trends in Cost• Cost driven down by learning curve

Yield

• DRAM: price closely tracks cost

• Microprocessors: price depends on volume 10% less for each doubling of volume

ECGR 4181/5181 UNCC 2015

Integrated Circuit Cost• Integrated circuit

• Bose-Einstein formula:

• Defects per unit area = 0.016-0.057 defects per square cm (2010)

• N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

ECGR 4181/5181 UNCC 2015

Measuring Performance• Typical performance metrics:

Response time Throughput

• Speedup of X relative to Y Execution timeY / Execution timeX

• Execution time Wall clock time: includes all system overheads CPU time: only computation time

• Benchmarks SPEC CPU 2006: cpu-oriented programs (for desktops) SPECweb, TPC: throughput-oriented (for servers) EEMBC: for embedded processors/workloads

ECGR 4181/5181 UNCC 2015

Principles of Computer Design

• Take Advantage of Parallelisme.g. multiple processors, disks, memory banks,

pipelining, multiple functional units

• Principle of LocalityReuse of data and instructions

• Focus on the Common CaseAmdahl’s Law

ECGR 4181/5181 UNCC 2015

Compute Speedup – Amdahl’s Law

Speedup is due to enhancement(E):

TimeBefore

TimeAfter

= ExTimebefore x [(1-F) + FS ]

Speedup(E) =ExTimebefore

ExTimeafter

=1

FS

][(1-F) +

Let F be the fraction where enhancement is applied => Also, called parallel fraction and (1-F) as the serial fraction

Execution timeafter

ECGR 4181/5181 UNCC 2015


• The Processor Performance Equation

ECGR 4181/5181 UNCC 2015


• Different instruction types having different CPIs

ECGR 4181/5181 UNCC 2015

Measuring Performance

How do we conclude that System-A is “better” thanSystem-B?

ECGR 4181/5181 UNCC 2015

Performance Metrics• Purchasing perspective

given a collection of machines, which has the– best performance?– least cost?– best cost/performance?

• Design perspective faced with design options, which has the

– best performance improvement?– least cost?– best cost/performance?

• Both require basis for comparisonmetric for evaluation

• Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

ECGR 4181/5181 UNCC 2015

Defining (Speed) Performance• Normally interested in reducing

Response time (execution time) – the time between the start and the completion of a task

- Important to individual usersThus, to maximize performance, need to minimize execution time

Throughput – the total amount of work done in a given time– Important to data center managers

Decreasing response time almost always improves throughput

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = nperformanceY execution_timeX

ECGR 4181/5181 UNCC 2015

Performance Factors• Want to distinguish elapsed time and the time spent on

our task

• CPU execution time (CPU time) – time the CPU spends working on a taskDoes not include time waiting for I/O or running other programs

CPU execution time # CPU clock cyclesfor a program for a program

= x clock cycle time

CPU execution time # CPU clock cycles for a programfor a program clock rate

= -------------------------------------------

• Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

or

ECGR 4181/5181 UNCC 2015

Clock Cycles per Instruction• Not all instructions take the same amount of time to

executeOne way to think about execution time is that it equals the

number of instructions executed multiplied by the average time per instruction

• Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to executeA way to compare two different implementations of the same

architecture

# CPU clock cycles # Instructions Average clock cyclesfor a program for a program per instruction = x

CPI for this instruction classA B C

CPI 1 2 3

ECGR 4181/5181 UNCC 2015

Effective CPI• Computing the overall effective CPI is done by looking at

the different types of instructions and their individual cycle counts and averaging

Overall effective CPI = Σ (CPIi x ICi)i = 1

n

Where ICi is the count (percentage) of the number of instructions of class i executed

CPIi is the (average) number of clock cycles per instruction for that instruction class

n is the number of instruction classes

• The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

ECGR 4181/5181 UNCC 2015

THE Performance Equation• Our basic performance equation is then

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count x CPI

clock_rate CPU time = -----------------------------------------------

or

• These equations separate the three key factors that affect performanceCan measure the CPU execution time by running the programThe clock rate is usually givenCan measure overall instruction count by using profilers/

simulators without knowing all of the implementation detailsCPI varies by instruction type and ISA implementation for which

we must know the implementation details

ECGR 4181/5181 UNCC 2015

Determinates of CPU PerformanceCPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming languageCompiler

ISA

Processor organizationTechnology X

XX

XX

X X

X

X

X

X

X

ECGR 4181/5181 UNCC 2015

A Simple Example

• How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?

• How does this compare with using branch prediction to shave a cycle off the branch time?

• What if two ALU instructions could be executed at once?

Op Freq CPIi Freq x CPIiALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

Σ =

.5

1.0

.3

.4

2.2

CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

1.6

.5

.4

.3

.4

.5

1.0

.3

.2

2.0

CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

.25

1.0

.3

.4

1.95

CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

ECGR 4181/5181 UNCC 2015

m/c A m/c B m/c C

Prog 1 (secs) 1 10 20Prog 2 (secs) 1000 100 20

Total time: 1001, 101, and 40 seconds

ECGR 4181/5181 UNCC 2015

Comparing and Summarizing Performance

• Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

• How do we summarize the performance for benchmark set with a single number?The average of execution times that is directly proportional to

total execution time is the arithmetic mean (AM)

AM = 1/n Σ Timeii = 1

n

Where Timei is the execution time for the ith program of a total of n programs in the workload

A smaller mean indicates a smaller average execution time and thus improved performance

ECGR 4181/5181 UNCC 2015

Normalized Execution Time

• A second approach to unequal mixture of programs in the workload – normalize execution times to a reference m/c and then take the average of the normalized execution times.

• Average normalized execution time – either an arithmetic or geometric mean

• Geometric mean

where execution time ratio is the execution time for the program i normalized to the reference m/c

Interesting property

n ∏=

n

1iitimeexecution

=

i

i

i

i

YX

YX mean Geometric

)mean( Geometric)mean( Geometric

ECGR 4181/5181 UNCC 2015

Normalized Execution Times

• Geometric means of normalized execution times are consistent - does not matter which m/c is the reference

• Not true for arithmetic mean – should not be used to average normalized execution times

• Weighted arithmetic means – weightings are proportional to execution times – influenced by frequency of use in workload, type of m/c, and size of program input. Geometric mean is independent of running times of individual programs and any m/c can be used for normalization

• Example: Comparative performance evaluation where programs were fixed and inputs are not Competitors could calculate weighted arithmetic mean by using their best performing

benchmark for the longest input. Geometric mean would be less misleading than arithmetic mean for such situations

ECGR 4181/5181 UNCC 2015


A B C A B C A B C

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

AM 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

GM 1.0 1.00 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Total 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0

m/c A m/c B m/c C

Prog 1(secs) 1 10 20Prog 2 (secs) 1000 100 20

ECGR 4181/5181 UNCC 2015


• From geometric means: P1 and P2 perform the same on A and B – true only if p1 runs 100 times for every run of

P2 For such a workload execution times on machines A and B are 1100 secs and the

execution time on m/c C is 2020 secs. Geometric means suggests C is faster than A and B.

• Conclusion – In general there is no workload for three or more machines that will match the performance predicted by the geometric means of normalized execution times.

• Another problem with GM – It encourages h/w and s/w designers to focus their attention on the benchmarks where performance is easiest to improve rather than on the benchmarks that are slowest. Example: cut running time from 2 seconds to 1 and from 10,000 seconds to 5,000

ECGR 4181/5181 UNCC 2015

Example

• A new laptop has an IPC that is 20% worse than the oldlaptop. It has a clock speed that is 30% higher than the oldlaptop. The same binaries on both machines.What is the speedup of the new laptop?

ECGR 4181/5181 UNCC 2015

Example

• A new laptop has an IPC that is 20% worse than the oldlaptop. It has a clock speed that is 30% higher than the oldlaptop. The same binaries on both machines.Their IPCs are listed below. The binaries are run such that each binary gets an equal share of CPU time.What is the speedup of the new laptop?

P1 P2 P3Old-IPC 1.2 1.6 2.0New-IPC 1.6 1.6 1.6

ECGR 4181/5181 UNCC 2015

Example

• A new laptop has an IPC that is 20% worse than the oldlaptop. It has a clock speed that is 30% higher than the oldlaptop. The same binaries on both machines.Their IPCs are listed below. The binaries are run such that each binary gets an equal share of CPU time.What is the speedup of the new laptop?

P1 P2 P3 AM GMOld-IPC 1.2 1.6 2.0 1.6 1.57New-IPC 1.6 1.6 1.6 1.6 1.6

ECGR 4181/5181 UNCC 2015

Speedup Vs. Percentage

• “Speedup” is a ratio = old exec time / new exec time

• “Improvement”, “Increase”, “Decrease” usually refer topercentage relative to the baseline = (new perf – old perf) / old perf

• A program ran in 100 seconds on the old laptop and in 70seconds on the new laptop

What is the speedup? What is the percentage increase in performance? What is the reduction in execution time?

ECGR 4181/5181 UNCC 2015

Quantitative Principles of Computer Design

• Designing for performance and cost-performance

• One of the most important principle – make the common case fast – favor the frequent case over the infrequent case – design trade-off

• How much performance can be improved by making the frequent case faster?

• Amdahl’s Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used – defines the speedup that can be gained by using a particular feature.

eperformanc theusingout task withentire for the eperformanct enhancemen theusing task entire for the eperformanc speedup =

eperformanc theusing task entire for the timeexecution t enhancemen theusingout task withentire for the imeexection t speedup =

ECGR 4181/5181 UNCC 2015


• Depends on two factors:

The fraction of the computational time in the original machine that can be converted to take advantage of the enhancement

Example: 20 seconds of the execution time of a program that takes 80 seconds in total can use an enhancement, the fraction is 20/80

The improvement gained by the enhanced execution mode – how much faster the task would run if the enhanced mode were used for the entire program

The overall speedup is the ratio of the execution times:

1fraction tenhancemen ≤

+−×=

enhanced

enhancedenhancedoldnew speedup

fraction )fraction 1(timeexecution timeexecution

enhanced

enhancedenhanced

new

oldoverall

speedupfraction )fraction-(1

1 timeexecution timeexecution speedup

+==

ECGR 4181/5181 UNCC 2015


• Example

Suppose we are considering an enhancement to the processor of a server system used for Web serving. The new CPU is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original CPU is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?

enhanced

enhancedenhanced

overall

speedupfraction )fraction-(1

1 speedup+

=

1.56

100.4 0.6

1 speedup

10 speedup0.4 fraction

overall

enhanced

enhanced

=+

=

==

ECGR 4181/5181 UNCC 2015


• The Amdahl’s Law expresses the law of diminishing returns – The incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added.

• If an enhancement is only usable for a fraction of a task, we cannot speed up the task by more than the 1/(1-fraction)

• Caution! Do not confuse “fraction of time converted to use enhancement” and “fraction of time after enhancement in use”

ECGR 4181/5181 UNCC 2015


• Example: A common transformation required in graphics engines is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instrcutions in the graphics processor run faster by 1.6; FP instructions are responsible for a total of 50 % of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort required for the fast square root. Compare the two design alternatives.

Improving the performance of the FP operations overall is slightly better because of the higher frequency

1.23

1.60.5 0.5)-(1

1 speedup

1.22

100.2 0.2)-(1

1 speedup

FP

FPSQR

=+

=

=+

=

ECGR 4181/5181 UNCC 2015

The CPU Performance Equation

• Sometimes difficult to measure time consumed by different operations

• Alternative – Decompose CPU execution time into different components and see how different alternatives affect these components

• In terms of number of instructions executed – instruction path length or instruction count (IC) and clock cycles per instruction (CPI)

This figure of merit provides insight into different styles of instruction sets and implementations

rateclock program afor cyclesclock CPU timeCPU

timecycleclock x program afor cyclesclock CPU timeCPU

=

=

countn Instructioprogramper cyclesclock CPU CPI =

ECGR 4181/5181 UNCC 2015

CPU Performance Equation

• CPU performance is dependent on three characteristics: clock cycle (or rate), clock cycles per instruction, and instruction count

• Difficult to change one parameter without affecting other parameters. Clock cycle time – hardware technology and organization CPI – organization and instruction set architecture Instruction count – ISA and compiler technology

• Many performance improvement techniques improve one component of CPU performance with small or predictable impact on the other two

timeCPUprogramseconds

cycleclock seconds

ninstructiocyclesclock

programnsinstrcutio

rateclock CPI IC timeCPU

CPI x timecycleclock x IC timeCPUCPI x IC program afor cyclesclock CPU

==××

×=

==

ECGR 4181/5181 UNCC 2015


• Sometimes it is useful to calculate the number of total CPU clock cycles as

• Note: should be measured and not just calculated from the reference table

• Possible uses of CPU performance equation: High-level performance comparisons Back-of-the-envelope calculations Enable architects to think about compilers and technology

i

n

1i

i

n

1iii

n

1iii

i

n

1ii

CPIICIC

IC

CPIIC CPI

timecycle clockCPIIC time CPU

CPIIC cycles clock CPU

×=×

=

×

×=

×=

∑∑

∑

∑

=

=

=

=

iCPI

ECGR 4181/5181 UNCC 2015


Example

Suppose we have the following measurements:

Frequency of FP operations = 25%

Average CPI of FP operations = 4.0

Average CPI of other operations = 1.33

Frequency of FPSQR = 2%

CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all the FP operations to 2.5. Compare the two design alternatives using the CPU performance equation.

Note: only the CPI changes; the clock rate and IC do not change

ECGR 4181/5181 UNCC 2015


CPI without enhancement

=(4x25%) + (1.33x75%) = 2.0

CPI for enhanced FPSQR

=2.0 – 2% x (20 – 2) = 1.64

CPI for enhanced FP instructions

×=∑

= ICICCPICPI i

1ioriginal

n

i

)CPI (CPI 2% CPI CPI only FPSQR newFPSQR oldoriginalFPSQR new −×−=

1.23 1.622.0

CPICPI

CPI timecycleclock IC CPI timecycleclock IC

timeCPU timeCPU

Speedup

1.62 2.5) x (25% 1.33) x (75% CPI

FP new

original

FP new

original

FP new

originalFP new

FP new

===

××

××==

=+=

ECGR 4181/5181 UNCC 2015


Example

Machine A: 40% ALU operations (1 cycle), 24% loads (1 cycle), 11% stores (2 cycles), 25% branches (2 cycles)

Should one-cycle stores be implemented if it slows the clock by 15%?

old CPI = 0.4 + 0.24 + (0.11 x 2) + (0.25 x 2) = 1.36 cycles/instruction

new CPI = 0.4 + 0.24 + 0.11 + (0.25 x 2) = 1.25 cycles/instruction

Speedup = (IC x old CPI x old clock cycle time)/(IC x new CPI x new clock cycle time)

= (1.36 x T)/(1.25 x 1.15T) = 0.946

< 1

NO!

ECGR 4181/5181 UNCC 2015


How to actually measure execution time and CPI?

• Execution time: time (Unix/Linux command)

• Calculate CPI

• Breakdown CPI into compute, memory stalls, etc. – identify performance problems

CPI Breakdown Measurement

• Hardware event counters

• Cycle-level simulators

ECGR 4181/5181 UNCC 2015

Harmonic Mean

• If performance is expressed as a rate, then the average that tracks the total execution time is the harmonic mean:

• Note: Arithmetic mean cannot be used for rates

Example: Travel at 40 MPH for one mile and 80 MPH for one mile – average is not 60 MPH

HM = 2/(1/40 + 1/80) = 53.33 MPH

∑=

= n

1i iRate1

n HM

ECGR 4181/5181 UNCC 2015

Little’s Law

Key relationship between latency and bandwidth

• The average number of objects in a queue is the product of the entry rate and the average holding time

average number of objects = arrival rate x average holding time

ECGR 4181/5181 UNCC 2015

Power and Energy

• Problem: Get power in, get power out

• Thermal Design Power (TDP)Characterizes sustained power consumptionUsed as target for power supply and cooling systemLower than peak power, higher than average power

consumption

• Clock rate can be reduced dynamically to limit power consumption

• Energy per task is often a better measurement

ECGR 4181/5181 UNCC 2015

Dynamic Energy and Power

• Dynamic energyTransistor switch from 0 -> 1 or 1 -> 0½ x Capacitive load x Voltage2

• Dynamic power½ x Capacitive load x Voltage2 x Frequency switched

• Reducing clock rate reduces power, not energy

ECGR 4181/5181 UNCC 2015

Power

• Intel 80386 consumed ~ 2 W

• 3.3 GHz Intel Core i7 consumes 130 W

• Heat must be dissipated from 1.5 x 1.5 cm chip

• This is the limit of what can be cooled by air

ECGR 4181/5181 UNCC 2015

Reducing Power• Techniques for reducing power:

Do nothing wellDynamic Voltage-Frequency Scaling Low power state for DRAM, disksOverclocking, turning off cores

ECGR 4181/5181 UNCC 2015

Static Power• Static power consumption

Currentstatic x VoltageScales with number of transistorsTo reduce: power gating

ECGR 4181/5181 UNCC 2015

Power Trends

• In CMOS IC technology

The Power Wall

FrequencyVoltageload CapacitivePower 2 ××=

×1000×30 5V → 1V

ECGR 4181/5181 UNCC 2015

Reducing Power

• Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction

0.520.85FVC

0.85F0.85)(V0.85CPP 4

old2

oldold

old2

oldold

old

new ==××

×××××=

• The power wall–We can’t reduce voltage further–We can’t remove more heat

• How else can we improve performance?

ECGR 4181/5181 UNCC 2015

Dependability

• Module reliabilityMean time to failure (MTTF)Mean time to repair (MTTR)Mean time between failures (MTBF) = MTTF + MTTRAvailability = MTTF / MTBF

ECGR 4181/5181 UNCC 2015

Define and quantity dependability • Module reliability = measure of continuous service

accomplishment (or time to failure).Two metrics

1. Mean Time To Failure (MTTF) measures Reliability

2. Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation

• Mean Time To Repair (MTTR) measures Service Interruption Mean Time Between Failures (MTBF) = MTTF+MTTR

• Module availability measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9)

• Module availability = MTTF / ( MTTF + MTTR)

ECGR 4181/5181 UNCC 2015

Example calculating reliability

• If modules have exponentially distributed lifetimes(age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules

• Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF):

=

=

MTTF

eFailureRat

ECGR 4181/5181 UNCC 2015

Performance Improvement

Exploiting properties of programs

• Principle of Locality – programs tend to reuse data and instructions they have used recently Rule of thumb – A program spends 90% of its execution time in only 10% of the code. We

can predict with reasonable accuracy what instructions and data a program will use in near future based on its accesses in the recent past.

Principle of locality also applies to data accesses – not as strongly as code accesses. Two types of locality

Temporal locality – recently accessed items are likely to be accessed again in near future

Spatial locality – Items whose addresses are near one another tend to be referenced close together

ECGR 4181/5181 UNCC 2015

Performance Improvement

Take Advantage of Parallelism

• One of the most important methods for improving performance

• Parallelism at different levels. For example, System level – multiprocessors and multiple disks for servers – workload spread among

CPUs or disks – improves throughput – scalability is important for server applications

Individual processors – Parallelism at instruction level, such as pipelining

Parallelism at the level of detailed digital design, such as carry-lookahead ALUs and set-associative caches for parallel searches

• One common class of techniques is parallel computation of two or more possible outcomes, followed by late selection, such as carry select adders and handling branches in pipelines

ECGR 4181/5181 UNCC 2015

Multiprocessors• Multicore microprocessors

More than one processor per chip

• Requires explicitly parallel programmingCompare with instruction level parallelism

- Hardware executes multiple instructions at once- Hidden from the programmer

Hard to do- Programming for performance- Load balancing- Optimizing communication and synchronization

ECGR 4181/5181 UNCC 2015

Fallacies and Pitfalls

• Fallacy The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite.

As processors have become faster and more sophisticated, processor performance in one application area can diverge from that in another area. Sometimes the ISA is responsible, but more often the pipeline structure and memory system are responsible. Thus the clock rate is not a good metric.

Performance of a 1.7 GHz Pentium 4 relative to a 1 GHz Pentium III

ECGR 4181/5181 UNCC 2015


• Pitfall Comparing hand-coded assembly and compiler-generated, high-level language performance Embedded market – several factors encourage hand-coding, at least of key loops;

importance of a few small loops to overall performance (real-time performance) in some embedded applications. Inclusion of instructions that can significantly boost performance of certain computations and compilers can not effectively use

Hand-coding of the critical parts can lead to large performance gains

MachineEEMBC

benchmark setCompiler-generated

performance

Hand-coded performance

Ratio hand/compiler

Trimedia 1300 @ 166MHz

Consumer 23.3 110 4.7

BOPS Manta @ 136 MHz

Telecomm 2.6 225.8 86.8

TI TMS320C6203

@ 300 MHz

Telecomm 6.8 68.5 10.1

ECGR 4181/5181 UNCC 2015


• Fallacy Peak performance tracks performance One definition of peak performance is that the performance level a machine is guaranteed

not to exceed. The gap between peak performance and observed performance is typically a factor of 10 or more for supercomputers. Since the gap is large and can vary significantly, peak performance is not useful in predicting observed performance unless the workload consists of small programs that operate close to peak.

Peak MIPS – Obtained by choosing an instruction mix that minimizes the CPI, even if the mix is totally impractical.

ECGR 4181/5181 UNCC 2015


• Fallacy The best design for a computer is the one that optimizes the primary objective without considering implementation

Complex designs take longer to complete and prolong time to market – design will be less competitive

• Pitfall Neglecting the cost of software in either evaluating a system or examining cost-performance

For many years, hardware was so expensive that it dominated the cost of software, but this is no longer true

For medium-size database sever the software costs are roughly 50% of the total cost while for desktop the software cost somewhere between 23% and 38%

ECGR 4181/5181 UNCC 2015


• Pitfall Falling prey to Amdahl’s Law Expending tremendous effort optimizing some aspect of a system before measuring its

usage

• Fallacy MIPS is an accurate measure for comparing performance among processors

Embedded market uses Dhrystone as the benchmark of choice and reports performance as Dhrystone MIPS

Easy to understand Faster machines have higher MIPS rating – matches intuition

6

66

10 MIPSIC timeExecution

10 CPIrateclock

10 imeExection tcountn Instructio MIPS

×=

×=

×=

ECGR 4181/5181 UNCC 2015


• Problems with using MIPS as a measure for comparison: MIPS is dependent on the instruction set, making it difficult to

compare MIPS of computers with different instruction sets

MIPS varies between programs on the same computer

MIPS can vary inversely to performance. Example: MIPS rating of a machine with optional floating-point hardware. Takes more clock cycles per floating-point instruction than integer instruction – floating-point programs using optional hardware instead software floating-point routines take less time but have a lower MIPS rating. Software floating point executes simpler instructions resulting in higher MIPS rating, but it executes so many instructions that the execution time is longer.

• MIPS is sometimes used by a single vendor within a single set of machines designed for a given class of applications

ECGR 4181/5181 UNCC 2015


• Designers started reporting relative MIPS – embedded market reports for Dhrystone Relative MIPS for a machine M was defined based on some

reference machine as

• MFLOPS – measured the inverse of execution time

referencereference

MM MIPS

ePerformancePerformanc MIPS ×=

ECGR 4181/5181 UNCC 2015

Other Performance Metrics

• Power consumption – especially in the embedded market where battery life is important (and passive cooling)For power-limited applications, the most important metric is

energy efficiency

ECGR 4181/5181 UNCC 2015

Summary: Evaluating ISAs• Design-time metrics:

Can it be implemented, in how long, at what cost?Can it be programmed? Ease of compilation?

• Static Metrics:How many bytes does the program occupy in memory?

• Dynamic Metrics:How many instructions are executed? How many bytes does the

processor fetch to execute the program?How many clocks are required per instruction?How "lean" a clock is practical?

Best Metric: Time to execute the program!

CPI

Inst. Count Cycle Timedepends on the instructions set, the processor organization, and compilation techniques.

computer architecutre

Documents