performancecs510 computer architectureslecture 3 - 1 lecture 3 benchmarks and performance metrics...

Performance CS510 Computer Architectures Lecture 3 - 1

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics

Lecture 3Lecture 3

Benchmarks and Benchmarks and Performance MetricsPerformance Metrics


Measurement ToolsMeasurement Tools

• Benchmarks, Traces, Mixes• Cost, Delay, Area, Power Estimation• Simulation (many levels)

– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental Laws


The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns ....(Performance)– Throughput, bandwidth

610 mph

1350 mph

470

132

286,700

178,200

6.5 hours

3.0 hours

Plane

Boeing 747

BAD/Sud Concorde

SpeedTime (DC-Paris) Passengers Throughput

(pmph)


The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)

ExTime(Y) Performance(X)

n = =

ExTime(X) Performance(Y)

“X is n times faster than Y” means:


Performance TerminologyPerformance Terminology

“X is n% faster than Y” means:

100 100 xx (Performance(X) - Performance(Y)) (Performance(X) - Performance(Y)) n =n = Performance(Y)Performance(Y)

ExTime(Y) Performance(X) n = = 1 +

ExTime(X) Performance(Y) 100


ExampleExample

1510

= 1.51.0

= Performance (X)Performance (Y)

ExTime(Y)ExTime(X)

=

n = 100 (1.5 - 1.0) 1.0

n = 50%

Example: Y takes 15 seconds to complete a task, Example: Y takes 15 seconds to complete a task, X takes 10 seconds. X takes 10 seconds. What % faster is X?What % faster is X?


Programs to Evaluate Programs to Evaluate Processor PerformanceProcessor Performance

• (Toy) Benchmarks– 10~100-line program– e.g.: sieve, puzzle, quicksort

• Synthetic Benchmarks– Attempt to match average frequencies of real workl

oads– e.g., Whetstone, dhrystone

• Kernels– Time critical excerpts of real programs– e.g., Livermore loops

• Real programs– e.g., gcc, spice


Benchmarking GamesBenchmarking Games

• Differing configurations used to run the same workload on two systems

• Compiler wired to optimize the workload• Workload arbitrarily picked• Very small benchmarks used• Benchmarks manually translated to optimize

performance


Common Benchmarking Common Benchmarking MistakesMistakes

• Only average behavior represented in test workload• Ignoring monitoring overhead• Not ensuring same initial conditions• “Benchmark Engineering”

– particular optimization– different compilers or preprocessors– runtime libraries


SPEC: System Performance SPEC: System Performance Evaluation CooperativeEvaluation Cooperative

• First Round 1989– 10 programs yielding a single number

• Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating

point programs)– VAX-11/780

• Third Round 1995– Single flag setting for all programs; new set of programs

“benchmarks useful for 3 years”– SPARCstation 10 Model 40


SPEC First RoundSPEC First Round

• One program: 99% of time in single line of code

• New front-end compiler could improve dramatically

Benchmark

SPEC

Per

f

0

100

200

300

400

500

600

700

800

gcc

epre

sso

spice

dodu

c

nasa7 li

eqnt

ott

matrix300

fppp

p

tom

catv


How to Summarize How to Summarize PerformancePerformance

• Arithmetic Mean (weighted arithmetic mean)

– tracks execution time: (Ti)/n or Wi*Ti

• Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS)

– tracks execution time: n/1/Ri or n/Wi/Ri

• Normalized execution time is handy for scaling performance

• But do not take the arithmetic mean of normalized execution time, use the geometric mean (Ri)1/n, where Ri=1/Ti


Comparing and Summarizing Comparing and Summarizing PerformancePerformance

For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on...

The relative performance of computer is unclear with Total Execution Times

Computer A Computer B Computer C

P1(secs) 1 10 20

P2(secs) 1,000 100 20

Total time(secs) 1,001 110 40


Summary MeasureSummary Measure

Arithmetic Mean n Execution Timei

i=1

1

n n n 1 / Ratei i=1

Ratei = ƒ(1 / Execcution Timei)

Good, if programs are run equally in the workload

Harmonic Mean(When performance is expressed as rates)


Unequal Job MixUnequal Job Mix

Weighted Arithmetic Mean

Weighted Harmonic Mean

n Weighti x Execution Timei

i=1

n Weighti / Ratei i=1

Relative Performance

• Normalized Execution Time to a reference machine Arithmetic Mean Geometric Mean

n Execution Time Ratioi

i=1n

Normalized to the reference machine

• Weighted Execution Time


Weighted Arithmetic MeanWeighted Arithmetic Mean

W(i)j x Timejj=1

nWAM(i) =

A B C W(1) W(2) W(3)

P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999

P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001

1.0 x 0.5 + 1,000 x 0.5

WAM(1) 500.50 55.00 20.00

WAM(2) 91.91 18.19 20.00

WAM(3) 2.00 10.09 20.00


Normalized Execution Normalized Execution TimeTime

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

Geometric Mean = n Execution time ratioiI=1

n

Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0

A B CP1 1.00 10.00 20.00P2 1,000.00 100.00 20.00


Disadvantages Disadvantages of Arithmetic Meanof Arithmetic Mean

Performance varies depending on the reference machine

1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0

B is 5 times slower than A

A is 5 times slower than B

C is slowest C is fastest

Normalized to A

Normalized to B Normalized to C

A B C A B C A B C

P1

P2

Arithmetic mean


The Pros and Cons Of The Pros and Cons Of Geometric MeansGeometric Means

• Independent of running times of the individual programs• Independent of the reference machines• Do not predict execution time

– the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2

P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0

P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0

1(P1) x 100 + 1000(P2) x 1= 10(P1) x 100 + 100(P2) x 1

Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0

Normalized to A Normalized to B Normalized to C

A B C A B C A B C


Amdahl's LawAmdahl's Law

Speedup due to enhancement E: ExTime w/o E Performance w/ESpeedup(E) = = ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

ExTime(E) =Speedup(E) =


Amdahl’s LawAmdahl’s Law

Speedup =ExTime

ExTimeE

=

1

(1 - FractionE) + FractionE

SpeedupE

ExTimeE = ExTime x (1 - FractionE) + SpeedupE

FractionE

1(1 - F) + F/S

=


Amdahl’s LawAmdahl’s Law

Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP

Speedup = 1

(1-F) + F/S

= 1.0535.3% improvement

1

(1-0.1) + 0.1/2 0.95=

1=


Corollary(Amdahl):Corollary(Amdahl): Make the Common Case FastMake the Common Case Fast

• All instructions require an instruction fetch, only a fraction require a data fetch/store

– Optimize instruction access over data access• Programs exhibit locality

Spatial Locality

Reg’sCache

Memory Disk / Tape

Temporal Locality

• Access to small memories is fasterProvide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.


Locality of AccessLocality of Access

Spatial Locality:There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference.

Temporal Locality:There is a high probability that the recently referenced data will be referenced in near future.


• The simple case is usually the most frequent and the easiest to optimize!

• Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software

Rule of ThumbRule of Thumb


Metrics of PerformanceMetrics of Performance

Compiler

Programming Language

Application

ISA

DatapathControl

Transistors Wires PinsFunction Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Answers per monthOperations per second

Cycles per second (clock rate)

Megabytes per second


Aspects of CPU PerformanceAspects of CPU Performance

Seconds Instructions Cycles Seconds

CPU time = = x x Program Program Instruction Cycle

Program X

Compiler X (X)

Inst. Set. X X

Organization X

Technology X X

Inst Count CPI Clock Rate


Marketing MetricsMarketing MetricsMIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106

• Machines with different instruction sets ?

• Programs with different instruction mixes ?

– Dynamic frequency of instructions

• Not correlated with performance

MFLOP/s = FP Operations / Time x 106

• Machine dependent

• Often not where time is spent

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8

Normalized:

add,sub,compare, mult 1

divide, sqrt 4

exp, sin, . . . 8


Cycles Per InstructionCycles Per Instruction

CPU time = Cycle Time x CPI x Ii = 1

n

i i

Instruction Frequency

CPI = CPI x F ,where F = I i = 1

n

i i i i

Instruction Count

CPI = (CPU Time x Clock Rate) / Instruction Count= Cycles / Instruction Count

Average cycles per instruction

Invest resources where time is spent !


Organizational Trade-offsOrganizational Trade-offs

Instruction Mix

Cycle Time

CPIISA

DatapathControl

Transistors Wires PinsFunction Units

Compiler

Programming Language

Application


Example: Calculating CPIExample: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq CPI(i) CPI (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5


ExampleExample

Add register / memory operations: R/M– One source operand in memory– One source operand in register– Cycle count of 2

Branch cycle count to increase to 3

What fraction of the loads must be eliminated for this to pay off?

Base Machine (Reg / Reg)

Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X]

Typical Mix

Op Freqi CPIi

ALU 50% 1Load 20% 2Store 10% 2Branch 20% 2


Example SolutionExample Solution Exec Time = Instr Cnt x CPI x Clock

Op Freqi CPIi CPIALU .50 1 .5Load .20 2 .4Store .10 2 .2Branch .20 2 .4Total 1.00 1.5


Example SolutionExample Solution

Exec Time = Instr Cnt x CPI x Clock

CPINEW must be normalized to new instruction frequency

NewFreqi CPIi CPINEW

.5 - X 1 .5 - X

.2 - X 2 .4 - 2X

.1 2 .2

.2 3 .6

X 2 2X

1 - X (1.7 - X)/(1 - X)

OldOp Freqi CPIi CPI

ALU .50 1 .5

Load .20 2 .4

Store .10 2 .2

Branch .20 2 .4

Reg/Mem

1.00 1.5


Example SolutionExample SolutionExec Time = Instr Cnt x CPI x Clock

All LOADs must be eliminated for this to be a win !

1.00 x 1.5 = (1 - X) x (1.7 - X)/(1 - X) 1.5 = 1.7 - X 0.2 = X

Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock

Op Freq Cycles CPIOld Freq Cycles CPINEW

ALU .50 1 .5 .5 - X 1 .5 - XLoad .20 2 .4 .2 - X 2 .4 - 2XStore .10 2 .2 .1 2 .2Branch .20 2 .4 .2 3 .6Reg/Mem X 2 2X

1.00 1.5 1 - X (1.7 - X)/(1 - X)

Old New


Fallacies and PitfallsFallacies and Pitfalls

• MIPS is an accurate measure for comparing performance among computers– dependent on the instruction set– varies between programs on the same computer– can vary inversely to performance

• MFLOPS is a consistent and useful measure of performance– dependent on the machine and on the program– not applicable outside the floating-point performance– the set of floating-point operations is not consistent

across the machines

performancecs510 computer architectureslecture 3 - 1 lecture 3 benchmarks and performance metrics...

Documents

mflops tracks execution

geometric mean pri1n

response time

normalized execution

total execution timesp1secs

performance terminologyx

task extimeexecution

integer programs