performancecs510 computer architectureslecture 3 - 1 lecture 3 benchmarks and performance metrics...
TRANSCRIPT
Performance CS510 Computer Architectures Lecture 3 - 1
Lecture 3Lecture 3
Benchmarks and Benchmarks and Performance MetricsPerformance Metrics
Lecture 3Lecture 3
Benchmarks and Benchmarks and Performance MetricsPerformance Metrics
Performance CS510 Computer Architectures Lecture 3 - 2
Measurement ToolsMeasurement Tools
• Benchmarks, Traces, Mixes• Cost, Delay, Area, Power Estimation• Simulation (many levels)
– ISA, RT, Gate, Circuit• Queuing Theory• Rules of Thumb• Fundamental Laws
Performance CS510 Computer Architectures Lecture 3 - 3
The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)
• Time to run the task (ExTime)– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns ....(Performance)– Throughput, bandwidth
610 mph
1350 mph
470
132
286,700
178,200
6.5 hours
3.0 hours
Plane
Boeing 747
BAD/Sud Concorde
SpeedTime (DC-Paris) Passengers Throughput
(pmph)
Performance CS510 Computer Architectures Lecture 3 - 4
The Bottom Line: The Bottom Line: Performance (and Cost)Performance (and Cost)
ExTime(Y) Performance(X)
n = =
ExTime(X) Performance(Y)
“X is n times faster than Y” means:
Performance CS510 Computer Architectures Lecture 3 - 5
Performance TerminologyPerformance Terminology
“X is n% faster than Y” means:
100 100 xx (Performance(X) - Performance(Y)) (Performance(X) - Performance(Y)) n =n = Performance(Y)Performance(Y)
ExTime(Y) Performance(X) n = = 1 +
ExTime(X) Performance(Y) 100
Performance CS510 Computer Architectures Lecture 3 - 6
ExampleExample
1510
= 1.51.0
= Performance (X)Performance (Y)
ExTime(Y)ExTime(X)
=
n = 100 (1.5 - 1.0) 1.0
n = 50%
Example: Y takes 15 seconds to complete a task, Example: Y takes 15 seconds to complete a task, X takes 10 seconds. X takes 10 seconds. What % faster is X?What % faster is X?
Performance CS510 Computer Architectures Lecture 3 - 7
Programs to Evaluate Programs to Evaluate Processor PerformanceProcessor Performance
• (Toy) Benchmarks– 10~100-line program– e.g.: sieve, puzzle, quicksort
• Synthetic Benchmarks– Attempt to match average frequencies of real workl
oads– e.g., Whetstone, dhrystone
• Kernels– Time critical excerpts of real programs– e.g., Livermore loops
• Real programs– e.g., gcc, spice
Performance CS510 Computer Architectures Lecture 3 - 8
Benchmarking GamesBenchmarking Games
• Differing configurations used to run the same workload on two systems
• Compiler wired to optimize the workload• Workload arbitrarily picked• Very small benchmarks used• Benchmarks manually translated to optimize
performance
Performance CS510 Computer Architectures Lecture 3 - 9
Common Benchmarking Common Benchmarking MistakesMistakes
• Only average behavior represented in test workload• Ignoring monitoring overhead• Not ensuring same initial conditions• “Benchmark Engineering”
– particular optimization– different compilers or preprocessors– runtime libraries
Performance CS510 Computer Architectures Lecture 3 - 10
SPEC: System Performance SPEC: System Performance Evaluation CooperativeEvaluation Cooperative
• First Round 1989– 10 programs yielding a single number
• Second Round 1992– SpecInt92 (6 integer programs) and SpecFP92 (14 floating
point programs)– VAX-11/780
• Third Round 1995– Single flag setting for all programs; new set of programs
“benchmarks useful for 3 years”– SPARCstation 10 Model 40
Performance CS510 Computer Architectures Lecture 3 - 11
SPEC First RoundSPEC First Round
• One program: 99% of time in single line of code
• New front-end compiler could improve dramatically
Benchmark
SPEC
Per
f
0
100
200
300
400
500
600
700
800
gcc
epre
sso
spice
dodu
c
nasa7 li
eqnt
ott
matrix300
fppp
p
tom
catv
Performance CS510 Computer Architectures Lecture 3 - 12
How to Summarize How to Summarize PerformancePerformance
• Arithmetic Mean (weighted arithmetic mean)
– tracks execution time: (Ti)/n or Wi*Ti
• Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS)
– tracks execution time: n/1/Ri or n/Wi/Ri
• Normalized execution time is handy for scaling performance
• But do not take the arithmetic mean of normalized execution time, use the geometric mean (Ri)1/n, where Ri=1/Ti
Performance CS510 Computer Architectures Lecture 3 - 13
Comparing and Summarizing Comparing and Summarizing PerformancePerformance
For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on...
The relative performance of computer is unclear with Total Execution Times
Computer A Computer B Computer C
P1(secs) 1 10 20
P2(secs) 1,000 100 20
Total time(secs) 1,001 110 40
Performance CS510 Computer Architectures Lecture 3 - 14
Summary MeasureSummary Measure
Arithmetic Mean n Execution Timei
i=1
1
n n n 1 / Ratei i=1
Ratei = ƒ(1 / Execcution Timei)
Good, if programs are run equally in the workload
Harmonic Mean(When performance is expressed as rates)
Performance CS510 Computer Architectures Lecture 3 - 15
Unequal Job MixUnequal Job Mix
Weighted Arithmetic Mean
Weighted Harmonic Mean
n Weighti x Execution Timei
i=1
n Weighti / Ratei i=1
Relative Performance
• Normalized Execution Time to a reference machine Arithmetic Mean Geometric Mean
n Execution Time Ratioi
i=1n
Normalized to the reference machine
• Weighted Execution Time
Performance CS510 Computer Architectures Lecture 3 - 16
Weighted Arithmetic MeanWeighted Arithmetic Mean
W(i)j x Timejj=1
nWAM(i) =
A B C W(1) W(2) W(3)
P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999
P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001
1.0 x 0.5 + 1,000 x 0.5
WAM(1) 500.50 55.00 20.00
WAM(2) 91.91 18.19 20.00
WAM(3) 2.00 10.09 20.00
Performance CS510 Computer Architectures Lecture 3 - 17
Normalized Execution Normalized Execution TimeTime
P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0
P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0
Normalized to A
Normalized to B Normalized to C
A B C A B C A B C
Geometric Mean = n Execution time ratioiI=1
n
Arithmetic mean 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0
Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0
Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0
A B CP1 1.00 10.00 20.00P2 1,000.00 100.00 20.00
Performance CS510 Computer Architectures Lecture 3 - 18
Disadvantages Disadvantages of Arithmetic Meanof Arithmetic Mean
Performance varies depending on the reference machine
1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0
1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0
B is 5 times slower than A
A is 5 times slower than B
C is slowest C is fastest
Normalized to A
Normalized to B Normalized to C
A B C A B C A B C
P1
P2
Arithmetic mean
Performance CS510 Computer Architectures Lecture 3 - 19
The Pros and Cons Of The Pros and Cons Of Geometric MeansGeometric Means
• Independent of running times of the individual programs• Independent of the reference machines• Do not predict execution time
– the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2
P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0
P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0
1(P1) x 100 + 1000(P2) x 1= 10(P1) x 100 + 100(P2) x 1
Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 1.58 1.0
Normalized to A Normalized to B Normalized to C
A B C A B C A B C
Performance CS510 Computer Architectures Lecture 3 - 22
Amdahl's LawAmdahl's Law
Speedup due to enhancement E: ExTime w/o E Performance w/ESpeedup(E) = = ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:
ExTime(E) =Speedup(E) =
Performance CS510 Computer Architectures Lecture 3 - 23
Amdahl’s LawAmdahl’s Law
Speedup =ExTime
ExTimeE
=
1
(1 - FractionE) + FractionE
SpeedupE
ExTimeE = ExTime x (1 - FractionE) + SpeedupE
FractionE
1(1 - F) + F/S
=
Performance CS510 Computer Architectures Lecture 3 - 24
Amdahl’s LawAmdahl’s Law
Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP
Speedup = 1
(1-F) + F/S
= 1.0535.3% improvement
1
(1-0.1) + 0.1/2 0.95=
1=
Performance CS510 Computer Architectures Lecture 3 - 25
Corollary(Amdahl):Corollary(Amdahl): Make the Common Case FastMake the Common Case Fast
• All instructions require an instruction fetch, only a fraction require a data fetch/store
– Optimize instruction access over data access• Programs exhibit locality
Spatial Locality
Reg’sCache
Memory Disk / Tape
Temporal Locality
• Access to small memories is fasterProvide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories.
Performance CS510 Computer Architectures Lecture 3 - 26
Locality of AccessLocality of Access
Spatial Locality:There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference.
Temporal Locality:There is a high probability that the recently referenced data will be referenced in near future.
Performance CS510 Computer Architectures Lecture 3 - 27
• The simple case is usually the most frequent and the easiest to optimize!
• Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software
Rule of ThumbRule of Thumb
Performance CS510 Computer Architectures Lecture 3 - 28
Metrics of PerformanceMetrics of Performance
Compiler
Programming Language
Application
ISA
DatapathControl
Transistors Wires PinsFunction Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Answers per monthOperations per second
Cycles per second (clock rate)
Megabytes per second
Performance CS510 Computer Architectures Lecture 3 - 29
Aspects of CPU PerformanceAspects of CPU Performance
Seconds Instructions Cycles Seconds
CPU time = = x x Program Program Instruction Cycle
Program X
Compiler X (X)
Inst. Set. X X
Organization X
Technology X X
Inst Count CPI Clock Rate
Performance CS510 Computer Architectures Lecture 3 - 30
Marketing MetricsMarketing MetricsMIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106
• Machines with different instruction sets ?
• Programs with different instruction mixes ?
– Dynamic frequency of instructions
• Not correlated with performance
MFLOP/s = FP Operations / Time x 106
• Machine dependent
• Often not where time is spent
Normalized:
add,sub,compare, mult 1
divide, sqrt 4
exp, sin, . . . 8
Normalized:
add,sub,compare, mult 1
divide, sqrt 4
exp, sin, . . . 8
Performance CS510 Computer Architectures Lecture 3 - 31
Cycles Per InstructionCycles Per Instruction
CPU time = Cycle Time x CPI x Ii = 1
n
i i
Instruction Frequency
CPI = CPI x F ,where F = I i = 1
n
i i i i
Instruction Count
CPI = (CPU Time x Clock Rate) / Instruction Count= Cycles / Instruction Count
Average cycles per instruction
Invest resources where time is spent !
Performance CS510 Computer Architectures Lecture 3 - 32
Organizational Trade-offsOrganizational Trade-offs
Instruction Mix
Cycle Time
CPIISA
DatapathControl
Transistors Wires PinsFunction Units
Compiler
Programming Language
Application
Performance CS510 Computer Architectures Lecture 3 - 33
Example: Calculating CPIExample: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)
Op Freq CPI(i) CPI (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
Performance CS510 Computer Architectures Lecture 3 - 34
ExampleExample
Add register / memory operations: R/M– One source operand in memory– One source operand in register– Cycle count of 2
Branch cycle count to increase to 3
What fraction of the loads must be eliminated for this to pay off?
Base Machine (Reg / Reg)
Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X]
Typical Mix
Op Freqi CPIi
ALU 50% 1Load 20% 2Store 10% 2Branch 20% 2
Performance CS510 Computer Architectures Lecture 3 - 35
Example SolutionExample Solution Exec Time = Instr Cnt x CPI x Clock
Op Freqi CPIi CPIALU .50 1 .5Load .20 2 .4Store .10 2 .2Branch .20 2 .4Total 1.00 1.5
Performance CS510 Computer Architectures Lecture 3 - 36
Example SolutionExample Solution
Exec Time = Instr Cnt x CPI x Clock
CPINEW must be normalized to new instruction frequency
NewFreqi CPIi CPINEW
.5 - X 1 .5 - X
.2 - X 2 .4 - 2X
.1 2 .2
.2 3 .6
X 2 2X
1 - X (1.7 - X)/(1 - X)
OldOp Freqi CPIi CPI
ALU .50 1 .5
Load .20 2 .4
Store .10 2 .2
Branch .20 2 .4
Reg/Mem
1.00 1.5
Performance CS510 Computer Architectures Lecture 3 - 37
Example SolutionExample SolutionExec Time = Instr Cnt x CPI x Clock
All LOADs must be eliminated for this to be a win !
1.00 x 1.5 = (1 - X) x (1.7 - X)/(1 - X) 1.5 = 1.7 - X 0.2 = X
Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock
Op Freq Cycles CPIOld Freq Cycles CPINEW
ALU .50 1 .5 .5 - X 1 .5 - XLoad .20 2 .4 .2 - X 2 .4 - 2XStore .10 2 .2 .1 2 .2Branch .20 2 .4 .2 3 .6Reg/Mem X 2 2X
1.00 1.5 1 - X (1.7 - X)/(1 - X)
Old New
Performance CS510 Computer Architectures Lecture 3 - 38
Fallacies and PitfallsFallacies and Pitfalls
• MIPS is an accurate measure for comparing performance among computers– dependent on the instruction set– varies between programs on the same computer– can vary inversely to performance
• MFLOPS is a consistent and useful measure of performance– dependent on the machine and on the program– not applicable outside the floating-point performance– the set of floating-point operations is not consistent
across the machines