lecture4 performance evaluation 2011 (2)
TRANSCRIPT
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
1/34
ELEC2300 Computer Organization
Lecture 4: Performance
EvaluationProfessor George Yuan
Office: Rm. 2527Email:[email protected]
Note: some of the slides are adapted from Computer Organization and Design.Copyright 1998 Morgan Kaufmann Publishers and Notes of Prof. Pattersons CS152 Class, Copyright 1997 UCB .
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
2/34
ELEC152 Computer Organization Fall 2011 Page 2
OUTLINE
What is the computer performance?
How to evaluate the performance?
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
3/34
ELEC152 Computer Organization Fall 2011 Page 3
Which of these airplanes has the best performance?
Airplane Passengers Range (mi) Speed (mph)
Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544
Time to perform the task (Execution Time) execution time, response time, latency
Tasks per day, hour, week, sec, ns. .. throughput , bandwidth
Latency and throughput often are in opposition
4 types of airplanes fly between Hong Kong & Shanghai(distance: D mi.)
S D
L
C S D
C D
ST
11
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
4/34
ELEC152 Computer Organization Fall 2011 Page 4
Example
Execution time of Concorde vs. 747:
Concorde is 1350 mph / 610 mph = 2.2 times fasterThroughput of Concorde vs. 747:
Boeing is 286700 pmph / 178200 pmph = 1.6 timesfaster (470*610=286700, 132*1350=178200)
Conclusions:Concorde is 2.2 times faster in terms of flying time .747 is 1.6 times faster in terms of throughput .
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
5/34
ELEC152 Computer Organization Fall 2011 Page 5
Execution Time vs. Throughput
Execution time
How long does it take for my job to run?
How long does it take to execute a job?
How long must I wait for the database query?Throughput :
How many tasks can the machine run at once?
What is the average execution rate?
How much work is getting done?Computer upgrade:
1. P3 -> P42. 1 P3 -> 2 P3We will focus primarily on execution time for a
single job .
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
6/34ELEC152 Computer Organization Fall 2011 Page 6
Definitions
For computer study,
" X is n times faster than Y" means
Problem:machine A runs a program in 20 seconds (1 program/20sec)machine B runs the same program in 25 seconds (1program/25 sec)
XX timeexecutioneperformanc
1
Y
X Y timeexecutioneperformanceperformancn
Xtimeexecution
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
7/34ELEC152 Computer Organization Fall 2011 Page 7
Elapsed time or response timecount everything (disk and memory accesses, I/O , etc.)
a useful number, but often not good for comparison purposesCPU time
Does not count I/O or time spent running other programscan be broken up into system time, and user time
Our focus: user CPU timetime spent executing the lines of code that are "in" our programSystem CPU time : time the CPU spends executing system(kernal) code in order to run your program, such as, reading files,moving information into and out of virtual memory, etc.
Execution Time
XX timeuser CPUeperformanc
1
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
8/34ELEC152 Computer Organization Fall 2011 Page 8
CPU Time Measurement: Clock CyclesInstead of reporting execution time in seconds, we oftenuse cycles
Processor runs machine instructions based on clockclock cycle time
clock rate (frequency) = cycles per second (1 Hz. = 1cycle/sec)
A 200 Mhz. clock cycle time is
cycle
seconds
program
cycles
program
seconds
time
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
9/34ELEC152 Computer Organization Fall 2011 Page 9
Relating the Metrics
CPU time for a programCPU time = CPU clock cycles * clock cycle time
= CPU clock cycles/clock rate
Common ways to improve performance
(i.e. shorten CPU execution time):Reduce number of required CPU clock cycles for
a programShorten clock cycle time (i.e. increase clock rate)
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
10/34ELEC152 Computer Organization Fall 2011 Page 10
Example-Problem
Description :A program takes 10 seconds to run on a 400 MHzmachine (computer A). We want to design a fastermachine (computer B) that can run the same programin 6 seconds.The increase in clock rate affects the rest of the CPUdesign, causing machine B to require 1.2 times asmany clock cycles as machine A for the program.
Problem to solve :What clock rate should machine B have?
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
11/34ELEC152 Computer Organization Fall 2011 Page 11
Example - Answer
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
12/34
ELEC152 Computer Organization Fall 2011 Page 12
Cycle Number Calculation
CPU time for a programCPU time = CPU clock cycles * clock cycle time
= CPU clock cycles/clock rateprogram
assembly program
machine instructions
ISA
compiler
assembler
compiler Instruction #
clock cycles/instruction (CPI)
Cycle # = Instruction # CPI
processor
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
13/34
ELEC152 Computer Organization Fall 2011 Page 13
Cycles Per Instruction
Wrong assumption:# of CPU clock cycles in a program = # of instructions in theprogram,
Actual situationFor some processors, some instructions may take more cyclesthan the others:
E.g. multiplication takes more cycles than addition Floating point operations takes more cycles than integer
operations Memory access takes more cycles than accessing registers
Conclusion: not all instructions require the same # of cycles toexecute.
Cycle per instructions (CPI) an average number of
clock cycles that each instruction in a program takes toexecute.
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
14/34
ELEC152 Computer Organization Fall 2011 Page 14
Cycles Per Instruction (CPI)
Definition (for a given program):
CPI = (CPU clock cycles)/(instruction count)
A program has the same instruction count on twodifferent implementations of the same instruction set
architecture, but it may have different CPIs (because aninstruction may require different numbers of clock cycleson different implementations). If the number of clockcycles for a program is known, knowing either theinstruction count or the CPI can determine the other.CPI provides a measure for comparing implementations.Instruction count can be measured using software tools
or simulators.
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
15/34
ELEC152 Computer Organization Fall 2011 Page 15
Cycles Per Instruction
Let there be n different instruction classes(with different CPIs). For a given program,suppose we know:
CPI i = CPI for instruction class iC i = # of instruction of class I
CPU clock cycles = CPI * instruction count. Itcan be generalized to
n
i
n
i
iii
n
iii
C C CPI CPI and
C CPI cyclesclock CPU
1 1
1
/ )(
)(__
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
16/34
ELEC152 Computer Organization Fall 2011 Page 16
Suppose we have two implementations of thesame instruction set architecture (ISA)
For some program, machine A has a clock cycletime of 1 ns (1 GHz) and a CPI of 2.0. MachineB has a clock cycle time of 2 ns (500MHz) and a
CPI of 1.2. Which machine is faster for thisprogram, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?
CPI Example
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
17/34
ELEC152 Computer Organization Fall 2011 Page 17
Example - Solution
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
18/34
ELEC152 Computer Organization Fall 2011 Page 18
Relating the metrics
For a given program X running on a machine A
The only complete and reliable measure is CPU executiontimeOther measures are unreliable. E.g. changing theinstruction set to lower the instruction count may lead to alarger CPI or an organization with a slower clock rate.Either case can offset the improvement in instruction count.
= # of instructions# of instructionsa programa program
secondclock clock
# of clocks# of clocks# of instructions# of instructions
* *
= instruction count * CPI * clock cycle time
secondssecondsprogramprogram
= instruction count * CPI / clock rate
TimeTime =
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
19/34
ELEC152 Computer Organization Fall 2011 Page 19
Example Comparing Code Segments
Description
A particular machine has the following hardware facts:
For a given C++ statement, a compiler designer considers twocode sequences with the following instruction counts:
Problem to solve
Which code sequence executes the most instructions? Which isfaster? What is the CPI for each sequence?
Instruction class CPI for this instruction classA 1B 2C 3
Code sequenceInstruction counts for instruction classes
A B C1 2 1 22 4 1 1
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
20/34
ELEC152 Computer Organization Fall 2011 Page 20
Example - Answer
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
21/34
ELEC152 Computer Organization Fall 2011 Page 21
A misleading measure - MIPS
There are some performance measures that arefamous among computer manufacturers andsellers but are misleading !
MIPS (million instructions per second)(meaningless indication of processor speed)
MIPS = (instruction count)/(execution time * 10 6)MIPS depends on Instruction set (instructions have different capabilities) Program
MIPS can vary inversely with performancePeak performance
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
22/34
ELEC152 Computer Organization Fall 2011 Page 22
Some Processors in MIPSProcessor IPS Year
Motorola 68000 1MIPS @ 8MHz 1979Intel 386DX 8.5MIPS @ 25MHz 1988
Intel 486DX 54MIPS @ 66MHz 1992
PowerPC G2 35MIPS @ 33MHz 1994Intel Pentium Pro 541MIPS @ 200MHz 1996
ARM 7500FE 35.9MIPS @ 40MHz 1996
PowerPC G3 525MIPS @ 233MHz 1997
Zilog eZ80 80MIPS @ 50MHz 1999
Intel Pentium III 1354MIPS @ 500MHz 1999
AMD Athlon 3561MIPS @ 1.2GHz 2000
Pentium 4 9726MIPS @ 3.2GHz 2003ARM Cortex A8 2000MIPS @ 1.0GHz 2005
Xbox360 IBM Xenon Triple Core 6400MIPS @ 3.2GHz 2005
AMD Athlon 64 3800+ X2(Dual Core) 14564MIPS @ 2.0GHz 2005Intel Core2 Extreme QX6700 57063MIPS @ 3.33GHz 2006
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
23/34
ELEC152 Computer Organization Fall 2011 Page 23
Another misleading measure - MFLOPS
MFLOPS (million floating-point operations per second):MFLOPS =
(# of floating point operations)/(execution time * 10 6)
MFLOPS considers only floating-point operations(addition, subtraction, multiplication, or divisionoperation applied to a number in a single or doubleprecision floating-point representation).MFLOPS depends on:
Floating-point operation(e.g., addition and multiplication differ in complexity)
ProgramMeaningless if there is little or no floating-pointarithmetic.
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
24/34
ELEC152 Computer Organization Fall 2011 Page 24
Two different compilers are being tested for a 100 MHz. machinewith three different classes of instructions: Class A, Class B, andClass C, which require one, two, and three cycles (respectively).Both compilers are used to produce code for a large piece ofsoftware.
The first compiler's code uses 5 million Class A instructions, 1million Class B instructions, and 1 million Class C instructions.The second compiler's code uses 10 million Class A instructions,1 million Class B instructions, and 1 million Class C instructions.
What are the execution times for each sequence?
What is the MIPS index for this processor based on the two testingsequence?
MIPS example
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
25/34
ELEC152 Computer Organization Fall 2011 Page 25
Some related terminology:clock, clock cycle, cycle
clock cycle time, cycle time (seconds, us, ns)
clock rate, cycle rate (Hz, MHz)
CPI (cycles per instruction)
MIPS (millions of instructions per second)
Performance is determined by the execution time
Execution time calculation:
Summary
= instruction count * CPI * clock cycle time= instruction count * CPI / clock rate
Execution TimeExecution Time
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
26/34
ELEC152 Computer Organization Fall 2011 Page 26
OUTLINE
What is the computer performance?
How to evaluate the performance?
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
27/34
ELEC152 Computer Organization Fall 2011 Page 27
Execution time calculation:
Benchmark: a set of specially designed programs to test theperformance of a computer
Performance best determined by running a real application
Benchmarks are application specificCPU performance, graphics, high-performance computing, object-
oriented computing, Java applications, client-server models, mailsystems, file systems, Web servers.
SPEC (System Performance Evaluation Cooperative)
companies have agreed on a set of real program and inputsvaluable indicator of computer performance
Processor (ISA implementation) + compiler
Benchmarks
= instruction count * CPI * clock cycle time
= instruction count * CPI / clock rate
Execution TimeExecution Time
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
28/34
ELEC152 Computer Organization Fall 2011 Page 28
SPEC 89
Compiler enhancements and performance
0
10 0
20 0
30 0
40 0
50 0
60 0
70 0
80 0
tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc
BenchmarkCompi ler
Enhan ced compi ler
S P E C p e r f o r m a n c e r a
t i o
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
29/34
ELEC152 Computer Organization Fall 2011 Page 29
SPEC ratioReference: Sun Ultra 5_10 with a 300MHzprocessor
CINT2000, CFP2000
Geometric mean of SPEC ratios
SPEC CPU2000
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
30/34
ELEC152 Computer Organization Fall 2011 Page 30
SPEC CPU2000 Benchmarks
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
31/34
ELEC152 Computer Organization Fall 2011 Page 31
SPEC CPU2000 ratings
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
32/34
ELEC152 Computer Organization Fall 2011 Page 32
Execution Time After Improvement =Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )
Example:"Suppose a program runs in 100 seconds on a machine, withmultiplication responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program to run 4times faster?"
How about making the program 5 times faster?
Principle: Make the common case fast
Amdahl's Law
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
33/34
ELEC152 Computer Organization Fall 2011 Page 33
Suppose we enhance a machine making all floating-point instructionsfive times faster. If the execution time of some benchmark before thefloating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?
We are looking for a benchmark to show off the new floating-pointunit described above, and want the overall benchmark to show a
speedup of 3. One benchmark we are considering runs for 100seconds with the old floating-point hardware. How much of theexecution time would floating-point instructions have to account forin this program in order to yield our desired speedup on this
benchmark?
Example
-
8/3/2019 Lecture4 Performance Evaluation 2011 (2)
34/34
Performance is specific to a particular programTotal execution time is a consistent summary of performance
For a given architecture performance increases comefrom:increases in clock rate (without adverse CPI affects)improvements in processor organization that lower CPIcompiler enhancements that lower CPI and/or instruction count
Pitfall: expecting improvement in one aspect of amachines performance to affect the total performance
You should not always believe everything you read!Read carefully!
Remember