lecture4 performance evaluation 2011 (2)

Upload: bakaasama

Post on 06-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    1/34

    ELEC2300 Computer Organization

    Lecture 4: Performance

    EvaluationProfessor George Yuan

    Office: Rm. 2527Email:[email protected]

    Note: some of the slides are adapted from Computer Organization and Design.Copyright 1998 Morgan Kaufmann Publishers and Notes of Prof. Pattersons CS152 Class, Copyright 1997 UCB .

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    2/34

    ELEC152 Computer Organization Fall 2011 Page 2

    OUTLINE

    What is the computer performance?

    How to evaluate the performance?

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    3/34

    ELEC152 Computer Organization Fall 2011 Page 3

    Which of these airplanes has the best performance?

    Airplane Passengers Range (mi) Speed (mph)

    Boeing 737-100 101 630 598Boeing 747 470 4150 610BAC/Sud Concorde 132 4000 1350Douglas DC-8-50 146 8720 544

    Time to perform the task (Execution Time) execution time, response time, latency

    Tasks per day, hour, week, sec, ns. .. throughput , bandwidth

    Latency and throughput often are in opposition

    4 types of airplanes fly between Hong Kong & Shanghai(distance: D mi.)

    S D

    L

    C S D

    C D

    ST

    11

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    4/34

    ELEC152 Computer Organization Fall 2011 Page 4

    Example

    Execution time of Concorde vs. 747:

    Concorde is 1350 mph / 610 mph = 2.2 times fasterThroughput of Concorde vs. 747:

    Boeing is 286700 pmph / 178200 pmph = 1.6 timesfaster (470*610=286700, 132*1350=178200)

    Conclusions:Concorde is 2.2 times faster in terms of flying time .747 is 1.6 times faster in terms of throughput .

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    5/34

    ELEC152 Computer Organization Fall 2011 Page 5

    Execution Time vs. Throughput

    Execution time

    How long does it take for my job to run?

    How long does it take to execute a job?

    How long must I wait for the database query?Throughput :

    How many tasks can the machine run at once?

    What is the average execution rate?

    How much work is getting done?Computer upgrade:

    1. P3 -> P42. 1 P3 -> 2 P3We will focus primarily on execution time for a

    single job .

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    6/34ELEC152 Computer Organization Fall 2011 Page 6

    Definitions

    For computer study,

    " X is n times faster than Y" means

    Problem:machine A runs a program in 20 seconds (1 program/20sec)machine B runs the same program in 25 seconds (1program/25 sec)

    XX timeexecutioneperformanc

    1

    Y

    X Y timeexecutioneperformanceperformancn

    Xtimeexecution

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    7/34ELEC152 Computer Organization Fall 2011 Page 7

    Elapsed time or response timecount everything (disk and memory accesses, I/O , etc.)

    a useful number, but often not good for comparison purposesCPU time

    Does not count I/O or time spent running other programscan be broken up into system time, and user time

    Our focus: user CPU timetime spent executing the lines of code that are "in" our programSystem CPU time : time the CPU spends executing system(kernal) code in order to run your program, such as, reading files,moving information into and out of virtual memory, etc.

    Execution Time

    XX timeuser CPUeperformanc

    1

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    8/34ELEC152 Computer Organization Fall 2011 Page 8

    CPU Time Measurement: Clock CyclesInstead of reporting execution time in seconds, we oftenuse cycles

    Processor runs machine instructions based on clockclock cycle time

    clock rate (frequency) = cycles per second (1 Hz. = 1cycle/sec)

    A 200 Mhz. clock cycle time is

    cycle

    seconds

    program

    cycles

    program

    seconds

    time

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    9/34ELEC152 Computer Organization Fall 2011 Page 9

    Relating the Metrics

    CPU time for a programCPU time = CPU clock cycles * clock cycle time

    = CPU clock cycles/clock rate

    Common ways to improve performance

    (i.e. shorten CPU execution time):Reduce number of required CPU clock cycles for

    a programShorten clock cycle time (i.e. increase clock rate)

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    10/34ELEC152 Computer Organization Fall 2011 Page 10

    Example-Problem

    Description :A program takes 10 seconds to run on a 400 MHzmachine (computer A). We want to design a fastermachine (computer B) that can run the same programin 6 seconds.The increase in clock rate affects the rest of the CPUdesign, causing machine B to require 1.2 times asmany clock cycles as machine A for the program.

    Problem to solve :What clock rate should machine B have?

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    11/34ELEC152 Computer Organization Fall 2011 Page 11

    Example - Answer

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    12/34

    ELEC152 Computer Organization Fall 2011 Page 12

    Cycle Number Calculation

    CPU time for a programCPU time = CPU clock cycles * clock cycle time

    = CPU clock cycles/clock rateprogram

    assembly program

    machine instructions

    ISA

    compiler

    assembler

    compiler Instruction #

    clock cycles/instruction (CPI)

    Cycle # = Instruction # CPI

    processor

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    13/34

    ELEC152 Computer Organization Fall 2011 Page 13

    Cycles Per Instruction

    Wrong assumption:# of CPU clock cycles in a program = # of instructions in theprogram,

    Actual situationFor some processors, some instructions may take more cyclesthan the others:

    E.g. multiplication takes more cycles than addition Floating point operations takes more cycles than integer

    operations Memory access takes more cycles than accessing registers

    Conclusion: not all instructions require the same # of cycles toexecute.

    Cycle per instructions (CPI) an average number of

    clock cycles that each instruction in a program takes toexecute.

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    14/34

    ELEC152 Computer Organization Fall 2011 Page 14

    Cycles Per Instruction (CPI)

    Definition (for a given program):

    CPI = (CPU clock cycles)/(instruction count)

    A program has the same instruction count on twodifferent implementations of the same instruction set

    architecture, but it may have different CPIs (because aninstruction may require different numbers of clock cycleson different implementations). If the number of clockcycles for a program is known, knowing either theinstruction count or the CPI can determine the other.CPI provides a measure for comparing implementations.Instruction count can be measured using software tools

    or simulators.

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    15/34

    ELEC152 Computer Organization Fall 2011 Page 15

    Cycles Per Instruction

    Let there be n different instruction classes(with different CPIs). For a given program,suppose we know:

    CPI i = CPI for instruction class iC i = # of instruction of class I

    CPU clock cycles = CPI * instruction count. Itcan be generalized to

    n

    i

    n

    i

    iii

    n

    iii

    C C CPI CPI and

    C CPI cyclesclock CPU

    1 1

    1

    / )(

    )(__

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    16/34

    ELEC152 Computer Organization Fall 2011 Page 16

    Suppose we have two implementations of thesame instruction set architecture (ISA)

    For some program, machine A has a clock cycletime of 1 ns (1 GHz) and a CPI of 2.0. MachineB has a clock cycle time of 2 ns (500MHz) and a

    CPI of 1.2. Which machine is faster for thisprogram, and by how much? If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?

    CPI Example

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    17/34

    ELEC152 Computer Organization Fall 2011 Page 17

    Example - Solution

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    18/34

    ELEC152 Computer Organization Fall 2011 Page 18

    Relating the metrics

    For a given program X running on a machine A

    The only complete and reliable measure is CPU executiontimeOther measures are unreliable. E.g. changing theinstruction set to lower the instruction count may lead to alarger CPI or an organization with a slower clock rate.Either case can offset the improvement in instruction count.

    = # of instructions# of instructionsa programa program

    secondclock clock

    # of clocks# of clocks# of instructions# of instructions

    * *

    = instruction count * CPI * clock cycle time

    secondssecondsprogramprogram

    = instruction count * CPI / clock rate

    TimeTime =

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    19/34

    ELEC152 Computer Organization Fall 2011 Page 19

    Example Comparing Code Segments

    Description

    A particular machine has the following hardware facts:

    For a given C++ statement, a compiler designer considers twocode sequences with the following instruction counts:

    Problem to solve

    Which code sequence executes the most instructions? Which isfaster? What is the CPI for each sequence?

    Instruction class CPI for this instruction classA 1B 2C 3

    Code sequenceInstruction counts for instruction classes

    A B C1 2 1 22 4 1 1

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    20/34

    ELEC152 Computer Organization Fall 2011 Page 20

    Example - Answer

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    21/34

    ELEC152 Computer Organization Fall 2011 Page 21

    A misleading measure - MIPS

    There are some performance measures that arefamous among computer manufacturers andsellers but are misleading !

    MIPS (million instructions per second)(meaningless indication of processor speed)

    MIPS = (instruction count)/(execution time * 10 6)MIPS depends on Instruction set (instructions have different capabilities) Program

    MIPS can vary inversely with performancePeak performance

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    22/34

    ELEC152 Computer Organization Fall 2011 Page 22

    Some Processors in MIPSProcessor IPS Year

    Motorola 68000 1MIPS @ 8MHz 1979Intel 386DX 8.5MIPS @ 25MHz 1988

    Intel 486DX 54MIPS @ 66MHz 1992

    PowerPC G2 35MIPS @ 33MHz 1994Intel Pentium Pro 541MIPS @ 200MHz 1996

    ARM 7500FE 35.9MIPS @ 40MHz 1996

    PowerPC G3 525MIPS @ 233MHz 1997

    Zilog eZ80 80MIPS @ 50MHz 1999

    Intel Pentium III 1354MIPS @ 500MHz 1999

    AMD Athlon 3561MIPS @ 1.2GHz 2000

    Pentium 4 9726MIPS @ 3.2GHz 2003ARM Cortex A8 2000MIPS @ 1.0GHz 2005

    Xbox360 IBM Xenon Triple Core 6400MIPS @ 3.2GHz 2005

    AMD Athlon 64 3800+ X2(Dual Core) 14564MIPS @ 2.0GHz 2005Intel Core2 Extreme QX6700 57063MIPS @ 3.33GHz 2006

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    23/34

    ELEC152 Computer Organization Fall 2011 Page 23

    Another misleading measure - MFLOPS

    MFLOPS (million floating-point operations per second):MFLOPS =

    (# of floating point operations)/(execution time * 10 6)

    MFLOPS considers only floating-point operations(addition, subtraction, multiplication, or divisionoperation applied to a number in a single or doubleprecision floating-point representation).MFLOPS depends on:

    Floating-point operation(e.g., addition and multiplication differ in complexity)

    ProgramMeaningless if there is little or no floating-pointarithmetic.

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    24/34

    ELEC152 Computer Organization Fall 2011 Page 24

    Two different compilers are being tested for a 100 MHz. machinewith three different classes of instructions: Class A, Class B, andClass C, which require one, two, and three cycles (respectively).Both compilers are used to produce code for a large piece ofsoftware.

    The first compiler's code uses 5 million Class A instructions, 1million Class B instructions, and 1 million Class C instructions.The second compiler's code uses 10 million Class A instructions,1 million Class B instructions, and 1 million Class C instructions.

    What are the execution times for each sequence?

    What is the MIPS index for this processor based on the two testingsequence?

    MIPS example

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    25/34

    ELEC152 Computer Organization Fall 2011 Page 25

    Some related terminology:clock, clock cycle, cycle

    clock cycle time, cycle time (seconds, us, ns)

    clock rate, cycle rate (Hz, MHz)

    CPI (cycles per instruction)

    MIPS (millions of instructions per second)

    Performance is determined by the execution time

    Execution time calculation:

    Summary

    = instruction count * CPI * clock cycle time= instruction count * CPI / clock rate

    Execution TimeExecution Time

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    26/34

    ELEC152 Computer Organization Fall 2011 Page 26

    OUTLINE

    What is the computer performance?

    How to evaluate the performance?

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    27/34

    ELEC152 Computer Organization Fall 2011 Page 27

    Execution time calculation:

    Benchmark: a set of specially designed programs to test theperformance of a computer

    Performance best determined by running a real application

    Benchmarks are application specificCPU performance, graphics, high-performance computing, object-

    oriented computing, Java applications, client-server models, mailsystems, file systems, Web servers.

    SPEC (System Performance Evaluation Cooperative)

    companies have agreed on a set of real program and inputsvaluable indicator of computer performance

    Processor (ISA implementation) + compiler

    Benchmarks

    = instruction count * CPI * clock cycle time

    = instruction count * CPI / clock rate

    Execution TimeExecution Time

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    28/34

    ELEC152 Computer Organization Fall 2011 Page 28

    SPEC 89

    Compiler enhancements and performance

    0

    10 0

    20 0

    30 0

    40 0

    50 0

    60 0

    70 0

    80 0

    tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

    BenchmarkCompi ler

    Enhan ced compi ler

    S P E C p e r f o r m a n c e r a

    t i o

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    29/34

    ELEC152 Computer Organization Fall 2011 Page 29

    SPEC ratioReference: Sun Ultra 5_10 with a 300MHzprocessor

    CINT2000, CFP2000

    Geometric mean of SPEC ratios

    SPEC CPU2000

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    30/34

    ELEC152 Computer Organization Fall 2011 Page 30

    SPEC CPU2000 Benchmarks

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    31/34

    ELEC152 Computer Organization Fall 2011 Page 31

    SPEC CPU2000 ratings

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    32/34

    ELEC152 Computer Organization Fall 2011 Page 32

    Execution Time After Improvement =Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

    Example:"Suppose a program runs in 100 seconds on a machine, withmultiplication responsible for 80 seconds of this time. How much do we

    have to improve the speed of multiplication if we want the program to run 4times faster?"

    How about making the program 5 times faster?

    Principle: Make the common case fast

    Amdahl's Law

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    33/34

    ELEC152 Computer Organization Fall 2011 Page 33

    Suppose we enhance a machine making all floating-point instructionsfive times faster. If the execution time of some benchmark before thefloating-point enhancement is 10 seconds, what will the speedup be if half of the 10 seconds is spent executing floating-point instructions?

    We are looking for a benchmark to show off the new floating-pointunit described above, and want the overall benchmark to show a

    speedup of 3. One benchmark we are considering runs for 100seconds with the old floating-point hardware. How much of theexecution time would floating-point instructions have to account forin this program in order to yield our desired speedup on this

    benchmark?

    Example

  • 8/3/2019 Lecture4 Performance Evaluation 2011 (2)

    34/34

    Performance is specific to a particular programTotal execution time is a consistent summary of performance

    For a given architecture performance increases comefrom:increases in clock rate (without adverse CPI affects)improvements in processor organization that lower CPIcompiler enhancements that lower CPI and/or instruction count

    Pitfall: expecting improvement in one aspect of amachines performance to affect the total performance

    You should not always believe everything you read!Read carefully!

    Remember