computer organization and architecture

KT6213

Computer Organization Computer Organization and Architectureand Architecture

Lecture 1: Introduction

1

KT6213

Architecture & Organization Architecture & Organization (Stalling)(Stalling)

Architecture is those attributes visible to the programmer◦ Instruction set, number of bits used for

data representation, I/O mechanisms, addressing techniques.

◦ e.g. Is there a multiply instruction?Organization is how features are

implemented◦ Control signals, interfaces, memory

technology.◦ e.g. Is there a hardware multiply unit or is

it done by repeated addition?

2

KT6213

Computer Architecture Is …Computer Architecture Is …

the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.

Amdahl, Blaaw, and Brooks, 1964

3

KT6213

Cont.Cont.Computer Architecture is the design of

computers, including their instruction sets, hardware components, and system organization [Patterson].

Thus two essential parts of computer architecture: ◦ Instruction-set Architecture (ISA)

◦ Hardware-system Architecture (HSA) Technology ProgrammingLanguages

OperatingSystems History

ApplicationsInterface Design(Inst. Set Arch.)

Measurement & Evaluation

Parallelism

Computer Architecture:• Instruction Set Design• Organization• Hardware

4

KT6213

Instruction-set Architecture (ISA)Instruction-set Architecture (ISA)The instruction set architecture of a computer

includes anything a programmer would need to know to make the computer run correctly. This include:◦ (a) The number and types of registers◦ (b) Instruction set (what operations can be performed?)◦ (c) Instruction format (how are they specified?)◦ (d) Addressing mode (how is data obtained? - direct vs.

indirect)◦ (e) Exception handling (what happens when something

goes wrong?) Instruction-set architecture includes the

specifications that determine how machine-language programs will interact with the computer. That is, in general, two computers with the same ISA will run the same programs. This is the notion of a computer-family architecture.

5

KT6213

Hardware-system Architecture (HSA)Hardware-system Architecture (HSA)The Hardware-system architecture deals with

the computer's major hardware subsystems, including central processing unit (CPU), its storage system, and its input-output system.

The computer hardware design determines the implementation of the various computer components. This includes◦ (a) Capabilities and performance of the functional units

(e.g., registers, ALUs, shifters) ◦ (b) Methods for connecting the functional units (e.g.,

data bus)◦ (c) Control logic for the functional units

Typically, the computer hardware is designed based on the instruction set architecture.

6

KT6213

A successful ISA generally has many implementations (a computer-family) which are different in their HSA.

Compatibility is the ability of different computers to run the same programs. ◦ Upward compatibility allows high-performance

members of a family to run the same program as do the low-performance members

◦ Downward compatibility is not always possible, since high-performance family members often have features not available on lower-performance members.

7

KT6213

Computer FamilyComputer Family

A computer family is a set of implementations that share the same or similar ISA (using a variety of technologies, memory sizes, and speeds). For example, IBM system/360 (1960s), PDP-8 family (1965), PDP-11 family (1965), and IBM system/370 (1970s).

All Intel x86 family share the same basic architecture

The IBM System/370 family share the same basic architecture

This gives code compatibility◦ At least backwards

Organization differs between different versions

8

KT6213

Computer EvolutionComputer Evolution

9

KT6213

Historical Historical PerspectivPerspectivee

10

KT6213

Early ComputingEarly Computing

1946: ENIAC, us Army, 18,000 Vacuum Tubes

1949: UNIVAC I, $250K, 48 systems sold

1954: IBM 701, Core Memory

1957: Moving Head Disk

1958: Transistor, FORTRAN, ALGOL, CDC & DEC

Founded

1964: IBM 360, CDC 6600, DEC PDP-8

1969: UNIX

1970: FLOPPY DISK

1981: IBM PC, 1st Successful Portable (Osborne1)

1986: Connection Machine, MAX Headroom Debut

11

KT6213

Underlying TechnologiesUnderlying Technologies

Generation

Evolutionary

Parallelism

Year Logic Storage Prog. Lang. O/S54 Tubes core (8 ms)58 Transistor (10µs) FORTRAN60 ALGOL, COBOL Batch64 Hybrid (1µs) thin film (200ns) Lisp, APL, Basic66 IC (100ns) PL/1, Simula,C67 Multiprog.71 LSI (10ns) 1k DRAM O.O. V.M.

73 (8-bit µP)75 (16-bit µP) 4k DRAM78 VLSI (10ns) 16k DRAM Networks80 64k DRAM84 (32-bit µP) 256k DRAM ADA87 ULSI 1M DRAM89 GAs 4M DRAM C++92 (64-bit µP) 16M DRAM Fortran90

12

KT6213

What has happened in the 1990sWhat has happened in the 1990s

“Network-Integrated Computing”◦ Wide-area AND local-area integration of

cluster-based computing, and high performance networks

Scalable Technologies for Computing, Networking, and Information Systems◦ Systems that scale DOWN as well as UP◦ High performance workstations◦ Clusters and distributed systems◦ Massively parallel I/O and computer servers◦ National Information Infrastructure

13

KT6213

What has been predicted for the What has been predicted for the Late 1990s and Early 2000sLate 1990s and Early 2000s

Technology◦ Very large dynamic RAM: 64 Mbits and

beyond◦ Large fast Static RAM: 1 MB, 10ns

Complete systems on a chip◦ 10+ Million Transistors

Parallelism ◦ Superscalar, Superpipeline, Vector,

Multiprocessors◦ Processor Arrays

14

KT6213

What has been predicted for the Late What has been predicted for the Late 1990s and Early 2000s1990s and Early 2000sLow Power

◦ 50% of PCs portable by 1995◦ Performance per watt

Parallel I/O◦ Many applications is I/O limited, not

computation◦ Computation scaling but memory, I/O

bandwidth not keeping paceMultimedia

◦ New interface technologies◦ Video, speech, handwriting, virtual reality, …

15

KT6213

Review of Technology Trends and Review of Technology Trends and Cost /PerformanceCost /Performance

16

KT6213

Original Food Chain PictureOriginal Food Chain Picture

Big Fishes Eating Little Fishes

17

KT6213

1988 Computer Food Chain1988 Computer Food Chain

PCWork-stationMini-

computer

Mainframe

Mini-supercomputer

Supercomputer

Massively Parallel Processors

18

KT6213

1998 Computer Food Chain1998 Computer Food Chain

PCWork-station

Mainframe

Supercomputer

Mini-supercomputerMassively Parallel Processors

Mini-computer

Now who is eating whom?

Server

19

KT6213

Why Such Change in 10 years?Why Such Change in 10 years?

Performance◦ Technology Advances

CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance

◦ Computer architecture advances improves low-end RISC, superscalar, RAID, …

Price: Lower costs due to …◦ Simpler development

CMOS VLSI: smaller systems, fewer components◦ Higher volumes

CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units Function

◦ Rise of networking/local interconnection technology

20

KT6213

Moore’s LawMoore’s Law

Gordon Moore - cofounder of IntelIncreased density of components on chipNumber of transistors on a chip will double

every yearSince 1970’s development has slowed a little

◦ Number of transistors doubles every 18 monthsCost of a chip has remained almost

unchangedHigher packing density means shorter

electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability

21

KT6213

Performance MismatchPerformance Mismatch

Processor speed increasedMemory capacity increasedMemory speed lags behind processor speed

22

KT6213

DRAM and Processor DRAM and Processor CharacteristicsCharacteristics

23

KT6213

Trends in DRAM useTrends in DRAM use

24

KT6213

Memory Capacity Memory Capacity (Single Chip DRAM)(Single Chip DRAM)

size

Year

Bit

s

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

Year size(Mb) cyc time

1980 0.0625 250 ns

1983 0.25 220 ns

1986 1 190 ns

1989 4 165 ns

1992 16 145 ns

1996 64 120 ns

2000 256 100 ns

25

KT6213

total amount ofwork done in a given time

time between the start and the completionof an event

26

KT6213

Performance milestonePerformance milestone

27

KT6213

Technology TrendsTechnology Trends(Summary)(Summary)

Capacity Speed (latency)

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 3 years 2x in 10 years

28

KT6213

Growth in CPU Transistor CountGrowth in CPU Transistor Count

29

KT6213

Year

Tra

nsis

tors

1000

10000

100000

1000000

10000000

100000000

1970 1975 1980 1985 1990 1995 2000

i80386

i4004

i8080

Pentium

i80486

i80286

i8086

Technology Trends: Microprocessor Technology Trends: Microprocessor CapacityCapacity

CMOS improvements:• Die size: 2X every 3 yrs• Line width: halve / 7 yrs

“Graduation Window”

Alpha 21264: 15 millionPentium Pro: 5.5 millionPowerPC 620: 6.9 millionAlpha 21164: 9.3 millionSparc Ultra: 5.2 million

Moore’s Law

30

KT6213

Growth in Processor Growth in Processor PerformancePerformance

31

KT6213

Performance TrendsPerformance Trends(Summary)(Summary)Workstation performance (measured in Spec

Marks) improves roughly 50% per year (2X every 18 months)

Improvement in cost performance estimated at 70% per year

32

KT6213

Measurement and EvaluationMeasurement and Evaluation

Design

Analysis

Architecture is an iterative process:• Searching the space of possible designs• At all levels of computer systems

Creativity

Good IdeasGood Ideas

Mediocre IdeasBad Ideas

Cost /PerformanceAnalysis

33

KT6213

Computer Engineering Computer Engineering MethodologyMethodology

TechnologyTrends

34

KT6213


Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks

TechnologyTrends

Benchmarks

35

KT6213



Simulate NewSimulate NewDesigns andDesigns and

OrganizationsOrganizations

TechnologyTrends

Benchmarks

Workloads

36

KT6213



Simulate NewSimulate NewDesigns andDesigns and

OrganizationsOrganizations

Implement NextImplement NextGeneration SystemGeneration System

TechnologyTrends

Benchmarks

Workloads

ImplementationComplexity

37

KT6213

Summary: Price vs. CostSummary: Price vs. Cost

0%

20%

40%

60%

80%

100%

Mini W/S PC

Average Discount

Gross Margin

Direct Costs

Component Costs

0

1

2

3

4

5

Mini W/S PC

Average Discount

Gross Margin

Direct Costs

Component Costs

4.73.8

1.8

3.52.5

1.5

45

KT6213

System PerformanceSystem Performance

46

KT6213

Measuring and Reporting Measuring and Reporting PerformancePerformance

• Designing high performance computers is one of the major goals of any computer architect.

• As a result, assessing the performance of computer hardware is at the heart of computer design and greatly affects the demand and market value of the computer.

• However, measuring performance of a computer system is not a straightforward task:

Metrics – How do we describe in a numerical way the performance of a computer?

What tools do we use to find those metrics? How do we summarize the performance?

48

KT6213

Measuring and Reporting Perform Measuring and Reporting Performanceance

What do we mean by one Computer is faster than another?◦ program runs less time

Response time or execution time◦ time that users see the output

Throughput ◦ total amount of work done in a given time

49

KT6213

PerformancePerformance

“Increasing and decreasing” ?????

We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time .

improve performance = increase performance improve execution time = decrease execution ti

me

50

KT6213

What is performance ?What is performance ?

…. how fast does this computer run MY program ?

…. is machine A faster than machine B, and if so, how much faster ?

…. one of the three factors driving architecture◦ …. effective use of new technology

Should I use it to enhance the architecture or improve performance of existing architecture.

◦ …. can a desired performance improvement be achieved by a given set of implementation or organization changes ?

51

KT6213

Measuring Performance Measuring Performance

Definition of time

Wall Clock timeResponse timeElapsed time

◦ A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead

52

Measure Performance – user CPU time

• Response time may include disk access, memory access, input/output activities, CPU event and operating system overhead—everything.

• In order to get an accurate measure of performance, we use CPU time instead of using response time.

• CPU time is the time the CPU spends computing a program and does not include time spent waiting for I/O or running other programs.

• CPU time can also be divided into user CPU time (program) and system CPU time (OS).

• Key in UNIX command time, we have,

90.7u 12.9s 2.39 65% (user CPU, system CPU, total response,%)

• In our performance measures, we use user CPU time – because of its independence on the OS and other factors.

53

KT6213

Performance MetricsPerformance Metrics

54

KT6213

Benchmarking: Performance Benchmarking: Performance MeasureMeasure

55

KT6213

What is the computer user What is the computer user interested in?interested in?

Reduce the time to run certain task◦ Execution time (response time) ◦ The time between the start and the

completion of an event. Increase the tasks per week, day, hour, sec,

ns … ◦ Throughput ◦ The total amount of work done in a given

time.

56

Example

Do the following changes to a computer system increase throughput, reduce response time, or both?

1)Replacing the processor in a computer with a faster version

2)Adding additional processors to a system that uses multiple processors for separate tasks –for example handling an airline reservation system

Answer

1) Both response time and throughput are improved.2) Only throughput increases.

57

KT6213

Performance MeasurePerformance Measure

• Time to run the task (ExTime)– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns … (Performance)– Throughput, bandwidth

Plane

Boeing 747

Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

58

KT6213

Execution TimeExecution Time

In this subject, we will be primarily interested in execution time as a measure of performance.

The relationship between performance and execution time on a computer X (reciprocal) is defined as follows:

XX timeExecution

1 ePerformanc

To maximize performance of an application, we need to minimize its execution time.

59

KT6213

Performance comparison Performance comparison

To compare design alternatives, we use the following equation:

nY

X

X

Y

ePerformanc

ePerformanc

TimeExecution

TimeExecution

• “X is fast n times faster than Y” or “the throughput of X is n times higher than Y” means that the execution time is n times less on X than Y.

60

KT6213

Metrics of PerformanceMetrics of Performance

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second

What is actually measured?

61

KT6213

Aspects of CPU PerformanceAspects of CPU Performance

Inst Count CPIClock Rate

Program X

Compiler X (X)

Inst. Set. X X

Organization X X

Technology X

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle



62

CPU Performance All computers are constructed using a clock to

operate its circuits. • Typically measured by two basic metrics

• Clock rate – today in MHz and GHz

• Clock cycle time: clock cycle time = 1/clock rate

• E.g., 1 GHz rate corresponds to 1 ns cycle time

Thus the CPU time for a program is given by:

Or,

timecycleClock program afor cyclesclock CPUTime CPU

rateClock

program afor cyclesclock CPU Time CPU

63

KT6213

More typically, we tend to count # instructions executed, known as Instruction Count (IC)

CPI: average number of clock cycles per instruction

Hence, alternative method to get the CPU timeIC

program afor cyclesclock CPU CPI

timecycleClock CPIIC timeCPU rateClock

CPIIC

CPU performance is equally dependent upon 3 characteristics: clock cycle time, CPI, IC. They are independent and making one better often makes another worse because the basic technologies involved in changing each characteristics are interdependent.

Clock cycle time: hardware technology and organizationCPI: organization and instruction set architectureIC: instruction set architecture and complier technology

CPU Performance

Depends on IS of the computer and its compiler

64

CPU PerformanceExample: Suppose we have 2 implementations of the same instruction set architecture. Computer A has a clock cycle time of 10 nsec and a CPI of 2.0 for some program, and computer B has a clock cycle time of 20 nsec and a CPI of 1.2 for the same program.

What machine is faster for this program?

Answer

Assume the program require IN instructions to be executed: CPU clock cycleA = IN x 2.0

CPU clock cycleB = IN x 1.2CPU timeA = IN x 2.0 x 10 = 20 x IN nsecCPU timeB = IN x 1.2 x 20 = 24 x IN nsec

So computer A is faster than computer B.

65

KT6213

Often the overall performance is easy to deal with

on a per instruction set basis

The overall CPI can be expressed as:

n

ii

i

n

iii

1

1 CPIcountn Instructio

IC

countn Instructio

CPIIC CPI

timecycleClock CPIIC timeCPU1

n

iii

n

iii

1

CPIIC cyclesclock CPU

# times instruction i is executed

CPI for instruction i

CPU Performance

66

KT6213 67

KT6213

Cycles Per InstructionCycles Per Instruction

Base Machine (Reg / Reg)

Instruction Freq CPI (% Time)

ALU 50% 1 (33%)

Load 20% 2 (27%)

Store 10% 2 (13%)

Branch 20% 2 (27%)

(Total CPI 1.5)

Example: Suppose we have a machine where we can count the frequency with which instructions are executed. We also know how many cycles it takes for each instruction type.

How do we get total CPI?How do we get %time?

68

KT6213

CPU PerformanceExample: Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) = 25% Avg. CPI of FP operations = 4.0 Avg. CPI of other instructions = 1.33 Frequency of FPSQR = 2% CPI of FPSQR = 20

Compare two designs: 1) decrease CPI of FPSQR to 2; 2) decrease the

avg. CPI of all FP operations to 2.5.

Answer

First, find the original CPI:

The CPI with design 1:

The CPI with design 2:

051.233.1%734%27CPIcountn Instructio

IC CPI

1o

n

ii

i

1.6912)-(202%-2.051FPSQRfor CPI Improved%2CPI CPI1 o

1.6491.3373%2.527% CPI2 So design 1 is better

69

KT6213

MIPS: Millions of Instructions per MIPS: Millions of Instructions per SecondSecond

70

KT6213

Relative MIPS and SpeedupRelative MIPS and Speedup

71

KT6213

CPU Performance Measure - how many millions of instructions per second (MIPS) a CPU can execute.◦ MIPS - million instructions per second◦ (a) MIPS = instruction count/(execution time x 10^6) ◦ = clock rate/(CPI x 10^6)◦ (b) Example: A program that executes 3 million instructions◦ in 2 seconds has a MIPS rating of 1.5◦ (c) Advantages

easy to understand easy to measure

◦ (d) Disadvantages cannot accurately compare machines with different instructions set (machines with powerful

instruction sets are penalized) varies from program to program (programs with lots of simple instructions will have higher MIPS

ratings) can vary inversly with performance

(e) different types◦ - native: As defined in (a) for a given program◦ - peak: Use instruction mix that minimizes CPI (may be very unrealistic)◦ - relative: compared to reference machine

72

KT6213

MFLOPS: Focus on One Type of MFLOPS: Focus on One Type of WorkWork

73

KT6213

Normalized MFLOPSNormalized MFLOPS

74

Some “Misleading” Performance Measures

There are certain computer performance measures which are famous with computer manufactures and sellers – but may be misleading.

MIPS (Million Instructions Per Second)

MIPS depends on the instruction set to make it difficult to compare MIPS of different instructions.

MIPS varies between programs on the same computer – different programs use different instruction mix.

MIPS can vary inversely to performance –most importantly.

77

Some “Misleading” Performance Measures

MFLOPS: Focus on one type of work MFLOPS (Million Floating-point Operations Per Second) depends

on the program. Must be FP intensive.

MFLOPS depends on the computer as well.

The floating-point operations vary in complexity (e.g., add & divide).

Peak Performance: Performance that the manufacture guarantees you won’t exceed

Difference between peak performance and average performance is huge.

Peak performance is not useful in predicting observed performance.

78

KT6213

Programs for Measuring Programs for Measuring PerformancePerformance

• Real applications: text processing software (Word), compliers (C), and other applications like Photoshop – have inputs, outputs, and options when the user wants to use them.

One major downside: Real applications often encounter portability problems arising from dependences on OS or complier.

• Modified (or scripted) applications: Modification is to enhance portability or focus on one particular aspect of system performance. Scripts are used to simulate the interaction behaviors.

• Kernels: small, key pieces from real programs. Typically used to evaluate individual features of the machine.

• Toy benchmarks: typically between 10 and 100 lines of code and produce a known result.

• Synthetic benchmarks: artificially created code to match an average execution profile.

79

KT6213

• A key advantage of such suites is that the weakness of any one benchmark is lessened by the presence of the other benchmarks.

• Good vs. Bad benchmarks

– Improving product for real programs vs. improving product for benchmarks to get more sales.

– If benchmarks are inadequate, then sales wins!

Benchmark Suites

• They are a collection of programs (workload) that try to explore and capture all the strengths and weaknesses of a computer system (real programs, kernels).

80

KT6213

SPEC BenchmarksSPEC Benchmarks

Most successful attempts and widely adopted. First generation 1989

◦ 10 programs yielding a single number (“SPECmarks”) Second generation 1992

◦ SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs)

◦ Unlimited compiler flags. Third generation 1995

◦ New set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point)

◦ Single flag setting for all programs: SPECint_base95, SPECfp_base95

◦ “benchmarks useful for 3 years”

SPEC: System Performance Evaluation Cooperation

81

KT6213


Program Language What Is It164.gzip C Compression175.vpr C FPGA Circuit Placement and Routing176.gcc C C Programming Language Compiler181.mcf C Combinatorial Optimization186.crafty C Game Playing: Chess197.parser C Word Processing252.eon C++ Computer Visualization253.perlbmk C PERL Programming Language254.gap C Group Theory, Interpreter255.vortex C Object-oriented Database256.bzip2 C Compression300.twolf C Place and Route Simulator

CINT2000 (Integer Component of SPEC CPU2000)

http://www.spec.org/osg/cpu2000/CINT2000/82

KT6213


Program Language What Is It168.wupwise Fortran 77 Physics / Quantum Chromodynamics171.swim Fortran 77 Shallow Water Modeling172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field173.applu Fortran 77 Parabolic / Elliptic Differential

Equations177.mesa C 3-D Graphics Library178.galgel Fortran 90 Computational Fluid Dynamics179.art C Image Recognition / Neural Networks183.equake C Seismic Wave Propagation Simulation187.facerec Fortran 90 Image Processing: Face Recognition188.ammp C Computational Chemistry189.lucas Fortran 90 Number Theory / Primality Testing191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution

CFP2000 (Floating Point Component of SPEC CPU2000)

http://www.spec.org/osg/cpu2000/CFP2000/83

KT6213


Sample Results For CINT2000

Base Base Base Peak Peak Peak

Benchmarks Ref Time Run Time Ratio Ref Time Run Time Ratio

164.gzip 1400 277 505* 1400 270 518*

175.vpr 1400 419 334* 1400 417 336*

176.gcc 1100 275 399* 1100 272 405*

181.mcf 1800 621 290* 1800 619 291*

186.crafty 1000 191 522* 1000 191 523*

197.parser 1800 500 360* 1800 499 361*

252.eon 1300 267 486* 1300 267 486*

253.perlbmk 1800 302 596* 1800 302 596*

254.gap 1100 249 442* 1100 248 443*

255.vortex 1900 268 710* 1900 264 719*

256.bzip2 1500 389 386* 1500 375 400*

300.twolf 3000 784 382* 3000 776 387*

SPECint_base2000 438

SPECint2000 442

http://www.spec.org/osg/cpu2000/results/res2000q3/cpu2000-20000718-00168.asc

Intel OR840(1 GHz Pentium III processor)

84

More Benchmarks TPC: Transaction Processing Council

– Measure the ability of a system to handle transactions, which consist of database accesses and updates.

– Many variants depending on transaction complexity

– TPC-A: simple bank teller transaction style

– TPC-C: complex database query

– 34 kernels in 5 classes

– 16 automotive/industrial; 5 consumer; 3 networking; 4 office

automation; 6 telecommunications

EDN Embedded Microprocessor Benchmark Consortium (EEMBC, “embassy”)

87

KT6213

Management would like to have one number.

Technical people want more:

1. They want to have evidence of reproducibility – there should be enough information so that you or someone else can repeat the experiment.

2. There should be consistency when doing the measurements multiple times.

How to Summarize Performance

How would you report these results?

Computer A Computer B Computer C

Program P1 (secs) 1 10 20

Program P2 (secs) 1000 100 20

Total Time (secs) 1001 110 40

88

Comparing and Summarizing Performance

Comparing the performance by looking at individual programs is not fair.

Total execution time: a consistent summary measure.

Arithmetic Mean – provides a simple average

– Timei: execution time for program i in the workload

– Doesn’t account for weight: all programs are treated equal

n

iiTime

n 1

1

89

Normalized Time Metrics Normalized execution time metrics

Measure the performance by normalizing it to a reference machine: Execution time ratioi

Geometric Mean

Geometric mean is consistent no matter which machine is the reference .

The arithmetic mean should not be used to average normalized execution times.

However, geometric mean still doesn’t form accurate predication models (doesn’t predict execution time)

ni

n

1iRatio TimeExecution

90

Comparing performance of two computer using SPECRatio

The relationship between Geomatric mean and Performance Ratio

91

The ratio of the geometric means is equal to the geometric mean of the performance ratios, which implies that the choice of the reference computer isirrelevant.

92

Weighted Variants

What is the proper mixture of programs for the workload?

Weight is a weighting factor to each program to indicate the relative frequency of the program in the workload: % of use

Weighted Arithmetic Mean

– Weighti: frequency of program i in the workload

– May be better but beware the dominant program time

n

iii

i

TimeWeightWeight 1

1

93

A B C W(1) W(2) W(3)

Program P1 (secs) 1 10 20 0.5 0.909 0.999

Program P2 (secs) 1000 100 20 0.5 0.091 0.001

Arithmetic Mean 500.5 55 20

Weighted Arithmetic Mean (1) 500.5 55 20

Weighted Arithmetic Mean (2) 91.82 18.18 20

Weighted Arithmetic Mean (3) 2 10.09 20

Example

94

Example

A B C W(1)

Program P1 (secs) 1 10 20 100/101

Program P2 (secs) 1000 100 20 1/101

Program P1 (secs) 1 10 20 Normalized to A

Program P2 (secs) 1 0.1 0.02

Geometric Mean 1 1 0.63

Arithmetic Mean 500.5 55

Weighted Arithmetic Mean (1) 10.89 10.89 20

Machines A and B have the same performance according to the Geometric Mean measure, yet this would only be true for a workload that P1 runs 100 times more than P2 according to the Weighted Arithmetic Mean measure.

20

95

KT6213

Locality of Reference Other important fundamental observations come from

the properties of programs.

Principle of locality: Programs tend to reuse data and instructions they have used recently.

There are two different types of locality:

Temporal Locality (locality in time): If an item is referenced, it will tend to be referenced again soon (loops, reuse, etc.)

Spatial Locality (locality in space/location): If an item is referenced, items whose addresses are close one another tend to be referenced soon (straight line code, array access, etc.)

We can predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the past.

96

KT6213

Quantitative Principles of Computer Quantitative Principles of Computer DesignDesign

Already known how to define, measure and summarize performance, then we can explore some of the principles and guidelines in design and analysis of computers.

Make the common case fast

In making a design trade-off, favor the frequent case over the infrequent case.

Improving the frequent event, rather than the rare event, will obviously help performance.

Frequent case is often simpler and can be done faster than the infrequent case.

We have to decide what the frequent case is and how much performance can be improved by making the case faster.

97

KT6213

Two equations to evaluate design Two equations to evaluate design alternatives alternatives

The performance gain that can be obtained by improving some porting of a computer can be calculated using Amdahl’s Law.

Amdahl’s Law states that the performance improvement to be gained from using some fast mode of execution is limited by the fraction of the time the faster mode can be used.

Amdahl’s Law defines the speedup that can be gained by using a particular feature.

Amdahl’s Law

The CPU Performance Equation Essentially all computers are constructed using a clock running at

a constant rate. CPU time then can be expressed by the amount of clock cycles.

98

KT6213

Amdahl's LawAmdahl's Law

Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

tEnhancemenWithTimeExecution

tEnhancemenWithoutTimeExecutionESpeedup

___

___)(

Speedup due to enhancement E:

This fraction enhanced

• Fractionenhanced: the fraction of the execution time in the original machine that can be converted to take advantage of the enhancement

• Speedupenhanced: the improvement gained by the enhanced execution mode

99

KT6213

ExTimenew = ExTimeold x (1 - Fractionenhanced) +

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=1

(1 - Fractionenhanced) +Speedupenhanced

This fraction enhanced

ExTimeold ExTimenew

Fractionenhanced

Fractionenhanced


100

KT6213


101

KT6213

Amdahl's LawAmdahl's LawExample: Floating point (FP) instructions are improved to run faster by a factor of 2; but only 10% of its time is used to execute instructions FP. What’s the overall speedup gained by this improvement?

Speedupoverall = 10.95

= 1.053

ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold

Answer

Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resource to improve cost-performance.

It is particularly useful for comparing performances both of the overall system and the CPU of two design alternatives.

102

KT6213

Example of Amdahl’s LawExample of Amdahl’s Law

Floating point instructions improved to run 3X; but only 5% of its time is to run instructions FP

Speedupoverall =1

0.kkk= 1.yyy

ExTimenew = ExTimeold x (0.95 + .05/3) = 0.kkk x ExTimeold

103

KT6213

ExampleExampleSuppose that we want to enhance the

processor used for Web serving. The new processor is 10 times faster on computation in the Web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement?

104

KT6213

ExampleExample

A common transformation required in graphics processors is square root. Implementations of floating-point (FP) square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root (FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responsible for half of the execution time for the application. The design team believes that they can make all FP instructions run 1.6 times faster with the same effort as required for the fast square root. Compare these two design alternatives.

105

KT6213

Amdahl's Law - contAmdahl's Law - cont

106

KT6213


107

KT6213


108

KT6213

ExerciseExercise

Eg: Our program takes 10s to run on computer A, which has 400 MHz clock. We want it to run in 6s. The designer says that the clock rate can be increased, but it will cause the total number of cycles for the program to increase to 1.2 times the previous value. What is the minimum clock rate required to get the desired speedup ?

109

KT6213

SolutionSolution

Using formula:ExTime = INS/PROG X CPI /clock rateinitial10 = INS/PROG X CPI/400final6 = INS/PROG X 1.2 CPI /CLKmin

==> 10/6 = CLKmin/(1.2 X 400)==> CLKmin = 10/6 X (1.2 X 400) = 800 MHz

110

KT6213

Exercise: Program runs in 100s. Multiplies = 80% of program. Designer M can improve speedup of multiply operations. Now, I am a user and I need to make MY program 5 times faster. How much speedup of multiply instructions should M achieve to allow me to reach my overall speedup goal ?

111

KT6213

Summary, #1Summary, #1• Designing to Last through Trends

Capacity Speed

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 3 years 2x in 10 years

• 6yrs to graduate => 16X CPU speed, DRAM/Disk size

• Time to run the task– Execution time, response time, latency

• Tasks per day, hour, week, sec, ns, …– Throughput, bandwidth

• “X is n times faster than Y” means ExTime(Y) Performance(X)

--------- = --------------

ExTime(X) Performance(Y)

112

KT6213

Summary, #2Summary, #2

Amdahl’s Law:

CPI Law:

Execution time is the REAL measure of computer performance!

Good products created when have:◦ Good benchmarks, good ways to summarize performance

Die Cost goes roughly with die area4

Can PC industry support engineering/research investment?

Speedupoverall =ExTimeold

ExTimenew

=

1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced





113

computer organization and architecture

Documents

given time

user cpu time program

decrease execution time

increase performance

system cpu time os

task extimeexecution

kt6213system performance

performance measures