computer architecture cse 3322

Post on 17-Jan-2016

42 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computer Architecture CSE 3322. Send email to Pramod Kumar, pxk3008@exchange.uta.edu , with the names and emails of your Four Project team members by Mon Sept 15. If not on a team, send your email address to Pramod. Web Site crystal.uta.edu/~jpatters/cse3322. CPI. - PowerPoint PPT Presentation

TRANSCRIPT

Computer Architecture CSE 3322Lecture 7

Assignment: 2.10, 2.11, 2.13, 2.18 Due Mon 9/22

TEST 1 - Mon 9/29

Lectures 1 - 7, Ch 2, 3

Web Sitecrystal.uta.edu/~jpatters/cse3322

Computer Architecture CSE 3322

Web Sitecrystal.uta.edu/~jpatters/cse3322

Send email to Pramod Kumar, pxk3008@exchange.uta.edu, with the names and emails of your Four Project team members by Mon Sept 15. If not on a team, send your email address to Pramod.

CPI

CPI = Clock Cycles / Instruction

“Average clock cycles per instruction”

CPI

CPI = Clock Cycles / Instruction

“Average clock cycles per instruction”

CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time

CPI

CPI = Clock Cycles / Instruction

“Average clock cycles per instruction”

CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time

Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count

CPI

Invest Resources where time is Spent!

CPI = Clock Cycles / Instruction Count = (CPU Time * Clock Rate) / Instruction Count

“Average clock cycles per instruction”

CPU Time = Instruction Count x CPI / Clock Rate = Instruction Count x CPI x Clock Cycle Time

Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count

Average CPI = SUM of CPI(i) * F(i) for i = 1, nF(i) is the Instruction Frequency

Suppose we have two implementations of the same instruction set

For some program,Machine A has:

a clock cycle time of 10 ns. and a CPI of 2.0

Machine B has:a clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much?

CPI Example

Suppose we have two implementations of the same instruction set

For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns

a clock cycle time of 10 ns. and a CPI of 2.0

Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much?

CPI Example

Suppose we have two implementations of the same instruction set

For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns

a clock cycle time of 10 ns. and a CPI of 2.0

Machine B has: CPU Time = I*1.2*20ns=I*24nsa clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much?

CPI Example

Suppose we have two implementations of the same instruction set

For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns

a clock cycle time of 10 ns. and a CPI of 2.0

Machine B has: CPU Time = I*1.2*20ns=I*24nsa clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B

CPI Example

Suppose we have two implementations of the same instruction set

For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns

a clock cycle time of 10 ns. and a CPI of 2.0

Machine B has: CPU Time = I*1.2*20ns=I*24nsa clock cycle time of 20 ns. and a CPI of 1.2

What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B

Note: CPI is Smaller for B

CPI Example

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1

Load 20% 5

Store 10% 3

Branch 20% 2

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5

Load 20% 5 1.0

Store 10% 3 .3

Branch 20% 2 .4

2.2 = CPI ave

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5

Load 20% 5 1.0

Store 10% 3 .3

Branch 20% 2 .4

2.2 = CPI ave

CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle TimeCPU Time Inst Cnt * CPI ave * Clk Cycle Time

% Time = F(i) * CPI(i) / CPI ave

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2 = CPI ave

CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle TimeCPU Time Inst Cnt * CPI ave * Clk Cycle Time

% Time = F(i) * CPI(i) / CPI ave

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2 = CPI ave

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 (2) 1.0 (.4) 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2 (1.6)

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?

2.2/1.6 = 1.375

CPU Time = Inst Cnt * CPI ave * Clk Cycle Time

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles? CPI = 1.6

How does this compare with using branch prediction to shave a cycle off the branch time?

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 (1) .4 (.2) 18%

2.2 (2.0)

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles? CPI = 1.6

How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles? CPI = 1.6

How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0

What if two ALU instructions could be executed at once?

Example (RISC processor)

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles F(i)CPI(i) % Time

ALU 50% 1 (.5) .5 (.25) 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2 (1.95)

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles? CPI = 1.6

How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0

What if two ALU instructions could be executed at once? CPI=1.95

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions:

Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C

The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.

Which sequence will be faster? How much?What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions:

Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence

has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much?

What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions:

Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has

6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be

faster? How much? 10 / 9 = 1.11What is the CPI for each sequence?

A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions:

Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles

The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6

instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster?

How much? 10 / 9 = 1.11What is the CPI for each sequence? 10/5 = 2

9/6 = 1.5

MIPS = Instruction Count

Execution time x 106

A popular performance metric is MIPS, the numberof millions of instructions per second.

For a given program,

MIPS = Instruction Count

Execution time x 106

A popular performance metric is MIPS, the numberof millions of instructions per second.

For a given program,

1. Cannot compare if instruction set is different

MIPS = Instruction Count

Execution time x 106

A popular performance metric is MIPS, the numberof millions of instructions per second.

For a given program,

1. Cannot compare if instruction set is different2. Highly dependent on the program

MIPS = Instruction Count

Execution time x 106

A popular performance metric is MIPS, the numberof millions of instructions per second.

For a given program,

1. Cannot compare if instruction set is different2. Highly dependent on the program3. Can be inversely proportional to performance

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A has 1 cycle,Class B has 2 cycles, Class C has 3 cycles

Instruction counts ( billions)Code from A B CCompiler 1 5 1 1Compiler 2 10 1 1

• Which sequence will be faster according to MIPS?• Which sequence will be faster according to execution

time?

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 5+1x2+1x3=10 billionCompiler 2

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billionCompiler 2 15 billion

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billion 1010x10-8=100Compiler 2 15 billion

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billion 100 secCompiler 2 15 billion 150 sec

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions:

Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billion 100 sec 7x103/100Compiler 2 15 billion 150 sec

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions:

Class A Class B Class CCPI 1 2 3

Instruction counts ( billions)Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billion 100 sec 70Compiler 2 15 billion 150 sec 12x103/150

MIPS example

Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C

CPI 1 2 3 Instruction counts ( billions)

Code from A B C TotalCompiler 1 5 1 1 7Compiler 2 10 1 1 12 CPU cycles Exec Time MIPSCompiler 1 10 billion 100 sec 70Compiler 2 15 billion 150 sec 80

MIPS example

• Performance best determined by running a real application– Use programs typical of expected workload– Or, typical of expected class of applications

e.g., compilers/editors, scientific applications, graphics, etc.

Benchmarks

• Performance best determined by running a real application– Use programs typical of expected workload– Or, typical of expected class of applications

e.g., compilers/editors, scientific applications, graphics, etc.

• Small benchmarks– nice for architects and designers– easy to standardize– can be abused

Benchmarks

• SPEC (System Performance Evaluation Cooperative)

Benchmarks

• SPEC (System Performance Evaluation Cooperative)– companies have agreed on a set of real

program and inputs

Benchmarks

• SPEC (System Performance Evaluation Cooperative)– companies have agreed on a set of real

program and inputs– can still be abused

Benchmarks

• SPEC (System Performance Evaluation Cooperative)– companies have agreed on a set of real

program and inputs– can still be abused – valuable indicator of performance (and

compiler technology)

Benchmarks

SPEC95• Eighteen application benchmarks (with inputs)

reflecting a technical computing workload

SPEC95• Eighteen application benchmarks (with inputs)

reflecting a technical computing workload• Eight integer applications

–go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

SPEC95• Eighteen application benchmarks (with inputs)

reflecting a technical computing workload• Eight integer applications

–go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

• Ten floating-point intensive applications–tomcatv, swim, su2cor, hydro2d, mgrid, applu,

turb3d, apsi, fppp, wave5

SPEC95• Eighteen application benchmarks (with inputs)

reflecting a technical computing workload• Eight integer applications

–go, m88ksim, gcc, compress, li, ijpeg, perl, vortex

• Ten floating-point intensive applications–tomcatv, swim, su2cor, hydro2d, mgrid, applu,

turb3d, apsi, fppp, wave5• Must run with standard compiler flags

–eliminate special undocumented incantations that may not even generate working code for real programs

PentiumClock rate (MHz)

SP

EC

fp

Pentium Pro

2

0

4

6

8

3

1

5

7

9

10

200 25015010050

SPEC fp

Execution Time After Improvement =

Execution Time Unaffected +

( Execution Time Affected / Amount of Improvement )

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a

machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a machine, with

multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/4

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a machine, with

multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Improved Time = (100 – 80) + 80/n = 100/4

20 + 80/n = 25 80 = 5n ; n = 16

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a machine,

with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

How about making it 5 times faster?

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a machine, with

multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

How about making it 5 times faster?Improved Time = (100 –80) + 80/n = 100/5

Amdahl's Law

Execution Time After Improvement =

Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )

• Example:Suppose a program runs in 100 seconds on a machine, with multiply

responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?

How about making it 5 times faster?Improved Time = (100 –80) + 80/n = 100/5 20 + 80/n = 20

80/n = 0 Impossible!

Amdahl's Law

• Performance is specific to a particular

program/s

– Total execution time is a consistent summary of performance

Remember

• Performance is specific to a particular program/s– Total execution time is a consistent summary of

performance

• For a given architecture performance increases come

from:– increases in clock rate (without adverse CPI affects)

Remember

• Performance is specific to a particular program/s– Total execution time is a consistent summary of

performance

• For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI

Remember

• Performance is specific to a particular program/s– Total execution time is a consistent summary of performance

• For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI– compiler enhancements that lower CPI and/or instruction count

Remember

top related