the improvement of the personal computer by 私立 義守大學 資工系 副教授 金明浩

74
The Improvement of the Personal Computer By 私私 私私私私 私私私 私私私 私私私

Post on 19-Dec-2015

249 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The Improvement of the Personal Computer

By

私立 義守大學 資工系 副教授

金明浩

Page 2: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

What is Computer Design?

ISA:Instruction Set Architecture

I/O systemInstr. Set Proc.

Compiler

Operating

System

Application

Digital Design

Circuit Design

Firmware

Datapath & Control

Layout

Page 3: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Forces on Computer Architecture

ComputerArchitecture

Technology ProgrammingLanguages

OperatingSystems

History

Applications

Page 4: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Technology

• In ~1985 the single-chip processor (32-bit) and the single-board computer emerged– => workstations, personal

computers, multiprocessors have been riding this wave since

• In the 2002+ timeframe, these may well look like mainframes compared single-chip computer (maybe 2 chips)

DRAM

Year Size

1980 64 Kb

1983 256 Kb

1986 1 Mb

1989 4 Mb

1992 16 Mb

1996 64 Mb

1999 256 Mb

2002 1 Gb

DRAM chip capacity

Page 5: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Forces on Computer Architecture

ComputerArchitecture

Technology ProgrammingLanguages

OperatingSystems

History

Applications

Page 6: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Levels of Representation

Machine Interpretation

temp = v[k];

v[k] = v[k+1];

v[k+1] = temp;

lw$15, 0($2)lw$16, 4($2)

sw$16, 0($2)sw$15, 4($2)

0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111

ALUOP[0:3] <= InstReg[9:11] & MASK

High Level Language Program

Assembly Language Program

Machine Language Program

Control Signal Specification

Compiler

Assembler

Page 7: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Forces on Computer Architecture

ComputerArchitecture

Technology ProgrammingLanguages

OperatingSystems History

Applications

Page 8: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Levels of Organization

SPARCstation 20

Processor

Computer

Control

Datapath

Memory Devices

Input

Output

Workstation Cost Design Target:25% on Processor25% on Memory(minimum memory size)Rest on I/O devices, power supplies, box

Page 9: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Processor and Caches SPARCstation 20

Slot 1MBus

Slot 0MBus

MBusMBus Module

External Cache

DatapathRegisters

InternalCache

Control

Processor

Page 10: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Input and Output (I/O) Devices SPARCstation 20

Slot 1SBus

Slot 0SBus

Slot 3SBus

Slot 2SBus

SEC MACIO

Disk

Tape

SCSIBus

SBus

Keyboard

& Mouse

Floppy

Disk

External Bus

• SCSI Bus: Standard I/O Devices• SBus: High Speed I/O Devices• PCI Bus: Compatible with PCs• External Bus: Low Speed I/O

Device

Page 11: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Forces on Computer Architecture

ComputerArchitecture

Technology ProgrammingLanguages

OperatingSystems

History

Applications

Page 12: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 13: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Processor Design is a Process• Bottom-up

– assemble components in target technology to establish critical timing

• Top-down– specify component behavior from high-level requirements

• Iterative refinement– establish partial solution, expand and improve

datapath control

processorInstruction SetArchitecture

=>

Reg. File Mux ALU Reg Mem Decoder Sequencer

Cells Gates

Page 14: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 15: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

Page 16: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 17: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

A Single Cycle Datapath• We have everything except control signals (underline)

– Today's lecture will show you how to generate the control signals

32

ALUctr

Clk

busW

RegWr

32

32

busA

32

busB

55 5

Rw Ra Rb

32 32-bitRegisters

Rs

Rt

Rt

RdRegDst

Exten

der

Mu

x

Mux

3216

imm16

ALUSrc

ExtOp

Mu

x

MemtoReg

Clk

Data InWrEn

32

Adr

DataMemory

32

MemWrA

LU

InstructionFetch Unit

Clk

Zero

Instruction<31:0>

0

1

0

1

01<

21:25>

<16:20>

<11:15>

<0:15>

Imm16RdRsRt

nPC_sel

Page 18: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The Truth Table for the Main Control

R-type ori lw sw beq jump

RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

Branch

Jump

ExtOp

ALUop (Symbolic)

1

0

0

1

0

0

0

x

R-type

0

1

0

1

0

0

0

0

Or

0

1

1

1

0

0

0

1

Add

x

1

x

0

1

0

0

1

Add

x

0

x

0

0

1

0

x

Subtract

x

x

x

0

0

0

1

x

xxx

op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010

ALUop <2> 1 0 0 0 0 x

ALUop <1> 0 1 0 0 0 x

ALUop <0> 0 0 0 0 1 x

MainControl

op

6

ALUControl(Local)

func

3

6

ALUop

ALUctr

3

RegDst

ALUSrc

:

Page 19: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Systematic Generation of Control

• In our single-cycle processor, each instruction is realized by exactly one control command or microinstruction

– in general, the controller is a finite state machine

– microinstruction can also control sequencing (see later)

Control Logic / Store(PLA, ROM)

OPcode

Datapath

Inst

ruct

ion

Decode

Con

ditio

nsControlPoints

microinstruction

Page 20: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The Big Picture: Where are We Now?

• The Five Classic Components of a Computer

• Today's Topic: Designing the Datapath for the Multiple Clock Cycle Datapath

Control

Datapath

Memory

Processor

Input

Output

Page 21: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 22: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Pipelining is Natural!

• Laundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes

to wash, dry, and fold• Washer takes 30 minutes

• Dryer takes 30 minutes

• Folder takes 30 minutes

• Stasher takes 30 minutesto put clothes into drawers

A B C D

Page 23: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads• If they learned pipelining, how long would laundry

take?

30Task

Order

B

C

D

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM

Page 24: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Pipelined Laundry: Start work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

B

C

D

A

3030 30 3030 30 30

Page 25: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Pipelining Lessons

• Pipelining doesn't help latency of single task, it helps throughput of entire workload

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Pipeline rate limited by slowest pipeline stage

• Unbalanced lengths of pipe stages reduces speedup

• Time to fill pipeline and time to drain it reduces speedup

• Stall for Dependences

6 PM 7 8 9

Time

B

C

D

A

3030 30 3030 30 30Task

Order

Page 26: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The Five Stages of Load

• Ifetch: Instruction Fetch– Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read the data from the Data Memory

• Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem Wr

Page 27: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Graphically Representing Pipelines

• Can help with answering questions like:– how many cycles does it take to execute this

code?– what is the ALU doing during cycle 4?– use this representation to help understand

datapaths

Page 28: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:

Ifetch Reg Exec Mem Wr

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

Load Ifetch Reg Exec Mem Wr

Ifetch Reg Exec Mem

Load Store

Pipeline Implementation:

Ifetch Reg Exec Mem WrStore

Clk

Single Cycle Implementation:

Load Store Waste

Ifetch

R-type

Ifetch Reg Exec Mem WrR-type

Cycle 1 Cycle 2

Page 29: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Pipelined Execution

• Utilization?• Now we just have to make it work

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WBProgram Flow

Time

Page 30: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Why Pipeline?• Suppose we execute 100 instructions

• Single Cycle Machine– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

• Multicycle Machine– 10 ns/cycle x 4.6 CPI (due to inst mix) x 100

inst = 4600 ns

• Ideal pipelined machine– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain)

= 1040 ns

Page 31: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Page 32: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 33: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Can pipelining get us into trouble?• Yes: Pipeline Hazards

– structural hazards: attempt to use the same resource two different ways at the same time

• E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)

– data hazards: attempt to use item before it is ready• E.g., one sock of pair in dryer and one in washer; can't fold until

get sock from washer through dryer• instruction depends on result of prior instruction still in the

pipeline– control hazards: attempt to make a decision before condition is

evaulated• E.g., washing football uniforms and need to get proper

detergent level; need to see after dryer before next load in• branch instructions

• Can always resolve hazards by waiting– pipeline control must detect the hazard– take action (or delay action) to resolve hazards

Page 34: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Why Pipeline? Because the resources are there!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

AL

UIm Reg Dm Reg

Page 35: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

• Dependencies backwards in time are hazardsData Hazard on

r1:

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF

ID/RF

EX MEM WBAL

U

Im Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Page 36: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 37: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Issues in Pipelined design

Pipelining

Super-pipeline

- Issue one instruction per (fast) cycle

- ALU takes multiple cycles

Super-scalar

- Issue multiple scalar

instructions per cycle

Limitation

Issue rate, FU stalls, FU depth

Clock skew, FU stalls, FU depth

Hazard resolution

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

IF D Ex M W

Page 38: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

VLIW- Each instruction specifiesmultiple scalar operations- Compiler determines parallelism

Vector operations

- Each instruction specifies

series of identical operations

Packing

Applicability

W

W

W

IF D Ex M

Ex M

Ex M

Ex M W

IF D Ex M W

Ex M W

Ex M W

Ex M W

The VLIW and Vector Processors

Page 39: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Limits of Superscalar• While Integer/FP split is simple for the HW, get CPI of 0.5 only for

programs with:

– Exactly 50% FP operations

– No hazards

• If more instructions issue at same time, greater difficulty of decode and issue

– Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue

• VLIW: tradeoff instruction space for simple decoding

– The long instruction word has room for many operations

– By definition, all the operations the compiler puts in the long instruction word can execute in parallel

– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits

– Need compiling technique that schedules across several branches

Page 40: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Software Pipelining ExampleBefore: Unrolled 3 times

1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4

4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8

7 LD F10,-16(R1)

8 ADDDF12,F10,F2

9 SD -16(R1),F12

10 SUBI R1,R1,#24 11 BNEZ R1,LOOP

After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i]

2 ADDD F4,F0,F2 ; Adds to

M[i-1]

3 LD F0,-16(R1);Loads M[i-

2]

4 SUBI R1,R1,#8

5 BNEZ R1,LOOP

• Symbolic Loop Unrolling– Less code space– Fill & drain pipe only once vs. each iteration in loop unrolling

Page 41: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Software Pipelining• Observation: if iterations from loops are independent, then can get

ILP by taking instructions from different iterations

• Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Page 42: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Unrolled Loop that Minimizes Stalls for Scalar1 Loop: LD F0,0(R1)

2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

LD to ADDD: 1 CycleADDD to SD: 2 Cycles

Page 43: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Loop Unrolling in VLIWMemory Memory FP FP Int. op/ Clock

reference 1 reference 2 operation 1 op. 2 branchLD F0,0(R1) LD F6,-8(R1) 1

LD F10,-16(R1) LD F14,-24(R1) 2

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2ADDD F8,F6,F23

LD F26,-48(R1) ADDD F12,F10,F2ADDD F16,F14,F24

ADDD F20,F18,F2 ADDD F24,F22,F2 5

SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6

SD -16(R1),F12 SD -24(R1),F16 7

SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8

SD -0(R1),F28 BNEZ R1,LOOP 9

Unrolled 7 times to avoid delays

7 results in 9 clocks, or 1.3 clocks per iteration

Need more registers in VLIW(EPIC => 128int + 128FP)

Page 44: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture Design of a Processor

Page 45: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Levels of the Memory Hierarchy

CPU Registers100s Bytes<2s ns

CacheK Bytes SRAM2-100 ns$.01-.001/bit

Main MemoryM Bytes DRAM100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents-3 -4

CapacityAccess TimeCost

Tapeinfinitesec-min10-6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Page 46: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Processor-DRAM Gap (latency)

proc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

Moore's Law

Page 47: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Memory Hierarchy° The Principle of Locality:

• Program access a relatively small portion of the address space at any instant of time.

- Temporal Locality: Locality in Time

- Spatial Locality: Locality in Space

° Three Major Categories of Cache Misses:

• Compulsory Misses: sad facts of life. Example: cold start misses.

• Conflict Misses: increase cache size and associativity.

• Capacity Misses: increase cache size

° Virtual Memory invented as another level of the hierarchy– Today VM allows many processes to share single memory without

having to swap all processes to disk, protection more important

– TLBs are important for fast translation/checking

Page 48: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Reducing Misses

• Classifying Misses: 3 Cs

– Compulsory The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)

– Capacity of the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)

– Conflict of block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)

Page 49: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

How Can Reduce Misses?• 3 Cs: Compulsory, Capacity, Conflict• In all cases, assume total cache size not changed:• What happens if:

1) Change Block Size: Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?

Page 50: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The Principle of Locality

• The Principle of Locality:

– Program access a relatively small portion of the address space at any instant of time.

• Two Different Types of Locality:

– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straight-line code, array access)

• Last 15 years, HW relied on locality for speed

Page 51: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example: Block X)

– Hit Rate: the fraction of memory access found in the upper level

– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)

– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

• Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Page 52: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Cache Measures

• Hit rate: fraction found in that level

– So high that usually talk about Miss rate

– Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory

• Average memory-access time = Hit time + Miss rate x Miss penalty

(ns or clocks)

• Miss penalty: time to replace a block from lower level, including time to replace in CPU

– access time: time to lower level

= f(latency to lower level)

– transfer time: time to transfer block

=f(BW between upper & lower levels)

Page 53: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Simplest Cache: Direct Mapped

Memory 4 Byte Direct Mapped CacheMemory Address

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

Cache Index

0

1

2

3

• Location 0 can be occupied by data from:

– Memory location 0, 4, 8, ... etc.

– In general: any memory locationwhose 2 LSBs of the address are 0s

– Address<1:0> => cache index

• Which one should we place in the cache?

• How can we tell which one is in the cache?

Page 54: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

1 KB Direct Mapped Cache, 32B blocks

• For a 2 ** N byte cache:– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block = 2 ** M)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache state

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Page 55: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel (N typically 2 to 4)

• Example: Two-way set associative cache

– Cache Index selects a set from the cache

– The two tags in the set are compared in parallel

– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Page 56: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

3Cs Relative Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block sizeGood: insight => invention

Page 57: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Why Do We Need Cache Memory?

Configuration L1 L2 CPU Clock Speed

P II w/o Cache 0 0 400MHZ 10 MIPS

386, w L1 Cache 8K 0 33MHZ 27 MIPS

Celeron, 333 32K 0 333MHZ 330/100 MIPS

P II, 350 32K 512K 350MHZ 330/240 MIPS

Page 58: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Main Memory Performance• Simple: CPU, Cache, Bus, Memory same width (32 bits)

• Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)

• Interleaved: CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved

Timing model: 1 to send address, 6 access time, 1 to send dataCache Block is 4 wordsSimple M.P. = 4 x (1+6+1) = 32Wide M.P. = 1 + 6 + 1 = 8Interleaved M.P. = 1 + 6 + 4x1 = 11

Page 59: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture of a Processor

Page 60: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

I/O System Design Issues

Processor

Cache

Memory - I/O Bus

MainMemory

I/OController

Disk Disk

I/OController

I/OController

Graphics Network

interrupts

• Systems have a hierarchy of busses as well (PC: memory,PCI,ESA)

Page 61: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩
Page 62: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Key Technologies

° Fast, cheap, highly integrated Computers-on-a-chip• IDT R4640, NEC VR4300, StrongARM, Superchips

° Affordable access to fast networks• ISDN, Cable Modems, ATM, . . .

° Platform independent programming languages• Java, JavaScript, Visual Basic Script

° Lightweight Operating Systems• GEOS, NCOS, RISCOS

° ???

Page 63: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Future of Computer Architecture and Engineering

• Performance

• High Level Computer Architecture

• Multiprocessors

• IRAM

Page 64: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Processor Performance

Year

Perf

orm

an

ce

0

50

100

150

200

250

300

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

RISC

Intel x86

35%/yr

RISCintroduction

Page 65: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

SPECfp95base Performance (Oct. 1997)

0

10

20

30

40

50

60

tom

catv

swim

su2c

or

hyd

ro2d

mg

rid

app

lu

turb

3d

apsi

fpp

pp

wav

e5

SP

EC

fp

PA-8000

21164

PPro

Page 66: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

The design of a processor and the major components

The CPU execution cycles

The Single and multiple path architecture design

The Pipelined CPU

The Data, Control and Structure Hazard

The Advanced CPU Architecture

The Memory Hierarchy

The Need of Cache Memory

The Status and the Future

Architecture of a Processor

Page 67: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

1985 Computer Food Chain

PCWork-stationMini-

computer

Mainframe

Vector Supercomputer

Big Iron

Page 68: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

1995 Computer Food Chain

PCWork-station

Mainframe

Vector Supercomputer Massively Parallel Processors

Minicomputer

(hitting wall soon)

(future is bleak)

Page 69: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

2005 Computer Food Chain

Mainframe Vector Supercomputer

Minicomputer

PortableComputers

Networks of Workstations/PCs

Massively Parallel

Processors

Page 70: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Interconnection Networks

° Switched vs. Shared Media: pairs communicate at same time: point-to-point?connections

Page 71: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

P

M

P

M

P

M

P

M

I/O

NI

Fast, Switched Network

P

MNININININI

Fast Communication

Cluster/Network of Workstations (NOW)

Slow, Scalable Network

P

M

NI

D

P

M

NI

D

P

M

NI

D

Distributed Comp.MPP

P P P

M

SMP

I/OBus

NI

General Purpose

Incremental Scalability,Timeliness

Fast, Switched Network

P

M

NI

D

P

M

NI

D

P

M

NI

D

Page 72: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

Intelligent DRAM (IRAM)• IRAM motivation (?000 to 2005)

– 256 Mbit/1Gbit DRAMs in near future (128 MByte)

– Current CPUs starved for memory BW

– On chip memory BW = SQRT(Size)/RAS or 80 GB/sec

– 1% of Gbit DRAM = 10M transistors for processor

– Even in DRAM process, a 10M trans. CPU is attractive

– Package could be network interface vs. Addr./Data pins

– Embedded computers are increasingly important

• Why not re-examine computer design based on separation of memory and processor?

– Compact code & data?

– Vector instructions?

– Operating systems? Compilers? Data Structures?

Page 73: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

IRAM Vision Statement

Microprocessor & DRAM on a single chip:– on-chip memory latency

5-10X, bandwidth 50-100X– improve energy efficiency

2X-4X (no off-chip bus)– serial I/O 5-10X v. buses– smaller board area/volume– adjustable memory size/width

DRAM

fab

Proc

Bus

D R A M

$ $Proc

L2$

Logic

fabBus

D R A M

I/OI/O

I/OI/O

Bus

Page 74: The Improvement of the Personal Computer By 私立 義守大學 資工系 副教授 金明浩

and why not° multiprocessors on a chip?

° complete systems on a chip?• memory + processor + I/O

° computers in your credit card?

° networking in your kitchen? car?

° eye tracking input devices?