chapter 13 (in part) computer architecture pipelines and caches diagrams are from computer...
Post on 21-Dec-2015
226 views
TRANSCRIPT
Chapter 13 (In Part)
Computer Architecture
Pipelines and caches
Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson. Use these note
slides as your primary study source.
Instruction Execution
Instruction Execution
1. Instruction Fetch IF
IR M[PC]NPC PC + 4
2. Instruction Decode/Register Fetch ID
A Register[Rs1]B Register[Rs2]Imm (IR16)16##IR16..31
Instruction Execution
3. Execution EXThe instruction has been decoded, so execution can be splitaccording to instruction type.
Reg-Reg ALUout A op BReg-Imm ALUout A op ImmBranch ALUout NPC + Imm Target
cond (A{=,!=}0)LD/ST ALUout A op Imm Eff AddressJump ??
Instruction Execution
4. Memory Access/Branch Completion MEMLoad: Load Memory Data = LMD = MDR = Mem[ALUout]Store: Mem[ALUout] BBranch: PC (cond)?ALUout:NPCJump/JAL: ??JR: PC AELSE: PC NPC
5. Write-Back WBALU Instruction: Rd ALUoutputLoad: Rd LMDJAL: R31 OldPC
What is Pipelining??
Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
Pipelining
1 2 3 4 5 6 7
IF ID MEM WB Instr 1
IF ID EX MEM WB Instr 2
IF ID EX MEM WB Instr 3
IF ID EX MEM Instr 4
IF ID EX Instr 5
What are the overheads of pipelinging?What are the complexities?What types of pipelines are there?
Time
Pipelining
Pipelining
Pipeline Overhead
• Latches between stages• Must latch data and control
• Expansion of stage delay…• Original time was IF + ID + EX + MEM + WB With phases skipped for some instructions• New best time is MAX (IF,ID,EX,MEM,WB) + Latch Delay
• More memory ports or separate memories• Requires complicated control to make as fast as possible
• But so do non-pipelined high performance designs
Pipelining
Pipeline complexities
• 3 or 5 or 10 instructions are executing at the same time• What if the instructions are dependent?
• Must have hardware or software ways to ensure proper execution.• What if an interrupt occurs?
• Would like to be able to figure out what instruction caused it!!
Pipeline Stalls – Structural
What causes pipeline stalls?
1. Structural HazardInability to perform 2 tasks.Example:
• Memory needed for both instructions and data• SpecInt92 has 35% load and store• If this remains a hazard, each load or store will stall one IF.
Pipeline Stalls – Data
2.1 Data Hazards – Read after Write RAWData is needed before it is written.
ADD R1,R2,R3 IF ID EX MEM WB
ADD R4,R1,R5 IF ID EX MEM WB
ADD R4,R1,R5 IF ID EX MEM WB
ADD R4,R2,R3 IF ID EX MEM WB
TIME = 1 2 3 4 5 6 7 8
Calc Done R1 Written
R1 Read? R1 Used
R1 Read?
Pipeline Stalls – Data
• Data must be forwarded to other instructions in pipeline if readyfor use but not yet stored.• In 5-stage MIPS pipeline, this means forwarding to next threeinstructions.• Can reduce to 2 instructions if register bank can process asimultaneous read and write.
• Text indicates this as writing in first half of WB and readingin second half of ID.• More likely, both occur at once, with read receiving the dataas it is written.
Pipeline Stalls – Data
Load DelaysData available at end of MEM, not in time for the Instr + 1’s EX phase.
Pipeline Stalls – Data
So insert a bubble…
Pipeline Stalls – Data
A Store requires two registers at two different times:• Rs1 is required for address calculation during EX• Data to be stored (Rd) is required during MEM
Data can be loaded and immediately storedData from the ALU must be forwarded to MEM for stores
Pipeline Stalls – Data
2.2 Data Hazards – Write After Write WAWOut-of-order completion of instructions
IF
ID/RF
ISSUE
FMUL1
FMUL2
FDIV1
FDIV2
FDIV3
FDIV4
WB
Length of FDIVpipe could causeWAW error
FDIV R1, R2, R3FMUL R1, R2, R3
Pipeline Stalls – Data
2.3 Data Hazards – Write After Read WARA slow reader followed by a fast writer
IF
ID
ISSUE
FLOAD1 FMADD1
FMADD2
FMADD3
FMADD4
WB
Read registersfor multiply
Read registersfor add
FMADD R1, R2, R3, R4 IF ID IS FM1 FM2 FM3 FM4 WBFLOAD R4, #300(R5) IF ID IS F1 WB
Pipeline Stalls – Data
2.4 Data Hazards – Read After Read RAR
A read after a read is not a hazard
• Assuming that a read does not change any state
Pipeline Stalls – Control
3. Control HazardsBranch, Jump, InterruptOr why computers like straight line code…
3.1 Control Hazards – JumpJump IF ID EX MEM WB IF ID EX MEM WB
• When is knowledge of jump available?• When is target available?
With absolute addressing, may be able to fit jump into IF phasewith zero overhead.
Pipeline Stalls – Control
Pipeline Stalls – Control
• This relies on performing simple decode (2 bits for Sparc, 6 bitsfor other) in IF.• If IF is currently the slowest phase, this will slow the clock• Relative jumps would require an adder and its delay
Pipeline Stalls – Control
3.2 Control Hazards – BranchBranch IF ID EX MEM WB IF ID EX MEM WB
IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB
IF ID EX MEM WB
Branch is resolved in MEM and information forwarded to IF, causing three stalls.16% of the SpecInt instructions are branches, and about 67% aretaken,
CPI = CPIIDEAL + STALLS = 1 + 3(0.67)(0.16) = 1.32Must
• Resolve branch sooner• Update PC sooner• Fetch correct next instruction more frequently
Pipeline Stalls – Control
Delayed branch and branch delay slot
• Ideally, fill slot with instruction from before branch• Otherwise from target (since branches are more often taken than not)• Or from fall-through• Slots filled 50-70% of time• Interrupt recovery complicated
Memory HierarchyCPU Registers
3-10 access/cycle32-64 words
On-Chip Cache1-2 access/cycle
5-10 ns 16KB-2MB
Off-Chip Cache (SRAM)5-20 cycles/access
10-40 ns 1MB – 16MB
Main Memory (DRAM)20-200 cycles/access
60-120ns64MB-many GB
Disk BufferDisk or Network
1M-2M cycles/access4GB – many TB
$6/MB
$0.28/MB
$0.65/GB
sector
Page (1KB-16KB)
Blocks or Lines (8-128 bytes)
Words
Generalized Caches
Movement of Technology
120306020.25Alpha 21164
28147050.5Alpha 21064
0.66120020010VAX 11/780
/ Instr.CyclesMemory (ns)(ns)
PenaltyMissMainClockCPIMachine
????~5 (DDR, < RMBS)0.5??Pentium IV
Cache Example: 21064
Example cache: Alpha 21064
• 8 KB cache. With 34-bit addressing.• 256-bit lines (32 bytes)• Block placement: Direct map
• One possible place for each address• Multiple addresses for each possible place
033
OffsetCache IndexTag
• Cache line include…• •
Cache Example: 21064
Cache Example: 21064
Cache operation• Send address to cache• Parse address into offset, index, and tag• Decode index into a line of the cache, prepare cache forreading (precharge)• Read line of cache: valid, tag, data• Compare tag with tag field of address• Miss if no match• Select word according to byte offset and read or write
If there is a miss…• Stall the processor while reading in line from the next levelof hierarchy
• Which in turn may miss and read from main memory• Which in turn may miss and read from disk
Virtual Memory
Block 1KB-16KB Page/SegmentHit 20-200 Cycles DRAM accessMiss 700,000-6,000,000 cycles Page FaultMiss rate 1:0.1 – 10 million
Differences from cache• Implement miss strategy in software• Hit/Miss factor 10,000+ (vs 10-20 for cache)• Critical concerns are
• Fast address translation• Miss ratio as low as possible without ideal knowledge
Virtual Memory
Q0: Fetch strategy• Swap pages on task switch• May pre-fetch next page if extra transfer time is only issue• may include a disk cache
Q1: Block Placement• Anywhere – fully associate – random access is easilyavailable, and time to place a block well is tiny compared tomiss penalty.
Virtual Memory
Q2: Finding a block• Page table
• List of VPNs (Virtual Page Numbers) and physical address (or disk location)• Consider 32-bit VA, 30-bit PA, and 2 KB pages.
• Page table has 232/211 = 221 entries for perhaps 225 bytes or 214 pages.• Page table must be in virtual memory (segmented paging)• System page table must always be in memory.
• Translation look-aside buffer (TLB)• Cache of address translations• Hit in 1 cycle (no stalls in pipeline)• Miss results in page table access (which could lead topage fault). Perhaps 10-100 OS instructions.
Virtual Memory
Q3: Page replacement• LRU used most often (really, approximations of LRU with a fixed time window).• TLB will support determining what translations have been used.
Q4: Write policyWrite through or write back?
Write Through – data is written to both the block in the cache and to the block in lower level memory.
Write Back – data is written only to the block in the cache, only written to lower level when replaced.
Virtual Memory
Memory protection• Must index page table entries by PID• Flush TLB on task switch• Verify access to page before loading into TLB• Provide OS access to all memory, physical, and virtual• Provide some un-translated addresses to OS for I/O buffers
Address Translation
TLB – typically
• 8-32 entries• Set-associative or fully associative• Random or LRU replacement• Two or more ports (instruction and data)
• Why use a pipeline? Seems so complicated.• What are some of the overheads?• What is the deal with memory hierarchies?• Why are the caches so small? Why not make
them larger?• Do I have to worry about any of this when I am
writing code?• RISC and CISC again. What is the big deal and
difference?
Summary
Pipelines and Memory Hierarchies
Other Architectures
Motorola 68000 Family (circa late ’70’s)
•Instructions can specify 8, 16, or 32 bit operands.•Has 32 bit registers.•Has 2 register files A and D both have 8 32-bit registers.(A is for addresses and D is for Data). Used to save bits in theopcode since implicit use of A or D is in the instruction.•Uses a two-address instruction set.•Instructions are usually 16-bits but can have up to 4 more 16-bitwords.•Supports lots of addressing modes.
Other Architectures
Intel x86 Family (circa late ’70’s)
•Probably the least elegant design available.•Uses 16-bit words, so only 64k unique address can be provided.•Used segment addressing to overcome this limit. Basicallythe shift the address by 4 (multiply by 16) and then add a 16-bitoffset giving and effective 20-bit address, thus 1MB of memory.•Has 4 registers (segment registers) used to hold segmentaddresses: CS, DS, SS, ES.•Has 4 general registers: AX, BX, CX, and DX.
Other Architectures
Sun SPARC Architecture (late 1980’s)
• Scalable Processor ARChitecture•Developed around the same time as the MIPS architecture.•Very similar to MIPS except for register usage.•Has a set of global registers (8) like MIPS.•Has a large set of registers from which only a subset can be accessed at any time, called a register file.•Is a load/store architecture with 3-addressing instructions.•Has 32-bit registers and all instructions are 32-bits.•Designed to support procedure call and returns efficiently.•Number of register in the register file vary with implementationbut at most 24 can be accessed.•Those 24 plus the 8 global give a register window.