pipelining.1 9/15 appendix a & ch: 5 intro, basic pipelining & memory hierarchy review
DESCRIPTION
Pipelining.3 9/15 5 Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Next PC Address RS1 RS2 Imm MUX Datapath Control Path Inst 1 Inst 2 Inst 1 Inst 2 Inst 3TRANSCRIPT
Pipelining.19/15
Appendix A &Ch: 5
Intro, Basic Pipelining& Memory Hierarchy Review
Pipelining.29/15
Simple RISC Pipeline (MIPS)
Instruction # Clock number1 2 3 4 5 6 7 8
Instruction I IF ID EX MEM WB
Instruction I + 1 IF ID EX MEM WB
Instruction I + 2 IF ID EX MEM WB
Instruction I + 3 IF ID EX MEM WB
Instruction I + 4 IF ID EX MEM
Pipelining.39/15
5 Steps of MIPS Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
MUX
Mem
ory
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
Next PC
Address
RS1
RS2
Imm
MUX
Datapath
Control Path
Inst
1
Inst
1
Inst
2
Inst
1
Inst
2
Inst
3
Pipelining.49/15
Limits to pipelining
• Hazards: circumstances that would cause incorrect execution if next instruction were launched
– Structural hazards: Attempting to use the same hardware to do two different things at the same time
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Pipelining.59/15
Structural Hazard Example : Same Register Filecycle 5: Load instruction and Instr 3 use same reg. file
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Structural Hazard
Solution: Write in 1st half of clockRead in 2nd half
Pipelining.69/15
Resolving structural hazards
• Defn: attempt to use same hardware for two different things at the same time
• Solution 1: Wait must detect the hazard must have mechanism to stall
• Solution 2: Modify the pipeline
Pipelining.79/15
Instr.
Order
dadd r1,r2,r3
dsub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazards (data dependencies)Figure A.6, CA:AQA 3e
Time (clock cycles)
IF ID/RF EX MEM WB
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Pipelining.89/15
• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
• Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
Pipelining.99/15
• Write After Read (WAR) InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Occurs in Superscalars
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
Pipelining.109/15
Three Generic Data Hazards
• Write After Write (WAW) InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• WAR and WAW Occur in Superscalar pipelines
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Pipelining.119/15
Time (clock cycles)
Forwarding to Avoid Data HazardFigure A.7, CA:AQA 3e
Inst
r.
Order
dadd r1,r2,r3
dsub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Pipelining.129/15
Time (clock cycles)
Instr.
Order
ld r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding(Load – Immediate Use)Figure A.9, CA:AQA 3e
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Pipelining.139/15
Resolving the Load Data Hazard
Time (clock cycles)
or r8,r1,r9
Instr.
Order
ld r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU
DMemIfetch Reg
RegIfetch ALU
DMem RegBubble
Ifetch ALU
DMem RegBubble Reg
Ifetch ALU
DMemBubble Reg
One clock stall & forward data from memory (cache) to ALU
Pipelining.149/15
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling Avoids Load Hazardsdone by compiler
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd
Pipelining.159/15
Control Hazard on Branches=> Three Stage Stall(branch penalty!)
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Pipelining.169/15
Example: Branch Stall Impact
• If 30% branch, Stall 3 cycles significant, CPI_new=1.9!• Two part solution:
– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• MIPS branch tests if register = 0 or 0• MIPS Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
Pipelining.179/15
Adder
IF/ID
Pipelined MIPS DatapathFigure A.24, CA:AQA 3/e
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MUX
DataM
emory
MUX
SignExtend
Zero?
MEM
/WB
EX/MEM
4
Adder
Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
Imm
MUX
ID/EX
Pipelining.189/15
Four Simple Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
Pipelining.199/15
Predict not taken, branch untaken, taken
Untaken branch instruction IF ID EX MEM WB
Instruction I + 1 IF ID EX MEM WB
Instruction I + 2 IF ID EX MEM WB
Instruction I + 3 IF ID EX MEM
Instruction I + 4 IF ID EX
Taken branch instruction IF ID EX MEM WB
Instruction I + 1 IF idle idle idle idle
Branch Target IF ID EX MEM WB
Branch Target + 1 IF ID EX MEM
Branch Target + 2 IF ID EX
Pipelining.209/15
Delayed branch behavior is same
Untaken branch instruction IF ID EX MEM WB
Branch delay Instruction I + 1 IF ID EX MEM WB
Instruction I + 2 IF ID EX MEM WB
Instruction I + 3 IF ID EX MEM
Instruction I + 4 IF ID EX
Taken branch instruction IF ID EX MEM WB
Branch delay Instruction I + 1 IF ID EX MEM WB
Branch Target IF ID EX MEM WB
Branch Target + 1 IF ID EX MEM
Branch Target + 2 IF ID EX
Pipelining.219/15
Four Simple Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following instruction
branch instructionsequential successor1
sequential successor2........sequential successorn
........branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
– Used in MIPS
Branch delay of length n
Pipelining.229/15
Scheduling the branch Delay slot
Pipelining.239/15
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in
computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
Pipelining.249/15
Precise Exceptions in Static Pipelines
Key observation: architected state only change in memory and register write stages.
Pipelining.259/15
Deeper Pipeline Example:MIPS R4000
Pipelining.269/15
Load immediate use = 2 cycle penalty
Pipelining.279/15
Load immediate use 2-cycle penalty
1 2 3 4 5 6 7LD R1,.. IF IS RF EX DF DS TC
DADD R2,R1,.. IF IS RF stall stall DS
DSUB R3,R1,.. IF IS stall stall DF
OR R4,r1,.. IF stall stall EX
Pipelining.289/15
R4000 taken/untaken branch penalty
1 2 3 4 5 6 7 branch instruction IF IS RF EX DF DS TC
delay slot IF IS RF EX DF DS
stall stall stall stall stall stall
stall stall stall stall stall
branch target IF IS RF
1 2 3 4 5 6 7 branch instruction IF IS RF EX DF DS TC
delay slot IF IS RF EX DF DS
branch insruction+2 IF IS RF EX DF
branch insruction+3 IF IS RF EX
Pipelining.299/15
Architecture: Abstraction Layers in Modern Systems ?
Algorithm
Gates/Register-Transfer Level (RTL)
Application
Instruction Set Architecture (ISA)Operating System/Virtual Machine
Microarchitecture
Devices
Programming Language
Circuits
Physics
Original domain of
the computer architect
( ‘50s-’80s)
Domain of computer
architecture( ‘90s)
Reliability, power, …
Parallel computin
g, security,
…
Reinvigoration of computer
architecture, mid-2000s onward.
Pipelining.309/15
Memory Hierarchy of a Modern Computer System• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology.
– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hipC
ache
1 (10s ms)Speed (ns): 10 100
100s GBSize (bytes): KB MB
TertiaryStorage(Tape)
(10s sec)
TB
Pipelining.319/15
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level
(example: Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block to processor
• Hit Time << Miss Penalty (500 instructions on 21264!)Lower Level
MemoryUpper LevelMemory
To Processor
From ProcessorBlk X
Blk Y
Pipelining.329/15
Memory Hierarchy Technology• Random Access:
– “Random” is good: access time is the same for all locations– DRAM: Dynamic Random Access Memory
» High density, low power, cheap, slow» Dynamic: need to be “refreshed” regularly
– SRAM: Static Random Access Memory» Low density, high power, expensive, fast» Static: content will last “forever”(until lose power)
• “Non-so-random” Access Technology:– Access time varies from location to location and from time to time– Examples: Disk, CDROM, DRAM page-mode access
• Sequential Access Technology: access time linear in location (e.g.,Tape)
• The Main Memory: DRAMs + Caches: SRAMs + Flash
Pipelining.339/15
Why Memory Hierarchy?
• Processor-DRAM performance gap growing • Two goals (speed and size) conflict• Principle of locality: program access a relative small portion of their address
space --- if an item is referenced– it will tend to be referenced again soon (eg loops)--temporal locality– nearby items will tend to be referenced soon (straight code)-- spatial
locality• How memory hierarchy works • Goal of memory hierarchy: to achieve
– speed of highest level (cache)– size of lowest level (disk)
Pipelining.349/15
Memory Hierarchy Example: Apple iMac G5, powered by Power970-prior to Core duo
iMac G51.6 GHz
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency
Cycles, Time
1,0.6 ns
3,1.9 ns
3,1.9 ns
11,6.9 ns
88,55 ns
107,12 ms
Let programs address a memory space that scales to the disk size, at a speed
that is usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,application
Goal: Illusion of large, fast, cheap memory
Pipelining.359/15
Intel Nehalem Cache Hierarchy
Pipelining.369/15
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
Four Questions for Caches and Memory Hierarchy
Pipelining.379/15
Format of a Block / Line in Cache
• Tag: higher order of memory address of the data– memory size = , cache size =
how many bits should the tag field have? (32-r)
• Valid = 1 valid block / line 0 invalid block / line
Valid Tag Data
232 2r
Pipelining.389/15
• Index Used to Lookup Candidates in Cache– Index identifies the set
• Tag used to identify actual copy– If no candidates match, then declare cache miss
• Block is minimum quantum of caching– Data select field used to select data within block– Many caching applications don’t have data select field
Q2: How is a block found in upper level?
Blockoffset
Block AddressTag Index
Set SelectData Select
Pipelining.399/15
Word3Word0 Word1 Word2
Block Size and Spatial Locality
Larger block size has distinct hardware advantages• less tag overhead• exploit fast burst transfers from DRAM• exploit fast burst transfers over wide busses
What are the disadvantages of increasing block size?
block address offsetb
2b = block size a.k.a line size (in bytes)
Split CPU address
b bits32-b bits
Tag
Block is unit of transfer between cache and memory
4 word block, b=2
Fewer blocks => more conflicts. Can waste bandwidth.
Pipelining.409/15
Example: 2 KB Direct Mapped Cache with 32 B Blocks– The uppermost 21 bits are the Cache Tag– The lowest 5 bits are the Byte Select (Block Size = 25 = 32)– 2KB cache : 26 (64 lines) * 25 32 bytes / line)– On cache miss, “Cache Block” ( “Cache Line”) read from memory
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag=21 bits
Stored for each lineof the cache
Valid Bit
:63
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 2016Byte 2047 :
Cache Tag
Byte Select10
Block address
6 5
Pipelining.419/15
Set Associative Cache Example: 2-way• Example: 2KB Two-way set associative cache, 32B line
– 2 direct mapped caches operate in parallel, 32 lines * 32 bytes each– Cache Index selects a “set” from the cache Index=5 – The two tags in the set are compared to address in parallel Tag=22– Data is selected based on the tag comparison– Two 22 bit comparators to compare tags
Cache DataCache Block 0
Valid
:: 1KB
Cache DataCache Block 0
Cache Tag=22 Valid
: :1KB
Cache Index=5
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Cache Tag=22
Pipelining.429/15
Fully Associative Cache• Fully Associative Cache example: 2KB cache, 32 byte line
– No Cache Index (still have 64 lines)– Compare Cache Tags of all cache entries in parallel!!– Example: Block Size = 32 B blocks, we need 32 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
:
Cache DataByte 0
0431
:
Cache Tag (27 bits )
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
5
=
==
=
=
Pipelining.439/15
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped• Set Associative or Fully Associative:
– LRU (Least Recently Used): Appealing, but hard to implement for high associativity
– Random: Easy, but – how well does it work?
Assoc: 2-way 4-way 8-way
Size LRU Ran LRU Ran LRU Ran
16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Pipelining.449/15
• Compulsory (cold start or process migration, first reference): first access to a block
– “Cold” fact of life: – If you running “billions” of instruction, Compulsory Misses are
insignificant
• Capacity:– Cache cannot contain all blocks access by the program– Solution: increase cache size
• Conflict (collision):– Multiple memory locations mapped
to the same cache location– Solution 1: increase cache size– Solution 2: increase associativity
• Coherence (Invalidation): other process (e.g., I/O) updates memory
Sources of Cache Misses
Pipelining.459/15
Write Policy: What happens on a WriteA-Write-Through B- Write-Back
• Write-through: all writes update cache and memory/cache– Can always discard cached data - most up-to-date data is in memory– Cache control bit: valid bit
• Write-back: all writes simply update cache– Can’t just discard cached data – – Update lower level when block falls out of cache– Cache control bits: valid and dirty (modified) bits
• Other Advantages:– Write-through:
» memory (or other processors) always have latest data» Simpler management of cache
– Write-back:» much lower bandwidth, since data often overwritten multiple times» Better tolerance to long-latency memory?
Pipelining.469/15
Write Buffers for Write-Through Caches
Q. Why write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall
Q. Why not just one register A. Bursts of writes are common.
Q. Are Read After Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or check write buffers for match on reads
Pipelining.479/15
Write Policy 2:Write Allocate vs Non-Allocate(What happens on write-miss)
• Write allocate: allocate new cache line in cache– Usually means that you have to do a “read miss” to fill in
rest of the cache-line!– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):– Simply send write data through to underlying memory/cache
- don’t allocate new cache line!
Pipelining.489/15
Virtual Memory Review
• Main memory used as “cache” for secondary (disk) storage– Managed jointly by CPU hardware and OS
• Programs share main memory– Each gets a private virtual address space eg 4GB in MIPS /
IA-32– Protected from other programs
• CPU and OS translate virtual addresses to physical addresses (mapping)
– VM “block” is called a page eg 4KB– VM translation “miss” is called a page fault
Pipelining.499/15
Modern Virtual Memory Systems Illusion of a large, private, uniform store
Protection & Privacyseveral users - with private and shared address spaces
Demand PagingProvides ability to run programs larger than the primary memory
Hides differences in machine configurations
The price is address translation on each memory reference
OS
useri
PrimaryMemory
SwappingStore
VA PAmappingTLB
Pipelining.509/15
Page Fault Penalty & Demand Paging
• Page fault (page not present), the page must be fetched from disk
– Takes millions of clock cycles– If no free page is left, a page is replaced– page table updated to point to new location of page
on disc– Handled by OS code
• To minimize page fault rate– Fully associative placement– Smart replacement algorithms
Pipelining.519/15
• Processor generated address <page number, offset> virtual
• A page table contains the physical address of the base of each page (translation) physical
Paged Memory Systems
pages of a program may be stored non-contiguously. Page table will translate
0123
0123
Address Spaceof process-1
Page Table of process-1
10
2
3
page number offset
virtual physical
Pipelining.529/15
Address Translation
• Fixed-size pages (e.g., 4K)
Pipelining.539/15
Mapping Pages to Storage
Pipelining.549/15
Page Tables example--Translating virtual to physical address
Page offsetVirtual page number
Virtual address
Page offsetPhysical page number
Physical address
Physical page numberValid
If 0 then page is notpresent in memory
Page table register
Page table
20 12
18
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
Pipelining.559/15
Page Tables
• Stores placement information– Array of page table entries, indexed by virtual page
number– Page table register in CPU points to page table in
physical memory• If page is present in memory
– PTE stores the physical page number– Plus other status bits (referenced R, dirty D,.)
• If page is not present– PTE can refer to location in swap space on disk
Pipelining.569/15
What is in a Page Table Entry (PTE)?– Pointer to next-level page table or to actual page– Permission bits: valid, read-only, read-write, write-only
• Example: Intel x86 architecture PTE:– Address same format previous slide (10, 10, 12-bit offset)– Intermediate page tables called “Directories”
P: Present (same as “valid” bit in other architectures) W: WriteableU: User accessible
PWT: Page write transparent: external cache write-throughPCD: Page cache disabled (page cannot be cached)
A: Accessed: page has been accessed recentlyD: Dirty (M bit): page has been modifiedL: L=14MB page (directory only).
Bottom 22 bits of virtual address serve as offset
Page Frame Number(Physical Page Number)
Free(OS) 0 L D A
PCDPW
T U W P01234567811-931-12
Pipelining.579/15
Size of Linear (Single level) Page Table is huge
With 32-bit addresses, 4-KB pages & 4-byte PTEs: 220 PTEs, i.e, 4 MB page table per user
Larger pages:• Internal fragmentation (Not all memory in a page is used)• Larger page fault penalty (more time to read from disk)
What about 64-bit virtual address space
Pipelining.589/15
Hierarchical (2-level ) Page Table(used in Intel X86 processors)
Level 1 Page Table
Level 2Page Tables
Data Pages
page in primary memory page in secondary memory
Root of the CurrentPage Table
p1
offset
p2
Virtual Address
(ProcessorRegister)
PTE of a nonexistent page
p1 p2 offset01112212231
10-bitL1 index
10-bit L2 index
Pipelining.599/15
Replacement and Writes
• To reduce page fault rate, least-recently used (LRU) replacement
– Reference bit in PTE set to 1 on access to page– Periodically cleared to 0 by OS– A page with reference bit = 0 has not been used recently
• Disk writes take millions of cycles– Page at once, not individual locations– Use write-back– Write through impractical– Dirty bit in PTE set when page is written
Pipelining.609/15
• Clock Algorithm:– Approximate LRU – Replace an old page, not the oldest page
• Details:– Hardware “use” bit per physical page:
» Hardware sets use bit on each reference» If use bit isn’t set, means not referenced in a long time
– On page fault:» Advance clock hand » Check use bit: 1used recently; clear and leave alone
0selected candidate for replacement
Set of all pagesin Memory
Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently
Clock Algorithm: Not Recently Used
...
Page Table
1 0useddirty
1 00 11 10 0
Pipelining.619/15
Fast Translation Using a TLB• Address translation require extra memory
references– PTE access– actual memory access
• Because access to page tables has good locality– TLB: Translation lookaside buffer– a fast cache of PTEs within the CPU– Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate– Misses handled by hardware or software
Pipelining.629/15
Fast Translation Using a TLB
Pipelining.639/15
TLB Misses• If page is in memory
– Load the PTE from memory and retry– Could be handled in hardware– Or software
» Raise a special exception, with optimized handler
• If page not in memory (page fault, not present)– OS fetches page and updates page table– Restart faulting instruction
Pipelining.649/15
• As described, TLB lookup is in serial with cache lookup:
• Machines with TLBs go one step further: they overlap TLB lookup with cache access.
– Works because offset available early
Reducing translation time further
Virtual Address
TLB Lookup
V AccessRights PA
V page no. offset10
P page no. offset10
Physical Address
Pipelining.659/15
• What if cache size is increased to 8KB?– Overlap not complete– Need to do something else.
• OR: Virtual Caches– Tags in cache are virtual addresses– Translation only happens on cache misses
TLB 4K Cache
10 200
4 bytes
index 1 K
page # disp20
assoclookup
32
Hit/Miss
FN Data Hit/Miss
=FN
Overlapping TLB & Cache Access