omse 510: computing foundations 3: caches, assembly, cpu overview chris gilmore portland state...
Post on 19-Dec-2015
222 views
TRANSCRIPT
![Page 1: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/1.jpg)
OMSE 510: Computing Foundations3: Caches, Assembly, CPU Overview
Chris Gilmore <[email protected]>
Portland State University/OMSE
![Page 2: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/2.jpg)
Today
Caches
DLX Assembly
CPU Overview
![Page 3: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/3.jpg)
Computer System (Idealized)
CPUMemory
System Bus
Disk Controller
Disk
![Page 4: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/4.jpg)
The Big Picture: Where are We Now?
Control
Datapath
Memory
Processor
Input
Output
The Five Classic Components of a Computer
Next Topic: Simple caching techniquesMany ways to improve cache performance
![Page 5: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/5.jpg)
Upper Level
Lower Level
faster
Larger
Recap: Levels of the Memory Hierarchy
Processor
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
![Page 6: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/6.jpg)
Recap: exploit locality to achieve fast memory
Two Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it
will tend to be referenced again soon.Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
By taking advantage of the principle of locality:Present the user with as much memory as is available in the
cheapest technology.Provide access at the speed offered by the fastest technology.
DRAM is slow but cheap and dense:Good choice for presenting the user with a BIG memory system
SRAM is fast but expensive and not very dense:Good choice for providing the user FAST access time.
![Page 7: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/7.jpg)
Memory Hierarchy: Terminology
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper levelHit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block Y)
Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Hit Time << Miss Penalty
![Page 8: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/8.jpg)
Processor
$
MEM
Memory
reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
op: i-fetch, read, write
Optimize the memory system organizationto minimize the average memory access timefor typical workloads
Workload orBenchmarkprograms
The Art of Memory System Design
![Page 9: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/9.jpg)
Example: Fully Associative
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
Fully Associative CacheNo Cache IndexCompare the Cache Tags of all cache entries in parallelExample: Block Size = 32 B blocks, we need N 27-bit
comparators
By definition: Conflict Miss = 0 for a fully associative cache
![Page 10: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/10.jpg)
Example: 1 KB Direct Mapped Cache with 32 B Blocks
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9Block address
For a 2 ** N byte cache:The uppermost (32 - N) bits are always the Cache TagThe lowest M bits are the Byte Select (Block Size = 2 ** M)
![Page 11: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/11.jpg)
Set Associative Cache
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
N-way set associative: N entries for each Cache IndexN direct mapped caches operates in parallel
Example: Two-way set associative cacheCache Index selects a “set” from the cacheThe two tags in the set are compared to the input in parallelData is selected based on the tag result
![Page 12: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/12.jpg)
Disadvantage of Set Associative Cache
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
N-way Set Associative Cache versus Direct Mapped Cache:N comparators vs. 1Extra MUX delay for the dataData comes AFTER Hit/Miss decision and set selection
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
Possible to assume a hit and continue. Recover later if miss.
![Page 13: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/13.jpg)
Block Size Tradeoff
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
Larger block size take advantage of spatial locality BUT:Larger block size means larger miss penalty:
Takes longer time to fill up the blockIf block size is too big relative to cache size, miss rate
will go up Too few cache blocks
In general, Average Access Time:
= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate
![Page 14: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/14.jpg)
A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block
“Cold” fact of life: not a whole lot you can do about itNote: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
Conflict (collision):Multiple memory locations mapped
to the same cache locationSolution 1: increase cache sizeSolution 2: increase associativity
Capacity:Cache cannot contain all blocks access by the programSolution: increase cache size
Coherence (Invalidation): other process (e.g., I/O) updates memory
![Page 15: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/15.jpg)
Direct Mapped N-way Set Associative Fully Associative
Compulsory Miss:
Cache Size: Small, Medium, Big?
Capacity Miss
Coherence Miss
Conflict Miss
Source of Cache Misses Quiz
Choices: Zero, Low, Medium, High, Same
Assume constant cost.
![Page 16: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/16.jpg)
Sources of Cache Misses
AnswerDirect Mapped N-way Set Associative Fully Associative
Compulsory Miss
Cache Size
Capacity Miss
Coherence Miss
Big Medium Small
Note:If you are going to run “billions” of instruction, Compulsory Misses are insignificant.
Same Same Same
Conflict Miss High Medium Zero
Low Medium High
Same Same Same
![Page 17: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/17.jpg)
Recap: Four Questions for Caches and Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement)
Q2: How is a block found if it is in the upper level? (Block identification)
Q3: Which block should be replaced on a miss? (Block replacement)
Q4: What happens on a write? (Write strategy)
![Page 18: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/18.jpg)
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can go anywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can go anywhere in set 0 (12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Block-frame address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache:Fully associative, direct mapped, 2-way set associativeS.A. Mapping = Block Number Modulo Number Sets
![Page 19: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/19.jpg)
Blockoffset
Block AddressTag Index
Q2: How is a block found if it is in the upper level?
Set Select
Data Select
Direct indexing (using index and block offset), tag compares, or combination
Increasing associativity shrinks index, expands tag
![Page 20: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/20.jpg)
Q3: Which block should be replaced on a miss?
Easy for Direct Mapped
Set Associative or Fully Associative:RandomFIFOLRU (Least Recently Used)LFU (Least Frequently Used)
Associativity: 2-way 4-way 8-way
Size LRU Random LRU Random LRU Random
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
![Page 21: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/21.jpg)
Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory.
Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
is block clean or dirty?
Pros and Cons of each?WT: read misses cannot result in writes, coherency easierWB: no writes of repeated writes
WT always combined with write buffers so they don’t wait for lower level memory
![Page 22: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/22.jpg)
ProcessorCache
Write Buffer
DRAM
Write Buffer for Write Through
A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write
cycle
Memory system designer’s nightmare:Store frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation
![Page 23: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/23.jpg)
Write Buffer SaturationProcessor
Cache
Write Buffer
DRAM
ProcessorCache
Write Buffer
DRAML2Cache
Store frequency (w.r.t. time) > 1 / DRAM write cycleIf this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation:Use a write back cacheInstall a second level (L2) cache: (does this always work?)
![Page 24: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/24.jpg)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x00
Ex: 0x00
0x50
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a missDo we read in the block?
Yes: Write Allocate No: Write Not Allocate
![Page 25: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/25.jpg)
Impact on Cycle Time
IR
PCI -Cache
D Cache
A B
R
T
IRex
IRm
IRwb
miss
invalid
Miss
Cache Hit Time:directly tied to clock rateincreases with cache sizeincreases with associativity
Average Memory Access time = Hit Time + Miss Rate x Miss Penalty
Time = IC x CT x (ideal CPI + memory stalls)
![Page 26: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/26.jpg)
What happens on a Cache miss?For in-order pipeline, 2 options:
Freeze pipeline in Mem stage (popular early on: Sparc, R4000)
IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex
Wr
Use Full/Empty bits in registers + MSHR queue MSHR = “Miss Status/Handler Registers” (Kroft)
Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line.
Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to “merge” multiple requests to one memory line
New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline.
Attempt to use register before result returns causes instruction to block in decode stage.
Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures.
Out-of-order pipelines already have this functionality built in… (load queues, etc).
![Page 27: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/27.jpg)
Time = IC x CT x (ideal CPI + memory stalls)
Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time)
Improving Cache Performance: 3 general options
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
![Page 28: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/28.jpg)
Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
![Page 29: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/29.jpg)
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
Compulsory vanishinglysmall
3Cs Absolute Miss Rate (SPEC92)
![Page 30: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/30.jpg)
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.141 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
2:1 Cache Rule
![Page 31: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/31.jpg)
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%
1 2 4 8
16
32
64
12
8
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
Flaws: for fixed block sizeGood: insight => invention
3Cs Relative Miss Rate
![Page 32: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/32.jpg)
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
1. Reduce Misses via Larger Block Size
![Page 33: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/33.jpg)
2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size
N/2
Beware: Execution time is only final measure!Will Clock Cycle time increase?Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%, internal + 2%
![Page 34: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/34.jpg)
Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped
Cache Size Associativity
(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20
(Green means A.M.A.T. not improved by more associativity)
(AMAT = Average Memory Access Time)
![Page 35: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/35.jpg)
To Next Lower Level InHierarchy
DATATAGS
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
3. Reducing Misses via a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
Used in Alpha, HP machines
![Page 36: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/36.jpg)
4. Reducing Misses by Hardware Prefetching
E.g., Instruction PrefetchingAlpha 21064 fetches 2 blocks on a missExtra block placed in “stream buffer”On miss check stream buffer
Works with data blocks too:Jouppi [1990] 1 data stream buffer got 25% misses from
4KB cache; 4 streams got 43%Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches
Prefetching relies on having extra memory bandwidth that can be used without penalty
![Page 37: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/37.jpg)
5. Reducing Misses by Software Prefetching Data
Data PrefetchLoad data into register (HP PA-RISC loads)Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes timeIs cost of prefetch issues < savings in reduced misses?Higher superscalar reduces difficulty of issue bandwidth
![Page 38: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/38.jpg)
6. Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
InstructionsReorder procedures in memory so as to reduce conflict missesProfiling to look at conflicts(using tools they developed)
DataMerging Arrays: improve spatial locality by single array of compound elements vs.
2 arraysLoop Interchange: change nesting of loops to access data in order stored in
memoryLoop Fusion: Combine 2 independent loops that have same looping and some
variables overlapBlocking: Improve temporal locality by accessing “blocks” of data repeatedly vs.
going down whole columns or rows
![Page 39: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/39.jpg)
Improving Cache Performance (Continued)
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
![Page 40: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/40.jpg)
0. Reducing Penalty: Faster DRAM / Interface
New DRAM Technologies RAMBUS - same initial latency, but much higher
bandwidthSynchronous DRAM
Better BUS interfaces
CRAY Technique: only use SRAM
![Page 41: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/41.jpg)
ProcessorCache
Write Buffer
DRAM
1. Reducing Penalty: Read Priority over Write on Miss
A Write Buffer Allows reads to bypass writesProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:Store frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation
![Page 42: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/42.jpg)
1. Reducing Penalty: Read Priority over Write on Miss
Write-Buffer Issues:Write through with write buffers offer RAW conflicts with main
memory reads on cache missesIf simply wait for write buffer to empty, might increase read
miss penalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the memory access continue
Write Back?Read miss replacing dirty blockNormal: Write dirty block to memory, and then do the readInstead copy the dirty block to a write buffer, then do the read,
and then do the writeCPU stall less since restarts as soon as do read
![Page 43: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/43.jpg)
block
2. Reduce Penalty: Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPUEarly restart—As soon as the requested word of the block arrives,
send it to the CPU and let the CPU continue executionCritical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart
![Page 44: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/44.jpg)
3. Reduce Penalty: Non-blocking Caches
Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss
requires F/E bits on registers or out-of-order executionrequires multi-bank memories
“hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests“hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses
Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses
![Page 45: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/45.jpg)
Hit Under i Misses
Av
g.
Me
m.
Acce
ss T
ime
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
eqnto
tt
esp
ress
o
xlisp
com
pre
ss
mdljsp
2
ear
fpppp
tom
catv
swm
256
doduc
su2co
r
wave5
mdljdp2
hydro
2d
alv
inn
nasa
7
spic
e2g6
ora
0->1
1->2
2->64
Base
Integer Floating Point
“Hit under n Misses”
Value of Hit Under Miss for SPEC
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.198 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
![Page 46: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/46.jpg)
4. Reduce Penalty: Second-Level Cache
Proc
L1 Cache
L2 Cache
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
Definitions:Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
![Page 47: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/47.jpg)
Reducing Misses: which apply to L2 Cache?
Reducing Miss Rate
1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Misses by HW Prefetching Instr, Data
5. Reducing Misses by SW Prefetching Data
6. Reducing Capacity/Conf. Misses by Compiler Optimizations
![Page 48: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/48.jpg)
Relative CPU Time
Block Size
11.11.21.31.41.51.61.71.81.9
2
16 32 64 128 256 512
1.361.28 1.27
1.34
1.54
1.95
L2 cache block size & A.M.A.T.
32KB L1, 8 byte path to memory
![Page 49: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/49.jpg)
Improving Cache Performance (Continued)
1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache:
- Lower Associativity (+victim caching)?- 2nd-level cache- Careful Virtual Memory Design
![Page 50: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/50.jpg)
Summary #1/ 3:The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time.
Temporal Locality: Locality in Time Spatial Locality: Locality in Space
Three (+1) Major Categories of Cache Misses:Compulsory Misses: sad facts of life. Example: cold start misses.Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!Capacity Misses: increase cache sizeCoherence Misses: Caused by external processors or I/O
devices
Cache Design Spacetotal size, block size, associativityreplacement policywrite-hit policy (write-through, write-back)write-miss policy
![Page 51: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/51.jpg)
Summary #2 / 3: The Cache Design Space
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
Several interacting dimensionscache sizeblock sizeassociativityreplacement policywrite-through vs write-backwrite allocation
The optimal choice is a compromisedepends on access characteristics
workload use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
![Page 52: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/52.jpg)
mis
s r
ate
Summary #3 / 3: Cache Miss Optimization
Lots of techniques people use to improve the miss rate of caches:
Technique MR MP HT Complexity
Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching+ 3Compiler Reduce Misses + 0
![Page 53: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/53.jpg)
Onto Assembler!
What is assembly language?
Machine-Specific Programming Language one-one correspondence between statements and native
machine language matches machine instruction set and architecture
![Page 54: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/54.jpg)
What is an assembler?
Systems Level Program Usually works in conjunction with the compiler translates assembly language source code to machine
language object file - contains machine instructions, initial data, and
information used when loading the program listing file - contains a record of the translation process, line
numbers, addresses, generated code and data, and a symbol table
![Page 55: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/55.jpg)
Why learn assembly?
Learn how a processor works
Understand basic computer architecture
Explore the internal representation of data and instructions
Gain insight into hardware concepts
Allows creation of small and efficient programs
Allows programmers to bypass high-level language restrictions
Might be necessary to accomplish certain operations
![Page 56: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/56.jpg)
Machine Representation
A language of numbers, called the Processor’s Instruction Set The set of basic operations a processor can perform
Each instruction is coded as a number
Instructions may be one or more bytes
Every number corresponds to an instruction
![Page 57: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/57.jpg)
Assembly vs Machine
Machine Language Programming Writing a list of numbers representing the bytes of machine
instructions to be executed and data constants to be used by the program
Assembly Language Programming Using symbolic instructions to represent the raw data that
will form the machine language program and initial data constants
![Page 58: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/58.jpg)
Assembly
Mnemonics represent Machine Instructions Each mnemonic used represents a single machine
instruction The assembler performs the translation
Some mnemonics require operands Operands provide additional information
register, constant, address, or variable
Assembler Directives
![Page 59: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/59.jpg)
Instruction Set Architecture: a Critical Interface
instruction set
software
hardware
Portion of the machine that is visible to the programmer or the compiler writer.
![Page 60: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/60.jpg)
Good ISA
Lasts through many implementations (portability, compatibility)
Can be used for many different applications (generality)
Provide convenient functionality to higher levels
Permits an efficient implementation at lower levels
![Page 61: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/61.jpg)
Von Neumann Machines
Von Neumann “invented” stored program computer in 1945
Instead of program code being hardwired, the program code (instructions) is placed in memory along with data Control
ALU
Program
Data
![Page 62: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/62.jpg)
Basic ISA Classes
Memory to Memory Machines Every instruction contains a full memory address for each
operand Maybe the simplest ISA design However memory is slow Memory is big (lots of address bits)
![Page 63: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/63.jpg)
Memory-to-memory machineAssumptions Two operands per operation first operand is also the destination Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte)
Example A = B+C (hypothetical code)
mov A, B ; A <= B
add A, C ; A <= B+C 5 bytes for instruction 4 bytes for fetch 1st and 2nd operands 4 bytes to store results add needs 17 bytes and mov needs 13 byts Total 30 bytes memory traffic
![Page 64: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/64.jpg)
Why CPU Storage?
A small amount of storage in the CPU To reduce memory traffic by keeping repeatedly used
operands in the CPU Avoid re-referencing memory Avoid having to specify full memory address of the operand
This is a perfect example of “make the common case fast”.
Simplest Case A machine with 1 cell of CPU storage: the accumulator
![Page 65: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/65.jpg)
Accumulator MachineAssumptions Two operands per operation 1st operand in the accumulator 2nd operand in the memory accumulator is also the destination (except for store) Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte)
Example A = B+C (hypothetical code)
Load B ; acc <= B
Add C ; acc <= B+C
Store A ; A <= acc 3 bytes for instruction 4 bytes to load or store the second operand 7 bytes per instruction 21 bytes total memory traffic
![Page 66: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/66.jpg)
Stack MachinesInstruction sets are based on a stack model of execution.
Aimed for compact instruction encoding
Most instructions manipulate top few data items (mostly top 2) of a pushdown stack.
Top few items of the stack are kept in the CPU
Ideal for evaluating expressions (stack holds intermediate results)
Were thought to be a good match for high level languages
Awkward Become very slow if stack grows beyond CPU local storage No simple way to get data from “middle of stack”
![Page 67: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/67.jpg)
Stack Machines
Binary arithmetic and logic operations Operands: top 2 items on stack Operands are removed from stack Result is placed on top of stack
Unary arithmetic and logic operations Operand: top item on the stack Operand is replaced by result of operation
Data move operations Push: place memory data on top of stack Pop: move top of stack to memory
![Page 68: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/68.jpg)
General Purpose Register Machines
With stack machines, only the top two elements of the stack are directly available to instructions. In general purpose register machines, the CPU storage is organized as a set of registers which are equally available to the instructions
Frequently used operands are placed in registers (under program control) Reduces instruction size Reduces memory traffic
![Page 69: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/69.jpg)
General Purpose Registers Dominate
1975-present all machines use general purpose registers
Advantages of registers registers are faster than memory registers are easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order registers can hold variables
memory traffic is reduced, so program is sped up (since registers are faster than memory)
code density improves (since register named with fewer bits than memory location)
![Page 70: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/70.jpg)
Classifying General Purpose Register Machines
General purpose register machines are sub-classified based on whether or not memory operands can be used by typical ALU instructions
Register-memory machines: machines where some ALU instructions can specify at least one memory operand and one register operand
Load-store machines: the only instructions that can access memory are the “load” and the “store” instructions
![Page 71: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/71.jpg)
Comparing number of instructions
Code sequence for A = B+C for five classes of instruction sets:
Memory to Memorymov A Badd A C
Stackpush Bpush Caddpop A
Accumulatorload Badd Cstore A
Register(Register-memory)load R1 Badd R1 Cstore A R1
Register(Load-store)Load R1 BLoad R2 CAdd R1 R1 R2Store A R1
DLX/MIPS is one of these
![Page 72: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/72.jpg)
Instruction Set DefinitionObjects = architecture entities = machine state Registers
General purpose Special purpose (e.g. program counter, condition code, stack pointer)
Memory locations Linear address space: 0, 1, 2, …,2^s -1
Operations = instruction types Data operation
Arithmetic Logical
Data transfer Move (from register to register) Load (from memory location to register) Store (from register to memory location)
Instruction sequencing Branch (conditional) Jump (unconditional)
![Page 73: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/73.jpg)
Topic: DLX
Instructional Architecture
Much nicer and easier to understand than x86 (barf)
The Plan: Teach DLX, then move to x86/y86
DLX: RISC ISA, very similar to MIPS
Great links to learn more DLX:
http://www.softpanorama.org/Hardware/architecture.shtml#DLX
![Page 74: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/74.jpg)
DLX Architecture
Based on observations about instruction set architecture Emphasizes:
Simple load-store instruction set Design for pipeline efficiency Design for compiler target
DLX registers 32 32-bit GPRS named R0, R1, ..., R31 32 32-bit FPRs named F0, F2, ..., F30
Accessed independently for 32-bit data Accessed in pairs for 64-bit (double-precision) data
Register R0 is hard-wired to zero Other status registers, e.g., floating-point status register
Byte addressable in big-endian with 32-bit addressArithmetic instructions operands must be registers
![Page 75: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/75.jpg)
0 zero constant 0
1 at reserved for assembler
2 v0 expression evaluation &
3 v1 function results
4 a0 arguments
5 a1
6 a2
7 a3
8 t0 temporary: caller saves
. . . (callee can clobber)
15 t7
MIPS: Software conventions for Registers
16 s0 callee saves
. . . (callee must save)
23 s7
24 t8 temporary (cont’d)
25 t9
26 k0 reserved for OS kernel
27 k1
28 gp Pointer to global area
29 sp Stack pointer
30 fp frame pointer
31 ra Return Address (HW)
![Page 76: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/76.jpg)
Addressing Mode Example Instruction
Meaning When Used
Register Add R4, R3 R[R4] <- R[R4] + R[R3] When a value is in a register.
Immediate Add R4, #3 R[R4] <- R[R4] + 3 For constants.
Displacement Add R4, 100(R1) R[R4] <- R[R4] +
M[100+R[R1] ]
Accessing local variables.
Register Deferred Add R4, (R1) R[R4] <- R[R4] +
M[R[R1] ]
Using a pointer or a computed address.
Absolute Add R4, (1001) R[R4] <- R[R4] + M[1001] Used for static data.
Addressing Modes
This table shows the most common modes.
![Page 77: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/77.jpg)
Memory Organization
Viewed as a large, single-dimension array, with an address.
A memory address is an index into the array
"Byte addressing" means that the index points to a byte of memory.
0
1
2
3
4
5
6
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
![Page 78: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/78.jpg)
Memory Addressing
Bytes are nice, but most data items use larger "words"
For DLX, a word is 32 bits or 4 bytes.
2 questions for design of ISA: Since one could read a 32-bit word as four loads of bytes
from sequential byte addresses or as one load word from a single byte address,
How do byte addresses map to word addresses? Can a word be placed on any byte boundary?
![Page 79: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/79.jpg)
Addressing Objects: Endianess and Alignment
msb lsb
3 2 1 0little endian byte 0
0 1 2 3big endian byte 0
Alignment: require that objects fall on address that is multiple of their size.
0 1 2 3
Aligned
NotAligned
Big Endian: address of most significant byte = word address (xx00 = Big End of word)
IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA
Little Endian: address of least significant byte = word address (xx00 = Little End of word)
Intel 80x86, DEC Vax, DEC Alpha (Windows NT)
![Page 80: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/80.jpg)
Assembly Language vs. Machine Language
Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first
Machine language is the underlying reality e.g., destination is no longer first
Assembly can provide 'pseudoinstructions' e.g., “move r10, r11” exists only in Assembly would be implemented using “add r10,r11,r0”
When considering performance you should count real instructions
![Page 81: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/81.jpg)
DLX arithmetic
ALU instructions can have 3 operands add $R1, $R2, $R3 sub $R1, $R2, $R3
Operand order is fixed (destination first)Example:
C code: A = B + CDLX code: add r1, r2, r3
(registers associated with variables by compiler)
![Page 82: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/82.jpg)
DLX arithmetic
Design Principle: simplicity favors regularity. Why?
Of course this complicates some things...
C code: A = B + C + D;E = F - A;
MIPS code:add r1, r2, r3
add r1, r1, r4sub r5, r6, r1
Operands must be registers, only 32 registers provided
Design Principle: smaller is faster. Why?
![Page 83: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/83.jpg)
Execution assembly instructionsProgram counter holds the instruction addressCPU fetches instruction from memory and puts it onto the instruction registerControl logic decodes the instruction and tells the register file, ALU and other registers what to doAn ALU operation (e.g. add) data flows from register file, through ALU and back to register file
![Page 84: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/84.jpg)
ALU Execution Example
![Page 85: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/85.jpg)
ALU Execution example
![Page 86: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/86.jpg)
Memory InstructionsLoad and store instructions lw r11, offset(r10); sw r11, offset(r10);
Example:
C code:A[8] = h + A[8];assume h in r2 and base address of the array A in r3
DLX code: lw r4, 32(r3)add r4, r2, r4sw r4, 32(r3)
Store word has destination last
Remember arithmetic operands are registers, not memory!
![Page 87: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/87.jpg)
Memory Operations - Loads
Load data from memory
lw R6, 0(R5) # R6 <= mem[0x14]
![Page 88: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/88.jpg)
Memory Operations - StoresStoring data to memory works essentially the same way sw R6 , 0(R5) R6 = 200; let’s assume R5 =0x18 mem[0x18] <-- 200
![Page 89: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/89.jpg)
So far we’ve learned:
DLX— loading words but addressing bytes— arithmetic on registers only
Instruction Meaning
add r1, r2, r3 r1 = r2 + r3sub r1, r2, r3 r1 = r2 – r3lw r1, 100(r2) r1 = Memory[r2+100] sw r1, 100(r2) Memory[r2+100] = r1
![Page 90: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/90.jpg)
Use of Registers
Example:
a = ( b + c) - ( d + e) ; // C statement# r1 – r5 : a - e add r10, r2, r3 add r11, r4, r5 sub r1, r10, r11
a = b + A[4]; // add an array element to a var// r3 has address A
lw r4, 16(r3) add r1, r2, r4
![Page 91: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/91.jpg)
Use of Registers : load and store
Example
A[8]= a + A[6] ; // A is in r3, a is in r2
lw r1, 24(r3) # r1 gets A[6] contents
add r1, r2, r1# r1 gets the sum
sw r1, 32(r3)# sum is put in A[8]
![Page 92: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/92.jpg)
load and store
Ex:
a = b + A[i]; // A is in r3, a,b, i in // r1, r2, r4
add r11, r4, r4 # r11 = 2 * i
add r11, r11, r11 # r11 = 4 * i
add r11, r11, r3 # r11 =addr. of A[i]
(r3+(4*i))
lw r10, 0(r11) # r10 = A[i]
add r1, r2, r10 # a = b + A[i]
![Page 93: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/93.jpg)
Example: Swap
Swapping words
r2 has the base address of the array v
swap: lw r10, 0(r2) lw r11, 4(r2) sw r10, 4(r2) sw r11, 0(r2)
temp = v[0]v[0] = v[1];v[1] = temp;
![Page 94: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/94.jpg)
DLX Instruction Format
Opcode Rs1 rd Immediate
Opcode Rs1 Rs2 rd Immediate
Opcode Offset added to PC
I-type Instructions
6 5 5 16
R-type Instructions
6 5 5 5 11
J-type Instructions
6 26
Instruction Format I-type R-type J-type
![Page 95: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/95.jpg)
Machine Language
Instructions, like registers and words of data, are also 32 bits long Example: add r10, r1, r2 registers have numbers, 10, 1, 2
Instruction Format:
000000 00001 00010 01010 0…000100000
Opcode Rs1 Rs2 rd func
R-type Instructions
6 5 5 5 11
![Page 96: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/96.jpg)
Machine Language
Opcode Rs1 rd Immediate
I-type Instructions (Loads/stores)
6 5 5 16
1000011 01010 00010 0…000100000
Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise
Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register
Example: lw r10, 32(r2)
![Page 97: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/97.jpg)
Machine Language
0000010 offset to .L1
Jump instructions
Example: j .L1
Opcode Offset added to PC
J-type Instructions (Jump, Jump and Link, Trap, return from exception)
6 26
![Page 98: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/98.jpg)
DLX Instruction Format
Opcode Rs1 rd Immediate
Opcode Rs1 Rs2 rd Immediate
Opcode Offset added to PC
I-type Instructions
6 5 5 16
R-type Instructions
6 5 5 5 16
J-type Instructions
6 26
Instruction Format I-type R-type J-type
![Page 99: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/99.jpg)
Instructions for Making Decisions
beq reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 equals
the value in reg2
bne reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 does not
equals the value in reg2
j L1 Unconditional jump
jr r10 “jump register”. Jump to the instruction specified in register
r10
![Page 100: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/100.jpg)
Making Decisions
Example
if ( a != b) goto L1; // x,y,z,a,b mapped to r1-r5
x = y + z;
L1 : x = x – a;
bne r4, r5, L1 # goto L1 if a != b
add r1, r2, r3 # x = y + z (ignored if a=b)
L1:sub r1, r1, r4 # x = x – a (always ex)
![Page 101: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/101.jpg)
if-then-else
Example: if ( a==b) x = y + z;
else x = y – z ;
bne r4, r5, Else # goto Else if a!=b
add r1, r2, r3 # x = y + z
j Exit # goto ExitElse : sub r1,r2,r3 # x = y – z
Exit :
![Page 102: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/102.jpg)
Example: Loop with array index
Loop: g = g + A [i];i = i + j;if (i != h) goto Loop
.... r1, r2, r3, r4 = g, h, i, j, array base = r5
LOOP: add r11, r3, r3 #r11 = 2 * iadd r11, r11, r11 #r11 = 4 * iadd r11, r11, r5 #r11 = adr. Of A[i]lw r10, 0(r11) #load A[i]add r1, r1, r10 #g = g + A[i]add r3, r3, r4 #i = i + jbne r3, r2, LOOP
![Page 103: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/103.jpg)
Other decisionsSet R1 on R2 less than R3 slt R1, R2, R3 Compares two registers, R2 and R3 R1 = 1if R2 < R3 else
R1 = 0 if R2 >= R3
Example slt r11, r1, r2Branch less than Example: if(A < B) goto LESS slt r11, r1, r2 #t1 = 1 if A < B bne r11, $0, LESS
![Page 104: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/104.jpg)
LoopsExample :
while ( A[i] == k ) // i,j,k in r3. r4, r5
i = i + j; // A is in r6
Loop: sll r11, r3, 2 # r11 = 4 * i add r11, r11, r6 # r11 = addr. Of A[i]
lw r10, 0(r11) # r10 = A[i]
bne r10, r5, Exit # goto Exit if A[i]!=k
add r3, r3, r4 # i = i + j
j Loop # goto Loop
Exit:
![Page 105: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/105.jpg)
op rs rt 16 bit address
op 26 bit address
I
J
Addresses in Branches and Jumps
Instructions:bne r14,r15,Label Next instruction is at Label if r14≠r15beq r14,r15,Label Next instruction is at Label if r14=r15j Label Next instruction is at Label
Formats:
Addresses are not 32 bits — How do we handle this with large programs?
— First idea: limitation of branch space to the first 216 bits
![Page 106: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/106.jpg)
op rs rt 16 bit addressI
Addresses in BranchesInstructions:bne r14,r15,Label Next instruction is at Label if r14≠r15beq r14,r15,Label Next instruction is at Label if r14=r15
Formats:
Treat the 16 bit number as an offset to the PC register – PC-relative addressing Word offset instead of byte offset, why?? most branches are local (principle of locality)
Jump instructions just use the high order bits of PC – Pseudodirect addressing 32-bit jump address = 4 Most Significant bits of PC concatenated with
26-bit word address (or 28- bit byte address) Address boundaries of 256 MB
![Page 107: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/107.jpg)
Conditional Branch Distance
0%
10%
20%
30%
40%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Words of Branch Dispalcement
Int. Avg. FP Avg.
• 65% of integer branches are 2 to 4 instructions
![Page 108: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/108.jpg)
Conditional Branch Addressing
Frequency of comparison types in branches
0% 50% 100%
EQ/NE
GT/LE
LT/GE
37%
23%
40%
86%
7%
7%
Int Avg.
FP Avg.
PC-relative since most branches are relatively close to the current PCAt least 8 bits suggested (128 instructions)Compare Equal/Not Equal most important for integer programs (86%)
![Page 109: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/109.jpg)
PC-relative addressing
Program Counter
Adressable space relative to PC
For larger distances: Jump register jr required.
![Page 110: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/110.jpg)
Example
op rs rt
0 19 10 9
35 9 8 1000
5 8 21 8
0 19 20
2 80000
...
80000
80004
80008
80012
80016
80020
19 0
0 24
32
LOOP: mult $9, $19, $10 # R9 = R19*R10 lw $8, 1000($9) # R8 = @(R9+1000)bne $8, $21, EXIT add $19, $19, $20 #i = i + jj LOOP EXIT: ...
Assume address of LOOP is 0x8000
2
0x8000
![Page 111: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/111.jpg)
Procedure calls
Procedures or subroutines: Needed for structured programming
Steps followed in executing a procedure call: Place parameters in a place where the procedure
(callee) can access them Transfer control to the procedure Acquire the storage resources needed for the procedure Perform desired task Place results in a place where the calling program
(caller) can access them Return control to the point of origin
![Page 112: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/112.jpg)
Resources InvolvedRegisters used for procedure calling: $a0 - $a3 : four argument registers in which to pass parameters $v0 - $v1 : two value registers in which to return values r31 : one return address register to return to the point of origin
Transferring the control to the callee: jal ProcedureAddress jump-and-link to the procedure address the return address (PC+4) is saved in r31 Example: jal 20000
Returning the control to the caller: jr r31 instruction following jal is executed next
![Page 113: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/113.jpg)
Memory Stacks
Useful for stacked environments/subroutine call & return even if operand stack not part of architecture
Stacks that Grow Up vs. Stacks that Grow Down:
abc
0 Little
inf. Big 0 Little
inf. Big
MemoryAddressesSP grows
upgrowsdown
Low address
High address
![Page 114: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/114.jpg)
Calling conventionsint func(int g, int h, int i, int j){ int f;
f = ( g + h ) – ( i + j ) ;return ( f );
} // g,h,i,j - $a0,$a1,$a2,$a3, f in r8func :addi $sp, $sp, -12 #make room in stack for 3 words
sw r11, 8($sp) #save the regs we want to use
sw r10, 4($sp)sw r8, 0($sp)add r10, $a0, $a1 #r10 = g + hadd r11, $a2, $a3 #r11 = i + jsub r8, r10, r11 #r8 has the result
add $v0, r8, r0 #return reg $v0 has f
![Page 115: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/115.jpg)
Calling (cont.)
lw r8, 0($sp) # restore r8lw r10, 4($sp) # restore r10lw r11, 8($sp) # restore r11addi $sp, $sp, 12 # restore sp
jr $ra
• we did not have to restore r10-r19 (caller save)• we do need to restore r1-r8 (must be preserved by
callee)
![Page 116: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/116.jpg)
Nested CallsStacking of Subroutine Calls & Returns and Environments:
A: CALL B
CALL C
C: RET
RET
B:
A
A B
A B C
A B
A
Some machines provide a memory stack as part of the architecture (e.g., VAX, JVM)Sometimes stacks are implemented via software convention
D:
![Page 117: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/117.jpg)
Compiling a String Copy Proc.
void strcpy ( char x[ ], y[ ]){ int i=0;
while ( x[ i ] = y [ i ] != 0)i ++ ;
} // x and y base addr. are in $a0 and $a1strcpy :
addi $sp, $sp, -4 # reserve 1 word space in stacksw r8, 0($sp) # save r8add r8, $zer0, $zer0 # i = 0
L1 : add r11, $a1, r8 # addr. of y[ i ] in r11lb r12, 0(r11) # r12 = y[ i ]add r13, $a0, r8 # addr. Of x[ i ] in r13sb r12, 0(r13) # x[ i ] = y [ i ]beq r12, $zero, L2 # if y [ i ] = 0 goto L2addi r8, r8, 1 # i ++j L1 # go to L1
L2 : lw r8, 0($sp) # restore r8addi $sp, $sp, 4 # restore $spjr $ra # return
![Page 118: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/118.jpg)
IA - 321978: The Intel 8086 is announced (16 bit architecture)1980: The 8087 floating point coprocessor is added1982: The 80286 increases address space to 24 bits, +instructions1985: The 80386 extends to 32 bits, new addressing modes1989-1995: The 80486, Pentium, Pentium Pro add a few instructions
(mostly designed for higher performance)1997: 57 new “MMX” instructions are added, Pentium II1999: The Pentium III added another 70 instructions (SSE)2001: Another 144 instructions (SSE2)2003: AMD extends the architecture to increase address space to 64 bits,
widens all registers to 64 bits and other changes (AMD64)2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds
more media extensions
“This history illustrates the impact of the “golden handcuffs” of compatibility
“adding new features as someone might add clothing to a packed bag”
“an architecture that is difficult to explain and impossible to love”
![Page 119: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/119.jpg)
IA-32 OverviewComplexity: Instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand can come from memory complex addressing modes
e.g., “base or scaled index with 8 or 32 bit displacement”
Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow
“what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective”
![Page 120: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/120.jpg)
IA32 Registers
Oversimplified Architecture Four 32-bit general purpose registers:
eax, ebx, ecx, edx al is a register to mean “the lower 8 bits of eax”
Stack Pointer esp
Fun fact: Once upon a time, only x86 was a 16-bit CPU So, when they upgraded x86 to 32-bits... Added an “e” in front of every register and called it
“extended”
![Page 121: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/121.jpg)
GPR0 EAX Accumulator
GPR1 ECX Count register, string, loop
GPR2 EDX Data Register; multiply, divide
GPR3 EBX Base Address Register
GPR4 ESP Stack Pointer
GPR5 EBP Base Pointer – for base of stack seg.
GPR6 ESI Index Register
GPR7 EDI Index Register
CS Code Segment Pointer
SS Stack Segment Pointer
DS Data Segment Pointer
ES Extra Data Segment Pointer
FS Data Seg. 2
GS Data Seg. 3
PC EIP Instruction Counter
Eflags Condition Codes
Intel 80x86 Integer Registers
![Page 122: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/122.jpg)
x86 Assemblymov <dest>, <src>
Move the value from <src> into <dest>
Used to set initial values
add <dest>, <src>
Add the value from <src> to <dest>
sub <dest>, <src>
Subtract the value from <src> from <dest>
![Page 123: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/123.jpg)
x86 Assemblypush <target>
Push the value in <target> onto the stack
Also decrements the stack pointer, ESP (remember stack grows from high to low)
pop <target>
Pops the value from the top of the stack, put it in <target>
Also increments the stack pointer, ESP
![Page 124: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/124.jpg)
x86 Assemblyjmp <address>
Jump to an instruction (like goto)
Change the EIP to <address>
Call <address>
A function call.
Pushes the current EIP + 1 (next instruction) onto the stack, and jumps to <address>
![Page 125: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/125.jpg)
x86 Assemblylea <dest>, <src>
Load Effective Address of <src> into register <dest>.
Used for pointer arithmetic (not actual memory reference)
int <value>
interrupt – hardware signal to operating system kernel, with flag <value>
int 0x80 means “Linux system call”
![Page 126: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/126.jpg)
x86 AssemblyCondition Codes:
CZ: Carry Flag – Overflow Detection (Unsigned)
ZF: Zero Flag
SF: Sign Flag
OF: Overflow Flag – Overflow Detection (Signed)
Conditional Codes are you usually accessed through conditional branches (Not Directly)
![Page 127: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/127.jpg)
Interrupt conventionint 0x80 – System call interupt
eax – System call number (eg. 1-exit, 2-fork, 3-read, 4-write)
ebx – argument #1
ecx – argument #2
edx – argument #3
![Page 128: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/128.jpg)
CISC vs RISCRISC = Reduced Instruction Set Computer (DLX)
CISC = Complex Instruction Set Computer (x86)
Both have their advantages.
![Page 129: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/129.jpg)
RISCNot very many instructions
All instructions all the same length in both execution time, and bit length
Results in simpler CPU’s (Easier to optimize)
Usually takes more instructions to express useful operation
![Page 130: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/130.jpg)
CISCVery many instructions (x86 instruction set over 700 pages)
Variable length instructions
More Complex CPU’s
Usually takes fewer instructions to express useful operation (lower memory requirements)
![Page 131: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/131.jpg)
Some linksGreat stuff on x86 asm:
http://www.cs.dartmouth.edu/~cs37/
![Page 132: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/132.jpg)
Instruction Architecture Ideas
If code size is most important, use variable length instructions
If simplicity is most important, use fixed length instructions
Simpler ISA’s are easier to optimize!
Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide instructions (Thumb, MIPS16); per procedure decide performance or density
![Page 133: OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore Portland State University/OMSE](https://reader031.vdocuments.net/reader031/viewer/2022012922/56649d3a5503460f94a15024/html5/thumbnails/133.jpg)
Summary
Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate
Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast
Instruction set architecture a very important abstraction indeed!