cda 5155 and 4150
DESCRIPTION
Computer Architecture Week 2: 2 September 2014. CDA 5155 and 4150. Goals of the course. Advanced coverage of computer architecture – general purpose processors, embedded processors,historically significant processors, design tools. Instruction set architecture Processor microarchitecture - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/1.jpg)
CDA 5155 and 4150
Computer Architecture
Week 2: 2 September 2014
![Page 2: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/2.jpg)
2/49
Goals of the course
• Advanced coverage of computer architecture – general purpose processors, embedded processors,historically significant processors, design tools.– Instruction set architecture– Processor microarchitecture– Systems architecture
• Memory systems• I/O systems
![Page 3: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/3.jpg)
3/49
Teaching Staff
• Professor Gary Tyson– PhD: University of California – Davis
– Faculty jobs:• California State University Sacramento: 1987 - 1990
• University of California – Davis: 1993 - 1994
• University of California – Riverside: 1995 - 1996
• University of Michigan: 1997 - 2003
• Florida State University: 2003 – present
![Page 4: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/4.jpg)
4/49
Grading in 5155 (Fall’14)
Programming assignments In-order pipeline simulation (10%) Out-of-order pipeline simulation (10%)
Exams (2 @ 25% each) In class, 75 minutes
Team Project (20%) 3 or 4 students per team
Class Participation (10%)
![Page 5: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/5.jpg)
5/49
Time Management
• 3 hours/week lecture– This is probably the most important time
• 2 hours/week reading– Hennessy/Patterson:
– Computer Architecture: A Quantitative Approach
• 3-5 hours/week exam prep• 5+ hours/week Project (1/3 semester)
Total: ~10-15 hours per week.
![Page 6: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/6.jpg)
6/49
Tentative Course Timeline
project
Exam
Due Dates
Research TopicsNov 2514
Research TopicsDec 215
Embedded processorsNov 1813
Embedded processorsNov 1112
Multiprocessor, MultithreadingNov 411
Cache design, VMOct 2810
Cache designOct 219
Oct 148
Advanced pipelines
Advanced pipelines
Oct 77
Dynamic SchedulingSept 306
Dynamic SchedulingSept 235
Compilers, VLIWSept 164
Superscalar, ExceptionsSept 93
Pipelining, Branch PredictionSept 22
Performance, ISA, PipeliningAug 281
NotesHolidaysTopicDateWeek
Exam
![Page 7: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/7.jpg)
7/49
Web Resources
Course Web Page: http://www.cs.fsu.edu/~tyson/courses/CDA5155
Wikipedia: http://en.wikipedia.org/wiki/Microprocessor
Wisconsin Architecture Page: http://arch-www.cs.wisc.edu/home
![Page 8: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/8.jpg)
8/49
Levels of Abstraction
• Problem/Idea (English?)
• Algorithm (pseudo-code)
• High-Level languages (C, Verilog)
• Assembly instructions (OS calls)
• Machine instructions (I/O interfaces)
• Microarchitecture/organization (block diagrams)
• Logic level: gates, flip-flops (schematic, HDL)
• Circuit level: transistors, sizing (schematic, HDL)
• Physical: VLSI layout, feature size, cabling, PC boards.
What are the abstractions at each level?
![Page 9: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/9.jpg)
9/49
Levels of Abstraction
• Problem/Idea (English?)
• Algorithm (pseudo-code)
• High-Level languages (C, Verilog)
• Assembly instructions (OS calls)
• Machine instructions (I/O interfaces)
• Microarchitecture/organization (block diagrams)
• Logic level: gates, flip-flops (schematic, HDL)
• Circuit level: transistors, sizing (schematic, HDL)
• Physical: VLSI layout, feature size, cabling, PC boards.
At what level do I perform a square root? Recursion?
![Page 10: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/10.jpg)
10/49
Levels of Abstraction
• Problem/Idea (English?)
• Algorithm (pseudo-code)
• High-Level languages (C, Verilog)
• Assembly instructions (OS calls)
• Machine instructions (I/O interfaces)
• Microarchitecture/organization (block diagrams)
• Logic level: gates, flip-flops (schematic, HDL)
• Circuit level: transistors, sizing (schematic, HDL)
• Physical: VLSI layout, feature size, cabling, PC boards.
Who/what translates from one level to the next?
![Page 11: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/11.jpg)
11/49
Role of Architecture
• Responsible for hardware specification:– Instruction set design
• Also responsible for structuring the overall implementation– Microarchitectural design.
• Interacts with everyone– mainly compiler and logic level designers.
• Cannot do a good job without knowledge of both sides
![Page 12: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/12.jpg)
12/49
Design Issues: Performance
• Get acceptable performance out of system.– Scientific: floating point throughput, memory&disk
intensive, predictable
– Commercial: string handling, disk (databases), predictable
– Multimedia: specific data types (pixels), network? Predictable?
– Embedded: what do you mean by performance?
– Workstation: Maybe all of the above, maybe not
![Page 13: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/13.jpg)
13/49
Calculating Performance
• Execution time is often the best metric• Throughput (tasks/sec) vs. latency (sec/task)• Benchmarks: what are the tasks?
– What I care about!
– Representative programs (SPEC, Linpack)
– Kernels: representative code fragments
– Toy programs: useful for testing end-conditions
– Synthetic programs: does nothing but with a representative instruction mix.
![Page 14: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/14.jpg)
14/49
Design Issues: Cost
• Processor– Die size, packaging, heat sink? Gold connectors?
– Support: fan, connectors, motherboard specifications, etc.
• Calculating processor cost:– Cost of device = (die + package + testing) / yield
– Die cost = wafer cost / good die yield• Good die yield related to die size and defect density
– Support costs: direct costs (components, labor), indirect costs ( sales, service, R&D)
– Total costs amortized over number of systems sold(PC vs NASA)
![Page 15: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/15.jpg)
15/49
Other design issues
• Some applications care about other design issues.• NASA deep space mission
– Reliability: software and hardware (radiation hardening)
• AMD
– Code compatibility • ARM
– Power
![Page 16: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/16.jpg)
16/49
A Quantitative Approach
• Hardware systems performance is generally easy to quantify– Machine A is 10% faster than Machine B– Of course Machine B’s advertising will show
the opposite conclusion
• Many software systems tend to have much more subjective performance evaluations.
![Page 17: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/17.jpg)
17/49
Measuring Performance
• Total Execution Time:– A is 3 times faster than B for programs P1,P2
– Issue: Emphasizes long running programs
1
i=1
n
n Σ Timei
![Page 18: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/18.jpg)
18/49
Measuring Performance
• Weighted Execution Time:
– What if P1 is executed far more frequently?
i=1
n
∑ Weighti X Timei
![Page 19: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/19.jpg)
19/49
Measuring Performance
• Normalized Execution Time:– Compare machine performance to a reference
machine and report a ratio.• SPEC ratings measure relative performance to a
reference machine.
![Page 20: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/20.jpg)
20/49
Amdahl’s Law
• Rule of Thumb: Make the common case faster
(Attack longest running part until it is no longer) repeat
http://en.wikipedia.org/wiki/Amdahl's_lawhttp://en.wikipedia.org/wiki/Amdahl's_law
![Page 21: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/21.jpg)
21/49
Instruction Set Design
• Software Systems: named variables; complex semantics.
• Hardware systems: tight timing requirements; small storage structures; simple semantics
• Instruction set: the interface between very different software and hardware systems
![Page 22: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/22.jpg)
22/49
Design decisions
• How much “state” is in the microarchitecture?– Registers; Flags; IP/PC
• How is that state accessed/manipulated?– Operand encoding
• What commands are supported?– Opcode; opcode encoding
![Page 23: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/23.jpg)
23/49
Design Challenges: or why is architecture still relevant?
• Clock frequency is increasing– This changes the number of levels of gates that
can be completed each cycle so old designs don’t work.
– It also tend to increase the ratio of time spent on wires (fixed speed of light)
• Power– Faster chips are hotter; bigger chips are hotter
![Page 24: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/24.jpg)
24/49
Design Challenges (cont)
• Design Complexity– More complex designs to fix frequency/power issues
leads to increased development/testing costs
– Failures (design or transient) can be difficult to understand (and fix)
• We seem far less willing to live with hardware errors (e.g. FDIV) than software errors – which are often dealt with through upgrades – that we
pay for!)
![Page 25: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/25.jpg)
25/49
Techniques for Encoding Operands
• Explicit operands:– Includes a field to specify which state data is
referenced– Example: register specifier
• Implicit operands:– All state data can be inferred from the opcode– Example: function return (CISC-style)
![Page 26: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/26.jpg)
26/49
Accumulator
• Architectures with one implicit register– Acts as source and/or destination– One other source explicit
• Example: C = A + B– Load A // (Acc)umulator = A– Add B // Acc = Acc + B– Store C // C = Acc
Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology”
![Page 27: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/27.jpg)
27/49
Stack
• Architectures with implicit “stack”– Acts as source(s) and/or destination
– Push and Pop operations have 1 explicit operand
• Example: C = A + B– Push A // Stack = {A}
– Push B // Stack = {A, B}
– Add // Stack = {A+B}
– Pop C // C = A+B ; Stack = {}
Compact encoding; may require more instructions though
![Page 28: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/28.jpg)
28/49
Registers
• Most general (and common) approach– Small array of storage
– Explicit operands (register file index)
• Example: C = A + BRegister-memory load/store
Load R1, A Load R1, A
Load R2, B
Add R3, R1, B Add R3, R1, R2
Store R3, C Store R3, C
![Page 29: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/29.jpg)
29/49
Memory
• Big array of storage– More complex ways of indexing than registers
• Build addressing modes to support efficient translation of software abstractions
• Uses less space in instruction than 32-bit immediate field
A[i]; use base (i) + displacement (A) (scaled?)
a.ptr; use base (a) + displacement (ptr)
![Page 30: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/30.jpg)
30/49
Addressing modes
Register Add R4, R3
Immediate Add R4, #3
Base/Displacement Add R4, 100(R1)
Register Indirect Add R4, (R1)
Indexed Add R4, (R1+R2)
Direct Add R4, (1001)
Memory Indirect Add R4, @(R3)
Autoincrement Add R4, (R2)+
![Page 31: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/31.jpg)
31/49
Other Memory Issues
What is the size of each element in memory?
0x000 Byte
Half word
Word
0x000
0x000
0-255
0 - 65535
0 - ~4B
![Page 32: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/32.jpg)
32/49
Other Memory Issues
Big-endian or Little-endian? Store 0x114488FF
0x000 11
44
88
FF
Points to most significant byte
0x000 FF
88
44
11
Points to least significant byte
![Page 33: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/33.jpg)
33/49
Other Memory Issues
Non-word loads? ldb R3, (000)
0x000 11
44
88
FF
00 00 00 11
![Page 34: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/34.jpg)
34/49
Other Memory Issues
Non-word loads? ldb R3, (003)
0x003
11
44
88
FF
FF FF FF FF
Sign extended
![Page 35: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/35.jpg)
35/49
Other Memory Issues
Non-word loads? ldbu R3, (003)
0x003
11
44
88
FF
00 00 00 FF
Zero filled
![Page 36: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/36.jpg)
36/49
Other Memory Issues
Alignment? Word accesses only address ending in 00
Half-word accesses only ending in 0
Byte accesses any address
0x002
11
44
88
FF
ldw R3, (002) is illegal!
Why is it important to be aligned?How can it be enforced?
![Page 37: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/37.jpg)
37/49
Techniques for Encoding Operators
• Opcode is translated to control signals that– direct data (MUX control) – select operation for ALU– Set read/write selects for register/memory/PC
• Tradeoff between how flexible the control is and how compact the opcode encoding.– Microcode – direct control of signals (Improv)– Opcode – compact representation of a set of control
signals.• You can make decode easier with careful opcode selection
![Page 38: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/38.jpg)
38/49
Handling Control Flow
• Conditional branches (short range)
• Unconditional branches (jumps)
• Function calls
• Returns
• Traps (OS calls and exceptions)
• Predicates (conditional retirement)
![Page 39: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/39.jpg)
39/49
Encoding branch targets
• PC-relative addressing– Makes linking code easier
• Indirect addressing– Jumps into shared libraries, virtual functions,
case/switch statements
• Some unusual modes to simplify target address calculation – (segment offset) or (trap number)
![Page 40: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/40.jpg)
40/49
Condition codes
• Flags– Implicit: flag(s) specified in opcode (bgt)
– Flag(s) set by earlier instructions (compare, add, etc.)
• Register– Uses a register; requires explicit specifier
• Comparison operation– Two registers with compare operation specified in
opcode.
![Page 41: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/41.jpg)
41/49
Higher Level Semantics: Functions
• Function call semantics– Save PC + 1 instruction for return– Manage parameters– Allocate space on stack– Jump to function
• Simple approach: – Use a jump instruction + other instructions
• Complex approach:– Build implicit operations into new “call” instruction
![Page 42: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/42.jpg)
42/49
Role of the Compiler
• Compilers make the complexity of the ISA (from the programmers point of view) less relevant.– Non-orthogonal ISAs are more challenging.– State allocation (register allocation) is better
left to compiler heuristics– Complex Semantics lead to more global
optimization – easier for a machine to do.
People are good at optimizing 10 lines of code.Compilers are good at optimizing 10M lines.
![Page 43: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/43.jpg)
43/49
LC processor
• Little Computer Fall 2011– For programming projects
• Instruction Set Design
regA regBopcode destReg
![Page 44: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/44.jpg)
44/49
LC processor
R-type instructions
regA regBopcode destReg
24- 22 21- 19 18 –16 15 – 3 2 - 0
add: destReg = regA + regB
nand: destReg = regA & regB
![Page 45: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/45.jpg)
45/49
LC processor
I-type instructions
regA regBopcode offsetField
24- 22 21- 19 18 –16 15 – 0
lw: regB = Memory[regA + offsetField]
sw: Memory[regA +offsetField] = regB
beq: if (regA= = regB) PC = PC + 1 + offsetField
![Page 46: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/46.jpg)
46/49
LC processor
O-type instructions
opcode unused
24- 22 21 – 0
noop: do nothing
halt: halt the simulation
![Page 47: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/47.jpg)
47/49
LC assembly example
lw 0 1 five load reg1 with 5 (uses symbolic address)lw 1 2 3 load reg2 with -1 (uses numeric address)
start add 1 2 1 decrement reg1beq 0 1 2 goto end of program when reg1==0beq 0 0 start go back to the beginning of the loopnoop
done halt end of programfive .fill 5neg1 .fill -1stAddr .fill start will contain the address of start (2)
![Page 48: CDA 5155 and 4150](https://reader033.vdocuments.net/reader033/viewer/2022061602/56814558550346895db228ad/html5/thumbnails/48.jpg)
48/49
LC machine code example
(address 0): 8454151 (hex 0x810007)(address 1): 9043971 (hex 0x8a0003)(address 2): 655361 (hex 0xa0001)(address 3): 16842754 (hex 0x1010002)(address 4): 16842749 (hex 0x100fffd)(address 5): 29360128 (hex 0x1c00000)(address 6): 25165824 (hex 0x1800000)(address 7): 5 (hex 0x5)(address 8): -1 (hex 0xffffffff)(address 9): 2 (hex 0x2)
Input for simulator:84541519043971655361168427541684274929360128251658245-12