Download - Computer Systems
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 1
Computer Systems
The processor architecture
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 2
Basic Knowledge
• Relative timing of the elements is important
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 3
Programmers visible state
Von Neumann architecture, both instructions and data in memory
%eax
%ecx
%edx
%ebx
%esi
%edi
%esp
%ebp
Program registers
PC
Memory
CC
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 4
Program counter
• The program counter holds the address of the instruction currently executed
• The next instruction has to be collected from memory (slow!)
Kernel virtual memory
Memory mapped region forshared libraries
Run-time heap(created at runtime by malloc)
User stack(created at runtime)
Unused0
Memoryinvisible touser code0xc0000000
0x08048000
0x40000000
Read/write data
Read-only code and dataLoaded from the hello executable file
printf() function
0xffffffff
PC or
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 5
Processing a single instruction
• Fetch– Read the instruction (1-5 bytes) from memory
• Decode– Reads the values from the registers
• Execute– Perform a arithmetic/logic operation OR Test the jump conditions
• Memory– Read/Write to memory
• Write back– Update the registers
• PC update– Set the address of the next instruction
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 6
Seq. architecture
• Hardware connected with named wires(word & bytes, byte & bits, bit)
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
RegisterfileA B M
E
PC
PC
Instructionmemory
Instructionmemory
PCincrement
PCincrement
rBicodeifun rA
PC
valC valP
Needregids
NeedvalCInstr
valid
AlignAlignSplitSplit
Bytes 1-5Byte 0
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 7
Stage Computation: ALU Operation
– Formulate instruction execution as sequence of simple steps
– Use same general form for all instructions
OPl rA, rB
icode:ifun M1[PC]
rA:rB M1[PC+1]
valP PC+2
Fetch
Read instruction byte
Read register byte
Compute next PC
valA R[rA]
valB R[rB]Decode
Read operand A
Read operand B
valE valB ifun valA
Set CCExecute
Perform ALU operation
Set condition code register
Memory
R[rB] valE
Write
back
Write back result
PC valPPC update Update PC
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 8
Stage Computation: procedure call
– Use ALU to decrement stack pointer– Store incremented PC
call Dest
icode:ifun M1[PC]
valC M4[PC+1]valP PC+5
Fetch
Read instruction byte
Read destination address
Compute return point
valB R[%esp]Decode
Read stack pointer
valE valB + –4Execute
Decrement stack pointer
M4[valE] valP Memory Write return value on stack
R[%esp] valE
Write
back
Update stack pointer
PC valCPC update Set PC to destination
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 9
Stage Computation: jump
– Compute both addresses– Choose based on setting of condition codes
and branch condition XX/ifun
jXX Dest
icode:ifun M1[PC]
valC M4[PC+1]valP PC+5
Fetch
Read instruction byte
Read destination address
Fall through address
Decode
Bch Cond(CC,ifun)Execute
Take branch?
Memory
Write
back
PC Bch ? valC : valPPC update Update PC
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 10
Branch conditions
jmp 7 0
jle 7 1
jl 7 2
je 7 3
jne 7 4
jge 7 5
jg 7 6
JXX
Condition Codes Description
1 Direct jump
(SF^OF) | ZF Less or equal <=
SF^OF Less <
ZF Equal ==
~ZF Non equal !=
~(SF^OF) & ~ZF Greater or equal >=
~(SF^OF) Greater >
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 11
Datapaths & Control Logic
– ALU fun: select function– ALU A: select Input A– ALU B: select Input B– Set CC: Should condition code
register be loaded?
CCCC ALUALU
ALUA
ALUB
ALUfun.
Bch
icode ifun valC valBvalA
valE
SetCC
bcondbcond
Execute Logic
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 12
Control logic: ALU A
valE valB + –4 Decrement stack pointer
No operation
valE valB + 4 Increment stack pointer
valE valB + valC Compute effective address
valE valB OP valA Perform ALU operation
OPl rA, rBExecute
rmmovl rA, D(rB)
popl rA
jXX Dest
call Dest
ret
Execute
Execute
Execute
Execute
Execute valE valB + 4 Increment stack pointer
int aluA = [icode in { IRRMOVL, IOPL } : valA;icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC;icode in { ICALL, IPUSHL } : -4;icode in { IRET, IPOPL } : 4;# Other instructions don't need ALU
];
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 13
Hardware structure
• This can be translated in silicon
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
NewPC
rB
dstEdstM
ALUA
ALUB
Mem.control
Addr
srcAsrcB
readwrite
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
RegisterfileA B M
E
Bch
dstEdstMsrcAsrcB
icodeifun rA
PC
valC valP
valBvalA
Data
valE
valM
PC
newPC
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 14
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 15
Sequential is too slow
• Clock has to slow enough to let the signal propagate through all wires and transistors
• Critical path: the slowest path between any two storage devices
Clk
.
.
.
.
.
.
.
.
.
.
.
.
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 16
Pipelining
• Divide the operations in stages and allow to start the next operation if the first operation is ready with first stage
• Increase the throughput, increase latency
Reg
Reg
Reg
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Comb.logic
A
Comb.logic
B
Comb.logic
C
Clock
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 17
Insert registers between stages
• Pipeline registers means extra silicon and delay
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
RegisterfileA B M
E
valP
d_srcA, d_srcB
valA, valB
aluA, aluB
Bch valE
Addr, Data
valM
PC
W_valE, W_valM, W_dstE, W_dstM
W_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_PC
predPC
Instructionmemory
Instructionmemory
M_icode, M_Bch, M_valA1 2 3 4 5 6 7 8 9
F D E MWF D E M
W
F D E M WF D E M W
F D E M W
Cycle 5WI1MI2EI3DI4FI5
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 18
Data hazards
Additional pipeline control is needed to prevent unintended interactions between instructions
• Stalling (wait a few stages till hazard is gone)
• Data forwarding (passing value to E before M/W)
Pipeline architecture already used for i386http://www.pcmech.com/show/processors/35/
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 19
Pipeline efficiency
Pipeline control can prevent many, but not all interactions between instructions → bubbles
For the model described in the book:• Load / Use hazards
(20% of load instr. → 1 bubble)
• Mispredicted branches(40% of jmp instr. → 2 bubbles)
• Return from procedure calls(100% of ret instr. → 3 bubbles)
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 20
Today’s architectures• Superscalar (Pentium)
(often two instructions/cycle)• Dynamic execution (P6)
(three instructions out-of-order/cycle)
• Explicit parallelism (Itanium)(six execution units)
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 21
Hyper-Threading
http://or1cedar.intel.com/media/training/detect_ht_dt_v1/tutorial/ch6/topic04.htm
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 22
Metrics of performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per month
Scaling of algorithms
Each metric has a place and a purpose, and each can be optimized
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 23
Summary
• Shown that an instruction set architecture can be translated onto multiple processor architectures– Complicated control logic on datapaths– Compilers have optimize the control logic for
multiple machines/targets– A programmer can add/frustrate compiler
University of Amsterdam
Computer Systems – the processor architecture Arnoud Visser 24
Assignment
• Practice Problem 4.26 (page 430)
Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks
A
80 ps
B
30 ps
C
60 ps
D
50 ps
E
70 ps
F
10 ps
R e g
20 ps