cse331 w14.1irwin fall 2007 psu cse 331 computer organization and design fall 2007 week 14 section...
Post on 02-Jan-2016
221 Views
Preview:
TRANSCRIPT
CSE331 W14.1 Irwin Fall 2007 PSU
CSE 331Computer Organization and
DesignFall 2007
Week 14
Section 1: Mary Jane Irwin (www.cse.psu.edu/~mji)
Section 2: Krishna Narayanan
Course material on ANGEL: cms.psu.edu
[adapted from D. Patterson slides]
CSE331 W14.2 Irwin Fall 2007 PSU
Head’s Up
Last week’s material Input/Output – dealing with exceptions and interrupts
This week’s material Intro to pipelined datapath design; memory design
- Reading assignment – PH: 6.1, B.9
Next week’s material Memory hierarchies, intro to cache design
- Reading assignment – PH: 7.1-7.2
Reminders Final Exam is Tues., Dec 18th, 10:10 to noon, 112 Walker HW 8 (last) will be due, Dec 12 (by 11:55pm) Quiz 8 (last) will close, Dec 13 (by 11:55pm) Dec. 12th deadline for filing grade corrections/updates
CSE331 W14.3 Irwin Fall 2007 PSU
We knew from the beginning that deciding on an out-of-order microarchitecture was the number one conceptual priority … An out-of-order core would imply a much more complicated engine, which would tend to increase the number of pipeline stages, which would impact the clock frequency (making it either higher or lower, we were not entirely sure which).
The Pentium Chronicles, Colwell, pg. 20
CSE331 W14.4 Irwin Fall 2007 PSU
Review: Single Cycle vs. Multiple Cycle Timing
Clk Cycle 1
Multiple Cycle Implementation:
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
IFetch Dec Exec Mem
lw sw
IFetch
R-type
Clk
Single Cycle Implementation:
lw sw Waste
Cycle 1 Cycle 2
multicycle clock slower than 1/5th of single cycle clock due to stage register overhead
CSE331 W14.5 Irwin Fall 2007 PSU
How Can We Make It Even Faster? Split the multiple instruction cycle design into smaller and
smaller steps There is a point of diminishing returns where as much time is
spent loading the state registers as doing the work
Start fetching and executing the next instruction before the current one has completed
Pipelining – (all?) modern processors are pipelined for performance
Superpipelining – many pipeline stages, very fast clock
Fetch (and execute) more than one instruction at a time (out-of-order superscalar and VLIW (epic) – CSE 431)
Fetch (and execute) instructions from more than one instruction stream (multithreading (hyperthreading)) – CSE 431)
CSE331 W14.6 Irwin Fall 2007 PSU
A Pipelined MIPS Processor Start the next instruction before the current one has
completed improves throughput - total amount of work done in a given
time instruction latency (execution time, delay time, response
time - time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage
- for some instructions, some stages are wasted cycles
CSE331 W14.7 Irwin Fall 2007 PSU
Single Cycle, Multiple Cycle, vs. Pipeline
Multiple Cycle Implementation:
Clk
Cycle 1
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
IFetch Dec Exec Mem
lw sw
IFetch
R-type
lw IFetch Dec Exec Mem WB
Pipeline Implementation:
IFetch Dec Exec Mem WBsw
IFetch Dec Exec Mem WBR-type
Clk
Single Cycle Implementation:
lw sw Waste
Cycle 1 Cycle 2
pipeline clock same as multi-cycle clock
CSE331 W14.8 Irwin Fall 2007 PSU
MIPS Pipeline Datapath Modifications What do we need to add/modify in our MIPS datapath?
State registers between each pipeline stage to isolate them
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadDataIF
etc
h/D
ec
De
c/E
xe
c
Ex
ec
/Me
m
Me
m/W
B
IF:IFetch ID:Dec EX:Execute MEM:MemAccess
WB:WriteBack
System Clock
SignExtend
CSE331 W14.9 Irwin Fall 2007 PSU
MIPS Pipeline Control Path Modifications All control signals can be determined during Decode
and held in the state registers between pipeline stages
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
Control
CSE331 W14.10 Irwin Fall 2007 PSU
Pipelining the MIPS ISA
What makes it easy all instructions are the same length (32 bits)
- can fetch in the 1st stage and decode in the 2nd stage few instruction formats (three) with symmetry across
formats- can begin reading register file in 2nd stage
memory operations can occur only in loads and stores- can use the execute stage to calculate memory addresses
each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB)
What makes it hard structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands
depend on the output of a previous instruction?
CSE331 W14.11 Irwin Fall 2007 PSU
Graphically Representing MIPS Pipeline
Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be
fixed?
AL
UIM Reg DM Reg
CSE331 W14.12 Irwin Fall 2007 PSU
Why Pipeline? For Performance!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Once the pipeline is full,
one instruction is completed
every cycle so CPI = 1
Time to fill the pipeline
CSE331 W14.13 Irwin Fall 2007 PSU
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource by
two different instructions at the same time data hazards: attempt to use data before it is ready
- An instruction’s source operand(s) are produced by a prior instruction still in the pipeline
control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
- branch and jump instructions, exceptions
Can always resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards
CSE331 W14.14 Irwin Fall 2007 PSU
Instr.
Order
Time (clock cycles)
lw
Inst 1
Inst 2
Inst 4
Inst 3
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
A Single Memory Would Be a Structural Hazard
Reading data from memory
Reading instruction from memory
Can fix with separate instr and data memories
CSE331 W14.16 Irwin Fall 2007 PSU
How About Register File Access?
Instr.
Order
Time (clock cycles)
Inst 1
Inst 2
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
Fix register file access hazard by
doing reads in the second half of the
cycle and writes in the first half
add $1,
add $2,$1,
clock edge that controls register writing
clock edge that controls loading of pipeline state registers
CSE331 W14.18 Irwin Fall 2007 PSU
Register Usage Can Cause Data Hazards
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Dependencies backward in time cause hazards
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
Read before write data hazard
CSE331 W14.20 Irwin Fall 2007 PSU
Loads Can Cause Data Hazards
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Dependencies backward in time cause hazards
Load-use data hazard
CSE331 W14.21 Irwin Fall 2007 PSU
stall
stall
One Way to “Fix” a Data Hazard
Instr.
Order
add $1,
AL
UIM Reg DM Reg
sub $4,$1,$5
and $6,$1,$7
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Can fix data hazard by
waiting – stall
CSE331 W14.23 Irwin Fall 2007 PSU
Another Way to “Fix” a Data Hazard
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Fix data hazards by forwarding
results as soon as they are available to where they are
neededA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Instr.
Order
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
CSE331 W14.25 Irwin Fall 2007 PSU
Forwarding with Load-use Data Hazards
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Will still need one stall cycle even with forwarding
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
CSE331 W14.26 Irwin Fall 2007 PSU
Control Hazards When the flow of instruction addresses is not
sequential (i.e., PC = PC + 4) Conditional branches (beq, bne) Unconditional branches (j, jal, jr) Exceptions
Possible “solutions” Stall (impacts performance) Move branch decision point as early in the pipeline as
possible, thereby reducing the number of stall cycles Delay decision (requires compiler support) Predict and hope for the best !
Control hazards occur less frequently than data hazards, but there is nothing as effective against control hazards as forwarding is for data hazards
CSE331 W14.27 Irwin Fall 2007 PSU
flush
Jumps Incur One Stall
Instr.
Order
j
j targetA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Fortunately, jumps are very infrequent – only 3% of the SPECint instruction mix
Jumps not decoded until ID, so one flush is needed To flush, set IF.Flush to zero the instruction field of the
IF/ID pipeline register (turning it into a noop)
Fix jump hazard by waiting –
flush
AL
UIM Reg DM Reg
CSE331 W14.29 Irwin Fall 2007 PSU
The maximum clock rate of a microprocessor has been its principal marketing feature and a first-order determinant of its eventual performance. … Some [chips] will function at faster clock rates than expected; others will be slower. … With a cutting-edge, flagship design, however, odds are that the chips will not be as fast as the design team intended.
The Pentium Chronicles, Colwell, pg. 118
CSE331 W14.30 Irwin Fall 2007 PSU
Review: MIPS Pipeline Control & Datapath All control signals can be determined during Decode
and held in the state registers between pipeline stages
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
Control
CSE331 W14.31 Irwin Fall 2007 PSU
Branches Cause Control Hazards
Instr.
Order
lw
Inst 4
Inst 3
beq
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Dependencies backward in time cause hazards
CSE331 W14.32 Irwin Fall 2007 PSU
flush
flush
flush
One Way to “Fix” a Branch Control Hazard
Instr.
Order
beq
AL
UIM Reg DM Reg
beq target
AL
UIM Reg DM Reg
AL
U
Inst 3IM Reg DM
Fix branch hazard by waiting –
flush – but affects CPI
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
CSE331 W14.33 Irwin Fall 2007 PSU
flush
Another Way to “Fix” a Branch Control Hazard
Instr.
Order
beq
beq target
AL
UIM Reg DM Reg
Inst 3
AL
UIM Reg DM
Fix branch hazard by waiting –
flush
AL
UIM Reg DM Reg
Move branch decision hardware back to as early in the pipeline as possible – i.e., during the decode cycle
AL
UIM Reg DM Reg
CSE331 W14.34 Irwin Fall 2007 PSU
flush
Yet Another Way to “Fix” a Control Hazard
4 beq $1,$2,2Instr.
Order
AL
UIM Reg DM Reg
16 and $6,$1,$7
20 or r8,$1,$9
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg8 sub $4,$1,$5
To flush, set IF.Flush to zero the instruction field of the IF/ID pipeline register (turning it into a noop)
“Predict branches are always not taken – and take corrective action when wrong (i.e., taken)
Branch decision hardware moved
to the decode cycle
CSE331 W14.35 Irwin Fall 2007 PSU
Two “Types” of Stalls
Noop instruction (or bubble) inserted between two instructions in the pipeline (e.g., load-use hazards)
Keep the instructions earlier in the pipeline (later in the code) from progressing down the pipeline for a cycle (“bounce” them in place with write control signals)
Insert noop instruction by zeroing control bits in the pipeline register at the appropriate stage
Let the instructions later in the pipeline (earlier in the code) progress normally down the pipeline
Flushes (or instruction squashing) where an instruction in the pipeline is replaced with a noop instruction (as done for instructions located sequentially after j and beq instructions)
Zero the control bits for the instruction to be flushed
CSE331 W14.36 Irwin Fall 2007 PSU
Many Other Pipeline Structures Are Possible
What about the (slow) multiply operation? Make the clock twice as slow or … let it take two cycles (since it doesn’t use the DM stage)
AL
UIM Reg DM Reg
MUL
AL
UIM Reg DM1 RegDM2
What if the data memory access is twice as slow as the instruction memory?
make the clock twice as slow or … let data memory access take two cycles (and keep the
same clock rate)
CSE331 W14.37 Irwin Fall 2007 PSU
Pipelining Summary All modern day processors use pipelining
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Potential speedup: a really fast clock cycle and able to complete one instruction every clock cycle (CPI)
Pipeline rate limited by slowest pipeline stage Unbalanced pipe stages makes for inefficiencies
The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs
Must detect and resolve hazards Stalling negatively affects CPI (makes CPI greater than
the ideal of 1)
CSE331 W14.38 Irwin Fall 2007 PSU
Review: Major Components of a Computer
Processor
Control
Datapath
Memory
Devices
Input
Output
Cach
e
Main
M
emo
ry
Seco
nd
ary M
emo
ry(D
isk)
CSE331 W14.39 Irwin Fall 2007 PSU
A Typical Memory Hierarchy
On-Chip Components
SecondLevelCache
(SRAM)
Control
Datapath
SecondaryMemory(Disk)R
egFile
MainMemory(DRAM)
Data
Cache
InstrC
ache
ITLB
DT
LB
Speed (ns): .1’s 1’s 10’s 100’s 1,000’s
Size (bytes): 100’s K’s 10K’s M’s T’s
Cost: highest lowest
By taking advantage of the principle of locality: Present the user with as much memory as is available in the
cheapest technology. Provide access at the speed offered by the fastest technology.
CSE331 W14.40 Irwin Fall 2007 PSU
Memory Hierarchy Technologies (Volatile) Random Access Memories (RAMs)
“Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory
- High density (1 transistor cells), low power, cheap, slow
- Dynamic: need to be “refreshed” regularly (~ every 4 ms) SRAM: Static Random Access Memory
- Low density (6 transistor cells), high power, expensive, fast
- Static: content will last “forever” (until power turned off) Size: DRAM/SRAM ratio of 4 to 8 Cost/Cycle time: SRAM/DRAM ratio of 8 to 16
(Nonvolatile) Random Flashs (EEPROMs) reads faster than erase/write and finite number of erase cycles
(Nonvolatile) “Non-so-random” Access Technology Access time varies from location to location and from time to
time (e.g., disk, CDROM)
CSE331 W14.41 Irwin Fall 2007 PSU
RAM Memory Uses and Performance Metrics
Caches use SRAM for speed Main Memory is DRAM for density
Addresses divided into 2 halves (row and column)- RAS or Row Access Strobe triggering row decoder
- CAS or Column Access Strobe triggering column selector
Memory performance metrics Latency: Time to access one word
- Access Time: time between request and when word is read or written (read access and write access times can be different)
- Cycle Time: time between successive (read or write) requests
- Usually cycle time > access time Bandwidth: How much data can be supplied per unit time
- width of the data channel * the rate at which it can be used
CSE331 W14.42 Irwin Fall 2007 PSU
Classical SRAM Organization (~Square)
row
decoder
rowaddress
4-bit data word
RAM Cell Array
word (row) select line
bit (data) lines
Each intersection represents a 6-T SRAM cell
Column Selector & I/O Circuits
columnaddress
One memory row holds a block of data, so the column address selects the requested word from that block
CSE331 W14.43 Irwin Fall 2007 PSU
data bitdata bit
Classical DRAM Organization (~Square Planes)
row
decoder
rowaddress
Column Selector & I/O Circuits
columnaddress
data bit
bit (data) lines
Each intersection represents a 1-T DRAM cell
The column address selects the requested bit from the row in each planedata word
. . .
. . .
RAM Cell Array
word (row) select line
n planes (where n is the # of bits in the word)
CSE331 W14.44 Irwin Fall 2007 PSU
Classical DRAM Operation
DRAM Organization: N rows x N column x M-bit
(planes) Read or Write M-bit at a time Each M-bit access requires
a RAS / CAS cycle
Row Address
CAS
RAS
Col Address Row Address Col Address
Access Time
N r
ows
N cols
DRAM
M bits
RowAddress
ColumnAddress
M-bit OutputCycle Time
CSE331 W14.45 Irwin Fall 2007 PSU
Synchronous RAMs
Synchronous RAMs have the ability to transfer a burst of data from a series of sequential addresses that are contained in the same row
Have speed advantages since don’t have to provide the complete (row and column) addresses for words in the same burst
Page mode DRAMs – Specify the starting (row) address and then the successive column addresses
SDRAMs – Specify starting (row+column) address and burst length (number of words in the row). Data in the burst is transferred under control of a clock signal.
DDR SDRAMs (Double Data Rate SDRAMs) Transfer burst data on both the rising and falling edge of
the clock (twice as much bandwidth)
CSE331 W14.46 Irwin Fall 2007 PSU
N r
ows
N cols
DRAM
ColumnAddress
M-bit Output
M bits N x M SRAM
RowAddress
(Fast) Page Mode DRAM Operation
Row Address
CAS
RAS
Col Address Col Address
1st M-bit Access
Col Address Col Address
2nd M-bit 3rd M-bit 4th M-bit
A row is read into an N x M SRAM buffer
Only CAS is needed to access other M-bit blocks on that row (burst size)
RAS remains asserted while CAS is changed
Cycle Time
CSE331 W14.47 Irwin Fall 2007 PSU
N r
ows
N cols
DRAM
ColumnAddress
M-bit Output
M bit planes N x M SRAM
RowAddress
Synchronous DRAM (SDRAM) Operation
After a row is read into the SRAM register
Input CAS as the starting “burst” address along with a burst length
Transfers a burst of data from a series of sequential addresses within that row- A clock controls transfer of
successive words in the burst
+1
Row Address
CAS
RAS
Col Address
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
Row Add
http://en.wikipedia.org/wiki/DDR_SDRAM
CSE331 W14.48 Irwin Fall 2007 PSU
DRAM Memory Latency & Bandwidth Milestones
In the time that the memory to processor bandwidth has doubled the memory latency has improved by a factor of only 1.2 to 1.4
To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks
DRAM Page DRAM
Page DRAM
Page DRAM
SDRAM DDR SDRAM
Module Width 16b 16b 32b 64b 64b 64b
Year 1980 1983 1986 1993 1997 2000
Mb/chip 0.06 0.25 1 16 64 256
Die size (mm2) 35 45 70 130 170 204
Pins/chip 16 16 18 20 54 66
BWidth (MB/s) 13 40 160 267 640 1600
Latency (nsec) 225 170 125 75 62 52Patterson, CACM Vol 47, #10, 2004
top related