cs 152 computer architecture and...
TRANSCRIPT
UC Regents Spring 2014 © UCBCS 152 L10: Cache I
2014-2-20John Lazzaro
(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 10 -- Cache I
Play:1Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Today: Caches and the Memory System
Memory Hierarchy: Technology motivation for caching.
Static Memory: Used in cache designs.
Datapath
Memory
Processor
Input
Output
Control
Short Break
2Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Static Memory CircuitsDynamic Memory: Circuit remembers for a fraction of a second.
Non-volatile Memory: Circuit remembers for many years, even if power is off.
Static Memory: Circuit remembers as long as the power is on.
3Thursday, February 20, 14
CS 152 L11: VLSI UC Regents Fall 2006 © UCB
Preliminaries
4Thursday, February 20, 14
UC Regents Fall 2006 © UCBCS 152 L11: VLSI
Inverters: Building block for SRAM
VoutVin
Vdd symbol
Vin Vout
5Thursday, February 20, 14
UC Regents Fall 2006 © UCBCS 152 L11: VLSI
p-
oxiden+ n+
n-well
oxidep+ p+ n+
Vout
Vin Vin
Inverter: Die Cross Section
VoutVin
6Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Recall: Our simple inverter model ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.A switch.
“On” if gate is
grounded.
nFET.A switch.
“On” if gate is at Vdd.
“1”“0”
“1” “0”
Correctly predicts logic output for simple static CMOS circuits.
Extensions to model subtler circuit families, or to predict timing, have not worked well ...
7Thursday, February 20, 14
UC Regents Fall 2006 © UCBCS 152 L11: VLSI
When the 0/1 model is too simple ...
I ds
VoutVin
I sd
Vth We wire the output of the inverter to drive its input. What happens?
Logic simulators based on our too-simple model predict this circuit will oscillate! This prediction is incorrect.
In reality, Vin = Vout settles to a stable value, defined as Vth, where nFET and pFET current match.
Can we figure out Vth, without solving tedious equations?
8Thursday, February 20, 14
UC Regents Fall 2006 © UCBCS 152 L11: VLSI
Graphical equation solving ...
I ds
VoutVin
I sd
Vth
Vin = Vout
I dsnFETpFET I sd
Recall: Graphs from power and energy lecture ...
Intersection defines Vth
Note: Ignores second-order
effects.
9Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Recall: Transistors as water valvesIf electrons are water molecules,
transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
10Thursday, February 20, 14
UC Regents Fall 2006 © UCBCS 152 L11: VLSI
What happens when we break tie wire?
I ds
VoutVin
I sd
Vin left free to float.
I ds
Vth
I sd
Tie wire broken
Small amounts of noise on Vin causes Ids > Isd or Isd > Ids ... and output bucket randomly fills and empties.Result: Vout randomly flips between logic 0 and logic 1.
11Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L1: Fab/Design Interface
SRAM
Intel 2102, a 1kb, 1 MHz static RAM chip with 6000 nFETs transistors in a 10 μm process.
1971 state of the art.
12Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Recall DRAM cell: 1 T + 1 C“Word Line”
Bit Line
“Column”
“Row”
Word Line
Vdd
“Bit Line”
“Row”
“Column”
13Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Idea: Store each bit with its complementx
“Row”
Gnd Vdd
Vdd Gnd We can use the redundant
representation to compensate for noise and leakage.
Why?
x
y y
14Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Case #1: y = Gnd, y = Vdd ...x
“Row”
Gnd Vdd I ds
I sd
x
y y
15Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Case #2: y = Vdd, y = Gnd ...x
“Row”
Gnd Vdd
I sd
I ds
x
y y
16Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Combine both cases to complete circuit
x
Gnd Vdd Vdd Gnd Vth Vth
noise noise
“Cross- coupled
inverters”
x
y y
17Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
SRAM Challenge #1: It’s so big!
Capacitors are usually
“parasitic”capacitance of wires and transistors.
Cell has both
transistor types
Vdd AND Gnd
More contacts,
more devices, two bit lines ...
SRAM area is 6-10X DRAM area, same generation ...
18Thursday, February 20, 14
Intel SRAM core cell (45 nm) Word LinesBit Lines
19Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Challenge #2: Writing is a “fight” When word line goes high, bitlines “fight” with cell
inverters to “flip the bit” -- must win quickly! Solution: tune W/L of cell & driver transistors
InitialstateVdd
InitialstateGnd
Bitline drives Gnd
Bitline drives
Vdd20Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Challenge #3: Preserving state on readWhen word line goes high on read, cell inverters must drive
large bitline capacitance quickly, to preserve state on its small cell capacitances
CellstateVdd
CellstateGnd
Bitline a big
capacitor
Bitline a big
capacitor
21Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
SRAM array: like DRAM, but non-destructive4/12/04 ©UCB Spring 2004
CS152 / Kubiatowicz Lec19.13
° Why do computer designers need to know about RAM technology?
• Processor performance is usually limited by memory bandwidth
• As IC densities increase, lots of memory will fit on processor chip
- Tailor on-chip memory to specific needs
- Instruction cache
- Data cache
- Write buffer
° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser
Random Access Memory (RAM) Technology
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.14
Static RAM Cell
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
° Write:1. Drive bit lines (bit=1, bit=0)
2.. Select row
° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
replaced with pullupto save area
10
0 1
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.15
Typical SRAM Organization: 16-word x 4-bit
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp
: : : :
Word 0
Word 1
Word 15
Dout 0Dout 1Dout 2Dout 3
- +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger
Ad
dress D
ecod
erWrEn
Precharge
Din 0Din 1Din 2Din 3
A0
A1
A2
A3
Q: Which is longer:
word line or
bit line?
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.16
° Write Enable is usually active low (WE_L)
° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed
• WE_L is asserted (Low), OE_L is disasserted (High)
- D serves as the data input pin
• WE_L is disasserted (High), OE_L is asserted (Low)
- D is the data output pin
• Both WE_L and OE_L are asserted:
- Result is unknown. Don’t do that!!!
° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)
A
DOE_L
2 Nwords
x M bit
SRAM
N
M
WE_L
Logic Diagram of a Typical SRAM
WriteDriver
WriteDriver
WriteDriver
WriteDriver
Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.
ParallelDataI/OLines
Add muxesto selectsubset of bits
For large SRAMs: Tile a small array, connect with muxes, decoders.22Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
SRAM vs DRAM, pros and cons
DRAM has a 6-10X density advantage at the same technology generation.
Big win for DRAM
SRAM is much faster: transistors drive bitlines on reads.SRAM easy to design in logic fabrication process (and premium logic processes have SRAM add-ons)
SRAM has deterministic latency: its cells do not need to be refreshed.
SRAM advantages
23Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L1: Fab/Design Interface
231DIGEST OF TECHNICAL PAPERS •
ISSCC 2012 / February 21, 2012 / 1:30 PM
Figure 13.1.1: 22nm HDC and LVC Tri-gate SRAM bitcells. Figure 13.1.2: Assist circuit overview, array design and floorplan.
Figure 13.1.3: TVC-WA circuit for write VMIN enhancement.
Figure 13.1.5: Read, write, retention VMIN tradeoffs with WLUD-RA and TVC-WA. Figure 13.1.6: LVC array shmoo and 162Mb VMIN distribution.
Figure 13.1.4: TVC-WA operation with simulated waveforms.
13
KARL et al.: A 4.6 GHz 162 Mb SRAM DESIGN IN 22 nm TRI-GATE CMOS TECHNOLOGY WITH INTEGRATED READ AND WRITE ASSIST CIRCUITRY 151
Fig. 1. 45-degree image of 22 nm tri-gate LVC SRAM bitcell.
Fig. 2. 22 nm HDC and LVC SRAM bitcells.
low voltage, achieving low SRAM minimum operating voltage
is desirable to avoid integration, routing, and control overheads
of multiple supply domains.
In the 22 nm tri-gate technology, fin quantization eliminatesthe fine-grained width tuning conventionally used to optimizeread stability and write margin and presents a challenge in
designing minimum-area SRAM bitcells constrained by finpitch. The 22 nm process technology includes both a high
density 0.092 m 6T SRAM bitcell (HDC) and a low voltage
0.108 m 6T SRAM bitcell (LVC) to support tradeoffs in area,
performance, and minimum operating voltage across a range
of application requirements. In Fig. 1, a 45-degree image of an
LVC tri-gate SRAM is pictured showing the thin silicon finswrapped on three sides by a polysilicon gate. The top-down
bitcell images in Fig. 2 illustrate that tri-gate device sizing and
minimum device dimensions are quantized by the dimensions
of each uniform silicon fin. The HDC bitcell features a 1 finpullup, passgate, and pulldown transistor to deliver the highest
6T SRAM density, while the LVC bitcell has a 2 fin pulldowntransistor for improved SRAM ratio (passgate to pulldown)
which enhances read stability in low voltage conditions. Bitcell
optimization via adjustment can be used to adjust the
bitcell (pullup to pulldown) and ratios for adjustments to
read and write margin, in lieu of geometric customization, but
low usage is constrained by bitcell leakage and high
Fig. 3. High density SRAM bitcell scales at 2X per technology node.
Fig. 4. 22 nm tri-gate SRAM array density scales by 1.85X with an unprece-
dented increase in performance at low voltage.
usage is limited by performance degradation at low voltage.
In the 22 nm process technology, the individual bitcell device
targets are co-optimized with the array design and integrated
assist circuits to deliver maximum yield and process margin at a
given performance target. Optical proximity correction
and resolution enhancement technologies extend the capabili-
ties of 193 nm immersion lithography to allow 54% scaling of
the bitcell topologies from the 32 nm node, as shown in Fig. 3.
Fig. 4 shows that SRAM cell size density scaling is preserved
at the 128 kb array level and the array is capable of 2.0 GHz
operation at 625 mV—a 175 mV reduction in supply voltage
required to reach 2 GHz from the prior technology node.
III. 22 NM 128 KB SRAM MACRO DESIGN
The 162 Mb SRAM array implemented on the 22 nm SRAM
test chip is composed of a tileable 128 kb SRAM macro with
integrated read and write assist circuitry. As shown in Fig. 5,
the array macro floorplan integrates 258 bitcells per local bit-line (BL) and 136 bitcells per local wordline (WL) to maintain
high array efficiency (71.6%) and achieve 1.85X density scaling(7.8 Mb/mm ) over the 32 nm design [11] despite the addition
of integrated assist circuits. The macro floorplan uses a foldedbitline layout with 8:2 column multiplexing on each side of the
shared I/O column circuitry. Two redundant row elements and
two redundant column elements are integrated into the macro
to improve manufacturing yield and provide capability to repair
RAM Compilers
On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers.
Compile-timeparameters set number of bits, aspect ratio, ports, etc.
24Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Flip Flops Revisited
25Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Recall: Static RAM cell (6 Transistors)
x! x
Gnd Vdd Vdd Gnd Vth Vth
noise noise
“Cross- coupled
inverters”
26Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Recall: Positive edge-triggered flip-flop
D Q A flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?
Clocked logic semantics.27Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Sensing: When clock is low
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 0clk’ = 1
Will capture new value on posedge.
Outputs last value captured.
28Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Capture: When clock goes high
D QA flip-flop “samples” right before the edge, and then “holds” value.
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk = 1clk’ = 0
Remembers value just captured.
Outputs value just captured.
29Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L10: Memory
Flip Flop delays:
D Q
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
clk-to-Q ?
CLK == 0Sense D, but Q
outputs old value.
CLK 0->1Capture D, pass
value to Q
CLK
setup ? hold ?
clk-to-Q
setup
hold
30Thursday, February 20, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
From flip-flops to latches ...
D Q
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Sampling circuit
Spring 2003 EECS150 – Lec10-Timing Page 14
Delay in Flip-flops
• Setup time results from delay
through first latch.
• Clock to Q delay results from
delay through second latch.
D
clk
Q
setup time clock to Q delay
clk
clk’
clk
clk
clk’
clk’
clk
clk’
Holds value
D Q D QD Q
CLK
Latch-based design: Break up the flip-flop circuit into twolatch state elements.Then, add combinational logic between the latches.
Latches are good for making small memories. Saves half the area over using D flip-flops.
31Thursday, February 20, 14
UC Regents Spring 2014 © UCBCS 152 L10: Cache I
Break
Play:32Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
The Memory Hierarchy
33Thursday, February 20, 14
UC Regents Spring 2005 © UCBCS 152 L14: Cache I
60% of the area of this CPU is devoted to SRAM cache.
But the role of cache in computer design has varied widely over time.
34Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
35Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Since then: Technology scaling ...
Circuit in 250 nm technology (introduced in 2000)
L nanometers long
Same circuit in 180 nm technology (introduced in 2003)
0.7 x L nm
Each dimension
30% smaller. Area is 50%
smaller
Logic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.
36Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
DRAM scaled for more bits, not more MHz
Assume Ccell = 1 fFBit line may have 2000 nFet drains,assume bit line C of 100 fF, or 100*Ccell.Ccell holds Q = Ccell*(Vdd-Vth)When we dump this charge onto the bit line, what voltage do we see?dV = [Ccell*(Vdd-Vth)] / [100*Ccell]dV = (Vdd-Vth) / 100 ≈ tens of millivolts!
In practice, scale array to get a 60mV signal.37Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
1980-2003, CPU speed outpaced DRAM ...
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990 Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache” memories
between CPU and DRAM. Create a “memory hierarchy”.
10000The
power wall
2005
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
38Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Caches: Variable-latency memory ports Lower Level
MemoryUpper Level
MemoryTo Processor
From Processor
Blk X
Blk Y
Small, fast Large, slow
FromCPU
To CPU
Data in upper memory returned with lower latency.
Data in lower level returned with higher latency.
Data
Address
39Thursday, February 20, 14
Queues as a building block for memory systems
both clocks. The basic GALS method focuses on point-
to-point communication between blocks.
FIFO solutionsAnother approach to interfacing locally synchro-
nous blocks is using specially designed asynchronous
FIFO buffers8–10 and hiding the system synchronization
problem within the FIFO buffers. Such a system can
tolerate very large interconnect delays and is also
robust with regard to metastability. Designers can use
this method to interconnect asynchronous and
synchronous systems and also to construct synchro-
nous-synchronous and asynchronous-asynchronous
interfaces. Figure 2 diagrams a typical FIFO interface,
which achieves an acceptable data throughput.8 In
addition to the data cells, the FIFO structure includes
an empty/full detector and a special deadlock de-
tector.
The advantage of FIFO synchronizers is that they
don’t affect the locally synchronous module’s opera-
tion. However, with very wide interconnect data
buses, FIFO structures can be costly in silicon area.
Also, they require specialized complex cells to
generate the empty/full flags used for flow control.
The introduced latency might be significant and
unacceptable for high-speed applications.
As an alternative, Beigne and Vivet designed
a synchronous-asynchronous FIFO based on the
bisynchronous classical FIFO design using gray code,
for the specific case of an asynchronous network-on-
chip (NoC) interface.10 Their aim was to maintain
compatibility with existing design solutions and to use
standard CAD tools. Thus, even with some performance
degradation or suboptimal
architecture, designers can
achieve the main goal of
designing GALS systems in
the standard design envi-
ronment.
Boundary
synchronizationA third solution is to
perform data synchroni-
zation at the borders of
the locally synchronous
island, without affecting
the inner operation of lo-
cally synchronous blocks
and without relying on
FIFO buffers. For this purpose, designers can use
standard two-flop, one-flop, predictive, or adaptive
synchronizers for mesochronous systems, or locally
delayed latching.1,11 This method can achieve very
reliable data transfer between locally synchronous
blocks. On the other hand, such solutions generally
increase latency and reduce data throughput, resulting
in limited applicability for high-speed systems. Table 1
summarizes the properties of GALS systems’ synchro-
nization methods.
Advantages and limitations ofGALS solutions
The scientific community has shown great interest
in GALS solutions and architectures in the past two
decades. However, this interest hasn’t culminated in
many commercial applications, despite all reported
advantages. There are several reasons why standard
design practice has not adopted GALS techniques.
Design and system integration issuesMany proposed solutions require programmable
ring oscillators. This is an inexpensive solution that
allows full control of the local clock. However, it has
significant drawbacks. Ring oscillators are impractical
for industrial use. They need careful calibration
because they are very sensitive to process, voltage,
and temperature variations. Moreover, embedded ring
oscillators consume additional power through contin-
uous switching of the chained inverters.
On the other hand, careful design of the delay line
can reduce its power consumption to a level below
that of a corresponding clock tree. In addition,
432
Figure 2. Typical FIFO-based GALS system.
Globally Asynchronous, Locally Synchronous Design and Test
IEEE Design & Test of Computers
Avoid blocking by using a queue (a First-In, First-Out buffer, or FIFO) to communicate
between two sub-systems.40Thursday, February 20, 14
Variable-latency port that doesn’t stall on a miss
From CPU To CPU
Queue 1 Queue 2
CPU makes a request by placing the following items in Queue 1:
CMD: Read, write, etc ...
TAG: 9-bit number identifying the request.MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.
MADDR: Memory address of first byte.STORE-DATA: For stores, the data to store.
41Thursday, February 20, 14
This cache is used in an ASPIRE CPU (Rocket)
When request is ready, cache places the following items in Queue 2:
TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.
CPU saves info about requests, indexed by TAG.Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.
From CPU To CPU
Queue 1 Queue 2
42Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Cache replaces data, instruction memory
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
D
PC
Q
+
0x4
Addr Data
Instr
Mem
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
Mux,Logic
IF (Fetch) ID (Decode) EX (ALU) MEM WB
Replace with Instruction Cache and Data Cacheof DRAM
main memory
43Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Recall: Intel ARM XScale CPU (PocketPC)1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
32 KB Instruction Cache
32 KB Data Cache
180 nm process (introduced 2003)
44Thursday, February 20, 14
ARM CPU 32 KB
instruction cache uses 3 million
transistors Typical
miss rate: 1.5%
DRAM interface
uses 61 pins that toggle at 100 MHz
45Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
2005 Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz$1299.00
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 10M
Let programs address a memory space that scales to the disk size, at a speed that is usually as fast
as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,application
Goal: Illusion of large, fast, cheap memory
46Thursday, February 20, 14
UC Regents Spring 2005 © UCBCS 152 L14: Cache I
(1K)
Registers
L1 (64K Instruction)
L1 (32K Data)
512KL2
90 nm, 58 M transistors
PowerPC 970 FX47Thursday, February 20, 14
UC Regents Fall 2008 © UCBCS 194-6 L8: Cache
Latency: A closer look
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1E+07Latency
(sec) 0.6n 1.9n 1.9n 6.9n 100n 12.5m
Hz 1.6G 533M 533M 145M 10M 80Architect’s latency toolkit:
Read latency: Time to return first byte of a random access
(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.
48Thursday, February 20, 14
UC Regents Spring 2014 © UCBCS 152: L6: Superpipelining + Branch Prediction
Byte 0-31
...
Q
Q
Q
.
.
.
A7-A0: 8-bit read address
Byte 32-63
Byte 224-255
OE
OE
OE
DEMUX
.
.
.
1
A5A6A7{
3
Each register holds 32 bytes (256 bits)
OE --> Tri-state Q outputs!
Recall: Adding pipeline stages to memoryBefore we pipelined, slow! Only read behavior shown.
Can we add two pipeline stages?
256
256
256
256
32
MUX
3
D0-D31
A2A3A4
{
3
Data output
is 32 bits
i.e. 4 bytes
49Thursday, February 20, 14
UC Regents Spring 2014 © UCBCS 152 L9: Memory
8192 rows
16384 columns
134 217 728 usable bits(tester found good bits in bigger array)
1
of
8192
decoder
13-bitrow
address input
16384 bits delivered by sense amps
Select requested bits, send off the chip
Recall: Reading an entire row for later useWhat if we want all of the 16384 bits?
In row access time (55 ns) we can do22 transfers at 400 MT/s.
16-bit chip bus -> 22 x 16 = 352 bits << 16384Now the row access time looks fast!
Thus, push to faster DRAM
interfaces
50Thursday, February 20, 14
UC Regents Spring 2014 © UCBCS 152 L9: Memory
Recall: Interleaved access to 4 banks
Can also do other commands on banks concurrently.
Figure 43: Multibank Activate Restriction
Command
Don’t Care
T1T0 T2 T3 T4 T5 T6 T7
tRRD (MIN)
Row Row
READACT ACT NOP
tFAW (MIN)
Bank address
CK#
Address
CK
T8 T9
Col
Bank a
ACTREAD READ READACT NOP
RowCol RowCol Col
Bank cBank b Bank dBank c Bank e
ACT
Row
T10
Bank dBank bBank a
Note: 1. DDR2-533 (-37E, x4 or x8), tCK = 3.75ns, BL = 4, AL = 3, CL = 4, tRRD (MIN) = 7.5ns,tFAW (MIN) = 37.5ns.
512Mb: x4, x8, x16 DDR2 SDRAMACTIVATE
PDF: 09005aef82f1e6e2512MbDDR2.pdf - Rev. O 7/09 EN 87 Micron Technology, Inc. reserves the right to change products or specifications without notice.
©2004 Micron Technology, Inc. All rights reserved.
Interleaving: Design the right interface to the 4 memory banks on the chip, soseveral row requests run in parallel.
Bank a Bank b Bank c Bank d
51Thursday, February 20, 14
UC Regents Spring 2014 © UCBCS 152 L9: Memory
charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.
The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.
2 Modern DRAM Architecture
As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.
Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the
available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].
For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.
A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that
Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
11
(0,0,0)
(1,1,2)
(1,0,1)
(1,1,1)
(1,0,0)
(0,1,3)
(0,0,1)
(0,1,0)
P A C
292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140
P A CP A C
P A CP A C
P A CP A C
P A C
4948474645 50 565554535251
11
(0,0,0)
(1,1,2)
(1,0,1)
(1,1,1)
(1,0,0)
(0,1,3)
(0,0,1)
(0,1,0)
P A C
191817161514131210987654321
P A
CP
CP A
C
A C
C
C
C
Time (Cycles)
Refe
rences (
Bank, R
ow
, C
olu
mn)
Refe
rences (
Bank, R
ow
, C
olu
mn
)
Time (Cycles) DRAM Operations:
P: bank precharge (3 cycle occupancy)
A: row activation (3 cycle occupancy)
C: column access (1 cycle occupancy)
(A) Without access scheduling (56 DRAM Cycles)
(B) With access scheduling (19 DRAM Cycles)
129
Recall: Leveraging banks and row reads
Memory Access Scheduling
Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens
Computer Systems Laboratory
Stanford University
Stanford, CA 94305
{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu
Abstract
The bandwidth and latency of a memory system are strongly
dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-
istic of contemporary DRAM chips. There is nearly an order
of magnitude difference in bandwidth between successive
references to different columns within a row and different
rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit
locality within the 3-D memory structure. Conservative
reordering, in which the first ready reference in a sequence
is performed, improves bandwidth by 40% for traces from
five media benchmarks. Aggressive reordering, in which
operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-
tions. Memory access scheduling is particularly important
for media processors where it enables the processor to make
the most efficient use of scarce memory bandwidth.
1 Introduction
Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more
often limited by memory system bandwidth than other com-puter systems.
To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.
The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.
This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.
To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-
1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00
128
From:charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.
The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.
2 Modern DRAM Architecture
As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.
Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the
available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].
For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.
A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that
Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
11
(0,0,0)
(1,1,2)
(1,0,1)
(1,1,1)
(1,0,0)
(0,1,3)
(0,0,1)
(0,1,0)
P A C
292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140
P A CP A C
P A CP A C
P A CP A C
P A C
4948474645 50 565554535251
11
(0,0,0)
(1,1,2)
(1,0,1)
(1,1,1)
(1,0,0)
(0,1,3)
(0,0,1)
(0,1,0)
P A C
191817161514131210987654321
P A
CP
CP A
C
A C
C
C
C
Time (Cycles)
Refe
rences (
Bank, R
ow
, C
olu
mn)
Refe
rences (
Bank, R
ow
, C
olu
mn
)
Time (Cycles) DRAM Operations:
P: bank precharge (3 cycle occupancy)
A: row activation (3 cycle occupancy)
C: column access (1 cycle occupancy)
(A) Without access scheduling (56 DRAM Cycles)
(B) With access scheduling (19 DRAM Cycles)
129
52Thursday, February 20, 14
UC Regents Fall 2009 © UCBCS 250 L11: DRAM
3 Memory Access Scheduling
Memory access scheduling is the process of ordering theDRAM operations (bank precharge, row activation, and col-umn access) necessary to complete the set of currentlypending memory references. Throughout the paper, the termoperation denotes a command, such as a row activation or acolumn access, issued by the memory controller to theDRAM. Similarly, the term reference denotes a memory ref-erence generated by the processor, such as a load or store toa memory location. A single reference generates one ormore memory operations depending on the schedule.
Given a set of pending memory references, a memoryaccess scheduler may chose one or more row, column, orprecharge operations each cycle, subject to resource con-straints, to advance one or more of the pending references.The simplest, and most common, scheduling algorithm onlyconsiders the oldest pending reference, so that referencesare satisfied in the order that they arrive. If it is currentlypossible to make progress on that reference by performingsome DRAM operation then the memory controller makesthe appropriate access. While this does not require a compli-cated access scheduler in the memory controller, it is clearlyinefficient, as illustrated in Figure 1 of the Introduction.
If the DRAM is not ready for the operation required by theoldest pending reference, or if that operation would leaveavailable resources idle, it makes sense to consider opera-tions for other pending references. Figure 4 shows the struc-ture of a more sophisticated access scheduler. As memoryreferences arrive, they are allocated storage space whilethey await service from the memory access scheduler. In thefigure, references are initially sorted by DRAM bank. Eachpending reference is represented by six fields: valid (V),load/store (L/S), address (Row and Col), data, and whateveradditional state is necessary for the scheduling algorithm.Examples of state that can be accessed and modified by thescheduler are the age of the reference and whether or notthat reference targets the currently active row. In practice,
the pending reference storage could be shared by all thebanks (with the addition of a bank address field) to allowdynamic allocation of that storage at the cost of increasedlogic complexity in the scheduler.
As shown in Figure 4, each bank has a precharge managerand a row arbiter. The precharge manager simply decideswhen its associated bank should be precharged. Similarly,the row arbiter for each bank decides which row, if any,should be activated when that bank is idle. A single columnarbiter is shared by all the banks. The column arbiter grantsthe shared data line resources to a single column access outof all the pending references to all of the banks. Finally, theprecharge managers, row arbiters, and column arbiter sendtheir selected operations to a single address arbiter whichgrants the shared address resources to one or more of thoseoperations.
The precharge managers, row arbiters, and column arbitercan use several different policies to select DRAM opera-tions, as enumerated in Table 1. The combination of policiesused by these units, along with the address arbiter’s policy,determines the memory access scheduling algorithm. Theaddress arbiter must decide which of the selected precharge,activate, and column operations to perform subject to theconstraints of the address line resources. As with all of theother scheduling decisions, the in-order or priority policiescan be used by the address arbiter to make this selection.Additional policies that can be used are those that select pre-charge operations first, row operations first, or column oper-ations first. A column-first scheduling policy would reducethe latency of references to active rows, whereas a pre-charge-first or row-first scheduling policy would increasethe amount of bank parallelism.
If the address resources are not shared, it is possible for botha precharge operation and a column access to the same bankto be selected. This is likely to violate the timing constraintsof the DRAM. Ideally, this conflict can be handled by hav-ing the column access automatically precharge the bank
Figure 4. Memory access scheduler architecture.
Memory References
Precharge0
Row
Arbiter0
Column
Arbiter
Address
Arbiter
PrechargeN
Row
ArbiterN
DRAM Operations
Memory Access
Scheduler Logic
Bank 0 Pending References
V L/S Row Col Data State
Bank N Pending References
V L/S Row Col Data State
131
Memory Access Scheduling
Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens
Computer Systems Laboratory
Stanford University
Stanford, CA 94305
{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu
Abstract
The bandwidth and latency of a memory system are strongly
dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-
istic of contemporary DRAM chips. There is nearly an order
of magnitude difference in bandwidth between successive
references to different columns within a row and different
rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit
locality within the 3-D memory structure. Conservative
reordering, in which the first ready reference in a sequence
is performed, improves bandwidth by 40% for traces from
five media benchmarks. Aggressive reordering, in which
operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-
tions. Memory access scheduling is particularly important
for media processors where it enables the processor to make
the most efficient use of scarce memory bandwidth.
1 Introduction
Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more
often limited by memory system bandwidth than other com-puter systems.
To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.
The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.
This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.
To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-
1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00
128
From:
Set scheduling algorithms in gates ...
53Thursday, February 20, 14
On Tuesday
Caches, part two ...
Have a good weekend !
54Thursday, February 20, 14