cs 152 computer architecture and...

UC Regents Spring 2014 © UCBCS 152 L10: Cache I

2014-2-20John Lazzaro

(not a prof - “John” is always OK)

CS 152Computer Architecture and Engineering

www-inst.eecs.berkeley.edu/~cs152/

TA: Eric Love

Lecture 10 -- Cache I

Play:1Thursday, February 20, 14

http://www.eecs.berkeley.edu/~johnw/

http://www.eecs.berkeley.edu/~johnw/

UC Regents Fall 2013 © UCBCS 250 L10: Memory

Today: Caches and the Memory System

Memory Hierarchy: Technology motivation for caching.

Static Memory: Used in cache designs.

Datapath

Memory

Processor

Input

Output

Control

Short Break

2Thursday, February 20, 14

UC Regents Fall 2008 © UCBCS 194-6 L8: Cache

Static Memory CircuitsDynamic Memory: Circuit remembers for a fraction of a second.

Non-volatile Memory: Circuit remembers for many years, even if power is off.

Static Memory: Circuit remembers as long as the power is on.


CS 152 L11: VLSI UC Regents Fall 2006 © UCB

Preliminaries


UC Regents Fall 2006 © UCBCS 152 L11: VLSI

Inverters: Building block for SRAM

VoutVin

Vdd symbol

Vin Vout



p-

oxiden+ n+

n-well

oxidep+ p+ n+

Vout

Vin Vin

Inverter: Die Cross Section

VoutVin


UC Regents Fall 2013 © UCBCS 250 L3: Timing

Recall: Our simple inverter model ...

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.5

Design Refinement

Informal System Requirement

Initial Specification

Intermediate Specification

Final Architectural Description

Intermediate Specification of Implementation

Final Internal Specification

Physical Implementation

refinementincreasing level of detail


Lec3.6

Logic Components


Lec3.7

° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)

° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay

° Flip-Flops: Storage Elements• After a clock edge, input copied to output

• Otherwise, the flip-flop holds its value

• Also: a “Latch” is a storage element that is level triggered

D Q D[8] Q[8]

8

Combinational

Logic

11

8

Elements of the design zoo


Lec3.8

Basic Combinational Elements+DeMorgan Equivalence

Wire Inverter

In Out

01

01

In Out

10

01

OutIn

Out = InOut = In

NAND Gate NOR GateA B Out

111

0 00 11 01 1 0

A B Out

0 0 10 1 01 0 01 1 0

OutA

BA

B

Out

DeMorgan’s

TheoremOut = A + B = A • BOut = A • B = A + B

A

B

Out

A B Out

1 1 11 0 10 1 10 0 0

0 00 11 01 1

A B

OutA

B

A B Out

1 1 11 0 00 1 00 0 0

0 00 11 01 1

A B


Lec3.29

Delay Model:

CMOS


Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

“1”

“0”

pFET.A switch.

“On” if gate is

grounded.

nFET.A switch.

“On” if gate is at Vdd.

“1”“0”

“1” “0”

Correctly predicts logic output for simple static CMOS circuits.

Extensions to model subtler circuit families, or to predict timing, have not worked well ...



When the 0/1 model is too simple ...

I ds

VoutVin

I sd

Vth We wire the output of the inverter to drive its input. What happens?

Logic simulators based on our too-simple model predict this circuit will oscillate! This prediction is incorrect.

In reality, Vin = Vout settles to a stable value, defined as Vth, where nFET and pFET current match.

Can we figure out Vth, without solving tedious equations?



Graphical equation solving ...

I ds

VoutVin

I sd

Vth

Vin = Vout

I dsnFETpFET I sd

Recall: Graphs from power and energy lecture ...

Intersection defines Vth

Note: Ignores second-order

effects.



Recall: Transistors as water valvesIf electrons are water molecules,

transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...

A “on” p-FET fillsup the capacitor

with charge.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the

bucket.


Lec3.29

Delay Model:

CMOS


Lec3.30








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31










Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32


Vdd

Circuit


OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level



What happens when we break tie wire?

I ds

VoutVin

I sd

Vin left free to float.

I ds

Vth

I sd

Tie wire broken

Small amounts of noise on Vin causes Ids > Isd or Isd > Ids ... and output bucket randomly fills and empties.Result: Vout randomly flips between logic 0 and logic 1.


UC Regents Fall 2013 © UCBCS 250 L1: Fab/Design Interface

SRAM

Intel 2102, a 1kb, 1 MHz static RAM chip with 6000 nFETs transistors in a 10 μm process.

1971 state of the art.



Recall DRAM cell: 1 T + 1 C“Word Line”

Bit Line

“Column”

“Row”

Word Line

Vdd

“Bit Line”

“Row”

“Column”



Idea: Store each bit with its complementx

“Row”

Gnd Vdd

Vdd Gnd We can use the redundant

representation to compensate for noise and leakage.

Why?

x

y y



Case #1: y = Gnd, y = Vdd ...x

“Row”

Gnd Vdd I ds

I sd

x

y y



Case #2: y = Vdd, y = Gnd ...x

“Row”

Gnd Vdd

I sd

I ds

x

y y



Combine both cases to complete circuit

x

Gnd Vdd Vdd Gnd Vth Vth

noise noise

“Cross- coupled

inverters”

x

y y



SRAM Challenge #1: It’s so big!

Capacitors are usually

“parasitic”capacitance of wires and transistors.

Cell has both

transistor types

Vdd AND Gnd

More contacts,

more devices, two bit lines ...

SRAM area is 6-10X DRAM area, same generation ...


Intel SRAM core cell (45 nm) Word LinesBit Lines



Challenge #2: Writing is a “fight” When word line goes high, bitlines “fight” with cell

inverters to “flip the bit” -- must win quickly! Solution: tune W/L of cell & driver transistors

InitialstateVdd

InitialstateGnd

Bitline drives Gnd

Bitline drives

Vdd20Thursday, February 20, 14


Challenge #3: Preserving state on readWhen word line goes high on read, cell inverters must drive

large bitline capacitance quickly, to preserve state on its small cell capacitances

CellstateVdd

CellstateGnd

Bitline a big

capacitor

Bitline a big

capacitor



SRAM array: like DRAM, but non-destructive4/12/04 ©UCB Spring 2004

CS152 / Kubiatowicz Lec19.13

° Why do computer designers need to know about RAM technology?

• Processor performance is usually limited by memory bandwidth

• As IC densities increase, lots of memory will fit on processor chip

- Tailor on-chip memory to specific needs

- Instruction cache

- Data cache

- Write buffer

° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser

Random Access Memory (RAM) Technology


Lec19.14

Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

° Write:1. Drive bit lines (bit=1, bit=0)

2.. Select row

° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1


Lec19.15

Typical SRAM Organization: 16-word x 4-bit

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &

Precharger - +Wr Driver &



Precharger

Ad

dress D

ecod

erWrEn

Precharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3

Q: Which is longer:

word line or

bit line?


Lec19.16

° Write Enable is usually active low (WE_L)

° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed

• WE_L is asserted (Low), OE_L is disasserted (High)

- D serves as the data input pin

• WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin

• Both WE_L and OE_L are asserted:

- Result is unknown. Don’t do that!!!

° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)

A

DOE_L

2 Nwords

x M bit

SRAM

N

M

WE_L

Logic Diagram of a Typical SRAM

WriteDriver

WriteDriver

WriteDriver

WriteDriver

Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.

ParallelDataI/OLines

Add muxesto selectsubset of bits

For large SRAMs: Tile a small array, connect with muxes, decoders.22Thursday, February 20, 14


SRAM vs DRAM, pros and cons

DRAM has a 6-10X density advantage at the same technology generation.

Big win for DRAM

SRAM is much faster: transistors drive bitlines on reads.SRAM easy to design in logic fabrication process (and premium logic processes have SRAM add-ons)

SRAM has deterministic latency: its cells do not need to be refreshed.

SRAM advantages


UC Regents Fall 2013 © UCBCS 250 L1: Fab/Design Interface

231DIGEST OF TECHNICAL PAPERS •

ISSCC 2012 / February 21, 2012 / 1:30 PM

Figure 13.1.1: 22nm HDC and LVC Tri-gate SRAM bitcells. Figure 13.1.2: Assist circuit overview, array design and floorplan.

Figure 13.1.3: TVC-WA circuit for write VMIN enhancement.

Figure 13.1.5: Read, write, retention VMIN tradeoffs with WLUD-RA and TVC-WA. Figure 13.1.6: LVC array shmoo and 162Mb VMIN distribution.

Figure 13.1.4: TVC-WA operation with simulated waveforms.

13

KARL et al.: A 4.6 GHz 162 Mb SRAM DESIGN IN 22 nm TRI-GATE CMOS TECHNOLOGY WITH INTEGRATED READ AND WRITE ASSIST CIRCUITRY 151

Fig. 1. 45-degree image of 22 nm tri-gate LVC SRAM bitcell.

Fig. 2. 22 nm HDC and LVC SRAM bitcells.

low voltage, achieving low SRAM minimum operating voltage

is desirable to avoid integration, routing, and control overheads

of multiple supply domains.

In the 22 nm tri-gate technology, fin quantization eliminatesthe fine-grained width tuning conventionally used to optimizeread stability and write margin and presents a challenge in

designing minimum-area SRAM bitcells constrained by finpitch. The 22 nm process technology includes both a high

density 0.092 m 6T SRAM bitcell (HDC) and a low voltage

0.108 m 6T SRAM bitcell (LVC) to support tradeoffs in area,

performance, and minimum operating voltage across a range

of application requirements. In Fig. 1, a 45-degree image of an

LVC tri-gate SRAM is pictured showing the thin silicon finswrapped on three sides by a polysilicon gate. The top-down

bitcell images in Fig. 2 illustrate that tri-gate device sizing and

minimum device dimensions are quantized by the dimensions

of each uniform silicon fin. The HDC bitcell features a 1 finpullup, passgate, and pulldown transistor to deliver the highest

6T SRAM density, while the LVC bitcell has a 2 fin pulldowntransistor for improved SRAM ratio (passgate to pulldown)

which enhances read stability in low voltage conditions. Bitcell

optimization via adjustment can be used to adjust the

bitcell (pullup to pulldown) and ratios for adjustments to

read and write margin, in lieu of geometric customization, but

low usage is constrained by bitcell leakage and high

Fig. 3. High density SRAM bitcell scales at 2X per technology node.

Fig. 4. 22 nm tri-gate SRAM array density scales by 1.85X with an unprece-

dented increase in performance at low voltage.

usage is limited by performance degradation at low voltage.

In the 22 nm process technology, the individual bitcell device

targets are co-optimized with the array design and integrated

assist circuits to deliver maximum yield and process margin at a

given performance target. Optical proximity correction

and resolution enhancement technologies extend the capabili-

ties of 193 nm immersion lithography to allow 54% scaling of

the bitcell topologies from the 32 nm node, as shown in Fig. 3.

Fig. 4 shows that SRAM cell size density scaling is preserved

at the 128 kb array level and the array is capable of 2.0 GHz

operation at 625 mV—a 175 mV reduction in supply voltage

required to reach 2 GHz from the prior technology node.

III. 22 NM 128 KB SRAM MACRO DESIGN

The 162 Mb SRAM array implemented on the 22 nm SRAM

test chip is composed of a tileable 128 kb SRAM macro with

integrated read and write assist circuitry. As shown in Fig. 5,

the array macro floorplan integrates 258 bitcells per local bit-line (BL) and 136 bitcells per local wordline (WL) to maintain

high array efficiency (71.6%) and achieve 1.85X density scaling(7.8 Mb/mm ) over the 32 nm design [11] despite the addition

of integrated assist circuits. The macro floorplan uses a foldedbitline layout with 8:2 column multiplexing on each side of the

shared I/O column circuitry. Two redundant row elements and

two redundant column elements are integrated into the macro

to improve manufacturing yield and provide capability to repair

RAM Compilers

On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers.

Compile-timeparameters set number of bits, aspect ratio, ports, etc.



Flip Flops Revisited



Recall: Static RAM cell (6 Transistors)

x! x

Gnd Vdd Vdd Gnd Vth Vth

noise noise

“Cross- coupled

inverters”



Recall: Positive edge-triggered flip-flop

D Q A flip-flop “samples” right before the edge, and then “holds” value.

Spring 2003 EECS150 – Lec10-Timing Page 14

Delay in Flip-flops

• Setup time results from delay

through first latch.

• Clock to Q delay results from

delay through second latch.

D

clk

Q

setup time clock to Q delay

clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value

16 Transistors: Makes an SRAM look compact!What do we get for the 10 extra transistors?

Clocked logic semantics.27Thursday, February 20, 14


Sensing: When clock is low

D QA flip-flop “samples” right before the edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 0clk’ = 1

Will capture new value on posedge.

Outputs last value captured.



Capture: When clock goes high

D QA flip-flop “samples” right before the edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 1clk’ = 0

Remembers value just captured.

Outputs value just captured.



Flip Flop delays:

D Q


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk-to-Q ?

CLK == 0Sense D, but Q

outputs old value.

CLK 0->1Capture D, pass

value to Q

CLK

setup ? hold ?

clk-to-Q

setup

hold



From flip-flops to latches ...

D Q


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value

D Q D QD Q

CLK

Latch-based design: Break up the flip-flop circuit into twolatch state elements.Then, add combinational logic between the latches.

Latches are good for making small memories. Saves half the area over using D flip-flops.



Break

Play:32Thursday, February 20, 14


The Memory Hierarchy



60% of the area of this CPU is devoted to SRAM cache.

But the role of cache in computer design has varied widely over time.



1977: DRAM faster than microprocessors

Apple ][ (1977)

Steve WozniakSteve

Jobs

CPU: 1000 ns DRAM: 400 ns



Since then: Technology scaling ...

Circuit in 250 nm technology (introduced in 2000)

L nanometers long

Same circuit in 180 nm technology (introduced in 2003)

0.7 x L nm

Each dimension

30% smaller. Area is 50%

smaller

Logic circuits use smaller C’s, lower Vdd, and higher kn and kp to speed up clock rates.



DRAM scaled for more bits, not more MHz

Assume Ccell = 1 fFBit line may have 2000 nFet drains,assume bit line C of 100 fF, or 100*Ccell.Ccell holds Q = Ccell*(Vdd-Vth)When we dump this charge onto the bit line, what voltage do we see?dV = [Ccell*(Vdd-Vth)] / [100*Ccell]dV = (Vdd-Vth) / 100 ≈ tens of millivolts!

In practice, scale array to get a 60mV signal.37Thursday, February 20, 14


1980-2003, CPU speed outpaced DRAM ...

10

DRAM

CPU

Performance(1/latency)

100

1000

1980

2000

1990 Year

Gap grew 50% per year

Q. How do architects address this gap? A. Put smaller, faster “cache” memories

between CPU and DRAM. Create a “memory hierarchy”.

10000The

power wall

2005

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs



Caches: Variable-latency memory ports Lower Level

MemoryUpper Level

MemoryTo Processor

From Processor

Blk X

Blk Y

Small, fast Large, slow

FromCPU

To CPU

Data in upper memory returned with lower latency.

Data in lower level returned with higher latency.

Data

Address


Queues as a building block for memory systems

both clocks. The basic GALS method focuses on point-

to-point communication between blocks.

FIFO solutionsAnother approach to interfacing locally synchro-

nous blocks is using specially designed asynchronous

FIFO buffers8–10 and hiding the system synchronization

problem within the FIFO buffers. Such a system can

tolerate very large interconnect delays and is also

robust with regard to metastability. Designers can use

this method to interconnect asynchronous and

synchronous systems and also to construct synchro-

nous-synchronous and asynchronous-asynchronous

interfaces. Figure 2 diagrams a typical FIFO interface,

which achieves an acceptable data throughput.8 In

addition to the data cells, the FIFO structure includes

an empty/full detector and a special deadlock de-

tector.

The advantage of FIFO synchronizers is that they

don’t affect the locally synchronous module’s opera-

tion. However, with very wide interconnect data

buses, FIFO structures can be costly in silicon area.

Also, they require specialized complex cells to

generate the empty/full flags used for flow control.

The introduced latency might be significant and

unacceptable for high-speed applications.

As an alternative, Beigne and Vivet designed

a synchronous-asynchronous FIFO based on the

bisynchronous classical FIFO design using gray code,

for the specific case of an asynchronous network-on-

chip (NoC) interface.10 Their aim was to maintain

compatibility with existing design solutions and to use

standard CAD tools. Thus, even with some performance

degradation or suboptimal

architecture, designers can

achieve the main goal of

designing GALS systems in

the standard design envi-

ronment.

Boundary

synchronizationA third solution is to

perform data synchroni-

zation at the borders of

the locally synchronous

island, without affecting

the inner operation of lo-

cally synchronous blocks

and without relying on

FIFO buffers. For this purpose, designers can use

standard two-flop, one-flop, predictive, or adaptive

synchronizers for mesochronous systems, or locally

delayed latching.1,11 This method can achieve very

reliable data transfer between locally synchronous

blocks. On the other hand, such solutions generally

increase latency and reduce data throughput, resulting

in limited applicability for high-speed systems. Table 1

summarizes the properties of GALS systems’ synchro-

nization methods.

Advantages and limitations ofGALS solutions

The scientific community has shown great interest

in GALS solutions and architectures in the past two

decades. However, this interest hasn’t culminated in

many commercial applications, despite all reported

advantages. There are several reasons why standard

design practice has not adopted GALS techniques.

Design and system integration issuesMany proposed solutions require programmable

ring oscillators. This is an inexpensive solution that

allows full control of the local clock. However, it has

significant drawbacks. Ring oscillators are impractical

for industrial use. They need careful calibration

because they are very sensitive to process, voltage,

and temperature variations. Moreover, embedded ring

oscillators consume additional power through contin-

uous switching of the chained inverters.

On the other hand, careful design of the delay line

can reduce its power consumption to a level below

that of a corresponding clock tree. In addition,

432

Figure 2. Typical FIFO-based GALS system.

Globally Asynchronous, Locally Synchronous Design and Test

IEEE Design & Test of Computers

Avoid blocking by using a queue (a First-In, First-Out buffer, or FIFO) to communicate

between two sub-systems.40Thursday, February 20, 14

Variable-latency port that doesn’t stall on a miss

From CPU To CPU

Queue 1 Queue 2

CPU makes a request by placing the following items in Queue 1:

CMD: Read, write, etc ...

TAG: 9-bit number identifying the request.MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.

MADDR: Memory address of first byte.STORE-DATA: For stores, the data to store.


This cache is used in an ASPIRE CPU (Rocket)

When request is ready, cache places the following items in Queue 2:

TAG: Identity of the completed command.LOAD-DATA: For loads, the requested data.

CPU saves info about requests, indexed by TAG.Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order.

From CPU To CPU

Queue 1 Queue 2



Cache replaces data, instruction memory

rd1

RegFile

rd2

WEwd

rs1

rs2

ws

D

PC

Q

+

0x4

Addr Data

Instr

Mem

Ext

IR IR

B

A

M

32A

L

U

32

32

op

IR

Y

M

IR

Dout

Data Memory

WE

Din

Addr

MemToReg

R

Mux,Logic

IF (Fetch) ID (Decode) EX (ALU) MEM WB

Replace with Instruction Cache and Data Cacheof DRAM

main memory



Recall: Intel ARM XScale CPU (PocketPC)1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-

32 KB Instruction Cache

32 KB Data Cache

180 nm process (introduced 2003)


ARM CPU 32 KB

instruction cache uses 3 million

transistors Typical

miss rate: 1.5%

DRAM interface

uses 61 pins that toggle at 100 MHz



2005 Memory Hierarchy: Apple iMac G5

iMac G51.6 GHz$1299.00

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 10M

Let programs address a memory space that scales to the disk size, at a speed that is usually as fast

as register access

Managed by compiler

Managed by hardware

Managed by OS,hardware,application

Goal: Illusion of large, fast, cheap memory



(1K)

Registers

L1 (64K Instruction)

L1 (32K Data)

512KL2

90 nm, 58 M transistors

PowerPC 970 FX47Thursday, February 20, 14


Latency: A closer look

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 160 1E+07Latency

(sec) 0.6n 1.9n 1.9n 6.9n 100n 12.5m

Hz 1.6G 533M 533M 145M 10M 80Architect’s latency toolkit:

Read latency: Time to return first byte of a random access

(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.


UC Regents Spring 2014 © UCBCS 152: L6: Superpipelining + Branch Prediction

Byte 0-31

...

Q

Q

Q

.

.

.

A7-A0: 8-bit read address

Byte 32-63

Byte 224-255

OE

OE

OE

DEMUX

.

.

.

1

A5A6A7{

3

Each register holds 32 bytes (256 bits)

OE --> Tri-state Q outputs!

Recall: Adding pipeline stages to memoryBefore we pipelined, slow! Only read behavior shown.

Can we add two pipeline stages?

256

256

256

256

32

MUX

3

D0-D31

A2A3A4

{

3

Data output

is 32 bits

i.e. 4 bytes


UC Regents Spring 2014 © UCBCS 152 L9: Memory

8192 rows

16384 columns

134 217 728 usable bits(tester found good bits in bigger array)

1

of

8192

decoder

13-bitrow

address input

16384 bits delivered by sense amps

Select requested bits, send off the chip

Recall: Reading an entire row for later useWhat if we want all of the 16384 bits?

In row access time (55 ns) we can do22 transfers at 400 MT/s.

16-bit chip bus -> 22 x 16 = 352 bits << 16384Now the row access time looks fast!

Thus, push to faster DRAM

interfaces



Recall: Interleaved access to 4 banks

Can also do other commands on banks concurrently.

Figure 43: Multibank Activate Restriction

Command

Don’t Care

T1T0 T2 T3 T4 T5 T6 T7

tRRD (MIN)

Row Row

READACT ACT NOP

tFAW (MIN)

Bank address

CK#

Address

CK

T8 T9

Col

Bank a

ACTREAD READ READACT NOP

RowCol RowCol Col

Bank cBank b Bank dBank c Bank e

ACT

Row

T10

Bank dBank bBank a

Note: 1. DDR2-533 (-37E, x4 or x8), tCK = 3.75ns, BL = 4, AL = 3, CL = 4, tRRD (MIN) = 7.5ns,tFAW (MIN) = 37.5ns.

512Mb: x4, x8, x16 DDR2 SDRAMACTIVATE

PDF: 09005aef82f1e6e2512MbDDR2.pdf - Rev. O 7/09 EN 87 Micron Technology, Inc. reserves the right to change products or specifications without notice.

©2004 Micron Technology, Inc. All rights reserved.

Interleaving: Design the right interface to the 4 memory banks on the chip, soseveral row requests run in parallel.

Bank a Bank b Bank c Bank d



charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.

The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.

2 Modern DRAM Architecture

As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.

Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the

available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].

For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.

A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that

Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.

11

(0,0,0)

(1,1,2)

(1,0,1)

(1,1,1)

(1,0,0)

(0,1,3)

(0,0,1)

(0,1,0)

P A C

292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140

P A CP A C

P A CP A C

P A CP A C

P A C

4948474645 50 565554535251

11

(0,0,0)

(1,1,2)

(1,0,1)

(1,1,1)

(1,0,0)

(0,1,3)

(0,0,1)

(0,1,0)

P A C

191817161514131210987654321

P A

CP

CP A

C

A C

C

C

C

Time (Cycles)

Refe

rences (

Bank, R

ow

, C

olu

mn)

Refe

rences (

Bank, R

ow

, C

olu

mn

)

Time (Cycles) DRAM Operations:

P: bank precharge (3 cycle occupancy)

A: row activation (3 cycle occupancy)

C: column access (1 cycle occupancy)

(A) Without access scheduling (56 DRAM Cycles)

(B) With access scheduling (19 DRAM Cycles)

129

Recall: Leveraging banks and row reads

Memory Access Scheduling

Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens

Computer Systems Laboratory

Stanford University

Stanford, CA 94305

{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu

Abstract

The bandwidth and latency of a memory system are strongly

dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-

istic of contemporary DRAM chips. There is nearly an order

of magnitude difference in bandwidth between successive

references to different columns within a row and different

rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit

locality within the 3-D memory structure. Conservative

reordering, in which the first ready reference in a sequence

is performed, improves bandwidth by 40% for traces from

five media benchmarks. Aggressive reordering, in which

operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-

tions. Memory access scheduling is particularly important

for media processors where it enables the processor to make

the most efficient use of scarce memory bandwidth.

1 Introduction

Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more

often limited by memory system bandwidth than other com-puter systems.

To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.

The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.

This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.

To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-

1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00

128

From:charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.

The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.

2 Modern DRAM Architecture

As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.

Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the

available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].

For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.

A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that

Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.

11

(0,0,0)

(1,1,2)

(1,0,1)

(1,1,1)

(1,0,0)

(0,1,3)

(0,0,1)

(0,1,0)

P A C

292827262524232221191817161514131210987654321 20 39383736353433323130 4443424140

P A CP A C

P A CP A C

P A CP A C

P A C

4948474645 50 565554535251

11

(0,0,0)

(1,1,2)

(1,0,1)

(1,1,1)

(1,0,0)

(0,1,3)

(0,0,1)

(0,1,0)

P A C

191817161514131210987654321

P A

CP

CP A

C

A C

C

C

C

Time (Cycles)

Refe

rences (

Bank, R

ow

, C

olu

mn)

Refe

rences (

Bank, R

ow

, C

olu

mn

)

Time (Cycles) DRAM Operations:

P: bank precharge (3 cycle occupancy)

A: row activation (3 cycle occupancy)

C: column access (1 cycle occupancy)

(A) Without access scheduling (56 DRAM Cycles)

(B) With access scheduling (19 DRAM Cycles)

129


UC Regents Fall 2009 © UCBCS 250 L11: DRAM

3 Memory Access Scheduling

Memory access scheduling is the process of ordering theDRAM operations (bank precharge, row activation, and col-umn access) necessary to complete the set of currentlypending memory references. Throughout the paper, the termoperation denotes a command, such as a row activation or acolumn access, issued by the memory controller to theDRAM. Similarly, the term reference denotes a memory ref-erence generated by the processor, such as a load or store toa memory location. A single reference generates one ormore memory operations depending on the schedule.

Given a set of pending memory references, a memoryaccess scheduler may chose one or more row, column, orprecharge operations each cycle, subject to resource con-straints, to advance one or more of the pending references.The simplest, and most common, scheduling algorithm onlyconsiders the oldest pending reference, so that referencesare satisfied in the order that they arrive. If it is currentlypossible to make progress on that reference by performingsome DRAM operation then the memory controller makesthe appropriate access. While this does not require a compli-cated access scheduler in the memory controller, it is clearlyinefficient, as illustrated in Figure 1 of the Introduction.

If the DRAM is not ready for the operation required by theoldest pending reference, or if that operation would leaveavailable resources idle, it makes sense to consider opera-tions for other pending references. Figure 4 shows the struc-ture of a more sophisticated access scheduler. As memoryreferences arrive, they are allocated storage space whilethey await service from the memory access scheduler. In thefigure, references are initially sorted by DRAM bank. Eachpending reference is represented by six fields: valid (V),load/store (L/S), address (Row and Col), data, and whateveradditional state is necessary for the scheduling algorithm.Examples of state that can be accessed and modified by thescheduler are the age of the reference and whether or notthat reference targets the currently active row. In practice,

the pending reference storage could be shared by all thebanks (with the addition of a bank address field) to allowdynamic allocation of that storage at the cost of increasedlogic complexity in the scheduler.

As shown in Figure 4, each bank has a precharge managerand a row arbiter. The precharge manager simply decideswhen its associated bank should be precharged. Similarly,the row arbiter for each bank decides which row, if any,should be activated when that bank is idle. A single columnarbiter is shared by all the banks. The column arbiter grantsthe shared data line resources to a single column access outof all the pending references to all of the banks. Finally, theprecharge managers, row arbiters, and column arbiter sendtheir selected operations to a single address arbiter whichgrants the shared address resources to one or more of thoseoperations.

The precharge managers, row arbiters, and column arbitercan use several different policies to select DRAM opera-tions, as enumerated in Table 1. The combination of policiesused by these units, along with the address arbiter’s policy,determines the memory access scheduling algorithm. Theaddress arbiter must decide which of the selected precharge,activate, and column operations to perform subject to theconstraints of the address line resources. As with all of theother scheduling decisions, the in-order or priority policiescan be used by the address arbiter to make this selection.Additional policies that can be used are those that select pre-charge operations first, row operations first, or column oper-ations first. A column-first scheduling policy would reducethe latency of references to active rows, whereas a pre-charge-first or row-first scheduling policy would increasethe amount of bank parallelism.

If the address resources are not shared, it is possible for botha precharge operation and a column access to the same bankto be selected. This is likely to violate the timing constraintsof the DRAM. Ideally, this conflict can be handled by hav-ing the column access automatically precharge the bank

Figure 4. Memory access scheduler architecture.

Memory References

Precharge0

Row

Arbiter0

Column

Arbiter

Address

Arbiter

PrechargeN

Row

ArbiterN

DRAM Operations

Memory Access

Scheduler Logic

Bank 0 Pending References

V L/S Row Col Data State

Bank N Pending References

V L/S Row Col Data State

131

Memory Access Scheduling

Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens

Computer Systems Laboratory

Stanford University

Stanford, CA 94305

{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu

Abstract

The bandwidth and latency of a memory system are strongly

dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-

istic of contemporary DRAM chips. There is nearly an order

of magnitude difference in bandwidth between successive

references to different columns within a row and different

rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit

locality within the 3-D memory structure. Conservative

reordering, in which the first ready reference in a sequence

is performed, improves bandwidth by 40% for traces from

five media benchmarks. Aggressive reordering, in which

operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-

tions. Memory access scheduling is particularly important

for media processors where it enables the processor to make

the most efficient use of scarce memory bandwidth.

1 Introduction

Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more

often limited by memory system bandwidth than other com-puter systems.

To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.

The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.

This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.

To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-

1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00

128

From:

Set scheduling algorithms in gates ...


On Tuesday

Caches, part two ...

Have a good weekend !


cs 152 computer architecture and...

Documents