a comparative study of set associative memory mapping

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 2, MARCH 1978

A Comparative Study of Set Associative Memory

Mapping Algorithms and Their Use for Cacheand Main Memory

ALAN JAY SMITH, MEMBER, IEEE

Abstract-Set associative page mapping algorithms have becomewidespread for the operation of cache memories for reasons of cost andefficiency. We show how to calculate analytically the effectiveness ofstandard bit-selection set associative page mapping or random mappingrelative to fully associative (unconstrained mapping) paging. For twomiss ratio models, Saltzer's linear model and a mixed geometric model,we are able to obtain simple, closed-form expressions for the relativeLRU fault rates. We also experiment with two (infeasible to imple-ment) dynamic mapping algorithms, in which pages are assigned to setseither in an LRU or FIFO manner at fault times, and fmd that theyoften yield significantly lower miss ratios than static algorithms such asbit selection. Trace driven simulations are used to generate experimen-tal results and to verify the accuracy of our calculations. We suggestthat as electronically accessed third-level memories composed ofelectron-beam tubes, magnetic bubbles, or charge-coupled devices be-come available, algorithms currently used only for cache paging will beapplied to main memory, for the same reasons of efficiency, implemen-tation ease, and cost.

Index Terms-Buffer memory, cache memory, linear paging model,LRU, memory mapping, memory technology, paging, placement algo-rithm, virtual memory.

I. INTRODUCTIONC ACHE memories were proposed early in the 1960's [1I as_ high speed memory buffers used to hold the contents of

recently accessed main memory locations. It was alreadyknown at that time that recently used information (instruc-tions and data) is likely to be used again in the near future.The idea was that although the cache (buffer) memory wouldhold only a small fraction of the contents of main memory, adisproportionate fraction of all memory references would besatisfied by information contained within the buffer. Thatthis does happen is attested to by the prevalence of machines(such as the IBM 360/85 [21, 360/195 [3], 370/158, 370/168[4], [5], DEC PDP lO/L [6], PDP 11/70, etc.) using cachememones.

Manuscript received November 1 1, 1976; revised September 22, 1977.This work was partially supported by the National Science Foundationunder Grant MCS75-06768. Computer time was supplied by theEnergy Resources Development Administration under ContractE(043)515.The author is with the Computer Science Division, Department of

Electrical Engineering and Computer Sciences and the Electronics Re-search Laboratory, University of California, Berkeley, CA 94720.

Commonly, there is a ratio of access times between cacheand main memory of from five to ten. Such a "large" ratiomakes it important that the cache be carefully designed, so asto capture as great a fraction of all memory references as pos-sible, yet such a "small" ratio requires that the optimizationsbe simple and impose no significant penalties in access time inreturn for decreased miss ratios. In the remainder of this sec-tion we explain two aspects of this optimization: the searchand mapping algorithm and the replacement algorithm. Wethen briefly discuss cache architecture and examine its imple-mentation. Also considered later in this section is the use ofmain memory as a high speed buffer for some form of "gap-filler" technology. Subsequent sections of this paper will pre-sent both mathematical and experimental analysis of somememory mapping algorithms.

A. Search andMapping AlgorithmsProcessors normally generate the main memory address of

desired information (instructions or data) and then the cacheis searched in some manner to see if it can supply the contentsof this main memory location. The search may be over allcache pages (or lines as IBM and Amdahl call them), in whichcase it is called fully associative, or only over some subset ofpossible locations, as is done with set associative mapping,direct mapping [71, or in the IBM 360/85 sector buffer [2].Let M be the size of the cache in lines (often 32 bytes/line).

Set associative organization for the cache implies that one kthof the lines in main memory will be mapped into each ofthe Mlk sets in the cache. Direct mapping is a form of setassociative organization in whichM = k; i.e., the set size is one.Most commonly, the mapping algorithm employed is to mapeach kth line, where k is a power of 2, into the same set; thatis, lines 1, 1 + k, 1 + 2k, etc., all map into set one, etc. If thebits of the address are numbered 0 to J from least to most sig-nificant, then 32-byte lines would be mapped into the set givenby bits 5 to 5 + log2 k - 1 (k > 2) of the address. We call thisform of mapping bit selection. Another easily implementedalgorithm would be random mapping in which a set is selectedby hashing the main memory address.Both bit mapping and random mapping are static algorithms;

they always map a given line to a given set. It is possible todefine (although not easily implement) dynamic algorithms,by which a line is mapped into a set dynamically (e.g., at time

0098-5589/78/0300-0121$00.75 O 1978 IEEE

121


of reference); this mapping may change with time. We experi-ment later with two dynamic mapping algorithms: leastrecently used (LRU) and first-in, first-out (FIFO). At the timeof a miss to the cache, LRU selects the set in the cache thathas been least recently accessed to receive the line to befetched. FIFO mapping maintains a rotating pointer whichselects each of the sets in tum for the new line. One wouldexpect that either of these dynamic algorithms would tend todistribute the active lines more evenly over the sets in thecache and thus produce a lower miss ratio than either of thestatic algorithms.One other cache organization, called the sector buffer, has

been used on the IBM 360/85 [2] . In this machine, the cacheis divided into 16 sectors of 1024 bytes each. When a word isaccessed for which the corresponding sector is not in the cache(the sector search is fully associative), a sector is made available(the LRU sector-see below) and a 64-byte block is transferred.When a word is referenced whose sector is in memory butwhose block is not, the block is simply fetched. The perfor-mance of this algorithm is now known to be unsatisfactory.

B. ReplacementAlgorithms

The replacement algorithm is used to select a line for removalfrom the cache so that a referenced line can be loaded. A widechoice of replacement algorithms is possible, just as with mainmemory paging for which replacement algorithms have beenstudied extensively. Because the buffer is generally overcom-

mitted; that is, since it is generally impossible to keep theworking set [8] of even one program in the cache at once, andalso because of the constrained mapping mechanisms, onlyfixed space replacement algorithms such as LRU [9], FIFO[10], or RAND [10] are generally considered. LRU replace-ment, by which the least recently used page (line) is the one

removed from memory, has been found to work well both incache and main memories and is used in some IBM 370 [4]series machines. Unless otherwise indicated, we shall considerLRU replacement in this paper. We note that for small setsizes, LRU may be implemented efficiently in hardware (e.g.,[11]).

C Cache Architecture

Two large machines with set associative cache memories (us-ing bit selection), implemented slightly differently from eachother, are the IBM 370/168-3 [5] and the Amdahl 470V/6[12] . Each also has virtual memory, adding another level ofcomplexity to the design of the cache memory. We discusseach very briefly here to illustrate cache architecture; a more

complete discussion is available in [13] or from the manufac-turer's manuals.The 470V/6 has 256 sets, with two 32-byte lines per set.

Associated and stored with each line is a tag, consisting of thehigh order 11 bits of its real memory address. The high order13 bits of the 24-bit virtual address generated by the CPU are

sent to the translation lookaside buffer (TLB), which returnsthe corresponding real address; if the TLB does not contain

page tables. In the mean time, bits 5 to 10 of the virtual ad-dress are used to select (partially read out) the 8 lines (in foursets) that could be the desired line. A four way selectionamong the eligible sets is made on the basis of the real address(bits 1 1, 12 from the TLB) and then an associative selection ismade among the two lines in the set. The desired four bytesare finally read out.The 370/168-3 is similar, but uses 128 sets of 8 lines each

and keeps the tag bits for each line separately from the data.The set is selected by bits 5 to 11 of the virtual address, andthen the real address (from the TLB) is used to select (associa-tively) among the 8 lines in that set. The needed 8 bytes fromthe corresponding line is then gated out onto the buffer out-put bus.A very important issue in each of these designs is speed: bit

selection is used to choose a set because it is fast and easy toimplement. The set size chosen is a compromise between min-imizing the miss ratio and minimizing the speed and cost.Small set sizes result in competition by simultaneously activelines for the small number of positions in a set; direct mappingis considered unacceptable in high speed machines for this rea-son. Large set sizes require more logic, both for search and re-placement, and operate somewhat more slowly.Random mapping, although not currently used for cache

access (it is used for the TLB's in both the 370/168 and the470V/6) can be implemented almost as easily as bit selection.The randomness may be achieved by exclusive-or'ing the lineaddress either with itself after folding or with some other con-venient quantity; the constraint on the randomizing functionis that it be fast. Other static algorithms, providing that theycan function quickly enough, may also be used. Dynamicmapping algorithms, however, require a complicated mappingtable, thus negating the advantages of set associative organiza-tion. Our experiments with dynamic algorithms are thereforemore for the purpose of comparison than to examine realisticalternatives.

D. Main Memory as a High Speed BufferWe believe that within the next five years, certainly within

the next decade, electronically accessed "secondary storage"(actually, third-level storage after cache and main) will becomestandard for large, high speed computing systems. Several"'gap-filler" technologies suitable for such a function are al-ready available commercially or are well advanced into com-mercial development. We include here magnetic bubbles [14],[15], charge-coupled devices [161-[18], domain tip technol-ogy [191, and electron-beam accessed memory [20]-[221.For example, the electron-beam memory under developmentby General Electric [20] promises transfer rates of 10 Mbit/sper tube, with a random access time of 30 ,s and an initialcost of 0.020/bit, declining to 0.001¢/bit. Micro-bit Corpora-tion [22] expects to have a memory with a 5-Ms access time,2 Mbit transfer rates, and a cost of about 0.04w/bit.A potentially highly effective use for these gap-filler technol-

ogies is as "main memory" where current main memory, semi-conductor RAM, is used as a high speed buffer. Some cur-rently existing computer systems permit virtual address spaces

122

the real address, a lookup is performed using the segment and

SMITH: SET ASSOCIATIVE MEMORY MAPPING ALGORITHMS

considerably larger than the physical address space. Multics,running on the Honeywell 6180 (now 66/80), can address 236words. IBM 370's can address only 224 bytes but could easilybe modified to reach 232 bytes. Mapping this address spacedirectly onto high speed electronic storage, with semiconduc-tor RAM used as a hardware managed cache to this storage,would appear to produce significant savings in overhead rela-tive to software implemented paging. We believe that many ofthe advantages that result from the use of set associative map-ping schemes in cache memories will also favor their use inmain memories. Therefore, we expect that our analysis of setassociative mapping will also be applicable to main memoryoperation, and we will explicitly consider two models for mainmemory miss ratios.

E. OutlineIn the next section of this paper, we analyze random map-

ping and use that analysis to examine the efficiency of staticset associative page mapping. For two different models of themiss ratio curve, a mixture of geometric distributions and thehyperbolic curve resulting from Saltzer's linear model [23],[24], we are able to obtain simple closed-form expressions forthe miss ratio for LRU replacement. In Section III, a numberof program address traces are used for trace-driven simulation,and we show that our calculations give good estimates for themiss ratios for both random and bit-selection set associativemapping. We experiment with LRU and FIFO dynamic map-ping and find that they often perform significantly better thaneither of the two static algorithms. Comparisons indicatingthe performance penalty to be expected from decreases in thedegree of associativity are also presented.

II. ANALYSIS OF RANDOM AND BIT-SELECTION SETASSOCIATIVE MAPPING

In this section we show how to calculate the set associativemiss ratio for two paging algorithms from the fully associativemiss ratio. For LRU replacement, our results are of a particu-larly simple form and are independent of the program behaviormodel, if any, assumed, including empirical "models."The major assumption that we make in this section, upon

which our results depend, is that the mapping between a pageand i'ts set in the memory is completely independent of itsprobability of reference. This assumption should hold exactlyfor random mapping; it holds fairly well (as we will see in Sec-tion III) for bit-selection mapping. In the case of LRU replace-ment, this will mean that the set into which a given page ismapped is independent of the position of that page within theLRU stack.

A. Least Recently Used ReplacementWe define R = {ri, i = 1, 2, * } to be the reference string for

some arbitrary program. Let D = {di, i = 1, 2, *} be the cor-responding LRU distance string for the full LRU stack (num-ber of sets = 1) [9]. We do not assume any model of programbehavior; we only assume that the distance string D exists, andthat some average probability of a hit to level i in the completeLRU stack, which we denote qj, also exists. It is not necessary

that the value of q, be stationary; it may change with time,may display trends, and successive values need not be indepen-dent. Given either the LRU stack model or the independentreference model [25] for program behavior, it is straight-forward to determine the q,.

It should be clear that if each set is managed as a separatememory with LRU replacement, the LRU order of the pagesin each set will be a subsequence of the total LRU ordering.Let pi(N) denote the probability of a hit to level i in one ofthe separate set stacks, where N is the number of sets; thisvalue, pi(N), may be calculated by summing over all of the hitprobabilities in the full LRU stack, each multiplied by theprobability that the page in that position in the full LRU stackis at level i in a set stack. We employ our assumption that eachpage (as a function of LRU stack depth) is equally likely to befound in any set to write by inspection

p(N) = , qj(1N-1(N\N \i - (1)

For any given set of values {q1}, the values for pi(N) may becalculated directly from (1).

1) The Mixed Geometric Model: In order to examine fur-ther the effect of set size on the program miss ratio, where themiss ratio is the probability that a page that is referenced isnot in primary (main or cache) memory, we shall assume thatthe distribution {qj} is of one of two particular functionalforms. First, we assume that

T

qj= ai(I-bi)b1 )o<bi<1,i=l

0 <ai S 1,

Lai = 1,

j>1. (2)

It has been generally observed that the miss ratio curve for aprogram is a steeply declining function, which, when plottedon a semilogarithmic scale (linear x axis, logarithmic y axis)resembles a bent arm with the elbow at the lower left. Suchcurves appear, for example, in [26], and in Figs. 3-8 in thenext section of this paper. It seems quite appropriate to modela function of this form as a mixture of geometric distributionssuch as (2).

If we substitute q1 as defined in (2) into the expression forpi(N) in (1), we are able to obtain a relatively simple expres-sion in a few steps as follows:

(N )' (i1- [(1N ) (j i)!T j-

* (ak(l bk)bkj)k=l

N 1 T [4ak(l - bk)

(N - N( - 1 k=l bk

00 N (i -I)

123


C0MPARISON 0F MEM0RY 0RGANIZATI0NS

0 500 1000 1500

MEM0RY CAPACITY I KIL0BYTES )

Fig. 1.

Letw=(N- 1/N) bk, so

oo (N-1 bk() ijN b (ji)!

=wi£wi-.i1w!

°° dl- U0,=j dw -1

=idWi-i [E0 i -E

The derivative of the second term in brackets is zero, so

dw'1- (/ 1)

= (i - 1)! (I - w)1wi.

Substituting,

(1 -bk)ak bk

bk (1N-l)N

The miss ratio is defined as Mi(N) = 2j,+I pi(N), or

- bk - i+l

T (l -bk)ak. N

M1(N)=N bkb bkk=lk F, k -1~~~~~~~~~~~~~~~~~~~~~~~~

( -N- 1

The full LRU stack miss ratio is

00 T

Mi(l) = L qj = Eakb1j=i+l k=l

and fully associative (labeled "LRU") paging for an exampleusing the parameters:

a, = 0.998997a2 = 0.001a3 = 0.000003

b = 0.6000b2 =0.91b3 = 0.9835

which yields a plausible appearing miss ratio curve (comparewith Figs. 3-8). The rather peculiar shape of the ratio curve[(5) below] is explained by the results in (6) below.Of particular interest in general is the ratio

Mi(N)MNi(l) (5)

which is the factor by which the miss ratio increases with setsize N. For finite set sizes i, values for (5) may be computednumerically from (3) and (4). In the limit, for large i, we canobtain more general results. It is apparent by inspection thatas i increases without limit, there will be one term in both (3)and (4) that dominates, and that in both cases it will be thesame term, the one with the largest value of b. Let us desig-nate this the tth term, with parameters at and bt. Then, sub-stituting the tth term from (3) and (4) into (5), and simplify-ing, we obtain

(6)((N- (N- l)bt)bN1 )

To determine the behavior of this expression, we compute thederivative of the term in the denominator, to obtain

(N - (N - I)bt)bN -1 = (N - 1)bN-2N(1 - bt)abt

which, for 0 < bt < 1, may be seen to be greater than zero.Since (6) is equal to 1 for bt = 1, we can conclude that the

denominator of (6) is less than 1, for 0 < bt < 1, and thus that(6) is greater than 1 for i > 0, and increases without bound forincreasing i.This is a very counter-intuitive result: that as memory size

increases, set associative paging will become arbitrarily poorwith respect to fully associative paging. Thus despite the factthat (2) would seem to be a good model for the distribution{qj} over a finite region of memory sizes, we are led to believethat it cannot correctly represent the behavior of the systemfor very large memory sizes. For this reason, we will considera somewhat different model, one which has had some experi-mental verification.2) The Linear Paging Model: In 1974, Saltzer [23] pub-

lished measurements taken on the Honeywell 6180 computer(3) running the Multics operating system at the Massachusetts

Institute of Technology. He found that over a very widerange, from 4 words of associative memory to 2048 pages of1024 words each, the mean headway between faults (MHBF),that is, the number of instructions executed between faults,could be accurately characterized (for some a) as

(4)

In Fig. 1 we have plotted the miss ratios for set associative

MHBF = aZ

where Z is the number of words allocated to primary program

storage. Greenberg [24] in later measurements found that the

3.0C 2.5

2.0Qt 1.5

1.0

lo-lI1010

R 10~ID lo~

r) 10cn10

10F los

Tpi(N))=N Ei

k=l

124

I'


linear approximation begins to fail at approximately 108 bits,and that after that point the MHBF curve contains quadraticand perhaps higher order terms. We shall limit our considera-tion to the linear model only since it spans the range of cur-rently feasible memory sizes. Strictly speaking, the linearmodel indicates that the miss ratio function Mi(1) is equal to11(ai). For reasons of mathematical convenience, and with noloss of accuracy, since we are still well within measurementerror, we shall approximate Mi(l) as (for some b and c)

=b 1(i-) i>2(7)

bil)b+c i=1 7

and

{bl((j-1)(j-2)) j>3

q1= c j=2.1-b-c j= l

Then for i > 3

C0MPARIS0N OF MEMORY ORGANIZATIONS

c

F--

io-2

< -3cr 10

cnE .--4

(8)

50000 100000 150000 200000MEM0RY CAPACITY ( BYTES )

Fig. 2.

= 1-c---- In N.NN N

(14)

Of great interest again is the limiting behavior of the ratio ofthe miss ratios between the set associative and the fully associ-ative memories, which may be seen to be

Mi(N) i-l/NMNi(l) i - I

> (15)

EI(j - l) (j - 2)(N N ) (i -1

{ il b {-li (j- 3)!\N/ (i- 1)! j N / (- 0!

which after a series of steps similar to those used in the pre-vious model, reduces to

pi(N) b i > 3 (9)

and the miss ratio may be calculated to be

Mi(N)b i>3. (10)

Mi(N)=N( -1);We must also compute

P2(N) =-+E ( l - 1) Z1

(wer wN-1ere w

c + b E w dw

N N ( N) )(2

where CO is the constant of integration. Expanding (12) in a

power series and comparing it to (11), we find that CO = 0, so

P2(N) - + l N. (13)N N

In the same manner (but with a double integration), we find

pl(N)=(I-b-c)+ c(N )+b- b(l+lnN)

We see from (15) that as the memory size N * i increases with-out limit, the ratio of the fault rates approaches 1.0 and thepenalty of set associative paging declines to zero. This is aconsiderably more satisfying as well as believable result thanwas observed in (6) for the previous model, both intuitivelyand in that it indicates that set associative memories will per-form well.A very important point which cannot be overemphasized is

that two different and plausible models of program behavior,the mixed exponential model and the linear model, haveyielded exactly opposite predictions. Research which basesits results on unverified mathematical models of program be-havior should thus always be considered with suspicion, unlessit can be shown that the results are very insensitive to theassumptions.We have plotted the miss ratio for the linear model [(7)],

the miss ratio for set associative mapping [(10)], and theirratio [(15)] in Fig. 2 to illustrate our results.

B. Non-LRUReplacementAlgorithmsLeast recently used replacement is relatively easy to analyze,

as is clear from (1), since the ordered sequence of pages ac-cording to the LRU ordering in each set is a subsequence ofthe full LRU stack. This type of analysis is possible at all onlyfor stack algorithms [9], which impose an ordering on thepages and usually permit this ordering to be partially carriedover to the individual sets; nonstack algorithms (such as FIFO)are much more difficult to analyze since an essentially enu-merative approach seems necessary. For other than LRU,however, the ordering only applies partially to the individualsets. In the special case of the independent reference model,it is possible to analyze Ao replacement [271, which, in the

125

=c+

b - I i. Y, -WN I i=li


limit is approximated by least frequently used (LFU) replace-ment. The difficulty in this analysis, which will become ap-parent, is that in order to reference a page, it must be in mainmemory and thus the set of pages in main memory may in-clude a page that is not among the most frequently used. Thenumber of such "nonfrequent" pages may be as large as thenumber of sets and thus the analysis becomes considerablymore complicated. This complication, in fact, is the reasonwhy this analysis holds only for the independent referencemodel and not more generally.The independent reference model assumes that each page is

referenced independently of all others with a constant prob-ability for page j of q1. Without loss of generality, we assumethat the q1 are monotonically nonincreasing for increasing j.AO, the optimal realizable algorithm for this model, maintainsin memory of size V page frames those V - 1 pages with thehighest reference probability, and uses the remaining pageframe for all other pages. It can be shown [25] that the faultrate for this algorithm is

Mv(1) =1 - [: qi+ qi ( ). (16)

i= =vk V: k

We define SF,T, as

00 j

Si,k = Z Etikqii=l 1=1i*k

which is the probability of reference to a page in set stackpositions 1, - - ,-j, given that page k is not in any of thesepositions. Then the total expected rate of reference to pagesin a set u, but not in the top i - 1 levels of u, given also thatpage / is in set u, but not in the top i - 1 levels of u, is

(1 -Si-; q1)q1+ N

Therefore, the probability that, on referencing page j, not inthe top i - 1 levels of a set, we find that it was the last refer-enced page in that set among those not in the top i - 1 levelsis

qj

(l-Si-l,qi)qf ++dN

and

In computing the fault rate for set associative mapping, it isnecessary to compute the probability that a page is not in thetop V - 1 locations in that set (for a memory of size V * N)and that it is not also the last page referenced in that set,among those not in the top V - 1 locations. To make our cal-culations intelligible, we define some terms. Let ti,j be theprobability that page i occupies position j in the set stack,where the set stack is ordered by probability of reference fromhigh to low. Let Ti,,,i 'j ~ eT .l- ek= 1 ti, k, and let T5)j = I - Ti& LetS= I Ti,jqj. Then Si is the probability of reference to apage in one of the top j set stack positions, given that the setsize is j + 1 or greater. We may calculate these variables to be:

N 1 j~1

1</rk)-(N- kj i- 1)

Ti klklf_) -k(i-k=1 \ '~

00 , 1 )k -1 N- Ii-k i_ 1Si= z q -i V i)

We also define a variable ti,1j,k as

I)X (N - I i- - 1) i >j, i< k

NiX1(1)t_ N- i_X-1(i-2)

_N_I i>,j +lI i>k

otherwise

tij,* is the probability that page i is in set stack position j,given that page k is not in positions 1, * - , j - 1 of this set.

MV(N)= 1 - Sv-l +

00

EQi2 T,C V -1i=V I (17)(1 - SV-i,--qj)

If we neglect the initial part of the reference string R, duringwhich portion the probability that a reference to a page willcause it to change position in the LFU stack is nonnegligible,then these same expressions hold with only slight error wherewe define the q1 to be the hit probabilities for the LFU stack.As may be apparent, the mathematical contortions required

to estimate the fault rate for other than LRU replacement areout of all proportion to the importance of our interest in suchestimates. We therefore do not pursue this matter further.

III. DATA ANALYSIS-PREDICTIONS AND MEASUREMENTS

A. Random and Bit-Selection MappingAs discussed in the beginning of Section II, our results are

based on the assumption that the placement of lines into setsis independent of their probability of reference. This condi-tion is clearly satisfied for random mapping, but may or maynot be satisfied for bit selection mapping. In order to testwhether any deviations from this assumption, which is diffi-cult to test directly, are important, we ran trace-driven simula-tions using program address traces and measured the LRU(fully associative) fault rate, the set associative fault rate (for64, 16, and 256 sets, and bit-selection mapping as indicated),and calculated the predicted set associative fault rate from (2).Used in this simulation were four program address traces forthe IBM 360/91: APL, the execution of an APL program whichproduces plots at a terminal, WATFIV, the execution of theWatfiv compiler, WATEX, the execution of a combinatorialsearch program compiled using the Watfiv compiler, and FFT,

126


CACHE MISS RATIO100

in-i

DI-

c 10E7 10

-2

100

H-

LYHCCC

lo-1

io-2

0 20000 40000 60000CACHE CAPACITY (BYTES)

Fig. 3.

CACHE MISS RATIO

SINGLE LRU STACK VS. SET ASS0CIATIVE

VS. ESTIMATE FOR SET ASS0CIATIVE

32 BYTE BLOCKS

.ESTIMATED SET ASSOCIATIVE' SET ASS0CIATIVE (64 SETS)

\N-LRU -

WATFIV

0 10000 20000 30000 40000 50000CACHE CAPACITY (BYTES)

Fig. 4.

the execution of the Fast Fourier Transform Algorithm [28].Various properties of these traces are discussed in [26]. Allsimulations were run for one million memory references andused 32-byte lines. 32-byte lines and 64 sets were chosen tomatch those of the IBM 370/168-1 [5].The results from our simulations are presented in Figs. 3-6,

where we plot the miss ratio as a function of the memory allo-cated for all four program traces for the fully associative andset associative implementations and for the estimate for set as-sociative paging. As is clear from Figs. 3-5, which display theresults for the APL, WATFIV, and FFT traces, our estimatesare excellent, and on the scale of the plots are hard to distin-guish from the measured values. In Fig. 6, where the resultsfor the WATEX trace are presented, we observe that the esti-mates are consistently and significantly high. For comparison,the same simulation was run and comparison made, but for amemory of 16 rather than 64 sets (Fig. 7). We believe, al-though we have not been able to verify it directly, that thepoor estimates in Fig. 6 are due to the simultaneous use, to anunusual extent, of lines that are physically contiguous andthus fall into different sets. As is the custom, we have usedbits 5 through 10 (as noted earlier) to select the set. The useof a smaller number of sets, 16, forces lines in simultaneoususe to "wrap around" and thus fall into the same set. This

100

10-1

CE

CO)(C)

io-2

100

101H-

C -2u) 10CC

CACHE MISS RATI0

SINGLE LRU STACK VS. SET ASS0CIATIVE

VS. ESTIMATE F0R SET ASS0CIATIVE

32 BYTE BL0CKS

ESTIMATED SET ASS0CIATIVE

SET ASS0CIATIVE (154 SETS)

- LRU "_LRUF .... .....T--.......

FFT

0 5000 10000 15000 20000CACHE CAPACITY (BYTES)

Fig. 5.

CACHE MISS RATIO


Fig. 6.

CACHE MISS RATIO100

lo-l

CD

CE -2cC 10CC)


Fig. 7.

provides a much closer match to the results to be expectedfrom random placement.We have also run simulations of random mapping and the

results appear in Table I. Each of the five right-hand columnsgives the ratio of the given miss ratio to the fully associativeLRU miss ratio for the given memory capacity, trace, and

127


TABLE I

\ MappingProgram gorithm Ratio to LRU Miss Ratio

Trace(s Memoyrnos) cacyt LRU FIFO bit selection random (preadnictd)(bytes)ced

WATFIV 2K 1.078* 1.227* 1.229 1.500 1.39664 sets 4K 1.026* 1.049* 1.121 1.270 1.16264sets 6K .991* .975* .999 1.029 1.004

8K .968* .977* .987 1.007 .98210K 1.049 1.009 1.183 1.291 1.24912K 1.308 1.295 1.814 2.096 2.01314K 1.220 1.185 2.177 2.416 2.36916K 1.121 1.029 2.306 2.376 2.50024K 1.129 1.099 1.252 1.180 1.40132K 1.157 1.005 1.108 1.129 1.204

WATEX 2K 1.949* 2.241* 2.039 3.206 2.9554K 1.043* 1.266* 1.360 2.064 2.56264 sets 6K 1.170* 1.195* 1.401 2.401 2.6388K 1.138* 1.129* 1.886 3.102 3.12010K .968 .923 1.168 1.800 1.85812K 1.016 1.000 1.000 1.333 1.47214K 1.042 .958 .962 1.181 1.34816K 1.161 1.049 .999 1.421 1.521

FFT 2K 3.699* 3.789* 4.825 6.408 7.42064 sets 4K .966* 2.377* 8.831 15.06 15.6464sets 6K .955* .999* 2.955 7.597 4.182

8K .955* .974* 1.545 2.344 1.5711OK 1.000 1.007 1.374 1.075 1.08212K 1.000 .993 1.286 .946 .95214K 1.000 .878 1.184 .837 .89116K 1.441 1.039 1.412 1.118 1.20924K 1.487 .992 1.004 1.024 1.02832K 1.539 1.000 1.000 1.019 ---

APL 2K 1.186* 1.662* 1.595 1.805 1.8614K 1.050* 1.102* 1.388 1.339 1.35764 oats 6K 1.034* 1.044* 1.493 1.355 1.379

8K 1.044* 1.022* 1.606 1.350 1.4321OK 1.087 1.048 1.563 1.333 1.36212K 1.082 1.018 1.397 1.303 1.29714K 1.063 1.016 1.231 1.212 1.22716K 1.037 .983 1.110 1.152 1.16124K 1.105** 1.011** 1.046 1.072 1.08732K 1.052** .966** .977 1.003 1.018

S = 8KS = 32KS = 16K otherwise

other parameters. The rightmost two columns give the pre-dicted [(2)] and measured random mapping miss ratio. Withinthe range of experimental error, these two columns should beidentical, so the differences are some indication of the signifi-cance of differences between other columns. The first fourparts of Table I present the results for each of the previouslydiscussed traces; the remaining parts show the results of multi-programming all four traces. In the case of multiprogramming,each of the traces was allowed to run on the "processor" forexactly Q time units (where a miss counted as 10 time units,which is the "access ratio"), and then the processor wasswitched to another trace (using round robin scheduling).Comparing the contents of the columns for random and bit-

selection mapping of Table I, we observe the following: 1)there is an inconsistent but noticeable bias in favor of bit-selection mapping over random mapping and 2) this bias ismost pronounced and consistent when the number of elementsin the set is small. This suggests that since bit selection is easierto implement and faster in operation than random mapping,that it be used, as it has been, for actual machine designs.On the whole, we believe (from these trace-driven simula-

tions) that our estimates for the effectiveness of bit-selectionset associative mapping are generally accurate, and that whenthey do err, they err in such a way as to underestimate the ef-fectiveness of bit-selection set associative page mapping. If setassociative page mapping is used to map pages from third-levelelectronic storage to primary memory, then the randomness of

TABLE I (cont'd)

\Mapp i ng

Program lgorithm Ratio to LRU Miss RatioTrace(s) MemrCapacity LRU FIFO bit selection random (p

random(bytes) (p~~~~~~~~~redicted)

WATFIV-WATEX- 2K 1.046 1.204 1.138 1.276FFT-APL 4K 1.127 1.247 1.252 1.318Q = 100 6K 1.180 1.277 1.293 1.412

8K 1.104 1.153 1.205 1.28564 sets 1OK 1.031 1.045 1.103 1.128

12K 1.000 1.006 1.047 1.03814K .993 1.000 1.025 1.01116K .996 1.000 1.043 1.03424K 1.283 1.063 1.302 1.29332K 1.224 1.033 1.220 1.219

2K 1.120 1.278 1.652 2.2214K 1.016 1.027 1.165 1.343

WATFIV-WATEX- 6K .988 .973 1.042 1.078FFT-APL 8K .997 .976 1.051 1.075

Q =10.000 10K 1:019 :996 1:081 1:109Q000 12K 1.017 1.004 1 .062 1.083964 sets 14K 1.000 1.004 1.008 1.008

16K .978 1.000 .944 .94024K 1.110 1.055 1.118 1.15732K 1.093 1.021 1.227 1.211

WATFIV-WATEX- 2K 1.212 1.464 1.605 2.234FFT-APL 4K 1.006 1.039 1.131 1.389Q =100, 000 6K .966 .946 .980 1.047

8K .982 .928 1.009 1.06864 sets 10K 1.041 .986 1.150 1.252

12K 1.148 1.111 1.385 1.52314K 1.106 1.053 1.369 1.44416K 1.035 1.004 1.262 1.25424K 1.009 1.009 1.003 .99432K .991 .993 .966 .965

WATFIV-WATEX- 8K .814 1.027 1.548 2.575FFT-APL 16K 1.094 1.051 1.587 1.917

Q =100 ,000 24K 1.029 1.007 1.054 1.26832K .996 .994 .962 1.003256 sets

S = 8K

S = 32KS = 16K otherwise

page placement on this third-level storage should be sufficientto fulfill our randomness condition and permit the accurateapplication of our results.

B. LRUand FIFO MappingIn Section I we described two dynamic mapping algorithms,

LRU and FIFO, and indicated that they might be able to per-form better than the two static algorithms discussed. Trace-driven simulations were again used to make comparisons. Ineach case, when a miss occurred (to a memory of size S), theline fetched was assigned, for the duration of its stay in mem-ory, to either the least recently accessed set or the appropriateset in FIFO order, where the FIFO list was updated only atfault times. Since LRU replacement was used for the lines atall times, it was possible to record miss ratios for all memorycapacities at once [9], but the simulation for LRU and FIFOmapping was affected by the specific capacity S associatedwith the run. In each case in Table I, the value of S isindicated.

It can be seen from Table I that LRU and FIFO set selectionperform about equally well, and in most cases almost as wellas full LRU replacement. Both are a significant improvementover the results obtained from our two static algorithms, bitselection and random.

It is particularly interesting to compare the dynamic andstatic algorithms for the multiprogrammed simulations fordifferent values of Q, the process switching interval. Fig. 8

128


MISS RATI0 VS. CACHE SIZE

0.100

0.050

c-

"n0.010

0.005

0 20000 40000CACHE SIZE (BYTES)

Fig. 8.

60000

gives the miss ratio as a function of Q and the memory size,and one may see that frequent switching significantly increasesthe miss ratio. The effect is reflected in the results of Table Iwhere for small values of Q there is little difference betweenthe static algorithms and the dynamic ones. For large valuesof Q, dynamic algorithms show a significant improvement.The reason would appear to be that the effect of memory con-

tention within a set, which makes set associative mapping gen-

erally poorer than full LRU, is swamped by the fight for smallQ among the programs for cache space. For large values of Q,a program is able to entirely fill the buffer with its own infor-mation, at which point set contention becomes visible; forsmall Q, a program simply gets to evict another program's datafrom the cache when it runs.

As we also noted in Section I, despite the significant decreasein the miss ratio observed for dynamic algorithms over staticmapping algorithms, it would appear to be very difficult to im-plement such dynamic mapping. It is interesting to observe,however, that dynamic mapping largely removes the disadvan-tages of set associative mapping with respect to the miss ratio.

IV. CONCLUSIONSWe have shown how to calculate the fault rate for static set

associative page mapping from the fault rate using fully associ-ative mapping. For two models of the miss ratio curve, themixed exponential and the inverse of the linear model, we

were able to obtain simple closed-form expressions for theLRU miss ratio as a function of the memory and set size. Inboth cases, the limiting behavior of the miss ratio for set asso-

ciative mapping as compared to fully associative mapping was

examined, and it was found that these two models yieldedsharply different results. The mixed exponential model sug-

gested that set associative mapping was unboundedly worse

than fully associative mapping, whereas the linear model indi-cated that set associative mapping rapidly approached that ofunconstrained mapping. We believe that the latter model ismore accurate and that only small increases in the miss ratioare likely with set associative page mapping.In order to verify the accuracy of our predictions, trace-

driven simulations were run using four different program tracesand in three cases our formulas were found to be in excellent

agreement with measured results. In the fourth case, bit-selection mapping was found to perform better than predicted.Good agreement between random and bit-selection mappingwas also found for simulations of multiprogramming usingfour different traces. Experiments with dynamic mappingalgorithms indicated that most of the penalty associated withstatic mapping over fully associative mapping was eliminatedfor dynamic mapping.From both our measurements and calculations, we draw the

conclusion that there is only a small miss ratio penalty for setassociative bit-selection mapping. We believe that the imple-mentation advantages for set associative mapping will result inits use when electronic third-level memories become fully inte-grated into computer designs.

REFERENCES

[1] L. Bloom, M. Cohen, and S. Porter, "Considerations in the designof a computer with high logic-to-memory speed ratio," in Proc.Gigacycle Computing Systems, Jan. 1962; AIEE Special Publ.S-136, pp. 5 3-63.

[21 J. S. Liptay, "Structural aspects of the system/360 model 85,part II: The Cache," IBM Syst. J., vol. 7, pp. 15-21, 1968.

[31 IBM Corporation, IBM System/360 and System/370 Model 195Functional Characteristics (GA22-6943-2), IBM Syst. Develop-ment Division, Poughkeepsie, NY.

[4] IBM Corporation, IBM System/370 Model 168 Functional Char-acteristics (GA22-7010), 3rd ed., May 1974, IBM Syst. ProductsDivision, Poughkeepsie, NY.

[5] IBM Corporation, System/370 Model 168. Theory of OperationlDiagrams Manual (vol. 4), Processor Storage Control Function(PSCF), May 1975, Poughkeepsie, NY.

[6] A. Kotok, private communication, Jan. 1976.[7] C. J. Conti, "Concepts for buffer storage," IEEE Comput. Group

News, pp. 9-13, Mar. 1969.[8] P. J. Denning, "The working set model for program behavior,"

Commun. Ass. Comput. Mach., vol. 11, pp. 323-333, May 1968.[9] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, "Evalua-

tion techniques for storage hierarchies," IBM Syst. J., vol. 9, pp.78-117, 1970.

[10] L. A. Belady, "A study of replacement algorithms for a virtualstorage computer," IBM Syst. J., vol. 5, pp. 78-101, 1966.

[11] K. Maruyama, "mLRU page replacement algorithm in terms ofthe reference matrix," IBM Tech. Disclosure Bulletin, vol. 17,pp. 3101-3103, Mar. 1975.

[121 Amdahl Corporation, 470V/6 Machine Reference Manual, 1976.[13] A. J. Smith, "Sequential program prefetching in memory hier-

archies," Apr. 1977, submitted for publication.[14] A. Bobeck, P. Bonyhard, and J. Gecsei, "Magnetic bubbles-An

Q-100 32 BYTES/LINE

-Q-1000 54 SETS

LRU REPLACEMENT-5000

ACCESS RATI0 - 10 -

I10K

Q-20K

-30K

- Q-50K Q-

Q-250K

129


emerging new memory technology," Proc. IEEE, vol. 63, pp.1176-1195, Aug. 1975.

[15] R. B. Clover, "Limitations of and expectations for magnetic bub-ble memories," in Proc. IEEE Comput. Soc. Conf., San Francisco,CA, Feb. 1977, pp. 96-98.

[16] G. Amelio, "Charge-coupled devices for memory applications,"in Proc. NCC 1975, pp.515-522.

[17] D. House, "CCD vs. RAM for bulk storage applications," in Proc.IEEE Comput. Soc. Conf., Feb. 1976, San Francisco, CA, pp.58-61.

[18] Fairchild Semiconductor, "Economic impact of CCD memory,"March 1975; also, promotional literature on Fairchild CCD mem-ory CDC 450/450A.

[19] R. J. Spain, J. 1. Jauvits, and F. T. Ruben, "DOT memory sys-tems," in Proc. NCC 1974, pp. 841-846.

[20] W. Hughes, C. Lemond, H. Parks, G. Ellis, G. Possin, andR. Wilson, "A semiconductor nonvolatile electron beam accessedmass memory," Proc. IEEE, vol. 63, pp. 1230-1240, Aug. 1975.

[21] J. Kelly, "The development of experimental electron-beam-addressed memory module," Computer, pp. 32-42, Feb. 1975.

[22] Computer Decisions, July 1975, p. 16.[23] J. H. Saltzer, "A simple linear model of demand paging perfor-

mance," Commun. Ass. Comput. Mach., vol. 17, Apr. 1974, pp.181-186.

[24] B. S. Greenberg, "An experimental analysis of program referencepatterns in the multics virtual memory," Project MAC Tech. Rep.MAC-TR-127,1974.

[25] W. F. King III, "Analysis of demand paging algorithms," in Proc.IFIPS Conference, Aug. 1971, pp. TA-3-155-TA-3-159.

[26] A. J. Smith, "A modified working set paging algorithms," Stan-ford Comput. Sci. Dep. Rep. STAN-CS-74-451, Aug. 1974;IEEETrans. Comput., vol. C-25, pp. 907-914, Sept. 1976.

[27] A. Aho, P. Denning, and J. Ullman, "Principles of optimal pagereplacement," J. Ass. Comput. Mach., vol. 18, pp. 80-93, Jan.1971.

[28] J. W. Cooley and J. W. Tukey, "An algorithm for the machinecalculation of complex Fourier series," Math. Computation,vol. 19,pp. 297-301, 1965.

Mgnl Alan Jay Smith (S'73-M'74) was born in New-0 Rochelle, NY. He received the B.S. degree in

1Na> 1 _ | S electrical engineering from the MassachusettsInstitute of Technology, Cambridge, and theM.S. and Ph.D. degrees in computer sciencefrom Stanford University, Stanford, CA, thelatter in 1974.He is currently an Assistant Professor in the

Computer Science Division of the Departmentof Electrical Engineering and Computer Sci-ences and the Electronics Research Laboratory,

University of California, Berkeley, a position he has held since 1974.His research interests include the analysis and modeling of computersystems and devices, operating systems, computer architecture, anddata compression.Dr. Smith is a member of the Association for Computing Machinery,

the Society for Industrial and Applied Mathematics, Eta Kappa Nu,Tau Beta Pi, and Sigma Xi.

Performance of Storage Management in an

Implementation of SNOBOL4

G. DAVID RIPLEY, MEMBER, IEEE, RALPH E. GRISWOLD, AND DAVID R. HANSON, MEMBER, IEEE

Abstract-Results of measuring the performance of the storage manage-ment subsystem in an implementation of SNOBOL4 are described. Byinstrumenting the storage management system, data concerning the size,lifetime, and use of storage blocks were collected. These data, like thoseobtained from conventional time measurement techniques, were usedto locate program inefficiencies. In addition, these measurements un-covered some deficiencies in the storage management system, and pro-vided the basis upon which to judge the heuristics used in the garbagecollector.

Index Terms-Program measurement, SNOBOL4, storage management.

Manuscript received May 11, 1977; revised August 17, 1977. Thiswork was supported by the National Science Foundation under GrantMCS75-21757.The authors are with the Department of Computer Science, The Uni-

versity of Arizona, Tucson, AZ 85721.

1. INTRODUCTIONCONVENTIONAL techniques of program measurement

concentrate on execution time [11. Storage, however, isan important and expensive resource. This is particularly trueof high-level languages such as SNOBOL4, Lisp, and APL,which rely on dynamic storage management to support manylanguage features. Measuring the performance of storagemanagement in-these kinds of languages may lead to betterimplementation techniques, and to a better understandingof the ways in which the dynamic features of such languagesare actually used [2] .

In SNOBOL4, as in languages such as Lisp and APL, creationof data objects is a run-time activity. Construction of strings,pattems, and arrays all require the allocation of storage duringprogram execution. Other activities, such as pattern matching,

0098-5589/78/0300-0130$00.75 i 1978 IEEE

a comparative study of set associative memory mapping

Documents