data prefetch mechanisms - aminer...result in a cache miss for the first ac-cess to a cache block,...

Data Prefetch MechanismsSTEVEN P. VANDERWIEL

IBM Server Group

AND

DAVID J. LILJA

University of Minnesota

The expanding gap between microprocessor and DRAM performance hasnecessitated the use of increasingly aggressive techniques designed to reduce orhide the latency of main memory access. Although large cache hierarchies haveproven to be effective in reducing this latency for the most frequently used data, itis still not uncommon for many programs to spend more than half their run timesstalled on memory requests. Data prefetching has been proposed as a technique forhiding the access latency of data referencing patterns that defeat cachingstrategies. Rather than waiting for a cache miss to initiate a memory fetch, dataprefetching anticipates such misses and issues a fetch to the memory system inadvance of the actual memory reference. To be effective, prefetching must beimplemented in such a way that prefetches are timely, useful, and introduce littleoverhead. Secondary effects such as cache pollution and increased memorybandwidth requirements must also be taken into consideration. Despite theseobstacles, prefetching has the potential to significantly improve overall programexecution time by overlapping computation with memory accesses. Prefetchingstrategies are diverse, and no single strategy has yet been proposed that providesoptimal performance. The following survey examines several alternativeapproaches, and discusses the design tradeoffs involved when implementing a dataprefetch strategy.

Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles—Cache memories; B.3 [Hardware]: Memory Structures

General Terms: Design, Performance

Additional Key Words and Phrases: Memory latency, prefetching

This work was supported in part by National Science Foundation grants MIP-9610379, CDA-9502979,CDA-9414015, the Minnesota Supercomputing Institute, and the University of Minnesota-IBM SharedUniversity Research Project. Steve VanderWiel was partially supported by an IBM Graduate ResearchFellowship during the preparation of this work.Authors’ addresses: S. P. VanderWiel, System Architecture, Performance & Design, IBM Server Group,3605 Highway 52, North Rochester, MN 55901; email: [email protected]; D. J. Lilja, Dept. of Electrical &Computer Engineering, University of Minnesota, 200 Union St. SE, Minneapolis, MN 55455; email:[email protected] to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 2001 ACM 0360-0300/00/0600–0174 $5.00

ACM Computing Surveys, Vol. 32, No. 2, June 2000

1. INTRODUCTION

By any metric, microprocessor perfor-mance has increased at a dramatic rateover the past decade. This trend hasbeen sustained by continued architec-tural innovations and advances in mi-croprocessor fabrication technology. Incontrast, main memory dynamic RAM(DRAM) performance has increased at amuch more leisurely rate, as shown inFigure 1.

Chief among latency reducing tech-niques is the use of cache memory hier-archies [Smith 1982]. The static RAM(SRAM) memories used in caches havemanaged to keep pace with processormemory request rates but continue tobe too expensive for a main store tech-nology. Although the use of large cachehierarchies has proven to be effective inreducing the average memory accesspenalty for programs that show a highdegree of locality in their addressingpatterns, it is still not uncommon forscientific and other data-intensive pro-grams to spend more than half their runtimes stalled on memory requests[Mowry et al. 1992]. The large, densematrix operations that form the basis ofmany such applications typically exhibitlittle data reuse and thus may defeatcaching strategies.

The poor cache utilization of theseapplications is partially a result of the“on demand” memory fetch policy ofmost caches. This policy fetches datainto the cache from main memory onlyafter the processor has requested aword and found it absent from thecache. The situation is illustrated in

Figure 2(a) where computation, includ-ing memory references satisfied withinthe cache hierarchy, are represented bythe upper time line while main memoryaccess time is represented by the lowertime line. In this figure, the data blocksassociated with memory references r1,r2, and r3 are not found in the cachehierarchy and must therefore be fetchedfrom main memory. Assuming a simple,in-order execution unit, the processorwill be stalled while it waits for thecorresponding cache block to be fetched.Once the data returns from main mem-ory it is cached and forwarded to theprocessor where computation may againproceed.

Note that this fetch policy will alwaysresult in a cache miss for the first ac-cess to a cache block, since only previ-ously accessed data are stored in thecache. Such cache misses are known ascold start or compulsory misses. Also, ifthe referenced data is part of a largearray operation, it is likely that thedata will be replaced after its use tomake room for new array elements be-ing streamed into the cache. When thesame data block is needed later, theprocessor must again bring it in frommain memory, incurring the full mainmemory access latency. This is called acapacity miss.

Many of these cache misses can beavoided if we augment the demand fetchpolicy of the cache with a data prefetchoperation. Rather than waiting for acache miss to perform a memory fetch,data prefetching anticipates suchmisses and issues a fetch to the memorysystem in advance of the actual memoryreference. This prefetch proceeds in par-allel with processor computation, allow-ing the memory system time to transferthe desired data from main memory tothe cache. Ideally, the prefetch will com-plete just in time for the processor toaccess the needed data in the cachewithout stalling the processor.

An increasingly common mechanismfor initiating a data prefetch is an ex-plicit fetch instruction issued by theprocessor. At a minimum, a fetch specifies

CONTENTS

1. Introduction2. Background3. Software Data Prefetching4. Hardware Data Prefetching

4.1 Sequential Prefetching4.2 Prefetching with Arbitrary Strides

5. Integrating Hardware and Software Prefetching6. Prefetching in Multiprocessors7. Conclusions

Data Prefetch Mechanisms • 175


the address of a data word to be broughtinto cache space. When the fetch in-

struction is executed, this address issimply passed on to the memory system

1

10

100

1000

88 89 90 91 92 93 94 95 96

Year

Per

form

ance

System

DRAM

Figure 1. System and DRAM performance since 1988. System performance is measured by SPECfp92and DRAM performance by row access times. All values are normalized to their 1988 equivalents(Source: Internet SPECtable, ftp://ftp.cs.toronto.edu/pub/jdd/spectable).

403020100

(a)

(b)

(c)

Time

Computation Memory Access Cache Hit Cache Miss Prefetch

r1 r2 r3

r1 r2 r3

r1 r2 r3

memory delay

computation

memory delay

computation

memory delay

computation

Figure 2. Execution diagram assuming (a) no prefetching, (b) perfect prefetching, and (c) degradedprefetching.

176 • S. P. VanderWiel and D. J. Lilja


without forcing the processor to wait fora response. The cache responds to thefetch in a manner similar to an ordi-nary load instruction, with the excep-tion that the referenced word is notforwarded to the processor after it hasbeen cached. Figure 2(b) shows howprefetching can be used to improve theexecution time of the demand fetch casegiven in Figure 2(a). Here, the latencyof main memory accesses is hidden byoverlapping computation with memoryaccesses, resulting in a reduction inoverall run time. This figure representsthe ideal case when prefetched data ar-rives just as it is requested by the pro-cessor.

A less optimistic situation is depictedin Figure 2(c). In this figure, theprefetches for references r1 and r2 areissued too late to avoid processor stalls,although the data for r2 is fetched earlyenough to realize some benefit. Notethat the data for r3 arrives earlyenough to hide all of the memory la-tency, but must be held in the processorcache for some period of time before it isused by the processor. During this time,the prefetched data are exposed to thecache replacement policy, and may beevicted from the cache before use. Whenthis occurs, the prefetch is said to beuseless because no performance benefitis derived from fetching the block early.

A prematurely prefetched block mayalso displace data in the cache that iscurrently in use by the processor, re-sulting in what is known as cache pollu-tion [Casmira and Kaeli 1995]. Notethat this effect should be distinguishedfrom normal cache replacement misses.A prefetch that causes a miss in the

cache that would not have occurred ifprefetching was not in use is defined ascache pollution. If, however, aprefetched block displaces a cache blockwhich is referenced after the prefetchedblock has been used, this is an ordinaryreplacement miss since the resultingcache miss would have occurred with orwithout prefetching.

A more subtle side effect of prefetch-ing occurs in the memory system. Notethat in Figure 2(a) the three memoryrequests occur within the first 31 timeunits of program startup, whereas inFigure 2(b), these requests are com-pressed into a period of 19 time units.By removing processor stall cycles,prefetching effectively increases the fre-quency of memory requests issued bythe processor. Memory systems must bedesigned to match this higher band-width to avoid becoming saturated andnullifying the benefits of prefetching[Burger 1997]. This can be particularlytrue for multiprocessors where bus uti-lization is typically higher than singleprocessor systems.

It is also interesting to note that soft-ware prefetching can achieve a reduc-tion in run time despite adding instruc-tions into the execution stream. InFigure 3, the memory effects from Fig-ure 2 are ignored and only the computa-tional components of the run time areshown. Here, it can be seen that thethree prefetch instructions actually in-crease the amount of work done by theprocessor.

Several hardware-based prefetchingtechniques have also been proposed thatdo not require the use of explicit fetchinstructions. These techniques employ

r1 r2 r3

r1 r2 r3

Prefetch Overhead

No Prefetching

Prefetching

Figure 3. Software prefetching overhead.



special hardware that monitors the pro-cessor in an attempt to infer prefetchingopportunities. Although hardwareprefetching incurs no instruction over-head, it often generates more unneces-sary prefetches than software prefetch-ing. Unnecessary prefetches are morecommon in hardware schemes becausethey speculate on future memory ac-cesses without the benefit of compile-time information. If this speculation isincorrect, cache blocks that are not ac-tually needed will be brought into thecache. Although unnecessary prefetchesdo not affect correct program behavior,they can result in cache pollution, andwill consume memory bandwidth.

To be effective, data prefetching mustbe implemented in such a way thatprefetches are timely, useful, and intro-duce little overhead. Secondary effectsin the memory system must also betaken into consideration when design-ing a system that employs a prefetchstrategy. Despite these obstacles, dataprefetching has the potential to signifi-cantly improve overall program execu-tion time by overlapping computationwith memory accesses. Prefetchingstrategies are diverse; no single strat-egy that provides optimal performancehas yet been proposed. In the followingsections, alternative approaches toprefetching will be examined by com-paring their relative strengths andweaknesses.

2. BACKGROUND

Prefetching, in some form, has existedsince the mid-1960’s. Early studies ofcache design [Anacker and Wang 1967]recognized the benefits of fetching mul-tiple words from main memory into thecache. In effect, such block memorytransfers prefetch the words surround-ing the current reference in hope oftaking advantage of the spatial localityof memory references. Hardwareprefetching of separate cache blocks waslater implemented in the IBM 370/168and Amdahl 470V [Smith 1978]. Soft-ware techniques are more recent. Smith

first alluded to this idea in his survey ofcache memories [Smith 1982], but atthat time doubted its usefulness. Later,Porterfield [1989] proposed the idea of a“cache load instruction” at approxi-mately the same time that Motorola in-troduced the “touch load’ instruction inthe 88110, and Intel proposed the use of“dummy read” operations in the i486[Fu et al. 1989]. Today, nearly all micro-processor ISAs contain explicit instruc-tions designed to bring data into theprocessor cache hierarchy.

It should be noted that prefetching isnot restricted to fetching data frommain memory into a processor cache.Rather, it is a generally applicable tech-nique for moving memory objects up inthe memory hierarchy before they areactually needed by the processor.Prefetching mechanisms for instruc-tions and file systems are commonlyused to prevent processor stalls; for ex-ample, see Young and Shekita [1993], orPatterson and Gibson [1994]. For brev-ity, only techniques that apply to dataobjects residing in memory will be con-sidered here.

Nonblocking load instructions sharemany similarities with data prefetch-ing. Like prefetches, these instructionsare issued in advance of the data’s ac-tual use to take advantage of the paral-lelism between the processor and mem-ory subsystem. Rather than loadingdata into the cache, however, the speci-fied word is placed directly into a pro-cessor register. Nonblocking loads arean example of a binding prefetch, sonamed because the value of theprefetched variable is bound to a namedlocation (a processor register, in thiscase) at the time the prefetch is issued.Although nonblocking loads are not dis-cussed further here, other forms of bind-ing prefetches are examined.

Data prefetching has received consid-erable attention in the literature as apotential means of boosting perfor-mance in multiprocessor systems. Thisinterest stems from a desire to reducethe particularly high memory latenciesoften found in such systems. Memory



delays tend to be high in multiproces-sors due to added contention for sharedresources such as a shared bus andmemory modules in a symmetric multi-processor. Memory delays are evenmore pronounced in distributed-memorymultiprocessors, where memory re-quests may need to be satisfied acrossan interconnection network. By maskingsome or all of these significant memorylatencies, prefetching can be an effec-tive means of speeding up multiproces-sor applications.

Due to this emphasis on prefetchingin multiprocessor systems, many of theprefetching mechanisms discussed be-low have been studied either mostly orexclusively in this context. Because sev-eral of these mechanisms may also beeffective in single processor systems,multiprocessor prefetching is treated asa separate topic only when the prefetchmechanism is inherent in such systems.

3. SOFTWARE DATA PREFETCHING

Most contemporary microprocessorssupport some form of fetch instructionthat can be used to implement prefetch-ing [Bernstein et al. 1995; Santhanamet al. 1997; Yeager 1996]. The Imple-mentaation of a fetch can be as simpleas a load into a processor register thathas been hardwired to zero. Slightlymore sophisticated implementationsprovide hints to the memory system asto how the prefetched block will beused. Such information may be useful inmultiprocessors where data can beprefetched in different sharing states,for example.

Although particular implementationswill vary, all fetch instructions sharesome common characteristics. Fetchesare nonblocking memory operations andtherefore require a lockup-free cache[Kroft 1981] that allows prefetches tobypass other outstanding memory oper-ations in the cache. Prefetches are typi-cally implemented in such a way thatfetch instructions cannot cause excep-tions. Exceptions are suppressed forprefetches to insure that they remain

an optional optimization feature thatdoes not affect program correctness orinitiate large and potentially unneces-sary overhead, such as page faults orother memory exceptions.

The hardware required to implementsoftware prefetching is modest com-pared to other prefetching strategies.Most of the complexity of this approachlies in the judicious placement of fetchinstructions within the target applica-tion. The task of choosing where in theprogram to place a fetch instruction rel-ative to the matching load or store in-struction is known as prefetch scheduling.

In practice, it is not possible to pre-cisely predict when to schedule aprefetch so that data arrives in thecache at the moment it will be re-quested by the processor, as was thecase in Figure 2(b). The execution timebetween the prefetch and the matchingmemory reference may vary, as willmemory latencies. These uncertaintiesare not predictable at compile time, andtherefore require careful considerationwhen scheduling prefetch instructionsin a program.

Fetch instructions may be added bythe programmer or by the compiler dur-ing an optimization pass. Unlike manyoptimizations that occur too frequentlyin a program or are too tedious to imple-ment by hand, prefetch scheduling canoften be done effectively by the pro-grammer. Studies indicate that addingjust a few prefetch directives to a pro-gram can substantially improve perfor-mance [Mowry and Gupta 1991]. How-ever, if programming effort is to be keptat a minimum, or if the program con-tains many prefetching opportunities,compiler support may be required.

Whether hand-coded or automated bya compiler, prefetching is most oftenused within loops responsible for largearray calculations. Such loops provideexcellent prefetching opportunities be-cause they are common in scientific codes,exhibit poor cache utilization, and oftenhave predictable array referencing pat-terns. By establishing these patterns atcompile-time, fetch instructions can be



placed inside loop bodies so that datafor a future loop iteration can beprefetched during the current iteration.

As an example of how loop-basedprefetching may be used, consider thecode segment shown in Figure 4(a). Thisloop calculates the inner product of twovectors, a and b, in a manner similar tothe innermost loop of a matrix multipli-cation calculation. If we assume a four-word cache block, this code segment willcause a cache miss every fourth itera-tion. We can attempt to avoid thesecache misses by adding the prefetch di-rectives shown in Figure 4(b). Note thatthis figure is a source code representa-tion of the assembly code that would begenerated by the compiler.

This simple approach to prefetchingsuffers from several problems. First, weneed not prefetch every iteration of thisloop, since each fetch actually bringsfour words (one cache block) into thecache. Although the extra prefetch oper-ations are not illegal, they are unneces-sary and will degrade performance. As-suming a and b are cache-block-aligned,prefetching should be done only on ev-ery fourth iteration. One solution to thisproblem is to surround the fetch direc-tives with an if condition that testswhen i modulo 4 5 0 is true. Theoverhead of such an explicit prefetchpredicate, however, would likely offsetthe benefits of prefetching, and there-fore should be avoided. A better solutionis to unroll the loop by a factor of r,where r is equal to the number of wordsto be prefetched per cache block. Asshown in Figure 4(c), unrolling a loopinvolves replicating the loop body rtimes and increasing the loop stride tor. Note that the fetch directives arenot replicated and the index value usedto calculate the prefetch address ischanged from i 1 1 to i 1 r.

The code segment given in Figure 4(c)removes most cache misses and unnec-essary prefetches, but further improve-ments are possible. Note that cachemisses will occur during the first itera-

tion of the loop, since prefetches arenever issued for the initial iteration.Unnecessary prefetches will occur in thelast iteration of the unrolled loop wherethe fetch commands attempt to accessdata past the loop index boundary. Bothof the above problems can be remediedby using software pipelining techniquesas shown in Figure 4(d). In this figure,

for (i = 0; i < N; i++)ip = ip + a[i]*b[i];

for (i = 0; i < N; i++){fetch( &a[i+1]);fetch( &b[i+1]);ip = ip + a[i]*b[i];

}

for (i = 0; i < N; i+=4){fetch( &a[i+4]);fetch( &b[i+4]);ip = ip + a[i]*b[i];ip = ip + a[i+1]*b[i+1];ip = ip + a[i+2]*b[i+2];ip = ip + a[i+3]*b[i+3];

}

fetch( &ip);fetch( &a[0]);fetch( &b[0]);

for (i = 0; i < N-4; i+=4){fetch( &a[i+4]);fetch( &b[i+4]);ip = ip + a[i] *b[i];ip = ip + a[i+1]*b[i+1];ip = ip + a[i+2]*b[i+2];ip = ip + a[i+3]*b[i+3];

}for ( ; i < N; i++)

ip = ip + a[i]*b[i];

(a)

(b)

(c)

(d)

Figure 4. Inner product calculation using (a) noprefetching, (b) simple prefetching, (c) prefetchingwith loop unrolling, and (d) software pipelining.



we extracted select code segments out ofthe loop body and placed them on eitherside of the original loop. Fetch state-ments have been prepended to the mainloop to prefetch data for the first itera-tion of the main loop, including ip. Thissegment of code is referred to as theloop prolog. An epilog is added to theend of the main loop to execute the finalinner product computations without ini-tiating any unnecessary prefetch in-structions.

The code given in Figure 4 is said tocover all loop references because eachreference is preceded by a matchingprefetch. However, one final refinementmay be necessary to make theseprefetches effective. The examples inFigure 4 were written with the implicitassumption that prefetching one itera-tion ahead of the data’s actual use issufficient to hide the latency of mainmemory accesses. This may not be thecase. Although early studies [Callahanet al. 1991] were based on this assump-tion, Klaiber and Levy [1991] recog-nized that this was not a sufficientlygeneral solution. When loops containsmall computational bodies, it may benecessary to initiate prefetches

d 5 l

s

where l is the average memory latencymeasured in processor cycles and s isthe estimated cycle time of the shortestpossible execution path through oneloop iteration, including the prefetchoverhead. By choosing the shortest exe-cution path through one loop iterationand using the ceiling operator, this cal-culation is designed to err on the con-servative side, and thus increase thelikelihood that prefetched data will becached before it is requested by the pro-cessor.

Returning to the main loop in Figure4(d), let us assume an average misslatency of 100 processor cycles and aloop iteration time of 45 cycles so thatthe final inner product loop transforma-tion is as shown in Figure 5.

The loop transformations outlinedabove are fairly mechanical and, withsome refinements, can be applied recur-sively to nested loops. Sophisticatedcompiler algorithms based on this ap-proach have been developed, with vary-ing degrees of success, to automaticallyadd fetch instructions during an opti-mization pass of a compiler [Mowry etal. 1992]. Bernstein et al. [1995] mea-sured the run-times of 12 scientific bench-marks both with and without the use ofprefetching on a PowerPC 601-basedsystem. Prefetching typically improved

fetch( &ip);for (i = 0; i < 12; i += 4){

fetch( &a[i]);fetch( &b[i]);

}for (i = 0; i < N-12; i += 4){

fetch( &a[i+12]);fetch( &b[i+12]);ip = ip + a[i] *b[i];ip = ip + a[i+1]*b[i+1];ip = ip + a[i+2]*b[i+2];ip = ip + a[i+3]*b[i+3];

}for ( ; i < N; i++)

ip = ip + a[i]*b[i];

prolog -prefetching only

main loop -prefetchingand computation

epilog - computation only

Figure 5. Final inner product loop transformation.



run-times by less than 12%, althoughone benchmark ran 22% faster andthree others actually ran slightly slowerdue to prefetch instruction overhead.Santhanam et al. [Kroft 1981] foundthat six of the ten SPECfp95 benchmarkprograms ran between 26% and 98%faster on a PA8000-based system whenprefetching was enabled. Three of thefour remaining SPECfp95 programsshowed less than a 7% improvement inrun-time and one program was sloweddown by 12%.

Because a compiler must be able toreliably predict memory access pat-terns, prefetching is normally restrictedto loops containing array accesseswhose indices are linear functions of theloop indices. Such loops are relativelycommon in scientific codes, but far lessso in general applications. Attempts atestablishing similar software prefetch-ing strategies for these applications arehampered by their irregular referencingpatterns [Chen et al. 1991; Lipasti et al.1995; Luk and Mowry 1996]. Given thecomplex control structures typical ofgeneral applications, there is often alimited window in which to reliably pre-dict when a particular datum will beaccessed. Moreover, once a cache blockhas been accessed, there is less of achance that several successive cacheblocks will also be requested when datastructures such as graphs and linkedlists are used. Finally, the compara-tively high temporal locality of manygeneral applications often result in highcache utilization, thereby diminishingthe benefit of prefetching.

Even when restricted to well-con-formed looping structures, the use ofexplicit fetch instructions exacts a per-formance penalty that must be consid-ered when using software prefetching.Fetch instructions add processor over-head not only because they require ex-tra execution cycles, but also becausethe fetch source addresses must becalculated and stored in the processor.Ideally, this prefetch address should beretained so that it need not be recalcu-lated for the matching load or store

instruction. By allocating and retainingregister space for the prefetch ad-dresses, however, the compiler will haveless register space to allocate to otheractive variables. The addition of fetchinstructions is therefore said to increaseregister pressure which, in turn, mayresult in additional spill code to managevariables “spilled” out to memory due toinsufficient register space. The problemis exacerbated when prefetch distance isgreater than one, since this implies ei-ther maintaining d address registers tohold multiple prefetch addresses orstoring these addresses in memory ifthe required number of registers is notavailable.

Comparing the transformed loop inFigure 5 to the original loop, it can beseen that software prefetching also re-sults in significant code expansionwhich, in turn, may degrade instructioncache performance. Finally, becausesoftware prefetching is done statically,it is unable to detect when a prefetchedblock has been prematurely evicted andneeds to be refetched.

4. HARDWARE DATA PREFETCHING

Several hardware prefetching schemeshave been proposed that add prefetch-ing capabilities to a system without theneed for programmer or compiler inter-vention. No changes to existing ex-ecutables are necessary, so instructionoverhead is completely eliminated.Hardware prefetching can also take ad-vantage of run-time information to po-tentially make prefetching more effec-tive.

4.1 Sequential Prefetching

Most (but not all) prefetching schemesare designed to fetch data from mainmemory into the processor cache inunits of cache blocks. It should be noted,however, that multiple word cacheblocks are themselves a form of dataprefetching. By grouping consecutivememory words into single units, cachesexploit the principle of spatial locality



to implicitly prefetch data that is likelyto be referenced in the near future.

The degree to which large cacheblocks can be effective in prefetchingdata is limited by the ensuing cachepollution effects. That is, as the cacheblock size increases, so does the amountof potentially useful data displaced fromthe cache to make room for the newblock. In shared-memory multiproces-sors with private caches, large cacheblocks may also cause false sharing[Lilja 1993], which occurs when two ormore processors wish to access differentwords within the same cache block andat least one of the accesses is a store.Although the accesses are logically ap-plied to separate words, the cache hard-ware is unable to make this distinctionbecause it operates on whole cacheblocks only. Hence the accesses aretreated as operations applied to a singleobject, and cache coherence traffic isgenerated to ensure that the changesmade to a block by a store operation areseen by all processors caching the block.In the case of false sharing, this trafficis unnecessary because only the proces-sor executing the store references theword being written. Increasing thecache block size increases the likelihoodof two processors sharing data from thesame block, and hence false sharing ismore likely to arise.

Sequential prefetching can take ad-vantage of spatial locality without intro-ducing some of the problems associatedwith large cache blocks. The simplestsequential prefetching schemes arevariations upon the one block lookahead(OBL) approach, which initiates aprefetch for block b 1 1 when block b isaccessed. This differs from simply dou-bling the block size, in that theprefetched blocks are treated separatelywith regard to cache replacement andcoherence policies. For example, a largeblock may contain one word that is fre-quently referenced and several otherwords that are not in use. Assuming anLRU replacement policy, the entireblock will be retained, even though only

a portion of the block’s data is actuallyin use. If this large block is replacedwith two smaller blocks, one of themcould be evicted to make room for moreactive data. Similarly, the use ofsmaller cache blocks reduces the proba-bility that false sharing will occur.

OBL implementations differ depend-ing on what type of access to block binitiates the prefetch of b 1 1. Smith[1982] summarizes several of these ap-proaches, of which the prefetch-on-missand tagged prefetch algorithms will bediscussed here. The prefetch-on-miss al-gorithm simply initiates a prefetch forblock b 1 1 whenever an access forblock b results in a cache miss. If b 1 1is already cached, no memory access isinitiated. The tagged prefetch algorithmassociates a tag bit with every memoryblock. This bit is used to detect when ablock is demand-fetched or a prefetchedblock is referenced for the first time. Ineither of these cases, the next sequen-tial block is fetched.

Smith found that tagged prefetchingreduces cache miss ratios in a unified(both instruction and data) cache bybetween 50% and 90% for a set of trace-driven simulations. Prefetch-on-misswas less than half as effective as taggedprefetching in reducing miss ratios. Thereason prefetch-on-miss is less effectiveis illustrated in Figure 6 where the be-havior of each algorithm when accessingthree contiguous blocks is shown. Here,it can be seen that a strictly sequentialaccess pattern will result in a cachemiss for every other cache block whenthe prefetch-on-miss algorithm is usedbut this same access pattern results inonly one cache miss when employing atagged prefetch algorithm.

The HP PA7200 [Chan et al. 1996]serves as an example of a contemporarymicroprocessor that uses OBL prefetchhardware. The PA7200 implements atagged prefetch scheme using either adirected or an undirected mode. In theundirected mode, the next sequential lineis prefetched. In the directed mode, theprefetch direction (forward or backward)



and distance can be determined by thepre/post-increment amount encoded inthe load or store instructions. That is,when the contents of an address regis-ter are auto-incremented, the cacheblock associated with a new address isprefetched. Compared to a base casewith no prefetching, the PA7200achieved run-time improvements in therange of 0% to 80% for 10 SPECfp95benchmark programs [VanderWiel et al.1997]. Although performance was foundto be application-dependent, all but twoof the programs ran more than 20%faster when prefetching was enabled.

Note that one shortcoming of the OBLschemes is that the prefetch may not beinitiated far enough in advance of theactual use to avoid a processor memory

stall. A sequential access stream result-ing from a tight loop, for example, maynot allow sufficient lead time betweenthe use of block b and the request forblock b 1 1. To solve this problem, it ispossible to increase the number ofblocks prefetched after a demand fetch

from one to K, where K is known as thedegree of prefetching. Prefetching K . 1subsequent blocks aids the memory sys-tem in staying ahead of rapid processorrequests for sequential data blocks. Aseach prefetched block b, is accessed forthe first time, the cache is interrogatedto check if blocks b 1 1, . . . , b 1 K,are present in the cache and, if not, themissing blocks are fetched from memory.

demand-fetched

prefetched

demand-fetched

prefetched

demand-fetched

prefetched

demand-fetched

prefetched

0 demand-fetched

1 prefetched

0 demand-fetched

0 prefetched

1 prefetched0

0 demand-fetched

0 prefetched

1 prefetched

0 prefetched

0 0

0 demand-fetched

1 prefetched

0 demand-fetched

0 prefetched

1 prefetched

0 demand-fetched

0 prefetched

1 prefetched

0 prefetched

0

1 prefetched

0 0

1 prefetched

1 prefetched

0 0 0

(c)

(a)

(b)

Figure 6. Three forms of sequential prefetching: (a) Prefetch on miss, (b) tagged prefetch, and (c)sequential prefetching with K 5 2.



Note that when K 5 1, this scheme isidentical to tagged OBL prefetching.

Although increasing the degree ofprefetching reduces miss rates in sec-tions of code that show a high degree ofspatial locality, additional traffic andcache pollution are generated by se-quential prefetching during programphases that show little spatial locality.Przybylski [1990] found that this over-head tends to make sequential prefetch-ing unfeasible for values of K largerthan one.

Dahlgren et al. [1993] proposed anadaptive sequential prefetching policythat allows the value of K to vary dur-ing program execution in such a waythat K is matched to the degree of spa-tial locality exhibited by the program ata particular point in time. To do this, aprefetch efficiency metric is periodicallycalculated by the cache as an indicationof the current spatial locality character-istics of the program. Prefetch efficiencyis defined to be the ratio of usefulprefetches to total prefetches where auseful prefetch occurs whenever aprefetched block results in a cache hit.The value of K is initialized to one,incremented whenever the prefetch effi-ciency exceeds a predetermined upperthreshold, and decremented wheneverthe efficiency drops below a lowerthreshold, as shown in Figure 7. Notethat if K is reduced to zero, prefetching

is effectively disabled. At this point, theprefetch hardware begins to monitorhow often a cache miss to block b occurswhile block b 2 1 is cached, and re-starts prefetching if the respective ratioof these two numbers exceeds the lowerthreshold of the prefetch efficiency.

Simulations of a shared memory mul-tiprocessor found that adaptiveprefetching could achieve appreciablereductions in cache miss ratios overtagged prefetching. However, simulatedrun-time comparisons show only slightdifferences between the two schemes.The lower miss ratio of adaptive se-quential prefetching was partially nulli-fied by the associated overhead of in-creased memory traffic and contention.

Jouppi [1990] proposed an approachwhere K prefetched blocks are broughtinto a FIFO stream buffer before beingbrought into the cache. As each bufferentry is referenced, it is brought intothe cache while the remaining blocksare moved up in the queue and a newblock is prefetched into the tail position.Note that since prefetched data are notplaced directly into the cache, thisscheme avoids any cache pollution.However, if a miss occurs in the cacheand the desired block is also not foundat the head of the stream buffer, thebuffer is flushed. Therefore, prefetchedblocks must be accessed in the orderthey are brought into the buffer for

K--

K++

time

upper threshold

lower threshold

prefetchefficiency

Figure 7. Sequential adaptive prefetching.



stream buffers to provide a performancebenefit.

Palacharla and Kessler [1994] studiedstream buffers as a cost-effective alter-native to large secondary caches. Whena primary cache miss occurs, one ofseveral stream buffers is allocated toservice the new reference stream.Stream buffers are allocated in LRUorder and a newly allocated buffer im-mediately fetches the next K blocks fol-lowing the missed block into the buffer.Palacharla and Kessler found that eightstream buffers and K 5 2 provided ad-equate performance in their simulationstudy. With these parameters, streambuffer hit rates (the percentage of pri-mary cache misses that are satisfied bythe stream buffers) typically fell be-tween 50% and 90%.

However, memory bandwidth require-ments were found to increase sharply asa result of the large number of unneces-sary prefetches generated by the streambuffers. To help mitigate this effect, asmall history buffer is used to recordthe most recent primary cache misses.When this history buffer indicates thatmisses have occurred for both block band block b 1 1, a stream is allocatedand blocks b 1 2, ..., b 1 K 1 1 areprefetched into the buffer. Using thismore selective stream allocation policy,bandwidth requirements were reducedat the expense of some slightly reducedstream buffer hit rates. The streambuffers described by Palacharla andKessler were found to provide an eco-nomical alternative to large secondarycaches, and were eventually incorpo-rated into the Cray T3E multiprocessor[Oberlin et al. 1996].

In general, sequential prefetchingtechniques require no changes to exist-ing executables and can be implementedwith relatively simple hardware. How-ever, compared to software prefetching,sequential hardware prefetching per-forms poorly when nonsequential mem-ory access patterns are encountered.Scalar references or array accesses withlarge strides can result in unnecessary

prefetches because these types of accesspatterns do not exhibit the spatial local-ity upon which sequential prefetching isbased. To enable prefetching of stridedand other irregular data access pat-terns, several more elaborate hardwareprefetching techniques have been pro-posed.

4.2 Prefetching with Arbitrary Strides

Several techniques have been proposedthat employ special logic to monitor theprocessor’s address referencing patternto detect constant stride array refer-ences originating from looping struc-tures [Baer and Chen 1991; Fu et al.1992; Sklenar 1992]. This is accom-plished by comparing successive ad-dresses used by load or store instruc-tions. Chen and Baer’s [1995] techniqueis perhaps the most aggressive proposedthus far. To illustrate its design, as-sume a memory instruction, mi, refer-ences addresses a1, a2, and a3 duringthree successive loop iterations.Prefetching for mi will be initiated if

~a2 2 a1! 5 D Þ 0

where D is now assumed to be the strideof a series of array accesses. The firstprefetch address will then be A3 5 a2

1 D where A3 is the predicted value ofthe observed address a3. Prefetchingcontinues in this way until An Þ an.

Note that this approach requires theprevious address used by a memory in-struction to be stored along with thelast detected stride, if any. Recordingthe reference histories of every memoryinstruction in the program is clearlyimpossible. Instead, a separate cachecalled the reference prediction table(RPT) holds this information for onlythe most recently used memory instruc-tions. The organization of the RPT isgiven in Figure 8. Table entries containthe address of the memory instruction,the previous address accessed by thisinstruction, a stride value for those entries



that have established a stride, and astate field that records the entry’s cur-rent state. The state diagram for RPTentries is given in Figure 9.

The RPT is indexed by the CPU’sprogram counter (PC). When memoryinstruction mi is executed for the firsttime, an entry for it is made in the RPTwith the state set to initial, signifyingthat no prefetching is yet initiated forthis instruction. If mi is executed againbefore its RPT entry has been evicted, astride value is calculated by subtractingthe previous address stored in the RPTfrom the current effective address. Toillustrate the functionality of the RPT,consider the matrix multiply code andassociated RPT entries given in Figure 10.

In this example, only the load instruc-tions for arrays a, b and c are consid-ered, and it is assumed that the arraysbegin at addresses 10000, 20000, and30000, respectively. For simplicity, one-word cache blocks are also assumed.

After the first iteration of the innermostloop, the state of the RPT is as given inFigure 10(b) where instruction ad-dresses are represented by theirpseudocode mnemonics. Since the RPTdoes not yet contain entries for theseinstructions, the stride fields are initial-ized to zero and each entry is placed inan initial state. All three references re-sult in a cache miss.

After the second iteration, strides arecomputed as shown in Figure 10(c). Theentries for the array references to b andc are placed in a transient state becausethe newly computed strides do notmatch the previous stride. This stateindicates that an instruction’s referenc-ing pattern may be in transition, and atentative prefetch is issued for the blockat address effective address 1 stride if itis not already cached. The RPT entry forthe reference to array a is placed in asteady state because the previous andcurrent strides match. Since this entry’s

instruction tag previous address stride state

PC effective address

prefetch address

Figure 8. State transition graph for reference prediction table entries.



stride is zero, no prefetching will beissued for this instruction.

During the third iteration, the entriesfor array references b and c move to thesteady state when the tentative stridescomputed in the previous iteration areconfirmed. The prefetches issued duringthe second iteration result in cache hitsfor the b and c references, provided thata prefetch distance of one is sufficient.

From the above discussion, it can beseen that the RPT improves upon se-quential policies by correctly handlingstrided array references. However, asdescribed above, the RPT still limits theprefetch distance to one loop iteration.To remedy this shortcoming, a distancefield may be added to the RPT, whichspecifies the prefetch distance explic-itly. Prefetch addresses would then becalculated as

effective address 1 ~stride 3 distance!

The addition of the distance field re-quires some method of establishing itsvalue for a given RPT entry. To calcu-late an appropriate value, Chen andBaer decouple the maintenance of theRPT from its use as a prefetch engine.The RPT entries are maintained underthe direction of the PC, as describedabove, but prefetches are initiated sepa-rately by a pseudo program counter,called the lookahead program counter(LA-PC), which is allowed to precedethe PC. The difference between the PC

and LA-PC is then the prefetch distanced. Several implementation issues arisewith the addition of the lookahead pro-gram counter; the interested reader isreferred to Baer and Chen [1991].

Chen and Baer [1994] compared RPTprefetching to Mowry’s softwareprefetching scheme [Mowry et al. 1992],and found that neither method showedconsistently better performance on asimulated shared memory multiproces-sor. Instead, it was found that perfor-mance depends on the individual pro-gram characteristics of the fourbenchmark programs upon which thestudy is based. Software prefetchingwas found to be more effective withcertain irregular access patterns forwhich an indirect reference is used tocalculate a prefetch address. The RPTmay not be able to establish an accesspattern for an instruction that uses anindirect address because the instructionmay generate effective addresses thatare not separated by a constant stride.Also, the RPT is less efficient at thebeginning and end of a loop. Prefetchesare issued by the RPT only after anaccess pattern has been established.This means that no prefetches will beissued for array data for at least thefirst two iterations. Chen and Baer alsonote that it may take several iterationsfor the RPT to achieve a prefetch dis-tance that completely masks memorylatency when the LA-PC was used.

initial steady

transientno

prediction

Correct stride prediction

Incorrect stride prediction

Incorrect prediction with stride update

initial Start state. No prefetching.transient Stride in transition. Tentative prefetch.steady Constant Stride. Prefetch if stride ≠ 0.no prediction

Stride indeterminate. No prefetching.

Figure 9. The RPT during execution of matrix multiply.



Finally, the RPT will always prefetchpast array bounds, because an incorrectprediction is necessary to stop subse-quent prefetching. However, during loopsteady state, the RPT is able to dynam-ically adjust its prefetch distance toachieve a better overlap with memorylatency than the software scheme forsome array access patterns. Also, soft-ware prefetching incurs instructionoverhead resulting from prefetch ad-dress calculation, fetch instruction exe-cution, and spill code.

Dahlgren and Stenstrom [1995] com-pared tagged and RPT prefetching inthe context of a distributed shared

memory multiprocessor. By examiningthe simulated run-time behavior of sixbenchmark programs, it was concludedthat RPT prefetching showed limitedperformance benefits over taggedprefetching, which tends to perform aswell or better for the most commonmemory access patterns. Dahlgrenshowed that most array strides are lessthan the block size, and therefore cov-ered by the tagged prefetch policy. Inaddition, it was found that some scalarreferences show a limited amount ofspatial locality that could be captured bythe tagged prefetch policy but not by theRPT mechanism. If memory bandwidth

float a[100][100], b[100][100], c[100][100];

...

for ( i = 0; i < 100; i++) for ( j = 0; j < 100; j++) for ( k = 0; k < 100; k++) a[i][j] += b[i][k] * c[k][j];

(a)

Tag Previous Address Stride Stateld b[i][k] 20,000 0 initialld c[k][j] 30,000 0 initialld a[i][j] 10,000 0 initial

(b)

Tag Previous Address Stride Stateld b[i][k] 20,004 4 transientld c[k][j] 30,400 400 transientld a[i][j] 10,000 0 steady

(c)

Tag Previous Address Stride Stateld b[i][k] 20,008 4 steadyld c[k][j] 30,800 400 steadyld a[i][j] 10,000 0 steady

(d)Figure 10. An access to the next field of a list element can prompt prefetching of the subsequent listelement.



is limited, however, it is conjecturedthat the more conservative RPTprefetching mechanism may be prefera-ble, since it tends to produce fewer use-less prefetches.

As with software prefetching, the ma-jority of hardware prefetching mecha-nisms focus on very regular array refer-encing patterns. There are some notableexceptions, however. For example, Har-rison and Mehrotra [1994] propose ex-tensions to the RPT mechanism thatallow for the prefetching of data objectsconnected via pointers. This approachadds fields to the RPT that enable thedetection of indirect reference stridesarising from structures such as linkedlists and sparse matrices.

Another hardware mechanism de-signed to prefetch data structures con-nected via pointers was described byRoth et al. [1998]. Like the RPT and itsderivatives, this scheme uses a hard-ware table to log the most recently exe-cuted load instructions. However, thistable is used to detect dependencies be-tween load instructions rather than es-tablishing reference patterns for singleinstructions. More specifically, this ta-ble records the data values loaded bythe most recently executed load instruc-tions and detects instances of these val-ues being used as base addresses forsubsequent loads. Such address depen-dencies are common in linked list pro-cessing, i.e., where the next pointer ofone list element is used as the baseaddress for the fields of the subsequentlist element, as shown in Figure 11.Once the hardware table establishesthese dependencies, prefetches are trig-gered by the execution of those loadinstructions that produce base ad-dresses. For example, once the addressof p- .next is known, p- .next- .

field_1 and p- .next- .field_2 can beprefetched. A prefetch of p- .next- .nextcan be used to initiate further prefetch-ing. However, it was found that suchaggressive prefetching is generally notuseful because the relatively long pro-cessing time required for each pointer-connected element is sufficient to hidethe latency of the prefetches.

Alexander and Kedem [1996] pro-posed a prefetch mechanism based onthe observation that a cache miss to agiven address tends to be followed by amiss that belongs to a relatively predict-able subset of addresses. To take advan-tage of this property, a hardware tableis used to associate the current cachemiss address with a set of likely succes-sor cache miss addresses. That is, whena cache miss occurs, the address of themissed block is used to index the tableto find the most likely block addressesfor the next cache miss, based on previ-ous observations. Prefetches are thenissued for these blocks. Rather thanprefetching data into the processorcache, however, data blocks areprefetched into SRAM buffers inte-grated onto the DRAM memory chips.When a cache miss has been correctlypredicted by the prefetch mechanism,the corresponding data can be read di-rectly from the SRAM buffers, therebyavoiding the relatively time-consumingDRAM access. Data are not brought intothe processor cache hierarchy becausethis scheme prefetches large amounts ofdata in order to exploit the high on-chipbandwidth that exists between the inte-grated DRAM arrays and SRAM buff-ers. Alexander and Kedem found thatprefetch units in the range of 8 to 64cache blocks gave the best performancefor a benchmark suite of scientific andengineering applications. Prefetching

Next

Field_1

Field_2

Next

Field_1

Field_2

Next

Field_1

Field_2

Next

Field_1

Field_2

P

Figure 11. Block prefetching for a vector-matrix product calculation.



up to four such units on each cache missyields an average prediction accuracy of88%.

Using a similar approach, Joseph andGrunwald [1997] studied the use of aMarkov predictor to drive a hardwaredata prefetch mechanism. This mecha-nism attempts to predict when a previ-ous pattern of misses has begun to re-peat itself by dynamically recordingsequences of cache miss references in ahardware table similar to that used inAlexander and Kedem [1996]. When thecurrent cache miss address is found inthe table, prefetches for likely subse-quent misses are issued to a prefetchrequest queue. To prevent cache pollu-tion and wasted memory bandwidth,prefetch requests may be displaced fromthis queue by requests that belong toreference sequences with a higher prob-ability of occurring in the near future.This probability is determined by thestrength of the Markov chain associatedwith a given reference.

Rather than triggering dataprefetches on memory instructions,some researchers [Chang et al. 1994;Lui and Kaeli 1996] proposed triggeringon branch instructions stored in abranch target buffer (BTB) [Smith1981]. When a branch instruction is ex-ecuted, its entry in the BTB is inspectedto predict not only the branch targetaddress but also the data that will beconsumed after the predicted branch istaken. One advantage of this approachis that several memory operations canbe associated with a single branch tar-get, thereby reducing the amount of tagspace required to support prefetching.The entries in the BTB that controlprefetching are similar to the contentsof an RPT entry, with fields for theprevious data address, a stride, and thestate of the entry. These fields are up-dated in a fashion similar to the RPT ifthe branch prediction is correct and thecorresponding memory instructions areexecuted. Studies of such branch-pre-dicted prefetching indicate that no morethan three prefetch entries per BTB en-try are required to yield close to the

maximum benefit of this approach. Sim-ulations of a processor using branch-predicted prefetching with threeprefetch entries per BTB entry foundthat miss rates improved by between11.8% and 63.9% for six SPECint92benchmarks.

5. INTEGRATING HARDWARE ANDSOFTWARE PREFETCHING

Software prefetching relies exclusivelyon compile-time analysis to schedulefetch instructions within the user pro-gram. In contrast, the hardware tech-niques discussed thus far infer prefetch-ing opportunities at run-time withoutany compiler or instruction set support.Noting that each of these approacheshas its advantages, some researchershave proposed mechanisms that com-bine elements of both software andhardware prefetching.

Gornish and Veidenbaum [1994] de-scribe a variation on tagged hardwareprefetching in which the degree ofprefetching ~K! for a particular refer-ence stream is calculated at compiletime and passed on to the prefetchhardware. To implement this scheme, aprefetching degree (PD) field is associ-ated with every cache entry. A specialfetch instruction is provided thatprefetches the specified block into thecache and then sets the tag bit and thevalue of the PD field of the cache entryholding the prefetched block. The first Kblocks of a sequential reference streamare prefetched using this instruction.When a tagged block, b, is demandfetched, the value in its PD field, Kb, isadded to the block address to calculate aprefetch address. The PD field of thenewly prefetched block is then set to Kband the tag bit is set. This insures thatthe appropriate value of K is propagatedthrough the reference stream. Prefetch-ing for nonsequential reference patternsis handled by ordinary fetch instructions.

Zhang and Torrellas [1995] suggestan integrated technique that enablesprefetching for irregular data struc-tures. This is accomplished by tagging



memory locations in such a way that areference to one element of a data objectinitiates a prefetch of either other ele-ments within the referenced object orobjects pointed to by the referenced ob-ject. Both array elements and datastructures connected via pointers cantherefore be prefetched. This approachrelies on the compiler to initialize thetags in memory, but the actual prefetch-ing is handled by hardware within thememory system.

The use of a programmable prefetchengine was proposed by Chen [1995] asan extension to the reference predictiontable described in Section 4.2. Chen’sprefetch engine differs from the RPT inthat the tag, address, and stride infor-mation are supplied by the programrather than being dynamically estab-lished in hardware. Entries are insertedinto the engine by the program beforeentering looping structures that canbenefit from prefetching. Once pro-grammed, the prefetch engine functionsmuch like the RPT, with prefetches be-ing initiated when the processor’s pro-gram counter matches one of the tagfields in the prefetch engine.

VanderWiel and Lilja [1999] proposea prefetch engine that is external to theprocessor. The engine is a simple pro-cessor that executes its own program toprefetch data for the CPU. Through ashared second-level cache, a producer-consumer relationship is established inwhich the engine prefetches new datablocks into the cache, but only afterpreviously prefetched data is accessedby the processor. The processor alsopartially directs the actions of theprefetch engine by writing control infor-mation to memory-mapped registerswithin the prefetch engine’s supportlogic.

These integrated techniques are de-signed to take advantage of compile-time program information without in-troducing as much instruction overheadas pure software prefetching. Much ofthe speculation by pure hardwareprefetching is also eliminated, resultingin fewer unnecessary prefetches. Al-

though no commercial systems supportthis model of prefetching yet, the simu-lation studies used to evaluate theabove techniques indicate that perfor-mance can be enhanced over pure soft-ware or hardware prefetch mechanisms.

6. PREFETCHING IN MULTIPROCESSORS

In addition to the prefetch mechanismsabove, several multiprocessor-specificprefetching techniques have been pro-posed. Prefetching in these systems dif-fers from uniprocessors for at leastthree reasons. First, multiprocessor ap-plications are typically written usingdifferent programming paradigms thanuniprocessors. These paradigms canprovide additional array referencing in-formation that enables more accurateprefetch mechanisms. Second, multipro-cessor systems frequently contain addi-tional memory hierarchies that providedifferent sources and destinations forprefetching. Finally, the performanceimplications of data prefetching cantake on added significance in multipro-cessors because these systems tend tohave higher memory latencies and moresensitive memory interconnects.

Fu and Patel [1991] examined howdata prefetching might improve the per-formance of vectorized multiprocessorapplications. This study assumes vectoroperations are explicitly specified by theprogrammer and supported by the in-struction set. Because the vectorizedprograms describe computations interms of a series of vector and matrixoperations, no compiler analysis orstride detection hardware is required toestablish memory access patterns. In-stead, the stride information encoded invector references is made available tothe processor caches and associatedprefetch hardware.

Two prefetching policies were studied.The first is a variation upon theprefetch-on-miss policy, in which K con-secutive blocks following a cache missare fetched into the processor cache.This implementation of prefetch-on-miss differs from that presented earlier



in that prefetches are issued only forscalars and vector references with astride less than or equal to the cacheblock size. The second prefetch policy,referred to as vector prefetching here, issimilar to the first policy, with the ex-ception that prefetches for vector refer-ences with large strides are also issued.If the vector reference for block b missesin the cache, then blocks b, b 1 stride,b 1 ~2 3 stride!, . . . , b 1 ~K 3 stride!are fetched.

Fu and Patel found both prefetch pol-icies improve performance over the noprefetch on an Alliant FX/8 simulator.Speedups were more pronounced whensmaller cache blocks were assumed,since small block sizes limit the amountof spatial locality a nonprefetchingcache can capture, while prefetchingcaches can offset this disadvantage bysimply prefetching more blocks. In con-trast to other studies, Fu and Patelfound both sequential prefetching poli-cies were effective for values of K up to32. This is in apparent conflict withearlier studies, which found sequentialprefetching to degrade performance forK . 1. Much of this discrepancy may beexplained by noting how vector instruc-tions are exploited by the prefetchingscheme used by Fu and Patel. In thecase of prefetch-on-miss, prefetching issuppressed when a large stride is speci-fied by the instruction. This avoids use-less prefetches, which degraded the per-formance of the original policy.Although vector prefetching does issueprefetches for large stride referencingpatterns, it is a more precise mecha-nism than other sequential schemes be-cause it is able to take advantage ofstride information provided by the pro-gram.

Comparing the two schemes, it wasfound that applications with largestrides benefit the most from vectorprefetching, as expected. For programsin which scalar and unit-stride refer-ences dominate, the prefetch-on-misspolicy tends to perform slightly better.For these programs, the lower miss ra-

tios resulting from the vector prefetch-ing policy are offset by the correspond-ing increase in bus traffic.

Gornish et al. [1990] examinedprefetching in a distributed memorymultiprocessor where global and localmemory are connected through a multi-stage interconnection network. Data areprefetched from global to local memoryin large, asynchronous block transfersto achieve higher network bandwidththan would be possible with word-at-a-time transfers. Since large amounts ofdata are prefetched, the data are placedin local memory rather than the proces-sor cache to avoid excessive cache pollu-tion. Some form of software-controlledcaching is assumed to be responsible fortranslating global array addresses to lo-cal addresses after the data been placedin local memory.

As with software prefetching in sin-gle-processor systems, loop transforma-tions are performed by the compiler toinsert prefetch operations into the usercode. However, rather than insertingfetch instructions for individual wordswithin the loop body, entire blocks ofmemory are prefetched before the loopis entered. Figure 12 shows how thisblock prefetching may be used with avector-matrix product calculation. InFigure 12(b), the iterations of the origi-nal loop (Figure 12(a)) are partitionedamong NPROCprocessors of the multi-processor system so that each processoriterates over 1 / NPROCth of a and c .Also note that the array c is prefetcheda row at a time. Although it is possibleto pull out the prefetch for c so that theentire array is fetched into local memorybefore entering the outermost loop, it isassumed here that c is very large and aprefetch of the entire array would occupymore local memory than is available.

The block fetches given in Figure12(b) add processor overhead to theoriginal computation in a manner simi-lar to the software prefetching schemedescribed earlier. Although the block-oriented prefetch operations require sizeand stride information, significantly less



overhead is incurred than with theword-oriented scheme, since fewerprefetch operations are needed. Assum-ing equal problem sizes and ignoringprefetches for a, the loop given in Figure12 generates N 1 1 block prefetches,as compared to the 1 / 2~N 1 N 2!prefetches that would result from apply-ing a word-oriented prefetching scheme.

Although a single bulk data transferis more efficient than dividing thetransfer into several smaller messages,the former approach tends to increasenetwork congestion when several suchmessages are being transferred at once.Combined with the increased requestrate that prefetching induces, this net-work contention can lead to signifi-cantly higher average memory laten-cies. For a set of six numericalbenchmark programs, Gornish notesthat prefetching increases averagememory latency by a factor between 5.3and 12.7 over the no prefetch case.

An implication of prefetching into thelocal memory rather than the cache isthat the array a in Figure 12 cannot beprefetched. In general, this scheme re-quires that all data must be read-onlybetween prefetch and use because nocoherence mechanism is provided thatallows writes by one processor to beseen by the other processors. Data

transfers are also restricted by controldependencies within the loop bodies. Ifan array reference is predicated by aconditional statement, no prefetching isinitiated for the array. This is done fortwo reasons. First, the conditional mayonly test true for a subset of the arrayreferences, and initiating a prefetch ofthe entire array would result in theunnecessary transfer of a potentiallylarge amount of data. Second, the condi-tional may guard against referencingnonexistent data, and initiating aprefetch for such data could result inunpredictable behavior.

Honoring the above data and controldependencies limits the amount of datathat can be prefetched. On average, 42%of loop memory references for the sixbenchmark programs used by Gornishcould not be prefetched due to theseconstraints. Together with the in-creased average memory latencies, thesuppression of these prefetches limitedthe speedup due to prefetching to lessthan 1.1 for five of the six benchmarkprograms.

Mowry and Gupta [1991] studied theeffectiveness of software prefetching forthe DASH DSM multiprocessor archi-tecture. In this study, two alternativedesigns are considered. The first designplaces prefetched data in a remote access

nrows = N/NPROC;

fetch( b[0:N-1]);for( i=0; i < nrows; i++){

fetch( c[i][0:nrows-1]);for( j=0; j < N; j++)

a[i] = a[i] + b[j]*c[i][j];}

for( i=0; i < N; i++){for( j=0; j < N; j++)

a[i] = a[i] + b[j]*c[i][j];}

(b)

(a)

Figure 12. Block prefetching for a vector-matrix product calculation.



cache (RAC), which lies between theinterconnection network and the proces-sor cache hierarchy of each node in thesystem. The second design alternativesimply prefetched data from remotememory directly into the primary pro-cessor cache. In both cases, the unit oftransfer is a cache block.

The use of a separate prefetch cachesuch as the RAC was motivated by adesire to reduce contention for the pri-mary data cache. By separatingprefetched data from demand-fetcheddata, a prefetch cache avoids pollutingthe processor cache and provides moreoverall cache space. This approach alsoavoids processor stalls that can resultfrom waiting for prefetched data to beplaced in the cache. However, in thecase of a remote access cache, only re-mote memory operations benefit fromprefetching, since the RAC is placed onthe system bus and access times areapproximately equal to those of mainmemory.

Simulation runs of three scientificbenchmarks found that prefetching di-rectly into the primary cache offered themost benefit with an average speedup of1.94 compared to an average of 1.70when the RAC was used. Despite signif-icantly increasing cache contention andreducing overall cache space, prefetch-ing into the primary cache resulted inhigher cache hit rates, which proved tobe the dominant performance factor. Aswith software prefetching in single pro-cessor systems, the benefit of prefetch-ing was application-specific. Speedupsfor two array-based programs achievedspeedups over the non-prefetch case of2.53 and 1.99 while the third, less regu-lar, program showed a speedup of 1.30.

7. CONCLUSIONS

Prefetching schemes are diverse. Tohelp categorize a particular approach, itis useful to answer three basic ques-tions concerning the prefetching mecha-nism: (1) When are prefetches initiated,(2) where are prefetched data placed,

and (3) what is prefetched?

When Prefetches can be initiated ei-ther by an explicit fetch opera-tion within a program, by logicthat monitors the processor’sreferencing pattern to inferprefetching opportunities, orby a combination of these ap-proaches. However they areinitiated, prefetches must beissued in a timely manner. If aprefetch is issued too early,there is a chance that theprefetched data will displaceother useful data from thehigher levels of the memory hi-erarchy or be displaced itselfbefore use. If the prefetch isissued too late, it may not ar-rive before the actual memoryreference and thereby intro-duce processor stall cycles.Prefetching mechanisms alsodiffer in their precision. Soft-ware prefetching issues fetchesonly for data that is likely to beused, while hardware schemestend to prefetch data in a morespeculative manner.

Where The decision of where to placeprefetched data in the memoryhierarchy is a fundamental de-sign decision. Clearly, datamust be moved into a higherlevel of the memory hierarchyto provide a performance bene-fit. The majority of schemesplace prefetched data in sometype of cache memory. Otherschemes place prefetched datain dedicated buffers to protectthe data from premature cacheevictions and prevent cachepollution. When prefetcheddata are placed into named lo-cations, such as processor reg-isters or memory, the prefetchis said to be binding, and addi-tional constraints must be im-posed on the use of the data.Finally, multiprocessor systems



can introduce additional levels,which must be taken into con-sideration, into the memory hi-erarchy.

What Data can be prefetched in unitsof single words, cache blocks,contiguous blocks of memory,or program data objects. Theamount of data fetched is alsodetermined by the organiza-tion of the underlying cacheand memory system. Cacheblocks may be the most appro-priate size for uniprocessorsand SMPs, while larger memoryblocks may be used to amortizethe cost of initiating a datatransfer across an interconnec-tion network of a large, distrib-uted memory multiprocessor.

These three questions are not inde-pendent of each other. For example, ifthe prefetch destination is a small pro-cessor cache, data must be prefetched ina way that minimizes the possibility ofpolluting the cache. This means thatprecise prefetches will need to be sched-uled shortly before the actual use andthe prefetch unit must be kept small. Ifthe prefetch destination is large, thetiming and size constraints can be relaxed.

Once a prefetch mechanism has beenspecified, it is natural to compare it toother schemes. Unfortunately, a com-parative evaluation of the various pro-posed prefetching techniques is hin-dered by widely varying architecturalassumptions and testing procedures.However, some general observationscan be made.

The majority of prefetching schemesand studies concentrate on numerical,array-based applications. These pro-grams tend to generate memory accesspatterns that, although comparativelypredictable, do not yield high cache uti-lization, and thus benefit more fromprefetching than general applications do.As a result, automatic techniques thatare effective for general programs havereceived comparatively little attention.

Within the context of array-based ref-

erencing patterns, prefetch mechanismsprovide varying degrees of coverage, de-pending on the flexibility of the mecha-nism. Because unit- or small-stride ar-ray referencing patterns are the mostcommon and easily detected, allprefetching schemes capture this type ofaccess pattern. Sequential prefetchingtechniques concentrate exclusively onsuch patterns. Although less frequent,large-stride array referencing patternscan result in very poor cache utilization.The design of the RPT is motivated bythe desire to capture all constant-stridereferencing patterns, including thosewith large strides. Array referencingpatterns that do not have a constantstride or frequently change strides typi-cally cannot be covered by pure hard-ware techniques. Software prefetchingcan provide coverage for any arbitraryreferencing pattern if the pattern can bedetected at compile-time. The array-based integrated techniques discussedthus far are designed to augment exist-ing hardware techniques with softwaresupport, but their coverage is limited bythe underlying hardware mechanisms.

Flexibility also tends to lend accuracyto a prefetch scheme. Software and inte-grated techniques avoid many of theunnecessary prefetches characteristic ofpure hardware mechanisms, which aremore constrained in the prefetchstreams they can produce. These unnec-essary prefetches can displace activedata within the cache with data that isnever used by the processor. In additionto causing cache pollution, unnecessaryprefetches needlessly expend memorybandwidth, which may already be lim-ited due to the added pressure prefetch-ing naturally places on the memory system.

Prefetch mechanisms also introducesome degree of hardware overhead. Alltechniques rely on cache hardware thatsupports a prefetch operation. Most se-quential schemes introduce additionalcache logic, with the exception of streambuffers, which require hardware that isexternal to the cache. Some techniques re-quire that logic be added to the processor.Pure software prefetching requires the



inclusion of a fetch instruction in theprocessors instruction set, while theRPT and its derivatives add a compara-tively large amount of logic overheadinto the processor.

Although software prefetching hasminimal hardware requirements, thistechnique introduces a significantamount of instruction overhead into theuser program. Integrated schemes at-tempt to strike a balance between purehardware and software schemes by re-ducing instruction overhead while stilloffering better prefetch coverage thanpure hardware techniques.

Finally, memory systems must be de-signed to match the added demandsprefetching imposes. Despite a reduc-tion in overall execution time, prefetchmechanisms tend to increase averagememory latency by removing processorstall cycles. This effectively increasesthe memory reference request rate ofthe processor which, in turn, can intro-duce congestion within the memory sys-tem. This can be a problem, particularlyin multiprocessor systems where busesand interconnect networks are sharedby several processors.

Despite the many application and sys-tem constraints, data prefetching hasdemonstrated the ability to reduce over-all program execution time both in sim-ulation studies and in real systems. Ef-forts to improve and extend theseknown techniques to more diverse ar-chitectures and applications is an activeand promising area of research. Theneed for new prefetching techniques islikely to continue to be motivated byincreasing memory access penaltiesarising from both the widening gap be-tween microprocessor and memory per-formance and the use of more complexmemory hierarchies.

ACKNOWLEDGMENTS

The authors thank the anonymous re-viewers for their many useful sugges-tions.

REFERENCES

ALEXANDER, T. AND KEDEM, G. 1996. Distributedprefetch-buffer/cache design for high perfor-mance memory systems. In Proceedings of2nd IEEE Symposium on High-PerformanceComputer Architecture. IEEE Press, Piscat-away, NJ, 254–263.

ANACKER, W. AND WANG, C. P. 1967. Performanceevaluation of computing systems with mem-ory hierarchies. IEEE Trans. Comput. 16, 6,764–773.

BAER, J.-L. AND CHEN, T.-F. 1991. An effectiveon-chip preloading scheme to reduce data ac-cess penalty. In Proceedings of the 1991 Con-ference on Supercomputing (Albuquerque,NM, Nov. 18–22), J. L. Martin, Chair. ACMPress, New York, NY, 176–186.

BERNSTEIN, D., COHEN, D., AND FREUND, A. 1995.Compiler techniques for data prefetching onthe PowerPC. In Proceedings of the IFIPWG10.3 Working Conference on Parallel Ar-chitectures and Compilation Techniques(PACT ’95, Limassol, Cyprus, June 27–29), L.Bic, P. Evripidou, W. Böhm, and J.-L. Gau-diot, Chairs. IFIP Working Group on Algol,Manchester, UK, 19–26.

BURGER, D., GOODMAN, J. R., AND KGI, A. 1997.Limited bandwidth to affect processor design.IEEE Micro 17, 6, 55–62.

CALLAHAN, D., KENNEDY, K., AND PORTERFIELD, A.1991. Software prefetching. SIGARCH Com-put. Arch. News 19, 2 (Apr.), 40–52.

CASMIRA, J. P. AND KAELI, D. R. 1995. Modelingcache pollution. In Proceedings of the SecondIASTED Conference on Modeling andSimulation. 123–126.

CHAN, K. K. 1996. Design of the HP PA 7200 CPU.Hewlett-Packard J. 47, 1, 25–33.

CHANG, P. Y., KAELI, D., AND LIU, Y. 1994. Branch-directed data cache prefetching. In Proceed-ings of Second Workshop on Shared-MemoryMultiprocessing Systems.

CHEN, T.-F. 1995. An effective programmableprefetch engine for on-chip caches. In Proceed-ings of the 28th Annual International Sympo-sium on Microarchitecture (Ann Arbor, MI,Nov. 29 - Dec. 1), T. Mudge and K. Ebcioglu,Chairs. IEEE Computer Society Press, LosAlamitos, CA, 237–242.

CHEN, T.-F. AND BAER, J. L. 1994. A performancestudy of software and hardware dataprefetching schemes. In Proceedings of 21stInternational Symposium on Computer Archi-tecture (Chicago, IL, Apr.). 223–232.

CHEN, T.-F. AND BEAR, J. L. 1995. Effective hard-ware-based data prefetching for high-perfor-mance processors. IEEE Trans. Comput. 44, 5(May), 609–623.

CHEN, W. Y., MAHLKE, S. A., CHANG, P. P., ANDHWU, W.-M. W. 1991. Data access microarchi-tectures for superscalar processors with com-piler-assisted data prefetching. In Proceed-ings of the 24th Annual International



Symposium on Microarchitecture (MICRO 24,Albuquerque, NM, Nov. 18–20), Y. K.Malaiya, Chair. ACM Press, New York, NY,69–73.

DAHLGREN, F. AND STENSTROM, P. 1995. Effective-ness of hardware-based stride and sequentialprefetching in shared-memorymultiprocessors. In Proceedings of 1st IEEESymposium on High-Performance ComputerArchitecture (Raleigh, NC, Jan.). IEEE Press,Piscataway, NJ, 68–77.

DAHLGREN, F., DUBOIS, M., AND STENSTROM, P.1993. Fixed and adaptive sequential prefetch-ing in shared-memory multiprocessors. InProceedings of the International Conference onParallel Processing (St. Charles, IL). 56–63.

FU, B., SAINI, A., AND GELSINGER, P. P. 1989.Performance and microarchitecture of thei486 processor. In Proceedings of the IEEEInternational Conference on Computer Design(Cambridge, MA). 182–187.

FU, J. W. C. AND PATEL, J. H. 1991. Data prefetch-ing in multiprocessor vector cache memories.In Proceedings of 18th International Sympo-sium on Computer Architecture (Toronto,Ont., Canada). 54–63.

FU, J. W. C., PATEL, J. H., AND JANSSENS, B. L.1992. Stride directed prefetching in scalarprocessors. In Proceedings of the 25th AnnualInternational Symposium on Microarchitec-ture (MICRO 25, Portland, OR, Dec. 1–4),W.-m. Hwu, Chair. IEEE Computer SocietyPress, Los Alamitos, CA, 102–110.

GORNISH, E. H. AND VEIDENBAUM, A. V. 1994. Anintegrated hardware/software scheme forshared-memory multiprocessors. In Proceed-ings of International Conference on ParallelProcessing (St. Charles, IL). 281–284.

GORNISH, E. H., GRANSTON, E. D., AND GRANSTON,A. V. 1990. Compiler-directed data prefetch-ing in multiprocessors with memoryhierarchies. In Proceedings of the 1990 ACMInternational Conference on Supercomputing(ICS ’90, Amsterdam, The Netherlands, June11–15), A. Sameh and H. van der Vorst,Chairs. ACM Press, New York, NY, 354–368.

HARRISON, L. AND MEHROTRA, S. 1994. A dataprefetch mechanism for accelerating generalcomputation. 1351. University of Illinois atUrbana-Champaign, Champaign, IL.

JOSEPH, D. AND GRUNWALD, D. 1997. Prefetchingusing Markov predictors. In Proceedings ofthe 24th International Symposium on Com-puter Architecture (ISCA ’97, Denver, CO,June 2–4), A. R. Pleszkun and T. Mudge,Chairs. ACM Press, New York, NY, 252–263.

JOUPPI, N. 1990. Improving direct-mapped cacheperformance by the addition of a small fully-associative cache and prefetch buffers. In Pro-ceedings of the 17th International Symposiumon Computer Architecture (ISCA ’90, Seattle,WA, May). IEEE Press, Piscataway, NJ, 364–373.

KLAIBER, A. C. AND LEVY, H. M. 1991. An architec-ture for software-controlled data prefetching.SIGARCH Comput. Arch. News 19, 3 (May),43–53.

KROFT, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings ofthe 8th International Symposium on Com-puter Architecture (Minneapolis, MN, June).ACM Press, New York, NY, 81–85.

LILJA, D. J. 1993. Cache coherence in large-scaleshared-memory multiprocessors: Issues andcomparisons. ACM Comput. Surv. 25, 3(Sept.), 303–338.

LIPASTI, M. H., SCHMIDT, W. J., KUNKEL, S. R., ANDROEDIGER, R. R. 1995. SPAID: Softwareprefetching in pointer- and call-intensiveenvironments. In Proceedings of the 28th An-nual International Symposium on Microarchi-tecture (Ann Arbor, MI, Nov. 29–Dec. 1), T.Mudge and K. Ebcioglu, Chairs. IEEE Com-puter Society Press, Los Alamitos, CA, 231–236.

LUI, Y. AND KAELI, D. R. 1996. Branch-directedand stride-based data cache prefetching. InProceedings of International Conference onComputer Design (ICCD’96, Austin, TX).IEEE Computer Society Press, Los Alamitos,CA, 255–230.

LUK, C.-K. AND MOWRY, T. C. 1996. Compiler-based prefetching for recursive datastructures. ACM SIGOPS Oper. Syst. Rev. 30,5, 222–233.

MOWRY, T. AND GUPTA, A. 1991. Tolerating latencythrough software-controlled prefetching inshared-memory multiprocessors. J. ParallelDistrib. Comput. 12, 2 (June), 87–106.

MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992.Design and evaluation of a compiler algorithmfor prefetching. In Proceedings of the 5th In-ternational Conference on Architectural Sup-port for Programming Languages and Operat-ing Systems (ASPLOS-V, Boston, MA, Oct.12–15), S. Eggers, Chair. ACM Press, NewYork, NY, 62–73.

OBERLIN, S., KESSLER, R., SCOTT, S., AND THORSON,G. 1996. Cray T3E architecture overview.Cray Supercomputers, Chippewa Falls, MN.

PALACHARLA, S. AND KESSLER, R. E. 1994. Evaluat-ing stream buffers as a secondary cachereplacement. In Proceedings of 21st Interna-tional Symposium on Computer Architecture(Chicago, IL, Apr.).

PATTERSON, R. H. AND GIBSON, G. A. 1994. Expo-sing I/O concurrency with informedprefetching. In Proceedings of the Third Inter-national Conference on Parallel and Distrib-uted Information Systems (Austin, TX). 7–16.

PORTERFIELD, A. K. 1989. Software methods forimprovement of cache performance on super-computer applications. Ph.D thesis,. Ph.D.Dissertation. Rice University, Houston, TX.

PRZYBYLSKI, S. 1990. The performance impact ofblock sizes and fetch strategies. In Proceed-ings of the 17th International Symposium on



Computer Architecture (Seattle, WA). 160–169.

ROTH, A., MOSHOVOS, A., AND SOHI, G. 1998. De-pendance based prefetching for linked datastructures. In Proceedings of the 8th Interna-tional Conference on Architectural Support forProgramming Languages and Operating Sys-tems (ASPLOS-VIII, San Jose, CA, Oct. 3–7),D. Bhandarkar and A. Agarwal, Chairs. ACMPress, New York, NY.

SANTHANAM, V., GORNISH, E. H., AND HSU, W.-C.1997. Data prefetching on the HP PA-8000. InProceedings of the 24th International Sympo-sium on Computer Architecture (ISCA ’97,Denver, CO, June 2–4), A. R. Pleszkun and T.Mudge, Chairs. ACM Press, New York, NY,264–273.

SKLENAR, I. 1992. Prefetch unit for vector opera-tions on scalar computers. In Proceedings ofthe 19th International Symposium on Com-puter Architecture (Gold Coast, Qld.,Australia). 31–37.

SMITH, A. J. 1978. Sequential program prefetchingin memory hierarchies. IEEE Computer 11,12, 7–21.

SMITH, A. J. 1982. Cache memories. ACM Comput.Surv. 14, 3 (Sept.), 473–530.

SMITH, J. E. 1981. A study of branch predictiontechniques. In Proceedings of the 8th Interna-

tional Symposium on Computer Architecture(Minneapolis, MN, June). ACM Press, NewYork, NY, 135–147.

VANDERWIEL, S. P. AND LILJA, D. J. 1999. Acompiler-assisted data prefetch controller. InProceedings of International Conference onComputer Design (ICCD ’99, Austin TX).

VANDERWIEL, S. P., HSU, W. C., AND LILJA, D. J.1997. When caches are not enough: Dataprefetching techniques. IEEE Computer 30, 7,23–27.

YEAGER, K. C. 1996. The MIPS R10000 supersca-lar microprocessor. IEEE Micro 16, 2 (Apr.),28–40.

YOUNG, H. C. AND SHEKITA, E. J. 1993. An intelli-gent I-cache prefetch mechanism. In Proceed-ings of the International Conference on Com-puter Design (ICCD’93, Cambridge, MA).IEEE Computer Society Press, Los Alamitos,CA, 44–49.

ZHANG, Z. AND TORRELLAS, J. 1995. Speeding upirregular applications in shared-memorymultiprocessors. In Proceedings of the 22ndAnnual International Symposium on Com-puter Architecture (ISCA ’95, Santa Margh-erita Ligure, Italy, June 22–24), D. A. Patter-son, Chair. ACM Press, New York, NY.

Received: January 1998; revised: March 1999; accepted: April 1999



data prefetch mechanisms - aminer...result in a cache miss for the first ac-cess to a cache block,...

Documents