stealing the shared cache for fun and profit - diva...

IT 13 048

Examensarbete 30 hpJuli 2013

Stealing the shared cache for fun and profit

Moncef Mechri

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Stealing the shared cache for fun and profit

Moncef Mechri

Cache pirating is a low-overhead method created by the Uppsala Architecture Research Team (UART) to analyze the effect of sharing a CPU cacheamong several cores. The cache pirate is a program that will actively andcarefully steal a part of the shared cache by keeping its working set in it.The target application can then be benchmarked to see its dependency onthe available shared cache capacity. The topic of this Master Thesis projectis to implement a cache pirate and use it on Ericsson’s systems.

Tryckt av: Reprocentralen ITC

Sponsor: EricssonIT 13 048Examinator: Ivan ChristoffÄmnesgranskare: David Black-SchafferHandledare: Erik Berg

Contents

Acronyms 2

1 Introduction 3

2 Background information 52.1 A dive into modern processors . . . . . . . . . . . . . . . . . . 5

2.1.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . 52.1.2 Virtual memory . . . . . . . . . . . . . . . . . . . . . . 62.1.3 CPU caches . . . . . . . . . . . . . . . . . . . . . . . . 82.1.4 Benchmarking the memory hierarchy . . . . . . . . . . 13

3 The Cache Pirate 173.1 Monitoring the Pirate . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 The original approach . . . . . . . . . . . . . . . . . . 193.1.2 Defeating prefetching . . . . . . . . . . . . . . . . . . . 193.1.3 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Stealing evenly from every set . . . . . . . . . . . . . . . . . . 213.3 How to steal more cache? . . . . . . . . . . . . . . . . . . . . . 22

4 Results 234.1 Results from the micro-benchmark . . . . . . . . . . . . . . . 234.2 SPEC CPU2006 results . . . . . . . . . . . . . . . . . . . . . . 25

5 Conclusion, related and future work 275.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Acknowledgement 29

List of Figures 30

Bibliography 31

1

Acronyms

LLC Last Level Cache.

LRU Least Recently Used.

OS Operating System.

PIPT Physically-Indexed, Physically-Tagged.

PIVT Physically-Indexed, Virtually-Tagged.

RAM Random Access Memory.

THP Transparent Hugepage.

TLB Translation Lookaside Buffer.

TSC Time-Stamp Counter.

UART Uppsala Architecture Research Team.

VIPT Virtually-Indexed, Physically-Tagged.

VIVT Virtually-Indexed, Virtually-Tagged.

2

Chapter 1

Introduction

In the past, computer architects and processor manufacturers have usedseveral approaches to increase the performance of their CPUs: Increase infrequency and transistor count, Out-of-Order execution, branch prediction,complex pipelines, speculative execution and other techniques have all beenextensively used to improve the performance of our processors.

However, in the mid-2000s, a wall was hit. Architects started to run outof room for improvement using these traditional approaches [1]. In order tosatisfy Moore’s law, the industry has converged on another approach: Mul-ticore processors. If we cannot noticeably improve the performance of oneCPU core, why not having several cores in one CPU?

A multicore processor embeds several processing units in one physicalpackage, thus allowing execution of different execution flows in parallel.Helped by the constant shrinkage in semiconductor fabrication processes,this has led to a transition to multicore processors, where several cores areembedded on a CPU. As of today this transition has been completed, withmulticore processors being everywhere, from supercomputers to mainstreampersonal computers and even mobile phones [2].

This revolution has allowed processor makers to keep producing signifi-cantly faster chips. However, this does not come for free: It imposes a greaterburden on the programmers in order to efficiently exploit this new compu-tational power, and new problems arose (or were at least greatly amplified).Most of them should be familiar to the reader: False/true sharing, race con-ditions, costly communications, scalability issues, etc.

Although multicore processors have by definition duplicate execution units,

3

they still tend to share some resources, like the Last Level Cache (LLC) andthe memory bandwidth. The contention for these shared ressources repre-sents another class of bottlenecks and performance issues that are much lessknown.

In order to study them, the UART has developped two new methods:Cache pirating [3] and the Bandwidth Bandit[4].

The Cache Pirate is a program that actively and carefully steals a partof the shared cache. In order to know how sensitive a target application is tothe amount of shared cache available, it is co-run with the Pirate. By stealingdifferent amount of shared cache, the performance of the target applicationcan be expressed as a function of the available shared cache.

We will first start by describing how most multicore processors are orga-nized and how caches work, as well as studying the Freescale P4080 processor.We will then introduce the micro-benchmark, a cache and memory bench-mark developped during the Thesis. Later, the pirate will be thoroughlydescribed. Finally, experiments and results will be shown.

This work is part of a partnership between UART and Ericsson.

4

Chapter 2

Background information

2.1 A dive into modern processors

2.1.1 Memory hierarchy

Memory systems are organized as a Memory hierarchy. Figure 2.1 showshow a memory hierarchy typically looks like. CPU registers sit on top ofthis hierarchy. They are the fastest memory available, but they are also verylimited in numbers and size.

CPU caches come next. They are caches that are used by the CPU toreduce accesses to the slower level of the memory hierarchy (Mainly, the mainmemory). Caches will be explored more thoroughly later.

The main memory, commonly called Random Access Memory (RAM) isthe next level in the hierarchy. It is a type of memory that is off-chip andthat is connected to the CPU by a bus (AMD HyperTransport, Intel Quick-Path Interconnect, etc.). Throughout the years, RAM has become cheaperand bigger, and mainstream devices often have several GigaBytes of RAM.

Lower levels of the hierarchy, like the hard drive are several order ormagnitude slower than even the main memory and will not be explored here.

5

Figure 2.1: The memory hierarchy (wikipedia.org)

2.1.2 Virtual memory

Virtual memory is a memory management technique used by most systemsnowadays which allows every process to see a flat and contiguous addressspace independent of the underlying physical memory.

The Operating System (OS) maintains in main memory per-process trans-lation tables of virtual-to-physical addresses. The CPU has a special unitcalled memory management unit (MMU) which, in cooperation with the OS,is used to perform the translation from virtual to physical addresses.

Since the OS’ translation tables live in RAM, accessing them is a slowprocess. In order to speed up the translation, the CPU maintains a veryfast translation cache called Translation Lookaside Buffer (TLB) that cachesrecently-resolved addresses. The TLB can be seen as a kind of hash ta-ble where the virtual addresses are the keys and the physical addresses theresults.

When a translation is required, the TLB is looked up. If a matching entryis found, then we have a TLB-hit and the entry is returned. If the entry isnot found, then we have a TLB-miss. The in-memory translation table isthen searched, and when a valid entry is found, the TLB is updated and the

6

translation restarted.

Paged virtual memory

Most implementations of virtual memory use pages as the basic unit, themost common page size being 4KB. The memory is divided in chunks, andthe OS translation tables are called Page Tables. In such a setup, the TLBdoes not hold per-byte virtual-to-physical translations. Instead, it keeps per-page translations, and once the right page is found, some lower bits of thevirtual address are used as an offset in this page.

TLBs are unfortunately quite small: Most TLBs have a few dozens ofentries, 64 being a common number, which means that with 4KB pages, only256KB of memory can be quickly addressed thanks to the TLB. Programswhich requires more memory will thus likely cause a lot of TLB-misses, forc-ing the processor to waste time traversing the page table (this process iscalled a page walk).

Large pages

To reduce this problem, processors can also support other page sizes. Forexample, modern x86 processors support 2MB, 4MB and 1GB pages. Whilethis can waste quite a lot of memory (because even if only one byte is needed,a whole page is allocated), this can also significantly improve the performanceby increasing the range of memory covered by the TLB.

There is no standard way to allocate large pages: We must rely on OS-specific functionnalities.

Linux provides two ways of allocating large pages:

• hugetlbfs: hugetlbfs is a mechanism that allows the system administra-tor to create a pool of large pages that will then be accessible througha filesystem mount. libhugetlbfs provides a both an API and tools tomake allocating large pages easier [5],

• Transparent Hugepage (THP): THP is a feature that attempts to au-tomatically back memory areas with large pages when this could beuseful for performance. THP also allows the developer to request amemory area to be backed by large pages (but with no guaranty ofsuccess) through the madvise() system call. THP is much easier to usethan hugetlbfs [6].

7

Although not originally designed for that, large pages (sometimes calledhuge pages) support is fundamental for the Cache Pirate. We will explainedlater why.

2.1.3 CPU caches

Throughout the years, processors have kept getting exponentially faster, butthe growth in the main memory performance has not been quite as steady.To fill the gap and keep our processors busy, CPU architects have been usingcaches for a long time.

CPU caches are a kind of memory that is small and expensive but veryfast (one or two orders of magnitude faster than the main memory). Nowa-days, without caches, a CPU would spend most of its time waiting for datato arrive from RAM.

Before studying some common cache organizations, it is important to notethat caches don’t store individual bytes. Instead, they store a data block,called a cacheline. The most common cache line size is 64B.

Direct-mapped caches

Direct-mapped caches are the simplest kind of cache. Figure 2.2 shows howa direct-mapped cache with the following characteristics is organized:

Size: 32KB Cache line size: 64B

The cache is divided in sets. In a direct-mapped cache, each set can holdone cache line and a tag. This cache has 512 sets (32KB / 64B)

When the processor wants to read or write in main memory, the cache issearched first.

The memory address requested is used to locate the data in the cache.This process works as follows:

Bits 6-14 (9 bits) are used as an index to locate the right set. Once theset is located, the tag it contains is compared to bits 15-31 (17 bits) in theaddress.

If they match, it means that the data block requested is present in thecache. It can thus be immediately retrieved. This case is called a cache-hit.

The 6 lower bits of the address are then used to select the desired bytewithin the cache block.

If they don’t match, then the data needs to be fetched from main memoryand put in the cache, evicting in the process the cache line that was cached

8

Figure 2.2: A direct-mapped cache. Size: 32KB; Cache line size: 64B

in this set. This situation is a cache-miss.While being simple to implement and understand, direct-mapped caches

have a fundamental weakness: Each set only holds one data block at a time.In our example, it means that up to 217 data blocks can potentially be com-peting for the same set, leading to a lot of expensive evictions. This kindof cache misses are called conflict misses and can be triggered even if theprogram’s working set is small enough to fit entirely in the cache.

Set-associative caches

A practical way to mitigate conflict misses is to make each set holds severalblocks. A cache is said to be N-way set-associative if each set can hold Ncache lines (and their corresponding tag). These entries in the set are alsocalled ways. Figure 2.3 shows how the cache from Figure 2.2 would look likeif it is made 4-way set associative.

Since each set now holds 4 cache lines instead of 1, there are 128 sets(32KB / 64B / 4).

Upon request, bits 6-12 (7 bits) are used to index the cache (find the rightset). Once located, the 4 tags in the set are compared to bits 13-32 (19 bits).

9

Figure 2.3: A 4-ways set-associative cache.. Size: 32KB; Cache line size: 64B

10

If one of them matches, it’s a cache-hit, and the desired byte is extractedfrom the corresponding cache line using bits 0-5.

If no tag matches, it’s a cache-miss, and the data needs to be fetchedfrom main memory. This freshly-fetched block is then cached. To do so, itis necessary to evict one block from the set (more on that later).

Set-associativity is an effective way to decrease considerably the numberof conflict misses. However, it does not come for free: Costly hardware (interm of size, price, and power consumption) must be added to select thecorrect block in a set as fast as possible.

Fully-associative caches

Fully-associative caches are a special case of set-associative cache with onlyone set. Data can end up anywhere in this set, thus eliminating completelyconflict misses.

Fully-associative caches are very expensive because in order to be efficient,they must be able to compare all the tags contained in the set in parallel,which is very costly.

For this reason, only very small caches, like L1 TLBs are made fullyassociative.

Replacement policy

Each time a new block is brought to the cache, an old one needs to be evicted.But which one?

Several strategies [7] (or replacement policies) exist to choose the rightcache line to evict, called the victim.

The most common strategy is called Least Recently Used (LRU). Thispolicy assumes that data recently used will likely be reused soon (temporallocality) and therefore, as implied by its name, chooses the least recentlyused cache line as the victim.

To implement this strategy, each set maintains age bits for each of itscache line that represent. Each time a cache line is accessed, it becomes theyoungest cache line in the set and the other cache lines age.

Keeping track of the exact age of each cache line can become quite expen-sive as the associativity grows. In practice, instead of a strict LRU policy,processors tend to use an approximation of LRU.

11

Hardware prefetching

For the sake of hiding memory latency even further, processor manufacturersuse another technique: Prefetching. The idea behind prefetching is to tryto guess which data will be used soon and prefetch it so that it is alreadycached when actually needed.

To do so, the hardware prefetcher, which is a part of the CPU, constantlylooks for common memory access patterns. Here is a common pattern thatevery prefetcher should be able to detect:

int array [ 1 0 0 0 ] ;

for ( int i = 0 ; i < 1000 ; ++i ){

array [ i ] = 10 ;}

This code simply traverses linearly an array. When executing this code,the prefetcher will detect the linear memory accesses and will fetch aheadthe data before they are needed.

A prefetcher should also be able to detect strides between each memoryaccess. For example, the prefetcher would also successfully prefetch data:

int array [ 1 0 0 0 ] ;for ( int i = 0 ; i < 1000 ; i += 2){

array [ i ] = 10 ;}

Virtual or physical addresses?

This section discusses which address (virtual or physical) is used to addressa cache and the consequences of this choice. This is important because thepirate needs to control where its data end up in the cache.

We said earlier that some bits from the address are used to index thecache and some other bits are used as the tag to identify cache lines with thesame set index. But are we talking about the virtual or physical address?

The answer is: It depends. CPU manufacturers are free to choose whichcombination of virtual/physical address they want to use for the set indexesand tags.

Considering that they have to choose whether they use virtual or physicaladdresses for the indexes and tags, a cache can be:

12

• Virtually-Indexed, Virtually-Tagged (VIVT): The virtual address isused for both the index and tag. This has the advantage of havinga low latency, because since the physical address is not required atall, there is no need to request it to the MMU (which can take time,especially in the event of a TLB-miss). However, this does not comewithout problems since different virtual addresses can refer to the samephysical location (aliasing), and identical virtual addresses can refer todifferent physical locations (homonyms)

• Physically-Indexed, Physically-Tagged (PIPT): The physical addressis used for both the index and tag. This scheme avoids aliasing andhomonyms problems altogether but is also slow since nothing can bedone until the physical address has been served by the MMU.

• Virtually-Indexed, Physically-Tagged (VIPT): The virtual address isused to index the cache and the physical address is used for the tag.This scheme has a lower latency than PIPT because indexing the cacheand retrieving the physical address can be done in parallel since thevirtual address already gives us the set index.

The last possible combination, Physically-Indexed, Virtually-Tagged(PIVT), is futile and won’t be discussed.

Generally, smaller, lower latency caches such as L1 caches use VIPT be-cause it has a lower latency, while larger caches such as LLCs use PIPT toavoid dealing with aliasing.

2.1.4 Benchmarking the memory hierarchy

In order to understand how exactly a memory hierarchy works, a microbenchmark has been developped during this thesis work. The benchmarkworks as follows:

The benchmark’s main datastructure consists in an array of ”struct elem”whose size is specified as a command line parameter.

The C structure “elem” is defined like this:

struct elem{

struct elem∗ next ;long int pad [NPAD] ;

} ;

13

The next pointer points to the next element to visit in the array (Notethat a pointer is 4-bytes large on 32 bits systems but 8-bytes large on 64 bitssystems.)

The pad element is used to pad the structure in order to control the sizeof an element. Varying the padding can greatly vary the benchmark’s behav-ior and is done by choosing at compile time an appropriate value for NPAD.An interesting thing to do would be to adjust the padding so that the elemstructure size is similar to the cache line size: Assuming a 64B cache linesize, and considering that the long int type has a size of 4 bytes on Linux32 bits and 8 bits on Linux 64 bits, choosing NPAD=15 for a 32 bits sys-tem or NPAD=7 for a 64 bits system would make each element 64 bytes large.

The elements are linked together using the next pointer. There are 2ways to link the elements:

• Sequentially: The elements are linked sequentially and circularly, sothe last element must point to the first one,

• Randomly: The elements are randomly linked but we must make surethat each element will be visited exactly once to avoid looping only ona subset of the array.

The benchmark then times how long it takes to traverse (only using thenext pointer) a fixed number of times.

Finally, dividing the total time by the number of iterations and the num-ber of elements in the array gives us the average access time per element.The array size and average access time are then stored in a results file.

In its current state, each execution of the benchmark works only with onearray size. To try different array sizes, it is recommended to use a shell ”for”loop.

A script able to parse the results files them and plot them using “gnuplot”is also provided.

The Freescale P4080

Figure 2.4: Freescale P4080 Block Diagram

14

The Freescale P4080 is an octocore PowerPC processor. The developmentsystems provided by Freescale use Linux. On the other hand, the productionsystems use a commercial realtime OS.

Each core runs at 1.5GHz and has 32KB L1 data cache and 32KB L1instruction cache, as well as a 128KB L2 unified cache.

Those eight cores share a 2MB L3 cache (split into two chunks of 1MB)as well as two DDR3 memory controllers.

Figure 2.5 displays the micro-benchmark results when configured to oc-cupy one full cache line per element.

Figure 2.5: micro-benchmark run on the Freescale P4080

We see that up to a working set of 32KB, the benchmark is very low.However, beyond this point, the benchmark is slowed down. This is becausethe working set does not fit anymore in the L1 cache.

A second performance drop occurs when the working set grows beyond128KB, which matches the size of the L2 cache. Above this point, the working

15

set becomes too big for the L2 cache and we start to hit in the L3 cache,which has significantly higher latencies.

Another performance penalty occurs when the working set size exceeds2MB, which is the size of the L3 cache. Beyond this point, the working setdoes not fit in the caches anymore and every element needs to be fetchedfrom the main memory.

The characteristics of the memory hierarchy, such as the size of each cachelevel are thus clearly exhibited.

Many more details about the memory hierarchy can be found in Drepper2007[13].

Time measurement

In order to get valid results, it is crucial to be able to measure time accurately.

POSIX systems such as Linux which implement the POSIX RealtimeExtensions provide the clock gettime() function that has an high enoughresolution for our purpose [12].

Since the commercial OS used by Ericsson does not provide an high-resolution timing function, a timing function that reads the Time-StampCounter (TSC) has been written.

Finally, it is strongly advised to disable any frequency scaling featurewhen running the benchmark since it could affect the results.

16

Chapter 3

The Cache Pirate

As we showed previously, modern multicore processors tend to have one orseveral private caches per-core as well as one big cache shared by all thecores. Sharing a cache between cores have several advantages. Some of themare:

• Each core can, at any time, use as much shared cache space as available.This is in contrast with private caches where any unused space cannotbe reused by other cores,

• Faster/simpler data sharing and cache-coherency operations by usingthe shared cache

However, sharing a cache between cores also means that there can becontention for this precious resource, where several cores are competing forspace in the cache, evicting each other’s data.

As we saw in the previous section, the LLC is the last rampart beforehaving to hit the RAM, which has been shown to have significantly longeraccess times. Therefore it is important to understand how sharing the LLCwill affect a specific application.

In order to study this, the UART team has created an original approach:The Cache Pirate.

The Pirate is an application that runs on a core and actively steals apart of the shared cache to make it look smaller than it actually is. We thenco-run a target application with the pirate and measure its performance withwhatever metric we are interested in (Cycles Per Instruction, execution time,Frames Per Second, requests/second, and so on). If we repeat this operationfor different cache sizes, we can express the performance of our applicationas a function of the shared cache size.

17

The pirate works by looping through a working set of a desired size inorder to prevent it from being evicted from the cache. The size of this workingset represents the amount of cache we are trying to steal.

The pirate must:

• steal the desired amount of cache (no more, no less)

• steal evenly from every set in the cache

• not use any other resource, such as the memory bandwidth

Property 1 is self-explanatory: We want the pirate to steal the amountof cache we have asked for.

Property 2 is important because we want the cache to look smaller byreducing its associativity. In order to do that, the pirate needs to steal thesame number of ways in every set. Not doing so could affect applicationsthat have hotspots in the cache.

Note that this is the approach processor manufacturers tend to use whenthey create a lower-end processor with a smaller cache than their higher-endprocessors (For example, the L3 cache on Intel Sandy Bridge processors ismade of 2MB 16 ways set-associative chunks. To create a processor with a3MB L3 cache, they would use 2 chunks of cache but with 4 ways disabled,reducing the size from 4MB to 3MB).

Property 3 is also important because the pirate must only use the sharedcache. Other resources, like the memory bandwidth should be left untouchedor the results will be biased.

Ensuring that the pirate actually has these properties is the main chal-lenge of this work.

3.1 Monitoring the Pirate

We have already said that in order to steal cache, the Pirate will loop throughits working set to keep it in the shared cache. However, the Pirate will becontending with the rest of the system for space in this cache, which meansthat its data could get evicted if they become marked as least recently used.

This would lead to two problems:

• The Pirate would not be stealing the desired amount of cache (and thusviolating the first property), and

18

• It would need to fetch the working set (or part of it) from the nextlevel of the memory hierarchy, which is the RAM. This implies usingthe memory bandwidth, and since this resource is also shared withother cores, it also implies reducing the memory bandwidth availableto the other cores, violating also property 3

A key point is that this issue gets amplified when we try to steal morecache, because increasing the working set size implies reducing the accessrate, which is the period at which element gets touched. If the access ratebecomes too low, the data could become the least recently used and thereforeget evicted from the cache.

We thus need to check whether the Pirate is able to keep its working setcached or not.

3.1.1 The original approach

In their original paper, the UART team created a pirate that loops sequen-tially through its working set.

To check that the Pirate is not using the memory bandwidth, a naiveapproach would be to check LLC’s miss ratio (the ratio of memory accessesthat cause a cache miss). Missing in the Last-Level Cache implies fetchingdata from the RAM and thus this approach looks reasonable at a first glance.

However, we saw that processors implement prefetching in order to hidethe latency. In this case, a prefetcher could easily detect the sequential ac-cesses the pirate is doing and prefetch data from main memory before theyare needed. This would reduce drastically the miss ratio and thus hide thefact that the Pirate is using the memory bandwidth.

To get around that, the UART team uses another metric called the fetchratio, which is the ratio of memory accesses that cause a fetch from mainmemory [8].

3.1.2 Defeating prefetching

However, the fetch ratio is a metric that is much less known than the missratio. Some systems might also not provide the performance events neces-sary for measuring it, or those events might only be collectible system-wide,which is the case on modern processors where the L3 cache and the memorycontroller are located in an “uncore” part that provides no easy way to col-lect data on a per-core basis [10].

19

In order to not need to measure the fetch ratio, a different approach hasbeen used in this work: The Pirate written during this thesis work iteratesthrough its working set randomly.

This change is fundamental because since the prefetchers are not unableto recognize a typical access pattern, they are not able to guess which datawill be needed soon and therefore will not prefetch ahead any data.

Now that the prefetchers will not get in the way anymore, the Pirate’smiss ratio can be used to monitor the ratio. It should stay as close to 0 aspossible, otherwise it means that the Pirate is fetching data from the RAM.

Monitoring the miss ratio must be done with a profiler able to read theperformance counters, such as Linux’s perf tool [9].

Unfortunately, compared to the original approach, defeating the prefetch-ers could hit the Pirate’s performance by not being able to prefetch data fromthe LLC, leading to a lower access rate.

3.1.3 Timing

There are systems where L3 performance events are also not easily collectiblefor each core.

To work around this issue, another way of monitoring the Pirate has beendevised:

We first start the Pirate while the system is idling. The Pirate will thenbenchmark the time it takes to iterate over its working set. Once this is done,our target application can then be started. The pirate will keep timing itselfand compare it to the reference time. If the difference between the referencetime and the new time exceeds a threshold (The current default is 10%, butit can be changed via an optional command-line switch), it means that thepirate’s data got evicted from the cache and thus had to read them from theRAM, which in other words means that the pirate was not able to keep itsworking set cached.

This is due to the fact that, as we saw in the previous chapter, the RAMhas significantly higher latencies than caches, and thus missing in the last-level cache will lead to significantly higher running time for the Pirate thatwe can use to detect when the Pirate starts to use the memory bandwidth.

As before, timing is done using the clock gettime() function on Linux orby manually reading the CPU’s TSC on the OS used in production.

This method has several advantages:

20

• It is very easy to implement

• It is completely cross-platform for systems supporting the clock gettime()function (and easily portable to others)

• It does not use any CPU-dependent feature (apart from the TSC if weread it manually) such as performance counters

• It can be used on systems that do not provide an easy way to getper-core LLC events

3.2 Stealing evenly from every set

We saw previously that caches are split in sets which are themselves split inways.

Some applications might have hotspots in the cache, which means thattheir working set is concentrated in a small part of the cache, and failing tosteal evenly from every set could cause either the target application to beunaffected by the Pirate (if the Pirate does not steal anything or too little )or to be too affected by the Pirate (If the Pirate steals too much from thetarget application’s hotspots).

Ideally, what we would like is to steal the same number of ways in everyset.

As it has been said previously, to determine where each datum is (or willend up) in the cache, a part of its address is used as an index.

From this, we can derive that a way to control in which set a datumwill end up is to manually choose its address, or at least the part that willbe used as an index. This can be easily done for virtually-indexed cachesby using an aligned allocation routine such as posix memalign() to craft avirtual address [11].

The Pirate’s working set will be organized as follows:As with the micro-benchmark, the working set is a densily-packed linked-

list. Each element will occupy a full cache line. The working set is placed inmemory such that the first element ends up in the first set of the cache, thesecond in the second set, and so on until we have stolen a way in every set,at which point the addresses of the following elements (if we want to stealmore than one way) will wrap around.

Example:Suppose we have the same cache as in the previous section (Size: 32KB;

Cache line size: 64B; Associativity: 4; Number of sets: 128; 32-bits ad-

21

dresses) We saw that bits 0-5 of an address are used to select a byte within acache line and that bits 6-12 are used to index the cache. Thus, to use a wayin the first set, the first element of the working set should have an addresswith bits 0-12 (13 bits) unset. To achieve this, the working set needs to bealigned on 8KB boundaries (213).

However, LLCs are physically-indexed, which means as we said beforethat the physical address is used to index the cache. Thus, if we want tocontrol where data end up in the cache, we need to control the physicaladdresses. This becomes a problem because since the memory is paged,there is no guaranty that the pages will be contiguous in memory, whichprevents us from controlling the addresses of the elements in the working set,which in turn prevents us from placing elements in the cache as we wish.

A way to work around this issue is to use large memory pages in orderto be sure that our working set will be allocated as a contiguous block inphysical memory. Coupled with the use of an aligned allocation routine asexplained above, this allows us to know where each datum will end up in thecache.

3.3 How to steal more cache?

We said before that when the Pirate’s working set size increases, the accessrate decreases, down to a point where the rest of the system fights back hardenough to start evicting the Pirate’s data. This puts a limit on how muchcache the Pirate can steal (which depends heavily on how hard the targetapplication and the rest of the system fight back)

To go beyond this limit, we can run several Pirates. By splitting theworking set between each instance, they will each have a smaller working setto iterate on, and thus a higher access rate.

However, when contending for a space in the shared cache, the differentinstances will make no distinction between them and the rest of the system,which means that they will also fight each other.

Furthermore, since each Pirate should run on one core, this will reduceeven more the number of cores available to the target application.

22

Chapter 4

Results

4.1 Results from the micro-benchmark

The micro-benchmark developed during this thesis work was created to get abetter understanding of how the memory hierarchy works, but also to validatethe Pirate. The idea here is that the micro-benchmark can expose the sizeand latency of each part of the memory system.

Thus, if we steal a part of the LLC with the Pirate, and then co-run itwith the micro-benchmark, it should show that the LLC indeed looks smaller.

This is what has been done and Figure 4.1 shows the result.

23

Figure 4.1: micro-benchmark co-run with Pirates stealing different cachesizes

The tests were run on the Freescale P4080 (8 cores, 32KB L1, 128KB L2,2MB L3). The benchmark uses a random access pattern to defeat prefetchingand large memory pages to hide TLB effects.

The red line shows the benchmark results with no Pirate running. Eachlevel of the memory hierarchy is visible. Particularly, we can notice that thebenchmark gets slower when the working set size becomes larger than 2MB(221), meaning that we are hitting the main memory.

The green line shows us what happens when we run the same benchmarkwith a Pirate stealing 1MB of cache. We can see that the benchmarks gets

24

slowed down when the working set becomes as soon as its working set becomesbigger than 1MB (220). This is expected (and desired!) and due to the factthat the Pirate is already stealing 1MB, thus reducing the cache capacityavailable for the benchmark to 1MB.

The blue line represents the benchmark results when a Pirate is trying tosteal 1.5MB. We would expect the performance to get slower above 512KB(219), but actually the line looks very similar to the green line. This is becausethe Pirate is not able to steal that much cache.

Finally, the purple line shows what happens when we run two Pirates,stealing 0.75MB each (for a total of 1.5MB). Now the knee appears indeedwhen the working set grows above 512KB, which means that the benchmarkonly has access to 512KB of LLC space.

4.2 SPEC CPU2006 results

Several benchmarks from the SPEC CPU2006 benchmark suite were co-runwith a Pirate stealing different cache sizes.

Table 4.1 shows the execution for the 401.bzip2 benchmark when con-tending with a Pirate.

Amount of cache available Execution time (seconds)2MB 33.51.75MB 34.91.5MB 35.91MB 37.4512KB 40.2

Table 4.1: Results for 401.bzip2 co-run with the Pirate stealing differentcache sizes

We can see that although the Pirate and the benchmark obviously doesnot share any data, the benchmark is slowed down when the Pirate is stealingcache. Up to 1.5MB out of 2MB were stolen from the LLC. The performancedropped there by 20%.

25

Amount of cache available Execution time (seconds)2MB 58.81.75MB 61.71.5MB 64.11MB 68.4512KB 74.4

Table 4.2: Results for 433.milc co-run with the Pirate stealing different cachesizes

Table 4.2 shows the results for 433.milc. Here, the performance drop wentas far as 26.6%.

The important slowdown caused by reducing the shared cache capacitytells us that these applications rely heavily on the LLC. If the shared cachecapacity is reduced, this application starts to fetch data from RAM insteadwhich has been shown to be significantly slower.

Nevertheless, other benchmarks, such as 470.lbm were much less sensitiveto reducing the shared cache capacity.

26

Chapter 5

Conclusion, related and futurework

The goal of this work was to implement the Cache Pirating method on Eric-sson’s system in order to study the effect on performance of sharing a cachebetween cores.

During this thesis, a Pirate has been written that can run on both Linuxand the commercial OS used by Ericsson, and it has been successfully usedto test several applications, both from the SPEC CPU2006 benchmark suiteand from Ericsson’s own programs.

Compared to the original work done by the UART team, some differentchoices were made. The Pirate written as part of this work use a randomaccess pattern in order to defeat prefetching which makes monitoring easier.Monitoring itself is done by continuously timing the Pirate and looking forsudden slowdowns.

Multicore processors have become a norm and the need for performancekeeps rising every day, so studying the effect of sharing resources in a multi-core processors is of primary importance. In the same vein than the CachePirate, the UART team created the Bandwidth Bandit, which works in thesame vein as the Cache Pirate, but focuses on stealing only the memorybandwidth.

5.1 Related work

Multicore processors have become a norm and the need for performance keepsrising every day, so studying the effect of sharing resources in a multicore pro-cessors is of primary importance. In the same vein than the Cache Pirate, the

27

UART team created the Bandwidth Bandit, which works in the same veinas the Cache Pirate, but focuses on stealing only the memory bandwidth [4].

U. Drepper [13] provides comprehensive yet detailed informations aboutthe memory hierarchy, performance analysis, operating systems and whata programmer can do to optimize his program in regard to the memoryhierarchy.

Wulf & McKee [14] have shown that the increasing gap between the speedof CPUs and that of RAM memory speed will become a major problem, upto a point where the memory speed will become the major bottleneck.

Others have studied the effect of sharing a cache between cores. StatCC[15] is a statistical cache contention model also created by the UART teamthat takes as an input is a reuse distance distribution that must be collectedbeforehand and leverages StatStack [16], a probabilistic cache model, to es-timate the cache miss ratio of co-scheduled applications and their CPIs.

The Malardalen Real-Time Research Center (MRTC) has been workingon feedback-based generation of hardware characteristics, where a modelis derived from running a production system. This model is then used togenerate a similar load on different part of the hardware, such as the caches,without running the real production system [17].

28

Acknowledgments

I would like to thank David Black-Schaffer and Erik Berg for giving me theopportunity to work on this interesting project, as well as for their supervisionduring this work. I would also like to thank all the members of the Quantumteam at Ericsson for their precious help. Finally, I would like to thank myfriends and my family for their unconditional support and faith.

29

List of Figures

2.1 The memory hierarchy . . . . . . . . . . . . . . . . . . . . . . 62.2 A direct-mapped cache . . . . . . . . . . . . . . . . . . . . . . 92.3 A direct-mapped cache . . . . . . . . . . . . . . . . . . . . . . 102.4 Freescale P4080 Block Diagram . . . . . . . . . . . . . . . . . 142.5 micro-benchmark on the P4080 . . . . . . . . . . . . . . . . . 15

4.1 micro-benchmark co-run with Pirates stealing different cachesizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

30

Bibliography

[1] Herb Sutter, The free lunch is over. Date accessed: 2013-07-24.http://www.gotw.ca/publications/concurrency-ddj.htm, 2005.

[2] Herb Sutter, Welcome to the hardware jungle,Date accessed: 2013-07-24. http://herbsutter.com/welcome-to-the-jungle/, 2011.

[3] D. Eklov, N. Nikoleris, D. Black-Schaffer, E. Hagersten, Cache Pirating:Measuring the curse of the shared cache, in 2011 International Confer-ence on Parallel Processing (ICPP) 2011, pages 165-175

[4] D. Eklov, N. Nikoleris, D. Black-Schaffer, E. Hagersten, Bandwidth ban-dit: Understanding memory contention, in International Symposium onPerformance Analysis of Systems and Software (ISPASS) 2012, pages116-117

[5] LWN.net, Huge pages: Interfaces, Date accessed: 2013-07-24.https://lwn.net/Articles/375096/

[6] Linux Documentation, Transparent Hugepage support, Date accessed:2013-07-24. http://lwn.net/Articles/423592/

[7] Wikipedia, Cache algorithms, Date accessed: 2013-07-24.http://en.wikipedia.org/wiki/Cache algorithms

[8] Roguewave, Fetch ratio, Date accessed: 2013-07-24.http://www.roguewave.com/portals/0/products/threadspotter/docs/2011.2/manual html linux/manual html/ch03s08.html

[9] Kernel.org, Linux’ perf tool, Date accessed: 2013-07-24.https://perf.wiki.kernel.org/index.php/Main Page

[10] Intel, Performance Analysis Guide for Intel Core i7 Processor and In-tel Xeon 5500 processor, p13 section “Uncore performance monitoring(PMU)”.

31

[11] Linux Manpages, posix memalign() manpage,Date accessed: 2013-07-24. http://linux.die.net/man/3/posix memalign

[12] Linux Manpages, clock gettime() manpage, Date accessed: 2013-07-24.http://linux.die.net/man/3/clock gettime

[13] Ulrich Drepper, What every programmer should know about memory,Date accessed: 2013-07-24. www.akkadia.org/drepper/cpumemory.pdf

[14] Wm. A. Wulf, Sally A. McKee, Hitting the memory wall: implications ofthe obvious, in ACM SIGARCH Computer Architecture News Volume23 Issue 1, March 1995, pages 20-24

[15] D. Eklov, D. Black-Schaffer, E. Hagersten, StatCC: A statistical cachecontention model, in PACT ’10 Proceedings of the 19th internationalconference on Parallel architectures and compilation techniques, pages551-552

[16] D. Eklov, E. Hagersten, StatStack: Efficient modeling of LRU caches,in IEEE International Symposium on Performance Analysis of Systems& Software (ISPASS) 2010, pages 55-65

[17] M. Jagemar, S. Eldh, A. Ermedahl, B. Lisper, Towards Feedback-BasedGeneration of Hardware Characteristics, in 7th International Workshopon Feedback Computing 2012

32

stealing the shared cache for fun and profit - diva...

Documents