[ieee 2012 13th acis international conference on software engineering, artificial intelligence,...

5
The research is supported by the Fundamental Research Funds for the Central Universities (N100404007). Shared-Cache Simulation for Multi-core System with LRU2-MRU Collaborative Cache Replacement Algorithm Shan Ding, Shiya Lui, Yuanyuan Li Northeastern University College of Information Science and Engineering Shenyang City, China e-mail: [email protected], [email protected] [email protected] Abstract—The L2 shared cache is an important resource for multi-core system. The cache replacement algorithm of L2 shared cache is one of the key factors in judging whether the L2 shared cache of multi-core system is efficient. In this paper, we study shared-cache simulation for multi-core with the LRU2-MRU collaborative cache replacement algorithm. We propose a theoretical foundation for LRU2-MRU to show the property, test the stack distance of the LRU2-MRU algorithm. In addition, the simulation results demonstrate that the MPKI (misses per thousand instructions) of LRU2-MRU is lower than other cache replacement algorithm, and the miss ratio for shared-cache can be reduce through cache replacement algorithm optimization. Keywords; Shared-cache; Miss ratio; Stack distance; Replacement algorithm; MPKI; LRU2; MRU; I. INTRODUCTION Along with development of electronic technology, multi- core system is becoming main trend. The trend to multi-core chip designs presents new challenges for designing the storage system. The access conflict to shared-cache from different threads or processes for parallel applications, can lead to the system performance of multi-core system degrade. Cache replacement algorithm for L2 shared-cache can be used to solve the problem efficiently and reasonably. The L2 shared-cache is an important resource for multi-core system and the cache replacement algorithm of L2 shared-cache influences the performance of multi-core system [2]. The cache replacement algorithm of L2 shared cache is one of the key factors in judging whether the L2 shared cache of multi- core system is efficient [3]. The miss ratio of shared-cache can be evaluated by using one-pass simulation over a program trace [4]. At present, the traditional cache replacement algorithm is mainly LRU, LFU, FIFO, and RAND. (1) LRU replacement algorithm: This replacement algorithm is the most widely used. The data in cache is sorted by the last use time. If the data is missing and the cache is full, the data in the LRU position is evicted. But LRU replacement algorithm just considers the most recently access information of data block, and doesn't care about the frequency of data blocks that are accessed. When the capacity of cache is less than the work set of program, the cache can present the jitter phenomenon [5]. This phenomenon will lead to computer performance decline. The disadvantage of LRU is that it can not predict whether the data is be used frequently. (2) LFU replacement algorithm: In cache the most recently used data is more likely to be used again,And typically use the list of links of times cited as ascending order. There is not install failure mechanism in LFU. So it can cause the so-called “cache pollution”. (3) FIFO replacement algorithm: The load of this replacement algorithm is very low. The FIFO runs faster but doesn't apply. The data block that the first enter into cache is evicted. (4) RAND replacement algorithm: At a replacement , the evicted datum is selected randomly when a data block being accessed is not in the cache. In other words, the probability of each block data evicted are the same. Now most multi-core systems apply LRU as the cache replacement algorithm of L2 shared-cache. But LRU replacement algorithm exits some questions. In order to overcome the shortcoming of the LRU replacement algorithm, many solutions have been proposed: Pseudo LRU [6], it is used only in the low cache associate. According to the cache influence on IPC, and to IPC for the optimal objective, a kind of optimal classification method for shared- cache has been proposed [7]. Based on the modified LRU replacement algorithm, Liaoxin has proposed a kind of improved LRU replacement policy that is based on LRU and the visited probability [8]. Based on considering the visited frequency of a data block and the recently accessed information, Lin et al. have proposed bubble replacement algorithm [9]. Gu xiaoming et al. defined a model called bipartite cache that is called LRU-MRU cache [4] to maximize data reuse. In order to reduce the conflict access of shared cache from different threads or processes for parallel applications, overcome the disadvantage of LRU and improve cache miss ratio, we propose LRU2-MRU collaborative cache replacement algorithm in this paper. The rest of this paper is organized as follows. The next section, LRU2-MRU cache replacement algorithm is presented. Then in section 3 we describe simulation experiment, performance analysis and stack distance. In section 4, we will give a conclusion of this paper. 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing 978-0-7695-4761-9/12 $26.00 © 2012 IEEE DOI 10.1109/SNPD.2012.112 127

Upload: yuanyuan

Post on 04-Dec-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) - Kyoto, Japan (2012.08.8-2012.08.10)]

The research is supported by the Fundamental Research Funds for the Central Universities (N100404007).

Shared-Cache Simulation for Multi-core System with LRU2-MRU Collaborative Cache Replacement Algorithm

Shan Ding, Shiya Lui, Yuanyuan Li Northeastern University

College of Information Science and Engineering Shenyang City, China

e-mail: [email protected], [email protected] [email protected]

Abstract—The L2 shared cache is an important resource for multi-core system. The cache replacement algorithm of L2 shared cache is one of the key factors in judging whether the L2 shared cache of multi-core system is efficient. In this paper, we study shared-cache simulation for multi-core with the LRU2-MRU collaborative cache replacement algorithm. We propose a theoretical foundation for LRU2-MRU to show the property, test the stack distance of the LRU2-MRU algorithm. In addition, the simulation results demonstrate that the MPKI (misses per thousand instructions) of LRU2-MRU is lower than other cache replacement algorithm, and the miss ratio for shared-cache can be reduce through cache replacement algorithm optimization.

Keywords; Shared-cache; Miss ratio; Stack distance; Replacement algorithm; MPKI; LRU2; MRU;

I. INTRODUCTION Along with development of electronic technology, multi-

core system is becoming main trend. The trend to multi-core chip designs presents new challenges for designing the storage system. The access conflict to shared-cache from different threads or processes for parallel applications, can lead to the system performance of multi-core system degrade. Cache replacement algorithm for L2 shared-cache can be used to solve the problem efficiently and reasonably. The L2 shared-cache is an important resource for multi-core system and the cache replacement algorithm of L2 shared-cache influences the performance of multi-core system [2]. The cache replacement algorithm of L2 shared cache is one of the key factors in judging whether the L2 shared cache of multi-core system is efficient [3]. The miss ratio of shared-cache can be evaluated by using one-pass simulation over a program trace [4].

At present, the traditional cache replacement algorithm is mainly LRU, LFU, FIFO, and RAND.

(1) LRU replacement algorithm: This replacement algorithm is the most widely used. The data in cache is sorted by the last use time. If the data is missing and the cache is full, the data in the LRU position is evicted. But LRU replacement algorithm just considers the most recently access information of data block, and doesn't care about the frequency of data blocks that are accessed. When the capacity of cache is less than the work set of program, the

cache can present the jitter phenomenon [5]. This phenomenon will lead to computer performance decline. The disadvantage of LRU is that it can not predict whether the data is be used frequently.

(2) LFU replacement algorithm: In cache the most recently used data is more likely to be used again,And typically use the list of links of times cited as ascending order. There is not install failure mechanism in LFU. So it can cause the so-called “cache pollution”.

(3) FIFO replacement algorithm: The load of this replacement algorithm is very low. The FIFO runs faster but doesn't apply. The data block that the first enter into cache is evicted.

(4) RAND replacement algorithm: At a replacement , the evicted datum is selected randomly when a data block being accessed is not in the cache. In other words, the probability of each block data evicted are the same.

Now most multi-core systems apply LRU as the cache replacement algorithm of L2 shared-cache. But LRU replacement algorithm exits some questions. In order to overcome the shortcoming of the LRU replacement algorithm, many solutions have been proposed: Pseudo LRU [6], it is used only in the low cache associate. According to the cache influence on IPC, and to IPC for the optimal objective, a kind of optimal classification method for shared-cache has been proposed [7]. Based on the modified LRU replacement algorithm, Liaoxin has proposed a kind of improved LRU replacement policy that is based on LRU and the visited probability [8]. Based on considering the visited frequency of a data block and the recently accessed information, Lin et al. have proposed bubble replacement algorithm [9]. Gu xiaoming et al. defined a model called bipartite cache that is called LRU-MRU cache [4] to maximize data reuse.

In order to reduce the conflict access of shared cache from different threads or processes for parallel applications, overcome the disadvantage of LRU and improve cache miss ratio, we propose LRU2-MRU collaborative cache replacement algorithm in this paper.

The rest of this paper is organized as follows. The next section, LRU2-MRU cache replacement algorithm is presented. Then in section 3 we describe simulation experiment, performance analysis and stack distance. In section 4, we will give a conclusion of this paper.

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/DistributedComputing

978-0-7695-4761-9/12 $26.00 © 2012 IEEEDOI 10.1109/SNPD.2012.112

127

Page 2: [IEEE 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) - Kyoto, Japan (2012.08.8-2012.08.10)]

II. COLLABORATIVE CACHE REPLACEMENT ALGORITHM

A. LRU2 And MRU Algorithm Analysis LRU can not predict whether the data is frequently used.

Furthermore LRU replacement algorithm just considers the access information of the most recently used data block without caring about the frequency of data blocks that be accessed. In order to solve these problems, LRU2 is applied to solve these issues. LRU2 makes up for the deficiency of the LRU by introducing the concept of two checks. LRU2 has the following advantages:

(1) It can better judge set of different level reference data. (2) It can adapt different reference model by self-

adjustment. (3) The management load of LRU2 is less. MRU replacement policy: the basic idea is that the most

recently used data block will be evicted, when a block of data is missing. The time prediction of MRU is higher than LRU.

B. LRU2-MRU Collaborative Cache Relpacement Algorithm One of the key challenges in a cache replacement

algorithm is how to choose the most appropriate data to be evicted. For this reason, we propose the LRU2-MRU collaborative cache replacement algorithm. The LRU2-MRU cache algorithm depends on the hybrid priority: the LRU2 replacement algorithm is prioritized by the LRU2 order. The MRU replacement algorithm is prioritized is by the MRU order.

Figure 1. When M=0, Collaborative replace examples.

The LRU2-MRU collaborative cache replacement algorithm includes two accesses: the LRU2 access and MRU access. We set M as a tag of the data contained in the cache line. M=1 represents that the data contained in the cache line is hit, and M=0 represents that the data is miss. When M=0, there are two kinds of method to replace the block of data in cache. The first method is that the top of stack is chosen as

the replace block. The second method is that we replace the data from the bottom of stack. We set N as a tag of priority, initialized to zero. When N=0, we call the MRU access, and then the block of data replaced selected from the bottom of stack. The LRU2 access is prioritized in the case of N=1, and then the block of data replaced selected from the top of stack. Figure 1 shows an example that a set of cache uses the collaborative cache replacement policy. We assume that cache put to set associative mapping and related degree is set to eight.

Figure 2. Collaborative replace examples in the case of M=1.

When M=1, there are also two kinds of methods to move data that to be hit: The data move to the top of stack or move to the bottom of the stack. In the case of N=1, LRU2 access is prioritized and the data that to be hit move to the top of stack. Other data move down in turn. When N=0, MRU access is prioritized, and then the data move to the bottom of stack. Other data is sequentially moved upward. Figure 2 shows the regularity of the data moved, when M=1.

C. The Property of LRU2-MRU Cache The length that between the current accessed data m and

previous accessed the same data m is the LRU2-MRU stack distance [10]. By the stack distance, we can judge whether the cache is a miss.

The principle of the bipartite cache is described in paper [4]: If the bottom data of the stack is lasted accessed by LRU, then all data in the cache are lasted accessed by LRU.

For the LRU2-MRU collaborative cache algorithm that we proposed, it is similar to the principle of bipartite cache.

LEMMA 1. If the bottom data of the stack is lasted accessed by LRU2, then all data in the cache are lasted accessed by LRU2.

Lemma 1 has been proofed in paper [4]. Next, we use lemma 1 to prove that the collaborative

cache can obey the inclusion principle for any sequence of LRU2 and MRU accesses.

THEOREM 1. There are two caches named T1 and T2, whose sizes are |T1| and |T2| respectively (|T1| < |T2|). An

128

Page 3: [IEEE 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) - Kyoto, Japan (2012.08.8-2012.08.10)]

access trace X is being executed on T1 and T2. At every access, the content of cache T1 is always a subset of the content of cache T2.

Proof We use induction to prove the theorem on t. We let the access trace X = (x1, x2… xn), and let T1 (xt) and T2 (xt) be the set of elements in cache T1 and T2 after access xt. T1 (0) and T2 (0) are the initial sets for T1 and T2, which are equivalence and empty.

We assume T1 (xt) T2 (xt) ( ). If xt+1 is a hit in T2(xt+1 T2(xt)), it is easy to see that the content of cache T1 is always a subset of the content of cache T2 If xt+1 is a miss in T2 (xt+1 T2 (xt)), xt+1 is also a miss in T1 since T1 (xt)

T2 (xt). Let the evicted element M be last accessed at xp in T1 and

xq in T2. If the cache miss, we can obtain that T1 (xt+1) = T1 (xt) - xp + xt+1 and T2 (xt+1) = T2 (xt) - xq+ xt+1. The only possibility that T1 (xt) T2 (xt) is not set up is that T1 has xq but dose not evict it, but T2 evicts xq, so we have x T1 (xt+1) but x T2 (xt+1). Now, we prove that the content of cache T1 is always a subset of the content of cache T2.

(1) We assume xp exists and regardless whether xp is a LRU2 or MRU access, the eviction in T1 happens at the LRU2 position. In T1, xp is at the bottom before access xt+1, and x is at the bottom in T2. We need to pay attention that a cache miss does not mean a cache eviction.

(2) To violate the principle of LRU2-MRU collaborative cache, we have xq T1 (xt) in a position over xq. By the induction, we also have xp

T2 (xt) in a position over xq. So we can confirm that both T1 and T2 contain xp and xq, but they have an opposite order.

Next, we further prove the lemma. For xp and x access may be LRU2 access or MRU access. There are four cases:

(a) xp and x are both MRU accesses. In T1, because x is at a higher position than xp, we can conclude q>p. For T2, the similar reasoning requires p>q. We can confirm that this possible is impossible.

(b) xp and xq are both LRU2 accesses. Because xq is at a higher position than xp in T1, we have q>p. For T2, the similar reasoning requires p>q. Which make this case is impossible.

(c) xp is a LRU2 access, and xq is a MRU access. Using the principle on T1, we can confirm that xqhas to be LRU2 access, because xq resides over a LRU2 access xp in T1. So this case also is impossible.

(d) x is a LRU2 access, and xp is a MRU access. Using the principle on T2, we can confirm that xphas to be LRU2 access. Which make this case is impossible, because xp resides over a LRU2 access x in T2.

Hence, we can see that the property of LRU2-MRU collaborative cache holds for every access

III. SIMULATION AND EXPERIMENT

A. Simulation Environment In this paper, we used the Simics [11] simulation

environment which is a full system simulator. Simics simulation platform includes not only the CPU, but also includes hardware device. In this simulation platform, it can run true firmware, completely unmodified kernel, driver code, embedded operating system, general operating system, and server operating system, etc. This platform can provide mainstream instruction set system configuration, such as processor and corresponding peripherals model, and has been installed operating system of virtual disk mirror. By using these configurations, the user can simulate the corresponding virtual system of host.

We use the Freescale MPC8641 [12] system configuration that supports standards PC devices including graphic devices, flash memory, PCIe-to-PCI bridges, floppy and hard disks. The configuration simulates two PowerPC e600 cores running MPC8641-simple linux 2.6.23. For the two PowerPC e600 cores, each with hyperthreading, and 256MB shared memory. The default values of cache simulation are: 4-way set-associativity, write policy, write through policy, tag and index attrib

utes for virtual, and cache coherency protocol of two cores for MESI. TABLE 1 shows the basic parameters of the configuration.

In this paper, in order to illustrate LRU2-MRU collaborative cache, we select a producer-consumer parallel application and 10 group testing applications from MiBench benchmark including basicmath, stringsearch, sha, FFT, CRC32, patricia, dijkstra, qsort, susan and bitcount.

TABLE I. CONFIGURATION OF BASIS PARAMETERS OF MULTI-CORE SYSTEM

Consist of system Parameters processor two PowerPC e600 cores

L1 data cache

private 4-way

32 bytes cache sizes 1 cyclic latency

L1 inst. cache

private 4-way,

32 bytes cache sizes 1 cyclic latency

L2 cache

shared 8-way

cache sizes(1kbytes, 2k, 3k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 1M,

4M)

B. The Results of Cache Miss Ratio Figure 3 illustrates the data of the analysis of the cache

memory access for parallel application with shared-cache, and illustrates the read-miss ratio of shared cache when parallel application executed under different shared cache replacement algorithms and number of shared cache size.

129

Page 4: [IEEE 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) - Kyoto, Japan (2012.08.8-2012.08.10)]

We can confirm that the performance of LRU2-MRU collaborative cache replacement algorithm are increased by 3.17%, 5.54%, 5.11%, 6.98% 8.03% than LRU-MRU, LRU, LRU2, RAND, MRU. When cache size achieves a certain value, the cache miss ratio does not change.

Figure 3. read-miss ratio of shared cache

Figure 3. Miss ratio of shared-cache under the condition that different replacement and number of cache size

Figure 4. L2 cache miss ratio

Figure 4 shows L2 cache miss ratio for the LRU LRU-MRU and LRU2-MRU replacement algorithms when L2 shared cache size is set to 32k bytes. And we selected 10 group testing applications from the MiBench benchmark to test the L2 cache miss ratio for the LRU, LRU-MRU and LRU2-MRU algorithms. In the Figure 5, we can confirm that the L2 cache miss ratio for LRU2-MRU collaborative cache replacement algorithm is lower than the miss ratio for the LRU-MRU except qsort and susan testing applications, and is lower than the miss ratio for LRU except qsort and sha testing applications.

C. An influence on MPKI One of important factors of cache performance evaluation

is MPKI (misses per thousand instructions). This paper tests MPKI of 10 group testing applications from MiBench benchmark in the condition of LRU, LRU-MRU and LRU2-MRU cache replacement algorithms.

From Figure 5, we can see that the performance of new algorithm is higher than LRU-MRU, and LRU-MRU is higher than LRU in multi-core platform. In 10 group testing

applications, the performance of LRU-MRU replacement policy is higher than LRU2-MRU collaborative replacement algorithm only in dijkstra and bitcount of testing applications. As to other testing applications, LRU2-MRU is better than LRU-MRU.

For the result of 10 group testing applications, the MPKI of LRU2-MRU collaborative algorithm is average 4.54% lower than LRU and average 1.87% than LRU-MRU.

Figure 5. MPKI of 10 group testing applicationas in the condition of LRU,

LRU-MRU and LRU2-MRU cache replacement algorithms

D. The Results of LRU2-MRU Stack Distance This paper tests the stack distance of LRU2-MRU

collaborative cache by using the bi-sim algorithm [9]. The access trace X and the access type are listed in the second and third columns of TABLE 2. For the access type, y represents that the access is LRU2, and n represents that the access is MRU. From the fourth column of Table 2, we can see that the priority number is different, when accesses are MRU, the priority number is negative. But the priority number is positive when accesses are LRU2. The last column is the tested stack distance.

For the last column of TABLE 2: means a miss, and the stack distance k means a cache hit when cache size T > k. So we can confirm whether the cache is a miss by the stack distance.

IV. CONCLUSION

Since the shared-cache is an important resource for multi-core system and the cache replacement algorithm of shared-cache influences the performance of multi-core system, we study the shared-cache simulation for multi-core system with LRU2-MRU collaborative cache replacement algorithm. In this paper, we proved the property and tested the stack distance of LRU2-MRU collaborative cache. We conclude that the priority number is different when use different accesses for LRU2-MRU cache. The simulation results demonstrate that the shared-cache for multi-core system with LRU2-MRU cache replacement algorithm is able to reduce the conflicting access, to degrade the MPKI, and to make the system achieving better performance and higher processing efficiency when deal with parallel applications in multi-core system.

130

Page 5: [IEEE 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD) - Kyoto, Japan (2012.08.8-2012.08.10)]

REFERENCES

[1] X. Gu, T. Gao, and C. Ding. On the theory and potential of LRU-MRU collaborative cache management. Conference on Parallel Architectures and Compilation Techniques.(PATC’11 SRC). pp. 43-54, June 4-5. 2011, San Jose, California, USA.

[2] M. D. Hill. Amdahl’s law in the multicore era. High Performance Computer Architecture, 2008. IEEE 14th International Symposium on . 2008, pp. 33-38.

[3] B H. Zhou, J Z. Qiao, and S K. Lin, Research on the dynamic allocation algotithm of shared-cache for multi-core processor. Journal of Northeasterrn University(Natural Science). pp. 44-47, Jan 2011.

[4] X. Gu, T. Gao, C. Zhang, R. Archambault, and C. Ding. P-opt:Program-firected optimal cache management. In Proceedings of the Workshop on Langeages and Compilers for Parallel Computing, LNCS 5335, 2008, pp. 217-231.

[5] K. Kedzierski, M. Moreto, F. J. Cazorda, and M. Valero. Adapting cache partitioning algorithms to pseudo-LRU replacement policies. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. pp. 1530-2075, April 2010.

[6] G. Suo and X. J. Yang. Shared-cache partition od dualcore processor. Microelectronics & Computer. Jul. pp. 28-30, 2008.

[7] X. Liao. Study and implementation of an improved cache concept based on LRU algorithm. Electronic Engineer. pp. 46-48, Jul. 2008.

[8] X. M. Lin, T. Gui, F. M. Qiao, and T. S. Hu. Algorithm of bubble replacement in multicore shared-cache. Microelectronics & Computer. pp. 28-30, April 2011.

[9] W. Wong and J. L. Baer. Modified LRU policies for improving second-level cache behavior. HPCA-6, 2000, pp. 49-60.

[10] C. Ding, Y. Zhong. Predicting Whole-Program Locality with Resuse Distance Analysis. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation,,2003, pp.245-257.

[11] https://www.simics.net/ [12] http://www.freescale.com.cn/ [13] R. A. Sugumar and S. G. Abraham. Efficient simulation of cache

under optimal replacement with applications to miss characterization. In Proceeding of the ACM SIGMETRICS Conference on Measurement & Modeling Computer System, 1993.

TABLE II. THE STACK DISTANCE OF LRU2-MRU COLLABORATIVE CACHE

Trace number

The Access Trace LRU2? The priority list

Stack

distance 0 f y 0 1 h y 1 0 2 i n -2 0 3 c n -3 0 4 i y 4 0 2 5 b y 5 4 0 6 e n -6 4 0 7 b y 7 4 0 2 8 b y 8 4 0 1 9 d n -9 4 0 10 g y 10 -9 4 0 11 e y 11 10 4 0 4 12 b y 12 11 10 4 0 4 13 a y 13 12 11 10 4 0 14 d y 14 13 12 11 10 4 0 5 15 c y 15 14 13 12 11 10 0 6 16 a y 16 15 14 13 11 10 0 3 17 e y 17 16 15 13 11 10 0 5 18 i y 18 17 16 13 11 10 0 7 19 c n -19 18 17 16 13 11 10 0 4 20 f n -20 -19 18 17 16 13 11 10 0 8 21 a y 21 -20 -19 18 17 16 13 10 0 5 22 b y 22 21 -20 -19 18 16 13 10 0 6 23 c n -23 21 22 -19 18 16 13 10 0 4 24 f n -24 -21 22 -19 -23 16 13 10 0 4 25 c y 25 -21 22 -19 -23 16 13 10 0 2 26 e y 26 25 -21 19 22 -23 13 10 0 7 27 i n -27 25 -21 26 22 -23 13 10 0 7 28 c y 28 -27 -21 26 22 -23 13 13 0 3 29 f y 29 28 -21 26 22 -27 13 10 0 4

131