dynamic cache clustering for chip multiprocessors mohammad hammoud, sangyeun cho, and rami melhem...

Dynamic Cache Clustering for Chip Multiprocessors

Mohammad Hammoud, Sangyeun Cho, and Rami Melhem

Dept. of Computer ScienceUniversity of Pittsburgh

Tiled CMP Architectures

Tiled CMP Architectures have recently been advocated as a scalable design.

They replicate identical building blocks (tiles) and connect them with a switched network on-chip (NoC).

A tile typically incorporates a private L1 cache and an L2 cache bank.

Two traditional practices of CMP caches:

a. One bank to one core assignment (Private Scheme).

b. One bank to all cores assignment (Shared Scheme).

Dec. 13 ’06 – MICRO-39

Private and Shared Schemes

Private Scheme:

• A core maps and locates a cache block, B, to and from its local L2 bank.

• Coherence maintenance is required at both, the L1 and the L2 levels.

• Data is read very fast but cache miss rate might render high.

Shared Scheme:

• A core maps and locates a cache block, B, to and from a target tile (using some bits- home select or HS bits from B’s physical address) referred to as the static home tile (SHT) of B.

• Coherence is required only at the L1 level.• Cache miss rate is low but data reads are slow (NUCA design).

B’s physical address:

Dec. 13 ’06 – MICRO-39

The Degree of Sharing

Sharing Degree (SD), or the number of cores that share a given pool of cache banks, could be set somewhere between the shared and the private designs.

1-1 assignment 2-2 assignment 4-4 assignment 8-8 assignment16-16 assignment

(Private Design) (Shared Design)

Dec. 13 ’06 – MICRO-39

Static Designs’ Principal Deficiency

The aforementioned static designs are subject to a principal deficiency:

In reality, computer applications exhibit different cache demands. A single application may demonstrate different phases corresponding to distinct

code regions invoked during its execution. Program phases can be characterized by different L2 cache misses and durations.

They all entail static partitioning of the available cache capacity and don’t tolerate the variability among working sets and phases of a working set.

Our work

Dynamically monitor the behaviors of the programs running on different CMP cores.

Adapt to each program cache demand by offering a fine-grained banks-to-cores assignments (a technique we refer to as cache clustering).

Introduce novel mapping and location strategies to manage dynamic cache designs in tiled CMPs.

(CD = Cluster Dimension)

Talk roadmap

The proposed dynamic cache clustering (DCC) scheme.

Performance metrics.

DCC algorithm.

DCC mapping strategy.

DCC location strategy.

Quantitative evaluation

Concluding remarks.

Dec. 13 ’06 – MICRO-39

The Proposed Scheme

We denote the L2 cache banks that can be assigned to a specific core, i, as i’s cache cluster.

We further denote the number of banks that the cache cluster of core i consists of as cache cluster dimension of core i (CDi).

We propose a dynamic cache clustering (DCC) scheme where:

• Each core is initially started up with a specific cache cluster.

• After every period time T (potential re-clustering point), the cache cluster of a core is dynamically contracted, expanded, or kept intact, depending on the cache demand experienced by that core.

Performance Metrics

The basic trade-offs of varying the dimension of a cache cluster are the average L2 access latency and the L2 miss rate.

Average L2 access latency (AAL) increases strictly with the cluster dimension.

L2 miss rate (MR) is inversely proportional to the cluster dimension.

Improving either AAL or MR doesn’t necessarily correlate to an improvement in the overall system performance.

Improving one of the following metrics typically translates to a better system performance.

222L1)(AMT Time Miss L1 Average LLL yMissPenaltMissRateAAL

1111 )1((AMAT) Time AccessMemory Average LLLL AMTMissRateHitTimeMissRat

Dec. 13 ’06 – MICRO-39

DCC Algorithm The AMAT metric can be utilized to judiciously gauge

the benefit of varying the cache cluster dimension of a certain core i.

At every potential re-clustering point:

• The AMATi (AMATi current) experienced by a process P running on core i is evaluated and stored.

• AMATi current is subtracted from the previously stored AMATi (AMATi previous).

Assume a contraction action has been taken previously:

• A positive subtraction value indicates that AMATi has increased. Hence, we retard and expand P’s cluster.

• A negative value indicates that AMATi has decreased. We hence contract P’s cluster a step further predicting more benefit.

Dec. 13 ’06 – MICRO-39

DCC Mapping Strategy

Varying a cache cluster dimension (CD) of each core over time requires a function that maps cache blocks to cache clusters exactly as required.

Assume that a core i requests a cache block B. If CDi < 16 (for instance), B is mapped to a dynamic home tile (DHT) different than the static home tile (SHT) of B.

DHT of B depends on CDi. With CDi smaller than 16 only a subset of bits from the HS field of B’s physical address needs to be utilized to determine B’s DHT (i.e., 3 bits from HS are used if CDi = 8).

We developed the following generic function to determine the DHT of block B (ID is the binary representation of core i and MB are masking bits):

)&()&( MBIDMBHSDHT

Dec. 13 ’06 – MICRO-39

DCC Mapping Strategy: A Working Example

Assume core 5 (ID = 0101) requests cache block B with HS = 1111.

DHT= (1111&0111) + (0101&1000) = 0111

DHT= (1111&0101) + (0101&1010) = 0101

DHT= (1111&0001) + (0101&1110) = 0101

DHT= (1111&1111) + (0101&0000) = 1111

DHT= (1111&0000) + (0101&1111) = 0101

Dec. 13 ’06 – MICRO-39

DCC Location Strategy

The generic mapping function we defined can’t be used straightforwardly to locate cache blocks.

Assume a cache block B with HS = 1111 is requested by core 0 (ID = 0000).

Assume the cache cluster of core 0 is contracted and B is afterward requested by core 0.

DHT= (1111&0111) + (0000&1000) = 0111

DHT= (1111&0101) + (0000&1010) = 0101

Dec. 13 ’06 – MICRO-39


Solution 1: re-copy all blocks upon a re-clustering action.

Solution2: After missing at B’s DHT, B’s SHT (tile 15) can be accessed to locate B at tile 7.

Solution3: Send the L2 request directly to B’s SHT instead of sending it first to B’s DHT and then possibly to B’s SHT.

DHT= (1111&0101) + (0000&1010) = 0101

Slow: Inter-tile communications between tiles: 0, 5, 15, 7, and lastly 0

Very Expensive

Slow: inter-tile communications between tiles 0, 15, 7, and lastly 0.

Dec. 13 ’06 – MICRO-39


Solution4: Send simultaneous requests to only the tiles that are potential DHTs of B.

• The potential DHTs of B can be easily determined by varying MB and MBbar of the DCC mapping function for the range of CDs 1, 2, 4, 8, and 16.

• Upper bound = • Lower bound = 1• Average = 1 + 1/2 log2(n) (i.e., for 16 tiles, 3 messages per request)

log2(NumberofTiles) + 1

Dec. 13 ’06 – MICRO-39

Quantitative Evaluation: Methodology System Parameters:

• We simulate a 16-way tiled CMP.• Simulator: Simics 3.0.29 (Solaris OS)• Cache line size: 64 Bytes.• L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles.• L2 size/ways/latency: 512KB per bank/16 ways/12 cycles.• Latency per hop: 3 cycles.• Memory latency: 300 cycles.• L1 and L2 replacement policy: LRU

Benchmarks: SPECJBB, OCEAN, BARNES, LU, RADIX, FFT, MIX1(16 copies of HMMER), MIX2(16 copies of SPHINX), MIX3( Barnes, Lu, Milc, Mcf, Bzip2, and Hmmer- 2 threads/copies each).

Dec. 13 ’06 – MICRO-39

Comparing With Static Schemes

We first study the effect of the average L1 miss time (AMT) across FS1, FS2, FS4, FS8, FS16, and DCC.

FS16 FS1

DCC outperforms FS16, FS8, FS4, FS2, and FS1 by averages of 6.5%, 8.6%, 10.1%, 10%, and 4.5%, respectively, and by as much as 21.3%.

Dec. 13 ’06 – MICRO-39


We second study the effect of L2 miss rate across FS1, FS2, FS4, FS8, FS16, and DCC.

No Single static scheme provides the best miss rate for all the benchmarks. DCC always provides miss rates comparable to the best static alternative.

Dec. 13 ’06 – MICRO-39


We third study the effect of execution time across FS1, FS2, FS4, FS8, FS16, and DCC.

The superiority of DCC in AMT translates to better overall performance. DCC always provides performance comparable to the best static alternative.

Dec. 13 ’06 – MICRO-39

Sensitivity Study

We fourth study the sensitivity of DCC to different {T,Tl,Tg} values.

DCC is not much dependent on the values of parameters {T,Tl,Tg} . Overall, DCC performs a little better with T = 100K than with T = 300K.

Dec. 13 ’06 – MICRO-39

Comparing With Cooperative Caching

We fifth compare DCC against the cooperative caching (CC) scheme. CC is based on FS1 (private scheme).

CC FS1DCC

DCC outperforms CC by an average of 1.59%. The basic problem with CC is that it spills blocks without knowing if spilling helps or hurts cache performance (a problem addressed recently in HPCA09).

Concluding Remarks

This paper proposes DCC, a distributed cache management scheme for large scale chip multiprocessors.

Contrary to static designs, DCC adapts to working sets irregularities.

We propose generic mapping and location strategies that can be utilized for both, static designs (with different sharing degrees) and dynamic designs in tiled CMPs.

The proposed DCC location strategy can be improved (in regard to reducing the number of messages per request) by maintaining a small history about a specific cluster expansions and contractions.

For instance, with an activity chain of 16-8-4, we can predict that a requested block can’t exist at a DHT corresponding to CD = 1 or 2, and has higher probability to exist at DHTs corresponding to CD = 4 and 8 than at DHT that corresponds to CD = 16.

Dynamic Cache Clustering for Chip Multiprocessors

M. Hammoud, S. Cho, and R. Melhem

Dept. of Computer ScienceUniversity of Pittsburgh

Thank you!

dynamic cache clustering for chip multiprocessors mohammad hammoud, sangyeun cho, and rami melhem...

Documents