sos: a software-oriented distributed shared cache management approach for chip multiprocessors

SOS: A Software-Oriented Distributed Shared Cache Management Approach for

Chip Multiprocessors

Lei Jin and Sangyeun Cho

Dept. of Computer ScienceUniversity of Pittsburgh

University of Pittsburgh

PACT 2009

Chip Multiprocessor Development Cease of performance scaling of uniprocessors has turned

researchers to chip multiprocessor architectures The number of cores is increasing at a fast pace

1998 2000 2002 2004 2006 2008 20100

2

4

6

8

Pentium 4Power5

Pentium D

Core 2Duo

Ahtlon X2

Power6

Phenom X3Core i7

Phenom X4

Opteron

Cor

e C

ount

Timeline

Source: Wikipedia


PACT 2009

A CMP = N cores + one (coherent) cache system

Cache

The CMP Cache

Cache

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core


PACT 2009

A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?

The CMP Cache

Cache

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core


PACT 2009

A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?

The CMP Cache

Core L1 I/DCache

L2 Cache Slice

Directory Router

Non-Uniform Cache Architecture (NUCA) Shared cache scheme vs. private cache scheme


PACT 2009

Hybrid Cache Schemes Victim Replication [Zhang and Asanovic ISCA `05] Adaptive Selective Replication [Beckmann et al. MICRO `06] CMP-NuRAPID [Chishti et al. ISCA `05] Cooperative Caching [Chang and Sohi ISCA `06] R-NUCA [Hardavelles et al. ISCA `09]

Problems with hardware-based schemes:• Hardware complexity• Limited scalability


PACT 2009

The Challenge CMPs provide the scalability of the core count A cache system with scalable performance is critical in CMPs Available hardware-based schemes failed to do so

We propose a Software-Oriented Shared (SOS) cache man-agement approach:• Minimum hardware support• Good scalability


PACT 2009

Our Contributions We studied access patterns in multithreaded workloads and

found they can be utilized to improve locality

We proposed the SOS scheme, which offloads the work from hardware to software analysis

We evaluated our scheme and proved that it is a promising approach


PACT 2009

Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions


PACT 2009

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%

20%

40%

60%

80%

100%

Total Sharer#

Concurrent Sharer#

Observation L2 cache access distribution of Cholesky

# of access to blocks shared by 15 threads or less during whole exe-cution.

# of access to blocks shared by 15 threads or less simultaneouslyC

umul

ativ

e P

erce

ntag

e of

Acc

esse

s

Sharer Count


PACT 2009

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%

20%

40%

60%

80%

100%

Total Sharer#

Concurrent Sharer#

Observation L2 cache accesses are skewed at the two extremes

Cum

ulat

ive

Per

cent

age

of A

cces

ses

Sharer Count

~50% highlyshared access

~30% privatedata access


PACT 2009

Access Patterns Static data vs. dynamic data

• Static data: location and size are known prior to execution (e.g. global data)

• Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data)

• Dynamic data is more important than static data Common access patterns for dynamic data are:

• Even partition• Scattered• Dominant owner• Shared


PACT 2009

Even Partition Pattern A continuous memory space is partitioned evenly among

threads

Main thread:Array = malloc(sizeof(int) * NumProc * N);

Thread [ProcNo]:for(i = 0; i < N; i++)

Array[ProcNo * N + i] = x;

T0 T1 T2 T3


PACT 2009

Scattered Pattern Memory spaces are not continuous, but each is owned by one

thread

Main thread:ArrayPtr = malloc(sizeof(int) * NumProc);for(i = 0; i < NumProc; i++)

ArrayPtr[i] = malloc(sizeof(int) * Size[i]);

Thread [ProcNo]:for(i = 0; i < Size[i]; i++)

ArrayPtr[ProcNo][i] = i;

T0 T1 T2 T3Gap Gap


PACT 2009

Other Patterns Dominant owner: data are accessed by multiple threads, but

one thread contributes the access significantly more than the others

Shared: data are widely shared


PACT 2009

Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions


PACT 2009

SOS Scheme The SOS scheme consists of 3 components:

L2 CacheAccess Profiling

Page Clustering & Pattern Recognition

Page coloring

Replication

One-time offline analysis Run-time


PACT 2009

Page Clustering We take a machine-learning based approach:

Per-threadL2 Cache Access Trace

T0 T1 T2 T3

K-meansClustering

C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)

Per-Page Histogram

Dynamic Area


PACT 2009

Pattern Recognition Assume a dynamic area consists of 8 pages:

Pages accessed mostly by thread 0

Pages accessed mostly by thread 3

Highly shared pages

C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)

P1P0

P3P2

P4

P6 P7

C0

C1

C2

C3

C4 P5

Initial centroids forK-means clustering


PACT 2009

Pattern Recognition Assume a dynamic area consists of 8 pages:

C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)

P1P0

P3P2

P4

P6 P7

C0

C1

C2

C3

C4 P5

Initial centroids forK-means clustering

P0 – P1

P2 – P3

P4 – P5

P6 – P7

Compare

Ideal Partition


PACT 2009

Hints Representation & Utilization For dynamic data, pattern type is associated with every dy-

namic allocation system call[FileName, Line#, Pattern Type]

For static data, page location is explicitly given:[Virtual Page Num, Tile ID]

SOS data management policy:• Pattern type is translated into actual partition when the dynamic area lo-

cation and size are known by the OS• Page location is assigned on demand if the partition information (hint) is

available• Data without corresponding hints are treated as highly shared and dis-

tributed at block level• Data replication is enabled for shared data


PACT 2009

Architectural Support To allow flexible data placement in L2 cache, we add two

fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06]

The OS is responsible for providing TID and BIN• Main memory access is the same as before, with the translated physical

page address• L2 cache addressing mode depends the value of TID and BIN

Virtual Page NumberPhysical Page NumberP TID BINa TLB entry

To form physical address for main memory access

To locate page in L2 cache


PACT 2009

Outline Motivation Observations in access patterns SOS scheme Evaluation results Conclusions


PACT 2009

Experiment Setup We use a simics-based memory simulator, modeling a 16-tile

CMP with 4x4 2D mesh on-chip network Each core has 2-issue in-order pipeline with private L1 I/D

caches and an L2 cache slice Programs from SPLASH-2 suite and PARSEC suite are se-

lected as benchmarks with 3 different input sizes Small input set is used to profile and generate hints, while

median and large input sets are used to evaluate the SOS performance

For brevity, we only present results of 4 representative pro-grams (barnes, lu, cholesky, swaption) and the overall aver-age of 14 programs


PACT 2009

barnes lu cholesky swaption avg of all apps0%

20%

40%

60%

80%

100%

Accuracy is measured by the percentage of pages that are placed in the tile with most access

Hint Accuracy

Small inputMedian input


PACT 2009

0%

20%

40%

60%

80%

100%

StaticDominant OwnerPrivate ScatterOrderred ScatterEven PartitionShared

Breakdown of L2 Cache Accesses

Patterns vary among different programs A large percentage of L2 access can be tackled by page placement The shared data are evenly distributed and handled by replication


PACT 2009

Hint-guided data placement significantly reduces the number of remote cache accesses

Our SOS scheme removes nearly 87% of remote accesses!

barne

s lu

chole

sky

swap

tions

avg o

f all a

pps

0%

20%

40%

60%

80%

100%

Shared VR Hints Only SOS

Remote Access Comparison


PACT 2009

Hint-guided data placement tracks private cache performance closely

SOS performs nearly 20% better than shared cache scheme

barne

s lu

chole

sky

swap

tions

avg o

f all a

pps

0%

20%

40%

60%

80%

100%Shared Private VR Hints Only SOS

Execution Time


PACT 2009

Related Work Lu et al. PACT `09

• Analyzing the array access and performing data layout transformation to improve the data affinity

Marathe and Mueller PPoPP `06• Profiling truncated program before every run• Deriving optimal page location based on the sampled access trace• Optimizing data locality for cc-NUMA

Hardavellas et al. ISCA `09• Dynamic identification of private and shared pages• Private mapping for private pages and fine-grained broadcast-mapping

of shared pages• Focuses of server workloads


PACT 2009

Conclusions We propose a software-oriented approach for shared cache

management: controlling data placement and replication

This is the first work on software-managed distributed shared cache scheme for CMPs

We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity

We demonstrate that software-oriented shared cache man-agement is a promising approach through experiments• 19% performance improvement over shared cache scheme


PACT 2009

Thank you and Questions?


PACT 2009

Future Work Further study of more complex access patterns can show

more benefits of our software-oriented cache management scheme.

Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.


PACT 2009

barnes lu cholesky swaption avg of all apps0%

20%

40%

60%

80%

100%

Hint Coverage Hint coverage measures the percentage of L2 cache ac-

cesses to the pages guided by SOS.Small inputMedian input

sos: a software-oriented distributed shared cache management approach for chip multiprocessors

Documents

coherent cache systemhow

core counta cache system

access patternsstatic

highlyshared access

growth of n cores

dynamic datastatic data

global datadynamic data

stack datadynamic data