sos: a software-oriented distributed shared cache management approach for chip multiprocessors
DESCRIPTION
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors. Lei Jin and Sangyeun Cho. Dept. of Computer Science University of Pittsburgh. Chip Multiprocessor Development. - PowerPoint PPT PresentationTRANSCRIPT
SOS: A Software-Oriented Distributed Shared Cache Management Approach for
Chip Multiprocessors
Lei Jin and Sangyeun Cho
Dept. of Computer ScienceUniversity of Pittsburgh
University of Pittsburgh
PACT 2009
Chip Multiprocessor Development Cease of performance scaling of uniprocessors has turned
researchers to chip multiprocessor architectures The number of cores is increasing at a fast pace
1998 2000 2002 2004 2006 2008 20100
2
4
6
8
Pentium 4Power5
Pentium D
Core 2Duo
Ahtlon X2
Power6
Phenom X3Core i7
Phenom X4
Opteron
Cor
e C
ount
Timeline
Source: Wikipedia
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system
Cache
The CMP Cache
Cache
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?
The CMP Cache
Cache
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?
The CMP Cache
Core L1 I/DCache
L2 Cache Slice
Directory Router
Non-Uniform Cache Architecture (NUCA) Shared cache scheme vs. private cache scheme
University of Pittsburgh
PACT 2009
Hybrid Cache Schemes Victim Replication [Zhang and Asanovic ISCA `05] Adaptive Selective Replication [Beckmann et al. MICRO `06] CMP-NuRAPID [Chishti et al. ISCA `05] Cooperative Caching [Chang and Sohi ISCA `06] R-NUCA [Hardavelles et al. ISCA `09]
Problems with hardware-based schemes:• Hardware complexity• Limited scalability
University of Pittsburgh
PACT 2009
The Challenge CMPs provide the scalability of the core count A cache system with scalable performance is critical in CMPs Available hardware-based schemes failed to do so
We propose a Software-Oriented Shared (SOS) cache man-agement approach:• Minimum hardware support• Good scalability
University of Pittsburgh
PACT 2009
Our Contributions We studied access patterns in multithreaded workloads and
found they can be utilized to improve locality
We proposed the SOS scheme, which offloads the work from hardware to software analysis
We evaluated our scheme and proved that it is a promising approach
University of Pittsburgh
PACT 2009
Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions
University of Pittsburgh
PACT 2009
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%
20%
40%
60%
80%
100%
Total Sharer#
Concurrent Sharer#
Observation L2 cache access distribution of Cholesky
# of access to blocks shared by 15 threads or less during whole exe-cution.
# of access to blocks shared by 15 threads or less simultaneouslyC
umul
ativ
e P
erce
ntag
e of
Acc
esse
s
Sharer Count
University of Pittsburgh
PACT 2009
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%
20%
40%
60%
80%
100%
Total Sharer#
Concurrent Sharer#
Observation L2 cache accesses are skewed at the two extremes
Cum
ulat
ive
Per
cent
age
of A
cces
ses
Sharer Count
~50% highlyshared access
~30% privatedata access
University of Pittsburgh
PACT 2009
Access Patterns Static data vs. dynamic data
• Static data: location and size are known prior to execution (e.g. global data)
• Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data)
• Dynamic data is more important than static data Common access patterns for dynamic data are:
• Even partition• Scattered• Dominant owner• Shared
University of Pittsburgh
PACT 2009
Even Partition Pattern A continuous memory space is partitioned evenly among
threads
Main thread:Array = malloc(sizeof(int) * NumProc * N);
Thread [ProcNo]:for(i = 0; i < N; i++)
Array[ProcNo * N + i] = x;
T0 T1 T2 T3
University of Pittsburgh
PACT 2009
Scattered Pattern Memory spaces are not continuous, but each is owned by one
thread
Main thread:ArrayPtr = malloc(sizeof(int) * NumProc);for(i = 0; i < NumProc; i++)
ArrayPtr[i] = malloc(sizeof(int) * Size[i]);
Thread [ProcNo]:for(i = 0; i < Size[i]; i++)
ArrayPtr[ProcNo][i] = i;
T0 T1 T2 T3Gap Gap
University of Pittsburgh
PACT 2009
Other Patterns Dominant owner: data are accessed by multiple threads, but
one thread contributes the access significantly more than the others
Shared: data are widely shared
University of Pittsburgh
PACT 2009
Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions
University of Pittsburgh
PACT 2009
SOS Scheme The SOS scheme consists of 3 components:
L2 CacheAccess Profiling
Page Clustering & Pattern Recognition
Page coloring
Replication
One-time offline analysis Run-time
University of Pittsburgh
PACT 2009
Page Clustering We take a machine-learning based approach:
Per-threadL2 Cache Access Trace
T0 T1 T2 T3
K-meansClustering
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
Per-Page Histogram
Dynamic Area
University of Pittsburgh
PACT 2009
Pattern Recognition Assume a dynamic area consists of 8 pages:
Pages accessed mostly by thread 0
Pages accessed mostly by thread 3
Highly shared pages
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
P1P0
P3P2
P4
P6 P7
C0
C1
C2
C3
C4 P5
Initial centroids forK-means clustering
University of Pittsburgh
PACT 2009
Pattern Recognition Assume a dynamic area consists of 8 pages:
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
P1P0
P3P2
P4
P6 P7
C0
C1
C2
C3
C4 P5
Initial centroids forK-means clustering
P0 – P1
P2 – P3
P4 – P5
P6 – P7
Compare
Ideal Partition
University of Pittsburgh
PACT 2009
Hints Representation & Utilization For dynamic data, pattern type is associated with every dy-
namic allocation system call[FileName, Line#, Pattern Type]
For static data, page location is explicitly given:[Virtual Page Num, Tile ID]
SOS data management policy:• Pattern type is translated into actual partition when the dynamic area lo-
cation and size are known by the OS• Page location is assigned on demand if the partition information (hint) is
available• Data without corresponding hints are treated as highly shared and dis-
tributed at block level• Data replication is enabled for shared data
University of Pittsburgh
PACT 2009
Architectural Support To allow flexible data placement in L2 cache, we add two
fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06]
The OS is responsible for providing TID and BIN• Main memory access is the same as before, with the translated physical
page address• L2 cache addressing mode depends the value of TID and BIN
Virtual Page NumberPhysical Page NumberP TID BINa TLB entry
To form physical address for main memory access
To locate page in L2 cache
University of Pittsburgh
PACT 2009
Outline Motivation Observations in access patterns SOS scheme Evaluation results Conclusions
University of Pittsburgh
PACT 2009
Experiment Setup We use a simics-based memory simulator, modeling a 16-tile
CMP with 4x4 2D mesh on-chip network Each core has 2-issue in-order pipeline with private L1 I/D
caches and an L2 cache slice Programs from SPLASH-2 suite and PARSEC suite are se-
lected as benchmarks with 3 different input sizes Small input set is used to profile and generate hints, while
median and large input sets are used to evaluate the SOS performance
For brevity, we only present results of 4 representative pro-grams (barnes, lu, cholesky, swaption) and the overall aver-age of 14 programs
University of Pittsburgh
PACT 2009
barnes lu cholesky swaption avg of all apps0%
20%
40%
60%
80%
100%
Accuracy is measured by the percentage of pages that are placed in the tile with most access
Hint Accuracy
Small inputMedian input
University of Pittsburgh
PACT 2009
0%
20%
40%
60%
80%
100%
StaticDominant OwnerPrivate ScatterOrderred ScatterEven PartitionShared
Breakdown of L2 Cache Accesses
Patterns vary among different programs A large percentage of L2 access can be tackled by page placement The shared data are evenly distributed and handled by replication
University of Pittsburgh
PACT 2009
Hint-guided data placement significantly reduces the number of remote cache accesses
Our SOS scheme removes nearly 87% of remote accesses!
barne
s lu
chole
sky
swap
tions
avg o
f all a
pps
0%
20%
40%
60%
80%
100%
Shared VR Hints Only SOS
Remote Access Comparison
University of Pittsburgh
PACT 2009
Hint-guided data placement tracks private cache performance closely
SOS performs nearly 20% better than shared cache scheme
barne
s lu
chole
sky
swap
tions
avg o
f all a
pps
0%
20%
40%
60%
80%
100%Shared Private VR Hints Only SOS
Execution Time
University of Pittsburgh
PACT 2009
Related Work Lu et al. PACT `09
• Analyzing the array access and performing data layout transformation to improve the data affinity
Marathe and Mueller PPoPP `06• Profiling truncated program before every run• Deriving optimal page location based on the sampled access trace• Optimizing data locality for cc-NUMA
Hardavellas et al. ISCA `09• Dynamic identification of private and shared pages• Private mapping for private pages and fine-grained broadcast-mapping
of shared pages• Focuses of server workloads
University of Pittsburgh
PACT 2009
Conclusions We propose a software-oriented approach for shared cache
management: controlling data placement and replication
This is the first work on software-managed distributed shared cache scheme for CMPs
We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity
We demonstrate that software-oriented shared cache man-agement is a promising approach through experiments• 19% performance improvement over shared cache scheme
University of Pittsburgh
PACT 2009
Thank you and Questions?
University of Pittsburgh
PACT 2009
Future Work Further study of more complex access patterns can show
more benefits of our software-oriented cache management scheme.
Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.
University of Pittsburgh
PACT 2009
barnes lu cholesky swaption avg of all apps0%
20%
40%
60%
80%
100%
Hint Coverage Hint coverage measures the percentage of L2 cache ac-
cesses to the pages guided by SOS.Small inputMedian input