juan m. cebrián, alberto ros, ricardo fernández-pascual...
TRANSCRIPT
Introduction Proposal Methodology Results Discussion and Conclusions
EARLY EXPERIENCES WITH SEPARATE CACHES
FOR PRIVATE AND SHARED DATA
Juan M. Cebrián, Alberto Ros,Ricardo Fernández-Pascual and Manuel E. Acacio
{jcebrian, aros, rfernandez,meacacio}@ditec.um.es
Depart. of Computer EngineeringUniversity of Murcia
Sep 3, 2015
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 1 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 2 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 3 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
CURRENT TRENDSMOTIVATION
Shared-memory architectures are predominant inmulti-core microprocessors from all market segments
Correctness is ensured by means of coherence protocolsand consistency modelsBut performance and scalability are limited by the amountand size of the messages used to maintain coherence
Blindly keeping coherence for all memory accessestranslates into an unnecessary overhead for data that willremain coherent after the access
Read-only shared dataPrivate data
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 4 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
CURRENT TRENDSMOTIVATION
Shared-memory architectures are predominant inmulti-core microprocessors from all market segments
Correctness is ensured by means of coherence protocolsand consistency modelsBut performance and scalability are limited by the amountand size of the messages used to maintain coherence
Blindly keeping coherence for all memory accessestranslates into an unnecessary overhead for data that willremain coherent after the access
Read-only shared dataPrivate data
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 4 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
INTRODUCTIONIDEA
Why treat all accesses the same regardless of the natureof the data being accessed?
Idea use of dedicated caches for private (+sharedread-only) and shared data
Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches
HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
INTRODUCTIONIDEA
Why treat all accesses the same regardless of the natureof the data being accessed?Idea use of dedicated caches for private (+sharedread-only) and shared data
Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches
HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
INTRODUCTIONIDEA
Why treat all accesses the same regardless of the natureof the data being accessed?Idea use of dedicated caches for private (+sharedread-only) and shared data
Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches
HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 6 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
PROPOSALDEDICATED CACHE DESIGN
Tile
L1P L1SBank
Router
Private AccessShared Access
Core(OS-Classification)
L2 Bank
......................
Access the classification mechanismCheck the local cache for privatedata or the router for shared dataUpdate the classification mechanismbased on the nature of the data andmove data between the private andshared caches
DEDICATED CACHE DESIGN
Private $ (L1P) independent for each coreShared $ (L1S) logically shared but physically distributedCoherence messages only required when nature changes
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 7 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
PROPOSALCLASSIFICATION MECHANISM
CLASSIFICATION MECHANISM
Our proposal requires a classification mechanism that detectsprivate and shared data
Many mechanisms in the literature that work at differentlevels: compiler, OS, coherence protocol, etc
We chose a OS-level (at a page granularity) mechanismbecause of its simplicity
When a core accesses a page it is marked private in thepage tableA second access to a page (marked as private) by adifferent core will force an update on the page table stateto shared
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 8 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 9 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
EVALUATION ENVIRONMENT
CPU Core
L1P$L1I$
L2$(Tags)
L2$ (Data)R
outerD
irectoryL1S$
EVALUATION ENVIRONMENT
The simulated architecture corresponds to a single chip multiprocessor(tiled-CMP) featuring 16 cores
Real system data accesses captured by PIN are used as input for the GEMS 2.1simulation environment
The interconnection network uses Garnet (for tiled and custom layouts) and a“simple” idealized network model without contention for our idealized evaluation
We compare against a traditional MESI protocol
We evaluate our proposal using the applications from the SPLASH-2 benchmarksuite with the recommended input sizes
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 10 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
EVALUATION ENVIRONMENTSIMULATION PARAMETERS
Memory parametersBlock size 64 bytesL1 cache (data & private & instr.) 32 KB, 4 waysL1 access latency (data & private & instr.) 4 cycleL1S (shared) (32 KB, 4 ways) per tileL1S access latency 4 + NetworkL2 cache (shared) 512 KB/tile, 16 waysL2 access latency 12 cycleCache organization InclusiveDirectory information Included in L2Memory access time 160 cycles
Network parametersTopology Base: 2-D mesh (4×4)
Other: Custom/Simple NetworkRouting method X-Y deterministMessage size 5 flits (data), 1 flit (control)Link time 1 cycleBandwidth 1 flit per cycle
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 11 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 12 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSNORMALIZED RUNTIME (TO UNIFIED/MESI L1D)
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.01.02.03.04.05.06.07.08.09.0
10.0R
untim
e (N
orm
aliz
ed)
MESI-InclusivePrivate-Shared
TILED DESIGN, 2D MESH GARNET NETWORK. CLH LOCKS
Realistic layout design, but huge overall slowdown (3.7x)
Great variability in the results (some applications hide latency better)
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 13 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSCACHE MISSES
Now we must discover why we experienced this behavior while others did not,and how we could solve this issue
The extra combined capacity of the L1D (L1P+L1S) causes a significantreduction on the number of cache misses (4-Way 128-entry L1S)
But L1S accesses cost as much as misses on the private cache, so performancebenefits are minimal
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 14 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSCACHE MISSES
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.00.10.20.30.40.50.60.70.80.91.0
L1D
Mis
ses
(Nor
mal
ized
)
ColdCapacityConflict
1. MESI-Inclusive2. Private-Shared
Now we must discover why we experienced this behavior while others did not,and how we could solve this issue
The extra combined capacity of the L1D (L1P+L1S) causes a significantreduction on the number of cache misses (4-Way 128-entry L1S)
But L1S accesses cost as much as misses on the private cache, so performancebenefits are minimal
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 14 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSNETWORK TRAFFIC
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.05.0
10.015.020.025.030.035.040.045.050.055.060.0
Net
wor
k T
raffi
c (N
orm
aliz
ed) MESI-Inclusive
Private-Shared
Huge increment in traffic due to the shared cache (around 13× on average)
High variability that but similar performance degradation
Performance penalty only critical for those applications that have sharedaccesses in their critical path of execution
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 15 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSOTHER APPROACHES: CENTRALIZED DESIGN
L1S-Bank 0 L1S-Bank 1
L1S-Bank 2 L1S-Bank 3
Custom Layout. Place all L1S in the centre of the layout
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 16 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSOTHER APPROACHES: CENTRALIZED DESIGN II
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.01.02.03.04.05.06.07.08.09.0
10.011.012.0
Run
time
(Nor
mal
ized
)
MESI-InclusivePrivate-Shared
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.05.0
10.015.020.025.030.035.040.045.050.055.060.0
Net
wor
k T
raffi
c (N
orm
aliz
ed) MESI-Inclusive
Private-Shared
Runtime increased (3.9x vs 3.4x)
Network traffic slightly decreased (10x vs 13x)
Less hops but more contention on L1S controllers
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 17 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
RESULTSOTHER APPROACHES: IDEALIZED DESIGN
Barn
esCh
olesk
yFF
TFM
M LULU
-nc
Ocea
nOc
ean-
ncRa
diosit
yRa
dixRa
ytrac
eVo
lrend
Wat
er-N
sqW
ater
-Sp
Aver
age
0.00.20.40.60.81.01.21.41.61.82.0
Run
time
(Nor
mal
ized
)
MESI-InclusivePrivate-Shared
02.5
57.510
12.515
17.520
22.525
27.530
Net
wor
k Tr
affic
(nor
mal
ized
)
MESI-InclusivePrivate-Shared
Barn
esCh
olesk
yFF
TFM
M LULU
-nc
Ocea
nOc
ean-
ncRa
diosit
yRa
dixRa
ytrac
eVo
lrend
Wat
er-N
sqW
ater
-Sp
Aver
age
Idealized network (P2P, without contention)
Runtime increased by 25%, network traffic increases by 7x
Even in this optimistic design the L1S accesses outweigh the gains fromremoving the coherence protocol
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 18 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
OUTLINE
1 INTRODUCTION
2 PROPOSAL
3 METHODOLOGY
4 RESULTS
5 DISCUSSION AND CONCLUSIONS
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 19 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
DISCUSSIONPRIVATE AND SHARED DATA
Bar
nes
Cho
lesk
y F
FT F
MM
LU
LU
-nc
Oce
an O
cean
-nc
Rad
iosi
ty R
adix
Ray
trace
Vol
rend
Wat
er-N
sq W
ater
-Sp
Ave
rage
0.00.10.20.30.40.50.60.70.80.91.0
Mem
. Acc
ess
Cla
ssifi
catio
n
ForwardLoadsStores
1. MESI-Inclusive2. Private-Shared
52% shared accesses does not match the literature (10%)
Build a model based on empirical data to estimate performance based on theamount of private and shared accesses
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 20 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
DISCUSSIONPERFORMANCE MODEL
1
2
3
4
5
6
7
8Custom Simple Tiled
Percentage of L1S Accesses
No
rma
lize
d L
1 L
ate
ncy
Based on access and miss latencies to both private and shared caches
Even for 10% ratio we still have a 60% performance degradation
But we are modeling a latency-oriented architectureNVIDIA architectures separate private and shared data (programmer)
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 21 / 23
Introduction Proposal Methodology Results Discussion and Conclusions
DISCUSSIONCONCLUSIONS
CONCLUSIONS
We analyze a dedicated cache design that takes into consideration the nature ofthe data being accessed. Our results show two main drawbacks that limit theusability of our implementation:
Low accuracy on the classification mechanism (can be improved to 90%)Huge increase of the latency of shared accesses (due to theinterconnection network. 6.75 ideal to 23 cycles realistic latency).
We believe there is also room for improvement by using a specialized networksince our design only requires 8 bytes to be sent as control/request plus 16 bytesto be returned (control + data word).
With improved accuracy for access clasisfication and reduced network latency,we believe our approach can become useful as the number of cores increasesand coherence protocols face more scalability issues.
Additional research is needed to discover the number of cores required to makethe dedicated design feasible in terms of performance.
Throughput oriented cores (GPUs or accelerators like the Xeon Phi) can alsobenefit from our design, since additional memory latency is usually hidden inthose systems by swapping execution threads.
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 22 / 23
Thank you
EARLY EXPERIENCES WITH SEPARATE CACHES
FOR PRIVATE AND SHARED DATA
Juan M. Cebrián, Alberto Ros,Ricardo Fernández-Pascual and Manuel E. Acacio
{jcebrian, aros, rfernandez,meacacio}@ditec.um.es
Depart. of Computer EngineeringUniversity of Murcia
Sep 3, 2015
Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 23 / 23