juan m. cebrián, alberto ros, ricardo fernández-pascual...

27
Introduction Proposal Methodology Results Discussion and Conclusions E ARLY E XPERIENCES WITH S EPARATE C ACHES FOR P RIVATE AND S HARED DATA Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual and Manuel E. Acacio {jcebrian, aros, rfernandez, meacacio}@ditec.um.es Depart. of Computer Engineering University of Murcia Sep 3, 2015 Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 1 / 23

Upload: others

Post on 21-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

EARLY EXPERIENCES WITH SEPARATE CACHES

FOR PRIVATE AND SHARED DATA

Juan M. Cebrián, Alberto Ros,Ricardo Fernández-Pascual and Manuel E. Acacio

{jcebrian, aros, rfernandez,meacacio}@ditec.um.es

Depart. of Computer EngineeringUniversity of Murcia

Sep 3, 2015

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 1 / 23

Page 2: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 2 / 23

Page 3: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 3 / 23

Page 4: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

CURRENT TRENDSMOTIVATION

Shared-memory architectures are predominant inmulti-core microprocessors from all market segments

Correctness is ensured by means of coherence protocolsand consistency modelsBut performance and scalability are limited by the amountand size of the messages used to maintain coherence

Blindly keeping coherence for all memory accessestranslates into an unnecessary overhead for data that willremain coherent after the access

Read-only shared dataPrivate data

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 4 / 23

Page 5: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

CURRENT TRENDSMOTIVATION

Shared-memory architectures are predominant inmulti-core microprocessors from all market segments

Correctness is ensured by means of coherence protocolsand consistency modelsBut performance and scalability are limited by the amountand size of the messages used to maintain coherence

Blindly keeping coherence for all memory accessestranslates into an unnecessary overhead for data that willremain coherent after the access

Read-only shared dataPrivate data

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 4 / 23

Page 6: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

INTRODUCTIONIDEA

Why treat all accesses the same regardless of the natureof the data being accessed?

Idea use of dedicated caches for private (+sharedread-only) and shared data

Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches

HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23

Page 7: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

INTRODUCTIONIDEA

Why treat all accesses the same regardless of the natureof the data being accessed?Idea use of dedicated caches for private (+sharedread-only) and shared data

Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches

HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23

Page 8: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

INTRODUCTIONIDEA

Why treat all accesses the same regardless of the natureof the data being accessed?Idea use of dedicated caches for private (+sharedread-only) and shared data

Eliminates the need for a coherence directory, improvingscalability of the multi-core architectureReduces the pressure on the L1P and overall combinedL1D missesReduces the amount of duplicated data in L1 caches

HoweverIncreases latency when accessing the shared data (usesthe network and may require high bandwidth)Requires a classification mechanism to detect the nature ofthe data

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 5 / 23

Page 9: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 6 / 23

Page 10: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

PROPOSALDEDICATED CACHE DESIGN

Tile

L1P L1SBank

Router

Private AccessShared Access

Core(OS-Classification)

L2 Bank

......................

Access the classification mechanismCheck the local cache for privatedata or the router for shared dataUpdate the classification mechanismbased on the nature of the data andmove data between the private andshared caches

DEDICATED CACHE DESIGN

Private $ (L1P) independent for each coreShared $ (L1S) logically shared but physically distributedCoherence messages only required when nature changes

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 7 / 23

Page 11: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

PROPOSALCLASSIFICATION MECHANISM

CLASSIFICATION MECHANISM

Our proposal requires a classification mechanism that detectsprivate and shared data

Many mechanisms in the literature that work at differentlevels: compiler, OS, coherence protocol, etc

We chose a OS-level (at a page granularity) mechanismbecause of its simplicity

When a core accesses a page it is marked private in thepage tableA second access to a page (marked as private) by adifferent core will force an update on the page table stateto shared

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 8 / 23

Page 12: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 9 / 23

Page 13: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

EVALUATION ENVIRONMENT

CPU Core

L1P$L1I$

L2$(Tags)

L2$ (Data)R

outerD

irectoryL1S$

EVALUATION ENVIRONMENT

The simulated architecture corresponds to a single chip multiprocessor(tiled-CMP) featuring 16 cores

Real system data accesses captured by PIN are used as input for the GEMS 2.1simulation environment

The interconnection network uses Garnet (for tiled and custom layouts) and a“simple” idealized network model without contention for our idealized evaluation

We compare against a traditional MESI protocol

We evaluate our proposal using the applications from the SPLASH-2 benchmarksuite with the recommended input sizes

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 10 / 23

Page 14: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

EVALUATION ENVIRONMENTSIMULATION PARAMETERS

Memory parametersBlock size 64 bytesL1 cache (data & private & instr.) 32 KB, 4 waysL1 access latency (data & private & instr.) 4 cycleL1S (shared) (32 KB, 4 ways) per tileL1S access latency 4 + NetworkL2 cache (shared) 512 KB/tile, 16 waysL2 access latency 12 cycleCache organization InclusiveDirectory information Included in L2Memory access time 160 cycles

Network parametersTopology Base: 2-D mesh (4×4)

Other: Custom/Simple NetworkRouting method X-Y deterministMessage size 5 flits (data), 1 flit (control)Link time 1 cycleBandwidth 1 flit per cycle

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 11 / 23

Page 15: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 12 / 23

Page 16: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSNORMALIZED RUNTIME (TO UNIFIED/MESI L1D)

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.01.02.03.04.05.06.07.08.09.0

10.0R

untim

e (N

orm

aliz

ed)

MESI-InclusivePrivate-Shared

TILED DESIGN, 2D MESH GARNET NETWORK. CLH LOCKS

Realistic layout design, but huge overall slowdown (3.7x)

Great variability in the results (some applications hide latency better)

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 13 / 23

Page 17: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSCACHE MISSES

Now we must discover why we experienced this behavior while others did not,and how we could solve this issue

The extra combined capacity of the L1D (L1P+L1S) causes a significantreduction on the number of cache misses (4-Way 128-entry L1S)

But L1S accesses cost as much as misses on the private cache, so performancebenefits are minimal

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 14 / 23

Page 18: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSCACHE MISSES

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.00.10.20.30.40.50.60.70.80.91.0

L1D

Mis

ses

(Nor

mal

ized

)

ColdCapacityConflict

1. MESI-Inclusive2. Private-Shared

Now we must discover why we experienced this behavior while others did not,and how we could solve this issue

The extra combined capacity of the L1D (L1P+L1S) causes a significantreduction on the number of cache misses (4-Way 128-entry L1S)

But L1S accesses cost as much as misses on the private cache, so performancebenefits are minimal

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 14 / 23

Page 19: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSNETWORK TRAFFIC

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.05.0

10.015.020.025.030.035.040.045.050.055.060.0

Net

wor

k T

raffi

c (N

orm

aliz

ed) MESI-Inclusive

Private-Shared

Huge increment in traffic due to the shared cache (around 13× on average)

High variability that but similar performance degradation

Performance penalty only critical for those applications that have sharedaccesses in their critical path of execution

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 15 / 23

Page 20: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSOTHER APPROACHES: CENTRALIZED DESIGN

L1S-Bank 0 L1S-Bank 1

L1S-Bank 2 L1S-Bank 3

Custom Layout. Place all L1S in the centre of the layout

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 16 / 23

Page 21: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSOTHER APPROACHES: CENTRALIZED DESIGN II

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.01.02.03.04.05.06.07.08.09.0

10.011.012.0

Run

time

(Nor

mal

ized

)

MESI-InclusivePrivate-Shared

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.05.0

10.015.020.025.030.035.040.045.050.055.060.0

Net

wor

k T

raffi

c (N

orm

aliz

ed) MESI-Inclusive

Private-Shared

Runtime increased (3.9x vs 3.4x)

Network traffic slightly decreased (10x vs 13x)

Less hops but more contention on L1S controllers

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 17 / 23

Page 22: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

RESULTSOTHER APPROACHES: IDEALIZED DESIGN

Barn

esCh

olesk

yFF

TFM

M LULU

-nc

Ocea

nOc

ean-

ncRa

diosit

yRa

dixRa

ytrac

eVo

lrend

Wat

er-N

sqW

ater

-Sp

Aver

age

0.00.20.40.60.81.01.21.41.61.82.0

Run

time

(Nor

mal

ized

)

MESI-InclusivePrivate-Shared

02.5

57.510

12.515

17.520

22.525

27.530

Net

wor

k Tr

affic

(nor

mal

ized

)

MESI-InclusivePrivate-Shared

Barn

esCh

olesk

yFF

TFM

M LULU

-nc

Ocea

nOc

ean-

ncRa

diosit

yRa

dixRa

ytrac

eVo

lrend

Wat

er-N

sqW

ater

-Sp

Aver

age

Idealized network (P2P, without contention)

Runtime increased by 25%, network traffic increases by 7x

Even in this optimistic design the L1S accesses outweigh the gains fromremoving the coherence protocol

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 18 / 23

Page 23: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

OUTLINE

1 INTRODUCTION

2 PROPOSAL

3 METHODOLOGY

4 RESULTS

5 DISCUSSION AND CONCLUSIONS

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 19 / 23

Page 24: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

DISCUSSIONPRIVATE AND SHARED DATA

Bar

nes

Cho

lesk

y F

FT F

MM

LU

LU

-nc

Oce

an O

cean

-nc

Rad

iosi

ty R

adix

Ray

trace

Vol

rend

Wat

er-N

sq W

ater

-Sp

Ave

rage

0.00.10.20.30.40.50.60.70.80.91.0

Mem

. Acc

ess

Cla

ssifi

catio

n

ForwardLoadsStores

1. MESI-Inclusive2. Private-Shared

52% shared accesses does not match the literature (10%)

Build a model based on empirical data to estimate performance based on theamount of private and shared accesses

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 20 / 23

Page 25: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

DISCUSSIONPERFORMANCE MODEL

1

2

3

4

5

6

7

8Custom Simple Tiled

Percentage of L1S Accesses

No

rma

lize

d L

1 L

ate

ncy

Based on access and miss latencies to both private and shared caches

Even for 10% ratio we still have a 60% performance degradation

But we are modeling a latency-oriented architectureNVIDIA architectures separate private and shared data (programmer)

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 21 / 23

Page 26: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Introduction Proposal Methodology Results Discussion and Conclusions

DISCUSSIONCONCLUSIONS

CONCLUSIONS

We analyze a dedicated cache design that takes into consideration the nature ofthe data being accessed. Our results show two main drawbacks that limit theusability of our implementation:

Low accuracy on the classification mechanism (can be improved to 90%)Huge increase of the latency of shared accesses (due to theinterconnection network. 6.75 ideal to 23 cycles realistic latency).

We believe there is also room for improvement by using a specialized networksince our design only requires 8 bytes to be sent as control/request plus 16 bytesto be returned (control + data word).

With improved accuracy for access clasisfication and reduced network latency,we believe our approach can become useful as the number of cores increasesand coherence protocols face more scalability issues.

Additional research is needed to discover the number of cores required to makethe dedicated design feasible in terms of performance.

Throughput oriented cores (GPUs or accelerators like the Xeon Phi) can alsobenefit from our design, since additional memory latency is usually hidden inthose systems by swapping execution threads.

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 22 / 23

Page 27: Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual ...press3.mcs.anl.gov/errorworkshop/files/2015/09/separate.pdf · Juan M. Cebrián, Alberto Ros, Ricardo Fernández-Pascual

Thank you

EARLY EXPERIENCES WITH SEPARATE CACHES

FOR PRIVATE AND SHARED DATA

Juan M. Cebrián, Alberto Ros,Ricardo Fernández-Pascual and Manuel E. Acacio

{jcebrian, aros, rfernandez,meacacio}@ditec.um.es

Depart. of Computer EngineeringUniversity of Murcia

Sep 3, 2015

Juan Manuel Cebrián González ERROR Workshop 2015 Sep 3, 2015 23 / 23