Download - Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
The Case for a Scalable Coherence Protocol for
Complex On-Chip Cache Hierarchies in Many-Core
SystemsLucía G. Menezo
Valentín PuenteJosé Ángel Gregorio
University of Cantabria (Spain)
MOSAIC :
University of CantabriaEdinburgh - PACT 2013
Motivation Directory Schemas
◦ In-cache ◦ Sparse
MOSAIC Coherence Protocol◦ Examples
Evaluation Results Conclusions
Outline
3University of CantabriaEdinburgh - PACT 2013
Performance improvement: more processors per chip
Major challenges: off-chip bandwidth wall Introduce cache into the chip Complex on-chip cache hierarchies
Coherence protocol: fundamental role to play
Motivation
4University of CantabriaEdinburgh - PACT 2013
What coherence protocol to use with large number of cores: ◦ Broadcast-based protocols high energy
requirements◦ Directory-based protocols more storage
necessities for sharing information
MOSAIC: new coherence protocol◦ Directory without inclusiveness◦ Token Coherence to guarantee correctness
Motivation
University of CantabriaEdinburgh - PACT 2013
Motivation Directory Schemas
◦ In-cache ◦ Sparse
MOSAIC Coherence Protocol◦ Examples
Evaluation Results Conclusions
Outline
6University of CantabriaEdinburgh - PACT 2013
Each block in LLC includes tag, data and the sharers information
LLC receives requests needs precise knowledge
Inclusiveness is necessary: any block in the private levels needs to be allocated in LLC
Advantage: coherence protocol less complex Disadvantage: all LLC blocks has storage
overhead
Directory schemas: In-cache
7University of CantabriaEdinburgh - PACT 2013
@ data
sharers
@ data
@ data
@ data
@ data
P
Pro
cess
ors
an
d p
rivate
ca
ches
LLC + in-cache directory
P
P
P
Inte
rconnect
ion n
etw
ork
Overhead!!!
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
Directory schemas: In-cache
8University of CantabriaEdinburgh - PACT 2013
Directory schemas: In-cache@ dat
asharers @ dat
asharers
LLC + in-cache directory
Inte
rconnect
ion n
etw
ork
Overhead!!!
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
Overhead!!!
Pro
cess
ors
an
d p
rivate
ca
ches
9University of CantabriaEdinburgh - PACT 2013
Directory entries separated from data Allocated under demand Overhead proportional to the aggregate
private levels size (not LLC) Capacity and associativity has to be
sufficient to keep private-level cache tags
Directory schemas: Sparse
10University of CantabriaEdinburgh - PACT 2013
@ data
sharers @ data
Directory schemas: Sparse
Inte
rconnect
ion n
etw
ork
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@ dataP
@@ sharers
LLCSparse dir
Pro
cess
ors
an
d p
rivate
ca
ches
11University of CantabriaEdinburgh - PACT 2013
Duplicate-tag directory: holding all the tags of private levels
Example: 16 cores with 4-way 32KB L1 64-way
Directory schemas: SparseAssociativity = # cores * private caches associativity
# sets = # private
caches sets
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
12University of CantabriaEdinburgh - PACT 2013
Directory schemas: Sparse
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
Decrease Associativity: now << # cores * private caches associativity
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
sharers sharers
sharers
sharers
sharers
sharers
sharerssharers
sharers
sharers
sharers
sharers
sharers sharers
sharers
sharers
sharers
sharers
sharerssharers
sharers
sharers
sharers
sharers
tagtagtagtagtagtag
tagtagtagtagtagtag
One tag may be in various private caches
More than 1 tag per entry conflicts
Inclusiveness needed invalidate private data (recalls messages)
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
tagtagtagtagtagtag
Increasenumber of sets
13University of CantabriaEdinburgh - PACT 2013
Motivation Directory Schemas
◦ In-cache ◦ Sparse
MOSAIC Coherence Protocol◦ Examples
Evaluation Results Conclusions
Outline
14University of CantabriaEdinburgh - PACT 2013
In-cache or sparse it doesn’t matter No inclusiveness No invalidations of data in private caches Reconstruction of sharing information under
demand Uses token counting to avoid extra traffic and
guarantee correctness
Token Coherence protocol:◦ Initially each block := # tokens (==#procs) ◦ Read request: data and 1 token◦ Write request: data and all tokens
MOSAIC Protocol
15University of CantabriaEdinburgh - PACT 2013
MOSAIC Conceptual Approach
I 0 N/A
P0
O 2 DATA
P1
S 1 DATA
P2
SharersI
Last Level Cache
I 0 N/A
Data_sliceDir_slice Memory
Controller
On-chip network
Pri
vate
Cach
es
1
2
3
4
5
State Num. Tokens
Data
V
2
3
1
16University of CantabriaEdinburgh - PACT 2013
When data not present in LLC broadcast for reconstruction
Private caches inform of num. of held tokens
Token counting avoids negative acknowledgements or timeouts
Reconstruction message piggybacks type of request and requestor
Key: directory may replace silently no invalidations
MOSAIC Key Facts
17University of CantabriaEdinburgh - PACT 2013
MOSAIC Read RequestP0 P1 P2
Invalid
State IS
Read
P3 Dir LLC
State SState OState C
Data + token
State A
Reconstruction
Info 1 tokenInfo 2 tokensOwnerUnblock (info 1 token)
Read
Forward GETS to Owner
Sharers [P2]Owner: ¿?Sharers [P2, P1]Owner: P1
Sharers [P2, P1, P0]Owner: P1
Data + token
3 tokens 1 token
Unblock Sharers [P2, P1, P0, P3]Owner: P1
18University of CantabriaEdinburgh - PACT 2013
MOSAIC Write RequestP0 P1 P2
Invalid
State IS
Write
P3 Dir LLC
State SState O
State C
Data + 3 tokens
State A
Reconstruction
Sharers [P0]Owner: P0
3 tokens 1 token
State IM
State M
1 token
Unblock (info all tokens)
Directory Eviction
19University of CantabriaEdinburgh - PACT 2013
Motivation Directory Schemas
◦ In-cache ◦ Sparse
MOSAIC Coherence Protocol◦ Examples
Evaluation Results Conclusions
Outline
20University of CantabriaEdinburgh - PACT 2013
Evaluation methodologyConfig 1 Config 2
Number of cores 8 @3GHz 16 @3GHz
IWin size/Issue Width 128, 4-way
Block size 64B
Private
L1 Size /
Associativity32KB I/D, 2-way
L2 Size /
Associativity
64KB, 4-way(exclusive with L1)
L3 Shared
Size / Associativity
16MB 16-way
32MB16-way
NUCA MappingStatic, interleaved across
slices
Memory Capacity 4GB
Max. Outstanding Mem. Operations 16
Topology 4×4 Mesh 6×6 Mesh
Core 0 Core 1 Core 2 Core 3
Core 4 Core 5 Core 6 Core 7
R R R R
R R R R
R R R R
R R R R
Slice 0 Slice 2Slice 1 Slice 3
Slice 4 Slice 6Slice 5 Slice 7
Slice 8 Slice 10Slice 9 Slice 11
Slice 12 Slice 14Slice 13 Slice 15
Core 0 Core 1 Core 2 Core 3
R R R R
R R R R
R R R R
R R R R
Slice 0 Slice 2Slice 1 Slice 3
Slice 5 Slice 7Slice 6 Slice 8
Slice 11 Slice 13Slice 12 Slice 14
Slice 17 Slice 19Slice 18 Slice 20
R
R
R
R
Slice 9
Slice 15
Slice 21
R
R
R
R
Slice 4
Slice 10
Slice 16
R R R R
Slice 23 Slice 25Slice 24 Slice 26
R
Slice 27
R
Slice 22
R R R R
Slice 28 Slice 30Slice 29 Slice 31
RR
Core
7C
ore
5C
ore
6C
ore
4
Core 11 Core 10 Core 9 Core 8C
ore
1
2C
ore
14
Core
13
Core
15
21University of CantabriaEdinburgh - PACT 2013
GEMS: full-system evaluation
◦SLICC: Specification Language for Implementing Cache Coherence
Simulation stack and Workloads
Multithreaded Workloads
4 Wisconsin Commercial Workload
3 NAS Parallel Bench.
Multiprogrammed Workloads
3 Spec 2006 (Rate Mode)
22University of CantabriaEdinburgh - PACT 2013
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0.5
0.6
0.7
0.8
0.9
1
1.164w128KB 32w128KB 2w128KB 1w128KB
MOSAIC PerformanceReducing associativity
Norm
aliz
ed
exe
cuti
on t
ime
128KB 16K entries (8 bytes per entry)
23University of CantabriaEdinburgh - PACT 2013
Number of misses6
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 1
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
BASE MO-SAIC
Astar Hmmer Omnetpp FT IS LU Apache Jbb OLTP Zeus
00.20.40.60.8
11.21.41.61.8
2Misses L2 Misses L1I Misses L1D
Norm
aliz
ed
num
. m
isse
s
x2
24University of CantabriaEdinburgh - PACT 2013
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
64w16KB 32w16KB 2w16KB 1w16KB
MOSAIC Performance Reducing associativity and capacity
Norm
aliz
ed
exe
cuti
on t
ime
128KB 16K entries (8 bytes per entry) 16KB 2K entries
25University of CantabriaEdinburgh - PACT 2013
MOSAIC Latency6
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 16
43
2 2 1
BASE
MOSAIC
Astar Hmmer Omnetpp FT IS LU Apache Jbb OLTP Zeus
0
2
4
6
8
10
12
L3 Other L2 Other L1 Private L2 Local L1
Late
ncy (
Pro
cessor
Cycle
s)
16KB 2K entries
26University of CantabriaEdinburgh - PACT 2013
Avera
ge n
etw
ork
lin
k uti
lizati
on
MOSAIC Link Utilization
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0
0.2
0.4
0.6
0.8
1
1.2
1.4 64w128KB 64w64KB 64w32KB 64w8KB 2w128KB 2w64KB
2w16KB
27University of CantabriaEdinburgh - PACT 2013
MOSAIC Link Utilization vs. Dir
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0
0.2
0.4
0.6
0.8
1
1.2
1.4
2w128KB 2w64KB 2w16KBN
orm
aliz
ed n
etw
ork
link
utili
zatio
n
40%!!
28University of CantabriaEdinburgh - PACT 2013
MOSAIC Scalability
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 128w256KB 128w128KB 128w64KB 128w32KB 2w256KB 2w128KB2w64KB 2w32KB
Norm
aliz
ed
lin
k uti
lizati
on
16 cores configuration
29University of CantabriaEdinburgh - PACT 2013
Low complexity and great scalability Very low storage overhead No noticeable energy cost Alternative for future many-core cache
coherent CMPs
Conclusions
Bandwidth scalability of a directory Elegancy of Token Coherence
MOSAIC Coherence Protocol
30University of CantabriaEdinburgh - PACT 2013
Thank you for your attention
31University of CantabriaEdinburgh - PACT 2013
32University of CantabriaEdinburgh - PACT 2013
Realistic Cache Configuration
Asta
r
Hmm
er
Omne
tpp FT IS LU
Apac
he Jbb
OLTP
Zeus
Gmea
n0
0.2
0.4
0.6
0.8
1
1.2
16w512KB 16w256KB 16w128KB 16w64KB 16w32KB
Norm
aliz
ed e
xecu
tion t
ime
- Same experiment with BASE: 20% impact in some cases
L1: 4-way 32KB / L2: 8-way 256KBx2 full dir 1/10 full dir
33University of CantabriaEdinburgh - PACT 2013
MOSAIC Energy1
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
61
28
64
16
12
86
41
6
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MO-SAIC
BASE
MOSAIC
Astar Hmmer Om-netpp
FT IS LU Apache Jbb OLTP Zeus
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Network Sparse directory L3 L2 L1
Norm
aliz
ed
Dynam
ic E
nerg
y