design and implementation of the blue gene/p snoop filter
TRANSCRIPT
© 2008 IBM Corporation
IBM Research
Design and Implementation of the Blue Gene/P Snoop Filter
Valentina SalapuraMatthias BlumrichAlan Gara
The 14th International Symposium on High-Performance Computer ArchitectureHPCA ’08, February 18, 2008
2
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
System optimization for the multicore era
Scaling up the system performance– Increasing the number of processors on chip
Keeping the communication cost between the processors low– Data sharing for unified memory space and shared address space– Coherence protocols implemented
Coherence comes at a cost– Designing a coherent multiprocessor is one of the most challenging tasks – System performance is reduced due to coherence overhead
Goals– Reduce performance loss due to coherence traffic– Reduce power consumption for high-power cache access
3
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
A new 4-way chip multiprocessor node design
4-way SMP based on PowerPC 450– Next-generation IBM embedded microprocessor
PPC450 represents first IBM 4xx embedded processor with coherence support– Write through policy– Snoop on write requests to invalidate cache lines
Internal architecture closely related to PPC440– Single copy of single ported L1 data cache tag array reduces area– Snoop traffic shares tag port with data cache access
Design driven by technical and non-technical considerations– Area and power limit ability to increase cache subsystem complexity– NRE and time to market considerations limit
4
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Coherence revisited
Coherence establishes consistent, shared view of memory despite the presence of local caches
– Accomplished by ensuring updates from one core are visible to readers on other cores– Invalidate cache contents in remote caches when locally modified
Coherence is important for data sharing– Locks and shared data resources must be synchronized– Allow applications to efficiently share data
Coherence represents overhead for non-shared data– But in properly tuned applications, most data is not shared
– Cost of data transfer would diminish application performance– Many code transformations to increase locality of reference
– Key goal: avoid paying for synchronizing unshared data
Eliminate useless coherence actions– Increases performance (remove unnecessary lookups, reduced cache interference)– Reduces power (and energy)
5
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Snoop filtering of unnecessary coherence requests
Processor 0
snoop
SnoopUnit
Processor 1
snoop
Processor 2
snoop
Processor 3
snoop
L2 Cache
SnoopUnit
L2 Cache
SnoopUnit
L2 Cache
SnoopUnit
L2 Cache
L3 CacheDMA
Coherence traffic
Data transfer
6
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Multi-component snoop filtering
Key to snoop filter effectiveness is high filtering rate
Correctness requirement is to never miss a necessary snoop request
Different applications and different data classes have distinct characteristics
– Streams of contiguous data
– Repeated accesses to the same data addresses
– Some regions are known to not be shared
Novel multi-component snoop filter – matches individual snoop filters to specific access patterns
7
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Multi-component filtering:Matching filtering techniques to data access patterns
Stream registers– Contiguous data areas
– Adaptive to cache arbitrarily sized contiguous regions with a single register
– Stream registers track strided and sequential streams
Snoop caches– Cache of recently executed snoop requests
– Multiple requests to same line do not have to cause multiple snoop lookups
– Snoop caches track locality
Range filter– Identify regions of known non-shared data
– Configured by software
8
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Multi-component port filters
SnoopCache
RangeFilter
enable bypass
snoop 1
PortFilter 1
Stream reg.Lookup
PortFilter 2
PortFilter 3
PortFilter 4
Queuing and Multiplexing
snoop to L1 cache
StreamRegisters
snoop 2 snoop 3 snoop 4
Scalability requires multiple, simultaneous lookups for each cache– Different sharing patterns
Port filters capture sharing with a specific remote cache– Port filters can be replicated to service multiple incoming requests simultaneously ⇒ scalable
– No communication necessary between filters
9
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Stream registersTracks a superset of data that are in the cache
– Captures when a line might be in the cache– Stream register state is only modified by L1 activity
Each stream register is intended to capture an address stream– Multiple registers for multiple streams
snoop 0x1D330
LOAD 0x7F840LOAD 0x7F843
L1
7F840base
FFFFFmask
to L2…
7F840
FFFFC
7F843
snoop 0x7F840snoop 0x7F841
mask bits corresponding to non-matching base and LOAD bits are set to 0
10
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Stream registers: cache wrapping
Stream registers become less discriminating as addresses are added– Addresses cannot be subtracted from stream registers
– Solution: periodically reset the registers– Safe to do only when all of the addresses they are tracking are no longer in L1
7F840base
FFFFCmask
1D337 FFFFF
Active set
base mask
History set
7F840
1D337
FFFFC
FFFFF
11
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Snoop cacheIf a cache line was invalidated, no need to invalidate it again
– A hit in the snoop cache guarantees that the line is not in L1
– Store L1 snoop addresses in the snoop cache, and evict LOAD addresses– Snoop cache state may be modified after each lookup
L1
to L2…
snoop 0x7F840LOAD 0x7F840
7F8407F840
snoop 0x7F840LOAD 0x7F840
base vector
7F84
7F8407F845hit!
12
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Evaluation methodology
Trace-based evaluation and hardware measurements– Trace-based simulation during design to study design-tradeoffs
– Hardware measurements confirm accuracy of modeling and performance benefits
Multiprocessor traces– Used Augmint to generate traces
– SPLASH-2 benchmarks and other applications
Custom simulator to study filter effectiveness– Modeled the PowerPC 450 L1 data cache
– Memory accesses processed sequentially, with their effect applied immediately
Evaluated various snoop filters– More details on sizing in: “Improving the Accuracy of Snoop Filtering Using Stream
Registers”, MEDEA Workshop in conjunction with PACT 2007 Conference 2007
13
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Varying the number of stream registers
0
20
40
60
80
100
120
4 8 16 32
Number of stream registers
% fi
ltere
d
Ocean
FFT
LU
Barnes
Radix
Cholesky
FMM
Raytrace
14
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Stream register update policyAffinity: Matching stream registers to memory accesses
– Minimal Hamming Distance – captures strided accesses thru memory
– Most Matching Upper Bits (MMUB) – captures contiguous regions of accesses
Selecting the “Empty Affinity” threshold at which to start a new stream– If too low, streams not well captured and too many streams are started.
– If too high, all streams merged and filters force passing of wide range of accesses.
Ocean MMUB
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Empty affinity threshold
% fi
ltere
d
4 SRs8 SRs16 SRs32 SRs
Ocean Minimal Hamming
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Empty affinity threshold
% fi
ltere
d
4 SRs8 SRs16 SRs32 SRs
15
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Snoop cache size
1 2 4 8 16 32 644 lines
16 lines0
10
20
30
40
50
60
70
80
90
100
Filte
r rat
e %
Valid line vector length
Lines in snoop cache
Ocean
1 2 4 8 16 32 644 lines
16 lines0
10
20
30
40
50
60
70
80
90
100
Filte
r rat
e %
Valid line vector length
Lines in snoop cache
Raytrace
Different filtering rate depending on the benchmarkLarger caches advantageous for some applications, less importantfor others
1 2 4 8 16 32 644 lines
16 lines0
10
20
30
40
50
60
70
80
90
100
Filte
r rat
e %
Valid line vector length
Line
LU
s in snoop cache
16
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Combined snoop filter
0
20
40
60
80
100
120
FFT Barnes LU Ocean Raytrace Cholesky
Filte
r rat
e (%
)
Big
ger i
s be
tter
Big
ger i
s be
tter
snoop caches: 8 lines, 32 cache lines8 stream registers
Simulation results
17
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
JTAG 10 Gb/s
256
256
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
Ethernet10 Gbit
Ethernet10 GbitJTAG
Access
JTAGAccess Collective
CollectiveTorus
Torus GlobalBarrier
GlobalBarrier
DDR-2Controllerw/ ECC
DDR-2Controllerw/ ECC
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
4MBeDRAM
L3 Cacheor
On-ChipMemory
4MBeDRAM
L3 Cacheor
On-ChipMemory
6 3.4Gb/sbidirectional
4 globalbarriers orinterrupts
128
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU L2L2
Snoop filter
Snoop filter
4MBeDRAM
L3 Cacheor
On-ChipMemory
4MBeDRAM
L3 Cacheor
On-ChipMemory
512b data 72b ECC
128
L2L2
Snoop filter
Snoop filter
128
L2L2
Snoop filter
Snoop filter
128
L2L2
Snoop filter
Snoop filter
Multiplexing
switch
Multiplexing
switch
DMADMA
Multiplexing
switch
Multiplexing
switch
3 6.8Gb/sbidirectional
DDR-2Controllerw/ ECC
DDR-2Controllerw/ ECC
13.6 GB/sDDR-2 DRAM bus
32
SharedSRAM
SharedSRAM
snoop
Hybrid PMU
w/ SRAM256x64b
Hybrid PMU
w/ SRAM256x64b
System-level view
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
ArbArb
512b data 72b ECC
18
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
BlueGene/P ASIC
IBM Cu-08 90nm CMOS ASIC process technology Die size 173 mm2
Clock frequency 850MHzTransistor count 208MPower dissipation 16W
19
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Hardware measurements: snoop filter efficiency
smal
ler i
s be
tter
smal
ler i
s be
tter
0
0.2
0.4
0.6
0.8
1
1.2
BT CG EP FT LU-HP LU MG SP UA Avg.
Nor
mal
ized
exe
cutio
n tim
e
disabled stream reg only snoop cache only both enabled
20
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
Snoop filtering improves power and performance
50%
60%
70%
80%
90%
100%
110%
time power E E × t E × t²
snoop disabled snoop filter
Actual hardware
measurements
UMT2k application
smal
ler i
s be
tter
smal
ler i
s be
tter
21
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
ConclusionSystem level optimization is key to delivering on CMP promise
– Coherence traffic problematic in terms of performance and power
Multi-component filtering to capture data sharing and use patterns– Capture both temporal locality and streamed memory accesses
– Capture region-based sharing
Port filtering– Scalable, as port filters can be replicated to service multiple incoming requests simultaneously
– No communication necessary between filters
Stream registers– Adaptive filter to capture streams corresponding to contiguously accessed memory regions
Snoop filtering highly effective solution for multicore systems– Filters up to 99% of all coherence requests
– Efficient hardware implementation
– Significant performance improvement
– Power efficient
22
IBM Research
V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08
© 2008 IBM Corporation
BlueGene on the Web
The Blue Gene/P project has been supported and partially funded by Argonne National Laboratory and the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Subcontract No. B554331.