design and implementation of the blue gene/p snoop filter

22
© 2008 IBM Corporation IBM Research Design and Implementation of the Blue Gene/P Snoop Filter Valentina Salapura Matthias Blumrich Alan Gara The 14th International Symposium on High-Performance Computer Architecture HPCA ’08, February 18, 2008

Upload: others

Post on 01-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

© 2008 IBM Corporation

IBM Research

Design and Implementation of the Blue Gene/P Snoop Filter

Valentina SalapuraMatthias BlumrichAlan Gara

The 14th International Symposium on High-Performance Computer ArchitectureHPCA ’08, February 18, 2008

2

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

System optimization for the multicore era

Scaling up the system performance– Increasing the number of processors on chip

Keeping the communication cost between the processors low– Data sharing for unified memory space and shared address space– Coherence protocols implemented

Coherence comes at a cost– Designing a coherent multiprocessor is one of the most challenging tasks – System performance is reduced due to coherence overhead

Goals– Reduce performance loss due to coherence traffic– Reduce power consumption for high-power cache access

3

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

A new 4-way chip multiprocessor node design

4-way SMP based on PowerPC 450– Next-generation IBM embedded microprocessor

PPC450 represents first IBM 4xx embedded processor with coherence support– Write through policy– Snoop on write requests to invalidate cache lines

Internal architecture closely related to PPC440– Single copy of single ported L1 data cache tag array reduces area– Snoop traffic shares tag port with data cache access

Design driven by technical and non-technical considerations– Area and power limit ability to increase cache subsystem complexity– NRE and time to market considerations limit

4

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Coherence revisited

Coherence establishes consistent, shared view of memory despite the presence of local caches

– Accomplished by ensuring updates from one core are visible to readers on other cores– Invalidate cache contents in remote caches when locally modified

Coherence is important for data sharing– Locks and shared data resources must be synchronized– Allow applications to efficiently share data

Coherence represents overhead for non-shared data– But in properly tuned applications, most data is not shared

– Cost of data transfer would diminish application performance– Many code transformations to increase locality of reference

– Key goal: avoid paying for synchronizing unshared data

Eliminate useless coherence actions– Increases performance (remove unnecessary lookups, reduced cache interference)– Reduces power (and energy)

5

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Snoop filtering of unnecessary coherence requests

Processor 0

snoop

SnoopUnit

Processor 1

snoop

Processor 2

snoop

Processor 3

snoop

L2 Cache

SnoopUnit

L2 Cache

SnoopUnit

L2 Cache

SnoopUnit

L2 Cache

L3 CacheDMA

Coherence traffic

Data transfer

6

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Multi-component snoop filtering

Key to snoop filter effectiveness is high filtering rate

Correctness requirement is to never miss a necessary snoop request

Different applications and different data classes have distinct characteristics

– Streams of contiguous data

– Repeated accesses to the same data addresses

– Some regions are known to not be shared

Novel multi-component snoop filter – matches individual snoop filters to specific access patterns

7

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Multi-component filtering:Matching filtering techniques to data access patterns

Stream registers– Contiguous data areas

– Adaptive to cache arbitrarily sized contiguous regions with a single register

– Stream registers track strided and sequential streams

Snoop caches– Cache of recently executed snoop requests

– Multiple requests to same line do not have to cause multiple snoop lookups

– Snoop caches track locality

Range filter– Identify regions of known non-shared data

– Configured by software

8

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Multi-component port filters

SnoopCache

RangeFilter

enable bypass

snoop 1

PortFilter 1

Stream reg.Lookup

PortFilter 2

PortFilter 3

PortFilter 4

Queuing and Multiplexing

snoop to L1 cache

StreamRegisters

snoop 2 snoop 3 snoop 4

Scalability requires multiple, simultaneous lookups for each cache– Different sharing patterns

Port filters capture sharing with a specific remote cache– Port filters can be replicated to service multiple incoming requests simultaneously ⇒ scalable

– No communication necessary between filters

9

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Stream registersTracks a superset of data that are in the cache

– Captures when a line might be in the cache– Stream register state is only modified by L1 activity

Each stream register is intended to capture an address stream– Multiple registers for multiple streams

snoop 0x1D330

LOAD 0x7F840LOAD 0x7F843

L1

7F840base

FFFFFmask

to L2…

7F840

FFFFC

7F843

snoop 0x7F840snoop 0x7F841

mask bits corresponding to non-matching base and LOAD bits are set to 0

10

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Stream registers: cache wrapping

Stream registers become less discriminating as addresses are added– Addresses cannot be subtracted from stream registers

– Solution: periodically reset the registers– Safe to do only when all of the addresses they are tracking are no longer in L1

7F840base

FFFFCmask

1D337 FFFFF

Active set

base mask

History set

7F840

1D337

FFFFC

FFFFF

11

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Snoop cacheIf a cache line was invalidated, no need to invalidate it again

– A hit in the snoop cache guarantees that the line is not in L1

– Store L1 snoop addresses in the snoop cache, and evict LOAD addresses– Snoop cache state may be modified after each lookup

L1

to L2…

snoop 0x7F840LOAD 0x7F840

7F8407F840

snoop 0x7F840LOAD 0x7F840

base vector

7F84

7F8407F845hit!

12

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Evaluation methodology

Trace-based evaluation and hardware measurements– Trace-based simulation during design to study design-tradeoffs

– Hardware measurements confirm accuracy of modeling and performance benefits

Multiprocessor traces– Used Augmint to generate traces

– SPLASH-2 benchmarks and other applications

Custom simulator to study filter effectiveness– Modeled the PowerPC 450 L1 data cache

– Memory accesses processed sequentially, with their effect applied immediately

Evaluated various snoop filters– More details on sizing in: “Improving the Accuracy of Snoop Filtering Using Stream

Registers”, MEDEA Workshop in conjunction with PACT 2007 Conference 2007

13

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Varying the number of stream registers

0

20

40

60

80

100

120

4 8 16 32

Number of stream registers

% fi

ltere

d

Ocean

FFT

LU

Barnes

Radix

Cholesky

FMM

Raytrace

14

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Stream register update policyAffinity: Matching stream registers to memory accesses

– Minimal Hamming Distance – captures strided accesses thru memory

– Most Matching Upper Bits (MMUB) – captures contiguous regions of accesses

Selecting the “Empty Affinity” threshold at which to start a new stream– If too low, streams not well captured and too many streams are started.

– If too high, all streams merged and filters force passing of wide range of accesses.

Ocean MMUB

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Empty affinity threshold

% fi

ltere

d

4 SRs8 SRs16 SRs32 SRs

Ocean Minimal Hamming

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Empty affinity threshold

% fi

ltere

d

4 SRs8 SRs16 SRs32 SRs

15

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Snoop cache size

1 2 4 8 16 32 644 lines

16 lines0

10

20

30

40

50

60

70

80

90

100

Filte

r rat

e %

Valid line vector length

Lines in snoop cache

Ocean

1 2 4 8 16 32 644 lines

16 lines0

10

20

30

40

50

60

70

80

90

100

Filte

r rat

e %

Valid line vector length

Lines in snoop cache

Raytrace

Different filtering rate depending on the benchmarkLarger caches advantageous for some applications, less importantfor others

1 2 4 8 16 32 644 lines

16 lines0

10

20

30

40

50

60

70

80

90

100

Filte

r rat

e %

Valid line vector length

Line

LU

s in snoop cache

16

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Combined snoop filter

0

20

40

60

80

100

120

FFT Barnes LU Ocean Raytrace Cholesky

Filte

r rat

e (%

)

Big

ger i

s be

tter

Big

ger i

s be

tter

snoop caches: 8 lines, 32 cache lines8 stream registers

Simulation results

17

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

JTAG 10 Gb/s

256

256

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

Ethernet10 Gbit

Ethernet10 GbitJTAG

Access

JTAGAccess Collective

CollectiveTorus

Torus GlobalBarrier

GlobalBarrier

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

6 3.4Gb/sbidirectional

4 globalbarriers orinterrupts

128

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU L2L2

Snoop filter

Snoop filter

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

512b data 72b ECC

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

128

L2L2

Snoop filter

Snoop filter

Multiplexing

switch

Multiplexing

switch

DMADMA

Multiplexing

switch

Multiplexing

switch

3 6.8Gb/sbidirectional

DDR-2Controllerw/ ECC

DDR-2Controllerw/ ECC

13.6 GB/sDDR-2 DRAM bus

32

SharedSRAM

SharedSRAM

snoop

Hybrid PMU

w/ SRAM256x64b

Hybrid PMU

w/ SRAM256x64b

System-level view

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

Shared L3 Directory

for eDRAM

w/ECC

ArbArb

512b data 72b ECC

18

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

BlueGene/P ASIC

IBM Cu-08 90nm CMOS ASIC process technology Die size 173 mm2

Clock frequency 850MHzTransistor count 208MPower dissipation 16W

19

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Hardware measurements: snoop filter efficiency

smal

ler i

s be

tter

smal

ler i

s be

tter

0

0.2

0.4

0.6

0.8

1

1.2

BT CG EP FT LU-HP LU MG SP UA Avg.

Nor

mal

ized

exe

cutio

n tim

e

disabled stream reg only snoop cache only both enabled

20

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

Snoop filtering improves power and performance

50%

60%

70%

80%

90%

100%

110%

time power E E × t E × t²

snoop disabled snoop filter

Actual hardware

measurements

UMT2k application

smal

ler i

s be

tter

smal

ler i

s be

tter

21

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

ConclusionSystem level optimization is key to delivering on CMP promise

– Coherence traffic problematic in terms of performance and power

Multi-component filtering to capture data sharing and use patterns– Capture both temporal locality and streamed memory accesses

– Capture region-based sharing

Port filtering– Scalable, as port filters can be replicated to service multiple incoming requests simultaneously

– No communication necessary between filters

Stream registers– Adaptive filter to capture streams corresponding to contiguously accessed memory regions

Snoop filtering highly effective solution for multicore systems– Filters up to 99% of all coherence requests

– Efficient hardware implementation

– Significant performance improvement

– Power efficient

22

IBM Research

V. Salapura et al., Design and Implementation of the Blue Gene/P Snoop FilterHPCA ’08

© 2008 IBM Corporation

BlueGene on the Web

The Blue Gene/P project has been supported and partially funded by Argonne National Laboratory and the Lawrence Livermore National Laboratory on behalf of the United States Department of Energy under Subcontract No. B554331.