exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors...

46
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S. Lee

Upload: santiago-dobbyn

Post on 14-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors

Chinnakrishnan S. Ballapuram

Ahmad Sharif

Hsien-Hsin S. Lee

2Ballapuram, Sharif, and Lee

Concurrent Execution in CMP

Code, Data

Single-threaded program

Registers, Stack(Local)

Code Data

Multi-threaded program

Registers, Stack(Local)

Registers, Stack(Local)

Registers, Stack(Local)

Thread 2Thread 1Thread 0Thread 0

Shared Last Level Cache

3Ballapuram, Sharif, and Lee

Self-Modifying Code (SMC) Snoop

IL1IL1

Core 0

IL1IL1 DL1

Core 1

IL1 DL1

Core 2

IL1 DL1

Core 3

IL1 DL1

SMC snoop

SMC snoop

SMC snoop

SMC snoop

4Ballapuram, Sharif, and Lee

Snoop for Core 0 DL1 Miss

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

5Ballapuram, Sharif, and Lee

External Snoop Request

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

6Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

IL1IL1

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

7Ballapuram, Sharif, and Lee

Modified L2 Eviction, External Request, etc

L2 queue (FIFO)L2 queue (FIFO)

L2 L2 cachecache

Snoop queue Snoop queue (FIFO)(FIFO)

Other Other logic logic and and

buffersbuffers

External interconnectExternal interconnect

CMP core interconnectCMP core interconnect

Core 0

IL1IL1 DL1

SMC snoop

Core 1

IL1 DL1

SMC snoop

Core 2

IL1 DL1

SMC snoop

Core 3

IL1 DL1

SMC snoop

As # of cores increasesPower

Performance

8Ballapuram, Sharif, and Lee

Number of Snoop Probes

• SMC Snoops to I-Cache > Snoops to D-Cache > Snoops to LSB.

0

1

2

3

4

5

6

7

8

9

10

11

12to

_ls

b

to_

dca

che

to_

ica

che

to_

lsb

to_

dca

che

to_

ica

che

to_

lsb

to_

dca

che

to_

ica

che

to_

lsb

to_

dca

che

to_

ica

che

to_

lsb

to_

dca

che

to_

ica

che

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threaded apps

Nu

mb

er

of s

no

op

pro

be

s in

Mill

ion

s

2C

4C

2 x 4C

8C

16.4M

9Ballapuram, Sharif, and Lee

Snoop Probe and Snoop Rate

• % of data snoop > % of instruction cache snoop

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

2C 4C 2Px4C 8C 8C-MT 2Px4C-MT

Nu

mb

er

of

sno

op

s in

Mill

ion

s

0%

200%

400%

600%

800%

1000%

1200%

1400%

1600%

1800%

2000%

2200%

2400%

Processor configuration

% o

f sn

oo

p in

cre

ase

to_lsb

to_dcache

to_icache

total snoops

% of data snoop increase

% of SMC snoop increase

% of total snoop increase

~22x increase

~12x increase

10Ballapuram, Sharif, and Lee

We propose two techniques to reduce the power consumed by snoop probes:

1. Selective Snoop Probe (SSP)2. Essential Snoop Probe (ESP)

11Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC- SSP for Non-Stack Accesses- SSP for Stack Accesses

12Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for SMC

13Ballapuram, Sharif, and Lee

Normal Operation: To Support SMC

L1 I-Cache

From RS or LSB

dispatch

SMC snoop probe

L1 D-cache

MSHR

Core 0

14Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 miss

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

To filter SMC/XMC snoops

15Ballapuram, Sharif, and Lee

Core 0

SSP (SMC) – No SMC Snoop if BF1 Hit

From RS or LSB

dispatch

All store addr

HASH

cntr

MSHR

u1

r1

r1 – read Bloom filteru1 – update Bloom filtercntr- counting Bloom filter

BF1SMC snoop probe

L1 I-Cache

L1 D-cache

16Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Stack Accesses

17Ballapuram, Sharif, and Lee

Normal Operation: Always Snoop for All Accesses

Snoopprobes

Snoop probes

L2 queue

Last Level Cache

dL1 miss

Core 0

From RS or LSB

dispatch

L1 D-cache

MSHR

Snoop controller

Snoop queue

18Ballapuram, Sharif, and Lee

Core 0

SSP – Stack Accesses

All addresses(carry S-bit annotation)

L2 queue

From RS or LSB

dispatch

L1 D-cache

MSHR

dL1 miss

Last Level Cache

Snoop controller

0

1

0

0

Snoop queue

Annotated by

Front-End

19Ballapuram, Sharif, and Lee

Selective Snoop Probe (SSP)- SSP for Non-Stack Accesses

20Ballapuram, Sharif, and Lee

Core 0

SSP – Non-stack Accesses Update BF2

From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISISISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

L2 queueL2 queue

Last Level Cache

Snoop controller

1

0

0

0

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Filter snoops to non-stack region

HASH cntr

BF2

21Ballapuram, Sharif, and Lee

SSP – Non-stack Accesses Read BF2

All non-stack addresses

Filter snoops to non-stack region

HASH cntr

u2u2

L2 queue

dL1 miss

r2

r2All addresses(carry S-bit annotation)

r2 – read Bloom filteru2 - update Bloom filtercntr - counting Bloom filter

Last Level Cache

Snoop controller

1

0

0

0

Snoop queue

BF2

Core 0

From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISISISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

22Ballapuram, Sharif, and Lee

SSP - Selectively Send Snoop Probes

Selectively send snoops

L2 queue

Last Level Cache

Snoop controller

1

0

0

0

Snoop queuer2 – read Bloom filter

u2 - update Bloom filtercntr - counting Bloom filter

u2u2

Selectively send snoops

All non-stack addresses

u2u2All addresses(carry S-bit annotation)

Core 0

From RS From RS or LSBor LSB

dispatchdispatch

All non-stack addressesAll non-stack addresses

MEME SISISISIMEME

L1 D-cacheL1 D-cache MSHRMSHR

Filter snoops to non-stack region

HASH cntr

BF2

dL1 miss

23Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC- ESP for all variables

24Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for SMC

25Ballapuram, Sharif, and Lee

Core 0

SMC – Normal Operation

L1 I-$

Every Store SnoopsI-cache

From RS or

LSB dispatch

L1 D-$

Other pipe stages

26Ballapuram, Sharif, and Lee

Core 0

ESP Essential Snoop Probe

From RS or

LSB dispatch

Other pipe stages

L1 I-$ L1 D-$

• OS sets a control register bit (SMC-CR) • SMC-CR=1 Non Self-Modifying Code• SMC-CR=0 Self-Modifying Code

SMC-CR=1

27Ballapuram, Sharif, and Lee

Essential Snoop Probe (ESP)- ESP for all variables

28Ballapuram, Sharif, and Lee

Core 0

Normal Operation – Snoop for All Variables

Snoop probes

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

Snoop probes

Snoop controller

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

dL1 miss

29Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 1

dL1 misswith SMN bit annotation

L2 queue

From RS or

LSB dispatch

Other pipe stages

CMP interconnect domain

SMN bitSMN bit – Snoop-Me-Not bit is 0/1

Snoop controller

1

1

0

0

Snoop queue

Last Level Cache

L1 I-$ L1 D-$

30Ballapuram, Sharif, and Lee

Core 0

Essential Snoop Probe (ESP) – SMN bit 0

L2 queue

From RS or

LSB dispatch

ESP

Other pipe stages

CMP interconnect domain

SMN bit – Snoop-Me-Not bit is 0/1

Last Level Cache

SMN bit

Snoop controller

0

1

0

0

Snoop queue

L1 I-$ L1 D-$

ESPESP

dL1 misswith SMN bit annotation

31Ballapuram, Sharif, and Lee

Energy Savings in D-Cache Using SSP

• In the 2C config 5% - 10% data cache energy savings and in the 8C config 30% - 65% is achieved.

• The data cache energy savings increases with the number of cores on the die as the number of snoops to all the cores increases.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

2C 4C 2Px4C 8C

Processor configuration

% o

f d

ata

ca

ch

e e

ne

rgy

sa

vin

gs

pe

r c

ore

SPEC INT 2006

SPEC FP 2006

games/multi-media

server

multi-threaded application

32Ballapuram, Sharif, and Lee

Energy Savings in I-Cache Using SSP

• There is a 50% - 70% instruction cache tag energy savings is achieved across all processor configurations.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2C 4C 2Px4C 8C

Processor configuration

% o

f ica

che

tag

en

erg

y sa

vin

gs

pe

r co

re

SPEC INT 2006SPEC FP 2006games/multi-media

servermulti-threaded application

33Ballapuram, Sharif, and Lee

Performance Impact with SSP

• On average there is 1% - 2% performance improvement across various benchmark categories and different processor configurations is achieved.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

SPEC INT 2006 SPEC FP 2006 games/multi-media

server multi-threadedapplication

Harmean acrossbenchmarks

min performanceobserved

maxperformance

observed

2C 4C 2Px4C 8C

34Ballapuram, Sharif, and Lee

Energy Savings with ESP

• It shows that 5% to a maximum of 82% data cache energy is spent on the non-essential snoop probes that can be eliminated using the ESP technique.

• Also, 85% of the snoops to the instruction cache tag energy can be eliminated using ESP.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

dcache icache dcache icache dcache icache dcache icache dcache icache dcache icache

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Harmonic meanacross benchmarks%

of

cach

e en

ergy

spe

nt o

n no

n-es

sent

ial s

noop

s pe

r co

re

2C 4C 2Px4C 8C

35Ballapuram, Sharif, and Lee

• Semantics and program behavior are useful indicators

• They are exploited to reduce power due to snoops

• We proposed– Selective Snoop Probe (SSP) – Essential Snoop Probe (ESP)

• Energy Reduction Results– 5% to 65% in D-cache per core– 50% to 70% in I-cache per core

• 1% - 2% performance improvement

• Extensible to optimize integrated platforms with graphics processor

Conclusion

Georgia TechElectrical and Computer Engineering MARS Labshttp://arch.ece.gatech.edu

Thank You !

BACKUP

38Ballapuram, Sharif, and Lee

Simulation Infrastructure

Execution Engine 4-wide, Out-of-Order

Load buf / Store buf / RS / ROB 96 / 64 / 128 / 256 entries

L1 / L2 latency 4 / 8 cycles

L1 I, L1 D cache size 32KB, 8 way, 64B

L2 Cache 4MB, 16 way, 64B

L1 TLB entries 128, 4 way

Memory 2GB, DDR 2 timings

CACTI 4.2 70nm power model

Benchmark class Example applications

Server specJBB, TPCC

SPEC FP 2006 wrf, namd, lbm, soplex

SPEC INT 2006 hmmer, gobmk, omnetpp, gcc

Games and multi-media shooters, realtime strategy, raytracer

Multi-threaded applications ray tracer, cinebench

39Ballapuram, Sharif, and Lee

Number of Modified Lines

• It shows the number of modified lines that needs to be evicted to the last level cache.

0

20

40

60

80

100

120

140

160

180

200

220

SPEC INT 2006 SPEC FP 2006 games/multi-media server multi-threadedapplication

Average acrossbenchmarks

Nu

mb

er

of m

od

ifie

d li

ne

s a

t co

mp

letio

n

2C

4C2Px4C

8C

40Ballapuram, Sharif, and Lee

Cache access Vs Snoop access

• Cache access – Read one sub-bank (8 bytes)• Snoop access – Need to read all sub-banks to ship the data to other cores

or other processor in an MP system. (all 64 bytes, cache line size)

41Ballapuram, Sharif, and Lee

Hash functions

Cache LineCache Line(physical address)(physical address)

(48-bits)(48-bits)

MESIMESIstatestate

Tag + Tag + Index Index bitsbits

DataData

cntrcntr cntrcntr

HASH HASH 33

HASH HASH 33

If M/E stateIf M/E state If S stateIf S state

Unused bitsUnused bits BBCC AA

Tag + Index bits [6-32]Tag + Index bits [6-32]

cntcntrr

cntcntrr

cntcntrr

HASH HASH 33

If bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 0, HASH3 = A ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ CIf bit-10 is 1, HASH3 = (A ^ 0x22) ^ B ^ C

6153347

42Ballapuram, Sharif, and Lee

Incoming Events to LLCIncoming events to the last level cache

RFO

Data Read

Code fetch

Shared L2 evict

43Ballapuram, Sharif, and Lee

Incoming Events to LLC and Sources of Snoop TriggersIncoming events to the last level cache

iL1 of

this

core

dL1 of

this

core

RFO - Event trigger

Data Read - Event trigger

Code fetch

Event trigger

Shared L2 evict

44Ballapuram, Sharif, and Lee

Snooped Units in the Triggered CoreIncoming events to the last level cache

iL1 of

this

core

dL1 of

this

core

LSB of

this

core

MSHR,

WBB of

this core

RFO - Event trigger

- -

Data Read - Event trigger

- -

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

Shared L2 evict

- Snoop - Snoop

45Ballapuram, Sharif, and Lee

Snoop Probes for Incoming Data ReadIncoming events to the last level cache

iL1 of

this

core

dL1 of

this

core

LSB of

this

core

MSHR,

WBB of

this core

iL1 of

other 3

cores

dL1 of

other 3

cores

LSB of

other 3

cores

MSHR,

WBB of

other 3 cores

Shared L2

queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

46Ballapuram, Sharif, and Lee

Snoop Triggers and Snoop UnitsIncoming events to the last level cache

iL1 of

this

core

dL1 of

this

core

LSB of

this

core

MSHR,

WBB of

this core

iL1 of

other 3

cores

dL1 of

other 3

cores

LSB of

other 3

cores

MSHR,

WBB of

other 3 cores

Shared L2

queue

RFO - Event trigger

- - XMC snoop to invalidate line

Snoop snoop load buffer only to invalidate

Snoop to invalidate pending requests

Snoop to invalidate

Data Read - Event trigger

- - XMC snoop to invalidate line

Snoop - Snoop Snoop

Code fetch

Event trigger

SMC snoop

Snoop store buffer only (updated writes)

Snoop (update writes)

- XMC snoop

Snoop store buffer only (update writes)

Snoop SMC Snoop

Shared L2 evict

- Snoop - Snoop - Snoop - Snoop Snoop

SMC snoop to iL1

On all store addr disp

- - SMC snoop

to iL1

On all store addr disp

- - -