acoherent shared memory derek r. hower ph.d. defense july 16, 2012

76
Acoheren t Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

Upload: alice-richardson

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

Acoherent

Shared Memory

Derek R. HowerPh.D. DefenseJuly 16, 2012

Page 2: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

2

Executive Summary

PP

Coherent

View

PPC

O

CO

CI

CI

Acoherent

View

GPU

Simple abstraction?

L1 L1

L2Simple abstraction

- Simple implementation- Abstracts caches- Low overhead

- Complex implementation- Hides caches (bad?!)- High overhead

Page 3: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

3

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work

Page 4: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

4

Trends

We must change

We can change

Energy Matters Dark Silicon/Mobile/Datacenter < 50% of processor powered by 20241

Complexity Matters Lower barrier to entry for accelerators

Area Matters New tech nodes are not cheaper2

Memory: may be difficult to turn off e.g., S-NUCA

Compatibility doesn’t matter Vertical integration is the new black

1 Esmaeilzadeh, et al. ISCA 20112 ExtremeTech 2012

Page 5: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

5

The Problem With Coherence Wrong abstraction

Optimized for fine-grained, share-everything• Programs aren’t!

Makes SW isolation hard Hypothesis: SW will want control over data

placement

Impedes HW specialization Does your multicore ASIC need a coherence

controller? Coherent GPUs?

Efficiency problems Directories take space/broadcasts take energy

• e.g. 14% of cache are dedicated to directory on 4-core die1

1 Stackhouse et al., ISSCC 2008

Page 6: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

6

Rethinking Coherence: Goals Maintain programmer sanity

Keep shared memory Minimal compatibility change

Expose hardware capabilities Let SW guide memory management -> semantics

Simple hardware Lower cost of entry for accelerators

Solution: Acoherent Shared Memory

Page 7: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

7

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work

Page 8: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

8

ASM Model Basics Replace black box with simple hierarchy

Still flat, linear address space SW gets private storage

Manage with CVS-like checkout/checkin

P

CI

P

CI

CO CO

Page 9: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

9

Checkout/Checkin

Checkout: Pull data into private storage

P

CI

P

CI

CO CO

Checkin: Publish local updates globally

Checkout/Checkin are not synchronization primitives - Closer to a FENCE

Granularity?

Page 10: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

10

Segments

Stack

Code

BSSData

Heap

Compromise: Memory Segments– Linear partition of address space– CO/CI segments at a time

Observation: Programs are already segmented Can re-use layout

Typical CO/CI granularity in existing C code

Page 11: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

11

Segment Types

Acoherent

PrivateStack

Code

BSSData

Heap

Coherent RO

Shared

Private

Shared, Read-Only

Not all memory wants/needs acoherence Segment types give different “views” Communicate semantic information to HW

Available Types

Private

Coherent-RW

Coherent-RO

Acoherent

Device

Page 12: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

12

Managing Finite Resources Model so far is strong acoherence

Likely requires prohibitive HW resources Also weak acoherence and best-effort

acoherence Still useful to software/hardware

Weak acoherence: Data visible early (before checkin)

Best-effort acoherence: Spontaneous checkouts at any time

• + SW notification All-or-nothing

Synchronized => not a problem

Hybrid Runtimes =>not a problem

Page 13: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

13

Case Study: pthreadspthread_barrier_t barrier;char* shared_data;

int main(int argc, char* argv[]) { int i,j,k; pthread_t sib; shared_data = malloc(PROBLEM_SIZE); pthread_barrier_init(&barrier, NULL, 2); pthread_create(&sib, NULL, worker, (void*) 1); worker((void*) 0); pthread_join(sib, NULL); return 0;}

void* worker(void* arg){ while (work remains) { <split work> <do work> pthread_barrier_wait(&barrier); }}

Task: Convert to ASM

• Global, Heap in acoherent segment

• Stack in private segment

• Synch. in coherent-RW segment

• CI/CO Global, Heap at synchronization

Communication Point

barrier;

barrier

barrier

shared_data

argc argv

arg

i j ksib

sib

sib

• Text in coherent-RO segment

shared_data

Automatic:Runtime

Automatic:Library

Works as isint pthread_barrier_init(…) { … _barrier = coherent_malloc(sizeof(int)); …}

int pthread_barrier_wait(…) { … checkin(heap, data); <barrier> checkout(heap, data); …}

Step 1: Assign Segments

Step 2: Checkout/Checkin

Page 14: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

14

Memory Consistency Model

Option 1: The Details(6 slides + really ugly equations)

Option 2: The Highlights (2 slides)

Page 15: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

15

Memory Consistency Model Defined in style of SPARC TSO/RMO

Memory Order: Total order of memory ops• Restricted by consistency model

Processor Order: Local dependencies

Value of load: defined via memory + processor order

Page 16: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

16

Weak Acoherence

# Load -> Load to same address (a)# Load -> Store to same address (b)# Store -> Store to same address (c)# Paired CI-CO act as distributed fence (d)# CI/CO -> CI/CO (e)

1. Define Memory Order

2. Define legal value of loads

Same as TSO, etc.

CI-CO pair => fenceTotal order of CO/CI

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

value value max |S S S S Sm

Si m i p iL a S a S a L a or S a L a

JJJJJJJJJJJJJJ

Page 17: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

17

Strong Acoherence

# Load -> Load to same address (a) # Load -> Store to same address (b) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered

1. Define Memory Order

2. Define legal value of loads

S S Si mpS

i i iL a L L a L

S S S Si i ip m iL S a L S a

S S S Si i i mp iS a S a S a S a

SS Si i

S S Sp m j p j i m jS a CI CO L S L

p mCX CX CX CX

next ( , ) max ( , )S S S S Si p p i i mp p iS CO CI S S CO S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

max | max ( , )

or, if does not exist,

max | max ( , )

S S S Si p i p i

S

S Si p

S S S Sm m

i

m i

i p

S

p

value L a value S a CO L a S a L a

S a

value S a CO S a S a L a

JJJJJJJJJJJJJJJJJJJJJJJJJJJJ

JJJJJJJJJJJJJJ JJJJJJJJJJJJJJ

( , ) next ( , ) ( , )S S S S S S S Si p p ipp i p i i mS next CI S CO S next CI S S

JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Normally:Stores not visible until CI

S Si p m iS CI CI S

Can “lose” data

Page 18: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

18

Other Segment Types Coherent

Like weak, but:• Loads implicitly paired with (atomic) CO• Stores implicitly paired with (atomic) CI

SC w.r.t. each other

Private Like weak

Page 19: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

19

Analysis CO/CI not atomic Subtlties:

03: R0 = A

02: CHECKOUT

13: CHECKIN

Thread 0 Thread 1

12: A = 1

Initially, A = 0

04: R1 = A

Strong: R0 = 0, R1 = 0Weak: R0 = 0, R1 = 0 or 1

(b) Isolation

05: CHECKOUT

Thread 0 Thread 1

14: A = 1

Initially, A = 0

06: R0 = A

Strong: R0 = 0Weak: R0 = 0 or 1

(c) Leaky stores

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

(a) Lazy checkout

Strong: R0 = 0 or 1Weak: R0 = 0 or 1

Page 20: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

20

ASM = SC for DRF ASM = SC for lossless and properly paired Lossless:

No clobbering checkouts i.e.,

Properly Paired: All conflicting stores->load separated by CI/CO i.e.,

Proof sketch: LL+PP executions defined by CO/CI order, program

order only CO/CI, program order same in ASM, SC

, :

:

S S S Si i i p i

S S S Si i p i p i

S a if CO S a CO

CI S a CI CO

, : value( ) value( ),

, :

S S Sj i j

S S S S S Si j j p j m i p

i

i

SL a S a L a S a i j

CO CI S a CI CO L a

Next

Page 21: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

21

CO/CI Semantics CO/CI like fence

Lazy checkouts Non-atomic, non-blocking checkins

• Updates can interleave

00: CHECKOUT

11: CHECKIN

Thread 0 Thread 1

10: A = 1

Initially, A = 0

01: R0 = A

Finally: R0 = 0 or 1

00: A = 101: B = 202: CHECKIN

Thread 0 Thread 1

10: A = 1011: B = 2012: CHECKIN

Initially, A = 0

Finally, any combo of: A = 1 or 10 B = 2 or 20

Page 22: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

22

Consistency Highlights Coherent accesses have implicit CO/CI

CO/CI are totally ordered Transitivity hides non-atomicity

Sequentially consistent for data-race-free Lossless & Properly Paired

ST critical

ST lock

CI lock_segment

CO lock_segment

LD lock

ST lock

CI lock_segment

CO critical_segment

CI critical_segment

LD critical

Thread 0 Thread 1

STsync lock

LL lock

SC lock

Page 23: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

23

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions and Future Work

Page 24: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

24

ASM-CMP Overview Based on MIPS

+ special insns, e.g., checkout, checkin Uses segments, no paging

• Maintains flat address space

Coherence protocol -> Acoherence Engine DMA for caches

• Selectively move data

Skipping the Details

Page 25: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

25

Baseline

Switch

CoreL1I

L1D

L2

Memory Controller

Memory Controller

Mem

ory

Contro

llerM

em

ory

Contr

olle

r

Page 26: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

26

Segment Types

L1

Non-inclusive L2

L1 AEAE

P P

CO

CO

CI

CI

L2

P P

L1

Exclusive L2

L1

P P

Acoherent

Coherent-RW

Private

Page 27: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

27

Acoherence Engine

Three main responsibilities: Checkout:

• Invalidate all segment data Checkin:

• Write back all dirty segment data Order:

• Detect CI-CO pairs

FSM like coherence, but few races, no directory

Timestamp based

Lazy Flash Invalidate

Track write set

Decoupled MetastateCache

Page 28: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

28

Decoupled Metastate Cache All L1 Caches

Decouple metastate from data Quick access to aggregate

state Track V/D per-segment

Checkout: XOR

global/segment valid

Checkin: Walk segment

dirty state

Page 29: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

29

Order Need to:

1. Determine if a CI precedes a CO2. Delay load after CO if previous CI hasn’t completed

Timestamp algorithm (per segment): Two phase CO/CI

1. Acquire timestamp1. Invalidate/Flush

2. Wait for previous CO/CI to complete

Implemented in firmware

Page 30: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

30

Multiple Writer Support

Keep per-byte dirty bitmask in L1s Allows multiple writers with false sharing 12.5% larger L1 cache

Bitmask accompanies data to L2

Page 31: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

31

Simple?

Directory / L2

L1 L1

REQREQ RESP RESP

FWD

Source of Races / Complexity

L2

Page 32: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

32

Outline

Motivation and GoalsASM ModelASM-1 PrototypeEvaluation and ResultsConclusions and Future Work

Page 33: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

33

Methodology Simulation-based

Enhanced-User Mode

Workloads: Class-1: SPLASH Class-2: Task-Q

Three memory modules ASM-CMP CC from gem5-Ruby

• MESI (Inclusive)• MOESI (Non-inclusive)

Page 34: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

34

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

1.4

moesi mesi asm

Runti

me N

orm

alized t

o M

OESI

Performance

Comparable performance

Checkout too muchFalse Sharing/

Migratory Sharing

Page 35: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

35

Perfect Checkout

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

asm_base asm_ideal

Runti

me N

orm

alized t

o A

SM

Baseline

Page 36: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

36

Energy

.

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.2

0.4

0.6

0.8

1

1.2

e_l1d e_l1i e_l2 e_link e_switch e_tlb

Energ

y N

orm

alized t

o M

OESI

Less Energy (Same Performance)

Page 37: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

37

Checkout Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

0%10%20%30%40%50%60%70%80%

Class-1 Workloads barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

% o

f checkouts

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

fixed

mea

n0

0.2

0.4

0.6

0.8

1

1.2

% o

f C

heckout

Invalidati

ons

Elided

Most checkout invalidations affect dead

blocks

Checkouts usually small;Can be large (> 25% of

L1)

Page 38: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

38

Checkin Characteristics

0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 64-71 72-79 80-87 88-95 96-103 104-111

112-119

120-127

0%

5%

10%

15%

20%

25%

30%

35%

40%

Class-1 Workloads

barnes fft fmm lu mp3d ocean radix water

# blocks invalidated

% o

f checkin

s

Checkin latency is hiddenCheckins usually small;Can be large (> 25% of

L1)

Page 39: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

39

Outline

Motivation and GoalsASM ModelASM-CMP PrototypeEvaluation and ResultsConclusions/Other Work

Page 40: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

40

Conclusions Going forward:

HW designs must find efficiency SW will want to see caches/control placement

ASM: viable alternative to coherent shared memory Semantic cooperation between HW/SW

ASM-CMP: build components w/o coherence engine Make custom integration easier

Practically: Will the next x86 core use ASM? No Will a heterogeneous accelerator? Maybe

Page 41: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

41

Related Work

ASM Model

ASM-CMP

Alternatives/Detractors

Skip

Page 42: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

42

Related Work – ASM Model Relaxed consistency models

Release Consistency (ISCA 1990)• Acquire/Release ≈ CO/CI

DRF-0 (ISCA 1990), DRF-1 (PDS 1993)• SC for DRF

Weak ordering (ISCA 1998)

Semantic Segmentation Cohesion (ISCA 2011) Entry consistency (CMU-TR 1991)

Page 43: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

43

Related Work – ASM-CMP Rigel: IEEE Micro 2011

Differentiates coherent/incoherent Treadmarks: ISCA 1992

Twinning and diffing

Page 44: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

44

Related Work - Alternatives Reduce directory overhead

Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…)

Simplify coherence design Denovo (PACT 2011)

Coherence is here to stay CACM 2012

Page 45: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

45

Future Work

ASM Model

ASM Implementations

ASM Software

Skip

Page 46: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

46

Future Work – ASM Model1. Use CO/CI for synchronization

Return timestamp with CO/CI Blocking CO

2. Only guarantee transitivity across coherent accesses Would eliminate need for timestamps

3. Hierarchical ASM Expose multiple levels of abstracted caches

4. Interaction with coherent shared memory Acoherent/coherent components in same system

Page 47: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

47

Future Work: ASM Implementation ASM-CMP

1. Optimize empty checkout/checkin2. Non-speculative support for strong

acoherence• e.g., HW copy-on-write support on eviction• Use ASM as foundation for TM/Determinism/etc

3. Low overhead byte-diffing• False sharing is rare/pattern reuse is common

4. More segment control• Non-contiguous• Remap-able

Other1. Multi-socket support2. Use ASM to simplify traditional coherence

• Private/shared

Page 48: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

48

Future Work – ASM Software1. Message passing on ASM

More efficient than coherence (think: migratory)

2. Software speculation Use working memory for isolation

3. Programming language integration CO/CI first-class operations Work already exists:

• Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS

Page 49: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

49

Previous Work Rerun: ISCA 2008 and CACM 2009

Race recorder for deterministic replay vs. state of the art:

• SAME logging performance, > 10x state reduction

Calvin: HPCA 2011 Coherence for deterministic execution

• i.e., zero-log-size deterministic replay Selective determinism to match program

requirements

Hobbes: WoDet 2011 Strong acoherence in SW runtime

Page 50: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

Phew!

Page 51: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

Backup Slides

Page 52: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

52

What I would do differently Focus on more specific target system Stop building new infrastructure!

Why did I? • gem5 wasn’t ready• Started more radical/not clear it would have helped

Step back more often Easy to get sucked in to details – usually don’t

matter Functional specification of consistency -> yuck!

Page 53: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

53

Case Study: Cilk Work-stealing task queue

Distributed design

Page 54: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

54

ASM Segments Benefit

.

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.20.40.60.8

11.21.41.61.8

2

e_l1d e_l1i e_l2 e_link e_switch e_tlb

Energ

y N

orm

alized t

o M

OESI

Usin

g S

egm

ents

barn

es fftfm

m lu

mp3

d

ocea

nra

dix

water

cilks

ort

clu

heat

2d

heat

3d

mat

mul

uts_

circ

uts_

fixed

gmea

n0

0.4

0.8

1.2

1.6

2

moesi_tlb_0 moesi_tlb_32 moesi_tlb_64

Runti

me N

orm

alized t

o

MO

ESI

Usin

g S

egm

ents

Page 55: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

55

History

1980 2000 2010CPU Era Multicore Era Dark Era

Moore of the sameEverything is

general purpose ??

Page 56: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

56

Navigating the Darkness

Solution #1: Wait for CMOS replacement Don’t hold your breath

Solution #2: Rethink everything Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal

Coherence?

Page 57: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

57

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues:

Need efficient components/reduced waste Heterogeneity/Specialization

Different memory access patterns Multicore ASICs

Important workloads don’t use it Compatibility not a show stopper

Mobile -> fast design cycles, controlled SW stacks Datacenter -> economy of scale in single location

Missing opportunities

Page 58: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

58

Case Study 2: Software Speculationbegin_speculation() {

<copy state> <setup>}

end_speculation() { if(success) <free copies> else abort_speculation();}

abort_speculation() { <revert to copy> <cleanup>}

Multiple checkouts: “forget” updatesTask: Convert to ASM

checkout(…)

checkout(…)

checkin(…) Checkin: commit updates

SW can use memory in new ways

Use private storage

Page 59: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

59

New Software Potential Evaluate ability to write speculation software

Microbenchmark: Fill array with speculative data, then commit Vary size of array

16 64 256 1K 16K 32K 64K 128K0

0.4

0.8

1.2

ASMMESI

# of Blocks in Isolation Region

Norm

alized

R

un

tim

e

Page 60: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

60

Using Weak Acoherence

func producer(…) checkout(array); array[0] = x; array[1] = y; checkin(array);

signal(consumer);

end func

func consumer(…)

waitfor(producer);

checkout(array); …end func

global array;

weak acoherent

checkin(array);

Synchronized -> Early visibility OK

Synch hides early checkin

globally visible!

Page 61: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

61

Using Best-Effort Acoherence

Exception!

SW handles resource limitations

array[1] = y checkin(array)

end_tx

begin_tx checkout(array)

array[0] = x checkout(array)

Page 62: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

62

Simulator Design Two Goals

Functionally evaluate ASM system• programming model, kernel management

Performance comparison to CMP

Enhanced User Mode simulator Emulate non-timing critical components (e.g., disks) Simulate the rest (e.g., virtual memory)

Page 63: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

63

Qualitative Data

Is ASM a reasonable model?

YES Almost no changes to application software

• Unsynchronized flags• Stack sharing

Functioning Kernel, same tricks• Heavier use of coherent segments

Page 64: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

64

Three Questions

PC PC

DRAM

LLC

PP PPPPPP

CO

CO

CI

CI

Hardware

Layout

Coherent

View

PrivateView

Acoherent

View

1. How can software select view?2. Which view to use?3. How to manage CO/CI?

Page 65: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

65

ASM-CMP Segments Uses true memory segments

e.g., all pointers are long (segment + offset)

BUT, address space still appears flat!

Long Pointer Propagation Segment pointers propagate through datapath Add lp/sp + register sidecars Languages/SW remain segment-oblivious

Page 66: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

66

ASM-CMP SegmentsSegment pointers propagates with datapath

memcpy(dst, src, len);

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

OffsetSeg. Ptr.

dst ptr

Memory

OffsetSeg. Ptr.

src ptr

OffsetRegister File

SegOffset

Seg. Ptr.

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- lenloop: beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc. dst addi $t1, $t1, 1 ; inc. src subi $t2, $t2, 1 ; dec. cnt b loopexit:

Offset Seg$t1Seg. Ptr.Offset

len$t2data$t3

dst

src

ALU

Offset SegOffset SegOffset

1

Seg

Seg

Offset+1

lp $t0, 0(dst) lp $t1, 0(src) mov $t2, $a2 ; cnt <- len beqz $t2, exit lb $t3, 0($t1) ; ld src sb $t3, 0($t0) ; st dst addi $t0, $t0, 1 ; inc.dst

$t0

Segment propagates src -> dst

Pointers are long

Page 67: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

67

The Problem

HardwareLayout

PC PC

DRAM

LLC

P

Coherent Shared Memory

P PP

SoftwareView

?

Page 68: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

68

The Problem

HardwareLayout

PC PC

DRAM

LLC

PP PP

SoftwareView

Coherent Shared Memory

Hardware Policy – Software Can’t Change!

Page 69: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

69

All Data Are Created Equal?

Location := 1;

PC PC

DRAM

LLC

PP

?

?

?

Assume: CMP MESI protocol, inclusive LLC

Page 70: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

70

Missed Opportunities

Location := 1;

PC PC

DRAM

LLC

PP

Assume: CMP MESI protocol, inclusive LLC

begin_tx

end_tx

cpLocation := Location;

SW Makes Redundant Copy

Page 71: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

71

All Data Are NOT Created Equal

Location := 1;

PC PC

DRAM

LLC

PP

Assume: CMP MESI protocol, inclusive LLC

func foo() var Location;

Private

Wasting Space

Page 72: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

72

ASM-1 Hardware

8MB L3

256KB L2

32KB L1

P

AE

Bitmask

Bitmask

Per-line

Page 73: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

73

Baseline

L3_0

L3_1

L3_2

L3_3

L3_4

L3_5

L3_6

L3_7

L2L2L2L2L2L2L2L2

L2L2L2L2L2L2L2L2

L1L1L1L1L1L1L1L1

L1L1L1L1L1L1L1L1

P0P1P2P3P4P5P6P7

P8P9

P10P11P12P13P14P15

In-order, single thread Ring interconnect

Page 74: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

74

Storage Overhead

2 4 8 16 32 64 128

256

512

1024

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ASM-1MESI-1 LevelMESI-2 LevelMESI-3 Level

# Cores

Sto

rag

e O

verh

ead

More indirection -> longer latency

No Indirection

Page 75: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

75

Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues

Need scalable, energy efficient components Accelerators are here

How should they see memory? Shared-little workloads in important markets

Page 76: Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012

76

All Data Are NOT Created Equal

Location := 1;

PC PC

DRAM

LLC

GPUP

Assume: CMP MESI protocol, inclusive LLC

func CUDAKernel(…) …

Not clear accelerators want/need coherence