shared memory consistency models : a broad survey ganesh gopalakrishnan* school of computing,...

108
Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported in part by SRC Contract 1031.001, NSF Award 0219805 and an equipment grant from Intel Corporation

Post on 20-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

Shared Memory Consistency Models :A broad survey

Ganesh Gopalakrishnan*

School of Computing, University of Utah, Salt Lake City, UT

* Past work supported in part by SRC Contract 1031.001, NSF Award 0219805 and an equipment grant from Intel Corporation

Page 2: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

2

Shared Memory: Hardware Realities

Memory performance

CPU performance

Page 3: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

3

Shared Memory: Software Realities

• Must define the formal semantics of shared-memory concurrent programming while allowing for all reasonable optimizations

•Defining the Shared Thread semantics for Java (Original Java book’s Chapter 17 has essentially been ripped out…)

• Defining the Shared Memory Model for new languages such as Unified Parallel C (UPC) for Scientific Programming

• At a deeper level: Must have formal basis for Automatic Minimal Fence Insertion to make programs appear to execute sequentially consistent

Page 4: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

4

Topics:

• Motivations for strong and weak memory models - How it affects consistency protocol design - How it affects programming

• Classical memory models- Their “power”

• Fence insertion during compilation - Run on weak architectures but appear to run SC

• Overview of some weak architectures

• Itanium in a nutshell

• SAT-based programs that check executions against memory model specs - Demo of MP Execution Checker (MPEC) tool for Itanium

Page 5: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

5

Topics:

• Theoretical aspects of memory model specification

- Specify using Traces or Specify using Transducers

• Why Traced-based Specification can allow one to talk about unrealizable machines

- Hence “undecidability of sequential consistency” is not a solved problem

• Why trace-based verification methods need to exert some care

- Otherwise can prove “conniving machines” to be SC !!

• A brief taxonomy of recent results in this area

- Mainly Alur et.al., Qadeer, Bingham et.al., and Sezgin

Page 6: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

6

Sequential Consistency : The Most Basic Memory Consistency Model

• Requirements1. Exists a common total

order 2. Respects program order 3. Read sees the “latest”

write

Under Sequential Consistency: No

Under many weak models: Yes

Example

Initially, x = y = 0. Finally, can r1 = r2 = 0?

Thread 1 Thread 2x = 1;

r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

Page 7: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

7

How to Think About Sequential Consistency

P1 P2 Pn

Memory

Initially, x = y = 0. Finally, can r1 = r2 = 0?

Thread 1 Thread 2x = 1;

r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

No! Not under SC ! But possible under many weak memory models!

An example of such a weak memory model is Sparc TSO

Page 8: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

8

Coherence == Per-location Sequential Consistency

P1 P2 Pn

1-address Memory

Notice that the same execution is Coherent !

Initially, x = y = 0. Finally, can r1 = r2 = 0?

Thread 1 Thread 2x = 1;

r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

Page 9: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

9

Memory Consistency Models

Defines the legal orderings of memory operations that can be perceived at the user level

• Processors intermittently throw colors onto memory cells and also intermittently look at their colors

P1 P2 Pn

Memory Cell 1

Memory Cell 2

Memory Cell n

Pi

Page 10: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

10

Memory Consistency Models

Defines the legal orderings of memory operations that can be perceived at the user level

• Many have been developed: – Sequential Consistency (SC)

– Coherence (per-location SC)

– Parallel Random Access Memory (PRAM)– Causal Consistency– Processor Consistency (PC)– Release Consistency– Location Consistency– The Intel Itanim Memory Model– Java Memory Model (JMM)– and more!

Page 11: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

11

Memory Consistency Model Specifications:

A VERY complex specification for a real architecture (e.g. Itanium, PowerPC, …)

Also of growing concern in Software (e.g. Java Memory Model, Unified Parallel C model, …)

Page 12: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

12

Motivation for (weak) Memory Consistency models:

A Hardware Perspective :

• Cannot afford to do industrious updates across large MP systems

• Delayed and re-orderable updates allow considerable latitude in memory consistency protocol design less bugs in protocols !!

dir dir

Chip-level protocols

Inter-cluster protocols

Intra-cluster protocols

mem mem

Page 13: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

13

Price Paid for Delayed Updates : Bugs!

Algorithms such as Peterson’s Mutual Exclusion cease to work!

Thread 1 Thread 2------------ -----------Flags[1] = BUSY; Flags[2] = BUSY;Turn = 2; Turn = 1;

While (Flags[2] == BUSY && While (Flags[1] == BUSY && Turn != 1) ; Turn != 2) ;

Critical section Critical section

Flags[1] = FREE; FLAGS[2] = FREE;

CAN READ OLD VALUE!!

CAN READ OLD VALUE!!

Page 14: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

14

Scope of Tutorial:

• Survey of ‘Classical’ Work

• Survey of Current Activities (that this speaker is aware of)

• Verification Challenges

• Theoretical Questions

• Justification for topic selection:

- Complement talks on Shared Memory Consistency Protocols

- Intuitions more important than the detailzzz….

- Knowing who’s who in this area helps

- Excuse for me to stick my neck out and learn something new

Page 15: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

15

Organization:

1. Overview (mainly of classical works)

2. Practical aspects of weak consistency models (more depth)

3. What’s not apparent at first glance (still more depth)

4. Conclusions and references

Page 16: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

16

Part 1: Overview of Classical Work

Page 17: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

17

Memory Serves to Plumb Data…

Uniprocessor:

Write ( address = 2 , data = 33) ; …….. Read ( address = 2 , returns data = 33) ;

Multiprocessor:

P1 P2---- ----Write (2, 33) ; || Read (2, 33) ;

Multiprocessor: P1 P2 ---- ----Write(2, 33); Write(2, 77);

Read (2, 77); Read(2, 33);

P1 P2 P3 P4---- ---- ---- ----Write (2, 33) ; Write (2, 77) ; Read(2, 33); Read(2, 77);

Read(2, 77); Read(2, 33);

…but respecting Coherence!

Page 18: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

18

…but Coherence is not sufficient:

From Shasha and Snir, Figure 1, P. 282 (ACM TOPLAS (10)2: 1988)

Processor 1 Processor 2------------- --------------

Test_and_set1(LOCK); Test_and_set2(LOCK);

Read1(X); Read2(X);

Write1(X); Write2(X);

Reset1(LOCK); Reset2(LOCK);

The following memory access sequence respects Coherence but breaks the critical section :

Test_and_set1(LOCK); Read1(X); Reset1(LOCK);

Test_and_set2(LOCK); Read2(X); Write1(X); Write2(X); Reset2(LOCK);

• Consistent view ACROSS ADDRESS SPACE is needed

• Most intuitive such : Sequential Consistency !

Page 19: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

19

Basic understanding of SC:

• Execute AS IF instructions in each thread were executed sequentially and atomically

- respecting the program order in each thread

- no constraints across sequential programs

Requires effort to achieve above effect AS WELL AS high performance :

CPU 1

Memoryand

Bus Controller

CPU n …

Write (2, 55) ; MISSESRead (4, 11) ; HITS

Write (4, 66) ; MISSESRead (2, 22) ; HITS Which Read waits ?

Page 20: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

20

CPU 1

Memoryand

Bus Controller

CPU n …

Write (2, 55) ; MISSESRead (4, 11) ; HITS

Write (4, 66) ; MISSESRead (2, 22) ; HITS

Aggressive SC Implementations:

From Adve, Pai, and Ranganathan (Proc IEEE, (87)3, March 1999, p.448)

“If the accessed location does not change its value until the Read could have been non-speculatively issued, then the speculation is successful. Otherwise, roll-back speculation until incorrect load.” (Similar schemes used in HP PA-8000, Intel Pentium Pro, MIPS R10K)

One way to implement this: * If bus-snoop for Write(4,..) arrives before that for Write(2,..), the Read(4, 11) is invalidated – and it reissues…

Snoops areWrite(4,66);Write(2,55);

Snoops areWrite(4,66);Write(2,55);

Page 21: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

21

Unexpected Interactions:SC and Write Update Protocols(from Grahn, Stenstrom, Dubois)

• An important aspect of Sequential Consistency is Write Atomicity

• Write-Invalidate protocols can easily guarantee Write Atomicity

• However, Write-Update protocols are often recommended (Read-latency)

• Ensuring Write-Atomicity in Write-Update Protocols is tricky

• WEAK MEMORY MODELS TO THE RESCUE ! Don’t care about Write Atomicity except at Acquire / Release points

dir dir

Chip-level protocols

Inter-cluster protocols

Intra-cluster protocols

mem mem

Page 22: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

22

A Deeper Look at Coherence :

Complexity of Checking Coherence of Executions is in NPC :

Cantin’s proof: Reduction from SAT:

Example: Consider (u1 \/ u2) /\ (~u1 \/ u2)

Create the following concurrent processes:

h1 h2 h_u1 h_~u1 h_u2 h_~u2 h3--- --- ----- ------- ----- ------- ---W(d_u1) W(d_~u1) R(d_u1) R(d_~u1) R(d_u2) R(d_~u2) R(d_c1)

W(d_u2) W(d_~u2) R(d_~u1) R(d_u1) R(d_~u2) R(d_u2) R(d_c2)

W(d_c1) W(d_c2) W(d_c1) W(d_u1) W(d_c2) W(d_u2) W(d_~u1)

W(d_~u2) W(d_F)

Literal Gadget

Clause Gadget

Existence of aCoherent Scheduleis tested

Page 23: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

23

A Deeper Look at Coherence :

Memory models that relax coherence – and how “useful” they are:

• PRAM (pipelined RAM – Lipton and Sandberg) is of academic interest

One memory per processor

Program order is obeyed, butNo Write-Atomicity

P1 P2 Pn

Page 24: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

24

A Deeper Look at Coherence :Memory models that relax coherence – and how “useful” they are:

• PRAM – of academic interest

• Location consistency

- Proposed by Gao and Sarkar- They tout its advantages in terms of scalability- They describe an LC protocol “machine”

- Analysis by Wallace et.al (PDPTA 2002: 1542-1550) :

* Shown that this LC machine is stronger than the LC definition

* Question whether LC programs indeed appear to execute with sequentially consistent outcomes assuming that they are “properly labeled”

* I have not seen many pubs on LC of late…

Page 25: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

25

Classical Weak Memory Models:

• Processor Consistency is widely known

• Good discussions in Ahamad et.al.,

“The Power of Processor Consistency”

• First understand PRAM :

- For each processor p, there is a legal serialization S_p of

H_p+w such that if o1 and o2 are in H_p+w and o1 –po-> o2

then o1 – s_p o2

- For PC_g, we add the following condition:

for any two processors p and q, and for any location x,

S_p | (w,x) = S_q | (w,x)

“Processor Consistency according to Goodman (PC_g)”

is not the same as

“PC_d – processor consistency according to the DASH project”

Page 26: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

26

Execution that’s PRAM and Coherent … but not PC_g:

P : w(x,0) w(y,0)

Q: r(y,0) w(x,1)

R: r(x,1) r(x,0)

* Coherent! Just look at each color separately

* Not PC_g :

Construct a history per processor with all of the processor’s actionsand all of others’ writes in that history

PC_g requires the write-histories to agree per variable; but in our example,

History of Q = …w(x,0)… w(x,1)… while

History of R = …w(x,1)… w(x,0)…

Page 27: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

27

The “power” of Processor Consistency:

• Can handle “Peterson” (Ahamad)

• Can’t handle “Bakery” (Ahamad)

• What else? (Kawash and Higham, “Bounds for mutual

exclusion with only Processor Consistency”):

- Peterson is correct for PC-G (a multi-writer protocol)

- Bakery is incorrect for PC-G (a single-writer protocol)

- Kawash and Higham prove that for mutual exclusion under

PC-G, one multi-writer and n single-writers are necessary

Page 28: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

28

Observations:

• Weak shared memory consistency models allow consistency

protocols to be efficient

• Unfortunately programmers find weak models non-intuitive

• How can we have the best of both worlds:

- weak models to be supported by the hardware

- strong models to be presented by the software

This can be achieved through compilers that insert the minimal number of fence instructions to give the appearance of SC

Page 29: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

29

Basics of Fence Insertion:

• Widely cited work is by Shasha and Snir

• Recent work by Lee, Midkiff, and Padua extends the above

• Let us go through some examples (initially all mem. locations are 0)

P1 P2 ---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

Under SC,

If yd = 1, then xd = 1

Page 30: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

30

Basics of Fence Insertion:

P1 P2 ---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

BUT if we allow instructions to re-order, then the guarantee

If yd = 1, then xd = 1

is lost !!

• But often we CAN re-order without noticing an SC violation

• When can we re-order ??

Page 31: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

31

Basics of Fence Insertion:

• Widely cited work is by Shasha and Snir (our exs. from their paper)

• Recent work by Lee, Midkiff, and Padua extends the above

• Let us go through some examples (initially all mem. locations are 0) P1 P2

---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

Which program order edges in P = {a,b} must be respectedin order to guarantee SC-compliant executions ?

• Preserving a alone : Insufficient, as it can return xd=0, yd=1• Preserving b alone : Insufficient, as it can return xd=0, yd=1

• BOTH a and b need to be preserved – how to compute this in general?

• Terminology : {a,b} in this example forms the Delay Set, D

a b

Page 32: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

32

Analysis is based on Critical Cycles

• Locate all critical cycles in the concurrent program

• Equate Delay Set D to all the program-order edges in all

critical cycles

• Locating Critical Cycles :

- Locate all Conflict Edges C

. Locate two accesses that are concurrent and one of them is

a write; these give the undirected Conflict Edges C

. A critical cycle is a cycle in P U C that has the following

properties :

* Contains at-most two operations from the same thread

that are consecutive in it

* Contains 0, 2, or 3 accesses to each shared variable

that are consecutive in it (further properties omitted…)

Page 33: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

33

P1 P2 ---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

Conflict Edges C

ProgramOrderEdgesP

P1 P2 ---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

CriticalCycle

Delay Set D = all the P edges in Critical Cycle = P in our case

Finding Critical Cycles : Example 1

Page 34: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

34

Finding Critical Cycles : Example 2

P1 P2 ---- ----read(x, xd); write(x,1);

read(y, yd); write(y,1);

Basicallya“while”loop

P1 P2 ---- ----read(x, xd); write(x,1);

read(y, yd); write(y,1);

ConflictEdges

CriticalCycle

Delay Set D = {b, c} whereas P = {a, b, c}

ab c

Page 35: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

35

Finding Critical Cycles : Example 3

a1 : read A

b1 : read B

c1 : read C

d1 : read D

a2 : write B

b2 : write C

c2: write D

d2 : write A

D = { (a1,b1), (a1,c1), (a1,d1), (a2,d2), (b2,d2), (c2,d2) }

suffices to ensure SC !

I.e., a1 is an acquire-read and d2 is a release-write !!

Page 36: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

36

Basic Approach to Fence Insertion:

• Goal : Discover the minimal set of fences to be inserted into

a concurrent shared memory program

• Suppose D is the delay-set discovered by the previous analysis

• Suppose the underlying (weak) architecture supports orderings

D_o

• Let D_m be the fences to be inserted to get the effect of D

• D_m = ( ( D U D_o )+ )tr - D_o

where “tr” is the transitive reduction

a

b

cd

• Required Delay Set = { (a,b), (b,c), (a,d) }

• D_o = (c,d)

• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}

• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here

Page 37: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

37

Basic Approach to Fence Insertion:

a

b

cd

• Required Delay Set = { (a,b), (b,c), (a,d) }

• D_o = (c,d)

• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}

• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here

a

b

cd

fence

fence

Hardware-providedordering

So, in a nutshell, ….

implements the desireddelay-set

Page 38: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

38

Deriving Fences from Correctness Proofs

Lamport’s paper “How to make a Correct Multiprocess Program Execute Correctly on a Multiprocessor,” IEEE Trans Computer 46(7) – 1997

provides a really good insight on deriving required weak orderings thru proofs

• Notations :

A B : Every event in A precedes every event in B

A -- > B : Some event in A precedes some event in B

Implies

Implies

Page 39: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

39

Deriving Places to insert a Synch Instruction:

Repeat forever

noncritical section;

L : x_i := true;

For j := 1 until i-1

Do if x_j then x_I := false ; while x_j do od; goto L fi oD

For j := i+1 until N do while x_j do od od;

critical section ;

x_j := false

End Repeat

Synch

Synch

Synch

There is a proof in Lamport’s paper that withjust these Synch instructions, mutual exclusion is guaranteed.

Page 40: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

40

Part 2: A Detailed Look at a Practical Weak Memory Model : Itanium(I do mention three others briefly…)

Page 41: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

41

Well, let’s look at the big picture first:

Sparc TSO, PSO, RMO

• Reads and Writes follow the

TSO, PSO, or RMO semantics

• Additional Fence instructions

and others (e.g. semaphores)

• I’m not upto speed on these…

Alpha

• Reads (only coherence)

• Writes (only coherence)

• Load-Locked

• Store-Conditional

• Membar

Page 42: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

42

Well, let’s look at the big picture:

Power-4

• Reads and Writes (don’t know much)

• Sync (Synchronize)

• Lwsync (Lightweight Sync – new in Power4)

• E I E I O (Enforce In-Order Execution of I/O)

• Lwarx (Load word and reserve)

• Ldarx (Load doubleword and reserve)

• Stwcx (Store word conditional)

• Stdcx (Store Doubleword Conditional)

• Isync (Instruction synchronize)

Perhaps Old-McDonald knows more…

Page 43: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

43

IA-32, IA-64, AMD, … ?

•Generally thought to be “Processor Consistency”

•Does it really help formally specify (or even reveal the details) ?

•Intel thought so ……

The Itanium memory model is described next…

Page 44: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

44

The Intel Itanium® Processor memory model

• Has these kinds of instructions : “weak load” or “ordinary load” -- ld

“strong load” or “acquire-load” -- ld.acq

“weak store” or “ordinary store” -- st

“strong store” or “release store” -- st.rel

“memory fence” (NOT barrier!) -- mf

A few semaphore-types

Allows sub-word writes, I/O spaces…

We don’t model these

Page 45: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

45

Itanium® memory model thru examples

st [x] = 2

Can freely slide in asequential program…

Only rule is coherence

“Ordinary store”

ld reg1 = [x]

The same applies to an “ordinary load”

Page 46: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

46

Itanium® memory model thru examples

st.rel [x] = 2

Things before it in sequential program ordercan’t happen after it

“Release store”

Things after it in sequential program Ordermay happen before it !!

Page 47: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

47

Itanium® memory model thru examples

ld.acq r3 = [y]

Things before it in sequential program ordermay happen after it

“Acquire load”

Things after it in sequential program Ordercan’t happen before it !!

Page 48: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

48

st.rel [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

st.rel [x] = 2

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

Datadep.

ld.acqrule

Itanium specification DOES NOT try to explain outcomes in terms of “shuffles” of the original instructions!

But with these rules alone, we can’t explain thefollowing legal outcome in Itanium®

Page 49: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

49

This has turned out to be an unspoken convention in this area for other memory models also…

st [y] = 1

Local copy for P0

“remote” copy for P0

“remote” copy for P1

A store generates (n+1) progenies

ld.acq r3 = [y]

Other instructionsgenerate only one

Itanium® rules explain execution outcomes in terms of “progenies” of stores and loads

Page 50: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

50

P1: St a,1; Ld r1,a <1>; St b,r1 <1>;

P2: Ld.acq r2,b <1>; Ld r3,a <0>;

We wrote such a “breeding assembler”

{id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false};

{id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false};

{id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false};

{id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true};

{id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true};

{id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true};

{id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true};

{id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true};

{id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true}

Tuple 1

Tuple 9

...

Page 51: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

51

Itanium® rules specify how to line-up the tuplesto explain the load-outcomes !!

st [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

P0 P1

st [x] = 2

Now, arrange the split copies…

st [y] = 1 “l”

st [y] = 1 “rp0”st [y] = 1 “rp1”

st [x] = 2 “l”st [x] = 2 “rp0”st [x] = 2 “rp1”

st [y] = 1 “l”

st [y] = 1 “rp0”

st [y] = 1 “rp1”

st [x] = 2 “l”

st [x] = 2 “rp0”

st [x] = 2 “rp1”

ld reg1 = [x] <0>

ld reg2 = [y] <0>

Explanation…

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

ld.acq r3 = [y] <1>

ld.acq r4 = [x] <2>

Dependencies

Anti-dependencies

Page 52: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

52

legalItanium(exec) =Exists order.( requireStrictTotalOrder exec order

/\ requireWriteOperationOrder exec order/\ requireItProgramOrder exec order/\ requireMemoryDataDependence exec order/\ requireDataFlowDependence exec order/\ requireCoherence exec order/\ requireAtomicWBRelease exec order/\ requireSequentialUC exec order/\ requireNoUCBypass exec order /\ requireReadValue exec order

SC(exec) =Exists order.( requireStrictTotalOrder exec order

/\ requireProgramOrder exec order

/\ requireReadValue exec order

Gist of our method: Illustration on SC and of Itanium

The tuples to be ordered

Find an arrangement under SC constraints

The tuples to be ordered

Find arrangement as per above constraints

Page 53: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

53

legal_itanium exec = (* a given execution *) ?order. requireStrictTotalOrder exec order /\ requireWriteOperationOrder exec order /\ requireProgramOrder exec order /\ requireMemoryDataDependence exec order /\ requireDataFlowDependence exec order /\ requireCoherence exec order /\ requireReadValue exec order /\ requireAtomicWBRelease exec order /\ requireSequentialUC exec order /\ requireNoUCBypass exec order

Our Itanium Formal Model (extracted from IntelDocuments – written as a HOL Theory)

See Charme’03, IPDPS’04, CAV’04Various contributions by Yue Yang, Gopalakrishnan, Lindstrom, Slind, Sivaraj, Yu Yang

Page 54: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

54

requireStrictTotalOrder exec order

Page 55: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

55

requireWriteOperationOrder exec order

Local Write before Local Global Write

Local Write before Remote Global Writes

Page 56: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

56

requireProgramOrder exec order

Program Order is defined solely through

Acquires, Releases, and Fences

Page 57: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

57

requireMemoryDataDependence exec order

Order two accesses (Read or Write) under these conditions :

IF program-ordered AND the same variable AND

Write is local and RAW (and Read of course is local)

OR Write is local and WAR

OR Both writes are local and WAW

OR Both writes are remote and WAW and Fall in same processor

Page 58: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

58

requireDataFlowDependence exec order

Data Dependence Thru the Register-Space

Page 59: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

59

requireCoherence exec order

Just Plain-Old Coherence

but for TWO WRITES falling in the WB or UC space

and for EITHER Two Local Writes OR two Remote Writes in the same processor

Page 60: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

60

requireReadValue exec order

Reads return Most Recent Writes

Page 61: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

61

requireAtomicWBRelease exec order

All Remote Events Stemming from the Same Release-Write Instruction appear to be an Atomic Set

Page 62: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

62

requireSequentialUC exec order

In the UC Space, Program-Ordered

UC Read and Write Events, both of which are Local

are ordered as per program order

(the two operations in question could be RR, RW, WR, or WW)

Page 63: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

63

requireNoUCBypass exec order

UC-space Operations Do Not Exhibit

Read Bypassing as in TSO

Page 64: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

64

requireCoherence exec order =!i j. i IN exec /\ j IN exec ==> isWr i /\ isWr j /\ (i.var = j.var) /\ order i j /\ ((attr_of i.var = WB) \/ (attr_of i.var = UC)) /\ ((i.wrType=Local) /\ (j.wrType=Local) /\ (i.proc=j.proc) \/ (i.wrType=Remote) /\ (j.wrType=Remote) /\ (i.wrProc=j.wrProc)) ==> !p q. p IN exec /\ q IN exec ==> isWr p /\ isWr q /\ (p.wrID = i.wrID) /\ (q.wrID = j.wrID) /\ (p.wrType = Remote) /\ (q.wrType = Remote) /\(p.wrProc = q.wrProc) ==> order p q

A MEMORY MODEL RULE IN HOL

Page 65: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

65

How do we know that the actual silicon matches the shared memory model ?

• Pray

• Run tests and manually check results

• ? What else ?

! X . X in exec ? Y . Y in exec …. ? ! /\ … \/ ….

?

One use we have put our Spec to:Post-Si Verification of MP Systems…

Page 66: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

66

st8 [12ca20] = 7f869af546f2f14cld8 r25 = [45180] <87b5e547172644a8>ld2 r26 = [2c2a2c] <44a8>ld2 r27 = [45aa2a] <c58e>…

FORMALLY VERIFY “interesting” EXECUTIONS

st8 [45180] = 87b5e547172644a8ld8 r25 = [45180] <87b5e547172644a8>st2 [2c2a2c] = 44a8st2 [45aa2a] = c58e…

P1’s exec

P2’s exec

Page 67: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

67

TWO APPROACHES: - explicitly QB - implicitly QB

“BOOLIFY”

CONVERTTO

EXECUTIONCHECKERPROGRAM

SPEC OFMEMORY MODELIN hol

Given Execution

QBF

PROGRAM

Given Execution

SATPROBLEM

(Prototyped this; but definitely need to re-code this…)

Page 68: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

68

The alternative is to produce a manual proof:

P

st [x] = 1

mf

ld r1 = [y] <0>

Rld . acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st . rel [y] = 1

Atomicity of st.rel

Load of initial valueis before store ofevery other value

Even this simple “Litmus Test” has a 1-page detailed proof

Page 69: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

69

The MPEC Tool Flow:

Itanium Ordering rules in HOL

MechanicalProgram Derivation(to be automated)

Checker Program

Satisfiability Problem with Clauses carrying annotations

Sat Solver

SatUnsat

Explanationin the form ofone possibleinterleaving

Unsat CoreExtraction using Zcore

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st.rel [y] = 1

• Find Offending Clauses• Trace their annotations• Determine “ordering cycle”

MP execution

to be verified

RECENT WORK

Page 70: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

70

Largest example tried to date (courtesy S. Zeisset, Intel)

Proc 1

st8 [12ca20] = 7f869af546f2f14cld r25 = [45180] <87b5e547172644a8>

… 58 more instructions…

st2 [7c2a00] = 4bca

Proc 2

ld4 r24 = [733a74] <415e304>st4.rel [175984] = 96ab4e1f

… 67 more instructions…

ld8 r87 = [56460] <b5c113d7ce4783b1>

• Initially the tool gave a trivial violation

• Diagnosed to be forgotten memory initialization

• Added method to incorporate memory initialization in our tool

• Our tool found the exact same cycle as pointed out by author of test

Cycle found thru our tool:

st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)

Page 71: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

71

Statistics Pertaining to Case Study

• 140 total instructions

• All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon

• 1 minutes to generate Sat instance

• 9M clauses ( O(n^3) in terms of instructions ) • 117,823 variables ( not a problem )

• ~1 minute to run Sat (unsat here) – 0.2 sec to do “real work”

• Zcore runs fast – gave 23 clauses in one iteration

Page 72: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

72

Overview of MPEC:

• Example of how a HOL rule was turned into a SAT generator

• How the SAT part was done

Throwing an efficient “transitivity blanket” over a

problem to cover it with whatever transitivity it begs for !!

• What more to expect• Related work

Page 73: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

73

Gist of constraints :

• Some arrangements are statically known :

• Others are conditional : Implies and

• Some must form an atomic set : Everybody elseStrictly before orStrictly after.

• Many are unordered :

• Find a strict total order satisfying all the above !

Page 74: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

74

Gist of constraint ENCODING :

Implies and

1

1

N

1 1 N

i

j

1

• Use Boolean precedence matrix • Capture “i before j” by m_ij

Unit clauses

Boolean formula

See how SAT-generator is derived

Spew out irreflexivity and totality axioms Then throw a “transitivity blanket” on top of all tuples

Strict total order :

Atomic set :

Statically known :

Page 75: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

75

* Small Domain method (n logn encoding)

- Generates fantastically hard SAT problems!

- Chokes many SAT solvers – Zchaff-II can handle it well

* Incremental SAT (see CAV’04)

* QBF version : initial prototype needs lots of work - can serve to provide good QBF benchmarks…..

Other Approaches Tried:

Page 76: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

76

Approaches to “transitivity blanket”

Naïve : For all tuples i, j, and k, generate

m_ij /\ m_jk m_jk

Too many clauses (1B for a 1000-tuple program)

Better: Obtain transitive-closure of known orderings and then prune irrelevant parts of the blanket

E.g., if ~m_ij is known, don’t generate

m_ij /\ … … as well as … /\ m_ij …

Page 77: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

77

Obtaining SAT-generator from HOL

atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID)

atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

Initial Spec

Applying Contrapositive

After Reducing quantifier Scopes

Page 78: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

78

atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(exec) = forall(i,exec,wb(i))

wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,exec,wb1(i,k))

wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,exec,wb2(i,k,j))

wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *)

Transformed Spec

Functional Program that generates the constraints (will be automated)

…Obtaining SAT-generator from HOL

Page 79: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

79

Clause annotations for the unsat core for example

op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexiveop1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrderop1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrderop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrderop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

Page 80: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

80

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

denotes an op

Denotes op numbers. Store has both local and remote exec

Building an Error-trail for UNSAT (infeasible executions) :

Page 81: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

81

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder

Building an Error-trail…

Page 82: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

82

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder

Building an Error-trail …

Page 83: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

83

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = R eadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

Building an Error-trail …

Page 84: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

84

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

Building an Error-trail …

Page 85: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

85

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

Building an Error-trail …

Page 86: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

86

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder

Building an Error-trail …

Page 87: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

87

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

Building an Error-trail …

Page 88: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

88

HOLRulesFor

ItaniumIn a HOL

Theory File

ZcoreCORE Extractor

“Explain”Error ExplainerAnd DOT fileGeneratorGhostView

AnMPECcable

OcamlProgram

MPEC (MP Execution Checker) Tool Demo

Gentuple AssemblerSAT Converter

Zchaff-II or other

Ganesh sittingdown and coding

Printout of CycleRevealing Error

SAT Result

SAT(GivesInterleaving)

UNSAT

Page 89: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

89

Other Tools Developed in UV Group

• Yue (Jason) Yang’s Dissertation webpage

• Itanium Litmus-test Checker in Constraint Prolog

• NemosFinder – Easily Parameterizable Litmus-Checker Suite in Constraint Prolog

• UMM Tool – Easily Parameterizable Murphi Operational Model for writing Operational Specs of Memory Models

• DefectFinder – Demo Prototype of Memory-model Aware Race Analyzer

• Now at MSR (www.cs.utah.edu/~yyang/) -- now [email protected]

Page 90: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

90

Part 3: What’s not apparent at first glance

Page 91: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

91

Topics:

* Formal verification approaches to memory consistency compliance

* How to model the interface of the shared memory?

- Execution based

- IO mappings based

* What is wrong if an Execution based approach is chosen ?

- Finite-state realizability

* A transducer-based model of shared memory

- Highlights of results

* Whither undecidability ?

Page 92: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

92

Formal Verification Approaches:

• Several paper-and-pencil proofs

• Arons (pvs-based)

• McMillan (CTL model-checking based)

• Nalumasu et.al. (Test Automata based)

• Qadeer (1. Finding a serializer. 2. Automated for simple write order)

• Bingham et.al. (Window observer based)

Spec ofShared Memory

Consistency Model

Imp ofShared Memory

Consistency Model(a protocol)

Agreement

Page 93: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

93

Other Formal Approaches:

• Park, Dill, Nowatzyk

• Pong and Dubois (several papers)

• Collier’s work

• Ghughal’s adaptation of above for weak memory models

• Chatterjee (CAV’02)

• Yu, Tuttle, Lamport

• Shen, Arvind

• Ahamad, Neiger

• (Check webpage of MPV’00 www.cs.utah.edu/mpv )

• Steinke and Nutt

• Gibbons, Gharachorloo

• Adve, Pugh

• … (a survey will take too long)

Page 94: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

94

Modeling the Interface of Shared Memory:

• Trace Based

- Most existing works

• IO Mappings Based

- The original Lazy-caching paper (casual use)

- Kawash and Higham (defines Specs this way;

Implementations not addressed)

- Sezgin et.al. – (defines Specs and Imps + Correspondence)

Spec Imp

Read(proc, addr, data),Write(proc,addr,data), …

Spec Imp

Read_o(proc, addr, data), Write_o(proc,addr,data), …

Read_i(proc, addr), Write_i(proc,addr,data), …

Page 95: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

95

What’s “wrong” with trace-based approaches?

• Permits making statements about uninteresting or unrealizable machines

• Muddies exact import of the famous “undecidability result” (Alur et.al)

Page 96: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

96

Example 1: Finiteness cannot be adequatelydescribed thru regular sets of executions alone…

Consider the set of executions w(1,a,2) r(1,a,1)* r(2,a,2)* w(2,a,1) -- defines the TEMPORAL order of events

All these are considered SC because we can build a LOGICAL order w(1,a,2) r(2,a,2)* w(2,a,1) r(1,a,1)*

But how can the above TEMPORAL order be generated by a FSM ?

P1 P2--- ---w(a,2) ; r(a,2) ; r(a,2) ;

r(a,1) ; …r(a,1) ; r(a,2) ;

… r(a,1) ; w(a,1) ;

Page 97: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

97

Example 1: … continued (take specific unravelling of *)

Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)

A FSM ImplementationOf Seq Consistency

With N Internal States

w(1,a,2) ;

w(1,a,2) ;

Program fedSo far …

Output generatedSo far …

A FSM ImplementationOf Seq Consistency

With N Internal States

w(1,a,2) ;w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;

A FSM ImplementationOf Seq Consistency

With N Internal States

w(1,a,2) ;r(1,a,1) ;

w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;NO w(2,a,1)

FAIL ! O/P w/o Input !!

Page 98: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

98

Example 1: … continued (take specific unravelling of *)

Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)

A FSM ImplementationOf Seq Consistency

With N Internal States

wo(1,a,2) ;

wi(1,a,2) ;

Program fedSo far …

Output generatedSo far …

A FSM ImplementationOf Seq Consistency

With N Internal States

wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;

A FSM ImplementationOf Seq Consistency

With N Internal States

wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)

FAIL ! Too manyinputs w/o output

Page 99: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

99

Example 1: … continued (take specific unravelling of *)

Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)

Labeled by

wi(1,a,2) ;

A FSM ImplementationOf Seq Consistency

With N Internal States

wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)

FAIL ! Too manyinputs w/o output

wi(1,a,2) ;

{ ri(1,a)K, ri(2,a)L } ;

We can “pump” this loop, thus making it possible to generatethe SAME execution for arbitrary long programs !!

Page 100: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

100

Restrictions in contemporary work that enables SC verification:

i.e. Temporal Orders of the form …

w(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1)

• Bingham, Condon, Hu :

- Require Prefix Closure (“no outputs w/o input”) e.g. the trace of length 1 : r(1,a,1)

- Rule out Prophetic Inheritance

Page 101: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

101

Restrictions in contemporary work that enables SC verification:

• Qadeer :

- Requires Simple Write Ordering

The order of the writes to the same addressin the temporal order and the logical order must be the same

- (But they provide an automated model-checking based verification method for this class of SC protocols…)

Temporal Order: w(1,a,1); w(2,a,2); r(3,a,2); r(4,a,1)

Required Logical Order: w(2,a,2); r(3,a,2); w(1,a,1); r(4,a,1)

< diagram of Lazy Caching here >

Page 102: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

102

Taxonomy of formal “SC modeling” approaches:

• Alur et.al. :

- Not Necessarily Prefix Closed (NNPC) regular traces model the SC language

- Checking containment of the (regular) language of the Implementation is undecidable

• Bingham, Condon, and Hu :

- DSC trace set (Decisive Sequential Consistency)

• Sezgin’s work :

- Models memory systems using regular transducers

- Defines EXACTLY what finite-state realizable SC systems are

- SC verification is language containment

- Provides a semi-decision procedure for SC verification in this setting

Page 103: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

103

Example 2 (Sezgin) : The dangers of trace-based modeling

Imagine a memory system implementation that does this:

• Accept reads and writes• If the first |P| * |A| instructions are writes, and further these contain exactly one write by each processor to each address

THEN go into malevolent mode (disconnect the shared memory) ELSE go into benevolent mode (behave like serial memory)

M1 M2 Mn… Single Serial Memory Unit M

P1 P2 Pn

Malevolent ModeConnections

Benevolent ModeConnections

Page 104: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

104

Example 2 (Sezgin) …

Example : P = {1,2,3} and A={a} and D = {0,1,2}

w(1,a,2); r(3,a, 2); w(2,a,1); r(1,a, 1) ; …Benevolent Mode from now on,since the second instrn is a read…

w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);

w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …

Malevolent Mode from now on,as we have p*a writes

M1 M2 Mn…Single Serial

Memory Unit M

P1 P2 Pn

w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);

r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …

LOGICAL ORDER:

Page 105: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

105

Whoa? Any Logical Order will do?!

w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);

w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …

w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);

r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …

LOGICAL ORDER:

TEMPORAL ORDER:

• A Logical Order had better be not fiction… it should be a possible schedule in a “could have happened” sense

• Viewed from that angle, the above logical order is nonsense because it allows certain actions to be postponed unboundedly

• Sezgin’s formal definition of Implementations builds in boundedness

• BCH address an instance of this in their “past-time SC” idea

• Sezgin’s SC machines give logical order out as Commit Order …

Page 106: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

106

Status of SC “undecidability”:

• Alur et.al. : UNDECIDABLE NNPC is

under NNPC unrealistic

• Qadeer : Decidable Simple Write Order

under simple write order rules out some

protocols

• Bingham, Condon, and Hu : Decidable under simple These don’t capture

write order; also in exactly those that

DSC_k are FS realizable

• Sezgin’s work : Decidability open Captures exactly the

class of FS realizable

protocols in a detailed manner

(“Input” or programs explicitly modeled)

Page 107: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

107

Concluding Remarks:

• Importance of topic unlikely to diminish

• Platform compliance is a big deal

• High-performance OS kernel writers need to know

• Think of proving a distributed Garbage Collector running on a Weak Memory Model (would be a great PhD topic)

• I’ve omitted too many important names I can’t even remember

• Partial list: Adve, Gharachorloo, Pugh, Arvind, Collier, …

Page 108: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

108

Acknowledgements (sorry for omissions):

• Past students / postdoc : Nalumasu, Ghughal, Mokkedem, Hosabettu, Jones, Sivaraj, Yang, Yang, Kuramkote

• Faculty colleagues : Lindstrom, Slind, Carter

• Funding agencies : NSF, SRC

• Industrial Liaisons : Corella, Chou, German, Vaid, Neiger, Zeisset, Park

• Other favorable influences : Mathews, Tuttle, Yu, Joshi, Dill, Pong, Nowatzyk, Lamport, Hu, Condon, Higham, Kawash, Jackson

• Who am I forgetting?