Download - Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

Shared Memory Consistency Models :A broad survey

Ganesh Gopalakrishnan*

School of Computing, University of Utah, Salt Lake City, UT

* Past work supported in part by SRC Contract 1031.001, NSF Award 0219805 and an equipment grant from Intel Corporation

2

Shared Memory: Hardware Realities

Memory performance

CPU performance

3

Shared Memory: Software Realities

• Must define the formal semantics of shared-memory concurrent programming while allowing for all reasonable optimizations

•Defining the Shared Thread semantics for Java (Original Java book’s Chapter 17 has essentially been ripped out…)

• Defining the Shared Memory Model for new languages such as Unified Parallel C (UPC) for Scientific Programming

• At a deeper level: Must have formal basis for Automatic Minimal Fence Insertion to make programs appear to execute sequentially consistent

4

Topics:

• Motivations for strong and weak memory models - How it affects consistency protocol design - How it affects programming

• Classical memory models- Their “power”

• Fence insertion during compilation - Run on weak architectures but appear to run SC

• Overview of some weak architectures

• Itanium in a nutshell

• SAT-based programs that check executions against memory model specs - Demo of MP Execution Checker (MPEC) tool for Itanium

5

Topics:

• Theoretical aspects of memory model specification

- Specify using Traces or Specify using Transducers

• Why Traced-based Specification can allow one to talk about unrealizable machines

- Hence “undecidability of sequential consistency” is not a solved problem

• Why trace-based verification methods need to exert some care

- Otherwise can prove “conniving machines” to be SC !!

• A brief taxonomy of recent results in this area

- Mainly Alur et.al., Qadeer, Bingham et.al., and Sezgin

6

Sequential Consistency : The Most Basic Memory Consistency Model

• Requirements1. Exists a common total

order 2. Respects program order 3. Read sees the “latest”

write

Under Sequential Consistency: No

Under many weak models: Yes

Example

Initially, x = y = 0. Finally, can r1 = r2 = 0?

Thread 1 Thread 2x = 1;

r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

7

How to Think About Sequential Consistency

P1 P2 Pn

Memory



r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

No! Not under SC ! But possible under many weak memory models!

An example of such a weak memory model is Sparc TSO

8

Coherence == Per-location Sequential Consistency

P1 P2 Pn

1-address Memory

Notice that the same execution is Coherent !



r1 = y;

x = 1;

r1 = y;y = 2;

r2 = x;

y = 2;

r2 = x;

9

Memory Consistency Models

Defines the legal orderings of memory operations that can be perceived at the user level

• Processors intermittently throw colors onto memory cells and also intermittently look at their colors

P1 P2 Pn

Memory Cell 1

Memory Cell 2

Memory Cell n

…

Pi

10

Memory Consistency Models

Defines the legal orderings of memory operations that can be perceived at the user level

• Many have been developed: – Sequential Consistency (SC)

– Coherence (per-location SC)

– Parallel Random Access Memory (PRAM)– Causal Consistency– Processor Consistency (PC)– Release Consistency– Location Consistency– The Intel Itanim Memory Model– Java Memory Model (JMM)– and more!

11

Memory Consistency Model Specifications:

A VERY complex specification for a real architecture (e.g. Itanium, PowerPC, …)

Also of growing concern in Software (e.g. Java Memory Model, Unified Parallel C model, …)

12

Motivation for (weak) Memory Consistency models:

A Hardware Perspective :

• Cannot afford to do industrious updates across large MP systems

• Delayed and re-orderable updates allow considerable latitude in memory consistency protocol design less bugs in protocols !!

…

dir dir

Chip-level protocols

Inter-cluster protocols

Intra-cluster protocols

mem mem

13

Price Paid for Delayed Updates : Bugs!

Algorithms such as Peterson’s Mutual Exclusion cease to work!

Thread 1 Thread 2------------ -----------Flags[1] = BUSY; Flags[2] = BUSY;Turn = 2; Turn = 1;

While (Flags[2] == BUSY && While (Flags[1] == BUSY && Turn != 1) ; Turn != 2) ;

Critical section Critical section

Flags[1] = FREE; FLAGS[2] = FREE;

CAN READ OLD VALUE!!

CAN READ OLD VALUE!!

14

Scope of Tutorial:

• Survey of ‘Classical’ Work

• Survey of Current Activities (that this speaker is aware of)

• Verification Challenges

• Theoretical Questions

• Justification for topic selection:

- Complement talks on Shared Memory Consistency Protocols

- Intuitions more important than the detailzzz….

- Knowing who’s who in this area helps

- Excuse for me to stick my neck out and learn something new

15

Organization:

1. Overview (mainly of classical works)

2. Practical aspects of weak consistency models (more depth)

3. What’s not apparent at first glance (still more depth)

4. Conclusions and references

16

Part 1: Overview of Classical Work

17

Memory Serves to Plumb Data…

Uniprocessor:

Write ( address = 2 , data = 33) ; …….. Read ( address = 2 , returns data = 33) ;

Multiprocessor:

P1 P2---- ----Write (2, 33) ; || Read (2, 33) ;

Multiprocessor: P1 P2 ---- ----Write(2, 33); Write(2, 77);

Read (2, 77); Read(2, 33);

P1 P2 P3 P4---- ---- ---- ----Write (2, 33) ; Write (2, 77) ; Read(2, 33); Read(2, 77);

Read(2, 77); Read(2, 33);

…but respecting Coherence!

18

…but Coherence is not sufficient:

From Shasha and Snir, Figure 1, P. 282 (ACM TOPLAS (10)2: 1988)

Processor 1 Processor 2------------- --------------

Test_and_set1(LOCK); Test_and_set2(LOCK);

Read1(X); Read2(X);

Write1(X); Write2(X);

Reset1(LOCK); Reset2(LOCK);

The following memory access sequence respects Coherence but breaks the critical section :

Test_and_set1(LOCK); Read1(X); Reset1(LOCK);

Test_and_set2(LOCK); Read2(X); Write1(X); Write2(X); Reset2(LOCK);

• Consistent view ACROSS ADDRESS SPACE is needed

• Most intuitive such : Sequential Consistency !

19

Basic understanding of SC:

• Execute AS IF instructions in each thread were executed sequentially and atomically

- respecting the program order in each thread

- no constraints across sequential programs

Requires effort to achieve above effect AS WELL AS high performance :

CPU 1

Memoryand

Bus Controller

CPU n …

Write (2, 55) ; MISSESRead (4, 11) ; HITS

Write (4, 66) ; MISSESRead (2, 22) ; HITS Which Read waits ?

20

CPU 1

Memoryand

Bus Controller

CPU n …



Aggressive SC Implementations:

From Adve, Pai, and Ranganathan (Proc IEEE, (87)3, March 1999, p.448)

“If the accessed location does not change its value until the Read could have been non-speculatively issued, then the speculation is successful. Otherwise, roll-back speculation until incorrect load.” (Similar schemes used in HP PA-8000, Intel Pentium Pro, MIPS R10K)

One way to implement this: * If bus-snoop for Write(4,..) arrives before that for Write(2,..), the Read(4, 11) is invalidated – and it reissues…

Snoops areWrite(4,66);Write(2,55);

Snoops areWrite(4,66);Write(2,55);

21

Unexpected Interactions:SC and Write Update Protocols(from Grahn, Stenstrom, Dubois)

• An important aspect of Sequential Consistency is Write Atomicity

• Write-Invalidate protocols can easily guarantee Write Atomicity

• However, Write-Update protocols are often recommended (Read-latency)

• Ensuring Write-Atomicity in Write-Update Protocols is tricky

• WEAK MEMORY MODELS TO THE RESCUE ! Don’t care about Write Atomicity except at Acquire / Release points

…

dir dir

Chip-level protocols

Inter-cluster protocols

Intra-cluster protocols

mem mem

22

A Deeper Look at Coherence :

Complexity of Checking Coherence of Executions is in NPC :

Cantin’s proof: Reduction from SAT:

Example: Consider (u1 \/ u2) /\ (~u1 \/ u2)

Create the following concurrent processes:

h1 h2 h_u1 h_~u1 h_u2 h_~u2 h3--- --- ----- ------- ----- ------- ---W(d_u1) W(d_~u1) R(d_u1) R(d_~u1) R(d_u2) R(d_~u2) R(d_c1)

W(d_u2) W(d_~u2) R(d_~u1) R(d_u1) R(d_~u2) R(d_u2) R(d_c2)

W(d_c1) W(d_c2) W(d_c1) W(d_u1) W(d_c2) W(d_u2) W(d_~u1)

W(d_~u2) W(d_F)

Literal Gadget

Clause Gadget

Existence of aCoherent Scheduleis tested

23

A Deeper Look at Coherence :

Memory models that relax coherence – and how “useful” they are:

• PRAM (pipelined RAM – Lipton and Sandberg) is of academic interest

One memory per processor

Program order is obeyed, butNo Write-Atomicity

P1 P2 Pn

…

…

…

24

A Deeper Look at Coherence :Memory models that relax coherence – and how “useful” they are:

• PRAM – of academic interest

• Location consistency

- Proposed by Gao and Sarkar- They tout its advantages in terms of scalability- They describe an LC protocol “machine”

- Analysis by Wallace et.al (PDPTA 2002: 1542-1550) :

* Shown that this LC machine is stronger than the LC definition

* Question whether LC programs indeed appear to execute with sequentially consistent outcomes assuming that they are “properly labeled”

* I have not seen many pubs on LC of late…

25

Classical Weak Memory Models:

• Processor Consistency is widely known

• Good discussions in Ahamad et.al.,

“The Power of Processor Consistency”

• First understand PRAM :

- For each processor p, there is a legal serialization S_p of

H_p+w such that if o1 and o2 are in H_p+w and o1 –po-> o2

then o1 – s_p o2

- For PC_g, we add the following condition:

for any two processors p and q, and for any location x,

S_p | (w,x) = S_q | (w,x)

“Processor Consistency according to Goodman (PC_g)”

is not the same as

“PC_d – processor consistency according to the DASH project”

26

Execution that’s PRAM and Coherent … but not PC_g:

P : w(x,0) w(y,0)

Q: r(y,0) w(x,1)

R: r(x,1) r(x,0)

* Coherent! Just look at each color separately

* Not PC_g :

Construct a history per processor with all of the processor’s actionsand all of others’ writes in that history

PC_g requires the write-histories to agree per variable; but in our example,

History of Q = …w(x,0)… w(x,1)… while

History of R = …w(x,1)… w(x,0)…

27

The “power” of Processor Consistency:

• Can handle “Peterson” (Ahamad)

• Can’t handle “Bakery” (Ahamad)

• What else? (Kawash and Higham, “Bounds for mutual

exclusion with only Processor Consistency”):

- Peterson is correct for PC-G (a multi-writer protocol)

- Bakery is incorrect for PC-G (a single-writer protocol)

- Kawash and Higham prove that for mutual exclusion under

PC-G, one multi-writer and n single-writers are necessary

28

Observations:

• Weak shared memory consistency models allow consistency

protocols to be efficient

• Unfortunately programmers find weak models non-intuitive

• How can we have the best of both worlds:

- weak models to be supported by the hardware

- strong models to be presented by the software

This can be achieved through compilers that insert the minimal number of fence instructions to give the appearance of SC

29

Basics of Fence Insertion:

• Widely cited work is by Shasha and Snir

• Recent work by Lee, Midkiff, and Padua extends the above

• Let us go through some examples (initially all mem. locations are 0)

P1 P2 ---- ----write(x,1) ; read(y, yd) ;

write(y,1); read(x, xd) ;

Under SC,

If yd = 1, then xd = 1

30




BUT if we allow instructions to re-order, then the guarantee

If yd = 1, then xd = 1

is lost !!

• But often we CAN re-order without noticing an SC violation

• When can we re-order ??

31


• Widely cited work is by Shasha and Snir (our exs. from their paper)

• Recent work by Lee, Midkiff, and Padua extends the above

• Let us go through some examples (initially all mem. locations are 0) P1 P2

---- ----write(x,1) ; read(y, yd) ;


Which program order edges in P = {a,b} must be respectedin order to guarantee SC-compliant executions ?

• Preserving a alone : Insufficient, as it can return xd=0, yd=1• Preserving b alone : Insufficient, as it can return xd=0, yd=1

• BOTH a and b need to be preserved – how to compute this in general?

• Terminology : {a,b} in this example forms the Delay Set, D

a b

32

Analysis is based on Critical Cycles

• Locate all critical cycles in the concurrent program

• Equate Delay Set D to all the program-order edges in all

critical cycles

• Locating Critical Cycles :

- Locate all Conflict Edges C

. Locate two accesses that are concurrent and one of them is

a write; these give the undirected Conflict Edges C

. A critical cycle is a cycle in P U C that has the following

properties :

* Contains at-most two operations from the same thread

that are consecutive in it

* Contains 0, 2, or 3 accesses to each shared variable

that are consecutive in it (further properties omitted…)

33



Conflict Edges C

ProgramOrderEdgesP



CriticalCycle

Delay Set D = all the P edges in Critical Cycle = P in our case

Finding Critical Cycles : Example 1

34


P1 P2 ---- ----read(x, xd); write(x,1);

read(y, yd); write(y,1);

Basicallya“while”loop

P1 P2 ---- ----read(x, xd); write(x,1);

read(y, yd); write(y,1);

ConflictEdges

CriticalCycle

Delay Set D = {b, c} whereas P = {a, b, c}

ab c

35


a1 : read A

b1 : read B

c1 : read C

d1 : read D

a2 : write B

b2 : write C

c2: write D

d2 : write A

D = { (a1,b1), (a1,c1), (a1,d1), (a2,d2), (b2,d2), (c2,d2) }

suffices to ensure SC !

I.e., a1 is an acquire-read and d2 is a release-write !!

36

Basic Approach to Fence Insertion:

• Goal : Discover the minimal set of fences to be inserted into

a concurrent shared memory program

• Suppose D is the delay-set discovered by the previous analysis

• Suppose the underlying (weak) architecture supports orderings

D_o

• Let D_m be the fences to be inserted to get the effect of D

• D_m = ( ( D U D_o )+ )tr - D_o

where “tr” is the transitive reduction

a

b

cd

• Required Delay Set = { (a,b), (b,c), (a,d) }

• D_o = (c,d)

• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}

• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here

37

Basic Approach to Fence Insertion:

a

b

cd

• Required Delay Set = { (a,b), (b,c), (a,d) }

• D_o = (c,d)

• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}

• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here

a

b

cd

fence

fence

Hardware-providedordering

So, in a nutshell, ….

implements the desireddelay-set

38

Deriving Fences from Correctness Proofs

Lamport’s paper “How to make a Correct Multiprocess Program Execute Correctly on a Multiprocessor,” IEEE Trans Computer 46(7) – 1997

provides a really good insight on deriving required weak orderings thru proofs

• Notations :

A B : Every event in A precedes every event in B

A -- > B : Some event in A precedes some event in B

Implies

Implies

39

Deriving Places to insert a Synch Instruction:

Repeat forever

noncritical section;

L : x_i := true;

For j := 1 until i-1

Do if x_j then x_I := false ; while x_j do od; goto L fi oD

For j := i+1 until N do while x_j do od od;

critical section ;

x_j := false

End Repeat

Synch

Synch

Synch

There is a proof in Lamport’s paper that withjust these Synch instructions, mutual exclusion is guaranteed.

40

Part 2: A Detailed Look at a Practical Weak Memory Model : Itanium(I do mention three others briefly…)

41

Well, let’s look at the big picture first:

Sparc TSO, PSO, RMO

• Reads and Writes follow the

TSO, PSO, or RMO semantics

• Additional Fence instructions

and others (e.g. semaphores)

• I’m not upto speed on these…

Alpha

• Reads (only coherence)

• Writes (only coherence)

• Load-Locked

• Store-Conditional

• Membar

42

Well, let’s look at the big picture:

Power-4

• Reads and Writes (don’t know much)

• Sync (Synchronize)

• Lwsync (Lightweight Sync – new in Power4)

• E I E I O (Enforce In-Order Execution of I/O)

• Lwarx (Load word and reserve)

• Ldarx (Load doubleword and reserve)

• Stwcx (Store word conditional)

• Stdcx (Store Doubleword Conditional)

• Isync (Instruction synchronize)

Perhaps Old-McDonald knows more…

43

IA-32, IA-64, AMD, … ?

•Generally thought to be “Processor Consistency”

•Does it really help formally specify (or even reveal the details) ?

•Intel thought so ……

The Itanium memory model is described next…

44

The Intel Itanium® Processor memory model

• Has these kinds of instructions : “weak load” or “ordinary load” -- ld

“strong load” or “acquire-load” -- ld.acq

“weak store” or “ordinary store” -- st

“strong store” or “release store” -- st.rel

“memory fence” (NOT barrier!) -- mf

A few semaphore-types

Allows sub-word writes, I/O spaces…

We don’t model these

45

Itanium® memory model thru examples

st [x] = 2

…

…

Can freely slide in asequential program…

Only rule is coherence

“Ordinary store”

ld reg1 = [x]

The same applies to an “ordinary load”

…

…

46


st.rel [x] = 2

…

Things before it in sequential program ordercan’t happen after it

“Release store”

Things after it in sequential program Ordermay happen before it !!

47


ld.acq r3 = [y]

…

Things before it in sequential program ordermay happen after it

“Acquire load”

Things after it in sequential program Ordercan’t happen before it !!

48

st.rel [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

st.rel [x] = 2

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

Datadep.

ld.acqrule

Itanium specification DOES NOT try to explain outcomes in terms of “shuffles” of the original instructions!

But with these rules alone, we can’t explain thefollowing legal outcome in Itanium®

49

This has turned out to be an unspoken convention in this area for other memory models also…

st [y] = 1

Local copy for P0

“remote” copy for P0

“remote” copy for P1

A store generates (n+1) progenies

ld.acq r3 = [y]

Other instructionsgenerate only one

Itanium® rules explain execution outcomes in terms of “progenies” of stores and loads

50

P1: St a,1; Ld r1,a <1>; St b,r1 <1>;

P2: Ld.acq r2,b <1>; Ld r3,a <0>;

We wrote such a “breeding assembler”

{id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false};

{id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false};

{id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false};

{id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true};

{id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true};

{id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true};

{id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true};

{id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true};

{id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true}

Tuple 1

Tuple 9

...

51

Itanium® rules specify how to line-up the tuplesto explain the load-outcomes !!

st [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

P0 P1

st [x] = 2

Now, arrange the split copies…

st [y] = 1 “l”

st [y] = 1 “rp0”st [y] = 1 “rp1”

st [x] = 2 “l”st [x] = 2 “rp0”st [x] = 2 “rp1”

st [y] = 1 “l”

st [y] = 1 “rp0”

st [y] = 1 “rp1”

st [x] = 2 “l”

st [x] = 2 “rp0”

st [x] = 2 “rp1”

ld reg1 = [x] <0>

ld reg2 = [y] <0>

Explanation…

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

ld.acq r3 = [y] <1>

ld.acq r4 = [x] <2>

Dependencies

Anti-dependencies

52

legalItanium(exec) =Exists order.( requireStrictTotalOrder exec order

/\ requireWriteOperationOrder exec order/\ requireItProgramOrder exec order/\ requireMemoryDataDependence exec order/\ requireDataFlowDependence exec order/\ requireCoherence exec order/\ requireAtomicWBRelease exec order/\ requireSequentialUC exec order/\ requireNoUCBypass exec order /\ requireReadValue exec order

SC(exec) =Exists order.( requireStrictTotalOrder exec order

/\ requireProgramOrder exec order

/\ requireReadValue exec order

Gist of our method: Illustration on SC and of Itanium

The tuples to be ordered

Find an arrangement under SC constraints

The tuples to be ordered

Find arrangement as per above constraints

53

legal_itanium exec = (* a given execution *) ?order. requireStrictTotalOrder exec order /\ requireWriteOperationOrder exec order /\ requireProgramOrder exec order /\ requireMemoryDataDependence exec order /\ requireDataFlowDependence exec order /\ requireCoherence exec order /\ requireReadValue exec order /\ requireAtomicWBRelease exec order /\ requireSequentialUC exec order /\ requireNoUCBypass exec order

Our Itanium Formal Model (extracted from IntelDocuments – written as a HOL Theory)

See Charme’03, IPDPS’04, CAV’04Various contributions by Yue Yang, Gopalakrishnan, Lindstrom, Slind, Sivaraj, Yu Yang

54

requireStrictTotalOrder exec order

55

requireWriteOperationOrder exec order

Local Write before Local Global Write

Local Write before Remote Global Writes

56

requireProgramOrder exec order

Program Order is defined solely through

Acquires, Releases, and Fences

57

requireMemoryDataDependence exec order

Order two accesses (Read or Write) under these conditions :

IF program-ordered AND the same variable AND

Write is local and RAW (and Read of course is local)

OR Write is local and WAR

OR Both writes are local and WAW

OR Both writes are remote and WAW and Fall in same processor

58

requireDataFlowDependence exec order

Data Dependence Thru the Register-Space

59

requireCoherence exec order

Just Plain-Old Coherence

but for TWO WRITES falling in the WB or UC space

and for EITHER Two Local Writes OR two Remote Writes in the same processor

60

requireReadValue exec order

Reads return Most Recent Writes

61

requireAtomicWBRelease exec order

All Remote Events Stemming from the Same Release-Write Instruction appear to be an Atomic Set

62

requireSequentialUC exec order

In the UC Space, Program-Ordered

UC Read and Write Events, both of which are Local

are ordered as per program order

(the two operations in question could be RR, RW, WR, or WW)

63

requireNoUCBypass exec order

UC-space Operations Do Not Exhibit

Read Bypassing as in TSO

64

requireCoherence exec order =!i j. i IN exec /\ j IN exec ==> isWr i /\ isWr j /\ (i.var = j.var) /\ order i j /\ ((attr_of i.var = WB) \/ (attr_of i.var = UC)) /\ ((i.wrType=Local) /\ (j.wrType=Local) /\ (i.proc=j.proc) \/ (i.wrType=Remote) /\ (j.wrType=Remote) /\ (i.wrProc=j.wrProc)) ==> !p q. p IN exec /\ q IN exec ==> isWr p /\ isWr q /\ (p.wrID = i.wrID) /\ (q.wrID = j.wrID) /\ (p.wrType = Remote) /\ (q.wrType = Remote) /\(p.wrProc = q.wrProc) ==> order p q

A MEMORY MODEL RULE IN HOL

65

How do we know that the actual silicon matches the shared memory model ?

• Pray

• Run tests and manually check results

• ? What else ?

! X . X in exec ? Y . Y in exec …. ? ! /\ … \/ ….

?

One use we have put our Spec to:Post-Si Verification of MP Systems…

66

st8 [12ca20] = 7f869af546f2f14cld8 r25 = [45180] <87b5e547172644a8>ld2 r26 = [2c2a2c] <44a8>ld2 r27 = [45aa2a] <c58e>…

FORMALLY VERIFY “interesting” EXECUTIONS

st8 [45180] = 87b5e547172644a8ld8 r25 = [45180] <87b5e547172644a8>st2 [2c2a2c] = 44a8st2 [45aa2a] = c58e…

P1’s exec

P2’s exec

…

67

TWO APPROACHES: - explicitly QB - implicitly QB

“BOOLIFY”

CONVERTTO

EXECUTIONCHECKERPROGRAM

SPEC OFMEMORY MODELIN hol

Given Execution

QBF

PROGRAM

Given Execution

SATPROBLEM

(Prototyped this; but definitely need to re-code this…)

68

The alternative is to produce a manual proof:

P

st [x] = 1

mf

ld r1 = [y] <0>

Rld . acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st . rel [y] = 1

Atomicity of st.rel

Load of initial valueis before store ofevery other value

Even this simple “Litmus Test” has a 1-page detailed proof

69

The MPEC Tool Flow:

Itanium Ordering rules in HOL

MechanicalProgram Derivation(to be automated)

Checker Program

Satisfiability Problem with Clauses carrying annotations

Sat Solver

SatUnsat

Explanationin the form ofone possibleinterleaving

Unsat CoreExtraction using Zcore

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st.rel [y] = 1

• Find Offending Clauses• Trace their annotations• Determine “ordering cycle”

MP execution

to be verified

RECENT WORK

70

Largest example tried to date (courtesy S. Zeisset, Intel)

Proc 1

st8 [12ca20] = 7f869af546f2f14cld r25 = [45180] <87b5e547172644a8>

… 58 more instructions…

st2 [7c2a00] = 4bca

Proc 2

ld4 r24 = [733a74] <415e304>st4.rel [175984] = 96ab4e1f

… 67 more instructions…

ld8 r87 = [56460] <b5c113d7ce4783b1>

• Initially the tool gave a trivial violation

• Diagnosed to be forgotten memory initialization

• Added method to incorporate memory initialization in our tool

• Our tool found the exact same cycle as pointed out by author of test

Cycle found thru our tool:

st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)

71

Statistics Pertaining to Case Study

• 140 total instructions

• All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon

• 1 minutes to generate Sat instance

• 9M clauses ( O(n^3) in terms of instructions ) • 117,823 variables ( not a problem )

• ~1 minute to run Sat (unsat here) – 0.2 sec to do “real work”

• Zcore runs fast – gave 23 clauses in one iteration

72

Overview of MPEC:

• Example of how a HOL rule was turned into a SAT generator

• How the SAT part was done

Throwing an efficient “transitivity blanket” over a

problem to cover it with whatever transitivity it begs for !!

• What more to expect• Related work

73

Gist of constraints :

• Some arrangements are statically known :

• Others are conditional : Implies and

• Some must form an atomic set : Everybody elseStrictly before orStrictly after.

• Many are unordered :

• Find a strict total order satisfying all the above !

74

Gist of constraint ENCODING :

Implies and

1

1

N

1 1 N

i

j

1

• Use Boolean precedence matrix • Capture “i before j” by m_ij

Unit clauses

Boolean formula

See how SAT-generator is derived

Spew out irreflexivity and totality axioms Then throw a “transitivity blanket” on top of all tuples

Strict total order :

Atomic set :

Statically known :

75

* Small Domain method (n logn encoding)

- Generates fantastically hard SAT problems!

- Chokes many SAT solvers – Zchaff-II can handle it well

* Incremental SAT (see CAV’04)

* QBF version : initial prototype needs lots of work - can serve to provide good QBF benchmarks…..

Other Approaches Tried:

76

Approaches to “transitivity blanket”

Naïve : For all tuples i, j, and k, generate

m_ij /\ m_jk m_jk

Too many clauses (1B for a 1000-tuple program)

Better: Obtain transitive-closure of known orderings and then prune irrelevant parts of the blanket

E.g., if ~m_ij is known, don’t generate

m_ij /\ … … as well as … /\ m_ij …

77

Obtaining SAT-generator from HOL

atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID)

atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

Initial Spec

Applying Contrapositive

After Reducing quantifier Scopes

78

atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(exec) = forall(i,exec,wb(i))

wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,exec,wb1(i,k))

wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,exec,wb2(i,k,j))

wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *)

Transformed Spec

Functional Program that generates the constraints (will be automated)

…Obtaining SAT-generator from HOL

79

Clause annotations for the unsat core for example

op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexiveop1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrderop1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrderop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrderop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

80

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

denotes an op

Denotes op numbers. Store has both local and remote exec

Building an Error-trail for UNSAT (infeasible executions) :

81

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder

Building an Error-trail…

82

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>


Building an Error-trail …

83

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue


op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = R eadValue


op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue



84

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease







85

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue


86

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>



87

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue


88

HOLRulesFor

ItaniumIn a HOL

Theory File

ZcoreCORE Extractor

“Explain”Error ExplainerAnd DOT fileGeneratorGhostView

AnMPECcable

OcamlProgram

MPEC (MP Execution Checker) Tool Demo

Gentuple AssemblerSAT Converter

Zchaff-II or other

Ganesh sittingdown and coding

Printout of CycleRevealing Error

SAT Result

SAT(GivesInterleaving)

UNSAT

89

Other Tools Developed in UV Group

• Yue (Jason) Yang’s Dissertation webpage

• Itanium Litmus-test Checker in Constraint Prolog

• NemosFinder – Easily Parameterizable Litmus-Checker Suite in Constraint Prolog

• UMM Tool – Easily Parameterizable Murphi Operational Model for writing Operational Specs of Memory Models

• DefectFinder – Demo Prototype of Memory-model Aware Race Analyzer

• Now at MSR (www.cs.utah.edu/~yyang/) -- now [email protected]

http://www.cs.utah.edu/~yyang/

90

Part 3: What’s not apparent at first glance

91

Topics:

* Formal verification approaches to memory consistency compliance

* How to model the interface of the shared memory?

- Execution based

- IO mappings based

* What is wrong if an Execution based approach is chosen ?

- Finite-state realizability

* A transducer-based model of shared memory

- Highlights of results

* Whither undecidability ?

92

Formal Verification Approaches:

• Several paper-and-pencil proofs

• Arons (pvs-based)

• McMillan (CTL model-checking based)

• Nalumasu et.al. (Test Automata based)

• Qadeer (1. Finding a serializer. 2. Automated for simple write order)

• Bingham et.al. (Window observer based)

Spec ofShared Memory

Consistency Model

Imp ofShared Memory

Consistency Model(a protocol)

Agreement

93

Other Formal Approaches:

• Park, Dill, Nowatzyk

• Pong and Dubois (several papers)

• Collier’s work

• Ghughal’s adaptation of above for weak memory models

• Chatterjee (CAV’02)

• Yu, Tuttle, Lamport

• Shen, Arvind

• Ahamad, Neiger

• (Check webpage of MPV’00 www.cs.utah.edu/mpv )

• Steinke and Nutt

• Gibbons, Gharachorloo

• Adve, Pugh

• … (a survey will take too long)

http://www.cs.utah.edu/mpv

94

Modeling the Interface of Shared Memory:

• Trace Based

- Most existing works

• IO Mappings Based

- The original Lazy-caching paper (casual use)

- Kawash and Higham (defines Specs this way;

Implementations not addressed)

- Sezgin et.al. – (defines Specs and Imps + Correspondence)

Spec Imp

Read(proc, addr, data),Write(proc,addr,data), …

Spec Imp

Read_o(proc, addr, data), Write_o(proc,addr,data), …

Read_i(proc, addr), Write_i(proc,addr,data), …

95

What’s “wrong” with trace-based approaches?

• Permits making statements about uninteresting or unrealizable machines

• Muddies exact import of the famous “undecidability result” (Alur et.al)

96

Example 1: Finiteness cannot be adequatelydescribed thru regular sets of executions alone…

Consider the set of executions w(1,a,2) r(1,a,1)* r(2,a,2)* w(2,a,1) -- defines the TEMPORAL order of events

All these are considered SC because we can build a LOGICAL order w(1,a,2) r(2,a,2)* w(2,a,1) r(1,a,1)*

But how can the above TEMPORAL order be generated by a FSM ?

P1 P2--- ---w(a,2) ; r(a,2) ; r(a,2) ;

r(a,1) ; …r(a,1) ; r(a,2) ;

… r(a,1) ; w(a,1) ;

97

Example 1: … continued (take specific unravelling of *)

Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)

A FSM ImplementationOf Seq Consistency

With N Internal States

w(1,a,2) ;

w(1,a,2) ;

Program fedSo far …

Output generatedSo far …



w(1,a,2) ;w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;



w(1,a,2) ;r(1,a,1) ;

w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;NO w(2,a,1)

FAIL ! O/P w/o Input !!

98





wo(1,a,2) ;

wi(1,a,2) ;

Program fedSo far …

Output generatedSo far …



wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;



wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)

FAIL ! Too manyinputs w/o output

99



Labeled by

wi(1,a,2) ;



wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)

FAIL ! Too manyinputs w/o output

wi(1,a,2) ;

{ ri(1,a)K, ri(2,a)L } ;

We can “pump” this loop, thus making it possible to generatethe SAME execution for arbitrary long programs !!

100

Restrictions in contemporary work that enables SC verification:

i.e. Temporal Orders of the form …

w(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1)

• Bingham, Condon, Hu :

- Require Prefix Closure (“no outputs w/o input”) e.g. the trace of length 1 : r(1,a,1)

- Rule out Prophetic Inheritance

101

Restrictions in contemporary work that enables SC verification:

• Qadeer :

- Requires Simple Write Ordering

The order of the writes to the same addressin the temporal order and the logical order must be the same

- (But they provide an automated model-checking based verification method for this class of SC protocols…)

Temporal Order: w(1,a,1); w(2,a,2); r(3,a,2); r(4,a,1)

Required Logical Order: w(2,a,2); r(3,a,2); w(1,a,1); r(4,a,1)

< diagram of Lazy Caching here >

102

Taxonomy of formal “SC modeling” approaches:

• Alur et.al. :

- Not Necessarily Prefix Closed (NNPC) regular traces model the SC language

- Checking containment of the (regular) language of the Implementation is undecidable

• Bingham, Condon, and Hu :

- DSC trace set (Decisive Sequential Consistency)

• Sezgin’s work :

- Models memory systems using regular transducers

- Defines EXACTLY what finite-state realizable SC systems are

- SC verification is language containment

- Provides a semi-decision procedure for SC verification in this setting

103

Example 2 (Sezgin) : The dangers of trace-based modeling

Imagine a memory system implementation that does this:

• Accept reads and writes• If the first |P| * |A| instructions are writes, and further these contain exactly one write by each processor to each address

THEN go into malevolent mode (disconnect the shared memory) ELSE go into benevolent mode (behave like serial memory)

M1 M2 Mn… Single Serial Memory Unit M

P1 P2 Pn

Malevolent ModeConnections

Benevolent ModeConnections

104

Example 2 (Sezgin) …

Example : P = {1,2,3} and A={a} and D = {0,1,2}

w(1,a,2); r(3,a, 2); w(2,a,1); r(1,a, 1) ; …Benevolent Mode from now on,since the second instrn is a read…

w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);

w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …

Malevolent Mode from now on,as we have p*a writes

M1 M2 Mn…Single Serial

Memory Unit M

P1 P2 Pn

w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);

r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …

LOGICAL ORDER:

105

Whoa? Any Logical Order will do?!

w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);

w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …

w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);

r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …

LOGICAL ORDER:

TEMPORAL ORDER:

• A Logical Order had better be not fiction… it should be a possible schedule in a “could have happened” sense

• Viewed from that angle, the above logical order is nonsense because it allows certain actions to be postponed unboundedly

• Sezgin’s formal definition of Implementations builds in boundedness

• BCH address an instance of this in their “past-time SC” idea

• Sezgin’s SC machines give logical order out as Commit Order …

106

Status of SC “undecidability”:

• Alur et.al. : UNDECIDABLE NNPC is

under NNPC unrealistic

• Qadeer : Decidable Simple Write Order

under simple write order rules out some

protocols

• Bingham, Condon, and Hu : Decidable under simple These don’t capture

write order; also in exactly those that

DSC_k are FS realizable

• Sezgin’s work : Decidability open Captures exactly the

class of FS realizable

protocols in a detailed manner

(“Input” or programs explicitly modeled)

107

Concluding Remarks:

• Importance of topic unlikely to diminish

• Platform compliance is a big deal

• High-performance OS kernel writers need to know

• Think of proving a distributed Garbage Collector running on a Weak Memory Model (would be a great PhD topic)

• I’ve omitted too many important names I can’t even remember

• Partial list: Adve, Gharachorloo, Pugh, Arvind, Collier, …

108

Acknowledgements (sorry for omissions):

• Past students / postdoc : Nalumasu, Ghughal, Mokkedem, Hosabettu, Jones, Sivaraj, Yang, Yang, Kuramkote

• Faculty colleagues : Lindstrom, Slind, Carter

• Funding agencies : NSF, SRC

• Industrial Liaisons : Corella, Chou, German, Vaid, Neiger, Zeisset, Park

• Other favorable influences : Mathews, Tuttle, Yu, Joshi, Dill, Pong, Nowatzyk, Lamport, Hu, Condon, Higham, Kawash, Jackson

• Who am I forgetting?

Download - Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT * Past work supported

Top Related