Shared Memory Consistency Models :A broad survey
Ganesh Gopalakrishnan*
School of Computing, University of Utah, Salt Lake City, UT
* Past work supported in part by SRC Contract 1031.001, NSF Award 0219805 and an equipment grant from Intel Corporation
2
Shared Memory: Hardware Realities
Memory performance
CPU performance
3
Shared Memory: Software Realities
• Must define the formal semantics of shared-memory concurrent programming while allowing for all reasonable optimizations
•Defining the Shared Thread semantics for Java (Original Java book’s Chapter 17 has essentially been ripped out…)
• Defining the Shared Memory Model for new languages such as Unified Parallel C (UPC) for Scientific Programming
• At a deeper level: Must have formal basis for Automatic Minimal Fence Insertion to make programs appear to execute sequentially consistent
4
Topics:
• Motivations for strong and weak memory models - How it affects consistency protocol design - How it affects programming
• Classical memory models- Their “power”
• Fence insertion during compilation - Run on weak architectures but appear to run SC
• Overview of some weak architectures
• Itanium in a nutshell
• SAT-based programs that check executions against memory model specs - Demo of MP Execution Checker (MPEC) tool for Itanium
5
Topics:
• Theoretical aspects of memory model specification
- Specify using Traces or Specify using Transducers
• Why Traced-based Specification can allow one to talk about unrealizable machines
- Hence “undecidability of sequential consistency” is not a solved problem
• Why trace-based verification methods need to exert some care
- Otherwise can prove “conniving machines” to be SC !!
• A brief taxonomy of recent results in this area
- Mainly Alur et.al., Qadeer, Bingham et.al., and Sezgin
6
Sequential Consistency : The Most Basic Memory Consistency Model
• Requirements1. Exists a common total
order 2. Respects program order 3. Read sees the “latest”
write
Under Sequential Consistency: No
Under many weak models: Yes
Example
Initially, x = y = 0. Finally, can r1 = r2 = 0?
Thread 1 Thread 2x = 1;
r1 = y;
x = 1;
r1 = y;y = 2;
r2 = x;
y = 2;
r2 = x;
7
How to Think About Sequential Consistency
P1 P2 Pn
Memory
Initially, x = y = 0. Finally, can r1 = r2 = 0?
Thread 1 Thread 2x = 1;
r1 = y;
x = 1;
r1 = y;y = 2;
r2 = x;
y = 2;
r2 = x;
No! Not under SC ! But possible under many weak memory models!
An example of such a weak memory model is Sparc TSO
8
Coherence == Per-location Sequential Consistency
P1 P2 Pn
1-address Memory
Notice that the same execution is Coherent !
Initially, x = y = 0. Finally, can r1 = r2 = 0?
Thread 1 Thread 2x = 1;
r1 = y;
x = 1;
r1 = y;y = 2;
r2 = x;
y = 2;
r2 = x;
9
Memory Consistency Models
Defines the legal orderings of memory operations that can be perceived at the user level
• Processors intermittently throw colors onto memory cells and also intermittently look at their colors
P1 P2 Pn
Memory Cell 1
Memory Cell 2
Memory Cell n
…
Pi
10
Memory Consistency Models
Defines the legal orderings of memory operations that can be perceived at the user level
• Many have been developed: – Sequential Consistency (SC)
– Coherence (per-location SC)
– Parallel Random Access Memory (PRAM)– Causal Consistency– Processor Consistency (PC)– Release Consistency– Location Consistency– The Intel Itanim Memory Model– Java Memory Model (JMM)– and more!
11
Memory Consistency Model Specifications:
A VERY complex specification for a real architecture (e.g. Itanium, PowerPC, …)
Also of growing concern in Software (e.g. Java Memory Model, Unified Parallel C model, …)
12
Motivation for (weak) Memory Consistency models:
A Hardware Perspective :
• Cannot afford to do industrious updates across large MP systems
• Delayed and re-orderable updates allow considerable latitude in memory consistency protocol design less bugs in protocols !!
…
dir dir
Chip-level protocols
Inter-cluster protocols
Intra-cluster protocols
mem mem
13
Price Paid for Delayed Updates : Bugs!
Algorithms such as Peterson’s Mutual Exclusion cease to work!
Thread 1 Thread 2------------ -----------Flags[1] = BUSY; Flags[2] = BUSY;Turn = 2; Turn = 1;
While (Flags[2] == BUSY && While (Flags[1] == BUSY && Turn != 1) ; Turn != 2) ;
Critical section Critical section
Flags[1] = FREE; FLAGS[2] = FREE;
CAN READ OLD VALUE!!
CAN READ OLD VALUE!!
14
Scope of Tutorial:
• Survey of ‘Classical’ Work
• Survey of Current Activities (that this speaker is aware of)
• Verification Challenges
• Theoretical Questions
• Justification for topic selection:
- Complement talks on Shared Memory Consistency Protocols
- Intuitions more important than the detailzzz….
- Knowing who’s who in this area helps
- Excuse for me to stick my neck out and learn something new
15
Organization:
1. Overview (mainly of classical works)
2. Practical aspects of weak consistency models (more depth)
3. What’s not apparent at first glance (still more depth)
4. Conclusions and references
16
Part 1: Overview of Classical Work
17
Memory Serves to Plumb Data…
Uniprocessor:
Write ( address = 2 , data = 33) ; …….. Read ( address = 2 , returns data = 33) ;
Multiprocessor:
P1 P2---- ----Write (2, 33) ; || Read (2, 33) ;
Multiprocessor: P1 P2 ---- ----Write(2, 33); Write(2, 77);
Read (2, 77); Read(2, 33);
P1 P2 P3 P4---- ---- ---- ----Write (2, 33) ; Write (2, 77) ; Read(2, 33); Read(2, 77);
Read(2, 77); Read(2, 33);
…but respecting Coherence!
18
…but Coherence is not sufficient:
From Shasha and Snir, Figure 1, P. 282 (ACM TOPLAS (10)2: 1988)
Processor 1 Processor 2------------- --------------
Test_and_set1(LOCK); Test_and_set2(LOCK);
Read1(X); Read2(X);
Write1(X); Write2(X);
Reset1(LOCK); Reset2(LOCK);
The following memory access sequence respects Coherence but breaks the critical section :
Test_and_set1(LOCK); Read1(X); Reset1(LOCK);
Test_and_set2(LOCK); Read2(X); Write1(X); Write2(X); Reset2(LOCK);
• Consistent view ACROSS ADDRESS SPACE is needed
• Most intuitive such : Sequential Consistency !
19
Basic understanding of SC:
• Execute AS IF instructions in each thread were executed sequentially and atomically
- respecting the program order in each thread
- no constraints across sequential programs
Requires effort to achieve above effect AS WELL AS high performance :
CPU 1
Memoryand
Bus Controller
CPU n …
Write (2, 55) ; MISSESRead (4, 11) ; HITS
Write (4, 66) ; MISSESRead (2, 22) ; HITS Which Read waits ?
20
CPU 1
Memoryand
Bus Controller
CPU n …
Write (2, 55) ; MISSESRead (4, 11) ; HITS
Write (4, 66) ; MISSESRead (2, 22) ; HITS
Aggressive SC Implementations:
From Adve, Pai, and Ranganathan (Proc IEEE, (87)3, March 1999, p.448)
“If the accessed location does not change its value until the Read could have been non-speculatively issued, then the speculation is successful. Otherwise, roll-back speculation until incorrect load.” (Similar schemes used in HP PA-8000, Intel Pentium Pro, MIPS R10K)
One way to implement this: * If bus-snoop for Write(4,..) arrives before that for Write(2,..), the Read(4, 11) is invalidated – and it reissues…
Snoops areWrite(4,66);Write(2,55);
Snoops areWrite(4,66);Write(2,55);
21
Unexpected Interactions:SC and Write Update Protocols(from Grahn, Stenstrom, Dubois)
• An important aspect of Sequential Consistency is Write Atomicity
• Write-Invalidate protocols can easily guarantee Write Atomicity
• However, Write-Update protocols are often recommended (Read-latency)
• Ensuring Write-Atomicity in Write-Update Protocols is tricky
• WEAK MEMORY MODELS TO THE RESCUE ! Don’t care about Write Atomicity except at Acquire / Release points
…
dir dir
Chip-level protocols
Inter-cluster protocols
Intra-cluster protocols
mem mem
22
A Deeper Look at Coherence :
Complexity of Checking Coherence of Executions is in NPC :
Cantin’s proof: Reduction from SAT:
Example: Consider (u1 \/ u2) /\ (~u1 \/ u2)
Create the following concurrent processes:
h1 h2 h_u1 h_~u1 h_u2 h_~u2 h3--- --- ----- ------- ----- ------- ---W(d_u1) W(d_~u1) R(d_u1) R(d_~u1) R(d_u2) R(d_~u2) R(d_c1)
W(d_u2) W(d_~u2) R(d_~u1) R(d_u1) R(d_~u2) R(d_u2) R(d_c2)
W(d_c1) W(d_c2) W(d_c1) W(d_u1) W(d_c2) W(d_u2) W(d_~u1)
W(d_~u2) W(d_F)
Literal Gadget
Clause Gadget
Existence of aCoherent Scheduleis tested
23
A Deeper Look at Coherence :
Memory models that relax coherence – and how “useful” they are:
• PRAM (pipelined RAM – Lipton and Sandberg) is of academic interest
One memory per processor
Program order is obeyed, butNo Write-Atomicity
P1 P2 Pn
…
…
…
24
A Deeper Look at Coherence :Memory models that relax coherence – and how “useful” they are:
• PRAM – of academic interest
• Location consistency
- Proposed by Gao and Sarkar- They tout its advantages in terms of scalability- They describe an LC protocol “machine”
- Analysis by Wallace et.al (PDPTA 2002: 1542-1550) :
* Shown that this LC machine is stronger than the LC definition
* Question whether LC programs indeed appear to execute with sequentially consistent outcomes assuming that they are “properly labeled”
* I have not seen many pubs on LC of late…
25
Classical Weak Memory Models:
• Processor Consistency is widely known
• Good discussions in Ahamad et.al.,
“The Power of Processor Consistency”
• First understand PRAM :
- For each processor p, there is a legal serialization S_p of
H_p+w such that if o1 and o2 are in H_p+w and o1 –po-> o2
then o1 – s_p o2
- For PC_g, we add the following condition:
for any two processors p and q, and for any location x,
S_p | (w,x) = S_q | (w,x)
“Processor Consistency according to Goodman (PC_g)”
is not the same as
“PC_d – processor consistency according to the DASH project”
26
Execution that’s PRAM and Coherent … but not PC_g:
P : w(x,0) w(y,0)
Q: r(y,0) w(x,1)
R: r(x,1) r(x,0)
* Coherent! Just look at each color separately
* Not PC_g :
Construct a history per processor with all of the processor’s actionsand all of others’ writes in that history
PC_g requires the write-histories to agree per variable; but in our example,
History of Q = …w(x,0)… w(x,1)… while
History of R = …w(x,1)… w(x,0)…
27
The “power” of Processor Consistency:
• Can handle “Peterson” (Ahamad)
• Can’t handle “Bakery” (Ahamad)
• What else? (Kawash and Higham, “Bounds for mutual
exclusion with only Processor Consistency”):
- Peterson is correct for PC-G (a multi-writer protocol)
- Bakery is incorrect for PC-G (a single-writer protocol)
- Kawash and Higham prove that for mutual exclusion under
PC-G, one multi-writer and n single-writers are necessary
28
Observations:
• Weak shared memory consistency models allow consistency
protocols to be efficient
• Unfortunately programmers find weak models non-intuitive
• How can we have the best of both worlds:
- weak models to be supported by the hardware
- strong models to be presented by the software
This can be achieved through compilers that insert the minimal number of fence instructions to give the appearance of SC
29
Basics of Fence Insertion:
• Widely cited work is by Shasha and Snir
• Recent work by Lee, Midkiff, and Padua extends the above
• Let us go through some examples (initially all mem. locations are 0)
P1 P2 ---- ----write(x,1) ; read(y, yd) ;
write(y,1); read(x, xd) ;
Under SC,
If yd = 1, then xd = 1
30
Basics of Fence Insertion:
P1 P2 ---- ----write(x,1) ; read(y, yd) ;
write(y,1); read(x, xd) ;
BUT if we allow instructions to re-order, then the guarantee
If yd = 1, then xd = 1
is lost !!
• But often we CAN re-order without noticing an SC violation
• When can we re-order ??
31
Basics of Fence Insertion:
• Widely cited work is by Shasha and Snir (our exs. from their paper)
• Recent work by Lee, Midkiff, and Padua extends the above
• Let us go through some examples (initially all mem. locations are 0) P1 P2
---- ----write(x,1) ; read(y, yd) ;
write(y,1); read(x, xd) ;
Which program order edges in P = {a,b} must be respectedin order to guarantee SC-compliant executions ?
• Preserving a alone : Insufficient, as it can return xd=0, yd=1• Preserving b alone : Insufficient, as it can return xd=0, yd=1
• BOTH a and b need to be preserved – how to compute this in general?
• Terminology : {a,b} in this example forms the Delay Set, D
a b
32
Analysis is based on Critical Cycles
• Locate all critical cycles in the concurrent program
• Equate Delay Set D to all the program-order edges in all
critical cycles
• Locating Critical Cycles :
- Locate all Conflict Edges C
. Locate two accesses that are concurrent and one of them is
a write; these give the undirected Conflict Edges C
. A critical cycle is a cycle in P U C that has the following
properties :
* Contains at-most two operations from the same thread
that are consecutive in it
* Contains 0, 2, or 3 accesses to each shared variable
that are consecutive in it (further properties omitted…)
33
P1 P2 ---- ----write(x,1) ; read(y, yd) ;
write(y,1); read(x, xd) ;
Conflict Edges C
ProgramOrderEdgesP
P1 P2 ---- ----write(x,1) ; read(y, yd) ;
write(y,1); read(x, xd) ;
CriticalCycle
Delay Set D = all the P edges in Critical Cycle = P in our case
Finding Critical Cycles : Example 1
34
Finding Critical Cycles : Example 2
P1 P2 ---- ----read(x, xd); write(x,1);
read(y, yd); write(y,1);
Basicallya“while”loop
P1 P2 ---- ----read(x, xd); write(x,1);
read(y, yd); write(y,1);
ConflictEdges
CriticalCycle
Delay Set D = {b, c} whereas P = {a, b, c}
ab c
35
Finding Critical Cycles : Example 3
a1 : read A
b1 : read B
c1 : read C
d1 : read D
a2 : write B
b2 : write C
c2: write D
d2 : write A
D = { (a1,b1), (a1,c1), (a1,d1), (a2,d2), (b2,d2), (c2,d2) }
suffices to ensure SC !
I.e., a1 is an acquire-read and d2 is a release-write !!
36
Basic Approach to Fence Insertion:
• Goal : Discover the minimal set of fences to be inserted into
a concurrent shared memory program
• Suppose D is the delay-set discovered by the previous analysis
• Suppose the underlying (weak) architecture supports orderings
D_o
• Let D_m be the fences to be inserted to get the effect of D
• D_m = ( ( D U D_o )+ )tr - D_o
where “tr” is the transitive reduction
a
b
cd
• Required Delay Set = { (a,b), (b,c), (a,d) }
• D_o = (c,d)
• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}
• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here
37
Basic Approach to Fence Insertion:
a
b
cd
• Required Delay Set = { (a,b), (b,c), (a,d) }
• D_o = (c,d)
• ( (D U D_o )+ )tr = {(a,b), (b,c), (c,d)}
• ( (D U D_o)+ )tr – D_o = {(a,b), (b,c)} - fences needed only here
a
b
cd
fence
fence
Hardware-providedordering
So, in a nutshell, ….
implements the desireddelay-set
38
Deriving Fences from Correctness Proofs
Lamport’s paper “How to make a Correct Multiprocess Program Execute Correctly on a Multiprocessor,” IEEE Trans Computer 46(7) – 1997
provides a really good insight on deriving required weak orderings thru proofs
• Notations :
A B : Every event in A precedes every event in B
A -- > B : Some event in A precedes some event in B
Implies
Implies
39
Deriving Places to insert a Synch Instruction:
Repeat forever
noncritical section;
L : x_i := true;
For j := 1 until i-1
Do if x_j then x_I := false ; while x_j do od; goto L fi oD
For j := i+1 until N do while x_j do od od;
critical section ;
x_j := false
End Repeat
Synch
Synch
Synch
There is a proof in Lamport’s paper that withjust these Synch instructions, mutual exclusion is guaranteed.
40
Part 2: A Detailed Look at a Practical Weak Memory Model : Itanium(I do mention three others briefly…)
41
Well, let’s look at the big picture first:
Sparc TSO, PSO, RMO
• Reads and Writes follow the
TSO, PSO, or RMO semantics
• Additional Fence instructions
and others (e.g. semaphores)
• I’m not upto speed on these…
Alpha
• Reads (only coherence)
• Writes (only coherence)
• Load-Locked
• Store-Conditional
• Membar
42
Well, let’s look at the big picture:
Power-4
• Reads and Writes (don’t know much)
• Sync (Synchronize)
• Lwsync (Lightweight Sync – new in Power4)
• E I E I O (Enforce In-Order Execution of I/O)
• Lwarx (Load word and reserve)
• Ldarx (Load doubleword and reserve)
• Stwcx (Store word conditional)
• Stdcx (Store Doubleword Conditional)
• Isync (Instruction synchronize)
Perhaps Old-McDonald knows more…
43
IA-32, IA-64, AMD, … ?
•Generally thought to be “Processor Consistency”
•Does it really help formally specify (or even reveal the details) ?
•Intel thought so ……
The Itanium memory model is described next…
44
The Intel Itanium® Processor memory model
• Has these kinds of instructions : “weak load” or “ordinary load” -- ld
“strong load” or “acquire-load” -- ld.acq
“weak store” or “ordinary store” -- st
“strong store” or “release store” -- st.rel
“memory fence” (NOT barrier!) -- mf
A few semaphore-types
Allows sub-word writes, I/O spaces…
We don’t model these
45
Itanium® memory model thru examples
st [x] = 2
…
…
Can freely slide in asequential program…
Only rule is coherence
“Ordinary store”
ld reg1 = [x]
The same applies to an “ordinary load”
…
…
46
Itanium® memory model thru examples
st.rel [x] = 2
…
Things before it in sequential program ordercan’t happen after it
“Release store”
Things after it in sequential program Ordermay happen before it !!
47
Itanium® memory model thru examples
ld.acq r3 = [y]
…
Things before it in sequential program ordermay happen after it
“Acquire load”
Things after it in sequential program Ordercan’t happen before it !!
48
st.rel [y] = 1
ld reg1 = [x] <0> ld reg2 = [y] <0>
st.rel [x] = 2
ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>
Datadep.
ld.acqrule
Itanium specification DOES NOT try to explain outcomes in terms of “shuffles” of the original instructions!
But with these rules alone, we can’t explain thefollowing legal outcome in Itanium®
49
This has turned out to be an unspoken convention in this area for other memory models also…
st [y] = 1
Local copy for P0
“remote” copy for P0
“remote” copy for P1
A store generates (n+1) progenies
ld.acq r3 = [y]
Other instructionsgenerate only one
Itanium® rules explain execution outcomes in terms of “progenies” of stores and loads
50
P1: St a,1; Ld r1,a <1>; St b,r1 <1>;
P2: Ld.acq r2,b <1>; Ld r3,a <0>;
We wrote such a “breeding assembler”
{id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false};
{id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false};
{id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false};
{id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true};
{id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true};
{id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true};
{id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true};
{id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true};
{id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true}
Tuple 1
Tuple 9
...
51
Itanium® rules specify how to line-up the tuplesto explain the load-outcomes !!
st [y] = 1
ld reg1 = [x] <0> ld reg2 = [y] <0>
P0 P1
st [x] = 2
Now, arrange the split copies…
st [y] = 1 “l”
st [y] = 1 “rp0”st [y] = 1 “rp1”
st [x] = 2 “l”st [x] = 2 “rp0”st [x] = 2 “rp1”
st [y] = 1 “l”
st [y] = 1 “rp0”
st [y] = 1 “rp1”
st [x] = 2 “l”
st [x] = 2 “rp0”
st [x] = 2 “rp1”
ld reg1 = [x] <0>
ld reg2 = [y] <0>
Explanation…
ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>
ld.acq r3 = [y] <1>
ld.acq r4 = [x] <2>
Dependencies
Anti-dependencies
52
legalItanium(exec) =Exists order.( requireStrictTotalOrder exec order
/\ requireWriteOperationOrder exec order/\ requireItProgramOrder exec order/\ requireMemoryDataDependence exec order/\ requireDataFlowDependence exec order/\ requireCoherence exec order/\ requireAtomicWBRelease exec order/\ requireSequentialUC exec order/\ requireNoUCBypass exec order /\ requireReadValue exec order
SC(exec) =Exists order.( requireStrictTotalOrder exec order
/\ requireProgramOrder exec order
/\ requireReadValue exec order
Gist of our method: Illustration on SC and of Itanium
The tuples to be ordered
Find an arrangement under SC constraints
The tuples to be ordered
Find arrangement as per above constraints
53
legal_itanium exec = (* a given execution *) ?order. requireStrictTotalOrder exec order /\ requireWriteOperationOrder exec order /\ requireProgramOrder exec order /\ requireMemoryDataDependence exec order /\ requireDataFlowDependence exec order /\ requireCoherence exec order /\ requireReadValue exec order /\ requireAtomicWBRelease exec order /\ requireSequentialUC exec order /\ requireNoUCBypass exec order
Our Itanium Formal Model (extracted from IntelDocuments – written as a HOL Theory)
See Charme’03, IPDPS’04, CAV’04Various contributions by Yue Yang, Gopalakrishnan, Lindstrom, Slind, Sivaraj, Yu Yang
54
requireStrictTotalOrder exec order
55
requireWriteOperationOrder exec order
Local Write before Local Global Write
Local Write before Remote Global Writes
56
requireProgramOrder exec order
Program Order is defined solely through
Acquires, Releases, and Fences
57
requireMemoryDataDependence exec order
Order two accesses (Read or Write) under these conditions :
IF program-ordered AND the same variable AND
Write is local and RAW (and Read of course is local)
OR Write is local and WAR
OR Both writes are local and WAW
OR Both writes are remote and WAW and Fall in same processor
58
requireDataFlowDependence exec order
Data Dependence Thru the Register-Space
59
requireCoherence exec order
Just Plain-Old Coherence
but for TWO WRITES falling in the WB or UC space
and for EITHER Two Local Writes OR two Remote Writes in the same processor
60
requireReadValue exec order
Reads return Most Recent Writes
61
requireAtomicWBRelease exec order
All Remote Events Stemming from the Same Release-Write Instruction appear to be an Atomic Set
62
requireSequentialUC exec order
In the UC Space, Program-Ordered
UC Read and Write Events, both of which are Local
are ordered as per program order
(the two operations in question could be RR, RW, WR, or WW)
63
requireNoUCBypass exec order
UC-space Operations Do Not Exhibit
Read Bypassing as in TSO
64
requireCoherence exec order =!i j. i IN exec /\ j IN exec ==> isWr i /\ isWr j /\ (i.var = j.var) /\ order i j /\ ((attr_of i.var = WB) \/ (attr_of i.var = UC)) /\ ((i.wrType=Local) /\ (j.wrType=Local) /\ (i.proc=j.proc) \/ (i.wrType=Remote) /\ (j.wrType=Remote) /\ (i.wrProc=j.wrProc)) ==> !p q. p IN exec /\ q IN exec ==> isWr p /\ isWr q /\ (p.wrID = i.wrID) /\ (q.wrID = j.wrID) /\ (p.wrType = Remote) /\ (q.wrType = Remote) /\(p.wrProc = q.wrProc) ==> order p q
A MEMORY MODEL RULE IN HOL
65
How do we know that the actual silicon matches the shared memory model ?
• Pray
• Run tests and manually check results
• ? What else ?
! X . X in exec ? Y . Y in exec …. ? ! /\ … \/ ….
?
One use we have put our Spec to:Post-Si Verification of MP Systems…
66
st8 [12ca20] = 7f869af546f2f14cld8 r25 = [45180] <87b5e547172644a8>ld2 r26 = [2c2a2c] <44a8>ld2 r27 = [45aa2a] <c58e>…
FORMALLY VERIFY “interesting” EXECUTIONS
st8 [45180] = 87b5e547172644a8ld8 r25 = [45180] <87b5e547172644a8>st2 [2c2a2c] = 44a8st2 [45aa2a] = c58e…
P1’s exec
P2’s exec
…
67
TWO APPROACHES: - explicitly QB - implicitly QB
“BOOLIFY”
CONVERTTO
EXECUTIONCHECKERPROGRAM
SPEC OFMEMORY MODELIN hol
Given Execution
QBF
PROGRAM
Given Execution
SATPROBLEM
(Prototyped this; but definitely need to re-code this…)
68
The alternative is to produce a manual proof:
P
st [x] = 1
mf
ld r1 = [y] <0>
Rld . acq r2 = [y] <1>
ld r3 = [x] <0>
Q
st . rel [y] = 1
Atomicity of st.rel
Load of initial valueis before store ofevery other value
Even this simple “Litmus Test” has a 1-page detailed proof
69
The MPEC Tool Flow:
Itanium Ordering rules in HOL
MechanicalProgram Derivation(to be automated)
Checker Program
Satisfiability Problem with Clauses carrying annotations
Sat Solver
SatUnsat
Explanationin the form ofone possibleinterleaving
Unsat CoreExtraction using Zcore
P
st [x] = 1
mf
ld r1 = [y] <0>
R
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
Q
st.rel [y] = 1
• Find Offending Clauses• Trace their annotations• Determine “ordering cycle”
MP execution
to be verified
RECENT WORK
70
Largest example tried to date (courtesy S. Zeisset, Intel)
Proc 1
st8 [12ca20] = 7f869af546f2f14cld r25 = [45180] <87b5e547172644a8>
… 58 more instructions…
st2 [7c2a00] = 4bca
Proc 2
ld4 r24 = [733a74] <415e304>st4.rel [175984] = 96ab4e1f
… 67 more instructions…
ld8 r87 = [56460] <b5c113d7ce4783b1>
• Initially the tool gave a trivial violation
• Diagnosed to be forgotten memory initialization
• Added method to incorporate memory initialization in our tool
• Our tool found the exact same cycle as pointed out by author of test
Cycle found thru our tool:
st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)
71
Statistics Pertaining to Case Study
• 140 total instructions
• All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon
• 1 minutes to generate Sat instance
• 9M clauses ( O(n^3) in terms of instructions ) • 117,823 variables ( not a problem )
• ~1 minute to run Sat (unsat here) – 0.2 sec to do “real work”
• Zcore runs fast – gave 23 clauses in one iteration
72
Overview of MPEC:
• Example of how a HOL rule was turned into a SAT generator
• How the SAT part was done
Throwing an efficient “transitivity blanket” over a
problem to cover it with whatever transitivity it begs for !!
• What more to expect• Related work
73
Gist of constraints :
• Some arrangements are statically known :
• Others are conditional : Implies and
• Some must form an atomic set : Everybody elseStrictly before orStrictly after.
• Many are unordered :
• Find a strict total order satisfying all the above !
74
Gist of constraint ENCODING :
Implies and
1
1
N
1 1 N
i
j
1
• Use Boolean precedence matrix • Capture “i before j” by m_ij
Unit clauses
Boolean formula
See how SAT-generator is derived
Spew out irreflexivity and totality axioms Then throw a “transitivity blanket” on top of all tuples
Strict total order :
Atomic set :
Statically known :
75
* Small Domain method (n logn encoding)
- Generates fantastically hard SAT problems!
- Chokes many SAT solvers – Zchaff-II can handle it well
* Incremental SAT (see CAV’04)
* QBF version : initial prototype needs lots of work - can serve to provide good QBF benchmarks…..
Other Approaches Tried:
76
Approaches to “transitivity blanket”
Naïve : For all tuples i, j, and k, generate
m_ij /\ m_jk m_jk
Too many clauses (1B for a 1000-tuple program)
Better: Obtain transitive-closure of known orderings and then prune irrelevant parts of the blanket
E.g., if ~m_ij is known, don’t generate
m_ij /\ … … as well as … /\ m_ij …
77
Obtaining SAT-generator from HOL
atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID)
atomicWBRelease(exec,order) = forall (i in exec).(j in exec).(k in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))
atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))
Initial Spec
Applying Contrapositive
After Reducing quantifier Scopes
78
atomicWBRelease(exec,order) = forall (i in exec). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in exec). (i.wrID = k.wrID) ==> forall (j in exec). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))
atomicWBRelease(exec) = forall(i,exec,wb(i))
wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,exec,wb1(i,k))
wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,exec,wb2(i,k,j))
wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *)
Transformed Spec
Functional Program that generates the constraints (will be automated)
…Obtaining SAT-generator from HOL
79
Clause annotations for the unsat core for example
op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexiveop1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrderop1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrderop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrderop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue
op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease
80
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
denotes an op
Denotes op numbers. Store has both local and remote exec
Building an Error-trail for UNSAT (infeasible executions) :
81
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder
Building an Error-trail…
82
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrder
Building an Error-trail …
83
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue
op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue
op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = R eadValue
op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue
op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue
op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue
Building an Error-trail …
84
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease
op1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease
Building an Error-trail …
85
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue
Building an Error-trail …
86
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrder
Building an Error-trail …
87
1 2 3 4
5
6
7 8 9 10
12
11
st [x] = 1
mf
ld r1 = [y] <0>
st.rel [y] = 1
ld.acq r2 = [y] <1>
ld r3 = [x] <0>
op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue
Building an Error-trail …
88
HOLRulesFor
ItaniumIn a HOL
Theory File
ZcoreCORE Extractor
“Explain”Error ExplainerAnd DOT fileGeneratorGhostView
AnMPECcable
OcamlProgram
MPEC (MP Execution Checker) Tool Demo
Gentuple AssemblerSAT Converter
Zchaff-II or other
Ganesh sittingdown and coding
Printout of CycleRevealing Error
SAT Result
SAT(GivesInterleaving)
UNSAT
89
Other Tools Developed in UV Group
• Yue (Jason) Yang’s Dissertation webpage
• Itanium Litmus-test Checker in Constraint Prolog
• NemosFinder – Easily Parameterizable Litmus-Checker Suite in Constraint Prolog
• UMM Tool – Easily Parameterizable Murphi Operational Model for writing Operational Specs of Memory Models
• DefectFinder – Demo Prototype of Memory-model Aware Race Analyzer
• Now at MSR (www.cs.utah.edu/~yyang/) -- now [email protected]
90
Part 3: What’s not apparent at first glance
91
Topics:
* Formal verification approaches to memory consistency compliance
* How to model the interface of the shared memory?
- Execution based
- IO mappings based
* What is wrong if an Execution based approach is chosen ?
- Finite-state realizability
* A transducer-based model of shared memory
- Highlights of results
* Whither undecidability ?
92
Formal Verification Approaches:
• Several paper-and-pencil proofs
• Arons (pvs-based)
• McMillan (CTL model-checking based)
• Nalumasu et.al. (Test Automata based)
• Qadeer (1. Finding a serializer. 2. Automated for simple write order)
• Bingham et.al. (Window observer based)
Spec ofShared Memory
Consistency Model
Imp ofShared Memory
Consistency Model(a protocol)
Agreement
93
Other Formal Approaches:
• Park, Dill, Nowatzyk
• Pong and Dubois (several papers)
• Collier’s work
• Ghughal’s adaptation of above for weak memory models
• Chatterjee (CAV’02)
• Yu, Tuttle, Lamport
• Shen, Arvind
• Ahamad, Neiger
• (Check webpage of MPV’00 www.cs.utah.edu/mpv )
• Steinke and Nutt
• Gibbons, Gharachorloo
• Adve, Pugh
• … (a survey will take too long)
94
Modeling the Interface of Shared Memory:
• Trace Based
- Most existing works
• IO Mappings Based
- The original Lazy-caching paper (casual use)
- Kawash and Higham (defines Specs this way;
Implementations not addressed)
- Sezgin et.al. – (defines Specs and Imps + Correspondence)
Spec Imp
Read(proc, addr, data),Write(proc,addr,data), …
Spec Imp
Read_o(proc, addr, data), Write_o(proc,addr,data), …
Read_i(proc, addr), Write_i(proc,addr,data), …
95
What’s “wrong” with trace-based approaches?
• Permits making statements about uninteresting or unrealizable machines
• Muddies exact import of the famous “undecidability result” (Alur et.al)
96
Example 1: Finiteness cannot be adequatelydescribed thru regular sets of executions alone…
Consider the set of executions w(1,a,2) r(1,a,1)* r(2,a,2)* w(2,a,1) -- defines the TEMPORAL order of events
All these are considered SC because we can build a LOGICAL order w(1,a,2) r(2,a,2)* w(2,a,1) r(1,a,1)*
But how can the above TEMPORAL order be generated by a FSM ?
P1 P2--- ---w(a,2) ; r(a,2) ; r(a,2) ;
r(a,1) ; …r(a,1) ; r(a,2) ;
… r(a,1) ; w(a,1) ;
97
Example 1: … continued (take specific unravelling of *)
Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)
A FSM ImplementationOf Seq Consistency
With N Internal States
w(1,a,2) ;
w(1,a,2) ;
Program fedSo far …
Output generatedSo far …
A FSM ImplementationOf Seq Consistency
With N Internal States
w(1,a,2) ;w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;
A FSM ImplementationOf Seq Consistency
With N Internal States
w(1,a,2) ;r(1,a,1) ;
w(1,a,2) ;{ r(1,a)K, r(2,a)L } ;NO w(2,a,1)
FAIL ! O/P w/o Input !!
98
Example 1: … continued (take specific unravelling of *)
Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)
A FSM ImplementationOf Seq Consistency
With N Internal States
wo(1,a,2) ;
wi(1,a,2) ;
Program fedSo far …
Output generatedSo far …
A FSM ImplementationOf Seq Consistency
With N Internal States
wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;
A FSM ImplementationOf Seq Consistency
With N Internal States
wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)
FAIL ! Too manyinputs w/o output
99
Example 1: … continued (take specific unravelling of *)
Temporal Order Logical Orderw(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N w(2,a,1)2N r(1,a,1)
Labeled by
wi(1,a,2) ;
A FSM ImplementationOf Seq Consistency
With N Internal States
wo(1,a,2) ;wi(1,a,2) ;{ ri(1,a)K, ri(2,a)L } ;wi(2,a,1)
FAIL ! Too manyinputs w/o output
wi(1,a,2) ;
{ ri(1,a)K, ri(2,a)L } ;
We can “pump” this loop, thus making it possible to generatethe SAME execution for arbitrary long programs !!
100
Restrictions in contemporary work that enables SC verification:
i.e. Temporal Orders of the form …
w(1,a,2) r(1,a,1)2N r(2,a,2)2N w(2,a,1)
• Bingham, Condon, Hu :
- Require Prefix Closure (“no outputs w/o input”) e.g. the trace of length 1 : r(1,a,1)
- Rule out Prophetic Inheritance
101
Restrictions in contemporary work that enables SC verification:
• Qadeer :
- Requires Simple Write Ordering
The order of the writes to the same addressin the temporal order and the logical order must be the same
- (But they provide an automated model-checking based verification method for this class of SC protocols…)
Temporal Order: w(1,a,1); w(2,a,2); r(3,a,2); r(4,a,1)
Required Logical Order: w(2,a,2); r(3,a,2); w(1,a,1); r(4,a,1)
< diagram of Lazy Caching here >
102
Taxonomy of formal “SC modeling” approaches:
• Alur et.al. :
- Not Necessarily Prefix Closed (NNPC) regular traces model the SC language
- Checking containment of the (regular) language of the Implementation is undecidable
• Bingham, Condon, and Hu :
- DSC trace set (Decisive Sequential Consistency)
• Sezgin’s work :
- Models memory systems using regular transducers
- Defines EXACTLY what finite-state realizable SC systems are
- SC verification is language containment
- Provides a semi-decision procedure for SC verification in this setting
103
Example 2 (Sezgin) : The dangers of trace-based modeling
Imagine a memory system implementation that does this:
• Accept reads and writes• If the first |P| * |A| instructions are writes, and further these contain exactly one write by each processor to each address
THEN go into malevolent mode (disconnect the shared memory) ELSE go into benevolent mode (behave like serial memory)
M1 M2 Mn… Single Serial Memory Unit M
P1 P2 Pn
Malevolent ModeConnections
Benevolent ModeConnections
104
Example 2 (Sezgin) …
Example : P = {1,2,3} and A={a} and D = {0,1,2}
w(1,a,2); r(3,a, 2); w(2,a,1); r(1,a, 1) ; …Benevolent Mode from now on,since the second instrn is a read…
w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);
w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …
Malevolent Mode from now on,as we have p*a writes
M1 M2 Mn…Single Serial
Memory Unit M
P1 P2 Pn
w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);
r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …
LOGICAL ORDER:
105
Whoa? Any Logical Order will do?!
w(1,a,1); w(3,a,2); w(2,a,0); r(1,a,1); r(2,a,0); r(3,a,2); w(1,a,2);
w(2,a,1); r(1,a,2); r(2,a,1); r(3,a,2); …
w(1,a,1); r(1,a,1); w(1,a,2); r(1,a,2); w(2,a,0); r(2,a,0); w(2,a,1);
r(2,a,1); w(3,a,2); r(3,a,2); r(3,a,2); …
LOGICAL ORDER:
TEMPORAL ORDER:
• A Logical Order had better be not fiction… it should be a possible schedule in a “could have happened” sense
• Viewed from that angle, the above logical order is nonsense because it allows certain actions to be postponed unboundedly
• Sezgin’s formal definition of Implementations builds in boundedness
• BCH address an instance of this in their “past-time SC” idea
• Sezgin’s SC machines give logical order out as Commit Order …
106
Status of SC “undecidability”:
• Alur et.al. : UNDECIDABLE NNPC is
under NNPC unrealistic
• Qadeer : Decidable Simple Write Order
under simple write order rules out some
protocols
• Bingham, Condon, and Hu : Decidable under simple These don’t capture
write order; also in exactly those that
DSC_k are FS realizable
• Sezgin’s work : Decidability open Captures exactly the
class of FS realizable
protocols in a detailed manner
(“Input” or programs explicitly modeled)
107
Concluding Remarks:
• Importance of topic unlikely to diminish
• Platform compliance is a big deal
• High-performance OS kernel writers need to know
• Think of proving a distributed Garbage Collector running on a Weak Memory Model (would be a great PhD topic)
• I’ve omitted too many important names I can’t even remember
• Partial list: Adve, Gharachorloo, Pugh, Arvind, Collier, …
108
Acknowledgements (sorry for omissions):
• Past students / postdoc : Nalumasu, Ghughal, Mokkedem, Hosabettu, Jones, Sivaraj, Yang, Yang, Kuramkote
• Faculty colleagues : Lindstrom, Slind, Carter
• Funding agencies : NSF, SRC
• Industrial Liaisons : Corella, Chou, German, Vaid, Neiger, Zeisset, Park
• Other favorable influences : Mathews, Tuttle, Yu, Joshi, Dill, Pong, Nowatzyk, Lamport, Hu, Condon, Higham, Kawash, Jackson
• Who am I forgetting?