robust concurrent computing - tu berlin · them, as shared objects, instances of abstract data...

132
Robust Concurrent Computing Rachid Guerraoui Petr Kuznetsov School of Computer and Communication Sciences, EPFL Technische Universit¨ at Berlin, Deutsche Telekom Laboratories June 23, 2012

Upload: others

Post on 20-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Robust Concurrent Computing

Rachid Guerraoui‡ Petr Kuznetsov⋆

‡ School of Computer and Communication Sciences, EPFL⋆ Technische Universitat Berlin, Deutsche Telekom Laboratories

June 23, 2012

Page 2: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

2

Page 3: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Contents

1 Introduction 71.1 A broad picture: the concurrency revolution . . . . . . . . . .. . . . . . . . . . . . . . . . 71.2 The topic: robust synchronization objects . . . . . . . . . . .. . . . . . . . . . . . . . . . 71.3 Content of the book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 9

1.3.1 Shared objects as synchronization abstractions . . . .. . . . . . . . . . . . . . . . 91.3.2 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 91.3.3 Wait-freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 101.3.4 Object implementation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 101.3.5 Reducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 11

1.4 Content and organization . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 111.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 12

I Correctness: safety and liveness 15

2 Atomicity: A Correctness Property for Shared Objects 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 172.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 18

2.2.1 Processes and operations . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 182.2.2 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 192.2.3 Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 202.2.4 Sequential history . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 22

2.3 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 232.3.1 Legal history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 232.3.2 The case of complete histories . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 232.3.3 The case of incomplete histories . . . . . . . . . . . . . . . . . .. . . . . . . . . . 25

2.4 Safety and Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 272.5 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 27

2.5.1 Local properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 272.5.2 Atomicity is a local property . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 27

2.6 Atomicity is nonblocking . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 292.7 Alternatives to atomicity . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 29

2.7.1 Sequential consistency . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 292.7.2 Serializability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 31

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 32

3

Page 4: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

2.9 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 32

3 Wait-freedom: A Progress Property for Shared Object Implementations 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 333.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 34

3.2.1 High Level Object and Low Level Object . . . . . . . . . . . . . .. . . . . . . . . 343.2.2 Zooming into histories . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 34

3.3 Progress properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 363.3.1 Solo, partial and global termination . . . . . . . . . . . . . .. . . . . . . . . . . . 363.3.2 Bounded termination . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 37

3.4 Atomicity and wait-freedom . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 373.4.1 Operation termination and atomicity . . . . . . . . . . . . . .. . . . . . . . . . . . 373.4.2 A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 383.4.3 A less simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 393.4.4 On the power of low level objects . . . . . . . . . . . . . . . . . . .. . . . . . . . 413.4.5 Non-determinism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 41

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 41

II Read-Write Memory 43

4 Safe, regular and atomic registers 454.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 454.2 The many faces of registers . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 454.3 Safe, regular and atomic registers . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 46

4.3.1 Safe registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 464.3.2 Regular registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 484.3.3 Atomic registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 484.3.4 Regularity and atomicity: a reading function . . . . . . .. . . . . . . . . . . . . . 484.3.5 From very weak to very strong registers . . . . . . . . . . . . .. . . . . . . . . . . 50

4.4 Two simple bounded transformations . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 514.4.1 Safe/regular registers: from single reader to multiple readers . . . . . . . . . . . . . 524.4.2 Binary multi-reader registers: from safe to regular .. . . . . . . . . . . . . . . . . 53

4.5 From binary tob-valued registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5.1 From safe bits to safeb-valued registers . . . . . . . . . . . . . . . . . . . . . . . . 554.5.2 From regular bits to regularb-valued registers . . . . . . . . . . . . . . . . . . . . . 554.5.3 From atomic bits to atomicb-valued registers . . . . . . . . . . . . . . . . . . . . . 59

4.6 Three (unbounded) atomic register implementations . . .. . . . . . . . . . . . . . . . . . . 614.6.1 1W1R registers: From unbounded regular to atomic . . . .. . . . . . . . . . . . . 624.6.2 Atomic registers: from unbounded 1W1R to 1WMR . . . . . . .. . . . . . . . . . 634.6.3 Atomic registers: from unbounded 1WMR to MWMR . . . . . . .. . . . . . . . . 65

4.7 Concluding remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 664.8 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 66

4

Page 5: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

5 From safe bits to atomic bits: an optimal construction 675.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 675.2 A Lower Bound Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 67

5.2.1 Digests and Sequences of Writes . . . . . . . . . . . . . . . . . . .. . . . . . . . . 685.2.2 The Impossibility Result and the Lower Bound . . . . . . . .. . . . . . . . . . . . 69

5.3 From three safe bits to an atomic bit . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 715.3.1 Base architecture of the construction . . . . . . . . . . . . .. . . . . . . . . . . . . 715.3.2 Handshaking mechanism and the write operation . . . . . .. . . . . . . . . . . . . 715.3.3 An incremental construction of the read operation . . .. . . . . . . . . . . . . . . . 725.3.4 Proof of the construction . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 755.3.5 Cost of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 79

5.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 79

6 Collect and Snapshot objects 816.1 Store-collect object . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 81

6.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 816.1.2 A trivial implementation . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 826.1.3 A store-collect object has no sequential specification . . . . . . . . . . . . . . . . . 82

6.2 Snapshot object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 836.2.1 Non-blocking snapshot . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 856.2.2 Wait-free snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 876.2.3 The snapshot object construction is bounded wait-free . . . . . . . . . . . . . . . . 886.2.4 The snapshot object construction is atomic . . . . . . . . .. . . . . . . . . . . . . 896.2.5 Bounded snapshot object . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 90

7 Immediate Snapshot and Iterated Immediate Snapshot 937.1 Immediate snapshot object . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 93

7.1.1 Immediate snapshot and participating set problem . . .. . . . . . . . . . . . . . . . 937.1.2 A one-shot immediate snapshot construction . . . . . . . .. . . . . . . . . . . . . 957.1.3 A participating set algorithm . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 96

7.2 A connection between (one-shot) renaming and snapshot .. . . . . . . . . . . . . . . . . . 987.2.1 A weakened version of the immediate snapshot problem .. . . . . . . . . . . . . . 987.2.2 The adapted algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 98

7.3 Iterated immediate snapshot . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 99

III General Memory 101

8 Consensus and universal construction 1038.1 What cannot be read-write implemented . . . . . . . . . . . . . . .. . . . . . . . . . . . . 103

8.1.1 The case of one dequeuer . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1038.1.2 Two or more dequeuers . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 104

8.2 Universal objects and consensus . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 1048.3 A wait-free universal construction . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 105

8.3.1 Deterministic objects . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 1058.3.2 Bounded wait-free universal construction . . . . . . . . .. . . . . . . . . . . . . . 107

5

Page 6: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

8.3.3 Non-deterministic objects . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 1088.4 Bibliographic notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . 108

9 Consensus number and the consensus hierarchy 1099.1 Consensus number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 1099.2 Preliminary definitions . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 110

9.2.1 Schedule, configuration and valence . . . . . . . . . . . . . . .. . . . . . . . . . . 1109.2.2 Bivalent initial configuration . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 110

9.3 The weak wait-free power of atomic registers . . . . . . . . . .. . . . . . . . . . . . . . . 1129.3.1 The consensus number of atomic registers is 1 . . . . . . . .. . . . . . . . . . . . 1129.3.2 The wait-free limit of atomic registers . . . . . . . . . . . .. . . . . . . . . . . . . 1149.3.3 Another limit of atomic registers . . . . . . . . . . . . . . . . .. . . . . . . . . . . 115

9.4 Objects whose consensus number is2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159.4.1 Consensus from a test&set objects . . . . . . . . . . . . . . . . .. . . . . . . . . . 1159.4.2 Consensus from queue objects . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 1169.4.3 Consensus from swap objects . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 1179.4.4 Other objects for consensus in a system of two processes . . . . . . . . . . . . . . . 1179.4.5 Power and limit of the previous objects . . . . . . . . . . . . .. . . . . . . . . . . 118

9.5 Objects whose consensus number is+∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.5.1 Consensus from compare&swap objects . . . . . . . . . . . . . .. . . . . . . . . . 1219.5.2 Consensus from mem-to-mem-swap objects . . . . . . . . . . .. . . . . . . . . . . 1229.5.3 Consensus from augmented queue objects . . . . . . . . . . . .. . . . . . . . . . . 1239.5.4 Impossibility result . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 124

9.6 Hierarchy of atomic objects . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1249.6.1 From consensus numbers to a hierarchy . . . . . . . . . . . . . .. . . . . . . . . . 1249.6.2 Robustness of the hierarchy . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 124

IV Memory with Oracles 127

6

Page 7: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 1

Introduction

1.1 A broad picture: the concurrency revolution

The field of concurrent computing has gained a huge importance after major chip manufacturers haveswitched their focus from increasing the speed of individual processors to increasing the number of proces-sors on a chip. The old good times where nothing needed to be done to boost the performance of programs,besides changing the underlying processors, are over. To exploit multi-core architectures, programs have tobe devised in a parallel manner. A single-threaded application can for instance exploit at most 1/100 of thepotential throughput of a 100-core chip and such a chip mightbe available before this book is edited.

In short, chip manufacturers are calling for a new software revolution: theconcurrency revolution. Thismight look surprising at first glance for concurrency is almost as old as computer science. In fact, therevolution is more than about concurrency alone: it is aboutconcurrency for every one. In other words,concurrency is going out of the small box of specialist programmers and is conquering the masses.

The challenge underlying this revolution is to come up with abstractions that average programmers caneasily use for general purpose concurrent programming. Such abstractions need to be composed with others,possibly more sophisticated, that advanced programmers could use to fully leverage their expertise and theavailable underlying hardware. The aim of this book is to study how to build these abstractions and we willfocus on those that are considered the most difficult:synchronization elements.

1.2 The topic: robust synchronization objects

In concurrent computing, a problem is solved through several processes that execute a set of tasks. Exceptin embarrassingly parallel programs, i.e., problems that can easily and regularly be decomposed into inde-pendent parts, the tasks usually need to synchronize their activities by accessing shared elements. Theseelements typically serialize the threads and reduce parallelism. According to Amdahl’s law, the cost of ac-cessing these elements significantly impacts the overall performance of concurrent computations. Devising,implementing and making good usage of such synchronizationelements usually lead to intricate schemesthat are very fragile and sometimes error prone.

Every multicore architecture provides such elements in hardware. Usually, these elements are very lowlevel and their good usage is not trivial. Also, the synchronization elements that are provided in hardwarediffer from architecture to architecture, making it hard toport concurrent programs. Sometimes, event if theelements look the same, their exact semantics are differentand some subtle details can have important con-sequences on the performance or the correctness of the concurrent program. Clearly, coming up with a high

7

Page 8: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

level and uniform library of synchronization abstractionsthat could be used across multicore architecturesis crucial to the success of the multicore revolution. Such alibrary could only be implemented in softwarefor it is simply not realistic to require multicore constructors to agree on the same high level library to offerto their programmers.

This book shows how to design such a library in a principled manner. We assume a small set of lowlevel synchronization primitives provided in hardware andshow to build algorithms that could be used toimplement higher level synchronization elements, in turn to be used by different kinds of programmers,depending on their skills and the target concurrent system,and smoothly composed within the same appli-cation. We view synchronization elements, both those available in hardware and those constructed abovethem, asshared objects, instances of abstract data types, accessible through someinterface exporting a setof operations. This interface is itself defined by a specification that captures the precise semantics of theoperations and the way these have to be used.

This book studies algorithms that implement such shared objects in arobustmanner. Roughly speaking,“robustness” means two things:

• No processp ever prevents any other processq from making progress whenq executes an objectoperation on shared objectX. This means that, provided it remains alive and kicking,q terminatesits operation onX despite the speed or the failure of any other processp. Processp could be veryfast and might be permanently accessing shared objectX, or could have been swapped out by theoperating system while accessingX. None of these situations should preventp from executing itsoperation. This aspect of robustness is calledwait-freedom. As we will explain later in this chapter,this property transforms the difficult problem of reasoningabout a failure-prone concurrent systemwhere processes can be arbitrarily delayed and swapped-out(or paged-out), into the simpler problemof reasoning about a failure-free concurrent system where every process progresses at its own paceand runs to completion.

• Despite concurrency, the operations issued on each object appear as if they are executed sequentially.In fact, each operationop on an objectX appears to take effect at some indivisible instant betweenthe invocation and the reply times ofop. This robustness property is calledatomicity. In short, thisproperty transforms the difficult problem of reasoning about a concurrent system into the simplerproblem of reasoning about a sequential one where the processes access each object one after theother.

This book focuses mainly onwait-free implementations of atomic objects. Basically, given certain sharedobjects ofbasetypes, say provided in hardware, we study how and whether it is at all possible to wait-freeimplement (i.e., in software) an atomic object of amore powerfultype. In fact, and strictly speaking, whenwe talk about implementing an object, we actually mean implementing its type. As we shall see, ensuringeach of atomicity or wait-freedom alone is trivial. The challenge is to ensure both.

The material of the book is presented in an incremental manner. We first define atomicity and wait-freedom, and then we show how to implement simple shared objects from even simpler ones, and moreprogressively how to use the resulting objects to build evenmore powerful objects.

8

Page 9: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

1.3 Content of the book

1.3.1 Shared objects as synchronization abstractions

As we pointed out, this manuscript describes how to build synchronization abstractions. The quest for suchabstractions can be viewed as a continuation of one of the most important challenges in computing. A file,a stack, a record, a list, queue and a set, are well-known examples of abstractions that have proved to bevaluable in traditional sequential and centralized computing. Their definitions and effective implementationshave enabled programming to become a high-level activity and made it possible to reason about algorithmswithout specific mention of hardware primitives.

In modern computing, an abstraction is usually captured by an object representing a server program thatoffers a set of operations to its users. These operations andtheir specification define the behavior of theobject, also called thetypeof the object. The way an abstraction (object) is implemented is usually hiddento its users who can only rely on its operations and their specification to design and produce upper layersoftware, i.e., software using that object. Such a modular approach is key to implementing provably correctsoftware that can be reused by subsequent programmers.

The abstractions we will consider aresharedobjects, i.e., objects that can be accessed by concurrentprocesses. That is, the operations exported by the shared object can be accessed by concurrent processes.Each individual process accesses the shared object in a sequential manner. Roughly speaking, sequentialitymeans here that, after it has invoked an operation on an object, a process waits to receive a reply indicatingthat the operation has terminated, and only then is allowed to invoke another operation on the same or adifferent object. The fact that a processp is executing an operation on a shared objectX does not howeverpreclude other processesq from invoking an operations on the same objectX.

1.3.2 Atomicity

Atomicity is the first main property we will require from the shared objects we seek to study. It is also calledlinearizability, and it basically means that each object operation appears to execute at some indivisiblepoint in time, also calledlinearizationpoint, between the invocation and reply time events of the operation.Atomicity provides the illusion that the operations issuedby the processes on the shared objects are executedone after the other. To program with atomic objects, the developer simply needs thesequential specificationof each object, called also its sequential type or simply itstype, which specifies how the object behaveswhen accessed sequentially by the processes.

Most interesting synchronization problems are best described as atomic objects. Examples of popularsynchronization problems are thereader-writerand theproducer-consumerproblems. In the reader-writerproblem, the processes need to read or write a shared data structure such that the value read by a process ata given point in timet is the last value written beforet. Solving this problem boils down to implementingan atomic object exportingread() andwrite() operations. Such an object type is usually called an atomicread-write variable or aregister. It abstracts the very notions of shared file and disk storage.

In the producer-consumer problem, the processes are usually split into two camps: the producers whichcreate items and the consumers which use the items. It is typical to require that the first item produced isthe first to be consumed. Solving the producer-consumer problem boils down to implementing an atomicobject type, called a FIFOqueue (or simply a queue) that exports two operations:enqueue() (invoked by aproducer) anddequeue() (invoked by a consumer).

9

Page 10: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

1.3.3 Wait-freedom

This is the second property we will typically require from the object implementations we will study. It can beviewed as a way to enforce a radically alternative approach to locking. Indeed, traditional synchronizationalgorithms rely onmutual exclusion(typically based on somelockingprimitives): critical shared objects (orcritical sections of code within shared objects) are accessed by processes one at a time. No process can entera critical section if some other process is in that critical section. We also say that a process has acquired alock on that object (resp., critical section). This technique issafein the sense that it ensures atomicity andprotects the program from inconsistencies due to concurrent accesses to shared variables.

However, coarse-grained mutual exclusion does not scale and fine-grained mutual exclusion can easilylead to violate atomicity. Indeed, atomicity is automatically ensured only if all related variables are pro-tected by the same critical section. This significantly limits the parallelism and thus the performance of theprogram, unless the program is devised with minimal interference among processes. This, on the other hand,is nevertheless hard to expect from common programmers and precludes most legacy programs.

Maybe more importantly, mutual exclusion hampers progresssince a process delayed in a critical sectionprevents all other processes from entering that critical section. Delays could be significant and especiallywhen caused by crashes, preemptions and memory paging. For instance, a process paged-out might bedelayed for millions of instructions, and this would mean delaying many other processes if these want toenter the critical section held by the delayed process. Withmodern architectures, we might be talking aboutone process delaying hundreds of processors, making them completely idle and useless.

Lock-freeimplementations of atomic objects provide an alternative to mutual exclusion-based imple-mentations. In particularwait-freedom, a strong form of lock-freedom, precludes any form of blocking. Inshort, wait-freedom stipulates that, unless it stops executing (say it crashes), any process that invokes anobject operation eventually obtains a reply. That is, the process calling the operation on the object (to beimplemented), should obtain a response for the operation, in a finite number of its own steps, independentlyof concurrent steps from other processes. The notion of stepmeans here a local instruction of the process,say updating a local variable, or an operation invocation ona base object used in the implementation. Some-times, we will assume that the object to be implemented should tolerate a certain number of base objectfailures. That is, we will seek to implement objects that areresilient in the sense that they eventually returnfrom process invocations, even if the underlying base objects fail and do not return, or return useless replies.

1.3.4 Object implementation

In short, this book studies how to wait-free implement high level atomic objects out of more primitivebase objects. The notions of high and primitive being of course relative as we will see. The notion ofimplementation has to be considered here in thealgorithmic sense (there will not be any C or Java code inthis book).

An object to be implemented is typically calledhigh-level, in comparison with the objects used in theimplementation, considered at alower-level. It is common to talk aboutemulationsof the high-level objectusing the low-level ones. Unless explicitly stated otherwise, we will by default meanwait-free implementa-tion when we writeimplementation, andatomic objectwhen we writeobject.

It is often assumed that the underlying system model provides some form ofregistersas base objects.These provide the abstraction of read-write storage elements. Message-passing systems can also, undercertain conditions, emulate such registers. Sometimes thebase registers that are supported are atomic butsometimes not. As we will see in this book, there are algorithms that implement atomic registers out ofnon-atomic base registers that might be provided in hardware.

10

Page 11: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Some multiprocessor machines also provide objects that aremore powerful than registers liketest&tetobjects orcompare&swapobjects. Intuitively, these are more powerful in the sense that the writer processdoes not systematically overwrite the state of the object, but specifies the conditions under which this can bedone. Roughly speaking, this enables more powerful synchronization schemes than with a simple registerobject. We will capture the notion of “more powerful” more precisely later in the book.

Not surprisingly, a lot of work has been devoted to figure out whether certain objects can wait-freeimplement other objects. As we have seen, focusing on wait-free implementations clearly excludes mutualexclusion based approaches, with all its drawbacks. From the application perspective, there is a clear gainbecause relying on wait-free implementations makes it lessvulnerable to failures and dead-locks. However,the desire for wait-freedom makes the design of atomic object implementations subtle and difficult. This isparticularly so when we assume that processes have noa priori information about the interleaving of theirsteps: this is the model we will assume by default in this bookto seek general algorithms.

1.3.5 Reducibility

In its abstract form, the question we address in this book, namely of implementing high level objects usinglower level objects, can be stated as a generalreducibility question in the parlance of the classical theory ofcomputing. Given two object typesX1 andX2, can we implementX2 using any number of instances ofX1 (we simply say usingX1)? In other words, is there an algorithm that implementsX2 usingX1? Thespecificity of concurrent computing here lies in the very fact that under the term ”implementing”, lies thenotions of atomicity and wait-freedom These notions encapsulate the smooth handling of concurrency andfailures.

When the answer to the reducibility question is negative, and it will be for some values ofX1 andX2,then it is also interesting to ask what is needed (under some minimality metric) to add to the base objects(X1) in order to implement the desired high level object (X2). For instance, if the base objects providedby a given multiprocessor machine are not enough to implement a particular object, knowing that extendingthe base objects with another specific object (or many of suchobjects) is sufficient, might give some usefulinformation to the designers of the new version of the multiprocessor machine in question. We will seeexamples of these situations.

1.4 Content and organization

The book is organized in an incremental way, going from implementing simple objects from even simplerones, to implementing more powerful objects. After precisely defining the notions of atomicity and wait-freedom, we go through the following steps.

1. We first study how to implement atomic registers objects out of non-atomic base registers. Roughlyspeaking, assuming as base objects registers that provide weaker guarantees than atomicity, and weshow how to wait-free implement atomic registers from theseweak registers. Furthermore, we alsoshow how to implement registers that can contain an arbitrary large range of values, and be read andwritten by any process in the system, from single bit registers (i.e., that contain only0 or 1) that canbe accessed by only one writer processp and only one reader processq.

2. We then discuss how to use registers to implement seemingly more sophisticated objects than regis-ters, likecountersandsnapshotobjects. We contrast this with the inherent limitation of registers in

11

Page 12: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

implementing more powerful objects likequeues. This limitation is highlighted through the seminalconsensus impossibilityresult.

3. We then discuss the importance ofconsensus as an object type, by explaining itsuniversality. Inparticular, we describe a simple algorithm that uses registers and consensus objects to implement anyother object. Then, we turn to the question on how to implement a consensus object from other objects.In particular, we describe an algorithm to implement a consensus object in a system of two processes,using registers and either a test&set or a queue objects, as well as an algorithm that implements aconsensus object using a compare&swap object in a system with an arbitrary size. The differencebetween these implementations is highlighted to introducethe notion ofconsensus number.

4. We then study a complementary way of implementing consensus: using registers and some additionalassumptions about the way processes access these registers. More precisely, we make use of an oraclethat reveals information about the operational status of the processes accessing the shared registers.We discuss how even an oracle that is unreliable most of time can help devise a consensus algorithm,and hence any other object. We also discuss the implementation of such an oracle assuming that thecomputing environment satisfies additional assumptions about the scheduling of the processes. Thismay be viewed as a slight weakening of the wait-freedom requirement which requires progress nomatter how processes interleave their steps.

5. We then consider the question of implementing objects outof base objects that can fail. This issue canbe of practical relevance in a distributed multi-core architecture where it is reasonable to assume thatcertain base objects might fail. It also abstracts the problem of implementing a highly available storageabstraction in a storage area network where basic units (files or disks) can fail. Not surprisingly, thegeneral way to achieve resilience is replication, but the underlying approach depends on the failuremodel. We distinguish two canonical failure models. First,we consider a failure model where a baseobject that fails keeps on returning a specific value⊥ whenever it is invoked. This model is calledthe responsivefailure model. Then we look at another failure model where a base object that failsstops replying. This model is called thenon-responsivefailure model. As we will see, algorithmsthat tolerate the first form of failures are usually sequential algorithms whereas those that tolerate thesecond form of failures are usually parallel ones.

6. Finally, we revisit some of the implementations given in the book by giving up the assumption thatprocesses do have unique identities. We study hereanonymousimplementations. We give anonymousimplementations of a weak counter object and a snapshot object based on registers.

1.5 Bibliographical notes

The fundamental notion of abstract object type has been developed in various textbooks on the theory orpractice of programming. Early works on the genesis of abstract data types were described in [5, 14, 18, 19].In the context of concurrent computing, one of the earliest work was reported in [10, 17]. More informationon the history concurrent programming can be found in the book [4].

The notion of register (as considered in this book) and its formalization are due to Lamport [13]. Amore hardware-oriented presentation was given in [16]. Thenotion of atomicity has been generalized to anyobject type by Herlihy and Wing [9] under the name linearizability. The concept of snapshot object has beenintroduced in [3]. A theory of wait-free atomic objects was developed in [11].

12

Page 13: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The mutual exclusion problem has been introduced by Dijkstra [6]. The problem constituted a basicchapter in nearly all textbooks devoted to operating systems. There was also an entire monograph solelydevoted to the mutual exclusion problem [22]. Various synchronization algorithms are also detailed in [23].

The notion of wait-free computation originated in the work of Lamport [12], and was then exploredfurther by Peterson [21]. It has then generalized and formalized by Herlihy [8].

The consensus problem was introduced in [20]. Its impossibility in asynchronous message-passing sys-tems prone to process crash failures has been proved by Fischer, Lynch and Paterson in [7]. Its impossibilityin shared memory systems was proved in [15]. The universality of the consensus problem and the notion ofconsensus number were investigated in [8].

13

Page 14: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

14

Page 15: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Part I

Correctness: safety and liveness

15

Page 16: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface
Page 17: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 2

Atomicity: A Correctness Property forShared Objects

2.1 Introduction

Before diving into how to implement shared sequential objects, we first address in this chapter the followingquestions:

• What is a sequential object?

• What does it mean for a shared-object implementation to be correct? In particular, how to evaluatecorrectness even when one or more processes stop their execution in the middle of an operation?

To get a flavor of the questions we address, let us consider an unbounded FIFO (first in first out) queue.This is an object of the typequeue defined by the following two operations:

• Enq(v): Add the valuev at the end of the queue,

• Deq(): Return the first value of the queue and suppress it from the queue; if the queue is empty, returnthe default value⊥.

Enq (a) Enq (c) Enq (b) Deq (a) Deq (c)

Figure 2.1: A sequential execution an a queue

Figure 2.1 describes a sequential execution of a system madeup of a single process using the queue.The time-line, going from left to right, describes the progress of the process when it enqueues first the valuea, then the valuec, and finally the valueb. According to the expected semantics of a queue, and as depictedby the figure, the first invocation ofDeq() returns the valuea, the second returns the valuec, etc.

Figure 2.2 depicts an execution of a system made up of two processes sharing the same queue. Now,processp1 enqueuesa and thenb whereas processp2 concurrently enqueuesc. On the figure, the executionof Enq(c) by p2 overlaps bothEnq(a) andEnq(b) by p1. Such execution raises the following questions:

• What values are dequeued byp1 andp2?

17

Page 18: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

p1

p2

Enq (a) Enq (b) Deq (a|b|c) ?

Deq (a|b|c) ?Enq (c)

Figure 2.2: A concurrent execution on a queue

• What values can be returned by a process if the other process has failed while executing an operation?

• What happens ifp1 andp2 share several queues instead of a single one? Etc.

Addressing these and related questions goes first through defining more precisely our model of compu-tation.

2.2 Model

2.2.1 Processes and operations

The system we consider consists of a finite set ofn processes, denotedp1, . . . , pn. The processes executesome common distributed computation and, while doing so, cooperate by accessingshared objects.

Processes synchronize their activities by executing operations exported by shared objects. An executionby a process of an operation on an objectX is denotedX.op(arg)(res) wherearg andres denote respec-tively the input and output parameters of the invocation. The output corresponds to the response to theinvocation. Sometimes we simply writeX.op when the input and output parameters are not important. Theexecution of an operationop() on an objectX by a processpi is modeled by two events, namely, the eventsdenotedinv[X.op(arg ) by pi] that occurs whenpi invokes the operation (invocation event), and the eventdenotedresp[X.op(res) by pi] that occurs when the operation terminates. (When there is noambiguity, wetalk aboutoperationswhere we should be talking aboutoperation executions.) We say that these events aregenerated by the processpi and associated with the objectX. Given an operationX.op(arg)(res), the eventresp[X.op(res) by pi] is called the response event matching the invocation eventinv[X.op(arg by pi].

An execution of a distributed system induces a sequence of interactions between the processes of thesystem and the shared objects. Every such interaction corresponds to a computationstepand is representedby anevent: the visible part of a step, i.e., the invocation or the replyof an operation. A sequence of eventsis called ahistoryand this is precisely how we model executions. We will detailthis later in this chapter.

As we pointed out in the introduction of the book, we generally assume that processes aresequential:a process executes (at most) one operation of an object at a time. That is, the algorithm of a sequentialprocess stipulates that after an operation is invoked on an object and until a matching response is received,the process does not invoke any other operation. The fact that processes are individually sequential does notpreclude them from concurrently invoking operations on thesame shared object. Sometimes, we will focusonsequential executions(modeled bysequential histories) which precisely preclude such concurrency; thatis, only one process at a time invokes an operation on an object in a sequential execution.

18

Page 19: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

2.2.2 Objects

An object has a name and a type. A type is defined by (1) the set ofpossible values for (the states of) objectsof that type, including theinitial state; (2) a finite set of operations through which the objects of that typecan be manipulated; and (3) a specification describing, for each operation, the condition under which thatoperation can be invoked, and the effect produced after it has been executed. Figure 2.3 presents a structuralview of a set ofn processes sharingm objects.

pnp1 p2

op11 opk

1 op1mopℓ

2

ObjectO1 ObjectOmObjectO2

Figure 2.3: Structural view of a system

Each operation is associated with two predicates: pre-assertion and post-assertion. Assuming the pre-assertion is satisfied before executing the operation, the post-assertion describes the new value of the objectand the result of the operation returned to the calling process. We say that an object operation istotal ifits pre-assertion associated is always satisfied, i.e., theoperation is defined for every state of the object.Otherwise, the operation ispartial. An object type is total if it has only total operations.

We also say that an object operation isdeterministicif, given any state of the object that satisfies the pre-assertion and input parameters, the output parameters and the final state of the object are uniquely defined.An object type is deterministic if it has only deterministicoperations; otherwise it is non-deterministic.

Sequential specification Every obect type determines asequential specification: the set of sequences ofinvocations immediately followed by a matching response (i.e., sequences of operations) that, starting fromthe initial state of the type, respect the type specification. These sequences are calledsequentialhistories ofthe type.

Example 1: a read/write object (register) To illustrate the notion of sequential specification, we considerhere three examples object types. The first type (called the type register) is a simple read/write abstraction,that models objects such as a shared memory word, a shared fileor a shared disk.

It has two operations:

• The operationread() has no input parameter. It returns a value of the object.

• The operationwrite(v) has an input parameter,v, a new value of the object. The result of thatoperation is a valueok indicating to the calling process that the operation has terminated.

19

Page 20: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The sequential specification of the object is defined by all the sequences of read and write operations inwhich each read operation returns the value of the last preceding write operation (i.e., the last value written).Clearly, the read and write operations are always defined: they are total operations.

The problem of implementing a concurrent read/write objectis a classical synchronization problemknown under the namereader/writerproblem.

Example 2: a FIFO queue with total operations The second example is the unbounded (FIFO) queuedescribed in Section 2.1. Such object has the following sequential specification: every dequeue returnsthe first element enqueued and not dequeued yet. If there is not such element (i.e., the queue is empty), aspecific default value⊥ is returned. This definition never prevents an enqueue or a dequeue operation to beexecuted: both enqueue and dequeue operations are total.

Example 3: a FIFO queue with a partial operation Let us consider now the previous queue definitionmodified as follows: a dequeue operation can be executed onlywhen the queue is not empty. The sequentialspecification of this object is then a restriction of the previous specification; all situations where a dequeueoperations returns⊥ have to be precluded. The enqueue operation can always be executed, so it remainsa total operation. On the other hand, the pre-assertion of the dequeue operation states that it can only beexecuted when the queue is not empty; consequently, that operation is a partial operation.

The two FIFO queues examples (2 and 3) are two variants of a theclassicalproducer/consumersynchro-nization problem.

Example 4: an object with no sequential specification Not all object types have a sequential specifi-cation. To illustrate this, let us consider arendezvousobject that can be accessed by two processesp1 andp2. Such an object provides the processes with a single operation meeting() with the following semantics:after it has been invoked by a process, the operation terminates only when the other process has also invokedthe operation. In other words, the key property of this object is that no process can terminate an operationwithout a concurrent invocation. It is easy to see that such arendezvous object has no sequential specifi-cation: the behavior of the object cannot be described simply by stating what happens when the operationinvocations byp1 andp2 would be totally ordered. (A rendezvous object is a typical example of an objectthat has no sequential specification. In this book, we are mainly interested in objects that have a sequentialspecification.)

2.2.3 Histories

An execution of a set of processes accessing a set of shared objects is captured through the notion of ahistory.

Representing an execution as a history of eventsProcesses interact with shared objects viainvocationandresponseevents. We assume that simultaneous (invocation or response) events do not affect each other.This is generally the case, in particular for events generated by sequential processes accessing objects with asequential specification. Therefore, without loss of generality, we can arbitrarily order simultaneous events.

This makes it possible to model the interaction between processes and objects as an ordered sequenceof eventsH, called ahistory (sometimes also called atrace). The total order relation on the set of eventsinduced byH is denoted<H . A history abstracts the real-time order in which the eventsdo actually occur.

20

Page 21: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Recall that an event includes the name of an object, the name of a process, the name of an operationand input -or output- parameters). We assume that each eventin H is uniquely identified.1 The objects andprocesses associated with events ofH are said to be involved inH.

A local history ofpi, denotedH|pi, is a projection ofH on processpi: the subsequenceH consistingof the events generated bypi.

Equivalent histories Two historiesH andH ′ are said to beequivalentif they have the same local histories,i.e., for eachpi, H|pi = H ′|pi. That is, equivalent histories cannot be distinguished by any process.

Well-formed histories As we are interested only in histories generated by sequential processes, we restrictour attention to the historiesH such that, for each processpi, H|pi (the local history generated bypi) issequential: it starts with an invocation, followed by a response, called the matching response and associatedwith the same object, followed by another invocation, etc. We say in this case thatH is well-formed.

Complete vs incomplete histories An operation is said to becompletein a history if the history includesboth the event corresponding to the invocation of the operation and its response. Otherwise we say that theoperation ispending. A history without pending operations is said to becomplete. A history with pendingoperations is said to beincomplete. Note that, being sequential, a process can have at most one pendingoperation in a given history.

Partial order on operations A history H induces an irreflexive partial order on its operations as follows.Let op = X.op1() by pi andop′ = Y.op2() by pj be two operations. Informally, operationop precedes op-erationop′, if op terminates beforeop′ starts, where “terminates” and “starts” refer to the time-line abstractedby the<H total order relation. More formally:

(op→H op′

) def=

(resp[op] <H inv[op′]

).

Two operationsop andop′ are said tooverlap (we also say areconcurrent) in a historyH if neitherresp[op] <H inv[op′], nor resp[op′] <H inv[op]. Notice that two overlapping operations are such that¬(op →H op′) and¬(op′ →H op). As sequential histories have no overlapping operations, it follows that→H is a total order ifH is a sequential history.

Illustrating histories Figure 2.4 depicts a well-formed historyH. The history comprises ten eventse1 . . . e10 (e4, e6, e7 ande9 are explicitly detailed). As all the events inH are on the same object, itsname is omitted. The enqueue operation issued byp2 overlaps both enqueue operations issued byp1. No-tice that the operationEnq(c) by p2 is concurrent with bothEnq(a) andEnq(b) issued byp1. Moreover,the historyH has no pending operations, and is consequently complete.

To illustrate the notions of incomplete and complete histories, let us again consider Figure 2.4. Thesequencee1 . . . e9 is an incomplete history where the dequeue operation issuedby p1 is pending. Thesequencee1 . . . e6 e7 e8 e10 is another incomplete history in which the dequeue operation issued byp2 ispending. Finally, the historye1 . . . e8 has two pending operations. Now we are ready to define what wemean by asequentialhistory.

1For example, we can choose the identifier of an (invocation orresponse) eventx of a processpi in H as(pi, k) wherek is thenumber of events precedingx in H |pi.

21

Page 22: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

p1

p2

e1 e2 e3 e5 e8 e10History H

e4 = inv[Enq(b) by p1]

e6 = resp[Enq(ok) by p1]

e9 = resp[Deq(?) by p2]

e7 = inv[Deq() by p2]

Enq (a) Enq (b) Deq (a|b|c) ?

Deq (a|b|c) ?Enq (c)

Figure 2.4: Example of a history

2.2.4 Sequential history

Definition A history is sequentialif its first event is an invocation, and then (1) each invocation event,except possibly the last, is immediately followed by the matching response event, and (2) each responseevent, except possibly the last, is immediately followed byan invocation event. The sentence “exceptpossibly the last” associated with an invocation event is due to the the fact that a history can be incomplete. Acomplete sequential history always ends with a response event. A history that is not sequential isconcurrent.

A sequential history models a sequential multiprocess computation (there are no overlapping operationsin such a computation), while a concurrent history models a concurrent multiprocess computation (thereare at least two overlapping operations in such a computation). Given that a sequential historyS has nooverlapping operations, the associated partial order→S defined on its operations is actually a total order.

Strictly speaking, the sequential specification of an object is a set of sequential histories involving solelythat object. Basically, the sequential specification represents all possible sequential accesses to the object.

Example Considering Figure 2.4,H is a complete concurrent history. On the other hand, the completehistory

H1 = e1 e3 e4 e6 e2 e5 e7 e9 e8 e10

is sequential: it has no overlapping operations. We can thushighlight its sequential nature by separating itsoperations using square brackets as follows:

H1 = [e1 e3] [e4 e6] [e2 e5] [e7 e9] [e8 e10].

The following historiesH2 andH3

H2 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7 e9],

22

Page 23: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

H3 = [e1 e3] [e4 e6] [e8 e10] [e2 e5] [e7 e9].

are also sequential. Let us also notice that historiesH, H1, H2, H3 are equivalent. LetH4 be the historydefined as follows

H4 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7.

H4 is an incomplete sequential history. All these histories have the same local history for processp1:H|p1 = H1|p1 = H2|p1 = H3|p1 = H4|p1 = [e1 e3] [e4 e6] [e8 e10], and, as farp2 is concerned,H4|p2

is a prefix ofH|p2 = H1|p2 = H2|p2 = H3|p2 = [e2 e5] [e7 e9].

So far, we defined the notion of a history as an abstract way to depict the interaction between a set ofprocesses and a set of shared objects. In short, a history is atotal order on the set of invocation and responseevents generated by the processes on the objects. We are now ready to define what we mean by a correctshared-object implementation, based on the notion of aatomic(or linearizable) history.

2.3 Atomicity

This section introduces the correctness condition calledatomicity(or linearizability). The aim of atomicityis to transform the difficult problem of reasoning about a concurrent execution into the simpler problem ofreasoning about a sequential one.

Intuitively, atomicity states that a history is correct if its invocation and response events could have beenobtained, in the same order, by a single sequential process.In an atomic (also called linearizable) history,each operation has to appear as if it has been executed alone and instantaneously at some point between itsinvocation event and its response event. This section defines formally the atomicity concept and presents itsmain properties.

2.3.1 Legal history

As we pointed out earlier, shared objects that are usually considered in programming typically have a se-quential specification defining their semantics. Not surprisingly, a definition of what is a “correct” historyhas to refer in one way or another to sequential specifications. The notion oflegalhistory captures this idea.

Given a sequential historyS, let S|X (S at X) denote the subsequence ofS made up of all the eventsinvolving objectX. We say that a sequential historyS is legal if, for each objectX, the sequenceS|Xbelongs to the sequential specification ofX. In a sense, a history is legal if it could have been generatedbyprocesses sequentially accessing objects.

2.3.2 The case of complete histories

We first define in this section atomicity for complete historiesH, i.e., histories without pending operations:each invocation event ofH has a matching response event inH. The section that follows will extend thisdefinition to incomplete histories.

Definition A complete historyH is atomic(or linearizable) if there is a “witness” historyS such that:

1. H andS are equivalent,

2. S is sequential and legal, and

23

Page 24: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

3. →H⊆→S .

The definition above states that for a historyH to be linearizable, there must exist a permutation ofH,S (witness history), which satisfies the following requirements. First,S has to be indistinguishable fromHto any process [item 1]. Second,S has to be sequential (interleave the process histories at the granularityof complete operations) and legal (respect the sequential specification of each object) [item 2]. Notice that,asS is sequential,→S is a total order. Finally,S has also to respect the real-time occurrence order of theoperations as defined by→H [item 3]. S represents a history that could have been obtained by executingall the operations, one after the other, while respecting the occurrence order of non-overlapping operations.Such a sequential historyS is called alinearizationof H.

When proving that an algorithm implements an atomic object,we need to prove that all histories gener-ated by the algorithm are linearizable, i.e., identify a linearization of its operations that respects the “real-time” occurrence order of the operations and that is consistent with the sequential specification of the object.

It is important to notice that the notion of atomicity includes inherently a form of nondeterminism. AhistoryH, may allow for several linearizations.

Linearization: an example Let us consider the historyH described in Figure 2.4 where the dequeueoperation invoked byp1 returns the valueb while the dequeue operation invoked byp2 returns the valuea. This means that we havee9 = resp[Deq(a) by p2] ande10 = resp[Deq(b) by p1]. To show that thishistory is linearizable, we have to exhibit a linearizationsatisfying the three requirements of atomicity. Thereader can check that historyH1 = [e1 e3] [e4 e6] [e2 e5] [e7 e9] [e8 e10] defined in Section 2.2.4 is such awitness. At the granularity level defined by the operations,witness historyH1 can be represented as follows

[Enq(a) by p1][Enq(b) by p1][Enq(c) by p2][Deq(a) by p2][Deq(b) by p1].

This formulation highlights the intuition that underlies the definition of the atomicity concept.

Linearization points The very existence of a linearization of an atomic historyH means that each op-eration ofH could have been executed at an indivisible instant between its invocation and response timeevents (while providing the same result asH). It is thus possible to associate alinearization pointwith eachoperation of an atomic history. This is a point of the time-line at which the corresponding operation couldhave been “instantaneously” executed according to its legal linearization.

To respect the real time occurrence order, the linearization point associated with an operation has alwaysto appear within the interval defined by the invocation eventand the response event associated with thatoperation.

Example Figure 2.5 depicts the linearization point of each operation. A triangle is associated with eachoperation, such that the vertex at the bottom of a triangle (bold dot) represents the associated linearizationpoint. A triangle shows how atomicity allows shrinking an operation (the history of which takes someduration) into a single point of the time-line.

In that sense, atomicity reduces the difficult problem of reasoning about a concurrent system to thesimpler problem of reasoning about a sequential system where the operations issued by the processes areinstantaneously executed.

As a second example, let us consider the complete history depicted in Figure 2.5 where the responseeventse9 ande10 are such thate9 = resp[Deq(b) by p2] ande10 = resp[Deq(a) by p1]. It is easy to

24

Page 25: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

p1

p2

e10e9e8e7e6e5e4e3e2e1History H

Enq (a) Enq (b) Deq (b)

Deq (a)Enq (c)

Figure 2.5: Linearization points

see that this history is linearizable: the sequential history H2 described in Section 2.2.4 is one possiblelinearization. Similarly, the history wheree9 = resp[Deq(c) by p2] ande10 = resp[Deq(a) by p1] is alsolinearizable. It has the following sequential witness history:

[Enq(c) by p2][Enq(a) by p1][Enq(b) by p1][Deq(c) by p2][Deq(a) by p1].

On the other hand, the history in which the two dequeue operations would return the same value is notlinearizable: it does not have any witness history which respects the sequential specification of the queue.

2.3.3 The case of incomplete histories

We show here how to extend the definition of atomicity to partial histories. As we explained, these arehistories with at least one process whose last operation is pending: the invocation event of this operationappears in the history while the corresponding response event does not. The historyH4 described in Section2.2.4 is such a partial history. Extending atomicity to partial histories is important as it allows to cope withprocess crashes.

Definition A partial historyH is linearizable ifH can becompleted, i.e., modified in such a way thatevery invocation of a pending operation is either removed orcompleted with a response event, so that theresulting (complete) historyH ′ is linearizable.

Basically, we reduce the problem of determining whether an incomplete historyH is linearizable to theproblem of determining whether a complete historyH ′, extracted fromH, is linearizable.H ′ is obtainedby adding response events to certain pending operations ofH, as if these operations have indeed beencompleted, but also by removing invocation events from someof the pending operations ofH. We requirehowever that all complete operations ofH are preserved inH ′. It is important to notice that, given a historyH, we can extract several historiesH ′ that satisfy the required conditions.

Example Consider Figure 2.6 where we depict two processes accessinga shared register. Processp1 firstwrites the value0. The same process later issues a write for the value1, but p1 crashes during this secondwrite (this is indicated by a cross on its time-line). Processp2 executes two consecutive read operations. Thefirst read operation lies between the two write operations ofp1 and returns the value0. A different value

25

Page 26: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Read(0) Read(v)

H1

H0

p1

p2

Write(0) Write(1)

Figure 2.6: Two ways of completing a history

would clearly violate atomicity. The situation is less obvious with the second value and it is not entirelyclear what valuev has to be returned by the second read operation in order for the history to be linearizable?

As we now explain, both values0 and1 can be returned by that read operation while preserving atomic-ity. The second write operation is pending in the incompletehistoryH modeling this execution. This historyH is made up of 7 events (the name of the object and process namesare omitted as there is no ambiguity),namely:

inv[write(0)] resp[write(0)] inv[read(0)] resp[read(0)] inv[read(v)] inv[write(1)] resp[read(v)].

We explain now why both0 and1 can be returned by the second read:

• Let us first assume that the returned valuev is 0.We can associate with historyH a legal sequential witness historyH0 which includes only completeoperations and respects the partial order defined byH on these operations (see Figure 2.6). To obtainH0, we construct historyH ′ by removing fromH eventinv[write(1)]: we obtain a complete history,i.e., without pending operations.

History H with v = 0 is consequently linearizable. The associated witness history H0 models thesituation wherep1 is considered as having crashed before invoking the second write operation: every-thing appears as if this write has never been issued.

• Let us now assume that the returned valuev is 1.Similarly to the previous case, we can associate with history H a witness legal sequential historyH1

that respects the partial order on the operations. We actually deriveH1 by first constructingH ′, whichwe obtain by adding toH the response eventres[write(1)]. (In Figure 2.6, the part added toH inorder to obtainH ′ -from whichH1 is constructed- is indicated by dotted lines).

The history wherev = 1 is consequently linearizable. The associated witness history H1 representsthe situation where the second write is taken into account despite the crash of the process that issuedthat write operation.

26

Page 27: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

2.4 Safety and Liveness

It is convenient to reason about correctness of a concurrentobject implementation by splitting the correctnessproperties intosafetyand liveness. Intuitively, safety properties ensure that nothing “bad”is ever going tohappen, and liveness properties guarantee that something “good” eventually happens.

Formally, we can represent anexecutionof an concurrent object implementation as a sequence of systemstatess0s1 . . ., where each state afters0 results after one atomic action (e.g., an atomic access to a baseobject) applied to the preceding state. Aproperty is then a set of (finite or infinite) executions. Now apropertyP is a safety property if:

• P is prefix-closed: if σ ∈ P , then for every prefixσ′ of σ, σ′ ∈ P .

• P is limit-closed: for every infinite sequenceσ0, σ1, . . . of executions, where eachσi is a prefix ofσi+1 and eachσi ∈ P , the limit σ = limi→∞ σi is in P .

To ensure that a safety propertyP holds for a given implementation, it is thus enough to show that everyfinite execution is inP : an execution is inP if and only if each of its prefixes is inP . It can be shown thatatomicity is a safety property and, thus, for the rest of thisbook, we mainly focus on

Now P is a liveness property ifanyfinite σ has an extension inP . To show that a liveness propertyPholds, we should thus show that all infinite executions are inP .

Every property can be represented as an intersection of a safety property and a liveness property. In thisbook, we focus on atomicity (linearizability) and wait-freedom, two fundamental properties of concurrentobject implementations, where atomicity is a safety property and wait-freedom is a liveness property.

2.5 Locality

This section presents an inherent property of atomicity that makes it particularly attractive. (Another impor-tant property of atomicity, namelynon-blockingness, is discussed in the next chapters.)

2.5.1 Local properties

Let P be any property that is on a set of objects. The propertyP is said to belocal if the set of objects as awhole satisfiesP whenever each object taken alone satisfiesP .

Locality is an important concept that promotes modularity.Consider some local propertyP . To provethat an entire set of objects satisfyP , we only have to ensure that each object -independently fromthe otherssatisfiesP . As a consequence, propertyP can be implemented on a per object basis. At one extreme, itis even possible to design an implementation where each object has its own algorithm implementingP . Atanother extreme, all the objects (whatever their types) might use the same algorithm to implementP (eachobject using its own instance of the algorithm).

2.5.2 Atomicity is a local property

Intuitively, the fact that atomicity is local comes from thefact that (1) it considers that each operation is onsingle object, and (2) it involves the real-time occurrenceorder on non-concurrent operations whatever theobjects and the processes concerned by these operations. Wewill rely on these two aspects in the proof ofthe following theorem.

27

Page 28: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Theorem 1 A historyH is atomic (linearizable) if and only if, for each objectX involved inH, H|X isatomic (linearizable).

Proof The “⇒” direction (only if) is an immediate consequence of the definition of atomicity: if H islinearizable then, for each objectX involved inH, H|X is linearizable. So, the rest of the proof is restrictedto the “⇐” direction. We also restrict the rest of the proof to the casewhereH is complete, i.e.,H has nopending operation. This is without loss of generality, given that the definition of atomicity for an incompletehistory is derived from the definition of atomicity for a complete history.

Given an objectX, let SX be a linearization ofH|X. It follows from the definition of atomicity thatSX

defines a total order on the operations involvingX. Let→X denote this total order. We construct an orderrelation→ defined on the whole set of operations inH as the union{

⋃X →X}∪ →H , i.e.:

1. For each objectX: →X ⊆ →,

2. →H ⊆ →.

Basically, “→” totally orders all operations on the same objectX, according to→X (item 1), while preserv-ing→H , i.e., the real-time occurrence order on the operations (item 2).Claim.→ is acyclic.

The claim implies that a transitive closure of→ indeed defines a partial order on the set of all the op-erations ofH. Since any partial order can be extended to a total order, we construct a sequential historySincluding all events ofH and respecting→. By construction, we have→⊆→S where→S is the total orderon the operations defined fromS. We have the three following conditions: (1)H andS are equivalent (2)Sis sequential (by construction) and legal (due to item 1 above); and (3)→H⊆→S (due to item 2 above andthe fact that→⊆→S). It follows thatH is linearizable.

Proof of the claim. We show (by contradiction) that→ is acyclic. Assume first that→ induces a cycleinvolving the operations on a single objectX. Indeed, as→X is a total order, in particular transitive, theremust be two operationsopi andopj on X such thatopi →X opj andopj →H opi. But opi →X opj ⇒inv[opi] <H resp[opj] becauseX is linearizable. Given thatopj →H opi ⇒ resp[opj] <H inv[opi],which establishes the contradiction as<H is a total order on the whole set of events.

It follows that any cycle must involve at least two objects. To obtain a contradiction we show that, inthat case, a cycle in→ implies a cycle in→H (which is acyclic). Let us examine the way the cycle could beobtained. If two consecutive edges of the cycle are due to just some→X or just→H , then the cycle can beshortened as any of these relations is transitive. Moreover, opi →X opj →Y opk is not possible forX 6= Y ,as each operation is on only one object (opi →X opj →Y opk would imply thatopj is on bothX andY ).So let us consider any sequence of edges of the cycle such that: op1→H op2→X op3 →H op4. We have:

- op1→H op2⇒ resp[op1] <H inv[op2] (definition ofop1→H),- op2→X op3⇒ inv[op2] <H resp[op3] (asX is linearizable),- op3→H op4⇒ resp[op3] <H inv[op4] (definition ofop1→H).

Combining these statements, we obtainresp[op1] <H inv[op4] from which we can conclude thatop1→H

op4. It follows that any cycle in→ can be reduced to a cycle in→H . A contradiction as→H is an irreflexivepartial order.End of the proof of the claim. 2Theorem 1

Considering an execution of a set of processes that access concurrently a set of objects, atomicity allowsreasoning as the operations issued by the processes on the objects were executed one after the other. The

28

Page 29: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

previous theorem is fundamental. It states that when one hasto reason on sequential processes that accessconcurrent atomic objects, one can reason on a per object basis, without loosing the atomicity property onthe whole computation.

2.6 Atomicity is nonblocking

Atomicity is a nonblockingproperty: an incomplete total operation is not required to wait until anotheroperation to complete.

Theorem 2 LetH be a finite linearizable history, and letinv[op] be a pending invocation of a total opera-tion in H. Then there existsr = res[op] such thatH · r is linearizable.

Proof Let H be a finite linearizable history andL be linearization ofH. Let H be a completion ofH suchthatL is equivalent toH. Recall thatL is legal and respects the real-time order ofH.

Let i = inv[op] be a pending invocation inH, whereop is total.Now two cases are possible:

• L containsop, and letr be the matching response ofop in L. ThenL is also linearization ofH · r.

Indeed, consider historyH ′, an extension ofH · r that is equivalent toH. We obtain it by reorderingresponses added toH to obtainH so thatr is the first such response. ThenH ′ is a linearization ofH · r.

• L containsop. ConsiderL′ = L · i · r, wherer is a legal response matching the invocationi appliedat the end ofL. Sinceop is total, such a response exists.

L′ is a linearization ofH · r. IndeedH ′ obtained fromH by insertingr immediately after the lastevent ofH is a completion ofH · r that is equivalent toL′.

In both cases, we construct a linearizable historyH · r in which inv[op] is complete. 2Theorem 1

2.7 Alternatives to atomicity

This section discusses alternatives to atomicity, namely,sequential consistencyandserializability.

2.7.1 Sequential consistency

Overview Atomicity stipulates that the witness sequential historyS for a given historyH should respectthe partial order relation→H on operations inH (also called the real-time order). Any two operationsopandop′ suchop →H e′ should appear in that order in the witness historyS, irrespective of the processesinvoking them and the objects on which they are performed.

A relaxation of atomicity, calledsequential consistencyonly requires that the real-time order is preservedif the operations are invoked by the same process, i.e.,S is only supposed to respect theprocess-order.

29

Page 30: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Definition The definition of the sequential consistency correctness condition reuses the notions of history,sequential history, complete history, as in Section 2.2. Tosimplify the presentation and without loss ofgenerality, we only consider complete histories (with no pending operations).

A history H is sequentially consistentif there is a “witness” historyS such that:

1. H andS are equivalent,

2. S is sequential and legal. respect process-order).

To illustrate sequential consistency, let us consider Figure 2.7. There are two processesp1 andp2 thatshare a queueQ. At the operation level, the local history ofp1 comprises a single operation,Q.Enq(a),while the local history ofp2 comprises two operations, firstQ.Enq(b) and thenQ.Deq(b). The reader caneasily verify that this history is not atomic: as all the operations are totally ordered according to real-time,theQ.Deq() operation issued byp2 should return the valuea whose enqueuing was terminated before theenqueuing ofa has started. However, the history is sequentially consistent: The sequential history (describedat the operation level)

S = [Q.Enq(b) by p2][Q.Enq(a) by p1][Q.Deq(b) by p2]

is legal and respects the process-order relation.Both consistency criteria, atomicity and sequential consistency, require a witness sequential history, but

sequential consistency has no requirement related to the occurrence order of operations issued by differentprocesses (and captured by the real-time order). It can be seen as based only on a logical time (the onedefined by the witness history).

Q.Enq(a)

Q.Enq(b) Q.Deq(b)

p1

p2

Figure 2.7: A sequentially consistent history

Atomicity vs sequential consistency Clearly, any linearizable history is also sequentially consistent. Asshown by the example of Figure 2.7 however, the contrary is not true. It is then natural to ask whethersequential consistency is not good enough to reason about correctness of concurrent implementations.

A drawback of sequential consistency is that it is not a localproperty. To illustrate this, consider thecounter-example described in Figure 2.8. HistoryH involves two processes accessing two concurrent queuesQ andQ′. It is easy to see that, when we consider each object in isolation, we obtain the historiesH|Q andH|Q′ that are sequentially consistent. Unfortunately, there isno way to witness a legal total orderS thatinvolves the six operations: ifp1 dequeuesb′ from Q′, Q′.enq(a′) has to be ordered afterQ′.enq(b′) ina witness sequential history. But this means that (to respect process-order)Q.enq(a) by p1 is necessarilyordered beforeQ.enq(b) by p2: consequentlyQ.Deq() by p2 should returnsa for S to be legal. A similarreasoning can be done starting from the operationQ.Deq(b) by p2. It follows that there can be no legalwitness total order: even thoughH|Q andH|Q′ are sequentially consistent, the whole historyH is not.

30

Page 31: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Q.Enq(a) Q′.Enq(b′) Q′.Deq(b′)

Q′.Enq(a′) Q.Enq(b) Q.Deq(b)

p1

p2

Figure 2.8: Sequential consistency is not a local property

2.7.2 Serializability

Overview Both atomicity and sequential consistency guarantee that operations appear to execute instan-taneously at some point of the time line. The difference is that atomicity requires that, for each operation,this instant lies between the occurrence times of the invocation and response events associated with theoperation, which is not the case for sequential consistency.

Sometimes, it is important to ensure thatgroupsof operations appear to execute as if they have beenexecuted without interference with any other group of operations. The concept oftransactionis then theappropriate abstraction that allows grouping operations.

A transaction is a sequence of operations that might complete successfully (commit) or abort. In short,the execution of a set of concurrent transactions is correctif committed transactions appear to execute atsome indivisible point in time and aborted transactions do not appear to have been executed at all. Thiscorrectness criteria is calledserializability (sometimes it is also called atomicity). The point (again) is toreduce the difficult problem of reasoning about concurrent transactions into the easier problem of reasoningabout transactions that are executed one after the other. For instance, if some invariant predicate on theset of shared objects is preserved by every individual committed transaction, then it will be preserved by aserializable execution of transactions.

Definition To define serializability, the notion of history needs to revisited. Events are now associated withobjects and transactions. In short, processes are replacedby transactions. For each transaction, in addition tothe invocation and response events, two new events come intothe picture:commitandabort events. Theseare associated with transactions. At most one such event is associated with every transaction in a history. Atransaction without such event is called pending; otherwise the transaction is said to be complete (committedor aborted). Adding a commit (resp., abort) event after all other events of a pending transaction is calledcommitting (resp., aborting) the transaction. A sequential history is a sequence of committed transactions.We say that a history is complete if all its transactions are complete.

Let H be a complete history.H is serializableif there is a “witness” historyS such that:

1. For each transactionT , S|T = H|T .

2. S is sequential and legal, and

LetH be a history that is not complete.H is serializableif we can derive fromH a complete serializablehistoryH ′ by completing or removing pending transactions fromH.

31

Page 32: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Atomicity vs serializability Again, correctness is defined according to the equivalence to a witness se-quential history. No real-time ordering is required. In this sense, serializability can be viewed as the exten-sion of sequential consistency to several operations. Likesequential consistency, serializability is not a localproperty either. Replacing in Figure 2.8 processes with transactions gives a counter-example that provesthat.

2.8 Summary

We introduced in this chapter the basic elements that are needed to reason about executions of a distributedsystem made up of concurrent processes interacting throughshared objects. More specifically, we introducedthe elements that are needed to introduce the atomicity concept.

The fundamental element is that of a history: a sequence of events depicting the interaction betweenprocesses and objects. An event represents the invocation of an object or the return of a response. A historyis atomic if, despite concurrency, it appears as if processes access the objects of the history in a sequentialmanner. In this sense, the correctness of a concurrent computation is judged with respect to a sequentialbehavior, itself determined by the sequential specification of the objects.

2.9 Bibliographic notes

The notion of atomic read/write objects (registers), as studied here, has been investigated and formalized byLamport [13] and Misra [16]. The generalization of the atomicity consistency condition to objects of anysequential type has been developed by Herlihy and Wing underthe name linearizability [9].

The notion of sequential consistency has been introduced byLamport [29]. The relation between atom-icity and sequential consistency was investigated in [25] and [31] where it was shown that, from a protocoldesign point of view, sequential consistency can be seen as alazy linearizability. Examples of protocolsimplementing sequential consistency can be found in [24, 25, 32].

The concept of transactions is part of every textbook on database systems. Books entirely devotedto transactions are [26, 27, 28]). The theory of serializability was the main topic of the following books[27, 30].

The notions of safety and liveness were introduced by Lamport [40] and refined by Alpern and Schnei-der [33]. Lynch reformulated the notions for finite histories and proved that linearizability (atomicity) is asafety property [?].

32

Page 33: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 3

Wait-freedom: A Progress Property forShared Object Implementations

3.1 Introduction

The previous chapter focused on theatomicityproperty of shared objects.This property requires operationsto appear as if executed one after the other. Basically, atomicity stipulates that certain behaviors should beprecluded, namely those that do not hide concurrent operation executions on the same object.

Atomicity is a safetyproperty. It states what shouldnot happen in an execution involving processesand shared objects (namely, an execution that is not linearizable must never happen). Clearly, this could beachieved by a trivial algorithm that would not return any operation of the object to be implemented, i.e., onethat would never return any result: an empty history would bea trivial linearization of every execution ofthis algorithm.

However, one would also require the implemented shared object to also perform its operations whenit is asked to do so by an application process making use of that object. In other words, one would alsorequire that the algorithm implementing the object satisfies someprogressproperty. Not surprisingly, thisalso depends on the process invoking the operation and in particular on how this process is scheduled by theoperating system.

For instance, if a process invokes an operation and immediately crashes or is paged out by the operatingsystem, then it makes little sense to require that the process obtains a reply matching its invocation. In fact,one might require that the shared object satisfies some progress property, provided that the process invokingan operation on the shared object is scheduled by the underlying operating system to execute enough stepsof the algorithm implementing that operation. Performing such steps reflect the ability of the process toinvoke primitives objects used in the implementation in order to eventually obtain a reply to the high-leveloperation being implemented.

One might, for example, require that if a process invokes an operation and keeps executing steps ofthe algorithm implementing the operation, then the operation eventually terminates and the process obtainsa reply. Sometimes one might even require that, after invoking an operation, the process should obtain aresponse to the operation withinb steps of the process.

To express such requirements, we need to carefully define thenotion of objectimplementationandzoom into the way processes execute the algorithm implementing the object, in particular how their stepsare scheduled by the operating system. We will in particularintroduce the notion ofimplementation history:this is alower levelnotion than that describing the interaction between the processes and the object being

33

Page 34: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

implemented (previous chapter). Accordingly, the first is called ahigh level historywhereas the second iscalled alow level history. This will be used to introduce progress properties of shared object implementa-tions, the strongest of these beingwait-freedom.

We will focus on asingleobject implementation. As discussed in Chapter 2, when implementing atomicobjects, it is enough to consider each object separately from the other objects. This is because the atomicityconsistency criterion is a local property (i.e. the composition of objects, that individually satisfy atomicity,provides a system that, as a whole, does satisfy the atomicity consistency criterion).

3.2 Implementation

3.2.1 High Level Object and Low Level Object

To distinguish the shared object to be implemented from the underlying objects used in the implementation,we typically talk about ahigh levelobject and underlyinglow levelobjects. (The latter are sometimes alsocalledbaseobjects.) Similarly, to disambiguate, we will talk aboutprimitives instead of operations as faras low level objects are concerned. That is, a process invokes operationson a high level object and theimplementation of these operations requires the process toinvoke primitives of the underlying low levelobjects. When a process invokes such a primitive, we say thatthe process performs astep.

The very notions of high level and low level are relative and depend on the actual implementations. Anobject type might be considered high level in a given implementation and low level in another one. Theobject to be implemented is the high level one and the objectsused in the implementation are the low levelones. In general, the intuition is that the low level objectsmight typically capture basic synchronization ab-stractions provided in hardware whereas the high level onesare those we want to emulate in software (whatwe call implement). Such emulations are strongly motivated by the desire to facilitate the programming ofconcurrent applications, i.e. to provide the programmer with powerful synchronization abstractions encap-sulated by high-level objects. Another motivation is to reuse programs initially devised with the high levelobject in mind in a system that does not provide such an objectin hardware. Indeed, multiprocessor ma-chines do not all provide the same basic synchronization abstractions. For instance, some modern machinesprovide compare&swap as a base object in hardware. Others do not and might provide insteadtest&set

or simply some form ofregisters. Providing an implementation ofcompare&swap usingtest&set would,for instance, make it possible to directly reuse, within an old machine, an application written for a modernmachine.

Sometimes the low level objects are assumed to be atomic, andsometimes not. As shown later in thebook, it is sometimes useful to first implement intermediateobjects that are not atomic, then implement thedesired high level atomic objects on top of them.

3.2.2 Zooming into histories

When reasoning about the atomicity of an object to be implemented, i.e., a high-level object, the executionsof the processes accessing the object are represented with histories. As defined in the previous chapter,a history is a sequence of events, each representing an invocation or a reply on the high-level object inquestion.

History of an implementation Reasoning about progress properties requires to zoom into the invocationsand replies of the lower level objects on top of which the highlevel object is built. Hence,lower level

34

Page 35: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

histories are needed that depict events at the interface between the processes and the low level objectsusedin the implementation, i.e., the primitive events. Withoutsuch a zooming, it is not possible, for instance,to distinguish a process that crashes right after invoking ahigh level object operation and stops invokinglow-level objects, from one that keeps executing the algorithm implementing that operation and invokingprimitives of low level objects. We might want to require that the later obtains a matching reply and exemptthe former from having to obtain a reply. So, when we talk about the history of an implementation, weimplicit assume such a low level history, which is a refinement of the higher level history involving only theinvocations and replies of the high level object to be implemented.

Consider the example of a fetch-and-increment high-level-object implementation described in Sec-tion 3.4.2. As low-level objects, the implementations usesan infinite arrayT [, . . . ,∞] of TAS (test-and-set)objects and a snapshot-memory objectmi. Here a high level history is a sequence built from invocation andreply events associated to one operationfetch-and-increment, while a low level history (or implementationhistory) is a sequence built from the primitivesread(), update(), snapshot () andtest − and − set().

The two faces of a process To better understand the very notion of a low level history, it is important todistinguish the two roles of a process. On the one hand, a process has the role of aclient that sequentiallyinvokes operations on the high level object and receives replies. On the other hand, the process has also therole of aserverimplementing the operations. While doing so, the process invokes primitives on lower levelobjects in order to obtain a reply to the high-level invocation.

It might sometimes be convenient to think of the two roles of aprocess as executed by different entities.As a client, a process invokes object operations but does notcontrol the way the low level primitives imple-menting these operations are executed. It even does not knowhow an object operation is implemented. Dif-ferently, as a server, a process executes the algorithm (made up of invocations of low level object primitives)associated with the high level object operation. Such an algorithm is typically described by an automaton(possibly with an unbounded number of states). The execution of a low level object primitive is called astepand it typically represents an atomic unit of computation.

Scheduling and asynchrony The interleaving of steps in an implementation is specified by a scheduler(itself part of an operating system). This is outside of the control of processes and, in our context, it isconvenient to think of a scheduler as anadversary. This is because, when devising a distributed algorithm,one has to cope with worst-case strategies of a scheduler that could defeat the algorithm.

Strictly speaking, a process is said to becorrect in a low-level history when it executes an infinitenumber of steps, i.e., when the scheduler schedules an infinite steps of that process. This “infinity” notionmodels the fact that the process executes as many steps as needed by the implementation. Otherwise, theprocess is said to befaulty. Sometimes it is convenient to see a faulty processes as a process that crashesand prematurely quits the computation. In the context of this book, we assume that processes might indeedcrash and permanently stop performing steps but do not deviate from the algorithm assigned to them. Inother words, they are not malicious (we also say they are not Byzantine).

Unless explicitly stated otherwise, the system is assumed to be asynchronouswhich means that therelative speeds of the processes are unbounded: for anyΦ there is an execution in which a process takesΦsteps while another not crashed process takes only one step.Basically, an asynchronous system progressesis controlled by a very weak scheduler, i.e., a scheduler whose only restriction lies in the fact that it cannotprevent forever a correct (never crashing) process from executing steps.

35

Page 36: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

3.3 Progress properties

As pointed out above, a trivial way to ensure atomicity wouldbe to do nothing, i.e., not return any reply toany operation invocation. This would preclude any history that violates linearizability by simply precludingany history with a reply.

Besides this (clearly, meaningless) approach, a popular way to ensure atomicity is to usecritical sections(say usinglocks), preventing concurrent accesses to the same object. In thesimplest case, every operationon a shared object is executed as a critical section. When a process invokes an operation on an object, it firstrequests the corresponding lock, and the algorithm of the operation is executed by the process only whenthe lock is acquired. If the lock is not available, the process waits until the lock is released. After a processobtains the reply to an operation, it releases the corresponding lock.

As we discussed in the introduction of this book, such an implementation of a shared object has aninherent drawback: the crash of the process holding the lockon an object prevents any other . In practice,this might correspond to the situation where the process holding the lock is preempted for a long period oftime, and all processes contending on the same object are blocked. When processes are asynchronous (i.e.,the scheduler can arbitrarily preempt processes) which is the default assumption we consider, there is noway for a process to know whether another process has crashed(or was preempted for a long while) or isonly very slow.

This book focuses on shared object implementations with progress properties that preclude the use ofcritical sections and locks. Informally, we say an implementation is lock-based if it allows for a situation inwhich a process running in isolation from some point on is never able to complete its operation. Taking anegation of this property, we state that an implementation does not employ locks if starting after any finiteexecution, every process can complete its operation in a finite number of its solo steps. Intuitively, thisproperty, calledobstruction-freedom(or solo termination), must be satisfied by any implementation wherethe crash of some processes does not prevent other processesfrom making progress. Several such progressproperties, including obstruction-freedom, are presented below.

3.3.1 Solo, partial and global termination

• Obstruction-freedom (also calledSolo termination). An implementation of a concurrent object isobstruction-free, if any of its operations is guaranteed toterminate if it is eventually executed withoutconcurrency (assuming that the invoking process does not crash1).

An operation is “eventually executed without concurrency”if there is a time after which the onlyprocess to take steps is the process that invoked that operation. Note that this does not prevent otherprocesses from having started and not yet finished operations on the same object (this is for examplethe case of a process that crashed in the middle of an operation on the object).

Note that obstruction-freedom allows executions in which several processes invoking operations onthe same object forever contend on the internal representation of the object without terminating.

As we observed earlier, obstruction-freedom precludes theuse of locks.

• Non-blockingness. This is apartial terminationproperty that is strictly stronger than obstruction-freedom. It states the following: despite asynchrony and process crashes, if several processes executeoperations on the same object and do not crash, at least one ofthem terminates its operation.

So, non-blocking meansdeadlock-freedomdespite asynchrony and crashes.

1Let us recall that “a process does not crash” means that “it executes an infinite number of steps”.

36

Page 37: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• Wait-freedom. This is aglobal terminationproperty that states the following: despite asynchrony andprocess crashes, any process that executes an operation on the object (and does not crash), terminatesits operation [8]. Wait-freedom is strictly stronger than non-blockingness.

So, wait-freedom meansstarvation-freedomdespite asynchrony and crashes.

3.3.2 Bounded termination

Wait-freedom, the strongest among the liveness propertiesconsidered above, does not stipulate a bound onthe number of steps that a process needs to execute before obtaining a matching reply when it invokes ahigh level object operation. Typically, this number can depend on the behavior of the other processes. Forexample, it can be small if no other process performs any step, and increases when all processes performsteps (or the opposite), while remaining always finite, regardless of the number and timing of crashes.

• An implementation satisfies thebounded wait-free property if there is a boundB such that in anylow level history every processp that invokes an operation receives a matching reply withinB of itsown steps. (TheB steps ofp are not required to be consecutive.)

In other words, there is no prefix of a low level history in which a process invokes an operation andexecutesB steps without obtaining a matching reply.

Showing that an implementation is bounded wait-free consists in exhibiting an upper bound on thenumber of steps needed to return from any operation. That upper bound is usually defined by a functionf()on the number of processes (e.g.,O(n2)). One can similarly define notions like bounded solo or boundedpartial termination.

3.4 Atomicity and wait-freedom

Just as it is meaningless to ensure atomicity alone, withoutany progress guarantee, it is also meaningless toensure any progress guarantee alone. Meaningful implementations are those that ensure both: ideal ones arethose that ensure atomicity and wait-freedom.

Before diving into such implementations in the next chapters, it is important to ask whether every atomicobject has an implementation that ensures wait-freedom andatomicity. In fact, it is easy to see that this wouldnot be the case for objects withpartial operations (previous chapter). By definition, the progressof suchoperations may depend on concurrent invocations of other operations. That is, if an object’s specificationrequires that a process does not return from an operation unless some other process completes some otheroperation first, then it would be impossible to come up with even a solo-terminating implementation ofthis object, regardless of how powerful underlying base objects are. As we discuss below however, animplementation that ensures wait-freedom and atomicity isalways possible for objects withtotal operations.

3.4.1 Operation termination and atomicity

Besides being alocal property, which we discussed in the previous chapter, atomicity is alsonon-blocking,meaning that a pending invocation of a total operation is never required to wait for another operation tocomplete and yet preserve atomicity. This property has a fundamental consequence. It means that, per se,atomicity never forces a pending total operation to block. In other words, atomicity,per se, cannot preventwait-freedom. Blocking (or even deadlock and starvation) can occur as an artifact of a particular implemen-tation of atomicity, but is not inherent to atomicity. The following theorem captures this idea by stating that

37

Page 38: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

any (linearizable) history with a pending operation invocation can be extended with a reply to that operation.

Theorem 3 Let inv[op(arg)] be the invocation event of a total operation that is pending in a linearizablehistoryH. There exist a matching reply eventresp[op(res)] such that the historyH ′ = H.resp[op(res)] islinearizable.

Proof Let S be a linearization of the partial historyH. By definition of a linearization,S has a matchingreply to every invocation. Assume first thatS includes a reply eventresp[op(res)] matching the invocationeventinv[op(arg)]. In this case, the theorem trivially follows as thenS is also a linearization ofH ′.

If S does not include a matching reply event, thenS does not includeinv[op(arg)] either. Because theoperationop() is total, there is a reply eventresp[op(res)] matching the invocation eventinv[op(arg)] inevery state of the shared object. LetS′ be the sequential historyS with the invocation eventinv[op(arg)]and a matching reply eventresp[op(res)] added in that order at the end ofS. S′ is trivially legal. It followsthatS′ is a linearization ofH ′. 2Theorem 3

3.4.2 A simple example

To give an example of a wait-free linearizable implementation, consider the algorithm in Figure 3.1. Thealgorithm implements afetch-and-increment (FAIobject using an infinite array ofTAS objectsT [1, . . . ,∞]and asnapshot memory mi. Below we give a brief description of these objecs.

An FAI object stores an integer value exports one operationfetch-and-increment() that atomically incre-ments the value of the object and returns the previous value.

A TAS object exports one atomic operationtest-and-set() that returns0 or 1 and guarantees that the firstinvocation oftest-and-set() on the object returns1 and all subsequent invocations return0. Intuitively, aTAS object allows a single process to distinguish itself from the rest of the system.

Finally, snapshot memory can be seen as an array ofn registers, one for each process, such that each pro-cesspi can atomically write a valuev to its dedicated register with an operationupdate(i, v) and atomicallyread the content of the array using an operationsnapshot(). 2

The algorithm in Figure 3.1 is very simple. To increment the value of the object, a process first in-crements its dedicated register in the snapshot memorymy inc. Then it takes a snapshot of the memoryand evaluatesentryas the sum of all its elements. Then, starting from theT [entry] down to1, the processinvokes operationstest-and-set() until the first TAS object to return1. The index of this object minus1 isthen returned byfetch-and-increment().

Intuitively, if a processpi evaluates its local variableentry to ℓ, it means that at mostℓ processes havepreviously incremented their positions and, thus, at leastone TAS object in the arrayT [1, . . . , ℓ] is “reserved”for pi (pi is one of theseℓ processes). Every process that increments its position inmy inc later will obtaina strictly higher value ofentry. Thus, eventually, every operation obtains1 from one of the TAS objects andreturns. Moreover, since a TAS object returns1 to exactly one process, every returned value is unique. Tryto see that it guarantees that every history of this implementation is linearizable.

Notice that the number of steps performed by afetch-and-increment() operation is finite but in generalunbounded (the implementation is not bounded wait-free). This is because an unbounded number of in-crements can be performed by other processes in the time lag between a process increments it position inmy inc and the moment it takes a snapshot ofmy inc. It is however not difficult to modify the algorithm sothat every operation performsO(n2) steps.

2In Chapter 6, we show that snapshot memory can be wait-free implemented using only read-write registers.

38

Page 39: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

SharedT [1, . . . ,∞]: n-process TAS objectsmy inc[1, . . . ,∞]: snapshot memory, initialized to0

Localentry, c (initially 0), S

operation fetch-and-increment():c← c + 1;my inc.update(i, c);S ← my inc.snapshot();entry← sum(S)while T [entry].test-and-set() 6= 0 do

entry← entry− 1;end do;return(entry− 1)

Figure 3.1: Fetch-and-increment implementation: code forprocesspi

3.4.3 A less simple example

Proving that a given implementation satisfies atomicity andwait-freedom can be extremely tricky some-times. To illustrate this, consider the algorithm in Figure3.2 that intends to implement an unbounded FIFOqueue. The sequential specification of this object has been given in Section 2.1 of Chapter 2.

The algorithm is quite simple. The system we consider here ismade up of producers (clients) andconsumers (servers) that cooperate through an unbounded FIFO queue. A producer process repeats foreverthe following two statements: it first prepares a new itemv, and then invokes the operationEnq(v) todepositsv in the queue. Since we assume that the queue is unbounded, theoperationEnq(v) is total.

Similarly, a consumer process repeats forever the following two statements: it first withdraws an itemfrom the queue by invoking the operationDeq(), and then consumes that item. If the queue is empty, thenthe default value⊥ is returned to the invoking process. (This default value that cannot be deposited by aproducer process.) Since we do not preclude the possibilityof returning⊥, theDeq() operation also is total.We assume that no processing by the consumer is associated with the⊥ value.

The algorithm implementing the shared queue relies on an array Q[0..∞) used to store the items of thequeue. Each entry of the array is initialized to⊥.

To enqueue an item to the queue, the producer first locates theindex of the next empty slot in the arrayQ, reserves it, and then stores the item in that slot. To dequeue a value, the consumer first determines thelast entry of the arrayQ that has been reserved by a producer. Then, it scans the arrayQ in ascending orderuntil it finds an item different from the default value⊥. If it finds one, it returns it. Otherwise, the defaultvalue is returned.

The algorithm is given in Figure 3.2. Thereturn() statement terminates the operation (it corresponds tothe reply event associated with that operation). Lowercaseletters are used for identifiers of local variables.Uppercase letters are used for shared variables. The implementation uses the following shared variables:NEXT (initialized to1) and the arrayQ, used to contain the values that have been produced and not yetconsumed. The variableNEXT is a pointer to the next slot of the arrayQ that can be used to deposit anew value. (This implementation could be optimized by reclaiming the slots from which items have beendequeued. But this is not the point here.)

The variableNEXT is provided with two primitives denotedread() andfetch&add(). The invocationNEXT .fetch&add(x) returns the value ofNEXT before the invocation and addsx to NEXT . Similarly,

39

Page 40: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

each entryQ[i] of the the array is provided with two primitives denotedwrite() andswap(). The invocationQ[i].swap(v) writesv in Q[i] and returns the value ofQ[i] before the invocation.

The execution of theread(), write(), fetch&add() and swap() primitives on the shared base objects(NEXT and each variableQ[i]) are assumed to be atomic. The primitivesread() andwrite() are implicitin the code of Figure 3.2 (they are in the assignment statements denoted “←”).

The algorithm does not use locks, so no process can block other processes forever by crashing. Further-more, each value deposited in the array by a producer will be withdrawn by aswap() operation issued by aconsumer (assuming that at least one consumer is correct).

operation Enq(v):in← NEXT .fetch&add (1);Q[in]← v;return()

operation Deq():last← NEXT − 1;for i from 0 until last do

aux← Q[i].swap (⊥);if (aux 6= ⊥) then return(aux) end if

end do;return(⊥)

Figure 3.2: Enqueue and dequeue implementations

Therefore, the implementation is wait-free: every processcompletes every its operation in a finite num-ber of its own steps: the number of steps performed byEnq() is two, and the number of steps performed byDeq() is proportional to the queue size as evaluated in the first line of its pseudocode.

But is the implementation linearizable? Superficially, yes: it seems that if every dequeue operationreturns a non-⊥ value, we can order operation based on the time when the corresponding operation onQ[](a write performed byEnq() or a successful swap performed byDeq()) takes place.

However, if a dequeue operation may return⊥ it is not always possible to find the right place for it in alegal linearization. For example, consider the following scenario:

1. Processp1 performsEnq(x). As a result, the value ofNEXT is 1, andQ[0] storesx.

2. Processp2 starts executingDeq() and reads1 in NEXT .

3. Processp1 performsEnq(y). The value ofNEXT is now2, Q[0] storesx, andQ[1] storesy.

4. Processp3 performsDeq(), reads2 in NEXT , findsx in Q[0] and returnsx. The value ofQ[0] is⊥now.

5. Finally,p2 reads⊥ in Q[0] and completesDeq() by returning⊥.

In this execution: we have the following partial order on operations: p1.Enq(x) → p1.Enq(y) →p3.Deq(x), andp1.Enq(x)→ p2.Deq(⊥). Thus, there are only three possible ways to linearizep2.Deq(⊥):right afterp1.Enq(x), right afterp1.Enq(y) or right afterp3.Deq(). In all three possible linearizations, thequeue is non-empty whenp2 invokesDeq(), and thus⊥ cannot be returned.

How to fix this problem? One solution is to sacrifice linearizability and not to consider operationsreturning⊥ in a linearization.

40

Page 41: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Another solution is to sacrifice wait-freedom and instead ofreturning⊥ in the last line of theDeq(),repeat the same procedure (evaluatingNEXT and going through the firstNEXT elements inQ[]) overand over until a non-⊥ value is found inQ[]. As long as a producer keeps adding items to the queue, everyDeq() operation is guaranteed to eventually return.

3.4.4 On the power of low level objects

The previous example shows that a FIFO queue, shared by an arbitrary number of processes, can be wait-free implemented from two types of base atomic objects, namely, an atomic objectNEXT whose type isdefined by then pair of primitivesfetch&add() andread(), as well as an arrayQ of atomic objects, the typeof these objects being defined by the pair of primitiveswrite() andswap().

This means that these base types are “powerful enough” to wait-free implement a FIFO queue shared byany number of processes. The investigation of the power of base object types to wait-free implementanyshared object constitutes the topic addressed in the third part of the book.

3.4.5 Non-determinism

Before concluding this chapter, it is worthwhile to highlight some sources of non-determinism in a concur-rent system that need to be considered when devising a sharedobject implementation.

1. The scheduler of a concurrent system can orchestrate the steps of the processes in all kinds of waysand this is a source of non-determinism that any wait-free implementation has to cope with.

2. Finally, when seeking for a linearization of a concurrenthistory, we can also choose among severalpossible sequential histories. First, there might indeed be several ways of completing the originalhistory, especially when non-deterministic objects are involved. Second, there might be several waysof ordering concurrent operations in the equivalent linearization.

3.5 Summary

We defined in this chapter three progress properties: solo-termination, partial-termination and wait-freedom.A wait-free implementation of an atomic object is inherently robust in the following sense.

• It is inherently starvation-free. (This is an immediate consequence of the definition of the wait-freeproperty.)

• It is (n− 1)-resilient. This expresses the fact it naturally copes withup to(n− 1) process crashes.

Bibliographic notes

The notion of wait-freedom originated in the work of Lamport[12]. An associated theory was developed byHerlihy [8].

The notion of solo-termination was presented implicitly in[35]. It has been introduced as a progressproperty in [38] under the nameobstruction-freesynchronization. That notion has been formalized in [34].More developments on obstruction-freedom can be found in [36]. The minimal knowledge on processfailures needed to transform any solo-terminating implementation into a wait-free one was investigated in[37]. Other liveness properties (also called progress conditions) are presented in [39].

41

Page 42: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The algorithms in Figures 3.1 and 3.2 were proposed by Afek etal. [1]. A blocking variant of thealgorithm in Figure 3.2 in which⊥ is never returned was given and proved correct by Herlihy andWing [9].

42

Page 43: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Part II

Read-Write Memory

43

Page 44: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface
Page 45: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 4

Safe, regular and atomic registers

4.1 Introduction

This part of the book is devoted to the construction of the simplest linearizable objects that are usuallyconsidered, namely sharedstorageobjects that provide their users with two basic operations:read andwrite. These objects are usually calledregisters, and linerarizable registers are calledatomic registers. Inparticular, we focus on how to wait-free implement such atomic registers using “weaker” registers. Again,the picture to have in mind is one where the weak registers areprovided in hardware and the strongestregisters are emulated in software to facilitate the job of the application’s developer.

This chapter defines different sorts of registers and these differences depend on three dimensions: (a) thecapacity of a register, (b) the access pattern to a register and (c) the behavior of a register in face of concur-rency. The capacity of a register conveys the range of valuesit can store and we will in particular distinguishregisters that can store a binary value from those than can store any number (possibly an infinite number)of values. The access pattern or a register conveys the number of processes that can read (resp. write) in aregister. Finally, we will distinguish registers that do not provide any guarantee if accessed concurrently atone extreme, from those that ensure linearizability at the other extreme (i.e., atomic registers).

The weakest kind of a shared register is one that can only store one bit of information, can be read by asingle processp, can be written by a single processq, and does not ensure any guarantee on the value readby p whenp andq access it concurrently. We will show how, using multiple such registers, we can constructan atomic register that can store an arbitrary number of values and be read and written by any number ofprocesses. This construction will be presented incrementally, going through intermediate kinds of registers,interesting in their own right.

An algorithm used to implement a register of a given kind fromone of another kind is sometimes calledtransformationor reduction, the first high-level register being “reduced” to the secondregister used as abase object in the implementation. We also say that the high-level register is emulated by the second one.

4.2 The many faces of registers

The capacity of a register According to the operations on a register issued by the processes, read opera-tions on the register can return different values at different times. So, the first dimension that characterizesa register is related to its capacity, i.e., how much information it can contain.

The simplest kind of register is thebinary register: it can only store a single bit0 or 1. We talk about ashared bit, or simply abit.

45

Page 46: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

More generally, amulti-valuedregister may store two or more distinct values. A multi-valued registercan be bounded or unbounded. Aboundedregister is one whose value range contains exactlyb distinctvalues (e.g., the values from0 until b−1) whereb is typically a constant known by the processes. Otherwisethe register is said to beunbounded.

A register that can containb distinct values is said to beb-valued. Its binary representation requiresB = ⌈log2 b⌉ bits. Its unary representation is more expensive as it requiresb bits (the valuev being thenrepresented with a bit equal to1 followed byv − 1 bits equal to0).

Access patterns This dimension concerns the sets of processes that can read from or write into the register.A register is calledsingle-writer, denoted 1W, (resp.,single-reader, noted 1R) if only one specific process,known in advance, and called thewriter (resp., thereader) can invoke the write (resp., read) operationon the register. A register that can be written (resp., read)by any process is called amulti-writer (resp.,multi-reader) register. Such a register is denoted MW (resp., MR).

For instance, a binary 1WMR register is a register that (a) can contain only0 or 1, (b) can be read by allthe processes but (c) written by a single process.

The concurrent behavior of a register When accessed sequentially, the behavior of a register is simpleto define: a read invocation returns the last value written. When accessed concurrently, the semantics ismore involved and several variants have been considered. Weoverview these variants in the following.

4.3 Safe, regular and atomic registers

We consider three kinds of registers that vary according to their behavior in the presence of concurrentaccesses. The differences are depicted in the value returned by a read operation invoked on the registerconcurrently with a write operation. When there is no concurrency, the behavior is the same in all cases. Forthe one-writer case, all the registers defined below preserve the following invariant:

• A read that is not concurrent with a write (i.e., their executions do not overlap) returns the last writtenvalue.

4.3.1 Safe registers

A saferegister is the weakest traditionally considered in distributed computing. It has a single writer, and,since we assume that every process is sequential, allows forno concurrent writes. A safe register onlyguarantees that:

• A read that is concurrent with one or several writes returns any element of the value range of theregister.

It is important to see that, in the presence of concurrency, the value returned by a read operation canpossibly be a value that has never been written. The only constraint is that the value needs to be in the rangeof the register. To illustrate this, consider a safe register that can contain onlyb = 3 values, e.g.,1, 2 and3.The register is bounded. Assuming that the current value is1, consider a write of value2 that is concurrentwith a read operation. That read operation can returns1, 2 or even3. It cannot return5 as that value is notin the range of the safe register.

An interesting particular case is the binary 1WMR (one-writer-one-reader) safe register. This can containonly 0 and1 and can be seen as modeling a flickering bit. Whatever its previous value, the value of the

46

Page 47: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

register can flicker during a write operation and only when the write finishes the register stabilizes to itsfinal value (the value just written) and keep that value untilthe next write.

pww(0) w(0)

pr

r(0)r(a)r(1) r(c)r(b)

w(1)

History H of the invocation/response events

Figure 4.1: History of a register

Value returned a b c

Safe 1/0 1/0 1/0Regular 1/0 1/0 0Atomic 1 1/0 0

0 0 0

Table 4.1: Safe, regular and atomic registers

An example of the behavior of a binary safe register (i.e., a safe bit) is depicted in Figure 4.1 and Table4.1. We consider there a 1W1R safe register: only one reader is involved. The writer process is denotedpw

whereas the reader process is denotedpr (w(v) stands for a write operation that writes the valuev; similarly,r(v) stands for a read operation that obtains the valuev). As the first and the fourth read operations do notoverlap a write, they return the last written value namely,1 for the first read and0 for the fourth one. Thevalues returned by the other read operations are denoteda, b andc. All these read operations overlap a writeand can consequently return any of the values that the register can contain (this is denoted1/0 as the registeris binary in Table 4.1). So, the last read can return1 even if the previous value was0 and the concurrentoperation writes the very same value0. This gives8 possible correct executions, assuming indeed a binarysafe register.

w(1) w(0) w(0)

r(1) r(a) r(b) r(0) r(c)

Figure 4.2: History of a safe register

Figure 4.2 depicts the corresponding history at the operation level (i.e., the partial order on the operationsdenoted→H). The transitive dependencies are not indicated. The unordered operations (e.g., the secondw(0) operation issued bypw andr(c) issued bypr) are concurrent.

47

Page 48: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

4.3.2 Regular registers

A regular register is also defined for the case of a single writer. It is asafe register that satisfies the followingadditional property:

• A read that is concurrent with one or several writes returns the value written by a concurrent write orthe value written by the last preceding write.

To illustrate the regular register notion, let us again consider Figure 4.1. The values that can be possiblyreturned by a regular register are described in Table 4.1. The second read operation can return either theprevious value or the value of the concurrent write, namely,0 or 1. It is the same for the third read operation.In contrast, as the last write does not change the value of theregister, the last read can return only the value0. This means that4 possible correct executions can be determined for Figure 4.1.

It is important to see that a read that overlaps several writeoperations can return any value among thevalues written by these writes as well as the value of the register before these writes. This is depicted inFigure 4.3 where valuea returned by the second read can be any of1, 2, 3, 4 or 5.

pww(1)

r(1) r(a)

w(2) w(3) w(4) w(5)

pr

Figure 4.3: History of a regular register

4.3.3 Atomic registers

An atomic register is a MWMR register whose execution histories are linearizable. This means that it ispossible to totally order all its read and write operations in such a way that this total orderS respects theirreal-time occurrence order and each read returns the value written by the last write operation that precedesit in S (legality property).

Again, Figure 4.1 illustrates the atomicity notion for the specific case of a register. The second readr(a)is concurrent with thew(0) operation. Given that the previous value of the register is1, the returned valuea can be either1 or 0. If it returns1 (the value written by the last preceding write), then the third read canreturn either1 or 0. In contrast, if the second read returns0 (the value written by the concurrent write), onlyvalue0 can be returned by the third read as the second read indicatesthat the value1 is now overwritten bythe “new” value0. Finally, the last read can only return the value0. It is easy to see that there are3 possibleexecutions when the registers are binary and atomic. As previously, the possible values returned by the threeread operations concurrent with a write operation are summarized in Table 4.1.

4.3.4 Regularity and atomicity: a reading function

One important difference between regularity and atomicityis that a regular register allows fornew/oldinversion: in case two read operations are concurrent with a write, thefirst read may return the concurrentlywritten value while the second read may still return the value written by a preceding write. Such a history is

48

Page 49: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

not allowed by an atomic register, since the second read mustsucceed the first one in any linearization, andthus must return the same or a “newer” value.

For example, the history depicted in Figure 4.1 and Table 4.1, the history is correct witha = 0 andb = 1with respect to regularity and incorrect with respect to atomicity. In that history, and considering the twoconsecutive read operationsr(a) andr(b), the first’, namelyr(a), obtains the “new” value (a = 0) whilethe second’, namelyr(b), obtains the “old” value (b = 1).

Formally, we capture the difference bwtween (one-writer) regular and atomic registers using the notionof a reading function. A reading function is associated with a given history and maps every returned readoperationr(x) to somew(x) in that history. Without loss of generality, we assume that every history startswith a sequential operationw(x0) that writes the initial valuex0.

We say that a reading functionπ associated with a historyH is regular if (here r andw with indicesdenote read and write operations inH):

A0 : ∀ r: ¬(r →H π(r)). (No read returns a value not yet written.)

A1 : ∀ r, w in H: (w →H r)⇒(π(r) = w ∨ w →H π(r)

). (No read obtains an overwritten value.)

We say that a reading function isatomicif it is regular and satisfied the following additional property:

A2 : ∀ r1, r2: (r1→H r2)⇒(π(r1) = π(r2) ∨ π(r1)→H π(r2)

). (No new/old inversion.)

We show now determining a regular reading function is exactly what we need to show that a history canbe produced by a regular register.

Theorem 4 Let H be an execution history of a 1WMR register.H can be a history of a regular register ifand only if it allows for a regular a reading functionπ.

Proof Suppose thatH is a history of a regular register. We defineπ as follows: for any read For anyr(x), a complete read operation inH, we defineπ(r) as the last write operationw(x) in H such that¬(r(x)→H w(x). It is easy to see thatπ is a regular reading function.

Now suppose thatH allows for a regular reading function. Letr(x) be a complete read operation inH. Then there exists a writew(x) in H that either precedes or is concurrent withr(x) in H (A0) andis not succeeded by a write that precedesr(x) in H (A1). Thus,r(x) returns either the last written or aconcurrently written value. 2Theorem ??

Theorem 5 LetH be an execution history of a 1WMR register.H is linearizable if and only if it allows foran atomic a reading functionπ.

Proof Given a linearizable historyH, it is straightforward to construct an atomic reading function: take anyS, a linearization ofH and defineπ(r) as the last write that precedesr in S. By construction,π(r) satisfiespropertiesA0, A1 andA2.

Now suppose thatH allows for an atomic reading functionπ. We useπ to constructS, a linearizationof H, as follows.

We first constructS as the sequence of all writes that took place inH in the order of appearance. Sincewe have only one writer, the writes are totally ordered. (In case the last write is incomplete, we add toS itscomplete version.) Then we put every complete operationr immediately afterπ(r), making sure that:

if π(r1) = π(r2) andr1→H r2, thenr1→S r2.

49

Page 50: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Clearly,S is legal: the reading function guarantees thatπ(r) writes the value read byr, and thus everyread inS returns the last written value.

To show that→H⊆→S , we consider the following four possible cases (w1 andw2 (resp.,r1 andr2)denote here write (resp., read) operations):

• w1 →H w2. By the very construction ofS (that considers the order on write operations performedby the writer), we havew1→S w2.

• r1→H r2. By A2, we haveπ(r1) = π(r2) or π(r1)→H π(r2).

If π(r1) = π(r2), asr1 started beforer2 (case assumption), due to the wayS is constructed,r1 isordered beforer2 in S, and we have consequentlyr1→S r2.

If π(r1) →H π(r2), as (1)r1 andr2 are placed just afterπ(r1) andπ(r2), respectively, and (2)π(r1)1→S π(r2) (see the first item), the construction ofS ensuresr1→S r2.

• r1 →H w2. By A0, eitherπ(r1) is concurrent withr1 or π(r1) →H r1. Sincer1 →H w2 and allwrites are totally ordered, we haveπ(r1)→H w2. By construction ofS, sinceπ(r1) is the last writeprecedingr1 in S, r1→S w2.

• w1→H r2. By A1 we haveπ(r2) = w1 or w1→H π(r2).

Caseπ(r2) = w1. As r2 is placed just afterπ(r2) in S, we haveπ(r2) = w1→S r2.

Casew1 →H π(r2). As (2)w1 →H π(r2) ⇒ w1 →S π(r2) (first item), and (2)π(r2) →S r2 (r2is ordered just afterπ(r2) in S), we obtain (by transitivity of→S) w1→S r2.

Finally, sinceS contains all complete operations ofH and preserves→H , H is indistinguishable fromS for every reader, modulo the last incomplete read operation(if any).

Thus,S is a legal sequential history that is equivalent to a completion ofH and preserves→H . 2Theorem ??

Now we can say that a history of a regular register suffers from new/old inversion if it allows for noatomic reading function. Theorems 4 and 5 imply that an atomic register can be seen as a regular registerthat never suffers from new/old inversion.

It follows from the fact that atomicity (linearizability) is a local property that a set of 1WMR regularregisters behave atomically if each of themindependently from the othersis written by a single process andsatisfies the “no new/old inversion” property.

4.3.5 From very weak to very strong registers

To summarize, there are different kinds of registers and thedifferences depend on several dimensions. It isappealing to ask whether registers of strong kinds can be constructed in software (emulated) using registersof weak kinds. As pointed out in the introduction of this chapter, and this might look surprising, it is indeedpossible to emulate a multi-valued MWMR atomic register using binary 1W1R safe registers. Next sectionsare devoted to proving that.

In general, what we call a (register)transformationis here an algorithm that builds a registerR withcertain properties, called a high-level register, from other registers featuring different properties. Theseregisters, used in the implementation, are called low-level or base registers. These low level registers arecalledbase registers. Of course, for a transformation to be interesting, the baseregisters it uses have toprovide weaker properties than the high level register we want to construct. Typically:

50

Page 51: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• The base registers are safe (resp., regular) while the high level register is regular (resp., atomic).

• The base registers are 1W1R (resp., 1WMR) while the constructed register is 1WMR (resp., MWMR).

• The base registers are binary whereas the high level register is multi-valued.

The transformations also vary according to the number and size of base registers considered. Basically:

• The number of base registers needed to build the high level register might or not depend on the totalnumber of processes in the system, i.e., readers and writers.

• The amount of information used to build the high level register might be bounded or not. Some-times, the transformation algorithm uses sequence numbersthat can arbitrarily grow and is inherentlyunbounded. In general, and except for few constructions, bounded transformations are much moredifficult to design and prove correct than unbounded ones. From a complexity point of view, boundedones are better.

In the following, we proceed as follows.

1. We illustrate the notion of transformation algorithm by presenting first two simple (bounded) algo-rithms. The first constructs a 1WMR safe register out of a number of 1W1R safe registers. The secondbuilds a binary 1WMR regular register out of a binary 1WMR safe register. The combination of thesealgorithms already shows that we can implement a binary 1WMRregular register using a number ofbinary 1W1R safe registers.

2. We then show how to transform a binary 1WMR register that provides certain semantics (safe, regularor atomic) into a multi-valued 1WMR register that preservesthe same semantics. The three transfor-mations we present here are all bounded. The combination of the second of these with those aboveshows that we can implement a multi-valued 1WMR regular register using a number of binary 1W1Rsafe registers.

3. We finally show how to transform a 1W1R regular register into a MWMR atomic register. We gothrough three intermediate (unbounded) transformations:from a 1W1R regular register into a 1W1Ratomic register, then to a 1WMR atomic register, and finally to a MWMR register. These, with thecombination pointed out above, shows that we can construct amulti-valued MWMR atomic registerusing binary 1W1R safe registers.

4.4 Two simple bounded transformations

This section describes two very simple bounded transformations. We focus on safe and regular registers.(Recall that these kinds of registers are defined for systemswith a single writer for each register.) Thefirst transformation extends a single-reader register, being safe or regular, to multiple readers. The secondtransformation transforms a shared safe bit into a regular one.

51

Page 52: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

4.4.1 Safe/regular registers: from single reader to multiple readers

We present here an algorithm that implements a 1WMR safe (resp., regular) register using 1W1R safe(regular) registers. In short, the transformation allows for multiple readers instead of single readers. Notsurprisingly, the idea is to emulate the multi-reader register using several single-reader registers.

The transformation, described in Figure 4.4, is very simple. The constructed registerR is built from n1W1R base registers, denotedREG[1 : n], one per reader process. (We consider a system ofn processesand all are potential readers.) A readerpi reads the base registerREG[i] it is associated with, while thesingle writer writes all the base registers (in any order).

It is important to see that this transformation is bounded: it uses no additional control informationbeyond the actual value stored, and each base register can beof the same size (measured in number of bits)as the multi-readers register we want to build.

Interestingly, with the same algorithm, if the base 1W1R registers are regular, than the resulting 1WMRregister we then obtain is regular.

operation R.write(v):for all j in {1, . . . , n} do REG[i]← v end do;return()

operation R.read() issued bypi :return(REG[i])

Figure 4.4: From 1W1R safe/regular to 1WMR safe/regular (bounded transformation)

Theorem 6 Given one base safe (resp., regular) 1W1R register per reader, the algorithm described in Fig-ure 4.4 implements a 1WMR safe (resp., regular) register.

Proof Assume first that base registers are safe 1W1R registers. It follows directly from the algorithm that aread ofR (i.e.,R.read() ) that is not concurrent with aR.write() operation obtains the last value depositedin the registerR. The obtained registerR is consequently safe while being 1WMR.

Let us now consider the case where the base registers are regular. We will argue that the high-levelregisterR constructed by the algorithm is a 1WMR regular one. The fact thatR allows for multiple readersis by construction. Because a regular register is safe, and by the argument above (for the case where thebase registers are safe), we only need to show that a read operation R.read() that is concurrent with oneor more write operationsR.write(v), R.write(v′), etc., returns one of the valuesv, v′, ... written by theseconcurrent write operations, or the value ofR before these write operations.

Let pi be any process that reads some value fromR. Whenpi reads the base registerREG[i], whileexecuting operationR.read(), pi obtains (a) the value of a concurrent write on this base register REG[i](if any) or (b) the last value written onREG[i] before such concurrent write operations. This is becausethe base registerREG[i] is itself regular. In case (a), the valuev obtained is from aR.write(v) that isconcurrent with theR.read() of pi. In case (b), the valuev obtained can either be (b.1) from aR.write(v)that is concurrent with theR.read() of pi , or (b.2) from the last value written by aR.write() before theR.read() of pi. It follows that the constructed registerR is regular. 2Theorem 6

It is important to see that the algorithm of Figure 4.4 does not implement a 1WMR atomic registereven when every base registerREG[i] is a 1W1R atomic register. Roughly speaking, this is becausethetransformation can cause a new/old inversion problem, evenif the base registers preclude these. To show

52

Page 53: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

p1

p2

pw

REG[2]← 2

return(REG[1])

REG[1]← 2

inv[R.write(2)] resp[R.write(2)]

return(REG[2])

Figure 4.5: A counter-example

this, let us consider the counter-example described in Figure 4.5. The example involves one writerpw

and two readersp1 and p2. Assume the registerR implemented by the algorithm contains initially thevalue1 (which means that we initially haveREG[1] = REG[2] = 1). To write value2 in R, the writerfirst executesREG[1] ← 2 and thenREG[2] ← 2. The duration of these two write operations on baseregisters can be arbitrarily long. (Remember that processes are asynchronous and there is no bound on theirexecution speed). Concurrently,p1 readsREG[1] and returns2 while later (as indicated on the figure)p2

readsREG[2] and returns1. The linearization order on the two base atomic registers isdepicted on thefigure (bold dots). It is easy to see that, from the point of view of the constructed registerR, there is anew/old inversion asp1 reads first and obtains the new value, whilep2 reads afterp1 and obtains the oldvalue. The constructed register is consequently not atomic.

4.4.2 Binary multi-reader registers: from safe to regular

The aim is here to build a regular binary register from a safe binary register, i.e., to construct a regular bit outof a safe one. As we shall see, the algorithm is very simple, precisely because the register to be implemented,R, can only contain one out of two values (0 or 1).

The difference between a safe and a regular register is only visible in the face of concurrency. That is,the value to be returned in the regular case has to be a value concurrently written or the last value written,whereas no such restriction exists for safe semantics. The fact that we only consider a shared bit meanshowever that the issue to be addressed is restricted: only one out of two values can be returned anyway. Toillustrate the issue, assume that the regular register is directly implemented using a safe base register: everyread (resp. write) on the high-level register is directly translated into a read (resp. write) on the base (safe)register. Assume this register has value0 and there is a write operation that writes the very same value0.As the base register is only safe, it is possible that a concurrent read operation obtains value1, which mighthave never been written.

The way to fix this problem is to preclude the actual writing inthe base register if the value to be writtenin the high-level register is thesameas the value previously written. If the value to be written isdifferentfrom the previous value, then it is okay to write the value in the base register: a concurrent read can obtainthe other value (remember that only two values are possible)and this is fine with the regularity semantics.With this strategy, a read operation concurrent with one or more write operations obtains the value beforethese write operations or the value written by one of these operations.

53

Page 54: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The transformation algorithm is described in Figure 4.6. Besides a safe registerREG shared betweenthe reader and the writer, the algorithm requires that the writer uses a local registerprev val that containsthe previous value that has been written in the base safe registerREG. Before writing a valuev in thehigh-level regular register, the writer checks if this value v is different from the value inprev val and, onlyin that case,v is written inREG.

operation R.write(v):if (prev val 6= v) then REG← v;

prev val← v end if ;return()

operation R.read() issued bypi :return(REG)

Figure 4.6: From a binary safe to a binary regular register (bounded transformation)

Theorem 7 Given a 1WMR binary safe register, the algorithm described in Figure 4.6 implements a 1WMRbinary regular register.

Proof The proof is an immediate consequence of the following facts. (1) As the underlying base registeris safe, a read that is not concurrent with a write obtains thelast written value. (2) As the underlying baseregister always alternate between0 and1, a read concurrent with one or more write operations obtainsthevalue of the base register before these write operations or one of the values written by such a write operation.

2Theorem 7

As we pointed out in the overview description above, the transformation heavily exploits the fact that theconstructed registerR can only contain one out of two possible values (0 or 1). It does not work for multi-valued registers. The transformation does not implement anatomic register either as it does not prevent anew/old inversion. Notice also that If the safe base binary register is 1W1R, then the algorithm implementsa 1W1R regular binary register.

4.5 From binary to b-valued registers

This section presents three transformations from binary registers to multi-valued registers. These are calledb-valued registers in the sense that their value range contains b distinct values; we assume thatb > 2. Ourtransformations have three characteristics.

1. Although we assume that the base binary registers (bits) are 1WMR registers and we transform theseinto 1WMR b-valued registers, our algorithms can also be used to transform 1W1R bits into 1W1Rb-valued registers; i.e., if the base bits allow only a singlereader, then the same algorithm implementsa b-valued bit.

2. Our transformations preserve the semantics of the base registers in the following sense: if the basebits have semanticsX (safe, regular or atomic), then the resulting high-level (b-valued) registers alsohave semanticsX (safe, regular or atomic).

3. The transformations are bounded. There is a bound on the number of base registers used, as well ason the amount of memory needed within each register.

54

Page 55: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

4.5.1 From safe bits to safeb-valued registers

Overview The first algorithm we present here uses a number of safe bits in order to implement ab-valuedregisterR. We assume thatb is an integer power of2, i.e., b = 2B whereB is an integer. It follows that(with a possible pre-encoding if theb distinct values are not the consecutive values from0 until b − 1) thebinary representation of theb-valued registerR we want to build consists of exactlyB bits. In a sense, anycombination ofB bits defines a value that belongs to the range ofR (notice that this would not be true ifbwas not an integer power of2).

The algorithm relies on this encoding for the values to be written inR. It uses an arrayREG[1 : B] of1WMR safe bit registers to store the current value ofR. Givenµi = REG[i], the binary representation ofthe current value ofR is µ1 . . . µB . The corresponding transformation algorithm is given in Figure 4.7.

operation R.write(v):let µ1 . . . µB be the binary representation ofv;for all j in {1, . . . , B} do REG[j]← µj end do;return()

operation R.read() issued bypi:for all j in {1, . . . , B} do µj ← REG[j] end do;let v be the value whose binary representation isµ1 . . . µB ;return(v)

Figure 4.7: Safe register: from bits tob-valued register

Space complexity As B = log2(b), the memory cost of the algorithm is logarithmic with respect to thesize of the value range of the constructed registerR. This follows from the binary encoding of the values ofthe high level registerR.

Theorem 8 Given b = 2B and B 1WMR safe bits, the algorithm described in Figure 4.7 implements a1WMRb-valued safe register.

Proof A read ofR that does not overlap a write ofR obtains the binary representation of the last value thathas been written intoR and is consequently safe to return. A read ofR that overlaps a write ofR can obtainany ofb possible values whose binary encoding usesB bits. As every possible combination of theB base bitregisters represents one of the theb values thatR can potentially contain (this is becauseb = 2B), it followsthat a read concurrent with a write operation returns a valuethat belongs to the range ofR. Consequently,R is ab-valued safe register. 2Theorem 8

It is interesting to notice that this algorithm does not implement a regular registerR even when the basebits are regular. A read that overlaps a write operation thatchanges the value ofR from 0 . . . 0 to 1 . . . 1 (inbinary representation) can return any value, i.e., even onethat was never written. The reader (the human, notthe process) can check that requiring a specific order according to which the arrayREG[1 : B] is accesseddoes not overcome this issue.

4.5.2 From regular bits to regular b-valued registers

Overview A way to build a 1WMR regularb-valued registerR from regular bits is to employ unaryencoding. Considering an arrayREG[1 : b] of 1WMR regular bits, the valuev ∈ [1..b] is represented by0sin bits 1 throughv − 1 and1 in thevth bit.

55

Page 56: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The algorithm is described in Figure 4.8. The key idea is to write into the arrayREG[1 : b] in onedirection, and to read it in the opposite direction. To writev, the writer first setsREG[v] to 1, and then“cleans” the arrayREG. The cleaning consists in setting the bitsREG[v−1] until REG[1] to 0. To read, areader traverses the arrayREG[1 : b] starting from its first entry (REG[1]) and stops as soon as it discoversan entryj such thatREG[j] = 1. The reader then returnsj as the result of the read operation. It is importantto see that a read operation starts reading first the “cleaned” part of the array. On the other hand, the writingis performed in the opposite direction, fromv − 1 until 1.

It is also important to notice that, even when no write operation is in progress, it is possible that severalentries of the array be equal to1. The value represented by the array is then the valuev such thatREG[v] =1 and for all the entries1 ≤ j < v we haveREG[j] = 0. Those entries are then the only meaningful entries.The other entries can be seen as a partial evidence on past values of the constructed register.

The algorithm assumes that the registerR has an initial value, sayv. The arrayREG[1 : b] is accord-ingly initialized, i.e.,REG[j] = 0 for 1 ≤ j < v, REG[v] = 1, andREG[j] = 0 or 1 for v < j ≤ b.

operation R.write(v):REG[v]← 1;for j from v − 1 step−1 until 1 do REG[j]← 0 end do;return()

operation R.read() issued bypi:j ← 1;while (REG[j] = 0) do j ← j + 1 end do;return(j)

Figure 4.8: Regular register: from bits tob-valued register

Two observations are in order:

1. In the writer’s algorithm, once set to1, the “last” base registerREG[b] keeps that value forever. In asense, setting this register to1 makes it useless: the writer never writes in it again, and when it has toread it, a reader might by default consider its value to be1.

2. The reader’s algorithm does not write base registers. This means that the algorithm handles anynumber of readers. Of course, the base registers have to be 1WMR if there are several readers (aseach reader reads the base registers), and can be 1W1R when there a single reader is involved.

Space complexity The memory cost of the transformation algorithm isb base bits, i.e., it is linear withrespect to the size of the value range of the constructed register R. This is a consequence of the unaryencoding of these values1.

Lemma 1 Consider the algorithm of Figure 4.8. AnyR.read() or R.write() operation terminates. More-over, the valuev returned by a read belongs to the set{1, . . . , b}.

Proof A R.write() operation trivially terminates (as by definition thefor loop always terminates). For thetermination of theR.read() operation, let us first make the following two observations:

1LetB be the number of bits required to obtain a binary representation of a value ofR. It is important to see that, asB = log2(b),

the cost of the construction is exponential with respect to this number of bits.

56

Page 57: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• At least one entry of the arrayREG is initially equal to1. Then, it follows from the write algorithmthat each time the writer changes the value of a base registerREG[x] from 1 to 0, it has previouslyset to1 another entryREG[y] such hatx < y ≤ b.

Consequently, if the writer updatesREG[x] from 1 to 0 while concurrently the reader readsREG[x]and obtains the new value0, we can conclude that a higher entry of the array has the value1.

• If, while its previous value is1, the reader readsREG[x] and concurrently the writer updatesREG[x]to the same value1, the reader obtains value1, as the base register is regular2.

It follows from these observations that a sequential scanning of the arrayREG (starting atREG[1]) neces-sarily encounters en entryREG[v] whose reading returns1. As the running index of thewhile loop starts at1 and is increased by1 each time the loop body is executed, it follows that the loop always terminates, andthe valuej it returns is such that1 ≤ j ≤ b. 2Lemma 1

Remark The previous lemma relies heavily on the fact that the high-level registerR can contain onlybdistinct values. The lemma would no longer be true if the value range ofR was unbounded. AR.read()operation could then never terminate in case the writer continuously writes increasing values. To illustratethat, consider the following scenario. LetR.write(x) be the last write operation terminated before theoperationR.read(), and assume there is no concurrent write operationR.write(y) such thaty < x. It ispossible that, when it readsREG[x], the reader findsREG[x] = 0 because anotherR.write(y) operation(with y > x) updatedREG[x] from 1 to 0. Now, when it readsREG[y], the reader findsREG[y] = 0because anotherR.write(z) operation (withz > y) updatedREG[y] from 1 and so on. The read can thennever terminate.

Theorem 9 Given b 1WMR regular bits, the algorithm described in Figure 4.8 implements a 1WMRb-valued regular register.

Proof Let us first consider a read operation that is not concurrent with any write, and letv the last writtenvalue. It follows from the write algorithm that, whenR.write(v) terminates, the first entry of the array thatequals1 is REG[v] (i.e.,REG[x] = 0 for 1 ≤ x ≤ v − 1). Because a read traverses the array starting fromREG[1], thenREG[2], etc., it necessarily reads untilREG[v] and returns accordingly the valuev.

R.write(v0) R.write(v1) R.write(v2) R.write(vm)

R.read()

Figure 4.9: A read with concurrent writes

Let us now consider a read operationR.read() that is concurrent with one or more write operationsR.write(v1), ..., R.write(vm) (as depicted in Figure 4.9). Moreover, letv0 be the value written by thelast write operation that terminated before the operationR.read() starts (or the initial value if there is nosuch write operation). As a read operation always terminates (Lemma 1), the number of write operations

2If the base register was only safe, the reader could obtain value0.

57

Page 58: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

concurrent with theR.read() operation is finite. We have to show that the valuev returned byR.read() isone of the valuesv0, v1, ...,vm. We proceed with a case analysis.

1. v < v0.No value that is both smaller thanv0 and different fromvx (1 ≤ x ≤ m) can be output. This is because(1) R.write(v0) has set to0 all entries fromv0−1 until the first one, and only a write of a valuevx cansetREG[vx] to 1; and (2) as the base registers are regular, ifREG[v′] is updated by aR.write(vx)operation from0 to the same value0, the reader cannot concurrently readsREG[v′] = 1. It followsfrom that observation that, ifR.read() returns a valuev smaller thanv0, thenv has necessarily beenwritten by a concurrent write operation, and consequentlyR.read() satisfies the regularity property.

2. v = v0.In this case,R.read() trivially satisfies the regularity property. Notice that itis possible that thecorresponding write operation be someR.write(vx) such thatvx = v0.

3. v > v0.From v > v0, we can conclude that the read operation obtained0 when it readREG[v0]. AsREG[v0] was set to1 by R.write(v0), this means that there is aR.write(v′) operation, issued af-ter R.write(v0) and concurrent withR.read(), such thatv′ > v0, and that operation has executedREG[v′] ← 1, and has then set to0 at least all the registers fromREG[v′ − 1] until REG[v0]. Weconsider three cases.

(a) v0 < v < v′.In this case, asREG[v] has been set to0 by R.write(v′), we can conclude that there is aR.write(v), issued afterR.write(v′) and concurrent withR.read(), that updatedREG[v] from0 to 1. The value returned byR.read() is consequently a value written by a concurrent writeoperation. The regularity property is consequently satisfied byR.read().

(b) v0 < v = v′.The regularity property is then trivially satisfied byR.read(), asR.write(v′) andR.read() areconcurrent.

(c) v0 < v′ < v.In this case,R.read() missed the value1 in REG[v′]. This can only be due to aR.write(v′′)operation, issued afterR.write(v′) and concurrent withR.read(), such thatv′′ > v′, and thatoperation has executedREG[v′′] ← 1, and has then set to0 at least all the registers fromREG[v′′ − 1] until REG[v′].

We are now in the same situation as the one described at the beginning of item 3, wherev0 andR.write(v′) are replaced byv′ andR.write(v′′). As (a) the number of values betweenv0 andbis finite and (b) the read operationR.read() terminates, if follows that this operation eventuallyterminates in 3a or 3b, which completes the proof of the theorem.

2Theorem 9

A counter-example for atomicity Figure 4.10 shows that, even if all base registers are atomic, the algo-rithm we just presented (Figure 4.8) does not implement an atomic b-valued register.

Let us assume thatb = 5 and the initial value of the registerR is 3, which means that we initially haveREG[1] = REG[2] = 0, REG[3] = 1 andREG[4] = REG[5] = 0. The writer issues firstR.write(1)

58

Page 59: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

and thenR.write(2). There are concurrently two read operations as indicated onthe figure. The first readoperation returns value2 while the second one returns value1: there is a new/old inversion. The last line ofthe figure depicts a linearization orderS of the read and write operations on the base binary registers. (Aswe can see, each base object taken alone is linearizable. This follows from the fact that linearizability is alocal property, see the first chapter).

R.write(2)

REG[2]← 1 REG[1]← 0

R.read()

R.read()

R.write(1)

REG[1] = 0 REG[2] = 1

REG[1] = 1

REG[1]← 1

Figure 4.10: A counter-example for atomicity

4.5.3 From atomic bits to atomicb-valued registers

As just seen, the algorithm of Figure 4.8 does not work if the goal is to build ab-valued atomic registerfrom atomic bits. Interestingly, a relatively simple modification of its read algorithm makes that possible bypreventing the new/old inversion phenomenon.

Overview The idea consists in decomposing aR.read() operation in two phases. The first phase is thesame as in the read algorithm of Figure 4.8 : base registers are read in ascending order, until an entry equalto 1 is found; letj be that entry. The second phase traverses the array in the reverse direction (fromj to1), and determines the smallest entry that contains value1: this is then returned. So, the returned value isdetermined by a double scanning of a “meaningful” part of theREG array.

The new algorithm is given in Figure 4.11. To understand the underlying idea, let us consider the firstR.read() operation depicted in Figure 4.10. After it findsREG[2] = 1, the reader changes its scanningdirection. The reader then findsREG[1] = 1 and returns consequently value1. In the figure, the secondread obtains1 in REG[1] and consequently returns1. This shows that, in the presence of concurrency, thisconstruction does not strive to eagerly return a value. Instead, valuev returned by a read operation has tobe “validated” by an appropriate procedure, namely, all the“preceding” base registersREG[v − 1] untilREG[1] have to be found equal to0 when rereading them.

Theorem 10 Given b 1WMR atomic bits, the algorithm described in Figure 4.11 implements a 1WMRatomicb-valued register.

Proof The proof consists in two parts: (1) we first show that the implemented register is regular, and then(2) we show that it does not allow for new/old inversions. Applying Theorem 11 proves then that the con-

59

Page 60: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation R.write(v):REG[v]← 1;for j from v − 1 step−1 until 1 do REG[j]← 0 end do;return()

operation R.read() issued bypi:j up← 1;

(1) while (REG[j up] = 0) do j up← j up + 1 end do;(2) j ← j up;(3) for j down from j up− 1 step−1 until 1 do(4) if (REG[j down] = 1) then j ← j down end if

end do;return(j)

Figure 4.11: Atomic register: from bits tob-valued register

structed register is a 1WMR atomic register.

Let us first show that the implemented register is regular. Let R.read() be a read operation andj thevalue it returns. We consider two cases:

• j = j up (j is determined at line 2).The value returned is then the same as the one returned by the algorithm described in Figure 4.8. Itfollows from theorem 9 that the value read is then either the value of the last preceding write or thenew value of an overlapping write.

• j < j up (j is determined at line 4; let us observe that, due to the construction, the casej > j upcannot happen).In that case, the read foundREG[j] = 0 during the ascending loop (line 1), andREG[j] = 1 duringthe descending loop (line 4). Due to the atomicity of the baseREG[j] register, this means that a writeoperation has writtenREG[j] = 1 between these two readings of that base atomic register. It followsthat the valuej returned has been written by a concurrent write operation.

pw

R.write(v) R.write(v′)

pr

r1 = R.read() r2 = R.read()

Figure 4.12: There is no new/old inversion

To show that there is no new/old inversion, let us consider Figure 4.12. There are two write operations,and two read operationsr1 andr2 that are concurrent with the second write operation. (The fact that theread operations are issued by the same process or different processes is unimportant for the proof.) As theconstructed registerR is regular, both read operations can returnv or v′. If the first read operationr1 returnsv, the second read can return eitherv or v′ without entailing a new/old inversion. So, let us consider thecase wherer1 returnsv′. We show that the second readr2 returnsv′′, wherev′′ is v′ or a value written bya more recent write concurrent with this read. Ifv′′ = v′, then there is no new/old inversion. So, let us

60

Page 61: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

considerv′′ 6= v′. As r1 returnsv′, r1 has sequentially readREG[v′] = 1 and thenREG[v′ − 1] = 0 untilREG[1] = 0 (lines 2-4). Moreover,r2 starts afterr1 has terminated (r1 →H r2 in the associated historyH).

1. v′′ < v′. In that case, a write operation has writtenREG[v′′] = 1 afterr1 has readREG[v′′] = 0 (atline 4) and beforer2 readsREG[v′′] = 1 (at line 2 or 4) with1 ≤ v′′ < v′. It follows that this writeoperation is afterR.write(v′) (there is a single sequential writer, andr1 returnsv′). Consequently,r2 obtains a value newer thanv′, hence newer thanv: there is no new/old inversion.

2. v′′ > v′. In that case,r2 has read1 fromREG[v′′] and then0 from REG[v′] (line 4). Asr1 terminates(readingREG[v′] = 1 and returningv′) beforer2 starts, and write operations are sequential, it followsthat there is a write operation, issued afterR.write(v′), that has updatedREG[v′] from 1 to 0.

(a) If that operation isR.write(v′′), we conclude that the valuev′′ read byr2 is newer thanv′, andthere is no new/old inversion.

(b) If that operation is notR.write(v′′), it follows that there is another operationR.write(v′′′),such thatv′′′ > v′, that has updatedREG[v′] from 1 to 0, and that update has been issued afterR.write(v′) (that setREG[v′] to 1), and beforer2 readsREG[v′] = 0.

Moreover,R.write(v′′′) is beforeR.write(v′′) (otherwise, the update ofREG[v′] from 1 to 0would have been done byR.write(v′′)).

It follows thatR.write(v′′′) is afterR.write(v′) and beforeR.write(v′′), from which we con-clude thatv′′ is newer thanv′, proving that there is no new/old inversion.

2Theorem 10

4.6 Three (unbounded) atomic register implementations

So far, none of our algorithms implements an atomic registerout of a non-atomic register. (Moreover, theone that implements atomic registers implements ab-valued register out of a binary atomic on.) In thefollowing, we present algorithms that implementunboundedatomic registers. Such registers can containany number of distinct values.

We present three algorithms. All use the notion ofsequence number. In short, this notion represents aconcept of logical time. The sequence numbers are associated with each write operation and induce a totalorder on these operations: the total order is then exploitedto ensure atomicity. These numbers are writtenin the base registers, which also means that such registers are unbounded, because the space of sequencenumbers is the set of natural numbers.

More specifically, in the algorithms presented in this section, a base register is made up of several fields,namely:

• A data part intended to contain the valuev of the constructed high-level registerR.

• A control part containing a sequence number and possibly a process identity. The sequence numbervalues increase proportionally with the number of write operations, and is consequently bounded.

Two observations are in order before diving into the detailsof these algorithms.

61

Page 62: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

1. As we pointed out, the high-level register we construct can be unbounded and might contain, atdifferent points in time, an arbitrary number of distinct values. In fact, the fact that the high-levelregister can be unbounded does not mean it has to. This depends on the application that uses thishigh-level register and the operations that access it.

2. The base registers are unbounded for they contain arbitrarily large sequence numbers. In fact, thereare techniques to recycle sequence numbers and bound these constructions. These techniques arehowever pretty involved and we do not present them here.

4.6.1 1W1R registers: From unbounded regular to atomic

We show in the following how to implement an 1W1R atomic register using a 1W1R regular register. The useof sequence numbers make such a construction easy and helps in particular prevent the new/old inversionphenomenon. Preventing this, while preserving regularity, means, by Theorem 11, that the constructedregister is atomic.

The algorithm is described in Figure 4.13. Exactly one base regular registerREG is used in the imple-mentation of the high-level registerR. The local variablesn at the writer is used to hold sequence numbers.It is incremented for every new write inR. The scope of the local variableaux used by the reader spans aread operation; it is made up of two fields: a sequence number (aux.sn) and a value (aux.val).

Each time it writes a valuev in the high-level register,R, the writer writes the pair[sn, v] in the baseregular registerREG. The reader manages two local variables:last sn stores the greatest sequence numberit has even read inREG, andlast val stores the corresponding value. When it wants to readR, the readerfirst readsREG, and then compareslast sn with the sequence number it has just read inREG. The valuewith the highest sequence number is the one returned by the reader and this prevents new/old inversions.

operation R.write(v):sn← sn + 1;REG← [sn, v];return()

operation R.read():aux← REG;if (aux.sn > last sn) then last sn← aux.sn;

last val← aux.val end if ;return(last val)

Figure 4.13: From regular to atomic: unbounded construction

Theorem 11 Given an unbounded 1W1R regular register, the algorithm described in Figure 4.13 constructsa 1W1R atomic register.

Proof The proof is similar to the proof of Theorem 11. We associate with each read operationr of the high-level registerR, the sequence number (denotedsn(r)) of the value returned byr: this is possible as the baseregister is regular and consequently a read always returns avalue that has been written with its sequencenumber, that value being the last written value or a value concurrently written -if any-. Considering anarbitrary historyH of registerR, we show thatH is atomic by building an equivalent sequential historySthat is legal and respects the partial order on the operations defined by→H .

62

Page 63: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

S is built from the sequence numbers associated with the operations. First, we order all the writeoperations according to their sequence numbers. Then, we order each read operation just after the writeoperation that has the same sequence number. If two reads operations have the same sequence number, weorder first the one whose invocation event is first. (Rememberthat we consider a 1W1R register)

The historyS is trivially sequential as all the operations are placed oneafter the other. Moreover,Sis equivalent toH as it is made up of the same operations.S is trivially legal as each read follows thecorresponding write operation. We now show thatS respects→H .

• For any two write operationsw1 andw2 we have eitherw1 →H w2 or w2 →H w1. This is becausethere is a single writer and it is sequential: as the variablesn is increased by1 between two consecutivewrite operations, no two write operations have the same sequence number, and these numbers agreeon the occurrence order of the write operations. As the totalorder on the write operations inS isdetermined by their sequence numbers, it consequently follows their total order inH.

• Let op1 be a write or a read operation, andop2 be a read operation such thatop1→H op2. It followsfrom the algorithm thatsn(op1) ≤ sn(op2) (wheresn(op) is the sequence number of the operationop). The ordering rule guarantees thatop1 is ordered beforeop2 in S.

• Let op1 be a read operation, andop2 a write operation. Similarly to the previous item, we then havesn(op1) < sn(op2), and consequentlyop1 is ordered beforeop2 in S (which concludes the proof).

2Theorem 11

One might think of a naive extension of the previous algorithm to construct a 1WMR atomic registerfrom base 1W1R regular registers. Indeed, we could, at first glance, consider an algorithm associating one1W1R regular register per reader, and have the writer writesin all of them, each reader reading its dedicatedregister. Unfortunately, a fast reader might see a new concurrently written value, whereas a reader thatcomes later sees the old value. This is because the second reader does not know about the sequence numberand the value returned by the first reader. The latter stores them locally. In fact, this can happen even ifthe base 1W1R registers are atomic. The construction of a 1WMR atomic register from base 1W1R atomicregisters is addressed in the next section.

4.6.2 Atomic registers: from unbounded 1W1R to 1WMR

We presented in Section 4.4.1 an algorithm that builds a 1WMRsafe/regular register from similar 1W1Rbase registers. We also pointed out that the corresponding construction does not build a 1WMR atomicregister even when the base registers are 1W1R atomic (see the counter-example presented in Figure 4.5).

This section describes such an algorithm: assuming 1W1R atomic registers, it shows how to go fromsingle reader registers to a multi-reader register. This algorithm uses sequence numbers, and requires un-bounded base registers.

Overview As there are now several possible readers, actuallyn, we make use of several (n) base 1W1Ratomic registers: one per reader. The writer writes in all ofthem. It writes the value as well as a sequencenumber. The algorithm is depicted in Figure 4.14.

We prevent new/old inversions (Figure 4.5) by having the readers “help” each other. The helping isachieved using an arrayHELP [1 : n, 1 : n] of 1W1R atomic registers. Each register contains a pair (se-quence number, value) created and written by the writer in the base registers. More specifically,HELP [i, j]

63

Page 64: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

is a 1W1R atomic register written only bypi and read only bypj. It is used as follows to ensure the atomicityof the high-level 1WMR registerR that is constructed by the algorithm.

• Help the others. Just before returning the valuev it has determined (we discuss how this is achievedin the second bullet below), readerpi helps every other process (reader)pj by indicating topj thelast valuepi has read (namelyv) and its sequence numbersn. This is achieved by havingpi updateHELP [i, j] with the pair[sn, v]. This, in turn, preventspj from returning in the future a value olderthanv, i.e., a value whose sequence number would be smaller thansn.

• Helped by the others. To determine the value returned by a read operation, a reader pi first computesthe greatest sequence number that it has ever seen in a base register. This computation involves all1W1R atomic registers thatpi can read, i.e.,REG[i] andHELP [j, i] for anyj. pi. Readerpi thenreturns the value that has the greatest sequence numberpi has computed.

The corresponding algorithm is described in Figure 4.14. Variableaux is a local array used by a reader;its jth entry is used to contain the (sequence number, value) pairthatpj has written inHELP [j, i] in orderto helppi; aux[j].sn andaux[j].val denote the corresponding sequence number and the associated value,respectively. Similarly,reg is a local variable used by a readerpi to contain the last (sequence number,value) pair thatpi has read fromREG[i] (reg.sn andreg.val denote the corresponding fields).

RegisterHELP [i, i] is used only bypi, which can consequently keep its value in a local variable. Thismeans that the 1W1R atomic registerHELP [i, i] can be used to contain the 1W1R atomic registerREG[i].It follows that the protocol requires exactlyn2 base 1W1R atomic registers.

operation R.write(v):sn← sn + 1;for all j in {1, . . . , n} do REG[i]← [sn, v] end do;return()

operation R.read() issued bypi:reg ← REG[i];for all j in {1, . . . , n} do aux[j]← HELP [j, i] end do;let sn max bemax(reg.sn, aux[1].sn, . . . , aux[n].sn);let val be reg.val or aux[k].val such that the associated seq number issn max;for all j in {1, . . . , n} do HELP [i, j]← [sn max, val] end do;return(val)

Figure 4.14: Atomic register: from one reader to multiple readers (unbounded construction)

Theorem 12 Givenn2 unbounded 1W1R atomic registers, the algorithm described in Figure 4.14 imple-ments a 1WMR atomic register.

Proof As for Theorem 11, the proof consists in showing that the sequence numbers determine a linearizationof any historyH.

Considering an historyH of the constructed registerR, we first build an equivalent sequential historyS by ordering all the write operations according to their sequence numbers, and then inserting the readoperations as in the proof of Theorem 11. This history is trivially legal as each read operation is orderedjust after the write operation that wrote the value that is read. A similar reasoning similar as the one used inTheorem 11, but based on the sequence numbers provided by thearraysREG[1 : n] andHELP [1 : n, 1 :n], shows thatS respects→H . 2Theorem 12

64

Page 65: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

4.6.3 Atomic registers: from unbounded 1WMR to MWMR

This section shows how to use sequence numbers to build a MWMRatomic register fromn 1WMR atomicregisters (wheren is the number of writers). The algorithm is simpler than the previous one. An arrayREG[1 : n] of n 1WMR atomic registers is used in such a way thatpi is the only process that can writein REG[i], while any process can read it. Each registerREG[i] stores a (sequence number, value) pair.VariablesX.sn andX.val are again used to denote the sequence number field and the value field of theregisterX, respectively. EachREG[i] is initialized to the same pair, namely,[0, v0] wherev0 is the initialvalue ofR.

The problem we solve here consists in allowing the writers tototally order their write operations. To thatend, a write operation first computes the highest sequence number that has been used, and defines the nextvalue as the sequence number of its write. Unfortunately, this does not prevent two distinct concurrent writeoperations from associating the same sequence number with their respective values. A simple way to copewith this problem consists in associating atimestampwith each value, where a timestamp is a pair made upof a sequence number plus the identity of the process that issues the corresponding write operation.

The timestamping mechanism can be used to define a total orderon all the timestamps as follows. Letts1 = [sn1, i] andts1 = [sn2, j] be any two timestamps. We have:

ts1 < ts2def=

((sn1 < sn2) ∨ (sn1 = sn2 ∧ i < j)

).

The corresponding construction is described in Figure 4.15. The meaning of the additional local variablesthat are used is, we believe, clear from the context.

operation R.write(v) issued bypi:for all j in {1, . . . , n} do reg[j]← REG[j] end do;let sn max bemax(reg[1].sn, . . . , reg[n].sn) + 1;REG[i]← [sn max, v];return()

operation R.read() issued bypi:for all j in {1, . . . , n} do reg[j]← REG[j] end do;let k be the process identitysuch that [sn, k] is the greatest times-tamp

among then time-stamps[reg[1].sn, 1], . . . and[reg[n].sn, n];return(reg[k].val)

Figure 4.15: Atomic register: from one writer to multiple writers (unbounded construction)

Theorem 13 Givenn unbounded 1WMR atomic registers, the algorithm described in Figure 4.15 imple-ments a MWMR atomic register.

Proof Again, we show that the timestamps define a linearization of any historyH.Considering an historyH of the constructed registerR, we first build an equivalent sequential historyS

by ordering all the write operations according to their timestamps, then inserting the read operations as inTheorem 11. This history is trivially legal as each read operation is ordered just after the write operation thatwrote the read value. Finally, a reasoning similar to the oneused in Theorem 11 but based on timestampsshows thatS respects→H . 2Theorem 13

65

Page 66: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

4.7 Concluding remark

We have shown in Section 4.6 how to build a MWMR atomic register from unbounded 1W1R regularregisters. All these use sequence numbers. The only transformation from safe to regular that has beenpresented concerns the case of binary registers (Section 4.4.2). At least three questions are natural to ask:

• How to implement a 1W1R atomic bit from a bounded number of 1W1R safe bits? This question isof independent interest and is addressed in Chapter 4.

• How to implement a 1W1R atomic register from a bounded numberof 1W1R safe bits? This questionis also of independent interest and is addressed in Chapter 5.

• How to implement a MWMR atomic register from bounded 1W1R atomic registers. This question isnot addressed in this book.

4.8 Bibliographic notes

The notions of safe, regular and atomic registers have been introduced by Lamport [13].Theorem 11, and the algorithms described in Figure 4.4, Figure 4.6, Figure 4.7 and Figure 4.8 are due to

Lamport [13]. The algorithm described in Figure 4.11 is due to Vidyasankar [44]. The algorithms describedin Figure 4.14 and 4.15 are due to Vityani and Awerbuch [48].

The wait-free construction of stronger registers from weaker registers has always been an active researcharea. The interested reader can consult the following (non-exhaustive!) list where numerous algorithms arepresented an analyzed [49, 50, 51, 52, 53, 41, 42, 43, 45, 46, 47].

66

Page 67: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 5

From safe bits to atomic bits: an optimalconstruction

5.1 Introduction

In the previous chapter, we introduced the notions of safe, regular and atomic (linearizable) read/writeobjects (also called registers). In the case of 1W1R (one writer one reader) register, assuming that thereis no concurrency between the reader and the writer, the notions of safety, regularity and atomicity areequivalent. This is no longer true in the presence of concurrency. Several bounded constructions have beendescribed for concurrent executions. Each construction implements a stronger register from a collection ofweaker base registers. We have seen the following constructions:

• From a safe bit to a regular bit. This construction improves on the quality of the base object withrespect to concurrency. Contrarily to the base safe bit, a read operation on the constructed regular bitnever returns an arbitrary value in presence of concurrent write operations.

• From a bounded number of safe (resp., regular or atomic) bitsto a safe (resp., regular or atomic)b-valued register. These constructions improve on the quality of each base object as measured bythe number of values it can store. They show that “small” baseobjects can be composed to provide”bigger” objects that have the same behavior in the presenceof concurrency.

To get a global picture, we miss one bounded construction that improves on the quality in the presenceof concurrency, namely, a construction of an atomic bit fromregular bits. This construction is fundamental,as an atomic bit is the simplest nontrivial object that can bedefined in terms ofsequentialexecutions. Evenif an execution on an atomic bit contains concurrent accesses, the execution still appears as its sequentiallinearization.

In this chapter, we first show that to construct a 1W1R atomic bit, we need at least three regular bits,two written by the writer and one written by the reader. Then we present an optimal three-bit constructionof an atomic bit.

5.2 A Lower Bound Theorem

In Section 4.6.1 of Chapter 4, we presented the constructionof a 1W1R atomic register from anunboundedregular register. The base regular register had to be unbounded because the construction was using sequence

67

Page 68: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

numbers, and the value of the base register was a pair made up of the data value of the register and thecorresponding sequence number. The use of sequence numbersmakes sure that new/old inversions of readoperations never happen.

A fundamental question is the following: Can we build a 1W1R atomic register from a finite number ofregular registers that can store only finitely many values, and can be written only by the writer (of the atomicregister)?

This section first shows that such a construction is impossible, i.e., the reader must also be able to write.In other words, such a construction must involve two-way communication between the reader and the writer.Moreover, even if we only want to implement one atomic bit, the writer must be able to write intwo regularbase bits.

5.2.1 Digests and Sequences of Writes

Let A be any finite sequence of values in a given set. Adigestof A is a shorter sequenceB that “mimics”A:A andB have the same first and last elements; an element appears at most once inB; and two consecutiveelements ofB are also consecutive inA. B is called adigestof A.

As an example letA = v1, v2, v1, v3, v4, v2, v4, v5. The sequenceB = v1, v3, v4, v5 is a digest ofA.(there can be multiple digests of a sequence).

Every finite sequence has a digest:

Lemma 2 Let A = a1, a2, . . . , an be a finite sequence of values. For any such sequence there exists asequenceB = b1, . . . , bm of values such that:

• b1 = a1 ∧ bm = an,

• (bi = bj)⇒ (i = j),

• ∀j : 1 ≤ j < m : ∃i : 1 ≤ i < n : bj = ai ∧ bj+1 = ai+1.

Proof The proof is a trivial induction onn. If n = 1, we haveB = a1. If n > 1, let B = b1, . . . , bm be adigest ofA = a1, a2, . . . , an. A digest ofa1, a2, . . . , an, an+1 can be constructed as follows:- If ∀j ∈ {1, . . . ,m} : bj 6= an+1, thenB = b1, . . . , bm, an+1 is a digest ofa1, a2, . . . , an.- If ∃j ∈ {1, . . . ,m} : bj = an+1, there is a singlej such thatbj = an+1 (this is because any value appearsat most once inB = b1, . . . , bm). It is easy to check thatB = b1, . . . , bj is a digest ofa1, . . . , an, an+1.

2Lemma 2

Consider now an implementation of a bounded atomic 1W1R register R from a collection of basebounded1W1R regular registers. Clearly, any execution of a write operationw that changes the valueof the implemented register must consist of a sequence of writes on base registers. Such a sequence ofwrites triggers a sequence of state changes of the base registers, from the state beforew to the state afterw.

Assuming thatR is initialized to0, let us consider an executionE where the writer indefinitely alternatesR.write(1) andR.write(0). Let wi, i ≥ 1, denotes thei-th R.write(v) operation. This means thatv = 1wheni is odd andv = 0 wheni is even. Each prefix ofE, denoted byE′, unambiguously determines theresultingstateof each base objectX, i.e., the value that the reader would obtain if it readX right afterE′,assuming no concurrent writes. Indeed, since the resultingexecution is sequential, there exists exactly onereading function and we can reason about the state of each object at any point in the execution.

Each write operationw2i+1 = R.write(1), i = 0, 1, . . ., contains a sequence of writes on the baseregisters. Letω1, . . . , ωx be the sequence of base writes generated byw2i+1. Let Ai be the corresponding

68

Page 69: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

sequence of base-registers states defined as follows: its first elementa0 is the state of the base registersbeforeω1, its second elementa2 is the state of the base registers just afterω1 and beforeω2, etc.; its lastelementax is the state of the base registers afterωx.

Let Bi be a digest derived fromAi (by Lemma 2 such a digest sequence exists).

Lemma 3 There exists a digestB = b0, . . . , by (y ≥ 1) that appears infinitely often inB1, B2, . . ..

Proof First we observe that every digestBi (i = 1, 2, . . .) must consists of at least two elements. Indeed ifBi is a singletonb0, then the read operation onR applied just beforewi and the read operation onR appliedjust afterwi observe the same state of base registersb0. Therefore, the reader cannot decide when exactlythe read operation was applied and must return the same value—a contradiction with the assumption thatwi

changes the value ofR.Since the base registers are bounded, there are finitely manydifferent states of the base registers that

can be written by the writer. Since a digest is a sequence of states of the registers written by the writer inwhich every state appears at most once, we conclude that there can only be finitely many digests. Thus, inthe infinite sequence of digests,B1, B2, . . ., some digestB (of two or more elements) must appear infinitelyoften. 2Lemma 3

Note that there is no constraint on the number ofinternal states of the writer. Since there may be nobound on the number of steps taken within a write operation, all the sequencesAi can be different, andthe writer may never perform the same sequence of base-register operations twice. But the evolution of thebase-register states in the course ofAi can be reduced to its digestBi.

5.2.2 The Impossibility Result and the Lower Bound

Theorem 14 It is not possible to build a 1W1R atomic bit from a finite number of regular registers that cantake a finite number of values and are written only by the writer.

Proof By contradiction, assume that it is possible to build a 1W1R atomic bit R from a finite setS ofregular registers, each with a finite value domain, in which the reader does not update base registers.

An operationr = R.read() performed by the reader is implemented as a sequence of read operations onbase registers. Without loss of generality, assume thatr readsall base registers. Consider again the executionE in which the writer performs write operationsw1, w2, . . ., alternatingR.write(1) andR.write(0).

Since the reader does not update base registers, we can insert the complete execution ofr between everytwo steps inE without affecting the steps of the writer. Since the base registers are regular, the value readin a base registerX by the reader performingr after a prefix ofE is unambiguously defined by the latestvalue written toX before the beginning ofr. Let λ(r) denote the state of all base registers observed byr.

By Lemma 3, there exists a digestB = b0, . . . , by (y ≥ 1) that appears infinitely often inB1, B2, . . .,whereBi is a digest ofw2i+1. Since each state in{b0, . . . , by} appears inE infinitely often, we can constructan executionE′ by inserting inE a sequence of read operationsr0, . . . , ry such that for eachj = 0, . . . , y,λ(rj) = by−j. In other words, inE′, the reader observes the states of base registers evolving downwardsfrom by to b0.

By induction, we show that inE′, eachrj (j = 0, . . . , y) must return1. Initially, sinceλ(r0) = by andby is the state of the base registers right after someR.write(1) is complete,r0 must return1. Inductively,suppose thatrj (for somej, 0 ≤ j ≤ y − 1) returns1 in E′.

Consider read operationsrj andrj+1 (j = 0, . . . , y − 1). Recall thatλ(rj) = by−j andλ(rj+1) =by−j−1. Since digestB appears inB1, B2, . . . infinitely often, E′ contains infinitely many base-register

69

Page 70: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

R.write(1) operation

λ(rj) = by−j λ(rj+1) = by−j−1

from by−j−1 to by−j

rj rj+1

Figure 5.1: Two read operationsrj andrj + 1 concurrent withR.write(1)

writes by which the writer changes the state of base registers from by−j−1 to by−j. Let X be the baseregister changed by these writes.

SinceX is regular, we can construct an executionE′′ which is indistinguishable to the reader fromE′,whererj are concurrent with a base-register write performed withinR.write(1) in which the writer changesthe state of the base registers fromby−j−1 to by − j (Figure 5.1).

By the induction hypothesis,rj returns1 in E′ and, thus, inE′′. Since the implemented registerR isatomic andrj returns the concurrently written value1 in E′′, rj+1 must also return1 in E′′. But the readercannot distinguishE′ andE′′ and, thus,rj+1 returns1 also inE′.

Inductively, ry must return1 in E′. But λ(ry) = b0, whereb0 is the state of base registers right aftersomeR.write(0) is complete. Thus,ry must return0—a contradiction. 2Theorem 14

Therefore, to implement a 1W1R atomic register from boundedregular registers, we must establish two-way communication between the writer and the reader. Intuitively, the reader must inform the writer that itis aware of the latest written value, which requires at leastone base bit that can be written by the reader andread by the writer. But the writer must be able to react to the information read from this bit. In other words:

Theorem 15 In any implementation a 1W1R atomic bit from regular bits, the writer must be able to writeto at least 2 regular bits.

Proof Suppose, by contradiction, that there exists an implementation of a 1W1R atomic bitR in which thewriter can write to exactly one base bitX.

Note that every write operation onR that changes the value ofX and does not overlap with any readoperation must change the state ofX. Without loss of generality assume that the first write operationw1 = R.write(1) performed by the writer in the absence of the reader changes the value ofX from 0 to 1(the corresponding digest is0, 1).

Consider an extension of this execution in which the reader performsr1 = R.read() right after the endof w1. Clearly,r1 must return1. Now addw2 = R.write(0) right after the end ofr1. Since the state ofXat the beginning ofw2 is 1, the only digest generated byw2 is 1, 0.

Now addr2 = R.read() right after the end ofw2, and letE be the resulting execution. Nowr2 mustreturn0 in E. But sinceX is regular,E is indistinguishable to the reader from an execution in which r1 andr2 take place within the interval ofw1 and thus both must return1—a contradiction. 2Theorem 15

As we have seen in the previous chapter, there is a trivial bounded algorithm that constructs a regular bitfrom a safe bit. This algorithm only requires one additionallocal variable at the writer. The combination ofthis algorithm with Theorem 15 implies:

70

Page 71: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Corollary 1 The construction of a 1W1R atomic bit from safe bits requiresat least 3 1W1R safe bits, twowritten by the writer and one written by the reader.

As the construction presented in the next section uses exactly 3 1W1R regular bits to build an atomicbit, it is optimal in the number of base safe bits.

5.3 From three safe bits to an atomic bit

Now we present an optimal construction of a high level 1W1R atomic bitR from three base 1W1R safe bits.The high level bitR is assumed to be initialized to0. It is also assumed that eachR.write(v) operationinvoked by the writer changes the value ofR. This is done without loss of generality, as the writer ofRcan locally keep a copyv′ of the last written value, and apply the nextR.write(v) operation only when itmodifies the current value ofR.

The construction ofR is presented in an incremental way.

5.3.1 Base architecture of the construction

The three base registers are initialized to0. Then, as we will see, the read and write algorithms defining theconstruction, are such that, any write applied to a base registerX changes its value. So, its successive valuesare0, then1, then0, etc. Consequently, to simplify the presentation, a write operation on a base registerX,is denoted “changeX”. As any two consecutive write operations on a base bitX write different values, itfollows thatX behaves as regular register.

The 3 base safe bits used in the construction of the high levelatomic registerR are the following:

• REG: the safe bit that, intuitively, contains the value of the atomic bit that is constructed. It is writtenby the writer and read by the reader.

• WR: the safe bit written by the writer to pass control information to the reader.

• RR: the safe bit written by the reader to pass control information to the writer.

5.3.2 Handshaking mechanism and the write operation

As we saw in the previous section, the reader should inform the writer when it read a new valuev inthe implemented register. Otherwise, the uninformed writer may subsequently repeat the same digest ofstate transitions executingR.write(v) so that the reader would be subject to new/old inversion. Therefore,whenever the writer is informed that a previously written value is read by the reader, it should change theexecution so that critical digests are not repeated.

The basic idea of the construction is to use the control bitsWR andRR to implement thehandshakingmechanism. Intuitively, the writer informs the reader about a new value by changing the value ofWR sothatWR 6= RR. Respectively, the reader informs the writer that the new value is read by changing the valueof RR so thatWR = RR. With these conventions, we obtain the following handshaking protocol betweenthe writer and the reader:

• After the writer has changed the value of the base registerREG, if it observesWR = RR, it changesthe value ofWR.

As we can see, setting the predicateWR = RR equal to false is the way used by the writer to signalthat a new value has been written inREG. The resulting is described in Figure 5.2.

71

Page 72: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation R.write(v): %Change the value ofR %i changeREG ;ii if WR = RR then changeWR end if ; % Strive to establishWR 6= RR %

return()

Figure 5.2: TheR.write(v) operation

• Before readingREG , the reader changes the value ofRR, if it observes thatWR 6= RR. Thissignaling is used by the writer to updateWR when it discovers that the previous value has been read.

As we are going to see in the rest of this chapter, the exchangeof signals throughWR andRR is also usedby the reader to check if the value it has found inREG can be returned.

5.3.3 An incremental construction of the read operation

The reader’s algorithm is much more involved than the writer’s algorithm. To make it easier to understand,this section presents the reader’s code in an incremental way, from simpler versions to more involved ones.In each stage of the construction, we exhibit scenarios in which a simpler version fails, which motivates achange of the protocol.

The construction: step 1 We start with the simplest construction in which the reader establishesRR =WR and returns the value found inREG .

3 if WR 6= RR then changeRR end if ; % Strive to establishWR = RR %4 val ← REG;5 return(val)

We can immediately see that this version does not really use the control information: the value returnedby the read operation does not depend on the states ofRR andWR. Consequently, this version is subjectto new/old inversions: suppose that while the writer changes the value ofREG from 0 to 1 (line ii inFigure 5.2), the reader performs two read operations. The first read returns1 (the “new” value ofR) and thesecond read returns0 (the “old” value), i.e., we obtain a new/old inversion.

The construction: step 2 An obvious way to prevent the new/old inversion described inthe previous stepis to allow the reader to return the current value ofREG only if it observes that the writer has updatedWR

to makeWR 6= RR since the previous read operation.

1 if WR = RR then return(val) end if ;3′ changeRR; % Strive to establishWR = RR %4 val ← REG;5 return(val)

Here we assume that the local variableval initially contains the initial value ofR (0). Checking whetherWR 6= RR before changingRR in line 3′ looks unnecessary, since the reader does not touch the sharedmemory between readingWR in line 1 and in line 3, so we dropped it for the moment.

72

Page 73: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Unfortunately, we still have a problem with this construction. When a read is executed concurrentlywith a write, it may happen that the read returns a concurrently written value but a subsequent read findsRR 6= WR and returns an old value found inREG .

Indeed, consider the following scenario (Figure 5.3):

1. w1 = R.write(1) completes.

2. r1 readsWR, findsWR 6= RR and changesRR.

3. w2 = R.write(0) begins, changesREG to 0, readsRR, findsWR = RR, changesWR, restoringthe predicateWR 6= RR, and completes.

4. w3 = R.write(1) begins and starts changingREG from 0 to 1.

5. r1 concurrently readsREG and returns the new value1

6. r2 = R.read() begins, findsRR 6= WR, readsREG and returns the old value0 (which is perfectlypossible since the write operation onREG performed byw3 is not yet finished).

In other words, we obtain ta new-old inversion for read operationsr1 andr2.

Reader

Writer

changeREG

return 0

read1 in REG

RR 6= WR

read0 in REGRR 6= WR

changeRR

r2r1 return 1

w1=write(1) w2=write(0) w3=write(1)

Figure 5.3: Counter example to step 2 of the construction: new-old inversion forr1 andr2

The construction: step 3 The problem with the scenario above is that a read operation is too quick toreturn the new value ofREG without noticing that the writer has meanwhile changedWR. A subsequentread operation may observeRR = WR and thus return the value read inREG (line 4) which may, in caseof a slow concurrent write, still be the old value.

One solution to circumvent this is to evaluateREG before changingRR. If the predicateRR = WR

does not hold afterRR was changed (line 3′) andREG was read again (line 4), then the reader returns theolder (conservative) value ofREG .

1 if WR = RR then return(val) end if ;2 aux← REG; % Conservative value %3′ changeRR; % Strive to establishWR = RR %4 val ← REG;5 if WR = RR then return(val) end if7 return(aux)

73

Page 74: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Unfortunately, there is still a problem here. The variableval evaluated in line 4 may be too conservativeto be returned by a subsequent read operation that findsRR = WR in line 1.

Again, suppose thatw1 = R.write(1) is followed a concurrent execution ofr1 = R.read() andw2 =R.write(0) as follows (Figure 5.4):

1. w1 = R.write(1) completes.

2. w2 = R.write(0) begins and starts changingREG from 1 to 0.

3. r1 findsWR 6= RR, reads0 from REG and stores it inaux (line 2), changesRR, reads1 from REG

and stores it inval (the write operation onREG performed byw2 is still going on).

4. w2 completes its write onREG , findsRR = WR and starts changingWR.

5. r1 finds WR 6= RR (line 5), concludes that there is a concurrent write operation and returns the“conservative” value0 (read in line 2).

6. r2 = R.read() begins, findsRR = WR (the write operation onWR performed byw2 is still goingon), and returns1 previously evaluated in line 4 ofr1.

That is,r1 returned the new (concurrently written) value0 while r2 returned the old value1.

Reader

Writer

r1

w1=write(1) w2=write(0)

change REG

RR6=WR change RR

read 0 read 1

RR=WR change WR

RR6=WR

r2 return 1return 0

to aux to val

RR=WR

Figure 5.4: Counter example to step 3 of the construction: new-old inversion forr1 andr2

The construction: step 4 Intuitively, the problem with the algorithm above is thatr1 did not realize thatthe “conservative” value evaluated in line 2 is in fact the concurrently written value, while the “new” valueevaluated in line 4 is outdated. To fix this, before the readerdecides to be conservative and return in line 7,we add one more read ofREG to update the local variableval. This way, a subsequent read would alsoreturn the new value.

operation R.read():1 if WR = RR then return(val) end if ;2 aux← REG;3′ changeRR;4 val← REG;5 if WR = RR then return(val) end if ;6 val← REG;7 return(aux)

74

Page 75: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

But still there is a problem here. ChangingRR in line 3′ without previously checking ifWR = RR maycreate an illusion to the reader that it has established the predicateRR = WR, while in fact the predicatewas invalidated by a concurrent write.

Consider the following execution (Figure 5.5):

1. w1 = R.write(1) begins, changesREG to 1, findsRR = WR, and starts changingWR to 1.

2. r1 = R.read() begins, observesRR 6= WR in line 1, reads1 from REG in line 2, changesRR to 1,reads1 from REG in line 4, and returns1 in line 5.

3. r2 = R.read() begins and findsWR = RR (the write onWR performed byw1 is still going on).

4. w1 finishes changingWR to 1 and completes.

5. w2 = R.write(0) begins and starts changingREG to 0.

6. r2 reads0 in REG , (unconditionally) changesRR back to0, findsWR 6= RR, reads1 in REG inline 6 (the write onREG performed byw2 is still going on), and returns the conservative value0 inline 7.

7. r3 = R.read() begins, observesRR 6= WR in line 1, reads1 REG , changesRR to 1, reads1 fromREG again (recall that the write onREG performed byw2 is still going on) and returns the old value1 in line 5.

Again, we have a new-old inversion:r2 returns the value concurrently written byw2 while r3 returnsthe old value.

Reader

Writer

r1

w1=write(1)

RR6=WR

change WR

r3

read 1

change RR

read 1

RR=WR

return 1 r2

change REG

read 0to aux RR6=WR

change RR RR6=WR

return 1return 0

w2=write(0)

RR6=WR

Figure 5.5: Counter example to step 4 of the construction: new-old inversion forr2 andr3

The construction: last step The complete read algorithm is presented in Figure 5.6. As wesaw in thischapter, safe base registers allow for a multitude of possible execution scenarios, so an intuitively correctimplementation could be flawed because of an overlooked case. To be convinced that our construction isindeed correct, we provide a rigorous proof below.

5.3.4 Proof of the construction

Theorem 16 Let H be an execution history of the 1W1R registerR constructed by the algorithm in Fig-ures 5.2 and 5.6. ThenH is linearizable.

75

Page 76: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation R.read():1 if WR = RR then return(val) end if ;2 aux← REG;3 if WR 6= RR then changeRR end if ;4 val← REG;5 if WR = RR then return(val) end if ;6 val← REG;7 return(aux)

Figure 5.6: TheR.read() operation

Proof Let H be an execution history. By Theorem 5, to show thatH is linearizable (atomic), it is sufficientto show that there exists a reading functionπ satisfying the assertionsA0, A1 andA2.

In order to distinguish the operationsR.read() andR.write(v), denoted byr andw, from the read andwrite operations on the base registers (e.g., “changeRR”, “ aux ← REG”, etc.), the latter ones are calledactions. The corresponding execution containing, additionally, the action invocation and response events isdenotedL. Let→L denote the corresponding partial relation on the actions.

Moreover,r being a read operation andloc the local variable (aux or val) whose value is returned byr(in line 1, 5 or 7),ρr denotes the last read action “loc← REG” executed beforer returns:

• If r returns in line 7,ρr is the read action “aux← REG” executed in line 2 ofr,

• If r returns in line 5,ρr is is the read action “val ← REG” executed in line 4 ofr, and finally

• If r returns in line 1,ρr is is the read action “val ← REG” executed in line 4 or 6 of some previousread operation.

Let φ be any regular reading function onREG . Thus, for each read actionρr we can define the cor-responding write actionφ(ρr) that writes the value returned byr. The write operation that containsφ(ρr)determinesπ(r). If there is no such write operation, i.e.,ρr returns the initial value ofREG , we assumethatπ(r) is the (imaginary) initial write operation that writes the initial value and precedes all actions inH.

Proof of A0. Let r be a complete read operation inH. By the definition ofπ, the invocation of the writeactionφ(ρr) occurs before the response ofρr and, thus, the response ofr in L, i.e.,inv[π(ρr)] <L resp[r].Thus,inv[π(r)] <L inv[π(ρr)] <L resp[r] and¬(resp[r] <L inv[π(r)]).

By contradiction, suppose thatA0 is violated, i.e.,r →H π(r). Thus,resp[r] <L inv[π(ρr)])—acontradiction.

Proof of A1. Since there is only one writer, all writes are totally ordered andw →H π(r) is equivalent to¬(π(r)→H w).

By contradiction, suppose that there is a write operationw such thatπ(r) →H w →H r. If there areseveral such write operations, letw be the last one beforer, i.e.,∄ w′: w →H w′ →H r.We first claim that, in such a context,ρr cannot be a read action of the read operationr (i.e.,ρr /∈ r).Proof of the claim. Recall thatφ(ρr) ∈ π(r) (by definition). Letω be the “changeREG” action of theoperationw (ω ∈ w). By the case assumption, we obtainφ(ρr)→L ω. By the definition ofφ(ρr), we have¬(φ(ρr) →L ρr) and, thus,¬(ω →L ρr). Therefore,inv[ρr] <L resp[ω]. As ω ∈ w andw →H r, wehaveinv[ρr] <L resp[w] <L inv[r]. Asρr started beforer, and both are executed by the same process, wehaveρr /∈ r. End of the proof of the claim.

76

Page 77: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Sinceρr /∈ r, by the algorithm in Figure 5.6, the read operationr returns a value in line 1, which meansthat it has previously seenWR = RR. On the other hand, after the writer has executedω within π(r), it readRR in order to setWR different fromRR if they were seen equal. Asw→H r and∄ w′: w →H w′ →H r(assumption), it follows thatRR has been modified by a read operation in line 3beforethe read operationrstarts butafter or concurrently withthe read action onRR performed byw. Let r′ be that read operation;as there is a single process executingR.read(), we haver′ →H r.Now we claim thatρr /∈ r′.Proof of the claim: Let r′′ be the read operation that containsρr. We show thatr′′ 6= r′. We observe that(Figure 5.7):

- If r′′ updatesRR, it does it in line 3, i.e., before executingρr (in line 4 or 6),

- inv[ρr] <L resp[ω] (sinceφ is a regular reading function andφ(rhor) precedesω);

it is indicated by a dotted arrow in Figure 5.7),

- w readsRR after having executedω (code of the write operation).

It follows from these observations that ifr′′ writes intoRR, then it completes the write beforew startsreadingRR. But r′ writes toRR r′ after or concurrently withw readingRR. Therefore,r′′ 6= r′ and, thus,ρr /∈ r′. End of the proof of the claim.

But since the reader modifiesRR within r′, it also executes line 4 ofr′ (val ← REG) before executingr (this follows from the code of the read operation). But, asρr /∈ r′, this read ofREG action withinr′

contradicts the definition ofρr (according to whichρr is the last action “val ← REG” executed beforerstarts), which completes the proof of the assertionA1.

ω

π(r)

write RR

r′

readRR

w

ρr

rr′′

Figure 5.7:ρr belongs neither tor nor tor′

Proof of A2. By contradiction, suppose that there existr1 andr2, two complete read operations inH,such thatr1 →H r2 andπ(r2) →H π(r1). Without loss of generality, we assume that ifr1 returns at line1, thenρr1 is the read action in line 6 in the immediately preceding readoperation. Sinceπ(r2) 6= π(r1),we haveρr1 6= ρr2. Thus, eitherρr1 →L ρr2 or ρr2 →L ρr1.

• ρr2 →L ρr1.As ρr1 precedes or belongs tor1, andr1 →H r2, we haveresp[ρr1] <L inv[r2]. Combined withthe case assumption, the assertion impliesρr2 →L ρr1 →L r2, which contradicts the fact thatρr2

is the last “loc ← REG” action executed beforer2 started, whereloc is val or aux. So, the caseρr2 →L ρr1 is not possible.

77

Page 78: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

φ(ρr2) φ(ρr1)

ρr2ρr1

resp[ρr1] inv[ρr2] resp[φ(ρr1)]

WR is not modified

inv[φ(ρr1)]

Figure 5.8: A new/old inversion on the regular registerREG

• ρr1 →L ρr2.By definitionφ(ρr1) ∈ π(r1) andφ(ρr2) ∈ π(r2). Asπ(r2)→H π(r1), we haveφ(ρr2)→L φ(ρr1).

Thus, we haveφ(ρr2) →L φ(ρr1) andρr1 →L ρr2 (Figure 5.8) which implies a new/old inversionon the base regular registerREG . But sinceφ is a regular reading function onREG , we have¬(ρr1 →L φ(rhor1) and¬(φ(ρr1) →L ρr2). Thus, bothρr1 andρr2 have to overlapπ(ρr1) (Figure5.8): inv[φ(ρr1)] <L resp[ρ1] andinv[ρ2] <L resp[φ(ρr1)]. As φ(ρr1) is a base action that updatesREG , and asREG andWR are both updated by the writer, the “value” of the base registerWR doesnot change while the writer is updatingREG or, more formally:

Property P: all read actions onWR performed betweenresp[ρr1] and inv[ρr2] return the samevalue.

We consider three cases according to the line at whichr1 returns.

– r1 returns in line 7.Thenρr1 is “aux← REG” in line 2 of r1. We have the following:- Sinceρr1 →L ρr2 andr1 returns in line 7,ρr2 can only be the read in line 6 ofr1 or a laterread action.- After having performedρr1, r1 readsWR and ifWR 6= RR, it setsRR = WR in line 3. Butr1 returns in line 7, after having seenRR different fromWR in line 5 (otherwise, it would havereturned in line 5). Thus,r1 reads different values ofWR afterρr1 (line 2 of r1) and beforeρr2

(line 6 of r1 or later). This contradicts property P above.

– r1 returns in line 5.Then,ρr1 is “val ← REG” in line 4 of r1, andr1 seesRR = WR in line 5. Sinceρr1 →L ρr2,r2 does not return in line 1. Indeed, ifr2 returns in line 1, the property P implies that the lastread onREG preceding line 1 ofr2 is line 4 ofr1, i.e.,ρr1 = ρr2. Thus,r2 seesRR 6= WR inline 1, before performingρr2 is in line 2 or line 4 ofr2. But r1 has seenWR = RR in line 5,after having performedρr1 in line 4—a contradiction with propertyP .

– r1 returns in line 1.In that case,ρr1 is line 4 or line 6 of the read operation that precedesr1. Again, sinceρr1 →L

ρr2, r2 does not return in line 1, from which we conclude that, beforeperformingρr2, r2 sees

78

Page 79: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

RR 6= WR in line 1. On the other hand,r1 seesRR = WR in line 1 after having performedρr1 which contradicts propertyP and concludes the proof.

Thus,π is an atomic reading function. 2Theorem 16

5.3.5 Cost of the algorithms

The cost of theR.read() andR.write(v) operations is measured by the the maximal and minimal numbersof accesses to the base registers. Let us remind that the writer (resp., reader) does not readWR (resp.,RR)as it keeps a local copy of that register.

• R.write(v): maximal cost: 3; minimal cost: 2.

• R.read(): maximal cost: 7; minimal cost: 1.

The minimal cost is realized when the same type of operation (i.e., read or write) is repeatedly executedwhile the operation of the other type is not invoked.

Let us remark that we have assumed that ifR.write(v) and R.write(v′) are two consecutive writeoperations, we havev 6= v′. This means that if the upper layer issues two consecutive write operationswith v = v′, the cost of the second one is0, as it is skipped and consequently there is no accesses to baseregisters.

5.4 Bibliographic notes

Tromp 1989

Lamport 86 (1W2R, but very inefficient)

79

Page 80: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

80

Page 81: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 6

Collect and Snapshot objects

In many applications, processes cooperate in the followingway. Periodically, a process posts a value basedon its current state in the shared memory so that the other processes can read and use it. The last valuestored by a process overwrites the value it has previously stored (if any). To compute the new value, aprocess accesses the shared memory to get the “latest” values stored by the other processes. This kind ofcooperation is usually used in asynchronous systems: processes compute, post and read values at their ownpaces, and we assume no bounds on the relative processing speeds. Thus, we expect the cooperation to bewait-free.

In a concurrent system, it may be hard however to say which value is the latest one. We discuss multipleways of defining such abstractions, starting from the weakest store-collectobject, to strongersnapshotandimmediate snapshot objects.

6.1 Store-collect object

A store-collectobject exports the operationstore() that is used to post values and the operationcollect()that is used to obtain get the values that have been posted so far. The set of collected values defines what iscalled aview: a viewV is ann-vector, with one value per process. Astore(v) is invoked by processpi toreplace the value in positioni of the view withv. The view returned by acollect() operation contains⊥ atpositioni if no value posted bypi has been found.

6.1.1 Definition

LetH be a history of events on a store-collect object:inv [store()], resp[store()], inv [collect()] resp[collect()]issued by the processes. Recall that<H denotes the total order on the events inH and→H denoted thereal-time order on the operations inH. As usual, we assume thatH is well-formed: no process invokes anew operation on the store-collect object before its previous operation returns. Thus, for any two operationsinvoked by a given process, we can say which oneprecedesthe other, i.e., all operation of a given processare related by→H . (See Chapter 2.)

Let V1 andV2 be two views returned by collect operations in a historyH. We say thatV1 ≤ V2 if, forevery processpi, V1[i] = V2[i] or store(V1[i]) precedesstore(V2[i]) (notice that bothstore() operations areissued bypi). Respectively,V 1 < V 2 if V 1 ≤ V 2 andV 1 6= V 2.

Now a store-collect object satisfies, for every its historyH, the following properties:

81

Page 82: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• Validity. Let col be a collect operation issued bypj that returns a viewV such thatV [i] = v. Thenthere is astore(v) operationst by pi that started beforecol terminates, i.e.,col 9H st .

In other words, acollect() operation cannot output values that have not been yet stored.

• Up-to-dateness. Let st = store(v) be issued bypi andcol = collect() be issued bypi. If st →H col ,thencol returns a viewV such thatV [i] is v or a value posted bypi afterv.

The views returned by the collect operations are thus “up to date” in the sense that, as soon as avalue has been posted, it cannot be ignored by future collectoperations. Note that Validity and Up-to-dateness imply that each positioni in the view returned bycol contains the value written by the laststore bypi that precedescol or the value written by a store operation issued bypi that is concurrentto col .

• Partial Consistency. Let col1 and col 2 be two collect() operations that obtain viewsV1 and V2,respectively. Ifcol1 →H col2, thenV 1 ≤ V 2.

This property requires that non-concurrent collect operations are mutually consistent, namely, a collectoperation cannot obtain values older than the values obtained by a preceding collect operation. Notethat there are no constraints on the views returned by concurrent collect operations.

6.1.2 A trivial implementation

A store-collect object has a trivial implementation from 1WMR atomic registers. This implementation isbased on an arrayREG [1..n] with one entry per process. Onlypi can writeREG [i], while all the processescan read it. For collecting values, a process reads the entries of the arrayREG [1..n] in any order. Eachentry is initialized to a default value⊥ (⊥ cannot be posted by a process). The construction is described inFigure 6.1.

when store(v) is invoked bypi:REG [i] := v;return ()

whencollect() is invoked:for all j ∈ {1, . . . , n} do

V [j] := REG [j]end for ;return (V )

Figure 6.1: Simple implementation of a store-collect object

6.1.3 A store-collect object has no sequential specification

An abstractionA has a sequential specificationS, if its behavior of can be expressed through a set ofsequential histories ofS. Intuitively, A behaves like an atomic implementation ofS. Formally:

• Every implementation ofA is an atomic implementation ofS, and

• Every atomic implementation ofS is an implementation ofA.

Note that the second property implies thateverysequential history ofS should be a history ofA. If anabstractionA has a sequential implementation, we say thatA is anatomic object.

82

Page 83: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Lemma 4 Store-collect is not an atomic object.

Proof Suppose, by contradiction, that the store-collect abstraction has a sequential specificationS.Consider the execution history in Figure 6.2. Here thecollect() issued byp1 operation is concurrent

with two store operations issued byp2 andp3. The history could have been exported by an execution of thealgorithm in Figure 6.1, wherep1, within its collect() operation, readsREG [2] before any write onREG [2]performed byp2 andREG [3] after the write onREG [3] performed byp3.

By our assumption, the history should be atomic with respectS. We recall that any linearization ofHshould respect the real-time order on operations and, thus,should put[store(v) by p2] before[store(v′) by p3].We establish a contradiction by showing that there is no way to find a place for thecollect() operation inany linearization ofH.

Suppose thatS allows placing thecollect() operationbefore [store(v′) by p3]. Thus, S contains asequential history that violates the Validity property of store-collect (the collect operation returns a valuewhich is not written yet)!

Now suppose thatS allows placing thecollect() operation after[store(v′) by p3]. This results in ahistory that violates the Up-to-dateness of store-collect(the collect operation returns an overwritten value)!

2Lemma 4

p1

p2

collect()→ [⊥,⊥, v′]

p3

store(v′)

store(v)

Figure 6.2: A store-collect object has no sequential specification

6.2 Snapshot object

We have seen that a store-collect object provides the processes with two operations,store(v) that allows aprocess to definev as its last posted data, andcollect() that allows a process to obtain the last data postedby each process (if any). We have also seen that a store-collect object cannot be captured by a sequentialspecification. One of the reasons is that the partial consistency property allows concurrent collect operationsto be ordered arbitrarily.

In this chapter, we introduce an “atomic restriction” of store-collect: asnapshotobject. A snapshotobject is accessed with operationsupdate() andsnapshot (). Thesnapshot () operation returns a vector ofn values (one per process). The value in positioni of the vector is affected by last preceding or concurrentupdate(v) operation executed by processpi.

A snapshot object satisfies the properties of store-collect(Section 6.1.1), wherestoreandcollect arereplaced withupdate andsnapshot , plus the following property:

• Snapshot order. For any two viewsS1 andS2 obtained by snapshot operations, eitherS1 ≤ S2 orS2 ≤ S1

The property implies that all views obtained by a snapshot operation can be totally ordered respectingthe≤ relation.

83

Page 84: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• Update order. For any two updatesupdate(v) andupdate(v′) performed bypi andpj , respectively,and any viewS obtained by a snapshot operation, if the response ofupdate(v) precedes the invocationof update(v′) andS containsv′ in positionj, thenS containsv or a later value in positioni.

in other words, non-concurrent updates cannot be observed by snapshot operations in the oppositeorder.

The sequential specification of typesnapshot defines the set of allowed sequential histories ofupdate

andsnapshot operations. In every such sequential history, each position i of the vector returned by everysnapshot operation contains the argument of last precedingupdate operation ofpi (if any, or the initialvalue otherwise). Intuitively, a snapshot operation allows to view update and snapshot operations as takingplace instantaneously. We show that this type indeed captures the behavior of a snapshot object.

Lemma 5 A snapshot abstraction is atomic.

Proof We show below that any snapshot implementation is atomic with respect to thesnapshot type, andany atomic implementation of thesnapshot type is a snapshot implementation. Consider a finite historyH of an implementation that satisfies the properties of collect (Section 6.1.1), plus the Snapshot Order andUpdate Order properties.

We construct a linearizationL of H as follows. First we order all complete snapshot operationsin H,based on the≤ relation, so that the real-time order is respected. This is possible by the Total Order and thePartial Consistency properties. Indeed, if a snapshot operation returnsS1 and precedes a snapshot operationthat returnsS2, thenS1 ≤ S2.

Letupdate(v) = U be an operation performed bypi. U is then inserted inL just before the first snapshotoperation that returnsv or a later value in positioni, or at the end of the sequence if there is no such a snap-shot. After having done this for every update, we obtain a sequence[U0], S1, [U1], S2, [U2], . . . , Sk, [Uk],where each[Uj ] is a (possibly empty) sequence of update operationsU such that snapshotSj contains val-ues older that written byU andSj+1 contains the value written byU or a later value. Now we rearrangeelements of each[Uj ] so that the real-time order is respected. This is possible since the real-time order isacyclic.

Now we observe that the resulting linearizationL does not violate the real-time order. Indeed, supposethat an operationop precedesop′ in H. Three cases are possible:

• Both op andop′ are update operations. Letop andop′ belong to[Um] and[Uℓ], respectively. Ifℓ = k(op′ belongs to the last subsequence of updates inL), then, by constructionop precedesop′ in L.

Now suppose thatℓ < k and considerSℓ+1. By the construction ofL, Sm+1 is the first snapshot thatcontains the value written byop′ or a later value at the corresponding position. By the UpdateOrderproperty,Sm+1 also contains the value written byop or a later value at the corresponding positionand, thus,ℓ ≤ m. Thus,op precedesop′ in L.

• Both op andop′ are snapshot operations that return viewsS andS′, respectively. Ifop′ is incom-plete, then it does not appear inL. If op′ is complete, then, by the Partial Consistency property(Section 6.1.1),S ≤ S′. Thus, by construction,op precedesop′ in L.

• op is an update andop′ is a snapshot. By the Up-to-dateness property (Section 6.1.1), op′ returns thevalue written byop or a later value, and, by the construction ofL and the Snapshot Order property,opprecedesop′ in L.

84

Page 85: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• op is a snapshot andop′ is an update. By the Validity property (Section 6.1.1), the value written byop′ does not appear in the result ofop. By the construction ofL and the Snapshot Order property,opprecedesop′ in L.

Thus, any snapshot object is indeed a atomic implementationof thesnapshot type.Now consider a historyH of a atomic implementation of thesnapshot type. LetL be the linearization

of H, i.e., L respects the real-time order inH, L is legal with respect to thesnapshot type, andL isequivalent to a completion ofH. Recall that, in particular,L contains every complete operation inH.

• Suppose that a snapshot operationS returns a valuev 6= ⊥ at positioni in H. SinceL is legal,v isthe value written by the last updateU of pi that precedesS in L. SinceL respects the real-time order,S cannot precedeU in H, and, thus, Validity is ensured inH.

• Suppose an updateU precedes a snapshotS in H. SinceL respects the real-time order ofH, UprecedesS also inL. SinceL is legal,S1 ≤ S2 and, thus, Partial Consistency is ensured inH.

• Suppose a snapshotS1 precedes a snapshotS2 in H. SinceL respects the real-time order ofH,S1 precedesS2 also inL. SinceL is legal,S returns the value written byU or a later value at thecorresponding position and, thus, Up-to-dateness is ensured inH.

• All complete snapshot operations appear inL and, sinceL is legal, are related by≤: Snapshot Orderis ensured inH.

• Suppose an updateU1 precedes an updateU2 and a snapshotS returns the value written byU2. Thus,U1 precedesU2 also inL. SinceL is legal,U2 precedesS in L. Thus,U1 precedesS in L and, sinceL is legal,S returns the value written byU1 or a later value: Update Order is ensured inH.

Thus, any atomic implementation of thesnapshot type is indeed is a snapshot object. 2Lemma 5

By default, when we refer to a snapshot object, we mean await-free atomic implementation of thesnapshot type. But we start with a simplenon-blockingimplementation of atomic snapshot that onlyguarantees that at least one correct process completes every its operation. The construction assumes that theunderlying base registers can store values of arbitrary size, i.e., we may associated ever-growing sequencenumbers with every stored value. Then we turn the construction into a wait-free one. Finally, we present await-free snapshot implementation that uses bounded memory.

6.2.1 Non-blocking snapshot

In this section, we assume we can use base registers of unbounded size. Ourn-process implementation ofsnapshot uses an array of atomic registersREG []. Each value that can be stored in a registerREG [i] isassociated with a sequence number that is incremented each time a new value is stored. So eachREG [i]consists of two fields, denotedREG [i].sn andREG [i].val. The implementation ofupdate() is presentedin Figure 6.3. Heresni is a local variable thatpi uses to generate sequence numbers.

To maintain consistency across the results of snapshot operations, each snapshot operation is imple-mented using the “double scan” technique: the process keepsreading registersREG [1, . . . , n] until twoconsecutive collects return identical results. The resultof the last scan is then returned by the snapshotoperation.

Thescan() function asynchronously reads the last (sequence number, data) pairs posted by each process:function REG .scan(): for j ∈ {1, . . . , n} do r[j] := REG [j] end do; return (r).

85

Page 86: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation update (v) invoked by pi:sni := sni + 1 { local sequence number generator}REG [i] := [v, sni] { store the pair}

Figure 6.3: Update operation

operation snapshot ():1 aa := REG .scan();2 repeat forever3 bb := REG .scan();4 if (∀j : aa = bb) then return (aa.val) end if ; { return the vector of read values}5 aa := bb6 end while.

Figure 6.4: Snapshot operation

Theorem 17 The algorithm in Figures 6.3 and 6.4 is a non-blocking atomicsnapshot implementation.

Proof To prove that the implementation is non-blocking, we observe first that the update operation containsonly one base-object step. Consider an infinite execution ofthe algorithm, and suppose that a snapshotoperation performed by a correct processpi never terminates. By the algorithm,pi thus executes infinitelymany scans ofREG . The only reason not to return in line 4 is to find out that one ofthe positions inREG

has changed since the last scan. Thus, for every two consecutive scan operationsS1 andS2 executed bypi,another processpj executes an update operationU such that write toREG [j] in U takes place between theread ofREG [j] in S1 and the read ofREG [j] in S2.

Since there are only finitely many processes, at least one process performs infinitely update operationsconcurrently with the snapshot operation ofpi. Thus, in every infinite execution of the algorithm, at leastone correct process completes every its operation. So the implementation is indeed non-blocking.

Now we prove atomicity. LetE be any finite execution of the algorithm andH be the correspondinghistory. Consider any completesnapshot () operation inE. LetC1 andC2 be its last two scans. By the algo-rithm, C1 andC2 return the same result. Now we choose the linearization point of the snapshot operation tobe any point inE between the response ofC1 and the invocation ofC2 (see example in Figure 6.5). Other-wise, if a snapshot operation does not return inE, we do not include it in the linearization (or, equivalently,we remove the operation from our complete extension of the corresponding historyH).

Consider now anupdate(v) operation executed by a processpi in E. We linearize the operation at thepoint when it performs a write onREG [i] in E (if it does not, we remove it from the completion ofH).

Let L be the resultinglinearizationof H, i.e., the sequential history where operations appear in the orderof their linearization points inE. By the construction,L is equivalent to a completion ofH. Also, sinceeach operation is linearized within its interval inE, L respects the real-time order ofH. We show thatL islegal, i.e., at every positioni, every snapshot operation inL returns the value written by the latest precedingupdate ofpi.

Let S be a snapshot operation inL, and letC1 andC2 be the two last scans ofS. For eachpi, let Ui bethe last update operation ofpi precedingS in L. If there is no such an update operation inL, we artificiallyaddupdate(⊥) by pi at the beginning ofL and call itUi. Recall thatUi is linearized at the write onREG [i]andS is linearized between the response ofC1 and the invocation ofC2. Since, by the algorithm,C1 andC2

86

Page 87: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

read the same value inREG [i], no write onidREG[i] takes place between the read ofREG [i] performedwithin C1 and the read ofREG [i] performed withinC2. Thus, since the write operation performed withinUi is the last write onREG [i] to precede the linearization point ofS in E, we derive that it is also the lastwrite onREG [i] to precede the read ofREG [i] performed withinC1.

Therefore, for eachpi, the value ofpi returned byC1 and, thus, byS is the value written byUi. Hence,L is legal and the algorithm in Figure. 2Theorem 17

linearization point ofsnapshot()

REG [1]

REG [4]

REG [3]

REG [2]

secondscan() snapshot()

time line

aai[2].sn = b

aai[3].sn = c

aai[1].sn = a = REG [1].sn

bbi[2].sn = b

aai[4].sn = d

bbi[1].sn = a

bbi[3].sn = c

bbi[4].sn = d = REG [4].sn

first scan()

Figure 6.5: Linearization point of asnapshot () operation

6.2.2 Wait-free snapshot

In the non-blocking snapshot implementation in Figures 6.3and 6.4, update operations may starve a snapshotoperation out by “selfishly” updatingREG . This implementation can be turned into a wait-free one usinghelping: an update operations helps concurrent snapshot operations to terminate. An update operation mayitself take a snapshot of and store the result together with the new value inREG . Of course, for this helpingmechanism to work, we need to make sure that the intertwined snapshot and update operations do not preventeach other from terminating. Below we present an exact argument for this elegant idea.

U2

pi

pj

U1

S

Shelp

Figure 6.6: Eachupdate() operation includes asnapshot () operation

First we can make the two following observations on the non-blocking snapshot implementation:

• If two consecutive scans performed within a snapshot operation are not identical (and, thus, the snap-shot operation cannot return), then at least one process hasconcurrently performed an update opera-tion.

87

Page 88: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• If a snapshot operationS issued by a processpi witnesses that the value ofREG [j] has changed twice,i.e.,pj concurrently executed two update operationsU1 andU2, then the second of these updates wasentirely performed within the interval ofS (see Figure 6.6). This is because the update bypj of thebase atomic registerREG [j] is the last operation executed in anupdate() operation.

As the second update is entirely executed duringS, we may use it to “help”S:

• Within U2, pj takes a snapshot itself (using the algorithm in Figure 6.4) and writes the resulthelp toREG [j].

• Within S, pi uses the result read inREG [j] as the response ofS. This is going to be a valid result,since the execution ofU2 (and, thus, of the snapshot performed byU2) takes place entirely within theinterval ofS, soS can simply “borrow” the snapshot resulthelp from U2.

Note that for this kind of helping to work,S must witness at least two concurrent updates of the sameprocess. For example, even though the write onREG [j] performed withinU1 takes place within the intervalof S, the snapshot written byU1 together with its value may have taken place way before the invocation ofS. Thus, adopting the result ofU1’s snapshot as the result ofS may violate linearizability, since it may missupdates executedafter the snapshot taken byU1 butbeforethe invocation ofS. This is why, before adoptingthe snapshot taken bypj, pi should wait until it observes the second change inREG [j].

The algorithms for resultingupdate() andsnapshot () are described in Figure 6.7.The atomic registerREG [i] consists now of three fields,REG [i].val andREG [i].sn as before, plus the

new fieldREG [i].help array that contains the result of the snapshot taken bypi in the course of its latestupdate operation.

The new local variableidcould helpi is used by processpi when it executessnapshot (). Initially ∅,idcould helpi contains the set of the processes that terminated update operations concurrently with thesnapshot operation currently executed bypi (lines 11-15). Whenpi observes that a processpj ∈ could help

updated its value inREG , i.e.,pi finds out thataai[j].sn 6= bbi[j].sn, pi returnsREG [j].help array as theresult of its snapshot operation.

6.2.3 The snapshot object construction is bounded wait-free

Theorem 18 Eachupdate() or snapshot () operation returns after at mostO(n2) operations on base reg-isters.

Proof Let us first observe that anupdate() by a correct process always terminates as long as thesnapshot ()operation it invokes always returns. So, the proof consistsin showing that anysnapshot () issued by a correctprocesspi terminates.

Suppose, by contradiction, that a snapshot operation executed bypi has not returned after having exe-cutedn times thewhile loop (lines 5-19). Thus, each time it has executed the loop,pi has found out that forsome newj /∈ could helpi, aai[j].sn 6= bbi[j].sn (line 11), i.e.,pj has executed a newupdate() operationsince the lastscan() of pi. After thisj is added to the setcould helpi in line 14.

Note thati /∈ could helpi (pi does not change the value ofREG [i] while executingsnapshot ()). Thus,aftern− 1 iterations,could helpi contains all othern− 1 processes{1, . . . , i− 1, i+1, . . . , n}. Therefore,whenpi executes the while loop for thenth time, for anypj such thataai[j].sn 6= bbi[j].sn (line 11), itfindsj ∈ idcould helpi in line 12. By the algorithm,pi returns in line 13, after having executedn iterationsin lines 5-19—a contradiction.

88

Page 89: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation update(v) invoked by pi:(1) help arrayi := snapshot();(2) sni := sni + 1;(3) REG[i] := (v, sni, help arrayi)

operation snapshot():(4) could helpi := ∅;(5) aai := scan();(6) while true do(7) bbi := scan();(8) if (∀j ∈ {1, . . . , n} : aai[j].sn = bbi[j].sn)(9) then return(aai.val)(10) else for eachj ∈ {1, . . . , n} do(11) if (aai[j].sn 6= bbi[j].sn) then(12) if (j ∈ could helpi)(13) then return(bbi[j].help array)(14) else could helpi := could helpi ∪ {j}(15) end if end if(16) end for(17) end if ;(18) aai := bbi

(19) end while

Figure 6.7: Atomic snapshot object construction

Thus, every snapshot operation returns after having executed at mostn while loops in lines 5-19. Sinceevery loop involves exactlyn base-object reads (in the scan operation on registersREG [1], . . . ,REG [n]),every snapshot terminates inO(n2) base-object steps. Same holds for an update operation, since it addition-ally executes only one base-object write. 2Theorem 18

6.2.4 The snapshot object construction is atomic

Theorem 19 The object built by the algorithms described in Figure 6.7 isatomic with respect to thesnap-shot type.

Proof Let E be an execution of the algorithm andH be the corresponding history ofE. To prove that thealgorithm is indeed an atomic snapshot implementation, we construct a linearization ofH, i.e., a total orderL on the operations inH such that: (1)L is equivalent to a completion ofH, (2) L respects the real-timeorder ofH, and (3)L is legal, i.e., eachsnapshot () operationS in L returns, for each processpj, the valuewritten by the lastupdate() operation ofpj that precedesS in L.

The desired linearizationL is built as follows. The linearization point of a completeupdate() operationin E is the write in the corresponding 1WMR register (line 3). Incomplete update operations are not includedto L. The linearization point of asnapshot () operationS issued by a processpi depends on the line at whichit returns.(i) The linearization point of aS operation that terminates in line 9 (successful doublescan()) is at anytime time between the end of the firstscan() and the beginning of the secondscan() (see the proof ofTheorem 17 and Figure 6.5).(ii) The linearization point of aS operation that terminates in line 13 (i.e.,pi terminates with the help ofanother processpj) is defined inductively as follows (see Figure 6.8). The arrows show the direction in

89

Page 90: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

which snapshot results are adopted by one operation from another.

pi

snapshot()

pj1

update() update()

snapshot()

help array

update()

successful double scan

update()pjk

help array

snapshot()

Figure 6.8: Linearization point of asnapshot () operation (case ii)

SinceS returns in line 13, the array (sayhelp array ) returned bypi has been provided by anupdate()operation executed by some processpj1. As we observed earlier, thisupdate() has been entirely executedwithin the interval ofS. Indeed,help array is the result of the second update operation ofpj that is observedby pi to be concurrent withS. Thus, this update started after the invocation ofS and its last event (the writein REG [j] in line 8) before the response ofS.

Recursively,help array has been obtained bypj1 from a successful double scan, or from another processpj2. As there are at mostn concurrent processes, it follows by induction that there isa processpjk

that hasexecuted asnapshot () operation within the interval ofS and has obtainedhelp array from a successfuldouble scan.

The linearization point of thesnapshot () operation issued bypi is thus defined as the linearization pointof snapshot () operation ofpjk

whose double scan determinedhelp array .This association of linearization points to the operationsin H results in a linearizationL that puts the

operation in the order their linearization points appear inE. L trivially satisfies properties (1) and (2) statedat the beginning of the proof. Reusing the proof of Theorem 17, we observe that, for everypj, every snapshotoperationS (be it a standalone snapshot or a part of an update) returns the value written toREG [j] by thelast update ofpj to precede the linearization point ofS in E. Thus,L also satisfies(3), and the algorithm inFigure 6.7 is an atomic implementation ofsnapshot. 2Theorem 19

6.2.5 Bounded snapshot object

Dolev-Shavit’s bounded timestamps

Bibliographic notes

Afek et al. JACM

Aguilera 04

90

Page 91: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Attiya-Fouren

Borowsky-Gafni 93

Masuzawa 94, MWMR andO(n)

Exercises

One-shot snapshot from renaming (attiya)

91

Page 92: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

92

Page 93: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 7

Immediate Snapshot and IteratedImmediate Snapshot

7.1 Immediate snapshot object

7.1.1 Immediate snapshot and participating set problem

One-shot immediate snapshot object A one-shotimmediate snapshotobject is a snapshot object wherethe update() andsnapshot() are fused in a single operation denotedupdate snapshot(), and such thateach process invokes at most once that operation. When a processpi invokesupdate snapshot(v), it de-positsv as its last value and obtains a setVi made up of (process identity, value) pairs. From an externalobserver point of view, everything has to appear as if the operation was executed instantaneously. (An anal-ogous operation has been seen in the context of store-collect objects, where thestore collect() operationfusesstore() andcollect().)

More formally, a one-shot immediate snapshot object is defined by the following properties. LetVi

denote the set returned bypi when it invokesupdate snapshot(vi).

• Liveness. An invocation ofupdate snapshot(v) by a correct process terminates.

• Self-inclusion.(i, vi) ∈ Vi.

• Set inclusion.∀i, j : Vi ⊆ Vj or Vj ⊆ Vi.

• Immediacy.∀i, j : if (j, vj) ∈ Vi thenVj ⊆ Vi.

The first three properties are satisfied by a snapshot object where theupdate snapshot() is implementedby a update(vi) invocation followed by asnapshot() invocation. The last immediacy property is notsatisfied by this implementation, which shows the fundamental difference between snapshot and immediatesnapshot. This is illustrated in the Figures 7.1 and 7.2.

Figure 7.1 shows three processesp1, p2 and p3. Each process executes anupdate() followed by asnapshot(). The process identity appears as a subscript in the operation invoked. Moreover, the valuewritten by pi is i. According to the specification of the snapshot object,snapshot1() returns[1, 2,⊥](where⊥ is the value initially placed in the register associated with each process), whilesnapshot2() andsnapshot3() return [1, 2, 3]. This means that it is possible to associate with this execution the followingsequence of operationsS

update1(1) update2(2) snapshot1() update3(3) snapshot2() snapshot3(),

93

Page 94: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

snapshot2()

p1

p2

p3

update1(1) snapshot1()

update3(3)

update2(2)

snapshot3()

Figure 7.1:update() andsnapshot() operations

thereby showing the atomicity of this execution.

update snapshot3(3)

p1

update snapshot1(1)

p2

update snapshot2(2)

p3

Figure 7.2:update snapshot() operations

Figure 7.2 shows the same processes where the operationsupdatei() andsnapshoti() issued bypi arereplaced by a singleupdate snapshoti() (that starts at the same time asupdatei() starts, and terminates atthe same time assnapshoti() terminates). Asupdate snapshot1(1) terminates beforeupdate snapshot3(3)starts, it is not possible for the latter to return the value1 written byp1. Let Vi be the set of pairs returnedby pi. The following results are possible:

- V1 = {(1, 1)}, V2 = {(1, 1), (2, 2)}, andV3 = {(1, 1), (2, 2), (3, 3)},- V1 = V2 = {(1, 1), (2, 2)}, andV3 = {(1, 1), (2, 2), (3, 3)},- V2 = {(2, 2)}, V1 = {(1, 1), (2, 2)}, andV3 = {(1, 1), (2, 2), (3, 3)},- V1 = {(1, 1)}, andV2 = V3 = {(1, 1), (2, 2), (3, 3)}, and- V1 = {(1, 1)}, V3 = {(1, 1), (3, 3)}, andV2 = {(1, 1), (2, 2), (3, 3)}.

When V1, V2 and V3 are all different, everything appears as if theupdate snapshot() operations havebeen executed sequentially (and consistently with their realtime occurrence order). When two of them areequal, e.g.,V1 = V2 = {(1, 1), (2, 2)} (second case), everything appears as ifupdate snapshot1(1) andupdate snapshot2(2) have been executed at the very same time, both before theupdate snapshot3() op-eration. This possibility of simultaneity is the very essence of the “immediate” snapshot abstraction. It alsoshows that an immediate snapshot object is not an atomic object.

Theorem 20 A one-shot immediate object satisfies the following property: if (i,−) ∈ Vj and (j,−) ∈ Vi,thenVi = Vj.

Proof If (j,−) ∈ Vi (theorem assumption), we haveVj ⊆ Vi, due to the immediacy property. Similarly,(i,−) ∈ Vj implies thatVi ⊆ Vj. It trivially follows that Vi = Vj when (j,−) ∈ Vi and (i,−) ∈ Vj.

2Theorem 20

This theorem states that, while its operations appear as if they were executed instantaneously, an im-mediate snapshot object is not an atomic object. This is because it is not always possible to totally order

94

Page 95: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

all its operations. The immediacy property states that, from a logical time point of view, it is possible thatoperations occur simultaneously (they then return the sameresult), making impossible to consider that oneoccurred before the other. Differently from atomic snapshot objects, the specification of an immediate snap-shot object allows for concurrent operations. It requires that these operations return the very same result.Stated another way, this means that an immediate snapshot object has no sequential specification.

The participating set problem The participating setproblem is a particular instance of the one-shotimmediate snapshot problem. It considers the case where thevalue vi deposited by a processpi is itsown identity. The corresponding operation is consequentlydenotedparticipate(). The properties a setVi

returned byparticipate() are then:

• Self-inclusion.i ∈ Vi.

• Set inclusion.∀i, j : Vi ⊆ Vj or Vj ⊆ Vi.

• Immediacy.∀i, j : if j ∈ Vi thenVj ⊆ Vi.

7.1.2 A one-shot immediate snapshot construction

This section describes a very simple one-shot immediate snapshot algorithm based on an algorithm solvingthe participating set problem. (The section that follows provides a solution to that problem.)

The algorithm is described in Figure 7.3. It uses an arrayREG [1 : n] of 1WMR atomic registers, and aparticipating set object denotedPART . REG [i] is the register wherepi deposits its value. Its initial valueis⊥.

operation update snapshot(v) invoked by pi:(1) REG [i]← v;(2) present← PART .participate();(3) result← ∅;(4) for each j ∈ present do result← result ∪ {(j, REG[j])} end do;(5) return(result)

Figure 7.3: Atomic snapshot object construction

Theorem 21 The algorithm described in Figure 7.3 is a bounded wait-freeimplementation of a one-shotimmediate snapshot object.

Proof Let us first observe that the algorithm is bounded wait-free as soon as the algorithm implementingthe underlying participating set objectPART is bounded wait-free. We will see in Theorem 22 that there isa bounded wait-free implementation ofPART .

As a processpi that invokesupdate snapshot(v), first updates its registerREG [i], and then invokesPART .participate(), it follows that a participating process has always deposited a value. The rest of theproof follows directly from the specification of the objectPART . The set of process identities returned topi

is the set from which it builds its result. As this set satisfies the self-inclusion, set inclusion and immediacyproperties associated with the objectPART , the set of pairs computed satisfies the corresponding propertiesof the one-shot immediate snapshot specification. 2Theorem 21

95

Page 96: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

7.1.3 A participating set algorithm

Underlying data structure A participating set algorithm is described in Figure 7.4. This algorithm usesan array of 1WMR atomic registersLEVEL[1 : n], whereLEVEL[i] can be written only bypi. A processpi

uses also a local arrayleveli[1 : n] to keep the last values it has (asynchronously) read fromLEVEL[1 : n].A registerLEVEL[i] contains at mostn distinct values (fromn + 1 until 1), which means that it requiresb = ⌈log2(n)⌉ bits. It is initialized ton + 1.

operation participate() invoked by pi:% initially: ∀j : LEVEL[j] = n + 1 %

(1) repeatLEVEL[i]← LEVEL[i]− 1;(2) for eachj ∈ {1, . . . , n} do leveli[j]← LEVEL[j] end do;(3) seti ← {x | leveli[x] ≤ leveli[i]}(4) until (|seti| ≥ leveli[i]);(5) return(seti)

Figure 7.4: A participating set algorithm

Underlying principles of the algorithm let us consider the image of a stairway made up ofn stairs.Initially all the processes stand at the highest stair (i.e., the stair whose, number isn+1). (This is representedin the algorithm by the initial values of theLEVEL array, namely, for any processpj , we haveLEVEL[j] =n + 1.)

The algorithm is based on the following idea. When a processpi invokesparticipate(), it descendsalong the stairway, going from the stepLEVEL[i] to the stepLEVEL[i]− 1 (line 1), until it attains a stepksuch that there arek processes (including itself) stopped on the steps1 to k . It then returns the identities ofthesek processes.

To catch the underlying intuition and understand how this idea works, let us consider two extremal casesin whichk processes invoke theparticipate() operation.

• Sequential case.In this case, thek processes invokes the operation sequentially, i.e., the next invocation starts only afterthe previous one has returned. It is easy to see that the first processpi1 that invokes theparticipate()operation proceeds from the stepn + 1 until the step number1, and stops at this step. Then, theprocesspi2 starts and descends from the stepn + 1 until the step number2, etc., and the last processpik stops at the stepk.

Moreover, the set returned bypi1 is {i1}, the set returned bypi2 is {i1, i2}, etc., the set returned bypik being{i1, i2, . . . , ik}. These sets trivially satisfy the inclusion property.

• Synchronous case.In this case, thek processes proceed synchronously. They all, simultaneously, descend from the stepn+1 to the stepn, and then from the stepn to the stepn−1, etc., and they all stop at the step numberk, as there are thenk processes at the steps from1 to k (they all are on the samekth step).

It follows that all the processes return the very same set of participating processes, namely, the setincluding all of them{i1, i2, . . . , ik}.

Other cases, where the processes proceed asynchronously and some of them crash, can easily be designed.The main question is now: how to make operational this idea? This is done by three statements (Figure

7.4). Let us consider a processpi:

96

Page 97: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

• First, when it is standing on a given stepLEVEL[i], pi reads the steps at which the other processesare (line2). The aim of this asynchronous reading is to allowpi to compute an approximate globalstate of the stairway. Let us notice that as a processpj can go only downstairs,leveli[j] is equal orsmaller to the stepk = LEVEL[i] on whichpj currently is. It follows that, despite the fact the globalstate obtained bypi is approximate,seti can be safely used bypi.

• Then (line3),pi uses the approximate global state it has obtained, to compute a setseti of processesthat are standing at a step comprised betweenLEVEL][1] andLEVEL][i], the step wherepi currentlyis.

• Finally (line 4), if seti is containsk = LEVEL][i] or more processes,pi returns it as its result to theparticipating set problem. Otherwise, it descends to the next stairLEVEL][i]− 1 (line 1).

Proof the algorithm Two preliminaries lemmas are proved before the main theorem.

Lemma 6 Letseti = {x | leveli[x] ≤ LEVEL[i]} (as computed at line 3). For any processpi, the predicate|seti| ≤ LEVEL[i] is always satisfied at line 4.

Proof Let us first observe thatleveli[i] andLEVEL[i] are always equal at lines 3 and 4. Moreover, anyLEVEL[j] register can only decrease, and for any(i, j) pair we haveLEVEL[j] ≤ leveli[j].

The proof is by contradiction. Let us assume that there is at least one processpi such that|seti| =|{x | leveli[x] ≤ LEVEL[i]}| > LEVEL[i]. Let k the current value ofLEVEL[i] when this occurs.|seti| > k andLEVEL[i] = k mean that at leastk + 1 processes have progressed at least to the stairk.Moreover, as any processpj descends one stair at a time (it proceeds from the stairLEVEL[j] to the stairLEVEL[j] − 1 without skipping stairs), at leastk + 1 processes have proceeded from the stairk + 1 to thestairk.

Among the≥ k +1 processes that are on stairs≤ k, letpℓ be the last process that updated itsLEVEL[ℓ]register tok + 1 (due to the atomicity of the base registers, there is such a last process). Whenpℓ wason the stairk + 1 (we then hadLEVEL[ℓ] = k + 1), it obtained at line 3 a setsetℓ such that|setℓ| =|{x | levelℓ[x]| ≤ LEVEL[ℓ]} ≥ k + 1 (this is because≥ k + 1 processes have proceeded to the stairk + 1and, aspℓ is the last of them, it has read a value≤ k + 1 from its ownLEVEL[ℓ] register and the ones ofthose processes). As|setℓ| ≥ k + 1, pℓ stopped descending the stairway at line 4, at the stairk + 1. It thenreturned, contradicting the initial assumption stating that it progresses until the stairk. 2Lemma 6

Lemma 7 If pi halts at the stairk, we then have|seti| = k. Moreover,seti is composed of the processesthat are at a stairk′ ≤ k.

Proof Due to Lemma 6, we always have|seti| ≤ LEVEL[i], whenpi executes line 4. If it stops, we alsohave|seti| ≥ LEVEL[i] (test of line 4). It follows that|seti| = LEVEL[i]. Finally, if k is pi’s current stair,we haveLEVEL[i] = k (definition ofLEVEL[i] and line 1). Hence,|seti| = k.

The fact thatseti is composed of the identities of the processes that are at a stair≤ k follows from thevery definition ofseti (namely,seti = {x | leveli[x] ≤ LEVEL[i]}), the fact that, for anyx, leveli[x] ≤LEVEL[x], and the fact that a process never climbs the stairway (it either halts on a stair, line 4, or descendsto the next one, line 1). 2Lemma 7

Theorem 22 The algorithm described in Figure 7.4 is a bounded wait-freeimplementation of a participat-ing set object.

97

Page 98: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Proof Let us observe that (1)LEVEL[i] is monotonically decreasing, and (2), at any time,seti is such that|seti| ≥ 1 (because it contains at least the identityi). It follows that therepeatloop always terminates (inthe worst case whenLEVEL[i] = 1). It follows that the algorithm is wait-free. Moreover,pi executes therepeatloop at mostn times, and each computation inside the loop includesn read of atomic base registers.It follows thatO(n2) is an upper bound on the number of read/write operations on base registers involvedin aparticipate() operation. The algorithm is consequently bounded wait-free.

The self-inclusion property is a direct consequence of the way seti is computed (line 3): trivially, theset{x | leveli[x] ≤ leveli[i]} contains alwaysi.

For the set inclusion property, let us consider two processes pi andpj , that stop at stairski, andkj ,respectively. Without loss of generality, letki ≤ kj . Due to Lemma 7, there are exactlyki processes onthe stairs1 to ki, andkj processes on the stairs1 to kj ≤ ki. As no process backtracks on the stairway (aprocess descends or stops), the set ofkj processes returned bypj includes the set ofk1 processes returnedby pi.

It follows from the lines 3 and 4 that, if a a processpj stops at a stairkj and theni ∈ setj, thenpi

stopped at a stairki ≤ kj. It follows from Lemma 7 that the setsetj returned bypj includes the setsetireturned bypi, which proves the immediacy property. 2Theorem 22

7.2 A connection between (one-shot) renaming and snapshot

7.2.1 A weakened version of the immediate snapshot problem

Let us consider a weakened version of the (one-shot) immediate snapshot problem without the immediacyproperty. This means that, when a processpi invokesupdate snapshot(vi) it obtains a setVi, and the setsreturned satisfy the following properties:

• Self-inclusion.(i, vi) ∈ Vi.

• Set inclusion.∀i, j : Vi ⊆ Vj or Vj ⊆ Vi.

This section shows that a one-shot snapshot algorithm can beobtained from a simple modification of arenaming algorithm.

7.2.2 The adapted algorithm

We consider here the renaming algorithm, based on reflector base objects, that has been described in chapter7. The idea, to adapt it to solve the previous specification, comes from the following observation. In additionto routing processes, reflectors can be used to help processes to collect their final viewVi.

Instead of being boolean atomic registers, the base atomic objectsVISITED [0..1] contains now sets ofpairs(i, vi), i.e., they are views. They are initialized to∅. (The meaning ofVISITED[y] = ∅ is the same asthe meaning of¬VISITED[y] in the base implementation of a reflector object.)

The operationreflect() is modified accordingly to take into account the computationof views. In ad-dition to an entrance number (0 or 1), it takes a viewV as additional parameter. Let us remind that (1) aprocess that enters a reflector on the entrance labeledy, leaves it on an exit with the same label (upy ordowny), and (2) the network is designed in such a way that, for each reflector, each of its entrance is usedby at most one process.

The modifiedreflect(V, y) is as follows (Figure 7.5). The input parameterV is the current estimate ofthe final view of the invoking process. Initially,V = {(i, vi)}. The aim of areflect() invocation is to enrich

98

Page 99: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

V in order it converges to a final value that satisfies self-inclusion and set inclusion. When a process entersthe reflector on the entrancey, it writes its current local view inVISITED[y], and then reads the otherregisterVISITED [1 − y]. If it empty, its local viewV does not change, and the process exits ondowny.Otherwise the process adds the view inVISITED [1− y] to V and exits onupy.

function reflect (V, y):(1) VISITED[y]← V ;(2) if (VISITED[1− y] = ∅) then return (V, downy)(3) else return (V ∪VISITED[1− y], upy) endif

Figure 7.5: Adapting the reflector base object

The algorithm that directs the progress of a process in the network of reflectors is exactly the same asin the renaming algorithm. The only difference is in the returned value. Instead of a row number, the viewobtained after the process has visited its last reflector (that reflector belongs to the last column) is returnedas its final view to the invoking process. The correspondingupdate snapshot() operation is describedin Figure 7.6. Let us remind that the reflector object whose coordinates are(r, c) is denotedR[r, c]. Thealgorithm can be trivially modified to solve both one-shot renaming and one-shot snapshot.

operation update snapshot (vi):(1) Vi ← {(i, vi)}; ci ← idi; ri ← idi;(2) while (ci = idi) do(3) (Vi, exit)← R[ri, ci].reflect (Vi, 1);(4) if (exit = up1) then ci ← ci + 1(5) else ri ← ri − 1;(6) if ri < −ci then ci ← ci + 1 endif(7) endif(8) endwhile;(9) while (ci < N) do(10) (Vi, exit)← R[ri, ci].reflect (Vi, 0);(11) ci ← ci + 1;(12) if (exit = up0) then ri ← ri + 1(13) else ri ← ri − 1(14) endif(15) endwhile;(16) return(Vi) % 0 ≤ ri + N ≤ 2(n− 1) %

Figure 7.6: From renaming to snapshot

Let us remind that the new name of a process in the original renaming algorithm is the number of therow where it is when it attains the last column. It follows from that algorithm and the modifiedreflect()operation that, if two processespi andpj are such that the new name ofpi is smaller than the new nameof pj, we haveVi ⊆ Vj . The self-inclusion property follows directly from thereflect() operation, as theset it returns always includes its input parameter set, and the setVi, whose final value it returned topi, isinitialized to{(i, vi)}.

7.3 Iterated immediate snapshot

Iterated memory and nonblocking simulation.

99

Page 100: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Bibliographic notes

Afek et al. JACM

Aguilera 04

Attiya-Fouren

Borowsky-Gafni 93

Masuzawa 94, MWMR andO(n)

Gafni and Rajsbaum, OPODIS 2010

Exercises

One-shot snapshot from renaming (attiya)

100

Page 101: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Part III

General Memory

101

Page 102: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface
Page 103: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 8

Consensus and universal construction

In the first part of this book, we considered multiple powerful abstractions that can be wait-free implementedusing read-write registers. A natural question: can any object type be implemented this way? We show inthis chapter that the answer is no: for example, a queue cannot be wait-free implemented even when sharedby two processes. More generally, we address the following fundamental question:

Given object typesT andT ′, is there a wait-free implementation of an object of typeT fromobjects of typeT ′?

Recall that an object operation can be either total or partial (Section 2.2.2). A pending partial operationmay not always be able to complete. Indeed, there are executions in which the partial operation cannot belinearized, and, thus, it must be forced to wait until the value of the object allows it to proceed. In contrast(Chapter 2.6), a pending total operation can always be completed by a process, regardless the behavior ofthe other processes. Thus, only total operations can be wait-free implemented. In this chapter we assumetotal object types.

8.1 What cannot be read-write implemented

To warm up, let us consider aqueueobject type that exports two operationsenqueue() anddequeue(). In asequential execution,enqueue(v) addsv to the end of the queue anddequeue() returns the first element inthe queue and removes it from the queue. If the queue is empty the default value⊥ is returned.

8.1.1 The case of one dequeuer

Let us assume only one process is allowed to invokedequeue() on the concurrent implementation of thequeue. Such a restricted queue allows for a simple read-write wait-free implementation.

Each enqueuerpi maintains a registerRi which stores the sequence of values enqueued bypi so far,each value equipped with a “timestamp.” Each timepi enqueues a value, it scans all the registers to find athe highest timestampt used so far and updatesRi equipped with timestampt + 1. The dequeuer simplyreads all registersRi and returns the value with the lowest timestamp that was not previously returned (tiesbroken arbitrarily, e.g., by picking the value enqueued by the enqueuer with the lowest id).

103

Page 104: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation propose(v):if (x = ⊥) then x := v endif;return (x).

Figure 8.1: Consensus specification: sequential executionof popose(v)

Intuitively, the implementation is correct since we only need to break the ties for values that were concur-rently enqueued and thus can be linearized either way. We encourage the reader to find a formal correctnessargument.

[[PK exercise: prove that it is correct?]]

8.1.2 Two or more dequeuers

What about a general queue, shared by two or more processes, where every process is allowed to enqueueor dequeue elements? We show below that this

Schedules, configurations and values go here

Lemma 8 Every queue implementation has a bivalent configuration.

Lemma 9 Every queue implementation has a critical configuration.

Theorem 23 There is no wait-free two-process queue implementation from atomic registers.

Proof2Theorem ??

Since anyn-process wait-free implementation (n ≥ 2) implies a2-process wait-free implementation,we have:

Corollary 2 For anyn ≥ 2, there is no wait-freen-process queue implementation from atomic registers.

8.2 Universal objects and consensus

An object typeT is universalif, given any (total) typeT , an object of typeT ′ can be wait-free implementedfrom objects of typeT , together with atomic registers. An algorithm providing such an implementation iscalled auniversal construction.

In this chapter, we introduceconsensusas an example of a universal object type. We present twoconsensus-based universal constructions. The first is wait-free, the second one isboundedwait-free. (Recallthat an implementation is bounded wait-free if there is a bound on the number of base-object steps anoperation must perform to terminate.)

Theconsensusobject type exports an operationpropose() that takes one input parameterv in a valuesetV (|V | ≥ 2) and returns a value inV . Let⊥ denote a default value that cannot be proposed by a process(⊥ /∈ V ). ThenV ∪{v}is the set of states a consensus object can take,⊥ is its initial state, and its sequentialspecification is defined in Figure 8.1. A consensus objects can thus be seen as a “write-once” registerthat keeps forever the value proposed by the firstpropose() operation. Then, any subsequentpropose()operation returns the first written value.

104

Page 105: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Given alinearizable implementation of the consensus object type, we say that a processproposesv ifit invokespropose(v) (we then say that it is aparticipant in consensus). If the invocation ofpropose(v)returns a valuev′, we say that the invoking processdecidesv′, or v′ is decided by the consensus object.We observe now that any execution of await-freelinearizable implementation of the consensus object typesatisfies three properties:

• Agreement:no two processes decide different values.

• Validity: every decided value was previously proposed.

Indeed, otherwise, there would be no way to linearize the execution with respect to the sequentialspecification in Figure 8.1 which only allows to decide on thefirst proposed value.

• Termination:Every correct process eventually decides.

This property is implied by wait-freedom: every process taking sufficiently many steps of the consen-sus implementation must decide.

8.3 A wait-free universal construction

In this section, we show that if, in a system ofn processes, we can wait-free implement consensus, then wecan implementany total object type.

Recall that a total object type can be represented as a tuple(Q, q0, O,R, δ), whereQ is a set of states,q0 ∈ Q is an initial state,O is a set of operations,R is a set of responses, andδ is a binary relation onO ×Q×R×Q, total onO ×Q: (o, q, r, q′) ∈ δ if operationo is applied when the object’s state isq, thenthe objectcan returnr and change its state toq′. Note that fornon-deterministicobject types, there can bemultiple such pairs(r, q′) for giveno andq.

The goal of our universal construction is, given an object typeτ = (Q,O,R, δ), to provide a wait-freelinearizable implementation ofτ using read-write registers and atomic consensus objects.

8.3.1 Deterministic objects

For deterministic object types,δ can be seen as a functionO × Q → R × Q that associates each state anoperation with a unique response and a unique resulting state. The state of a deterministic object is thusdetermined by a sequence of operations applied to the initial state of the object. The universal constructionof an object of deterministic is presented in Figure 8.2.

Correctness.

Lemma 10 At all times, for all processespi andpj , linearized i andlinearized j are related by containment.

Proof We observe thatlinearized i is constructed by adding the batches of requests decided by consensusobjectsC1, C2, . . ., in that order. The agreement property of consensus (applied to each of these consensusobjects) implies that, for eachj, eitherlinearized i is a prefix oflinearized i or vice versa. 2Lemma 10

Lemma 11 Every operation returns in a finite number of its steps.

105

Page 106: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Shared objects:R, store-collect object, initially⊥C1, C2, . . . , consensus objects

Local variables, for each processpi:integerseq

i, initially 0 { the number of executed requests ofpi }

integerki, initially 0 { the number of batches of executed requests}sequencelinearized i, initially empty { the sequence of executed requests}

Code for operationop executed bypi:7 seq

i:= seq

i+ 1

8 R.store(op, i, seqi) { publish the request}

9 repeat10 V := R.collect() { collect all current requests}11 requests := V − {linearized i} { choose not yet linearized requests}12 ki := ki + 113 decided := C[k].propose(requests)14 linearized i := linearized i.decided { append decided requests}15 until (op, i, seq

i) ∈ linearized i

16 return the result of(op, i, seqi) in linearized i usingδ andq0

Figure 8.2: Universal construction for deterministic objects

Proof Suppose, by contradiction, that a processpi invokes an operationop and executes infinitely manysteps without returning. By the algorithm,pi forever blocks in the repeat-until clause in lines 20-15. Thus,pi proposes batches of requests containing its request(op, i, seq i) to an infinite sequence of consensusinstancesC1, . . . but the decided batches never contain(op, i, seq i). By validity of consensus, there existsa processpj 6= pi that accesses infinitely many consensus objects. By the algorithm, before proposing abatch to a consensus object,pj first collects the batches currently stored by other processes in a store-collectobjectR. Sincepi stores its request inR and never updates it since that, eventually, every such processpj

must collect thepi’s request and propose it to the next consensus object. Thus,every value returned by theconsensus objects from some point on must contain thepi’s request—a contradiction. 2Lemma 11

Theorem 24 For each typeτ = (Q, q0, O,R, δ), the algorithm in Figure 8.2 describes a wait-free lineariz-able implementation ofτ using consensus objects and atomic registers.

Proof Let H be the history an execution of the algotihm in Figure 8.2. By Lemma 10, local variableslinearized i are prefixes of some sequence of requestslinearized . Let L be the legal sequential history,where operations and are ordered bylinearized and responses are computed usingq0 andδ. We constructH ′, a completion ofH, by adding responses to the incomplete operations inH that are present inL. Byconstruction,L agrees with the local history ofH ′ for each process.

Now we show thatL respects the real-time order ofH. Consider any two operationsop andop′ suchthat op →H op′ and suppose, by contradiction thatop′ →L op. Let (op, i, si) and (op′, j, sj) be thecorresponding requests issued by the processes invokingop and op′, respectively. Thus, inlinearized ,(op′, j, sj) appears before(op, i, si), i.e., beforeop terminates it witnesses(op′, j, sj) being decided byconsensus objectsC1, C2, . . . before(op′, j, sj). But, by our assumption,op →H op′ and, thus,(op′, j, sj)has been stored in the store-collect objectR after op has returned. But the validity property of consensus

106

Page 107: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Shared objects:R, store-collect object, initially⊥ { published requests}C1, C2, . . . , consensus objectsS, store-collect object, initially(1, ǫ) { the current consensus object and the last committed sequence of requests}

Local variables, for each processpi:integerseq

i, initially 0 { the number of executed requests ofpi }

integerki, initially 0 { the number of batches of executed requests}sequencelinearized i, initially ǫ { the sequence of executed requests}

Code for operationop executed bypi:17 seq

i:= seq

i+ 1

18 R.store(op, i, seqi) { publish the request}

19 (ki, linearized i) := max(S.collect()) { get the current consensus object and the most recent state}20 repeat21 V := R.collect() { collect all current requests}22 requests := V − {linearized i} { choose not yet linearized requests}23 ki := ki + 124 decided := C[k].propose(requests)25 linearized i := linearized i.decided { append decided requests}26 until (op, i, seq

i) ∈ linearized i

27 S.store((ki + 1, linearized i)) { publish the current consensus object and state}28 return the result of(op, i, seq

i) in linearized i usingδ andq0

Figure 8.3: Bounded wait-free universal construction for deterministic objects

does not allow to decide a value that has not yet been proposed—a contradiction. Thus,op→L op′, and weconclude thatH is linearizable. 2Theorem 24

8.3.2 Bounded wait-free universal construction

The implementation described in Figure 8.2 is wait-free butnot boundedwait-free. A process may takearbitrarily many steps in the repeat-until clause in lines 20-15 to “catch up” with the current consensusobject.

It is straightforward to turn this implementation into a bounded wait-free. Before returning an opera-tion’s response (line 16), a process posts in the shared memory the sequence of requests it has witnessedcommitted together with the id of the last consensus object it has accessed. On invoking an operation, a pro-cess reads the memory to get the “most recent” state on the implemented object and the “current” consensusid. Note that multiple processes concurrently invoking different operations might get the same estimate ofthe “current state” of the implementation. In this case onlyone of them may “win” in the current consensusinstance and execute its request. But we argue that the requests of “lost” processes must be then committedby the next consensus object, which implies that every operation returns in a bounded number of its ownsteps. The bound here depends on the implementation of

The resulting implementation is presented in Figure 8.3.To prove the following theorem, we assume that it takesO(n) read-write steps to implement store-collect

objectsR andS (Chapter??).

107

Page 108: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Theorem 25 For each typeτ = (Q, q0, O,R, δ), the algorithm in Figure 8.3 describes a wait-free lineariz-able implementation ofτ using consensus objects and atomic registers, where every operation returns inO(n).

Proof The proof of linearizability is similar to the one in the proof of Theorem 24.To prove bounded wait-freedom, consider a request(op, i, ℓ) issued by a processpi. By the algorithm,pi

first publishes its request and obtains the current state of the implemented object (line 19), denotedk ands,respectively. Thenpi proposes all requests it observes proposed but not yet committed to consensus objectCk. If (op, i, ℓ) is committed byCk, thenpi returns after takingO(n) read-write steps (we assume that bothcollect operations involveO(n) read-write steps).

Suppose now that(op, i, ℓ) is not committed byCk. Thus, another processpj has previously proposed toCk a set of requests that did not include(op, i, ℓ). Thus,pj collected requests in line 21 beforepi published(op, i, ℓ) in line 18.

[[PK: to complete]]2Theorem 25

8.3.3 Non-deterministic objects

The universal construction in Figure 8.2 assumes the objecttype is deterministic, where for each state andeach operation there exists exactly one resulting state andresponse pair. Thus, given a sequence of request,there is exactly one corresponding sequence of responses and state transitions.

A “dumb” way to use our universal construction is to considerany deterministic restriction of the givenobject type. But this may not be desirable if we expect the shared object to behave probabilistically (e.g.,in randomized algorithms). A “fair” non-deterministic universal construction can be derived from the al-gorithm in Figure 8.3 as follows. Instead of only proposing asequence of requests in line 24, processpi (non-deterministically) picks a sequence of possible responses and state transitions assuming that thesequence of operations inrequests is applied to the last state inlinearized i.

[[PK: to complete]]

8.4 Bibliographic notes

Herlihy 1991

Attiya-Welch 1998

State machine replication: Lamport, Schneider

Chandra-Toueg total order broadcast from consensus

Guerraoui-Raynal 2004: tech report on FT atomic objects

108

Page 109: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Chapter 9

Consensus number and the consensushierarchy

In the previous chapter, we introduced a notion of auniversalobject type. Using read-write registers andobjects of a universal type and, one can wait-free implementan object of any total type. One example of auniversal type isconsensus. Therefore, the following question is fundamental:

Which object types allow for a wait-free implemention of consensus?

For example, do atomic registers can implement consensus ontheir own? If not, what about queues andregisters? In this chapter, we address this question by introducing the notion ofconsensus numberof anobject typeT , the largest number of processes for whichT is universal. Consensus number is fundamentalin capturing the relative power of object types, we show how to evaluate the consensus power of variousobject types.

9.1 Consensus number

The consensus numberof an object typeT , denoted byCN (T ), is the largest numbern such that it ispossible to wait-free implement a consensus object from atomic registers and objects of typeT , in a systemof n processes. If there is no such largestn, i.e, consensus can be implemented in a system of arbitrarynumber of processes, the consensus number ofT is said to be infinite.

Note that if there exists a wait-free implementation in a system ofn implies a wait-free implementationin a system of anyn′ < n processes. Thus, that the notion of consensus number is well-defined. By thedefinition, if CN (T ) < CN (T ′), then there is no wait-free implementation of an object of type T ′ fromobjects of typeT and registers in a system ofCN (T ) + 1 or more processes.

What if atomic registers are strong enough to wait-free implement consensus for any number of pro-cesses, i.e.,CN (regiter) =∞? Then all object types would have the same consensus number,and the verynotion of consensus number would be useless. We show in this chapter that this is not the case. Moreover,we show that for eachn, there exists object typesT , such thatCN (T ) = n, i.e., theconsensus hierarchyispopulated for each leveln.

109

Page 110: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

9.2 Preliminary definitions

In this section, we introduce some machinery that facilitates computing consensus numbers of various objecttypes. This includes the notions of a schedule, a configuration, and valence.

9.2.1 Schedule, configuration and valence

This section defines news notions (schedule, configuration and valence) that are central to prove the im-possibility to wait-free implement a consensus object fromsome “base” object types. Before giving thesedefinitions, it also reminds a few notions and results introduced in the first chapter, that are useful to betterunderstand results presented in this chapter.

Reminder Let us consider an execution made up of sequential processesthat invoke operations on atomicobjects of typesT1, . . . , Tx. These objects are called “base objects” (equivalently, the typesT1, . . . , Tx arecalled “base types”). We have seen in the first chapter (theorem 1), that, as each base object is atomic,that execution can be modeled, at the operation level, by an atomic history S on the operations issuedby the processes. This means thatS is a sequential history that (1) includes all the operationsissued bythe processes (except possibly the last operation of a process if that process crashes), (2) is legal, and (3)respects the real time occurrence order on the operations. As we have seen, such a historyS is also called alinearization.

Schedules and configurations A scheduleis a sequence of operations issued by processes. Sometimes anoperation is represented in a schedule only by the name of theprocess that issues that operation.

A configurationC is a global state of the system execution at a given point in time. It includes the valueof each base object plus the local state of each process. The configurationp(C) denotes the configurationobtained fromC by applying an operation issued by the processp. More generally, given a scheduleS anda configurationC, S(C) denotes the configuration obtained by applying toC the sequence of operationsdefiningS.

Valence The valencenotion is a fundamental concept to prove consensus impossibility results. Let usconsider a consensus object such that only the values0 and1 can be proposed by the processes. Such anobject is called abinary consensusobject. Let us assume that there is an algorithmA implementing sucha consensus object from base type objects. LetC be a configuration attained during an execution of thealgorithmA.

The configurationC is v-valent, if from C, no matter the schedule it applies toC, the algorithm alwaysleads tov as the decided value;v is the valence of that configuration. Ifv = 0 (resp.,v = 1) C is said to be0-valent (resp.,1-valent) and0 (resp.,1). A 0-valent or1-valent configuration is said to bemonovalent. Aconfiguration that is not monovalent is said to bebivalent.

While a monovalent configuration states that the decided value is determined (be processes aware of itor not), the decided value is not yet determined in a bivalentconfiguration.

9.2.2 Bivalent initial configuration

The next theorem shows that, for any wait-free consensus algorithm A, there is at least one initial bivalentconfiguration, i.e., a configuration in which the decided value is not predetermined: any one from several

110

Page 111: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

proposed value can still be decided (for each of these valuesv, there is a schedule generated by the algorithmA that, starting from that configuration, decidesv).

This means that, while the decided value is only determined from the inputs when the initial config-uration is univalent, this is not always true for all configurations, as there is at least one initial bivalentconfiguration. The value decided by a wait-free consensus algorithm cannot always be deterministicallydetermined from the inputs. It can also depend on the execution of the algorithmA itself.

Theorem 26 Let us assume that there is an algorithmA that wait-free implements a consensus object in asystem ofn processes. There is then a bivalent initial configuration.

Proof Let C0 be the initial configuration in which all the processes propose0 to the consensus object, andCi, 1 ≤ i ≤ n, the initial configuration in which the processes fromp1 to pi propose the value1, whileall the other processes propose1. So, all the processes propose1 in Cn. These configurations constitutea sequence in which any two adjacent configurationsCi−1 andCi, 1 ≤ i ≤ n, differ only in the valueproposed by the processpi: it proposes the value0 in Ci−1 and the value1 in Ci. Moreover, it follows fromthe validity property of the consensus algorithmA, thatC0 is 0-valent, whileCn is 1-valent.

Let us assume that all the previous configurations are univalent. It follows that, in the previous sequence,there is (at least) one pair of consecutive configurations, sayCi−1 andCi, such thatCi−1 is 0-valent andCi

is 1-valent. We show a contradiction.Assuming that no process crashes, let us consider an execution historyH of the algorithmA that starts

from the configurationCi−1, in which the processpi executes no operation for an arbitrarily long period(the end of that period is defined below). As the algorithm is wait-free, all the processes decide after a finitenumber of their operations. The sequence of operations thatstarts at the very beginning of the history andends when all the processes have decided (butpi, which has not yet executed an operation), defines thescheduleS. (See the upper part of Figure 9.1. Within the vector of the values proposed by the processes,the value proposed bypi has been placed inside a box.) Then, afterS terminates,pi starts executing andeventually decides. AsCi−1 is 0-valent,S(Ci−1) is also0-valent.

[1, . . . , 1, 0 , 0, . . . , 0]

Ci−1 is 0-valent ScheduleS (no operation bypi)

[1, . . . , 1, 1 , 0, . . . , 0]

ScheduleS (no operation bypi)

S(Ci−1): 0-valent

Ci is 1-valentS(Ci): 0-valent

Figure 9.1: There is a bivalent initial configuration

Let us observe (lower part of Figure 9.1) that the same schedule S can be produced by the algorithmA from the configurationCi. This is because (1) the configurationsCi−1 andCi differ only in the valueproposed bypi, and, (2) aspi executes no operation inS, that schedule cannot depend on the value proposedby pi. It follows that, asS(Ci−1) is 0-valent, the configurationS(Ci) is also0-valent. But as, on anotherside,Ci is 1-valent, we conclude thatS(Ci) is 1-valent, a contradiction. 2Theorem 26

Crash vs asynchrony The previous proof is based on (1) the assumption stating that the consensus al-gorithm A is wait-free (intuitively, the progress of a process does not depend on the “speed” of the other

111

Page 112: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

processes), and (2) asynchrony (a process progresses at its“own speed”). This allows the proof to play withprocess speed, and consider a schedule (part of an executionhistory) in which a processpi does not executeoperations. We could have instead considered thatpi has initially crashed (i.e.,pi crashes before executingany operation). During the scheduleS, the wait-free consensus algorithmA (the existence of which is atheorem assumption) has no way to know in which case the system really is (haspi initially crashed or is itonly very slow?). This shows that, for some problems, asynchrony and process crashes are two facets of thesame “uncertainty” wait-free algorithms have to cope with.

9.3 The weak wait-free power of atomic registers

We have seen in the second part of this book that atomic registers allows wait-free implementing atomiccounters and atomic snapshot objects. As atomic registers are very basic objects, an important questionfrom a computability point of view, is then: can atomic registers wait-free implement objects such as aqueue or a stack shared by concurrent processes. This section shows that the answer to this question is “no”.

More precisely, this section shows that MWMR atomic registers are not powerful enough to wait-freeimplement a consensus object in a system of two processes. This means that the consensus number of thetype “atomic register” is1, which means that atomic registers allow wait-free implementing consensus in asystem made up of a single process! Stated another way, atomic registers have the “poorest” power whenone is interested in wait-free implementations of atomic objects in systems of asynchronous processes proneto process crashes.

9.3.1 The consensus number of atomic registers is 1

To show that there is no algorithm that wait-free implementsa consensus object in a system of two processesp andq, the proof assumes such an algorithm and derives a contradiction. The concept central in that proofis the notion of valence previously introduced.

Theorem 27 There is no an algorithmA that wait-free implements a consensus object from atomic registersin a set of two processes (i.e., the consensus number of atomic registers is1.)

Proof Let us assume (by contradiction) that there is an atomic register-based algorithmA that wait-freeimplements a consensus object in a set of two processes. Due to theorem 26, there is an initial bivalentconfiguration. The proof of the theorem consists in showing that, starting from a bivalent configurationC, there is always an arbitrarily long scheduleS produced byA that leads fromC to another bivalentconfigurationS(C). It follows thatA has a run in which no process ever decides, which proves the theorem.

Given a configurationD, let us remind thatp(D) is the configuration obtained by applying the nextoperation of the processp -as defined by the algorithmA- to the configurationD. Let us also remind thatthe operationsp or q can issue are reading or writing a base atomic register.

Let us assume that, starting the algorithm from the bivalentconfigurationC, there is a maximal scheduleS such thatD = S(C) is bivalent. “Maximal” means that both the configurationp(D) and the configurationq(D) are monovalent, and have different valence (otherwise,D would not be bivalent). Without loss ofgenerality, let us consider thatp(D) is 0-valent, whileq(D) is 1-valent.

The operation that leads fromD to p(D) is a read or a write byp of a base registerR1. Similarly, theoperation that leads fromD to q(D) is a read or a write byq of a base registerR2. The proof consists in acase analysis.

112

Page 113: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

1. R1 andR2 are distinct registers (Figure 9.2).In that case, whatever are the (read or write) operationsOP1() andOP2() issued byp andq on thebase registersR1 andR2, as the processes access different registers, the configurationsp(q(D)) andq(p(D)) are the same configuration, i.e.,p(q(D)) ≡ q(p(D)).

R2.OP2() by qR1.OP1() by p

Bivalent configurationD

0-valent configurationp(D) 1-valent configurationq(D)

R1.OP1() by pR2.OP2() by q

Configurationq(p(D)) ≡ p(q(D))

Figure 9.2: Operations issued on distinct registers

As q(D) is 1-valent, it follows thatp(q(D)) is also1-valent. Similarly, asp(D) is 0-valent, it followsthat q(p(D)) is also0-valent. A contradiction, as the configurationp(q(D)) ≡ q(p(D)) cannot beboth0-valent and1-valent.

2. R1 andR2 are the same registerR.

• Bothp andq readR.As a read operation on an atomic register does modify its value, this case is the same as theprevious one wherep andq access distinct registers.

• p readsR, while q writesR (Figure 9.3).(Let us notice that the case whereq readsR, while p writesR is similar.) LetReadp be the readoperation issued byp onR, andWriteq be the write operation issued byq onR. AsReadp(D)is 0-valent, so isWriteq(Readp(D)). Moreover,Writeq(D) is 1-valent.

The configurationsD andReadp(D) differ only in the local state ofp (it has readR in Readp(D),while it has not inD). These two configurations cannot be distinguished byq. Let us considerthe following two executions:

– After the configurationD has been attained by the algorithmA, p stops executing for anarbitrarily long period, and during that period onlyq executes operations. As by assumptionthe algorithmA is wait-free, there is a finite sequence of operations issuedby q at the endof which q decides. LetS′ be the schedule made up of these operations. AsWriteq(D) is1-valent, it decides1. (Thereafter,p wakes up and executes operations as specified in thethe algorithmA. Alternatively,p could crash after the configurationD has been attained.)

– Similarly, after the configurationReadp(D) has been attained by the algorithmA, p stopsexecuting for an arbitrarily long period The same scheduleS′ (defined in the previous item)can be issued byq after the configurationReadp(D). This is because, asp issues no oper-ation,q cannot distinguishD from Readp(D). It follows that,q decides at the end of thatschedule, and, asWriteq(Readp(D)) is 0-valent,q decides0.

113

Page 114: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

D

W riteq(D) (1-valent)Readp(D) (0-valent)

Writeq(Readp(D))

ScheduleS ′ (only operations byq)

ScheduleS ′

q decides

q decides

Figure 9.3: Read and write issued on the same register

But, while executing the scheduleS′, q cannot know which (D or Readp(D)) was the configu-ration when it started executingS′ (this is because, these configurations differ only in a read ofR by p). As the scheduleS′ is deterministic (it is composed only of read and write operationsissued byq on base atomic registers),q must decide the same value, whatever the configurationat the beginning ofS′. A contradiction as it decides0 in the first case and1 in the second case.

• Bothp andq write the same registerR.LetWritep andWriteq be the write operations issued byp andq onR, respectively. By assump-tion the configurationsWritep(D) andWriteq(D) are is0-valent and1-valent, respectively.The configurationsWriteq(Writep(D)) andWriteq(D) cannot be distinguished byq: the writeof R by p in the configurationD that produces the configurationWritep(D) is overwritten byqwhen it produces the configurationWriteq(Writep(D)).The reasoning is then the same as in the previous item. It follows that, ifq executes alone fromD until it decides, it decides1 after executing a scheduleS′′. The same schedule from theconfigurationWritep(D) leads to decide0. But, asq cannot distinguishD from Writep(D),and S′′ is deterministic, it follows that it has to decide the same value in both executions, acontradiction as it decides0 in the first case and1 in the second case. (Let us observe that Figure9.3 is still valid. We have only to replaceReadp(D) andS′ by Writep(D) andS′′, respectively.)

2Theorem 27

9.3.2 The wait-free limit of atomic registers

Theorem 28 It is impossible to wait-free implement any object with consensus number greater than 1 fromatomic registers.

Proof The proof is an immediate consequence of theorem 27 (the consensus number of atomic registers is1), and theorem?? (if CN (X) < CN (Y ), X does allow wait-free implementingY in a system or morethan|CN (X)| processes). 2Theorem 28

114

Page 115: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

9.3.3 Another limit of atomic registers

Naming the anonymous

9.4 Objects whose consensus number is2

As atomic registers are too weak to wait-free implement a consensus object for two processes, the questionposed at the beginning of the chapter becomes: are they objects that allow wait-free implementing a consen-sus object for two or more processes. This section first considers three base objects (test&set objects, queue,and swap objects) and show that they can wait-free implementconsensus in a set of two processes denotedp0 andp1 (considering the process indices0 and1 makes the presentation simpler). It then shows that theycannot wait-free implement consensus in a set of three or more processes.

9.4.1 Consensus from a test&set objects

Test&set objects A test&set object is an atomic object that provides the processes with a single operation(calledtest&set, hence the name of the object). Such an object can be seen as maintaining an internal statevariablex that can contain the value0 or 1. It is initialized to 0 and can be accessed by the operationtest&set(). Assuming only one operation at a time is executed, its sequential specification is defined asfollows:

operation test&set ():prev val ← x;if (prev val = 0) then x← 1 endif;return (prev val).

From test&set objects to consensus The algorithm described in Figure 9.4 constructs a consensus objectfor two processes from a test&set objectTS . It uses two additional 1W1R atomic registersREG[0] andREG[1] (a processpi can always keep a local copy of the atomic register it writes,so we do not count it asone of its readers). The construction is made up of two parts:

• When the processpi invokespropose(v) on the consensus object, it deposits the value it proposesinto REG [i] (line 1). This part consists forpi in making public the value it proposes.

• Then pi executes a control part to know which value has to be decided.To that aim, it uses thetest&set object (line 2). If it obtains the initial value of the test&set object (0), it decides the value ithas proposed (line 3); otherwise it decides the value proposed by the other processp1−i(line 4).

In the following, we callwinner the process that is the first to execute line 2. More precisely, as thetest&set object is atomic, the winner is the process whoseTS .test&set() operation is the first to appearin the linearization order associated with the objectTS . The proof shows that the value decided by theconsensus object is the value deposited by the winnerpj in its registerREG [j].

Theorem 29 The algorithm described in Figure 9.4 is a wait-free construction of a consensus object froma test&set object, in a system of two processes.

115

Page 116: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation propose(v) issued bypi:(1) REG[i]← v;(2) aux← TS .test&set ();(3) case(aux = 0) then return(REG[i])(4) (aux = 1) then return(REG[1− i])(5) endcase

Figure 9.4: From test&set to consensus

Proof The algorithm is clearly wait-free. Letpj be the winner. Let us observe that it deposits the valuevit proposes inREG [j] before invokingTS .test&set() (this follows from the fact that, as bothREG [j] andTS are atomic, an execution that involves both of them is also atomic, and consequently the linearizationorder -with which we reason- respects process order). When the winnerpj executes line 2, the test&setobjectTS changes its value from0 to 1, and then, as any other invocation findsTS = 1, the test&set objectkeeps forever the value1. Aspj is the only process that obtains the value0 from the objectTS , it decides thevaluev it has just deposited inREG [j] (line 3). Moreover, as the other process obtains the value1 from TS ,that process does not decide the value it proposes but the other proposed value, namely, the value depositedREG [j] by the winnerpj (line 4). It follows that a single value is decided, and that value has been proposedby a process. Consequently, the algorithm described in Figure 9.4 is wait-free wait-free implementation ofa consensus object in a system of two processes. 2Theorem 29

9.4.2 Consensus from queue objects

Queue objects These objects have been already used in several chapters. A queue is defined by two totaloperations with a sequential specification. The enqueue operation adds an item at the end of the queue. Thedequeue operation removes the item at the head of the queue and returns it to the calling process; if thequeue is empty, the default value⊥ is returned.

From queue objects to consensusAn wait-free algorithm that constructs a consensus object from a queue,in a system of two processes, is described in Figure 9.5. Thisalgorithm is based on the same principles asthe previous one, and its code it nearly the same. The only difference is in line 2 where a queueQ is usedinstead of a test&set object. The queue is initialized to thesequence of items< w, ℓ >. The process thatdequeuesw (the value at the head of the queue) is the winner. The processthat dequeuesℓ is the loser. Thevalue decided by the consensus object is the value proposed by the winner.

operation propose(v) issued bypi:(1) REG [i]← v;(2) aux← Q.dequeue ();(3) case(aux = w) then return(REG[i])(4) (aux = ℓ) then return(REG[1− i])(5) endcase

Figure 9.5: From queue to consensus

Theorem 30 The algorithm described in Figure 9.5 is a wait-free construction of a consensus object fromqueue object, in a system of two processes.

116

Page 117: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Proof The proof is the same as the proof of theorem 29. The only difference is the way the winner processis selected. Here, the winner is the process that dequeues the valuew that is initially at the head of the queue.As suggested by the text of the algorithm, the proof is then verbatim the same. 2Theorem 30

9.4.3 Consensus from swap objects

Swap objects A swap objectR is an atomic read/write register that has an additional operation denotedswap(). That operation has an input parameter, the name of a local variable (local var) of the process thatinvokes it. It atomically exchanges the content of the register R with the content of the local variable. Theswap operation can be described by the following statements:

operation swap (local var):aux← R;R← local var;local var ← aux.

From swap objects to consensus An algorithm that wait-free implements a consensus object from aswap object in a system of two processes is described in Figure9.5. That algorithm uses a swap objectR,initialized to⊥. Its design principles are the same as in the previous algorithms. The winner is the processthat succeeds in depositing its index inR while obtaining the value⊥ from R. The proof of the algorithmis the same as the proof of the previous algorithms.

operation propose(v) issued bypi:(1) REG [i]← v;(2) aux← i;(3) R.swap(aux);(4) case(aux = ⊥) then return(REG [i])(5) (aux 6= ⊥) then return(REG [1− i])(6) endcase

Figure 9.6: From swap to consensus

9.4.4 Other objects for consensus in a system of two processes

It is possible to build a wait-free implementation of a consensus object, in a system of two processes, fromother objects such as a stack, a set, a list, a priority queue.When they do not provide total operations, theusual definition of of these objects has to be extended in order all the operations be total. As an example, apop on an empty stack has to be extended to the case where the stackis empty. This can easily be done, byspecifying thatpop() returns a default value⊥ when the stack is empty.

Other objects such as fetch&add objects allow wait-free implementing a consensus object in a system oftwo processes. Such an object is an atomic object that can be seen as encapsulating an integer state variablex and that can be accessed by the atomic operationfetch&add(). That operation has an input parameter,an integer denotedincr. Its behavior can be defined as follows:

operation fetch&add (incr):prev val ← x;x← x + incr;

117

Page 118: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

return (prev val).

9.4.5 Power and limit of the previous objects

As we have shown, all the objects described previously allowwait-free implementing a consensus object ina system of two processes. Do they allow implementing a consensus object in a system of three or moreprocesses? Surprisingly, The answer to that question is “no”. This section gives the proof that the queueobjects have consensus number 2. The corresponding proofs for the other objects presented in this sectionare similar.

Theorem 31 Atomic wait-free queues have consensus number 2.

Proof The proof has the same structure as the proof of Theorem 27. Considering binary consensus, it as-sumes that there is an algorithm based on queues and atomic registers that wait-free implements a consensusobject in a system of three processes (denotedp, q andr). As in Theorem 27, we show that starting from aninitial bivalent configurationC (due to theorem 26, such a configuration does exist), there isan arbitrarilylong scheduleS produced byA that leads fromC to another bivalent configurationS(C). This shows thatA has a run in which no process ever decides, which proves the theorem by contradiction.

Starting the algorithmA in a bivalent configurationC, let S be a maximal schedule produced byA suchthat the configurationD = S(C) is bivalent. As we have seen in the proof of theorem 27, “maximal” meansthat the configurationsp(D), q(D) andr(D) are monovalent. Moreover, asD is bivalent, two of theseconfigurations have not the same valence. Without loss of generality let us say thatp(D) is 0-valent andq(D) is 1-valent;r(D) is either0-valent or1-valent (the important point here is thatr(D) is not bivalent).

Let OPp the operation issued byp that leads fromD to p(D), OPq the operation issued byq that leadsfrom D to q(D), andOPr the operation issued byr that leads fromD to r(D). Each ofOPp, OPq andOPr,is a read or an atomic register, a write of an atomic register,an enqueue on an atomic queue, or a dequeueon an atomic queue.

Let us considerp andq (the processes that produce configurations with different valences), and let usconsider that, fromD, r does not execute operations for an arbitrarily long period.

• If both OPp andOPq are operations on atomic registers the proof of theorem 27 still applies.

• If one of OPp and OPq is an operation on an atomic register, while the other is an operation on anatomic queue, the reasoning used in the item 1 of the proof of theorem 27 applies. This reasoning,based on the argument depicted in Figure 9.2, allows concluding thatp(q(D)) ≡ q(p(D)), while oneis 0-valent and the other is1-valent.

It follows that the only case that remains to be investigatedin when bothOPp andOPq are operations onthe same atomic queueQ. We proceed by a case analysis. There are three cases.

1. BothOPp andOPp areQ.dequeue().As p(D) andq(D) are0-valent and1-valent, respectively, the configurationOPq(OPp(D)) is 0-valent,while OPp(OPq(D)) is 1-valent. But these configurations cannot be distinguished by the processr:in both configurations,r has the same local state and each base object -atomic register or atomicqueue- has the same value. So, starting from any of these configurations, let us consider a scheduleS′ in which only r issues operations (as defined by the algorithmA) until it decides (the fact thatneitherp nor q executes an operation in this schedule is made possible by asynchrony). We have

118

Page 119: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

then (1)S′(OPq(OPp(D)) is 0-valent, and (2)S′(OPp(OPq(D)) is 1-valent. But, asOPq(OPp(D)) andOPq(OPp(D)) cannot be distinguished byr, it has to decide the same value in bothS′(OPq(OPp(D))andS′(OPp(OPq(D)). A contradiction.

2. OPp is Q.dequeue(), while OPp is Q.enqueue(a).(The case whereOPp isQ.enqueue(a) while OPp isQ.dequeue() is the same.) There are two subcasesaccording to the current state of the wait-free atomic queueQ.

• Q is not empty. In that case, the configurationsOPq(OPp(D)) andOPp(OPq(D)) are identical: inboth each object has the same state, and each process is in thesame local state. A contradictionbecauseOPq(OPp(D)) is 0-valent, andOPp(OPq(D)) is 1-valent.

• Q is empty. In that case,r cannot distinguish the configurationOPp(D) andOPp(OPq(D)). Thesame reasoning as the one in item 1 above shows a contradiction (the same scheduleS′ startingfrom any of these configurations and involving only operations byr has to decide both0 and1).

Bivalent configurationD

0-valent configurationp(D) 1-valent configurationq(D)

Q.enqueue(a) by p Q.enqueue(b) by q

Figure 9.7:enqueue() operations byp andq

3. OPp is Q.enqueue(a) andOPp is Q.enqueue(b). (This case is described in Figure 9.7.)Let k be the number of items in the queueQ in the configurationD. This means thatp(D) containsk + 1 items, andq(p(D)) (or p(q(D))) containsk + 2 items (see Figure 9.8).

k ≥ 0 items

enqueue() side dequeue() sideb a

Figure 9.8: State of the queue objectQ in configurationq(p(D))

As the algorithmA is wait-free andp(D) is 0-valent, there is a scheduleSp, starting from the config-urationq(p(D)) and involving only operations1 issued byp, that ends withp deciding0. We claimthat the scheduleSp contains an operation (byp) that dequeues thek + 1 element ofQ. Assume bycontradiction thatp issues at mostk dequeue operations inSp (and so never dequeues the valuea ithas enqueued). In that case, if we apply the schedulep Sp to the configurationq(D), we obtain theconfigurationSp(p(q(D))) in which p decides0 (p decides0 because as it dequeues at mostk itemsfrom Q, it cannot distinguishp(q(D)) andq(p(D)) (these two configurations differ only in the stateof Q: its two last items inq(p(D)) area followed byb, while they areb followed bya in p(q(D))),and as we have just seen,p decides0 in Sp(q(p(D)))). But this contradicts the fact that, asq(D) is1-valent,p should decide1, which proves the claim.

1These operations byp are on the atomic registers and the atomic queues.

119

Page 120: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

It follows from this discussion thatSp contains at leastk + 1 dequeue operations onQ (issued byp).Let S′

p be the longest prefix ofSp that does not contain the(k + 1)th dequeue operation onQ by p.

As the algorithmA is wait-free andp(D) is 0-valent, there is a scheduleSq, starting from the config-urationS′

p(q(p(D))) and involving only operations fromq, that ends withq deciding0. Similarly tothe above discussion, we claim that the scheduleSq contains an operation (byq) that dequeues an itemfrom Q. To show it, assume by contradiction, thatq never dequeues fromQ in Sq. In that case, if weapply the scheduleSq to the configurationS′

p(p(q(D))), q decides0 (as the only difference betweenS′

p(p(q(D))) andS′p(q(p(D))) is in the state ofQ), which contradicts the fact thatq(D) is 1-valent.

It follows from this discussion thatSq contains at least one operation that dequeues fromQ. Let S′q

be the longest prefix ofSq that does not contain that dequeue operation. We now define two schedulesthat start fromD, lead to different decisions, and cannot be distinguished by the third processr(Figure 9.9).

Bivalent configurationD

1-valent configurationq(D)

Q.enqueue(a) by p Q.enqueue(b) by q

Q.enqueue(b) by q

scheduleS ′p (by p)

scheduleS ′q (by q)

Q.dequeue() by p

Q.dequeue() by q

D0

(p obtainsa)

(q obtainsb)

0-valent configurationp(D)

Q.enqueue(a) by p

Q.dequeue() by p

Q.dequeue() by q

D1

scheduleS ′p (by p)

scheduleS ′q (by q)

(p obtainsb)

(q obtainsa)

Figure 9.9: From the configurationD to D0 or D1

• The first schedule is defined by the following sequence of operations:- p executesQ.enqueue(a) (producingp(D)),- q executesQ.enqueue(b) (producingq(p(D))),- p executes the operations inS′

p, then executesQ.dequeue() and obtainsa,- q executes the operations inS′

q, then executesQ.dequeue() and obtainsb.That schedule leads from the configurationD to the configuration denotedD0. As D0 is reach-able fromp(D), it is 0-valent.

• The second schedule is defined by the following sequence of operations:- q executesQ.enqueue(b) (producingq(D)),

120

Page 121: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

- p executesQ.enqueue(a) (producingp(q(D))),- p executes the operations inS′

p, then executesQ.dequeue() and obtainsb,- q executes the operations inS′

q, then executesQ.dequeue() and obtainsa.That schedule leads from the configurationD to the configuration denotedD1. As D1 is reach-able fromq(D), it is 1-valent.

Let us now consider the third processr (that has not executed operations since configurationD).

All the objects have the same state in the configurationsD0 andD1. Moreover,r has also the samestate in both configurations. It follows thatD0 andD1 cannot be distinguished byr (the third processthat has executed no operation since the configurationD). Consequently, as the algorithmA is wait-free andD0 is 0-valent, there is a scheduleSr, starting from the configurationD0 and involving onlyoperations issued byr in which r decides0. As r cannot distinguishD0 andD1, the very sameschedule can be applied fromD1 at the end of whichr decides0. A contradiction, asD1 is 1-valent.

2Theorem 31

The following corollary is an immediate consequence of the theorems 28 and 31.

Corollary 3 Wait-free atomic objects such as queues, stacks, sets, lists, priority queues, test&set objects,swap objects, fetch&add objects cannot be wait-free implemented from atomic registers.

9.5 Objects whose consensus number is+∞

This section shows that some atomic objects have an infinite consensus number. They can wait-free imple-ments a consensus object whatever the numbern of processes, and consequently can be used to wait-freeimplement any object defined by a sequential specification ontotal operations in a system made up of anarbitrary number of precesses. Three such objects are presented here: compare&swap objects, memory tomemory swap objects, and augmented queues.

9.5.1 Consensus from compare&swap objects

A compare&swap objectCS is an atomic object that can be accessed by a single operationthat is de-notedcompare&swap(). That operation, that returns a value, has two input parameters (two values calledold and new). Such an object can be seen as maintaining an internal statevariablex. The effect of acompare&swap() operation can be described by the following specification:

operation compare&swap(old, new):prev ← x;if (x = old) then x← new endif;return (prev).

From compare&swap objects to consensus The algorithm described in Figure 9.10 is a wait-free con-struction of a consensus object from a compare&swap object in a system ofn processes, for any value ofn. The base compare&swap objectCS is initialized to⊥, a default value that cannot be proposed by theprocesses to the consensus object. When a process proposes avaluev to the consensus object, it first invokesCS.compare&swap (⊥, v) (line 1). If it obtains⊥ it decides the value irt proposes (line 2). Otherwise, itdecides the value returned from the compare&swap object (line 3).

121

Page 122: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation propose(v) issued bypi:(1) aux← CS.compare&swap (⊥, v);(2) caseaux = ⊥ then return(v)(3) aux 6= ⊥ then return(aux)(4) endcase

Figure 9.10: From compare&swap to consensus

Theorem 32 The compare&swap objects have infinite consensus number.

Proof The algorithm described in Figure 9.10 is clearly wait-free. As the base compare&swap objectCS

is atomic, there is a first process that executesCS.compare&swap () (as previously, “first” is definedaccording to the linearization order of all the invocationsCS.compare&swap ()). Let that process be thewinner. According to the specification of thecompare&swap () operation, the winner has depositedv(the value it proposes to the consensus object) inCS. As the input parameterold of any invocation of thecompare&swap () operation is⊥, it follows that all the futurecompare&swap () invocations returns thefirst value deposited inCS, namely the valuev deposited by the winner. It follows that all the processes thatpropose a value and do not crash decide the value of the winner. The algorithm is trivially independent of thenumber of processes that invokeCS.compare&swap (). It follows that the algorithm wait-free implementsa consensus object for any number of processes. 2Theorem 32

9.5.2 Consensus from mem-to-mem-swap objects

Mem-to-mem-swap objects A mem-to-mem-swap object is an atomic register that provides the processeswith three operations. The classical read and write operations plus a a binarymem-to-mem-swap() operationthat is on two registers,R1 andR2. mem-to-mem-swap(R1, R2) atomically exchanges the content ofR1and the content ofR2. (This operation has not to be confused with the swap operation described in Section9.4.3. The latter involves two base atomic registers, the former involves a single base atomic register and alocal variable.)

From mem-to-mem-swap objects to consensusThe algorithm described in Figure 9.11 is a wait-freeconstruction of a consensus object from base atomic registers and mem-to-mem-swap objects, for any num-bern of processes.

A base 1WMR atomic registerREG [i] is associated with each processpi. That register is used to makepublic the value it proposes (line 1). A processpj can read it at line 4 if it has to decide the value proposedby pi.

There aren + 1 mem-to-mem-swap objects. The arrayA[1 : n] is initialized to [0, . . . , 0], while theobjectR is initialized to1. The objectA[i] is written only bypi, and this write is due to a mem-to-mem-swap operation:pi exchange the content ofA[i] with the content ofR (line 2). As we can see, differentlyfrom A[i], the mem-to-mem-swap objectR can be written by any process. As described in lines 2-4, theseobjects are used to determine the decided value. After it hasexchangedA[i] andR, a process looks for thefirst entryj of the arrayA such thatA[j] 6= 0, and decides the value deposited by the processpj it has h=justdetermined the index name.

Before proving the next theorem and to better understand howthe algorithm works let us observe that

122

Page 123: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation propose(v) issued bypi:(1) REG [i]← v;(2) mem-to-mem-swap(A[i], R);(3) for j from 1 to n do(4) if (A[j] = 1) then return(REG [j]) endif;(5) endfor

Figure 9.11: From mem-to-mem-swap to consensus

the following relation is invariant:

R +

i=n∑

i=1

A[i] = 1.

As initially, R = 1, andA[i] = 0 for eachi, this relation is initially satisfied. Then, due to the fact thatthe operationmem-to-mem-swap(A[i], R) issued at line 2 is executed atomically, it follows that the relationremains true forever.

Lemma 12 The mem-to-mem-swap object type has consensus numbern in a system ofn processes.

Proof The algorithm is trivially wait-free (the loop is bounded).As before, let the winner be the processpi that setsA[i] to 1 when it executes line 2. As anyA[j] is written at most once, we conclude from theprevious invariant, that there is a single winner. Moreover, due to the atomicity of the mem-to-mem-swapobjects, the winner is the first process that executes line 2.As, before becoming the winner, the winnerprocesspi has deposited inREG [i] the valuev it proposes to the consensus object, we haveREG[i] = vandA[i] = 1 before the other processes terminate the execution of line 2. It follows that all the processesthat decide return the value proposed by the single winner process. 2Lemma 12

An object type is universal in a system ofn processes if it allows wait-free contructing a consensusobject forn processes. The following corollary is an immediate consequence of the previous lemma.

Corollary 4 Mem-to-mem-swap objects are universal in a system ofn processes.

Theorem 33 The mem-to-mem-swap objects have infinite consensus number.

Proof The proof follows from the fact that, whatevern, it is always possible to construct a consensus objectin a system ofn processes from mem-to-mem-swap objects. 2Theorem 33

9.5.3 Consensus from augmented queue objects

This type of objects is very close to the previous queue we have studied. Interestingly, the augmentedqueues have infinite consensus umber. This shows that enriching an object with an additional operation caninfinitely increase its power when one is interested in the wait-free implementation of consensus objects.

An augmented queue is a queue with an additional operation denotedpeek() that returns the first itemof the queue without removing it. In some sense, that operation allows reading a part of a queue withoutmodifying it.

The algorithm in Figure 9.12 provides a wait-free construction of a consensus object from an augmentedqueue. The construction is pretty simple. The augmented queue Q is initially empty. A process first

123

Page 124: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

operation propose(v) issued bypi:Q.enqueue(v);return(Q.peek())

Figure 9.12: From an augmented queue to consensus

enqueues the valuev it proposes to the consensus object. Then, it invokes thepeek() operation to obtain thefirst value that has been enqueued. It is easy to see that the construction works for any number of processes,and we have the following theorem:

Theorem 34 The augmented queue objects have infinite consensus number.

9.5.4 Impossibility result

Corollary 5 There is no wait-free implementation of an object of type compare&swap, mem-to-mem-swapor augmented queue from base objects of type stack, queue, set, priority queue, swap, fetch&add or test&set.

Proof The proof follows directly from the combination of theorem 31 (the cited base objects have consensusnumber2), theorems 32, 33 and 34 (compare&swap or mem-to-mem-swap objects have infinite consensusnumber) and theorem?? (impossibility to wait-free implementY from X when CN (X) < CN (Y )).

2Corollary 5

9.6 Hierarchy of atomic objects

9.6.1 From consensus numbers to a hierarchy

Consensus numbers establish a hierarchy on the power of object types to wait-free implement a consensusobject, i.e., to wait-free implement any object defined by a sequential specification on total operations. Moregenerally:

• Consensus numbers allow ranking the power of classical synchronization primitives (provided byshared memory parallel machines) in presence of process crashes: compare&swap is stronger thantes&set that is in turn stronger than atomic read/write operations. Interestingly, they also show thatclassical objects encountered in sequential computing such as stacks and queues are as powerful asthe test&set or fetch&add synchronization primitives whenone is interested in providing upper layerapplication processes with wait-free objects.

• Fault masking can be impossible to achieve when the designeris not provided with powerful enoughatomic synchronization operations. As an example, a first in/first out queue that has to tolerate thecrash of a single process, can not be built from atomic registers. This follows from the fact that theconsensus number of a queue is2, while the he consensus number of atomic registers is1.

9.6.2 Robustness of the hierarchy

Let us remind the definition of consensus number, stated at the beginning of this chapter: theconsensus num-ber associated with an object typeT is the largest numbern such that it is possible to wait-free implement,in a system ofn processes, a consensus object from atomic registers and objects of typeT .

124

Page 125: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

The previous object hierarchy isrobust in the following sense. Any set of object types with consensusnumbers equal to or smaller thank cannot wait-free implement an object whose consensus number is ata higher level of the hierarchy, i.e., an object whose consensus number is greater thank. The hierarchywould no longer be robust if the definition of the consensus number notion prevented the use of base atomicregisters.

Bibliographic notes

Herlihy 1991

FLP 85; Loui-Abu Amara, Anderson-Gouda, etc (voir dans H91)

Attiya-Welch 98, Lynch 96

Chandra-Jayanti-Toueg JACM 98

125

Page 126: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

126

Page 127: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Part IV

Memory with Oracles

127

Page 128: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface
Page 129: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

Bibliography

[1] Yehuda Afek, Eytan Weisberger and Hanan Weisman. A Completeness Theorem for a Class of SynchronizationObjects (Extended Abstract).PODC, 1993, 159-170.

[2] Gene Amdahl. Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities.AFIPS Conference Proceedings(30): 483485, 1967.

[3] Afek Y., Attiya H., Dolev D., Gafni E., Merritt M. and Shavit N. Atomic snapshots of shared memory.Journalof the ACM, 40(4):873-890, 1993.

[4] Brinch Hansen P. (Editor), The Origin of Concurrent Programming.Springer Verlag, 534 pages, 2002.

[5] Dahl O.-J., Dijkstra E.W.D. and Hoare C.A.R., Structured Programming.Academic Press, 220 pages, 1972.

[6] Dijkstra E.W.D., Solution of a problem in concurrent programming control.Communications of the ACM,8(9):569, 1965.

[7] Fisher M.J., Lynch N.A. and Paterson M.S., Impossibility of distributed consensus with one faulty-process.Journal of the ACM, 32(2): 374-382, 1985.

[8] Herlihy M.P., Wait-free synchronization.ACM Transactions on Programming Languages and Systems,13(1):124-149, 1991.

[9] Herlihy M.P. and Wing J.M, Linearizability: a correctness condition for concurrent objects.ACM Transactionson Programming Languages and Systems, 12(3):463-492, 1990.

[10] Hoare C.A.R.,, Monitors: an Operating System Structuring Concept.Communications of the ACM, 17(10):549-557, 1974.

[11] Jayanti P., Chandra T. and Toueg S., Fault-tolerant wait-free shared objects.Journal of the ACM, 45(3):451-500,1998.

[12] Lamport L., Concurrent reading and writing.Communications of the ACM, 20(11):806-811, 1977.

[13] Lamport. L., On interprocess communication, part I: basic formalism, Part II: algorithms.Distributed Comput-ing, 1(2):77-101, 1986.

[14] Liskov B. and Zilles S., Specification Techniques for Data Abstraction.IEEE Transactions on Software Engi-neering, SE1:7-19, 1975.

[15] Loui M.C. and Abu-Amara H.H. Memory requirements for agreement among unreliable synchronous processes.Advances in Computing Research, 163-183, 1987.

[16] Misra J., Axioms for memory access in asynchronous hardware systems.ACM Transactions on ProgrammingLanguages and Systems, 8(1):142-153, 1986.

129

Page 130: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

[17] Owicki S. and Gries D., Verifying Properties of Parallel Programs: An Axiomatic Approach.Commununicationsof the ACM, 19(5): 279-285, 1976.

[18] Parnas D.L., A Technique for Software Modules with Examples.Commununications of the ACM, 15(2):220-336,1972.

[19] Parnas D.L., On the Criteria to be Used in Decomposing Systems in to Modules.Commununications of the ACM,15(12):1053-1058, 1972.

[20] Pease L., Shostak R. and Lamport L., Reaching agreementin presence of faults.Journal of the ACM, 27(2):228-234, 1980.

[21] Peterson G.L., Concurrent reading while writing.ACM Transactions on Programming Languages and Systems,5(1):46-55, 1983.

[22] Raynal M., Algorithms for mutual exclusion.The MIT Press, ISBN 0-262-18119-3, 107 pages, 1986.

[23] Taubenfeld G., Synchronization algorithms and concurrent programming.Pearson Prentice-Hall, ISBN 0-131-97259-6, 423 pages, 2006.

[24] Afek Y., Brown G. and Merritt M., Lazy Caching.ACM Transactions on Programming Languages andSystems, 15(1):182-205, 1993.

[25] Attiya H. and Welch J.L., Sequential Consistency versus Linearizability.ACM Transactions on Com-puter Systems, 12(2):91-122, 1994.

[26] Bernstein A., Haqzilacos V. and Goodman N., Concurrency Control and Recovery in Database Sys-tems.Addison Wesley, 1986.

[27] Fekete A., Lynch N., Merritt M. and Weihl W., Atomic Trasanctions.Morgan Kaufmann Publishing,1994.

[28] Gray J. and Reuter A., Transactions Procesing: Concepts and Techniques,Morgan Kaufmann Publish-ing, 1070 pages, 1992.

[29] Lamport L., How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs.IEEE Transactions on Computers, C28(9):690-691, 1979.

[30] Papadimitriou C., The Theory of Database Concurrency Control.Computer Science Press, 1988.

[31] Raynal M., Sequential Consistency as Lazy Linearizabilty. Proc. 14th ACM Symposium on ParallelAlgorithms and Architectures (SPAA’02), pp. 151-152, Winnipeg, 2002.

[32] Raynal M., Token-Based Sequential Consistency.Int’l Journal of Computer Systems Science and En-gineering, 17(6):359-366, 2002.

[33] Alpern B. and Schneider F.B., Defining liveness.Information Processing Letters, 21(4):181-185,18985.

[34] Attiya H., Guerraoui R. and Kouznetsov P., Computing with reads and writes in the absence of stepcontention.Proc. 19rd Int’l Symposium on Distributed Computing (DISC’05), Springer-Verlag #3724,pp. 122-136, 2005.

130

Page 131: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

[35] Fich F., Herlihy H. and Shavit N., On the space complexity of randomized synchronization.Journal ofthe ACM, 45(5):843-862. 1998.

[36] Fich F., Luchangco V., Moir M. and Shavit N., Obstruction-free algorithms can be practically wait-free. Proc. 23rd Int’l Symposium on Distributed Computing (DISC’05), Springer-Verlag #3724, pp.78-92, 2005.

[37] Guerraoui R., Kapalka M. and Kouznetsov P., The weakestfailure detector to boost obstruction-freedom.Proc. 20rd Int’l Symposium on Distributed Computing (DISC’06), Springer-Verlag #4167,pp. 399-412, 2006.

[38] Herlihy M., Luchangco V., Moir M., Obstruction-free synchronization: double-ended queues as anexample.Proc. 23rd IEEE Int’l Conference on Distributed Computing Systems I(CDCS ’03), IEEEComputer Press, pp. 522-529, 2003.

[39] Herlihy M. and Shavit N., The Art of Multiprocessor Programming.Morgan Kaufmann Pubs, Elsevier,508 pages, 2008.

[40] Lamport. L., Proving the correctness of multiprocess programs.IEEE Transaction on Software Engi-neering, SE-3(2):125-143, 1977.

[41] Jayanti P., Burns J. and Peterson G., Almost optimal single reader single writer atomic register.Journalof Parallel and Distributed Computing, 60:150-168, 2000.

[42] Li M., Tromp J. and Vityani P., How to share concurrent wait-free variables.Journal of the ACM,43(4):723-746, 1996.

[43] Singh A.K., Anderson J.H. and Gouda M., The elusive atomic register.Journal of the ACM, 41(2):331-334, 1994.

[44] Vidyasankar K., Converting Lamport’s Regular Register to Atomic Register.Information ProcessingLetters, 28(6):287-290, 1988.

[45] Vidyasankar K., An elegant 1-writer multireader multivalued atomic register.Information ProcessingLetters, 30(5):221-223, 1989.

[46] Vidyasankar K., A very simple cosntruction of 1-writermultireader multivalued atomic variable.In-formation Processing Letters, 37:323-326, 1991.

[47] Vityani P., Simple wait-free multireader registers.Proc. 16th Int’l Symposium on Distributed Comput-ing (DISC’02), Springer-Verlag LNCS #2508, pp. 118-132, 2002.

[48] Vityani P. and Awerbuch B., Atomic shared register access by asynchronous hardware.Proc. 27thIEEE Symposium on Foundations of Computer Science (FOCS’87), IEEE Computer Press, 223-243,1986.

[49] Bloom B., Constructing two-writer atomic regsiters.IEEE Transactions on Computers, 37:1506-1514,1988.

[50] Burns J. and Peterson G., Constructing multireaders atomic values from non-atomic values.Proc. 7thACM Symposium on Principles of Distributed Computing (PODC’87), ACM Press, pp. 222-231, 1987.

131

Page 132: Robust Concurrent Computing - TU Berlin · them, as shared objects, instances of abstract data types, accessible through some interface exporting a set of operations. This interface

[51] Chaudhuri S., Kosa M.J. and Welch J., One-write algorithms for multivalued regular and atomic regis-ters.Acta Informatica, 37:161-192, 2000.

[52] Chaudhuri S. and Welch J., Bounds on the cost of multivalued register implementations.SIAM Journalof Computing, 2”(2):33(-354, 1994.

[53] Haldar S. and Vidyasankar K., Constructing 1-writer multireader multivalued atomic variables fromregular variables.Journal of the ACM, 42(1):186-203, 1995.

132