making reliable distributed systems in the presence of ... during this time the erlang programming...

Download Making reliable distributed systems in the presence of ... During this time the Erlang programming language

Post on 19-Oct-2019

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Making reliable distributed systems in the presence of

    sodware errors Final version (with corrections) — last update 20 November 2003

    Joe Armstrong

    A Dissertation submitted to the Royal Institute of Technology

    in partial fulfilment of the requirements for the degree of Doctor of Technology The Royal Institute of Technology

    Stockholm, Sweden

    December 2003

    Department of Microelectronics and Information Technology

  • ii

    TRITA–IMIT–LECS AVH 03:09 ISSN 1651–4076 ISRN KTH/IMIT/LECS/AVH-03/09–SE

    and

    SICS Dissertation Series 34 ISSN 1101–1335 ISRN SICS–D–34–SE

    c©Joe Armstrong, 2003 Printed by Universitetsservice US-AB 2003

  • iii

    To Helen, Thomas and Claire

  • iv

  • Abstract

    The work described in this thesis is the result of a research programstarted in 1981 to find better ways of programming Telecom applica-tions. These applications are large programs which despite careful testing will probably contain many errors when the program is put into service. We assume that such programs do contain errors, and investigate methods for building reliable systems despite such errors.

    The research has resulted in the development of a new programming language (called Erlang), together with a design methodology, and set of libraries for building robust systems (called OTP). At the time of writing the technology described here is used in a number of major Ericsson, and Nortel products. A number of small companies have also been formed which exploit the technology.

    The central problem addressed by this thesis is the problem of con- structing reliable systems from programs which may themselves contain errors. Constructing such systems imposes a number of requirements on any programming language that is to be used for the construction. I discuss these language requirements, and show how they are satisfied by Erlang.

    Problems can be solved in a programming language, or in the stan- dard libraries which accompany the language. I argue how certain of the requirements necessary to build a fault-tolerant system are solved in the language, and others are solved in the standard libraries. Together these form a basis for building fault-tolerant sodware systems.

    No theory is complete without proof that the ideas work in practice. To demonstrate that these ideas work in practice I present a number of case studies of large commercially successful products which use this technol- ogy. At the time of writing the largest of these projects is a major Ericsson

    v

  • vi ABSTRACT

    product, having over a million lines of Erlang code. This product (the AXD301) is thought to be one of the most reliable products ever made by Ericsson.

    Finally, I ask if the goal of finding better ways to program Telecom applications was fulfilled—I also point to areas where I think the system could be improved.

  • Contents

    Abstract v

    1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Ericsson background . . . . . . . . . . . . . . . . . . . . . 2 Chronology . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter by chapter summary . . . . . . . . . . . . . . . . 7

    2 The Architectural Model 11 2.1 Definition of an architecture . . . . . . . . . . . . . . . . . 12 2.2 Problem domain . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Concurrency oriented programming . . . . . . . . . . . . 19

    2.4.1 Programming by observing the real world . . . . . 21 2.4.2 Characteristics of a COPL . . . . . . . . . . . . . . 22 2.4.3 Process isolation . . . . . . . . . . . . . . . . . . . 22 2.4.4 Names of processes . . . . . . . . . . . . . . . . . 24 2.4.5 Message passing . . . . . . . . . . . . . . . . . . . 25 2.4.6 Protocols . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.7 COP and programmer teams . . . . . . . . . . . . 26

    2.5 System requirements . . . . . . . . . . . . . . . . . . . . . 27 2.6 Language requirements . . . . . . . . . . . . . . . . . . . . 28 2.7 Library requirements . . . . . . . . . . . . . . . . . . . . . 29 2.8 Application libraries . . . . . . . . . . . . . . . . . . . . . 30 2.9 Construction guidelines . . . . . . . . . . . . . . . . . . . 31 2.10 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 32

    vii

  • viii ABSTRACT

    3 Erlang 39 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Sequential Erlang . . . . . . . . . . . . . . . . . . . . . . . 44

    3.3.1 Data structures . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Variables . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Terms and patterns . . . . . . . . . . . . . . . . . 47 3.3.4 Guards . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.5 Extended pattern matching . . . . . . . . . . . . . 49 3.3.6 Functions . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.7 Function bodies . . . . . . . . . . . . . . . . . . . 52 3.3.8 Tail recursion . . . . . . . . . . . . . . . . . . . . 52 3.3.9 Special forms . . . . . . . . . . . . . . . . . . . . . 54 3.3.10 case . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.11 if . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.12 Higher order functions . . . . . . . . . . . . . . . . 55 3.3.13 List comprehensions . . . . . . . . . . . . . . . . . 57 3.3.14 Binaries . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.15 The bit syntax . . . . . . . . . . . . . . . . . . . . 60 3.3.16 Records . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.17 epp . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.18 Macros . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.19 Include files . . . . . . . . . . . . . . . . . . . . . 66

    3.4 Concurrent programming . . . . . . . . . . . . . . . . . . 66 3.4.1 register . . . . . . . . . . . . . . . . . . . . . . . . 67

    3.5 Error handling . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5.1 Exceptions . . . . . . . . . . . . . . . . . . . . . . 69 3.5.2 catch . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5.3 exit . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.4 throw . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.5 Corrected and uncorrected errors . . . . . . . . . 72 3.5.6 Process links and monitors . . . . . . . . . . . . . 73

    3.6 Distributed programming . . . . . . . . . . . . . . . . . . 76 3.7 Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8 Dynamic code change . . . . . . . . . . . . . . . . . . . . 78

  • ix

    3.9 A type notation . . . . . . . . . . . . . . . . . . . . . . . . 80 3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4 Programming Techniques 85 4.1 Abstracting out concurrency . . . . . . . . . . . . . . . . . 86

    4.1.1 A fault-tolerant client-server . . . . . . . . . . . . . 92 4.2 Maintaining the Erlang view of the world . . . . . . . . . . 101 4.3 Error handling philosophy . . . . . . . . . . . . . . . . . . 104

    4.3.1 Let some other process fix the error . . . . . . . . 104 4.3.2 Workers and supervisors . . . . . . . . . . . . . . 106

    4.4 Let it crash . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5 Intentional programming . . . . . . . . . . . . . . . . . . . 109 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    5 Programming Fault-tolerant Systems 115 5.1 Programming fault-tolerance . . . . . . . . . . . . . . . . . 116 5.2 Supervision hierarchies . . . . . . . . . . . . . . . . . . . . 118

    5.2.1 Diagrammatic representation . . . . . . . . . . . . 120 5.2.2 Linear supervision . . . . . . . . . . . . . . . . . . 121 5.2.3 And/or supervision hierarchies . . . . . . . . . . . 122

    5.3 What is an error? . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.1 Well-behaved functions . . . . . . . . . . . . . . . 126

    6 Building an Application 129 6.1 Behaviours . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    6.1.1 How behaviours are written . . . . . . . . . . . . . 131 6.2 Generic server principles . . . . . . . . . . . . . . . . . . . 132

    6.2.1 The generic server API . . . . . . . . . . . . . . . 132 6.2.2 Generic server example . . . . . . . . . . . . . . . 135

    6.3 Event manager principles . . . . . . . . . . . . . . . . . . 137 6.3.1 The event manager API . . . . . . . . . . . . . . . 139 6.3.2 Event manager example . . . . . . . . . . . . . . . 141

    6.4 Finite state machine principles . . . . . . . . . . . . . . . . 141 6.4.1 Finite state machine API . . . . . . . . . . . . . . 143 6.4.2 Finite state machine example . . . . . . . . . . . . 144

  • x ABSTRACT

    6.5 Supervisor principles . . . . . . . . . . . . . . . . . . . . . 146 6.5.1 Supervisor API . . . . . . . . . . . . . . . . . . . . 146 6.5.2 Supervisor example . . . . . . . . . . . . . . . . . 147

    6.6 Application principles . . . . . . . . . . . . . . . . . . . . 153 6.6.1 Applications API . . . . . . . . . . . . . . . . . . . 153 6.6.2 Application example . . . . . . . . . . . . . . . . . 154

    6.7 Systems and releases . . . . . . . . . . . . . . . . . . . . . 156 6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    7 OTP 161 7.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

    8 Case Studies 167 8.1 Methodology . . . . . . . . . . . . . . . . .