haskell’09 proceedings of the 2009 acm sigplan haskell symposium

September 3, 2009

Edinburgh, Scotland

Sponsored by:

ACM SIGPLAN

Co-located with:

ICFP’09

Haskell’09Proceedings of the 2009 ACM SIGPLAN

Haskell Symposium

ii

The Association for Computing Machinery 2 Penn Plaza, Suite 701

New York, New York 10121-0701

Copyright © 2009 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or <[email protected]>.

For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

Notice to Past Authors of ACM-Published Articles ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. If you have written a work that has been previously published by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library, please inform [email protected], stating the title of the work, the author(s), and where and when published. ISBN: 978-1-60558-508-6 Additional copies may be ordered prepaid from:

ACM Order Department PO Box 30777 New York, NY 10087-0777, USA Phone: 1-800-342-6626 (US and Canada) +1-212-626-0500 (Global) Fax: +1-212-944-1318 E-mail: [email protected] Hours of Operation: 8:30 am – 4:30 pm ET

ACM Order Number 565097 Printed in the USA

iii

Foreword

It is my great pleasure to welcome you to the 2nd ACM Haskell Symposium. This meeting follows the first occurrence of the Haskell Symposium last year and 11 previous instances of the Haskell Workshop. The name change reflects both the steady increase of influence of the Haskell Workshop on the wider community as well as the increasing number of high quality submissions.

The Call for Papers attracted 31 submissions from Asia, Europe, North and South America, of which 12 were accepted. During the review period, each paper was evaluated by at least three Program Committee members, and many papers received an additional external review. Based on these reviews, the submissions were chosen during a five-day electronic PC meeting and judged on their impact, clarity and relevance to the Haskell community. Because of the constraints of a one-day workshop, many papers with valuable contributions could not be accepted. To accommodate more papers, the PC chose to allocate 25-minute presentation slots for 11 papers and allocate a 15-minute slot for one paper, a short experience report. The program also includes a tool demonstration and a discussion on the future of Haskell.

Foremost, I would like to thank the authors of all submitted papers for their hard work. The Program Committee also deserves strong thanks for their efforts in selecting from the many excellent submissions, despite a tight review period. My gratitude goes to the external reviewers, for responding on short notice. Special thanks go to Andy Gill, chair of the 2008 Haskell Symposium, and the rest of the Steering Committee. The Conference Management System EasyChair was invaluable; my thanks to its lead developer Andrei Voronkov. Finally, my thanks go to Christopher Stone and Michael Sperber, the ICFP Workshop Co-Chairs, Graham Hutton, the ICFP General Chair, Lisa Tolles from Sheridan Printing, and ACM SIGPLAN for their support and sponsorship.

Stephanie Weirich Haskell’09 Program Chair University of Pennsylvania

v

Table of Contents

Haskell 2009 Symposium Organization ..............................................................................................vi

Session 1 Session Chair: Janis Voigtlaender (TU Dresden) • Types Are Calling Conventions.....................................................................................................................1

Maximilian C. Bolingbroke (University of Cambridge), Simon L. Peyton Jones (Microsoft Research) • Losing Functions without Gaining Data – another look at defunctionalisation ...................................13

Neil Mitchell, Colin Runciman (University of York, UK)

Session 2 Session Chair: Jeremy Gibbons (University of Oxford) • Push-Pull Functional Reactive Programming...........................................................................................25

Conal M. Elliott (LambdaPix) • Unembedding Domain-Specific Languages...............................................................................................37

Robert Atkey, Sam Lindley, Jeremy Yallop (The University of Edinburgh) • Lazy Functional Incremental Parsing ........................................................................................................49

Jean-Philippe Bernardy (Chalmers University of Technology & University of Gothenburg) • Roll Your Own Test Bed for Embedded Real-Time Protocols: A Haskell Experience .......................61

Lee Pike (Galois, Inc.), Geoffrey Brown (Indiana University), Alwyn Goodloe (National Institute of Aerospace)

Session 3 Session Chair: Mark P. Jones (Portland State University) • A Compositional Theory for STM Haskell................................................................................................69

Johannes Borgström, Karthikeyan Bhargavan, Andrew D. Gordon (Microsoft Research) • Parallel Performance Tuning for Haskell..................................................................................................81

Don Jones Jr. (University of Kentucky), Simon Marlow, Satnam Singh (Microsoft Research) • The Architecture of the Utrecht Haskell Compiler ..................................................................................93

Atze Dijkstra, Jeroen Fokker, S. Doaitse Swierstra (Universiteit Utrecht)

Session 4 Session Chair: Simon Marlow (Microsoft Research) • Alloy: Fast Generic Transformations for Haskell ..................................................................................105

Neil C. C. Brown, Adam T. Sampson (University of Kent) • Type-Safe Observable Sharing in Haskell ...............................................................................................117

Andy Gill (The University of Kansas) • Finding the Needle: Stack Traces for GHC.............................................................................................129

Tristan O. R. Allwood (Imperial College), Simon Peyton Jones (Microsoft Research), Susan Eisenbach (Imperial College)

Author Index ................................................................................................................................................141

vi

Haskell 2009 Symposium Organization

Program Chair: Stephanie Weirich (University of Pennsylvania, USA)

Steering Committee Chair: Andres Löh (University of Bonn, Germany)

Steering Committee: Gabriele Keller (University of New South Wales, Australia) Andy Gill (University of Kansas, USA) Doaitse Swierstra (Utrecht University, The Netherlands) Colin Runciman (University of York, UK) John Hughes (Chalmers and Quviq, Sweden)

Program Committee: Jeremy Gibbons (Oxford University, UK) Bastiaan Heeren (Open Universiteit Nederland, The Netherlands) John Hughes (Chalmers and Quviq, Sweden) Mark Jones (Portland State University, USA) Simon Marlow (Microsoft Research, UK) Ulf Norell (Chalmers, Sweden) Chris Okasaki (United States Military Academy, USA) Ross Paterson (City University London, UK) Alexey Rodriguez Yakushev (Vector Fabrics, The Netherlands) Don Stewart (Galois, USA) Janis Voigtländer (TU Dresden, Germany)

Additional reviewers: Niklas Broberg Magnus Carlsson Jacome Cunha Iavor Diatchki Marko van Eekelen Nate Foster Alex Gerdes Stefan Holdermans Wolfgang Jeltsch Jerzy Karczmarczuk John Launchbury Gavin Lowe Henrik Nilsson

Bruno Oliveira Lee Pike Riccardo Pucella Claudio Russo Peter Sewell Doaitse Swierstra Aaron Tomb Jesse Tov Dimitrios Vytiniotis Adam Wick Baltasar Trancon y Widemann Peter Wong

Sponsor:

Types Are Calling Conventions

Maximilian C. BolingbrokeUniversity of Cambridge

[email protected]

Simon L. Peyton JonesMicrosoft Research

[email protected]

AbstractIt is common for compilers to derive the calling convention of afunction from its type. Doing so is simple and modular but missesmany optimisation opportunities, particularly in lazy, higher-orderfunctional languages with extensive use of currying. We restore thelost opportunities by defining Strict Core, a new intermediate lan-guage whose type system makes the missing distinctions: lazinessis explicit, and functions take multiple arguments and return multi-ple results.

Categories and Subject Descriptors D.3.1 [Programming Lan-guages]: Formal Definitions and Theory – Semantics; D.3.2 [Pro-gramming Languages]: Language Classifications – Applicative(functional) languages; D.3.4 [Programming Languages]: Pro-cessors – Optimization

General Terms Languages, Performance

1. IntroductionIn the implementation of a lazy functional programming language,imagine that you are given the following function:

f :: Int → Bool → (Int ,Bool)

How would you go about actually executing an application of f totwo arguments? There are many factors to consider:

• How many arguments are given to the function at once? One ata time, as currying would suggest? As many are as available atthe application site? Some other answer?• How does the function receive its arguments? In registers? On

the stack? Bundled up on the heap somewhere?• Since this is a lazy language, the arguments should be evaluated

lazily. How is this achieved? If f is strict in its first argument,can we do something a bit more efficient by adjusting f and itscallers?• How are the results returned to the caller? As a pointer to a

heap-allocated pair? Or in some other way?

The answers to these questions (and others) are collectively calledthe calling convention of the function f . The calling convention ofa function is typically determined by the function’s type signature.This suffices for a largely-first-order language like C, but it imposes

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.Haskell’09, September 3, 2009, Edinburgh, Scotland, UK.Copyright c© 2009 ACM 978-1-60558-508-6/09/09. . . $5.00

unacceptable performance penalties for a language like Haskell,because of the pervasive use of higher-order functions, currying,polymorphism, and laziness. Fast function calls are particularlyimportant in a functional programming language, so compilers forthese languages – such as the Glasgow Haskell Compiler (GHC) –typically use a mixture of ad hoc strategies to make function callsefficient.

In this paper we take a more systematic approach. We outlinea new intermediate language for a compiler for a purely functionalprogramming language, that is designed to encode the most impor-tant aspects of a function’s calling convention directly in the typesystem of a concise lambda calculus with a simple operational se-mantics.

• We present Strict Core, a typed intermediate language whosetypes are rich enough to describe all the calling conventionsthat our experience with GHC has convinced us are valuable(Section 3). For example, Strict Core supports uncurried func-tions symmetrically, with both multiple arguments and multipleresults.• We show how to translate a lazy functional language like

Haskell into Strict Core (Section 4). The source language,which we call FH, contains all the features that we are inter-ested in compiling well – laziness, parametric polymorphism,higher-order functions and so on.• We show that the properties captured by the intermediate lan-

guage expose a wealth of opportunities for program optimiza-tion by discussing four of them – definition-site and use-sitearity raising (Section 6.1 and Section 6.2), thunk speculation(Section 5.5) and deep unboxing (Section 5.6). These optimi-sations were awkward or simply inaccessible in GHC’s earlierCore intermediate language.

Although our initial context is that of lazy functional programminglanguages, Strict Core is a call-by-value language and should alsobe suitable for use in compiling a strict, pure, language such as Tim-ber [1], or a hybrid language which makes use of both evaluationstrategies.

No single part of our design is new, and we discuss related workin Section 7. However, the pieces fit together very nicely. For exam-ple: the symmetry between arguments and results (Section 3.1); theuse of n-ary functions to get thunks “for free”, including so-called“multi-thunks” (Section 3.4); and the natural expression of algo-rithms and data structures with mixed strict/lazy behaviour (Sec-tion 3.5).

2. The challenge we addressIn GHC today, type information alone is not enough to get a defini-tive specification of a function’s calling convention. The next fewsections discuss some examples of what we lose by working with

1

the imprecise, conservative calling convention implied by the typesystem as it stands.

2.1 Strict argumentsConsider the following function:

f :: Bool → Intf x = case x of True → . . . ; False → . . .

This function is certainly strict in its argument x . GHC usesthis information to generate more efficient code for calls to f ,using call-by-value to avoid allocating a thunk for the argument.However, when generating the code for the definition of f , can wereally assume that the argument has already been evaluated, andhence omit instructions that checks for evaluated-ness? Well, no.For example, consider the call

map f [fibonacci 10, 1234]

Since map is used with both strict and lazy functions, map will notuse call-by-value when calling f . So in GHC today, f is conserva-tive, and always tests its argument for evaluated-ness even thoughin most calls the answer is ‘yes’.

An obvious alternative would be to treat first-order calls (wherethe call site can “see” the definition of f , and you can statically seethat your use-site has as at least as many arguments as the definitionsite demands) specially, and generate a wrapper for higher-ordercalls that does the argument evaluation. That would work, but it isfragile. For example, the wrapper approach to a map call might dosomething like this:

map (λx . case x of y → f y) [ . . .]

Here, the case expression evaluates x before passing it to f ,to satisfy f ’s invariant that its argument is always evaluated1.But, alas, one of GHC’s optimising transformations is to rewritecase x of y → e to e[x/y ], if e is strict in x . This transfor-mation would break f ’s invariant, resulting in utterly wrong be-haviour or even a segmentation fault – for example, if it lead toerroneously treating part of an unevaluated value as a pointer. GHChas a strongly-typed intermediate language that is supposed to beimmune to segmentation faults, so this fragility is unacceptable.That is why GHC always makes a conservative assumption aboutevaluated-ness.

The generation of spurious evaluated-ness checks represents anobvious lost opportunity for the so-called “dictionary” argumentsthat arise from desugaring the type-class constraints in Haskell.These are constructed by the compiler so as to be non-bottoming,and hence may always be passed by value regardless of how afunction uses them. Can we avoid generated evaluated-ness checksfor these, without the use of any ad-hocery?

2.2 Multiple argumentsConsider these two functions:

f x y = x + yg x = let z = factorial 10 in λy → x + y + z

They have the same type (Int → Int → Int), but we evaluateapplications of them quite differently – g can only deal with beingapplied to one argument, after which it returns a function closure,whereas f can and should be applied to two arguments if possible.GHC currently discovers this arity difference between the twofunctions statically (for first-order calls) or dynamically (for higher-order calls). However, the former requires an apparently-modest but

1 In Haskell, a case expression with a variable pattern is lazy, but in GHC’scurrent compiler intermediate language it is strict, and that is the semanticswe assume here.

Shorthand Expansionxn , 〈x1, . . . , xn〉 (n > 0)

x , 〈x1, . . . , xn〉 (n > 0)

x , 〈x〉 Singletonx, y , 〈x1, . . . , xn, y1, . . . , ym〉 Concatenation

Figure 1: Notation for sequences

insidiously-pervasive propagation of ad-hoc arity information; andthe latter imposes a performance penalty [2].

For the higher-order case, consider the well-known list-combiningcombinator zipWith , which we might write like this:

zipWith = λf :: (a → b → c). λxs :: List a. λys :: List b.case xs of

Nil → Nil(Cons x xs ′)→

case ys ofNil → Nil(Cons y ys ′)→ Cons (f x y) (zipWith f xs ′ ys ′)

The functional argument f is always applied to two arguments,and it seems a shame that we cannot somehow communicate thatinformation to the functions that are actually given to zipWithso that they might be compiled with a less pessimistic callingconvention.

2.3 Optionally-strict source languagesLeaving the issue of compilation aside, Haskell’s source-level typesystem is not expressive enough to encode an important class ofinvariants about how far an expression has been evaluated. Forexample, you might like to write a function that produces a listof certainly-evaluated Ints, which we might write as [ !Int ]. We donot attempt to solve the issues of how to expose this functionalityto the user in this paper, but we make a first step along this road bydescribing an intermediate language which is able to express suchtypes.

2.4 Multiple resultsIn a purely functional language like Haskell, there is no directanalogue of a reference parameter, such as you would have inan imperative language like C++. This means that if a functionwishes to return multiple results it has to encapsulate them in adata structure of some kind, such as a tuple:

splitList :: [Int ]→ (Int , [Int ])splitList xs = case xs of (y : ys)→ (y , ys)

Unfortunately, creating a tuple means that you need to allocatea blob of memory on the heap – and this can be a real performancedrag, especially when functions returning multiple results occur intight loops.

How can we compile functions which – like this one – returnmultiple results, efficiently?

3. Strict CoreWe are now in a position to discuss the details of our proposedcompiler intermediate language, which we call Strict CoreANF

2.Strict CoreANF makes extensive use of sequences of variables,

types, values, and terms, so we pause to establish our notationfor sequences. We use angle brackets 〈x1, x2, . . . , xn〉 to denote

2 ANF stands for A-normal form, which will be explained further in Sec-tion 3.6

2

a possibly-empty sequence of n elements. We often abbreviatesuch a sequence as xn or, where n is unimportant, as x. When noambiguity arises we abbreviate the singleton sequence 〈x〉 to justx. All this notation is summarised in Figure 1.

We also adopt the “variable convention” (that all names areunique) throughout this paper, and assume that whenever the en-vironment is extended, the name added must not already occur inthe environment – α-conversion can be used as usual to get aroundthis restriction where necessary.

3.1 Syntax of Strict CoreANF

Strict CoreANF is a higher-order, explicitly-typed, purely-functional,call-by-value language. In spirit it is similar to System F, but it isslightly more elaborate so that its types can express a richer varietyof calling conventions. The key difference from an ordinary typedlambda calculus, is this:

A function may take multiple arguments simultaneously,and (symmetrically) return multiple results.

The syntax of types τ , shown in Figure 2, embodies this idea:a function type takes the form b → τ , where b is a sequenceof binders (describing the arguments of the function), and τ is asequence of types (describing its results). Here are three examplefunction types:

f1 : Int → Int: 〈 :Int〉 → 〈Int〉

f2 : 〈α :?, α〉 → α: 〈α :?, :α〉 → 〈α〉

f3 : 〈α :?, Int , α〉 → 〈α, Int〉: 〈α :?, :Int , :α〉 → 〈α, Int〉

f4 : α :?→ Int → α→ 〈α, Int〉: 〈α :?〉 → 〈〈 :Int〉 → 〈〈 :α〉 → 〈α, Int〉〉〉

In each case, the first line uses simple syntactic abbreviations,which are expanded in the subsequent line. The first, f1 , takesone argument and returns one result3. The second, f2 , shows apolymorphic function: Strict Core uses the notation of dependentproducts, in which a single construct (here b → τ ) subsumesboth ∀ and function arrow. However Strict Core is not dependentlytyped, so that types cannot depend on terms: for example, in thetype 〈x :Int〉 → 〈τ〉, the result type τ cannot mention x. For thisreason, we always write value binders in types as underscores “ ”,and usually omit them altogether, writing 〈Int〉 → 〈τ〉 instead.

The next example, f3 , illustrates a polymorphic function thattakes a type argument and two value arguments, and returns tworesults. Finally, f4 gives a curried version of the same function.Admittedly, this uncurried notation is more complicated than theunary notation of conventional System F, in which all functions arecurried. The extra complexity is crucial because, as we will see inSection 3.3, it allows us to express directly that a function takesseveral arguments simultaneously, and returns multiple results.

The syntax of terms (also shown in Figure 2) is driven by thesame imperatives. For example, Strict CoreANF has n-ary applica-tion a g; and a function may return multiple results a. A possibly-recursive collection of heap values may be allocated with valrec,where a heap value is just a lambda or constructor application. Fi-nally, evaluation is performed by let; since the term on the right-hand side may return multiple values, the let may bind multiplevalues. Here, for example, is a possible definition of f3 above:

f3 = λ〈α :?, x :Int , y :α〉. 〈y , x 〉In support of the multi-value idea, terms are segregated into

three syntactically distinct classes: atoms a, heap values v, and

3 Recall Figure 1, which abbreviates a singleton sequence 〈Int〉 to Int

Variables x, y, z

Type Variables α, β

Kindsκ ::= ? Kind of constructed types

| κ→ κ Kind of type constructors

Bindersb ::= x :τ Value binding

| α :κ Type binding

Typesτ, υ, σ ::= T Type constructors

| α Type variable references| b→ τ Function types| τ υ Type application

Atomsa ::= x Term variable references

| ` Literals

Atoms In Argumentsg ::= a Value arguments

| τ Type arguments

Multi-value Termse ::= a Return multiple values

| let x :τ = e in e Evaluation| valrec x :τ = v in e Allocation| a g Application| case a of p → e Branch on values

Heap Allocated Valuesv ::= λb. e Closures

| C τ , a Constructed data

Patternsp ::= Default case

| ` Matches exact literal value| C x :τ Matches data constructor

Data Typesd ::= data T α :κ = c | . . . | c Data declarationsc ::= C τ Data constructors

Programs d, e

Typing EnvironmentsΓ ::= ε Empty environment

| Γ, x :τ Value binding| Γ, α :κ Type binding| Γ,C :b→ 〈T α〉 Data constructor binding| Γ,T :κ Type constructor binding

Syntactic sugarShorthand Expansion

Value binders τ , :τ

Thunk types {τ1, . . . , τn} , 〈〉 → 〈τ1, . . . , τn〉Thunk terms {e} , λ 〈〉 . e

Figure 2: Syntax of Strict CoreANF

3

Γ `κ τ : κ

T :κ ∈ Γ

Γ `κ T : κTYCONDATA

B(T) = κ

Γ `κ T : κTYCONPRIM

α :κ ∈ Γ

Γ `κ α : κTYVAR

Γ ` b : Γ′ ∀i.Γ′ `κ τi : ?

Γ `κ b→ τ : ?TYFUN

Γ `κ τ : κ1 → κ2 Γ `κ υ : κ1

Γ `κ τ υ : κ2TYCONAPP

Figure 3: Kinding rules for Strict CoreANF

multi-value terms e. An atom a is a trivial term – a literal, variablereference, or (in an argument position) a type. A heap value v isa heap-allocated constructor application or lambda term. Neitheratoms nor heap values require evaluation. The third class of termsis much more interesting: a multi-value term (e) is a term thateither diverges, or evaluates to several (zero, one, or more) valuessimultaneously.

3.2 Static semantics of Strict CoreANF

The static semantics of Strict CoreANF is given in Figure 3, Figure 4and Figure 5. Despite its ineluctable volume, it should present fewsurprises. The term judgement Γ ` e : τ types a multi-valued terme, giving it a multi-type τ . There are similar judgements for atomsa, and values v, except that they possess types (not multi-types). Animportant invariant of Strict CoreANF is this: variables and valueshave types τ , not multi-types τ . In particular, the environment Γmaps each variable to a type τ (not a multi-type).

The only other unusual feature is the tiresome auxiliary judge-ment Γ àpp b → τ @ g : υ, shown in Figure 5, which computesthe result type υ that results from applying a function of type b→ τto arguments g.

The last two pieces of notation used in the type rules are forintroducing primitives and are as follows:

L Maps literals to their built-in typesB Maps built-in type constructors to their kinds – the do-

main must contain at least all of the type constructorsreturned by L

3.3 Operational semantics of Strict CoreANF

Strict CoreANF is designed to have a direct operational interpreta-tion, which is manifested in its small-step operational semantics,given in Figure 7. Each small step moves from one configurationto another. A configuration is given by 〈H; e; Σ〉, where H repre-sents the heap, e is the term under evaluation, and Σ represents thestack – the syntax of stacks and heaps is given in Figure 6.

We denote the fact that a heap H contains a mapping from x toa heap value v by H[x 7→ v]. This stands in contrast to a patternsuch as H,x 7→ v, where we intend that H does not include themapping for x

The syntax of Strict Core is carefully designed so that there is a1–1 correspondence between syntactic forms and operational rules:

• Rule EVAL begins evaluation of a multi-valued term e1, pushingonto the stack the frame let x :τ = • in e2. Although it is a purelanguage, Strict CoreANF uses call-by-value and hence evaluatese1 before e2. If you want to delay evaluation of e1, use a thunk(Section 3.4).• Dually, rule RET returns a multiple value to the let frame, bind-

ing the x to the (atomic) returned values a. In this latter rule, thesimultaneous substitution models the idea that e1 returns mul-tiple values in registers to its caller. The static semantics (Sec-

Γ à a : τ

x :τ ∈ ΓΓ à x : τ

VARL(`) = τ

Γ à ` : τLIT

Γ ` e : τ

∀i.Γ à ai : τiΓ ` a : τ

MULTI

Γ ` e1 : τ Γ, x :τ ` e2 : σ

Γ ` let x :τ = e1 in e2 : σLET

∀j.Γ, x :τ `v vj : τj Γ, x :τ ` e2 : σ

Γ ` valrec x :τ = v in e2 : σVALREC

Γ à a : b→ τ Γ àpp b→ τ @ g : υ

Γ ` a g : υAPP

Γ à a : τscrut ∀i.Γ àlt pi → ei : τscrut ⇒ τ

Γ ` case a of p → e : τCASE

Γ `v v : τ

Γ ` b : Γ′ Γ′ ` e : τ

Γ `v λb.e : b→ τLAM

C :b→ 〈T α〉 ∈ ΓΓ àpp b→ 〈T α〉 @ τ , a : 〈υ〉

Γ `v C τ , a : υDATA

Γ àlt p → e : τscrut ⇒ τ

Γ ` e : τΓ àlt → e : τscrut ⇒ τ

DEFALT

L(`) = τscrut Γ ` e : τ

Γ àlt ` → e : τscrut ⇒ τLITALT

Γ, x :τ `v C σ, x : 〈T σ〉Γ, x :τ ` e : τ

Γ àlt C x :τ → e : T σ ⇒ τCONALT

Γ ` d : Γ

Γ0 = Γ,T :κ1 → . . .→ κm → ?∀i.Γi−1 ` ci : T α :κm in Γi

Γ ` data T α :κm = c1 | . . . | cn : ΓnDATADECL

Γ ` c : T α :κ in Γ

∀i.Γ `κ τi : ?

Γ ` C τ : T α :κ in (Γ,C :α :κ, τ → 〈T α〉) DATACON

` d, e : τ

Γ0 = ε ∀i.Γi−1 ` di : Γi Γn ` e : τ

` dn, e : τPROGRAM

Figure 4: Typing rules for Strict CoreANF

tion 3.2) guarantees that the number of returned values exactlymatches the number of binders.

4

EVAL 〈H; let x :τ = e1 in e2; Σ〉〈H; e1; let x :τ = • in e2. Σ〉RET 〈H; a; let x :τ = • in e2. Σ〉

DH; e2[a/x]; Σ

EALLOC 〈H; valrec x :τ = v in e; Σ〉

DH, y 7→ v[y/x]; e[y/x]; Σ

Ey 6∈ dom(H)

BETADH[x 7→ λb

n. e]; x an; Σ

E

DH; e[a/b

n]; Σ

E(n > 0)

ENTER 〈H,x 7→ λ 〈〉 . e; x 〈〉 ; Σ〉〈H,x 7→ ; e; update x. Σ〉UPDATE 〈H; a; update x. Σ〉〈H[x 7→ IND a]; a; Σ〉IND 〈H[x 7→ IND a]; x 〈〉 ; Σ〉〈H; a; Σ〉CASE-LIT

˙H; case ` of . . . , ` → e, . . .; Σ

¸ 〈H; e; Σ〉

CASE-CONDH[x 7→ C τ , an]; case x of . . . ,C b

n → e, . . .; ΣE

DH; e[a/b

n]; Σ

ECASE-DEF 〈H; case a of . . . , → e, . . .; Σ〉〈H; e; Σ〉 If no other match

Figure 7: Operational semantics of Strict CoreANF

Γ ` b : Γ

Γ ` 〈〉 : ΓBNDRSEMPTY

Γ, α :κ ` b : Γ′

Γ ` α :κ, b : Γ′BNDRSTY

Γ `κ τ : ? Γ, x :τ ` b : Γ′

Γ ` x :τ, b : Γ′BNDRSVAL

Γ àpp b→ τ @ g : υ

Γ àpp 〈〉 → τ @ 〈〉 : τAPPEMPTY

Γ à a : σ Γ àpp b→ τ @ g : υ

Γ àpp ( :σ, b)→ τ @ a, g : υAPPVAL

Γ `κ σ : κ Γ àpp`b→ τ

´[σ/α] @ g : υ

Γ àpp (α :κ, b)→ τ @ σ, g : υAPPTY

Figure 5: Typing rules dealing with multiple abstraction and appli-cation

Heap values h ::= λb. e Abstraction| C τ , a Constructor| IND a Indirection| Black hole

Heaps H ::= ε | H,x 7→ h

Stacks Σ ::= ε| update x. Σ| let x :τ = • in e. Σ

Figure 6: Syntax for operational semantics of Strict CoreANF

• Rule ALLOC performs heap allocation, by allocating one ormore heap values, each of which may point to the others. Wemodel the heap address of each value by a fresh variable y thatis not already used in the heap, and freshen both the v and e toreflect this renaming.• Rule BETA performs β-reduction, by simultaneously substitut-

ing for all the binders in one step. This simultaneous substitu-tion models the idea of calling a function passing several ar-guments in registers. The static semantics guarantees that the

number of arguments at the call site exactly matches what thefunction is expecting.

Rules CASE-LIT, CASE-CON, and CASE-DEF deal with patternmatching (see Section 3.5); while ENTER, UPDATE, and IND dealwith thunks (Section 3.4)

3.4 ThunksBecause Strict CoreANF is a call-by-value language, if we need todelay evaluation of an expression we must explicitly thunk it inthe program text, and correspondingly force it when we want toactually access the value.

If we only cared about call-by-name, we could model a thunkas a nullary function (a function binding 0 arguments) with type〈〉 → Int . Then we could thunk a term e by wrapping it in anullary lambda λ 〈〉 . e, and force a thunk by applying it to 〈〉. Thiscall-by-name approach would unacceptably lose sharing, but wecan readily turn it into call-by-need by treating nullary functions(henceforth called thunks) specially in the operational semantics(Figure 7), which is what we do:

• In rule ENTER, an application of a thunk to 〈〉 pushes ontothe stack a thunk update frame mentioning the thunk name. Italso overwrites the thunk in the heap with a black hole ( ), toexpress the fact that entering a thunk twice with no interveningupdate is always an error [3]. We call all this entering, orforcing, a thunk.• When the machine evaluates to a result (a vector of atoms a),

UPDATE overwrites the black hole with an indirection IND a,pops the update frame, and continues as if it had never beenthere.• Finally, the IND rule ensures that, should the original thunk be

entered to again, the value saved in the indirection is returneddirectly (remember – the indirection overwrote the pointer tothe thunk definition that was in the heap), so that the body ofthe thunk is evaluated at most once.

We use thunking to describe the process of wrapping a term e ina nullary function λ 〈〉 . e. Because thunking is so common, weuse syntactic sugar for the thunking operation on both types andexpressions – if something is enclosed in {braces} then it is athunk. See Figure 2 for details.

An unusual feature is that Strict CoreANF supports multi-valuedthunks, with a type such as 〈〉 → 〈Int ,Bool〉, or (using our syntac-tic sugar) {Int ,Bool}. Multi-thunks arose naturally from treatingthunks as a special kind of function, but this additional expressive-ness turns out to allow us to do at least one new optimisation: deepunboxing (Section 5.6).

5

Arguably, we should not conflate the notions of functions andthunks, especially since we have special cases in our operationalsemantics for nullary functions. However, the similarity of thunksand nullary functions does mean that some parts of the compilercan be cleaner if we adopt this conflation. For example, if thecompiler detects that all of the arguments to a function of type〈Int ,Bool〉 → Int are absent (not used in the body) then thefunction can be safely transformed to one of type 〈〉 → Int ,but not one of type Int – as that would imply that the body isalways evaluated immediately. Because we conflate thunks andnullary functions, this restriction just falls out naturally as part ofthe normal code for discarding absent arguments rather than beinga special case (as it is in GHC today).

One potential side effect of this, for example, we may detect thatthe unit is absent in a function of type 〈()〉 → Int and turn it intoone of type 〈〉 → Int . This might increase memory usage, as theresulting function has its result memoized! Although this is a bitsurprising, it is at least not a property peculiar to our intermediatelanguage - this is actually the behaviour of GHC today, and thesame issue crops up in other places too – such as when “floating”lets out of lambdas [4].

3.5 Data typesWe treat Int and Char as built-in types, with a suitable family of(call-by-value) operations. A value of type Char is an evaluatedcharacter, not a thunk (ie. like ML, not like Haskell), and similarlyInt . To allow a polymorphic function to manipulate values of thesebuilt-in types, they must be boxed (ie. represented by a heap pointerlike every other value). A real implementation, however, mighthave additional unboxed (not heap allocated) types, Char#, Int#,which do not support polymorphism [5], but we ignore these issueshere.

All other data types are built by declaring a new algebraicdata type, using a declaration d, each of which has a number ofconstructors (c). For example, we represent the (lazy) list data typewith a top-level definition like so:

data List a :∗ = Nil | Cons 〈{a}, {List a}〉

Applications of data constructors cause heap allocation, and hence(as we noted in Section 3.3), values drawn from these types canonly be allocated by a valrec expression.

The operational semantics of case expressions are given inrules CASE-LIT, CASE-CON, and CASE-DEF, which are quite con-ventional (Figure 7). Notice that, unlike Haskell, case does notperform evaluation – that is done by let in EVAL. The only subtlety(present in all such calculi) is in rule CASE-CON: the constructorC must be applied to both its type and value arguments, whereasa pattern match for C binds only its value arguments. For the sakeof simplicity we restrict ourselves to vanilla Haskell 98 data types,but there is no difficulty with extending Strict Core to include exis-tentials, GADTs, and equality constraints [6].

3.6 A-normal form and syntactic sugarThe language as presented is in so-called A-normal form (ANF),where intermediate results must all be bound to a name beforethey can be used in any other context. This leads to a very clearoperational semantics, but there are at least two good reasons toavoid the use of ANF in practice:

• In the implementation of a compiler, avoiding the use of ANFallows a syntactic encoding of the fact that an expression occursexactly once in a program. For example, consider the followingprogram:

(λ〈α :∗, x :α〉. x ) 〈Int , 1〉

The compiler may manifestly see, using purely local informa-tion, that it can perform β-reduction on this term, without theworry that it might increase code size. The same is not true ina compiler using ANF, because the ability to do β-reductionwithout code bloat depends on your application site being thesole user of the function – a distinctly non-local property!• Non-ANFed terms are often much more concise, and tend to be

more understandable to the human reader.

In the remainder of the paper we will adopt a non-ANFedvariant of Strict CoreANF which we simply call Strict Core, bymaking use of the following simple extension to the grammar andtype rules:

a ::= . . . | e | vΓ ` e : 〈τ〉Γ `a e : τ

SINGΓ `v v : τ

Γ `a v : τVAL

The semantics of the new form of atom are given by a stan-dard ANFing transformation into Strict CoreANF. Note that thereare actually several different choices of ANF transformation, cor-responding to a choice about whether to evaluate arguments orfunctions first, and whether arguments are evaluated right-to-leftor vice-versa. The specific choice made is not relevant to the se-mantics of a pure language like Strict Core.

3.7 Types are calling conventionsConsider again the example with which we began this paper. Hereare several different Strict Core types that express different callingconventions:

f1 : Int → Bool → (Int ,Bool)f2 : 〈Int ,Bool〉 → (Int ,Bool)f3 : (Int ,Bool)→ 〈Int ,Bool〉f4 : 〈{Int},Bool〉 → (Int ,Bool)

Here f1 is a curried function, taking its arguments one at a time; f2takes two arguments at once, but returns a heap-allocated pair; f3takes a heap-allocated pair and returns two results (presumably inregisters); while f4 takes two arguments at once, but the first is athunk. In this way, Strict CoreANF directly expresses the answers tothe questions posed in the Introduction.

By expressing all of these operational properties explicitly inour intermediate language we expose them to the wrath of theoptimiser. Section 5 will show how we can use this new informationabout calling convention to cleanly solve the problems consideredin the introduction.

3.8 Type erasureAlthough we do not explore it further in this paper, Strict CoreANFhas a simple type-erased counterpart, where type binders in λs,type arguments and heaps values have been dropped. A naturalconsequence of this erasure is that functions such as 〈a : ∗〉 →〈Int〉 will be converted into thunks (like 〈〉 → 〈Int〉), so theirresults will be shared.

4. Translating lazinessWe have defined a useful-looking target language, but we havennot yet shown how we can produce terms in it from those of amore traditional lazy language. In this section, we present a simplesource language that captures the essential features of Haskell, andshow how we can translate it into Strict Core.

Figure 8 presents a simple, lazy, explicitly-typed source lan-guage, a kind of featherweight Haskell, or FH. It is designed to bea suitable target language for the desugaring of programs written inHaskell, and is deliberately similar to GHCs current intermediatelanguage (which we call Core). Due to space constraints, we omit

6

Typesτ, υ, σ ::= T Type constructors

| α Type variables| τ → τ Function types| ∀α :κ.τ Quantification| τ τ Type application

Expressionse ::= ` Unlifted literals

| C Built-in data constructors| x Variables| e e Value application| e τ Type application| λx :τ. e Functions binding values| Λα :κ. e Functions binding types| let x :τ = e in e Recursive name binding| case e of p → e Evaluation and branching

Patternsp ::= Default case / ignores eval. result

| ` Matches exact literal value| C x :τ Matches data constructor

Data Typesd ::= data T α :κ = c | . . . | c Data declarationsc ::= C τ Data constructors

Programs d, e

Figure 8: The FH language

[[τ : κ]] : κ

[[T]] = T[[α]] = α

[[τ1 → τ2]] = {[[τ1]]} → [[τ2]][[∀α :κ.τ ]] = α :κ→ [[τ ]]

[[τ1 τ2]] = [[τ1]] [[τ2]]

Figure 9: Translation from FH to Strict Core types

[[e : τ ]] : 〈[[τ ]]〉[[`]] = `

[[C]] = Cwrap

[[x]] = x 〈〉[[e τ ]] = [[e]] [[τ ]]

[[Λα :κ. e]] = λα :κ. [[e]][[e1 e2]] = [[e1]] {[[e2]]}

[[λx :τ. e]] = λx :{[[τ ]]} . [[e]]

[[let x :τ = e in eb]] = valrec x :{[[τ ]]} = {[[e]]} in [[eb]]

[[case es of p → e]] = case [[es]] of [[p]] → [[e]]

[[p]]

[[`]] = `

[[C x :τ ]] = C x :{[[τ ]]}[[ ]] =

Figure 10: Translation from FH to Strict Core expressions

D [[d]]

D [[data T α :κ = C1 τ1 | . . . |Cn τn]]

= data T α :κ = C1 {[[τ ]]}1 | . . . |Cn {[[τ ]]}n

W [[d]]

W [[data T α :κr = C1 τm11 | . . . |Cn τ

mnn ]]

=

8>>><>>>:. . .Cwrapk = λα1 :κ1 . . . λαr :κr.

λx1 :{[[τ1,k]]} . . . λxmk :{[[τmk,k]]} .Ck (αr, xmk )

. . . ˆd, e˜

ˆd, e˜

= D [[d]], valrecW [[d]] in [[e]]

Figure 11: Translation from FH to Strict Core programs

the type rules and dynamic semantics for this language – suffice tosay that they are perfectly standard for a typed lambda calculus likeSystem Fω [7].

4.1 Type translationThe translation from FH to Strict Core types is given by Figure 9.The principal interesting feature of the translation is the way it dealswith function types. First, the translation makes no use of n-arytypes at all: both ∀ and function types translate to 1-ary functionsreturning a 1-ary result.

Second, function arguments are thunked, reflecting the call-by-need semantics of application in FH, but result types are leftunthunked. This means that after being fully applied, functionseagerly evaluate to get their result. If a use-site of that functionwants to delay the evaluation of the application it must explicitlycreate a thunk.

4.2 Term translationThe translation from FH terms to those in Strict Core becomesalmost inevitable given our choice for the type translation, and isgiven by Figure 10. It satisfies the invariant:

x :τ `FH e : υ =⇒ x :{[[τ ]]} ` [[e]] : 〈[[υ]]〉

The translation makes extensive use of our syntactic sugar andability to write non-ANFed terms, because the translation to StrictCoreANF is highly verbose. For example, the translation for appli-cations into Strict CoreANF would look like this:

[[e1 e2]] = let 〈f〉 = [[e1]] invalrec x = λ 〈〉 . [[e2]] in f 〈x〉

The job of the term translation is to add explicit thunks to theStrict Core output wherever we had implicit laziness in the FHinput program. To this end, we add thunks around the result of thetranslation in “lazy” positions – namely, arguments to applicationsand in the right hand side of let bindings. Dually, when we needto access a variable, it must have been the case that the bindingsite for the variable caused it to be thunked, and hence we need toexplicitly force variable accesses by applying them to 〈〉.

Bearing all this in mind, here is the translation for a simpleapplication of a polymorphic identity function to 1:

[[(Λα :?. λx :α. x ) Int 1]] = (λα :?. λx :{α}. x 〈〉) Int {1}

7

4.3 Data type translationIn any translation from FH to Strict Core we must account for(a) the translation of data type declarations themselves, (b) thetranslation of constructor applications, and (c) the translation ofpattern matching. We begin with (a), using the following FH datatype declaration for lists:

data List α :∗ = Nil | Cons α (List α)

The translation D, shown in Figure 11 yields this Strict Core dec-laration:

data List α :∗ = Nil | Cons 〈{α}, {List α}〉The arguments are thunked, as you would expect, but the construc-tor is given an uncurried type of (value) arity 2. So the types of thedata constructor Cons before and after translation are:

FH Cons : ∀α.α→ List α→ List αStrict Core Cons : 〈α : ?, {α} , {List α}〉 → 〈List α〉

We give Strict Core data constructors an uncurried type to reflecttheir status as expressing the built-in notions of allocation andpattern matching (Figure 7). However, since the type of Strict-CoreCons is not simply the translation of the type of the FH Cons , wedefine a top-level wrapper function Conswrap which does have theright type:

Conswrap = λα :∗. λx :{α}. λxs :{List α}. Cons 〈α, x , xs〉Now, as Figure 10 shows, we translate a call of a data constructorC to a call of Cwrap. (As an optimisation, we refrain from thunkingthe definition of the wrapper and forcing its uses, which accountsfor the different treatment of C and x in Figure 10.) We expect thatthe wrappers will be inlined into the program by an optimisationpass, exposing the more efficient calling convention at the originaldata constructor use site.

The final part of the story is the translation of pattern match-ing. This is also given in Figure 10 and is fairly straightforwardonce you remember that the types of the bound variables must bethunked to reflect the change to the type of the data constructorfunctions.4

Finally, the translation for programs, also given in Figure 11,ties everything together by using both the data types and expressiontranslations.

4.4 The seq functionA nice feature of Strict CoreANF is that it is possible to give astraightforward definition of the primitive seq function of Haskell:

seq : {α :∗ → β :∗ → {α} → {β} → β}= {λα :∗. λβ :∗. λx :{α}. λy :{β}. let : α = x 〈〉 in y 〈〉}

5. Putting Strict Core to workIn this section we concentrate on how the features of Strict Core canbe of aid to an optimising compiler that uses it as an intermediatelanguage. These optimisations all exploit the additional operationalinformation available from the types-as-calling-conventions corre-spondence in order to improve the efficiency of generated code.

5.1 Routine optimisationsStrict Core has a number of equational laws that have applicationsto program optimisation. We present a few of them in Figure 12.

The examples we present in this section will usually alreadyhave had these equational laws applied to them, if the rewrite

4 It is straightforward (albeit it fiddly) to extend this scheme with supportfor strict fields in data types, which is necessary for full Haskell 98 support.

represents an improvement in their efficiency or readability. Foran example of how they can improve programs, notice that in thetranslation we give from FH, variable access in a lazy context (suchas the argument of an application) results in a redundant thunkingand forcing operation. We can remove that by applying the η law:

[[f y ]] = [[f ]] 〈λ 〈〉 . [[y ]]〉 = f 〈〉〈λ 〈〉 . y 〈〉〉 = f 〈〉〈y〉

5.2 Expressing the calling convention for strict argumentsLet’s go back to the first example of a strict function from Section 1:

f :: Bool → Intf x = case x of True → . . . ; False → . . .

We claimed that we could not, while generating the code for f ,assume that the x argument was already evaluated, because that is afragile property that would be tricky to guarantee for all call-sites.In Strict Core, the evaluated/non-evaluated distinction is apparentin the type system, so the property becomes robust. Specficically,we can use the standard worker/wrapper transformation [8, 9] to fas follows:

fwork :Bool → Intfwork = λx :Bool . case x of True 〈〉 → . . . ; False 〈〉 → . . .

f :{Bool} → Intf = λx :{Bool}. fwork 〈x 〈〉〉

Here the worker fwork takes a definitely-evaluated argument of typeBool , while the wrapper f takes a lazy argument and forces itbefore calling f . By inlining the f wrapper selectively, we will oftenbe able to avoid the forcing operation altogether, by cancelling itwith explicit thunk creation. Because every lifted (i.e. lazy) typein Strict Core has an unlifted (i.e. strict) equivalent, we are ableto express all of the strictness information resulting from strictnessanalysis by a program transformation in this style. This is unlikethe situation in GHC today, where we can only do this for producttypes; in particular, strict arguments with sum types such as Boolhave their strictness information applied in a much more ad-hocmanner.

We suggested in Section 2 that this notion could be used toimprove the desugaring of dictionary arguments. At this point,the approach should be clear: during desugaring of Haskell intoStrict Core, dictionary arguments should not be wrapped in explicitthunks, ever. This entirely avoids the overhead of evaluatednesschecking for such arguments.

5.3 Exploiting the multiple-result calling conventionOur function types have first-class support for multiple argumentsand results, so we can express the optimisation enabled by a con-structed product result (CPR) analysis [10] directly. For example,translating splitList from Section 2.4 into Strict Core yields thefollowing program:

splitList = {λxs :{List Int}. case xs 〈〉 ofCons 〈y :{Int}, ys :{List Int}〉 → (, ) 〈Int ,List Int , y , ys〉}

Here we assume that we have translated the FH pair type in thestandard way to the following Strict Core definition:

data (, ) α :∗ β :∗ = (, ) 〈{α}, {β}〉After a worker/wrapper transformation informed by CPR analysiswe obtain a version of the function that uses multiple results, likeso:

splitListwork = λxs :{List Int}. case xs 〈〉 ofCons 〈y :{Int}, ys :{List Int}〉 → 〈y , ys〉

splitList = {λxs :{List Int}.let 〈y :{Int}, ys :{List Int}〉 = splitListwork xsin (, ) 〈Int ,List Int , y , ys〉}

8

β valrec x :τ = λbn. e in x an = e[a/b

n]

η valrec x :τ = λbn. y b

n in e = let 〈x :τ〉 = 〈y〉 in elet let x :τ = a in e = e[a/x]let-float let x :τ1 = (let y :σ2 = e1 in e2) in e3 = let y :σ2 = e1 in let x :τ1 = e2 in e3valrec-float let x :τ = (valrec y :σ = e in e2) in e3 = valrec y :σ = e in let x :τ = e2 in e3valrec-join valrec x :τ = e in valrec y :σ = e in e = valrec x :τ = e, y :σ = e in ecase-constructor-elim valrec x :τ = C τ , an in case x of . . . C b

n → e . . . = valrec x :τ = C τ , an in e[a/bn]

case-literal-elim case ` of . . . ` → e . . . = e

Figure 12: Sample equational laws for Strict CoreANF

Once again, inlining the wrapper splitList at its call sites can oftenavoid the heap allocation of the pair ((, )).

Notice that the worker is a multi-valued function that returnstwo results. GHC as it stands today has a notion of an “unboxedtuple” type supports multiple return values, but this extension hasnever fitted neatly into the type system of the intermediate lan-guage. Strict Core gives a much more principled treatment of thesame concept.

5.4 Redundant evaluationConsider this program:

data Colour = R | G | Bf x = case x of

R → . . .→ . . . (case x of G → . . . ; B → . . .) . . .

In the innermost case expression, we can be certain that x has al-ready been evaluated – and we might like to use this informationto generate better code for that inner case split, by omitting evalu-atedness checks. However, notice that it translates into Strict Corelike so:

f = {λx . case x 〈〉 ofR 〈〉 → . . .→ . . . (case x 〈〉 of G 〈〉 → . . .

B 〈〉 → . . .) . . .}

It is clear that to avoid redundant evaluation of x we can simplyapply common-subexpression elimination (CSE) to the program:

f = {λx . let x ′ = x 〈〉 incase x ′ of R 〈〉 → . . .

→ . . . (case x ′ of G 〈〉 → . . .B 〈〉 → . . .) . . .}

This stands in contrast to GHC today, where an ad-hoc mechanismtries to discover opportunities for exactly this optimisation.

5.5 Thunk eliminationThere are some situations where delaying evaluation by insertinga thunk just does not seem worth the effort. For example, considerthis FH source program:

let xs :List Int = Cons Int y ys

The translation of this program into Strict Core will introduce awholly unnecessary thunk around xs , thus

valrec xs :{List Int} = {Cons 〈Int , y , ys〉}

It is obviously stupid to build a thunk for something that is alreadya value, so we would prefer to see

valrec xs :List Int = Cons 〈Int , y , ys〉

but now references to xs in the body of the valrec will be badly-typed! As usual, we can solve the impedence mis-match by addingan auxiliary definition:

valrec xs ′ :List Int = Cons 〈Int , y , ys〉 invalrec xs :{List Int} = {xs ′}

Indeed, if you think of what this transformation would look likein Strict CoreANF, it amounts to floating a valrec (for xs ′) out ofa thunk, a transformation that is widely useful [4]. Now, severaloptimisations suggest themselves:

• We can inline xs freely at sites where it is forced, thus (xs 〈〉),which then simplifies to just xs ′.• Operationally, the thunk λ 〈〉 . xs′ behaves just like IND xs ′,

except that the former requires an update (Figure 7). So it wouldbe natural for the code generator to allocate an IND directly fora nullary lambda that returns immediately.• GHC’s existing runtime representation goes even further: since

every heap object needs a header word to guide the garbagecollector, it costs nothing to allow an evaluated Int to be enter-able. In effect, a heap object of type Int can also be used torepresent a value of type {Int}, an idea we call auto-lifting.That in turn means that the binding for xs generates literally nocode at all – we simpy use xs ′ where xs is mentioned.

One complication is that thunks cannot be auto-lifted. Consider thisprogram:

valrec f :{Int} = {⊥} in valrec g :{{Int}} = {f } in g 〈〉Clearly, the program should terminate. However if we adopt-autolifting for thunks then at runtime g and f will alias and hence wewill cause the evaluation of ⊥! So we must restrict auto-lifting tothunks of non-polymorphic, non-thunk types. (Another alternativewould be to restrict the kind system so that thunks of thunks andinstantiation of type variables with thunk types is disallowed, whichmight be an acceptable tradeoff.)

5.6 Deep unboxingAnother interesting possibility for optimisation in Strict Core isthe exploitation of “deep” strictness information by using n-arythunks to remove some heap allocated values (a process known asunboxing). What we mean by this is best understood by example:

valrec f :{({Int}, {Int})} → Int= λ〈pt : {({Int}, {Int})}〉.

valrec c :Bool = . . . incase c of True 〈〉 → 1

False 〈〉 → case pt 〈〉 of (x , y)→(+) 〈x 〈〉, y 〈〉〉

Typical strictness analyses will not be able to say definitivelythat f is strict in pt (even if c is manifestly False!). However,some strictness analysers might be able to tell us that if pt is

9

ever evaluated then both of its components certainly are. Takingadvantage of this information in a language without explicit thunkswould be fiddly at best, but in our intermediate language we canuse the worker/wrapper transformation to potentially remove somethunks by adjusting the definition of f like so:

valrec fwork :{Int , Int} → Int =λpt ′ :{Int , Int}.

valrec c :Bool = . . . incase c of True 〈〉 → 1

False 〈〉 → let 〈x ′ :Int , y ′ :Int〉 = pt ′ 〈〉in (+) 〈x ′, y ′〉,

f :{({Int}, {Int})} → Int =λ(pt : {({Int}, {Int})})→

valrec pt ′ :{Int , Int} ={ case pt 〈〉 of (x , y)→ 〈x 〈〉, y 〈〉〉}

in fwork pt ′

Once again, inlining the new wrapper function at the use siteshas the potential to cancel with pair and thunk allocation by thecallers, avoiding heap allocation and indirection.

Note that the ability to express this translation actually de-pended on the ability of our new intermediate language to expressmulti-thunks (Section 3.4) – i.e. thunks that when forced, evaluateto multiple results, without necessarily allocating anything on theheap.

6. Arity raisingFinally, we move on to two optimisations that are designed toimprove function arity – one that improves arity at a function byexamining how the function is defined, and one that realises animprovement by considering how it is used. These optimisationsare critical to ameliorating the argument-at-a-time worst case forapplications that occurs in the output of the naive translation fromFH. GHC does some of these arity-related optimisations in anad-hoc way already; the contribution here is to make them moresystematic and robust.

6.1 Definition-site arity raisingConsider the following Strict Core binding:

valrec f :Int → Int → Int = λx :Int . λy :Int . e in f 1 2

This code is a perfect target for one of the optimisations that StrictCore lets us express cleanly: definition-site arity raising. Observethat currently callers of f are forced to apply it to its arguments oneat a time. Why couldn’t we change the function so that it takes bothof its arguments at the same time?

We can realise the arity improvement for f by using, once again,a worker/wrapper transformation. The wrapper, which we give thisthe original function name, f , simply does the arity adaptationbefore calling into a worker. The worker, which we call fwork, isthen responsible for the rest of the calculation of the function5:

valrec fwork :〈Int , Int〉 → Int = λ〈x :Int , y :Int〉. ef :Int → Int → Int = λx :Int . λy :Int . fwork 〈x , y〉

in f 1 2

At this point, no improvement has yet occurred – indeed, we willhave made the program worse by adding a layer of indirection viathe wrapper! However, once the wrapper is vigourously inlined atthe call sites by the compiler, it will often be the case that thewrapper will cancel with work done at the call site, leading to aconsiderable efficiency improvement:

5 Since e may mention f , the two definitions may be mutually recursive.

valrec fwork :〈Int , Int〉 → Int = λ〈x :Int , y :Int〉. ein fwork 〈1, 2〉

This is doubly true in the case of recursive functions, because byperforming the worker/wrapper split and then inlining the wrapperinto the recursive call position, we remove the need to heap-allocatea number of intermediate function closures representing partialapplications in a loop.

Although this transformation can be a big win, we have to be abit careful about where we apply it. The ability to apply argumentsone at a time to a curried function really makes a difference toefficiency sometimes, because call-by-need (as opposed to call-by-name) semantics allows work to be shared between severalinvocations of the same partial application. To see how this works,consider this Strict Core program fragment:

valrec g :Int → Int → Int= (λx :Int . let s = fibonacci x inλy :Int . . . .) in

let h :Int → Int = g 5 in h 10 + h 20

Because we share the partial application of g (by naming it h),we will only compute the application fibonacci 5 once. However,if we were to “improve” the arity of g by turning it into a functionof type 〈Int , Int〉 → Int , then it would simply be impossible toexpress the desired sharing! Loss of sharing can easily outweighthe benefits of a more efficient calling convention.

Identifying some common cases where no significant sharingwould be lost by increasing the arity is not hard, however. Inparticular, unlike g , it is safe to increase the arity of f to 2, becausef does no work (except allocate function closures) when applied tofewer than 2 arguments. Another interesting case where we mightconsider raising the arity is where the potentially-shared work doneby a partial application is, in some sense, cheap – for example, ifthe sharable expressions between the λs just consist of a boundednumber of primitive operations. We do not attempt to present asuitable arity analysis in this paper; our point is only that StrictCore gives a sufficiently expressive medium to express its results.

6.2 Use-site arity raisingThis is, however, not the end of the story as far as arity raisingis concerned. If we can see all the call-sites for a function, andnone of the call sites share partial applications of less than thann arguments, then it is perfectly safe to increase the arity of thatfunction to n, regardless of whether or not the function does workthat is worth sharing if you apply fewer than n arguments. Forexample, consider function g from the previous sub-section, andsuppose the the body of its valrec was . . . (g p q) . . . (g r s) . . .;that is, every call to g has two arguments. Then no sharing islost by performing arity raising on its definition, but considerableefficiency is gained.

This transformation not only applies to valrec bound functions,but also to uses of higher-order functional arguments. After trans-lation of the zipWith function from Section 2.2 into Strict Core,followed by discovery of its strictness and definition-site arity prop-erties, the worker portion of the function that remains might looklike the following:

valrec zipWith :〈a :∗, b :∗, c :∗, {{a} → {b} → c},List a,List b〉 → List c

= λ〈a :∗, b :∗, c :∗, f :{{a} → {b} → c},xs :List a, ys :List b〉.

case xs of Nil 〈〉 → Nil cCons 〈x :{a}, xs ′ :{List a}〉 →

case ys of Nil 〈〉 → Nil cCons 〈y :{b}, ys ′ :{List b}〉 →

Cons 〈c, f 〈〉 x y , zipWith 〈a, b, c, f , xs ′ 〈〉, ys ′ 〈〉〉〉.

10

Notice that f is only ever applied in the body to three arguments ata time – 〈〉, x , and y (or rather 〈x 〉 and 〈y〉). Based on this observa-tion, we could re-factor zipWith so that it applied its function argu-ment to all these arguments (namely 〈x , y〉) at once. The resultingwrapper would look like this (omitting a few types for clarity):

valrec zipWith :〈a :∗, b :∗, c :∗, {{a} → {b} → c},List a,List b〉 → List c

= λ〈a :∗, b :∗, c :∗, f , xs, ys〉.valrec f ′ :〈{a}, {b}〉 → c = λ〈x , y〉. f 〈〉 x yin zipWithwork 〈a, b, c, f ′, xs, ys〉

To see how this can lead to code improvement, consider a callzipWith 〈Int , Int , Int , g , xs, ys〉, where g is the function fromSection 6.1. Then, after inlining the wrapper of zipWith we cansee locally that g is applied to all its arguments can can thereforebe arity-raised. Now, the wrapper of g will cancel with definitionof f ′, leaving the call we really want:

zipWithwork 〈Int , Int , Int , gwork, xs, ys〉

6.3 Reflections on arity-raisingAlthough the use-site analysis might, at first blush, seem to bemore powerful than the definition-site one, it is actually the casethat the two arity raising transformations are each able to improvethe arities of some functions where the other cannot. In particular,for a compiler that works module-by-module like GHC, the use-site analysis will never be able to improve the arity of a top-levelfunction as some of the call sites are unknown statically.

The key benefits of the new intermediate language with regardto the arity raising transformation are as follows:

• Arity in the intermediate language is more stable. It is almostimpossible for a compiler transformation to accidentally reducethe arity of a function without causing a type error, whereasaccidental reduction of arity is a possibility we must activelyconcern ourselves with avoiding in the GHC of today.• Expressing arity in the type system allows optimisations to be

applied to the arity of higher-order arguments, as we saw inSection 6.2.• By expressing arity statically in the type information, it is possi-

ble that we could replace GHC’s current dynamic arity discov-ery [2] with purely static arity dispatch. This requires that arityraising transformations like these two can remove enough of theargument-at-a-time worst cases such that we obtain satisfactoryperformance with no run-time tests at all.• If purely static arity discovery turns out to be too pessimistic

in practice (a particular danger for higher order arguments), itwould still be straightforward to adapt the dynamic discoveryprocess for this new core language, but we can avoid using itexcept in those cases where it could give a better result thanstatic dispatch. Essentially, if we appear to be applying at leasttwo groups of arguments to a function, then at that point weshould generate code to dynamically check for a better aritybefore applying the first group.

7. Related workBenton et al’s Monadic Intermediate Language (MIL) [11] is simi-lar to our proposed intermediate language. The MIL included bothn-ary lambdas and multiple returns from a function, but lacked atreatment of thunks due to aiming to compile a strict language. MILalso included a sophisticated type system that annotated the returntype of functions with potential computational effects, including di-vergence. This information could be used to ensure the soundness

of arity-changing transformations – i.e. uncurrying is only sound ifa partial application has no computational effects.

Both MIL and the Bigloo Scheme compiler [12] (which couldexpress n-ary functions), included versions of what we have calledarity definition-site analysis. However, the MIL paper does notseem to consider the work-duplication issues involved in the arityraising transformation, and the Bigloo analysis was fairly simpleminded – it only coalesced manifestly adjacent lambdas, withoutallowing (for example) potentially shareable work to be duplicatedas long as it was cheap. We think that both of these issues deserve amore thorough investigation. A simple arity definition-site analysisis used by SML/NJ [13], though the introduction of n-ary argu-ments is done by a separate argument flattening pass later on in thecompiler rather than being made immediately manifest.

In MIL, function application used purely static arity informa-tion. Bigloo used a hybrid static/dynamic arity dispatch scheme,but unfortunately do not appear to report on the cost (or otherwise)of operating purely using static arity information.

The intermediate language discussed here is in some ways anextension an extension of theL2 language [14] which also exploredthe possibility of an optimising compiler suitable for both strictand lazy languages. We share with L2 an explicit representationof thunking and forcing operations, but take this further by addi-tionally representing the operational notions of unboxing (throughmultiple function results) and arity. The L2 language shares withthe MIL the fact that it makes an attempt to support impure strictlanguages, which we do not – though impure operations could po-tentially be desugared into our intermediate language using a state-token or continuation passing style to serialize execution.

GRIN [15] is another language that used an explicit represen-tation of thunks and boxing properties. Furthermore, GRIN uses afirst order program representation where the structure of closuresis explicit – in particular, this means that unboxing of closures isexpressible.

The FLEET language [16] takes yet another tack. Thunked andunthunked values have the same type, but can be distinguished bythe compiler by inspecting flow labelling information attached toevery type – if the flow information includes no label from a thunkcreation site, then the value must be in WHNF. A variant of thelanguage, CFleet, has n-ary abstraction but does not support n-aryresult types.

The IL language [17] represents thunks explicitly by way ofcontinuations with a logical interpretation, and is to our knowledgethe first discussion of auto-lifting in the literature. Their logic basedapproach could perhaps be extended to accommodate a treatmentof arity and multiple-value expressions if “boxed” and “unboxed”uses of the ∧ tuple type former were distinguished.

Hannan and Hicks have previously introduced the arity use-siteoptimization under the name “higher-order uncurrying” [18] as atype-directed analysis on a source language. They also separatelyintroduced an optimisation called “higher-order arity raising” [19]which attempts to unpack tuple arguments where possible – thisis a generalisation of the existing worker/wrapper transformationsGHC currently does for strict product parameters. However, theiranalyses only consider a strict language, and in the case of uncur-rying does not try to distinguish between cheap and expensive com-putation in the manner we propose above. Leroy et al. [20] demon-strated a verified version of the framework which operates by coer-cion insertion, which is similar to our worker/wrapper approach.

8. Conclusions and further workIn this paper we have described what we believe to be an interest-ing point in the design space of compiler intermediate languages.By making information about a function’s calling convention to-tally explicit in the intermediate language type system, we expose

11

it to the optimiser – in particular we allow optimisation of decisionsabout function arity. A novel concept – n-ary thunks – arose nat-urally from the process of making calling convention explicit, andthis in turn allows at least one novel and previously-inexpressibleoptimisation (deep unboxing) to be expressed.

This lazy λ-calculus FH we present is similar to System FC,GHC’s current intermediate language. For a long time, a lazy lan-guage was, to us at least, the obvious intermediate language for alazy source language such as Haskell – so it was rather surprisingto discover that an appropriately-chosen strict calculus seems to bein many ways better suited to the task!

However, it still remains to implement the language in GHCand gain practical experience with it. In particular, we would liketo obtain some quantitative evidence as to whether purely staticarity dispatch leads to improved runtimes compared to a dynamicconsideration of the arity of a function such as GHC implementsat the moment. A related issue is pinning down the exact detailsof how a hybrid dynamic/static dispatch scheme would work, andhow to implement it without causing code bloat from the extrachecks. We anticipate that we can reuse existing technology fromour experience with the STG machine [21] to do this.

Although we have presented, by way of examples, a number ofcompiler optimisations that are enabled or put on a firmer footingby the use of the new intermediate language, we have not providedany details about how a compiler would algorithmically decidewhen and how to apply them. In particular, we plan to write apaper fully elucidating the details of the two arity optimisations(Section 6.2 and Section 6.1) in a lazy language and reporting onour practical experience of their effectiveness.

There are a number of interesting extensions to the intermediatelanguage that would allow us to express even more optimisations.We are particularly interested in the possibility of using somefeatures of the ΠΣ language [22] to allow us to express even moreoptimisations in a typed manner. In particular, adding unboxed Σtypes would address an asymmetry between function argument andresult types in Strict Core – binders may not appear to the rightof a function arrow currently. They would also allow us to expressunboxed existential data types (including function closures, shouldwe wish) and GADTs. Another ΠΣ feature – types that can dependon “tags” – would allow us to express unboxed sum types, but theimplications of this feature for the garbage collector are not clear.

We would like to expose the ability to use “strict” types tothe compiler user, so Haskell programs can, for example, manip-ulate lists of strict integers ([ !Int ]). Clean [23] has long supportedstrictness annotations at the top level of type declarations, (whichhave a straightforward transformation into Strict Core), but allow-ing strictness annotations to appear in arbitrary positions in typesappears to require ad-hoc polymorphism, and it is not obvious howto go about exposing the extra generality in the source language ina systematic way.

AcknowledgmentsThis work was partly supported by a PhD studentship generouslyprovided by Microsoft Research. We would like to thank Paul BlainLevy for the thought provoking talks and discussions he gave whilevisiting the University of Cambridge which inspired this work.Thanks are also due to Duncan Coutts, Simon Marlow, DouglasMcClean, Alan Mycroft, Dominic Orchard, Josef Svenningssonand the anonymous reviewers for their helpful comments and sug-gestions.

References[1] A. P. Black, M. Carlsson, M. P. Jones, D. Kieburtz, and J. Nordlander.

Timber: a programming language for real-time embedded systems.

Technical Report CSE-02-002, Oregon Health & Science University,2002.

[2] S. Marlow and S. Peyton Jones. How to make a fast curry:push/enter vs eval/apply. In International Conference on FunctionalProgramming, pages 4–15, September 2004.

[3] J. Launchbury. A natural semantics for lazy evaluation. In Principlesof Programming Languages, pages 144–154. ACM, January 1993.

[4] S. Peyton Jones, W. D Partain, and A. Santos. Let-floating: movingbindings to give faster programs. In International Conference onFunctional Programming, 1996.

[5] S. Peyton Jones and John Launchbury. Unboxed values as firstclass citizens in a non-strict functional language. In FunctionalProgramming Languages and Computer Architecture, pages 636–666. Springer, 1991.

[6] M. Sulzmann, M. Chakravarty, S. Peyton Jones, and K. Donnelly.System F with type equality coercions. In ACM SIGPLAN Interna-tional Workshop on Types in Language Design and Implementation(TLDI’07). ACM, 2007.

[7] J. Girard. The system F of variable types, fifteen years later.Theoretical Computer Science, 45(2):159–192, 1986.

[8] S. Peyton Jones and A. Santos. A transformation-based optimiserfor Haskell. Science of Computer Programming, 32(1-3):3–47,September 1998.

[9] A. Gill and G. Hutton. The worker/wrapper transformation. Journalof Functional Programming, 19(2):227–251, March 2009.

[10] C. Baker-Finch, K. Glynn, and S. Peyton Jones. Constructed productresult analysis for haskell. Journal of Functional Programming,14(2):211–245, 2004.

[11] N. Benton, A. Kennedy, and G. Russell. Compiling standard MLto Java bytecodes. In International Conference on FunctionalProgramming, pages 129–140, New York, NY, USA, 1998. ACM.

[12] M. Serrano and P. Weis. Bigloo: A portable and optimizing compilerfor strict functional languages. In International Symposium on StaticAnalysis, pages 366–381, London, UK, 1995. Springer-Verlag.

[13] A. Appel. Compiling with Continuations. Cambridge UniversityPress, 1992.

[14] S. Peyton Jones, M. Shields, J. Launchbury, and A. Tolmach. Bridgingthe gulf: a common intermediate language for ML and Haskell. InPrinciples of Programming Languages, pages 49–61, New York, NY,USA, 1998. ACM.

[15] U. Boquist. Code Optimisation Techniques for Lazy FunctionalLanguages. PhD thesis, Chalmers University of Technology, April1999.

[16] K. Faxen. Flow Inference, Code Generation, and Garbage Collectionfor Lazy Functional Languages. PhD thesis, KTH Royal Institute OfTechnology, June 1997.

[17] B. Rudiak-Gould, A. Mycroft, and S. Peyton Jones. Haskell is notnot ML. In European Symposium on Programming, 2006.

[18] J. Hannan and P. Hicks. Higher-order uncurrying. Higher OrderSymbolic Computation, 13(3):179–216, 2000.

[19] J. Hannan and P. Hicks. Higher-order arity raising. In InternationalConference on Functional Programming, pages 27–38, New York,NY, USA, 1998. ACM.

[20] Z. Dargaye and X. Leroy. A verified framework for higher-orderuncurrying optimizations. March 2009.

[21] S. Peyton Jones. Implementing lazy functional languages on stockhardware: the Spineless Tagless G-machine. Journal of FunctionalProgramming, 2:127–202, April 1992.

[22] T. Altenkirch and N. Oury. PiSigma: A core language for dependentlytyped programming. 2008.

[23] T. Brus, M. van Eekelen, M. van Leer, and M. Plasmeijer. Clean —a language for functional graph rewriting. Functional ProgrammingLanguages and Computer Architecture, pages 364–384, 1987.

12

Losing Functions without Gaining Data– another look at defunctionalisation

Neil Mitchell ∗

University of York, [email protected]

Colin RuncimanUniversity of York, [email protected]

AbstractWe describe a transformation which takes a higher-order program,and produces an equivalent first-order program. Unlike Reynolds-style defunctionalisation, it does not introduce any new data types,and the results are more amenable to subsequent analysis opera-tions. We can use our method to improve the results of existinganalysis operations, including strictness analysis, pattern-matchsafety and termination checking. Our transformation is imple-mented, and works on a Core language to which Haskell programscan be reduced. Our method cannot always succeed in removingall functional values, but in practice is remarkably successful.

Categories and Subject Descriptors D.3 [Software]: Program-ming Languages

General Terms Languages

Keywords Haskell, defunctionalisation, firstification

1. IntroductionHigher-order functions are widely used in functional programminglanguages. Having functions as first-class values leads to moreconcise code, but it often complicates analysis methods, such asthose for checking pattern-match safety (Mitchell and Runciman2008) or termination (Sereni 2007).

Example 1Consider this definition of incList:

incList :: [ Int] → [ Int ]incList = map (+1)

map :: (α → β) → [α ] → [β ]map f [ ] = [ ]map f (x : xs) = f x : map f xs

The definition of incList has higher-order features. The expres-sion (+1) is passed as a functional argument to map. The incList

∗ This work was done while the first author was supported by an EPSRCPhD studentship


definition contains a partial application of map. The use of first-class functions has led to short code, but we could equally havewritten:

incList :: [ Int ] → [Int ]incList [ ] = [ ]incList (x : xs) = x + 1 : incList xs

Although this first-order variant of incList is longer (excludingthe library function map), it is also more amenable to certain typesof analysis. The method presented in this paper transforms thehigher-order definition into the first-order one automatically. ¤

Our defunctionalisation method processes the whole programto remove functional values, without changing the semantics ofthe program. This idea is not new. As far back as 1972 Reynoldsgave a solution, now known as Reynolds-style defunctionalisation(Reynolds 1972). Unfortunately, this method effectively introducesa mini-interpreter, which causes problems for analysis tools. Ourmethod produces a program closer to what a human might havewritten, if denied the use of functional values.

There are two significant limitations to our method:

1. The transformation can reduce sharing, causing the resultingprogram to be less efficient. Therefore our defunctionalisationmethod is not appropriate as a stage in compilation. But it workswell when used as a preliminary stage in program analysis,effectively making first-order analyses applicable to higher-order programs: examples include analyses for safe pattern-matching and for termination.

2. The transformation is not complete. In some programs theremay be residual higher-order expressions. However, the pos-sible occurrences of such residual expressions can be charac-terised, and mild restrictions guarantee first-order results. Inpractice, our method is very often completely successful: forexample defunctionalisation is complete for over 90% of thenofib benchmark programs.

Our method has been implemented in Haskell (Peyton Jones2003), and operates over the Core language from the York HaskellCompiler (Golubovsky et al. 2007). We have used our transforma-tion within the Catch analysis tool (Mitchell and Runciman 2008),which checks for potential pattern-match errors in Haskell. Catchis a first-order analysis, and without a defunctionalisation methodwe wouldn’t be able to apply Catch to real programs.

1.1 ContributionsOur paper makes the following contributions:

• We define a defunctionalisation method which, unlike someprevious work, does not introduce new data types (§3, §4). Ourmethod makes use of standard transformation steps, but withprecise restrictions on their applicability.

13

expr := λv → x lambda abstraction| f xs function application| c xs constructor application| x xs general application| v variable| let v = x in y non-recursive let expression| case x of alts case expression

alt := c vs → x case alternative

arityExpr [[λv → x]] = 1 + arityExpr xarityExpr = 0

We let v range over locally defined variables, x and y over expres-sions, f over top-level function names and c over constructors.

Figure 1. Core Language.

• We show where higher-order elements may remain in a resul-tant program, and show that given certain restrictions we guar-antee a first-order result (§6).

• We identify restrictions which guarantee termination, but arenot overly limiting (§7).

• We have implemented our method, and present measured re-sults for much of the nofib benchmark suite (§8). Our methodcan deal with the complexities of a language like Haskell, in-cluding type classes, programs using continuation-passing styleand monads.

• We show how to apply our results to existing analysis tools,using GHC’s strictness analysis and Agda’s termination checkeras examples (§9).

2. Core LanguageOur Core language is both pure and lazy. The expression type isgiven in Figure 1. A program is a mapping of function namesto expressions, with a root function named main. The arity ofa function is the result of applying arityExpr to its associatedexpression. We initially assume there are no primitive functions inour language, but explain how to extend our method to deal withthem in §4.5. We allow full Haskell 98 data types, assuming a finitenumber of different constructors, each with a fixed arity.

The variable, case, application and lambda expressions aremuch as they would be in any Core language. We restrict ourselvesto non-recursive let expressions. (Any recursive let expressions canbe removed, with a possible increase in runtime complexity, usingthe methods described in (Mitchell 2008).) The constructor ex-pression consists of a constructor and a list of expressions, exactlymatching the arity of the constructor. (Any partially applied con-structor can be represented using a lambda expression.) A functionapplication consists of a function name and a possibly empty list ofargument expressions. If a function is given fewer arguments thanits arity we refer to it as partially-applied, matching the arity isfully-applied, and more than the arity is over-applied. We use themeta functions arity f and body f to denote the arity and body offunction f. We use the function rhs to extract the expression on theright of a case alternative. We define the syntactic sugar f v = x tobe equivalent to f = λv → x.

We assume that all Core programs are type correct. In particularwe assume that when a program is evaluated a constructor applica-tion will never be the first argument of a general application, and alambda expression will never be the subject of a case expression.

All our transformations are semantics preserving, so maintain thesetwo invariants.Definition: A program is higher-order if it contains expressionswhich create or use functional values. An expression creates a func-tional value if it is a partially-applied function or a lambda expres-sion which does not contribute to the arity of function definition.An expression uses a functional value if it is an over-applied func-tion or a general application. ¤

Example 1 (revisited)The original definition of incList is higher-order because it createsfunctional values with the partial applications of both map and (+).The original definition of map is higher-order because it uses func-tional values within a general application. In the defunctionalisedversion, the program is first-order. ¤

3. Our First-Order Reduction MethodOur method works by applying a set of rules non-deterministicallyuntil no further rules apply. The rules are grouped in to threecategories:

Simplification: Many local simplification rules are used, most ofwhich may be found in any optimising compiler (Peyton Jonesand Santos 1994).

Inlining: Inlining is a standard technique in optimising compilers(Peyton Jones and Marlow 2002), and has been studied in depth.Inlining involves replacing an application of a function with thebody of the function.

Specialisation: Specialisation is another standard technique, usedto remove type classes (Jones 1994) and more recently to spe-cialise functions to a given constructor (Peyton Jones 2007).Specialisation involves generating a new function specialisedwith information about the functions arguments.

Each transformation has the possibility of removing some func-tional values, but the key contribution of this paper is how they canbe used together – including which restrictions are necessary.

We proceed by first giving a brief flavour of how these transfor-mations may be used in isolation to remove functional values. Wethen discuss the transformations in detail in §4.

3.1 SimplificationThe simplification rules have two purposes: to remove some simplefunctional values, and to ensure a normal form so other rules canapply. The simplification rules are simple, and many are found inoptimising compilers. All the rules are given in §4.1.

Example 2

one = (λx → x) 1

The simplification rule (lam-app) transforms this function to:

one = let x = 1 in x ¤Other rules do not eliminate lambda expressions, but put them intoa form that other rules can remove.

Example 3

even = let one = 1in λx → not (odd x)

The simplification rule (let-lam) lifts the lambda outside of the letexpression.

even = λx → let one = 1in not (odd x)

14

In general this transformation may cause duplicate computation tobe performed, an issue we return to in §4.1.2. ¤

3.2 InliningWe use inlining to remove functions which return data constructorscontaining functional values. A frequent source of data constructorscontaining functional values is the dictionary implementation oftype classes (Wadler and Blott 1989).

Example 4

main = case eqInt of(a, b) → a 1 2

eqInt = (primEqInt, primNeqInt)

Both components of the eqInt pair, primEqInt and primNeqInt,are functional values. We can start to remove these functionalvalues by inlining eqInt:

main = case (primEqInt, primNeqInt) of(a, b) → a 1 2

The simplification rules can now make the program first-order,using the rule (case-con) from §4.1.

main = primEqInt 1 2 ¤

3.3 SpecialisationWe use specialisation to remove lambda expressions that are ar-guments of function applications. Specialisation creates alternativefunction definitions where some information is known about thearguments. In effect, some arguments are passed at transformationtime.

Example 5

notList xs = map not xs

Here the map function takes the functional value not as its firstargument. We can create a variant of map specialised to this argu-ment:

map not x = case x of[ ] → [ ]y : ys → not y : map not ys

notList xs = map not xs

The recursive call in map is replaced by a recursive call to thespecialised variant. We have now eliminated all functional values.¤

3.4 GoalsWe define a number of goals: some are essential, and others aredesirable. If essential goals make desirable goals unachievablein full, we still aim to do the best we can. Essential goals areeither necessary to combine our transformation with an analysis,or significantly simplify any subsequent analysis.

EssentialPreserve the result computed by the program. By making use ofestablished transformations, total correctness is relatively easy toshow.

Ensure the transformation terminates. The issue of terminationis much harder. Both inlining and specialisation could be appliedin ways that diverge. In §7 we develop a set of criteria to ensuretermination.

Recover the original program. Our transformation is designed tobe performed before analysis. It is important that the results of theanalysis can be presented in terms of the original program. We needa method for transforming expressions in the resultant program intoequivalent expressions in the original program.

Introduce no data types. Reynolds’ method introduces a newdata type that serves as a representation of functions, then embedsan interpreter for this data type into the program. We aim to elimi-nate the higher-order aspects of a program without introducing anynew data types. By not introducing any data types we avoid in-troducing an interpreter, which can be a bottleneck for subsequentanalysis. By composing our transformation out of existing transfor-mations, none of which introduces data types, we can easily ensurethat our transformation does not introduce data types.

DesirableRemove all functional values. We aim to remove as many func-tional values as possible. In §6 we make precise where functionalvalues may appear in the resultant programs. If a totally first-orderprogram is required, Reynolds’ method can always be applied afterour transformation. Applying our method first will cause Reynolds’method to introduce fewer additional data types and generate asmaller interpreter.

Preserve the space/sharing behaviour of the program. In theexpression let y = f x in y + y, according to the rules of lazyevaluation, f x will be evaluated at most once. It is possible toinline the let binding to give f x+ f x, but this expression evaluatesf x twice. This transformation is valid in Haskell due to referentialtransparency, and will preserve both semantics and termination, butmay increase the amount of work performed. In an impure or strictlanguage, such as ML (Milner et al. 1997), this transformation maychange the semantics of the program.

Our goals are primarily for analysis of the resultant code, notto compile and execute the result. Because we are not interested inperformance, we permit the loss of sharing in computations if todo so will remove functional values. However, we will avoid theloss of sharing where possible, so the program remains closer tothe original.

Minimize the size of the program. A smaller program is likelyto be faster for any subsequent analysis. Previous work has spec-ulated that there may be a substantial increase in code-size afterdefunctionalisation (Chin and Darlington 1996).

Make the transformation fast. The implementation must be suf-ficiently fast to permit proper evaluation. Ideally, when combinedwith a subsequent analysis phase, the defunctionalisation shouldnot take an excessive proportion of the runtime.

4. Method in DetailThis section gives a set of rules, all of which are applied non-deterministically, until no further rules apply. Many programs re-quire a combination of rules to be applied, for example, the initialincList example requires simplification and specialisation rules.

We have implemented our steps in a monadic framework to dealwith issues such as obtaining unique free variables and trackingtermination constraints. But to simplify the presentation here, weignore these issues – they are mostly tedious engineering concerns,and do not effect the underlying algorithm.

4.1 SimplificationThe simplification rules aim to move lambda expressions upwards,and introduce lambdas for partially applied functions. The rulesinclude standard simplification rules given in Figure 2, which are

15

(x xs) ys⇒ x xs ys

(app-app)

(f xs) ys⇒ f xs ys

(fun-app)

(λv → x) y⇒ let v = y in x

(lam-app)

(let v = x in y) z⇒ let v = x in y z

(let-app)

(case x of {p1 → y1; . . .; pn → yn}) z⇒ case x of {p1 → y1 z; . . .; pn → yn z} (case-app)

case c xs of {. . .; c vs → y; . . .}⇒ let vs = xs in y

(case-con)

case (let v = x in y) of alts

⇒ let v = x in (case y of alts)(case-let)

case (case x of { . . .; c vs → y; . . .}) of alts

⇒ case x of {. . .; c vs → case y of alts; . . .} (case-case)

case x of { . . .; c vs → λv → y; . . .}⇒ λz → case x of

{ . . .z; c vs → (λv → y) z; . . .z}(case-lam)

f xs⇒ λv → f xs vwhere arity f > length xs

(eta)

Figure 2. Standard Core simplification rules.

let v = (λw → x) in y⇒ y [λw → x / v ]

(bind-lam)

let v = x in y⇒ y [x / v ]where x is a boxed lambda (see §4.2)

(bind-box)

let v = x in λw → y⇒ λw → let v = x in y

(let-lam)

Figure 3. Lambda Simplification rules.

found in most optimising compilers, such as GHC (Peyton Jonesand Santos 1994). The (app-app) and (fun-app) rules are a conse-quence of our application expressions taking a list of arguments.We also make use of additional rules which deal specifically withlambda expressions, given in Figure 3. All of the simplificationrules are correct individually. The rules are applied to any subex-pression, as long as any rule matches. We believe that the combina-tion of rules from Figures 2 and 3 are confluent.

4.1.1 Lambda IntroductionThe (eta) rule inserts lambdas in preference to partial applications,using η-expansion. For each partially applied function, a lambdaexpression is inserted to ensure that the function is given at least asmany arguments as its associated arity.

Example 6

(◦) f g x = f (g x)

even = (◦) not odd

Here the function applications of (◦), not and odd are all partiallyapplied. Three lambda expressions can be inserted using the (eta)rule:

even = λx → (◦) (λy → not y) (λz → odd z) x

Now all three function applications are fully-applied. The (eta) rulereplaces partial application with lambda expressions, making func-tional values more explicit, which permits other transformations.¤

In Haskell, unrestricted η-expansion is not correct as the seqprimitive allows⊥ to be distinguished from λv → ⊥. However, our(eta) rule only transforms applications of partially-applied func-tions, which must evaluate to lambda abstractions. Therefore our(eta) rule is similar to replacing λv → x with λw → (λv → x) w– a transformation that is correct even allowing for seq.

4.1.2 Lambda MovementThe (bind-lam) rule inlines a lambda bound in a let expression. The(let-lam) rule can be responsible for a reduction in sharing:

Example 7

f x = let i = expensive xin λj → i + j

main xs = map (f 1) xs

Here (expensive 1) is computed at most once. Every applicationof the functional argument within map performs a single (+)operation. After applying the (let-lam) rule we obtain:

f x = λj → let i = expensive xin i + j

Now (expensive 1) is recomputed for every element in xs. Weinclude this rule in our transformation, focusing on functional valueremoval at the expense of sharing. ¤

4.2 InliningWe use inlining of top-level functions as the first stage in the re-moval of functional values stored within a constructor – for exam-ple Just (λx → x). To eliminate a functional value stored insidea constructor we eliminate the containing constructor by making itthe subject of a case expression and using the (case-con) rule. Wemove the constructor towards the case expression using inlining.

16

isBox [[c xs]] = any isLambda xs ∨ any isBox xsisBox [[let v = x in y]] = isBox y

isBox [[case x of alts]] = any (isBox ◦ rhs) altsisBox [[f xs]] = isBox (fromLambda (body f))isBox = False

fromLambda [[λv → x]] = fromLambda xfromLambda x = x

isLambda [[λv → x]] = TrueisLambda = False

The isBox function as presented may not terminate. Any non-terminating evaluation can be easily detected (by rememberingwhich function bodies have been examined) and is defined to beFalse.

Figure 4. The isBox function, to test if an expression is a boxedlambda.

Definition: An expression e is a boxed lambda iff isBox e ≡ True,where isBox is defined as in Figure 4. A boxed lambda evaluates toa functional value inside a constructor. ¤

Example 8Recalling that [e ] is shorthand for (:) e [ ], where (:) is the consconstructor, the following expressions are boxed lambdas:

[λx → x ]Just [λx → x]let y = 1 in [λx → x ][Nothing, Just (λx → x)]

The following are not boxed lambdas:

λx → [x ][ id (λx → x)]id [λx → x ]let v = [λx → x] in v

The final three expressions all evaluate to a boxed lambda, butare not themselves boxed lambdas. ¤

If a boxed lambda is bound in a let expression, we substitute thelet binding, using the (bind-box) rule from Figure 3. We only inlinea function if two conditions both hold: (1) the body of the functiondefinition is a boxed lambda; (2) the function application occurs asthe subject of a case expression.Definition: The inlining transformation is specified by:

case (f xs) of alts

⇒ case (y xs) of altswhere

y = body fIf isBox (f xs) evaluates to True ¤

As with the simplification rules, there may be some loss ofsharing if the definition being inlined has arity 0 – a constantapplicative form (CAF). A Haskell implementation computes theseexpressions at most once, and reuses their value as necessary. Ifthey are inlined, this sharing will be lost.

4.3 SpecialisationFor each application of a top-level function in which at least oneargument has a lambda subexpression, a specialised variant is cre-ated, and used where applicable. The process follows the same

pattern as constructor specialisation (Peyton Jones 2007), but ap-plies where function arguments are lambda expressions, rather thanknown constructors. Examples of common functions whose appli-cations can usually be made first-order by specialisation includemap, filter, foldr and foldl.

The specialisation transformation makes use of templates. Atemplate is an expression where some subexpressions are omitted,denoted by the • symbol. The process of specialisation proceeds asfollows:

1. Find all function applications which need specialising, and gen-erate templates (see §4.3.1).

2. Abstract templates, replacing some subexpressions with • (see§4.3.2).

3. For each template, generate a function definition specialised tothat template (see §4.3.3).

4. For each expression matching a template, replace it with thegenerated function (see §4.3.4).

Example 9

main xs = map (λx → x) xs

map f xs = case xs of[ ] → [ ]y : ys → f y : map f ys

Specialisation first finds the application of map in main, andgenerates the template map (λx → x) xs. Next it abstracts thetemplate to map (λx → x) •. It then generates a unique namefor the template (we choose map id), and generates an appropriatefunction body. Next all calls matching the template are replacedwith calls to map id, including the call to map within the freshlygenerated map id.

main xs = map id xs

map id v1 = let xs = v1

in case xs of[ ] → [ ]y : ys → y : map id ys

The resulting code is first-order. ¤

4.3.1 Generating TemplatesThe idea is to generate templates for all function applications whichpass functional values. Given an expression e, a template is gener-ated if: (1) e is a function application; and (2) at least one of thesubexpressions of e is either a lambda or a boxed lambda (see §4.2).In all cases, the template generated is simply e.

Example 10The following expressions generate templates:

id (λx → x)map f [λx → x]id (Just (λx → x + 1))f (λv → v) True ¤

4.3.2 Abstracting TemplatesWe perform abstraction to reduce the number of different templatesrequired, by replacing non-functional expressions with •. For eachsubexpression e in a template, it can be replaced with • if thefollowing two conditions hold:

17

1. e is not, and does not contain, any expressions which are eitherlambda expressions or boxed lambdas, e.g. we cannot substitute• for (λx → x) or (let y = λx → x in y).

2. None of the free variables in e are bound in the template, e.g. wecannot replace the expression f v with • in (let v = 1 in f v),as the variable v is bound within the template.

Example 11Template Abstract Templateid (λx → x) id (λx → x)id (Just (λx → x)) id (Just (λx → x))id (λx → x : xs) id (λx → x : •)id (λx → let y = 12 in 4) id (λx → •)id (λx → let y = 12 in x) id (λx → let y = • in x)

In all these examples, the id function has an argument whichhas a lambda expression as a subexpression. In the last three cases,there are subexpressions which do not depend on variables boundby the lambda – these have been removed and replaced with •. ¤

4.3.3 Generating FunctionsTo generate a function from a template, we first pick a uniquename for the new function. We replace each • in the templatewith a unique fresh variable, then inline the outer function symbol.The body of the new function is the modified template, containedwithin lambda abstractions introducing each fresh variable used. Ifa previous specialisation has already generated a function for thistemplate, we reuse the previous function.

Example 9 (revisited)Consider the template map (λx → x) •. Let v1 be the freshvariable for the single • placeholder, and map id be the functionname:

map id = λv1 → map (λx → x) v1

We inline the outer function symbol (map):

map id = λv1 → (λf → λxs → case xs of[ ] → [ ]y : ys → f y : map f ys)

(λx → x) v1

After the simplification rules from Figure 3, we obtain:

map id = λv1 → let xs = v1

in case xs of[ ] → [ ]y : ys → y : map (λx → x) ys ¤

4.3.4 Using TemplatesAn expression e, matching an existing template t, can be replacedby a call to the function generated from t. All subexpressions in ewhich match up with • in t are passed as arguments.

Example 9 (continued)

map id = λv1 → let xs = v1

in case xs of[ ] → [ ]y : ys → y : map id ys

We now have a first-order definition. ¤

4.4 ConfluenceThe transformations we have presented are not confluent. Considerthe expression id ((λx → x) 1). We can either apply specialisation,

or the (lam-app) rule. The first will involve the creation of anadditional function definition, while the second will not.

We conjecture that the rules in each of the separate categoriesare confluent. In order to ensure a deterministic application of therules we always favour rules first from the simplification stage, thenthe inlining stage, and finally the specialisation stage. By choosingthe above order, we reduce the generation of auxiliary top-levelfunctions, which should lead to a simpler result.

4.5 Primitive FunctionsPrimitive functions do not have an associated body, and thereforecannot be examined or inlined. We make two simple changes tosupport primitives.

1. We define that a primitive application is not a boxed lambda,and has an arity derived from its type.

2. We restrict specialisation so that if the function to be specialisedis a primitive, no template is generated. This restriction is neces-sary because specialisation requires inlining the function, whichis not possible for a primitive.

These restrictions mean that some programs using primitivefunctions cannot be made first-order.

Example 12

main = seq (λx → x) 42

Here a functional value is passed as the first argument to theprimitive seq. As we are not able to peer inside the primitive, andmust preserve its interface, we cannot remove this functional value.For most primitives, such as arithmetic operations, the types ensurethat no functional values are passed as arguments. However, the seqprimitive is of type α → β → β, allowing any type to be passed aseither of the arguments, including functional values.

Some primitives not only permit functional values, but actuallyrequire them. For example, the primCatch function within theYhc standard libraries implements the Haskell exception handlingfunction catch. The type of primCatch is α → (IOError → α) →α, taking an exception handler as one of the arguments. ¤

4.6 Recovering Input ExpressionsSpecialisation is the only rule which introduces new functionnames. In order to translate an expression in the output programto an equivalent expression in the input program, it is sufficient toreplace all generated function names with their associated template,supplying all the necessary variables.

5. ExamplesWe now give two examples. Our method can convert the firstexample to a first-order equivalent, but not the second.

Example 13 (Inlining Boxed Lambdas)An earlier version of our defunctionaliser inlined boxed lambdaseverywhere they occurred. Inlining boxed lambdas means the isBoxfunction does not have to examine the body of applied functions,and is therefore simpler. However, it was unable to cope withprograms like this one:

main = map (λx → x 1) gengen = (λx → x) : gen

The gen function is both a boxed lambda and recursive. If weinlined gen initially the method would not be able to remove alllambda expressions. By first specialising map with respect to gen,and waiting until gen is the subject of a case, we are able to remove

18

the functional values. This operation is effectively deforestation(Wadler 1988), which also only performs inlining within the subjectof a case. ¤

Example 14 (Functional Lists)Sometimes lambda expressions are used to build up lists whichcan have elements concatenated onto the end. Using Hughes lists(Hughes 1986), we can define:

nil = idsnoc x xs = λys → xs (x : ys)list xs = xs [ ]

This list representation provides nil as the empty list, but insteadof providing a (:) or “cons” operation, it provides snoc, whichadds a single element on to the end of the list. The function list isprovided to create a standard list. We are unable to defunctionalisesuch a construction, as it stores unbounded information withinclosures. We have seen such constructions in both the lines functionof the HsColour program, and the sort function of Yhc. However,there is an alternative implementation of these functions:

nil = [ ]snoc = (:)list = reverse

We have benchmarked these operations in a variety of settingsand the list based version appears to use approximately 75% ofthe memory, and 65% of the time required by the function-basedsolution. ¤

6. Restricted CompletenessOur method would be complete if it made all programs first-order.In this section we give three conditions, which if met, ensure aprogram can be made first-order. In doing so, we hope to show thatno obvious rule is missing.

6.1 PropositionAfter transformation, there will be no partial applications, and alllambda expressions will either contribute to the arity of a functiondefinition or be unreachable (never be evaluated at runtime), pro-vided:

1. The termination criteria do not curtail defunctionalisation (see§7).

2. No primitive function receives a functional argument, nor re-turns a functional result.

3. The main function has a type that ensures it neither receives afunctional argument, nor returns a functional result.

We prove this proposition with a series of lemmas about theresultant program.

6.2 LemmasWe define the root of a function to be its body after applying thefromLambda function from Figure 4. We define a higher-orderlambda (HO lambda) to be a lambda expression that does notcontribute to the arity of a function definition.Lemma: No partial applications

The (eta) rule removes partial application, and at the end of thetransformation, no further rules apply – therefore there can be nopartial applications in the resultant program. ¤Lemma: The first argument of a general application must be avariable

The rules (app-app), (fun-app), (lam-app), (let-app) and (case-app) mean the first argument to a general application must be a

variable or a constructor application. All constructor applicationsare fully applied, and therefore cannot return a functional value,so type safety ensures they cannot be the first argument of anapplication. Therefore, the first argument of an application is avariable. ¤Lemma: A HO lambda may only occur in the following places:inside a HO lambda; as an argument to an application or a con-structor

A lambda cannot be the subject of a case expression as it wouldnot be well typed. A lambda cannot be an argument to a function asit would be removed by specialisation. All other possible lambdapositions are removed by the rules (lam-app), (case-lam), (bind-lam) and (let-lam). ¤Lemma: A boxed lambda may only occur in the following places:the root of a function; inside a HO lambda or boxed lambda; as anargument to an application

Using the definition of isBox from Figure 4 to ignore expres-sions which are themselves boxed lambdas, the only possible loca-tions of a boxed lambda not mentioned in the lemma are the bindingof a let, the subject of a case, and as an argument to a function ap-plication. We remove the binding of a let with (bind-box) and theargument to a function application with specialisation.

To remove a boxed lambda from the subject of a case weobserve that a boxed lambda must be a constructor application, alet expression, a case expression or a function application. The firstthree are removed with the rules (case-con), (case-let) and (case-case), the final one is removed by inlining. ¤Lemma: A boxed lambda must have a type that permits a func-tional value

An expression must have a type that permits a functional valueif any execution, choosing any alternative in a case expression,evaluates to a functional value. The base case of a boxed lambdais a constructor application to a lambda, which is a functionalvalue. For let and case, the type of the expression is the type of thecontained boxed lambda. The remaining case is if ((λvs → b) xs)evaluates to a functional value. As b must be a boxed lambda, i.e.a constructor wrapping a lambda, any application and abstractionoperations alone cannot remove the constructor, so cannot removethe functional value. ¤Lemma: A function whose root is a boxed lambda must be calledfrom inside a HO lambda or as the argument of an application

An application of a function whose root is a boxed lambda isitself a boxed lambda. Therefore the restrictions on where a boxedlambda can reside apply to applications of these functions. ¤Lemma: All HO lambdas are unreachable

The main function cannot be a boxed lambda, as that wouldbe a functional value, and is disallowed by restrictions on main.There remain only four possible locations for HO lambdas or boxedlambdas:

1. As an argument to an application (v •).

2. As the body of a HO lambda (λv → •).

3. Contained within a boxed lambda.

4. As the root of a function definition, whose applications areboxed lambdas.

None of these constructs binds a functional value to a variable,therefore in the first case v cannot be bound to a functional value.If v is not a functional value, then type checking means that v mustevaluate to⊥, and • will never be evaluated. In the remaining threecases, the lambda or boxed lambda must ultimately be containedwithin an application whose variable evaluates to⊥ – and thereforewill not be evaluated. ¤

19

Lemma: There are no partial applications and all lambda expres-sions either contribute to the arity of a function definition or areunreachable

By combining the lemmas that there are no partial applicationsand that all HO lambdas are unreachable. ¤

It is instructive to note that during the proof every rule has beenused, and that the removal of any single rule would invalidate theproof. While this does not prove that each step is necessary, it doesprovide a motivation for each rule.

6.3 Residual Higher-Order ProgramsThe following programs all remain higher-order after applying ourmethod, although none will actually create higher-order values atruntime.

Example 15

main = bottom (λx → x)

We use the expression bottom to indicate a computation thatevaluates to⊥ – either a call to error or a non-terminating computa-tion. The function main will evaluate to ⊥, without ever evaluatingthe contained lambda expression. ¤

Example 16

nothing = Nothingmain = case nothing of

Nothing → 1Just f → f (λx → x)

In this example the lambda expression is never reached becausethe Just branch of the case expression is never taken. ¤

6.4 Transformation to First-OrderAs a result of our proposition, provided the three restrictions aremet, we can replace all lambda expressions in the resultant programwhich don’t contribute to the arity of a function with ⊥, to give anequivalent program. In addition, any uses of functional values areguaranteed to actually be operating on ⊥, as no functional valuescould have been created. Another way of viewing the proposition isthat after transformation the program will be first-order at runtime,even if there are expressions that create or use functional values inthe source program. Therefore, the following rewrites are valid:

(λv → x) ⇒ ⊥ if not contributing to the arity of a functionx xs ⇒ xf xs ⇒ f (take (arity f) xs)

After applying the (eta) rule and performing these rewrites, allprograms are guaranteed to be first-order.

7. Proof of TerminationOur algorithm, as it stands, may not terminate. In order to ensuretermination, it is necessary to bound both the inlining and speciali-sation rules. In this section we develop a mechanism to ensure ter-mination, by first looking at how non-termination may arise.

7.1 Termination of SimplificationIn order to check the termination of the simplifier we have usedthe AProVE system (Giesl et al. 2006) to model our rules as aterm rewriting system, and check its termination. An encoding ofa simplified version of the rules from Figures 2 and 3 is given inFigure 5. We have encoded rules by considering what type of ex-pression is transformed by a rule. For example, the rule replacing(λv → x) y with let v = y in x is expressed as a rewrite replacing

[x,y,z]app(lam(x),y) → let(y,x)app(case(x,y),z)→ case(x,app(y,z))app(let(x,y),z) → let(x,app(y,z))case(let(x,y),z) → let(x,case(y,z))case(con(x),y) → let(x,y)case(x,lam(y)) → lam(case(x,app(lam(y),var)))let(lam(x),y) → lam(let(x,y))

Figure 5. Encoding of termination simplification.

app(lam(x),y) with let(y,x). The names of binding variables withinexpressions have been ignored. To simplify the encoding, we haveonly considered applications with one argument. The rewrite rulesare applied non-deterministically at any suitable location, so faith-fully model the behaviour of our original rules.

The encoding of the (bind-box) and (bind-lam) rules is ex-cluded. Given these rules, there are non terminating sequences. Forexample:

(λx → x x) (λx → x x)⇒ -- (lam-app) rule

let x = λx → x x in x x⇒ -- (bind-lam) rule

(λx → x x) (λx → x x)

Such expressions are a problem for GHC, and can cause thecompiler to loop if encoded as data structures (Peyton Jones andMarlow 2002). Other transformation systems (Chin and Darlington1996) make use of type annotations to ensure these reductionsterminate. To guarantee termination, we apply (bind-lam) or (bind-box) at most n times in any definition body. If the body is alteredby either inlining or specialisation, we reset the count. Currentlywe set n to 1000, but have never seen the count exceed 50 on a realprogram – it is not a problem that arises in practice.

7.2 Termination of InliningA standard technique to ensure termination of inlining is to refuseto inline recursive functions (Peyton Jones and Marlow 2002).For our purposes, this non-recursive restriction is too cautious asit would leave residual lambda expressions in programs such asExample 13. We first present a program which causes our methodto fail to terminate, then our means of ensuring termination.

Example 17

data B α = B αf = case f of

B → B (λx → x)

The f inside the case is a candidate for inlining:

case f of B → B (λx → x)⇒ -- inlining rule

case (case f of B → B (λx → x)) of B → B (λx → x)⇒ -- (case-case) rule

case f of B → case B (λx → x) of B → B (λx → x)⇒ -- (case-con) rule

case f of B → B (λx → x)

So this expression would cause non-termination. ¤To avoid such problems, we permit inlining a function f, at all

use sites within the definition of a function g, but only once perpair (f, g). In the previous example we would inline f within itsown body, but only once. Any future attempts to inline f withinthis function would be disallowed, although f could still be inlinedwithin other function bodies. This restriction is sufficient to ensure

20

termination of inlining. Given n functions, there can be at most n2

inlining steps, each for possibly many application sites.

7.3 Termination of SpecialisationThe specialisation method, left unrestricted, may also not termi-nate.

Example 18

data Wrap α = Wrap (Wrap α) | Value α

f x = f (Wrap x)main = f (Value head)

In the first iteration, the specialiser generates a version of fspecialised for the argument Value head. In the second iterationit would specialise for Wrap (Value head), then in the third withWrap (Wrap (Value head)). Specialisation would generate aninfinite number of specialisations of f. ¤

To ensure we only specialise a finite number of times we usea homeomorphic embedding (Kruskal 1960). The relation x E yindicates the expression x is an embedding of y. We can define Eusing the following rewrite rule:

emb = {f(x1, . . . , xn) → xi | 1 6 i 6 n}Now x E y can be defined as x ←∗

emb y (Baader and Nipkow1998). The rule emb takes an expression, and replaces it withone of its immediate subexpressions. If repeated non-deterministicapplication of this rule to any subexpression transforms y to x, thenx E y. The intuition is that by removing some parts of y we obtainx, or that x is somehow “contained” within y.

Example 19a E a b(a) 5 aa E b(a) a 5 b(c)

c(a) E c(b(a)) d(a, a) 5 d(b(a), c)d(a, a) E d(b(a), c(c(a))) b(a, a) 5 b(a, a, a)

¤The homeomorphic embedding E is a well-quasi order, as

shown by Kruskal’s tree theorem (Kruskal 1960). This propertymeans that for every infinite sequence of expressions e1, e2 . . .over a finite alphabet, there exist indicies i < j such that ei E ej .This result is sometimes used in program optimisation to ensure analgorithm over expressions performs a bounded number of itera-tions, by stopping at iteration n once ∃i • 1 6 i < n ∧ ei E en –for example by Jonsson and Nordlander (2009).

For each function definition, we associate a set of expressions S.After generating a template t, we only specialise with that templateif ∀s ∈ S • s 5 t. After specialising an expression e with templatet, we add t to the set S associated with the function definitioncontaining e. When we generate a new function from a template, wecopy the S associated with the function at the root of the template.

One of the conditions for termination of homeomorphic embed-ding is that there must be a finite alphabet. To ensure this condition,we consider all variables to be equivalent. However, this is not suf-ficient. During the process of specialisation we generate new func-tion names, and these names are new symbols in our alphabet. Tokeep the alphabet finite we only use function names from the orig-inal input program, relying on the equivalence of each template toan expression in the original program (§4.6). We perform the home-omorphic embedding test only after transforming all templates intotheir original equivalent expression.

Example 18 (revisited)Using homeomorphic embedding, we again generate the spe-cialised variant of f (Value head). Next we generate the tem-

plate f (Wrap (Value head)). However, f (Value head) Ef (Wrap (Value head)), so the new template is not used. ¤

Forbidding homeomorphic embeddings in specialisation stillallows full defunctionalisation in most simple examples, but thereare examples where it terminates prematurely.

Example 20

main y = f (λx → x) yf x y = fst (x, f x y) y

Here we first generate a specialised variant of f (λx → x) y. Ifwe call the specialised variant f′, we have:

f′ y = fst (λx → x, f′ y) y

Note that the recursive call to f has also been specialised. Wenow attempt to generate a specialised variant of fst, using thetemplate fst (λx → x, f′ y) y. Unfortunately, this template is anembedding of the template we used for f′, so we do not specialiseand the program remains higher-order. But if we did permit afurther specialisation, we would obtain the first-order equivalent:

f′ y = fst′ y yfst′ y1 y2 = y2 ¤

This example may look slightly obscure, but similar situationsoccur frequently with the standard implementation of type classesas dictionaries. Often, classes have default methods, which callother methods in the same class. These recursive class calls oftenpass dictionaries, embedding the original caller even though norecursion actually happens.

To alleviate this problem, instead of storing one set S, we storea sequence of sets, S1 . . . Sn – where n is a small positive number,constant for the duration of the program. Instead of adding to theset S, we now add to the lowest set Si where adding the elementwill not violate the invariant. Each of the sets Si is still finite, andthere are a finite number (n) of them, so termination is guaranteed.

By default our defunctionalisation program uses 8 sets. In theresults table given in §8, we have included the minimum possiblevalue of n to remove all expressions creating functional values fromeach program.

7.4 Termination as a WholeGiven an initial program, inlining and specialisation rules willonly apply a finite number of times. The simplification rules areterminating on their own, so when combined, all the rules willterminate.

8. Results8.1 Benchmark TestsWe have tested our method with programs drawn from the nofibbenchmark suite (Partain et al. 2008), and the results are given inTable 1. Looking at the input Core programs, we see many sourcesof functional values.

• Type classes are implemented as tuples of functions.• The monadic bind operation is higher-order.• The IO data type is implemented as a function.• The Haskell Show type class uses continuation-passing style

extensively.• List comprehensions in Yhc are desugared to continuation-

passing style. There are other translations which require fewerfunctional value manipulations (Coutts et al. 2007).

We have tested all 14 programs from the imaginary section ofthe nofib suite, 35 of the 47 spectral programs, and 17 of the 30

21

Table 1. Results of defunctionalisation on the nofib suite.Name is the name of the program; Bound is the numeric boundused for termination (see §7.3); HO Create is the number of under-applied functions and lambda expressions not contributing to thearity of a top-level function, first in the input program and thenin the output program; HO Use is the number of over-appliedfunctions and application expressions; Time is the execution timeof our method in seconds; Size is the change in the program sizemeasured by the number of lines of Core.

Name Bound HO Create HO Use Time Size

Programs curtailed by a termination bound:cacheprof 8 611 44 686 40 1.8 2%grep 8 129 9 108 22 0.8 40%lift 8 187 123 175 125 1.2 -6%prolog 8 308 301 203 137 1.1 -5%

All other programs:ansi 4 239 0 187 2 0.5 -29%bernouilli 4 240 0 190 2 0.3 -32%bspt 4 262 0 264 1 0.7 -22%

. . . plus 56 additional programs . . .sphere 4 343 0 366 2 0.7 -45%symalg 5 402 0 453 64 1.0 -32%x2n1 4 345 0 385 2 0.8 -57%

Summary of all 62 other programs:Minimum 2 60 0 46 0 0.1 -78%Maximum 14 580 1 581 100 1.2 27%Average 5 260 0 232 5 0.5 -30%

real programs. The remaining 25 programs do not compile usingthe Yhc compiler, mainly due to missing or incomplete libraries.After applying our defunctionalisation method, 4 programs arecurtailed by the termination bound and 2 pass functional values toprimitives. The remaining 60 programs can be transformed to first-order as described in §6.4. We first discuss the resultant programswhich remain higher-order, then those which contain higher-orderexpressions but can be rewritten as first-order, then make someobservations about each of the columns in the table.

8.2 Higher-Order ProgramsAll four programs curtailed by the termination bound are listed inTable 1. The lift program uses pretty-printing combinators, whilethe other three programs use parser combinators. In all programs,the combinators are used to build up a functional value representingthe action to perform, storing an unbounded amount of informationinside the functional value, which therefore cannot be removed.

The remaining two higher-order programs are integer and mail-list, both of which pass functional values to primitive functions.The maillist program calls the catch function (see §4.5). The inte-ger program passes functional values to the seq primitive, using thefollowing function:

seqlist [ ] = return ()seqlist (x : xs) = x `seq seqlist xs

This function is invoked with the IO monad, so the return ()expression is a functional value. It is impossible to remove thisfunctional value without having access to the implementation ofthe seq primitive.

8.3 First-Order ProgramsOf the 66 programs tested, 60 can be made first-order using therewrites given in §6.4. When looking at the resultant programs,

3 contain lambda expressions, and all but 5 contain expressionswhich could use functional values.

The pretty, constraints and mkhprog programs pass functionalvalues to expressions that evaluate to ⊥. The case in pretty comesfrom the fragment:

type Pretty = Int → Bool → PrettyRep

ppBesides :: [Pretty ] → PrettyppBesides = foldr1 ppBeside

Here ppBesides xs evaluates to ⊥ if xs ≡ [ ]. The ⊥ valuewill be of type Pretty, and can be given further arguments, whichinclude functional values. In reality, the code ensures that the inputlist is never [ ], so the program will never fail with this error.

The vast majority of programs which have residual uses of func-tional values result from over-applying the error function, becauseYhc generates such an expression when it desugars a pattern-matchwithin a do expression.

8.4 Termination BoundThe termination bound required varies from 2 to 14 for the sampleprograms (see Bound in Table 1). If we exclude the integer pro-gram, which is complicated by the primitive operations on func-tional values, the highest bound is 8. Most programs have a termi-nation bound of 4. There is no apparent relation between the sizeof a program and the termination bound.

8.5 Creation and Uses of Functional ValuesWe use Yhc-generated programs as input. Yhc performs desugaringof the Haskell source code, introducing dictionaries of functions toimplement type classes, and performing lambda lifting (Johnsson1985). As a result the input programs have no lambda expressions,only partial application. Conversely, the (eta) rule from Figure 2ensures resultant programs have no partial application, only lambdaexpressions. Most programs in our test suite start with hundreds ofpartial applications, but only 9 resultant programs contain lambdaexpressions (see HO Create in Table 1).

For the purposes of testing defunctionalisation, we have workedon unmodified Yhc libraries, including all the low-level detail.For example, readFile in Yhc is implemented in terms of filehandles and pointer operations. Most analysis operations work onan abstracted view of the program, which reduces the number andcomplexity of functional values.

8.6 Execution TimeThe timing results were all measured on a 1.2GHz laptop, runningGHC 6.8.2 (The GHC Team 2007). The longest execution time wasjust over one second, with the average time being half a second(see Time in Table 1). The programs requiring most time madeuse of floating point numbers, suggesting that library code requiresmost effort to defunctionalise. If abstractions were given for librarymethods, the execution time would drop substantially.

In order to gain acceptable speed, we perform a number ofoptimisations over the method presented in §4. (1) We transformfunctions in an order determined by a topological sort with respectto the call-graph. (2) We delay the transformation of dictionarycomponents, as these will often be eliminated. (3) We track thearity and boxed lambda status of each function.

8.7 Program SizeWe measure the size of a program by counting the number of linesof Core code, after a simple dead-code analysis to remove entirelyunused function definitions. On average the size of the resultantprogram is smaller by 30% (see Size in Table 1). The decreasein program size is mainly due to the elimination of dictionaries

22

holding references to unnecessary code. An optimising compilerwill perform dictionary specialisation, and therefore is likely toalso reduce program size. We do not claim that defunctionalisationreduces code size, merely hope to alleviate concerns raised byprevious papers that it might cause an explosion in code size (Chinand Darlington 1996).

9. Higher-Order AnalysisIn this section we show that our method can be used to improvethe results of existing analysis operations. Our method is alreadyused by the Catch tool (Mitchell and Runciman 2008), allowing afirst-order pattern-match analysis to check higher-order programs.We now give examples of applying our method to strictness andtermination analysis.

Example 21GHC’s demand analysis (The GHC Team 2007) is responsible fordetermining which arguments to a function are strict.

main :: Int → Int → Intmain x y = apply 10 (+x) y

apply :: Int → (α → α) → α → αapply 0 f x = xapply n f x = apply (n− 1) f (f x)

GHC’s demand analysis reports that the main function is lazyin both arguments. By generating a first-order variant of main andthen applying the demand analysis, we find that the argument y isstrict. This strictness information can then be applied back to theoriginal program. ¤

Example 22The Agda compiler (Norell 2008) checks that each function isterminating, using an analysis taken from the Foetus terminationchecker (Abel 1998).

cons : (N→ List N) → N→ List Ncons f x = x :: f x

downFrom : N→ List NdownFrom = cons f

where f : N→ List Nf zero = [ ]f (suc x) = downFrom x

Agda’s termination analysis reports that downFrom may notterminate. By generating a first-order variant and applying thetermination analysis, we find that downFrom is terminating. ¤

No doubt there are other ways in which the above analysis meth-ods could be improved, by extending and reworking the analysismachinery itself. But a big advantage of adding a preliminary de-functionalisation stage is that it is modular: the analysis is treatedas a black box. A combination with Reynolds-style defunctionali-sation does not improve either analysis.

10. Related Work10.1 Reynolds-style defunctionalisationReynolds-style defunctionalisation (Reynolds 1972) is the seminalmethod for generating a first-order equivalent of a higher-orderprogram.

Example 23

map f [ ] = [ ]map f (x : xs) = f x : map f xs

Reynolds’ method works by creating a data type to represent all val-ues that f may take anywhere in the whole program. For instance,it might be:

data Function = Head | Tail

apply Head x = head xapply Tail x = tail x

map f [ ] = [ ]map f (x : xs) = apply f x : map f xs

Now all calls to map head are replaced by map Head. ¤Reynolds’ method works on all programs. Defunctionalised

code is still type safe, but type checking would require a depen-dently typed language. Others have proposed variants of Reynolds’method that are type safe in the simply typed lambda calculus (Bellet al. 1997), and within a polymorphic type system (Pottier andGauthier 2004).

The method is complete, removing all higher-order functions,and preserves space and time behaviour. The disadvantage is thatthe transformation essentially embeds a mini-interpreter for theoriginal program into the new program. The control flow is compli-cated by the extra level of indirection and the apply interpreter canbe a bottleneck for analysis. Various analysis methods have beenproposed to reduce the size of the apply function, by statically de-termining a safe subset of the possible functional values at a callsite (Cejtin et al. 2000; Boquist and Johnsson 1996).

Reynolds’ method has been used as a tool in program calcula-tion (Danvy and Nielsen 2001; Hutton and Wright 2006), often as amechanism for removing introduced continuations. Another use ofReynolds’ method is for optimisation (Meacham 2008), allowingflow control information to be recovered without the complexity ofhigher-order transformation.

10.2 Removing Functional ValuesThe closest work to ours is by Chin and Darlington (1996), whichitself is similar to that of Nelan (1991). They define a defunctional-isation method which removes some functional values without in-troducing data types. Their work shares some of the simplificationrules, and includes a form of function specialisation. Despite thesecommonalities, there are big differences between their method andours.

• Their method makes use of the types of expressions, informa-tion that must be maintained and extended to work with addi-tional type systems.

• Their method has no inlining step, or any notion of boxedlambdas. Functional values within constructors are ignored.The authors suggest the use of deforestation (Wadler 1988) tohelp remove them, but deforestation transforms the programmore than necessary, and still fails to eliminate many functionalvalues.

• Their specialisation step only applies to outermost lambda ex-pressions, not lambdas within constructors.

• To ensure termination of the specialisation step, they never spe-cialise a recursive function unless it has all functional argu-ments passed identically in all recursive calls. This restrictionis satisfied by higher-order functions such as map, but fails inmany other cases.

In addition, functional programs now use monads, IO continua-tions and type classes as a matter of course. Such features were stillexperimental when Chin and Darlington developed their methodand it did not handle them. Our work can be seen as a successorto theirs, indeed we achieve most of the aims set out in their future

23

work section. We have tried their examples, and can confirm that allof them are successfully handled by our system. Some of their ob-servations and extensions apply equally to our work: for example,they suggest possible methods of removing accumulating functionssuch as in Example 14.

10.3 Partial Evaluation and SupercompilationThe specialisation and inlining steps are taken from existing pro-gram optimisers, as is the termination strategy of homeomorphicembedding. A lot of program optimisers include some form of spe-cialisation and so remove some higher-order functions, such as par-tial evaluation (Jones et al. 1993) and supercompilation (Turchin1986). We have certainly benefited from ideas in both these areasin developing our method.

11. Conclusions and Future WorkHigher-order functions are very useful, but may pose difficulties forcertain types of analysis. Using the method we have described, it ispossible to remove most functional values from most programs. Auser can still write higher-order programs, but an analysis tool canwork on equivalent first-order programs. Our method has alreadyfound practical use within the Catch tool, allowing a first-orderpattern-match analysis to be applied to real Haskell programs. Itwould be interesting to investigate the relative accuracy of higher-order analysis methods with and without defunctionalisation.

Our method works on whole programs, requiring sources for allfunction definitions. This requirement both increases transforma-tion time, and precludes the use of closed-source libraries. We maybe able to relax this requirement, precomputing first-order variantsof libraries, or permitting some components of the program to beignored.

The use of a numeric termination bound in the homeomorphicembedding is regrettable, but practically motivated. We need fur-ther research to determine if such a numeric bound is necessary, orif other measures could be used.

Many analysis methods, in fields such as strictness analysis andtermination analysis, start out first-order and are gradually extendedto work in a higher-order language. Defunctionalisation offers analternative approach: instead of extending the analysis method,we transform the functional values away, enabling more analysismethods to work on a greater range of programs.

ReferencesAndreas Abel. foetus – Termination Checker for Simple Functional Pro-

grams. Programming Lab Report, July 1998.

Franz Baader and Tobias Nipkow. Term Rewriting and All That. CambridgeUniversity Press, 1998.

Jeffrey M. Bell, Francoise Bellegarde, and James Hook. Type-driven de-functionalization. In Proc. ICFP ’97, pages 25–37. ACM, 1997.

Urban Boquist and Thomas Johnsson. The GRIN project: A highly opti-mising back end for lazy functional languages. In Proc IFL ’96, volume1268 of LNCS, pages 58–84. Springer-Verlag, 1996.

Henry Cejtin, Suresh Jagannathan, and Stephen Weeks. Flow-directedclosure conversion for typed languages. In Proc. ESOP ’00, volume1782 of LNCS, pages 56–71. Springer–Verlang, 2000.

Wei-Ngan Chin and John Darlington. A higher-order removal method. LispSymb. Comput., 9(4):287–322, 1996.

Duncan Coutts, Roman Leshchinskiy, and Don Stewart. Stream fusion:From lists to streams to nothing at all. In Proc ICFP ’07, pages 315–326. ACM Press, October 2007.

Olivier Danvy and Lasse R. Nielsen. Defunctionalization at work. In Proc.PPDP ’01, pages 162–174. ACM, 2001.

J. Giesl, P. Schneider-Kamp, and R. Thiemann. AProVE 1.2: Automatictermination proofs in the dependency pair framework. In Proceedings ofthe 3rd International Joint Conference on Automated Reasoning (IJCAR’06), volume 4130 of LNCS, pages 281–286. Springer–Verlag, 2006.

Dimitry Golubovsky, Neil Mitchell, and Matthew Naylor. Yhc.Core – fromHaskell to Core. The Monad.Reader, 1(7):45–61, April 2007.

John Hughes. A novel representation of lists and its application to thefunction “reverse”. Inf. Process. Lett., 22(3):141–144, 1986.

Graham Hutton and Joel Wright. Calculating an Exceptional Machine. InTrends in Functional Programming volume 5. Intellect, February 2006.

Thomas Johnsson. Lambda lifting: transforming programs to recursiveequations. In Proc. FPCA ’85, pages 190–203. Springer-Verlag NewYork, Inc., 1985.

Mark P. Jones. Dictionary-free Overloading by Partial Evaluation. In Proc.PEPM ’94, pages 107–117. ACM Press, June 1994.

Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. Partial Evaluationand Automatic Program Generation. Prentice-Hall International, 1993.

Peter Jonsson and Johan Nordlander. Positive supercompilation for a higherorder call-by-value language. In POPL ’09, pages 277–288. ACM, 2009.

J B Kruskal. Well-quasi-ordering, the tree theorem, and Vazsonyi’s conjec-ture. Transactions of the American Mathematical Society, 95(2):210–255, 1960.

John Meacham. jhc: John’s haskell compiler. http://repetae.net/john/computer/jhc/, 2008.

Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. TheDefinition of Standard ML - Revised. The MIT Press, May 1997.

Neil Mitchell. Transformation and Analysis of Functional Programs. PhDthesis, University of York, 2008.

Neil Mitchell and Colin Runciman. Not all patterns, but enough – anautomatic verifier for partial but sufficient pattern matching. In Proc.Haskell ’08, 2008.

George Nelan. Firstification. PhD thesis, Arizona State University, Decem-ber 1991.

Ulf Norell. Dependently typed programming in Agda. In Lecture notes onAdvanced Functional Programming, 2008.

Will Partain et al. The nofib Benchmark Suite of Haskell Programs.http://darcs.haskell.org/nofib/, 2008.

Simon Peyton Jones. Haskell 98 Language and Libraries: The RevisedReport. Cambridge University Press, 2003.

Simon Peyton Jones. Call-pattern specialisation for Haskell programs. InProc. ICFP ’07, pages 327–337. ACM Press, October 2007.

Simon Peyton Jones and Simon Marlow. Secrets of the Glasgow HaskellCompiler inliner. JFP, 12:393–434, July 2002.

Simon Peyton Jones and Andres Santos. Compilation by transformation inthe Glasgow Haskell Compiler. In Functional Programming Workshopsin Computing, pages 184–204. Springer-Verlag, 1994.

Francois Pottier and Nadji Gauthier. Polymorphic typed defunctionaliza-tion. In Proc. POPL ’04, pages 89–98. ACM Press, 2004.

John C. Reynolds. Definitional interpreters for higher-order programminglanguages. In Proc. ACM ’72, pages 717–740. ACM Press, 1972.

Damien Sereni. Termination analysis and call graph construction for higher-order functional programs. In Proc. ICFP ’07, pages 71–84. ACM, 2007.

The GHC Team. The GHC compiler, version 6.8.2. http://www.haskell.org/ghc/, December 2007.

Valentin F. Turchin. The concept of a supercompiler. ACM Trans. Program.Lang. Syst., 8(3):292–325, 1986.

Philip Wadler. Deforestation: Transforming programs to eliminate trees. InProc ESOP ’88, volume 300 of LNCS, pages 344–358. Berlin: Springer-Verlag, 1988.

Philip Wadler and Stephen Blott. How to make ad-hoc polymorphism lessad hoc. In Proc. POPL ’89, pages 60–76. ACM Press, 1989.

24

Push-Pull Functional Reactive Programming

Conal ElliottLambdaPix

[email protected]

AbstractFunctional reactive programming (FRP) has simple and powerfulsemantics, but has resisted efficient implementation. In particular,most past implementations have used demand-driven sampling,which accommodates FRP’s continuous time semantics and fitswell with the nature of functional programming. Consequently,values are wastefully recomputed even when inputs don’t change,and reaction latency can be as high as the sampling period.

This paper presents a way to implement FRP that combinesdata- and demand-driven evaluation, in which values are recom-puted only when necessary, and reactions are nearly instantaneous.The implementation is rooted in a new simple formulation of FRPand its semantics and so is easy to understand and reason about.

On the road to a new implementation, we’ll meet some oldfriends (monoids, functors, applicative functors, monads, mor-phisms, and improving values) and make some new friends (func-tional future values, reactive normal form, and concurrent “unam-biguous choice”).

Categories and Subject Descriptors D.1.1 [Software]: Program-ming Techniques—Applicative (Functional) Programming

General Terms Design, Theory

Keywords Functional reactive programming, semantics, concur-rency, data-driven, demand-driven

1. IntroductionFunctional reactive programming (FRP) supports elegant program-ming of dynamic and reactive systems by providing first-class,composable abstractions for behaviors (time-varying values) andevents (streams of timed values) (Elliott 1996; Elliott and Hudak1997; Nilsson et al. 2002).1 Behaviors can change continuously(not just frequently), with discretization introduced automaticallyduring rendering. The choice of continuous time makes programssimpler and more composable than the customary (for computerprogramming) choice of discrete time, just as is the case withcontinuous space for modeled imagery. For instance, vector and3D graphics representations are inherently scalable (resolution-independent), as compared to bitmaps (which are spatially dis-crete). Similarly, temporally or spatially infinite representations are

1 See http://haskell.org/haskellwiki/FRP for more references.


more composable than their finite counterparts, because they can bescaled arbitrarily in time or space, before being clipped to a finitetime/space window.

While FRP has simple, pure, and composable semantics, its ef-ficient implementation has not been so simple. In particular, pastimplementations have used demand-driven (pull) sampling of reac-tive behaviors, in contrast to the data-driven (push) evaluation typ-ically used for reactive systems, such as GUIs. There are at leasttwo strong reasons for choosing pull over push for FRP:

• Behaviors may change continuously, so the usual tactic of idlinguntil the next input change (and then computing consequences)doesn’t apply.

• Pull-based evaluation fits well with the common functionalprogramming style of recursive traversal with parameters (time,in this case). Push-based evaluation appears at first to be aninherently imperative technique.

Although some values change continuously, others change onlyat discrete moments (say in response to a button click or an objectcollision), while still others have periods of continuous change al-ternating with constancy. In all but the purely continuous case, pull-based implementations waste considerable resources, recomputingvalues even when they don’t change. In those situations, push-basedimplementations can operate much more efficiently, focusing com-putation on updating values that actually change.

Another serious problem with the pull approach is that it im-poses significant latency. The delay between the occurrence of anevent and the visible result of its reaction, can be as much as thepolling period (and is on average half that period). In contrast, sincepush-based implementations are driven by event occurrences, reac-tions are visible nearly instantaneously.

Is it possible to combine the benefits of push-based evaluation—efficiency and minimal latency—with those of pull-based evaluation—simplicity of functional implementation and applicability to tem-poral continuity? This paper demonstrates that it is indeed possibleto get the best of both worlds, combining data- and demand-drivenevaluation in a simple and natural way, with values being recom-puted only, and immediately, when their discrete or continuousinputs change. The implementation is rooted in a new simple for-mulation of FRP and its semantics and so is relatively easy tounderstand and reason about.

This paper describes the following contributions:

• A new notion of reactive values, which is a purely discrete sim-plification of FRP’s reactive behaviors (no continuous change).Reactive values have simple and precise denotational semantics(given below) and an efficient, data-driven implementation.

• Decomposing the notion of reactive behaviors into independentdiscrete and continuous components, namely reactive valuesand (non-reactive) time functions. Recomposing these two no-tions and their implementations results in FRP’s reactive behav-iors, but now with an implementation that combines push-based

25

http://haskell.org/haskellwiki/FRP

and pull-based evaluation. Reactive values have a lazy, purelydata representation, and so are cached automatically. This com-posite representation captures a new reactive normal form forFRP.

• Modernizing the FRP interface, by restructuring much of itsfunctionality and semantic definitions around standard typeclasses, as monoids, functors, applicative functors, and monads.This restructuring makes the interface more familiar, reducesthe new interfaces to learn, and provides new expressive power.In most cases, the semantics are defined simply by choosing thesemantic functions to be type class morphisms (Elliott 2009).

• A notion of composable future values, which embody purevalues that (in many cases) cannot yet be known, and is atthe heart of this new formulation of reactivity. Nearly all thefunctionality of future values is provided via standard typeclasses, with semantics defined as class morphisms.

• Use of Warren Burton’s “improving values” as a richly struc-tured (non-flat) type for time. Events, reactive values, reactivebehaviors, and future values can all be parameterized with re-spect to time, which can be any ordered type. Using improvingvalues (over an arbitrary ordered type) for time, the semanticsof future values becomes a practical implementation.

• A new technique for semantically determinate concurrency viaan “unambiguous choice” operator, and use of this technique toprovide a new implementation of improving values.

2. Functional reactive programmingFRP revolves around two composable abstractions: events and be-haviors (Elliott and Hudak 1997). Because FRP is a functionalparadigm, events and behaviors describe things that exist, ratherthan actions that have happened or are to happen (i.e., what is, notwhat does). Semantically, a (reactive) behavior is just a function oftime, while an event (sometimes called an “event source”) is a listof time/value pairs (“occurrences”).

type Ba = T → a

type Ea = [(bT , a)] -- for non-decreasing times

Historically in FRP, T = R. As we’ll see, however, the semanticsof behaviors assumes only that T is totally ordered. The type bT ofoccurrence times is T extended with −∞ and∞.

Orginally, FRP had a notion of events as a single value withtime, which led to a somewhat awkward programming style withexplicit temporal loops (tail recursions). The sequence-of-pairs for-mulation above, described in, e.g., (Elliott 1998a; Peterson et al.1999) and assumed throughout this paper, hides discrete time it-eration, just as behaviors hide continuous “iteration”, resulting insimpler, more declarative specifications.

The semantic domains Ba and Ea correspond to the behaviorand event data types, via semantic functions:

at :: Behavior a → Ba

occs :: Event a → Ea

This section focuses on the semantic models underlying FRP,which are intended for ease of understanding and formal reasoning.The insights gained are used in later sections to derive new correctand efficient representations.

FRP’s Behavior and Event types came with a collection ofcombinators, many of which are instances of standard type classes.To dress FRP in modern attire, this paper uses standard classes andmethods wherever possible in place of names from “Classic FRP”.

2.1 BehaviorsPerhaps the simplest behavior is time , corresponding to the identityfunction.

time :: Behavior Timeat time = id

2.1.1 FunctorFunctions can be “lifted” to apply to behaviors. Classic FRP(CFRP) had a family of lifting combinators:

liftn :: (a1 → ...→ an → b)→ (Behavior a1 → ...→ Behavior an → Behavior b)

Lifting is pointwise and synchronous: the value of liftn f b1 ...bn attime t is the result of applying f to the values of the bi at (exactly)t .2

at (liftn f b1 ... bn) = λt → f (b1 ‘at ‘ t) ... (bn ‘at ‘ t)

The Functor instance for behaviors captures unary lifting, withfmap replacing FRP’s lift1.

fmap :: (a → b)→ Behavior a → Behavior b

The semantic domain, functions, also form a functor:

instance Functor ((→) t) wherefmap f g = f ◦ g

The meaning of fmap on behaviors mimics fmap on the meaningof behaviors, following the principle of denotational design usingtype class morphisms (Elliott 2009) and captured in the following“semantic instance”:3

instancesem Functor Behavior whereat (fmap f b) = fmap f (at b)

= f ◦ at b

In other words, at is a natural transformation, or “functor mor-phism” (for consistency with related terminology), from Behaviorto B (Mac Lane 1998).

The semantic instances in this paper (“instancesem ...”)specify the semantics, not implementation, of type class instances.

2.1.2 Applicative functorApplicative functors (AFs) are a recently explored notion (McBrideand Paterson 2008). The AF interface has two methods, pure and(<∗>), which correspond to the monadic operations return andap. Applicative functors are more structured (less populated) thanfunctors and less structured (more populated) than monads.

infixl 4<∗>class Functor f ⇒ Applicative f where

pure :: a → f a(<∗>) :: f (a → b)→ f a → f b

These two combinators suffice to define liftA2, liftA3, etc.

infixl 4<$>(<$>) :: Functor f ⇒ (a → b)→ f a → f bf <$> a = fmap f a

liftA2 :: Applicative f ⇒ (a → b → c)→ f a → f b → f c

liftA2 f a b = f <$> a <∗> b

2 Haskellism: The at function here is being used in both prefix form (on theleft) and infix form (on the right).3 Haskellism: Function application has higher (stronger) precedence thaninfix operators, so, e.g., f ◦ at b ≡ f ◦ (at b).

26

liftA3 :: Applicative f ⇒ (a → b → c → d)→ f a → f b → f c → f d

liftA3 f a b c = liftA2 f a b <∗> c...

The left-associative (<$>) is just a synonym for fmap—a stylisticpreference—while liftA2, liftA3, etc. are generalizations of themonadic combinators liftM2, liftM3, etc.

CFRP’s lift0 corresponds to pure , while lift2, lift3, etc corre-spond to liftA2, liftA3, etc., so the Applicative instance replacesall of the liftn.4

Functions, and hence B, form an applicative functor, wherepure and (<∗>) correspond to the classic K and S combinators:

instance Applicative ((→) t) wherepure = constf <∗> g = λt → (f t) (g t)

The Applicative instance for functions leads to the semanticsof the Behavior instance of Applicative . As with Functor above,the semantic function distributes over the class methods, i.e., at isan applicative functor morphism:

instancesem Applicative Behavior whereat (pure a) = pure a

= const a

at (bf <∗> bx) = at bf <∗> at bx= λt → (bf ‘at ‘ t) (bx ‘at ‘ t)

So, given a function-valued behavior bf and an argument-valuedbehavior bx, to sample bf <∗> bx at time t , sample bf and bx at tand apply one result to the other.

This (<∗>) operator is the heart of FRP’s concurrency model,which is semantically determinate, synchronous, and continuous.

2.1.3 MonadAlthough Behavior is a semantic Monad as well, the implemen-tation developed in Section 5 does not implement Monad .

2.2 EventsLike behaviors, much of the event functionality can be packagedvia standard type classes.

2.2.1 MonoidClassic FRP had a never-occurring event and an operator to mergetwo events. Together, these combinators form a monoid, so ∅ and(⊕) (Haskell’s mempty and mappend) replace the CFRP namesneverE and (.|.).

The event monoid differs from the list monoid in that (⊕) mustpreserve temporal monotonicity.

instancesem Monoid (Event a) whereoccs ∅ = [ ]occs (e ⊕ e ′) = occs e ‘merge‘ occs e ′

Temporal merging ensures a time-ordered result and has a left-biasin the case of simultaneity:

merge :: Ea → Ea → Ea

[ ] ‘merge‘ vs = vsus ‘merge‘ [ ] = us

((ta, a) : ps) ‘merge‘ ((tb, b) : qs)

| ta 6 tb = (ta, a) : (ps ‘merge‘ ((tb, b) : qs))

| otherwise = (tb, b) : (((ta, a) : ps) ‘merge‘ qs)

Note that occurrence lists may be infinitely long.

4 The formulation of the liftn in terms of operators corresponding to pureand (<∗>) was noted in (Elliott 1998a, Section 2.1).

2.2.2 FunctorMapping a function over an event affects just the occurrence values,leaving the times unchanged.

instancesem Functor Event where

occs (fmap f e) = map (λ(ta, a)→ (ta, f a)) (occs e)

2.2.3 MonadPrevious FRP definitions and implementations did not have amonad instance for events. Such an instance, however, is veryuseful for dynamically-generated events. For example, considerplaying Asteroids and tracking collisions. Each collision can breakan asteroid into more of them (or none), each of which has to betracked for more collisions. Another example is a chat room hav-ing an enter event whose occurrences contain new events like speak(for the newly entered user).

A unit event has one occurrence, which is always available:

occs (return a) = [(−∞, a)]

The join operation collapses an event-valued event ee:

joinE :: Event (Event a)→ Event a

Each occurrence of ee delivers a new event, all of which get mergedtogether into a single event.

occs (joinE ee) =foldr merge [ ] ◦map delayOccs ◦ occs ee

delayOccs :: (bT ,Event a)→ Ea

delayOccs (te, e) = [(te ‘max ‘ ta, a) | (ta, a)← occs e ]

Here, delayOccs ensures that inner events cannot occur before theyare generated.

This definition of occs hides a subtle problem. If ee has in-finitely many non-empty occurrences, then the foldr , if taken asan implementation, would have to compare the first occurrences ofinfinitely many events to see which is the earliest. However, noneof the occurrences in delayOccs (te, e) can occur before timete, and the delayOccs applications are given monotonically non-decreasing times. So, only a finite prefix of the events generatedfrom ee need be compared at a time.

2.2.4 Applicative functorAny monad can be made into an applicative functor, by definingpure = return and (<∗>) = ap. However, this Applicativeinstance is unlikely to be very useful for Event . Consider function-and argument-valued events ef and ex. The event ef <∗> ex wouldbe equivalent to ef ‘ap‘ ex and hence to

ef >>= λf → ex >>= λx → return (f x )

or more simply

ef >>= λf → fmap f ex

The resulting event contains occurrences for every pair of occur-rences of ef and ex, i.e., (tf ‘max ‘ tx, f x ) for each (tf , f ) ∈occs ef and (tx, x ) ∈ occs ex. If there are m occurrences of efand n occurrences of ex, then there will m × n occurrences ofef <∗> ex. Since the maximum of two values is one value or theother, there are at most m +n distinct values of tf ‘max ‘ tx. Hencethe m × n occurrences must all occur in at most m + n tempo-rally distinct clusters. Alternatively, one could give a relative timesemantics by using (+) in place of max .

2.3 Combining behaviors and eventsFRP’s basic tool for introducing reactivity combines a behavior andand an event.

27

switcher :: Behavior a → Event (Behavior a)→ Behavior a

The behavior b0 ‘switcher ‘ e acts like b0 initially. Each occurrenceof the behavior-valued event e provides a new phase of behaviorto switch to. Because the phases themselves (such as b0) may bereactive, each transition may cause the switcher behavior to loseinterest in some events and start reacting to others.

The semantics of b0 ‘switcher ‘ e chooses and samples either b0

or the last behavior from e before a given sample time t :

(b0 ‘switcher ‘ e) ‘at ‘ t = last (b0 : before (occs e) t) ‘at ‘ t

before :: Ea → T → [a ]

before os t = [a | (ta, a)← os, ta < t ]

As a simple and common specialization, stepper producespiecewise-constant behaviors (step functions, semantically):

stepper :: a → Event a → Behavior aa0 ‘stepper ‘ e = pure a0 ‘switcher ‘ (pure <$> e)

Hence

at (a0 ‘stepper ‘ e) = λt → last (a0 : before (occs e) t)

There is a subtle point in the semantics of switcher . Considerb0 ‘stepper ‘ (e ⊕ e ′). If each of e and e ′ has one or more occur-rences at the same time, then the ones from e ′ will get reacted tolast, and so will appear in the switcher behavior.

3. From semantics to implementationNow we have a simple and precise semantics for FRP. Refining itinto an efficient implementation requires addressing the followingobstacles.

• Event merging compares the two occurrence times in order tochoose the earlier one: ta 6 tb. If time is a flat domain (e.g.,Double), this comparison could not take place until both ta andtb are known. Since occurrence times are not generally knownuntil they actually arrive, this comparison would hold up eventreaction until the later of the two occurrences, at which timethe earlier one would be responded to. For timely response, thecomparison must complete when the earlier occurrence hap-pens.5 Section 4 isolates this problem in an abstraction called“future values”, clarifying exactly what properties are requiredfor a type of future times. Section 9 presents a more sophisti-cated representation of time that satisfies these properties andsolves the comparison problem. This representation adds an ex-pense of its own, which is removed in Sections 10 and 11.

• For each sample time t , the semantics of switcher involvessearching through an event for the last occurrence before t . Thissearch becomes costlier as t increases, wasting time as wellas space. While the semantics allow random time sampling, inpractice, behaviors are sampled with monotonically increasingtimes. Section 8 introduces and exploits monotonic time forefficient sampling.

• The semantics of behaviors as functions leads to an obvious, butinefficient, demand-driven evaluation strategy, as in past FRPimplementations. Section 5 introduces a reactive normal formfor behaviors that reveals the reactive structure as a sequenceof simple non-reactive phases. Wherever phases are constant (acommon case), sampling happens only once per phase, drivenby occurrences of relevant events, as shown in Section 8.

5 Mike Sperber noted this issue and addressed it as well (Sperber 2001).

4. Future valuesA FRP event occurrence is a “future value”, or simply “future”,i.e., a value and an associated time. To simplify the semantics andimplementation of events, and to provide an abstraction that mayhave uses outside of FRP, let’s now focus on futures. Semantically,

type Fa = ( bT , a)

force :: Future a → Fa

Like events and behaviors, much of the interface for futurevalues is packaged as instances of standard type classes. Moreover,as with behaviors, the semantics of these instances are defined astype class morphisms. The process of exploring these morphismsreveals requirements for the algebraic structure of bT .

4.1 FunctorThe semantic domain for futures, partially applied pairing, is afunctor:

instance Functor ((, ) t)where fmap h (t , a) = (t , h a)

The semantic function, force , is a functor morphism:

instancesem Functor Future whereforce (fmap h u) = fmap h (force u)

= (t , h a) where (t , a) = force u

Thus, mapping a function over a future gives a future with the sametime but a transformed value.

4.2 Applicative functorFor applicative functors, the semantic instance (pairing) requires anadditional constraint:

instance Monoid t ⇒ Applicative ((, ) t) wherepure a = (∅, a)(t , f )<∗> (t ′, x ) = (t ⊕ t ′, f x )

When t is a future time, what meanings do we want for ∅and (⊕)? Two future values can be combined only when both areknown, so (⊕) = max . Since ∅ is an identity for (⊕), it followsthat ∅ = minBound , and so bT must have a least element.

The Applicative semantics for futures follow from these con-siderations choosing force to be an applicative functor morphism:

instancesem Applicative Future whereforce (pure a) = pure a

= (∅, a)= (minBound , a)

force (uf <∗> ux) = force uf <∗> force ux

= (tf , f )<∗> (tx, x )

= (tf ⊕ tx, f x )

= (tf ‘max ‘ tx, f x )where

(tf , f ) = force uf

(tx, x ) = force ux

Now, of course these definitions of (⊕) and ∅ do not holdfor arbitrary t , even for ordered types, so the pairing instance ofApplicative provides helpful clues about the algebraic structure offuture times.

Alternatively, for a relative-time semantics, use the Sum monoidin place of the Max monoid.

4.3 MonadGiven the Monoid constraint on t , the type constructor ((, ) t) isequivalent to the more familiar writer monad.

28

instance Monoid t ⇒ Monad ((, ) t) wherereturn a = (∅, a)

(ta, a)>>= h = (ta ⊕ tb, b)

where (tb, b) = h a

Taking force to be a monad morphism (Wadler 1990),

instancesem Monad Future whereforce (return a) = return a

= (minBound , a)

force (u >>= k) = force u >>= force ◦ k

= (ta ‘max ‘ tb, b)

where (ta, a) = force u

(tb , b) = force (k a)

Similarly, join collapses a future future into a future.

joinF :: Future (Future a)→ Future aforce (joinF uu) = join (fmap force (force uu))

= (tu ‘max ‘ ta, a)

where (tu, u) = force uu

(ta , a) = force u

So, the value of the join is the value of the of the inner future, andthe time matches the later of the outer and inner futures. (Alterna-tively, the sum of the future times, in relative-time semantics.)

4.4 MonoidA useful (⊕) for futures simply chooses the earlier one. Then, asan identity for (⊕), ∅ must be the future that never arrives. (So bTmust have an upper bound.)

instancesem Monoid (Future a) whereforce ∅ = (maxBound ,⊥)

force (ua ⊕ ub) = if ta 6 tb then ua else ub

where

(ta, ) = force ua

(tb, ) = force ub

(This definition does not correspond to the standard monoid in-stance on pairs, so force is not a monoid morphism.)

Note that this Monoid instance (for future values) uses maxBoundand min , while the Monoid instance on future times uses minBoundand max .

4.5 Implementing futuresThe semantics of futures can also be used as an implementation,if the type of future times, FTime (with meaning bT ), satisfies theproperties encountered above:

• Ordered and bounded with lower and upper bounds of−∞ and∞ (i.e., before and after all sample times), respectively.

• A monoid, in which ∅ = −∞ and (⊕) = max .• To be useful, the representation must reveal partial information

about times (specifically lower bounds), so that time compar-isons can complete even when one of the two times is not yetfully known.

Assuming these three properties for FTime , the implementa-tion of futures is easy, with most of the functionality derived (usinga GHC language extension) from the pairing instances above.

newtype Future a = Fut (FTime, a)deriving (Functor ,Applicative,Monad)

A Monoid instance also follows directly from the semantics inSection 4.4:

instance Monoid (Future a) where∅ = Fut (maxBound ,⊥)

-- problematic:ua@(Fut (ta, ))⊕ ub@(Fut (tb, )) =

if ta 6 tb then ua else ub

This definition of (⊕) has a subtle, but important, problem.Consider computing the earliest of three futures, (ua ⊕ ub) ⊕ uc,and suppose that uc is earliest, so that tc < ta ‘min‘ tb. No matterwhat the representation of FTime is, the definition of (⊕) abovecannot produce any information about the time of ua ⊕ ub untilta 6 tb is determined. That test will usually be unanswerable untilthe earlier of those times arrives, i.e., until ta ‘min‘ tb, which (aswe’ve supposed) is after tc.

To solve this problem, change the definition of (⊕) on futuresto immediately yield a time as the (lazily evaluated) min of the twofuture times. Because min yields an FTime instead of a boolean,it can produce partial information about its answer from partialinformation about its inputs.

-- working definition:Fut (ta, a)⊕ Fut (tb, b) =

Fut (ta ‘min‘ tb, if ta 6 tb then a else b)

This new definition requires two comparison-like operations in-stead of one. It can be further improved by adding a single oper-ation on future times that efficiently combines min and (6).

4.6 Future timesEach of the three required properties of FTime (listed in Sec-tion 4.5) can be layered onto an existing type:

type FTime = Max (AddBounds (Improving Time))

The Max wrapper adds the required monoid instance whileinheriting Ord and Bounded .

newtype Max a = Max a deriving (Eq ,Ord ,Bounded)

instance (Ord a,Bounded a)⇒ Monoid (Max a) where∅ = Max minBoundMax a ⊕Max b = Max (a ‘max ‘ b)

The AddBounds wrapper adds new least and greatest elements,preserving the existing ordering.

data AddBounds a =MinBound | NoBound a | MaxBound deriving Eq

instance Bounded (AddBounds a) whereminBound = MinBoundmaxBound = MaxBound

For an unfortunate technical reason, AddBounds does not deriveOrd . The semantics of Haskell’s deriving clause does not guar-antee that min is defined in terms of min on the component types.If min is instead defined via (6) (as currently in GHC), then par-tial information in the type parameter a cannot get passed throughmin . For this reason, AddBounds has an explicit Ord instance,given in part in Figure 1.

The final wrapper, Improving , is described in Section 9. It addspartial information to times and has min and (6) that work withpartially known values.

5. Reactive normal formFRP’s behavior and event combinators are very flexible. For in-stance, in b0 ‘switcher ‘ e , the phases (b0, ...) themselves may bereactive, either as made by switcher , or by fmap or (<∗>) ap-plied to reactive behaviors. This flexibility is no trouble at all for

29

instance Ord a ⇒ Ord (AddBounds a) whereMinBound ‘min‘ = MinBound

‘min‘ MinBound = MinBoundNoBound a ‘min‘ NoBound b = NoBound (a ‘min‘ b)u ‘min‘ MaxBound = uMaxBound ‘min‘ v = v

-- similarly for (6) and max

Figure 1. Ord instance for the AddBounds type

the function-based semantics in Section 2, but how can we find ourway to an efficient, data-driven implementation?

Observed over time, a reactive behavior consists of a sequenceof non-reactive phases, punctuated by events. Suppose behaviorscan be viewed or represented in a form that reveals this phase struc-ture explicitly. Then monotonic behavior sampling could be imple-mented efficiently by stepping forward through this sequence, sam-pling each phase until the next one begins. For constant phases (acommon case), sampling would then be driven entirely by relevantevent occurrences.

Definition: A behavior-valued expression is in reactive normalform (RNF) if it has the form b ‘switcher ‘ e , where the leadbehavior b is non-reactive, i.e., has no embedded switcher (orcombinators defined via switcher ), and the behaviors in e are alsoin RNF.

For instance, b can be built up from pure , time , fmap, and(<∗>). To convert arbitrary behavior expressions into RNF, onecan provide equational rewrite rules that move switchers out ofswitcher heads, out of fmap, (<∗>), etc, and prove the correctnessof these equations from the semantics in Section 2. For example,

fmap f (b ‘switcher ‘ e) ≡ fmap f b ‘switcher ‘ fmap f e

The rest of this paper follows a somewhat different path, inspiredby this rewriting idea, defining an RNF-based representation.

5.1 Decoupling discrete and continuous changeFRP makes a fundamental, type-level distinction between eventsand behaviors, i.e., between discrete and continuous. Well, notquite. Although (reactive) behaviors are defined over continuoustime, they are not necessarily continuous. For instance, a behaviorthat counts key-presses changes only discretely. Let’s further teaseapart the discrete and continuous aspects of behaviors into twoseparate types. Call the purely discrete part a “reactive value” andthe continuous part a “time function”. FRP’s notion of reactivebehavior decomposes neatly into these two simpler notions.

Recall from Section 1 that continuous time is one of the reasonsfor choosing pull-based evaluation, despite the typical inefficiencyrelative to push-based. As we will see, reactive values can be eval-uated in push style, leaving pull for time functions. Recomposingreactive values and time functions yields an RNF representationfor reactive behaviors that reveals their phase structure. The twoseparate evaluation strategies combine to produce an efficient andsimple hybrid strategy.

5.2 Reactive valuesA reactive value is like a reactive behavior but is restricted tochanging discretely. Its meaning is a step function, which is fullydefined by its initial value and discrete changes, with each changedefined by a time and a value. Together, these changes correspondexactly to a FRP event, suggesting a simple representation:

data Reactive a = a ‘Stepper ‘ Event a

The meaning of a reactive value is given via translation into areactive behavior, using stepper :

rat :: Reactive a → Ba

rat (a0 ‘Stepper ‘ e) = at (a0 ‘stepper ‘ e)= λt → last (a0 : before (occs e) t)

where before is as defined in Section 2.3.With the exception of time , all behavior operations in Section 2

(as well as others not mentioned there) produce discretely-changingbehaviors when given discretely-changing behaviors. Therefore, allof these operations (excluding time) have direct counterparts forreactive values. In addition, reactive values form a monad.

stepperR :: a → Event a → Reactive aswitcherR :: Reactive a → Event (Reactive a)

→ Reactive a

instance Functor Reactiveinstance Applicative Reactiveinstance Monad Reactive

The semantic function, rat , is a morphism on Functor , Applicative ,and Monad :

instancesem Functor Reactive whererat (fmap f b) = fmap f (rat b)

= f ◦ rat b

instancesem Applicative Reactive whererat (pure a) = pure a

= const a

rat (rf <∗> rx) = rat rf <∗> rat rx= λt → (rf ‘rat ‘ t) (rx ‘rat ‘ t)

instancesem Monad Reactive whererat (return a) = return a

= const a

rat (r >>= k) = rat r >>= rat ◦ k= λt → (rat ◦ k) (rat r t) t= λt → rat (k (rat r t)) t

The join operation may be a bit easier to follow than (>>=).

rat (joinR rr) = join (fmap rat (rat r))= join (rat ◦ rat rr)= λt → rat (rat rr t) t

Sampling joinR rr at time t then amounts to sampling rr at t toget a reactive value r , which is itself sampled at t .

5.3 Time functionsBetween event occurrences, a reactive behavior follows a non-reactive function of time. Such a time function is most directly andsimply represented literally as a function. However, functions areopaque at run-time, preventing optimizations. Constant functionsare particularly helpful to recognize, in order to perform dynamicconstant propagation, as in (Elliott 1998a; Nilsson 2005). A simpledata type suffices for recognizing constants.

data Fun t a = K a | Fun (t → a)

The semantics is given by a function that applies a Fun to anargument. All other functionality can be neatly packaged, again, ininstances of standard type classes, as shown in Figure 2. There is asimilar instance for Arrow as well. The semantic function, apply ,is a morphism with respect to each of these classes.

Other optimizations could be enabled by in a similar way. Forinstance, generalize the K constructor to polynomials (adding aNum constraint for t). Such a representation could support pre-cise and efficient differentiation and integration and prediction of

30

data Fun t a = K a | Fun (t → a)

apply :: Fun t a → (t → a) -- semantic functionapply (K a) = const aapply (Fun f ) = f

instance Functor (Fun t) wherefmap f (K a) = K (f a)fmap f (Fun g) = Fun (f ◦ g)

instance Applicative (Fun t) wherepure = KK f <∗>K x = K (f x )cf <∗> cx = Fun (apply cf <∗> apply cx )

instance Monad (Fun t) wherereturn = pureK a >>= h = h aFun f >>= h = Fun (f >>= apply ◦ h)

Figure 2. Constant-optimized functions

some synthetic events based on root-finding (e.g., some object col-lisions). The opacity of the function arguments used with fmap andarr would, however, limit analysis.

5.4 ComposingReactive values capture the purely discrete aspect of reactive be-haviors, while time functions capture the purely continuous. Com-bining them yields a representation for reactive behaviors.

type Behavior = Reactive ◦ Fun Time

Type composition can be defined as follows:

newtype (h ◦ g) a = O (h (g a))

Functors compose into functors, and applicative functors intoapplicative functors (McBride and Paterson 2008).

instance (Functor h,Functor g)⇒ Functor (h ◦ g) where

fmap f (O hga) = O (fmap (fmap f ) hga)

instance (Applicative h,Applicative g)⇒ Applicative (h ◦ g) where

pure a = O (pure (pure a))O hgf <∗>O hgx = O (liftA2 (<∗>) hgf hgx )

The semantics of behaviors combines the semantics of its twocomponents.

at :: Behavior a → Ba

at (O rf ) = join (fmap apply (rat rf ))= λt → apply (rat rf t) t

More explicitly,

O (f ‘Stepper ‘ e) ‘at ‘ t = last (f : before (occs e) t) t

This last form is almost identical to the semantics of switcher inSection 2.3.

This representation of behaviors encodes reactive normal form,but how expressive is it? Are all of the Behavior combinatorscovered, or do some stray outside of RNF?

The time combinator is non-reactive, i.e., purely a function oftime:

time = O (pure (Fun id))

The Functor and Applicative instances are provided automati-cally from the instances for type composition (above), given the in-stances for Reactive and Fun (specified in Section 5 and to be de-fined in Section 7). Straightforward but tedious calculations showthat time and the Functor and Applicative instances have the se-mantics specified in Section 2.

I doubt that there is a Monad instance. While the semantic do-mainB is a monad, I think its join surpasses the meanings that canbe represented as reactive time functions. For purely discrete ap-plications, however, reactive behaviors can be replaced by reactivevalues, including the Monad functionality.

6. Another angle on eventsThe model of events we’ve been working with so far is time-ordered lists of future values, where a future value is a time/valuepair: [(t0, a0), (t1, a1), ... ]. If such an occurrence list is nonempty,another view on it is as a time t0, together with a reactive valuehaving initial value a0 and event with occurrences [(t1, a1), ... ]. Ifthe occurrence list is empty, then we could consider it to have initialtime∞ (maxBound ), and reactive value of⊥. Since a future valueis a time and value, it follows that an event (empty or nonempty)has the same content as a future reactive value. This insight leadsto a new representation of functional events:

-- for non-decreasing timesnewtype Event a = Ev (Future (Reactive a))

With this representation, the semantic function on events peels offone time and value at a time.

occs :: Event a → Ea

occs (Ev (Fut (∞, ))) = [ ]

occs (Ev (Fut (ta , a ‘Stepper ‘ e ′))) = (ta, a) : occs e ′

Why use this representation of events instead of directly mim-icking the semantic model E? The future-reactive representationwill be convenient in defined Applicative and Monad instancesbelow. It also avoids a subtle problem similar to the issue of com-paring future times using (6), discussed in Section 4.5. The defini-tion of merge in Section 2.2.1 determines that an event has no moreoccurrences by testing the list for emptiness. Consider filtering outsome occurrences of an event e . Because the emptiness test yields aboolean value, it cannot yield partial information, and will have toblock until the prefiltered occurrences are known and tested. Theseissues are also noted in Sperber (2001).

7. Implementing operations on reactive valuesand events

The representations of reactive values and events are now tightlyinterrelated:

data Reactive a = a ‘Stepper ‘ Event anewtype Event a = Ev (Future (Reactive a))

These definitions, together with Section 5, make a convenient basisfor implementing FRP.

7.1 Reactive values7.1.1 FunctorAs usual, fmap f applies a function f to a reactive value pointwise,which is equivalent to applying f to the initial value and to eachoccurrence value.

instance Functor Reactive wherefmap f (a ‘Stepper ‘ e) = f a ‘Stepper ‘ fmap f e

31

7.1.2 ApplicativeThe Functor definition was straightforward, because the Stepperstructure is easily preserved. Applicative is more challenging.

instance Applicative Reactive where ...

First the easy part. A pure value becomes reactive by using it as theinitial value and ∅ as the (never-occuring) change event:

pure a = a ‘Stepper ‘ ∅Consider next applying a reactive function to a reactive argument:

rf@(f ‘Stepper ‘ Ev uf )<∗> rx@(x ‘Stepper ‘ Ev ux) =f x ‘Stepper ‘ Ev u

where u = ...

The initial value is f x , and the change event occurs each timeeither the function or the argument changes. If the function changesfirst, then (at that future time) apply a new reactive function to anold reactive argument:

fmap (λrf ′ → rf ′ <∗> rx) uf

Similarly, if the argument changes first, apply an old reactive func-tion and a new reactive argument:

fmap (λrx′ → rf <∗> rx′) ux

Combining these two futures as alternatives:6

u = fmap (λrf ′ → rf ′ <∗> rx) uf ⊕fmap (λrx′ → rf <∗> rx′) ux

More succinctly,

u = ((<∗>rx)<$> uf )⊕ ((rf<∗>)<$> ux)

A wonderful thing about this (<∗>) definition for Reactive isthat it automatically reuses the previous value of the function orargument when the argument or function changes. This cachingproperty is especially handy in nested applications of (<∗>), whichcan arise either explicitly or through liftA2, liftA3, etc. Consideru = liftA2 f r s or, equivalently, u ≡ (f <$>r)<∗>s , where r ands are reactive values, with initial values r0 and s0, respectively. Theinitial value u0 of u is f r0 s0. If r changes from r0 to r1, then thenew value of f <$>r will be f r1, which then gets applied to s0, i.e.,u1 ≡ f r1 s0. If instead s changes from s0 to s1, then u1 ≡ f r0 s1.In this latter case, the old value f r0 of f <$>r is passed on withouthaving to be recomputed. The savings is significant for functionsthat do some work based on partial applications.

7.1.3 MonadThe Monad instance is perhaps most easily understood via its join:

joinR :: Reactive (Reactive a)→ Reactive a

The definition of joinR is similar to (<∗>) above:

joinR ((a ‘Stepper ‘ Ev ur) ‘Stepper ‘ Ev urr ) =a ‘Stepper ‘ Ev u

where u = ...

Either the inner future (ur) or the outer future (urr ) will arrive first.If the inner arrives first, switch and continue waiting for the outer:

(‘switcher ‘Ev urr )<$> ur

The (<$>) here is over futures. If instead the outer future arrivesfirst, abandon the inner and get new reactive values from the outer:

6 Recall from Section 4.1 that fmap f u arrives exactly when the future uarrives, so the (⊕)’s choice in this case depends only on the relative timingof uf and ux.

join <$> urr

Choose whichever comes first:

u = ((‘switcher ‘Ev urr )<$> ur)⊕ (join <$> urr )

Then plug this join into a standard Monad instance:

instance Monad Reactive wherereturn = purer >>= h = joinR (fmap h r)

7.1.4 ReactivityIn Section 2.3, stepper (on behaviors) is defined via switcher . Forreactive values, stepperR corresponds to the Stepper constructor:

stepperR :: a → Event a → Reactive astepperR = Stepper

The more general switching form can be expressed in terms ofstepperR and monadic join:

switcherR :: Reactive a → Event (Reactive a)→ Reactive a

r ‘switcherR‘ er = joinR (r ‘stepperR‘ er)

7.2 Events7.2.1 FunctorThe Event functor is also easily defined. Since an event is a futurereactive value, combine fmap on Future with fmap on Reactive .

instance Functor Event wherefmap f (Ev u) = Ev (fmap (fmap f ) u)

7.2.2 MonadAssuming a suitable join for events, the Monad instance is simple:

instance Monad Event wherereturn a = Ev (return (return a))r >>= h = joinE (fmap h r)

This definition of return makes a regular value into an event bymaking a constant reactive value (return) and wrapping it up as analways-available future value (return).

The join operation collapses an event-valued event ee into anevent. Each occurrence of ee delivers a new event, all of which getadjusted to insure temporal monotonicity and merged together intoa single event. The event ee can have infinitely many occurrences,each of which (being an event) can also have an infinite number ofoccurrences. Thus joinE has the tricky task of merging (a repre-sentation of) a sorted infinite stream of sorted infinite streams intoa single sorted infinite stream. Since an event is represented as aFuture , the join makes essential use of the Future monad7:

joinE :: Event (Event a)→ Event ajoinE (Event u) = Event (u >>= eFuture ◦ g)

whereg (e ‘Stepper ‘ ee) = e ⊕ joinE eeeFuture (Ev u) = u

7.2.3 MonoidThe Monoid instance relies on operations on futures:

instance Ord t ⇒ Monoid (Event a) where∅ = Ev ∅Ev u ⊕ Ev v = Ev (u ‘mergeu‘ v)

7 This definition is inspired by one from Jules Bean.

32

The never-occuring event happens in the never-arriving future.To merge two future reactive values u and v , there are again two

possibilities. If u arrives first (or simultaneously), with value a0 andnext future u ′, then a0 will be the initial value and u ′ ‘mergeu‘ vwill be the next future. If v arrives first, with value b0 and nextfuture v ′, then b0 will be the initial value and u ‘mergeu‘ v ′ will bethe next future.

mergeu :: Future (Reactive a)→ Future (Reactive a)→ Future (Reactive a)

u ‘mergeu‘ v = (inFutR (‘merge‘v)<$> u)⊕(inFutR (u‘merge‘)<$> v)

whereinFutR f (r ‘Stepper ‘ Ev u ′) = r ‘Stepper ‘ Ev (f u ′)

8. Monotonic samplingThe semantics of a behavior is a function of time. That functioncan be applied to time values in any order. Recall in the seman-tics of switcher (Section 2.3) that sampling at a time t involvessearching through an event for the last occurrence before t . Themore occurrences take place before t , the costlier the search. Lazyevaluation can delay computing occurrences before they’re used,but once computed, these occurrences would remain in the events,wasting space to hold and time to search.

In practice, behaviors are rendered forward in time, and so aresampled with monotonically increasing times. Making this usagepattern explicit allows for much more efficient sampling.

First, let’s consider reactive values and events. Assume we havea consumer for generated values:

type Sink a = a → IO ()

For instance, a sink may render a number to a GUI widget oran image to a display window. The functions sinkR and sinkE

consume values as generated by events and reactive values:

sinkR :: Sink a → Reactive a → IO bsinkE :: Sink a → Event a → IO b

The implementation is an extremely simple back-and-forth, withsinkR rendering initial values and sinkE waiting until the nextevent occurrence.

sinkR snk (a ‘Stepper ‘ e) = snk a >> sinkE snk e

sinkE snk (Ev (Fut (tr , r))) = waitFor tr >> sinkR snk r

Except in the case of a predictable event (such as a timer),waitFor tr blocks simply in evaluating the time tr of a futureevent occurrence. Then when evaluation of tr unblocks, the realtime is (very slightly past) tr , so the actual waitFor need not doany additional waiting.

A behavior contains a reactive value whose values are timefunctions, so it can be rendered using sinkR if we can come upwith a appropriate sink for time functions.

sinkB :: Sink a → Behavior a → IO bsinkB snk (O rf ) = do snkF ← newTFunSink snk

sinkR snkF rf

The procedure newTFunSink makes a sink that consumes suc-cessive time functions. For each consumed constant function K a ,the value a is rendered just once (with snk ). When a non-constantfunction Fun f is consumed, a thread is started that repeatedlysamples f at the current time and renders:

forkIO (forever (f <$> getTime >>= snk))

In either case, the constructed sink begins by killing the currentrendering thread, if any. Many variations are possible, such as using

a GUI toolkit’s idle event instead of a thread, which has the benefitof working with thread-unsafe libraries.

9. Improving valuesThe effectiveness of future values, as defined in Section 4, dependson a type wrapper Improving , which adds partial information inthe form of lower bounds. This information allows a time compar-ison ta 6 tb to suceed when the earlier of ta and tb arrives insteadof the later. It also allows ta ‘min‘ tb to start producing lower boundinformation before either of ta and tb is known precisely.

Fortunately, exactly this notion was invented, in a more gen-eral setting, by Warren Burton. “Improving values” (Burton 1989,1991) provide a high-level abstraction for parallel functional pro-gramming with determinate semantics.

An improving value (IV) can be represented as a list of lowerbounds, ending in the exact value. An IV representing a simplevalue (the exactly function used in Section 4.6), is a singleton list(no lower bounds). See (Burton 1991, Figure 3) for details.

Of course the real value of the abstraction comes from thepresence of lower bounds. Sometimes those bounds come frommax , but for future times, the bounds will come to be known overtime. One possible implementation of future times would involveConcurrent Haskell channels (Peyton Jones et al. 1996).

getChanContents :: Chan a → IO [a ]

The idea is to make a channel, invoke getChanContents , and wrapthe result as an IV. Later, lower bounds and (finally) an exact valueare written into the channel. When a thread attempts to look beyondthe most recent lower bound, it blocks. For this reason, this simpleimplementation of improving values must be supplied with a steadystream of lower bounds, which in the setting of FRP correspond toevent non-occurrences.

Generating and manipulating numerous lower bounds is a sig-nificant performance drawback in the purely functional implemen-tation of IVs. A more efficient implementation, developed next,thus benefits FRP and other uses of IVs.

10. Improving on improving valuesIn exploring how to improve over the functional implementation ofimproving values, let’s look at how future times are used.

• Sampling a reactive value requires comparing a sample time twith a future time tr′ .

• Choosing the earlier of two future values ((⊕) from Section 4),uses min and (6) on future times.

Imagine that we can efficiently compare an improving valuewith an arbitrary known (exact) value:8

compareI :: Ord a ⇒ Improving a → a → Ordering

How might we use compareI to compare two future times, e.g.,testing ta 6 tb? We could either extract the exact time from ta andcompare it with tb, or extract the exact time from tb and compareit with ta. These two methods produce the same information butusually not at the same time, so let’s choose the one that can answermost promptly. If indeed ta 6 tb, then the first method will likelysucceed more promptly and otherwise the second method. Thedilemma in choosing is that we have to know the answer beforewe can choose the best method for extracting that answer.

Like many dilemmas, this one results from either/or thinking.A third alternative is to try both methods in parallel and just use

8 The Haskell Ordering type contains LT , EQ , and GT to represent less-than, equal-to, and greater-than.

33

whichever result arrives first. Assume for now the existence of an“unambiguous choice” operator, unamb, that will try two methodsto solve a problem and return whichever one succeeds first. The twomethods are required to agree when they both succeed, for semanticdeterminacy. Then

ta 6 tb = ((ta ‘compareI ‘ exact tb) 6≡ GT ) ‘unamb‘

((tb ‘compareI ‘ exact ta) 6≡ LT )

Next consider ta ‘min‘ tb. The exact value can be extracted fromthe exact values of ta and tb, or from (6) on IVs:

exact (ta ‘min‘ tb) = exact ta ‘min‘ exact tb= exact (if (ta 6 tb) then ta else tb)

How can we compute (ta ‘min‘ tb)‘compareI ‘t for an arbitraryexact value t? The answer is ta ‘compareI ‘ t if ta 6 tb, andtb ‘compareI ‘ t otherwise. However, this method, by itself, missesan important opportunity. Suppose both of these tests can yieldanswers before it’s possible to know whether ta 6 tb. If theanswers agree, then we can use that answer immediately, withoutwaiting to learn whether ta 6 tb.

With these considerations, a new representation for IVs suggestsitself. Since the only two operations we need on IVs are exactand compareI , use those two operations as the IV representation.Figure 3 shows the details, with unamb and asAgree defined inSection 11. Combining (6) and min into minLE allows for asimple optimization of future (⊕) from Section 4.5.

11. Unambiguous choiceThe representation of improving values in Section 10 relies on an“unambiguous choice” operator with determinate semantics and anunderlying concurrent implementation.

-- precondition: compatible argumentsunamb :: a → a → a

In order to preserve simple, determinate semantics, unamb mayonly be applied to arguments that agree where defined.

compatible a b = (a ≡ ⊥ ∨ b ≡ ⊥ ∨ a ≡ b)

unamb yields the more-defined of the two arguments.

∀a b.compatible a b ⇒ unamb a b = a t b

Operationally, unamb forks two threads and evaluates one argu-ment in each. When one thread finishes its computed value is re-turned.

Figure 4 shows one way to implement unamb, in terms of anambiguous choice operator, amb. The latter, having indeterminate(ambiguous) semantics, is in the IO type, using race to run twoconcurrent threads. For inter-thread communication, the race func-tion uses a Concurrent Haskell MVar (Peyton Jones et al. 1996) tohold the computed value. Each thread tries to execute an action andwrite the resulting value into the shared MVar. The takeMVar op-eration blocks until one of the threads succeeds, after which boththreads are killed (one perhaps redundantly).9 This unamb imple-mentation fails to address an important efficiency concern. Whenone thread succeeds, there is no need to continue running its com-petitor. Moreover, the competitor may have spawned many otherthreads (due to nested unamb), all of which are contributing to-ward work that is no longer relevant.

The assuming function makes a conditional strategy for com-puting a value. If the assumption is false, the conditional strat-egy yields ⊥ via hang , which blocks a thread indefinitely, while

9 My thanks to Spencer Janssen for help with this implementation.

-- An improving value. Invariant:-- compareI iv w compare (exact iv)

data Improving a =Imp {exact :: a, compareI :: a → Ordering }

exactly :: Ord a ⇒ a → Improving aexactly a = Imp a (compare a)

instance Eq a ⇒ Eq (Improving a) whereImp a ≡ Imp b = a ≡ b

instance Ord a ⇒ Ord (Improving a) wheres 6 t = snd (s ‘minLE ‘ t)s ‘min‘ t = fst (s ‘minLE ‘ t)s ‘max ‘ t = fst (s ‘maxLE ‘ t)

-- Efficient combination of min and (6)minLE :: Ord a ⇒ Improving a → Improving a

→ (Improving a,Bool)Imp u uComp ‘minLE ‘ Imp v vComp =

(Imp uMinV wComp, uLeqV )where

uMinV = if uLeqV then u else v-- u 6 v : Try u ‘compare‘ v and v ‘compare‘ u .

uLeqV = (uComp v 6≡ GT ) ‘unamb‘ (vComp u 6≡ LT )minComp = if uLeqV then uComp else vComp

-- (u ‘min‘ v) ‘compare‘ t : Try comparing according to-- whether u 6 v , or use either answer if they agree.

wComp t = minComp t ‘unamb‘(uComp t ‘asAgree‘ vComp t)

-- Efficient combination of max and (>)maxLE :: Ord a ⇒ Improving a → Improving a

→ (Improving a,Bool)-- ... similarly ...

Figure 3. Improved improving values

consuming neglible resources and generating no error. One use ofassuming is to define asAgree , which was used in Figure 3.

12. Additional functionalityAll of the usual FRP functionality can be supported, including thefollowing.

Integration Numeric integration requires incremental samplingfor efficiency, replacing the apply interface from Section 5.3 byapplyK from Section 8. The residual time function returned byapplyK remembers the previous sample time and value, so the nextsampling can do a (usually) small number of integration steps. (Foraccuracy, it is often desirable to take more integration steps thansamples.) Integration of reactive behaviors can work simply by in-tegrating each non-reactive phase (a time function) and accumu-lating the result, thanks the interval-additivity property of definiteintegration (

R c

af ≡

R b

af +

R c

bf ).

Accumulation Integration is continuous accumulation on behav-iors. The combinators accumE and accumR discretely accumulatethe results of event occurrences.

accumR :: a → Event (a → a)→ Reactive aaccumE :: a → Event (a → a)→ Event a

34

-- Unambiguous choice on compatible arguments.unamb :: a → a → aa ‘unamb‘ b = unsafePerformIO (a ‘amb‘ b)

-- Ambiguous choice, no precondition.amb :: a → a → IO aa ‘amb‘ b = evaluate a ‘race‘ evaluate b

-- Race two actions in separate threads.race :: IO a → IO a → IO arace :: IO a → IO a → IO aa ‘race‘ b =

do v ← newEmptyMVarta ← forkIO (a >>= putMVar v)tb ← forkIO (b >>= putMVar v)x ← takeMVar vreturn x

-- Yield a value if a condition is true.assuming :: Bool → a → aassuming c a = if c then a else bottom

-- The value of agreeing values (or bottom)asAgree :: Eq a ⇒ a → a → aa ‘asAgree‘ b = assuming (a ≡ b) a

-- Never yield an answer. Identity for unamb.bottom :: abottom = unsafePerformIO hangIO

-- Block forever, cheaplyhangIO :: IO ahangIO = do forever (threadDelay maxBound)

return ⊥

Figure 4. Reference (inefficient) unamb implementation

Each occurrence of the event argument yields a function to beapplied to the accumulated value.

a ‘accumR‘ e = a ‘stepper ‘ (a ‘accumE ‘ e)a ‘accumE ‘ Ev ur = Ev (h <$> ur)

whereh (f ‘Stepper ‘ e ′) = f a ‘accumR‘ e ′

Filtering It’s often useful to filter event occurrences, keepingsome occurrences and dropping others. The Event monad instanceallows a new, simple and very general definition that includesevent filtering as a special case. One general filtering tool con-sumes Maybe values, dropping each Nothing and unwrappingeach Just .10

joinMaybes :: MonadPlus m ⇒ m (Maybe a)→ m ajoinMaybes = (>>=maybe mzero return)

The MonadPlus instance for Event uses mzero = ∅ andmplus = (⊕). The more common FRP event filter has the fol-lowing simple generalization:

filterMP :: MonadPlus m ⇒ (a → Bool)→ m a → m afilterMP p m = joinMaybes (liftM f m)where

f a | p a = Just a| otherwise = Nothing

10 My thanks to Cale Gibbard for this succinct formulation.

13. Related workThe most closely related FRP implementation is the one underlyingthe Lula system for design and control of lighting, by Mike Sper-ber (2001). Like the work described above, Lula-FRP eliminatedthe overhead of creating and processing the large numbers of eventnon-occurrences that have been present, in various guises, in al-most all other FRP implementations. Mike noted that the pull-basedevent interface that motivates these non-occurrences also imposes areaction latency bounded by the polling frequency, which detractsnoticeably from the user experience. To eliminate non-occurrencesand the resulting overhead and latency, he examined and addressedsubtle issues of events and thread blocking, corresponding to thethose discussed in Section 4.5. Mike’s solution, like the one de-scribed in Section 10 above, involved a multi-threaded implemen-tation. However, it did not guarantee semantic determinism, in caseof simultaneous or nearly-simultaneous event occurrences. The im-plementation of event operations was rather complex, especially forevent merging. The supporting abstractions used above (future val-ues, improving values, and unambiguous choice) seem to be helpfulin taming that complexity. Lula-FRP’s behaviors still used a purepull interface, so the latency solution was limited to direct use ofevents rather than reactive behaviors. The reactive value abstrac-tion used above allows behavior reactions at much lower latencythan the sampling period. Unlike most published FRP implemen-tations, Lula-FRP was implemented in a strict language (Scheme).For that reason, it explicitly managed details of laziness left implicitin Haskell-based implementations.

“Event-Driven FRP” (E-FRP) (Wan et al. 2002) also has similargoals. It focused on event-driven systems, i.e., ones in which lim-ited work is done in reaction to an event, while most FRP imple-mentations repeatedly re-evaluate the whole system, whether or notthere are relevant changes. Like RT-FRP (Wan et al. 2001), expres-siveness is restricted in order to make guarantees about resource-bounded execution. The original FRP model of continuous time isreplaced by a discrete model. Another restriction compared withthe semantics of the original FRP (preserved in this paper) is thatevents are not allowed to occur simultaneously.

Peterson et al. (2000) explored opportunities for parallelism inimplementing a variation of FRP. While the underlying semanticmodel was not spelled out, it seems that semantic determinacy wasnot preserved, in contrast to the semantically determinate concur-rency used in this paper (Section 11).

Nilsson (2005) presented another approach to FRP optimiza-tion. The key idea was to recognize and efficiently handle severalFRP combinator patterns. In some cases, the standard Haskell typesystem was inadequate to capture and exploit these patterns, butgeneralized algebraic data types (GADTs) were sufficient. Theseoptimizations proved worthwhile, though they did introduce signif-icant overhead in run-time (pattern matching) and code complexity.In contrast, the approach described in the present paper uses verysimple representations and unadventurous, Hindley-Milner types.Another considerable difference is that (Nilsson 2005) uses anarrow-based formulation of FRP, as in Fruit (Courtney and Elliott2001) and Yampa (Nilsson et al. 2002). The nature of the Arrowinterface is problematic for the goal of minimal re-evaluation. Inputevents and behaviors get combined into a single input, which thenchanges whenever any component changes. Moreover, because theimplementation style was demand-driven, event latency was stilltied to sampling rate.

FranTk is a GUI library containing FRP concepts but mixing insome imperative semantics (Sage 2000). Its implementation wasbased on an experimental data-driven FRP implementation (El-liott 1998b), which was itself inspired by Pidgets++ (Scholz andBokowski 1996). Pidgets++ used functional values interactively re-computed in a data-driven manner via one-way constraints. None

35

of these three systems supported continuous time, nor implementeda pure FRP semantics.

At first blush, one might think that an imperative implementa-tion could accomplish what we set out to do in this paper. For in-stance, there could be imperative call-backs associated with meth-ods that side-effect some sort of dependency graph. As far as Iknow, no such implementation has achieved (nor probably couldachieve) FRP’s (determinate) merge semantics for ordered receiptof simultaneous occurrences (which happens easily with composi-tional events) or even nearly-simultaneous occurrences. Imperativeimplementations are quite distant from semantics, hence hard toverify or trust. In contrast, the functional implementation in thispaper evolves from the semantics.

In some formulations of FRP, simultaneous occurrences areeliminated or merged (Nilsson et al. 2002; Wan and Hudak 2000;Wan et al. 2001), while this paper retains such occurrences as dis-tinct. In some cases, the elimination or merging was motivated bya desire to reduce behaviors and events to a single notion. Thisdesire is particularly compelling in the arrow-based FRP formu-lations, which replace behaviors (or “signals”) and events with ahigher level abstraction of “signal transformers”. Although simul-taneity is very unlikely for (distinct) purely physical events, it caneasily happen with FRP’s compositional events.

14. Future work• Much more testing, measurement, and tuning is needed in order

to pragmatically and quantitatively evaluate the implementationtechniques described in this paper, especially the new imple-mentation of improving values described in Section 10. Howwell do the techniques work in a complex application?

• Can these ideas be transplanted to arrow-based formulationsof FRP? How can changes from separately-changing inputs bekept from triggering unnecessary computation, when the arrowformulations seem to require combining all inputs into a singlevarying value?

• Explore other uses of the unambiguous choice operator definedin Section 11, and study its performance, including the kindsof parallel search algorithms for which improving values wereinvented (Burton 1989, 1991).

• Experiment with relaxing the assumption of temporal mono-tonicity exploited in Section 8. For instance, a zipper represen-tation for bidirectional sampling could allow efficient access tonearby past event occurrences as well as future ones. Such arepresentation may be efficient in time though leaky in space.

• Type class morphisms are used to define the the semanticsof every key type in this paper except for events. Can thisexception be eliminated?

• Since reactive values are purely data, they cache “for free”.In contrast, time functions (Section 5.3) have a partly functionrepresentation. Is there an efficiently caching representation?

15. AcknowledgmentsI’m grateful to Mike Sperber for the conversation that inspired thework described in this paper, as well as his help understandingLula-FRP. My thanks also to the many reviewers and readers ofprevious drafts for their helpful comments.

ReferencesF. Warren Burton. Indeterminate behavior with determinate se-

mantics in parallel programs. In International conference onFunctional programming languages and computer architecture,pages 340–346. ACM, 1989.

F. Warren Burton. Encapsulating nondeterminacy in an abstractdata type with deterministic semantics. Journal of FunctionalProgramming, 1(1):3–20, January 1991.

Antony Courtney and Conal Elliott. Genuinely functional userinterfaces. In Haskell Workshop, September 2001.

Conal Elliott. A brief introduction to ActiveVRML. TechnicalReport MSR-TR-96-05, Microsoft Research, 1996. URL http://conal.net/papers/ActiveVRML/.

Conal Elliott. Functional implementations of continuous modeledanimation. In Proceedings of PLILP/ALP, 1998a.

Conal Elliott. An imperative implementation of functional reactiveanimation. Unpublished draft, 1998b. URL http://conal.net/papers/new-fran-draft.pdf.

Conal Elliott. Denotational design with type class morphisms.Technical Report 2009-01, LambdaPix, March 2009. URLhttp://conal.net/papers/type-class-morphisms.

Conal Elliott and Paul Hudak. Functional reactive animation. InInternational Conference on Functional Programming, 1997.

Saunders Mac Lane. Categories for the Working Mathematician.Graduate Texts in Mathematics. Springer, September 1998.

Conor McBride and Ross Paterson. Applicative programming witheffects. Journal of Functional Programming, 18(1):1–13, 2008.

Henrik Nilsson. Dynamic optimization for functional reactive pro-gramming using generalized algebraic data types. In Interna-tional conference on Functional programming, pages 54–65.ACM Press, 2005.

Henrik Nilsson, Antony Courtney, and John Peterson. Functionalreactive programming, continued. In Haskell Workshop, pages51–64. ACM Press, October 2002.

John Peterson, Paul Hudak, and Conal Elliott. Lambda in motion:Controlling robots with Haskell. In Practical Aspects of Declar-ative Languages, 1999.

John Peterson, Valery Trifonov, and Andrei Serjantov. Parallelfunctional reactive programming. Lecture Notes in ComputerScience, 1753, 2000.

Simon Peyton Jones, Andrew Gordon, and Sigbjorn Finne. Con-current Haskell. In Symposium on Principles of ProgrammingLanguages, January 1996.

Meurig Sage. FranTk – a declarative GUI language for Haskell.In International Conference on Functional Programming, pages106–118. ACM, ACM Press, September 2000.

Enno Scholz and Boris Bokowski. PIDGETS++ - a C++ frameworkunifying postscript pictures, GUI objects, and lazy one-way con-straints. In Conference on the Technology of Object-OrientedLanguages and Systems. Prentice-Hall, 1996.

Michael Sperber. Computer-Assisted Lighting Design and Control.PhD thesis, University of Tubingen, June 2001.

Philip Wadler. Comprehending monads. In Conference on LISPand Functional Programming, pages 61–78. ACM, 1990.

Zhanyong Wan and Paul Hudak. Functional Reactive Programmingfrom first principles. In Conference on Programming LanguageDesign and Implementation, 2000.

Zhanyong Wan, Walid Taha, and Paul Hudak. Real-time FRP. InInternational Conference on Functional Programming, 2001.

Zhanyong Wan, Walid Taha, and Paul Hudak. Event-driven FRP.In Practical Aspects of Declarative Languages, January 2002.

36

http://conal.net/papers/ActiveVRML/

http://conal.net/papers/ActiveVRML/

http://conal.net/papers/new-fran-draft.pdf

http://conal.net/papers/new-fran-draft.pdf

http://conal.net/papers/type-class-morphisms

Unembedding Domain-Specific Languages

Robert Atkey Sam Lindley Jeremy Yallop

LFCS, School of Informatics, The University of Edinburgh

{bob.atkey,sam.lindley,jeremy.yallop}@ed.ac.uk

AbstractHigher-order abstract syntax provides a convenient way of embed-ding domain-specific languages, but is awkward to analyse and ma-nipulate directly.

We explore the boundaries of higher-order abstract syntax. Ourkey tool is the unembedding of embedded terms as de Bruijn terms,enabling intensional analysis. As part of our solution we presenttechniques for separating the definition of an embedded programfrom its interpretation, giving modular extensions of the embeddedlanguage, and different ways to encode the types of the embeddedlanguage.

Categories and Subject Descriptors D.1.1 [Programming tech-niques]: Applicative (functional) programming

General Terms Languages, Theory

Keywords domain-specific languages, higher-order abstract syn-tax, type classes, unembedding

1. IntroductionEmbedding a domain-specific language (DSL) within a host lan-guage involves writing a set of combinators in the host languagethat define the syntax and semantics of the embedded language.Haskell plays host to a wide range of embedded DSLs, includinglanguages for database queries [Leijen and Meijer 1999], finan-cial contracts [Peyton Jones et al. 2000], parsing [Leijen and Mei-jer 2001], web programming [Thiemann 2002], production of dia-grams [Kuhlmann 2001] and spreadsheets [Augustsson et al. 2008].

An embedded language has two principal advantages over astand-alone implementation. First, using the syntax and semanticsof the host language to define those of the embedded languagereduces the burden on both the implementor (who does not need towrite a parser and interpreter from scratch) and the user (who doesnot need to learn an entirely new language and toolchain). Second,integration of the embedded language — with the host language,and with other DSLs — becomes almost trivial. It is easy to see whyone might wish to use, say, languages for web programming anddatabase queries within a single program; if both are implementedas embeddings into Haskell then integration is as straightforwardas combining any other two libraries.

Perhaps the most familiar example of an embedded DSL is themonadic language for imperative programming that is part of the

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

Haskell’09, September 3, 2009, Edinburgh, Scotland, UK.Copyright c© 2009 ACM 978-1-60558-508-6/09/09. . . $5.00

Haskell standard library. A notable feature of the monadic languageis the separation between the definition of the symbols of thelanguage, which are introduced as the methods of the Monad typeclass, and the interpretation of those symbols, given as instancesof the class. This approach enables a range of interpretations to beassociated with a single language — a contrast to the embeddedlanguages enumerated earlier, which generally each admit a singleinterpretation.

If the embedded language supports binding a number of diffi-culties may arise. The interface to the embedded language mustensure that there are no mismatches between bindings and usesof variables (such as attempts to use unbound or incorrectly-typedvariables); issues such as substitution and alpha-equivalence intro-duce further subtleties. Higher-order abstract syntax [Pfenning andElliott 1988] (HOAS) provides an elegant solution to these diffi-culties. HOAS uses the binding constructs of the host language toprovide binding in the embedded language, resulting in embeddedlanguage binders that are easy both to use and to interpret.

However, while HOAS provides a convenient interface to anembedded language, it is a less convenient representation for en-coding certain analyses. In particular, it is difficult to perform in-tensional analyses such as closure conversion or the shrinking re-ductions optimisation outlined in Section 2.4, as the representationis constructed from functions, which cannot be directly manipu-lated.

It is clear that higher-order abstract syntax and inductive termrepresentations each have distinct advantages for embedded lan-guages. Elsewhere, the first author provides a proof that the higher-order abstract syntax representation of terms is isomorphic to aninductive representation [Atkey 2009a]. Here we apply Atkey’s re-sult, showing how to convert between the two representations, andso reap the benefits of both.

We summarise the contents and contributions of this paper asfollows:

• We start in Section 2 with an embedding of the untyped λ-calculus, using the parametric polymorphic representation ofhigher-order abstract syntax terms. This representation was ad-vocated by Washburn and Weirich [2008], but dates back to atleast Coquand and Huet [1985]. We show how to convert thisrepresentation to a concrete de Bruijn one, using the mappingdefined in Atkey [2009a]. This allows more straightforward ex-pression of intensional analyses, such as the shrinking reduc-tions optimisation.

We then examine the proof of the isomorphism between theHOAS and de Bruijn representations in more detail to pro-duce an almost fully well-typed conversion between the HaskellHOAS type and a GADT representing well-formed de Bruijnterms. Interestingly, well-typing of this conversion relies on theparametricity of Haskell’s polymorphism, and so even complexextensions to Haskell’s type system, such as dependent types,would not be able to successfully type this translation. Our first

37

main contribution is the explanation and translation of the proofinto Haskell.

• Our representation of embedded languages as type classes is putto use in Section 3, where we show how to modularly constructembedded language definitions. For example, we can indepen-dently define language components such as the λ-calculus,booleans and arithmetic. Our second main contribution is toshow how to extend an embedded language with flexible patternmatching and how to translate back-and-forth to well-formedde Bruijn terms.

• Having explored the case for untyped languages we turn totyped languages in Section 4. We carefully examine the issue ofhow embedded language types are represented, and work to en-sure that type variables used in the representation of embeddedlanguage terms do not leak into the embedded language itself.Thus we prevent exotically typed terms as well as exotic termsin our HOAS representation. As far as we are aware, this dis-tinction has not been noted before by other authors using typedHOAS, e.g. [Carette et al. 2009]. Our third main contributionis the extension of the well-typed conversion from HOAS to deBruijn to the typed case, identifying where we had to circum-vent the Haskell typechecker. Another contribution is the iden-tification and explanation of exotically typed terms in Churchencodings, a subject we feel deserves further study.

• Our final contributions are two larger examples in Section 5:unembedding of mobile code from a convenient higher-orderabstract syntax representation, and an embedding of the NestedRelational Calculus via higher-order abstract syntax.

• Section 6 surveys related work.

The source file for this paper is a literate Haskell program. Theextracted code and further examples are available at the followingURL:http://homepages.inf.ed.ac.uk/ratkey/unembedding/.

2. Unembedding untyped languagesWe first explore the case for untyped embedded languages. Evenwithout types at the embedded language level, an embedding of thisform is not straightforward, due to the presence of variable bindingand α-equivalence in the embedded language. We start by showinghow to handle the prototypical language with binding.

2.1 Representing the λ-calculusTraditionally, the λ-calculus is presented with three term formers:variables, λ-abstractions and applications. Since we are using thehost-language to represent embedded language variables, we re-duce the term formers to two, and place them in a type class:

�� UntypedLambda exp ��

lam :: (exp → exp) → expapp :: exp → exp → exp

To represent closed terms, we abstract over the type variable exp,where exp is an instance of UntypedLambda:

�� Hoas = ∀exp. UntypedLambda exp ⇒ exp

Encoding a given untyped λ-calculus term in this representationbecomes a matter of taking the term you first thought of, insertinglams and apps into the correct places, and using Haskell’s ownbinding and variables for binding and variables in the embedded-language. For example, to represent the λ-calculus term λx.λy.xy,we use:

example1 :: Hoasexample1 = lam (λx → lam (λy → x ‘app‘ y))

Our host language, Haskell, becomes a macro language for ourembedded language. As an example, this function creates Churchnumerals for any given integer:

numeral :: Integer → Hoasnumeral n = lam (λs → (lam (λz → body s z n)))�� body s z 0 = z

body s z n = s ‘app‘ (body s z (n-1))

Following the work of Pfenning and Elliott [1988], the use ofhost language binding to represent embedded language binding hasalso been attempted by the use of algebraic datatypes. For example,Fegaras and Sheard [1996] start from the following datatype:

�� Term = Lam (Term → Term)| App Term Term

One can use this datatype to write down representations of terms,but Fegaras and Sheard are forced to extend this in order to definefolds over the abstract syntax trees:

�� Term a = Lam (Term a → Term a)| App (Term a) (Term a)| Var a

The additional constructor and type argument are used in the imple-mentation of the fold function to pass accumulated values through.It is not intended that the Var constructor be used in user programs.

The problem with this representation is that it permits so-calledexotic terms, members of the type that are not representatives ofλ-calculus terms. For example:

Lam (λx → �� x � Lam _ → x| App _ _ → Lam (λx → x))

The body of the λ-abstraction in this “term” is either x or λx.x, de-pending on whether the passed in term is itself a λ-abstraction or anapplication. Fegaras and Sheard mitigate this problem by definingan ad-hoc type system that distinguishes between datatypes thatmay be analysed by cases and those that may be folded over asHOAS. The type system ensures that the Var constructor is neverused by the programmer.

The advantage of the HOAS representation that we use, whichwas originally proposed by Coquand and Huet [1985], is that exoticterms are prohibited [Atkey 2009a] (with the proviso that infiniteterms are allowed when we embed inside Haskell). In our opinion,it is better to define types that tightly represent the data we wish tocompute with, and not to rely on the discipline of failible program-mers or ad-hoc extensions to the type system.

2.2 Folding over SyntaxOur representation of closed λ-terms amounts to a Church encod-ing of the syntax of the calculus, similar to the Church encodingsof inductive datatypes such as the natural numbers. Unfolding thetype Hoas, we can read it as the System F type:

Cλ = ∀α.((α → α) → α) → (α → α → α) → α

Compare this to the Church encoding of natural numbers:

Cnat = ∀α.α → (α → α) → α

For Cnat, we represent natural numbers by their fold operators. Avalue of type Cnat, given some type α and two constructors, one oftype α and one of type α → α (which we can think of as zero andsuccessor), must construct a value of type α. Since the type α isunknown when the value of type Cnat is constructed, we can onlyuse these two constructors to produce a value of type α. It is thisproperty that ensures that we only represent natural numbers.

Likewise, for the Cλ type, we have an abstract type α, and twoconstructors, one for λ-abstraction and one for application. Theconstruction for λ-abstraction is special in that there is a negative

38

occurence of α in its arguments. This does not fit into the classicaltheory of polymorphic Church encodings, but is crucial to theHOAS representation of binding. We sketch how parametricity isused below, in Section 2.6.

As for the Church encoded natural numbers, we can treat thetype Cλ as a fold operator over terms represented using HOAS. Wecan use this to compute over terms, as demonstrated by Washburnand Weirich [2008]. Returning to Haskell, folds over terms areexpressed by giving instances of the UntypedLambda type class.For example, to compute the size of a term:

�� Size = Size { size :: Integer }

�� UntypedLambda Size ��

lam f = Size $ 1 + size (f (Size 1))x ‘app‘ y = Size $ 1 + size x + size y

getSize :: Hoas → IntegergetSize term = size term

The case for app is straightforward; the size of an application isone plus the sizes of its subterms. For a λ-abstraction, we first addone for the λ itself, then we compute the size of the body. As werepresent bodies by host-language λ-abstractions we must applythem to something to get an answer. In this case the body f willhave type Size → Size, so we pass in what we think the sizeof a variable will be, and we will get back the size of the wholesubterm.

A more exotic instance of a fold over the syntax of a λ-termis the denotational semantics of a term, i.e. an evaluator. We firstdefine a “domain” for the semantics of the call-by-name λ-calculus:

�� Value = VFunc (Value → Value)

Now the definitions for lam and app are straightforward:

�� UntypedLambda Value ��

lam f = VFunc f(VFunc f) ‘app‘ y = f y

eval :: Hoas → Valueeval term = term

2.3 Unembedding the λ-calculusWriting computations over the syntax of our embedded language isall well and good, but there are many functions that we may wishto express that are awkward, inefficient, or maybe impossible to ex-press as folds. However, the HOAS representation is certainly con-venient for embedding embedded language terms inside Haskell,so we seek a conversion from HOAS to a form that is amenable tointensional analysis.

A popular choice for representing languages with binding isde Bruijn indices, where each bound variable is represented asa pointer to the binder that binds it [de Bruijn 1972]. We canrepresent de Bruijn terms by the following type:

�� DBTerm = Var Int| Lam DBTerm| App DBTerm DBTerm�� (Show,Eq)

To convert from Hoas to DBTerm, we abstract over the numberof binders that surround the term we are currently constructing.

�� DB = DB { unDB :: Int → DBTerm }

The intention is that unDB x n will return a de Bruijn term, closedin a context of depth n. To define a fold over the HOAS representa-tion, we give an instance of UntypedLambda for DB:

�� UntypedLambda DB ��

lam f = DB $ λi → �� v = λj → Var (j-(i+1)) ��

Lam (unDB (f (DB v)) (i+1))app x y = DB $ λi → App (unDB x i) (unDB y i)

toTerm :: Hoas → DBTermtoTerm v = unDB v 0

Converting a HOAS application to a de Bruijn application isstraightforward; we simply pass through the current depth of thecontext to the subterms. Converting a λ-abstraction is more com-plicated. Clearly, we must use the Lam constructor to generate a deBruijn λ-abstraction, and, since we are going under a binder, wemust up the depth of the context by one. As with the size exampleabove, we must also pass in a representation of the bound vari-able to the host-language λ-abstraction representing the body ofthe embedded language λ-abstraction. This representation will beinstantiated at some depth j, which will always be greater than i.We then compute the difference between the depth of the variableand the depth of the binder as j−(i+1), which is the correct deBruijn index for the bound variable.

We can represent an open HOAS term as a function from anenvironment, represented as a list of HOAS terms, to a HOAS term.

�� Hoas’ = ∀exp.UntypedLambda exp ⇒ [exp] → exp

It is worth pointing out that this encoding is technically incorrectas such functions can inspect the length of the list and so need notrepresent real terms. We could rectify the problem by making en-vironments total, that is, restricting them to be infinite lists (wherecofinitely many entries map variables to themselves). Rather thanworrying about this issue now we resolve it later when we considerwell-formed de Bruijn terms in Section 2.6.

Now we can convert an open HOAS term to a de Bruijn term byfirst supplying it with a total environment mapping every variable toitself, interpreting everything in the DB instance of UntypedLambdaas we do for closed terms.

toTerm’ :: Hoas’ → DBTermtoTerm’ v = unDB w 0�� w = v (env 0)

env j = DB (λi → Var (i+j)) : env (j+1)

Conversion from HOAS to de Bruijn representations have al-ready been presented by other workers; see, for example, someslides of Olivier Danvy1. In his formulation, the HOAS terms arerepresented by the algebraic datatype we saw in Section 2.1. Henceexotic terms are permitted by the type, and it seems unlikely thathis conversion to de Bruijn could be extended to a well-typed onein the way that we do below in Section 2.6.

2.4 Intensional analysisThe big advantage of converting HOAS terms to de Bruijn termsis that this allows us to perform intensional analyses. As a sim-ple example of an analysis that is difficult to perform directly onHOAS terms we consider shrinking reductions [Appel and Jim1997]. Shrinking reductions arise as the restriction of β-reduction(i.e. inlining) to cases where the bound variable is used zero (dead-code elimination) or one (linear inlining) times. As well as reducingfunction call overhead, shrinking reductions expose opportunitiesfor further optimisations such as common sub-expression elimina-tion and more aggressive inlining.

The difficulty with implementing shrinking reductions is thatdead-code elimination at one redex can expose further shrinking re-ductions at a completely different position in the term, so attemptsat writing a straightforward compositional algorithm fail. We give

1 http://www.brics.dk/~danvy/Slides/mfps98-up2.ps. Thanks toan anonymous reviewer for this link.

39

a naive algorithm that re-traverses the whole reduct whenever a re-dex is reduced. The only interesting case in the shrink functionis that of a β-redex where the number of uses is less than or equalto one. This uses the standard de Bruijn machinery to perform thesubstitution [Pierce 2002]. More efficient imperative algorithms ex-ist [Appel and Jim 1997, Benton et al. 2004, Kennedy 2007]. Thekey point is that these algorithms are intensional. It seems unlikelythat shrinking reductions can be expressed easily as a fold.

usesOf n (Var m) = �� n==m �� 1 �� 0usesOf n (Lam t) = usesOf (n+1) tusesOf n (App s t) = usesOf n s + usesOf n t

lift m p (Var n) | n < p = Var n| otherwise = Var (n+m)

lift m p (Lam body) = Lam (lift m (p+1) body)lift m p (App s t) = App (lift m p s) (lift m p t)

subst m t (Var n) | n==m = t| n > m = Var (n-1)| otherwise = Var n

subst m t (Lam s) = Lam (subst (m+1) (lift 1 0 t) s)subst m t (App s s’) = App (subst m t s) (subst m t s’)

shrink (Var n) = Var nshrink (Lam t) = Lam (shrink t)shrink (App s t) =�� s’ �

Lam u | usesOf 0 u ≤ 1 → shrink (subst 0 t’ u)_ → App s’ t’

�� s’ = shrink st’ = shrink t

2.5 Embedding againBefore we explain why the unembedding process works, we notethat going from closed de Bruijn terms back to the HOAS represen-tation is straightforward.

fromTerm’ :: DBTerm → Hoas’fromTerm’ (Var i) env = env !! ifromTerm’ (Lam t) env = lam (λx → fromTerm’ t (x:env))fromTerm’ (App x y) env =

fromTerm’ x env ‘app‘ fromTerm’ y env

fromTerm :: DBTerm → HoasfromTerm term = fromTerm’ term []

We maintain an environment storing all the representations ofbound variables that have been acquired down each branch of theterm. When we go under a binder, we extend the environment bythe newly abstracted variable. This definition is unfortunately par-tial (due to the indexing function (!!)) since we have not yetguaranteed that the input will be a closed de Bruijn term. In thenext sub-section we resolve this problem.

2.6 Well-formed de Bruijn termsWe can guarantee that we only deal with closed de Bruijn terms byusing the well-known encoding of de Bruijn terms into GADTs[Sheard et al. 2005]. In this representation, we explicitly recordthe depth of the context in a type parameter. We first define twovacuous type constructors to represent natural numbers at the typelevel.

�� Zero�� Succ a

To represent variables we make use of the Fin GADT, wherethe type Fin n represents the type of natural numbers less than n.The Zero and Succ type constructors are used as phantom types.

�� Fin :: � → � ��

FinZ :: Fin (Succ a)FinS :: Fin a → Fin (Succ a)

The type of well-formed de Bruijn terms for a given context iscaptured by the following GADT. The type WFTerm Zero will thenrepresent all closed de Bruijn terms.

�� WFTerm :: � → � ��

WFVar :: Fin a → WFTerm aWFLam :: WFTerm (Succ a) → WFTerm aWFApp :: WFTerm a → WFTerm a → WFTerm a

Writing down terms in this representation is tedious due to theuse of FinS (FinS FinZ) etc. to represent variables. The HOASapproach has a definite advantage over de Bruijn terms in thisrespect.

The toTerm function we defined above always generates closedterms, and we now have a datatype that can be used to representclosed terms. It is possible to give a version of toTerm that hasthe correct type, but we will have to work around the Haskell typesystem for it to work. To see why, we sketch the key part of theproof of adequacy of the Church encoding of λ-calculus syntax—the type Cλ—given by the first author [Atkey 2009a].

As alluded to above, the correctness of the Church encodingmethod relies on the parametric polymorphism provided by the ∀αquantifier. Given a value of type α, the only action we can performwith this value is to use it as a variable; we cannot analyse values oftype α, for if we could, then our function would not be parametricin the choice of α. The standard way to make such arguments rig-orous is to use Reynolds’ formalisation of parametricity [Reynolds1974] that states that for any choices τ1 and τ2 for α, and any bi-nary relation between τ1 and τ2, this relation is preserved by theimplementation of the body of the type abstraction.

To prove that the toTerm function always produces well-formedde Bruijn terms, we apply Reynolds’ technique with two minormodifications: we restrict to unary relations and we index our rela-tions by natural numbers. The indexing must satisfy the constraintthat if Ri(x) and j ≥ i, then Rj(x). This means that we requireKripke relations over the usual ordering on the natural numbers.

In the toTerm function, we instantiate the type α with the typeInt → DBTerm. The Kripke relation we require on this type isRi(t) ⇔ ∀j ≥ i. j � (t j), where j � t means that the de Bruijnterm t is well-formed in contexts of depth j. If we know R0(t),then t 0 will be a closed de Bruijn term. Following usual proofsby parametricity, we prove this property for toTerm by showingthat our implementations of lam and app preserve R. For app thisis straightforward. For lam, it boils down to showing that for acontext of depth i the de Bruijn representation of variables we passin always gives a well-formed variable in some context of depthj, where j ≥ i + 1, and in particular j > 0. The machineryof Kripke relations always ensures that we know that the contextdepths always increase as we proceed under binders in the term(see [Atkey 2009a] for more details).

We give a more strongly typed conversion from HOAS to deBruijn, using the insight from this proof. First we simulate partof the refinement of the type Int → DBTerm by the relation R,using a GADT to reflect type-level natural numbers down to theterm level:

�� Nat :: � → � ��

NatZ :: Nat ZeroNatS :: Nat a → Nat (Succ a)

�� WFDB = WFDB { unWFDB :: ∀j. Nat j → WFTerm j }

We do not include the part of the refinement that states that j isgreater than some i (although this is possible with GADTs) becausethe additional type variable this would entail does not appear in

40

the definition of the class UntypedLambda. The advantage of theHOAS representation over the well-formed de Bruijn is that we donot have to explicitly keep track of contexts; the Kripke indexing ofour refining relation keeps track of the context for us in the proof.

The little piece of arithmetic j − (i+1) in the toTerm functionabove must now be represented in a way that demonstrates to thetype checker that we have correctly accounted for the indices. Thefunctions natToFin and weaken handle conversion from naturalsto inhabitants of the Fin type and injection of members of Fintypes into larger ones. The shift function does the actual arith-metic.

natToFin :: Nat a → Fin (Succ a)natToFin NatZ = FinZnatToFin (NatS n) = FinS (natToFin n)

weaken :: Fin a → Fin (Succ a)weaken FinZ = FinZweaken (FinS n) = FinS (weaken n)

shift :: Nat j → Nat i → Fin jshift NatZ _ = ⊥shift (NatS x) NatZ = natToFin xshift (NatS x) (NatS y) = weaken $ shift x y

By the argument above, the case when the first argument of shiftis NatZ will never occur when we invoke it from within the foldover the the HOAS representation, so it is safe to return ⊥ (i.e.undefined). In any case, there is no non-⊥ inhabitant of the typeFin Zero to give here.

The actual code to carry out the conversion is exactly the sameas before, except with the arithmetic replaced by the more strongly-typed versions.

�� UntypedLambda WFDB ��

lam f = WFDB $λi → �� v = λj → WFVar (shift j i)

��

WFLam (unWFDB (f (WFDB v)) (NatS i))x ‘app‘ y = WFDB $

λi → WFApp (unWFDB x i) (unWFDB y i)

toWFTerm :: Hoas → WFTerm ZerotoWFTerm v = unWFDB v NatZ

The point where Haskell’s type system does not provide us withenough information is in the call to shift, where we know fromthe parametricity proof that j ≥ i + 1 and hence j > 0. Movingto a more powerful type system with better support for reasoningabout arithmetic, such as Coq [The Coq development team 2009]or Agda [The Agda2 development team 2009], would not help ushere. One could easily write a version of the shift function thattakes a proof that j ≥ i + 1 as an argument, but we have noway of obtaining a proof of this property without appeal to theparametricity of the HOAS representation. We see two options herefor a completely well-typed solution: we could alter the HOASinterface to include information about the current depth of bindersin terms, but this would abrogate the advantage of HOAS, whichis that contexts are handled by the meta-language; or, we couldincorporate parametricity principles into the type system, as hasbeen done previously in Plotkin-Abadi Logic [Plotkin and Abadi1993] and System R [Abadi et al. 1993]. The second option iscomplicated by our requirement here for Kripke relations and to useparametricity to prove well-typedness rather than only equalitiesbetween terms.

In order to handle open terms we introduce a type of environ-ments WFEnv which takes two type arguments: the type of valuesand the size of the environment.

�� WFEnv :: � → � → � ��

WFEmpty :: WFEnv exp ZeroWFExtend :: WFEnv exp n → exp → WFEnv exp (Succ n)

lookWF :: WFEnv exp n → Fin n → explookWF (WFExtend _ v) FinZ = vlookWF (WFExtend env _) (FinS n) = lookWF env n

Open well-formed HOAS terms with n free variables are de-fined as functions from well-formed term environments of size n toterms.

�� WFHoas’ n =∀exp.UntypedLambda exp ⇒ WFEnv exp n → exp

Now we can define the translation from well-formed openhigher-order abstract syntax terms to well-formed open de Bruijnterms. Whereas toTerm’ had to build an infinite environment map-ping free variables to themselves, because the number of free vari-ables did not appear in the type, we now build a finite environmentwhose length is equal to the number of free variables. We alsoneed to supply the length at the term level using the natural numberGADT.

toWFTerm’ :: Nat n → WFHoas’ n → WFTerm ntoWFTerm’ n v = unWFDB (v (makeEnv n)) n��

makeEnv :: Nat n → WFEnv WFDB nmakeEnv NatZ = WFEmptymakeEnv (NatS i) =

WFExtend(makeEnv i)(WFDB (λj → WFVar (shift j i)))

Conversion back from WFTerm to Hoas is straightforward.

toWFHoas’ :: WFTerm n → WFHoas’ ntoWFHoas’ (WFVar n) = λenv → lookWF env ntoWFHoas’ (WFLam t) =

λenv → lam (λx → toWFHoas’ t (WFExtend env x))toWFHoas’ (WFApp f p) =

λenv → toWFHoas’ f env ‘app‘ toWFHoas’ p env

toWFHoas :: WFTerm Zero → HoastoWFHoas t = toWFHoas’ t WFEmpty

The functions toWFTerm and toWFHoas are in fact mutually in-verse, and hence the two representations are isomorphic. See Atkey[2009a] for the proof.

3. Language extensionsHaving established the main techniques for moving between induc-tive and higher-order encodings of embedded languages, we nowconsider a number of extensions.

3.1 More term constructorsWe begin by adding boolean terms. As before, we create a typeclass containing the term formers of our language: constants fortrue and false, and a construct for conditional branching.

�� Booleans exp ��

true :: expfalse :: expcond :: exp → exp → exp → exp

We do not need to combine this explicitly with UntypedLambda:terms formed from true, false, cond, lam and app may bemingled freely. For example, we can define a function not asfollows:

not = lam (λx → cond x false true)

This receives the following type:

41

not :: (Booleans exp, UntypedLambda exp) ⇒ exp

However, for convenience we may wish to give a name to theembedded language that includes both functions and booleans,and we can do so by defining a new class that is a subclass ofUntypedLambda and Booleans.

�� (Booleans exp, UntypedLambda exp) ⇒BooleanLambda exp

We can now give our definition of not the following more concisetype:

not :: BooleanLambda exp ⇒ exp

In Section 2 we defined a number of functions on untyped λ ex-pressions. We can extend these straightforwardly to our augmentedlanguage by defining instances of Booleans. For example, we canextend the size function by defining the following instance:

�� Booleans Size ��

true = Size $ 1false = Size $ 1cond c t e = Size $ size c + size t + size e

In order to extend the functions for evaluation and conversion tode Bruijn terms we must modify the datatypes used as the domainsof those functions. For evaluation we must add constructors fortrue and false to the Value type.

�� Value = VFunc (Value → Value) | VTrue | VFalse

Then we can extend the evaluation function to booleans by writingan instance of Booleans at type Value.

�� Booleans Value ��

true = VTruefalse = VFalsecond VTrue t _ = tcond VFalse _ e = e

Note that the definitions for both cond and app are now partial,since the embedded language is untyped: there is nothing to preventprograms which attempt to apply a boolean, or use a function as thefirst argument to cond. In Section 4 we investigate the embeddingof typed languages, with total interpreters.

For conversion to well-formed de Bruijn terms we must modifythe WFTerm datatype to add constructors for true, false andcond.

�� WFTerm :: � → � ��

WFVar :: Fin a → WFTerm aWFLam :: WFTerm (Succ a) → WFTerm aWFApp :: WFTerm a → WFTerm a → WFTerm aWFTrue :: WFTerm aWFFalse :: WFTerm aWFCond :: WFTerm a → WFTerm a → WFTerm a

→ WFTerm a

Extending the conversion function to booleans is then a simplematter of writing an instance of Booleans at the type WFDB.

�� Booleans WFDB ��

true = WFDB (λi → WFTrue)false = WFDB (λi → WFFalse)cond c t e = WFDB (λi → WFCond (unWFDB c i)

(unWFDB t i)(unWFDB e i))

Term formers for integers, pairs, sums, and so on, can be addedstraightforwardly in the same fashion.

Adding integers is of additional interest in that it allows inte-gration with the standard Num type class. We can extend the Valuedatatype with an additional constructor for integers, and then use

the arithmetic operations of the Num class within terms of the em-bedded language. For example, the following term defines a binaryaddition function in the embedded language:

lam (λx → lam (λy → x + y)):: (UntypedLambda exp, Num exp) ⇒ exp

We can, of course, extend evaluation to such terms by defining in-stances of Num at the Value type; the other functions, such as con-version to the de Bruijn representation, can be extended similarly.

3.2 Conflating levelsThe embedded languages we have looked at so far have all main-tained a strict separation between the host and embedded levels.A simple example where we mix the levels, which was also usedin Atkey [2009a], is a language of arithmetic expressions with a“let” construct and with host language functions contained withinterms.

�� ArithExpr exp ��

let_ :: exp → (exp → exp) → expinteger :: Int → expbinop :: (Int → Int → Int) → exp → exp → exp

�� AExpr = ∀exp. ArithExpr exp ⇒ exp

An example term in this representation is:

example8 :: AExprexample8 = let_ (integer 8) $ λx →

let_ (integer 9) $ λy →binop (+) x y

Using the techniques described in Section 2.6, it is clear to see howwe can translate this representation to a type of well-formed deBruijn terms.

The point of this example is to show how function types canbe used in two different ways in the HOAS representation. In thelet operation, functions are used to represent embedded languagebinding. In the binop operation we use the function type compu-tationally as a host language function. Licata et al. [2008] define anew logical system based on a proof theoretic analysis of focussingto mix the computational and representation function spaces. Usingparametric polymorphism, we get the same functionality for free.

3.3 Pattern matchingTo this point, we have only considered languages where variablesare bound individually. Realistic programming languages featurepattern matching that allows binding of multiple variables at once.It is possible to simulate this by the use of functions as cases inpattern matches, but this gets untidy due to the additional lamconstructors required. Also, we may not want to have λ-abstractionin our embedded language. To see how to include pattern matching,we start by considering a language extension with sums and pairs.

We define a type class for introduction forms for pairs and sums:

�� PairsAndSums exp ��

pair :: exp → exp → expinl :: exp → expinr :: exp → exp

A simple language extension that allows pattern matching on pairsand sums can be captured with the following type class:

�� BasicPatternMatch exp ��

pair_match :: exp → ((exp,exp) → exp) → expsum_match :: exp → (exp → exp) → (exp → exp)

→ exp

These operations are certainly complete for matching against pairsand sums, but we do not have the flexibility in matching patterns

42

that exists in our host language. To get this flexibility we mustabstract over patterns. We represent patterns as containers of kind� → �:

�� Id a = V a�� Pair f1 f2 a = f1 a × f2 a�� Inl f a = Inl (f a)�� Inr f a = Inr (f a)

The HOAS representation of a pattern matching case will takea function of type f exp → exp, where we require that f is acontainer constructed from the above constructors. For example, tomatch against the left-hand component of a sum, which contains apair, we would use a function like:

λ(Inl (V x × V y)) → pair x y):: (Inl (Pair Id Id) exp → exp)

Note that when f is Pair, this will give the same type as thepair match combinator above.

We must be able to restrict to containers generated by the aboveconstructors. We do so by employing the following GADT:

�� Pattern :: (� → �) → � → � ��

PVar :: Pattern Id (Succ Zero)PPair :: Nat x → Pattern f1 x → Pattern f2 y →

Pattern (Pair f1 f2) (x :+: y)PInl :: Pattern f x → Pattern (Inl f) xPInr :: Pattern f x → Pattern (Inr f) x

The second argument in this GADT records the number of vari-ables in the pattern. This numeric argument will be used to accountfor the extra context used by the pattern in the de Bruijn represen-tation. The spare-looking Nat x argument in PPair is used as awitness for constructing proofs of type equalities in the conversionbetween HOAS and de Bruijn. We define type-level addition by thefollowing type family:

�� n :+: m :: �� Zero :+: n = n�� (Succ n) :+: m = Succ (n :+: m)

A HOAS pattern matching case consists of a pattern representa-tion and a function to represent the variables bound in the pattern:

�� Case exp = ∀f n. Case (Pattern f n) (f exp → exp)

A type class defines our pattern matching language extension:

�� PatternMatch exp ��

match :: exp → [Case exp] → exp

This representation is hampered by the need to explicitly describeeach pattern before use:

matcher0 x = match x[ Case (PPair (NatS NatZ) PVar PVar) $

λ(V x × V y) → pair x y, Case (PInl PVar) $ λ(Inl (V x)) → x ]

We get the compiler to do the work for us by using an existentialtype and a type class:

�� IPat f = ∀n. IPat (Nat n) (Pattern f n)

�� ImplicitPattern f ��

patRep :: IPat f

We define instances for each f that interests us. The additional Natn argument in IPat is used to fill in the Nat x argument in thePPair constructor. We can now define a combinator that allowsconvenient expression of pattern matching cases:

clause :: ∀f exp.ImplicitPattern f ⇒ (f exp → exp) → Case exp

clause body = �� patRep �

IPat _ pattern → Case pattern body

This combinator gives a slicker syntax for pattern matching:

matcher x = match x[ clause $ λ(V x × V y) → pair x y, clause $ λ(Inl (V x)) → x ]

We can unembed this HOAS representation to guaranteed well-formed de Bruijn terms by a process similar to the one we usedabove. The de Bruijn representation of pattern match cases consistsof a pair of a pattern and a term. In this representation we mustexplicitly keep track of the context, something that the HOASrepresentation handles for us.

�� WFCase a =∀f b. WFCase (Pattern f b) (WFTerm (a :+: b))

�� WFTerm :: � → � ��

WFVar :: Fin a → WFTerm aWFMatch :: WFTerm a → [WFCase a] → WFTerm aWFPair :: WFTerm a → WFTerm a → WFTerm aWFInl :: WFTerm a → WFTerm aWFInr :: WFTerm a → WFTerm aWFLam :: WFTerm (Succ a) → WFTerm aWFApp :: WFTerm a → WFTerm a → WFTerm a

As above, we translate from HOAS to de Bruijn representation bydefining a fold over the HOAS term. The case for match is:

�� PatternMatch WFDB ��

match e cases = WFDB $λi → WFMatch (unWFDB e i) (map (doCase i) cases)��

doCase :: ∀i. Nat i → Case WFDB → WFCase idoCase i (Case pattern f) =�� (x, j) = mkPat pattern i�� WFCase pattern (unWFDB (f x) j)

The helper function used here is mkPat, which has type

mkPat :: Pattern f n → Nat i → (f WFDB, Nat (i :+: n))

This function takes a pattern representation, the current size of thecontext and returns the appropriate container full of variable rep-resentations and the new size of the context. We omit the imple-mentation of this function for want of space. The core of the imple-mentation relies on an idiomatic traversal [McBride and Paterson2008] of the shape of the pattern, generating the correct variablerepresentations as we go and incrementing the size of the context.To keep track of the size of the context in the types, we use a pa-rameterised applicative functor [Cooper et al. 2008], the idiomaticanalogue of a parameterised monad [Atkey 2009b]. The term-levelrepresentations of natural numbers used in patterns are used to con-struct witnesses for the proofs of associativity and commutativity ofplus, which are required to type this function.

Conversion back again from de Bruijn to HOAS relies on ahelper function of the following type:

mkEnv :: ∀i exp f j.Nat i → WFEnv exp i → Pattern f j →

f exp → WFEnv exp (i :+: j)

This function takes the current size of the context (which can al-ways be deduced from the environment argument), a conversionenvironment and a pattern representation, and returns a functionthat maps pattern instances to extended environments. By compos-ing mkEnv with the main conversion function from de Bruijn terms,we obtain a conversion function for the de Bruijn representation ofpattern matching cases.

4. Unembedding typed languagesWe now turn to the representation and unembedding of typed lan-guages, at least when the types of our embedded language is a sub-set of the types of Haskell. This is mostly an exercise in decorating

43

the constructions of the previous sections with type information,but there is a subtlety involved in representing the types of the em-bedded language, which we relate in our first subsection.

4.1 Simply-typed λ-calculus, naivelyGiven the representation of the untyped λ-calculus above, an obvi-ous way to represent a typed language in the manner we have usedabove is by the following type class, where we decorate all the oc-curences of exp with type variables. This is the representation oftyped embedded languages used by Carette et al. [2009].

�� TypedLambda0 exp ��

tlam0 :: (exp a → exp b) → exp (a → b)tapp0 :: exp (a → b) → exp a → exp b

Closed simply-typed terms would now be represented by the type:

�� THoas0 a = ∀exp. TypedLambda0 exp ⇒ exp a

and we can apparently go ahead and represent terms in the simply-typed λ-calculus:

example3 :: THoas0 (Bool → (Bool → Bool) → Bool)example3 = tlam0 (λx → tlam0 (λy → y ‘tapp0‘ x))

However, there is a hidden problem lurking in this representa-tion. The type machinery that we use to ensure that bound variablesare represented correctly may leak into the types that are used in therepresented term. We can see this more clearly by writing out thetype TypedLambda0 explicitly as an Fω type, where the polymor-phism is completely explicit:

λτ.∀α : � → �. (∀σ1σ2. (α σ1 → α σ2) → α (σ1 → σ2)) →(∀σ1σ2. α (σ1 → σ2) → α σ1 → α σ2) →α τ

Now consider a typical term which starts with Λα.λtlam.tapp....and goes on to apply tlam and tapp to construct a representationof a simply-typed λ-calculus term. The problem arises becausewe have a type constructor α available for use in constructing therepresented term. We can instantiate the types σ1 and σ2 in the twoconstructors using α. This will lead to representations of simply-typed λ-calculus terms that contain subterms whose types dependon the result type of the specific fold operation that we performover terms. Hence, while this representation does not allow “exoticterms”, it does allow exotically typed terms.

An example of an exotically typed term in this representation isthe following:

exotic :: ∀exp. TypedLambda0 exp ⇒ exp (Bool → Bool)exotic = tlam0 (λx → tlam0 (λy → y))

‘tapp0‘ (tlam0 (λ(z :: exp (exp Int)) → z))

This “represents” the simply typed term:

(λxexp(Int)→exp(Int).λyBool .y)(λzexp(Int).z)

When we write a fold over the representation exotic, we will in-stantiate the type exp with the type we are using for accumulation.Thus the term exotic will technically represent different simply-typed terms for different folds.

This confusion between host and embedded language typesmanifests itself in the failure of the proof of an isomorphism be-tween this church encoding of typed HOAS and the de Bruijn rep-resentation. After the conversion of exotic to de Bruijn, we willhave a representation of the simply typed term:

(λxTDB(Int)→TDB(Int).λyBool .y)(λzTDB(Int).z)

where the placeholder exp has been replaced by the type construc-tor TDB used in the conversion to de Bruijn. Converting this termback to typed HOAS preserves this constructor, giving a term thatdiffers in its types to the original term.

An interesting question to ask is: exactly what is being repre-sented by the type THoas0, if it is not just the simply-typed terms?We currently have no answer to this. Maybe we are representingterms with the term syntax of the simply-typed λ-calculus, but thetypes of Haskell. On the other hand, the fact that the quantified con-structor exp used in the representation will change according to thetype of the fold that we perform over represented terms is troubling.

Note that, due to the fact that the type variable a, which repre-sents the type of the whole term, appears outside the scope of expin the type THoas0, we can never get terms that are exotically typedat the top level; only subterms with types that do not contribute tothe top-level type may be exotically typed, as in the exotic exam-ple above.

Aside from the theoretical problem, there is a point about whichtype system our embedded language should be able to have. If weare going to unembed an embedded language effectively, then weshould be able to get our hands on representations of object-leveltypes. Moreover, many intensional analyses that we may wish toperform are type-directed, so explicit knowledge of the embeddedlanguage types involved is required. To do this we cannot straight-forwardly piggy-back off Haskell’s type system (though we areforced to rely on it to represent object-level types, by the stratifica-tion between types and terms in Haskell’s type theory). To fix thisproblem, we define explicit representations for embedded languagetypes in the next subsection.

4.2 The closed kind of simple typesWe define a GADT Rep for representing simple types and henceprecluding exotic types. This connects a term-level representationof simple types with a type-level representation of types (in whichthe underlying types are Haskell types). Explicitly writing typerepresentations everywhere would be tedious, so we follow Cheneyand Hinze [2002] and define the type class Representable ofsimple types. This allows the compiler to infer and propagate manytype representations for us.

�� Rep :: � → � ��

Bool :: Rep Bool(:→) :: (Representable a, Representable b) ⇒

Rep a → Rep b → Rep (a→b)

�� Representable a �� rep :: Rep a

�� Representable Bool �� rep = Bool

�� (Representable a, Representable b) ⇒Representable (a→b) ��

rep = rep :→ rep

Note that the leaves of a Rep must be Bool constructors, andso it is only possible to build representations of simple types.The restriction to simple types is made more explicit with theRepresentable type class. In effect Representable is the closedkind of simple types.

A key function that we can define against values of type Rep isthe conditional cast operator, which has type:

cast :: Rep a → Rep b → Maybe ( ∀f. f a → f b)

We omit the implementation of this function to save space. Thebasic implementation idea is given by Weirich [2004].

4.3 Simply-typed λ-calculus, wiselyThe type class for simply-typed lambda terms is just like the naiveone we gave above, except that the constructors are now augmentedwith type representations.

�� TypedLambda exp ��

tlam :: (Representable a, Representable b) ⇒

44

(exp a → exp b) → exp (a → b)tapp :: (Representable a, Representable b) ⇒

exp (a → b) → exp a → exp b

�� THoas a = ∀exp. TypedLambda exp ⇒ exp a

Although the Representable type class restricts THoas termsto simple types, we can still assign a THoas term a polymorphictype.

example4 :: (Representable a, Representable b) ⇒THoas ((a → b) → a → b)

example4 = tlam (λx → tlam (λy → x ‘tapp‘ y))

Of course, this polymorphism is only at the meta level; we are infact defining a family of typing derivations of simply-typed terms.We can instantiate example4 many times with different simpletypes for a and b. However, if we wish to unembed it (usingthe function toTTerm that we define below) then we must pick aspecific type by supplying an explicit type annotation.

example5 =toTTerm (example4 :: THoas ((Bool→Bool)→Bool→Bool))

Sometimes the compiler will not be able to infer the types thatwe need in terms. This happens when a subterm contains a type thatdoes not contribute to the top-level type of the term. These are alsothe situations in which exotically typed terms arise. For example,the declaration

example6 :: (Representable a) ⇒ THoas (a → a)example6 = tlam (λx → tlam (λy → y))

‘tapp‘ tlam (λz→ z)

causes GHC to complain that there is an ambiguous type variablearising from the third use of tlam. We must fix the type of z to someconcrete simple type in order for this to be a proper representation.It is possible to do this by using type ascriptions at the Haskelllevel, but it is simpler to do so by defining a combinator that takesan explicit type representation as an argument:

tlam’ ::(Representable a, Representable b, TypedLambda exp) ⇒

Rep a → (exp a → exp b) → exp (a → b)tlam’ _ = tlam

The term can now be accepted by the Haskell type checker by fixingthe embedded language type of z:

example7 :: (Representable a) ⇒ THoas (a → a)example7 = tlam (λx → tlam (λy → y))

‘tapp‘ (tlam’ Bool (λz → z))

Defining an evaluator for these terms is now straightforward.We can simply interpret each embedded language type by its hostlanguage counterpart:

�� TEval a = TEval { unTEval :: a }

The instance of TypedLambda for TEval is straightforward:

�� TypedLambda TEval ��

tlam f = TEval (unTEval ◦ f ◦ TEval)TEval f ‘tapp‘ TEval a = TEval (f a)

teval :: THoas a → ateval t = unTEval t

We note that the HOAS representation is usually very convenientfor defining evaluators. In particular, this representation frees usfrom keeping track of environments. Also, note that exoticallytyped terms do not prevent us from writing an evaluator. If eval-uation is all one wants to do with embedded terms, then restrictingterms to a subset of types is not required.

4.4 Translating to de Bruijn and backWhere we used the natural numbers GADT to record the depthof a context in the representation of well-formed de Bruijn terms,we now need to include the list of types of the variables in thatcontext. At the type level, we use the unit type to represent theempty context, and pair types to represent a context extended by anadditional type. At the term level, we maintain a list of (implicit)type representations:

�� Ctx :: � → � ��

CtxZ :: Ctx ()CtxS :: Representable a ⇒ Ctx ctx → Ctx (a, ctx)

The simply-typed analogue of the Fin GADT is the GADTIndex. At the type level this encodes a pair of a type list andthe type of a distinguished element in that list; at the term levelit encodes the index of that element.

�� Index :: � → � → � ��

IndexZ :: Index (a, ctx) aIndexS :: Index ctx a → Index (b, ctx) a

The type constructor TTerm for simply-typed de Bruijn termstakes two parameters: the first is a type list encoding the types ofthe free variables, and the second is the type of the term itself.

�� TTerm :: � → � → � ��

TVar :: Representable a ⇒ Index ctx a → TTerm ctx aTLam :: (Representable a, Representable b) ⇒

TTerm (a, ctx) b → TTerm ctx (a → b)TApp :: (Representable a, Representable b) ⇒

TTerm ctx (a→b) → TTerm ctx a → TTerm ctx b

The translation to de Bruijn terms is similar to that for well-formed untyped terms. We again give the basic fold over the HOASterm representation as an instance of the TypedLambda class:

�� TDB a =TDB { unTDB :: ∀ctx. Ctx ctx → TTerm ctx a }

�� TypedLambda TDB ��

tlam (f::TDB a → TDB b) =TDB$ λi→ �� v = λj → TVar (tshift j (CtxS i))

�� TLam (unTDB (f (TDB v)) (CtxS i))(TDB x) ‘tapp‘ (TDB y) = TDB$ λi → TApp (x i) (y i)

The key difference is in the replacement of the shift functionthat computes the de Bruijn index for the bound variable by thetype-aware version tshift. To explain the tshift function, were-examine the proof that this fold always produces well-formedde Bruijn terms. In the untyped case, the proof relies on Kripkerelations indexed by natural numbers, where the natural numberrecords the depth of the context. Now that we also have typesto worry about, we use relations indexed by lists of embeddedlanguage types, ordered by the standard prefix ordering; we defineRΓ

σ(t) ⇔ ∀Γ′ ≥ Γ.Γ′ � (t Γ′) : σ, where Γ � t : σ is the typingjudgement of the simply-typed λ-calculus.

In the case for tlam, we again have two contexts i and j, wherei is the context surrounding the λ-abstraction, and j is the con-text surrounding the bound variable occurence. By a parametricityargument, and the way in which we have defined our Kripke re-lation, we know that (a, i) will always be a prefix of j, and sowe obtain a well-formed de Bruijn index by computing the differ-ence between the depths of the contexts. We implement this by thefollowing functions:

len :: Ctx n → Intlen CtxZ = 0len (CtxS ctx) = 1 + len ctx

tshift’ :: Int → Ctx j → Ctx (a, i) → Index j a

45

tshift’ _ CtxZ _ = ⊥tshift’ 0 (CtxS _) (CtxS _) =

fromJust (cast rep rep) IndexZtshift’ n (CtxS c1) c2 =

IndexS (tshift’ (n-1) c1 c2)

tshift :: Ctx j → Ctx (a, i) → Index j atshift c1 c2 = tshift’ (len c1 - len c2) c1 c2

As with the untyped case, we have had to feed the Haskell typechecker with bottoms to represent cases that can never occur.Firstly, the case when j is shorter than (a,i) can never happen,as with the untyped version. Secondly, we use a well-typed cast toshow that the type a does occur in j at the point we think it should.Given that we know the cast will succeed, it would likely be moreefficient to simply replace the cast with a call to unsafeCoerce.We chose not to here because we wanted to see how far we couldpush the type system.

Were we to use the representation given by the type THoas0,which allows exotically typed terms, it would still be possible towrite a conversion to de Bruijn representation, but it would be nec-essary to replace the use of cast in tshift’ with unsafeCoerce,since we do not have any type representations to check. Also,the de Bruijn representation would not be able to contain anyRepresentable typeclass constraints, meaning that we could notwrite intensional analyses that depend on the types of embedded-language terms.

In order to be able to define the type of open simply-typedHOAS we need to define a GADT for environments.

�� TEnv :: (� → �) → � → � ��

TEmpty :: TEnv exp ()TExtend :: TEnv exp ctx → exp a → TEnv exp (a, ctx)

lookT :: TEnv exp ctx → Index ctx a → exp alookT (TExtend _ v) IndexZ = vlookT (TExtend env _) (IndexS n) = lookT env n

Now we can define a type for open simply-typed HOAS terms.

�� THoas’ ctx a = ∀(exp :: � → �).TypedLambda exp ⇒ TEnv exp ctx → exp a

The translations between HOAS and de Bruijn representations andvice-versa fall out naturally.

toTHoas’ :: TTerm ctx a → THoas’ ctx atoTHoas’ (TVar n) = λenv → lookT env ntoTHoas’ (TLam t) =

λenv → tlam (λx → toTHoas’ t (TExtend env x))toTHoas’ (TApp f p) =

λenv → toTHoas’ f env ‘tapp‘ toTHoas’ p env

toTHoas :: TTerm () a → THoas atoTHoas t = toTHoas’ t TEmpty

toTTerm’ :: Ctx ctx → THoas’ ctx a → TTerm ctx atoTTerm’ ctx v = unTDB w ctx�� w = v (makeEnv ctx)

makeEnv :: Ctx ctx → TEnv TDB ctxmakeEnv CtxZ = TEmptymakeEnv (CtxS j) =

TExtend (makeEnv j)(TDB (λi → TVar (tshift i (CtxS j))))

toTTerm :: THoas a → TTerm () atoTTerm v = unTDB v CtxZ

5. ExamplesWe give two examples where unembedding plays an essential role.

5.1 Mobile codeOur first example involves sending programs of an embedded lan-guage over a network to be executed at some remote location. Inorder to make the programs a little more useful than pure lambdaterms we extend the embedding of typed λ calculus given in Sec-tion 4.3 to include constructors and destructors for booleans. Wedefine the TypedBooleans class independently of TypedLambda,and define a new class, Mobile, for the language formed by com-bining the two.

�� TypedBooleans exp ��

ttrue :: exp Booltfalse :: exp Booltcond ::

Representable a ⇒exp Bool → exp a → exp a → exp a

�� (TypedBooleans exp, TypedLambda exp) ⇒ Mobile exp

Next, we define concrete representations for types and terms, to-gether with automatically-derived parsers and printers.

�� URep = UBool | URepu→ URep �� (Show, Read)

�� MTerm = MVar Int| MLam URep MTerm | MApp MTerm MTerm| MTrue | MFalse | MCond MTerm MTerm MTerm

�� (Show, Read)

Section 2 showed how to unembed untyped HOAS terms tountyped de Bruijn terms; obtaining untyped de Bruijn terms fromtyped terms is broadly similar. The type MDB is analogous to DB(Section 2.3), but the phantom parameter discards type information.

�� MDB a = MDB { unMDB :: Int → MTerm }

Defining instances of Mobile and its superclasses for MDB gives atranslation to MTerm; composing this translation with show givesus a marshalling function for Mobile. (In an actual program itwould, of course, be preferable to use a more efficient marshallingscheme.) We omit the details of the translation, which follow thepattern seen in Section 2.3.

marshal :: ( ∀exp. Mobile exp ⇒ exp a) → Stringmarshal t = show (unMDB t 0)

Erasing types during marshalling is comparatively straightfor-ward; reconstructing types is more involved. We begin with a def-inition, Typed, that pairs a term with a representation of its type,hiding the type variable that carries the type information.

�� Typed :: (� → �) → � ��

(:::) :: Representable a ⇒ exp a → Rep a → Typed exp

We use Typed to write a function that re-embeds MTerm values astyped HOAS terms. The function toHoas takes an untyped termand an environment of typed terms for the free variables; it returnsa typed term. Since type checking may fail — the term may refer tovariables not present in the environment, or may be untypeable —the function is partial, as indicated by the Maybe in the return type.

toHoas :: (TypedLambda exp, TypedBooleans exp) ⇒MTerm → [Typed exp] → Maybe (Typed exp)

We omit the implementation, but the general techniques for re-constructing typed terms from untyped representations are well-known: see, for example, work by Baars and Swierstra [2002].Composing toHoas with the parser for MTerm gives an unmar-shalling function for closed terms.

unmarshal :: String →( ∀exp. Mobile exp ⇒ Maybe (Typed exp))

unmarshal s = toHoas (read s) []

46

Combined with an evaluator for terms as defined in Section 4.3,marshal and unmarshal allow us to construct HOAS terms, sendthem over a network, and evaluate them on another host.

5.2 Nested relational calculusOur second example is based on the Nested Relational Calculus(NRC) [Tannen et al. 1992]. NRC is a query language based oncomprehensions, with terms for functions, pairs, unit, booleans andsets. As the name suggests, NRC permits nested queries, unlikeSQL, which restricts the type of queries to a collection of recordsof base type. However, there are translations from suitably-typedNRC terms to flat queries [Cooper 2009, Grust et al. 2009]. Thespecification of these translations involves intensional analysis; itis therefore easier to define them on a concrete representation ofterms than as a mapping from higher-order abstract syntax.

Once again we can reuse the embeddings presented in earliersections. We combine the TypedLambda and TypedBoolean lan-guages of Sections 4.3 and 5.1 with embeddings of term formersfor pairs, units and sets; these are straightforward, so we give onlythe case for sets as an example. There are four term formers, forempty and singleton sets, set union, and comprehension; this lastuses Haskell’s binding to bind the variable, in standard HOAS style.

�� TypedSets exp ��

empty :: Representable a ⇒exp (Set a)

single :: Representable a ⇒exp a → exp (Set a)

union :: Representable a ⇒exp (Set a) → exp (Set a) → exp (Set a)

for :: (Representable a, Representable b) ⇒exp (Set a) → (exp a→exp (Set b)) → exp (Set b)

�� (TypedLambda exp, TypedBooleans exp,TypedUnit exp, TypedPairs exp,

TypedSets exp) ⇒ NRC exp

We must also extend the Rep datatype and Representableclass to include the new types.

�� Rep :: � → � ��

. . .Set :: Representable a ⇒ Rep a → Rep (Set a)

�� Representable a ⇒ Representable (Set a) ��

rep = Set rep

Using the techniques presented in earlier sections, we can unembedterms of NRC to obtain a concrete representation on which trans-lations to a flat calculus can be defined. The term formers of thelanguage ensure that embedded terms are correctly typed; we canalso assign a type to the translation function that restricts its inputto queries that can be translated to a flat query language such asSQL. Given these guarantees, we are free to dispense with typesin the concrete representation used internally, making it easier towrite the translation of interest.

The combination of a carefully-typed external interface and anuntyped core is used in a number of embedded languages; for ex-ample, by Leijen and Meijer [1999] for SQL queries and by Lindley[2008] for statically-typed XHTML contexts. Our presentation herehas the additional property that the external language (based onHOAS) is more convenient for the user than the internal language(de Bruijn terms), while the internal language is more convenientfor analysis.

6. Related workThe idea of encoding syntax with binding using the host language’sbinding constructs goes back to Church [1940]. As far as we are

aware Coquand and Huet [1985] were the first to remark thatthe syntax of untyped lambda-calculus can be encoded using theuniversally quantified type:

∀α.((α → α) → α) → (α → α → α) → α

Pfenning and Elliott [1988] proposed higher-order abstract syntaxas a general means for encoding name binding using the metalanguage. Washburn and Weirich [2008] also present essentiallythis type and show how functions can be defined over the syntaxby means of folds.

Programming with explicit folds is awkward. Carette et al.[2009] give a comprehensive account of how to achieve the sameeffect using Haskell type classes or ML modules. Our work is inthe same vein. Where Carette et al concentrate on implementingdifferent compositional interpretations of HOAS our main focus ison unembedding to a first-order syntax in order to allow intensionalanalyses. Hofer et al. [2008] apply Carette et al’s techniques in thecontext of Scala. As they remark, many standard optimisations onewants to perform in a compiler are difficult to define composition-ally. Our unembedding provides a solution to this problem. Hoferet al also discuss composing languages in a similar way to us. Theirsetting is somewhat complicated by the object-oriented features ofScala.

Meijer and Hutton [1995] and Fegaras and Sheard [1996] showhow to define folds or catamorphisms for data types with embed-ded functions. As we discussed in Section 2.1, the data type thatFegaras and Sheard use to represent terms does not use parametric-ity to disallow exotic terms, and so does not allow an unembeddingfunction to be defined. Fegaras and Sheard also use HOAS to rep-resent cyclic data structures and graphs, essentially by encodingthen using explicit sharing via a let construct and recursion usinga fix construct. Ghani et al. [2006] represent cyclic data structuresusing a de Bruijn representation in nested datatypes. Our unemeb-dding process gives a translation from Fegaras and Sheard’s HOASrepresentation to the Ghani et al.’s de Bruijn representation.

Pientka [2008] introduces a sophisticated type system that pro-vides direct support for recursion over HOAS datatypes. In con-trast, our approach supports recursion over HOAS datatypes withinthe standard Haskell type system. There is a similarity between ourrepresentation of open simply-typed terms using HOAS and hers,but we must leave a detailed comparison to future work.

Elliott et al. [2003] give an in-depth account of how to compiledomain-specific embedded languages, but they do not treat HOAS.

Rhiger [2003] details an interpretation of simply-typed HOASas an inductive datatype. His work differs from ours in that he onlyconsiders a single interpretation and he relies on a single globalabstract type to disallow exotic terms and to ensure that the targetterms are well-typed.

In their work on implementing type-preserving compilers inHaskell, Guillemette and Monnier [2007, 2008] mention conver-sion of HOAS to a de Bruijn representation. Their implementationsounds similar to ours, but they do not spell out the details. They donot mention the need to restrict the type representations in the em-bedded language. Their work does provide a good example of anintensional analysis—closure conversion—that would be difficultto express as a fold over the HOAS representation.

Pfenning and Lee [1991] examine the question of embedding apolymorphic language within Fω , with a view to defining a well-typed evaluator function. They use a nearly-HOAS representationwith parametricity, where λ-abstraction case is represented by aconstructor with type ∀αβ.(α → exp β) → exp (α → β). Hencethey do not disallow exotic terms. They are slightly more ambitiousin that they attempt to embed a polymorphic language, somethingthat we have not considered here. Guillemette and Monnier [2008]embed a polymorphic language using HOAS, but they resort to

47

using de Bruijn indices to represent type variables, which makesthe embedding less usable.

Oliveira et al. [2006] investigate modularity in the context ofgeneric programming. Our use of type classes to give modular ex-tensions of embedded DSLs is essentially the same as their encod-ing of extensible generic functions.

Our unembedding translations are reminiscent of normalisationby evaluation (NBE) [Berger et al. 1998]. The idea of NBE is toobtain normal forms by first interpreting terms in some model andthen defining a reify function mapping values in the model back tonormal forms. The key is to choose a model that includes enoughsyntactic hooks in order to be able to define the reify function. Infact our unembeddings can be seen as degenerate cases of NBE.HOAS is a model of α-conversion and the reify function is givenby the DB instance of the UntypedLambda type class.

Acknowledgements Atkey is supported by grant EP/G006032/1from EPSRC. We would like to thank the anonymous reviewers forhelpful comments, and Bruno Oliveira for pointing us to relatedwork.

ReferencesMartın Abadi, Luca Cardelli, and Pierre-Louis Curien. Formal parametric

polymorphism. In POPL, pages 157–170, 1993.

Andrew W. Appel and Trevor Jim. Shrinking lambda expressions in lineartime. Journal of Functional Programming, 7(5):515–540, 1997.

Robert Atkey. Syntax for free: Representing syntax with binding usingparametricity. In Typed Lambda Calculi and Applications (TLCA), vol-ume 5608 of Lecture Notes in Computer Science, pages 35–49. Springer,2009a.

Robert Atkey. Parameterised notions of computation. Journal of FunctionalProgramming, 19(3 & 4):355–376, 2009b.

Lennart Augustsson, Howard Mansell, and Ganesh Sittampalam. Paradise:a two-stage dsl embedded in Haskell. In ICFP, pages 225–228, 2008.

Arthur I. Baars and S. Doaitse Swierstra. Typing dynamic typing. In ICFP’02, pages 157–166, New York, NY, USA, 2002. ACM.

Nick Benton, Andrew Kennedy, Sam Lindley, and Claudio V. Russo.Shrinking reductions in SML.NET. In IFL, pages 142–159, 2004.

Ulrich Berger, Matthias Eberl, and Helmut Schwichtenberg. Normalisationby evaluation. In Prospects for Hardware Foundations, 1998.

Jacques Carette, Oleg Kiselyov, and Chung chieh Shan. Finally tagless,partially evaluated. Journal of Functional Programming, 2009. Toappear.

James Cheney and Ralf Hinze. A lightweight implementation of genericsand dynamics. In Haskell ’02, New York, NY, USA, 2002. ACM.

Alonso Church. A formulation of the simple theory of types. Journal ofSymbolic Logic, 5:56–68, 1940.

Ezra Cooper. The script-writer’s dream: How to write great sql in your ownlanguage, and be sure it will succeed. In DBPL, 2009. To appear.

Ezra Cooper, Sam Lindley, Philip Wadler, and Jeremy Yallop. The essenceof form abstraction. In APLAS, December 2008.

Thierry Coquand and Gerard P. Huet. Constructions: A higher order proofsystem for mechanizing mathematics. In European Conference on Com-puter Algebra (1), pages 151–184, 1985.

Nicolaas Govert de Bruijn. Lambda calculus notation with nameless dum-mies: A tool for automatic formula manipulation, with application to thechurch-rosser theorem. Indagationes Mathematicae, 1972.

Conal Elliott, Sigbjorn Finne, and Oege de Moor. Compiling embeddedlanguages. Journal of Functional Programming, 13(3):455–481, 2003.

Leonidas Fegaras and Tim Sheard. Revisiting catamorphisms overdatatypes with embedded functions (or, programs from outer space). InPOPL, pages 284–294, 1996.

N. Ghani, M. Hamana, T. Uustalu, and V. Vene. Representing cyclicstructures as nested datatypes. In H. Nilsson, editor, Proc. of 7th Symp.on Trends in Functional Programming, TFP 2006 (Nottingham, Apr.2006), 2006.

Torsten Grust, Manuel Mayr, Jan Rittinger, and Tom Schreiber. Ferry:Database-supported program execution. In SIGMOD 2009, Providence,Rhode Island, June 2009. To appear.

Louis-Julien Guillemette and Stefan Monnier. A type-preserving closureconversion in Haskell. In Haskell, pages 83–92, 2007.

Louis-Julien Guillemette and Stefan Monnier. A type-preserving compilerin Haskell. In ICFP, pages 75–86, 2008.

Christian Hofer, Klaus Ostermann, Tillmann Rendel, and Adriaan Moors.Polymorphic embedding of dsls. In GPCE, pages 137–148, 2008.

Andrew Kennedy. Compiling with continuations, continued. In ICFP,2007.

Marco Kuhlmann. Functional metapost for latex, 2001.

Daan Leijen and Erik Meijer. Parsec: Direct style monadic parser combina-tors for the real world. Technical Report UU-CS-2001-27, Departmentof Computer Science, Universiteit Utrecht, 2001.

Daan Leijen and Erik Meijer. Domain specific embedded compilers. InDSL’99, pages 109–122, Austin, Texas, October 1999.

Daniel R. Licata, Noam Zeilberger, and Robert Harper. Focusing on Bind-ing and Computation. In LICS, pages 241–252, 2008.

Sam Lindley. Many holes in Hindley-Milner. In ML ’08, 2008.

The Coq development team. The Coq proof assistant reference manual.LogiCal Project, 2009. URL http://coq.inria.fr. Version 8.2.

Conor McBride and Ross Paterson. Applicative programming with effects.Journal of Functional Programming, 18(1), 2008.

Erik Meijer and Graham Hutton. Bananas in space: Extending fold andunfold to exponential types. In FPCA, pages 324–333, 1995.

Bruno Oliveira, Ralf Hinze, and Andres Loh. Extensible and modulargenerics for the masses. In Trends in Functional Programming, pages199–216, 2006.

Simon Peyton Jones, Jean-Marc Eber, and Julian Seward. Composingcontracts: an adventure in financial engineering (functional pearl). InICFP ’00, pages 280–292, New York, NY, USA, 2000. ACM.

Frank Pfenning and Conal Elliott. Higher-order abstract syntax. In PLDI,pages 199–208, 1988.

Frank Pfenning and Peter Lee. Metacircularity in the polymorphic lambda-calculus. Theor. Comput. Sci., 89(1):137–159, 1991.

Brigitte Pientka. A type-theoretic foundation for programming with higher-order abstract syntax and first-class substitutions. In POPL, pages 371–382, 2008.

Benjamin C. Pierce. Types and Programming Languages. MIT Press, 2002.

Gordon D. Plotkin and Martın Abadi. A logic for parametric polymorphism.In Marc Bezem and Jan Friso Groote, editors, TLCA, volume 664 ofLecture Notes in Computer Science, pages 361–375. Springer, 1993.ISBN 3-540-56517-5.

John C Reynolds. Towards a theory of type structure. In ProgrammingSymposium, Proceedings Colloque sur la Programmation, pages 408–423, London, UK, 1974. Springer-Verlag.

Morten Rhiger. A foundation for embedded languages. ACM Trans.Program. Lang. Syst., 25(3):291–315, 2003.

Tim Sheard, James Hook, and Nathan Linger. GADTs + extensible kindsystem = dependent programming. Technical report, Portland StateUniversity, 2005.

Val Tannen, Peter Buneman, and Limsoon Wong. Naturally embeddedquery languages. In ICDT ’92, pages 140–154. Springer-Verlag, 1992.

The Agda2 development team. The agda2 website. http://wiki.portal.chalmers.se/agda/, 2009.

Peter Thiemann. WASH/CGI: Server-side web scripting with sessions andtyped, compositional forms. In PADL, pages 192–208, 2002.

Geoffrey Washburn and Stephanie Weirich. Boxes go bananas: Encodinghigher-order abstract syntax with parametric polymorphism. Journal ofFunctional Programming, 18(1):87–140, 2008.

Stephanie Weirich. Type-safe cast. Journal of Functional Programming, 14(6):681–695, 2004.

48

Lazy Functional Incremental Parsing

Jean-Philippe BernardyComputer Science and Engineering, Chalmers University of Technology and University of Gothenburg

[email protected]

AbstractStructured documents are commonly edited using a free-form edi-tor. Even though every string is an acceptable input, it makes senseto maintain a structured representation of the edited document. Thestructured representation has a number of uses: structural naviga-tion (and optional structural editing), structure highlighting, etc.The construction of the structure must be done incrementally tobe efficient: the time to process an edit operation should be pro-portional to the size of the change, and (ideally) independent of thetotal size of the document.We show that combining lazy evaluation and caching of intermedi-ate (partial) results enables incremental parsing. We build a com-plete incremental parsing library for interactive systems with sup-port for error-correction.

Categories and Subject Descriptors D.3.4 [Programming Lan-guages]: Processors; D.2.3 [Coding Tools and Techniques]: Pro-gram editors; D.1.1 [Programming Techniques]: Applicative (Func-tional) Programming; F.3.2 [Logics and Meanings of Programs]:Semantics of Programming Languages

General Terms Algorithms, Languages, Design, Performance,Theory

Keywords Lazy evaluation, Incremental Computing, Parsing, Dy-namic Programming, Polish representation, Editor, Haskell

1. IntroductionYi (Bernardy, 2008; Stewart and Chakravarty, 2005) is a text editorwritten in Haskell. It provides features such as syntax highlightingand indentation hints for a number of programming languages (fig-ure 1). All syntax-dependent functions rely on the abstract syntaxtree (AST) of the source code being available at all times. The feed-back given by the editor is always consistent with the text: the ASTis kept up to date after each modification. But, to maintain accept-able performance, the editor must not parse the whole file at eachkeystroke: we have to implement a form of incremental parsing.Another feature of Yi is that it is configurable in Haskell. Therefore,we prefer to use the Haskell language for every aspect of theapplication, so that the user can configure it. In particular, syntax isdescribed using a combinator library.


Figure 1. Screenshot. The user has opened a very big Haskellfile. Yi gives feedback on matching parenthesis by changing thebackground color. Even though the file is longer than 2000 lines,real-time feedback can be given as the user types, because parsingis performed incrementally.

Our main goals can be formulated as constraints on the parsinglibrary:

• it must be programmable through a combinator interface;• it must cope with all inputs provided by the user, and thus

provide error correction;• it must be efficient enough for interactive usage: parsing must

be done incrementally.

To implement this last point, one could choose a stateful approachand update the parse tree as the user modifies the input structure.Instead, in this paper we explore the possibility to use a more“functional” approach: minimize the amount of state that has tobe updated, and rely as much as possible on laziness to implementincrementality.

1.1 Approach

In this section we sketch how lazy evaluation can help achieveincremental parsing.An online parser exhibits lazy behavior: it does not proceed furtherthan necessary to return the nodes of the AST that are demanded.Assuming that, in addition to using an online parser to producethe AST, it is traversed in pre-order to display the decorated text

49

Figure 2. Viewing the beginning of a file. The big triangle repre-sents the syntax tree. The line at the bottom represents the file. Thezagged part indicates the part that is parsed. The viewing windowis depicted as a rectangle.

presented to the user, the situation right after opening a file isdepicted in figure 2. The window is positioned at the beginningof the file. To display the decorated output, the program has totraverse the first few nodes of the syntax tree (in pre-order). Thistraversal in turn forces parsing the corresponding part of the input,but, thanks to lazy evaluation, no further (or maybe a few tokensahead, depending on the amount of look-ahead required). If theuser modifies the input at this point, it invalidates the AST, butdiscarding it and re-parsing is not too costly: only a screenful ofparsing needs to be re-done.As the user scrolls down in the file, more and more of the ASTis demanded, and the parsing proceeds in lockstep (figure 3). Atthis stage, a user modification is more serious: re-parsing naivelyfrom the beginning can be too costly for a big file. Fortunately wecan again exploit the linear behavior of parsing algorithms to ouradvantage. Indeed, if the editor stores the parser state for the inputpoint where the user made the modification, we can resume parsingfrom that point. Furthermore, if it stores partial results for everypoint of the input, we can ensure that we will never parse more thana screenful at a time. Thereby, we achieve incremental parsing, inthe sense that the amount of parsing work needed after each userinteraction depends only on the size of the change or the length ofthe move.

1.2 Contributions

Our contributions can be summarized as follows.

• We describe a novel, purely functional approach to incrementalparsing, which makes essential use of lazy evaluation;

• We complete our treatment of incremental parsing with errorcorrection. This is essential, since online parsers need to betotal: they cannot fail on any input;

• We have implemented such a system in a parser-combinatorlibrary and made use of it to provide syntax-dependent feedbackin a production-quality editor.

Figure 3. Viewing the middle of a file. Parsing proceeds in linearfashion: although only a small amount of the parse tree may bedemanded, it will depend not only on the portion of the input thatcorresponds to it, but also on everything that precedes.

1.3 Interface and Outlook

Our goal is to provide a combinator library with a standard inter-face, similar to that presented by Swierstra (2000).Such an interface can be captured in a generalized algebraic datatype (GADT, Xi et al. (2003)) as follows. These combinators aretraditionally given as functions instead of constructors, but sincewe make extensive use of GADTs for modeling purposes at vari-ous levels, we prefer to use this presentation style everywhere forconsistency. (Sometimes mere ADTs would suffice, but we preferto spell out the types of the combinators explicitly, using the GADTsyntax.)

data Parser s a wherePure :: a → Parser s a(:∗:) :: Parser s (b → a)→ Parser s b → Parser s aSymb :: Parser s a → (s → Parser s a)→ Parser s aDisj :: Parser s a → Parser s a → Parser s aFail :: Parser s a

This interface supports production of results (Pure), sequencing(:∗:), reading of input symbols (Symb), and disjunction (Disj ,Fail ). The type parameter s stands for the type of input symbols,while a is the type of values produced by the parser.Most of this paper is devoted to uncovering an appropriate repre-sentation for our parsing process type, and the implementation ofthe functions manipulating it. The core of this representation is in-troduced in section 3, where we merely handle the Pure and (:∗:)constructors. Dependence on input and the constructor Symb aretreated in section 4. Disjunction and error correction will be imple-mented as a refinement of these concepts in section 5.Parsing combinator libraries usually propose a mere run functionthat executes the parser on a given input: run :: Parser s a →[s ] → Either Error a . Incremental systems require finer controlover the execution of the parser. Therefore, we have to split therun function into pieces and reify the parser state in values of typeProcess .

50

We also need a few functions to create and manipulate the parsingprocesses:

• mkProcess :: Parser s a → Process s a: given a parserdescription, create the corresponding initial parsing process.

• feed :: [s ] → Process s a → Process s a: feed the parsingprocess a number of symbols.

• feedEof :: Process s a → Process s a: feed the parsingprocess the end of the input.

• precompute :: Process s a → Process s a: transform aparsing process by pre-computing all the intermediate parsingresults available.

• finish :: Process s a → a: compute the final result of theparsing, in an online way, assuming that the end of input hasbeen fed into the process.

Section 2 details our approach to incrementality by sketching themain loop of an editor using the above interface. The implementa-tion for these functions can be given as soon as we introduce de-pendence on input in section 4.Sections 3 through 5 describe how our parsing machinery is built,step by step. In section 6 we discuss the problem of incrementalparsing of the repetition construct. We discuss and compare our ap-proach to alternatives in section 7 through section 10 and concludein section 11.

2. Main loopIn this section we write an editor using the interface describedin section 1.3. This editor lacks most features one would expectfrom a real application, and is therefore just a toy. It is however aself-contained implementation which tackles the issues related toincremental parsing.The main loop alternates between displaying the contents of thefile being edited and updating its internal state in response to userinput. Notice that we make our code polymorphic over the type ofthe AST we process, merely requiring it to be Show -able.

loop :: Show ast ⇒ State ast → IO ()loop s = display s >> update s >>= loop

The State structure stores the “current state” of our toy editor.data State ast = State{

lt , rt :: String ,ls :: [Process Char ast ]}

The fields lt and rt contain the text respectively to the left andto the right of the edit point. The field ls is our main interest: itcontains the parsing processes corresponding to each symbol tothe left of the edit point. The left-bound lists, lt and ls , containdata in reversed order, so that the information next to the cursorcorresponds to the head of the lists. Note that there is always onemore element in ls than in lt , because we also have a parser statefor the empty input.We do not display the input document as typed by the user, butan enriched version, to hightlight syntactic constructs. Therefore,we have to parse the input and then serialize the result. First, wefeed the remainder of the input to the current state and then run theonline parser. The display is then trimmed to show only a windowaround the edit point. Trimming takes a time proportional to theposition in the file, but for the time being we assume that displaying

is much faster than parsing and therefore the running time of theformer can be neglected.

display :: (Show ast)⇒ State ast → IO ()display s@State { ls = pst : } = do

putStrLn ""

putStrLn $ trimToWindow$ show$ finish$ feedEof$ feed (rt s)$ pst

where trimToWindow = take windowSize ◦drop windowBegin

windowSize = 10 -- arbitrary sizewindowBegin = length (lt s)− windowSize

There are three types of user input to take care of: movement,deletion and insertion of text. The main difficulty here is to keep thelist of intermediate states synchronized with the text. For example,every time a character is typed, a new parser state is computed andstored. The other editing operations proceed in a similar fashion.

update :: State ast → IO (State ast)update s@State { ls = pst : psts } = do

c ← getCharreturn $ case c of

-- cursor movements’<’→ case lt s of -- left

[ ] → s(x : xs)→ s { lt = xs, rt = x : rt s, ls = psts }

’>’→ case rt s of -- right[ ] → s(x : xs)→ s { lt = x : lt s, rt = xs

, ls = addState x }-- deletions

’,’→ case lt s of -- backspace[ ] → s(x : xs)→ s { lt = xs, ls = psts }

’.’→ case rt s of -- delete[ ] → s(x : xs)→ s {rt = xs }

-- insertion of textc → s { lt = c : lt s, ls = addState c}

where addState c = precompute (feed [c ] pst) : ls s

Besides disabling buffering of the input for real-time response, thetop-level program has to instantiate the main loop with an initialstate, and pick a specific parser to use: parseTopLevel .

main = do hSetBuffering stdin NoBufferingloop State {

lt = "",rt = "",ls = [mkProcess parseTopLevel ]}

As we have seen before, the top-level parser can return any type.In sections 4 and 5 we give examples of parsers for S-expressions,which can be used as instances of parseTopLevel .We illustrate using S-expressions because they have a recursivestructure which can serve as prototype for many constructs found inprogramming languages, while being simple enough to be treatedcompletely within this paper.

data SExpr = S [SExpr ] | Atom Char

51

The code presented in this section forms the skeleton of any pro-gram using our library. A number of issues are glossed over though.Notably, we would like to avoid re-parsing when moving in the fileif no modification is made. Also, the displayed output is computedfrom its start, and then trimmed. Instead we would like to directlyprint the portion corresponding to the current window. Doing thisis tricky to fix: the attempt described in section 6 does not tacklethe general case.

3. Producing resultsHughes and Swierstra (2003) show that the sequencing operatormust be applicative (McBride and Paterson (2007)) to allow foronline production of results. This result is the cornerstone of ourapproach to incremental parsing, so we review it in this section,justifying the use of the combinators Pure and (:∗:), which formthe applicative sub-language.We also introduce the Polish representation for applicative expres-sions: it is the essence of our parsing semantics. This section culmi-nates in the definition of the pipeline from applicative language toresults by going through Polish expressions. Our final parser (sec-tion 5) is an extension of this machinery with all the features men-tioned in the introduction.A requirement for online production of the result is that nodes areavailable before their children are computed. In terms of datatypes,this means that constructors must be available before their argu-ments are computed. This can only be done if the parser can ob-serve (pattern match on) the structure of the result. Hence, we makefunction applications explicit in the expression describing the re-sults.For example, the Haskell expression S [Atom ’a’ ], which standsfor S ((:) (Atom ’a’) [ ]) if we remove syntactic sugar, can berepresented in applicative form by using @ for applications.

S@((:)@(Atom@’a’)@[ ])

The following data type captures a pure applicative language em-bedding Haskell values. It is indexed by the type of values it repre-sents.

data Applic a where(:∗:) :: Applic (b → a)→ Applic b → Applic aPure :: a → Applic a

infixl 4 :∗:

The application annotations can then be written using Haskell syn-tax as follows:

Pure S :∗: (Pure (:) :∗: (Pure Atom :∗: Pure ’a’):∗: Pure [ ])

We can also write a function for evaluation:evalA :: Applic a → aevalA (f :∗: x ) = (evalA f ) (evalA x )evalA (Pure a) = a

If the arguments to the Pure constructor are constructors, then weknow that demanding a given part of the result forces only thecorresponding part of the applicative expression.Because our parsers process the input in a linear fashion, theyrequire a linear structure for the output as well. (This is revisitedin section 5). As Hughes and Swierstra (2003), we convert theapplicative expressions to their Polish representation to obtain sucha linear structure.The key idea of the Polish representation is to put the applica-tion in a prefix position rather than an infix one. Our example

expression (in applicative form S@((:)@(Atom@’a’)@[ ])) be-comes @S (@(@(:) (@Atom ’a’)) [ ])

Since @ is always followed by exactly two arguments, groupinginformation can be inferred from the applications, and the paren-theses can be dropped. The final Polish expression is therefore

@S@@(:)@Atom ’a’ [ ]

The Haskell datatype can also be linearized in the same way. UsingApp for @, Push to wrap values and Done to finish the expression,we obtain the following representation.

App $ Push S $ App $ App $ Push (:) $App $ Push Atom $ Push ’a’ $ Push [ ] $ Done

data Polish wherePush :: a → Polish → PolishApp :: Polish → PolishDone :: Polish

Unfortunately, the above datatype does not allow to evaluate ex-pressions in a typeful manner. The key insight is that Polish ex-pressions are in fact more general than applicative expressions: theyrepresent a stack of values instead of a single one.As hinted by the constructor names we chose, we can reinterpretPolish expressions as follows. Push produces a stack with onemore value than its second argument, App transforms the stackproduced by its argument by applying the function on the top tothe argument on the second position and pushing back the result.Done produces the empty stack.The expression Push (:) $ App $ Push Atom $ Push ’a’ $Push [ ] $ Done is an example producing a non-trivial stack. Itproduces the stack (:), (Atom ’a’), [ ], which can be expressedpurely in Haskell as (:) :< Atom ’a’ :< [ ] :< Nil , using thefollowing representation for heterogeneous stacks.

data top :< rest = (:<) {top :: top, rest :: rest }data Nil = Nilinfixr 4 :<

We are now able to properly type Polish expressions, by indexingthe datatype with the type of the stack produced.

data Polish r wherePush :: a → Polish r → Polish (a :< r)App :: Polish ((b → a) :< b :< r)→ Polish (a :< r)Done :: Polish Nil

We can also write a translation from the pure applicative languageto Polish expressions.

toPolish :: Applic a → Polish (a :<Nil)toPolish expr = toP expr Done

where toP :: Applic a → (Polish r → Polish (a :< r))toP (f :∗: x ) = App ◦ toP f ◦ toP xtoP (Pure x ) = Push x

And the value of an expression can be evaluated as follows:evalR :: Polish r → revalR (Push a r) = a :< evalR revalR (App s) = apply (evalR s)

where apply∼(f :<∼(a :< r)) = f a :< revalR (Done) = Nil

We have the equality evalR (toPolish x ) ≡ evalA x :<Nil .Additionally, we note that this evaluation procedure still possessesthe “online” property: prefixes of the Polish expression are de-manded only if the corresponding parts of the result are demanded.This preserves the incremental properties of lazy evaluation that we

52

required in the introduction. Furthermore, the equality above holdseven when ⊥ appears as argument to the Pure constructor. In fact,the conversion from applicative to Polish expressions can be un-derstood as a reification of the working stack of the evalA functionwith call-by-name semantics.

4. Adding inputWhile the study of the pure applicative language is interesting in itsown right (we come back to it in section 4.1), it is not enough torepresent parsers: it lacks dependency on the input.We introduce an extra type argument (the type of symbols, s), aswell as a new constructor: Symb. It expresses that the rest of theexpression depends on the next symbol of the input (if any): itsfirst argument is the parser to be used if the end of input has beenreached, while its second argument is used when there is at leastone symbol available, and it can depend on it.

data Parser s a wherePure :: a → Parser s a(:∗:) :: Parser s (b → a)→ Parser s b → Parser s aSymb :: Parser s a → (s → Parser s a)→ Parser s a

Using just this, as an example, we can write a simple parser forS-expressions.

parseList :: Parser Char [SExpr ]parseList = Symb

(Pure [ ])(λc → case c of’)’→ Pure [ ]’ ’→ parseList -- ignore spaces’(’→ Pure (λh t → S h : t) :∗: parseList

:∗: parseListc → Pure ((Atom c):) :∗: parseList)

We adapt the Polish expressions with the construct correspondingto Symb, and amend the translation. Intermediate results are repre-sented by a Polish expression with a Susp element. The part beforethe Susp element corresponds to the constant part that is fixed bythe input already parsed. The arguments of Susp contain the con-tinuations of the parsing algorithm: the first one if the end of inputis reached, the second one when there is a symbol to consume.

data Polish s r wherePush :: a → Polish s r → Polish s (a :< r)App :: Polish s ((b → a) :< b :< r) → Polish s (a :< r)Done :: Polish s NilSusp :: Polish s r → (s → Polish s r)→ Polish s r

toP :: Parser s a → (Polish s r → Polish s (a :< r))toP (Symb nil cons) =λk → Susp (toP nil k) (λs → toP (cons s) k)

toP (f :∗: x ) = App ◦ toP f ◦ toP xtoP (Pure x ) = Push x

Although we broke the linearity of the type, it does no harm sincethe parsing algorithm will not proceed further than the availableinput anyway, and therefore will stop at the first Susp. Suspensionsin a Polish expression can be resolved by feeding input into it.When facing a suspension, we pattern match on the input, andchoose the corresponding branch in the result.The feed function below performs this duty for a number of sym-bols, and stops when it has no more symbols to feed. The dualfunction, feedEof , removes all suspensions by consistently choos-ing the end-of-input alternative.

feed :: [s ]→ Polish s r → Polish s rfeed [ ] p = pfeed (s : ss) (Susp nil cons) = feed ss (cons s)feed ss (Push x p) = Push x (feed ss p)feed ss (App p) = App (feed ss p)feed ss Done = Done

feedEof :: Polish s r → Polish s rfeedEof (Susp nil cons) = feedEof nilfeedEof (Push x p) = Push x (feedEof p)feedEof (App p) = App (feedEof p)feedEof Done = Done

For example, evalR$feedEof $feed "(a)"$toPolish $parseListyields back our example expression: S [Atom ’a’ ].We recall from section 2 that feeding symbols one at a time yieldsall intermediate parsing results.

allPartialParses = scanl (λp c → feed [c ] p)

If the (n+ 1)th element of the input is changed, one can reuse thenth element of the partial results list and feed it the new input’s tail(from that position).This suffers from a major issue: partial results remain in their“Polish expression form”, and reusing offers little benefit, becauseno part of the result value is shared between the partial results: thefunction evalR has to perform the the full computation for eachof them. Fortunately, it is possible to partially evaluate prefixes ofPolish expressions.The following function performs this task by traversing a Polishexpression and applying functions along the way.

evalL :: Polish s a → Polish s aevalL (Push x r) = Push x (evalL r)evalL (App f ) = case evalL f of

(Push g (Push b r))→ Push (g b) rr → App r

evalL x = xpartialParses = scanl (λp c → evalL ◦ feed [c ] $ p)

This still suffers from a major drawback: as long as a functionapplication is not saturated, the Polish expression will start with along prefix of partial applications, which has to be traversed againin forthcoming partial results.For example, after applying the S-expression parser to the stringabcdefg, evalL is unable to perform any simplification of the listprefix:

evalL $ feed "abcdefg" (toPolish parseList)≡ App $ Push (Atom ’a’:) $

App $ Push (Atom ’b’:) $App $ Push (Atom ’c’:) $App $ ...

This prefix will persist until the end of the input is reached. Apossible remedy is to avoid writing expressions that lead to thissort of intermediate result, and we will see in section 6 how todo this in the particularly important case of lists. This howeverworks only up to some point: indeed, there must always be anunsaturated application (otherwise the result would be independentof the input). In general, after parsing a prefix of size n, it isreasonable to expect a partial application of at least depthO(log n),otherwise the parser is discarding information.

4.1 Zipping into Polish

In this section we develop an efficient strategy to pre-computeintermediate results. As seen in the above section, we want to avoid

53

the cost of traversing the structure up to the suspension at each step.This suggests to use a zipper structure (Huet, 1997) with the focusat the suspension point.

data Zip s out whereZip :: RPolish stack out → Polish s stack → Zip s out

data RPolish inp out whereRPush :: a → RPolish (a :< r) out →

RPolish r outRApp :: RPolish (b :< r) out →

RPolish ((a → b) :< a :< r) outRStop :: RPolish r r

Since the data is linear, this zipper is very similar to the zipper forlists. The part that is already visited (“on the left”), is reversed. Notethat it contains only values and applications, since we never go pasta suspension.The interesting features of this zipper are its type and its meaning.We note that, while we obtained the data type for the left part bymechanically inverting the type for Polish expressions, it can beassigned a meaning independently: it corresponds to reverse Polishexpressions.In contrast to forward Polish expressions, which directly producean output stack, reverse expressions can be understood as automatawhich transform a stack to another. This is captured in the typeindices inp and out , which stand respectively for the input and theoutput stack.Running this automaton requires some care: matching on the inputstack must be done lazily. Otherwise, the evaluation procedure willforce the spine of the input, effectively forcing to parse the wholeinput file.

evalRP :: RPolish inp out → inp → outevalRP RStop acc = accevalRP (RPush v r) acc = evalRP r (v :< acc)evalRP (RApp r)∼(f :<∼(a :< acc))

= evalRP r (f a :< acc)

In our zipper type, the Polish expression yet-to-visit (“on the right”)has to correspond to the reverse Polish automation (“on the left”):the output of the latter has to match the input of the former.Capturing all these properties in the types (though GADTs) allowsto write a properly typed traversal of Polish expressions. The rightfunction moves the focus by one step to the right.

right :: Zip s out → Zip s outright (Zip l (Push a r)) = Zip (RPush a l) rright (Zip l (App r)) = Zip (RApp l) rright (Zip l s) = Zip l s

As the input is traversed, in the implementation of precompute ,we also simplify the prefix that we went past, evaluating everyapplication, effectively ensuring that each RApp is preceded byat most one RPush .

simplify :: RPolish s out → RPolish s outsimplify (RPush a (RPush f (RApp r))) =

simplify (RPush (f a) r)simplify x = x

We see that simplifying a complete reverse Polish expression re-quires O(n) steps, where n is the length of the expression. Thismeans that the amortized complexity of parsing one token (i.e.computing a partial result based on the previous partial result) isO(1), if the size of the result expression is proportional to the sizeof the input. We discuss the worst case complexity in section 6.

In summary, it is essential for our purposes to have two evalua-tion procedures for our parsing results. The first one, presented insection 3, provides the online property, and corresponds to call-by-name CPS transformation of the direct evaluation of applicative ex-pressions. It underlies the finish function in our interface. The sec-ond one, presented in this section, enables incremental evaluationof intermediate results, and corresponds to a call-by-value trans-formation of the same direct evaluation function. It underlies theprecompute function.

5. Adding ChoiceWe kept the details of actual parsing out of the discussion so far.This is for good reason: the machinery for incremental computationand reuse of partial results is independent from such details. Indeed,given any procedure to compute structured values from a linearinput of symbols, one can use the procedure described above totransform it into an incremental algorithm.However, parsing the input string with the interface presented sofar is highly unsatisfactory. To support convenient parsing, we canintroduce a disjunction operator, exactly as Hughes and Swierstra(2003) do: the addition of the Susp operator does not underminetheir treatment of disjunction in any way.

5.1 Error correction

Disjunction is not very useful unless coupled with failure (other-wise any branch would be as good as another). Still, the (unre-stricted) usage of failure is problematic for our application: theonline property requires at least one branch to yield a successfuloutcome. Indeed, since the evalR function must return a result (wewant a total function!), the parser must conjure up a suitable resultfor any input.If the grammar is sufficiently permissive, no error correction inthe parsing library itself is necessary. An example is the simpleS-expression parser of section 4, which performs error correctionin an ad-hoc way. However, most interesting grammars produce ahighly structured result, and are correspondingly restrictive on theinput they accept. Augmenting the parser with error correction istherefore desirable.Our approach is to add some rules to accept erroneous inputs. Thesewill be marked as less desirable by enclosing them with Yuckcombinators, introduced as another constructor in the Parser type.The parsing algorithm can then maximize the desirability of the setof rules used for parsing a given fragment of input.

data Parser s a wherePure :: a → Parser s a(:∗:) :: Parser s (b → a)→ Parser s b → Parser s aSymb :: Parser s a → (s → Parser s a)→ Parser s aDisj :: Parser s a → Parser s a → Parser s aYuck :: Parser s a → Parser s a

5.2 Example

In this section we rewrite our parser for S-expressions from section4 using disjunction and error-correction. The goal is to illustratehow these new constructs can help in writing more modular parserdescriptions.First, we can define repetition and sequence in the traditional way:

many , some :: Parser s a → Parser s [a ]many v = some v ‘Disj ‘ Pure [ ]some v = Pure (:) :∗: v :∗: many v

54

1:#2:#D 2 0:#1:#D 1

1:#D 1

Done

Done

2:#3:#D 3

D 0D 1

0:#1:#D 1 0:#D 0 D 0

Dislike

Shift

BestDislikeDislikeShift

Shift0:#2:#3:#D 3 Dislike

Figure 4. A parsing process and associated progress information. The process has been fed a whole input, so it is free of Susp constructors.It is also stripped of result information (Push , App) for conciseness, since it is irrelevant to the computation of progress information. Eachconstructor is represented by a circle, and their arguments are indicated by arrows. The progress information associated with the process iswritten below the node that starts the process. To decide which path to take at the disjunction (Best), only the gray nodes will be forced, ifthe desirability difference is 1 for look-ahead 1.

Checking for the end of file can be done as follows. Notice that ifthe end of file is not encountered, we keep parsing the input, butcomplain while doing so.

eof = Symb (Pure ()) (λ → Yuck eof )

Checking for a specific symbol can be done in a similar way: weaccept anything but dislike (Yuck !) anything unexpected.

pleaseSymbol :: Eq s ⇒ s → Parser s (Maybe s)pleaseSymbol s = Symb

(Yuck $ Pure Nothing)(λs ′ → if s ≡ s ′ then Pure (Just s ′)

else Yuck $ Pure (Just s ′))

All of the above can be combined to write the parser for S-expressions. Note that we need to amend the result type to ac-commodate for erroneous inputs.

data SExpr= S [SExpr ] (Maybe Char)| Atom Char| Missing| Deleted Char

parseExpr = Symb(Yuck $ Pure Missing)(λc → case c of’(’→ Pure S :∗: many parseExpr :∗: pleaseSymbol ’)’’)’→ Yuck $ Pure $ Deleted ’)’

c → Pure $ Atom c)

parseTopLevel= Pure const :∗: parseExpr :∗: eof

We see that the constructs introduced in this section (Disj , Yuck )permit to write general purpose derived combinators, such asmany , in a traditional style.

5.3 The algorithm

Having defined our definitive interface for parsers, we can describethe parsing algorithm itself.As before, we linearize the applications (:∗:) by transforming theParser into a Polish-like representation. In addition to the the

Dislike and Best constructors corresponding to Yuck and Disj ,Shift records where symbols have been processed, once Susp isremoved.

data Polish s a wherePush :: a → Polish s r → Polish s (a :< r)App :: Polish s ((b → a) :< b :< r)

→ Polish s (a :< r)Done :: Polish s NilShift :: Polish s a → Polish s aSus :: Polish s a → (s → Polish s a)

→ Polish s aBest :: Polish s a → Polish s a → Polish s aDislike :: Polish s a → Polish s a

toP :: Parser s a → (Polish s r → Polish s (a :< r))toP (Pure x ) = Push xtoP (f :∗: x ) = App ◦ toP f ◦ toP xtoP (Symb a f ) = λfut → Sus (toP a fut)

(λs → toP (f s) fut)toP (Disj a b) = λfut → Best (toP a fut) (toP b fut)toP (Yuck p) = Dislike ◦ toP p

The remaining challenge is to amend our evaluation functions todeal with disjunction points (Best). It offers two a priori equivalentalternatives. Which one should be chosen?Since we want online behavior, we cannot afford to look furtherthan a few symbols ahead to decide which parse might be the best.(Performance is another motivation: the number of potential pathsgrows exponentially with the amount of look-ahead.) We use thewidespread technique (Bird and de Moor, 1997, chapter 8) to thinout the search after some constant, small amount of look-ahead.Hughes and Swierstra’s algorithm searches for the best path by di-rect manipulation of the Polish representation, but this direct ap-proach forces to transform between two normal forms: one wherethe progress nodes (Shift , Dislike) are at the head and one wherethe result nodes (Pure , :∗:) are at the head. Therefore, we chooseto use an intermediate datatype which represents the progress infor-mation only. This clear separation of concerns also enables to com-pile the progress information into a convenient form: our Progressdata structure directly records how many Dislike are encountered

55

after parsing so many symbols. It is similar to a list where the nth

element tells how much we dislike to take this path after shiftingn symbols following it, assuming we take the best choice at eachdisjunction.

data Progress = S | D Int | Int :# Progress

The difference from a simple list is that progress information mayend with success (D) or suspension (S ), depending on whether theprocess reaches Done or Susp. Figure 4 shows a Polish structureand the associated progress for each of its parts. The progressfunction below extracts the information from the Polish structure.

progress :: Polish s r → Progressprogress (Push p) = progress pprogress (App p) = progress pprogress (Shift p) = 0 :# progress pprogress (Done) = D 0progress (Dislike p) = mapSucc (progress p)progress (Susp ) = Sprogress (Best p q) = snd $ better (progress p)

(progress q)mapSucc S = SmapSucc (D x ) = D (succ x )mapSucc (x :# xs) = succ x :# mapSucc xs

To deal with the last case (Best), we need to find out which of twoprofiles is better. Using our thinning heuristic, given two Progressvalues corresponding to two terminated Polish processes, it ispossible to determine which one is best by demanding only a prefixof each. The following function handles this task. It returns the bestof two progress information, together with an indicator of which isto be chosen. Constructors LT or GT respectively indicates thatthe second or third argument is the best, while EQ indicates that asuspension is reached. The first argument (lk ) keeps track of howmuch lookahead has been processed. This value is a parameter toour thinning heuristic, dislikeThreshold , which indicates when aprocess can be discarded.

better S = (EQ ,S)better S = (EQ ,S)better (D x ) (D y) =

if x 6 y then (LT ,D x ) else (GT ,D y)better lk xs@(D x ) (y :# ys) =

if x ≡ 0 ∨ y − x > dislikeThreshold lkthen (LT , xs)else min x y +> better (lk + 1) xs ys

better lk (y :# ys) xs@(D x ) =if x ≡ 0 ∨ y − x > dislikeThreshold lkthen (GT , xs)else min x y +> better (lk + 1) ys xs

better lk (x :# xs) (y :# ys)| x ≡ 0 ∧ y ≡ 0 = rec| y − x > threshold = (LT , x :# xs)| x − y > threshold = (GT , y :# ys)| otherwise = recwhere threshold = dislikeThreshold lk

rec = min x y +> better (lk + 1) xs ysx +>∼(ordering , xs) = (ordering , x :# xs)

Calling the better function directly is very inefficient though, be-cause its result is needed every time a given disjunction is encoun-tered. If the result of a disjunction depends on the result of furtherdisjunction, the result of the further disjunction will be needlesslydiscarded. Therefore, we cache the result of better in the Polishrepresentation, using the well known technique of tupling. For sim-plicity, we cache the information only at disjunction nodes, where

we also remember which path is best to take. We finally see why thePolish representation is important: the progress information can-not be associated to a Parser , because it may depend on whateverparser follows it. This is not an issue in the Polish representation,because applications (:∗:) are unfolded.We now have all the elements to write our final data structuresand algorithms. The following code shows the final constructionprocedure. In the Polish datatype, only the Best constructor isamended.

data Polish s a where...

Best :: Ordering → Progress →Polish s a → Polish s a → Polish s a

toP :: Parser s a → (Polish s r → Polish s (a :< r))toP (Symb a f ) = λfut → Susp (toP a fut)

(λs → toP (f s) fut)toP (f :∗: x ) = App ◦ toP f ◦ toP xtoP (Pure x ) = Push xtoP (Disj a b) = λfut → mkBest (toP a fut) (toP b fut)toP (Yuck p) = Dislike ◦ toP p

mkBest :: Polish s a → Polish s a → Polish s amkBest p q =

let (choice, pr) = better 0 (progress p) (progress q)in Best choice pr p q

The evaluation functions can be easily adapted to support disjunc-tion by querying the result of better , cached in the Best construc-tor. We write the the online evaluation only: partial result computa-tion is modified similarly.

evalR :: Polish s r → revalR Done = NilevalR (Push a r) = a :< evalR revalR (App s) = apply (evalR s)

where apply∼(f :<∼(a :< r)) = f a :< revalR (Shift v) = evalR vevalR (Dislike v) = evalR vevalR (Susp ) = error "input pending"

evalR (Best choice p q) = case choice ofLT → evalR pGT → evalR qEQ → error "Suspension reached"

Note that this version of evalR expects a process without anypending suspension (the end of file must have been reached). In thisversion we also disallow ambiguity, see section 5.5 for a discussion.

5.4 Summary

We have given a convenient interface for constructing error-correcting parsers, and functions to evaluate them. This is per-formed in steps: first we linearize applications into Polish (as insection 4), then we linearize disjunctions (progress and better )into Progress . The final result is computed by traversing thePolish expressions, using Progress to choose the better alternativein disjunctions.Our technique can also be re-formulated as lazy dynamic program-ming, in the style of Allison (1992). We first define a full tree ofpossibilities (Polish expressions with disjunction), then we com-pute progress information that we tie to it, for each node; finally,finding the best path is a matter of looking only at a subset of theinformation we constructed, using any suitable heuristic. The cut-off heuristic makes sure that only a part of the exponentially grow-

56

ing data structure is demanded. Thanks to lazy evaluation, only thatsmall part will be actually constructed.

5.5 Thinning out results and ambiguous grammars

A sound basis for thinning out less desirable paths is to discardthose which are less preferable by some amount. In order to pickone path after a constant amount of look-ahead l, we must set thisdifference to 0 when comparing the lth element of the progressinformation, so that the parser can pick a particular path, and returnresults. Unfortunately, applying this rule strictly is dangerous ifthe grammar requires a large look-ahead, and in particular if it isambiguous. In that case, the algorithm can possibly commit to aprefix which will lead to errors while processing the rest of theoutput, while another prefix would match the rest of the inputand yield no error. In the present version of the library we avoidthe problem by keeping all valid prefixes. The user of the parsinglibrary has to be aware of this issue when designing grammars: itcan affect the performance of the algorithm to a great extent, bytriggering an exponential explosion of possible paths.

6. Eliminating linear behaviorAs we noted in section 4, the result of some computations cannotbe pre-computed in intermediate parser states, because constructorsare only partially applied.This is indeed a common case: if the constructed output is a list,then the spine of the list can only be constructed once we get holdof the very tail of it.For example, our parser for S-expressions would produce such listsfor flat expressions, because the applications of (:) can be computedonly when the end of the input is reached.

evalL $ feed "(abcdefg" (toPolish parseList)≡ App $ Push (Atom ’a’:) $

App $ Push (Atom ’b’:) $App $ Push (Atom ’c’:) $App $ ...

Section 4.1 explained how to optimize the creation of intermediateresults, by skipping this prefix. Unfortunately this does not improvethe asymptotic performance of computing the final result. Thepartial result corresponding to the end of input contains the longchain of partial applications (in reverse Polish representation), andto produce the final result the whole prefix has to be traversed.Therefore, in the worst case, the construction of the result has a costproportional to the length of the input.While the above example might seem trivial, the same result ap-plies to all repetition constructs, which are common in languagedescriptions. For example, a very long Haskell file is typically con-stituted of a very long list of declarations, for which a proportionalcost must be paid every time the result is constructed.The culprit for linear complexity is the linear shape of the list.Fortunately, nothing forces to use such a structure: it can alwaysbe replaced by a tree structure, which can then be traversed inpre-order to discover the elements in the same order as in thecorresponding list. Wagner and Graham (1998, section 7) recognizethis issue and propose to replace left or right recursive rules in theparsing with a special repetition construct. The parsing algorithmtreats this construct specially and does re-balancing of the tree asneeded. We choose a different approach: only the result type ischanged, not the parsing library. We can do so for two reasons:

• Combinators can be parametrized by arbitrary values

1

2

3

4

5 6

7

8

9

10 11

12

13 14

Figure 5. A tree storing the elements 1 . . . 14. Additional elementswould be attached to the right child of node 7: there would be noimpact on the tree constructed so far.

• Since we do not update a tree, but produce a fresh version everytime, we need not worry about re-balancing issues.

Let us summarize the requirements we put on the data structure:

• It must provide the same laziness properties as a list: accessingan element in the structure should not force to parse the inputfurther than necessary if we had used a list.

• the nth element in pre-order should not be further away thanO(log n) elements from the root of the structure. In otherwords, if such a structure contains a suspension in place of anelement at position n, there will be no more than O(log n)partial applications on the stack of the corresponding partialresult. This in turn means that the resuming cost for that partialresult will be in O(log n).

The second requirement suggests a tree-like structure, and the firstrequirement implies that whether the structure is empty or not canbe determined by entering only the root constructor. It turns out thata simple binary tree can fulfill these requirements.

data Tree a = Node a (Tree a) (Tree a)| Leaf

The only choice that remains is the size of the sub-trees. Thespecific choice we make is not important as long as we make surethat each element is reachable in O(log n) steps. A simple choiceis a series of complete trees of increasing depth. The kth tree willhave depth k and contain 2k − 1 nodes. For simplicity, all thesesub-trees are chained using the same data type: they are attachedas the left child of the spine of a right-leaning linear tree. Such astructure is depicted in figure 5.We note that a complete tree of total depth 2d can therefore store atleast

Pdk=1 2k − 1 elements, fulfilling the second requirement.

This structure is very similar to binary random access lists as pre-sented by Okasaki (1999, section 6.2.1), but differ in purpose. Theonly construction primitive presented by Okasaki is the appendingof an element. This is of no use to us, because the function hasto analyze the structure it is appending to, and is therefore strict.We want avoid this, and thus must construct the structure in onego. Indeed, the construction procedure is the only novel idea weintroduce:

toTree d [ ] = LeaftoTree d (x : xs) = Node x l (toTree (d + 1) xs ′)

where (l , xs ′) = toFullTree d xs

57

toFullTree 0 xs = (Leaf , xs)toFullTree d [ ] = (Leaf , [ ])toFullTree d (x : xs) = (Node x l r , xs ′′)

where (l , xs ′) = toFullTree (d − 1) xs(r , xs ′′) = toFullTree (d − 1) xs ′

In other words, we must use a special construction function toguarantee the online production of results: we want the argumentof Pure to be in a simple value (not an abstraction), as explainedin section 3. In fact, we will have to construct the list directly in theparser.The following function implements such a parser where repeatedelements are mere symbols.

parseTree d = Symb(Pure Leaf )(λs → Pure (Node s) :∗:

parseFullTree d :∗:parseTree (d + 1))

parseFullTree 0 = Pure LeafparseFullTree d = Symb

(Pure Leaf )(λs → Pure (Node s) :∗:

parseFullTree (d − 1) :∗:parseTree (d − 1))

The function can be adapted for arbitrary non-terminals. One hasto take care to avoid interference between the construction ofthe shape and error recovery. For example, the position of non-terminals can be forced in the tree, as to be in the node correspond-ing to the position of their first symbol. In that case the structurehas to be accommodated for nodes not containing any information.

6.1 Quick access

Another benefit of using the tree structure as above is that findingthe part of the tree of symbols corresponding to the edit windowalso takes logarithmic time. Indeed, the size of each sub-tree de-pends only on its relative position to the root. Therefore, one canaccess an element by its index without pattern matching on anynode which is not the direct path to it. This allows efficient indexedaccess without loosing any property of laziness. Again, the tech-nique can be adapted for arbitrary non-terminals. However, it willonly work if each node in the tree is “small” enough. Finding thefirst node of interest might force an extra node, and in turn forceparsing the corresponding part of the file.

7. Related workThe literature on parsing, incremental or not, is so abundant that acomprehensive survey would deserve its own treatment. Here wewill compare our approach to some of the closest alternatives.

7.1 Development environments

The idea of incremental analysis of programs is not new. Wilcoxet al. (1976) already implemented such a system. Their programworks very similarly to ours: parsing states to the left of the cursorare saved so that changes to the program would not force a com-plete re-parse. A big difference is that it does not rely on built-inlazy evaluation. If they had produced an AST, its online produc-tion would have had to be managed entirely by hand. The systemalso did not provide error correction nor analysis to the right of thecursor.Ghezzi and Mandrioli (1979) improved the concept by reusingparsing results to the right of the cursor: after parsing every symbol

they check if the new state of the LR automaton matches that ofthe previous run. If it does they know that they can reuse the resultsfrom that point on.This improvement offers some advantages over Wilcox et al. (1976)which still apply when compared to our solution.

1. In our system, if the user jumps back and forth between thebeginning and the end of the file, every forward jump willforce re-parsing the whole file. Note that we can mitigate thisdrawback by caching the (lazily constructed) whole parse tree:a full re-parse is required only when the user makes a changewhile viewing the beginning of the file.

2. Another advantage is that the AST is fully constructed at alltimes. In our case only the part to the left of the window isavailable. This means that the functions that traverse the ASTshould do so in pre-order. If this is not the case, the onlineproperty becomes useless. For example, if one wishes to apply asorting algorithm before displaying an output, this will force thewhole input to be parsed before displaying the first element ofthe input. In particular, the arguments to the Pure constructormust not perform such operations on its arguments. Ideally, theyshould be simple constructors. This leaves much risk for theuser of the library to destroy its incremental properties.

While our approach is much more modest, it can be consideredbetter in some respects.

1. One benefit of not analyzing the part of the input to the right ofthe cursor is that there is no start-up cost: only a screenful oftext needs to be parsed to start displaying it.

2. Another important point is that a small change in the inputmight completely invalidate the result from the previous parsingrun. A simple example is the opening of a comment: whileediting an Haskell source file, typing {- implies that the restof the file becomes a comment up to the next matching -}.It is therefore questionable that reusing right-bound parts of theparse tree offers any reasonable benefit in practice: it seems tobe optimizing for a special case. This is not very suitable inan interactive system where users expect consistent responsetimes.

3. Finally, our approach accommodate better to a combinator im-plementation. Indeed, comparing parser states is very tricky toaccomplish in the context of a combinator library: since parsingstates normally contain lambda abstractions, it is not clear howthey can be compared to one another.

Wagner and Graham (1998) improved on the state-matching tech-nique. They contributed the first incremental parser that took in ac-count the inefficiency of linear repetition. We compared our ap-proach to theirs in section 6.Despite extensive research dating as far back as 30 years ago, thesesolutions have barely caught up in the mainstream. Editors typicallywork using regular expressions for syntax highlighting at the lexicallevel (Emacs, Vim, Textmate, . . . ).It is possible that the implementation cost of earlier solutions out-weighed their benefits. We hope that the simplicity of our approachwill permit more widespread application.

7.2 Incremental computation

An alternative to our approach to would be to build the library asa plain parser on top of a generic incremental computation system.The main drawback is that there currently exists no such off-the-shelf system for Haskell. The closest matching solution is provided

58

by Carlsson (2002), and relies heavily on explicit threading ofcomputation through monads and explicit reference for storageof inputs and intermediate results. This imposes an imperativedescription of the incremental algorithm, which does not match ourgoals. Furthermore, in the case of parsing, the inputs would be theindividual symbols. This means that, not only their contents willchange from one run to another, but their numbers will as well. Onethen might want to rely on laziness, as we do, to avoid dependingunnecessarily on the tail of the input, but then we hit the problemthat the algorithm must be described imperatively. Therefore, wethink that such an approach would be awkward, if at all applicable.

7.3 Parser combinators

Our approach is firmly anchored in the tradition of parser combi-nator libraries (Hutton and Meijer, 1998), and particularly close tothe Polish parsers of Hughes and Swierstra (2003), which were re-cently refined by Swierstra (2009).The introduction of the Susp operator is directly inspired by theparallel parsing processes of Claessen (2004), which features a verysimilar construct to access the first symbol of the input and makeit accessible to the rest of the computation. This paper presentsour implementation as a version of Polish parsers extended withan evaluation procedure “by-value”, but we could equally havestarted with parallel parsing processes and extended them with “by-name” evaluation. The combination of both evaluation techniquesis unique to our library.Our error correction mechanism bears many similarities with thatpresented by Swierstra and Alcocer (1999): they also associatesome variant of progress information to parsers and rely on thinningand laziness to explore the tree of all possible parses. An importantdifference is that we embed the error reports in the tree instead ofreturning them as a separate tree. This is important, because weneed to highlight errors in a lazy way. If the errors we reported sep-arately, merely checking if an error is present could force parsingthe whole file.Wallace (2008) presents another, simpler approach to online pars-ing, based on the notion of commitment. His library features twosequencing combinators: the classic monadic bind, and a specialapplication with commitment. The former supports backtracking inthe classic way, but the latter decouples errors occurring on its left-hand side from errors occurring on its right-hand side: if there aretwo possible ways to parse the left-hand side, the parser choosesthe first match. This scheme therefore relies on user annotationsat determined points in the production of the result to prune thesearch tree, while we prune after the same amount of lookahead inall branches. This difference explains why we need to linearize theapplications, while it can be avoided in Wallace’s design. Addition-ally, we take advantage of the linear shape of the parsing process toto feed it with partial inputs, so we cannot spare the linearizationphase. A commitment combinator would be a useful addition to ourlibrary though: pruning the search tree at specific point can speedup the parsing and improve error-reporting.

8. DiscussionDue to our choice to commit to a purely functional, lazy approach,our incremental parsing library occupies a unique point in thedesign space.It is also the first time that incremental and online parsing are bothavailable in a combinator library.What are the advantages of using the laziness properties of theonline parser? Our system could be modified to avoid relying onlaziness at all. In section 4.1 we propose to apply the reverse Polish

automaton (on the left) to the stack produced — lazily — by thePolish expression (on the right). Instead of that stack, we could feedthe automaton with a stack of dummy values, or ⊥s. Everythingwould work as before, except that we would get exceptions whentrying to access unevaluated parts of the tree. If we know in advancehow much of the AST is consumed, we could make the systemwork as such.One could take the stance that this guesswork (knowing whereto stop the parsing) is practically possible only for mostly linearsyntaxes, where production of output is highly coupled with theconsumption of input. Since laziness essentially liberates us fromany such guesswork, the parser can be fully decoupled from thefunctions using the syntax tree.The above reflexion offers another explanation why most main-stream syntax highlighters are based on regular-expressions orother lexical analysis mechanism: they lack a mechanism to de-couple processing of input from production of output.The flip side to our approach is that the efficiency of the systemcrucially depends on the lazy behavior of consumers of the AST.One has to take lots of care in writing them.

9. Future workOur treatment of repetition is still lacking: we would like to retrieveany node by its position in the input while preserving all propertiesof laziness intact. While this might be very difficult to do in thegeneral case, we expect that our zipper structure can be used toguide the retrieval of the element at the current point of focus, sothat it can be done efficiently.Although it is trivial to add a failure combinator to the librarypresented here, we refrained from doing so because it can leadto failing parsers. Of course, one can use our Yuck combinatorin place of failure, but one has to take in account that the parsercontinues running after the Yuck occurrence. In particular, manyYucks following each other can lead to some performance loss,as the “very disliked” branch would require more analysis to bediscarded than an immediate failure. Indeed, if one takes this ideato the extreme and tries to use the fix-point (fix Yuck ) to representfailure, it will lead to non-termination. This is due to our use ofstrict integers in the progress information. We have chosen thisrepresentation to emphasize the dynamic programming aspect ofour solution, but in general it might be more efficient to representprogress by a mere interleaving of Shift and Dislike constructors.Our library suffers from the usual drawbacks of parser combina-tor approaches. In particular, it is impossible to write left-recursiveparsers, because they cause a non-terminating loop in the parsingalgorithm. We could proceed as Baars et al. (2009) and transformthe grammar to remove left-recursion. It is interesting to note how-ever that we could represent traditional left-recursive parsers aslong as they either consume or produce data, provided the progressinformation is indexed by the number of Pushes in addition toShifts.Finally, we might want to re-use the right hand side of previ-ous parses. This could be done by keeping the parsing results forall possible prefixes. Proceeding in this fashion would avoid thechaotic situation where a small modification might invalidate allthe parsing work that follows it, since we take in account all possi-ble prefixes ahead of time.

10. ResultsWe carried out development of a parser combinator library forincremental parsing with support for error correction. We argued

59

that, using suitable data structures for the output, the complexity ofparsing (without error correction) is O(log m+ n) where m is thenumber of tokens in the state we resume from and n is the numberof tokens to parse. Parsing an increment of constant size has anamortized complexity ofO(1). These complexity results ignore thetime to search for the nodes corresponding to the display window.The parsing library presented in this paper is used in the Yi editorto help matching parenthesis and layout the Haskell functions,and environment delimiters as well as parenthetical symbols werematched in the LATEX source. This paper and the accompanyingsource code have been edited in Yi.

11. ConclusionWe have shown that the combination of a few simple techniquesachieve the goal of incremental parsing.

1. In a lazy setting, the combination of online production of resultsand saving intermediate results provide incrementality;

2. The efficient computation of intermediate results requires somecare: a zipper-like structure is necessary to improve perfor-mance.

3. Online parsers can be extended with an error correction schemefor modularity.

4. Provided that they are carefully constructed to preserve lazi-ness, tree structures can replace lists in functional programs.Doing so can improve the complexity class of algorithms.

While these techniques work together here, we believe that theyare valuable independently of each other. In particular, our errorcorrection scheme can be replaced by another one without invali-dating the approach.

AcknowledgmentsWe thank Koen Claessen for persuading us to write this paper, andfor his unfading support throughout the writing process. This paperwas greatly improved by his comments on early and late drafts.Discussions with Krasimir Angelov helped sorting out the notionsof incremental parsing. Patrik Jansson, Wouter Swierstra, GustavMunkby, Marcin Zalewski and Michał Pałka and the anonymousreviewers of ICFP gave helpful comments on the presentation ofthe paper. Finally, special thanks go to the reviewers of the HaskellSymposium for their extremely helpful comments.

ReferencesL. Allison. Lazy Dynamic-Programming can be eager. Information

Processing Letters, 43(4):207–212, 1992.

A. Baars, D. Swierstra, and M. Viera. Typed transformations oftyped abstract syntax. In TLDI ’09: fourth ACM SIGPLANWorkshop on Types in Language Design and Implementation,New York, NY, USA, 2009.

J. Bernardy. Yi: an editor in Haskell for Haskell. In Proceedingsof the first ACM SIGPLAN symposium on Haskell, pages 61–62,Victoria, BC, Canada, 2008. ACM.

R. Bird and O. de Moor. Algebra of programming. Prentice-Hall,Inc., 1997.

M. Carlsson. Monads for incremental computing. In Proceedingsof the seventh ACM SIGPLAN international conference on Func-tional programming, pages 26–35, Pittsburgh, PA, USA, 2002.ACM.

K. Claessen. Parallel parsing processes. Journal of FunctionalProgramming, 14(6):741–757, 2004.

C. Ghezzi and D. Mandrioli. Incremental parsing. ACM Trans.Program. Lang. Syst., 1(1):58–70, 1979.

G. Huet. The zipper. J. Funct. Program., 7(5):549–554, 1997.

R. J. M. Hughes and S. D. Swierstra. Polish parsers, step by step.In Proceedings of the eighth ACM SIGPLAN international con-ference on Functional programming, pages 239–248, Uppsala,Sweden, 2003. ACM.

G. Hutton and E. Meijer. Monadic parsing in haskell. Journal ofFunctional Programming, 8(04):437–444, 1998.

C. McBride and R. Paterson. Applicative programming with ef-fects. Journal of Functional Programming, 18(01):1–13, 2007.

C. Okasaki. Purely Functional Data Structures. Cambridge Uni-versity Press, July 1999.

D. Stewart and M. Chakravarty. Dynamic applications from theground up. In Haskell ’05: Proceedings of the 2005 ACM SIG-PLAN workshop on Haskell, pages 27–38. ACM Press, 2005.

S. D. Swierstra. Combinator parsers: From toys to tools. ElectronicNotes in Theoretical Computer Science, 41(1), 2000.

S. D. Swierstra. Combinator parsing: A short tutorial. In LanguageEngineering and Rigorous Software Development, volume 5520of LNCS, pages 252–300, Piriapolis, 2009. Springer.

S. D. Swierstra and P. R. A. Alcocer. Fast, error correcting parsercombinators: A short tutorial. In Proceedings of the 26th Confer-ence on Current Trends in Theory and Practice of Informatics onTheory and Practice of Informatics, pages 112–131. Springer-Verlag, 1999.

T. A. Wagner and S. L. Graham. Efficient and flexible incrementalparsing. ACM Transactions on Programming Languages andSystems, 20(5):980–1013, 1998.

M. Wallace. Partial Parsing: Combining Choice with Commitment,volume 5083/2008 of LNCS, pages 93–110. Springer Berlin /Heidelberg, 2008.

T. R. Wilcox, A. M. Davis, and M. H. Tindall. The design and im-plementation of a table driven, interactive diagnostic program-ming system. Commun. ACM, 19(11):609–616, 1976.

H. Xi, C. Chen, and G. Chen. Guarded recursive datatype construc-tors. SIGPLAN Not., 38(1):224–235, 2003.

Appendix: The complete codeThe complete code of the library described in this paper canbe found at: http://github.com/jyp/topics/tree/master/FunctionalIncrementalParsing/Code.lhs The Yi sourcecode is constantly evolving, but at the time of this writing ituses a version of the parsing library which is very close tothe descriptions given in the paper. It can be found at: http://code.haskell.org/yi/Parser/Incremental.hs

60

Roll Your Own Test Bed for Embedded Real-Time Protocols:A Haskell Experience

Lee PikeGalois, Inc.

[email protected]

Geoffrey BrownIndiana University

[email protected]

Alwyn GoodloeNational Institute of [email protected]

AbstractWe present by example a new application domain for functionallanguages: emulators for embedded real-time protocols. As a case-study, we implement a simple emulator for the Biphase Mark Pro-tocol, a physical-layer network protocol in Haskell. The surprisingresult is that a pure functional language with no built-in notion oftime is extremely well-suited for constructing such emulators. Fur-thermore, we use Haskell’s property-checker QuickCheck to au-tomatically generate real-time parameters for simulation. We alsodescribe a novel use of QuickCheck as a “probability calculator”for reliability analysis.

Categories and Subject Descriptors B.8.1 [Hardware]: Perfor-mance and Reliability

General Terms Languages, Reliability, Verification

Keywords Physical-layer protocol Testing, Emulation, FunctionalProgramming

1. IntroductionWe present by example a new application domain for functionallanguages: building efficient emulators for real-time systems. Real-time systems are difficult to design and validate due to the com-plex interleavings possible between executing real-time compo-nents. Emulators assist in exploring and validating a design beforecommitting to an implementation. Our goal in this report is to con-vince the reader by example1 that

1. one can easily roll-one’s-own test bed for embedded real-timesystems using standard functional languages, with no built-innotion of real-time;

2. testing infrastructure common to functional languages, such asQuickCheck (Claessen and Hughes 2000), can be exploited togenerate real-time parameters for simulation—we generate ap-proximately 100,000 real-time parameters and execution tracesper minute on a commodity laptop;

1 The source code associated with this paper is presented in the Appendixand is also available for download at http://www.cs.indiana.edu/∼lepike/pub pages/qc-biphase.html. The code is released under aBSD3 license. The emulator is about 175 lines of code, and the QuickCheckinfrastructure is about 100 lines.


3. and QuickCheck can be used for a novel purpose—to do statis-tical reliability analysis.

In our report, we assume that the reader is familiar with Haskellsyntax. That said, our approach uses basic concepts shared bymodern functional languages and does not intrinsically rely onlaziness (or strictness) or special monads, for example.

In the remainder of this introduction, we motivate the problemdomain and describe related work before going on to describe theemulator framework.

Problem Space: Physical Layer Networking The physical layerresides at the lowest level of the network stack and defines themechanism for transmitting raw bits over the network. At the phys-ical layer, bits are encoded as voltage signals. A bit stream is trans-mitted by modulating the electrical signal on an interconnect (e.g.,coaxial cable). It is not as simple as translating the 1 to high volt-age and 0 to low voltage because the receiver needs to be able todetect when there are consecutive ones or zeros and know whenthe sender has changed the signal. The inherent complexity at thislayer results from (1) the sender and receiver not sharing a hard-ware clock (so they are asynchronous) and (2) the continuity of thephysical world. Thus, the digital abstraction cannot be assumed tohold at this level. Furthermore, we must model the jitter and drift ofhardware clocks and the time an electrical signal takes to settle be-fore it stabilizes to a high or low value. If the receiver samples theinterconnect at the wrong time, the signal may be misinterpreted bythe receiver. The goal is to design a protocol and define timing con-straints to ensure the receiver samples the interconnect at the rightintervals to reliably decode the bit stream sent by the transmitter.

Many physical protocols exist, but we shall focus on the BiphaseMark Protocol (BMP), which is used to transmit data in digitalaudio systems and magnetic card readers (e.g., for credit cards).The emulator is modularized: emulating another protocol requireschanging just a few small functions (about 30 lines of code).

Background and Related Work Physical layer protocols havebeen a canonical challenge problem in the formal methods com-munity. Recent work uses decision procedures (more precisely,satisfiability modulo theories) and model-checking to verify theircorrectness (Brown and Pike 2006); these results compare favor-ably to previous efforts using mechanical theorem-proving, whichrequired thousands of manual proof steps (Moore 1994; Vaan-drager and de Groot 2004). Indeed, the emulator described hereis essentially refined from its high-level specification in a modelchecker (Brown and Pike 2006). Given the success of these formalverification techniques—which prove correctness—what interest isthere in simulation?

There are at least a few responses. To begin with, it is not alwaysthe case that the constraints can be expressed in a decidable theory.In particular, timing constraints that contain non-linear inequalities

61

cannot be decided (in this case, it so happens that our expressionof the BMP constraints are linear). Furthermore, decision proce-dures and model-checkers are complex and may contain bugs, orthe model itself may contain bugs. Both cases may lead to vacuousproofs, but because the “execution” of a model-checker’s model issymbolic, it can be difficult to sanity-check the correctness of themodel or tool. An emulator, however, is executed on concrete data.Another motivation is that even if there are no bugs in a formalmodel, a proof of correctness is only as good as the connection be-tween the model used in the proof and its fidelity to the implemen-tation. The components of a Haskell emulator can be, in principle,refined into digital hardware (Sheeran 2005), and the QuickCheck-generated data can be used not only to drive the emulator, but astest-vectors for the implemented hardware. Finally, as we discussin Section 5, QuickCheck can be used as a “probability calculator”for reliability analysis of digital systems, something that cannot bedone easily with current formal verification tools.

The work described here is part of a larger framework being de-veloped by the two authors Pike and Goodloe for the purpose ofbuilding emulators for real-time safety-critical distributed systemsunder a NASA contract. On top of the emulator described here, wehave built infrastructure to simulate a serial broadcast bus with mul-tiple receivers and cyclic redundancy checks over the data by thereceivers. Functional languages make constructing the additionalemulator machinery easy; for example, a serial bus emulator is con-structed by doing little more than mapping the emulator describedhere over a list of receivers.

2. Biphase Mark Protocol (BMP)

1 1 0 1 0 0Bits

BMP

Clock

Period Encoded Bit

Figure 1. BMP Encoding of a Bit Stream

We begin by describing the protocol. The simple portion of theprotocol is the encoding of a bit stream by the transmitter. ConsiderFigure 1, where the top stream is the bit stream to be transmittedand the middle stream is the transmitter’s clock. In BMP, everyencoded data bit is guaranteed to begin with a transition markinga clock event; that is, the transmitter begins an encoded bit bymodulating the signal on the interconnect. The value of the encodedbit is determined by the presence (to encode a 1) or absence (toencode a 0) of a transition in the middle of the encoded bit. Thus,a 0 is encoded as either two sequential low or high signals (e.g., 00or 11), while a 1 is encoded as either a transition from high to lowor low to high (e.g., 01 or 10).

The central design issue for the receiver is to extract a clock signalfrom the combined signal reliably. The receiver has two modes, ascanning mode in which it attempts to detect a clock event markingthe first half of an encoded bit, and a sampling mode in which itassumes that sufficient synchrony has been established to simplysample the signal at some point while the second half of the bit isbeing transmitted.

In each of these modes, real-time constraints must be met to ensurecorrect operation. To see why, consider Figure 2 which representsa hypothetical plot over time of the strength of a signal sent bya transmitter. The period is the nominal interval between clocksignal transitions, as shown in Figure 1. For some portion of the

Period

SettleStable

1

0? Sampled

Value

Time

SignalStrength

Figure 2. Signal Strength Over Time

period, the signal is stable. During the stable interval, the signal isguaranteed to be sufficiently high or low (in the figure, it is high)so that if the receiver samples the signal then, it is guaranteed tobe sampled correctly. During the remainder of the period, however,the signal is settling, so the receiver nondeterministically interpretsthe signal as high, low, or indeterminate.

The real-time constraints on when the receiver scans and samples,described in the following section, are the key to the protocolcorrectness.

3. Real-Time Parameters and ConstraintsWe approximate dense real-time using double-precision floatingpoint numbers in Haskell:

type Time = Double

Real-time parameters associated with transmitter and receiver arecaptured in a data type. Simulation runs are executed over instancesof this data type. (We affix a ‘t’ or ‘r’ to the parameter names toremind ourselves whether they’re associated with the transmitter,tx, or receiver, rx.)

data Params = Params{ tPeriod :: Time -- ^ Tx’s nominal clock period., tSettle :: Time -- ^ Maximum settling time., rScanMin :: Time -- ^ Rx’s min scan duration., rScanMax :: Time -- ^ Rx’s max scan duration., rSampMin :: Time -- ^ Rx’s min sampling duration., rSampMax :: Time -- ^ Rx’s max sampling duration.} deriving (Show, Eq)

The field tPeriod contains the nominal period of the transmitter.The field tSettle contains the maximum settling duration for thesignal—we use the maximum possible settling interval so that themodel is as pessimistic as possible, since the value of the signalis indeterminate while settling. (We do not need to keep track oftStable since we can compute it by tPeriod - tSettle.) Wethen have fields containing the minimum and maximum real-timevalues that bound the intervals of time that pass between successivescanning or sampling by the receiver. The difference between theminimum and maximum values captures the error introduced byclock drift and jitter. Indeed, these bounds are used to capture thecumulative error in both the transmitter’s and receiver’s clock. Byascribing the cumulative error to the receiver in the model, wecan assume the transmitter’s clock is error-free and always updatesat its nominal period—otherwise, we would have fields recordingminimum and maximum tPeriod intervals—so it is a modelingconvenience.

We can now define a relation containing a conjunction of con-straints over the parameters that (we hope!) ensure correct op-eration. These timing constraints are at the heart of what makesdemonstrating the correctness of physical layer protocols difficult.

62

1 correctParams :: Params → Bool2 correctParams p =3 0 < tPeriod p4 && 0 ≤ tSettle p5 && tSettle p < tPeriod p6 && 0 < rScanMin p7 && rScanMin p ≤ rScanMax p8 && rScanMax p < tStable9 && tPeriod p + tSettle p < rSampMin p

10 && rSampMin p ≤ rSampMax p11 && rSampMax p < tPeriod p + tStable - rScanMax p12 where tStable = tPeriod p - tSettle p

Some of the constraints are simply “sanity constraints” to ensuretime is positive (e.g., the constraints on lines 3, 4, and 6) or thata minimum bound is no greater than a corresponding maximumbound (e.g., the constraints on lines 7 and 10). The other constraintsare more interesting and derive from a designer’s domain knowl-edge regarding the protocol. For example, the constraint on line 9ensures that even if rx detects the first half of an encoded bit tooearly (i.e., just after it starts modulating at the beginning of the set-tling interval), it waits until the end of the settling interval plus theentire period (containing the stable interval of the first half of thebit and the settling interval of the second half of the bit) before sam-pling. This ensures rx does not sample before the stable interval ofthe period containing the second half of the bit.

These constraints are complex and we want to simulate the proto-col’s execution to ensure they are correct and if they are, that ourimplementation satisfies them.

4. The EmulatorSo far, we have described the protocol and the real-time constraintswe posit it must satisfy. To simulate it, we need an executablemodel. We begin by describing a model of real-time for the em-ulator then the emulator itself.

4.1 Model of Time

Our model of time borrows from the discrete-event simulationmodel (Dutertre and Sorea 2004; Schriber and Brunner 1999).In this model, each independent real-time component, C, in asystem possesses a timeout variable that ranges over Time. Thattimeout variable denotes the point in time at which C will makea state transition. The value of C’s timeout variable is always inthe future or the present; when it is at the present, C exercisesa state transition, and its timeout variable is updated (possiblynondeterministically) to some point strictly in the future.

In our case, the transmitter and receiver each possess a timeout vari-able, which we denote as tclk and rclk, respectively. Intuitively,these values “leap frog” each other. The least-valued timeout is con-sidered to be at the present, and so that component executes. Ofcourse, one timeout might be significantly less than the other andwill make successive transitions before the other component pos-sesses the least-valued timeout.

The primary advantage of this model of time is that it is simple: wedo not need a special semantics to model real-time execution.

4.2 Emulator Architecture

In Figure 3, we show an abstract representation of the system as itis modeled. We describe the components below.

The Transmitter The transmitter is comprised of three Haskellfunctions (and some small helper functions): an environment tenv,encoder tenc, and the transmitter’s clock, tclock. Of these, onlythe encoder is protocol-specific; the remainder are generic infras-tructure.

tclock

tenv tenc tsignal

tx

rclockrclktclk

rdec

rx

Figure 3. Emulator Architecture

The environment tenv simply returns a new random bit to send.Regarding the timeout function tclock, recall from Section 3 thatin our model, we attribute errors to the receiver. Thus, transmitter’stimeout is updated deterministically: each application of tclockupdate’s tx’s timeout by exactly tPeriod p. This leaves only thetransmitter’s encoder tenc. This function is the protocol-specificportion of the transmitter’s definition. The function has three possi-ble branches. If the transmitter is not in the middle of sending an en-coded bit, it may nondeterministically (using the System.Randomlibrary) idle the signal (i.e., not modulate the signal), or it may sendthe first half of an encoded bit. Otherwise, it encodes the secondhalf of a bit.

The Receiver Architecturally, the receiver is simpler than thetransmitter since it only contains a clock and a decoder. However,both of their definitions are more complex: rx’s clock is morecomplex because we capture the effects of drift, jitter, and so forthhere, so the timeout updates nondeterministically (again using theSystem.Random library); rx’s decoder is more complex becausehere we model whether rx captures the signal depending on therelationship between tx’s and rx’s timeouts.

The receiver’s timeout function updates the timeout nondeterminis-tically depending on which of two modes rx is in. If rx is expectingthe first half of an encoded bit (so in its scanning mode), it up-dates the timeout rclk to some random value within the inclusiverange [rclk + rScanMin p, rclk + rScanMax p], where pis an instance of Params defined in Section 3. If rx is in the sam-pling mode, it similarly updates its timeout to some random valuewithin [rclk + rSampMin p, rclk + rSampMax p].

As mentioned, the decoder rdec is where we model the effectsof incorrectly sampling the signal. The decoder follows the BMPprotocol to decode an incoming signal if stable is true, and failsto detect the signal properly otherwise. The function stable takesrx’s and tx’s state (implemented as data types) and returns aboolean:

stable :: Params → Rx → Tx → Boolstable p rx tx =

not (changing tx)| | tclk tx - rclk rx < tPeriod p - tSettle p

Recall that tclk and rclk are the timeouts. The value of changingtx is a boolean that is part of tx’s state—it is true if tx is modu-lating the signal in the next period. Thus, the function stable istrue if either the signal is not going to modulate (so that even if itis sampled during the settling interval, it is sampled correctly), orthe receiver’s timeout falls within the stable interval—recall Fig-ure 2. If stable is false, we return the opposite value of the signalbeing sent by the transmitter. This ensures our emulator is overly-pessimistic and captures potentially metastable events even if theymay not result in a faulty signal capture in reality.

Wiring The Transmitter and Receiver Together The functiontransition causes either tx or rx to execute a state-update. Thefunction takes a set of real-time parameters, the receiver’s andtransmitter’s states, and return new states (within the IO monad).

63

transition :: Params → Rx → Tx → IO (Rx, Tx)transition p rx tx| tclk tx ≤ rclk rx

= do tx’ ← txUpdate p txreturn (rx {synch = False}, tx’)

| otherwise= do rx’ ← rxUpdate p rx tx

return (rx’, tx)

The txUpdate function updates tx’s state by applying the func-tions tenv, tenc, and tclock. Likewise for rxUpdate, exceptrxUpdate takes tx’s state too, as based on the relationship betweentx’s timeout and its own, it may sample the signal correctly or not.Whether tx or rx is updated depends on which timeout is least—ifthey are equal, we arbitrarily choose to update tx’s state.

Executing this function takes one “step” of the discrete-event emu-lator. We initialize the state of the transmitter and receiver, and theniteratively call the transition function for some user-specifiednumber of rounds.

5. QuickCheck: Automatically GeneratingTiming Parameters

QuickCheck is a popular tool for automatically testing programs.Because our emulator itself generates random values (e.g., timeoutupdates for rx), the emulator executes within the IO monad; there-fore, we use a monadic extension of QuickCheck (Claessen andHughes 2002).

Test-Case Generation Our first task is to generate parametersthat satisfy the correctParams function defined in Section 2. Thenaıve approach is to generate random instances of the Paramsdata type and throw away those instances that do not satisfycorrectParams. Unfortunately, this approach generates almostno satisfying instances because so few random parameters satisfythe constraints.

Therefore, we define a custom generator. However, we have thefollowing problem: the set of inequalities in correctParams arecircular and not definitional. The conjuncts of Params cannot beplaced in a linear order such that each constraint introduces no morethan one new parameter. Thus, we cannot sequentially generateparameters that satisfy them.

Our solution is to define a generator that over-approximates the in-equalities in correctParams. For example, we can replace any oc-currence of the parameter tSettle p on the right-hand side of ≤with the parameter tPeriod p, since the latter is guaranteed to belarger than the former. By over-approximating, we can rewrite theinequalities so that each constraint introduces just one new parame-ter. This over-approximation is “close enough” so that a large num-ber of generated instances satisfy correctParams—we can thenprune out the few instances that do not satisfy correctParams.

Validation The following is the fundamental correctness propertywe wish to validate: whenever the receiver has captured (what itbelieves to be) the second half of an encoded bit, the bit it decodesis the one that tx encoded. (Again, Rx and Tx are the data typescontaining the receiver’s and transmitter’s respective state.)

bitsEq :: Rx → Tx → BoolbitsEq rx tx = tbit tx == rbit rx

In the property, tbit tx is the bit that tx is encoding, and rbitrx is the bit rx has decoded.

QuickChecking this property over millions of simulation runs sug-gests (but of course does not prove) that our parameters are indeed

correct. And it is fast. On a commodity laptop (MacBook Pro, 2.5GHz Intel Core 2 Duo with 4 GB of memory), our emulator auto-matically generates approximately 100,000 simulations of the pro-tocol in a minute.2

As with emulators in other programming languages, the efficacy ofour test-bed for discovering timing errors is contingent upon thenumber of and duration of test runs, the coverage achieved by thegenerated test data, and the significance of the timing violation.

QuickCheck as a Probability Calculator In standard practice,QuickCheck is used to validate a property and to return a coun-terexample otherwise. This usage model makes sense when verify-ing that programs operate correctly over discrete data such as lists,trees, and integers. In real-time systems, however, we identify anovel usage of QuickCheck as a probability calculator.

For (a slightly contrived) example, suppose that for some legacyhardware configuration, we know that the settling interval is nomore than 5% of the period, and the receiver’s bounds on scan-ning and sampling ensure it consistently captures the data. Later,suppose the receiver is to be used in a new configuration in whichthe settling interval may be up to 15% of the period. The receiver’sbounds on scanning and sampling cannot be changed, since theyare determined by its legacy clock. Now we ask what percentage ofbits will the receiver incorrectly decode?

To answer this question, we generate a fixed number of tests anddetermine what percentage of them fail. To facilitate this use ofQuickCheck, we slightly extend its API.3 For the example de-scribed, generating 100,000 tests results in a failure rate (i.e., theproperty bitsEq above fails) of approximately 0.2%. Dependingon the performance of error-checking codes and other constraints,this bit-error rate may be satisfactory.

Another use of QuickCheck as a “probability calculator” is to com-pute the probability of cyclic redundancy checks capturing bit-transmission errors under different fault scenarios (Driscoll et al.2003; Paulitsch et al. 2005). In general, this appears to be a power-ful application of QuickCheck for testing stochastic systems.

Using QuickCheck as a probability calculator depends on QuickCheckgenerating a sufficiently large number of appropriately-distributedtests. We have not verified the extent to which this hypothesis holdsin various domains.

6. ConclusionIn this report, we demonstrate via example that functional languages—particularly Haskell—and their associated tools (i.e., QuickCheck)are unexpectedly well-suited to build real-time emulators. We haveapplied QuickCheck in two new ways—to generate real-time pa-rameters and as a probability calculator for reliability analysis. Wehope this report motivates others to explore the use of functionalprogramming for building emulation test-beds for real-time sys-tems.

AcknowledgmentsThis work is supported by NASA Contract NNL08AD13T fromthe Aviation Safety Program Office. We thank for the followingindividuals for their advice and guidance on this work: Ben Di

2 These performance results use a single core and suppress output to stan-dard out. While there are no special performance optimizations made tothe code, we use the System.Random.Mersenne Haskell library for fastrandom-number generation.3 A corresponding patch is available at http://www.cs.indiana.edu/∼lepike/pub pages/qc-biphase.html.

64

Vito of the NASA Langley Research Center; Levent Erkok, DylanMcNamee, Iavor Diatchki, and Don Stewart, and John Launchburyof Galois, Inc.; Rebekah Leslie of Portland State University; andAndy Gill of the University of Kansas.

ReferencesGeoffrey M. Brown and Lee Pike. Easy parameterized verification

of biphase mark and 8N1 protocols. In TACAS, volume 3920of Lecture Notes in Computer Science, pages 58–72. Springer,2006. Available at http://www.cs.indiana.edu/∼lepike/pub pages/bmp.html.

Koen Claessen and John Hughes. Quickcheck: A lightweight toolfor random testing of haskell programs. In ACM SIGPLANNotices, pages 268–279. ACM Press, 2000.

Koen Claessen and John Hughes. Testing monadic code withQuickCheck. In In Proc. ACM SIGPLAN workshop on Haskell,pages 65–77, 2002.

Kevin Driscoll, Brendan Hall, Hakan Sivencrona, and Phil Zum-steg. Byzantine fault tolerance, from theory to reality. In Com-puter Safety, Reliability, and Security, LNCS, pages 235–248.SAFECOMP, Springer-Verlag, September 2003.

Bruno Dutertre and Maria Sorea. Modeling and verification of afault-tolerant real-time startup protocol using calendar automata.In Formal Techniques in Real-Time and Fault-Tolerant Systems,volume 3253 of LNCS. Springer-Verlag, 2004.

J Strother Moore. A formal model of asynchronous communica-tion and its use in mechanically verifying a biphase mark pro-tocol. Formal Aspects of Computing, 6(1):60–91, 1994. URLciteseer.ist.psu.edu/moore92formal.html.

Michael Paulitsch, Jennifer Morris, Brendan Hall, Kevin Driscoll,Elizabeth Latronico, and Philip Koopman. Coverage and theuse of cyclic redundancy codes in ultra-dependable systems. InInternational Conference on Dependable Systems and Networks(DSN 2005), pages 346–355, 2005.

Thomas J. Schriber and Daniel T. Brunner. Inside discrete-eventsimulation software: how it works and why it matters. In WinterSimulation Conference, pages 72–80, 1999.

M. Sheeran. Hardware design and functional programming: aperfect match. Journal of Universal Computer Science, 11(7):1135–1158, 2005.

F. W. Vaandrager and A. L. de Groot. Analysis of a BiphaseMark Protocol with Uppaal and PVS. Technical Report NIII-R0455, Nijmegen Institute for Computing and Information Sci-ence, 2004.

A. Biphase.hsmodule Biphase where

-- A faster random-number generatorimport System.Random.Mersenne

---------- DATATYPES ---------------------------------------type Time = Double

-- | Realtime input parameters.data Params = Params

{ tPeriod :: Time -- ^ Tx’s clock period., tSettle :: Time -- ^ Nominal signal settling time., rScanMin :: Time -- ^ Rx’s min scan duration., rScanMax :: Time -- ^ Rx’s max scan duration., rSampMin :: Time -- ^ Rx’s min sampling duration., rSampMax :: Time -- ^ Rx’s max sampling duration.

} deriving (Show, Eq)

data TState = SendFirst -- ^ Sending the 1st datum;| SendSecond -- ^ Sending the 2nd.deriving (Show, Eq)

data Tx = Tx{ tstate :: TState -- ^ Tx’s state., tsignal :: Bool -- ^ Signal being sent., tbit :: Bool -- ^ Encoded bit to be sent., changing :: Bool -- ^ T: modulating the signal; F o/w., tclk :: Time -- ^ Tx’s timeout.} deriving (Show, Eq)

data RState = RcvFirst -- ^ Expecting the 1st datum;| RcvSecond -- ^ Expecting the 2nd.deriving (Show, Eq)

data Rx = Rx{ rstate :: RState -- ^ Rx’s state., rsignal :: Bool -- ^ Current datum being received., rbit :: Bool -- ^ Decoded bit., rclk :: Time -- ^ Rx’s timeout., synch :: Bool -- ^ Rx just transitioned from

-- RcvSecond to RcvFirst-- (capturing a bit).

} deriving (Show, Eq)

------------------------------------------------------------

-- Helper for Mersenne randomsrandomRng :: (Time, Time) → IO TimerandomRng (low, high) = do r ← randomIO

return $ low + (r ∗ (high - low))

---------- INITIAL STATE/CLOCKS ----------------------------initTx :: Params → IO TxinitTx p = do t ← randomRng (0, tPeriod p - tSettle p)

bit ← randomIOreturn Tx { tstate = SendFirst

, tsignal = True, tbit = bit, changing = False, tclk = t}

initRclock :: Params → IO TimeinitRclock p = do r ← randomRng (0, rScanMax p)

-- we want a random in [a, a)if r == rScanMax p

then initRclock pelse return r

initRx :: Params → IO RxinitRx p = do r ← initRclock p

bit ← randomIOreturn Rx { rstate = RcvFirst

, rsignal = True, rbit = bit, rclk = r, synch = False}

------------------------------------------------------------

---------- Tx UPDATE ----------------------------------------- |tenv :: Tx → IO Txtenv tx = case tstate tx of

SendFirst → do ran ← randomIOreturn tx {tbit = ran}

SendSecond → return tx

-- | The transmitter’s encoder. Protocol-specific.

65

tenc :: Tx → IO Txtenc tx =

case tstate tx ofSendFirst →

do idle ← randomIOif idle -- Idling

then return tx {changing = False}-- 1st half of a new bit.else return

tx { tsignal = ttoggle, tstate = SendSecond, changing = True}

SendSecond → return tx { tsignal = toggle, tstate = SendFirst, changing = changed toggle}

where toggle = if tbit txthen ttoggle else tsignal tx

ttoggle = not $ tsignal txchanged cur = cur /= tsignal tx

tclock :: Params → Tx → Txtclock p tx = tx {tclk = tPeriod p + tclk tx}

txUpdate :: Params → Tx → IO TxtxUpdate p tx = do

tx’ ← tenv txtx’’ ← tenc tx’return $ tclock p tx’’

------------------------------------------------------------

---------- Rx UPDATE ----------------------------------------- | Correct update of rclk---helperrclock :: Params → Rx → IO Timerclock p rx =

let r = rclk rxin case rstate rx of

RcvFirst →randomRng (r + rScanMin p, r + rScanMax p)

RcvSecond →randomRng (r + rSampMin p, r + rSampMax p)

stable :: Params → Rx → Tx → Boolstable p rx tx =

not (changing tx)| | tclk tx - rclk rx < tPeriod p - tSettle p

-- | The receiver’s decoder. Protocol-specific.rdec :: Params → Rx → Tx → Rxrdec p rx tx =

-- Are we in a "stable" part of the signal?let badSignal = not $ tsignal tx

v = if stable p rx txthen tsignal tx else badSignal

in case rstate rx ofRcvSecond → rx { rsignal = v

, rbit = rsignal rx /= v, rstate = RcvFirst}

RcvFirst → rx { rsignal = v, rstate = signal}

where signal = if v == rsignal rxthen RcvFirstelse RcvSecond

rxUpdate :: Params → Rx → Tx → IO RxrxUpdate p rx tx = do

let rx’ = rdec p rx txrchange = case (rstate rx, rstate rx’) of

(RcvSecond, RcvFirst) → True_ → False

r ← rclock p rx’return rx’ { rclk = r

, synch = rchange}------------------------------------------------------------

-- | Full state transition.transition :: Params → (Rx, Tx) → IO (Rx, Tx)transition p (rx, tx)| tclk tx ≤ rclk rx = do

tx’ ← txUpdate p txreturn (rx {synch = False}, tx’)

| otherwise = dorx’ ← rxUpdate p rx txreturn (rx’, tx)

putLnState :: Integer → (Rx, Tx) → IO ()putLnState i (rx, tx) = do

putStrLn $ "States: " ++ (show $ tstate tx) ++ " "++ (show $ rstate rx)

putStrLn $ "Clocks: "++ (show $ tclk tx) ++ " "++ (show $ rclk rx)

putStrLn $ "Bits: "++ (show $ tbit tx) ++ " "++ (show $ rbit rx)++ " Signal: " ++ (show $ tsignal tx)++ " " ++ (show $ rsignal rx)

putStrLn $ "i: " ++ (show i) ++ " Synch: "++ (show $ synch rx) ++ "λn"

-- | Defines a "good" stop state: tx has sent the 2nd-- signal bit and rx has sampled it.stopState :: Rx → BoolstopState rx = synch rx

execToStopState :: Bool → Params → Integer → (Rx, Tx) →IO (Rx, Tx)execToStopState output p i s = do

if output then putLnState i s else return ()if stopState (fst s)

then return selse execToStopState output p i =<< transition p s

-- | Exectuion of the protocol.exec :: Bool → Params → Integer → (Rx, Tx) → IO (Rx, Tx)exec output p i s = do

s’ ← execToStopState output p i sif i < 1 then return s’

else exec output p (i-1) s’

-- | Begin a finite trace of length i from the initial-- state. Either send one determined signal bit or a-- series of nondeterministic signals.startExec :: Bool → Params → Integer → IO (Rx, Tx)startExec output p i = exec output p i =<< initState p

-- | The initial state.initState :: Params → IO (Rx, Tx)initState p = do

rx ← initRx ptx ← initTx preturn (rx, tx)

B. BiphaseQC.hsmodule Main

where

import Biphaseimport Test.QuickCheckimport Test.QuickCheck.Monadicimport Test.QuickCheck.Gen

66

-- | Number of rounds to executeiter :: Integeriter = 1

-- | Property should always hold for good parameters.prop_correct :: Bool → Propertyprop_correct output =

assertFinal output forallValidParams bitsEq

-- | Testing should fail on this property for some-- percentage of tests.prop_incorrect :: Bool → Propertyprop_incorrect output =

assertFinal output forallInvalidParams bitsEq

-- | Did the receiver get the bits sent by the sender upon-- synchronizing?bitsEq :: (Rx, Tx) → BoolbitsEq (rx, tx) = (tbit tx) == (rbit rx)

-- | Note: monadicIO (from QuickCheck) uses unsafeperformIO.assertFinal :: Bool → ParamGen → ((Rx, Tx) → Bool) →PropertyassertFinal output genParams pred =

monadicIO $ genParams $ λp →assert ◦ pred =<< run (startExec output p iter)

----------- SIMPLE MAIN FUNCTION (modify as needed) -------main = do

putStrLn ""putStrLn $ "Enter the number of bits to encode"

++ " (an integer between 1 and 100 million): "s ← getLineputStrLn $ "Show output for each test? (True or False)"output ← getLinelet i = read s

in if i < 1 | | i > 10^8then mainelse quickCheckQuotientWith stdArgs {maxSuccess =

i} (prop_correct $ read output)

-----------------------------------------------------

type ParamGen =(Params → PropertyM IO ()) → PropertyM IO ()

-- | Generating correct params is "too hard" to do-- procedurally, so we get close and then use a-- predicate to make sure we’re only testing correct ones.forallValidParams :: ParamGenforallValidParams =

forAllM (genParams ‘suchThat‘ correctParams)

-- | Generate ∗almost∗ correct realtime parameters --- it’s an-- overapproximation. We need to test them to ensure-- correctness.genParams :: Gen ParamsgenParams = do

-- arbitrary-sized clock periodtperiod ← choose (0, 100)-- The remaining generated values are over-approximations.tsettle ← choose (0, tperiod)rscanmin ← choose (0, tperiod - tsettle)rscanmax ← choose (rscanmin, tperiod)rsampmin ← choose ( tperiod + tsettle

, 2 ∗ tperiod - tsettle - rscanmax)rsampmax ← choose ( rsampmin

, 2 ∗ tperiod - tsettle - rscanmax)return $ Params tperiod tsettle rscanmin

rscanmax rsampmin rsampmax

-- | Constraints are satisfied. Reproduced for genParams.correctParams :: Params → BoolcorrectParams p =

0 < tPeriod p -- tPeriod&& 0 ≤ tSettle p -- tSettle&& tSettle p < tPeriod p -- tSettle&& 0 < rScanMin p -- rScanMin&& rScanMin p ≤ rScanMax p -- rScanMax&& rScanMax p < tStable -- rScanMax&& tPeriod p + tSettle p < rSampMin p -- rSampMin&& rSampMin p ≤ rSampMax p -- rSampMax-- rSampMax&& rSampMax p < tPeriod p + tStable - rScanMax pwhere tStable = tPeriod p - tSettle p

--- GENERATING FAILING TESTS ---

forallInvalidParams :: ParamGenforallInvalidParams =

forAllM (badGenParams ‘suchThat‘ incorrectParams)

-- Example in hte paper.badGenParams :: Gen ParamsbadGenParams =

let tperiod = 100tsettleNewMax = 15tsettle = 5

in do-- arbitrary-sized clock period-- tperiod ← choose (0, 100)-- The remaining generated values are over-approximations.tsettle’ ← choose (0, tsettleNewMax)rscanmin ← choose (0, tperiod - tsettle)rscanmax ← choose (rscanmin, tperiod)rsampmin ← choose ( tperiod + tsettle

, 2 ∗ tperiod - rscanmax - tsettle)

rsampmax ← choose ( rsampmin, 2 ∗ tperiod - rscanmax - tsettle)

return $ Params tperiod tsettle’ rscanminrscanmax rsampmin rsampmax

-- Constraints are satisfied. Reproduced for genParams.incorrectParams :: Params → BoolincorrectParams p =

let tSettle’ = 5in 0 < tPeriod p -- tPeriod

&& 0 ≤ tSettle’ --p -- tSettle&& tSettle p < tPeriod p -- tSettle p&& 0 < rScanMin p -- rScanMin&& rScanMin p ≤ rScanMax p -- rScanMax&& rScanMax p < tPeriod p - tSettle’ -- p -- rScanMax&& tPeriod p + tSettle’ -- p

< rSampMin p -- rSampMin&& rSampMin p ≤ rSampMax p -- rSampMax&& rSampMax p < -- p -- rSampMax

2 ∗ tPeriod p - rScanMax p - tSettle’

67

A Compositional Theory for STM Haskell

Johannes Borgstrom Karthikeyan Bhargavan Andrew D. GordonMicrosoft Research, Cambridge, UK{joborg,karthb,adg}@microsoft.com

AbstractWe address the problem of reasoning about Haskell programs thatuse Software Transactional Memory (STM). As a motivating exam-ple, we consider Haskell code for a concurrent non-deterministictree rewriting algorithm implementing the operational semantics ofthe ambient calculus. The core of our theory is a uniform model, inthe spirit of process calculi, of the run-time state of multi-threadedSTM Haskell programs. The model was designed to simplify bothlocal and compositional reasoning about STM programs. A sin-gle reduction relation captures both pure functional computationsand also effectful computations in the STM and I/O monads. Westate and prove liveness, soundness, completeness, safety, and ter-mination properties relating source processes and their Haskell im-plementation. Our proof exploits various ideas from concurrencytheory, such as the bisimulation technique, but in the setting of awidely used programming language rather than an abstract processcalculus. Additionally, we develop an equational theory for reason-ing about STM Haskell programs, and establish for the first timeequations conjectured by the designers of STM Haskell. We con-clude that using a pure functional language extended with STMfacilitates reasoning about concurrent implementation code.

Categories and Subject Descriptors D.2.4 [Software/ProgramVerification]: Correctness Proofs; D.3.1 [Formal Definitions andTheory]: Syntax and Semantics; D.3.3 [Language Constructs andFeatures]: Concurrent Programming Structures—Software Trans-actional Memory.

General Terms Theory, verification

Keywords Transactional memory, compositional reasoning, am-bient calculus

1. IntroductionSoftware Transactional Memory (STM), introduced by Shavit andTouitou [31], is a promising programming abstraction for shared-variable concurrency. Shared variables may only be accessedwithin transactions. Ingenious software techniques allow transac-tions to run in parallel while their semantics is as if they run in se-ries. There is a good deal of promising research on efficient imple-mentations, in the context of various languages [9, 10, 27]. More-over, several formal techniques have been applied to verifying theunderlying algorithms [30] and their implementations [22, 8, 20, 1].


In this paper, we explore the prospects for reasoning about soft-ware written using the STM abstraction. Transactional semanticsundoubtedly simplifies the reasoning task compared to say lock-based concurrency [18, 23], but is no panacea.

We pursue the idea that theories of concurrency developed in ab-stract process calculi can fruitfully be recast in the concrete settingof transactional programming languages. We consider programswritten in STM Haskell [10, 11], an embedding of transactionalconcurrency within the pure functional language Haskell.

As a concrete programming task, we investigate programming,specifying, and reasoning about an STM Haskell implementationof ambients. The ambient calculus [5] is a formalism for express-ing and reasoning about mobile computation, more recently appliedto biological systems [26]. An ambient process has a hierarchicalstructure that mutates over time, as ambients within the structuremove inside and outside other ambients. Hence, an implementationof the ambient calculus amounts to a concurrent tree rewriting algo-rithm. The first concurrent implementation, by Cardelli in Java [4],was lock-based; here, we give a lock-free implementation of a pro-gramming API for the ambient calculus in STM Haskell.

As a basis for reasoning about this code, we present a corecalculus for STM Haskell: a concurrent non-strict lambda calcu-lus with transactional variables and atomic blocks. The syntax oflambda calculus expressions provides a compositional and uniformformalism for both source programs and their run-time states.

The original presentation of STM Haskell separated the trans-actional heap, its latest checkpoint and the currently running code.In contrast, our syntax uniformly represents all of these as expres-sions, hence facilitating compositional reasoning by allowing mul-tiple threads with associated pieces of the heap to be composed us-ing a parallel operator and rearranged using structural congruence.

We develop a formal semantics and type system for the calcu-lus. Our semantics is based on a reduction relation → on expres-sions, which range over both effect-free functional computationsand effectful concurrent computations. It facilitates local reason-ing by having transactions run against a part of the heap, makingit easy to exhibit the smallest heap allowing a given transaction tomake progress. On the other hand, we can show a precise corre-spondence with the original (but syntactically more complex) oper-ational semantics of STM Haskell.

We import into STM Haskell behavioural equivalences andproof techniques that originate in process calculi. Notably, in ourproofs, the notion of bisimulation up to a relation [19, 28] permitssignificant reductions in the size of the state space that needs tobe considered when reasoning equationally about a program. Thisstate space reduction comes beyond the already significant reduc-tion afforded by the atomicity and serializability guarantees of theSTM abstraction.

Using these proof techniques, our first result, Theorem 1, di-rectly relates the operational semantics of ambient processes withexpressions representing the run time states of their Haskell im-plementation. More specifically, Theorem 1 establishes liveness,

69

soundness, completeness, safety, and termination properties. Build-ing on the first theorem, our second result, Theorem 2, is a concisestatement of full correctness of our implementation in terms of abisimulation between the source process and its Haskell translation.The use of transactions makes the Haskell code rather simpler thanCardelli’s original Java program; still, it is non-trivial but feasibleto establish the intended correspondences between the implemen-tation code and the formal specification of ambients.

Finally, we adopt the standard definition of Morris-style con-textual equivalence of expressions, and develop a sound equationaltheory. We show the monad laws of the STM monad hold, andalso establish some equations proposed by the designers of STMHaskell [10].

Our main contributions are the following:

• We use notions of behavioural equivalence and proof tech-niques from process calculi to specify and prove STM Haskellcode for a complex concurrent tree rewriting algorithm (an im-plementation of the ambient calculus).

• We develop uniform syntax and reduction semantics, in thestyle of process calculi, for STM Haskell. The uniformity of thesyntax facilitates compositional reasoning over multiple threadswith associated pieces of the heap.

• We prove soundness of an equational theory for STM Haskell,including monad laws and properties of operators for transac-tional control flow.

Outline of the Paper Section 2 introduces the formal syntax ofcore STM Haskell, and reviews the programming model. (Since ourmotivating example does not use exceptions, we omit them fromthe core calculus. The extended version of the paper shows howto extend our core calculus with exceptions.) Section 3 describesour code and data structures for programming ambients in Haskell.Section 4 completes our formalization of STM Haskell; we definethe operational semantics and type system, and compare with theoriginal semantics. Section 5 recalls the semantics of the ambientcalculus, and states and proves the expected correctness properties;the proof depends on a detailed correspondence between ambientprocesses and expressions representing the corresponding statesof our implementations. Section 6 develops an equational theorybased on our operational semantics. Section 7 describes relatedwork, and Section 8 concludes, and discusses potential future work.

An extended version of this paper, with additional explanationsand proofs, and code listings, is available [3].

2. A Core Calculus for STM HaskellThe GHC implementation of STM [10] uses a monad to encapsu-late all accesses to shared transactional variables (TVars). To exe-cute an STM transaction we use the function atomically , whichhas type STM t → IO t. If the execution of a transaction returnsa result, the run-time system guarantees that it executed indepen-dently of all other transactions in the system. An STM expressioncan also retry, which may cause the transaction to be rolled backand run at some later time.

The original definition of STM Haskell made use of an implicitfunctional core language. For verification purposes, and in orderto make this paper self-contained, we formalize STM Haskell as aconcurrent non-strict lambda calculus, with memory cells (TVars)and atomic blocks.

2.1 SyntaxOur syntax treats source programs, heaps made up of transactionalvariables (TVars), and concurrent threads uniformly as expressionsin the calculus, similarly to Concurrent Haskell [24].

We assume denumerable distinct sets X of (lambda calculus)variables and N of (TVar) addresses; we let x,y,z range over X, anda,b range over N. We let f range over ADT constructors.Expressions of Core STM Haskell:M,N ∈M ::= expression

x (x ∈ X) variable (value)a (a ∈ N) address (value)λx.M lambda abstraction (value)f M construction (value)

M N applicationcase M of f x→ N case expressionY M fixpointequal M N address equality

readTVar M STM read variablewriteTVar M N STM write variablereturnSTM M STM returnretry STM retry transactionM�=STM N STM bindorElse M N STM prioritized choiceor M N STM erratic choice

atomically M IO execute transactionreturnIO M IO returnM�=IO N IO bind

a 7→M transactional variable (TVar)(νa)M restriction (a bound in M)M | N parallel compositionemp empty heap

Our syntax is close to STM Haskell, but with minor differences.The STM Haskell functions newTVar and fork are not primitiveexpressions in our calculus, but are derived forms, explained below.In actual source programs, the monadic bind and return operatorsdo not have the subscripts IO and STM; instead, the monadsare inferred by the typechecker. The syntax a 7→ M, (νa)M, M |N, emp, and addresses a exist as expressions only to representrun-time state, and cannot be written directly in source programs.Actual Haskell includes various abbrevations such as recursivedefinitions, pattern matching, and do-notation for monads, whichcan be reduced in standard ways to our core syntax.

We introduce some notational conventions. We write M asshorthand for a possibly empty sequence M1 · · · Mn (and sim-ilarly for x, t, etc.) and f x→ N for a non-empty sequencef1 x1 → N1 | · · · | fm xm → Nm where the fi as well as the xi j areassumed to be pairwise different (and similarly for f t). We writethe empty sequence as ◦ and denote concatenation of sequencesusing a comma. The length of a sequence x is written |x|. If φ isa phrase of syntax (such as an expression), we let fv(φ) and fn(φ)be the sets of variables and names occuring free in φ . We writeφ

{M/x}

for the outcome of the capture-avoiding substitution of Mfor each free occurrence of x in φ .

We let g M range over the builtin applications, which arethe expressions listed above starting with Y M and ending withM�=IO N. (This notation is particularly useful in the typing rule(T BUILTIN) in Section 4.2.) If g M is a builtin application, we sayg is a builtin function; Y and�=IO are examples. To simplify ourreduction semantics, builtin functions are not themselves expres-sions, and only occur fully applied to their arguments. As usual,we can use lambda abstraction to represent unapplied or partiallyapplied builtin functions as expressions.

2.2 Informal SemanticsOur uniform syntax for expressions covers single-threaded func-tional computations, as well as heaps, imperative transactions, and

70

concurrent computations. We describe each of these in turn, to-gether with their semantics.

Functional Computations The core of our expression languageis a call-by-name lambda calculus with algebraic data types. Thesimplest expressions are values, which return at once. As well asvariables x, there are three kinds of value: names a representingthe address of a mutable variable in the heap, lambda abstractionsλx.M, and constructions f M representing data tagged with thealgebraic type constructor f . For example, True, False, Nil, andCons M1 M2 are constructions. Since we describe a non-strictlanguage, constructor arguments M may be general expressions andnot just values.

The other expressions of the lambda calculus core are as fol-lows. An application expression M N evaluates to M′

{N/x}

, ifevaluation of M returns a function λx.M′. A case expressioncase M of f x→ N attempts to match one of the clauses f x→ Nagainst M. If f j x j → N j is the first of these clauses such that thevalue of M is f j M for some M, the whole expression behavesas N j{M/x j}; if there is no such clause, the whole expression isstuck. A builtin application Y M evaluates to M (Y M). (We useY to explain recursive definitions in the standard way.) A builtinapplication equal M N evaluates both M and N, and returns Trueif they evaluate to the same address, and False otherwise.

Heaps Next, we describe heap-expressions, which consist of apossibly empty composition of transaction variables, or TVars, forshort. An expression a 7→M denotes a heap consisting of a singleTVar, with address a and current content M. The expression emprepresents the empty heap, and the composition M | N representsthe concatenation of heaps M and N.

More generally, parallel composition M | N represents M andN running in parallel, where M and N may be a mixture of heaps,STM-expressions, and IO-expressions. Unusually, the | operator isnot fully commutative; the result of M | N, if any, is the result ofN. To evaluate a restriction (νa)M one creates a fresh address aand then evaluates M; as in process calculi, a restriction may alsobe read declaratively as meaning that the address a is known onlywithin its scope M.

Imperative Transactions An STM-expression typically has theform H | M, where H is a heap-expression and M is a running(single-threaded) transaction. Transactions are composed fromreads (readTVar a) and writes (writeTVar a M) to TVars. Insource programs, we create TVars by the following abbreviation:

newTVar M := (νa)(a 7→M | returnSTM a) where a 6∈ fn(M)

A transaction may returnSTM a result M (to commit), or retry (toroll back any updates and restart). There are two different kinds ofchoice. Prioritized choice M orElseN behaves as M unless M doesa retry (in which case it behaves as N). Erratic choice or M Nbehaves as M or as N nondeterministically.

Concurrency Finally, an IO-expression typically has the formH | M1 | . . . | Mn, where H is a heap-expression and M1, . . . , Mnare threads running in parallel. A thread Mi may run a transaction tohave side-effects on the shared heap H. Each thread may eventuallyterminate by returning a result. In our source programs, we createparallel threads using the abbreviation:

fork M := (M | returnIO ())

A thread atomically M behaves as if the transaction M executesin isolation against the heap, in one big atomic step. There are twopossibilities to consider. If M returns a result M′, the updates tothe heap are committed, and the whole expression returns M′. IfM retries, any updates to the heap are rolled back, and the wholeexpression snaps back to atomically M, ready to try again.

Both IO and STM computations can be sequenced using thebind operator M�=IO N, which behaves as N M′ if M returns M′,and otherwise as M.

For an example, consider a function swap that swaps the valuesof two TVars:

swap := λxa.λxb.readTVar xa�=STM λy.readTVar xb�=STM

λ z.writeTVar xa z�=STM λw.writeTVar xb y

Here, swap takes two TVar arguments xa and xb and calls thebuiltin function readTVar to read their values. As discussed (infix)function�=STM binds the return value of the expression to its leftto the function on its right. Hence, y and z get the values of xaand xb, respectively; swap then calls the function writeTVar towrite xa and xb with z and y, respectively, thus completing theswap. To call the function with two TVars a, b, we can writeswap a b. However, two parallel executions of swap could yield aninconsistent state. Instead, we call the function and have it executeatomically, by writing atomically (swap a b).

3. Application: An API for Managing AmbientsAs a verification challenge, we consider a sizeable program writtenin STM Haskell. Our target application is an API that implementsthe mobility primitives of the ambient calculus [5]. Programs thatuse this API spawn multiple threads that concurrently modify ashared data structure. We consider such programs to be exemplaryapplications of STM Haskell.

Our implementation is inspired by Cardelli’s earlier implemen-tation in Java using locks [4], which, although carefully specified,was never proved correct. A more recent programming comparisonbetween a lock-based and STM-based implementation of ambientsfinds the “STM implementation to be much easier to reason aboutand much faster to implement” [32]. We seek to evaluate whetheran STM implementation is also easy to verify, by developing thefirst proof for an implementation of the ambient calculus.

A Tree of TVars. The underlying data structure is a tree of namedambients, implemented as follows. The figure on the right depictsan example tree with a node named a that has two children namedb and c.

type Ambient = TVar AmbDatadata AmbData =

AD (Name,Maybe Handle,[Handle],[(Name, Ambient)])

type Name = TVar Stringtype Handle = TVar Ambient

An Ambient (depicted as a named rectangle in the figure) is aTVar containing an AmbData, which consists of four items: a Name,a Handle pointing to its parent node (if it has one), a list of allincoming handles pointing to the current node, and an assocationlist of child nodes, mapping node names to Ambients. A Name isan identifier (depicted as an italicized variable), implemented as atransactional variable (TVar) containing a string; the same Namemay appear on multiple nodes. A Handle (depicted as a circle) is apointer to an Ambient; in the figure, such pointers are depicted asdashed arrows. The additional level of indirection using Handlesis used when merging a child node with its parent; all handlespointing to the child can then simply be pointed over to its parent.

We illustrate this data structure through a few simple functionsthat are used in our subsequent development. These functions canonly be called from within STM atomic blocks.

71

readAmb :: Handle -> STM AmbDatareadAmb h = do { a <- readTVar h;

ad <- readTVar a;return ad}

The function readAmb takes a Handle h and returns the AmbDatait points to (through two levels of indirection). Since readAmbreads transactional variables using readTVar, the result is an STMaction.

writeAmb :: Handle -> AmbData -> STM ()writeAmb h ad = do { a <- readTVar h;

writeTVar a ad }

The function writeAmb both reads and writes transactional vari-ables; it writes an AmbData to the location pointed to by a handle.

parentOf :: Handle -> STM HandleparentOf a = do { AD (_,p,_,_) <- readAmb a;

case p ofNothing -> retry;Just ph -> return ph}

The function parentOf takes a Handle h pointing to a node andreturns a handle to the parent of the node; if there is no parent (thatis, h points to the root node), then it calls the STM retry functionto rollback the transaction and restart.

Mobile Agents. An agent represents an IO thread that is initial-ized with a handle to an ambient. We say that the agent executeswithin that ambient.

data AGENT a = Agent (Handle -> IO a)type Agent = AGENT ()nil :: Agentroot :: Agent -> IO ()new :: String -> AGENT Nameamb :: Name -> Agent -> Agent

The simplest agent is nil, which denotes an inactive agent thathas finished its computation. The function root creates a fresh(unnamed) root node and attaches an agent A to this node. Theagent A can then create subtrees by calling new to generate freshnode names, and calling amb to create new child nodes.

Using these functions, we can now create our example tree andattach agents to each node:

ex = root $ do {a <- new "a";b <- new "b";c <- new "c";amb a (doamb b (into c);amb c nil)}

Here, the agent attached to b uses a new function into:

into :: Name -> Agent

The figure treats the more general case when the agent performsother actions e after calling into. However, for simplicity, it doesnot depict nil processes or back pointers to handles.

When the agent into c isexecuted, it has the effect ofmoving the (subtree rooted at)node b to become a child ofnode c. The resulting tree isdepicted on the right. If there isno neighbouring tree labelledc, the operation blocks untilone becomes available.

As usual, when concurrent threads modify the tree in this way,there is a risk of the tree ending up in an inconsistent state. Ourimplementation of into in STM Haskell below uses the STMconstruct atomically to avoid inconsistency:

into c = Agent $ \bHandle -> atomically $ do {bAmbient <- readTVar bHandle;AD (b,bp,bh,bc) <- readAmb bHandle;aHandle <- parentOf bHandle;AD (a,ap,ah,ac) <- readAmb aHandle;let bSiblings = delete (b,bAmbient) ac in docAmbient <- lookup’ c bSiblings;AD (_,cp,ch,cc) <- readTVar cAmbient;let cHandle = head ch in dowriteAmb cHandle (AD (c, cp, ch,

(b,bAmbient):cc));writeAmb aHandle (AD (a, ap, ah, bSiblings));writeAmb bHandle (AD (b,Just cHandle,bh,bc))}

The function into takes as argument the name c of the targetambient and creates an agent parameterized by a handle bHandleto the source ambient named b. The function proceeds in twophases. First, it reads the values of three ambient nodes: b, b’sparent a, and some sibling of b named c. Then, it writes updatedvalues to all three nodes.

The function begins by reading the ambient at b, bAmbient, andcalls readAmb to read its contents. It calls parentOf to find the par-ent ambient at a and reads its contents, including the list of its chil-dren ac. It computes the siblings of b by deleting (b,bAmbient)from the association list ac. It then finds a target ambient cAmbientby calling lookup’ which non-deterministically chooses some sib-ling ambient with name c. (This non-determinism is motivated bythe desired correctness properties of Section 5.) It reads the con-tents of cAmbient, including its children cc and one of its handles,cHandle.

Finally, the function updates the ambient at c by adding a child(b,bAmbient) to ch, it updates the ambient at a by deleting thechild b from ac, and it updates the ambient at b by changing itsparent to c.

Note that into is a local operation that only modifies threenodes of the tree; Agents manipulating other parts of the tree canbe scheduled to run in parallel without causing any conflicts.

The Full API. The full Ambient API consists of several otherfunctions:

out :: Name -> Agentopen :: Name -> Agentfork :: Agent -> Agent

The agent out c is the inverse of into c; it moves an ambient outof its parent (if the parent is named c). The agent open c deletesa child node named c and swings all handles of c to point to itsparent. This has the effect of causing all of c’s children to becomechildren of the parent; all agents running on c are similarly affected.The figure below depicts its effect on an example graph.

The agent fork A forks off a new thread running the agent Awithin the same ambient.

72

Programmatically, agents form a Reader monad, where thevalue read is a handle to the location in which the agent is run-ning1. The �= and � operators have their standard definition:(Reader f )�= g running at a location with handle h evaluatesf h to some v, evaluates g v to some Reader g′, and then evalu-ates g′ h. Similarly, (Reader f ) � (Reader g) reading handle hevaluates f h, discards the result, and then evaluates g h.

instance Monad AGENT wherereturn a = Agent $ \s -> return aa >>= g = Agent $ \s -> case a of Agent f -> f s

>>= \v -> case (g v) of Agent ff -> (ff s)a >> b = Agent $ \s -> case a of Agent f -> f s

>>= \v -> case b of Agent ff -> (ff s)

When verifying the ambient API, we are interested in establish-ing full functional correctness, not only the preservation of certaininvariants of the location tree. To do this, we need to give a formalaccount of the semantics of our core calculus for STM Haskell.

4. The Core Calculus, ConcludedThis section concludes the definition of our core calculus, begunin Section 2. We define the operational semantics and type system,and make a comparison with the original semantics. In the next sec-tion, we apply the calculus to specifying and verifying the Haskellcode from Section 3.

4.1 Operational SemanticsWe define a reduction relation, M → M′, which specifies the runtime behaviour of STM programs. A single reduction relation cap-tures pure functional computation, imperative transactions, andconcurrency. We rely on some auxiliary notions to define reduc-tion. First, we define three kinds of evaluation contexts.

Contexts: Pure (Rβ ), parallel (R|), and transactional (R7→)

Rβ ::= [·] |Rβ M | case Rβ of f x→ N | equal Rβ M| equal a Rβ | readTVar Rβ | writeTVar Rβ M

R| ::= [·] | (νa)R| | (R| |M) | (M |R|) |(R|�=IO M) | (R|�=STM M)

R7→ ::= [·] | (νa)R7→ | (a 7→M |R7→)

The second auxiliary notion is structural equivalence, M ≡M′.The purpose of this relation is to re-arrange the structure of anexpression—for example, by pulling restrictions to the top, or bymoving TVars beside reads or writes—so as to enable reductionsteps. Structural equivalence is the least equivalence relation closedunder the following rules. Let bn(R|) be the names bound by thecontext R|, and let n(R|) = bn(R|)∪ fn(R|).

Structural Equivalence: M ≡ N

M ≡ emp |M (STRUCT EMP)M |R|[N]≡R|[M | N] if bn(R|)∩ fn(M) = ∅ (STRUCT FLOAT)R|[(νa)M]≡ (νa)R|[M] if a /∈ n(R|) (STRUCT RES CTX)M ≡ N⇒R|[M]≡R|[N] (STRUCT CTX)

Let reduction, M → M′, be the least relation closed under therules in groups (R1), (R2), and (R3) displayed below. The firstgroup consists of standard rules for functional and concurrent com-putation.

1 The Haskell programmer familiar with monad transformers will noticethat it is effectively a ReaderT Handle IO a.

(R1) Reductions without Side-Effects: M→M′

(λx.M) N→M{N/x} (BETA)case f j(M) of f x→ N→ N j{M/x j} (CASE)Y M→M (Y M) (FIX)

equal a a→ True (EQUAL TRUE)equal a b→ False if a 6= b (EQUAL FALSE)

(returnIO M�=IO N)→ N M (IO BIND RETURN)

(PURE CTX)M→M′

Rβ [M]→Rβ [M′]

(RED CTX)M→M′

R|[M]→R|[M′]

(STRUCT)M ≡ NN→ N′N′ ≡M′

M→M′

The second group of reduction rules concerns the core be-haviour of STM-expressions. A heap-expression H is a parallelcomposition of transactional variables

Πi(ai 7→Mi) := a1 7→M1 | · · · | an 7→Mn | empwhere the ai are pair-wise distinct. We write→∗ for the transitiveclosure of→.(R2) Core Reductions for STM Transactions: M→M′

(STM READ TVAR)(a 7→M) | readTVar a→ (a 7→M) | returnSTM M

(STM WRITE TVAR)(a 7→M) | writeTVar a M′→ (a 7→M′) | returnSTM ()

(returnSTM M�=STM N)→ N M (STM BIND RETURN)(retry�=STM N)→ retry (STM BIND RETRY)

(ATOMIC RETURN)H |M→∗ R 7→[returnSTM N]

H | atomically M→R 7→[returnIO N]

(STM READ TVAR) and (STM WRITE TVAR) allow transac-tional variable to be read and written within a transaction.

(STM BIND RETURN) allows return values to propagatethrough the STM bind operator, much as through the IO bind oper-ator, while (STM BIND RETRY) allows retry to propagate directlythrough the bind operator, much like an exception.

The rule (ATOMIC RETURN) turns a successful many-steptransaction of an STM-expression H |M into a single-step compu-tation of the IO-expression H | atomically M. If the transactionyields retry then (ATOMIC RETURN) is not applicable, so thereis no transition in this case. In the STM Haskell implementation,a transaction that retrys is aborted by the run-time system andqueued for later execution.

The final group of rules concerns choices within transactions.

(R3) Reductions for OrElse and Or: M→M′

(STM ORELSE RETURN)H | N1→∗ R7→[returnSTM N′1]

H | (N1 orElse N2)→R7→[returnSTM N′1]

(STM ORELSE RETRY)H | N1→∗ R 7→[retry]

H | (N1 orElse N2)→ H | N2

M or N→M (STM OR LEFT)M or N→ N (STM OR RIGHT)

Rules (STM OrElse Return) and (STM OrElse Retry) formalizethe idea that N1 orElse N2 behaves as N1 if N1 terminates withreturnSTM N′1. If N1 terminates with retry then its effects arediscarded, and we instead run N2 on the original heap H.

73

Rules (STM Or Left) and (STM Or Right) define M or Nas making a nondeterministic choice within a transaction. Suchchoices may be derived at the level of the IO monad, but this oper-ator introduces nondeterminism into transactions (which otherwiseare deterministic). Nondeterminism is used in our programming ex-ample only to ensure completeness with respect to its specification;without nondeterminism we would still have soundness.

4.2 Type SystemWe complete our formalization of STM Haskell by defining asimple type system that prevents ill-formed expressions, such asthe inappropriate mixing of pure, STM and I/O expressions.

The type system only permits the reading and writing of transac-tional variables inside transactions, which a fortiori enforces staticseparation [1] and permits us to reason about transactions as if theyoccur in a single step.

Let the domain, dom(M), of an expression M be the set of(free) addresses of the transactional variables at top level in theexpression. We have dom(a 7→ M) = {a}, dom(M �=IO N) =dom(M), dom(M�=STM N) = dom(M), dom(M |N) = dom(M)∪dom(N) and dom((νa)M) = dom(M)\{a}. Otherwise, dom(M) =∅. In particular, expressions that are not in a top-level evaluationcontext should have no free transactional variables, so the typesystem enforces that their domain is empty.

Here is the syntax of types. For the sake of simplicity, weformalize only a monomorphic type system. We make the standardassumption that uses of Hindley-Milner style polymorphism maybe represented by monomorphising via code duplication.

Types:u ::= t | T typet ::= t→ t | X | TVar t | IO ∅ t | STM ∅ t expression typeT ::= IO a t | STM a t | heap a | proc a Configuration type

An expression type t describes the eventual value of a pure func-tional computation. They are either function types (t → t), alge-braic datatypes (X), TVar reference types (TVar t), IO computationtypes (IO∅ t) or STM transaction types (STM∅ t). We usually writeIO t for IO ∅ t, and STM t for STM ∅ t.

A configuration type T describes the structure, heap and poten-tial return value (if any) of imperative and concurrent expressions.Heap-expressions with domain a have type heap a. Both runningtransactions and STM-expressions with domain a have type STM a tfor some t. Both threads and IO-expressions with domain a havetype IO a t for some t.

Finally, the type proc a consists of concurrent expressions withdomain a that are executed in the background for their effects, butwhose results will be discarded. Given T , we write dom(T ) for itsdomain.

We assume that all polymorphic algebraic datatypes X and theirconstructors f have been monomorphized by instantiating each oftheir occurrences. For instance, the type Maybe a is instantiatedat the unit type () as data Maybe() = Nothing() | Just() (). Weassume a set of predefined algebraic types (), Error, Listt ,Bool,and Maybet , with constructors (), Nilt , Const , True, False,Nothingt , and Justt .

The return type of an expression is the type of its rightmostthread. The typing rule for parallel composition guarantees thatan expression consists of some transactional variables togetherwith either several IO threads or a single rightmost STM thread(currently running a transaction). Moreover, it ensures that thereis at most one transactional variable at each location a. It usesthe partial non-commutative operation T ⊗T ′, defined as follows,where a]b is a∪b if a and b are disjoint.

heap a⊗heap b := heap a]bproc a⊗heap b := proc a]bheap a⊗STM b t := STM a]b tIO a t⊗heap b := proc a]b

T ⊗proc a := proc dom(T )]a if T 6= STM b t ′

T ⊗IO a t := IO dom(T )]a t if T 6= STM b t ′

In particular, note that STM a t ⊗ STM b t ′ is undefined, and hencethe type system does not allow two transactions to run at once.

Lemma 1. (T1 ⊗ T2)⊗ T3 = T1 ⊗ (T2 ⊗ T3) = T2 ⊗ (T1 ⊗ T3) =(T2⊗T1)⊗T3.

A typing environment E ∈ E is a finite mapping from X∪N totypes. Each individual map is written as a :: TVar t or x :: t. Wewrite x :: t for the environment x1 :: t1, . . . ,xn :: tn where n is thelength of x and t. We write E,E ′ for the union of E and E ′ whenE and E ′ have disjoint domains. The full typing rules are given inFigure 1 on page 7.

The rule (T BUILTIN) appeals to a relation g :: u1→ ···→ un→u′, defined as follows, which gives a type for each application of abuiltin function g. In the following, all types t, t ′ and domains a areuniversally quantified, and u→ u′ stands for u′ when |u| = 0, andotherwise for u1→ ··· → un→ u′.

Types for Builtin Functions: g :: u→ u′

Y :: (t→ t)→ tequal :: TVar t ′→ TVar t ′→ BoolreadTVar :: TVar t→ STM twriteTVar :: TVar t→ t→ STM ()returnSTM :: t→ STM tretry :: STM t�=STM :: STM a t ′→ (t ′→ STM t)→ STM a torElse :: STM t→ STM t→ STM tor :: STM t→ STM t→ STM tatomically:: STM t→ IO treturnIO :: t→ IO t�=IO :: IO a t ′→ (t ′→ IO t)→ IO a t

For example, the function swap has type TVar t → TVar t →STM () for each t. Hence, the expression a 7→M | b 7→N | swap a b iswell-typed, by (T PAR), (T CELL), and (T APP). But the expressiona 7→ M | b 7→ N | swap a b | swap a b is not well-typed, since ithas two STM threads and STM t⊗STM t ′ is undefined. As a secondexample, the expression λx.(x | x) (a 7→ ()) is not well-typed sincethe transactional variable a 7→ () has type heap a; heap a is notan expression type, so we cannot derive any valid function typet → t ′ for the lambda-expression. Indeed, this expression wouldyield a 7→ () | a 7→ (), which has two transactional variables withthe same location. Such ill-formed expressions are untypable, dueto the disjointness conditions of ⊗ (see (T PAR)). Similarly, theexpression λx.(x | x) (a 7→ () | returnIO ()) is not well-typed sincex must have an expression type, which always has empty domain.However, λx.(x | x) has type IO t → IO t for each t, by (T PAR)and (T LAMBDA). Thus, the expression λx.(x | x) (νa)(a 7→ () |returnIO ()) is well-typed.

For example, for a well-typed application of swap, we have theexpected result,

a 7→M | b 7→ N | swap a b →∗ a 7→ N | b 7→M | returnSTM ()

but an ill-typed application may have an undesirable outcome.

a 7→M | b 7→ N | swap a b | swap a b→∗

a 7→ N | b 7→ N | returnSTM () | returnSTM ()

74

(T VAR)

E,x :: t ` x :: t

(T ADDR)

E,a :: TVar t ` a :: TVar t

(T EMP)

E ` emp :: heap ∅

(T LAMBDA)E,x :: t `M :: t ′

E ` λx.M :: (t→ t ′)

(T APP)E `M :: t→ t ′ E ` N :: t

E `M N :: t ′

(T BUILTIN) (g :: u→ u′)E `M1 :: u1 . . . E `Mn :: un

E ` g M1 · · · Mn :: u′

(T ADT)(data X = f1 t1 | · · · | fm tm, |ti|= |M|

)E `M1 :: t1

i . . . E `Mm :: tmi

E ` fi M :: X

(T CASE) (data X = f1 t1 | · · · | fm tm)E `M :: X E,x1 :: t1 ` N1 :: t ′ · · · E,xm :: tm ` Nm :: t ′

E ` case M of f x→ N :: t ′

(T CELL)E,a :: TVar t ` N :: t

E,a :: TVar t ` a 7→ N :: heap a

(T PAR)E `M :: TM E ` N :: TN

E `M | N :: TM⊗TN

(T RES)E,b :: TVar t `M :: heap b⊗T

E ` (νb)M :: T

Figure 1. Type system

Lemma 2 (Subject Reduction).If E `M :: u and M→M′ then E `M′ :: u.

From this point, we only consider well-typed processes (that is,such that there is a typing environment under which they have atype). This is motivated by Lemma 2. Moreover, due to the struc-tural definition of the type system, every subexpression of a well-typed process is well-typed. In order to reason compositionallyabout multi-step reductions, we develop some simple conditions forwhen two reductions are independent. We use these conditions inour correctness proofs, where we often consider only transactionsand reason up to β -equivalence. We begin by dividing reductionsinto pure→β and impure→ST M . (This distinction is different fromthe one in [10], where the transition relation is stratified and thereis only one kind of top-level transition.)

Definition 3. We write M →β N if M → N can be derived usingonly the rules in group (R1). We write→ST M for (→\→β ) and �for→∗

β→ST M (the composition of→∗

βand→ST M). We let =β be

the smallest equivalence relation containing→β and ≡.

Using Lemma 2, we can show that the pure reductions of a sin-gle thread are deterministic, and that they commute with reductionsin other threads. β -reduction thus enjoys the diamond property.

Lemma 4. If M→M1 and M→β M2 with M1 6≡M2 then M1→β

M′ and M2→M′ for some M′.

4.3 Comparison with the Original SemanticsThe original STM Haskell semantics [10] is based on three differ-ent transition relations: I/O transitions, administrative transitions,and STM transitions. These are defined on structures built from ex-pressions, heaps, and multiple threads. In contrast, our semantics ofSTM Haskell is in the style of a process calculus (like the semanticsof Concurrent Haskell [24], for example) and consists of a singlereduction relation defined on expressions, whose syntax subsumesheaps and concurrent threads.

The difference in styles, though, is essentially syntactic. Wecan show that our reduction relation is equivalent to the original se-mantics. In the extended version of this paper we show a straight-forward translation between our syntax and the original run-timesyntax, which yields a strong operational correspondence.

Having finished the development of our theory, we suspect itwould be quite possible to recast it directly on top of the originalsemantics.

Still, we contend that our use of a uniform syntax of expressionsis better suited to the development of theories for reasoning aboutSTM Haskell programs. One reason is because it allows us to definecontextual equivalence (in Section 6) in the standard way, and toimport ideas from process calculus, such as bisimulation, directly.Another reason is that our STM reduction rules (in groups (R2) and(R3)) operate on the adjacent piece H of the heap, as opposed to thefull heap; this facilitates reasoning about the part of the heap that isactually used by a transaction. Moreover, we can easily representparts of the run-time state, such as a thread together with a smallpiece of the heap. The syntax also allows multiple threads withlocal state to be composed using the parallel operator.

On the other hand, although our expression syntax is uniform,we need to introduce configuration types, as well as conventionaltypes, to rule out certain ill-formed expressions. This is certainly acost we must pay for the uniform syntax, but we have not found itso onerous; we need a type system anyway, and the additional rulesare not hard to work with.

5. Verifying the Ambient APIWe are now in a position to specify the expected behaviour of theHaskell code for the ambient API in Section 3, and to verify it. Wedo so by showing that the API is a fully abstract implementationof the ambient calculus, a small calculus of tree-manipulating pro-cesses. Theorem 1, below, shows soundness and completeness ofthe API, while Theorem 2 shows that ambient processes and theirHaskell implementations are in fact bisimilar.

Although the high-level statement of correctness is fairly intu-itive, the definitions of correspondence between the run time statesof our Haskell code and the ambient calculus are rather detailedand technical. The proofs themselves, in the long version of this pa-per, are also rather complicated. Still, the theorems and their proofsshow the viability of our theory for reasoning about STM Haskellcode. To the best of our knowledge, ours is the first theory for equa-tional reasoning about concurrent Haskell programs (as opposed tosay the correctness of implementations).

5.1 An (Imperative) Ambient CalculusOur Haskell API is intended to implement the primitives of anambient calculus, defined as follows. calculus [5]. Readers familiarwith the ambient calculus will notice that every syntactic form ofthe original calculus also exists as an imperative operation in iAmb.

75

Syntax of the Ambient Calculus:π ::= simple capability

into a enter aout a leave aopen a open aamb a C create ambient a[C]fork C fork thread Cnew(a) C a fresh in C

C ::= π | nil |C.C capabilitiesP ::= Process

0 inactivity| a[P] ambient|C.P prefixed thread| (νa)P restriction| P | P parallel

R ::= [·] | a[R] | (νa)R | R | P | P |R Reduction context

We often omit the 0 in C.0 and a[0]. Free and bound names ofcapabilities and processes are defined as expected. The scope of thebound name a extends to P in (νa).P and to C in new(a) C.

The reduction semantics of the ambient calculus are definedas follows. Structural equivalence ≡ is the least congruence onprocesses, with respect to the reduction (R) contexts, that satisfiescommutative monoid laws for | with 0 as unit and the rules below.

Structural Equivalence for Ambient Processes: P≡ Q

nil.P≡ P (A EPS)(C1.C2).P≡C1.(C2.P) (A ASSOC)R[(νa)P]≡ (νa)R[P] if n 6∈ n(R) (A RES)

Reduction → of processes is the least relation satisfying thefollowing rules.

Reduction for Ambient Processes: P→ Q

b[into a.P | Q] | a[R]→ a[b[P | Q] | R] (A IN)a[b[out a.P | Q] | R]→ b[P | Q] | a[R] (A OUT)open a.P | a[Q]→ P | Q (A OPEN)(new(a) C).P→ (νa)C.P if a 6∈ fn(P) (A NEW)amb a C.P→ a[C.0] | P (A AMB)fork C.P→C.0 | P (A FORK)P→ P′ =⇒ R[P]→R[P′] (A R CTX)P≡→≡ P′ =⇒ P→ P′ (A STRUCT)

The first three rules specify how the tree structure can be modified.If into a is executed inside a location b that has a sibling a, then b ismoved inside a. Conversely, if out a is executed inside a location bthat is a child of a, then b is moved outside a. Finally, open a opensa single child named a of the ambient it is running in.

As a simple example, we take the ambient tree a[p[out a.into b]] |b[], where the ambient p represents a packet that intends to movefrom a to b: a[p[out a.into b]] | b[]→ a[] | p[into b] | b[]→ a[] |b[p[]]. We define the delay operator τ as τ.P := aτ [] | open aτ .P forsome distinguished aτ .

In this setting, processes such as C.a[P] are ill-formed, sincethey have no direct correspondent in the API. We instead useC.amb a P. Formally, we treat only the following subcalculus;processes that result from the execution of a closed process C.0.

Normal form for a subcalculus of iAmbPN ::= a[PN ] | (νa)PN | (PN | PN) |C.0 | 0

We write PN for the set of all PN . As an example, (out a.into b).0∈PN , but out a.(into b.0) 6∈ PN. Note that PN is not closed un-der structural equivalence, although it is closed (modulo structuralequivalence) under reduction. We write →N for → restricted to

PN×PN. In the remainder of the paper, we only consider processesP ∈ PN. Continuing the running example:

amb a (amb p (out a.into b)).amb b nil.0→N a[amb p (out a.into b).0] | amb b nil.0→N a[p[out a.into b.0]] | amb b nil.0

→N a[p[out a.into b.0]] | b[]

5.2 Statement of CorrectnessCardelli [4] defined a notion of correctness for implementations ofthe ambient calculus, which we quote here:

The problem. We want to find a (nondeterministic)implementation of the reduction relation→∗, such that eachPi in an ambient is executed by a concurrent thread (and soon recursively in the subambients m j[...]).Desirable properties of the implementation are:• Liveness: If P → Q then the implementation must re-

duce P.• Soundness: If the implementation reduces P to Q, then

we must have P→∗ Q.• Completeness: If P→∗ Q, then the implementation must

be able (however unlikely) to reduce P to some Q′ ≡ Q.

Additional Properties. In addition to the three properties pro-posed by Cardelli, we formalize the following two, and establishall five as Theorem 1.

• Safety: If the implementation reduces P to M then M can reducefurther to some Q.

• Termination: If the implementation of P has an infinite reduc-tion, then P also does.

Compared to [4], we additionally treat the open capability (and inan extended version of this paper, communication of both namesand capabilities).

The proof of Theorem 1 proceeds as follows: We begin by giv-ing a simple correspondence between ambient capabilities and theirHaskell implementation. In Definition 5, we define how an ambientprocess is implemented as a Haskell expression, including heap andrunning capabilities. Definition 6 bridges the gap beween this inten-sional specification and the expressions that arise when executingthe expressions; the main difference is due to the lack of garbagecollection in our semantics. Then, Lemma 7 guarantees that thecorrespondence does not confuse unrelated ambient processes.

With the static correspondence in place, we can then show howit is preserved by execution. Lemma 8 details how the executionof the implementation of a prefix corresponds to its semantics inthe ambient calculus. Finally, in the proof of Theorem 1 we closethe result of Lemma 8 under contexts, yielding a strong operationalcorrespondence.

5.3 Correspondence between Haskell Code and AmbientsThe encoding [[C]] into Haskell of imperative ambient capabilitiesis homomorphic, except for two cases:

[[new(a) C]] := (new [])�= λa→ [[C]][[C′.C]] := [[C′]] � [[C]]

Continuing the running example, we have:

[[amb a (amb p (out a.into b)).amb b nil]]= amb a (amb p (out a � into b)) � amb b nil

We can then give a compositional definition of what it means for therun-time state of a Haskell program to correspond to (the structure

76

of) a given iAmb process. This definition encapsulates both theheap shape invariant preserved by the functions of the API, andhow a given ambient calculus process is represented in the heap.The definition has two levels. At the inner level (Definition 5),we inductively match the structure of an ambient process againsta structured decomposition of a process term. At the outer level(Definition 6), we perform sanity checks, open restrictions, discardunused heap items and identify the root ambient.

Definition 5. We identify association lists with the correspondingbinary relations, that must be injective. We identify other lists withmultisets. We then say that (Dn,Dp,Dh,Dc) ∈ (Dn,Dp,Dh,D′c)⊕(Dn,Dp,Dh,D′′c ) if Dc ∈D′c∪D′′c . We write D for an AD(Dn,Dp,Dh,Dc).An agent C at location h is [[C.0]]h := case [[C]] of Agent x→ x h.

Informally, we write (a 7→ D,Hh,H,M) ∈ M(P) if a 7→ D isthe current ambient, Hh its handles, H the data and handles ofall its subambients and M the running capabilities in P. M(P) isinductively defined as follows:

(Completed agent)(a 7→ (Dn,Dp,Dh, []),Πh∈Dh h 7→ a,emp,returnIO ()) ∈M(P)if P≡ 0.

(Agent running in the current ambient)(a 7→ (Dn,Dp,Dh, []),Πh∈Dh h 7→ a,emp, [[C]]h) ∈M(P) if P ≡C.0 and h ∈ Dh

(Child of the current ambient)(a 7→ (Dn,Dp,Dh, [(b,c)]),Hh,H,M) ∈M(P) if P ≡ b[Q] andH ≡ c 7→ D′ | Πh∈D′h

h 7→ c | H ′ where (c 7→ D′,Πh∈D′hh 7→

c,H ′,M) ∈M(Q), D′n = b and D′p = Some h′ with h′ ∈ Dh(Parallel decomposition)

(a 7→ D,Hh,H,M) ∈M(P) if P ≡ Q1 | Q2, H ≡ H1 | H2, M ≡M1 |M2, D ∈D1]D2 with (a 7→D1,Hh,H1,M1) ∈M(Q1) and(a 7→ D2,Hh,H2,M2) ∈M(Q2).

We can then define what it means for M to be a run-time statecorresponding to an ambient process P0.

Definition 6. M ∈M (P0) iff

1. There are P,e such that P0 ≡ (νe)P and P is not a R[(νa)Q](the top-level restrictions of P0 are e);

2. fn(P0)⊆ dom(M) and E `M :: IO a () for E := {ai :: TVar [Char] |ai ∈ dom(M)}(M has the free names of P0 in its domain, and is well-typed);

3. M ≡ (νabce)(a 7→ [] | b 7→ (a,None,Dh,Dc) |H0 |H1 |H2 |H3 |M′) (we can split M into the root ambient, some heaps and somerunning code);

4. H0 = Πidi 7→ Ni with d ∩ fn(Dh | Dc | H1 | H2 | H3 | M′) = ∅.Moreover, if Ni = D′ then D′p 6= None(H0 is unreachable garbage not containing a root ambient);

5. H1 = Πn∈fn(P)n 7→ sn with ∅ ` sn :: String(H1 is the free names of P, and is well-typed);

6. H2 = Πh∈Dh h 7→ b(H2 is the handles of the root ambient);

7. There are no R|,a,M′′ such that H3 |M′ ≡R|[(νa)M′′](there are no further restricted heap cells at the top level); and

8. (a 7→ D,H2,H3,M′) ∈M(P).

Both M and M characterize PN modulo structural equivalence.

Lemma 7. If P ≡ Q then M (P) = M (Q) and M(P) = M(Q).Conversely, if M(P) ∩M(Q) 6= ∅ or M (P) ∩M (Q) 6= ∅ thenP≡ Q.

5.4 Operational Semantics of the ImplementationThe transactions of the implementations of prefixes exactly corre-spond to the axioms of the ambient calculus operational semantics,

lifted to Haskell using the M function. We show the case of the intoprefix.

Lemma 8. If C.0 ≡ into a.P and (d 7→ D,H2,H3,M) ∈M(a[Q] |b[C.0 | R1] | R2), M = R|[[[C.0]]h3 ], {(a,d2),(b,d3)} ∈ Dc withd2 6= d3, H3≡ d2 7→D2 | h3 7→ d3 | d3 7→D3 |H ′3 with D3p = just hand H2 ≡ h 7→ d | H ′2, thend 7→ D | H2 | H3 |M �=β d 7→ D′ | H2 | d2 7→ D2′ | h3 7→ d3 |d3 7→ D3′ | H ′3 |R|[[[C′.0]]h3 ] where C′.0≡ P and(d 7→ D′,H2,d2 7→ D2′ | h3 7→ d3 | d3 7→ D3′ | H ′3,R|[[[C′.0]]h3 ]) ∈M(a[Q |C′.0 | R1] | R2).

5.5 Main Results About the Haskell CodeOur first correctness result establishes direct correspondencesbetween ambient processes and the states of the Haskell im-plementation; the different properties in this theorem generalizethe properties sought by Cardelli [4]. Recall the definition of� :=→∗

β→ST M , intuitively “performing a transaction”.

Theorem 1.

• Liveness, Completeness:If P→N Q and M ∈M (P) then M �=β∈M (Q).

• Safety, Soundness:If M ∈M (P) and M � M′ then P→N Q with M′ =β∈M (Q).

• Termination:If M ∈M (P) and M has an infinite reduction then P has aninfinite reduction.

Proof sketch.

1. Assume that M � M′ and that M ∈M (P) where P ≡ (νe)P0such that P0 does not have any top-level restrictions. By as-sumption, M ≡ (νabce)(a 7→ ”” | b 7→ (a,None,Dh,Dc) | H0 |H1 | H2 | H3 | N) such that H1 | H2 | H3 | N � H ′1 | H ′2 | H ′3 | N′and A := (b 7→ (a,None,Dh,Dc),H2,H3,N) ∈ M(P0). By in-duction on the derivation of A ∈M(P0), N = ΠiNi is a parallelcomposition of several Ni = [[Ci]]hi . Then there is j such thatH1 |H2 |H3 | [[C j]]h j � H ′1 |H ′2 |H ′3 |N′j with N′=β N′j |Πi6= jNi.

As shown in Lemma 8 for the in prefix, and in the ex-tended version for the other prefixes, we then have H1 | H2 |H3 ≡ HR | d 7→ D | Hh | HS such that P0 ≡ R[R2[C′j.Q]],(d 7→ D,Hh,HS) ∈ M(R2[C j]) and H ′1 | H ′2 | H ′3 ≡ HR | d 7→D′ | H ′h | H ′S such that (d 7→ D′,H ′h,H

′S) ∈ M(R′2[Q]) where

C j.0 ≡C′j.Q′ and R2[C′j.Q

′]→R′2[Q′] is an axiom. By induc-

tion on the derivation of A ∈M(P0), M′ =β (νabce)(a 7→ ”” |b 7→ (a,None,Dh,Dc) | H0 | H ′1 | H ′2 | H ′3 | N′j |Πi 6= jNi).M′

β∈M (R[R′2[Q]]) follows by Lemma 7.

2. Assume that P → P′. Let e be the top-level restrictions ofP. If the reduction occured inside an ambient, then there area, Q, R and contexts R1,R2 where P ≡ (νe)R1[a[R2[π.Q] |R]], R2[π.Q] → R′2[Q] is an instance of an axiom and P′ ≡(νe)R[a[R′2[Q] | R]].By assumption M ∈ M (P), so N ≡ R|[d 7→ D | Hh | H |N] such that (d 7→ D,Hh,H,N) ∈ M(a[R2[π.Q] | R]). Thus,H ≡ c 7→ D′ | H1 | H2 | Πh∈D′h

h 7→ c and N ≡ N1 | N2 withD′n = b, D′p = Some h′, h′ ∈ Dh and D ∈ D′1 ]D′2 with A :=(c 7→ D′1,Πhi∈D′h

hi 7→ c,H1,N1) ∈ M(R2[π.Q]) and (c 7→D′2,Πh∈D′h

h 7→ c,H2,N2) ∈M(R).

By induction on the derivation of A ∈ M(R2[π.Q]), we haveN1 ≡ [[C′]]hi | N′1 with C′.0 ≡ π.Q. We treat the case whereπ is not new(a)C. As shown in Lemma 8 for the into prefix,and in the extended version for the other prefixes, c 7→ D′1 |

77

Πhi∈D′hhi 7→ c | H1 | [[C′]]hi � c 7→ D′′1 | H ′h | H

′1 | [[CQ]]hi with

CQ.0≡ Q and (c 7→ D′′1 ,H ′h,H′1, [[CQ]]hi) ∈M(R′2[CQ.0]).

If the reduction occurs at top level, we have P ≡ (νe)(Q | R),and N ≡R|[d 7→ D | Hh | H | N] such that (d 7→ D,Hh,H,N) ∈M(Q | R). The rest of the proof proceeds analogously.

3. This follows from the completeness above and the fact thatM (P) is→β -convergent (modulo ≡).

The proof of this theorem uses Lemma 8 to prove that an agent canprogress whenever the corresponding ambient process does and toget the shape of the result of the transition. The proof also uses thecompositionality of the calculus; specifically in order to separate anagent (running as part of an expression in the IO monad) and theheap it needs to progress.

Next, we define a notion of bisimulation between ambient pro-cesses and STM Haskell expressions.

Definition 9. R ⊆M×PN is a bisimulation iff for all (M,P) ∈ R• If M � M′ then P→N P′ with (M′,P′) ∈ R; and• If P→N P′ then M � M′ with (M′,P′) ∈ R.

The expression M is bisimilar to the process P if there is somebisimulation R with M R P.

Theorem 2. HC | root [[C]] is bisimilar to τ.C.0,where Hc := Πai∈fn(C)ai 7→ ””.

Bisimulation between the expressions of our calculus and pro-cesses of the ambient calculus allows a succinct statement of thetheorem. The proof relies on the soundness of bisimulation up to=β . We could probably replicate this definition using the originalsemantics of STM Haskell, but it would require many cases; ourreformulated semantics allows a simple and direct definition.

6. Equational ReasoningOne of the nice things about functional programming is that we canhope for two expressions to be equivalent, in the sense that they canbe substituted for each other in any context. In this section, we de-velop a proof technique for a Morris-style contextual equivalence.In particular, we prove a number of equations asserted in [10].

6.1 Contextual EquivalenceWe begin by defining a notion of a typed relation, stating that twoterms are related at a given type under a typing environment.

Definition 10 (Typed Relation). R ⊂ E×M×M×T is a typedrelation if whenever (E,M1,M2,u) ∈ R) we have E ` M1 :: u andE `M2 :: u. We write E ` M1 R M2 :: u for (E,M1,M2,u) ∈ R).

An expression M has terminated, written M ↓, if its rightmostthread returns. Termination is our only top-level observation.Termination(TERM RETURN)

returnIO M ↓

(TERM RES)M ↓

(νa)M ↓

(TERM PAR)M ↓

N |M ↓

An expression M terminates, written M ⇓, if M→∗ N such that N ↓.Definition 11. Contextual equivalence, written ', is the typedrelation such that E ` M1 ' M2 :: u if and only if for all contextsC such that ◦ ` C [M1] :: IO a () and ◦ ` C [M2] :: IO a () we haveC [M1] ⇓ if and only if C [M2] ⇓.

6.2 STM Expressions as Heap RelationsBecause of the isolation between different transactions provided bythe run-time systems, STM expressions are completely defined by

their effect on the transactional heap. For simplicity (cf. [16, 17,29]), we work with a pure heap, where the types of elements in theheap do not mention the STM or IO monads.

Definition 12. A type t is pure if it is either t1 → t2 where t1 andt2 are pure, if it is TVar t ′ where t ′ is pure, or if it is X such thatdata X = f1 t1 | · · · | fm tm where all t i

m are pure. An environmentE is a pure store environment if E is of the form ∪ibi :: TVar tiwhere all ti are pure.

A derivation E ` M :: u is pure, written E `p M :: u, if E is apure store environment and t is pure in all occurrences of TVar t inthe derivation. We then say that M uses only pure heap.

Two STM threads that only use pure heap are equivalent if theymodify the heap in the same way and return the same result.

Definition 13. Heap transformer equivalence, written =HT , is de-fined by E ` M =HT N :: u if and only if u = STM t, E `p M :: u,E `p N :: u, M and N are β -threads, and for all STM contextsR7→,R′7→, and heaps H such that E ` H :: heap a we haveH |M→∗ R7→[returnSTM M′] iff H | N→∗ R 7→[returnSTM M′];and H |M→∗ R7→[retry] iff H | N→∗ R′7→[retry].

Theorem 3. The relation =HT is sound, that is, =HT ⊆'.

Proof. We let =CHT be the smallest typed congruence containing

=HT . We prove that =CHT ⊆'. The proof has three parts:

1. If E `P M :: t and E `H :: heap a then reductions of H |M onlydepend on the pure cells in H.

2. Let ∼=CHT be the smallest typed congruence such that E `

M =CHT N :: t with t pure and M,N closed implies E ` M ∼=C

HTN :: t.If E `P M :: t, and G and H are pure heaps related by∼=C

HT , thenderivatives of G |M and H |M are related by ∼=C

HT .

3. We can then derive that =CHT is a barbed bisimulation, so it is

contained in '. The interesting case is as follows:

Assume that E ` M =HT N :: STM t, E ` H =CHT G :: heap c

and H |M→∗R7→[B]. To prove that G |N→∗R′7→[B′] such thatE `R 7→[B] =C

HT R′7→[B′] :: STM c t we first use 1. and 2. to provethat G | M →∗ R′′7→[B′′] such that E ` R7→[B] =C

HT R′′7→[B′′] ::STM c t.Then G | N→∗ R′7→[B′] such that E ` R′′7→[B′′] =C

HT R′7→[B′] ::STM c t by the definition of =HT . By transitivity, E `R7→[B] =C

HTR′7→[B′] :: STM c t.

We write M↔ N if for all pure store environments E and types tsuch that E `p M :: STM t and E `p N :: STM t we have E ` M =HTN :: STM t. We can now use Theorem 3 to prove classic equationsbetween expressions.

6.3 Proving the Monad LawsTo be a proper monad, the returnSTM and �=STM functions mustwork together according to three laws:

Lemma 14.

1. ((returnSTM M)�=STM N)↔ NM.2. (M�=STM λx.returnSTM x)↔M3. ((M�=STM f )�=STM g)↔ (M�=STM (λx. f x�=STM g))

Proof.

1. The only transition of H | (returnSTM M)�=STM N isH | (returnSTM M)�=STM N→≡ H | NM

2. Take M′ ∈ {retry,returnSTM M′′}.We then have H |M→∗ R 7→[M′] iff

78

M�=STM →∗ R7→[M′]�=STM λx.returnSTM x≡R 7→[M′�=STM λx.returnSTM x]

.

We proceed by case analysis on M′.• M′ = retry iff, using (STM BIND RETRY),

R7→[M′�=STM λx.returnSTM x]→R 7→[retry].• M′= returnSTM M′′ iff R7→[M′�=STM λx.returnSTM x]→→ R7→[returnSTM M′′], using (STM BIND RETURN) and(BETA).

3. as 2.

6.4 Proving Other EquationsWe prove classical single-threaded imperative equivalences, suchas the commutativity of accesses to independent memory cells.

Lemma 15.• (readTVar a�=STM λx.writeTVar a x)↔ returnSTM ().• (writeTVar a M�STM writeTVar b N)↔

(writeTVar b N�STM writeTVar a M) if a 6= b.• (readTVar a�=STM λx.writeTVar b M�=STM returnSTM x)↔ (writeTVar b M�STM readTVar a) if a 6= b

We also prove absorption and associativity laws for orElse, asproposed in [10], and associativity and commutativity laws for or.

Lemma 16.

1. orElse retry M↔M2. orElse M retry↔M3. orElseM1 ( orElseM2 M3)↔ orElse ( orElseM1 M2) M34. or M N↔ or N M5. or M1 ( or M2 M3)↔ or ( or M1 M2) M3

7. Related WorkPrior semantics for languages with STM, such as STM Haskell [10],were developed with an aim to specify and compare transactionmodels [33] and their implementations [30, 14, 1], or to studythe interaction of transactions with other language features [20].Hu and Hutton [13] show correctness for a compiler for a smalltransaction language, inspired by STM Haskell. In contrast, oursemantics is designed to enable equational reasoning about sourceprograms. In this respect, the development closest to ours is of anequational theory for a process algebra with STM [2]; this work isnot about actual code, and includes no substantial example.

Proof techniques for STM programs have focused on checkinginvariants of shared transactional state, not on equational reason-ing. An extension of STM Haskell with run time invariant check-ing [12] defines a semantics and implementation but does not at-tempt program verification. A program logic [21] allows invariantsto be specified as pre- and post-conditions within a dependent typesystem and proofs are by typechecking; unlike STM Haskell, thissystem has no explicit transaction abort or retry.

Our main case study is a verification of a centralized shared-memory implementation of ambients. There are several distributedimplementations of ambients described in the literature [6, 25, 7].These have also been verified using techniques from process cal-culus, but the algorithms are based on message-passing rather thantransactional memory. We recently learnt of an independent, butunverified, implementation of ambients within STM Haskell [32].We intend to investigate whether our verification techniques alsoapply to this code.

8. ConclusionsIt has been prominently argued that functional progamming inpure languages like Haskell facilitates equational reasoning [15]

and that transactional memory enables compositional reasoningabout concurrent programs [11]. Here we realize this promise inthe context of STM Haskell and show how to verify equationalproperties of a sizeable STM program.

As future work, we want to extend our proof techniques to stat-ically check invariants, and to investigate connections between ourmodel of heaps and concurrency, spatial logics for process calculi,and separation logics for imperative programming languages. Apossible further case study to exercise our theory would be to verifyan STM implementation of the join calculus.

Acknowledgements Discussions with Cedric Fournet, Tim Har-ris, Simon Peyton Jones, and Claudio Russo were useful.

A. Source codeThis appendix contains the remainder of the source code for theambient API of Section 3.

Ambient Functions

nil = Agent $ \s -> return ()

new arg = Agent $ \s ->atomically $ newTVar arg

root agent = dorHandle <- (atomically $

do rName <- newTVar "root";newAmb Nothing rName);

case agent of Agent f -> f rHandle

amb a agent = Agent $ \bHandle -> do {aHandle <- atomically $ do {aHandle <- newAmb (Just bHandle) a;aAmbient <- readTVar aHandle;AD (n,p,h,c) <- readAmb bHandle;writeAmb bHandle (AD (n,p,h,(a,aAmbient):c));return aHandle};

forkIO $ case agent of Agent f -> f aHandle;return ()}

out c = Agent $ \bHandle -> atomically $ do {bAmbient <- readTVar bHandle;AD (bn,bp,bh,bc) <- readAmb bHandle;cHandle <- parentOf bHandle;AD (cn,cp,ch,cc) <- readAmb cHandle;aHandle <- if (cn == c)

then parentOf cHandleelse retry;

AD (an,ap,ah,ac) <- readAmb aHandle;writeAmb aHandle (AD (an,ap,ah,

(bn,bAmbient):ac));writeAmb cHandle (AD (cn,cp,ch,

delete (bn,bAmbient) cc));writeAmb bHandle (AD (bn,Just aHandle,bh,bc))}

open c = Agent $ \aHandle -> atomically $ do {aAmbient <- readTVar aHandle;AD (an,ap,ah,ac) <- readAmb aHandle;cAmbient <- lookup’ c ac;AD (cn,cp,ch,cc) <- readTVar cAmbient;rePoint aAmbient ch;writeAmb aHandle

(AD (an, ap, ah++ch,(delete (cn,cAmbient) ac)++cc))}

79

fork agent = Agent $ \s -> do {atomically $ return ();

; forkIO $ case agent ofAgent f -> f s

; return ()}

Helper Functions

newAmb :: (Maybe Handle) -> Name -> STM HandlenewAmb p n = do {me <- newTVar (AD (n, p, [], []));pMe <- newTVar me;writeTVar me (AD (n, p, [pMe], []));return pMe}

rePoint :: Ambient -> [Handle] -> STM ()rePoint a [] = return ()rePoint a (x:xs) = do writeTVar x a;

rePoint a xs

Non-deterministic Lookup

choose :: [a] -> STM achoose [] = retrychoose (x:[]) = return xchoose (x:xs) = or (return x) (choose xs)

assoc :: Name -> [(Name,Ambient)] -> [Ambient]assoc f [] = []assoc f ((a,x):xs) =

if f==a then (x:assoc f xs)else assoc f xs

lookup’ x l = choose (assoc x l)

References[1] ABADI, M., BIRRELL, A., HARRIS, T., AND ISARD, M. Semantics

of transactional memory and automatic mutual exclusion. In Proc.POPL’08 (2008), pp. 63–74.

[2] ACCIAI, L., BOREALE, M., AND DAL-ZILIO, S. A concurrentcalculus with atomic transactions. In Proc. ESOP’07 (2007), R. D.Nicola, Ed., vol. 4421 of LNCS, Springer, pp. 48–63.

[3] BORGSTROM, J., BHARGAVAN, K., AND GORDON, A. D. Acompositional theory for STM Haskell. Tech. Rep. MSR-TR-2009-66, Microsoft Research, 2009.

[4] CARDELLI, L. Mobile ambient synchronization. Technical Note1997-013, Digital Equipment Corporation, Systems Research Center,1997.

[5] CARDELLI, L., AND GORDON, A. D. Mobile ambients. TheoreticalComputer Science 240 (2000), 177–213.

[6] FOURNET, C., LEVY, J.-J., AND SCHMITT, A. An asynchronous,distributed implementation of mobile ambients. In Proc. TCS’00(2000), Springer, pp. 348–364.

[7] GIANNINI, P., SANGIORGI, D., AND VALENTE, A. Safe ambients:Abstract machine and distributed implementation. Science ofComputer Programming 59, 3 (2006), 209–249.

[8] GUERRAOUI, R., HENZINGER, T. A., AND SINGH, V. Complete-ness and nondeterminism in model checking transactional memories.In Proc. CONCUR’08 (2008), F. van Breugel and M. Chechik, Eds.,vol. 5201 of LNCS, Springer, pp. 21–35.

[9] HARRIS, T., AND FRASER, K. Language support for lightweighttransactions. In Proc. OOPSLA’03 (2003), pp. 388–402.

[10] HARRIS, T., MARLOW, S., PEYTON JONES, S., AND HERLIHY,M. Composable memory transactions. In Proc. PPOPP’05 (2005),K. Pingali, K. A. Yelick, and A. S. Grimshaw, Eds., ACM, pp. 48–60.

[11] HARRIS, T., MARLOW, S., PEYTON JONES, S., AND HERLIHY, M.Composable memory transactions. Communications of ACM 51, 8(2008), 91–100.

[12] HARRIS, T., AND PEYTON JONES, S. Transactional memory withdata invariants. In Proc. TRANSACT’06 (2006).

[13] HU, L., AND HUTTON, G. Towards a verified implementation ofsoftware transactional memory. In The Symposium on Trends inFunctional Programming (2008). To appear.

[14] HUCH, F., AND KUPKE, F. A high-level implementation ofcomposable memory transactions in Concurrent Haskell. In Proc.Implementation and Application of Functional Languages (2005),vol. 4015 of LNCS, Springer, pp. 124–141.

[15] HUGHES, J. Why functional programming matters. ComputerJournal 32, 2 (Apr. 1989), 98–107.

[16] JEFFREY, A., AND RATHKE, J. A theory of bisimulation for afragment of concurrent ML with local names. Theoretical ComputerScience 323, 1–3 (2004), 1–48.

[17] KOUTAVAS, V., AND WAND, M. Small bisimulations for reasoningabout higher-order imperative programs. In Proc. POPL ’06 (2006),ACM, pp. 141–152.

[18] LEE, E. The problem with threads. COMPUTER (2006), 33–42.

[19] MILNER, R. Communication and Concurrency. Prentice Hall, 1989.

[20] MOORE, K. F., AND GROSSMAN, D. High-level small-stepoperational semantics for transactions. In Proc. POPL’08 (2008),pp. 51–62.

[21] NANEVSKI, A., GOVEREAU, P., AND MORRISETT, G. Type-theoretic semantics for transactional concurrency. Tech. Rep. TR-08-07, Harvard University, July 2007.

[22] O’LEARY, J., SAHA, B., AND TUTTLE, M. R. Model checkingtransactional memory with Spin. In Proc. PODC’08 (2008), R. A.Bazzi and B. Patt-Shamir, Eds., ACM, p. 424.

[23] OUSTERHOUT, J. Why threads are a bad idea (for most purposes). InPresentation given at the 1996 Usenix Annual Technical Conference,January (1996).

[24] PEYTON JONES, S., GORDON, A., AND FINNE, S. ConcurrentHaskell. In Proc. POPL’96 (1996), pp. 295–308.

[25] PHILLIPS, A., YOSHIDA, N., AND EISENBACH, S. A distributedabstract machine for boxed ambient calculi. In Proc. ESOP’04 (2004),D. A. Schmidt, Ed., vol. 2986 of LNCS, Springer, pp. 155–170.

[26] REGEV, A., PANINA, E. M., SILVERMAN, W., CARDELLI, L.,AND SHAPIRO, E. BioAmbients: An abstraction for biologicalcompartments. Theoretical Computer Science 325, 1 (2004), 141–167.

[27] RINGENBURG, M. F., AND GROSSMAN, D. AtomCaml: first-classatomicity via rollback. In Proc. ICFP ’05 (2005), ACM, pp. 92–104.

[28] SANGIORGI, D. On the bisimulation proof method. MathematicalStructures in Computer Science 8 (1998), 447–479.

[29] SANGIORGI, D., KOBAYASHI, N., AND SUMII, E. Environmentalbisimulations for higher-order languages. In Proc. LICS’07 (2007),IEEE Computer Society, pp. 293–302.

[30] SCOTT, M. L. Sequential specification of transactional memorysemantics. In Proc. TRANSACT’06 (2006).

[31] SHAVIT, N., AND TOUITOU, D. Software transactional memory.Distributed Computing 10, 2 (1997), 99–116.

[32] SUNSHINE-HILL, B., AND ZARKO, L. STM versus locks, ambiently,May 2008. CIS 552 Final Project, University of Pennsylvania.

[33] VITEK, J., JAGANNATHAN, S., WELC, A., AND HOSKING, A. L.A semantic framework for designer transactions. In Proc. ESOP’04(2004), D. A. Schmidt, Ed., vol. 2986 of LNCS, Springer, pp. 249–263.

80

Parallel Performance Tuning for Haskell

Don Jones Jr.University of [email protected]

Simon MarlowMicrosoft Research

[email protected]

Satnam SinghMicrosoft Research

[email protected]

AbstractParallel Haskell programming has entered the mainstream withsupport now included in GHC for multiple parallel programmingmodels, along with multicore execution support in the runtime.However, tuning programs for parallelism is still something of ablack art. Without much in the way of feedback provided by theruntime system, it is a matter of trial and error combined withexperience to achieve good parallel speedups.

This paper describes an early prototype of a parallel profilingsystem for multicore programming with GHC. The system com-prises three parts: fast event tracing in the runtime, a Haskell libraryfor reading the resulting trace files, and a number of tools built onthis library for presenting the information to the programmer. Wefocus on one tool in particular, a graphical timeline browser calledThreadScope.

The paper illustrates the use of ThreadScope through a num-ber of case studies, and describes some useful methodologies forparallelizing Haskell programs.

Categories and Subject Descriptors D.1.1 [Applicative (Func-tional) Programming]; D.1.3 [Concurrent Programming]

General Terms Performance and Measurement

Keywords Parallel functional programming, performance tuning

1. IntroductionLife has never been better for the Parallel Haskell programmer:GHC supports multicore execution out of the box, including multi-ple parallel programming models: Strategies (Trinder et al. 1998),Concurrent Haskell (Peyton Jones et al. 1996) with STM (Harriset al. 2005), and Data Parallel Haskell (Peyton Jones et al. 2008).Performance of the runtime system has received attention recently,with significant improvements in parallel performance available inthe forthcoming GHC release (Marlow et al. 2009). Many of theruntime bottlenecks that hampered parallel performance in earlierGHC versions are much reduced, with the result that it should nowbe easier to achieve parallel speedups.

However, optimizing the runtime only addresses half of theproblem; the other half being how to tune a given Haskell programto run effectively in parallel. The programmer still has control overtask granularity, data dependencies, speculation, and to some extentevaluation order. Getting these wrong can be disastrous for parallel

performance. For example, the granularity should neither be toofine nor too coarse. Too coarse and the runtime will not be able toeffectively load-balance to keep all CPUs constantly busy; too fineand the costs of creating and scheduling the tiny tasks outweigh thebenefits of executing them in parallel.

Current methods for tuning parallel Haskell programs relylargely on trial and error, experience, and an eye for understandingthe limited statistics produced at the end of a program’s run bythe runtime system. What we need are effective ways to measureand collect information about the runtime behaviour of parallelHaskell programs, and tools to communicate this information tothe programmer in a way that they can understand and use to solveperformance problems with their programs.

In this paper we describe a new profiling system developed forthe purposes of understanding the parallel execution of Haskell pro-grams. In particular, our system includes a tool called ThreadScopethat allows the programmer to interactively browse the parallel ex-ecution profile.

This paper contributes the following:

• We describe the design of our parallel profiling system, andthe ThreadScope tool for understanding parallel execution.Our trace file format is fully extensible, and profiling toolsbuilt using our framework are both backwards- and forward-compatible with different versions of GHC.• Through several case studies, we explore how to use Thread-

Scope for identifying parallel performance problems, and de-scribe a selection of methodologies for parallelising Haskellcode.

Earlier methodologies for parallelising Haskell code exist(Trinder et al. 1998), but there are two crucial differences in themulticore GHC setting. Firstly, the trade-offs are likely to be differ-ent, since we are working with a shared-memory heap, and commu-nication is therefore cheap1. Secondly, it has recently been discov-ered that Strategies interact badly with garbage collection (Marlowet al. 2009), so in this paper we avoid the use of the original Strate-gies library, relying instead on our own simple hand-rolled parallelcombinators.

Our work is at an early stage. The ThreadScope tool displaysonly one particular view of the execution of Parallel Haskell pro-grams (albeit a very useful one). There are a wealth of possibilities,both for improving ThreadScope itself and for building new tools.We cover some of the possibilities in Section 6.

2. Profiling MotivationHaskell provides a mechanism to allow the user to control thegranularity of parallelism by indicating what computations may

1 though not entirely free, since memory cache hierarchies mean data stillhas to be shuffled between processors even if that shuffling is not explicitlyprogrammed.


81

be usefully carried out in parallel. This is done by using functionsfrom the Control.Parallel module. The interface for Control.Parallelis shown below:

par :: a → b → bpseq :: a → b → b

The function par indicates to the GHC run-time system that it maybe beneficial to evaluate the first argument in parallel with thesecond argument. The par function returns as its result the value ofthe second argument. One can always eliminate par from a programby using the following identity without altering the semantics of theprogram:

par a b = b

A thread is not necessarily created to compute the value of theexpression a. Instead, the GHC run-time system creates a sparkwhich has the potential to be executed on a different thread fromthe parent thread. A sparked computation expresses the possibilityof performing some speculative evaluation. Since a thread is notnecessarily created to compute the value of a, this approach hassome similarities with the notion of a lazy future (Mohr et al. 1991).

We call such programs semi-explicitly parallel because the pro-grammer has provided a hint about the appropriate level of gran-ularity for parallel operations and the system implicitly createsthreads to implement the concurrency. The user does not need toexplicitly create any threads or write any code for inter-thread com-munication or synchronization.

To illustrate the use of par we present a program that performstwo compute intensive functions in parallel. The first computeintensive function we use is the notorious Fibonacci function:

fib :: Int → Intfib 0 = 0fib 1 = 1fib n = fib (n−1) + fib (n−2)

The second compute intensive function we use is the sumEulerfunction taken from (Trinder et al. 2002):

mkList :: Int → [Int]mkList n = [1..n−1]

relprime :: Int → Int → Boolrelprime x y = gcd x y == 1

euler :: Int → Inteuler n = length (filter (relprime n) (mkList n))

sumEuler :: Int → IntsumEuler = sum . (map euler) . mkList

The function that we wish to parallelize adds the results of callingfib and sumEuler:

sumFibEuler :: Int → Int → IntsumFibEuler a b = fib a + sumEuler b

As a first attempt we can try to use par to speculatively spark offthe computation of fib while the parent thread works on sumEuler:

−− A wrong way to parallelize f + eparSumFibEuler :: Int → Int → IntparSumFibEuler a b

= f ‘par‘ (f + e)wheref = fib ae = sumEuler b

To create two workloads that take roughly the same amount oftime to execute we performed some experiments which show thatfib 38 takes roughly the same time to execute as sumEuler 5300.

The execution trace for this program as displayed by ThreadScopeis shown in Figure 1. This figure shows the execution trace of twoHaskell Execution Contexts (HECs), where each HEC correspondsto a processor core. The x-axis is time. The purple portion ofeach line shows at what time intervals a thread is running andthe orange (lighter coloured) bar shows when garbage collectionis occurring. Garbage collections are always “stop the world”, inthat all Haskell threads must stop during GC, but a GC may beperformed either sequentially on one HEC or in parallel on multipleHECs; in Figure 1 we are using parallel GC.

We can examine the statistics produced by the runtime system(using the flags +RTS -s -RTS) to help understand what wentwrong:

SPARKS: 1 (0 converted, 0 pruned)

INIT time 0.00s ( 0.00s elapsed)MUT time 9.39s ( 9.61s elapsed)GC time 0.37s ( 0.24s elapsed)EXIT time 0.00s ( 0.00s elapsed)Total time 9.77s ( 9.85s elapsed)

The log shows that although a single spark was created, nosparks where “converted”, i.e. executed. In this case the perfor-mance bug is because the main thread immediately starts to workon the evaluation of fib 38 itself which causes this spark to fizzle. Afizzled spark is one that is found to be under evaluation or alreadyevaluated, so there is no profit in evaluating it in parallel. The logalso shows that the total amount of computation work done is 9.39seconds (the MUT time); the time spent performing garbage collec-tion was 0.37 seconds (the GC time); and the total amount of workdone amounts to 9.77 seconds with 9.85 seconds of wall clock time.A profitably parallel program will have a wall clock time (elapsedtime) which is less than the total time2.

One might be tempted to fix this problem by swapping thearguments to the + operator in the hope that the main thread willwork on sumEuler while the sparked thread works on fib:

−− Maybe a lucky parallelizationparSumFibEuler :: Int → Int → IntparSumFibEuler a b

= f ‘par‘ (e + f)wheref = fib ae = sumEuler b

This results in the execution trace shown in Figure 2 whichshows a sparked thread being taken up by a spare worker thread.

The execution log for this program shows that a spark was usedproductively and the elapsed time has dropped from 9.85s to 5.33s:



While this trick works, it only works by accident. There is nofixed evaluation order for the arguments to +, and GHC mightdecide to use a different evaluation order tomorrow. To make theparallelism more robust, we need to be explicit about the evaluation

2 although to measure actual parallel speedup, the wall-clock time for theparallel execution should be compared to the wall-clock time for the se-quential execution.

82

Figure 1. No parallelization of f ‘par‘ (f + e)

Figure 2. A lucky parallelization of f ‘par‘ (e + f)

order we intend. The way to do this is to use pseq3 in combinationwith par, the idea being to ensure that the main thread works onsumEuler while the sparked thread works on fib:

−− A correct parallelization that does not depend on−− the evaluation order of +parSumFibEuler :: Int → Int → IntparSumFibEuler a b

= f ‘par‘ (e ‘pseq‘ (f + e))wheref = fib ae = sumEuler b

This version does not make any assumptions about the evalu-ation order of +, but relies only on the evaluation order of pseq,which is guaranteed to be stable.

This example as well as our wider experience of attempting towrite semi-explicit parallel programs shows that it is often verydifficult to understand if and when opportunities for parallelismexpressed through par are effectively taken up and to also under-stand how operations like garbage collection influence the perfor-mance of the program. Until recently one only had available highlevel summary information about the overall execution of a parallelHaskell program. In this paper we describe recent improvements tothe Haskell run-time which allow a much more detailed profile tobe generated which can then be used to help debug performanceproblems.

3. Case Studies3.1 Batcher’s Bitonic Parallel SorterBatcher’s bitonic merger and sorter is a parallel sorting algorithmwhich has a good implementation in hardware. We have producedan implementation of this algorithm in Haskell originally for cir-cuit generation for FPGAs. However, this executable model also

3 Previous work has used seq for sequential evaluation ordering, but thereis a subtle difference between Haskell’s seq and the operator we need forsequencing here. The details are described in Marlow et al. (2009).

represents an interesting software implicit parallelization exercisebecause the entire parallel structure of the algorithm is expressed interms of just one combinator called par2:

par2 :: (a → b) → (c → d) → (a, c) → (b, d)par2 circuit1 circuit2 (input1, input2)

= (output1, output2)whereoutput1 = circuit1 input1output2 = circuit2 input2

This combinator captures the idea of two circuits which are in-dependent and execute in parallel. This combinator is used to defineother combinators which express different ways of performing par-allel divide and conquer operations:

two :: ([a] → [b]) → [a] → [b]two r = halve >→ par2 r r >→ unhalve

ilv :: ([a] → [b]) → [a] → [b]ilv r = unriffle >→ two r >→ riffle

The halve combinator breaks a list into two sub-lists of evenlength and the unhalve operate performs the inverse operation. Theriffile combinator permutes its inputs by breaking a list into twohalves and then interleaving the resulting lists. unriffle performs theinverse permutation.

These combinators are in turn used to define a butterfly parallelprocessing network which describes a merger:

butterfly circuit [x,y] = circuit [x,y]butterfly circuit input

= (ilv (butterfly circuit) >→ evens circuit) input

The evens combinator breaks an input list into adjacent groupsof two elements and applies the circuit argument to each group.A column of par-wise processing elements is used to combine theresults of two sub-merges:

evens :: ([a] → [b]) → [a] → [b]evens f = chop 2 >→ map f >→ concat

83

The chop 2 combinator breaks a list into sub-lists of length 2.This parallel Batcher’s bitonic merger plus the evens function canbe used to build a parallel Batcher’s bitonic sorter:

sortB cmp [x, y] = cmp [x, y]sortB cmp input

= (two (sortB cmp) >→ sndList reverse >→ butterfly cmp) input

The sndList combinator breaks a list into two halves and appliesits argument circuit to the top halve and the identity function to thebottom halve and then concatenates the sub-results into a single list.

A straightforward way to perform a semi-explicit parallelizationof the par2 combinator is use par to spark off the evaluation of oneof the sub-circuits.

par2 :: (a → b) → (c → d) → (a, c) → (b, d)par2 circuit1 circuit2 (input1, input2)

= output1 ‘par‘ (output2 ‘pseq‘ (output1, output2))whereoutput1 = circuit1 input1output2 = circuit2 input2

This relatively simple change results in a definite performancegain due to parallelism. Here is the log output produced by runninga test-bench program with just one Haskell execution context:

.\bsortpar.exe +RTS -N1 -l -qg0 -qb -sbsortpar-N1.logSPARKS: 106496 (0 converted, 106496 pruned)


Although many sparks are created none are taken up becausethere is only one worker thread. The execution trace for this invo-cation is shown in Figure 3.

Running with two threads shows a very good performance im-provement:

.\bsortpar.exe +RTS -N2 -l -qg0 -qb -sbsortpar-N2.logSPARKS: 106859 (49 converted, 106537 pruned)


This example produces very many sparks most of which fizzlebut enough sparks are turned into productive work i.e. 6.36 secondsworth of work done in 3.75 seconds of time. The execution trace forthis invocation is shown in Figure 4. There is an obvious sequen-tial block of execution between 2.1 seconds and 2.9 seconds andthis is due to a sequential component of the algorithm which com-bines the results of parallel sub-computations i.e the evens function.We can use the parallel strategies library to change the sequentialapplication in the definition of evens to a parallel map operation:

evens :: ([a] → [b]) → [a] → [b]evens f = chop 2 >→ parMap rwhnf f >→ concat

This results in many more sparks being converted:

.\bsortpar2.exe +RTS -N2 -l -qg0 -qb -sbsortpar2-N2.logSPARKS: 852737 (91128 converted, 10175 pruned)

INIT time 0.00s ( 0.04s elapsed)MUT time 4.95s ( 3.86s elapsed)GC time 1.29s ( 0.65s elapsed)

EXIT time 0.00s ( 0.00s elapsed)Total time 6.24s ( 4.55s elapsed)

3.2 SodaSoda is a program for solving word-search problems: given a rect-angular grid of letters, find occurrences of a word from a suppliedlist, where a word can appear horizontally, vertically, or diagonally,in either direction (giving a total of eight possible orientations).

The program has a long history as a Parallel Haskell benchmark(Runciman and Wakeling 1993). The version we start with here isa recent incarnation, using a random initial grid with a tunable size.The words do not in fact appear in the grid; the program just fruit-lessly searches the entire grid for a predefined list of words. Oneadvantage of this formulation for benchmark purposes is that theprogram’s performance does not depend on the search order, how-ever a disadvantage is that the parallel structure is unrealisticallyregular.

The parallelism is expressed using parListWHNF to avoid thespace leak issues with the standard strategy implementation ofparList (Marlow et al. 2009). The parListWHNF function is straight-forwardly defined thus:

parListWHNF :: [a] -> ()parListWHNF [] = ()parListWHNF (x:xs) = x ‘par‘ parListWHNF xs

To establish the baseline performance, we run the program usingGHC’s +RTS -s flags, below is an excerpt of the output:



We can see that there are only 12 sparks generated by thisprogram: in fact the program creates one spark per word in thesearch list, of which there are 12. This rather coarse granularity willcertainly limit the ability of the runtime to effectively load-balanceas we increase the number of cores, but that won’t be an issue witha small number of cores.

Initially we try with 4 cores, and with GHC’s parallel GCenabled:



Not bad: 8.00/3.38 is a speedup of around 2.4 on 4 cores. Butsince this program has a highly parallel structure, we might hope todo better.

Figure 5 shows the ThreadScope profile for this version of soda.We can see that while an overall view of the runtime shows areasonable parallelization, if we zoom into the initial part of therun (Figure 6) we can see that HEC 0 is running continuously,but threads on the other HECs are running very briefly and thenimmediately getting blocked (zooming in further would show theindividual events).

Going back to the program, we can see that the grid of lettersis generated lazily by a function mk grid. What is happening hereis that the main thread creates sparks before the grid has been

84

Figure 3. A sequential execution of bsort

Figure 4. A parallel execution of bsort

Figure 5. Soda ThreadScope profile

Figure 6. Soda ThreadScope profile (zoomed initial portion)

85

evaluated, and then proceeds to evaluate the grid. As each sparkruns, it blocks almost immediately waiting for the main thread tocomplete evaluation of the grid.

This type of blocking is often not disastrous, since a thread willbecome unblocked soon after the thunk on which it is blockingis evaluated (see the discussion of “blackholes” in Marlow et al.(2009)). There is nevertheless a short delay between the threadbecoming runnable again and the runtime noticing this and movingthe thread to the run queue. Sometimes this delay can be hidden ifthe program has other sparks it can run in the meantime, but thatis not the case here. There are also costs associated with blockingthe thread and waking it up again, which we would like to avoid ifpossible.

One way to avoid this is to evaluate the whole grid beforecreating any sparks. This is achieved by adding a call to rnf:

−− force the grid to be evaluated:evaluate (rnf grid)

The effect on the profile is fairly dramatic (Figure 7). We can seethat the parallel execution doesn’t begin until around 500ms intothe execution: creating the grid is taking quite a while. The programalso runs slightly faster in parallel now (a 6% improvement, or aparallel speedup of 2.5 compared to 2.4):



which we attribute to less blocking and unblocking of threads. Wecan also see that this program now has a significant sequentialsection - around 15% of the execution time - which limits themaximum speedup we can achieve with 4 cores to 2.7, and we arealready very close to that at 2.5.

To improve parallelism further with this example we would haveto parallelize the creation of the initial grid; this probably isn’t hard,but it would be venturing beyond the realms of realism somewhatto optimize the creation of the input data for a synthetic benchmark,so we conclude the case study here. It has been instructional to seehow thread blocking appears in the ThreadScope profile, and howto avoid it by pre-evaluating data that is needed on multiple CPUs.

Here are a couple more factors that may be affecting the speedupwe see in this example:

• The static grid data is created on one CPU and has to befetched into the caches of the other CPUs. We hope in thefuture to be able to show the rate of cache misses (and similarcharacteristics) on each CPU alongside the other information inthe ThreadScope profile, which would highlight issues such asthis.• The granularity is too large: we can see that the HECs finish

unevenly, losing a little parallelism at the end of the run.

3.3 minimaxMinimax is another historical Parallel Haskell program. It is basedon an implementation of alpha-beta searching for the game tic-tac-toe, from Hughes’ influential paper “Why Functional ProgrammingMatters” (Hughes 1989). For the purposes of this paper we havegeneralized the program to use a game board of arbitrary size: theoriginal program used a fixed 3x3 grid, which is too quickly solvedto be a useful parallelism benchmark nowadays. However 4x4 stillrepresents a sufficient challenge without optimizing the programfurther.

For the examples that follow, the benchmark is to evaluate thegame tree 6 moves ahead, on a 4x4 grid in which the first 4 moveshave already been randomly played. This requires evaluating amaximum of roughly 500,000,000 positions, although parts of thegame tree will be pruned, as we shall describe shortly.

We will explore a few different parallelizations of this programusing ThreadScope. The function for calculating the best line in thegame is alternate:

alternate depth player f g board= move : alternate depth opponent g f board’where

move@(board’, ) = best f possibles scoresscores = map (bestMove depth opponent g f) possiblespossibles = newPositions player boardopponent = opposite player

This function calculates the sequence of moves in the game thatgive the best outcome (as calculated by the alpha-beta search) foreach player. At each stage, we generate the list of possible moves(newPositions), evaluate each move by alpha-beta search on thegame tree (bestMove), and pick the best one (best).

Let’s run the program sequentially first to establish the baselineruntime:

14,484,898,888 bytes allocated in the heap


One obvious way to parallelize this problem is to evaluate eachof the possible moves in parallel. This is easy to achieve with aparListWHNF strategy:

scores = map (bestMove depth opponent g f) possibles‘using‘ parListWHNF

where using is defined to apply its first argument to its secondargument and then return the result evaluated to weak-head normalform.

x ‘using‘ s = s x ‘seq‘ x

And indeed this does yield a reasonable speedup:




A speedup of 2.7 on 4 processors is a good start! However,looking at the ThreadScope profile (Figure 8), we can see that thereis a jagged edge on the right: our granularity is too large, and wedon’t have enough work to keep all the processors busy until theend. What’s more, as we can see from the runtime statistics, therewere only 12 sparks, corresponding to the 12 possible moves in the4x4 grid after 4 moves have already been played. In order to scaleto more CPUs we will need to find more parallelism.

The game tree evaluation is defined as follows:

bestMove :: Int → Piece → Player → Player → Board→ Evaluation

bestMove depth p f g

86

Figure 7. Soda ThreadScope profile (evaluating the input grid eagerly)

Figure 8. Minimax ThreadScope profile

= mise f g. cropTree. mapTree static. prune depth. searchTree p

Where searchTree lazily generates a search tree starting fromthe current position, with player p to play next. The function pruneprunes the search tree to the given depth, and mapTree static appliesa static evaluation function to each node in the tree. The functioncropTree prunes branches below a node in which the game has beenwon by either player. Finally, mise performs the alpha-beta search,where f and g are the min and max functions over evaluations forthe current player p.

We must be careful with parallelization here, because the algo-rithm is relying heavily on lazy evaluation to avoid evaluating partsof the game tree. Certainly we don’t want to evaluate beyond theprune depth, and we also don’t want to evaluate beyond a node inwhich one player has already won (cropTree prunes further movesafter a win). The alpha-beta search will prune even more of the tree,since there is no point exploring any further down a branch if it hasalready been established that there is a winning move. So unlesswe are careful, some of the parallelism we add here may be wastedspeculation.

The right place to parallelize is in the alpha-beta search itself.Here is the sequential code:

mise :: Player → Player → Tree Evaluation → Evaluationmise f g (Branch a []) = amise f g (Branch l ) = foldr f (g OWin XWin) (map (mise g f) l)

The first equation looks for a leaf, and returns the evaluation ofthe board at that point. A leaf is either a completed game (either

drawn or a winning position for one player), or the result of prun-ing the search tree. The second equation is the interesting one: foldrf picks the best option for the current player from the list of eval-uations at the next level. The next level evaluations are given bymap (mise g f) l, which picks the best options for the other player(which is why the f and g are reversed).

The map here is a good opportunity for parallelism. Adding aparListWHNF strategy should be enough:

mise f g (Branch l) = foldr f (g OWin XWin)(map (mise g f) l ‘using‘ parListWHNF)

However, this will try to parallelize every level of the search,leading to some sparks with very fine granularity. Also it mayintroduce too much speculation: elements in each list after a windo not need to be evaluated. Indeed, if we try this we get:




We ran a lot of sparks (600k), but we didn’t achieve muchspeedup over the sequential version. One clue that we are actu-ally speculating useless work is the amount of allocation. In the

87

sequential run the runtime reported 14GB allocated, but this paral-lel version allocated 22GB4.

In order to eliminate some of the smaller sparks, we can paral-lelize the alpha-beta to a fixed depth. This is done by introducing anew variant of mise, parMise, that applies the parListWHNF strategyup to a certain depth, and then calls the sequential mise beyond that.Just using a depth of one gives quite good results:



Though as we can see from the ThreadScope profile (Figure 9),there are some gaps. Increasing the threshold to two works nicely:



We have now achieved a speedup of 3.1 on 4 cores against thesequential code, and as we can see from the final ThreadScopeprofile (Figure 10) all our cores are kept busy.

We found that increasing the threshold to 3 starts to causespeculation of unnecessary work. In 4x4 tic-tac-toe most positionsare a draw, so it turns out that there is little speculation in the upperlevels of the alpha-beta search, but as we get deeper in the tree, wefind positions that are a certain win for one player or another, whichleads to speculative work if we evaluate all the moves in parallel.

Ideally GHC would have better support for speculation: rightnow, speculative sparks are not garbage collected when they arefound to be unreachable. We do plan to improve this in the future,but unfortunately changing the GC policy for sparks is incompati-ble with the current formulation of Strategies (Marlow et al. 2009).

3.4 Thread RingThe thread ring benchmark originates in the Computer LanguageBenchmarks Game5 (formerly known as the Great Computer Lan-guage Shootout). It is a simple concurrency benchmark, in whicha large number of threads are created in a ring topology, and thenmessages are passed around the ring. We include it here as an exam-ple of profiling a Concurrent Haskell program using ThreadScope,in contrast to the other case studies which have investigated pro-grams that use semi-explicit parallelism.

The code for our version of the benchmark is given in Figure 11.This version uses a linear string of threads rather than a ring, wherea number of messages are pumped in to the first thread in the string,and then collected at the other end.

Our aim is to try to make this program speed up in parallel.We expect there to be parallelism available: multiple messages arebeing pumped through the thread string, so we ought to be able topump messages through distinct parts of the string in parallel.

First, the sequential performance. This is for 500 messages and2000 threads:

4 CPU time is not a good measure of speculative work, because in theparallel runtime threads can sometimes be spinning while waiting for work,particularly in the GC.5 http://shootout.alioth.debian.org/

import Control.Concurrentimport Control.Monadimport Systemimport GHC.Conc (forkOnIO)

thread :: MVar Int → MVar Int → IO ()thread inp out = do

x ← takeMVar inpputMVar out $! x+1thread inp out

spawn cur n = donext ← newEmptyMVarforkIO $ thread cur nextreturn next

main = don ← getArgs >>= readIO.heads ← newEmptyMVare ← foldM spawn s [1..2000]f ← newEmptyMVarforkIO $ replicateM n (takeMVar e) >>= putMVar f . sumreplicateM n (putMVar s 0)takeMVar f

Figure 11. ThreadRing code


Next, running the program on two cores:


Things are significantly slower when we add a core. Let’s ex-amine the ThreadScope profile to see why - at first glance, the pro-gram seems to be using both cores, but as we zoom in we can seethat there are lots of gaps (Figure 12).

In this program we want to avoid communication between thetwo separate cores, because that will be expensive. We want asmuch communication as possible to happen between threads on thesame core, where it is cheap. In order to do this, we have to give thescheduler some help. We know the structure of the communicationin this program: messages are passed along the string in sequence,so we can place threads optimally to take advantage of that. GHCprovides a way to place a thread onto a particular core (or HEC),using the forkOnIO operation. The placement scheme we use is todivide the string into linear segments, one segment per core (in ourcase two).

This strategy gets us back to the same performance as thesequential version:


Why don’t we actually see any speedup? Figure 13 shows theThreadScope profile. The program has now been almost linearized;there is a small amount of overlap, but most of the execution issequential, first on one core and then the other.

88

Figure 9. Minimax ThreadScope profile (with parMise 1)

Figure 10. Minimax ThreadScope profile (with parMise 2)

Figure 12. ThreadRing profile (no explicit placement; zoomed in)

Figure 13. ThreadRing profile (with explicit placement)

89

Figure 14. ThreadRing profile (explicit placement and more messages)

Investigating the profile in more detail shows that this is ascheduling phenomenon. The runtime has moved all the messagesthrough the first string before it propagates any into the secondstring, and this can happen because the total number of messageswe are using for the benchmark is less than the number of threads.If we increase the number of messages, then we do actually seemore parallelism. Figure 14 shows the execution profile for 2000messages and 2000 threads, and we can see there is significantlymore overlap.

4. Profiling InfrastructureOur profiling framework comprises three parts:

• Support in GHC’s runtime for tracing events to a log file at run-time. The tracing is designed to be as lightweight as possible,so as not to have any significant impact on the behaviour of theprogram being measured.• A Haskell library ghc-events that can read the trace file gener-

ated by the runtime and build a Haskell data structure represent-ing the trace.• Multiple tools make use of the ghc-events library to read and

analyze trace files.

Having a single trace-file format and a library that parses itmeans that it is easy to write a new tool that works with GHC tracefiles: just import the ghc-events package and write code that usesthe Haskell data structures directly. We have already built severalsuch tools ourselves, some of which are merely proof-of-conceptexperiments, but the ghc-events library makes it almost trivial tocreate new tools:

• A simple program that just prints out the (sorted) contents ofthe trace file as text. Useful for checking that a trace file can beparsed, and for examining the exact sequence of events.• The ThreadScope graphical viewer.• A tool that parses a trace file and generates a PDF format

timeline view, similar to the ThreadScope view.• A tool that generates input in the format expected by the Gtk-

Wave circuit waveform viewer. This was used as an early pro-totype for ThreadScope, since the timeline view that we want todisplay has a lot in common with the waveform diagrams thatgtkwave displays and browses.

4.1 Fast runtime tracingThe runtime system generates trace files that log certain events andthe time at which they occurred. The events are typically thoserelated to thread activity; for example, “HEC 0 started to run thread3”, or “thread 5 blocked on an MVar”. The kinds of events wecan log are limited only by the extra overhead incurred by the

act of logging them. Minimizing the overhead of event logging issomething we care about: the goal is to profile the actual runtimebehaviour of the program, so it is important that, as far as possible,we avoid disturbing the behaviour that we are trying to profile.

In the GHC runtime, a pre-allocated event buffer is used byeach HEC to store generated events. By doing so, we avoid anydynamic memory allocation overhead, and require no locks sincethe buffers are HEC-local. Yet, this requires us to flush the bufferto the filesystem once it becomes full, but since the buffer is afixed size we pay a near-constant penalty for each flush and adeterministic delay on the GHC runtime.

The HEC-local buffers are flushed independently, which meansthat events in the log file appear out-of-order and have to be sorted.Sorting of the events is easily performed by the profiling tool afterreading in the log file using the ghc-events library.

To measure the speed at which the GHC runtime can log events,we used a C program (no Haskell code, just using the GHC runtimesystem as a library) that simply generates 2,000,000 events, alter-nating between “thread start” and “thread stop” events. Our pro-gram generates a 34MB trace file and runs in 0.31 seconds elapsedtime:


which gives a rough figure of 150ns for each event on average.Looking at the ThreadScope view of this program (Figure 15) wecan clearly see where the buffer flushes are happening, and thateach one is about 5ms long.

An alternative approach is to use memory-mapped files, andwrite our events directly into memory, leaving the actual file writ-ing to the OS. This would allow writing to be performed asyn-chronously, which would hopefully reduce the impact of the bufferflush. According to strace on Linux, the above test program isspending 0.7s writing buffers, so making this asynchronous wouldsave us about 30ns per event on average. However, on a 32-bit ma-chine where we can’t afford to reserve a large amount of addressspace for the whole log file, we would still have to occasionallyflush and remap new portions of the file. This alternative approachis something we plan to explore in the future.

To see how much impact event logging has on real execu-tion times, we took a parallel version of the canonical Fibonaccifunction, parfib, and compared the time elapsed with and withoutevent logging enabled for 50 executions of parfib on an Intel(R)Core(TM)2 Duo CPU T5250 1.50GHz, using both cores. The pro-gram generates about 2,000,000 events during the run, and gener-ates a 40MB log file.

parfib eventlog

90

Figure 15. Synthetic event benchmark

./Main 40 10 +RTS -N2 -l -RTSAvg Time Elapsed Standard Deviation20.582757s 0.789547s

parfib without eventlog./Main 40 10 +RTS -N2 -RTSAvg Time Elapsed Standard Deviation17.447493s 1.352686s

Considering the significant number of events generated in thetraces and the very detailed profiling information made availableby these traces, the overhead does not have an immense impactat approximately 10-25% increase in elapsed time. In the caseof parfib, the event representing the creation of a new spark isdominant, comprising at least 80% of the the events generated. Infact, it is debatable whether we should be logging the creation of aspark, since the cost of logging this event is likely to be larger thanthe cost of creating the spark itself - a spark creation is simply awrite into a circular buffer.

For parallel quicksort, far fewer sparks are created and most ofthe computation is spent in garbage collection; thus, we can achievean almost unnoticeable overhead from event tracing. The parallelquicksort example involved sorting a list of 100,000 randomlygenerated integers and was performed in the same manner as parfibwhere we compare with event logging and without, yet in this testwe perform 100 executions on an Intel(R) Core(TM) 2 Quad CPU3.0Ghz.

parquicksort eventlog./Main +RTS -N4 -l -RTSAvg Time Elapsed Standard Deviation14.201385s 2.954869

parquicksort without eventlog./Main +RTS -N4 -RTSAvg Time Elapsed Standard Deviation15.187529s 3.385293s

Since parallel quicksort spent the majority of the computationdoing useful work, particularly garbage collection of the createdlists, a trace file of only approximately 5MB and near 300,000events was created and the overhead of event tracing is not notice-able.

The crux of the event tracing is that even when a poorly per-forming program utilizes event tracing, the overhead should stillnot be devastating to the program’s performance, but best of all ona program with high utilization event tracing should barely affectthe performance.

4.2 An extensible file formatWe believe it is essential that the trace file format is both backwardsand forwards compatible, and architecture independent. In particu-lar, this means that:

• If you build a newer version of a tool, it will still work withthe trace files you already have, and trace files generated byprograms compiled with older versions of GHC.• If you upgrade your GHC and recompile your programs, the

trace files will still work with any profiling tools you alreadyhave.• Trace files do not have a shelf life. You can keep your trace

files around, safe in the knowledge that they will work withfuture versions of profiling tools. Trace files can be archived,and shared between machines.

Nevertheless, we don’t expect the form of trace files to remaincompletely static. In the future we will certainly want to add newevents, and add more information to existing events. We thereforeneed an extensible file format. Informally, our trace files are struc-tured as follows:

• A list of event types. An event-type is a variable-length struc-ture that describes one kind of event. The event-type structurecontains

A unique number for this event type

A field describing the length in bytes of an instance of theevent, or zero for a variable-length event.

A variable-length string (preceded by its length) describingthis event (for example “thread created”)

A variable-length field (preceded by its length) for futureexpansion. We might in the future want to add more fieldsto the event-type structure, and this field allows for that.

• A list of events. Each event begins with an event number thatcorresponds to one of the event types defined earlier, and thelength of the event structure is given by the event type (or it hasvariable length). The event also contains

A nanosecond-resolution timestamp.

For a variable-length event, the length of the event.

Information specific to this event, for example which CPUit occurred on. If the parser knows about this event, then itcan parse the rest of the event’s information, otherwise itcan skip over this field because its length is known.

The unique numbers that identify events are shared knowledgebetween GHC and the ghc-events library. When creating a newevent, a new unique identifier is chosen; identifiers can never bere-used.

Even when parsing a trace file that contains new events, theparser can still give a timestamp and a description of the unknownevents. The parser might encounter an event-type that it knowsabout, but the event-type might contain new unknown fields. Theparser can recognize this situation and skip over the extra fields,because it knows the length of the event from the event-type struc-

91

ture. Therefore when a tool encounters a new log file it can continueto provide consistent functionality.

Of course, there are scenarios in which it isn’t possible toprovide this ideal graceful degradation. For example, we mightconstruct a tool that profiles a particular aspect of the behaviourof the runtime, and in the future the runtime might be redesignedto behave in a completely different way, with a new set of events.The old events simply won’t be generated any more, and the oldtool won’t be able to display anything useful with the new tracefiles. Still, we expect that our extensible trace file format willallow us to smooth over the majority of forwards- and backwards-compatibility issues that will arise between versions of the tools andGHC runtime. Moreover, extensibility costs almost nothing, sincethe extra fields are all in the event-types header, which has a fixedsize for a given version of GHC.

5. Related WorkGranSim (Loidl 1998) is an event-driven simulator for the paral-lel execution of Glasgow Parallel Haskell (GPH) programs whichallows the parallel behaviour of Haskell programs to be analyzedby instantiating any number of virtual processors which are em-ulated by a single thread on the host machine. GranSim has anassociated set of visualization tools which show overall activity,per-processor activity, and per-thread activity. There is also a sep-arate tool for analyzing the granularity of the generated threads.The GUM system (Trinder et al. 1996) is a portable parallel im-plementation of Haskell with good profiling support for distributedimplementations.

Recent work on the Eden Trace Viewer (Berthold and Loogen2007) illustrates how higher level trace information can help withperformance tuning. We hope to adopt many of the lessons learnedin future versions of ThreadScope.

6. Conclusions and Further workWe have shown how thread-based profile information can be effec-tively used to help understand and fix parallel performance bugsin both Parallel Haskell and Concurrent Haskell programs, and weexpect these profiling tools to also be of benefit to developers usingData Parallel Haskell in the future.

The ability to profile parallel Haskell programs plays an impor-tant part in the development of such programs because the analysisprocess motivates the need to develop specialized strategies to helpcontrol evaluation order, extent and granularity as we demonstratedin the minmax example.

Here are some of the future directions we would like to take thiswork:

• Improve the user interface and navigation of ThreadScope. Forexample, it would be nice to filter the display to show just asubset of the threads, in order to focus on the behaviour of aparticular thread or group of threads.• It would also be useful to understand how threads interact with

each other via MVars e.g. to make it easier to see which threadsare blocked on read and write accesses to MVars.• The programmer should be able to generate events program-

matically, in order to mark positions in the timeline so that dif-ferent parts of the program’s execution can easily be identifiedand separated in ThreadScope.• It would be straightforward to produce graphs similar to those

from the GpH and GranSim programming tools (Trinder et al.2002; Loidl 1998), either by writing a Haskell program totranslate the GHC trace files into the appropriate input for thesetools, or by rewriting the tools themselves in Haskell.

• Combine the timeline profile with information from the OSand CPU. For example, for IO-bound concurrent programs wemight like to see IO or network activity displayed on the time-line. Information from CPU performance counters could alsobe superimposed or displayed alongside the thread timelines,providing insight into cache behaviour, for example.• Have the runtime system generate more tracing information, so

that ThreadScope can display information about such things asmemory usage, run queue sizes, spark pool sizes, and foreigncall activity.

AcknowledgmentsThe authors would like to acknowledge the work of the developersof previous Haskell concurrent and parallel profiling systems whichhave provided much inspiration for our own work. Specificallywork on GpH, GranSim and Eden was particularly useful.

We wish to thank Microsoft Research for funding Donnie Jones’visit to Cambridge in 2008 during which he developed an earlyprototype of event tracing in GHC.

ReferencesJost Berthold and Rita Loogen. Visualizing parallel functional program

runs: Case studies with the Eden Trace Viewer. In Parallel Computing:Architectures, Algorithms and Applications. Proceedings of the Interna-tional Conference ParCo 2007, Julich, Germany, 2007.

Tim Harris, Simon Marlow, Simon Peyton-Jones, and Maurice Herlihy.Composable memory transactions. In PPoPP ’05: Proceedings of thetenth ACM SIGPLAN symposium on Principles and practice of parallelprogramming, pages 48–60, New York, NY, USA, 2005. ACM. ISBN1-59593-080-9. doi: http://doi.acm.org/10.1145/1065944.1065952.

John Hughes. Why functional programming matters. The ComputerJournal, 32(2):98–107, April 1989.

H-W. Loidl. Granularity in Large-Scale Parallel Functional Programming.PhD thesis, Department of Computing Science, University of Glasgow,March 1998.

Simon Marlow, Simon Peyton Jones, and Satnam Singh. Runtime sup-port for multicore Haskell. In ICFP’09: The 14th ACM SIGPLAN Inter-national Conference on Functional Programming, Edinburgh, Scotland,2009.

E. Mohr, D. A. Kranz, and R. H. Halstead. Lazy task creation – a techniquefor increasing the granularity of parallel programs. IEEE Transactionson Parallel and Distributed Systems, 2(3), July 1991.

S. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In Proc. ofPOPL’96, pages 295–308. ACM Press, 1996.

Simon Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and ManuelM. T. Chakravarty. Harnessing the multicores: Nested data parallelismin Haskell. In IARCS Annual Conference on Foundations of SoftwareTechnology and Theoretical Computer Science (FSTTCS 2008), 2008.

Colin Runciman and David Wakeling. Profiling parallel functional compu-tations (without parallel machines). In Glasgow Workshop on FunctionalProgramming, pages 236–251. Springer, 1993.

PW Trinder, K Hammond, JS Mattson, AS Partridge, and SL Pey-ton Jones. GUM: a portable parallel implementation of Haskell. InACM Conference on Programming Languages Design and Implementa-tion (PLDI’96). Philadelphia, May 1996.

P.W. Trinder, K. Hammond, H.-W. Loidl, and Simon Peyton Jones. Algo-rithm + Strategy = Parallelism. Journal of Functional Programming, 8(1):23–60, January 1998. URL http://research.microsoft.com/Users/simonpj/Papers/strategies.ps.gz.

P.W. Trinder, H.-W. Loidl, and R. F. Pointon. Parallel and DistributedHaskells. Journal of Functional Programming, 12(5):469–510, July2002.

92

The Architecture of the Utrecht Haskell Compiler

Atze Dijkstra Jeroen Fokker S. Doaitse SwierstraDepartment of Information and Computing Sciences

Universiteit UtrechtP.O.Box 80.089, 3508 TB Utrecht, The Netherlands

{atze,jeroen,doaitse}@cs.uu.nl

AbstractIn this paper we describe the architecture of the Utrecht HaskellCompiler (UHC). UHC is a new Haskell compiler, that supportsmost (but not all) Haskell 98 features, plus some experimental ex-tensions. It targets multiple backends, including a bytecode inter-preter backend and a whole-program analysis backend, both viaC. The implementation is rigorously organized as stepwise trans-formations through some explicit intermediate languages. The treewalks of all transformations are expressed as an algebra, with theaid of an Attribute Grammar based preprocessor. The compiler isjust one materialization of a framework that supports experimen-tation with language variants, thanks to an aspect-oriented internalorganization.

Categories and Subject Descriptors D.3.4 [Programming lan-guages]: Compilers; Preprocessors; F.3.2 [Logics and meaningsof programs]: Program analysis

General Terms Languages, Design

Keywords Haskell, compiler architecture, attribute grammar, as-pect orientation

1. IntroductionOn the occasion of the Haskell Hackathon on April 18th, 2009,we announced the first release of a new Haskell compiler: theUtrecht Haskell Compiler, or UHC for short. Until Haskell Prime[16] is available as a standard, UHC strives to be a full Haskell98 [30] compiler (although currently it lacks a few features). Thereason that we announce the compiler even though it is not yet fullyfinished, is that we feel that UHC is mature enough to use for playand experimentation.One can ask why there is a need for (yet) another Haskell compiler,where the Glasgow Haskell Compiler (GHC) is already available asa widely used, fully featured, production quality Haskell compiler[26, 15, 28, 31]. In fact, we are using GHC ourselves for theimplementation of UHC. Also, various alternatives exist, like Hugs(that in its incarnation of Gofer was the epoch maker for Haskell),and the Haskell compilers from York (NHC/YHC).


Still, we think UHC has something to add to existing compilers,not so much as a production compiler (yet), but more because of itssystematically designed and extensible architecture. It is intendedto be a platform for those who wish to experiment with adding newlanguage or type system features. In a broader sense, UHC is aframework from which one can construct a series of increasinglycomplex compilers for languages reaching from simple lambdacalculus to (almost-)Haskell 98. The UHC compiler in strict senseis just the culmination point of the series. We have been referring tothe framework as ‘EHC’ (E for essential, extensible, educational,experimental. . . ) in the past [10], but for ease we now call boththe framework and its main compiler ‘UHC’. Internally we use astepwise and aspect-wise approach, realized by the use of attributegrammars (AG) and other tools.In its current state, UHC supports most of the Haskell 98 (includingpolymorphic typing, type classes, input/output, base library), but afew features are still lacking (like defaulting, and some members ofthe awkward squad [29]). On the other hand, there are some exten-sions, notably to the type system. The deviations from the standardare not caused by obstinacy or desire to change the standard, butrather because of arbitrary priorization of the feature wish list.The main structure of the compiler is shown in Figure 1. Haskellsource text is translated to an executable program by stepwisetransformation. Some transformations translate the program to alower level language, many others are transformations within onelanguage, establishing an invariant or performing an optimization.All transformations, both within a language and between lan-guages, are expressed as an algebra giving a semantics to the lan-guage. The algebras are described with the aid of an attribute gram-mar, which makes it possible to write multi-pass tree-traversalswithout even knowing the exact number of passes. Although thecompiler driver is set up to pass data structures between transfor-mations, for all intermediate languages we have a concrete syntaxwith a parser and a pretty printer. This facilitates debugging thecompiler, by inspecting code between transformations.Here is a short characterization of the intermediate languages. Insection 3 we give a more detailed description.

• Haskell (HS): a general-purpose, higher-order, polymorphicallytyped, lazy functional language.

• Essential Haskell (EH): a higher-order, polymorphically typed,lazy functional language close to lambda-calculus, without syn-tactic sugar.

• Core: an untyped, lazy functional language close to lambda-calculus (at the time of this writing we are working on movingto a typed intermediate language, a combination of Henk [32],GHC core, and recent work on calling conventions [6]).

93

• Grin: ‘Graph reduction intermediate notation’, the instructionset of a virtual machine of a small functional language withstrict semantics, with features that enable implementation oflaziness [7].

• Silly: ‘Simple imperative little language’, an abstraction of fea-tures found in every imperative language (if-statements, assign-ments, explicit memory allocation) augmented with primitivesfor manipulating a stack, easily translatable to e.g. C (not allfeatures of C are provided, only those that are needed for ourpurpose).

• BC: A bytecode language for a low-level machine intendedto interpret Grin which is not whole-program analyzed nortransformed. We do not discuss this language in this paper.

The compiler targets different backends, based on a choice of theuser. In all cases, the compiler starts compiling on a per modulebasis, desugaring the Haskell source text to Essential Haskell, typechecking it and translating it to Core. Then there is a choice fromthree modes of operation:

• In whole-program analysis mode, the Core modules of the pro-gram and required libraries are assembled together and pro-cessed further as a whole. At the Grin level, elaborate inter-module optimization takes place. Ultimately, all functions aretranslated to low level C, which can be compiled by a standardcompiler. As alternative backends, we are experimenting withother target languages, among which are the Common Interme-diate Language (CIL) from the Common language infrastruc-ture used by .NET [19], and the Low-Level Virtual Machine(LLVM) compiler infrastructure [25].

• In bytecode interpreter mode, the Core modules are translatedto Grin separately. Each Grin module is translated into instruc-tions for a custom bytecode machine. The bytecode is emittedin the form of C arrays, which are interpreted by a handwrittenbytecode interpreter in C.

• In Java mode, the Core modules are translated to bytecode forthe Java virtual machine (JVM). Each function is translated toa separate class with an eval function, and each closure is rep-resented by an object combining a function with its parameters.Together with a driver function in Java which steers the inter-pretation, these can be stored in a Java archive (jar) and be in-terpreted by a standard Java interpreter.

The bytecode interpreter mode is intended for use during programdevelopment: it compiles fast, but because of the interpretationoverhead the generated code is not very fast. The whole-programanalysis mode is intended to use for the final program: it takes moretime to compile, but generates code that is more efficient.In Section 2 we describe the tools that play an important role inUHC: the Attribute Grammar preprocessor, a language for express-ing type rules, and the variant and aspect manager. In Section 3 wedescribe the intermediate languages in the UHC pipeline in moredetail, illustrated with a running example. In Section 4 the transfor-mations are characterized in more detail. Finally, in Section 5 wedraw conclusions about the methodology used, and mention relatedand future work.

2. Techniques and Tools2.1 Tree-oriented programming

Using higher order functions on lists, like map, filter and foldr ,is a good way to abstract from common patterns in functionalprograms.

…

HS

EH

Grin

Core

Silly

C

exe

HS

EH

C C

C

exe

C

Jvm Jvm

Core Core

bytecodetables

bytecodeinterpreter

runtimesystem

jar

Java

graphinterpreter

module1 module2

llvmcil

Grin Grin

BC BC

Figure 1. Intermediate languages and transformations in the UHCpipeline, in each of the three operation modes: whole-programanalysis (left), bytecode interpreter (middle), and Java (right).

The idea that underlies the definition of foldr , i.e. to capture thepattern of an inductive definition by having a function parameterfor each constructor of the data structure, can also be used forother data types, and even for multiple mutually recursive datatypes. A function that can be expressed in this way was calleda catamorphism by Bird, and the collective extra parameters tofoldr -like functions an algebra [3, 2]. Thus, ((+), 0) is an algebrafor lists, and ((++), [ ]) is another. In fact, every algebra defines asemantics of the data structure. When applying foldr -like functionsto the algebra consisting of the original constructor functions, suchas ((:), [ ]) for lists, we have the identity function. Such an algebrais said to define the “initial” semantics. Outside circles of functionalprogrammers and category theorists, an algebra is simply known asa “tree walk specification”.In compiler construction, algebras are very useful in defining asemantics of a syntactic structure or, bluntly said, to define treewalks over the parse tree. The fact that this is not widely done, isdue to the following problems:

1. Unlike lists, for which foldr is standard, in a compiler we dealwith custom data structures for abstract syntax of a language,which each need a custom fold function. Moreover, wheneverwe change the abstract syntax, we need to change the foldfunction and every algebra.

2. Generated code can be described as a semantics of the language,but often we need more than one alternative semantics: listings,messages, and internal structures (symbol tables etc.). This canbe done by having the semantic functions in algebras returntuples, but this makes the program hard to maintain.

3. Data structures for abstract syntax tend to have many alterna-tives, so algebras end up being clumsy tuples containing dozensof functions.

4. In practice, information not only flows bottom-up in the parsetree, but also top-down. E.g., symbol tables with global defini-

94

tions need to be distributed to the leaves of the parse tree to beable to evaluate them. This can be done by using higher-orderdomains for the algebras, but the resulting code becomes evenharder to understand.

5. A major portion of the algebra is involved with moving infor-mation around. The essence of a semantics usually forms only asmall part of the algebra and is obscured by lots of boilerplate.

Some seek the solution to these problems in the use of monads:the reader monad to pass information down into the tree, the writermonad to move information upwards, and the state monad and itsderivatives to accumulate information during the tree walk [20].Despite the attractiveness of staying inside Haskell we think thisapproach is doomed to fail when the algebras to be described aregetting more and more complicated.To save the nice idea of using an algebra for defining a semantics,we use a preprocessor [34] for Haskell that overcomes the above-mentioned problems. It is not a separate language; we can still useHaskell for writing auxiliary functions, and use all abstraction tech-niques and libraries available. The preprocessor just allows a fewadditional constructs, which can be translated into a custom foldfunction and algebras, or an equivalent more efficient implemen-tation. (If one really wants to avoid a preprocessor, Viera, Swier-stra and Swierstra recently described a technique to encode an at-tribute grammar directly in Haskell while keeping the advantagesdescribed below [35].)We describe the main features of the preprocessor here, and ex-plain why they overcome the five problems mentioned above. Theabstract syntax of the language is defined in a data declaration,which is like an Haskell data declaration with named fields, how-ever without the braces and commas. Constructor function namesneed not to be unique between types. As an example, consider afragment of a typical imperative language:

data Stat= Assign dest :: String src :: Expr| While cond :: Expr body :: Stat| Group elems :: [Stat ]

data Expr= Const num :: Int| Var name :: String| Add left :: Expr right :: Expr| Call name :: String args :: [Expr ]

The preprocessor generates corresponding Haskell data declara-tions (adding braces and commas, and making the constructorsunique by prepending the type name, like Expr Const), and gen-erates a custom fold function. This overcomes problem 1 (exceptfor the part that algebras change when sytax is changed, which willbe solved below).For any desired value we wish to compute over a tree, we candeclare a “synthesized attribute”. Possibly more than one data typecan have the same attribute. For example, we can declare that bothstatements and expressions need to synthesize bytecode as well aslistings, and that expressions can be evaluated to integer values:

attr Expr Stat syn bytecode :: [Instr ] syn listing :: Stringattr Expr syn value :: Int

The preprocessor generates semantic functions that return tuplesof synthesized attributes, but we can simply refer to attributes byname. This overcomes problem 2. Moreover, if at a later stage weadd extra attributes, we do not have to refactor a lot of code.

The value of each attribute needs to be defined for every constructorof every data type which has the attribute. Such definitions areknown as “semantic rules”, and start with keyword sem.

sem Expr | Const lhs.value = @num| Add lhs.value = @left .value + @right .value

This states that the synthesized (left hand side) value attribute ofa Constant expression is just the contents of the num field, andthat of an Add -expression can be computed by adding the valueattributes of its subtrees. The @-symbol in this context should beread as “attribute”, not to be confused with Haskell “as-patterns”.At the left of the =-symbol, the attribute to be defined is men-tioned; at the right, the defining Haskell expression is given. Eachdefinition (or group of definitions) is labeled with a constructor(Const and Add in the example), which in turn are labeled withthe datatype (Expr in the example). Vertical bars separate the con-structors (and should not be confused with ‘guarded’ equations).The preprocessor collects and orders all definitions in a single al-gebra, replacing attribute references by suitable selections from theresults of the tree walk on the children. This overcomes problem 3.To be able to pass information downward during a tree walk, we candefine “inherited” attributes (the terminology goes back to Knuth[22]). As an example, it can serve to pass down an environment, i.e.a lookup table that associates variables to values, which is neededto evaluate expressions:

type Env = [(String , Int)]attr Expr inh env :: Envsem Expr | Var lhs.value = fromJust $

lookup @lhs.env @name

The preprocessor translates inherited attributes into extra parame-ters for the semantic functions in the algebra. This overcomes prob-lem 4.In many situations, sem rules only specify that attributes a treenode inherits should be passed unchanged to its children, as in aReader monad. To scrap the boilerplate expressing this, the pre-processor has a convention that, unless stated otherwise, attributeswith the same name are automatically copied. A similar automatedcopying is done for synthesized attributes passed up the tree, as ina Writer monad. When more than one child offers a synthesizedattribute with the required name, we can specify to use an operatorto combine several candidates:

attr Expr Stat syn listing use (++) [ ]

which specifies that by default, the synthesized attribute listing isthe concatenation of the listings of all children that produce a sub-listing, or the empty list if no child produces one. This overcomesproblem 5, and the last bit of problem 1.

2.2 Rule-oriented programming

Using the attribute-grammar (AG) based preprocessor we can de-scribe the part of a compiler related to tree walks concisely andefficiently. However, this does not give us any means of looking atsuch an implementation in a more formal setting. We use the do-main specific language Ruler for describing the AG part related tothe type system.Although the use of Ruler currently is in flux because we are work-ing on a newer version and therefore are only partially using Rulerfor type system descriptions, we demonstrate some of its capabil-ities because it is our intent to tackle the difficulties involved withtype system implementations by generating as much as possibleautomatically from higher level descriptions.The idea of Ruler is to generate from a single source both a LaTeXrendering for human use in technical writing:

95

v freshΓ; Ck ; v → σk è e1 : σa → σ Cf

Γ; Cf ;σa è e2 : CaΓ; Ck ;σk è e1 e2 : Caσ Ca

E.APPHM

and its corresponding AG implementation:sem Expr| App (func.gUniq , loc.uniq1 )

= mkNewLevUID @lhs.gUniqfunc.knTy = [mkTyVar @uniq1 ] ‘mkArrow ‘ @lhs.knTy(loc.ty a , loc.ty )

= tyArrowArgRes @func.tyarg .knTy = @ty aloc .ty = @arg .tyVarMp ⊕ @ty

In this paper we neither further discuss the meaning or intention ofthe above fragments [9] nor explain Ruler [12] in depth. However,to sketch the underlying ideas we show the Ruler source coderequired for the above output; we need to define the scheme (ortype) of a judgment and populate these with actual rules.A scheme defines a LaTeX output template (judgeuse tex) withholes to be filled in by rules and a parsing template (judgespec).

scheme expr =holes [node e : Expr , inh valGam : ValGam, inh knTy : Ty

, thread tyVarMp : C, syn ty : Ty ]judgeuse tex valGam; tyVarMp.inh; knTy

` .."e" e : ty tyVarMp.synjudgespec valGam; tyVarMp.inh; knTy

` e : ty tyVarMp.syn

The rule for application is then specified by specifying premisejudgments (judge above the dash) and a conclusion (below thedash) using the parsing template defined for scheme expr.

rule e.app =judge tvarvFreshjudge expr = tyVarMp.inh; tyVarMp; (v → knTy)

` eFun : (ty .a → ty) tyVarMp.funjudge expr = tyVarMp.fun; valGam; ty .a

` eArg : ty .a tyVarMp.arg−judge expr = tyVarMp.inh; valGam; knTy

` (eFun eArg): (tyVarMp.arg ty) tyVarMp.arg

For this example no further annotations are required to automat-ically produce AG code, except for the freshness of a type vari-able. The judgment tvarvFresh encapsulates this by providingthe means to insert some handwritten AG code.In summary, the basic idea of Ruler is to provide a descriptionresembling the original type rule as much as possible, and thenhelping the system with annotations to allow the generation of animplementation and a LaTeX rendering.

2.3 Aspect-oriented programming

UHC’s source code is organized into small fragments, each belong-ing to a particular variant and aspect. A variant represents a step ina sequence of languages, where each step adds some language fea-tures, starting with simply typed lambda calculus and ending withUHC. Each step builds on top of the previous one. Independent of avariant each step adds features in terms of aspects. For example, the

type system and code generation are defined as different aspects.UHC’s build system allows for selectively building a compiler fora variant and a set of aspects.Source code fragments assigned to a variant and aspects are storedin chunked text files. A tool called Shuffle then generates the ac-tual source code when parameterized with the desired variant andaspects. Shuffle is language neutral, so all varieties of implementa-tion languages can be stored in chunked format. For example, thefollowing chunk defines a Haskell wrapper for variant 2 for theconstruction of a type variable:

%%[(2 hmtyinfer || hmtyast).mkTyVarmkTyVar :: TyVarId -> TymkTyVar tv = Ty_Var tv%%]

The notation %%[(2 hmtyinfer | hmtyast).mkTyVar begins achunk for variant 2 with name mkTyVar for aspect hmtyinfer(Hindley-Milner type inference) or hmtyast (Hindley-Milner typeabstract syntax), ended by %%]. Processing by Shuffle then gives:

mkTyVar :: TyVarId → TymkTyVar tv = Ty Var tv

The subsequent variant 3 requires a more elaborate encoding of atype variable (we do not discuss this further). The wrapper must beredefined, which we achieve by explicitly overriding 2.mkTyVarby a chunk for 3.mkTyVar:

%%[(3 hmtyinfer || hmtyast).mkTyVar -2.mkTyVarmkTyVar :: TyVarId -> TymkTyVar tv = Ty_Var tv TyVarCateg_Plain%%]

Although the type signature can be factored out, we refrain fromdoing so for small definitions.Chunked sources are organized on a per file basis. Each chunkedfile for source code for UHC is processed by Shuffle to yield a cor-responding file for further processing, depending on the languageused. For chunked Haskell a single module is generated, for chun-ked AG the file may be combined with other AG files by the AGcompiler.The AG compiler itself also supports a notion of aspects, differentfrom Shuffle’s idea of aspects in that it allows definitions for at-tributes and abstract syntax to be defined independent of file andposition in a file. Attribute definitions and attribute equations thuscan be grouped according to the programmers sense of what shouldbe together; the AG compiler combines all these definitions andgenerates corresponding Haskell code.Finally, chunked files may be combined by Shuffle by means of ex-plicit reference to the name of a chunk. This also gives a form ofliterate programming tools [23] where text is generated by explic-itly combining smaller text chunks. For example, the above codefor 2.mkTyVar and 3.mkTyVar are extracted from the chunkedsource code of UHC and combined with the text for this explana-tion by Shuffle.

3. LanguagesThe compiler translates a Haskell program to executable code byapplying many small transformations. In the process, the programis represented using five different data structures, or languages.Some transformations map one of these languages to the next,some are transformations within one language. Together, the fivelanguages span a spectrum from a full feature, lazy functionallanguage, to a limited, low level imperative language.

96

3.1 The Haskell Language

The Haskell language (HS) closely follows Haskell’s concretesyntax. A combinator-based, error-correcting parser parses thesource text and generates an HS parse tree. It consists of numer-ous datatypes, some of which have many constructors. A Moduleconsists of a name, exports, and declarations. Declarations can bevaried: function bindings, pattern bindings, type signatures, datatypes, new types, type synonyms, class, instance. . . Function bind-ings involve a right hand side which is either an expression or a listof guarded expressions. An expression, in turn, has no less than 29alternatives. All in all, the description of the context-free grammarconsists of about 1000 lines of code.We maintain sufficient information in the abstract syntax tree toreconstruct the original input, including layout and superfluousparentheses, with only the comments removed.When processing HS we deal with the following tasks:

• Name resolution: Checking for properly introduced names andrenaming all identifiers to the equivalent fully qualified names.

• Operator fixity and precedence: Expressions are parsed with-out taking into account the fixity and precedence of operators.Expressions are rewritten to remedy this.

• Name dependency: Definitions are reordered into different letbindings such that all identifier uses come after their definition.Mutually recursive definitions are put into one letrec binding.

• Definition gathering: Multiple definitions for the same identi-fier are merged into one.

• Desugaring: List comprehensions, do-notation, etc. are desug-ared.

In the remainder of this section on languages we use the followingrunning example program to show how the various intermediatelanguages are used:

module M where

len :: [a ]→ Intlen [ ] = 0len (x : xs) = 1 + len xs

main = putStr (show (len (replicate 4 ’x’)))

3.2 The Essential Haskell Language

HS processing generates Essential Haskell (EH). The EH equiv-alent of the running example is shown below. Some details havebeen omitted and replaced by dots.

let M .len :: [a ]→ IntM .len

= λx1 → case x1 ofUHC .Prelude.[ ]→ UHC .Prelude.fromInteger 0

(UHC .Prelude. : x xs )→ ...

inlet M .main = UHC .Prelude.putStr ...inlet main :: UHC .Prelude.IO ...

main = UHC .Prelude.ehcRunMain M .maininmain

In constrast to the HS language, the EH language brings back thelanguage to its essence, removing as much syntactic sugar as is

possible. An EH module consists of a single expression only, whichis the body of the main function, with local let-bindings for theother top-level values.Processing EH deals with the following tasks:

• Type system: Type analysis is done, types are erased when Coreis generated. Type analysis can be done unhindered by syntac-tical sugar, error messages refer to the original source locationbut cannot reconstruct the original textual context anymore.

• Evaluation: Enforcing evaluation is made explicit by means ofa let! Core construct.

• Recursion: Recursion is made explicit by means of a letrecCore construct.

• Type classes: All evidence for type class predicates are trans-formed to explicit dictionary parameters.

• Patterns: Patterns are transformed to their more basic equiva-lent, inspecting one constructor at a time, etc. .

3.3 The Core Language

The Core language is basically the same as lambda-calculus. TheCore equivalent of the running example program is:

module M =letrec{M .len =λM .x1 1→

let !{ 2 = M .x1 1} incase 2 of{ C : { ..., ...} → ...; C [ ]{ } →

let{ 3 =

(UHC .Prelude.packedStringToInteger)(#String "0")} in

let{ 4 =

(UHC .Prelude.fromInteger)(UHC .Prelude. d1 Num : DICT )( 3)} in

4}

in ...

A Core module, apart from its name, consists of nothing more thanan expression, which can be thought of as the body of main:

data CModule= Mod nm :: Name expr :: CExpr

An expression resembles an expression in lambda calculus. Wehave constants, variables, and lambda abstractions and applicationsof one argument:

data CExpr= Int int :: Int| Char char :: Char| String str :: String| Var name :: Name| Tup tag :: Tag| Lam arg :: Name body :: CExpr| App func :: CExpr arg :: Cexpr

Alternative Tup encodes a constructor, to be used with App toconstruct actual data alternatives or tuples. The Tag of a Tupencodes the Int tag, arity, and other information.

97

Furthermore, there is case distinction and local binding:

| Case expr :: CExpr alts :: [CAlt ] dflt :: CExpr| Let categ :: Categ binds :: [CBind ] body :: CExpr

The categ of a Let describes whether the binding is recursive,strict, or plain. These two constructs use the auxiliary notions ofalternative and binding:

data CAlt= Alt pat : CPat expr :: CExpr

data CBind= Bind name : Name expr :: CExpr| FFI name : Name imp :: String ty :: Ty

A pattern introduces bindings, either directly or as a field of aconstructor:

data CPat= Var name :: Name| Con name :: Name tag :: Tag binds :: [CPatBind ]| BoolExpr name :: Name cexpr :: CExpr

data CPatBind= Bind offset :: Int pat :: CPat

The actual Core language is more complex because of:

• Experiments with extensible records; we omit this part as ex-tensible records are currently not supported in UHC.

• Core generation is partly non syntax directed because contextreduction determines which dictionaries are to be used for classpredicates. The syntax directed part of Core generation there-fore leaves holes, later to be filled in with the results of contextreduction; this is a mechanism similar to type variables repre-senting yet unknown types.

• An annotation mechanism is used to propagate informationabout dictionary values. This mechanism is somewhat ad hocand we expect it to be changed when more analyses are done inearlier stages of the compiler.

3.4 The Grin Language

The Grin equivalent of the running example program is:

module M{M .len M .x1 1 ={eval M .x1 1;λ 2→

case 2 of{C /:→{...}

; C / [ ]→{store (C/UHC .Prelude.PackedString "0");λ 6→store (F/UHC .Prelude.packedStringToInteger 6);λ 3→

store (P/0/UHC .Prelude.fromIntegerUHC .Prelude. d1 Num);λ 5→

store (A /apply 5 3);λ 4→eval 4 }

}}}

A Grin module consists of its name, global variables with theirinitializations, and bindings of function names with parameters totheir bodies.

data GrModule= Mod nm :: Name globals :: [GrGlobal ] binds :: [GrBind ]

data GrGlobal= Glob nm :: Name val :: GrVal

data GrBind= Bind nm :: Name args :: [Name ] body :: GrExpr

Values manipulated in the Grin language are varied: we have nodes(think: heap records) consisting of a tag and a list of fields, stand-alone tags, literal ints and strings, pointers to nodes, and ‘empty’.Some of these are directly representable in the languages (nodes,tags, literal ints and strings)

data GrVal= LitInt int :: Int| LitStr str :: String| Tag tag :: GrTag| Node tag :: GrTag flds :: [GrVal ]

Pointers to nodes are also values, but they have no direct denotation.On the other hand, variables ranging over values are not a valuethemselves, bur for syntactical convenience we do add the notionof a ‘variable’ to the GrVal data type:

| Var name :: Name

The tag of a node describes its role. It can be a constructor of adatatype (Con), a function of which the call is deferred becauseof lazy evaluation (Fun), a function that is partially applied butstill needs more arguments (PApp), or a deferred application of anunknown function (appearing as the first field of the node) to a listof arguments (App).

data GrTag= Con name :: Name| Fun name :: Name| PApp needs :: Int name :: Name| App applyfn :: Name

The four tag types are represented as C , F , P and A in the exampleprogram above.The body of a function denotes the calculation of a value, whichis represented in a program by an ‘expression’. Expressions can becombined in a monadic style. Thus we have Unit for describing acomputation immediately returning a value, and Seq for binding acomputation to a variable (or rather a lambda pattern), to be usedsubsequently in another computation:

data GrExpr= Unit val :: GrVal| Seq expr :: GrExpr pat :: GrPatLam body :: GrExpr

There are some primitive computations (that is, constants in themonad) one for storing a node value (returning a pointer value),and two for fetching a node previously stored, and for fetching onefield thereof:

| Store val :: GrVal| FetchNode name :: Name| FetchField name :: Name offset :: Int

Other primitive computations call Grin and foreign functions, re-spectively. The name mentioned is that of a known function (i.e.,there are no function variables) and the argument list should fullysaturate it:

| Call name :: Name args :: [GrVal ]| FFI name :: String args :: [GrVal ]

Two special primitive computations are provided for evaluatingnode that may contain a Fun tag, and for applying a node thatmust contain a PApp tag (a partially applied function) to furtherarguments:

| Eval name :: Name| App name :: Name args :: [GrVal ]

98

Next, there is a computation for selecting a matching alternative,given the name of the variabele containing a node pointer:

| Case val :: GrVal alts :: [GrAlt ]

Finally, we need a primitive computation to express the need of‘updating’ a variable after it is evaluated. Boquist proposed anUpdate expression for the purpose which has a side effect only andan ‘empty’ result value [7]. We observed that the need for updatesis always next to either a FetchNode or a Unit , and found it morepractical and more efficient to introduce two update primitives:

| FetchUpdate src :: Name dst :: Name| UpdateUnit name :: Name val :: GrVal

Auxiliary data structures are that for describing a single alternativein a Case expression:

data GrAlt| Alt pat :: GrPatAlt expr :: GrExpr

and for two kinds of patterns, occurring in a Seq expression and inan Alt alternative, respectively. A simplified version of these is thefollowing, but in reality we have more pattern forms.

data GrPatLam= Var name :: Name

data GrPatAlt= Node tag :: GrTag args :: [Name ]

4. TransformationsAn UHC architecture principle is that the program is transformedin many small steps, each performing an isolated task. Even whenmultiple steps could have been combined, we prefer the simplicityof doing one task at a time. The Attribute Grammar preprocessormakes the definition of a tree walk easy, and the runtime overheadfor the additional passes is modest.Currently we have 12 transformations on the Core language, 24 onthe Grin language, and 4 on the Silly language. Some of them areapplied more than once, so the total number of transformations aprogram undergoes is even larger. In this section we give a shortdescription of all transformations. Of course, this is just a snapshotof the current situation: the very fact that the steps are isolated andidentified enables us to move them around while developing thecompiler. Yet, the description of the transformations gives an ideaof the granularity of the steps, and as a whole gives an overview oftechniques employed.

4.1 Core Transformations

Three major gaps have to be bridged in the transformation fromCore to Grin. Firstly, where Core has a lazy semantics, in Grindeferring of function calls and their later evaluation is explicitlyencoded. Secondly, in Core we can have local function definitions,whereas in Grin all function definitions are at top level. Grin doeshave a mechanism for local, explicitly sequenced variable bindings.Thirdly, whereas Core functions always have one argument, in Grinfunctions can have multiple parameters, but they take them all at thesame time. Therefore a mechanism for partial parametrization isnecessary. The end result is lambda lifted Core, that is the floatingof lambda-expressions to the top level and passing of non-globalvariables explicitly as parameters.Core has one construct let! for enforcing evaluation to WHNFindependent of other Core language constructs. This makes theimplementation of seq easier but burdens Core transformationswith the need not to cross an ‘evaluation boundary’ when movingcode around.

The Core transformations listed below also perform some trivialcleanup and optimizations, because we avoid burdening the Coregeneration from EH with such aspects.

1. EtaReduction Performs restricted η-reduction, that is replaceexpressions like λx y → f x y with f with the restrictionthat arity is not changed. Such expressions are introduced bycoercions which (after context reduction) turn out not to coerceanything at all.

2. RenameUnique Renames variables such that all variables areglobally unique.

3. LetUnrec Replaces mutually recursive bindings

letrec{v1 = . . ; v2 = . .} in . .

which actually are not mutually recursive by plain bindings

let v1 = . . in let v2 = . . in . .

Such bindings are introduced because some bindings are poten-tially mutually recursive, in particular groups of dictionaries.

4. InlineLetAlias Inlines let bindings for variables and constants.

5. ElimTrivApp Eliminates application of the id function.

6. ConstProp Performs addition of int constants at compile time.

7. ANormal Complex expressions like

f (g a) (h b)

are broken up into a sequence of bindings and simpler expres-sions

let v1 = g a in let v2 = h b in f v1 v2

which only have variable references as their subexpressions.

8. LamGlobalAsArg Pass global variables of let-bound lambda-expressions as explicit parameters, as a preparation for lambda-lifting.

9. CAFGlobalAsArg Similar for let-bound constant applicativeforms (CAFs).

10. FloatToGlobal Performs ‘lambda lifting’: move bindings oflambda-expressions and CAFs to the global level.

11. LiftDictFields Makes sure that all dictionary fields exist as atop-level binding.

12. FindNullaries Finds nullary (parameterless) functions f andinserts another definition f ′ = f , where f ′ is annotated in sucha way that it will end up as an updateable global variable.

After the transformations, translation to Grin is performed, wherethe following issues are addressed:

• for Let-expressions: global expressions are collected and madeinto Grin function bindings; local non-recursive expressionsare sequenced by Grin Seq-expressions; for local recursive let-bindings a Sequence is created which starts out to bind a newvariable to a ‘black hole’ node, then processes the body, andfinally generates a FetchUpdate-expression for the introducedvariable.

• for Case-expressions: an explicit Eval -expression for the scru-tinee is generated, in Sequence with a Grin Case-expression.

• for App-expressions: it is determined what it is that is applied:

if it is a constructor, then a node with Con tag is returned;

if it is a lambda of known arity which has exactly theright number of arguments, then either a Call -expression

99

is generated (in strict contexts) or a node with Fun tag isstored with a Store-expression (in lazy contexts);

if it is a lambda of known arity that is undersaturated (hasnot enough arguments), then a node with PApp tag is re-turned (in strict contexts) or Stored (in lazy contexts)

if it is a lambda of known arity that is oversaturated (hastoo many arguments), then (in strict contexts) first a Call -expression to the function is generated that applies the func-tion to some of the arguments, and the result is bound to avariable that is subSequently Applied to the remaining ar-guments; or (in non-strict contexts) a node with Fun tag isStored, and bound to a variable that is used in another nodewhich has an App tag.

if it is a variable that represents a function of unknown arity,then (in strict contexts) the variable is explicitly Evaluated,and its result used in an App expression to the arguments;or (in non-strict contexts) as a last resort, both functionvariable and arguments are stored in a node with App tag.

• for global bindings: lambda abstractions are ‘peeled off’ thebody, to become the arguments of a Grin function binding.

• for foreign function bindings: functions with IO result type aretreated specially.

We have now reached the point in the compilation pipeline wherewe perform our whole-program analysis. The Core module of theprogram under compilation is merged with the Core modules of allused libraries. The resulting big Core module is then translated toGrin.

4.2 Grin Transformations

In the Grin world, we take the opportunity to perform many op-timizing transformations. Other transformations are designed tomove from graph manipulation concepts (complete nodes that canbe ‘fetched’, ‘evaluated’ and pattern matched for) to a lower levelwhere single word values are moved and inspected in the impera-tive target language.We first list all transformations in the order they are performed, andthen discuss some issues that are tackled with the combined effortof multiple transformations.

1. DropUnreachableBindings Drops all functions not reachablefrom main , either through direct calls, or through nodes thatstore a deferred or partially applied function. The transforma-tion performs a provisional numbering of all functions, and cre-ates a graph of dependencies. A standard graph reachability al-gorithm determines which functions are reachable from main;the others are dropped. This transformation is done as very first,because is drastically reduces program size: all unused func-tions from included libraries are removed.

2. MergeInstance Introduces an explicit dictionary for each in-stance declaration, by merging the default definitions of func-tions taken from class declarations. This is possible because wehave the whole program available now (see discussion below).

3. MemberSelect Looks for the selection of a function from a dic-tionary and its subsequent application to parameters. Replacesthat by a direct call.

4. DropUnreachableBindings (again) Drops the now obsolete im-plicit constructions of dictionaries.

5. Cleanup Replaces some node tags by equivalent ones: PApp 0,a partial application needing 0 more parameters, is changedinto Fun , a simple deferred function; deferred applications of

constructor functions are changed to immediate application ofthe constructor function.

6. SimpleNullary Optimises nullary functions that immediatelyreturn a value or call another function by inlining them in nodesthat encode their deferred application.

7. ConstInt Replaces deferred applications of integer2int to con-stant integers by a constant int. This situation occurs for everynumeric literal in an Int context in the source program, becauseof the way literals are overloaded in Haskell.

8. BuildAppBindings Introduces bindings for apply functionswith as many parameters as are needed in the program.

9. GlobalConstants Introduces global variables for each constantfound in the program, instead of allocating the constants locally.

10. Inline Inlines functions that are used only once at their call site.

11. SingleCase Replaces case expressions that have a single alter-native by the body of that alternative.

12. EvalStored Do not do Eval on pointers that bind the result ofa previous Store . Instead, do a Call if the stored node is adeferred call (with a Fun tag), or do a Unit of the stored nodefor other nodes.

13. ApplyUnited Do not perform Apply on variables that bind theresult of a previous Unit of a node with a PApp tag. Instead,do a Call of the function if it is now saturated, or build a newPApp node if it is undersaturated.

14. SpecConst Specialize functions that are called with a constantargument. The transformation is useful for creating a special-ized ‘increment’ function instead of plus 1, but its main meritlies in making specialized versions of overloaded functions, thatis functions that take a dictionary argument. If the dictionary is aconstant, specialization exposes new opportunities for the Mem-berSelect transformation, which is why SpecConst is iterated inconjunction with EvalStored, ApplyUnited and MemberSelect.

15. DropUnreachableBindings Drops unspecialized functions thatmay have become obsolete.

16. NumberIdents Attaches an unique number to each variable andfunction name.

17. HeapPointsTo Does a ‘heap points to analysis’ (HPT), which isan abstract interpretation of the program in order to determinethe possible tags of the nodes that each variable can refer to.

18. InlineEA Replaces all occurrences of Eval and App to equiv-alent constructs. Each Eval x is replaced by FetchNode x ,followed by a Case distinction on all possible tag values of thenode referred to by x , which was revealed by the HPT analysis.If the number of cases is prohibitively large, we resort to a Callto a generic evaluate function, that is generated for the purposeand that distinguishes all possible node tags. Each App f xconstruct, that is used to apply an unknown function f to argu-ment x , is replaced by a Case distinction on all possible PApptag values of the node referred to by f .

19. ImpossibleCase Removes alternatives from Case constructsthat, according to the HPT analysis, can never occur.

20. LateInline Inlines functions that are used only once at their callsite. New opportunities for this transformation are present be-cause the InlineEA transformation introduces new Call con-structs.

21. SingleCase (again) Replaces case expressions that have a singlealternative by the body of that alternative. New opportunitiesfor this transformation are present because the InlineEA trans-formation introduces new Case constructs.

100

22. DropUnusedExpr Removes bindings to variables if the variableis never used, but only when the expression has no side effect.Therefore, an analysis is done to determine which expressionsmay have side effects. Update and FFI expressions are as-sumed to have side effects, and Case and Seq expressions ifone of their children has them. The tricky one is Call , whichhas a side effect if its body does. This is circular definition of‘has a side effect’ if the function is recursive. Thus we takea 2-pass approach: a ‘coarse’ approximation that assumes thatevery Call has a side effect, and a ‘fine’ approximation thattakes into account the coarse approximation for the body. Vari-ables that are never used but which are retained because of thepossible side effects of their bodies are replaced by wildcards.

23. MergeCase Merges two adjacent Case constructs into a singleone in some situations.

24. LowerGrin Translates to a lower level version of Grin, in whichvariables never represent a node. Instead, variables are intro-duced for the separate fields, of which the number becameknown through HPT analysis. Also, after this transformationCase constructs scrutinize on tags rather than full nodes.

25. CopyPropagation Shortcuts repeated copying of variables.

26. SplitFetch Translates to an even lower level version of Grin, inwhich the node referred to by a pointer is not fetched as a whole,but field by field. That is, the FetchNode expression is replacedby a series of FetchField expressions. The first of these fetchesthe tag, the others are specialized in the alternatives of the Caseexpression that always follows a FetchNode expression, suchthat no more fields are fetched than required by the tag of eachalternative.

27. DropUnusedExpr (again) Removes variable bindings intro-duced by LowerGrin if they happen not to be used.

28. CopyPropagation Again shortcuts repeated copying of vari-ables.

Simplification The Grin language has constructs for manipulat-ing heap nodes, including ones that encode deferred function calls,that are explicitly triggered by an Eval expression. As part of thesimplification, this high level construct should be decomposed insmaller steps. Two strategies can be used:

• tagged: nodes are tagged by small numbers, evaluation is per-formed by calling a special evaluate function that scrutinizesthe tag, and for each possible Fun tag calls the correspondingfunction and updates the thunk;

• tagless: nodes are tagged by pointers to code that does the calland update operations, thus evaluation is tantamount to justjumping to the code pointed to by the tag.

The tagged approach has overhead in calling evaluate , but the tag-less approach has the disadvantage that the indirect jump involvedmay stall the lookahead buffer of pipelined processors. Boquist pro-posed to inline the evaluate function at every occurrence of Eval ,where for every instance the Case expression involved only con-tains those cases which can actually occur. It is this approach thatwe take in UHC.This way, they high level concept of Eval is replaced by lowerlevel concepts of FetchNode , Case , Call and Update . In turn,each FetchNode expression is replaced by a series of FetchFieldexpressions in a later transformation, and the Case that scrutinizesa node is replaced by one that scrutinizes the tag only.

Abstract interpretation The desire to inline a specialized versionof evaluate at every Eval instance brings the need for an anal-

ysis that, for each pointer variable, determines the possible tagsof the node. An abstract interpretation of the program, known as‘heap points to (HPT) analysis’ tries to approximate this knowl-edge. As preparation, the program is scanned to collect constraintson variables. Some constraints immediately provide the informa-tion needed (e.g., the variable that binds the result of a Store ex-pression is obviously a pointer to a node with the tag of the nodethat was stored), but other constraints are indirect (e.g., the vari-able that binds the result of a Call expression will have the samevalue as the called function returns). The analysis is essentially awhole-program analysis, as actual parameters of functions imposeconstraints on the parameters.The constraint set is solved in a fixpoint iteration, which processesthe indirect constraints based on information gathered thus far. Inorder to have fast access to the mapping that records the abstractvalue for each variable, we uniquely number all variables, and usemutable arrays to store the mapping.The processing of the constraint that expresses that x binds theresult of Eval p deserves special attention. If p is already knownto point to nodes with a Con tag (i.e., values) then this is also apossible value for x . If p is known to point to nodes with a Fun ftag (i.e., deferred functions), then the possible results for f are alsopossible values for x . And if p is known to point to nodes with anApp apply tag (i.e., generic applications of unknown functions byapply), then the possible results for apply are also possible valuesfor x . For a more detailed description of the algorithm, we refer toanother paper [14].

HPT performance The HPT analysis must at least find all possi-ble tags for each pointer, but it is sound if it reports a superset ofthese. The design of the HPT analysis is a tradeoff between time(the number of iterations it takes to find the fixed point) and ac-curacy. A trivial solution is to report (in 1 step) that every pointermay point to every tag; a perfect solution would solve the haltingproblem and thus would take infinite time in some situations.We found that the number of iterations our implementation takesis dependent of two factors: the depth of the call graph (usuallybounded by a dozen or so in practice), and the length of static datastructures in the program. The latter surprised us, but is understand-able if one considers the program

main = putStrLn (show (last [id , id , id , id , succ ] 1))

where it takes 5 iterations to find out that 1 is a possible parameterof succ.As for accuracy, our HPT algorithm works well for first-order func-tions. In the presence of many higher-order functions, the resultssuffer from ‘pollution’: the use of a higher-order function in onecontext also influences its result in another context. We counter thisundesired behavior in several ways:

• instead of using a generic apply function, the BuildAppBind-ings transformation makes a fresh copy for each use by an Apptag. This prevents mutual pollution of apply results, and alsoincreases the probability that the apply function can be inlinedlater;

• we specialize overloaded functions for every dictionary that itis used with, to avoid the App needed on the unknown functiontaken from the dictionary;

• we fall back on explicitly calling evaluate (instead of inliningit) in situations where the number of possible tags is unreason-able large.

Instance declarations The basic idea of implementing instancesis simple: an instance is a tuple (known as a ‘dictionary’) containing

101

all member functions, which is passed as an additional parameterto overloaded functions. Things are complicated, however, by thepresence of default implementations in classes: the dictionary foran instance declaration is a merge of the default implementationsand the implementations in the instance declaration. Worse, theclass declaration may reside in another module than the instancedeclaration, and still be mutually dependent with it. Think of theEq class, having mutually circular definitions of eq and ne , leavingit to the instance declaration to implement either one of them (orboth).A clever scheme was designed by Faxen to generate the dictionaryfrom a generator function that is parameterized by the dictionarycontaining the default implementations, while the default dictio-nary is generated from a generator function parameterized by theinstance dictionary [13]. Lazy evaluation and black holes make thisall work, and we employ this scheme in UHC too. It would be awaste, however, now that we are in a whole-program analysis situ-ation, not to try to do as much work as possible at compile time.Firstly, we have to merge the default and instance dictionaries. Inthe Grin world, we have to deal with what the Core2Grin transfor-mation makes of the Faxen scheme. That is:

• A 1-ary generator function gfd that, given a default dictionary,will generate the dictionary;

• A 0-ary function fd that binds a variable to a black hole, callsgfd , and returns the result

• A global variable d which is bound to a node with tag Fun fd .

We want to change this in a situation where d is bound directly tothe dictionary node. This involves reverse engineering the defini-tions of d , fd and gfd to find the actual member function namesburied deep in the definition of gfd . Although possible, this is veryfragile as it depends on the details of the Core2Grin translation.Instead, we take a different approach: the definition of fd is an-notated with the names of the member functions at the time whenthey are still explicitly available, that is during the EH2Core trans-lation. Similarly, class definitions are annotated with the namesof the default functions. Now the Grin.MergeInstance transforma-tion can easily collect the required dictionary fields, provided thatthe Core.LiftDictFields transformation ensures they are availableas top-level functions. The fd and gfd functions are obsolete after-wards, and can be discarded by a later reachability analysis.Secondly, we hunt the program for dictionaries d (as constructedabove) and selection functions sk (easily recognizable as a functionthat pattern-matches its parameter to a dictionary structure andreturns its kth field xk). In such situations Call sk d can bereplaced by Eval xk . A deferred member selection, involving anode with tag Fun sk and field d , is dealt with similarly: both aredone by the MemberSelect transformation.Thirdly, as xk is a dictionary field, it is a known node n . If n hasa Fun f tag, then Eval xk can be replaced by Call f , and other-wise it can be replaced by Unit n . This is done by the EvalStoredtransformation. The new Unit that is exposed by this transforma-tion can be combined with the App expression that idiomaticallyfollows the member selection, which is what ApplyUnited does.All of this only works when members are selected from a constantdictionary. Overloaded functions however operate on dictionariesthat are passed as parameter, and member selection from a vari-able dictionary is not caught by MemberSelect. The constant dic-tionary appears where the overloaded function is called, and can bebrought to the position where it is needed by specializing functionswhen they are called with constant arguments. This is done in theSpecConst transformation. That transformation is not only useful inthe chain of transformations that together remove the dictionaries,

but also for the removal of other constant arguments, giving e.g.a 1-ary successor function as a specialization of plus 1. (If con-stant specialization is also done for string constants, we get manyspecializations of putStrLn).The whole pack of transformations is applied repeatedly, as apply-ing them exposes new opportunities for sub-dictionaries. Four iter-ations suffice to deal with the common cases (involving Eq , Ord ,Integral , Read etc.) from the prelude.The only situation where dictionaries cannot be eliminated com-pletely, is where an infinite family of dictionaries is necessary, suchas arises from the Eq a ⇒ Eq [a ] instance declaration in theprelude. We then automatically fall back to the Faxen scheme.

4.3 Silly Transformations

1. InlineExpr Avoids copying variables to other variables, if in alluses the original one could be used just as well (i.e., it is notmodified in between).

2. ElimUnused Eliminates assignments to variables that are neverused.

3. EmbedVars Silly has a notion of function arguments and localvariables. After this transformation, these kind of variables arenot used anymore, but replaced by explicit stack offsets. So, thistransformation does the mapping of variables to stack positions(and, if available, registers). In a tail call, the parameters ofthe function that is called overwrites the parameters and localvariables of the function that does the call. The assignments arescheduled in such a way that no values are overridden that arestill needed in assignments to follow.

4. GroupAllocs This transformation combines separate, adjacentcalls to malloc into one, enabling to do heap overflow checkonly once for all the memory that is allocated in a particularfunction.

5. Conclusion5.1 Code size

UHC is the standard materialization of a more general code base(the UHC framework, formerly known as EHC), from which in-creasingly powerful ‘variants’ of the compiler can be drawn, whereindependent experimental ‘aspects’ can be switched on or off. Thewhole source code base consists of a fairly exact 100.000 linesof code. Just over half of it is Attribute Grammar code, which ofcourse has lots of embedded Haskell code in it. One third of thecode base is plain Haskell (mostly for utility functions, the com-piler driver, and the type inferencer), and one sixth is C (for theruntime system and a garbage collector).In Figure 2 the breakdown of code size over various subsystemsin the pipeline is shown. All numbers are in kilo-lines-of-code, butbecause of the total of 100.000 lines they can also be interpretedas percentages. Column ‘UHC only’ shows the size of the codethat is selected by Shuffle for the standard compiler, i.e. the mostpowerful variant without experimental aspects. On average, 60%of the total code base is used in UHC. The rest is either codefor low variants which is overwritten in higher variants, code forexperimental aspects that are switched off in UHC, chunk headeroverhead, or comments that were placed outside chunks.The fraction of code used for UHC is relatively low in the typeinferencer (as there are many experimental aspects here), in theexperimental backends like Java, Cil and LLVM (as most of themare switched off), and in the garbage collector (as it is not yet used:UHC by default uses the Boehm garbage collector [5, 4]).

102

subsystem All variants and aspects UHC onlyAG HS C total total fract.

utility/general 1.7 18.3 20.0 14.0 70%Haskell 6.7 3.3 9.9 6.9 70%EH 11.2 0.6 11.8 6.7 57%EH typing 8.0 7.5 15.5 7.0 45%Core 7.1 1.0 8.0 4.7 58%ByteCode 2.1 2.1 1.7 82%Grin 11.3 1.6 12.9 8.5 66%Silly 2.8 2.8 2.6 93%exp.backends 2.5 0.4 2.9 0.8 26%runtime system 8.1 8.1 6.2 77%garb.collector 6.0 6.0 0.7 11%total 53.4 32.5 14.1 100.0 59.8 60%

Figure 2. Code size (in 1000 lines of code) of source files contain-ing Attribute Grammar code (AG), Haskell code (HS) and C code(C), for various subsystems. Column ‘all variants’ is the total codebase for all variants and aspects, column ‘UHC’ is the selection ofthe standard compiler, where ‘fract.’ shows the fraction of the fullcode base that is selected for UHC.

5.2 Methodological observations

Aspect-oriented organization UHC and its framework use anaspect-wise organization in which as much as possible is describedby higher level domain specific languages from which we generatelower level implementations. UHC as a framework offers a set ofcompilers, thus allowing picking and choosing a starting point forplay and experimentation. This makes UHC a good starting pointfor research, but debugging is also facilitated by it. A problem canmore easily be pinpointed to originate in a particular step of thewhole sequence of language increments; the framework then allowsto debug the compiler in this limited context, with less interactionby other features.The stepwise organization, where language features are built on topof each other, offers a degree of isolation. Much better would beto completely independently describe language features. However,this is hard to accomplish because language features often interactand require redefinition of parts of their independent implementa-tion when combined. To do this for arbitrary combinations wouldbe more complicated then to do it for a sequence of increments.Testing can also be kept relatively simple this way. As long as anincrement in features does not remove previous features or onlychanges the generated test output, tests for a previous step canstill be reused and extended with new tests. In UHC this only failswhen the presence of a Prelude is assumed; the testing frameworkis aware of this.The aspect-wise organization impacts all source code: AG code,Haskell code, C code, the build system, etc.. Implementing aspectsas part of the used languages would be a major undertaking, as alllanguages then should be aware of aspects, and in a similar way.In UHC we have chosen to factor out aspect management and dealwith it by preprocessing.

UHC as an experimentation platform An obvious tension existsbetween UHC as a “full Haskell compiler” and a “nimble compilerfor experimentation”. Many seemingly innocent paragraphs of theHaskell language report have major impact on the implementation,making the implementation disproportional complex. Although thiscannot be avoided, it can be isolated to a certain degree, which iswhat we hope to have accomplished using an aspect-wise approach.Although the chosen layering of language features and implemen-tation techniques restricts the extent one can deviate from it for

experimentation, one can always select a minimal starting point inthe sequence of compilers and build on top of that. When we addnew functionality, we usually start by making it work in an earlyvariant, and then gradually make it work for subsequent variants.

AG Design Patterns We tend to use various AG idioms fre-quently. For example, information is often gathered over a treevia a synthesized attribute, and subsequently passed back as an in-herited attribute. This leads to a “cyclic program” when lazy code isgenerated from the AG description, or a 2-pass tree traversal whenstrict code is generated (after checking for absence of cycles).Some idiomatic use is directly supported by the AG system. Forexample, transformations are expressed as attribute grammars witha single, specially designated, attribute declaration for a copy of thetree being walked over. The only thing that remains to be specifiedis where the transformed tree differs from the original.The AG notation allows us to avoid writing much boilerplate code,similar to other tree traversal approaches [37, 36, 24]. The useof attributes sometimes also resembles reader, writer, and statemonads. In practice, the real strength of the AG system lies incombining separately defined tree traversals into one. For example,the EH type analysis repeatedly builds environments for kinds,types, datatypes, etc. Combined with the above idiomatic use thiseasily leads to many passes over the EH tree; something we’d rathernot write by hand using monads (and monad transformers) or othermechanisms more suitable for single-pass tree traversals!However, not all idiomatic use is supported by AG. For example,the need to pattern match on subtrees arises when case analysison abstract syntax trees must be done. Currently this must beprogrammed by hand, and we would like to have automated supportfor it (as in Stratego [37, 36]).

The use of intermediate languages UHC uses various interme-diate languages and transformations on them. The benefit of thisapproach is that various compiling tasks can be done where it bestfits an intermediate language and can be expressed as small, easy tounderstand, transformations independently from other tasks. Draw-backs are that some tasks have more than one appropriate placein the pipeline and sometimes require information thrown away inearlier stages (e.g. absence of types in Core).

The use of domain specific languages (DSL) We use variousspecial purpose languages for subproblems: AG for tree traversals,Shuffle for incremental, aspect-wise, and better explainable devel-opment, Ruler for type systems. Although this means a steeperlearning curve for those new to the implementation, in practice theDSLs we used and their supporting tools effectively solve an iden-tifiable design problem.

5.3 Related work

Clearly other Haskell compilers exist, most notably GHC [26],which is hard if not impossible to match in its reliability and featurerichness: UHC itself uses GHC as its main development tool.Recently, JHC [27] and LHC [18] (derived from JHC) also take thewhole-program analysis approach proposed by Boquist [8, 7] astheir starting point. LHC in its most recent incarnation is availableas a backend to GHC, and thus is not a standalone Haskell compiler.Already longer available alongside GHC are Hugs [21] which wasinfluential on Haskell as a language, NHC98 [38], and YHC [33]derived from NHC98, all mature Haskell 98 compilers with ex-tensions. Helium [17] (also from Utrecht) does not implement fullHaskell 98 but focuses on good error reporting, thereby being suit-able for learning Haskell. We also mention HBC [1] (not main-tained anymore) for completeness.

103

The distinguishing feature of UHC is its internal organization.UHC, in particular its internal aspect-wise organized framework, isdesigned to be (relatively) easy to use as a platform for research andeducation. In Utrecht students regularly use the UHC framework toexperiment with. The use of AG and other tools also make UHCdifferent from other Haskell compilers, most of them written inHaskell or lower level languages.

5.4 Future work

We have recently made a first public release of UHC [11]. In thenear future we intend to add support for better installation, in partic-ular the use of Cabal, and to add missing language features and li-braries. On a longer time scale we will continue working on whole-program analysis, the optimizations allowed by it, add classicalanalyses (e.g. strictness), and improve the runtime system (switch-ing to our own garbage collector). As we recently included the stan-dard libraries, we will be able to run benchmark suites to comparethe performance (code size, compilation time, run time) of each op-eration mode (bytecode interpreter, whole-program analysis) witheach other and with other compilers. We welcome those who wantto contribute in these or other areas of interest.

References[1] L. Augustsson. The HBC compiler.

http://www.cs.chalmers.se/~augustss/hbc/hbc.html,1998.

[2] R. Bird and O. de Moor.The algebra of programming.Prentice Hall,1996.

[3] R. S. Bird.Using Circular Programs to Eliminate Multiple Traversalsof Data.Acta Informatica, 21:239–250, 1984.

[4] H. Boehm. A garbage collector for C and C++.http://www.hpl.hp.com/personal/Hans_Boehm/gc/, 2006.

[5] H. Boehm and M. Weiser. Garbage Collection in an UncooperativeEnvironment.Software Practice and Experience, pages 807–820, Sep1988.

[6] M. Bolingbroke and S. Peyton Jones.Types are calling conventions(submitted to Haskell Symposium 2009).2009.

[7] U. Boquist. Code Optimisation Techniques for Lazy FunctionalLanguages, PhD Thesis.Chalmers University of Technology, 1999.

[8] U. Boquist and T. Johnsson.The GRIN Project: A Highly OptimisingBack End For Lazy Functional Languages. In Selected papers fromthe 8th International Workshop on Implementation of FunctionalLanguages, 1996.

[9] A. Dijkstra.Stepping through Haskell.PhD thesis, Utrecht University,Department of Information and Computing Sciences, 2005.

[10] A. Dijkstra, J. Fokker, and S. D. Swierstra. The Structure of theEssential Haskell Compiler, or Coping with Compiler Complexity.InImplementation of Functional Languages, 2007.

[11] A. Dijkstra, J. Fokker, and S. D. Swierstra. UHC Utrecht HaskellCompiler.http://www.cs.uu.nl/wiki/UHC, 2009.

[12] A. Dijkstra and S. D. Swierstra.Ruler: Programming Type Rules. InFunctional and Logic Programming: 8th International Symposium,FLOPS 2006, Fuji-Susono, Japan, April 24-26, 2006, number 3945 inLNCS, pages 30–46. Springer-Verlag, 2006.

[13] K.-F. Faxen. A Static Semantics for Haskell. Journal of FunctionalProgramming, 12(4):295, 2002.

[14] J. Fokker and S. D. Swierstra. Abstract interpretation of functionalprograms using an attribute grammar system. In A. Johnstone andJ. Vinju, editors, Language Descriptions, Tools and Applications(LDTA08), 2008.

[15] GHC Team. The New GHC/Hugs Runtime System.http://citeseer.ist.psu.edu/marlow98new.html, 1998.

[16] Haskell’ Committee. Haskell Prime.http://hackage.haskell.org/trac/haskell-prime/,2009.

[17] B. Heeren, A. v. IJzendoorn, and J. Hage.Helium, for learning Haskell.http://www.cs.uu.nl/helium/, 2005.

[18] D. Himmelstrup, S. Bronson, and A. Seipp.LHC Haskell Compiler.http://lhc.seize.it/, 2009.

[19] ISO. Common language infrastructure (ISO/EIC standard 23271).ECMA, 2006.

[20] M. P. Jones.Typing Haskell in Haskell.In Haskell Workshop, 1999.

[21] M. P. Jones.Hugs 98.http://www.haskell.org/hugs/, 2003.

[22] D. Knuth.Semantics of context-free languages.Mathematical SystemsTheory, 2(2):127–145, 1968.

[23] D. Knuth.Literate Programming.Journal of the ACM, (42):97–111,1984.

[24] R. Lammel and S. Peyton Jones.Scrap your boilerplate: a practicaldesign pattern for generic programming.In Types In Languages DesignAnd Implementation, pages 26–37, 2003.

[25] C. Lattner and V. Adve. LLVM: A Compilation Framework forLifelong Program Analysis & Transformation.In Proceedings of the2004 International Symposium on Code Generation and Optimization(CGO’04), Palo Alto, California, Mar 2004.

[26] S. Marlow and S. Peyton Jones. The Glasgow Haskell Compiler.http://www.haskell.org/ghc/, 2004.

[27] J. Meacham. Jhc Haskell Compiler.http://repetae.net/computer/jhc/, 2009.

[28] S. Peyton Jones. Compiling Haskell by program transformation: areport from the trenches.In European Symposium On Programming,pages 18–44, 1996.

[29] S. Peyton Jones.Tackling the Awkward Squad: monadic input/output,concurrency, exceptions, and foreign-language calls in Haskell .In Engineering theories of software construction, MarktoberdorfSummer School, 2002.

[30] S. Peyton Jones. Haskell 98, Language and Libraries, The RevisedReport.Cambridge Univ. Press, 2003.

[31] S. Peyton Jones and S. Marlow. Secrets of the Glasgow HaskellCompiler inliner. Journal of Functional Programming, pages 393–434, 2002.

[32] S. Peyton Jones and E. Meijer.Henk: A Typed Intermediate Language.In Workshop on Types in Compilation, 1997.

[33] T. Shackell, N. Mitchell, A. Wilkinson, et al. YHC York HaskellCompiler.http://haskell.org/haskellwiki/Yhc, 2009.

[34] S. D. Swierstra, P. Azero Alocer, and J. Saraiva. Designing andImplementing Combinator Languages. In 3rd Advanced FunctionalProgramming, number 1608 in LNCS, pages 150–206. Springer-Verlag, 1999.

[35] M. Viera, S. D. Swierstra, and W. S. Swierstra.Attribute grammarsfly first class: How to do aspect oriented programming in haskell. InInternational Conference on Functional programming (ICFP ’09),New York, NY, USA, 2009. ACM Press.

[36] E. Visser. Stratego: A language for program transformation basedon rewriting strategies. System description of Stratego 0.5. InA. Middeldorp, editor, Rewriting Techniques and Applications(RTA’01), number 2051 in LNCS, pages 357–361. Springer-Verlag,2001.

[37] E. Visser. Stratego Home Page.http://www.program-transformation.org/Stratego/WebHome,2005.

[38] York Functional Programming Group. NHC98 Haskell Compiler.http://haskell.org/nhc98/, 2007.

104

Alloy: Fast Generic Transformations for Haskell

Neil C. C. Brown Adam T. SampsonComputing Laboratory, University of Kent, UK, CT2 7NF

[email protected], [email protected]

AbstractData-type generic programming can be used to traverse and manip-ulate specific parts of large heterogeneously-typed tree structures,without the need for tedious boilerplate. Generic programming isoften approached from a theoretical perspective, where the empha-sis lies on the power of the representation rather than on efficiency.We describe use cases for a generic system derived from our workon a nanopass compiler, where efficiency is a real concern, anddetail a new generics approach (Alloy) that we have developed inHaskell to allow our compiler passes to traverse the abstract syn-tax tree quickly. We benchmark our approach against several otherHaskell generics approaches and statistically analyse the results,finding that Alloy is fastest on heterogeneously-typed trees.

Categories and Subject Descriptors D.1.1 [Applicative (Func-tional) Programming]

General Terms Languages, Performance

Keywords Generic Programming, Haskell, Alloy

1. IntroductionData-type generic programming concerns functions that depend onthe structure of data-types, such as pretty-printing. A very commonuse is the automatic application of a function that operates on sub-elements of a larger type. This avoids the need for large amounts ofsystematic boilerplate code to traverse all the types not of interestto apply functions to the types that are of interest.

Generic programming research has become popular over thelast ten years, particularly in the functional programming languageHaskell (for a review, see Rodriguez et al. 2008). The approachesmainly differ by theoretical approach or the use of different lan-guage features to achieve generic programming (including severallanguage extensions for generic programming).

Our interest in generic programming is pragmatic. We usegeneric programming in a compiler to eliminate boilerplate, andwe require a straightforward API backed by a very fast genericsapproach (see section 2 for more detail of our requirements). Webegan by using a pre-existing generics system, but found that it wasnot fast enough for our needs.

We thus developed our own generics library for Haskell, Al-loy, that blends together features of several existing generics ap-proaches into an efficient whole. Our contributions are as follows:


• We describe the basic algorithm, implementation and API ofAlloy, a library for generic traversals and transformations builtusing Haskell type-classes (section 3). We later describe a fur-ther improvement to our approach (section 7).• We explain several real use cases of data-type generic program-

ming in our compiler, and examine how to implement them ef-ficiently (section 4).• We benchmark and statistically analyse the results of Alloy and

existing generics approaches (sections 5, 6 and 6.5). The resultsshow that Alloy is faster than existing approaches for traversingheterogeneously-typed trees (we conclude in section 8).

2. MotivationWe develop Tock, a compiler for imperative parallel languages suchas occam-π (Welch and Barnes 2005), in Haskell. Tock is currentlyover 20,000 non-blank lines of Haskell code. Tock is a nanopasscompiler (Sarkar et al. 2004), meaning that its design consistsof many (currently around 40) small passes that operate on theAbstract Syntax Tree (AST) of the program, each performing onesimple operation, for example: making names unique, or checkingthat variables declared constant are not modified.

A pass that makes names unique must traverse the entire AST,operating on all names. A constant folding pass must traversethe entire AST, operating on all expressions. To avoid writingboilerplate for each traversal, we use generic programming. Toensure fast compilation of occam-π code, the 40 traversals of thetree must be as fast as possible.

Our passes typically operate on one or two types, but the mostcomplex passes (such as the type-checker) operate on up to ninetypes in one traversal, with complicated rules for when the traversalmust descend further into the tree, and when it must not. Our ASTcurrently consists of around 40 different algebraic data types, witharound 170 constructors between them. If all the basic sub-types(lists, pairs, primitive types, etc) are also included, we have around110 different types.

We began by using the Scrap Your Boilerplate (SYB) library(Lammel and Peyton Jones 2003), we found it was too slow forour purposes, leading us to first augment SYB, and then replace italtogether with Alloy.

We require the following generics facilities:

• Monadic transformations. Most transformation functionsmust run in our compiler monad, so that they have access tothe compiler’s state and can report errors. As we will see later,while we require the full power of monads for the compiler, ourgenerics approach only requires the more general applicativefunctors (McBride and Paterson 2008).• Multiple target types. Several passes – particularly those that

walk the tree updating some internal state – need to operateupon multiple target types at once.

105

• Explicit descent. Some passes must be able to decide whether– and when – to descend into a subtree. A convenient way to dothis is to provide a function like gmap or descend. (An alterna-tive used by Strafunski (Lammel and Visser 2002) is to definetree traversal strategies separately from the transformation func-tions, but in Tock this would mean duplicating decision logic inmany cases, since traversal strategies are often pass-specific.)• High-level common operations. Most passes do not need ex-

plicit descent; we need helper functions like everywhere to applysimple depth-first transformations and checks to the tree.• No need to define instances by hand. Tock’s AST representa-

tion is complex, and sometimes extended or refactored. Writingtype class instances by hand would require a lot of effort (andbe prone to mistakes); we must be able to generate them auto-matically, such as with an external tool.• Decent performance. Walking the entire tree for every pass is

unacceptably inefficient; each traversal should examine as fewnodes as possible.• Library-level. We want it to be easy to distribute and build

Tock. Therefore any generics approach that we use must bein the form of a library that uses existing Glasgow HaskellCompiler (GHC) features, so that it can be built with a standarddistribution of GHC by our end-users. Ideally, we would dependonly on extensions to the Haskell language that are likely to endup in the next Haskell standard, Haskell Prime.

In section 4 we will detail several use cases that show examplesof where we need these different features of generic programming.There are several features of generic programming in the literaturethat we do not require. We refer to them, where possible, by thenames given in Rodriguez et al. (2008):

• Multiple arguments: This is required by operations such asgeneric zipping, or generic equality. In Tock we always operateon a part of the AST and do not need this.• Constructor names: This is required by operations such as

gshow. While Alloy could easily be extended to support this,we do not require this functionality in Tock.• Type-altering transformations: We need transformations of

the form a -> a (and a -> m a), but we do not need type-alteringtransformations of the form a -> b.• Extensibility: Several authors (Hinze 2004; Oliveira et al.

2007; Lammel and Peyton Jones 2005) have identified the prob-lem that once generic functions have been defined as a list ofspecific cases (also known as tying the recursive knot), a newcase cannot easily be added. This is not a problem in Tock,where we never need to extend pass functions with additionalspecific cases outside of the definition of the pass.

3. AlloyAlloy, our generics library, is centred on applying type-preservingtransformation operations to all of the largest instances of thosetypes in a heterogeneously-typed tree. The largest instances are allthose not contained within any other instances of the type-set ofinterest (see figure 1 for an illustration). The transformations canthen descend further if required.

We do this by taking a set of transformation operations (opsetfor short) and comparing the type that the operation acts on witha current suspect type (think of the type being investigated formatches; hence a suspect). If there is a match, the transformationis applied. If there is no match, the operations are applied to thechildren (immediate sub-elements) of the suspect type and so onuntil the largest types have all been transformed in such a way.

Figure 1. An illustration of the largest types in a tree. The shapeof a node indicates its type. The shaded shapes are the largestinstances when the types of interest are triangles and pentagons.

Our basic algorithm is to have a queued opset ready to becompared to the suspect type, and a descent opset ready to beapplied to the suspect’s children if no exact match is found. Werepeatedly take one operation from the queued opset, and compareit to the suspect type. There can be three possible results of thiscomparison:

1. the suspect type matches the operation type,

2. the suspect type can contain the operation type, or

3. the suspect type cannot contain the operation type.

In case 1, the operation is applied and the result returned. Nofurther work is done by the current call. In case 2, the operationis retained, by moving it onto the descent opset. In case 3, theoperation is discarded.

As an example, consider the following type:

data Foo = FooInt Int Int | FooFloat Float

We wish to apply transformations to everything of type Float,Int and String that might be contained in the suspect type Foo.

Figure 2 demonstrates our opset being compared against thesuspect type Foo. The operations on Float and Int are retained(because Foo can contain those types), whereas the operation ontype String is discarded.

Alloy is similar to several other approaches, such as Uniplate(Mitchell and Runciman 2007), SYB (Lammel and Peyton Jones2003) and Smash (Kiselyov 2006). The two key features of Alloy,intended to increase its efficiency, are that:

1. All our decisions about types are made statically via the Haskelltype-checker, rather than dynamically at run-time. Smash andUniplate take the same approach, in contrast to SYB’s use ofdynamic typing.

2. Unlike Smash or SYB, we discard operations that can no longerbe applied anywhere inside the suspect type. Uniplate, whichonly supports one target type, stops the traversal when this tar-get type cannot possibly be found anywhere inside the suspecttype. We extend this optimisation to multiple types. Not only dowe stop when no operations can be further applied, but we alsodynamically discard each operation individually when it cannotbe applied anywhere inside the suspect type. This is a primarycontribution of Alloy.

106

Figure 2. An example of processing an opset with respect to asuspect type. The types of the transformations in the queued opsetare progressively compared to the suspect type. If, like String , theycannot be contained in the suspect type, they are discarded. If theycan be contained, like Float and Int , they are retained by beingmoved to the descent opset.

3.1 The Type-ClassHaskell’s type-classes are a form of ad-hoc polymorphism that al-low functions to be specialised differently for different types. LikeSmash and Uniplate, we use Haskell’s type-classes to implementAlloy; the library is centred around a type-class of the same name:

class Alloy opsQueued opsDescent suspect wheretransform :: opsQueued -> opsDescent -> suspect -> suspect

The type-class has three parameters. The first is the queuedopset, the second is the descent opset and the third is the suspecttype, all of which were described in the previous section. Ouropsets are implemented in a cons fashion (with terminator BaseOp):

data BaseOp = BaseOpdata t :- ops = ( t -> t ) :- opsinfixr 7 :-

This allows the value of the opsets to directly mirror the type; asample opset that works on String , Float and Int is:

ops :: String :- Float :- Int :- BaseOpops = processString :- processFloat :- processInt :- BaseOp

Most of our use of Alloy is via two simple helper functions. Thedescend function1 is used to apply the transformations to a value’schildren, which is done by using the transform function with anempty queued opset and a full descent opset – which will result inan application of the descent opset to all the children of the value.In contrast, our apply helper function begins with a full queuedopset and an empty descent opset, and will attempt to apply theoperations directly to the target, before descending if none can beapplied:

descend :: Alloy BaseOp ops t => ops -> t -> tdescend ops = transform BaseOp ops

apply :: Alloy ops BaseOp t => ops -> t -> tapply ops = transform ops BaseOp

We can thus write a compiler pass (that has no automatic de-scent) as follows:

alterNames :: AST -> ASTalterNames = apply ops

whereops = doName :- BaseOp

doName :: Name -> NamedoName = ...

3.2 InstancesAs an example for instances we will consider again the type fromthe previous section:

data Foo = FooInt Int Int | FooFloat Float

To aid understanding, we will also provide a Haskell-likepseudo-code for the instances, of the form:

alloyInst :: [Op] -> [Op] -> a -> aalloyInst queued descent x = ...

3.2.1 Base CaseWe require a base case instance, for when there are no operationsleft in either opset – none to try to apply to the suspect type, andnone to apply to its children. In this case we are no longer interestedin this element or anything beneath it, and the identity operation isused on the data:

1 The descend function has the same behaviour as the compos operatordefined by Bringert and Ranta (2008).

107

instance Alloy BaseOp BaseOp Foo wheretransform x = x

This is equivalent in our pseudo-code to:

alloyInst [] [] x = x

3.2.2 Matching CaseWe require a case where the type of the operation matches thecurrent type:

instance Alloy (Foo :- opsQueued) opsDescent Foo wheretransform ( f :- ) x = f x

Here, we have found a type of interest and the appropriate oper-ation to apply. Therefore we simply apply the operation, ignoringthe remaining queued and descent opsets (any required further de-scent will be done by the f function). This is analogous to:

alloyInst ( f: ) x | typeOfOp f == typeOf x = f x

The matching of the Foo type in our instance declaration is hereconverted into a guard that uses notional type-getting functions.

3.2.3 Descent CaseWe require an instance dealing with the case where there are nooperations remaining in the queued opset to try to apply to thesuspect type, but there are operations remaining in the descent opsetto apply to all the sub-elements:

instance ( Alloy ( t :- ops) BaseOp Int ,Alloy ( t :- ops) BaseOp Float) =>

Alloy BaseOp (t :- ops) Foo where

transform opsD (FooInt m n)= FooInt ( transform opsD BaseOp m) (transform opsD BaseOp n)

transform opsD (FooFloat f )= FooFloat ( transform opsD BaseOp f)

The type t can be anything here; expressing the opset as at:-ops indicates to the type system that it is distinct from BaseOp,to prevent the instances overlapping (unlike Haskell’s normal in-order pattern-matching, with type-classes every instance must beuniquely determinable from the head). One can think of the con-structor BaseOp as being the type-level equivalent of the empty listpattern, [], whereas the pattern ( t :- ops) is akin to the cons pattern(x:xs). This is reflected in the two cases added to our pseudo-code:

alloyInst [] opsD@( : ) (FooInt m n)= FooInt ( alloyInst opsD [] m) ( alloyInst opsD [] n)

alloyInst [] opsD@( : ) (FooFloat f )= FooFloat ( alloyInst opsD [] f )

The instance body has a case for each constructor of the al-gebraic data type, and processes each sub-element with a furthertraversal, where the descent opset is moved to be processed anewon the sub-element type as the queued opset (and the descent opsetis emptied).

The head of the instance declaration lists the type-class require-ments for these new traversals. In this case, the two types Int andFloat need to be processed with an empty descent opset and a fullqueued opset.

3.2.4 Sliding CasesThe descent cases had generic opsets – that is, they did not exam-ine what types were in the opsets. The remaining instances must allconsider whether the type of the operation at the head of the opsetmatches, can be contained, or cannot be contained by the suspecttype. We perform this check at compile-time, by generating differ-ent instances for each combination of suspect type and head of theopset. A couple of the relevant instances for Foo are:

instance Alloy opsQueued (Int :- opsDescent) Foo =>Alloy ( Int :- opsQueued) opsDescent Foo where

transform ( f :- opsQ) opsD x = transform opsQ ( f :- opsD) x

instance Alloy opsQueued (Float :- opsDescent) Foo =>Alloy (Float :- opsQueued) opsDescent Foo where

transform ( f :- opsQ) opsD x = transform opsQ ( f :- opsD) x

These instances are processing operations on Float and Int –two types that can be contained in Foo. The instance moves the op-erations from the queued opset to the descent opset, and continuesprocessing the remainder of the queued opset.

Contrast this with the instance for String :

instance Alloy opsQueued opsDescent Foo =>Alloy ( String :- opsQueued) opsDescent Foo where

transform ( f :- opsQ) opsD x = transform opsQ opsD x

Here, the operation is discarded (String cannot be contained byFoo), and then we continue to process the remainder of the queuedopset. As well as not being applied to Foo, the operation will notbe checked against any of Foo’s children, because it is not added tothe descent opset. If Foo were a large data-type with many possiblesub-elements, this would save a lot of time.

These instances are reflected in the final case in our pseudo-code, now presented alongside the rest of the code:

alloyInst [] [] x = xalloyInst ( f: ) x | typeOfOp f == typeOf x = f xalloyInst [] opsD@( : ) (FooInt m n)= FooInt ( alloyInst opsD [] m) ( alloyInst opsD [] n)

alloyInst [] opsD@( : ) (FooFloat f )= FooFloat ( alloyInst opsD [] f )

alloyInst ( f:fs ) opsD x| typeOfOp f ‘canBeContainedIn‘ typeOf x

= alloyInst fs ( f : opsD) x| otherwise = alloyInst fs opsD x

Recall that type-class instances must have a unique match –unlike Haskell functions, they are not matched in-order. Hence ourpseudo-code has the same property; none of the pattern matches(plus guards) overlap; this is the reason for the explicit pattern foropsD on the third and fourth lines.

We could generate our instances using an approach like Smash,where the information on type relations could be abstracted outinto one type-class, and the descent instances put into another, withonly four or so instances of Alloy to traverse the opset and buildon these type-classes. Some preliminary testing indicated that thisalternative approach ended up being slower at run-time – but itwould be easy to change to this model.

3.2.5 Polymorphic TypesIn our compiler application, we have only one polymorphic type,Structured (as well as uses of Maybe and lists). Typically, we want toapply different operations to the instantiations of these types, e.g.process Structured Process differently than Structured Expressionand [Char] differently than [Formal].

Alloy thus does not currently provide any special support forpolymorphic types (e.g. processing all Maybe a, for all a). Maybe Intand Maybe Float are treated as two entirely separate types, just asInt and Float are.

3.3 Monadic AlloyAs mentioned earlier, in our compiler nearly all of our passesoperate inside a monad. To support monadic transformations, allwe strictly need is support for applicative functors – every monadcan be made an applicative functor (McBride and Paterson 2008).We must define a new type-class to support this:

108

class AlloyA opsQ opsD t wheretransformA :: Applicative f => opsQ f -> opsD f -> t -> f t

In order for it to be apparent to the type system that the applica-tive functor that transformA operates in is the same applicative func-tor that the opsets use, we parameterise the opsets with the functor.To support this we define our new opsets as follows:

data ( t :-* ops) f = (( t -> f t ) :-* ops f )infixr 7 :-*data BaseOpA f = BaseOpA

The use of this opset becomes apparent in an example:

fixNames :: AlloyA (Name :-* BaseOpA) BaseOpA a => a -> PassM afixNames = applyA (doName :-* BaseOpA)

wheredoName :: Name -> PassM NamedoName = ...

The opset Name :-*BaseOpA is ready to be parameterised by anapplicative functor, and the functor being used is not mentionedin the class constraint. The design of the :-* type is such that weguarantee that all operations in the opset are using the same functor,which a plain HList (Kiselyov et al. 2004) could not.

The instances for AlloyA are nearly identical to those givenfor Alloy in the previous sections. The operations are of type (forexample) Int -> f Int rather than Int -> Int , and two cases areslightly different – the base case and descent case:

-- Base case:instance AlloyA BaseOpA BaseOpA Foo where

transformA = pure

-- Descent case:instance (AlloyA ( t :-* ops) BaseOpA Int,

AlloyA ( t :-* ops) BaseOpA Float)=> AlloyA BaseOpA (t :-* ops) Foo where

transformA opsD (FooInt m n)= pure FooInt <*> transformA opsD BaseOpA m

<*> transformA opsD BaseOpA ntransformA opsD (FooFloat f )= pure FooFloat <*> transformA opsD BaseOpA f

The instances for Alloy and AlloyA are so similar that we do nothave to generate the instances for both Alloy and AlloyA. We cangenerate instances for AlloyA (the more general case), and defineAlloy in terms of AlloyA by converting each of the operations (usingsome trivial type-level programming) in the opsets into operationsin the Identity monad2. However, this is not as fast (at run-time) asgenerating specific instances for Alloy. Defining the pure versionin terms of the more general applicative functor version, and thedefinitions the descent case is very similar to the ComposOp module(Bringert and Ranta 2008).

3.4 Common OperationsThe Alloy type-class we have shown is used to apply transforma-tions to the largest values belonging to types of interest3 in a tree.Often we actually want to apply a transformation to all types ofinterest in a tree, which we can do by first wrapping each of thetransformation functions as follows:

makeBottomUp, makeTopDown :: Alloy BaseOp opsDescent t =>opsDescent -> ( t -> t ) -> t -> t

makeBottomUp ops f = f . descendmakeTopDown ops f = descend . f

2 We do this in Tock, for the very few passes that are pure functions.3 Recall that the largest types of interest are those not contained by any othertypes of interest – see figure 1.

The difference between these two functions is whether the func-tion is applied before or after the descent, which results in thetransformation either being bottom-up or top-down. We providetop-down transformations for illustration; Mitchell and Runciman(2007) rightly caution against the use of such transformations, be-cause it is more likely than errors will be introduced with top-downtransformations.

These functions can then be used in convenience functions(applyBottomUp is our equivalent of SYB’s everywhere) to applyfunctions to one or more different types in a large tree:

applyBottomUp :: (Alloy (s :- BaseOp) BaseOp t,Alloy BaseOp (s :- BaseOp) s) =>

(s -> s) -> t -> tapplyBottomUp f = apply ops

whereops = makeBottomUp ops f :- BaseOp

applyBottomUp2 :: (Alloy (sA :- sB :- BaseOp) BaseOp t,Alloy BaseOp (sA :- sB :- BaseOp) sA,Alloy BaseOp (sA :- sB :- BaseOp) sB) =>

(sA -> sA) -> (sB -> sB) -> t -> tapplyBottomUp2 fA fB = apply ops

whereops = makeBottomUp ops fA :- makeBottomUp ops fB :- BaseOp

Note that the opset is used in its own definition, because thewrappers for the functions need to know what operations to applywhen recursing. Our type-class constraints indicate what calls totransform need to be made, for example for applyBottomUp2:

• One call will be on the top-level type t with the full set ofqueued operations (and an empty descent opset).• A call will be made on the sA type to apply the operations to all

of its children. To force this descent into the sA type (rather thanapplying the sA transformation again), we pass an empty queuedopset, but a full descent opset. This will cause all the operationsto be applied to sA’s children. If sA does not contain sB, forexample, the opset will be pruned on the next step becausetherefore none of sA’s children contain sB.• The same call will be made on the sB type.

Should the user require any further functions (e.g. applyBottomUpwith four types), it is possible to create them from the more ba-sic functions as we have done here. It is important to note thatapplyBottomUp2 f g is not guaranteed to be the same as the com-position applyBottomUp f . applyBottomUp g (nor will it be the sameas applyBottomUp g . applyBottomUp f) unless the types that f and goperate on are entirely disjoint. Consider:

g :: Maybe Int -> Maybe Intg = const $ Just 3

f :: Int -> Intf = succ

x :: Maybe Intx = Nothing

(applyBottomUp f . applyBottomUp g $ x) == Just 4applyBottomUp2 f g x == Just 3applyBottomUp2 g f x == Just 3

The composition will apply the second function to children ofthe result of the first – something that applyBottomUp2 will not do.

Unlike Uniplate, we do not provide a great variety of helperfunctions. As well as the simple descend and apply functions ex-plained in section 3.1, and applyBottomUp and applyBottomUp2 (andapplicative versions of each using AlloyA), the only other functionwe need for Tock is a query function akin to SYB’s listify :

109

findAll :: (AlloyA (s :-* BaseOpA) BaseOpA t,AlloyA BaseOpA (s :-* BaseOpA) s) =>

(s -> Bool) -> t -> [s]findAll qf x = execState (applyBottomUpA examine x) []

whereexamine y = do when (qf y) $ modify (y:)

return y

3.5 Instance GenerationInstance generation is regular and systematic. Naturally, we donot wish users of Alloy to write instances by hand. While thereare tools, such as Derive (Mitchell and O’Rear 2009) and DrIFT(Winstanley 1997), for generating Haskell instances (as well asTemplate Haskell (Sheard and Peyton Jones 2002)), we opted tobuild our own simple instance generator using SYB.

The advantage of using SYB is that no external tools or librariesare required. SYB requires language extensions in GHC, and SYBis supplied with GHC. We can use its traversals to discover thenecessary information (the relations between types in terms of can-contain) to generate Alloy instances for any type that derives theData type-class in the standard way.

4. Use CasesIn this section, we present and discuss some of the uses we make ofgeneric operations. Our approach to designing our passes is guidedby the knowledge (backed up by the results in tables 2 and 3 onpage 12) that the traversal of large trees such as ours is a large timecost which dwarfs the cost of the operation at particular nodes. Wepresent several use cases in the subsequent sections, discussing asimple way to implement them, and possible efficient refactorings.We accompany each example with some code that makes correctuse of Alloy, but that uses a simplified version of our AST.

We characterise our traversals via two orthogonal distinctions:bottom-up (descent before transformation) versus top-down, anddepth-first (each child is processed entirely before its sibling) ver-sus breadth-first.

4.1 Correcting NamesThe occam naming rules, originally designed over twenty yearsago, permit dots in names but not underscores. In order to compileto C, we simply turn each dot in a name into an underscore. This iseasily accomplished with our helper functions:

dotToUnderscore :: AST -> ASTdotToUnderscore = applyBottomUp doName

wheredoName (Name n) = Name [if c == ’.’ then ’ ’ else c | c <- n]

The majority of the passes in Tock are implemented usingapplyBottomUpA or applyBottomUpA2, but the other examples wegive here are the more interesting cases that require a differentapproach.

4.2 Parallel Usage CheckThe languages we compile have parallel constructs that executeseveral code branches in parallel. We have a pass that checks thateach parallel construct obeys the CREW rule: Concurrent-Read,Exclusive Write. We must check that each variable is either:

• not written-to, or• only written-to in one part of the parallel construct while not

used at all (for reading or writing) in any others.

The algorithm is simple, once we have used our traversals to collectsets of written-to and read-from names for each part of the parallelconstruct, for which we can use a generic operation.

A straightforward implementation would be to use a generictraversal to descend to each parallel construct – then, furthergeneric queries could be used to find all written-to names (by look-ing for all elements that could be involved in writing to a name,such as assignments and procedure calls) and all read-from names(which can be done by just finding all other names), followed bychecking our CREW rule, and descending to find further nestedparallel constructs. This would be an implementation of an O(N2)pass, however, with each instance of name processed once for eachparallel construct it is contained within.

We refactor our pass as follows. We perform a traversal ofthe tree with explicit descent and a monad with a record of usednames. When we encounter a name, we add it to this record.At each parallel construct, we explicitly descend separately intoeach branch with a fresh blank record of names, and when thesetraversals finish, we use these different name records for our CREWcheck. Afterwards, we combine all these name records into thestate. In this way, we can perform one descent of the entire treeto deal with all the nested parallel constructs. The code is:

-- Issues an error when the CREW rule is brokencheckSets :: [Set.Set String] -> PassM ()

checkCREW :: AST -> PassM ASTcheckCREW x = liftM fst $ runWriterT (applyA ops x) Set.empty

whereops = doProc :-* doName :-* BaseOpA

doProc (Par ps)= do ns <- mapM (liftM snd . listen . applyA ops) ps

checkSets nstell $ mappend nsreturn $ Par ps

doProc other = descendA ops other

doName (Name n) = do tell $ Set. singleton nreturn $ Name n

Note that we have two cases for doProc; one to handle theconstructor we are interested in, and a default case to descend intothe children of all the other constructors. If we used applyA here wewould get an infinite loop (applyA would apply doProc again fromthe opset), so we must use descendA.

Several other passes in Tock make use of this pattern: manipu-lating/checking parts of the AST in a manner dependent on theirsub-nodes. For example: removing unused variables, pulling upfree names in procedures to become parameters, or pulling up sub-expressions into temporary variables. The latter two are actuallyrearrangements of the tree, pulling up sub-trees to a higher-level,which we do by recording the pulled-up trees in a monad on a de-scent, then inserting them higher up.

4.3 Adding Channel DirectionsOur source languages feature communication channels. Channelscan be used with direction specifiers that indicate the writing orreading end of a channel, but we allow these specifiers to be omittedwhere they can be inferred. For example, if a channel is used in anoutput statement, we can infer that the writing end of the channel isrequired. The compiler must therefore infer the direction specifierson all uses of the channel (in our AST, a variable can be a directedchannel variable, much as we can have subscripted array variables).Even though our interest is in modifying the variable, our traversalmust descend to the level of output/input statements, and thenfurther descend into the variable to see if a direction specifier isnecessary (or, indeed, if an invalid specifier has been given).

This pass illustrates two concepts that feature in other passes:

110

1. The concept of descent in a context is used more extensivelyby our type-checker, where the processing of an inner node isdependent on the value of a parent node.

2. Output and input statements are just two of many values a state-ment can take. We wish to process these constructors specif-ically, and descend further into other statements without pro-cessing. It is typical that we only want to process one or twoconstructors of a particular data type, and either descend, or ig-nore the rest.

4.4 Directly ContainsWe are often concerned with processing every element of a particu-lar type, but we need to take care when processing types that can berecursive. Consider the case of lists: strings, for example. A Haskelllist is a directly recursive data-type. We may want to write a func-tion to append a prime symbol to all strings. If we blindly applya function like SYB’s everywhere or Alloy’s applyBottomUp with theoperation (++ "’"), we will get multiple primes added to a string;the string "foo" technically contains four strings ("foo", "oo", "o"and ""), and a generic traversal appending to strings, applied every-where, will prime each of them before joining it back to the rest ofthe string, resulting in "foo’’’’". This is a simple example that aprogrammer should see to avoid. However, the issue becomes morecomplicated with more complicated data-types.

We have a pass to pull up (hoist) array literals from expressionsinto variables. In occam-π, array literals are delimited by squarebrackets, much as list literals are in Haskell. We may have someoccam-π code such as:

a := doubleEach ([xs , [0,1], doubleEach ([2,3]), ys])

We need to pull up any array literals that are not directly nestedinside other array literals, yielding the new code:

temp IS doubleEach ([2 ,3]):temp2 IS [xs , [0,1], temp , ys]:a := doubleEach(temp2)

Note that we do not need to pull up the [0,1] literal directlynested inside one literal (our multidimensional arrays compile toa flattened single-dimension array) – so we do not want to blindlypull up all array literals. We still want to pull up the array from theinner function call, though. To deal with these sorts of issues, weusually find the largest expression, then write some code to explic-itly descend (ignoring directly nested array literals) and revert backto generic descent when we encounter other nodes (such as the in-ner function call to doubleEach). We also have to deal with pullingup the temporaries to the nearest appropriate place:

makeUniqueTemp :: PassM Name

applyItems :: [(Name, Expression)] -> Struct -> Struct

pullUpArrayLiterals :: Struct -> PassM StructpullUpArrayLiterals x = evalWriterT ( doStruct x) []

whereops = doExpr :-* doStruct :-* BaseOpA

doExpr (ArrayLit es) = do es’ <- mapM doArrayLit est <- makeUniqueTemptell [( t , ArrayLit es ’)]return $ ExprVariable t

doExpr e = descendA ops other

doArrayLit ( ArrayLit es) = liftM ArrayLit (mapM doArrayLit es)doArrayLit e = descendA ops e

doStruct s = do (s ’, items ) <- listen $ descendA ops sreturn $ applyItems items s’

4.5 Making Names UniqueMaking names unique, or ‘uniquifying’ names, is the process of re-naming declared names to be unique in the program, and resolvingall uses of that name to match the new declared name. Thus, wewant to find declarations, and alter their name, followed by recurs-ing down the tree to resolve all uses of that name, doing so in atop-down manner (name shadowing is allowed!).

If names are declared in two different AST element types (asused to be the case in Tock), we could not resolve the names cor-rectly by resolving names for each AST declaration type separately– a name declared in one fashion may shadow a name declared inanother fashion. So we would require one pass that operates on twotypes, and could not use two passes that each operate on one type.

One way to implement name resolution non-monadically is tosearch for name declarations in a bottom-up fashion, then processthe scope of the declaration to uniquify all uses of the given name.This would resolve all names to their closest declaration, but therun-time would be O(N2). We can avoid the use of a readermonad for the name stack but we choose to retain a state monadfor assigning a unique suffix to the name, and the error monad forissuing errors:

addUniqueSuffix :: String -> PassM String

uniquifyNames :: AST -> PassM ASTuniquifyNames = applyA (ops [])

whereops nameStack= doDecl nameStack :-* doName nameStack :-* BaseOpA

doName nameStack (Name n)= case lookup n nameStack of

Nothing -> throwError $ "Name " ++ n ++ " not found"Just resolved -> return $ Name resolved

doDecl nameStack (Decl n body)= do unique <- addUniqueSuffix n

liftM (Decl unique) $applyA (ops $ (n, unique) : nameStack) body

doDecl nameStack other = descendA (ops nameStack) other

We omit the irrelevant details of the addUniqueSuffix function.When processing names, we look for the most recent entry on thestack and use that as the new name. We need not descend further,because there are no elements of interest inside Name.

For declarations, we make a unique version of the name, andthen descend into the body of the declaration with the adjustedname stack. This example demonstrates an interesting mix of pureprogramming (the name stack) and effectful programming (to getthe unique identifier for the names).

4.6 SummaryWe have described several ways in which we make use of monadsin our passes. Allowing transformations to be monadic/idiomaticis the most flexible way to augment and implement much of thedependence involved in our passes (i.e. where one part of thetransformation depends on the results of another part).

The cost involved in descending the tree guides much of thedesign of our passes, so that we traverse the tree as few times aspossible. However, for clarity of design, we stop short of combiningseveral passes into one (although we have considered attempting todo so automatically).

5. Related WorkRodriguez et al. (2008) provide a comprehensive review of genericprogramming libraries, and Hinze et al. (2006a) provide a slightlyolder review of generic programming approaches (including non-

111

library approaches). In this section we summarise the features andapproaches of various generic libraries.

Scrap Your Boilerplate (Lammel and Peyton Jones 2003) is asystem based on dynamic type examination. Lists of operationscan be constructed, and the correct operation to apply is chosenby dynamically comparing the type of the suspect data item to thetype of the operation. This is slow, but SYB is well-maintained andis effectively built-in to GHC, making it easily available.

SYB with class (Lammel and Peyton Jones 2005) reworks SYBto allow extensible generic functions – something not all otherapproaches (including Alloy) can achieve. The Spine work (Hinzeet al. 2006b) also built on SYB, transforming SYB into a moretype-theoretic approach that removed the dynamic polymorphism.

Smash (Kiselyov 2006) has an HList (Kiselyov et al. 2004)of operations that provided inspiration for our opsets. Smash usesstatic type-level techniques to compare the type of the operationagainst the suspect type. Alloy includes the extra optimisation fordiscarding operations by using more information about the typesinvolved, but does not support as many forms of transformationand traversal as Smash.

Uniplate (Mitchell and Runciman 2007) and its related libraryBiplate use type-classes instances to descend into types, lookingfor the largest instances, which also influenced Alloy’s design. Themain restriction of Uniplate and Biplate is that they can operate ononly one target type – Alloy lifts this restriction to allow operationon multiple types by using type-level opsets.

Compos (Bringert and Ranta 2008) has an explicit descentmechanism similar to our descend function, and also allowed usewith applicative functors, much as our AlloyA class does. However,Compos adopts a GADT approach and again lacks our optimisationfor discarding operations.

Many generics libraries focus on transforming Haskell data-types into a simpler (and more theoretical) representation, withspecial cases for primitive types ( Int and similar) and a sum-of-products view for all other types. Both RepLib (Weirich 2006) andEMGM (Hinze 2004; Oliveira et al. 2007) and take this approach.This builds a layer of abstraction that simplifies traversals. Theperformance of EMGM reported later in this paper demonstratesthat this layer does not necessarily come at a performance cost.

While this paper has focused on libraries for generic program-ming, there are also several other approaches to generic program-ming in Haskell. Template Haskell (Sheard and Peyton Jones 2002)allows code to be run by the compiler, using compiler-level in-formation on types and other aspects of the program to generatefurther code before compilation continues. EMGM uses TemplateHaskell to generate instances, but Template Haskell could equallybe used to generate traversal code. There are also language ex-tensions such as PolyP (Jansson and Jeuring 1997) and GenericHaskell (Clarke and Loh 2003), as well as external tools suchas DrIFT (Winstanley 1997) and Derive (Mitchell and Runciman2007), but as stated earlier we required a library-level approach.

6. BenchmarksOur primary motivation for creating Alloy was to increase the speedof our generic traversals, and in this section we describe somebenchmarks and analyse the results. We have used only transfor-mations in our benchmarks: this is the majority of use in Tock, andfrequently used elsewhere too (Rodriguez et al. 2008).

We first tried using the GPBench generic programming bench-marks (http://www.haskell.org/haskellwiki/GPBench) butfound that those did not sufficiently distinguish the different ap-proaches. Instead, we used the following benchmarks, taking datafrom Tock to provide larger-scale benchmarks:

• BTree: Common binary tree data structure, with a transforma-tion that alters the value at each leaf node. There are only twotypes: the binary tree, and the leaf type. Data instances are per-fectly symmetric, with a given depth.• OmniName: The real AST structure from Tock, using parsed

existing compiler tests as input data values, transforming everyName item in the AST.• FPName: The same AST structure, but only transforming

Names that are function calls or procedure calls (requires ex-amining three types).

Our expectations with respect to Alloy were that its worst rel-ative performance would be on the BTree example (homogeneousdata-type, everything can contain the target type), with mid-levelperformance on OmniName, and best relative performance on FP-Name, since this involves traversing less of the tree than Omni-Name (an optimisation that will not be spotted by the other ap-proaches).

Although Alloy was written to be faster in Tock, there is noTock-specific code in Alloy that confers it an advantage in thelatter two benchmarks. We believe that using a real example of acomplex tree structure will give the best idea of real performanceof the techniques.

We chose to benchmark Alloy against SYB (Lammel and Pey-ton Jones 2003), Smash (Kiselyov 2006) and EMGM (Hinze 2004;Oliveira et al. 2007). Uniplate (Mitchell and Runciman 2007)would not support the FPName benchmark, and thus we did nottest it. Uniplate, EMGM and Smash were indicated by Rodriguezet al. (2008) to be the fastest generics approaches, and we includeSYB due to its popularity and it being the approach that we havereplaced with Alloy in Tock.

6.1 ModificationsAfter some initial explorations with simple implementations of thebenchmarks for each approach, it was clear that Alloy was anorder of magnitude faster than the other approaches. We knew fromexperience why this was. Our AST data structure is filled withStrings – variable names, function names, source code filenames,and so on. Naıvely applying generic approaches such as EMGM,Smash and SYB leads them to descend into every character ofevery String , attempting to apply transformations. Alloy naturallyavoids this due to its optimisation to discard operations that cannotbe applied – when processing a String , all operations not targetedat String or Char will have been discarded, which in practical termsmeans Strings are never descended into. We felt it realistic to addspecial cases to EMGM and Smash and SYB that skip over Strings,as we did with SYB in earlier versions of Tock. Thus, EMGM andSmash both have standard versions (without this special case) andmodified (with this special case). The performance of SYB (both interms of speed and memory) without this special case was such thatwe were unable to complete the benchmarks, so SYB only featureswith this special case.

6.2 ParametersRodriguez et al. (2008) found that the performance of genericsbenchmarks were sensitive to differing compiler versions and opti-misation levels. We aimed to compensate for this difference by test-ing two compiler versions (GHC 6.8.2 and 6.10.1), each with threeoptimisation levels (O0, O1 and O2) and used statistical analysis totry to abstract from these differences to get an overall measure ofhow performance differed by generics approach.

For the OmniName and FPName benchmarks, we used ten dif-ferent ASTs taken from an occam compiler test suite that predatesTock. We first timed (wall-clock time) five instances (per com-bination of approach/compiler/optimisation/AST) of applying the

112

Factor Levels OmniName FPName BTreeEMGM / Alloy 1.57 1.32 0.72Smash / Alloy 1.91 2.17 1.06Smash / EMGM 1.21 1.64 1.46Opt 1 / Opt 0 0.61 0.56 0.52Opt 2 / Opt 0 0.61 0.56 0.51Opt 2 / Opt 1 0.99 1.00 0.99GHC 6.10 / GHC 6.8 0.92 1.07 1.08

Table 1. The relative difference in the factors according to separatefitted generalised linear models for each benchmark. For example,the linear model indicates that Smash took 1.91 times as longas Alloy to complete the OmniName benchmark, discounting theeffect of other factors.

operation a fixed number of times (50) and forcing evaluation ofthe output, with the following approaches (the difference betweenmodified and standard is explained in section 6.1):

1. Alloy

2. Smash (modified implementation)

3. EMGM (modified implementation)

4. SYB (modified implementation)

5. Smash (standard implementation)

6. EMGM (standard implementation)

It was apparent after running these benchmarks that approaches4–6 were considerably slower than 1–3 (see tables 2 and 3 onpage 12), and thus we only used approaches 1–3 in a subsequentexperiment where we measured 30 instances of each combination(this showed no difference in means than our original run, but hada smaller standard error, allowing us to be more confident in ourresults).

For the BTree benchmark, we used a symmetric tree of height14, and timed 50 instances (per combination of approach/compil-er/optimisation) of applying 100 increments on all leaves in thetree. The results are shown in table 4 on page 12.

6.3 AnalysisThere was no guarantee of a relationship between a technique’sperformance on one benchmark and its performance on another.Therefore we analysed each benchmark separately. For the ASTbenchmarks (OmniName and FPName), our dependent variablewas the time measurement, and our four independent variableswere: generics approach, compiler version, optimisation level andAST input. The last factor was included because the ASTs variedgreatly in size, and thus we could not meaningfully directly com-pare results across the different ASTs, such as averaging the timetaken across the ASTs. Note that in this section, for OmniNameand FPName we only discuss the analysis of the three fastest ap-proaches (Alloy, EMGM modified and Smash modified that wererun 30 times).

We performed an initial analysis using a four-way (3× 2× 3×10) analysis of variance (ANOVA). This revealed that all factorshad significant main effects, and all the interactions4 of factors werealso significant (significance at the 1% level, all p-values < .001).This is because our benchmarks are deterministic, and thus thevariance is primarily measurement error, differences in cache stateand similar.

We primarily wish to compare the performance of the threeremaining generics approaches to each other. To do this, we fitted a

4 An interaction is a difference in the effect of a given factor as a functionof the level of another.

generalised linear model to our data, which assigns to each level ofeach factor a weight that describes how many seconds that factorlevel adds/subtracts from a baseline. The absolute values are oflittle interest; instead we look at the relative difference between thelevels. The values of the differences can be seen in table 1 (withoutthe factors for task input, which are not of interest).

Table 1 is thus the most concise summary of all our results. Thistells us that Alloy is around twice as fast as Smash on our AST-based benchmarks, and offers similar performance in the binarytree example. EMGM is around a third to a half slower in the ASTbenchmarks, but a quarter faster in the binary tree example. Thesefigures include the String optimisations for EMGM and Smash (inthe AST examples).

6.4 Compiler Versions and Optimisation LevelsOur analysis of variance revealed that there is a statistically sig-nificant interaction between approach, compiler and optimisationlevel. Figure 4a (page 12) illustrates this for one task input to theOmniName benchmark. It can be seen that GHC 6.10 is not alwaysbetter than GHC 6.8 – EMGM in particular is much slower in thenewer compiler, and Alloy too. Optimisation level 1 usually offersa big improvement over no optimisation (level 0), but optimisationlevel 2 is only sometimes better than level 1 – this is noted in theGHC 6.10 user manual: “At the moment, -O2 is unlikely to producebetter code than -O1.” This is backed up for the BTree example infigure 4b (page 12) and by the figures in table 1.

It can be seen in table 3 (page 12) that in GHC 6.8 at optimisa-tion level 1, EMGM and Alloy have the same performance (a t-testconfirms there is no significant difference between the two at the5% level), but this is not true in GHC 6.10, where Alloy has be-come faster, and EMGM slower. This illustrates some of the effectcompiler and optimisation can have – but, broadly, the compiler andoptimisation levels did not affect the ranking of the approaches.

6.5 DiscussionOur benchmarks showed that for the AST-based benchmarks, Alloywas faster than the other approaches – although close enough toEMGM that the difference may not matter for many users. Thebenchmarks confirmed that SYB was an order of magnitude slowerthan the other approaches on the AST-based benchmarks (see tables2 and 3). For homogeneous types, Alloy is not the fastest (andprobably not the most suitable, either), while SYB has less of aperformance gap (see table 4).

Unexpectedly, the gap between EMGM and Alloy was narroweron FPName (see table 1) where we had expected Alloy to bethe clear winner. The FPName benchmark took around the sameamount of time in Alloy as the OmniName benchmark – it took thesame amount of time to process all the names at specific places inthe AST than it did to process all the names. It is possible that thecost of having three types in our opset counterbalanced the savingsof being able to discard those operations. One of the target ASTtypes, Expression, occurs throughout the tree very frequently, so thismay have also contributed to the lack of savings for Alloy.

7. Opening the Closed WorldThere are several issues with the design of Alloy as described thusfar, stemming from one cause: for each type, Alloy requires in-stances for all the types contained and not contained within it. Con-sider the type-relation table shown in figure 3a for the followingtypes:

data Foo = Foo Int Intdata Bar = Bar1 Foo | Bar2 Bazdata Baz = Baz1 Foo | Baz2 Bar

113

(a) (b) (c)

Figure 3. Type-relation squares for (a) four types, (b) with a fifth type added, requiring nine new instances, and (c) the same square withoverlapping instances. A ‘c’ indicates the row-type contains the column-type, an ‘n’ indicates the row-type does not contain the column-type,and an ‘=’ indicates type equality.

Each row in the figure corresponds to a type-suspect. If the type-suspect contains the type at the top of a column, a ‘c’ is present.An ‘n’ indicates the type-suspect cannot contain the column type,and the equals signs on the leading diagonal show where the typesmatch. Each entry in the type-relation square will have a corre-sponding instance (the row-type being the type suspect, and thecolumn-type being the operation type of the head of the opset).

WithN types, this means there will beN2 instances. This largenumber of instances is the first problem. It can also be seen thatadding further types at a later date (perhaps generated by someoneusing the original types in a library) requires several new instances.To add a new type, one must supply not only the instances for thetypes contained with the new type, but also the instances for all theexisting types (that the new type is not contained within). Considerthe extra type:

data Quux = Quux Foo

The new type-relation square can be seen in figure 3b, and theuser must add all the shaded types: both the bottom row regardingtypes contained in Quux, but also the right-hand column with all theinstances stating that Quux is not contained by any of the existingtypes. If Quux is added on later in a separate module, it is inherentthat Quux cannot be contained in any of the original types, so thisnew column will always be filled with ‘n’ instances (ignoring theequals instance). Having to add all these instances is a secondproblem.

Both of these problems can be solved by the use of the overlap-ping instances Haskell language extension. Overlapping instancesallows us to still have specific instances for all the cases where thecurrent type matches, or is contained within the type of the latestoperation, but then allows us to provide a single generic instancefor the not-contained-within case:

instance Alloy opsDescent opsQueued =>Alloy opsDescent ( t :- opsQueued) where

transform opsDescent ( :- opsQ) x = transform opsDescent opsQ x

This instance means that adding a new type only requires adding(at most) half the previous number of instances, and in general thenumber of instances is greatly reduced. Consider the new type-relation square in figure 3c, where all the ‘n’ decisions have beenreplaced with empty squares (they are covered by our overlappinginstance). The number of instances is almost halved. In Tock, withthis overlapping instance we generate around 5200 instances –without it, we generate around 13000.

It can be seen that adding Quux requires only instances for thetypes contained within Quux, and no more. This transforms our

approach from a closed-world system into more of an open-worldapproach where new types can easily be added on later. It alsoreduces compilation times (see section 8.1 for more discussion ofthis issue).

8. ConclusionsWe presented a new generics approach, Alloy, whose developmentand design was motivated by our use of generics in a compiler. Al-loy is a powerful and fast library that can be used in any applicationwhere transformations and traversals of large complex data struc-tures are required.

Our benchmarks confirmed that the generics approaches closestin performance to our own (that had to be optimised slightly forthe application) are 30–100% slower than Alloy on large heteroge-neous data types, and showed that SYB is around 3000% slower.Alloy’s advantage is eliminated on more homogeneous data types.We showed that our results persist across compiler version and op-timisation level, suggesting that generics comparisons are perhapsnot as sensitive to these factors as previously thought.

We described use cases for generics in our compiler, and ex-amined how they need to be rewritten to ensure that the numberof passes required is kept to a minimum. On large tree structures,the time taken by the traversals outweighs the processing at eachnode. In our compiler, Tock, switching from (an augmented andoptimised) SYB to Alloy approximately halved our entire compile-time for occam-π code (which includes parsing and code genera-tion), giving a strong indication of where much of the time is spent.

8.1 Limitations and Future WorkMost generics approaches require an instance to be generated pertype. In Alloy, the number of instances is proportional to the squareof the number of types (see section 7 for more details). Our im-proved performance at run-time comes at a cost at compile-time.GHC takes in the order of five to ten minutes to compile our thou-sands of generated instances. These instances are only re-compiled,however, when our AST type changes – which is very infrequentlycompared to how often we compile the compiler. For projects thathave complex types that change often, this is a drawback to usingAlloy and reveals a potential limitation for the scalability of ourparticular approach, even with the overlapping instances improve-ment detailed in section 7. In future we would like to investigateways to alleviate this problem.

Some of the passes in Tock operate on only one constructor ofa type; it descends into all the rest to continue the traversal. Sev-eral generics systems have support for special cases for particularconstructors, but Alloy does not. We could perhaps alter the opsets

114

so that an operation could either be a transformation on a wholetype, or a particular case for a constructor. It is unclear whether thiswould bring benefits, either in speed or in terms of code clarity.

Our AST contains one polymorphic type, Structured , which sup-ports name declarations surrounding an inner type. Structured isused with seven different types as a parameter at various placesin our AST. Some functions, such as those modifying name dec-larations, operate on all the different variants of Structured , whileothers are only interested in manipulating one specific Structuredinstance. Currently, our only support for manipulating all variantsof Structured is to instantiate the operation for all variants and putall of them in an opset. In future we would like to investigate neaterand more efficient solutions to this problem.

8.1.1 APIOur API was originally based on SYB (:-* is akin to extM). Whenwe began developing Alloy, Tock was already using SYB-basedtraversals so keeping the API similar to SYB was advantageousin order to ease the transition. Our API can be contrasted withUniplate’s API (Mitchell and Runciman 2007), which is of theform:

class Uniplate a whereuniplate :: a -> ([a], [a] -> a)

The uniplate function takes a data item, and gives back a list ofall the largest sub-elements of that type, along with a function thatcan take a corresponding list (same length, same order) of values,and reassemble them back into the original item.

The immediate problem with Alloy compared to Uniplate is thatmultiple types are involved. Still, if we use type-level programmingto transform an opset into a corresponding type-level list of types,we could add a front-end class such as:

class ConvertOpsToTypes ops ts => Alloy ’ t ops wheretransform :: t -> ops -> ( ts , ts -> t )

The instances would need a little alteration so that when anoperation is dropped from the opsets, an empty list is put at thecorrect point in the return type.

8.2 Further DetailsThe alloy library is already available on Hackage, the Haskellpackage repository (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/alloy). We hope to be able to re-lease our benchmarks, ideally as a contribution to the GPBench(http://www.haskell.org/haskellwiki/GPBench) genericprogramming benchmarks.

8.3 Haskell ExtensionsThe core idea of Alloy requires a few extensions to the Haskell lan-guage (available in the commonly-used GHC compiler). The firstis multi-parameter type-classes, and the others are undecidable in-stances, which allows our type-class recursion (with a correspond-ing increase in GHC’s context reduction stack), as well as flexiblecontexts and flexible instances for the same purpose, and infix typeconstructors for our opsets. Multi-parameter type classes and infixtype constructors have been accepted for the next Haskell languagestandard (currently titled Haskell Prime), and the other extensionsremain under consideration.

This set of extensions is increased by the use of overlappinginstances, although they are not essential for our library. Instancegeneration takes advantage of GHC’s support for automatically de-riving the Data type-class, but instances could instead be generatedby other external tools.

All of these language extensions are pre-existing and have beensupported by GHC for many major versions.

ReferencesBjorn Bringert and Aarne Ranta. A pattern for almost compositional

functions. Journal of Functional Programming, 18(5-6):567–598, 2008.Dave Clarke and Andres Loh. Generic Haskell, specifically. In Proceedings

of the IFIP TC2/WG2.1 Working Conference on Generic Programming,pages 21–47, Deventer, The Netherlands, 2003. Kluwer, B.V.

Ralf Hinze. Generics for the masses. In ICFP 2004, pages 236–243. ACMPress, 2004.

Ralf Hinze, Johan Jeuring, and Andres Loh. Comparing approaches togeneric programming in Haskell. In Spring School on Datatype-GenericProgramming, 2006a.

Ralf Hinze, Andres Loh, and Bruno C. d. S. Oliveira. “Scrap Your Boiler-plate” Reloaded. In Proceedings of the Eighth International Symposiumon Functional and Logic Programming (FLOPS 2006), 2006b.

Patrik Jansson and Johan Jeuring. PolyP – a polytypic programminglanguage extension. In POPL ’97: The 24th ACM SIGPLAN-SIGACTSymposium on Principles of Programming Languages, pages 470–482.ACM Press, 1997.

Oleg Kiselyov. Smash your boilerplate without class and typeable.http://article.gmane.org/gmane.comp.lang.haskell.general/14086, 2006.

Oleg Kiselyov, Ralf Lammel, and Keean Schupke. Strongly typed hetero-geneous collections. In Haskell ’04: Proceedings of the ACM SIGPLANworkshop on Haskell, pages 96–107, 2004.

Ralf Lammel and Simon Peyton Jones. Scrap your boilerplate: a practicaldesign pattern for generic programming. In TLDI 2003, pages 26–37,2003.

Ralf Lammel and Simon Peyton Jones. Scrap your boilerplate with class:extensible generic functions. In ICFP 2005, pages 204–215. ACM Press,September 2005.

Ralf Lammel and Joost Visser. Typed Combinators for Generic Traversal.In Proc. Practical Aspects of Declarative Programming PADL 2002,volume 2257 of LNCS, pages 137–154. Springer-Verlag, January 2002.

Conor McBride and Ross Paterson. Applicative programming with effects.Journal of Functional Programming, 18(1):1–13, 2008.

Neil Mitchell and Stefan O’Rear. Derive home page, May 2009. URLhttp://community.haskell.org/~ndm/derive/.

Neil Mitchell and Colin Runciman. Uniform boilerplate and list processing.In Haskell ’07: Proceedings of the ACM SIGPLAN workshop on Haskellworkshop, pages 49–60, New York, NY, USA, 2007. ACM.

Bruno C. d. S. Oliveira, Ralf Hinze, and Andres Loh. Extensible andmodular generics for the masses. In Henrik Nilsson, editor, Trends inFunctional Programming (TFP 2006), April 2007.

Alexey Rodriguez, Johan Jeuring, Patrik Jansson, Alex Gerdes, Oleg Kise-lyov, and Bruno C. d. S. Oliveira. Comparing libraries for generic pro-gramming in Haskell. In Haskell ’08: Proceedings of the first ACMSIGPLAN symposium on Haskell, pages 111–122, New York, NY, USA,2008. ACM.

Dipanwita Sarkar, Oscar Waddell, and R. Kent Dybvig. A nanopass infras-tructure for compiler education. In ICFP 2004, pages 201–212. ACMPress, 2004.

Tim Sheard and Simon Peyton Jones. Template metaprogramming forHaskell. In Manuel M. T. Chakravarty, editor, ACM SIGPLAN HaskellWorkshop 02, pages 1–16. ACM Press, October 2002.

Stephanie Weirich. RepLib: a library for derivable type classes. In Haskell’06: Proceedings of the 2006 ACM SIGPLAN workshop on Haskell,pages 1–12, New York, NY, USA, 2006. ACM.

Peter H. Welch and Fred R. M. Barnes. Communicating Mobile Processes:introducing occam-pi. In 25 Years of CSP, volume 3525 of Lecture Notesin Computer Science, pages 175–210. Springer Verlag, April 2005.

Noel Winstanley. Reflections on instance derivation. In 1997 GlasgowWorkshop on Functional Programming. BCS Workshops in ComputerScience, September 1997.

115

0

1

2

3

4

5

EMGM_Mod EMGM_Std Smash_Mod Smash_Std Alloy SYB

Tim

e(n

orm

alis

ed)

Approach

Opt. Level 0Opt. Level 1Opt. Level 2

0

0.5

1

1.5

2

2.5

3

EMGM Smash Alloy SYB

Tim

e(n

orm

alis

ed)

Approach

Opt. Level 0Opt. Level 1Opt. Level 2

(a) (b)

Figure 4. Effect of compiler and optimisation for each approach in (a) the OmniName benchmark and (b) the BTree benchmark. Eachapproach has two sets of three bars; the left-hand set is GHC 6.8, the right-hand set is GHC 6.10. Each set contains a bar per optimisationlevel. Each approach has its times (lower is better) normalised to GHC 6.10, Opt. Level 1, so numbers can only be compared withineach approach. There is little difference between optimisation levels 1 and 2 for any approach, but they both show an improvement overoptimisation level 0. Speed differs little by compiler version, except that EMGM was much faster under GHC 6.8 at optimisation levels 1and 2 in OmniName, and in the BTree benchmark Smash and Alloy were slightly faster (at optimisation levels 1 and 2) in GHC 6.8.

Compiler Optimisation EMGM Mod. EMGM Std. Smash Mod. Smash Std. Alloy SYB Mod.GHC 6.8 Opt0 3.448 (0.067) 20.669 (0.364) 4.963 (0.056) 34.394 (0.675) 1.536 (0.013) 49.309 (0.275)

Opt1 1.259 (0.007) 15.832 (0.096) 1.703 (0.015) 6.323 (0.010) 0.730 (0.005) 16.559 (0.233)Opt2 1.266 (0.007) 16.278 (0.136) 1.690 (0.017) 6.334 (0.011) 0.627 (0.005) 19.180 (0.061)

GHC 6.10 Opt0 3.526 (0.047) 19.894 (0.143) 5.128 (0.045) 32.101 (0.420) 1.542 (0.012) 53.937 (0.122)Opt1 2.096 (0.020) 17.183 (0.165) 1.432 (0.029) 6.760 (0.016) 0.864 (0.015) 17.633 (0.140)Opt2 2.085 (0.022) 14.930 (0.087) 1.833 (0.032) 6.754 (0.021) 0.848 (0.011) 18.756 (0.074)

Table 2. An illustrative table of results for one of our test inputs for the OmniName benchmark. Means are wall-clock times (measured inseconds) for 50 traversals, followed in brackets by standard deviations.

Compiler Optimisation EMGM Mod. EMGM Std. Smash Mod. Smash Std. Alloy SYB Mod.GHC 6.8 Opt0 3.123 (0.058) 19.189 (0.344) 5.948 (0.074) 39.748 (0.965) 2.066 (0.009) 105.791 (0.548)

Opt1 0.983 (0.018) 13.118 (0.352) 1.692 (0.049) 6.541 (0.102) 1.013 (0.057) 22.826 (0.055)Opt2 1.106 (0.028) 14.169 (0.453) 1.598 (0.056) 6.620 (0.131) 0.598 (0.013) 21.986 (0.170)

GHC 6.10 Opt0 3.219 (0.039) 20.596 (0.152) 5.926 (0.042) 34.415 (0.610) 2.068 (0.013) 109.272 (0.486)Opt1 1.560 (0.017) 14.891 (0.152) 1.600 (0.018) 7.056 (0.082) 0.859 (0.006) 17.636 (0.051)Opt2 1.432 (0.013) 13.377 (0.092) 1.813 (0.010) 6.896 (0.077) 0.845 (0.003) 19.007 (0.026)

Table 3. An illustrative table of results for one of our test inputs for the FPName benchmark. Means are wall-clock times (measured inseconds) for 50 traversals, followed in brackets by standard deviations.

Compiler Optimisation EMGM Smash Alloy SYBGHC 6.8 Opt0 1.488 (0.025) 2.152 (0.027) 2.112 (0.025) 9.214 (0.074)

Opt1 0.793 (0.015) 0.868 (0.012) 0.916 (0.022) 3.603 (0.038)Opt2 0.796 (0.017) 0.905 (0.017) 0.854 (0.022) 3.668 (0.049)

GHC 6.10 Opt0 1.543 (0.019) 2.245 (0.017) 1.999 (0.016) 9.798 (0.056)Opt1 0.810 (0.009) 1.058 (0.010) 1.021 (0.018) 3.484 (0.029)Opt2 0.813 (0.010) 1.054 (0.012) 1.019 (0.025) 3.481 (0.031)

Table 4. The results for the BTree benchmark for all four generics approaches. Means are wall-clock times (measured in seconds) for 100traversals, followed in brackets by standard deviations.

116

Type-Safe Observable Sharing in Haskell

Andy GillInformation Technology and Telecommunication Center

Department of Electrical Engineering and Computer ScienceThe University of Kansas

2335 Irving Hill RoadLawrence, KS 66045

[email protected]

Abstract

Haskell is a great language for writing and supporting embeddedDomain Specific Languages (DSLs). Some form of observablesharing is often a critical capability for allowing so-called deepDSLs to be compiled and processed. In this paper, we describe andexplore uses of an IO function for reification which allows directobservation of sharing.

Categories and Subject Descriptors D.3.3 [Programming Lan-guages]: Language Constructs and Features—Data types and struc-tures

General Terms Design, Languages

Keywords Observable Sharing, DSL Compilation

1. Introduction

Haskell is a great host language for writing Domain Specific Lan-guages (DSLs). There is a large body of literature and communityknow-how on embedding languages inside functional languages,including shallow embedded DSLs, which act directly on a prin-cipal type or types, and deep embedded DSLs, which construct anabstract syntax tree that is later evaluated. Both of these methodolo-gies offer advantages over directly parsing and compiling (or inter-preting) a small language. There is, however, a capability gap be-tween a deep DSL and compiled DSL, including observable sharingof syntax trees. This sharing can notate the sharing of computed re-sults, as well as also notating loops in computations. Observing thissharing can be critical to the successful compilation of our DSLs,but breaks a central tenet of pure functional programing: referentialtransparency.

In this paper, we introduce a new, retrospectively obvious way ofadding observable sharing to Haskell, and illustrate its use on anumber of small case studies. The addition makes nominal impacton an abstract language syntax tree; the tree itself remains a purelyfunctional value, and the shape of this tree guides the structure of


a graph representation in a direct and principled way. The solutionmakes good use of constructor classes and type families to providea type-safe graph detection mechanism.

Any direct solution to observable sharing, by definition, will breakreferential transparency. We restrict our sharing using the class typesystem to specific types, and argue that we provide a reasonablecompromise to this deficiency. Furthermore, because we observesharing on regular Haskell structures, we can write, reason about,and invoke pure functions with the same abstract syntaxes sansobservable sharing.

2. Observable Sharing and Domain SpecificLanguages

At the University of Kansas, we are using Haskell to explore thedescription of hardware and system level concerns in a way that issuitable for processing and extracting properties. As an example,consider a simple description of a bit-level parity checker.

This circuit takes a stream of (clocked) bits, and does a parity countof all the bits, using a bit register. Given some Haskell functionsas our primitives, we can describe this circuit in a similar fashionto Lava (Bjesse et al. 1998), Hawk (Matthews et al. 1998), andHydra (O’Donnell 2002). For example, the primitives may take theform

-- DSL primitivesxor :: Bit -> Bit -> Bitdelay :: Bit -> Bit

where xor is a function which takes two arguments of the abstracttype Bit, performing a bit-wise xor operation, and delay takesa single Bit argument, and outputs the bit value on the previousclock cycle (via a register or latch). Jointly these primitives providean interface to a µLava.

117

These abstract primitives allow for a concise specification of ourcircuits using the following Haskell.

-- Parity specificationparity :: Bit -> Bitparity input = output

whereoutput = xor (delay output) input

We can describe our primitives using a shallow DSL, where Bitis a stream of boolean values, and xor and delay act directly onvalues of type Bit to generate a new value, also of type Bit.

-- Shallow embeddingnewtype Bit = Bit [Bool]

xor :: Bit -> Bit -> Bitxor (Bit xs) (Bit ys) = Bit $ zipWith (/=) xs ys

delay :: Bit -> Bitdelay (Bit xs) = Bit $ False : xs

run :: (Bit -> Bit) -> [Bool] -> [Bool]run f bs = rs

where(Bit rs) = f (Bit bs)

Hawk used a similar shallow embedding to provide semantics forits primitives, which could be simulated, but the meaning of a spe-cific circuit could not be directly extracted. In order to constructa DSL that allows extraction, we can give our primitives an alter-native deep embedding. In a deep embedding, primitives are sim-ply Haskell data constructors, and a circuit description becomes aHaskell syntax tree.

-- New, deep embeddingdata Bit = Xor Bit Bit

| Delay Bit| Input [Bool]| Var Stringderiving Show

xor = Xordelay = Delay

run :: (Bit -> Bit) -> [Bool] -> [Bool]run f bs = interp (f (Input bs))

interp :: Bit -> [Bool]interp (Xor b1 b2) = zipWith (/=) (interp b1)

(interp b2)interp (Delay b) = False : interp binterp (Input bs) = bsinterp (Var v) = error $ "Var not supported"

The run function has the same behavior as the run in the shallowDSL, but has a different implementation. An interpreter functionacts as a supporting literal interpreter of the Bit data structure.

> run parity (cycle True)[True,False,True,False,True,...

The advantage of a deep embedding over a shallow embeddingis that a deep embedding can be extracted directly for process-ing and analysis by other functions and tools, simply by reading

the data type which encodes the DSL. Our circuit is a function,Bit -> Bit, so we provided the argument (Var "x"), where "x"is unique to this circuit, giving us a Bit, with the Var being a place-holder for the argument.

Unfortunately, if we consider the structure of parity, it contains aloop, introduced via the output binding being used as an argumentto delay when defining output.

> parity (Var "x")Xor (Delay (Xor (Delay (Xor (Delay (Xor (...

This looping structure can be used for interpretation, but not for fur-ther analysis, pretty printing, or general processing. The challengehere, and the subject of this paper, is how to allow trees extractedfrom Haskell hosted deep DSLs to have observable back-edges, ormore generally, observable sharing. This a well-understood prob-lem, with a number of standard solutions.

• Cycles can be outlawed in the DSL, and instead be encodedinside explicit looping constructors, which include, implicitly,the back edge. These combinators take and return functions thatoperate over circuits. This was the approach taken by Sharp(2002). Unfortunately, using these combinators is cumbersomein practice, forcing a specific style of DSL idiom for all loops.This is the direct analog of programing recursion in Haskellusing fix.

• Explicit Labels can be used to allow later recovery of a graphstructure, as proposed by O’Donnell (1992). This means pass-ing an explicit name supply for unique names, or relying on theuser to supply them; neither are ideal and both obfuscate theessence of the code expressed by the DSL.

• Monads, or other categorical structures, can be used to generateunique labels implicitly, or capture a graph structure as a net-listdirectly. This is the solution used in the early Lava implementa-tions (Bjesse et al. 1998), and continued in Xilinx Lava (Singhand James-Roxby 2001). It is also the solution used by Baarsand Swierstra (2004), where they use applicative functors ratherthan monads. Using categorical structures directly impacts thetype of a circuit, and our parity function would now be requiredto have the type

parity :: Bit -> M Bit

Tying the knot of the back edges can no longer be performedusing the Haskell where clause, but instead the non-standardrecursive-do mechanism (Erkok and Launchbury 2002) is used.

• References can be provided as a non-conservative exten-sion (Claessen and Sands 1999). This is the approach takenby Chalmers Lava, where a new type Ref is added, and pointerequality over Ref is possible. This non-conservative extensionis not to everyone’s taste, but does neatly solve the problem ofobservable sharing. Chalmers Lava’s principal structure con-tains a Ref at every node.

In this paper, we advocate another approach to the problem ofobservable sharing, namely an IO function that can observe sharingdirectly. Specifically, this paper makes the following contributions.

• We present an alternative method of observable sharing, usingstable names and the IO monad. Surprisingly, it turns out thatour graph reification function can be written as a reusable com-ponent in a small number of lines of Haskell. Furthermore, oursolution to observable sharing may be more palatable to thecommunity than the Ref type, given we accept IO functionsroutinely.

118

• We make use of type functions (Chakravarty et al. 2005), arecent addition to the Haskell programmers’ portfolio of tricks,and therefore act as a witness to the usefulness of this newextension.

• We illustrate our observable sharing library using a small num-ber of examples including digital circuits and state diagrams.

• We extend our single type solution to handle Haskell treescontaining different types of nodes. This extension criticallydepends on the design decision to use type families to denotethat differently typed nodes map to a shared type of graph node.

• We illustrate this extension being used to capture deep DSLscontaining functions, as well as data structures, considerablyextending the capturing potential of our reify function.

Our solution is built on the StableName extension in GHC (PeytonJones et al. 1999), which allows for a specific type of pointerequality. The correctness and predicability of our solution dependson the properties of the StableName implementation, a point wereturn to in section 12.

3. Representing Sharing in Haskell

Our solution to the observable sharing problem addresses the prob-lem head on. We give specific types the ability to have their shar-ing observable, via a reify function which translates a tree-like datastructure into a graph-like data structure, in a type safe manner. Weuse the class type system and type functions to allow Haskell pro-grammers to provide the necessary hooks for specific data struc-tures, typically abstract syntax trees that actually capture abstractsyntax graphs.

There are two fundamental issues with giving a type and implemen-tation to such a reify function. First, how do we allow a graph toshare a typed representation with a tree? Second, observable shar-ing introduces referential opaqueness, destroying referential trans-parency: a key tenet of functional programming. How do we con-tain – and reason about – referential opaqueness in Haskell? Inthis section, we introduce our reify function, and honestly admitopaqueness by making the reify function an IO function.

Graphs in Haskell can be represented using a number of idioms,but we use a simple associated list of pairs containing Uniques asnode names, and node values.

type Unique = Intdata BitGraph = BitGraph [(Unique,BitNode Unique)]

Unique

data BitNode s = GraphXor s s| GraphDelay s| GraphInput [Bool]| GraphVar String

We parameterize BitNode over the Unique graph “edges”, to fa-cilitate future generic processors for our nodes.

Considering the parity example, we might represent the sharingusing the following expression.

graph = BitGraph [ (1,GraphXor 2 3), (2,GraphDelay 1), (3,GraphInput "x")]1

This format is a simple and direct net-list representation. If we cangenerate this graph, then using smarter structures like Data.Mapdownstream in a compilation process is straightforward. Given aFunctor instance for BitNode, we can generically change thetypes of our nodes labels.

We can now introduce the type of a graph reification function.

reifyBitGraph :: Bit -> IO BitGraph

With this function, and provided we honor any preconditions of itsuse, embedding our µLava in a way that can have sharing extractedis trivial. Of course, the IO monad is needed. Typically, this reifyreplaces either a parser (which would use IO), or will call anotherIO function later in a pipeline, for example to write out VHDL fromthe BitGraph or display the graph graphically. Though the use ofIO is not present in all usage models, having IO does not appear tobe a handicap to this function.

4. Generalizing the Reification Function

We can now generalize reifyBitGraph into our generic graphreification function, called reifyGraph. There are three thingsreifyGraph needs to be able to do

• First, have a target type for the graph representation to use as aresult.

• Second, be able to look inside the Haskell value under consid-eration, and traverse its structure.

• Third, be able to build a graph from this traversal.

We saw all three of these capabilities in our reifyBitGraph ex-ample. We can incorporate these ideas, and present our generalizedgraph reification function, reifyGraph.

reifyGraph :: (MuRef t)=> t -> IO (Graph (DeRef t))

The type for reifyGraph says, given the ability to look deep insidea structure, provided by the type class MuRef, and the ability toderive the shared, inner data type, provided by the type functionDeRef, we can take a tree of a type that has a MuRef instance, andbuild a graph.

The Graph data structure is the generalization of BitGraph, withnodes of the higher kinded type e, and a single root.

type Unique = Intdata Graph e = Graph [(Unique,e Unique)]

Unique

Type functions and associated types (Chakravarty et al. 2005) is arecent addition to Haskell. reifyGraph uses a type function to de-termine the type of the nodes inside the graph. Associated types al-low the introduction of data and type declarations inside a classdeclaration; a very useful addition indeed. This is done by liter-ally providing type functions which look like standard Haskell typeconstructors, but instead use the existing class-based overloadingsystem to help resolve the function. In our example, we have thetype class MuRef, and the type function DeRef, giving the follow-ing (incomplete) class declaration.

class MuRef a wheretype DeRef a :: * -> *...

119

This class declaration creates a type function DeRef which actslike a type synonym inside the class; it does not introduce any con-structors or abstraction. The * -> * annotation gives the kind ofDeRef, meaning it takes two type arguments, the relevant instanceof MuRef, and another, as yet unseen, argument. DeRef can be as-signed to any type of the correct kind, inside each instance.

In our example above, we want trees of type Bit to be representedas a graph of BitNode, so we provide the instance MuRef.

instance MuRef Bit wheretype DeRef Bit = BitNode...

BitNode is indeed of kind * -> *, so the type of our reifyGraphfunction specializes in the case of Bit to

reifyGraph :: Bit -> IO (Graph (DeRef Bit))

then, because of the type function DeRef, to

reifyGraph :: Bit -> IO (Graph BitNode)

The use of the type function DeRef to find the BitNode data-typeis critical to tying the input tree to type node representation type,though functional dependencies (Jones and Diatchki 2008) couldalso be used here.

The MuRef class has the following definition.

class MuRef a wheretype DeRef a :: * -> *mapDeRef :: (Applicative f)

=> (a -> f u)-> a-> f (DeRef a u)

mapDeRef allows us, in a generic way, to reach into somethingthat has an instance of the MuRef class and recurse over relevantchildren. The first argument is a function that is applied to thechildren, the second is the node under consideration. mapDeRefreturns a single node, the type of which is determined by the DeReftype function, for recording in a graph structure. The result valuecontains unique indices, of type u, which were generated by theinvocation of the first argument. mapDeRef uses an applicativefunctor (McBride and Patterson 2006) to provide the threading ofthe effect of unique name generation.

To complete our example, we make Bit an instance of the MuRefclass, and provide the DeRef and mapDeRef definitions.

instance MuRef Bit wheretype DeRef Bit = BitNodemapDeRef f (Xor a b) = GraphXor <$> f a <*> f bmapDeRef f (Delay b) = GraphDelay <$> f bmapDeRef f (Input bs) = pure $ GraphInput bsmapDeRef f (Var nm) = pure $ GraphVar nm

This is a complete definition of the necessary generics to providereifyGraph with the ability to perform type-safe observable shar-ing on the type Bit. The form of mapDeRef is regular, and couldbe automatically derived, perhaps using Template Haskell (Sheardand Peyton Jones 2002). With this instance in place, we can use ourgeneral reifyGraph function, to extract our graph.

> reifyGraph $ parity (Name "x")Graph [ (1,GraphXor 2 3)

, (2,GraphDelay 1), (3,GraphInput "x")]1

The reifyGraph function is surprisingly general, easy to enablevia the single instance declaration, and useful in practice. Wenow look at a number of use cases and extensions to reifyGraph,before turning to its implementation.

5. Example: Finite State Machines

As a simple example, take the problem of describing a state ma-chine directly in Haskell. This is easy but tedious because we needto enumerate or label the states. Consider this state machine, a 5-7convolutional encoder for a viterbi decoder.

GFED@ABC00

0/00

,,1/11

++GFED@ABC01

0/11

1/00

ttGFED@ABC10

0/01

JJ

1/10

44

GFED@ABC110/10

kk1/01

ll

One possible encoding is a step function, which takes input, andthe current state, and returns the output, and a new state. Assumingthat we use Boolean to represent 0 and 1, in the input and output,we can write the following Haskell.

data State = ZeroZero | ZeroOne | OneZero | OneOnetype Input = Booltype Output = (Bool,Bool)

step :: Input -> State -> (Output,State)step False ZeroZero = ((False,False),ZeroZero)step True ZeroZero = ((True ,True ),ZeroOne)step False ZeroOne = ((True ,True ),OneOne)step True ZeroOne = ((False,False),OneZero)step False OneZero = ((False,True ),ZeroZero)step True OneZero = ((True ,False),ZeroOne)step False OneOne = ((True ,False),OneZero)step True OneOne = ((False,True ),OneOne)

Arguably more declarative encoding is to to use the binding as thestate unique identifier.

data State i o = State [(i,(o,State i o))]

step :: (Eq i) => i -> State i o -> (o,State i o)step i (State ts) = (output,st)

where Just (output,st) = lookup i ts

state00 = State [ (False,((False,False), state01)),(True, ((True ,True), state00))]

state01 = State [ (False,((True ,True ), state11)),(True, ((False,False), state10))]

state10 = State [ (False,((False,True), state00)),(True, ((True ,False), state01))]

state11 = State [ (False,((True ,False), state10)),(True, ((False,True), state11))]

120

Simulating this binding-based state machine is possible in pureHaskell.

run :: (Eq i) => State i o -> [i] -> [o]run st (i:is) = o : run st’ is

where (o,st’) = step i st

Extracting the sharing, for example to allow the display in the graphviewing tool dot (Ellson et al. 2003), is not possible in a purelyfunctional setting. Extracting the sharing using our reifyGraphallows the deeper embedding to be gathered, and other tools canmanipulate and optimize this graph.

data StateNode i o s = StateNode [ (i,(o,s)) ]deriving Show

instance MuRef (State i o) wheretype DeRef (State i o) = StateNode i omapDeRef f (State st) = StateNode <$>

traverse tState stwhere

tState (b,(o,s)) = (\ s’ -> (b,(o,s’)))<$> f s

Here, traverse (from the Traversable class) is a traversal overthe list type. Now we extract our graph.

> reifyGraph state00Graph [(1,StateNode [(False,((False,False),2))

,(True,((True,True),1))])

,(2,StateNode [(False,((True,True),3)),(True,((False,False),4))])

,(3,StateNode [(False,((True,False),4)),(True,((False,True),3))])

,(4,StateNode [(False,((False,True),1)),(True,((True,False),2))])

]1

6. Example: Kansas Lava

At the University of Kansas, we are developing a custom versionof Lava, for teaching and as a research platform. The intention isto allow for higher level abstractions, as supported by the HawkDSL, but also allow the circuit synthesis, as supported by Lava.Capturing our Lava DSL in a general manner was the originalmotivation behind revisiting the design decision of using referencesfor observable sharing in Chalmers Lava (Claessen 2001). In thissection, we outline our design of the front end of Kansas Lava, andhow it uses reifyGraph.

The principal type in Kansas Lava is Signal, which is a phantomtype (Leijen and Meijer 1999) abstraction around Wire, the inter-nal type of a circuit.

newtype Signal a = Signal Wire

newtype Wire = Wire (Entity Wire)

Entity is a node in our circuit graph, which can represent gatelevel circuits, as well are more complex blocks.

data Entity s= Entity Name [s] -- an entity| Pad Name -- an input pad| Lit Integer -- a constant

and2 :: (Signal a, Signal a) -> Signal aand2 (Signal w1,Signal w2)

= Signal $ Wire $ Entity (name "and2") [w1,w2]

...

In both Kansas Lava and Chalmers Lava, phantom types are usedto allow construction of semi-sensible circuits. For example, amux will take a Signal Bool as its input, but switch betweenpolymorphic signals.

mux :: Signal Bool-> (Signal a, Signal a)-> Signal a

mux (Signal s) (Signal w1,Signal w2)= Signal$ Wire$ Entity (name "mux") [s,w1,w2]

Even though we construct trees of type Signal, we want to ob-serve graphs of type Wire, because every Signal is a construc-tor wrapper around a tree of Wire. We share the same node data-type between our Haskell tree underneath Signal, and inside ourreified graph. So Entity is parametrized over its inputs, which areWires for our circuit specification tree, and are Unique labels inour graph. This allows some reuse of traversals, and we use in-stances of the Traversable, Functor and Foldable classes tohelp here.

Our MuRef instance therefore has the form:

instance MuRef Wire wheretype DeRef Wire = EntitymapDeRef f (Wire s) = traverse f s

We also define instances for the classes Traversable, Foldableand Functor, which are of general usefulness for performing othertransformations, specifically:

instance Traversable Entity wheretraverse f (Entity v ss) = Entity v

<$> traverse f sstraverse _ (Pad v) = pure $ Pad vtraverse _ (Lit i) = pure $ Lit i

instance Foldable Entity wherefoldMap f (Entity v ss) = foldMap f ssfoldMap _ (Pad v) = memptyfoldMap _ (Lit i) = mempty

instance Functor Entity wherefmap f (Entity v ss) = Entity v (fmap f ss)fmap _ (Pad v) = Pad vfmap _ (Lit i) = Lit i

Now, with our Kansas Lava Hardware specification graph capturedinside our Graph representation via reifyGraph, we can performsimple translations, and pretty print to VHDL, and other targets.

121

7. Comparing reifyGraph and Ref types

Chalmers Lava uses Ref types, which admit pointer equality. Theinterface to Ref types have the following form.

data Ref a = ...instance Eq (Ref a)ref :: a -> Ref aderef :: Ref a -> a

An abstract type Ref can be used to box polymorphic values, viathe (unsafe) function ref, and Ref admits equality without lookingat the value inside the box. Ref works by generating a new, uniquelabel for each call to ref. So a possible implementation is

data Ref a = Ref a Uniqueinstance Eq (Ref a) where

(Ref _ u1) == (Ref _ u2) = u1 == u2ref a = unsafePerformIO $ do

u <- newUniquereturn $ Ref a u

deref (Ref a _) = a

with the usual caveats associated with the use of unsafePerformIO.

To illustrate a use-case, consider a transliteration of Chalmers Lavato use the same names as Kansas Lava. We can use a Ref type ateach node, by changing the type of Wire, and reflecting this changeinto our DSL functions.

-- Transliteration of Chalmers Lavanewtype Signal s = Signal Wire

newtype Wire = Wire (Ref (Entity Wire))

data Entity s= Entity Name [s]| ...

and2 :: Signal a -> Signal a -> Signal aand2 (Signal w1) (Signal w2)

= Signal$ Wire$ ref$ Entity (name "and2") [w1,w2]

The differences between this definition and the Kansas Lava defi-nition are

• The type Wire includes an extra Ref indirection;• The DSL primitives include an extra ref.

Wire in Chalmers Lava admits observable sharing directly, whileKansas Lava only admits observable sharing using reifyGraph.The structure in Kansas Lava can be consumed by an alternative,purely functional simulation function, without the possibility of ac-cidentally observing sharing. Furthermore, reifyGraph can oper-ate over an arbitrary type, and does not need to be wired into thedatatype. This leaves open a new possibility: observing sharing onregular Haskell structures like lists, rose trees, and other structures.This is the subject of the next section.

8. Lists, and Other Structures

In the Haskell community, sometimes recursive types are tied usinga Mu type (Jones 1995). For example, consider a list specified in thisfashion.

newtype Mu a = In (a (Mu a))

data List a b = Cons a b | Nil

type MyList a = Mu (List a)

Now, we can write a list using Cons, Nil, and In for recursion. Thelist [1,2,3] would be represented using the following expression.

In (Cons 1 (In (Cons 2 (In (Cons 3 (In Nil))))))

The generality of the recursion, captured by Mu, allows a generalinstance of Mu for MuRef. Indeed, this is why MuRef is calledMuRef.

instance (Traversable a) => MuRef (Mu a) wheretype DeRef (Mu a) = amapDeRef = traverse

This generality is possible because we are sharing the representa-tion between structures. Mu is used to express a tree-like structure,where Graph given the same type argument will express a directedgraph. In order to use MuRef, we need Traversable, and there-fore need to provide the instances for Functor, Foldable, andTraversable.

instance Functor (List a) wherefmap f Nil = Nilfmap f (Cons a b) = Cons a (f b)

instance Foldable (List a) wherefoldMap f Nil = memptyfoldMap f (Cons a b) = f b

instance Traversable (List a) wheretraverse f (Cons a b) = Cons a <$> f btraverse f Nil = pure Nil

Now a list, written using Mu, can have its sharing observed.

> let xs = In (Cons 99 (In (Cons 100 xs)))> reifyGraph xsGraph [ (1,Cons 99 2)

, (2,Cons 100 1)]1

The type List is used both for expressing trees and graphs. We canreuse List and the instances of List to observe sharing in regularHaskell lists.

instance MuRef [a] wheretype DeRef [a] = ListmapDeRef f (x:xs) = Cons x <$> f xsmapDeRef f [] = pure Nil

That is, regular Haskell lists are represented as a graph, using List,and Mu List lists are also represented as a graph, using List. Nowwe can capture spine-level sharing in our list.

122

> let xs = 99 : 100 : xs> reifyGraph xsGraph [ (1,Cons 99 2)

, (2,Cons 100 1)]1

There is no way to observe built-in Haskell data structures usingRef, which is an advantage of our reify-based observable sharing.

A list spine, being one dimensional, means that sharing will alwaysbe represented via back-edges. A tree can have both loops andacyclic sharing. One question we can ask is can we capture thesecond level sharing in a list? That is, is it possible we observe thedifference between

let x = X 1 in [x,x] and [X 1,X 1]

using reifyGraph? Alas, no, because the type of the element of alist is distinct from the type of the list itself. In the next section, weextend reifyGraph to handle nodes of different types inside thesame reified graph.

9. Observable Sharing at Different Types

The nodes of the graph inside the runtime system of Haskell pro-grams have many different types. In order to successfully extractdeeper into our DSL, we want to handle nodes of different types.GHC Haskell already provides the Dynamic type, which is a com-mon type for using with collections of values of different types.The operations are

data Dynamic = ...toDyn :: Typeable a => a -> DynamicfromDynamic :: Typeable a => Dynamic -> Maybe a

Dynamic is a monomorphic Haskell object, stored with its type.fromDyn succeeds when Dynamic was constructed and extractedat the same type. Attempts to use fromDynamic at an incorrecttype always returns Nothing. The class Typeable is derivableautomatically, as well as being provided for all built-in types. Sowe have

> fromDynamic (toDyn "Hello") :: Maybe StringJust "Hello"> fromDynamic (toDyn (1,2)) :: Maybe StringNothing

In this way Dynamic provides a type-safe cast.

In our extended version of reifyGraph, we require all nodes thatneed to be compared for observational equality to be a member ofthe class Typeable, including the root of our Haskell structure weare observing. This gives the type of the extended reifyGraph.

reifyGraph :: (MuRef s, Typeable s)=> s -> IO (Graph (DeRef s))

The trick to reifying nodes of different type into one graph is tohave a common type for the graph representation. That is, if wehave a type A and a type B, then we can share a graph that iscaptured to Graph C, provided that DeRef A and DeRef B bothmap to C. We can express this, using the new ~ notation for typeequivalence.

Specifically, the type

example :: (DeRef a ~ DeRef [a]) => [a]

expresses that a and [a] both share the same graph node type.

In order to observe sharing on nodes of types that are Typeable,and share a graph representation type, we refine the type ofmapDeRef. The refined MuRef class has the following definition.

class MuRef a wheretype DeRef a :: * -> *

mapDeRef :: (Applicative f)=> (forall b .

( MuRef b, Typeable b, DeRef a ~ DeRef b) => b -> f u)

-> a-> f (DeRef a u)

mapDeRef has a rank-2 polymorphic functional argument for pro-cessing sub-nodes, when walking over a node of type a. This func-tional argument requires that

• The sub-node be a member of the class MuRef;• The sub-node be Typeable, so that we can use Dynamic inter-

nally;• Finally, the graph representation of the a node and the graph

representation of the b node are the same type.

We can use this version of MuRef to capture sharing at differenttypes. For example, consider the structure

let xs = [1..3]ys = 0 : xs

in cycle [xs,ys,tail ys]

There are three types inside this structure, [[Int]], [Int], andInt. This means we need two instances, one for lists with elementtypes that can be reified, and one for Int, and a common data-typeto represent the graph nodes.

data Node u = Cons u u| Nil| Int Int

instance ( Typeable a, MuRef a, DeRef [a] ~ DeRef a) => MuRef [a] where

type DeRef [a] = Node

mapDeRef f (x:xs) = Cons <$> f x <*> f xsmapDeRef f [] = pure Nil

instance MuRef Int wheretype DeRef Int = Node

mapDeRef f n = pure $ Int n

The Node type is our reified graph node structure, with three pos-sible constructors, Cons and Nil for lists (of type [Int] or type[[Int]]), and Int which represents an Int.

123

Cons

��????

**

Cons

��????

Cons

��???? Cons

��

Int 0 Cons

��????

Int 1 Cons

��????

Int 2 Cons

��????

Int 3 Nil

Figure 1. Sharing within structures of different types

Reifying the example above now succeeds, giving

> reifyGraph (let xs = [1..3]> ys = 0 : xs> in cycle [xs,ys,tail ys])Graph [ (1,Cons 2 9)

, (9,Cons 10 12), (12,Cons 2 1), (10,Cons 11 2), (11,Int 0), (2,Cons 3 4), (4,Cons 5 6), (6,Cons 7 8), (8,Nil), (7,Int 3), (5,Int 2), (3,Int 1)]1

Figure 1 renders this graph, showing we have successfully capturedthe sharing at multiple levels.

10. Observing Functions

Given we can observe structures with distinct node types, can weuse the same machinery to observe functions? It turns out we can!

A traditional way of observing functions is to apply a function to adummy argument, and observe where this dummy argument occursinside the result expression. At first, it seems that an exception canbe used for this, but there is a critical shortcoming. It is impossibleto distinguish between the use of a dummy argument in a soundway and examining the argument. For example

\ x -> (1,[1..x])

gives the same result as

\ x -> (1,x)

when x is bound to an exception-raising thunk.

We can instead use the type class system, again, to help us.

class NewVar a wheremkVar :: Dynamic -> a

Now, we can write a function that takes a function and returns thefunction argument and result as a tuple.

capture :: (Typeable a, Typeable b, NewVar a)=> (a -> b) -> (a,b)

capture f = (a,f a)where a = mkVar (toDyn f)

We use the Dynamic as a unique label (that does not admit equality)being passed to mkVar. To illustrate this class being used, considera small DSL for arithmetic, modeled on the ideas for capturingarithmetic expressions used in Elliott et al. (2003).

data Exp = ExpVar Dynamic| ExpLit Int| ExpAdd Exp Exp| ...

deriving (Typeable, ...)

instance NewVar Exp wheremkVar = ExpVar

instance Num Exp where(+) = ExpAdd...fromInteger n = ExpLit (fromInteger n)

With these definitions, we can capture our function

> capture (\ x -> x + 1 :: Exp)(ExpVar ..., ExpAdd (ExpVar ...) (ExpLit 1))

The idea of passing in a explicit ExpVar constructor is an old one,and the data-structure used in Elliott et al. (2003) also included aExpVar, but required a threading of a unique String at the pointa function was being examined. With observable sharing, we canobserve the sharing that is present inside the capture function,and reify our function without needing these unique names.

capture gives a simple mechanism for looking at functions, butnot functions inside data-structures we are observing for sharing.We want to add the capture mechanism to our multi-type reifica-tion, using a Lambda constructor in the graph node data-type.

instance ( MuRef a, Typeable a, NewVar a,MuRef b, Typeable b,DeRef a ~ DeRef (a -> b),DeRef b ~ DeRef (a -> b) )

=> MuRef (a -> b) wheretype DeRef (a -> b) = Node

mapDeRef f fn = let v = mkVar $ toDyn fnin Lambda <$> f v <*> f (fn v)

This is quite a mouthful! For functions of type a -> b, we needa to admit MuRef (have observable sharing), Typeable (becausewe are working in the multi-type observation version), and NewVar(because we want to observe the function). We need b to admitMuRef and Typeable. We also need a, b and a -> b to all share acommon graph data-type. When observing a graph with a function,we are actually observing the sharing created by the let v = ...inside the mapDeRef definition.

124

Cons

''OOOOOOOOO

xxqqqqqqqqq

Lambda

��

Cons

wwooooooooo

''OOOOOOOOO

Var Lambda

��

@@@@@@ Cons

��

��<<<<<<

Add

wwoooooooooo

��

Lambda

�� &&NNNNNNNNN Nil

Var Int 1 Var Int 9

Figure 2. Sharing within structures and functions

We need to add our MuRef instance for Exp, so we can observestructures of the type Exp.

data Node u = ... | Lambda u u | Var | Add u u

instance MuRef Exp wheretype DeRef Exp = Node

mapDeRef f (ExpVar _) = pure VarmapDeRef f (ExpLit i) = pure $ Int imapDeRef f (ExpAdd x y) = Add <$> f x <*> f y

Finally, we can observe functions in the wild!

> reifyGraph (let t = [ \ x -> x :: Exp> , \ x -> x + 1> , \ x -> head t 9 ]> in t)Graph [ (1,Cons 2 4)

, (4,Cons 5 9), (9,Cons 10 13), (13,Nil), (10,Lambda 11 12), (12,Int 9), (11,Var), (5,Lambda 6 7), (7,Add 6 8), (8,Int 1), (6,Var), (2,Lambda 3 3), (3,Var)]1

Figure 2 shows the connected graph that this reification produced.The left hand edge exiting Lambda is the argument, and the righthand edge is the expression.

In Elliott et al. (2003), an expression DSL like our example herewas used to synthesize and manipulate infinite, continuous images.The DSL generated C code, allowing real time manipulation ofimage parameters. In Elliott (2004), a similar expression DSL wasused to generate shader assembly rendering code plus C# GUIcode. A crucial piece of technology needed to make both theseimplementations viable was a common sub-expression eliminator,to recover lost sharing. We recover the important common sub-expressions for the small cost of observing sharing from within anIO function.

11. Implementation of reifyGraph

In this section, we present our implementation of reifyGraph. Theimplementation is short, and we include it in the appendix.

We provide two implementations of reifyGraph in the hackagelibrary data-reify. The first implementation of reifyGraph is adepth-first walk over a tree at single type, to discover structure,storing this in a list. A second implementation also performs adepth-first walk, but can observe sharing of a predetermined set oftypes, provided they map to a common node type in the final graph.

One surprise is that we can implement our flexible observablesharing functions in just a few lines of GHC Haskell. We usethe StableName abstraction, as introduced in Peyton Jones et al.(1999), to provide our basic (typed) pointer equality, and the re-mainder of our implementation is straightforward Haskell program-ing.

Stable names are supplied in the library System.Mem.StableName,to allow pointer equality, provided the objects have been declaredcomparable inside an IO operation. The interface is small.

data StableName amakeStableName :: a -> IO (StableName a)hashStableName :: StableName a -> Intinstance Eq (StableName a)

If you are inside the IO monad, you can make a StableNamefrom any object, and the type StableName admits Eq withoutlooking at the original object. StableNames can be thought of asa pointer, and the Eq instance as pointer equality on these pointers.Finally, the hashStableName facilitates a lookup table containingStableNames, and is stable over garbage collection.

We use stable names to keep a list of already visited nodes. Ourgraph capture is the classical depth first search over the graph,and does not recurse over nodes that we have already visited.reifyGraph is implemented as follows.

• We initialize two tables, one that maps StableNames (at thesame type) to Uniques, and a list that maps Uniques toedges in our final node type. In the first table, we use thehashStableName facility of StableNames to improve thelookup time.

• We then call a recursive graph walking function findNodeswith the two tables stored inside MVars.

• We then return the second table, and the Unique

Inside findNodes, for a specific node, we

• Perform seq on this node, to make sure this node is evaluated.• If we have seen this node before, we immediately return theUnique that is associated with this node.

• We then allocate a new Unique, and store it in our first MVartable, using the StableName of this node as the key.

• We use mapDeRef to recurse over the children of this node.• This returns a new node of type “DeRef s Unique”, where s is

the type we are recursing over, and DeRef is our type function.• We store the pair of the allocated unique and the value returned

by mapDeRef in a list. This list will become our graph.• We then return the Unique associated with this node.

It should be noted that the act of extracting the graph performs likea deep seq, being hyperstrict on the structure under consideration.

125

The Dynamic version of reifyGraph is similar to the standardreifyGraph. The first table contains Dynamics, not StableNames,and when considering a node for equality, the fromDynamic iscalled at the current node type. If the node is of the same type asthe object inside the Dynamic, then the StableName equality isused to determine point equality. If the node is of a different type(fromDynamic returns Nothing), then the pointer equality fails bydefinition.

One shortcoming with the Dynamic implementation is the obscureerror messages. If an instance is missing, this terse message isgenerated.

Top level:Couldn’t match expected type ‘Node’against inferred type ‘DeRef t’

This is stating that the common type of the final Graph was ex-pected, and for some structure was not found, but does not statewhich one was not found. It would be nice if we could somehowparameterize the error messages or augment them with a secondarymessage.

12. Reflections on Observable Sharing

In this section, we consider both the correctness and consequencesof observable sharing. The correctness of reifyGraph depends onthe correctness of StableNames. Furthermore, observing the heap,even from within an IO function, has consequences for the validityof equational reasoning and the laws that can be assumed.

In the System.Mem.StableName library, stable names are definedas providing “a way of performing fast [. . . ], not-quite-exact com-parison between objects.” Specifically, the only requirement on sta-ble names is that if two stable names are equal, then “[both] werecreated by calls to makeStableName on the same object.” This is aproperty that could be trivially satisfied by simply defining equalityover stable names as False!

The intent of stable names is to implement the behavior of pointerequality on heap representations, while allowing the heap to use ef-ficient encodings. In reality, the interface does detect sharing, withthe advertised caveat that an object before and after evaluation maynot generate stable names that are equal. In our implementation, weuse the seq function to force evaluation of each graph node underobservation, just before generating stable names, and this has beenfound to reliably detect the sharing we expect. It is unsettling, how-ever, that we do not (yet) have a semantics of when we can and cannot depend on stable names to observe sharing.

An alternative to using stable names would be to directly examinethe heap representations. Vacuum (Morrow) is a Haskell library forextracting heap representations, which gives a literal view of theheap world, and has been successfully used to both capture andvisualize sharing inside Haskell structures. Vacuum has the abilityto generate dot graphs for observation and does not require that agraph be evaluated before being observed.

Vacuum and reifyGraph have complementary roles. Vacuum al-lows the user to see a snapshot of the real-time heap without neces-sarily changing it, while reifyGraph provides a higher level inter-face, by forcing evaluation on a specific structure, and then observ-ing sharing on the same structure. Furthermore reifyGraph doesnot require the user to understand low-level representations to ob-serve sharing. It would certainly be possible to build reifyGraphon top of Vacuum.

Assuming a reliable observation of sharing inside reifyGraph,what are the consequences to the Haskell programmer? Claessen

and Sands (1999) argue that little is lost in the presence of observ-able sharing in a call-by-name lazy functional language, and alsoobserve that all Haskell implementations use a call-by-name evalu-ation strategy, even though the Haskell report (Peyton Jones 2003)does not require this. In Haskell let-β, a variant of β-reduction,holds.

let {x = M} in N = N [M/x] (x /∈M) (1)

Over structural values, this equality is used with caution insideHaskell compilers, in either direction. To duplicate the construc-tion of a structure is duplicating work, and can change the timecomplexity of a program. To common up construction (using (1)from right to left) is also problematic because this can be detrimen-tal to the space complexity of a program.

It is easy in Haskell to lose sharing, even without using (1). Con-sider one of the map laws.

map id M = M (2)

Any structure that the spine of ‘M ’ has is lost in ‘map id M ’.Interestingly, this loss of sharing in map is not mandated, and aversion of map using memoization could preserve the sharing. Thisis never done because we can not depend on – or observe – sharing.

One place where GHC introduces unexpected sharing is whengenerating overloaded literals. In Kansas Lava, the term 9 + 9unexpectedly shares the same node for the value 9.

> reifyGraph (9 + 9)Graph [ (1,Entity + [2,2])

, (2,Entity fromInteger [3]), (3,Lit 9)]1

Literal values are like enumerated constructors, and any user ofreifyGraph must allow for the possibility of such literals beingshared.

What does all this mean? We can have unexpected sharing ofconstants, as well as lose sharing by applying what we consideredto be equality holding transformations.

The basic guidelines for using reifyData are

• Observe only structures built syntactically. Combinators in ourDSLs are lazy in their (observed) arguments, and we do notdeconstruct the observed structure before reifyData.

• Assume constants and enumerated constructors may be shared,even if syntactically they are not the same expression.

There is a final guideline when using observable sharing, which isto allow a DSL to have some type of (perhaps informal) let-β rule.In the same manner as rule (1) in Haskell should only change howfast some things run and not the final outcome, interpreters usingobservable sharing should endeavor to use sharing to influence per-formance, not outcome. For example, in Lava, undetected acyclicsharing in a graph would result in extra circuitry and the same re-sults being computed at a much greater cost. Even for undetectedloops in well-formed Lava circuits, it is possible to generate circuitsthat work for a preset finite number of cycles.

If this guideline is followed literally, applying (1) and other equa-tional reasoning techniques to DSLs that use observable sharing isnow a familiar task for a functional programer, because applyingequational reasoning changes performance, not the final result. Asensible let-β rule might not be possible for all DSLs, but it pro-vides a useful rule of thumb to influence the design.

126

13. Performance Measurements

We performed some basic performance measurements on ourreifyGraph function. We ran a small number of tests observ-ing the sharing in a binary tree, both with and without sharing, onboth the original and Dynamic reifyGraph. Each extra level onthe graph introduces double the number of nodes.

Tree Original DynamicDepth Sharing No Sharing Sharing No Sharing

16 0.100s 0.154s 0.147s 0.207s17 0.237s 0.416s 0.343s 0.519s18 0.718s 1.704s 0.909s 2.259s19 2.471s 7.196s 2.845s 8.244s20 11.140s 25.707s 13.377s 32.443s

While reifyGraph is not linear, we can handle 220 (around amillion) nodes in a few seconds.

14. Conclusions and Further Work

We have introduced an IO based solution to observable sharing thatuses type functions to provide type-safe observable sharing. Theuse of IO is not a hinderance in practice, because the occasions wewant to observe sharing are typically the same occasions as whenwe want to export a net-list like structure to other tools.

Our hope is that the simplicity of the interface and the familiar-ity with the ramifications of using an IO function will lead toreifyGraph being used for observable sharing in deep DSLs.

We need a semantics for reifyGraph. This of course will involvegiving at least a partial semantics to IO, for the way it is being used.One possibility is to model the StableName equality as a non-deterministic choice, where IO provides a True/False oracle. Thiswould mean that reifyGraph would actually return an infinite treeof possible graphs, one for each possible permutation of answersto the pointer equality. Another approach we are considering is toextend Natural Semantics (Launchbury 1993) for a core functionallanguage with a reify primitive, and compare it with the semanticsfor Ref-based observable sharing (Claessen and Sands 1999).

Acknowledgments

I would like to thank all the members of CDSL at ITTC for thecreative research environment, many interesting discussions, anddetailed feedback. I would also like to thank Conal Elliott, KevinMatlage, Don Stewart, and the anonymous reviewers for their manyuseful comments and suggestions.

ReferencesArthur I. Baars and S. Doaitse Swierstra. Type-safe, self inspecting code. In

Proceedings of the ACM SIGPLAN workshop on Haskell, pages 69–79.ACM Press, 2004. ISBN 1-58113-850-4.

Per Bjesse, Koen Claessen, Mary Sheeran, and Satnam Singh. Lava:Hardware design in Haskell. In International Conference on FunctionalProgramming, pages 174–184, 1998.

Manuel M. T. Chakravarty, Gabriele Keller, and Simon Peyton Jones. As-sociated type synonyms. In ICFP ’05: Proceedings of the tenth ACMSIGPLAN international conference on Functional programming, pages241–253, New York, NY, USA, 2005. ACM. ISBN 1-59593-064-7.

Koen Claessen. Embedded Languages for Describing and VerifyingHardware. PhD thesis, Dept. of Computer Science and Engineering,Chalmers University of Technology, April 2001.

Koen Claessen and David Sands. Observable sharing for functional cir-cuit description. In P. S. Thiagarajan and Roland H. C. Yap, editors,Advances in Computing Science - ASIAN’99, volume 1742 of LectureNotes in Computer Science, pages 62–73. Springer, 1999. ISBN 3-540-66856-X.

Conal Elliott. Programming graphics processors functionally. In Proceed-ings of the 2004 Haskell Workshop. ACM Press, 2004.

Conal Elliott, Sigbjørn Finne, and Oege de Moor. Compiling embeddedlanguages. Journal of Functional Programming, 13(2), 2003.

J. Ellson, E.R. Gansner, E. Koutsofios, S.C. North, and G. Woodhull.Graphviz and dynagraph – static and dynamic graph drawing tools. InM. Junger and P. Mutzel, editors, Graph Drawing Software, pages 127–148. Springer-Verlag, 2003.

Levent Erkok and John Launchbury. A recursive do for Haskell. In HaskellWorkshop’02, Pittsburgh, Pennsylvania, USA, pages 29–37. ACM Press,October 2002.

Mark P. Jones. Functional programming with overloading and higher-orderpolymorphism. In Advanced Functional Programming, First Interna-tional Spring School on Advanced Functional Programming Techniques-Tutorial Text, pages 97–136, London, UK, 1995. Springer-Verlag. ISBN3-540-59451-5.

Mark P. Jones and Iavor S. Diatchki. Language and program de-sign for functional dependencies. In Haskell ’08: Proceedings ofthe first ACM SIGPLAN symposium on Haskell, pages 87–98, NewYork, NY, USA, 2008. ACM. ISBN 978-1-60558-064-7. doi:http://doi.acm.org/10.1145/1411286.1411298.

John Launchbury. A natural semantics for lazy evaluation. In POPL, pages144–154, 1993.

Daan Leijen and Erik Meijer. Domain specific embedded compilers. In 2ndUSENIX Conference on Domain Specific Languages (DSL’99), pages109–122, Austin, Texas, October 1999.

John Matthews, Byron Cook, and John Launchbury. Microprocessor spec-ification in Hawk. In ICCL ’98: International Conference on ComputerLanguages, pages 90–101, 1998.

Conor McBride and Ross Patterson. Applicative programing with effects.Journal of Functional Programming, 16(6), 2006.

Matt Morrow. Vacuum. hackage.haskell.org/package/vacuum.John O’Donnell. Overview of Hydra: a concurrent language for syn-

chronous digital circuit design. In Parallel and Distributed ProcessingSymposium, pages 234–242, 2002.

John O’Donnell. Generating netlists from executable circuit specificationsin a pure functional language. In Functional Programming, Glasgow1992, Workshops in Computing, pages 178–194. Springer-Verlag, 1992.

Simon Peyton Jones, editor. Haskell 98 Language and Libraries – TheRevised Report. Cambridge University Press, Cambridge, England,2003.

Simon Peyton Jones, Simon Marlow, and Conal Elliott. Stretching the stor-age manager: weak pointers and stable names in Haskell. In Proceedingsof the 11th International Workshop on the Implementation of FunctionalLanguages, LNCS, The Netherlands, September 1999. Springer-Verlag.

Richard Sharp. Functional design using behavioural and structural compo-nents. In FMCAD ’02: Proceedings of the 4th International Conferenceon Formal Methods in Computer-Aided Design, pages 324–341, Lon-don, UK, 2002. Springer-Verlag. ISBN 3-540-00116-6.

Tim Sheard and Simon Peyton Jones. Template metaprogramming forHaskell. In Manuel M. T. Chakravarty, editor, ACM SIGPLAN HaskellWorkshop 02, pages 1–16. ACM Press, October 2002.

Satnam Singh and Phil James-Roxby. Lava and jbits: From hdl to bitstreamin seconds. In FCCM ’01: Proceedings of the the 9th Annual IEEE Sym-posium on Field-Programmable Custom Computing Machines, pages91–100, Washington, DC, USA, 2001. IEEE Computer Society. ISBN0-7695-2667-5.

127

A. Implementation

{-# LANGUAGE FlexibleContexts, UndecidableInstances #-}module Data.Reify.Graph ( Graph(..), Unique ) where

import Data.Unique

type Unique = Intdata Graph e = Graph [(Unique,e Unique)] Unique

{-# LANGUAGE UndecidableInstances, TypeFamilies #-}module Data.Reify

( MuRef(..), module Data.Reify.Graph, reifyGraph) where

import Control.Concurrent.MVarimport Control.Monadimport System.Mem.StableNameimport Data.IntMap as Mimport Control.Applicativeimport Data.Reify.Graph

class MuRef a wheretype DeRef a :: * -> *mapDeRef :: (Applicative m)

=> (a -> m u) -> a -> m (DeRef a u)

reifyGraph :: (MuRef s) => s -> IO (Graph (DeRef s))reifyGraph m = do rt1 <- newMVar M.empty

rt2 <- newMVar []uVar <- newMVar 0root <- findNodes rt1 rt2 uVar mpairs <- readMVar rt2return (Graph pairs root)

findNodes :: (MuRef s)=> MVar (IntMap [(StableName s,Int)])-> MVar [(Int,DeRef s Int)]-> MVar Int-> s-> IO Int

findNodes rt1 rt2 uVar j | j ‘seq‘ True = dost <- makeStableName jtab <- takeMVar rt1case mylookup st tab of

Just var -> do putMVar rt1 tabreturn $ var

Nothing -> do var <- newUnique uVarputMVar rt1 $ M.insertWith (++)

(hashStableName st)[(st,var)]tab

res <- mapDeRef(findNodes rt1 rt2 uVar)j

tab’ <- takeMVar rt2putMVar rt2 $ (var,res) : tab’return var

wheremylookup h tab =

case M.lookup (hashStableName h) tab ofJust tab2 -> Prelude.lookup h tab2Nothing -> Nothing

newUnique :: MVar Int -> IO IntnewUnique var = do

v <- takeMVar varlet v’ = succ vputMVar var v’return v’

{-# LANGUAGE UndecidableInstances, TypeFamilies,RankNTypes, ExistentialQuantification,DeriveDataTypeable, RelaxedPolyRec,FlexibleContexts #-}

module Data.Dynamic.Reify( MuRef(..), module Data.Reify.Graph, reifyGraph) where

class MuRef a wheretype DeRef a :: * -> *mapDeRef :: (Applicative f) =>

(forall b . (MuRef b, Typeable b,DeRef a ~ DeRef b)

=> b -> f u)-> a-> f (DeRef a u)

reifyGraph :: (MuRef s, Typeable s)=> s -> IO (Graph (DeRef s))

reifyGraph m = do rt1 <- newMVar M.emptyrt2 <- newMVar []uVar <- newMVar 0root <- findNodes rt1 rt2 uVar mpairs <- readMVar rt2return (Graph pairs root)

findNodes :: (MuRef s, Typeable s)=> MVar (IntMap [(Dynamic,Int)])-> MVar [(Int,DeRef s Int)]-> MVar Int-> s-> IO Int

findNodes rt1 rt2 uVar j | j ‘seq‘ True = dost <- makeStableName jtab <- takeMVar rt1case mylookup st tab of

Just var -> do putMVar rt1 tabreturn $ var

Nothing -> do var <- newUnique uVarputMVar rt1 $ M.insertWith (++)

(hashStableName st)[(toDyn st,var)]tab

res <- mapDeRef(findNodes rt1 rt2 uVar)j

tab’ <- takeMVar rt2putMVar rt2 $ (var,res) : tab’return var

mylookup :: (Typeable a)=> StableName a-> IntMap [(Dynamic,Int)]-> Maybe Int

mylookup h tab =case M.lookup (hashStableName h) tab of

Just tab2 -> Prelude.lookup (Just h)[ (fromDynamic c,u)| (c,u) <- tab2 ]

Nothing -> Nothing

newUnique :: MVar Int -> IO IntnewUnique var = do

v <- takeMVar varlet v’ = succ vputMVar var v’return v’

128

Finding the NeedleStack Traces for GHC

Tristan O.R. AllwoodImperial College

[email protected]

Simon Peyton JonesMicrosoft Research

[email protected]

Susan EisenbachImperial College

[email protected]

AbstractEven Haskell programs can occasionally go wrong. Programs call-ing head on an empty list, and incomplete patterns in function def-initions can cause program crashes, reporting little more than theprecise location where error was ultimately called. Being told thatone application of the head function in your program went wrong,without knowing which use of head went wrong can be infuriating.

We present our work on adding the ability to get stack traces outof GHC, for example that our crashing head was used during theevaluation of foo, which was called during the evaluation of bar ,during the evaluation of main . We provide a transformation thatconverts GHC Core programs into ones that pass a stack around,and a stack library that ensures bounded heap usage despite thehighly recursive nature of Haskell. We call our extension to GHCStackTrace.

Categories and Subject Descriptors D.3.2 [Programming Lan-guages]: Language Classifications—Haskell

General Terms Algorithms, Languages

Keywords Stack Trace, Debugging

1. MotivationWell-typed Haskell programs cannot seg-fault, but they can stillfail, by calling error . For example, head is defined thus:

head :: [a] -> ahead (x:xs) = xhead [] = error "Prelude.head: empty list"

If the programmer calls head , and (presumably unexpectedly) theargument is [ ], the program will fail in the following cryptic fash-ion:

> ghc -o main Main.hs> ./main.exemain.exe: Prelude.head: empty list

At this point, a programmer new to Haskell will ask “Which of thezillions of calls to head in my program passed the empty list?”. Themessage passed to the error function in head tells the programmerthe local reason for the failure, but usually provides insufficientcontext to pinpoint the error.


If the programmer is familiar with debugging an imperativelanguage, they would expect to look at a stack trace; a listingthat shows main called foo, foo called bar and bar called head ,which then called error and ended the program. This informationis readily available, since the run-time stack, which the debuggercan unravel, gives the context of the offending call. However, in alazy language, the function that evaluates (head [ ]) is often not thefunction that built that thunk, so the run-time stack is of little or nouse in debugging.

In the work described here, we describe a modest modificationto GHC that provides a similar functionality to that offered by call-by-value languages:

• We describe a simple transformation that makes the programcarry around an explicit extra parameter, the debug stack repre-senting (an approximation of) the stack of a call-by-value eval-uator. Section 3.The debug stack is reified as a value, and made available tofunctions like error , to aid the user in debugging their program.

• Crucially, the transformation inter-operates smoothly with pre-compiled libraries; indeed, only the parts of the program underscrutiny need be recompiled. Section 3.2.

• We give an efficient implementation of the stack data structure,that ensures that the debugging stacks take only space linear inthe program size, regardless of the recursion depth, run-time, orheap residency of the program. Section 4.For Haskell programs to pass around stacks, we had to designan appropriate stack data structure with associated library func-tions, these are outlined in Section 4.2.

• We built a prototype implementation, called StackTrace, in thecontext of a full-scale implementation of Haskell, the GlasgowHaskell Compiler. We sketch the implementation and measurethe performance overhead of our transformation in Section 6.

• Our prototype implementation raised some interesting issues,which we discuss in Section 5.

Although it is very simple in both design and implementation, de-bug stack traces have an extremely good power-to-weight ratio.Since “Prelude.head: empty list” has so little information,even a modest amount of supporting context multiplies the pro-grammer’s knowledge by a huge factor! Sometimes, though, thatstill may not be enough, and we conclude by comparing our tech-nique with the current state of the art (Section 7).

129

2. The programmers-eye viewWe begin with a simple example of our implemented system. Thefollowing program should print out the second Fibonacci number.

1 module Main where23 import Error45 main :: IO ()6 main = print $ fib 278 fib :: Int → Int9 fib 1 = 110 fib n11 | n > 1 = fib (n − 1) + fib (n − 2)12 fib n = error ′ $ "Fib with negative number: "

13 ++ show n

However our programmer has made a small mistake:

> ghc --make -o Fib Fib.hs> ./FibFib: Fib with negative number: 0

Of course, 0 is not a negative number, and our programmer hasjust missed out a base case. But the first thing programmer wantsto know when faced with such an error is: what was the call site ofthe offending call to fib? Our new tool makes this easy to answer,by simply adding the -fexplicit-call-stack-all flag:

> ghc --make -fexplicit-call-stack-all -o Fib Fib> ./FibFib: Fib with negative number: 0in error’, Error.hs:7,14in fib, Fib.hs:12,9in fib, Fib.hs:11,27in main, Fib.hs:6,16in main, Fib.hs:6,1

This shows that the call to error ′ was made in function fib, on line12 and column 9; that is what “in fib, Fib.hs:12,9” means,where the line numbers are given in the code sample above. In turn,the offending call to fib was made in fib on line 11, column 27;the fib (n − 2) call. In effect, we are provided with a stack trace ofthe offending call.

2.1 Stack elisionOnce the program has been recompiled with call stack informationapplied, we can use GHCi to experiment with other calls to fib:

Prelude Main> fib 20*** Exception: Fib with negative number: 0in error’, Error.hs:7,14in fib, Fib.hs:12,9in fib, Fib.hs:11,27in fib, Fib.hs:11,13...

Here, the “...”s mean some of the stack has been elided, because wehave recursively called the same function from the same call site.In this case the interactive request for fib 20 will have forced thecall to fib (n − 1) on line 11, column 13, which will then call itselfanother 19 times before then calculating fib 1 + fib 0. The fib 0(from line 11 column 27) then fails as before.

If we were instead to keep the full stack trace, a programthat looped would consume ever-increasing memory for the ever-growing stack.

Here is another example of this behaviour (at the bottom of ourFib.hs file):

15 firstLetters = loopOver ["hi", "world", "", "!" ]1617 loopOver [ ] = [ ]18 loopOver (x : xs) = head ′ x : (loopOver xs)

Here we have a small recursive loop that turns a list of lists into alist by taking the head element of each of the sublists. Running thisthrough GHCi we can see that some recursion happened before theprogram took the head element of an empty list.

*Main> firstLetters"hw*** Exception: head: empty listin error’, exs/Error.hs:7,14in head’, exs/Error.hs:14,12in loopOver, Fib.hs:18,19in loopOver, Fib.hs:18,30...in firstLetters, Fib.hs:15,16

Of course, the more idiomatic way of writing this would be touse a map combinator.

21 firstLetters2 = map′ head ′ ["hi", "world", "", "!" ]

*Main> firstLetters2"hw*** Exception: head: empty listin error’, exs/Error.hs:7,14in head’, exs/Error.hs:14,12in firstLetters2, Fib.hs:21,22

Now the stack trace may appear at first to be surprising, as thereis no mention of the map′1 function in it. This is due to map′

taking head ′ as a higher-order argument, and at present we donot propagate stacks into higher-order arguments (a point we willreturn to in Section 5.1). However the stack trace obtained doesaccurately convey that it is some application of the head ′ functionreferenced in the source of firstLetters2 that caused the error.

2.2 Selective debuggingA critical design goal is that a program can be debugged withoutrecompiling the entire program. Although it is theoretically unim-portant, this goal is absolutely vital in practice for several reasons:

• Libraries may be available only in binary form.• The program may simply be tiresomely voluminous, so that

whole-program recompilation is painful (e.g. libraries, again).• The overheads of generating and passing around a stack trace

for the entire program may be substantial and unnecessary forall but a small critical path.

These have proved serious obstacles for tools based on whole-program transformation, including the cost-centres of GHC’s ownprofiler (Section 7).

We therefore provide support for selective debugging on afunction-by-function basis. A typical mode of use is this:

• Function buggy in module Bug crashes (by calling error ).• The programmer asks GHC to generate call-site information for

buggy by adding a pragma (a bit like a INLINE pragma) thus:

1 Several of the example functions used have primes (’) suffixed on. Be-cause of a currently unresolved bootstrapping issue, it is challenging to re-compile all the standard libraries with our transform turned on, so we havejust rewritten a few standard prelude functions and rebuilt them (with theexception of error ′, which is discussed later).

130

{-# ANN buggy Debug #-}

• The system is recompiled passing -fexplict-call-stackto GHC. Modules that call buggy need to be recompiled (topass their call site information), but that is all. (Except thatif optimisation is on (the -O flag), more recompilation mayhappen because of cross-module inlining.)

• The programmer re-runs the program.• Now buggy still crashes, but the trace tells that it crashed in

module Help, function bugCall .• That might already be enough information; but if not, the pro-

grammer asks GHC to debug bugCall in module Help, and re-compiles. Again, depending on the level of optimisation, only amodest amount of recompilation takes place.

• The process repeats until the bug is nailed.

There is a shorthand for adding a Debug pragma to every functionin a module, namely passing the -fexplicit-call-stack-allflag while compiling the module (which can reside in an OPTIONS_GHCpragma on a module by module basis).

2.3 Reifying the stack traceWe have seen that error ′ prints out the stack trace. But in GHC,error ′ is just a library function, not a primitive, so one might askhow error ′ gets hold of the stack trace to print. StackTrace adds anew primitive throwStack thus:

throwStack :: ∀ e a.Exception e ⇒ (Stack → e)→ a

The implementation of throwStack gets hold of the current stacktrace, reifies it as a Stack value, and passes it to throwStack ’s ar-gument, which transforms it into an exception. Finally, throwStackthrows this exception. The Stack type is provided by our tool’s sup-port library, and is an instance of Show .

Given throwStack , we can define error ′ as follows:

error ′ :: [Char ]→ aerror ′ m = throwStack (λs → ErrorCall (m ++ show s))

It is also possible to reify the stack trace elsewhere, as we discussin the case study that follows.

2.4 Debugging for realGHC is itself a very large Haskell program. As luck would have it,in implementing the later stages of StackTrace we encountered abug in GHC, which looked like this at runtime:

ghc.exe: panic! (the ’impossible’ happened)(GHC 6.11 for i386-unknown-mingw32): idInfo

Fortunately the project was far enough advanced that we couldapply it to GHC itself. The error was being thrown from thisfunction:

varIdInfo :: Var → IdInfovarIdInfo (GlobalId{idInfo = info}) = infovarIdInfo (LocalId{idInfo = info}) = infovarIdInfo other var = pprPanic "idInfo"

(ppr other var)

Rewriting it slightly to use our throwStack primitive, and recom-piling with the transform allowed us to gain some extra context:

{-# ANN varIdInfo Debug #-}varIdInfo :: Var → IdInfovarIdInfo (GlobalId{idInfo = info}) = infovarIdInfo (LocalId{idInfo = info}) = infovarIdInfo other var

= throwStack (λs →pprPanic ("idInfo\n" ++ show s)

(ppr other var) :: SomeException)


in varIdInfo, basicTypes/Var.lhs:238,30in idInfo, basicTypes/Id.lhs:168,10

We then chased through the functions sprinkling on further Debugannotations until we gained a full stack trace that we used to nailthe bug.


in varIdInfo, basicTypes/Var.lhs:238,30in idInfo, basicTypes/Id.lhs:168,10in idInlinePragma, basicTypes/Id.lhs:633,37in preInlineUnconditionally,

simplCore/SimplUtils.lhs:619,12in simplNonRecE, simplCore/Simplify.lhs:964,5in simplLam, simplCore/Simplify.lhs:925,13in simplExprF’, simplCore/Simplify.lhs:754,5in simplExprF, simplCore/Simplify.lhs:741,5in completeCall, simplCore/Simplify.lhs:1120,24in simplVar, simplCore/Simplify.lhs:1032,29in simplExprF’, simplCore/Simplify.lhs:746,39...in simplExprF’, simplCore/Simplify.lhs:750,39...in simplLazyBind, simplCore/Simplify.lhs:339,33in simplRecOrTopPair,

simplCore/Simplify.lhs:295,5in simplTopBinds, simplCore/Simplify.lhs:237,35in simplifyPgmIO, simplCore/SimplCore.lhs:629,5in simplifyPgm, simplCore/SimplCore.lhs:562,22in doCorePass, simplCore/SimplCore.lhs:156,40

This story seems almost too good to be true, but we assure thereader that it happened exactly as described: the original failure wasneither contrived nor anticipated, and the authors had no idea wherethe bug was until the trace revealed it. Simple tools can work verywell even on very large programs.

3. Overview of the implementationStackTrace is a simple Core-to-Core compiler pass that transformsthe program in GHC’s intermediate language (Core, [9]) to pass anadditional argument describing the call site of the current function.This extra argument is called the call stack. StackTrace comes witha supporting library to be described shortly.

The basic transformation is extremely simple. Suppose we havea user-defined function recip, with a Debug pragma (Section 3.1),and a call to it elsewhere in the same module:

{-# ANN recip Debug #-}recip :: Int → Intrecip x = if x ≡ 0 then error "Urk foo"

else 1 / x

bargle x = ....(recip x ) ....

The transformation (elaborated in Section 3.2) produces the follow-ing code:

recip :: Int → Intrecip x = recip deb emptyStack

{-# ANN recip (Debugged ’recip deb) #-}recip deb :: Stack → Int → Int

131

recip deb stk x = if x ≡ 0 then error stk ′ "Urk foo"

else 1 / xwhere

stk ′ = push "in recip:14,23" stk

bargle x = ....(recip deb stk x ) ....where

stk = push "in bargle:19:22" emptyStack

Notice several things here:

• The transformed program still has a function recip with its orig-inal type, so that the source-language type-checking is not dis-turbed. Also any dependent modules can be compiled withoutenabling the transform and still work normally.

• In the transformed program, recip simply calls the debuggingversion recip deb, passing an empty stack trace. The name“recip deb” is arbitrary; in our real implementation it is morelike recip $ 351 , to ensure it cannot clash with programmer-defined functions.

• The transformation adds a new annotation Debugged , whichassociates the original function recip with its (arbitrarily-named) debugging version recip deb. We discuss this anno-tation further in Section 3.2.

• The debugging version, recip deb, contains all the originalcode of recip, but takes an extra stack-trace parameter, andpasses on an augmented stack trace to the call to error .

• recip deb does not pass a stack trace to (≡) or (/). Why not?Because it cannot “see” a debugging version of these functions;we describe how it identifies such functions in Section 3.1.

• Even though bargle is not not marked for debugging, the call torecip in bargle is transformed to call recip deb with a singletonstack. In this way, a single Debug annotation may cause manycall sites to be adjusted. That is the whole point!

3.1 Debug pragmasAs discussed earlier (Section 2.2), our tool supports selective trac-ing, using pragmas to specify which functions should be traced.

For these pragmas we use a recent, separate, GHC feature,called annotations [10]. The annotations feature allows a user toassociate a top level function or module name with a Haskell value,using an ANN pragma, thus:

f x = ...

{-# ANN f True #-}data Target = GPU | CPU deriving (Data,Typeable){-# ANN f GPU #-}

The first pragma adds the association (f ,True), while the secondadds (f ,GPU ). The associated value is any Haskell value that im-plements both Data and Typeable . (In fact, the “value” is implic-itly a Template Haskell splice, which is run at compile time to givethe value.) These annotations are persisted into GHC interface files,and can be read off later by users of the GHC API, the GHC Corepipeline itself, and eventually GHC plugins.

StackTrace provides a datatype Debug (exported by the tool’ssupport library GHC .ExplicitCallStack .Annotation ) for anno-tating user functions with:

data Debug = Debug deriving (Data,Typeable)

This is then used with the ANN (annotate) pragma to mark func-tions for debugging:

import GHC .ExplicitCallStack .Annotation (Debug (. .))...

[[f = e]] =

8<: {-# ANN f (Debugged ′f deb) #-}f = f deb emptyStackf deb s = [[e]]s

if f has a Debug pragma

= f = [[e]]emptyStack

otherwise

[[throwStack ]]s = λf → throw (f s)[[xl]]s = x deb (push l s)

if x has (Debugged ’x deb) ann= x otherwise

[[e1 e2 ]]s = [[e1 ]]s [[e2]]s[[λx → e]]s = λx → [[e]]s

[[case e1 of p → e2 ]]s = case [[e1]]s of p → [[e2]]s[[let x = e1 in e2 ]]s = let x = [[e1]]s in [[e2]]s

Figure 1. The stack-trace transformation

{-# ANN foo Debug #-}foo = ...

Note the import of GHC .ExplicitCallStack .Annotation: thedata constructor Debug must be in scope before it can be men-tioned, even in an annotation.

3.2 The transformationWhen the user compiles their code with a command-line flag,-fexplicit-call-stack, we run an extra compiler pass thattransforms the program as sketched above. This section gives thedetails of the transformation.

The GHC compiler pipeline parses Haskell into a large, datastructure that is then typechecked. This typechecked source is thende-sugared into the simpler, typed intermediate language Core. TheCore program is then optimised before being passed to the back-end compiler for turning into an executable or byte-code.

Although we have presented the StackTrace transform above interms of the surface Haskell syntax, we implement it as a Core-to-Core transformation, because Core is a much, much smaller lan-guage than Haskell. However, the transformation is run early, justafter the Haskell program has been desugared into Core, but beforeit has been optimised. At this stage the Core program still bearsa close resemblance to the original Haskell, with some exceptionsas noted later in Section 5.4. For example, top level Haskell func-tions become top-level bindings, pattern matching is expanded outto case statements, etc. Some information does get lost; for exam-ple it is difficult to know whether a Core let bound variable hascome from a Haskell let or where statement or compiler createdvariable (for e.g. working with type class dictionaries). This cancause difficulties when trying to accurately talk about Haskell levelfunction scopes and source locations from within Core.

The transformation itself is presented in Figure 1. The transfor-mation is applied to each top-level definition f = e . If it has aDebug annotation then the transformation generates:

• A new function f deb with argument s (of type Stack ), whoseright hand side is [[e]]s.

• An impedance-matching definition for the original f , whichcalls f deb passing the empty stack, emptyStack (defined bythe support library).

• A new annotation is generated for f , that associates it with thevalue (Debugged ′f deb), where Debugged is a data construc-tor declared in the support library as follows:

132

data Debugged = Debugged TH .Name

Its argument is a Template Haskell name, in this case the nameof f ’s debugging variant. (Such quoted names are written inTemplate Haskell with a preceding single quote.)

If f does not have a Debug annotation (Section 3.1), then muchless happens: the right hand side e is simply transformed with[[e]]emptyStack , where emptyStack is the empty stack trace, reflect-ing the fact that a non-debugged function has no stack-trace con-text.

The term transformer [[e]]s, also defined in Figure 1, simplywalks over the term e, seeking occurrences of functions that havedebug variants. How are such functions identified? With the excep-tion of the special primitive throwStack , discussed shortly, they arethe ones that have a Debugged annotation, which gives the nameof the debugging variant to be substituted. Remember that importedfunctions, as well as functions defined in this module, may have aDebugged annotation. The new Debugged annotation attached tof by the transformation is automatically preserved in the module’sinterface file, and will thereby be seen by f ’s callers in other mod-ules.

The stack passed to x deb is (push l s). Here, l is the sourcelocation (source file, line and column number etc.) of this occur-rence of x, written informally as a subscript in Figure 1. The otherparameter s is the stack trace of the context. The function push isexported by the support library, and pushes a location onto the cur-rent stack trace. The implementation of stack traces is described inSection 4.

There is a small phase-ordering question here. Since the top-level functions of a module may be mutually recursive, we mustadd all their Debugged annotations before processing their right-hand sides, so that their mutual calls are transformed correctly.

The transform has been designed to preserve the existing APIof a module. The original function name f in the binding f = e isstill available at the original type. As the definition of f now usesthe debugged version with an empty initial stack, libraries compiledwithout the transform can still depend on it with no changes, andgain limited stack-trace benefits for free.

The transform is fully compatible with non-transformed li-braries: a call to a library function is left unchanged by the transfor-mation unless the library exposes a Debugged annotation for thatfunction.

3.3 Implementing throwStackThe primitive throwStack is implemented in our library very sim-ply, as follows:

throwStack :: ∀ e a.Exception e ⇒ (Stack → e)→ athrowStack f = throw (f emptyStack)

This provides a safe default for when it is used without Stack-Trace being enabled. The transformation then treats references tothrowStack as a special case, although you can imagine a de-bugged version of throwStack would take the following shape:

{-# ANN throwStack (Debugged ’throwStack deb) #-}throwStack deb :: ∀ e a.Exception e

⇒ Stack → (Stack → e)→ athrowStack deb s f = throw (f s)

Any call elsewhere to throwStack will be transformed to a callto (throwStack deb s) where s is the stack trace at that call site.Then throwStack deb simply passes the stack to f , and throws theresult. Simple.

The reader may wonder why we did not give throwStackthe simpler and more general type (Stack → a) → a . SincethrowStack is a normal Haskell function, if it had the more gen-

module Stack whereemptyStack :: Stackpush :: Stack → StackElement → StackthrowStack :: ∀ e a.Exception e ⇒ (Stack → e)→ a

Figure 2. The signature of the Stack library in StackTrace.

eral signature, it could lead to a subtle break of referential trans-parency. Consider the following program (assuming the more lib-eral throwStack ):

...{-# ANN main Debug #-}

main = print (bar ≡ bar)

{-# ANN bar Debug #-}bar :: Stringbar = throwStack show

When run normally, the program would print out True as expected.However, if -fexplicit-call-stack is enabled during compila-tion, it would instead print out False.

The two different contexts of the bar call in main are nowvisible. Since a debugging library should not affect the controlflow in pure Haskell code, we decided to require that throwStackdiverges. An expert Haskell programmer can of course resort tothe unsafe∗ black arts should they really desire the more liberalfunction.

4. Call StacksA key component of StackTrace is the data structure that actuallyrepresents stack traces. It is implemented by our support library,and has the signature given in Figure 2. This section discusses ourimplementation of stack traces. A key design goal was this:

• The maximum size of the stack is statically bounded, so that thedebugging infrastructure adds only a constant space overhead tothe program.

To maintain a precise stack trace would take unbounded space, ofcourse, because of recursion, so instead we abbreviate the stackwith “...” elisions, in order to bound its size. Section 2 showedsome examples of this elision. But just what should be elided? Weestablished the following constraints:

• The top of the stack accurately reflects the last calls made upto an identifiable point. This is important for debugging, so theuser can know exactly what they do and don’t know about whathappened.

• Any function that would be involved in a full stack trace isrepresented at least once in this stack trace.

4.1 Eliding locations in the StackOur stack design has the following behaviour when pushing asource location l (file name, line and column numbers) onto a stack:

• Place l at the top of the stack.• Filter the rest of the stack to replace the previous occurrence of

l (if it exists) with a sentinel value “...”.• If “...” were inserted and are directly above/below another “...”s,

they are collapsed into a single “...”.

Some examples of this behaviour are in Figure 3, which depictsa stack trace as a list of labels and elisions, such as a,...,b,-.The young end of the stack is at the left of such a list, with “-”representing the base of the stack. In examples (1) and (2) and (5)

133

Push onto stack gives result(1) a - a,-(2) b a,- b,a,-(3) a b,a,- a,b,...,-(4) b a,b,...,- b,a,...,-(5) c b,a,...,- c,b,a,...,-(6) c c,b,a,...,- c,...,b,a,...,-(7) b c,...,b,a,...,- b,c,...,a,...,-(8) a b,c,...,a,...,- a,b,c,...,-

Figure 3. Pushing elements onto our Stack (young end to the left)

the element being pushed is not already in the stack and is placed ontop as would be expected. In example (3) the element (a) is alreadypresent and is therefore its original reference is replaced with “...”s,while it is placed on top. In (4) the same happens with element b,although the new “...”s would be adjacent to the ones placed in (3),so they collapse together. In (8) we see an extreme example wherethree “...”s would end up adjacent and are all collapsed together.

An alternative way of imagining the results of this algorithm isthis: given a real stack trace, you can convert it to our stack trace bysweeping down the stack from the top. Whenever you see a sourcelocation you have seen before, replace it with a sentinel value“...”. If multiple sentinel values appear consecutively, collapse themtogether. To see this in practice, imagine reading the push column inFigure 3 from bottom to top (which represents the real stack trace),replacing any duplicate elements with “...”. Doing this on any linewill yield that line’s result.

Given that all stacks must start out as empty, and the onlymutation operator is to push a source location (i.e. you can neverpush an “...”), we get several nice properties:

• Any source location referring to a usage of a top-level functionoccurs at most once in the call stack.

• A “...” is never adjacent to another “...”• The number of elements in the call stack is bounded at twice the

number of possible source locations that refer to usages of toplevel functions (follows from the previous two). It is of courselikely to be much, much less than this since not all programlocations can call into each other.

• A “...” represents an unknown number of entries/calls in thestack trace. However the “...” can only elide functions that arementioned above the “...”.

• The top of the stack accurately reflects what happened, down tothe first “...”.

4.2 Stack ImplementationThe run-time stack trace is implemented as an ordinary Haskelllibrary. The data structure representing the stack takes advantageof the sentinel value (‘...’) only ever occurring between two stackelements, and maintains this invariant implicitly.

data Stack = Empty{stackDetails :: !StackDetails }| Then {stackDetails :: !StackDetails

, stackElement :: !StackElement, restOfStack :: !Stack }

| RecursionThen {stackDetails :: !StackDetails, stackElement :: !StackElement, restOfStack :: !Stack }

The Empty constructor represents the empty stack, and Thenis the way of placing a StackElement upon an existing stack.The RecursionThen constructor is used to represent a sentinelvalue between its StackElement and the stack below it. The

-

a

-a

b

-

b

a

...

-

a

b

a

-

b

a

b

a

...

-

b

b

...

-

b

a

b

...

-a

b

a

b

-

a

a

...

b

-

a

b

b

...

a

-

b

a

a

b

b

a

a

...

b

...

-

a

b

a

b

...

a

...

-ba

b

a

b

Figure 4. Complete transition diagram for our stack abstractionwith two source locations. The empty stack is denoted by ‘-’. Edgesrepresent pushing the named source location onto the stack.

StackElements represent source locations. The StackDetailscontain some bookkeeping information for each stack. When dis-cussing stacks in constructor form, we will elide the StackDetails,meaning we can talk about stacks like a ‘Then‘b ‘RecursionThen‘Empty (which is a,b,...,-).

In Figure 4 we consider building a stack where we only havetwo possible items to put into it called a and b (these are actuallylocations in our source code, but the fact is unimportant for thisexample). The figure shows how the push function relates stacksvia the source locations pushed onto them. The empty stack, ‘-’,at the bottom left of the picture is the root of all the possible stackconfigurations.

For example, if a is the first source location reached, then thestack becomes a,- (following the a arrow from - to the right).From this position, if we reach source location b (following theb arrow to the right), then b is pushed onto the top of the stackas would be expected (giving b,a,-). If that source location b isrecursively re-entered (following b again), then the first time thestack would transition to b,...,a,-, however any further pushesof the source location b would cause the stack to remain the same.

As the diagram shows, there are many possible configurations,and at runtime many of the shorter stacks appear in different con-texts (for example main,- will be a suffix of all stacks).

4.3 Stack sharing and memoizationThere are resource-related questions for stack traces:

• Every call (push l s) must search s for occurrences of l . Wewould like to not do so repeatedly, giving push an amortisedconstant-time cost. We achieve this by memoising calls to push .

• Although elision means that each individual stack trace hasbounded depth, there may be an unbounded number of them.

134

We would like to share their storage, so that the size of all stacktraces together is bounded, independent of program runtime ordata size. We can achieve this by hash-consing: that is, ensuringthat for any particular stack trace there is at most one stack inthe heap that represents it. Since the tail of a stack is also astack, this implicitly means we share all suffixes of stacks.

We can memoise push by attaching a memo table to each stacktrace. The memo table for a stack trace s maps source locations lto the result of (push l s). As a partial analogy, you could imaginethat the arrows in Figure 4 represent the associations in the memotables for each stack. The StackDetails data structure is wherethis memo table lives, which takes the following shape:

data StackDetails= StackDetails{

stackUnique :: !Unique,stackTable :: !(MVar (HashTable StackElement

Stack))}

The stackTable is used to memoize the push calls. When (push l s)is called, the stack s checks its stackTable to see if the newstack has already been calculated (looking it up and returning ifnecessary); otherwise the new appropriate stack is built and thestackTable is updated. Since we are using hashtables, the Stacksalso need to be comparable for equality, and we use stackUniqueto provide a quick equality check.

4.4 The implementation of push

The use of memo tables alone, however, does not guarantee thatall stacks in the heap are unique. The problem is that it could bepossible to reach the same stack in multiple different ways. Forexample, the stack a,b...,- could be reached by: push a ◦push b◦push a $emptyStack or push a ◦push b ◦push b $emptyStack .In order to ensure each stack is only created once, we make ourpush function generate new stacks using a canonical set of pushesupon a known memoized stack. The idea is to build all stacksincrementally using two “smart constructors” that can only alter thetop of the stack, only ever operate on stacks that have already beenmemoized correctly and do not feature the program location aboutto be pushed. If these preconditions are met, they guarantee that allstacks are only ever created once and share all tails correctly.

• pushl is the smart constructor for Then . It takes programlocation l and a known memoized stack s (not containing l ), andchecks s’s memo table for l . If the check succeeds, it returns thememoized result. Otherwise it uses Then to build a new stacktrace, adds it to s’s memo table, and returns it.

• pushr is the smart constructor for RecursionThen . To guaran-tee all stacks are correctly shared, this constructor ensures that(for example) the generation of the stack a,...,rest givena known memoized stack rest: a,rest is memoized and thememo table for a,rest knows that when a is pushed upon itthe result it a,...,rest.It achieves this by (using this example of pushr a rest):

First using pushl to build or lookup the stack a,rest

It then does a memo table check in a,rest for pushing a.If the check succeeds, it just returns the result. If it failsit picks apart the top of the stack and swaps the Thenfor a RecursionThen , and then adds the mapping forpushing a onto a,rest to a,...,rest, before returninga,...,rest.

With these smart constructors in hand, the implementation of(push l s) is easy:

push b (a,b,c,-)Action Queue Stack(1) split stack at b b,a,..., c,-(2) pushr a(3) pushl a b, a,c,-(4) replace a, with a,..., b, a,...,c,-(5) pushl b b,a,...,c,-

Figure 5. Example use of the smart constructors

1. Look up l in s’s memo table. If the check succeeds, return thepre-computed result.

2. Search s for an occurrence of l . If none is found, just tail-call(pushl l s) to push l onto s .

3. Starting from the suffix just below the occurrence of l (whichcannot contain l ), rebuild the stack using pushl and pushr ,omitting l . Finally use pushl to push l onto the re-built stack.

We illustrate the third step with an example in Figure 5. In thisexample we are pushing b onto the stack a,b,c,-. In (1), pushsplits the stack into a queue of things to be pushed, and the knownmemoized stack being built up. Notice that b has been placed at thefront of the queue, and its original location replaced with a “...”.

In (2) we take the last item of the queue (a,... which is reallya value representing a‘RecursionThen‘), and since we need tocreate a RecursionThen , use pushr to place that on the top of thenew stack. pushr first uses pushl to put a on the top of the stack in(3), and then replaces the Then constructor on the top of the newstack with RecursionThen in (4). In (5) we take the next item offthe queue, and since that needs to be separated using Then , we usepushl to place it on the top of the stack.

Once the queue is empty, push then updates the memo table ofthe original (pre-queue) stack to point to the final stack when (inthis example) b is pushed.

4.5 Run-time exampleWe now demonstrate how our algorithm for pushing elements ontothe stack, using memo tables, results in a bounded heap footprint.Using the following program:

a = bb = a

main = a

Imagine that main is not being debugged, so our stack traces willonly refer to the source locations b and a .

Initially there is a global constant empty stack available, withan empty memo table attached (Figure 6 - 1). Pushing programlocation a onto this stack first checks the memo table, but as it isempty we need to compute the new stack, and update the memotable. As the stack does not contain a already, pushl can simplycreate a new Then stack element (with its own empty memo table)and update Empty’s memo table to point to it (2).

Pushing b onto this new stack follows similarly, giving a heapas in (3). Now we come to pushing a on top of b,a,-. Again thememo table is empty, so we need to compute a new stack. Howeverthe existing stack already contains an a, so push splits the stack ata, giving a known memoized stack -, and a queue of a,b,....

So in this example, the first item off the queue is b,..., whichmeans push will delegate to pushr . This then delegates to pushl tofirst push b on to Empty , giving the heap layout in (4). Then, sincewe want a RecursionThen between Emtpy and b, pushr willreplace the top Then with a RecursionThen , giving the situationin (5). Notice in this step we have initialized the new memo table

135

-

1. Empty Stack (-) with emptymemo table.

-

a

a

Then

2. Pushing a onto the stack

-

a

a

Then

b

b

Then

3. Pushing b onto the stack

-

a

b

a

Then

b

b

Then

b

Then

4. Pushing b onto Emtpy

-

a

b

a

Then

b

b

Then

b

b

RecursionThen

b

b

Then

5. Pushing b again to create aRecursionThen

-

a

b

a

Then

b

b

Then

b

b

RecursionThen

a

b

b

Then

a

a

Then

6. Stack structure after re-enteringb for the first time.

-

a

b

a

Then

a

b

b

Then

b

a

RecursionThen

a

b

b

RecursionThen

a

b

b

Then

a

a

Then

b

b

Then

a

7. The final Stack structure.

Figure 6. Stack Pushing Example

136

with a self-reference loop because any further pushes of b willreturn to the same stack.

The only item left in the queue is the a,, which is pushed usingpushl . Finally push updates the b,a,- memo table to point to theresulting a,b,...,- stack (6).

The next iteration of the loop then pushes another b, transition-ing the stack from a,b,...,- to b,a,...,- with associated up-dates to form the heap in (7). (7) also includes the final arc that thesubsequent pushing of a creates.

5. Future WorkFor the most part, StackTrace as described so far works well; wellenough, for example, for it to be helpful in debugging GHC itself(Section 2.4). However there are some thorny open issues thatneed to be investigated to make it complete. How to deal withtype classes is one problem, as these have non-trivial, cross-moduleinteractions that a rewriting transform must take into account. Ourstack trace transform also has potential negative effects on constantfunctions and the translation of mutually recursive functions withpolymorphic / type-class arguments.

5.1 Stack traces for Higher Order FunctionsThere are times when it could be useful to have a more flexible callstack to the one currently implemented. Higher order functions area good motivator of this. For example, consider the map function:

map :: (a → b)→ [a ]→ [b ]map f [ ] = [ ]map f (x : xs) = (f x ) : map f xs

and a use site:

1 foo = map (error ′ "...") [1, 2, 3]

The call stack will be:

error ‘‘...’’in foo, Blah.hs:1,12in <foo’s calling context>

even if we add an annotation to explicitly say we want to debugmap, there will be no reference to map in the call stack. The reasonfor this is that map’s argument f is never told (and has no way toknow) that it is being applied inside map.

A natural solution to this problem would be to let the usersomehow indicate that that the first argument to map should alsoaccept a stack, giving a debugged version and new stack trace likeso:

map deb :: Stack → (Stack → a → b)→ [a ]→ [b ]map deb s f [ ] = [ ]map deb s f (x : xs)

= f (push loc1 s) x : map (push loc2 s)(λs ′ → f (push loc3 s ′)

xs)

foo = λstack → map (push loc4 stack)(λstk → error ′ (push loc5 stk)

"...") [1, 2, 3]

error "..."in foo at loc5in map at loc1in foo at loc4in <foo’s calling method>

Now f also takes a stack indicating where it is used, and in therecursive case of mapDebugged , the fact that it is called insidemap at loc1 is presented to it.

The complications with implementing this scheme would be es-tablishing which function arguments (or in fact any locally declaredvariable) could be useful to debug, and then keeping track of theseso that we know to propagate the stack. The difficulty comes fromrealising that f is a local variable, whereas previously all debuggedvariants of things were top-level declarations that could easily bereferred to in GHC.

5.2 Constant Applicative Form ExpressionsAnother problem area is the treatment of expressions in ConstantApplicative Form (CAF’s). Due to GHC’s evaluation strategy, thesewill be evaluated once and their end result stored, as opposed torecomputing their value each time they are demanded. For example:

e = expensive ‘seq ‘ f

main = print e >> print e

Here expensive will only be computed once, the second referenceto e in main will just get the result of whatever f evaluated to.

However, by adding the stack argument, and threading itthrough into expensive , we can dramatically change the runtimeof the program:

e deb stack= expensive ‘seq ‘ (f deb (push loc stack))

main = print (e deb (push loc1 emptyStack))>>print (e deb (push loc2 emptyStack))

Now, since e deb accepts an argument (which is different inboth cases), and GHC is unaware of our invariant that stacks donot change user-visible control flow, then both invocations of e debwill require the recomputation of expensive , each with the differentstack variable passed in.

This is a very hard problem to solve in general, although we mit-igate this by allowing the user to explicitly state which parts of theprogram should be rewritten - which allows stack traces to remainperformant even in the presence of expensive CAF expressions.

5.3 Type Class Design SpaceWe want the StackTrace pass to degrade gracefully if future mod-ules compiled without StackTrace are compiled against StackTracealtered modules. This means any changes StackTrace makes to amodule have to preserve the existing interface of the module. Forsimple functions, record selector functions and even mutually re-cursive functions, no definition can cross a module boundary andso making a change in an API compatible way is straightforward.However type classes can be instantiated in different modules towhere they are declared, and used in a different set of modulesagain. It could be possible, for instance, for a use-site of a type-class instance to see declared instances that have come from mod-ules both compiled with and without StackTrace enabled.

Consider the following two modules:

module MClassC whereclass C a where

c :: a → Bool

module MUseC whereimport MClassCuseC :: C a ⇒ a → BooluseC = ¬ ◦ c

Here we have a module declaring a type class C with a simplefunction c. And a module that just uses class C in a generic way.

If we Debug annotate useC , and propagate the stack into the cin its definition, the debugged version of useC would be:

useC deb stack = ¬ ◦ (c deb (push loc stack))

137

The question is now, where does the c deb name come from? Isit is generated by rewriting the type-class C as follows?

module MClassC whereclass C a where

c :: a → Boolc deb :: Stack → a → Boolc deb = c

Now the original class declaration is expanded with a new func-tion, and we give it a default implementation to ensure later clientscompiled without StackTrace have a sensible implementation of it.

Instance declarations for class C that are compiled with thetransform turned on could then generate a c deb function to give astack propagating version of their c instance, others would get theAPI safe, but stackless, default implementation.

However there are downsides to this approach. Firstly, GHC theinternal representation of a type-class is currently fixed very earlyon in the compiler pipeline, and altering that fixed definition wouldinvalidate some invariants in later stages of the compiler.

The second problem is that it requires the class declaration itselfto be available to be annotated by the user. If the class declarationis buried deep in a library without a debugged annotation attached,then any user code that has control flow through a user instancedeclaration would have its stack essentially reset.

An alternative approach would be to create a new typeclass thatcontains the debugged definitions of functions and to change therewritten functions to require the presence of the new typeclass(if it exists) instead of the original. So for our example, we wouldgenerate instead:

class (C a)⇒ C Deb a wherec deb :: Stack → a → Bool

useC deb :: (C Deb a)⇒ Stack → a → BooluseC deb stack = ¬ ◦ (c deb stack)

However, we currently have some open questions for this de-sign. If we allow the user to declare that the c function should havea debugged version available, but not need to annotate the classdeclaration in its declaring module, then we have to ensure thatany potential users of the debugged version can see the declarationof the debugged version. For this example, it may require an extraimport in MUseC to pull in the new declaration. It also requiresthat any instance declarations can see the debugged version of thetypeclass so they can make instances of it.

There are some other, more serious, issues however. For exam-ple imagine a class with two functions; and imagine that separatelywe create two debugged versions of the class, each debugging a dif-ferent function. Now we can have a function that can witness bothof these debugged versions - do we create debugged versions of itfor all possibilities of debug information available?

module Urg where

class Urg a whereu1 :: a → Boolu2 :: a → Bool

module Urg1 whereimport Urg{-# ANN u1 Debug #-} -- Which generates:

class (Urg a)⇒ Urg Deb 1 a whereu1 deb :: Stack → a → Bool

module Urg2 whereimport Urg{-# ANN u2 Debug #-} -- Which generates:

class (Urg a)⇒ Urg Deb 2 a whereu2 deb :: Stack → a → Bool

module UseUrgs whereimport Urg1 ,Urg2 ,Urg

{-# ANN d Debug #-}d :: Urg a ⇒ a → Boold x = u1 x ∧ u2 x

Our Urg module exports a typeclass with two member functions.Then in separate modules, we request that the member functions bedebugged. Finally in module UseUrgs we ask to debug the functiond . The question is now, do we expand out all the possibilities forthe debugged version of d , such as:

d Deb 1 :: Urg Deb 1 a ⇒ Stack → a → Boold Deb 1 stack x

= u1 deb (push loc stack) x ∧ u2 x

d deb 2 :: Urg Deb 2 a ⇒ Stack → a → Boold deb 2 stack x

= u1 x ∧ u2 deb (push loc stack) x

d deb 1 2 :: (Urg Deb 1 a,Urg Deb 2 a)⇒ Stack → a → Bool

d deb 1 2 stack x= u1 deb (push loc stack) x ∧

u2 deb (push loc stack) x

5.4 Mutually recursive functions with type parameters / typeclass dictionaries

One of the few cases in which GHC Core does not intuitivelyresemble the original Haskell source is in the treatment of mutuallyrecursive functions with type parameters / type class dictionaries.

By default, the following set of bindings:

f 0 = error ′ "Argh!"

f x = g (x − 1)

g x = f x

Normally desugars into (roughly) the following Core language:

fg tuple = Λa.λd num : Num a →let {

d eq = getEqDict d num

f lcl = λx : a → case (((≡) a d eq) 0 x ) ofTrue → error ′ "Argh"

False → g lcl (((−) a d num) x 1)

g lcl = λx : a → f lcl x

} in(f lcl , g lcl)

f = Λa.λd num : Num a →case (fg tuple a d num) of

(f lcl , g lcl)→ f lcl

g = Λa.λd num : Num a →case (fg tuple a d num) of

(f lcl , g lcl)→ g lcl

The actual definitions of f and g end up living in f lcl andg lcl inside the let in fg tuple . Hoisting them into this let meansthat the functions do not need to apply their counterparts to thetype variable a and dictionary d num (the arguments to fg tuple)on the recursive call, as they are just in scope. This has obviousbenefits in terms of keeping the code size down (it could blowup exponentially otherwise), but also (because the calculation ofthe Eq dictionary d eq , needed for finding the definition of (≡),becomes cached) maintains the full laziness property that GHCsupports. A fuller explanation for this can be found in [4].

However, when we add the stack transform, this occurs:

138

fg tuple = λstack .Λa.λd num : Num a →let {

d eq = getEqDict d num

f lcl = λx : a → case (((≡) a d eq) 0 x ) ofTrue → error ′ (push pos stack) "Argh"False → g lcl (((−) a d num) x 1)

g lcl = λx : a → f lcl x} in

(f lcl , g lcl)

f = λstack .Λa.λd num : Num a →case (fg tuple (push pos stack) a d num) of

(f lcl , g lcl)→ f lcl

g = λstack .Λa.λd num : Num a →case (fg tuple (push pos stack) a d num) of

(f lcl , g lcl)→ g lcl

The stack is modified in f and g when entering fg tuple , andagain in f lcl before calling error ′ (the latter causing the non-Haskell-source variable fg tuple to appear in the stack trace). How-ever the stack does not get modified when the recursion betweenf lcl and g lcl occurs. This means invocations of say f 100 andf 0 will produce the same output stacks, despite the fact that a lotof recursion will have happened in the former case.

In theory it could be easy to detect the code structure aboveand special-case-modify it to pass the call stack as desired. Un-fortunately by the time we get to the desugared Core, the link be-tween the tuple fg tuple and the top-level selectors being used toencode mutually recursive functions is gone. There is no way toknow that the let-bound f lcl , g lcl are really the implementationsof top-level functions.

To get around this, we have added an optional flag to the desug-arer to do a more naive translation. However this can result in largecode-blowup and duplication, and removes the full laziness prop-erty. We present some preliminary results from using this transformin the following section.

An alternative approach would be to add an annotation to theCore as it is being generated to describe the mutually recursivebind. However how this annotation would be persisted in the pres-ence of core-rewriting optimisations is an open question.

6. EvaluationAlthough this work is prototypical and experimental in nature, wehave used the nofib [7] benchmark suite to gain an insight into thepossible compile and runtime costs of StackTrace on non-erroneousprograms. The full logs of the nofib results are available from [1].

We ran the test-suite three times. Once using a clean GHC headsnapshot, and twice using our patched version of the GHC head,once using only our simple desugaring rule for mutually recursivefunctions (-fds-simple, see Section 5.4) and once rewriting allsources to pass stacks through (-fexplicit-call-stack-all).

As none of the nofib programs crash, and do not use ourthrowStack function anywhere, we are not going to see call stacksat runtime, however it is useful to see the performance impact ofthis work when enabled on full programs.

Our prototype implementation was able to compile and run allprograms with -fds-simple enabled, and 75 of the 91 programscould be tested under -fexplicit-call-stack-all.

Comparing the original GHC to our modified version with-fds-simple turned on, we see that there is an average of over11% cost in terms of runtime and memory allocations for just us-ing the simple desugaring strategy (though the maximum increasein time was over thirteen in the multiplier program). Compile times(excluding those programs that failed to compile) were on average

0 1 2 3 4 5 10 100 1000 10000 10000000:00.00

00:10.00

00:20.00

00:30.00

00:40.00

00:50.00

01:00.00

Avg. ECSAvg. NoECS

Fib(n)

Ela

pse

d T

ime

(seco

nds)

Figure 7. Graph of average runtimes for the erroneous Fibonaccifunction with and without StackTrace enabled

2.5% slower, although one standard deviation ranged from -18.5%to 28.5%.

Comparing the original GHC to our modified version with-fexplicit-call-stack-all turned on, we see that there isan average of over five times the cost in terms of runtime and mem-ory allocations. Compile times were on average 71% slower, withone standard deviation ranging from 14.0% to 157.4%.

The experiments with the nofib benchmark suite indicate thatsome work is still necessary in ironing out the bugs in the prototype.There are many different parts in the entirety of the GHC pipeline,and some of the nofib programs have teased out otherwise undis-covered interactions between the pipeline and the changes neces-sary to enable the stack transform. However, for the vast majorityof programs, it is possible to apply our stack passing transform tothe entire program, and still run it with a modest, but perfectly ac-ceptable, performance hit.

As a smaller benchmark, we have taken the example erroneousfib program from the Example in Section 2, and compared itsruntime with and without the explicit call stack transform enabled.Our benchmark calls fib with the indicated n , forcing the resultingexception (if there is one). This is done 10,000,000 times in a loop.For each n , we performed this experiment 5 times. The averageresults are presented graphically in Figure 7.

Calling fib where n is 1 doesn’t call error ′, and indicates thereis less than a 20% cost in just adding the call stack information tothe program. When n is 10 or greater, the resulting stack from theerror is always the same, and calculating it increases the runtime byapproximately 180%.

What the results also show is that the overhead is mostly inprinting the stack (which most normal use-cases would do onlyonce), as opposed to any calculation that occurs with each pushonto the stack, as there is no consistent increase in runtime as thesize of the fib argument increases from 10 to 100 to 1000 etc.

There is an increase in performance when n is 0 or 2 comparedto when n is 10 or greater with the transform enabled. When nis 0 or 2, the resulting stack is smaller and simpler (it featuresno recursion) than in the other cases - again this is indicative thatthe formatting of the stack is much more expensive than the actualcreation of the stack.

7. Related WorkThere are already several ways of debugging existing Haskell pro-grams. GHC currently ships with an interactive mode that featuresseveral debugging features, [5], [3]. Along with the standard op-tions for setting breakpoints, and inspecting current local variableswhen execution is paused, it also features a :trace mode, whichallows the build up of a dynamic execution stack. Currently this is

139

limited to the last 50 execution steps. It is also only available forcode that is run in interpreted mode.

The difference in approach in keeping an accurate but boundedstack, versus our abstracted stack has advantages and disadvan-tages. For cases where the program control flow does not exceed50 execution steps deep then certainly the accurate stack is morehelpful. However a tight loop of greater than 50 iterations wouldremove any of the preceding context, and would not provide anymore information beyond the loop running for over 50 iterations.Our abstracted stack on the other hand would indicate that the loopwas re-entered at least once, and would keep the (abstracted) con-text above the loop. It is possible that some form of hybrid approachthat keeps the full stack up to some limit and then starts abstractingaway recursion could provide the best of both worlds, which weleave open to future work.

Another existing tool is the Haskell Tracer, HAT [11]. This pro-vides the ability to trace Haskell 98 (plus most common extensions)programs and extract a Redex Trail (a full record of all the reduc-tions that happened in the program). From this Redex Trail, theyprovide several different views with the trace that can aid in debug-ging a program. One of these is a call stack (provided through thetool hat-stack). As the authors note, this call stack (and ours) is notthe real lazy evaluation stack, but

“gives the virtual stack showing how an eager evaluationmodel would have arrived at the same result.”

Although building a full Redex Trail could be quite expensive fora large application, HAT is designed to stream this out to disk andthus not cripple performance on large programs. Also of note is thedifference in when the tracing code is applied; HAT works by firstpre-processing the program, whereas we have integrated directlywith GHC. While this in theory gives us the advantage of being ableto reasonably easily track new GHC extensions to Haskell (becausewe are buffered from them by using Core unlike HAT which has tothen upgrade its parser, internal model and other features), we donot yet have a good story for tracing (for example) type-classes,which HAT can currently do perfectly.

It is also possible to re-use the GHC profiling tools in order toget stack traces out of GHC. When profiling, GHC associates theruntime costs (memory / cpu use) to cost centers [8], and it buildsup an abstracted stack of these at runtime as different functionsare evaluated. The abstraction scheme used is to prune the stackback to the previous entry for a cost center when one is recursivelyre-entered. When a program crashes, it is possible to acquire thecurrent cost-center stack, and thus get an indication of what theroot causes of the crash could be. Although the abstraction schemeemployed is somewhat lossy, in practice this is probably not anissue; the success or failure of using the cost center stacks for stacktraces depends on the accuracy and resolution of the cost centersthemselves. By default GHC creates a single cost center for anentire function definition, and so tracing through individual casescan be tricky. However the user is free to declare a new cost centeranywhere by annotating an expression with an SCC pragma.

Another related tool that has an integrated component into GHCis HPC [2] (Haskell Program Coverage). This transforms a Haskellprogram into one that uses tick boxes to record when expressionsare evaluated at runtime, and then allows visualisation of this datain terms of marked-up source code to see which expressions whereor where not executed. Unlike our approach of rewriting GHCCore, they perform their transform earlier in the pipeline, justbefore the Haskell AST is desugared into Core. This means theyhave a data structure that much more closely resembles the originalsource program to work with. As a possible alternative target in thepipeline for a fuller implementation, HPC demonstrates that beforeCore is a reasonable target.

JHC [6] features an annotation, SRCLOC_ANNOTATE, that in-structs the compiler to make any use sites of a function call analternate version that receives the call-site location. Although thisrequires more work from the user (they also have to implement theversion of the function that is passed call-site information), it is asimple and flexible tool.

8. ConclusionsWe have presented StackTrace, our prototype for adding the abilityto get stack traces out of crashing GHC-Haskell programs. We havegiven an intuitive overview of how Haskell programs are rewrittento pass an explicit stack around, and then given details on the actualtransformation used on the GHC Core language. Accompanyingthe stack passing transform is a stack data structure and associatedAPI that models the current call stack, while ensuring boundedheap usage by abstracting away recursively entered functions. Wehave discussed some current limitations and areas for future work,and presented some initial results from using our work on the nofibbenchmark suite.

AcknowledgmentsThis work was undertaken while Tristan Allwood was on an intern-ship at Microsoft Research Cambridge. We like to thank ThomasSchilling, Max Bolingbroke and Simon Marlow for long and inter-esting discussions and guidance during this work. We also wish tothank the anonymous reviewers for their detailed comments. Tris-tan is supported by EPSRC doctoral funding.

References[1] T. Allwood, S. P. Jones, and S. Eisenbach. Explicit call stack paper re-

sources. http://code.haskell.org/explicitCallStackPaper/.

[2] A. Gill and C. Runciman. Haskell program coverage. In G. Keller,editor, Haskell, pages 1–12. ACM, 2007.

[3] G. U. Guide. The ghci debugger. http://www.haskell.org/ghc/docs/latest/html/users_guide/ghci-debugger.html.

[4] S. P. Jones and P. Wadler. A static semantics for haskell. Draft paper,Glasgow, 91.

[5] S. Marlow, J. Iborra, B. Pope, and A. Gill. A lightweight interactivedebugger for haskell. In G. Keller, editor, Haskell, pages 13–24.ACM, 2007.

[6] J. Meacham. Jhc. http://repetae.net/computer/jhc/jhc.shtml.

[7] W. Partain. The nofib benchmark suite of haskell programs. InJ. Launchbury and P. M. Sansom, editors, Functional Programming,Workshops in Computing, pages 195–202. Springer, 1992.

[8] P. Sansom and S. Peyton Jones. Formally based profiling for higher-order functional languages. ACM Transactions on ProgrammingLangauges and Systems, 19(1), 1997.

[9] M. Sulzmann, M. M. T. Chakravarty, S. L. P. Jones, and K. Donnelly.System F with type equality coercions. In F. Pottier and G. C. Necula,editors, TLDI, pages 53–66. ACM, 2007.

[10] G. Trac. Annotations. http://hackage.haskell.org/trac/ghc/wiki/Annotations.

[11] M. Wallace, O. Chitil, T. Brehm, and C. Runciman. Multiple-viewtracing for Haskell: a new Hat. In R. Hinze, editor, PreliminaryProceedings of the 2001 ACM SIGPLAN Haskell Workshop, pages151–170, Firenze, Italy, Sept. 2001. Universiteit Utrecht UU-CS-2001-23. Final proceedings to appear in ENTCS 59(2).

140

141

Author Index

Allwood, Tristan O. R.................................... 129 Atkey, Robert .................................................. 37 Bernardy, Jean-Philippe .................................. 49 Bhargavan, Karthikeyan .................................. 69 Bolingbroke, Maximilian C. .............................. 1 Borgström, Johannes ....................................... 69 Brown, Geoffrey ............................................. 61 Brown, Neil C. C. .......................................... 105 Dijkstra, Atze .................................................. 93 Eisenbach, Susan ........................................... 129 Elliott, Conal M. .............................................. 25 Fokker, Jeroen ................................................. 93 Gill, Andy ..................................................... 117 Goodloe, Alwyn .............................................. 61 Gordon, Andrew D........................................... 69 Jones Jr., Don .................................................. 81 Lindley, Sam ................................................... 37 Marlow, Simon ............................................... 81 Mitchell, Neil .................................................. 13 Peyton Jones, Simon L............................... 1, 129 Pike, Lee ......................................................... 61 Runciman, Colin ............................................. 13 Sampson, Adam T.......................................... 105 Singh, Satnam ................................................. 81 Swierstra, S. Doaitse ....................................... 93 Yallop, Jeremy ................................................ 37

haskell’09 proceedings of the 2009 acm sigplan haskell symposium

Documents