using threads in interactive systems: a case studydga/15-712/f13/papers/hauser93.pdfnotify causes a...

Using Threads in Interactive Systems: A Case Study

Carl Hauser, Christian Jacobi, Marvin Theimer, Brent Welch and Mark Weiser

Computer Science LaboratoryXerox PARC

3333 Coyote Hill RoadPalo Alto, California 94304

Abstract. We describe the results of examining twolarge research and commercial systems for the waysthat they use threads. We used three methods:analysis of macroscopic thread statistics, analysis ofthe microsecond spacing between thread events, andreading the implementation code. We identify tendifferent paradigms of thread usage: defer work,

general pumps, slack processes, sleepers, one-shots,

deadlock avoidance, re] uuenatlon, serializes,

encapsulated fork and exploiting parallelism. Whilesome, like defer work, are well known, others havenot been previously described. Most of theparadigms cause few problems for programmers andhelp keep the resulting system implementationunderstandable. The slack process paradigm is bothparticularly effective in improving systemperformance and particularly difficult to make workwell. We observe that thread priorities are difficultto use and may interfere in unanticipated ways withother thread primitives and paradigms. Finally, weglean from the practices in this code several possiblefuture research topics in the area of threadabstractions.

1. Introduction

Threads sharing an address space are becomingmore widely available in popular operating systemssuch as Solaris 2.x, 0S2/2.x, Windows NT andSysVR4 [Powel191][Custer93]. Xerox PARC’s Cedar

We thank all the hundreds of Cedar andGVX programmerswhomade this work possible. Alan Demers created the PCR threadpackage and participated m many discussionsregarding threadprimitives and scheduling. John Corwin and Chris Uhlerassistedus in building and using instrumented versionsof GVXPortions of this work were supported by Xerox, and by ARPAunder contract DABT63-91-C-0027,

Permission to copy without fee all or part of this material IS

granted provided that the copies are not made or distributed for

dtrect commercial advantage, the ACM copyright notice and the

title of the publication and its date appear, and notice is given

that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to tepubllsh, requires a fee

and/or specific permission.

SIGOPS ‘93/12 /93/N. C., USA

@ 1993 ACM 0-89791 -632 -8/93 /0012 . ..$1 .50

programming environment and Xerox’s STAR,ViewPoint, GlobalVtew for X Windows andDocuPrint products have used lightweight threadsfor over 10 years [Smith82][Swinehart86]. Thispaper reports on an inspection and analysis of thecode developed at Xerox in an attempt to detectcommon paradigms and common mistakes ofprogramming with threads. Our analysis is bothstatic, from reading many lines of ancient andmodern modules, and dynamic, using a number oftools we built for detailed thread inspection. Wedistinguish two systems that developedindependently although most of our remarks applyto both. One, originally a research system but nowunderlying a number of products as well, we callCedar. The other, a product system originallyderived from the same base as Cedar but relativelyunconnected to it for over ten years, we call GVX.

We believe that the systems we examined are thelargest and longest-used thread-based interactivesystems in everyday use in the world. They are bothin use and under continual development. Theycontain approximately 2.5 million lines of code, in10,000 modules written by hundreds of people,declaring more than 1000 monitors and monitoredrecord types and over 300 condition variables. Theymostly run as very large (working sets of 10’s ofmegabytes) shared-memory, multi-applicationsystems. They have been ported to a variety ofprocessors and multiprocessors, including the .25 to5 MIPS Xerox D-machines in the 1980’s and 5 to 100MIPS processors of the early 1990’s. Ourexamination of them reveals the kinds of threadfacilities and practices that the programmers ofthese systems found useful and also the kinds ofthread programming mistakes that even thisexperienced community still makes.

Our analysis is subject to two unavoidable biases.First, the programmers of these systems may not berepresentative—for instance, one communityconsisted primarily of PhD researchers. Second, anygroup of people develop habits of work—idioms—that may be unrelated to best practice, but aresimply “how we do it here”. Although our reportdraws on code written by two relatively independentcommunities, comparing and contrasting the two isbeyond the scope of this paper.

94

We examined the systems running on the PortableCommon Runtime on top of Unix [Weiser89]. PCRprovides the usual primitives: threads sharing asingle address space, monitor locks, conditionvariables, fork and join. Section 2 describes theprimitives in more detail.

Although these systems do run on multiprocessors,this paper emphasizes the role of threads in programstructuring rather than how they are used to exploitmultiprocessors [Owicki89].

This paper is not, for the most part, a statisticalanalysis of thread behavior. We focus mostly on themicroscopic view of individual paradigms andcritical path performance. For example, the timebetween when a key is pressed and thecorresponding glyph is echoed to a window is veryimportant to the usability of these systems.Developing an understanding of how individualthreads are interacting in accomplishing this taskhelps to understand the observed performance.

Information for this paper came from three sources.An initial analysis of the dynamic behavior of oursystems in terms of macroscopic thread statistics ispresented in Section 3. A more detailed analysis ofdynamic behavior was obtained from microsecond-resolution information gathered about threadevents and scheduling events in the Unix kernel.The kinds of thread events we examined includedforks, yields, (thread) scheduler switches, monitorlock entries, and condition variable waits. Finally,to develop a database of static uses of threads weused grep to locate all uses of thread primitives andthen read the surrounding code. This reading led usto further searching for uses of modules that providespecialized access to threads (for example, the Cedarpackage PeriodicalFork). The understandinggained from these three information sourceseventually led to the classifications of threadparadigms described in Section 4.

Sections 5 and 6 present some engineering lessons—both for implementors using threads and forimplementors of thread systems—from this study.Our conclusions and some suggestions for futurework are given in Section 7.

2. Thread model

Lampson and Redell describe the Mesa language’sthread model and provide rationale for many of thedesign choices [Lampson80]. Here, we summarizethe salient features as used in our systems.

The Mesa thread model supports multiple, light-weight, pre-emptively scheduled threads that sharean address space. The FORK operation creates a newthread to carry out the FORK’s procedure-invocationargument. FORK returns a thread value. The JOINoperation on a thread value returns the valuereturned by the corresponding FORK’s procedure

invocation. A thread may be JOINed at most once. Ifa thread will not be JOINed it should be DETACHed,which tells the thread implementation that it canrecover the resources of the thread when itterminates.

The language provides monitors and condition

variables for synchronizing thread activities. Amonitor is a set of procedures, or module, that sharea mutual exclusion lock, or mutex. The mutexprotects any data managed by the module byensuring that only a single thread is executingwithin the module at any instant. Other threadswanting to enter the monitor are enqueued on themutex. The Mesa compiler automatically insertslocking code into monitored procedures. A varianton this scheme, associating locks with datastructures instead of with modules, is occasionallyused in order to obtain finer grain locking.

Condition variables (CVS) give more explicit controlof thread scheduling. Each CV represents a state ofthe module’s data structures (a condition) and aqueue of threads waiting for that condition tobecome true. A thread uses the WAIT operation on aCV if it has to wait until the condition holds. WAIToperations may time out depending on the timeoutinterval associated with the CV. A thread usesNOTIFY or BROADCASTto signal waiting threads thatthe condition has been achieved. The compilerenforces the rule that CV operations are onlyinvoked with the monitor lock held. The WAIToperation atomically releases the monitor lock andadds its calling thread to the CV’S wait queue.NOTIFY causes a single thread that is waiting on theCV’S wait queue to become runnable—exactly onewaiter wakens behavior. (Note that some threadpackages define their analog of NOTIFY to have atleast one waiter wakens behavior [Birrel1911.)BROADCAST causes all threads that are waiting onthe CV to become runnable. In either case, threadsmust compete for the monitor’s mutex beforereentering the monitor.

Unlike the monitors originally described by Hoare[Hoare74], the Mesa thread model does notguarantee that the condition associated with a CV issatisfied when a WAIT completes. If BROADCAST isused, for example, a different thread might acquirethe monitor lock first and change the state of theprogram. Therefore a thread is responsible forrechecking the condition after each WAIT. Thus, theprototypical use of WAIT is inside a WHILE loop thatchecks the condition, not inside an IF statement thatwould only check the condition once. Programs thatobey the “WAIT only in a loop” convention areinsensitive to whether NOTIFY has at least one waiterwakens behavior or exactly one waiter wakensbehavior as described above. Indeed, under thisconvention BROADCAST can be substituted forNOTIFY without affecting program correctness, so

95

NOTIFY is just a performance hint.

Threads have priorities that affect the scheduler.The scheduler runs the highest priority runnablethread and if there are several runnable threads atthe highest priority then round-robin is used amongthem. If a system event causes a higher prioritythread to become runnable, the scheduler willpreempt the currently running thread, even if itholds monitor locks. There are 7 priorities in all,with the default being the middle priority (4).Typically, lower priority is used for long running,background work, while higher priority is used forthreads associated with devices or aspects of theuser interface, keeping the system responsive forinteractive work. A thread’s initial priority is setwhen it is created. It can change its own priority.

The timeslice interval and the CV timeoutgranularity in the current implementation are each50 milliseconds. The scheduler runs at least thatoften, but also runs each time a thread blocks on amutex, waits on a CV, or calls the YIELD primitive.The only purpose of the YIELD primitive is to causethe scheduler to run (but see discussion later ofYieldButNotToMe ). The scheduler takes less than50 microseconds to switch between threads on aSparcstation-2.

3. Dynamic thread behavior

One of the original motivations for our work was adesire to understand the dynamic behavior of user-level threads. For this purpose, we constructed aninstrumented version of PCR that measured thenumber of threads in the system, thread lifetimes,the run length distribution of threads and the rateat which monitor locks and condition variables areused. Analysis of the data obtained from thisinstrumented system led to the realization thatthere were a number of consistent patterns of threadusage, which led to the static analysis on which thispaper is focused. To give the reader somebackground and context for this static analysis, wepresent a summary of our dynamic data below. Thisdata is based on a set of benchmarks intended to betypical of user activity, including compilation,formatting a document into a page descriptionlanguage (like Postscript), previewing pagesdescribed by a page description language and userinterface tasks (keyboarding, mousing and scrollingwindows). All data was taken on a Sparcstation-2running SunOS-4. 1.3.

Looking at the dynamic thread behavior, weobserved several different classes of threads. Therewere eternal threads that repeatedly waited on acondition variable and then ran briefly beforewaiting again. There were worker threads thatwere forked to perform some activity, such asformatting a document. Finally, there were short-lived transient threads that were forked by some

long-lived thread, would run for a relatively shortwhile and then exit.

A Cedar or GVX world uses a moderate number ofthreads. Consider Cedar first: an idle Cedar systemhas about 35 eternal threads running in it and forksa transient thread once a second on average.Keyboard activity can cause up to 5 thread forks persecond, although most other user-interface activitycauses much smaller increases in thread forkingrates. While one of our benchmark applications(document formatting) employed large numbers oftransient threads (forking 3.6 threads/see.), theother two compute-intensive applications weexamined caused thread-forking activity to decreaseby more than a factor of 3. In all our benchmarks,the maximum number of threads concurrentlyexisting in the system never exceeded 41, althoughusers employ two to three times this many ineveryday work. Transient threads are by far themost numerous resulting in an average lifetime fornon-eternal threads that is well under 1 second.

An idle GVX world exhibits noticeably differentbehavior than just described. An idle systemcontains 22 eternal threads and forks no additionalthreads. In fact, no additional threads are forked forany user interface activity, be it keyboard, mouse, orwindowing activity.

Table 1: Forking and thread-switching rates

Cedar Forkslsec

Switcheslsec

Idle Cedar 0.9Keyboard input 5.0Mouse movement 1.0Window scrolling 0.7Document formatting 3.6Document previewing 1.6Make program 0.3Compile 0.3

GVX

Idle GVX oKeyboard input oMouse movement oWindow scrolling o

Thread

132269191172171222170135

33603443

The rate at which a Cedar system switches amongrunning threads varies from 130/sec. for an idlesystem to around 2701sec for a system experiencingheavy key boardlmouse input activity. Threadexecution intervals (the lengths of time betweenthread switches) exhibit a peak at about 3milliseconds, with about 757o of all executionintervals being between O and 5 milliseconds inlength. This is due to the very short executionintervals of most eternal and transient threads. Asecond peak is around 45 milliseconds, which isrelated to the PCR time-slice period, which is 50milliseconds. Transient and eternal thread activity

96

steals the first part of a timeslice with theremainder going to worker threads.

While most execution intervals are short, longerexecution intervals account for most of the totalexecution time in our systems. Between 20$% and50% of the total execution time during any period isaccumulated by threads running for periods of 45 to50 milliseconds. We also examined the totalexecution time contribution as a function of threadpriority. Only two patterns were evident: of the 7available priority levels one wasn’t used at all, anduser interface activity tended to use higherpriorities for its threads than did user-initiatedtasks such as compiling.

GVX switches among threads at a decidely lowerrate: an idle system switches only 33 times persecond, while heavy keyboard activity will drive therate up to 60/sec. The same hi-modal distribution ofexecution intervals is exhibited as in Cedar:between 50?7.and 70% of all execution intervals arebetween O and 5 milliseconds in length with asecond peak around 45 milliseconds. Between 30%and 80% of the total execution time during anyperiod is accumulated by threads running forperiods of45 to 50 milliseconds.

GVX’S use of thread priorities was noticeablydifferent than Cedar’s. While Cedar’s core of long-lived threads are relatively evenly distributed overthe four “standard” priority values of 1 to 4, GVXsets almost all of its threads to priority level 3, usingthe lower two priority levels only for a fewbackground helper tasks. Two of the five low-priority threads in fact never ran during ourexperiments. As with Cedar, one of the 7 prioritylevels is never used. However, while Cedar useslevel 7 for interrupt handling and doesn’t use level5, GVX does the opposite. In both systems, prioritylevel 6 gets used by the system daemon that doesproportional scheduling. Cedar also uses level 6 forits garbage collection daemon.

One interesting behavior that our Cedar threaddata exhibited was a variety of different forkingpatterns. An idle Cedar system forks a transientthread about once every 2 seconds. Each forkedthread, in turn, forks another transient thread.Keyboard activity causes a transient thread to beforked by the command-shell thread for everykeystroke. On the other hand, simply moving themouse around causes no threads to be forked. Evenclicking a mouse button (e.g. to scroll a window)causes no additional forking activity. (However,both keyboard activity and mouse motion causesignificant increases in activity by eternal threads. )Scrolling a text window 10 times causes 3 transientthreads to be forked, one of which is the child of oneof the other transients.

Document formatting causes a great number of

transient threads to be forked by the mainformatting worker thread, whereas compilation anddocument previewing cause a moderate number oftransient forks. While the compiler’s andpreviewer’s transient threads simply run tocompletion, each of the document formatter’stransient threads fork one or more additionaltransient threads themselves. However, thirdgeneration forked threads do not occur. In fact, noneof our benchmarks exhibited forking generationsgreater than 2. That is, every transient thread waseither the child or grandchild of some worker orlong-lived thread.

Checking whether a program needs recompiling(the Make program ) does not cause any threads to beforked (the command-shell thread gets used as themain worker thread), except for garbage collectionand finalization of collected data objects. Each ofthese activities causes a moderate number of first-generation transient threads to be forked.

Table 2: Wait-CV and monitor entry rates

Cedar Waitslsec 7C timeouts ML-enters

per sec

Idle Cedar 121 82% 414Keyboard input 185 48% 2557Mouse movement 163 58% 1025Window scrolling 115 69% 2032Doc formatting 130 72% 2739Doc previewing 157 56% 1335Make program 158 61% 2218Compile 119 82% 1365

GVX

Idle GVX 32 99% 366Keyboard input 38 42% 1436Mouse movement 33 96% 410Window scrolling 25 61% 691

The rate at which locking and condition variableprimitives are used is another measure of threadactivity. Table 2 shows the rates for eachbenchmark. The rate of waiting on CVS in Cedarranged from 1151second to 185/second, with 50% to80% of these waits timing out rather than receivinga wakeup notification. Monitors are entered muchmore frequently, reflecting their use to protect datastructures (especially in reusable library packages).Entry rates varied from 400/second for an idlesystem to 2500/sec for a system experiencing heavykeyboard activity to 27001second for documentformatting. Contention was low, however, occuringon 0.0170 to 0.1% of all entries to monitors.

For GVX, the rate of waiting on CVS ranged from321second to 381second, with 42% to 99% of thesewaits timing out rather than receiving a wakeupnotification. Monitors are entered at rates between366/see and 1436/sec. Interestingly, contention formonitor locks was sometimes significantly higher in

97

GVX than in Cedar, occuring 0.4% of the time whenscrolling a window and 0.2?10of the time when heavykeyboard traffic was present.

Table 3: Number of different CVS and monitorlocks used

Cedar # CvsIdle Cedar 22Keyboard input 32Mouse movement 26Window scrolling 30Document formatting 46Document previewing 32Make program 24Compile 36

# MLs

554918734797

1060938

12962900

GVX

Idle GVX 5 48Keyboard input 7 204Mouse movement 5 52Window scrolling 6 209

Typically, most of the monitor/condition variabletraffic is observed in about 10 to 15 differentthreads, with the worker thread of a benchmarkactivity dominating the numbers, The other activethreads exhibit approximately equal traffic. Thenumber of different monitors entered during thebenchmarks varies from 500 to 3000 as shown inTable 3. In contrast, only about 20 to 50 differentcondition variables are waited for in the course ofthe benchmarks.

4. Thread paradigms

Birrell provides a good introduction to some of thebasic paradigms for thread use [Birrel19 1]. Here wego further into more advanced paradigms, theirnecessity and frequency in practice and theproblems of performance and correctness entailed bythese advanced usages.

Birrell suggests that forking a new thread is usefulin several situations: to exploit concurrency on amultiprocessor (including waiting for 1/0, where oneprocessor is the 1/0 device); to satisfy a human userby making progress on several tasks at once; toprovide network service to multiple clientssimultaneously; and to defer work to a less busytime [Birrel191, p. 109].

We examined about 650 different code fragmentsthat create threads, We gradually developed acollection of categories that we could use to explainhow thread uses were similar to one another andhow they differed. Our final list of categories is:

– defer work (same as Birrell’s)

pumps (components of Birrell’s pipelines. Hedescribes them for exploitingmultiprocessing, but we saw them mostlyused for structuring.)

– slack processes, a kind of specialized pump (new)

sleepers and one-shots (common uses probablyomitted by Birrell because synchronizationproblems in them are rare)

– deadlock avoiders (new)

– task rejuvenation (new)

– serializes, another kind of specialized pump (new)

– concurrency exploiters (same as Birrell’s)

– encapsulated forks, which are forks in packagesthat capture certain paradigms and whoseuses are themselves counted in the othercategories.

These static categories complement the threadlifetime characterization in Section 3. The eternalthreads tend to be sleepers, pumps and serializeswith nothing to do. Worker threads are often workdeferrers and transient threads are often deadlockavoiders. But the reader is cautioned that the staticparadigm can’t be predicted from the dynamiclifetime.

4.1 Defer work

Deferring work is the single most common use offorking in these systems. A procedure can oftenreduce the latency seen by its clients by forking athread to do work not required for the procedure’sreturn value. Sometimes work can be deferred totimes when the system is under less load [Birrel191].Cedar practice has been to introduce work deferrersfreely as the opportunity to use them is noticed.Many commands fork an activity whose results willbe reported in a separate window: control in theoriginating thread returns immediately to the user,an example of latency reduction for the humanclient. Some examples of work deferrers are:

– forking to print a document

– forking to send a mail message

– forking to create a new window

– forking to update the contents of a window

Some threads are themselves so critical to systemresponsiveness that they fork to defer almost anywork at all beyond noticing what work needs to bedone. These critical threads play the role ofinterrupt handlers. Forking the real work allows itto be done in a lower priority thread and frees thecritical thread to respond to the next event. Thekeyboard-and-mouse watching process, called theNotifier, is such a critical, high priority thread inboth Cedar and GVX.

4.2 Pumps

Pumps are components of pipelines. They pick upinput from one place, possibly transform it in someway and produce it as output someplace else. 1Bounded buffers and external devices are twocommon sources and sinks. The former occur in

98

several implementations in our systems forconnecting threads together, while the latter areaccessed with system calls (read, write) and sharedmemory (raw screen IO and memory shared with anexternal X server).

Though Birrell suggests creating pipelines toexploit parallelism on a multiprocessor, we findthem most commonly used in our systems as aprogramming convenience, another reflection of theuniprocessor heritage of the systems. This is alsotheir primary use in the well-known Unix shellpipelines. For example, in our systems all userinput is filtered through a pipeline thread thatpreprocesses events and puts them into anotherqueue, rather than have each reader threadpreprocess on demand. Neither approachnecessarily provides a more efficient system, but thepipeline is conceptually simpler: tokens just appearin a queue. The programmer needs to understandless about the pieces being connected.

One interesting kind of pump is the slack process. Aslack process explicitly adds latency to a pipeline inthe hope of reducing the total amount of work done,either by merging input or replacing earlier datawith later data before placing it on its output. Slackprocesses are useful when the downstreamconsumer of the data incurs high per-transactioncosts. The buffer thread discussed in Section 5,2 isan example of a slack process.

4.3 Sleepers and oneshots

Sleepers are processes that repeatedly wait for atriggering event and then execute, Often thetriggering event is a timeout. Examples include:call this procedure in K seconds; blink the cursor inM milliseconds; check for network connectiontimeout every T seconds. Other common events areexternal input and service callbacks from otheractivities. For instance, our systems use callbacksfrom the garbage collector to finalize objects andcallbacks from the filesystem when files changestate. These callbacks are removed from time-critical paths in the garbage collector and filesystemby putting an event in a work queue serviced by asleeper thread. The client’s code is then called fromthe sleeper.

Sleepers frequently do very little work beforesleeping again. For instance, various cachemanagers in our systems simply throw away agedvalues in a cache then go back to sleep.

Another kind of sleeper is the garbage collector’s

lWe use the term pump, rather than the more common fzlter

because data transformation, ala filters, IS just one of the thingsthat pumps can do. In general, pumps control both the datatransformation and the timing of the transfer. We also like theconnotation of an active entity conveyed by pump as opposed to

the passivity of filter

background thread which cleans pages dirtied byother threads. If it gets too far behind in its work itcould cause virtual memory thrashing by cleaningpages no longer resident in physical memory. Itsrate of awakening must depend on the amount ofpage dirtying which depends on the workload of allthe other threads in the system.

OneShots are sleeper processes that sleep for awhile, run and then go away. This paradigm is usedrepeatedly in Cedar, for example, to implementguarded buttons of several kinds. (A guarded buttonmust be pressed twice, in close, but not too closesuccession. They usually look like ‘Wwttem” on thescreen.) After a one-shot is forked it sleeps for anarming period that must pass before a second click isacceptable. Then it changes the button appearancefrom ‘%-u+tcm” to “Button” and sleeps a second time.During this period a second click invokes aprocedure associated with the button, but if thetimeout expires without a second click, the one-shotjust repaints the guarded button.

4.4 Deadlock avoiders

Cedar often uses FORK to avoid violating lock orderconstraints. The window manager makes heavy useof this paradigm. For example, after adjusting theboundary between two windows the contents of thewindows must be repainted. The boundary-movingthread forks new threads to do the repaintingbecause it already holds some, but not all of thelocks needed for the repainting. Acquiring theselocks would require unwinding the adjustingprocess far enough to release locks that wouldviolate locking order, then reacquiring all thenecessary locks in the right order. It is far simpler tofork the painting threads, unwind the adjustercompletely and let the painters acquire the locksthat they need in separate threads.

Another case of deadlock avoidance is forking thecallbacks from a service module to a client module,Forking permits the service thread to proceed,eventually releasing locks it holds that will beneeded by the client. The fork also insulates theservice from things that may go wrong in the clientcallback. For instance, Cedar permits clients toregister callback procedures with the garbagecollector that are called to finalize (clean up) datastructures. The finalization service thread forkseach callback.

4.5 Task rejuvenation

Sometimes threads get into bad states, such as arisefrom uncaught exceptions or stack overflow, fromwhich recovery is impossible within the threaditself. In many cases, however, cleanup and recoveryis possible if a new “task rejuvenation” thread isforked. For uncaught errors, an exception handlermay simply fork a new copy of the service. For stackoverflow, a new thread is forked to report the stack

99

overflow, Using threads for task rejuvenation can betricky and is a bit counter-intuitive. (This thread isin trouble. Ok let’s make two of them!) However, it isa paradigm that adds significantly to the robustnessof our systems and its use is growing. A recentaddition is a task-rejuvenating FORK that was addedto the input event dispatcher in Cedar. Thedispatcher makes unforked callbacks to clientprocedures because (a) this code is on the criticalpath for user-visible performance and (b) mostcallbacks are very short (e.g. enqueue an event) andso a fork overhead would be significant. But notforking makes the dispatcher vulnerable touncaught runtime errors that occur in the callbacks.Using task rejuvenation, the new copy of thedispatcher keeps running.

Task rejuvenation is a controversial paradigm. It’sability to mask underlying design problemssuggests that it be used with caution.

4.6 Serializes

A serialize is a queue and a thread that processesthe work on the queue. The queue acts as a point ofserialization in the system. The primary example isin the window system where input events can arrivefrom a number of’ different sources. They arehandled by a single thread in order to preserve theirordering. This same paradigm is present in mostother window systems and in many cases it is theonly paradigm. In the Macintosh, MicrosoftWindows, and X programming models, for example,each application runs in a serialize thread thatpulls events from a queue associated with theapplication’s window.

4.7 Concurrency exploiters

Concurrency exploiters are threads createdspecifically to make use of multiple processors.They tend to be very problem-specific in theirdetails. Since our systems have only relativelyrecently begun to run on multiprocessors we werenot surprised to find very few concurrency exploitersin them.

4.8 Encapsulated forks

One way that our systems promote use of commonthread paradigms is by providing modules thatencapsulate the paradigms. This section describesthree such packages used frequently in our systems.

DelayedFork and PeriodicalFork

Delay edFork expresses the paradigm of a one-shot.It calls a procedure at some time in the future.Although one-shots are common in our system,Delay edFork is only used in our window systems.(Its limited use might be because it appeared onlyrecently. )

PeriodicalFork is simply a Delay edFork thatrepeats over and over again at fixed intervals. Itencapsulates the sleeper paradigm where the

wakeups are prompted solely by the passage of time.

MBQueue

The module MBQueue (the name meansMenu/Button Queue) encapsulates the serializeparadigm in our systems. MBQueue creates a queueas a serialization context and a thread to process it.Mouse clicks and key strokes cause procedures to beenqueued for the context: the thread then calls theprocedures in the order received.

We consider a queue together with a threadprocessing it an important building block of userinterfaces. This is borne out by the fact that oursystem actually contains several minor variations ofMBQueue. Why is there not a more generalpackage? It seems that each instance addsserialization to a specific interface already familiarto the programmer. Furthermore, the serializationis often on a critical interactive performance path.Keeping a familiar interface to the programmer andreduc-ing latency cause new variations to bepreferred over a single generic implementation.

Miscellaneous

Many modules that do callbacks offer a fork booleanparameter in their interface, indicating whether ornot the called-back procedure is to be called directlyor in a forked thread. The default is almost alwaysTRUE, meaning the callback will be forked.Unforked callbacks are usually intended for experts,because they make future execution of the callingthread within the module dependent on successfulcompletion of the client callback.

4.9 Paradigm summary

Table 4 summarizes the absolute and relativefrequencies of the paradigms in our systems. Notethat in keeping with our emphasis on using threadsas program structuring devices this is a static count.(Threads may be counted in more than one categorybecause they change their behavior.) Some threadsseem not to fit easily into any category. These arecaptured in the “Unknown or other” entries.

Our categories seem to apply well to other systems.For instance, Lampson and Redell describe theapplications Pilot, Violet and Gateway (which,although Mesa-based, share no code with thesystems we examined), but do not identify generaltypes of threads [Lampson80]. Yet from thepublished description one can deduce the following:

Pilot: almost all sleepers.

Violet: sleepers, one-shots and work deferral.

Gateway: sleepers and pumps.

100

Table 4. Static CountsCedar GVX

Defer work 108 31% 77 33%Pumps

General pumps 48 14% 33 1470Slack processes 7 27o 2 1%

Sleepers 67 19% 15 6%Oneshots 25 770 11 570Deadlock avoid 35 1070 6 3%Task rejuvenate 11 3% o o%Serializes 5 1% 7 3%Encapsulated fork 14 4% 5 2%Concurrencyexploiters 3 170 0 0%Unknown orother225 770 78 33%’

TOTAL 348 100% 234 100%

5. Issues in thread use

Given a modern platform providing threads, systembuilders have the option of using thread primitivesto accomplish tasks that they would have used othertechniques to accomplish on other platforms. Thedesigner must balance the modest cost of creating athread against the benefits in structuralsimplification and concurrency that would accruefrom its introduction. In addition to the cost ofcreating it, a thread incurs a modest ongoing cost forthe virtual memory occupied by its stack. Our PCRthread implementation allocates virtual memory forthe maximum possible stack size of each thread. Ifthere is very little state associated with a threadthis may be a very inefficient use of memory[Draves911.

5.1 Easy thread uses

Our programmers have become very adept at usingthe sleeper, oneshot, pump (in paths without criticaltiming constraints) and work deferrer paradigms.For the most part these require little interactionbetween threads beyond taking care to providemutual exclusion (monitors ) for shared data. Pumpthreads interact with threads on either side of themin pipelines, but the interactions generally followwell-known producer-consumer patterns. UsingFORK to create sleeper threads has fallen intodisfavor with the advent of PCR threadimplementation: 100 kilobytes for each of hundredsof sleepers’ stacks is just too expensive. ThePeriodicalProcess module, for timeout drivensleepers, and other sleeper encapsulations often canaccomplish the same thing using closures tomaintain the little bit of state necessary betweenactivations.

2The large number of unknown threadsrelative unfamiliarity with this code,significant difference in paradigm use

in GVX is due to ourrather reflecting any

Deadlock avoiders also are usually very simple, butthe overall locking schemes in which they areinvolved are often very, very complicated (and farbeyond the scope of this paper).

5.2 Hard thread uses

Some of the thread paradigms seem much moredifficult. First, there is little guidance in theliterature and in our experience for usingconcurrency exploiters in interactive systems. Thearrival of relatively low-cost workstations andoperating systems supporting multiprocessing offersnew incentives to understand this paradigm better.

Second, designing slack processes and other pumpsin paths with timing constraints continues tochallenge both the slack process implementors andthe PCR implementors. Bad performance in thepipelines that provide feedback for typing (characterechoing) and mouse motion is immediately apparentto the user, yet improving the performance is oftenvery difficult.

One instance of a poorly performing slack process ina user feedback pipeline involves sending requeststo the X server. Good X window systemperformance requires batching communication withthe server and merging overlapping requests. Inone of our systems, the batching is performed usingthe slack process paradigm embodied in a highpriority thread. The buffer thread accumulatespaint requests, merges overlapping requests andsends them only occasionally to the X server. In theusual producer-consumer style, an imaging threadputs paint requests on a queue for the buffer threadand issues a NOTIFY to wake it up. In order togather multiple paint requests the buffer threadmust not act on the single paint request in its queuewhen it first wakes up so it YIELDs to cede theprocessor to allow more paint requests to beproduced by the imaging thread; or so we hope. Aproblem occurs when the buffer thread is a higherpriority thread than the image threads that feed it:the scheduler always chooses the buffer thread torun, not the image thread. Consequently the bufferthread sends the paint request on to the X serverand no merging occurs. The result is a high rate ofthread and process switching and much more workdone by the X server than should be necessary,

Fixing the problem by lowering the priority of thebuffer thread is clearly wrong, since that couldcause starvation of screen painting. We believe thatthe architecture of having a producer thread notify aconsumer (buffer) thread periodically of work to dois a good one, so we did not consider changing thebasic paradigm by which these two threads interact.Rewriting the buffer thread to simply sleep waitingfor a timeout before sending its events doesn’t workeither, for reasons discussed in Section 6.3.

We fixed the immediate problem by creating a new

101

yield primitive, called YieldButNotToMe whichgives the processor to the highest priority readythread other than its caller, if such a thread exists.Most of the time the image thread is the threadfavored with the extra cycles and there is a bigimprovement in the system’s perceivedperformance. Fewer switches are made to the Xserver, the buffer thread becomes more effective atdoing merging, there is less time spent in threadand process switching, and the image thread getsmuch more processor resource over the same timeinterval. The result is that the user experiencesabout a three-fold performance improvement.Unfortunately, the semantic compromise entailed inYieldButNotToMe and the uncertainty that theadditional cycles will go to the correct threadsuggest that the final solution to these problemsstill eludes us.

Finally, even beyond the difficulties encounteredwith priorities in managing a pipeline, we foundpriorities to be problematic in general. Birrelldescribes a stable priority inversion in which a highpriority thread waits on a lock held by a low prioritythread that is prevented from running by a middle-priority cpu hog [Birrel191, pp. 99-100]. Like Birrell,we chose not to incur the implementation overheadof providing priority inheritance from blockedthreads to threads holding locks. (Notice that theproblem occurs for abstract resources such as thecondition associated with a CV as well as for realresources such as locks: the thread implementationhas little hope of automatically adjusting threadpriority in such situations.) The problem is nothypothetical: we experienced enough real problemswith priority inversions that we found it necessaryto put the following two workarounds into oursystems. First, for one particular kind of lock in thesystem, PCR does donate cycles from a blockedthread to the thread that is blocking it. This is doneonly for the per-monitor metalock that locks eachmonitor’s queue of waiting threads. It is not done formonitors themselves, where we don’t know how toimplement it efficiently. Second, PCR utilizes ahigh-priority sleeper thread (which we call theSystemDaemon) that regularly wakes up anddonates, using a directed yield, a small timeslice toanother thread chosen at random. In this way weensure that all ready threads get some cpu resource,regardless of their priorities.

5.3 Common mistakes

Our dynamic and static inspections of old coderevealed occasional correctness and performanceproblems caused by improper thread usage. Twoquestionable practices stood out.

First, we saw many instances of WAIT code that didnot recheck the predicate associated with thecondition variable. Recall that proper use of WAITwhen using Mesa monitors is

WHILE NOT (condition) DO WAIT cv END

not the

IF NOT (condition) THEN WAIT cv

which would be appropriate with Hoare’s originalmonitors.

The lF’-based approach will work in Mesa withsufficient constraints on the number and behavior ofthe threads using the monitor, but its use cannot berecommended. The practice has been a continuingsource of bugs as programs are modified and thecorrectness conditions become untrue.

Second, there were cases where timeouts had beenintroduced to compensate for missing NOTIFYS

(bugs), instead of fixing the underlying problem.The problem with this is that the system can becometimeout driven—it apparently works correctly butslowly. Debugging the poor performance is oftenharder than figuring out why a system has stoppeddue to a missing NOTIFY. Of course, legitimatetimeouts can mask an omitted NOTIFY as well.

5.4 When a fork fails

Sometimes an attempt to fork may fail for lack ofresources. As with many other resource allocationfailures it’s difficult to plan a response. Earlierversions of the systems would raise an error when aFORK failed: the standard programming practicewas to catch the error and to try to recover, but goodrecovery schemes seem never to have been workedout. The resulting code seems overly complex to nogood end: the machinery for catching the error isalways set up even though once an error is caughtnobody really knows what to do about it. Memoryallocation failures present similar problems.

Our more recent implementations simply wait inthe fork implementation for more resources tobecome available, but the behaviors seen by theuser, such as long delays in response or evencomplete unresponsiveness, go unexplained.

A number of techniques may be adopted to avoidrunning out of resources: unfortunately, the betterwe succeed at that, the less experience we gain withtechniques for appropriately recovering from thesituation.

5.5 On robustness in a changing environment

The archeology of thread-related bugs taught us twoother things worth mentioning as well. First, wefound many instances of timeouts and pauses withridiculous values. These values presumably werechosen with some particular now-obsolete processorspeed or network architecture in mind. For userinterface-related timeouts, values based on wall-clock time are appropriate, but timeouts related toprocessor speeds, or more insidiously, to expectednetwork server response times, are more difficult to

specify simply for all time. This may be an area offuture research. For instance, dynamically tuningapplication timeout values based on end-to-endsystem performance may be a workable solution.

Second, we saw several places where the correctnessof threaded code depended on strong memoryordering, an assumption no longer true in somemodern multiprocessors with weakly orderedmemory [Frailong931[Sites931. The monitorimplementation for weak ordering can use memorybarrier instructions to ensure that all monitor-protected data access is consistent, but other usesthat would be correct with strong ordering will notwork, As a simple example, imagine a thread thatonce a minute constructs a record of time-datevalues and stores a pointer to that record into aglobal variable. Under the assumptions of strongordering and atomic write of the pointer value, thisis safe. Under weak ordering, readers of the globalvariable can follow a pointer to a record that has notyet had its fields filled in. As another example,Birrell offers a performance hint for calling anintialization routine exactly once [Birrel191, p. 97].Under weak ordering, a thread can both believe thatthe initializer has already been called and not yet beable to see the initialized data.

5.6 Some contrasting experiences: multi-

threading and X windows

The usual, single-threaded client interface to an Xserver is through Xlib, a library that translates anX client’s procedural interface into the message-oriented interface of the X server. Xlib does therequired translation and manages the 1/0connection to the server, both for reading events andwriting commands. We studied two approaches tousing X windows from a multi-threaded client. Oneapproach uses Xlib, modified only to make it thread-safe [Schmitmann93]. The other approach uses Xl,an X client library designed from scratch withmulti-threading in mind [Jacobi92].

A major difference between the two approachesconcerns management of the 1/0 connection to theserver. Xl introduced a new serializing thread thatwas associated with the 1/0 connection. The job ofthis thread was solely to read from the 1/0connection and dispatch events to waiting threads.In contrast, the modified Xlib allowed any clientthread to do the read with a monitor lock on thelibrary providing serialization. There were twoproblems with this: priority inversion and honoringthe clients’ timeout parameter on the GetEventroutine. When a client thread blocks on the readcall it holds the library mutex. A priority inversioncould occur if the thread were preempted.Furthermore, it is not possible for other threads totimeout on their attempt to obtain the librarymutex, Therefore, each read had to be done with a

short timeout after which the mutex was released,allowing other threads to continue. In contrast,with the introduction of a reading thread, the clienttimeout is handled perfectly by the conditionvariable timeout mechanism and priority inversioncan only occur during the short time period when alow-priority thread checks to see if there are eventson the input queue.

A second benefit of introducing this thread concernsinteraction between output requests and inputevents. The X specification requires that the outputqueue be flushed whenever a read is done on theinput stream. This ensures that any commands thatmight trigger a response are delivered to the serverbefore the client waits on the reply. The modifiedXlib retained this behavior, but the short timeout onthe read operations (to handle the problem describedabove ) caused an excessive number of outputflushes, defeating the throughput gains of batchingrequests. With the introduction of a reading thread,however, there is no need to couple the input andoutput together. The reading thread can blockindefinitely and other mechanisms such as anexplicit flush by clients or a periodic timeout by amaintenance thread ensure that output gets flushedin a timely manner.

Another difference in the two approaches is theintroduction of an extra thread for the batching ofgraphics requests. Both systems do batching on ahigher level to eliminate unnecessary X requests.Xl uses the slack process mentioned previously. Itmakes the connection to the server asynchronous in .order to improve throughput, especially whenperforming many graphics operations. The modifiedXlib uses external knowledge of when the paintingis finished to trigger a flush of the batched requests.This limits asynchronous graphics operations andleads to a few superfluous flushes.

In summary, this example illustrates the benefit ofintroducing an additional thread to help manageconcurrency and interactions with external 1/0events.

6. Issues in thread implementation

6.1 Spurious lock conflicts

A spurious lock conflict occurs between a threadnotifying a CV and the thread that it awakens.Birrell describes its occurence on a multiprocessor:the scheduler starts to run the notified thread onanother processor while the notifying thread, stillrunning on its processor, holds the associatedmonitor lock. The notifyee runs for a fewmicroseconds and then blocks waiting for themonitor lock. The cost of this behavior is that ofuseless trips through the scheduler made by thenotifyee’s processor [Birrel191].

We observed this phenomenon even on a

103

uniprocessor, where it occurs when the waitingthread has higher priority than the notifyingthread. Since the Mesa language does not allowcondition variable notifies outside of monitor locks,Birrell’s technique of moving the NOTIFY out of thelocked region was not applicable. In our systems thefix (defer processor rescheduling, but not thenotification itself, until after monitor exit) wasmade in the runtime implementation. The changedimplementation of NOTIFY prevents the problemboth in the case of interpriority notifications and onmultiprocessors.

6.2 Priorities

PCR approximates a strict priority scheduler, bywhich we mean that if a process of a given priority iscurrently scheduled to run on a processor, no processwith higher priority is ready to run. As we haveseen in Section 5.2, strict priority is not a desirablemodel on which to run our client code: a model thatprovides some cpu resource to all runnable threadshas proven necessary to overcome stable priorityinversions. Similarly, the YieldButNotToMe andSystemDaemon hacks, also described above, violatestrict priority semantics yet have proven useful inmaking our system perform well. TheSystemDaemon hack pushes the thread model a bitin the direction of proportional fair-sharescheduling (threads at each priority progress at arate proportional to a function of the currentdistribution of threads among priorities), a modelintuitively better suited to controlling long-termaverage behavior than to controlling moment-by-moment processor allocation to meet near-real-timerequirements.

We do not regard this as a satisfactory state ofaffairs. These implementation hacks mean that thethread model is incompletely specified with respectto priorities, adversely affecting our ability toreason about existing code and to provide guidancefor engineering new code. Priority inversions andtechniques for avoiding them are the subjects ofconsiderable research in the realtime computingcontext [Sha90][Pilling911. We believe thatsomeone should investigate these techniques forinteractive systems and report on the result.

6.3 The effect of the time-slice quantum

Only after several months of study of the individualthread switching events of the X server slackprocess did we realize the importance of the time-slice quantum. The end of a timeslice ends the effectof a YieldButNotToMe or a directed yield. What wedid not realize for a long time is that it is the 50millisecond quantum that is clocking the sending ofthe X requests from the buffer thread. That is, theonly reason this performs well is that the quantumis 50 milliseconds. For instance, if the quantumwere 1 second, then X events would be buffered for

one second before being sent and the user wouldobserve very bursty screen painting. If the quantumwere 1 millisecond, then the YieldButNotToMewould yield only very briefly and we would be backto the start of our problems again.

Above we mentioned that it does not work to rewritethe buffer thread to sleep for a timed interval,instead of doing a yield. The reason is that thesmallest sleep interval is the remainder of thescheduler quantum. Our 50 millisecond quantum isa little bit too long for snappy keyboard echoing andline drawing, both instances where immediateresponse is more important than the throughputimprovement achieved by buffering. However, if thescheduler quantum were 20 milliseconds, using atimeout instead of a yield in the buffer thread wouldwork fine.

We conclude that the choice of scheduler quantum isnot to be taken lightly in the design of interactivethread systems, since it can severely affect theperformance of different, correct,multiprogramming algorithms.

7. Conclusions

We have analyzed two interactive computingsystems that make heavy use of light-weightthreads and have been in daily use for many yearsby many people. The model of threads both thesesystems use is one based on preemptable threadsthat use monitors and condition variables to controlthread interactions. Both systems run on top of thePortable Common Runtime, which providespreemptable user-level threads based on a mostlypriority-based scheduling model.

Our analysis has focused on how people use threadsfor program structuring rather than for achievingmultiprocessor performance. As such, we havefocused more on a static analysis of program code,coupled with an analysis of thread micro-behavior,than on macroscopic thread runtime statistics.

These systems exemplify some paradigms that maybe useful to the thread programmer who is ready formore advanced thread uses, such as slack processes,serializes, deadlock avoiders and task rejuvenators.

These systems also show that even very experiencedcommunities may struggle at times to use threadswell. Some thread paradigms using monitormechanisms are easy for programmers to use;others, such as slack processes and priorities,challenge both application programmers and threadsystem implementors.

One of our major conclusions is a suggestion forfuture work in this area: there is still much to belearned from a careful analysis of large systems.The more unusual paradigms described in thispaper, such as task rejuvenation and deadlockavoidance, arose from a relatively small community

104

over a small number of years. There are likely otherinnovative uses of threads waiting to be discovered.

Another area of future work is to explore the workfrom the real-time scheduling community in thecontext of large, interactive systems. Both strictpriority scheduling and fair-share priorityscheduling seem to complicate rather than ease theprogramming of highly reactive systems.

Finally, reading code and microscopic analysistaught us new things about systems we had createdand used over a ten year period. Even after a year oflooking at the same 100 millisecond event historieswe are seeing new things in them. To understandsystems it is not enough to describe how thingsshould be; one also needs to know how they are.

Bibliography

[Accetta86] Accetta, R. Baron, W. Bolosky, D.Golub, R. Rashid, A. Tevanian, M. Young,“Mach: A New Kernel Foundation for UNIXDevelopment.” Proceedings of the Summer

1986 USENIX Conference, July 1986.

[Bier921 E. Bier. “EmbeddedButtons: SupportingButtons in Documents.” ACM Transactions on

Information Systems, 10(4), October 1992,pages 381–407,

[Birrel191] A. Birrell. “An Introduction toProgramming with Threads.” in Systems

Programming with Modula–3, G. Nelsoneditor. Prentice Hall, 1991, pp. 88–118.

[Custer931 H. Custer. Inside Windows NT. MicrosoftPress, 1993.

[Draves91] R. Draves, B. Bershad, R. Rashid, R.Dean. “Using Continuations to ImplementThread Management and Communication inOperating Systems.” Proceedings of the 13th

ACM Symposium on Operating Systems

Principles, in Operating Systems Review, 25(5),

October 1991.

[Frailong931 J. Frailong, M. Cekleov, P. Sindhu, J.Gastinel, M. Splain, J. Price, A. Singhal. “TheNext–Generation SPARC MultiprocessingSystem Architecture.” Proceedings of

COMPCON93.

[Halstead90] R. Halstead, Jr., D. Kranz. “A ReplayMechanism for Mostly Functional ParallelPrograms.” DEC Cambridge Research Lab

Technical Report 90/6, November 13,1990.

[Jacobi92] Jacobi, C. “Migrating Widgets.”Proceedings of the 6th Annual X Technical

Conference in The X Resource, Issue 1,January 1992, p. 157.

[Lampson80] B, Lampson, D, Redell, “Experiencewith Processes and Monitors in Mesa. ”Communications of the ACM, 23(2), Feb. 1980.

[Owicki891 Owicki, S. “Experience with the FireflyMultiprocessor Workstation.” Research Report

51, Digital Equipment Corp. SystemsResearch Center, September, 1989.

[Pier881 K. Pier, E. Bier, M. Stone. “An Introductionto Gargoyle: an Interactive Illustration Tool.”J.C. van Vliet (editor), Document

Manipulation and Typography, Proceedings of

the Int 7 Conference on Electronic Publishing,

Document Manipulation and Typography, Nice(France), April 20–22, 1988. CambridgeUniversity Press, 1988, pp. 223–238.

[Pilling91] M. Pilling. “Dangers of Priority as aS~ructuring Principle ~or Real–TimeLanguages.” Australian Computer Science

Communications, 13(1), February 1991, pp.18–1 - 18–10.

Powel191] M. Powell, S. Kleiman, S. Barton, D.Shah, D. Stein, M. Weeks, “SunOS Multi–thread Architecture.” Proceedings of the

Winter 1991 USENIX Conference, Jan. 1991,pp. 65–80.

Scheifler92] R. Scheifler, J. Gettys. X Window

System: The Complete Reference to Xlib, X

Protocol, ICCCM, XLFD. Third edition.Digital Press, 1992.

[Schmitmann931 C. Schmidtmann, M Tao, S. Watt.“Design and Implementation of a Multi–Threaded Xlib.” Proceedings of the Winter 1993

Usenix Conference, Jan. 1993, pp 193–204.[Sha901 L. Sha, J. Goodenough. “Real-Time

Scheduling Theory and Ada.” IEEE Computer,

23(4), April 1990, pp 53–62.[Sites93] Sites, R. “Alpha AXP Architecture.”

CACM 36(2), February, 1993, pp. 33-44.[Smith821 D. Smith, C. Irby, R. Kimball, B.

Verplank, E. Harslem. “Designing the STARUser Interface.” BYTE Magazine, (7)4, April1982, pp. 242–282.

[Swinehart86] D. Swinehart, P. Zellweger, R.Beach, R. Hagmann. “A Structural View of theCedar Programming Environment.” ACM

Transactions on Programming Languages and

Systems, (8)4, October, 1986.[Weiser89] M. Weiser, A. Demers, C. Hauser. “The

Portable Common Runtime Approach toInteroperability.” Proceedings of the 12th ACM

Symposium on Operating Systems Principles,

in Operating Systems Review, 23(5), December1989.

105