high performance staged event-driven middleware · 120 alan cassar and kevin vella 2.1 staged...

High Performance Staged Event-DrivenMiddleware

Alan Cassar and Kevin Vella

Department of Computer Science, University of Malta{acas034|kevin.vella}@um.edu.mt

Abstract. In this paper, we investigate the design of highly efficientand scalable staged event-driven middleware for shared memory multi-processors. Various scheduler designs are considered and evaluated, in-cluding shared run queue and multiple run queue arrangements. Tech-niques to maximise cache locality while improving load balancing arestudied. Moreover, we consider a variety of access control mechanismsapplied to shared data structures such as the run queue, including coarsegrained locking, fine grained locking and non-blocking algorithms. User-level memory management techniques are applied to enhance memoryallocation performance, particularly in situations where non-blocking al-gorithms are used. The paper concludes with a comparative analysis ofthe various configurations of our middleware, in an effort to identify theirperformance characteristics under a variety of conditions.

1 Introduction

In this paper, we investigate the construction of highly efficient and scalablestaged event-driven middleware for shared memory multiprocessors. The stagedevent-driven architecture (SEDA) that we focus on is a design introduced in[Wel02] for developing massively concurrent services which behave well underheavy loading. Our multiprocessor implementations of SEDA middleware aredesigned to be highly efficient, through the use of a variety of event queuestructures, and a range of access control mechanisms including non-blockingalgorithms.The paper kicks off with a brief overview of SEDA and possible event queue ar-rangements. We proceed by investigating a selection of our shared queue designsand multiple queue designs. Following this, we consider the memory manage-ment techniques used in our design, and our handling of blocking system calls.We conclude with a comparative analysis of the performance of the various con-figurations of our middleware.

2 Background

The following section gives a brief introduction to some of the techniques ap-plied to our staged event-driven middleware, including scheduling techniquesand performance enhancement techniques.

120 Alan Cassar and Kevin Vella

2.1 Staged Event-Driven Architecture

The staged event-driven architecture (SEDA) is a design introduced in [Wel02]for developing massively concurrent services which behave well under heavy load-ing. A SEDA application consists of a network of event-driven stages connectedby event queues. Dynamic resource controllers are used to ensure the efficientbehaviour of stages irrespective of loading variations.

2.2 Shared Run Queue

A technique widely used in scheduling algorithms is based on a shared run queue.Cordina [Cor00] chose this technique to implement a SMP version of the MESHuser level thread scheduler, which the author called SMP MESH. Debattista[Deb01] experimented with this technique in his user level thread scheduler im-plementations. In [Vel98], a similar technique is used for an SMP implementationof KRoC [WW96]. In a shared run queue environment, a single queue shared be-tween all processors contains all the schedulable entities and each processor willhave to “fight” to acquire one or more items from the queue. Since we have onequeue shared between all processors, the bigger the number of processors, thehigher the contention on the queue. This queue can easily become the bottleneckof the system especially when the number of processors is relatively huge.

2.3 Per-Processor Run Queue

In a per processor run queue environment, each processor has its own run queuewhere the schedulable entities are placed. The biggest advantage of this scheduleris that it does not matter how many processors are accessing the run queue, sinceeach processor has its own, no synchronisation techniques are required to getexclusive access to the queue, hence there is no need to implement complicatedand not so trivial lock-free, non-blocking or wait-free algorithms.Anderson et al. [ALL89] experimented with a per processor run queue threadscheduler, however the authors kept a global pool for the reason of load balanc-ing. Their experimentation showed that their per processor scheduler performedbetter than a shared run queue scheduler, mainly due to no contention on ashared run queue and to the locality achieved in scheduling the same threads onthe same processor [ALL89].

2.4 Locality and Load Balancing

Locality is the notion of keeping the processes as close to their data as possible[Deb01]. One of the best ways to achieve this is to exploit the processors’ cache.Schedulers which use per processor run queue technique tend to favour locality[Deb01]. On the other hand, shared run queue schedulers usually do not empha-sise on locality. Some explicit techniques have to be applied such as batching[Vel98] or cohort scheduling [LP01].

High Performance Staged Event-Driven Middleware 121

Both load balancing and locality are key factors in performance, however, most ofthe time tend to be mutual exclusive. In a per-processor run queue, automaticlocality management is provided, but when migration policies are applied toload balance the work, locality is lost. Shared run queue offer automatic loadbalancing, but special techniques have to be applied to maximise locality.

2.5 Batching

Vella [Vel98] argues that much of the performance enhancements that were beingstudied and evaluated for the KRoC scheduler focused on how to reduce theexplicit cost of context switching, but no one actually tried to improve on otherimplicit costs such as the effect of cache memory, which can easily boost softwareperformance when used correctly.A technique called batching, which tries to maximise the cache locality in finegrained multithreaded systems, is introduced in [Vel98]. In batching, the runqueue does not hold schedulable entities such as threads or processes. Instead,it holds queues of threads, referred to as batches of threads. When a processorrequests an entity to be executed from the run queue, the processor is given abatch of threads, where each batch has a dispatch time much longer than thedispatch time of a single process/thread.It is argued that if the batch size is relatively small with respect to the batch’sdispatch time, each thread inside the batch will be scheduled several times onthe same processor (within the dispatch time of the batch), helping each threadto find most of its relevant data in the cache memory. If a batch contains arelatively small number of threads, these will not manage to push all the dataof other threads from cache, improving performance drastically.

3 Shared Run Queue Middleware

The Shared Run Queue middleware embraces only one run queue for the set ofavailable processors, each of them demanding work from the same shared queue.Once this middleware is launched, one thread for each processor is created andbound to a separate CPU. Figure 1 depicts the design of this scheduler.Each processor is responsible of dequeuing an available stage from the sharedrun queue using some synchronisation technique. If the inbound queue of thedequeued stage contains executable events, the processor will remove a set ofevents, reappend the stage to the run queue, and execute the events as a batchin order to maximise both data and instruction locality. Stages with emptyinbound queues are removed from the run queue and added only when theirevent queue is populated by at least one event.Once the run queue is short of work, each processor will block after incrementingsleepCounter. This value is consulted every time a new stage is added to the runqueue, and if greater than zero, a signal is sent in order to wake up the sleepingprocessors to resume their execution.


Fig. 1: Shared Run Queue Middleware

The shared run queue scheduler is configurable in several ways, the run queue andthe event queues can be configured to use simple queue or dummy head queue.The locking mechanism is also configurable from the operating system-providedlocks and our implementation of the test&test&set algorithm. Moreover, a non-blocking shared run queue implementation is also available.

3.1 Event-Driven Scheduler

In the event-driven scheduler, instead of scheduling stages, events are directlyplaced on a shared run queue, each processor dequeues these events and executesthem as shown in Figure 2. This behaviour can be simulated by setting the batchsize of the previously discussed scheduler to one, however, the above schedulerrequires three operations to dequeue one event: dequeue a stage, dequeue anevent and enqueue the stage back on the run queue, where this scheduler wouldrequire only one. The aim of this scheduler implementation is to prove whetherstage batching is actually a better solution or not.

4 Per-Processor Run Queue Middleware

Every processor has exclusive access to an associated run queue in this schedulerimplementation. Moreover, each processor has an associated migration queueused for migration of stages between the different processors in the system.Figure 3 shows the architecture of the Multiple Run Queue Staged Event-DrivenMiddleware.


Fig. 2: Event-Driven Shared Run Queue Middleware

Each processor will start off an execution cycle by moving any stage found onthe migration queue to the run queue. Once the operation is complete, a stageis obtained from the associated run queue with no use of synchronisation and aset of events are executed as a batch. When no work is available both on the runqueue and on the migration queue, the processor blocks waiting for new workto reach the migration queue, at which point it will be signalled by the threadwhich provided the work.Several different per-processor run queue schedulers are implemented, each withits own characteristics. The following list gives a brief description of each sched-uler.

Basic Model This is the most primitive per-processor run queue scheduler im-plementation. When a stage changes state from non-schedulable to schedu-lable, it will be placed on the run-queue of the processor which triggered theoperation. Load balancing is an issue with this scheduler implementation.

Maximising Cache Locality This scheduler binds a stage to one processorwhere it will execute for its entire life. Cache locality is improved becauseeach time an event is executed, it has a greater probability of finding nec-essarily data in the cache which was pulled in by the execution of similarprevious events. This scheduler however tends to suffer from poor load bal-ancing if the system is not designed well.

Sender Initiated Migration The processor having extra workload triggersthe routine of sender initiated migration in order to try to move some ofhis extra workload to other less loaded processors. This routine is initiatedby a simple counter and a threshold configurable by the user.

Receiver Initiated Migration This is triggered by the processor which isshort of work, in other words, both the run queue and the migration queueare empty.

Hybrid A scheduler which applies both sender initiated migration and receiverinitiated migration to keep load balanced.


The run queues, event queues and the migration queues can be configured assimple queues or dummy head queues. The locking mechanism used for the eventqueues is also configurable from, pthread locks, test&test&set, or non-blocking.

Fig. 3: Per Processor Run Queue Middleware

5 Memory Management

The memory management used in our Staged Event-Driven Middleware is verysimilar to the one presented by Valois [Val96] for the implementation of his non-blocking data structures. However, a race condition in the memory managementof the same author was identified by Michael and Scott in [MS95] where a solutionis also given which was carefully applied to our memory management module.Each block of memory to be reused has associated a reference count integer


value referred to as refCount. This has a dual purpose: to store the number ofreferences pointing to the block in question multiplied by two, or to contain thevalue of one to denote that the memory block is deleted and ready for reuse.Memory blocks which are safe to be freed without causing the ABA problem[Val96] are actually never deleted using the free system call. Instead, a stack(called the free stack) is defined which keeps track of all free blocks in thesystem which can be reused by the application whenever required.If a pointer p is referencing a particular block, whenever a function will needto read or modify that location, it does not simply copy the value of p into atemporarily pointer and perform the required operations. Instead, it will need tosafely read that block by first incrementing the reference count, followed by thereading or modification of the contents of the block. When all operations on thatblock are ready, the process cannot plainly continue its execution without firstreleasing that block by decrementing its reference counter. Furthermore, if thereference count results in the value of zero after being decremented, that blockmust be freed and returned to the free stack to be reused by other operations.One can easily see how the malloc and free system calls are avoided with thissolution, but how is the ABA problem solved? If processor n has a reference toblock p, processor n is guaranteed that any operation can be carried out safelyon block p, even a compare&swap. This statement is supported by the definitionof our memory management. When processor n obtains a reference to blockp, the reference count of p will be incremented by two. Since in our memorymanagement blocks will only be freed when their reference count reaches zero,the safe read executed by processor n on block p halts the reference count toreach zero, and never freed. If the same block is not reused, the ABA problemcannot possibly arise. This is the same solution used by Valois [Val96] to avoid theABA problem in his non-blocking data structures and modified by Michael andScott [MS95] to eliminate the race condition determined by the same authors.

6 Blocking Operations

In an event driven system, event handlers are executed until completion. Sinceonly one kernel thread per processor is used by the middleware, blocking oneof these threads stalls a processor until the blocking call returns. Most I/Ooperations present such problem including file I/O and socket I/O. In this sectionwe present the approaches taken to solve this issue and discuss the methodsapplied to integrate for both file I/O and socket I/O in our middleware.

6.1 File I/O

Since we are aiming to achieve asynchronous I/O operations in order not toblock our scheduler threads, why not use the technology provided by the oper-ating system itself? The Linux kernel 2.5 and higher provides the support forasynchronous non-blocking I/O operations which they call AIO.


Welsh in SEDA [Wel02] wrapped each I/O call in a SEDA stage [Wel02], and inour staged event-driven middleware, we also prepare the main file I/O operationsin stages. When the user requires a file read operation, a file read stage has tobe created, where only one stage is required for any number of operations. Theoperation starts off by sending a special event data structure to the file read stagewhich includes data such as the file descriptor, the stage to be notified when theoperation is completed, and other important information. The file read stageimmediately starts off an AIO read operation which upon completion executesa call-back function. The job of the call-back function is that of notifying theappropriate stage that the operation is complete by handing over the data readfrom the file (or any errors which occurred during the operation). The file writeoperation is handled in almost the same way as the file read operation.

6.2 Socket I/O

At the time the staged event-driven middleware was being implemented, thesocket AIO system was not a standard feature of the Linux kernel. Due to thisfact, it was decided that the socket I/O had to be implemented using a differentapproach from the one used by File I/O.The model used resembles the solution used by the Flash [PDZ99] web server.For a blocking operation, Flash uses a helper process which blocks until theI/O event is completed, and notifies the original process when such operationcompletes. Instead of using a separate process, we opted for a thread pool fromwhich a thread is acquired and handled the blocking operation to be performed.Furthermore, Flash [PDZ99] uses IPC channels for communication between themain polling loop and the helper processors which entails a system call eachtime an event is completed. In our middleware implementation, when a blockingoperation terminates, the stage to be notified is sent an ordinary event just asany stage would send an event to any other stage in the system in order tocommunicate, and everything is done in user space.

7 Performance Testing

The following section presents some tests which were performed in order tocompare the different schedulers implemented throughout this project. All testswere performed on a dual Intel Xeon processor machine supporting the Hyper-Threading technology, each CPU running at 3.80Ghz with 2MB of level 2 cacheper processor. This Intel processor family supports the EM64T (Intel ExtendedMemory 64 Technology) architecture and was running an x86-64 operating sys-tem, a Linux RedHat with kernel version 2.6.9-42.ELsmp. All memory allocationwas done apriori of the start of the tests to avoid system calls.

7.1 Queue Comparison

The following test was performed on a synthetic benchmark where a total of 502stages had to be scheduled. The job of each event handler is that of reading 8


bytes out of a 32-byte cache aligned data structure and writing 4 bytes into thesame structure. The scheduler was configured as an event-driven scheduler andthe results can be seen in Figure 4.Our test&test&set implementation performed better than the operating systemprovided locking mechanisms in both the coarse grained locking queue and thefiner grained locking queue. Moreover, the dummy head queue performed betterthan the coarse grained locking queue due to the separation of the enqueue oper-ation from the dequeue operation. The non-blocking queue was outperformed bythe test&test&set implementation, however we believe that with a larger num-ber of processors available, the non-blocking queue would have performed betterthan any other implementation available for the event-driven scheduler.

Fig. 4: Event-Driven Scheduler — Different Event Queues

7.2 Batching

The event-driven scheduler never exploits batching, therefore the time taken forthe test to complete remains constant independently of the batching value. Itis however more efficient than the staged shared run queue scheduler with thebatching size set to one due to the fewer number of queue operations performedto execute an event.With a low batching value, the per-processor run queue scheduler outperformsthe shared run queue scheduler by almost 6 seconds, and their are two mainreasons for this behaviour, the first one being the contention on the shared runqueue with a low batching value. Cache locality is the second reason. Bindingstages to the same processor helps in decreasing the total amount of cache missesissued during the execution of the application. As the batching value is increased,both schedulers’ performance increases.


Fig. 5: Scheduler Comparison — Batching

8 Conclusion

We have presented a framework which is highly configurable and general purpose,usable with any staged event-driven application. We have also shown from theresults obtained that our middleware implementation is scalable and efficient.File I/O is also integrated within our middleware with the help of the operatingsystem provided AIO while socket I/O is available through the help of threadpools.We have also conducted a series of performance tests to identify which schedulerperforms the best, but we can conclude this document by noting that each appli-cation has different requirements and characteristics, and will perform differentlyon different scheduler implementations.

References

[ALL89] T. E. Anderson, D. D. Lazowska, and H. M. Levy. The performance implica-tions of thread management alternatives for shared-memory multiprocessors.In In SIGMETRICS ’89: Proceedings of the 1989 ACM SIGMETRICS inter-national conference on Measurement and modeling of computer systems, page4960, 1989.

[Cor00] Joe Cordina. Fast multi-threading on shared memory multiprocessors. Tech-nical report, Department of Computer Science and Articial Intelligence, Uni-versity of Malta, May 2000.

[Deb01] Kurt Debattista. High performance thread scheduling on shared memorymultiprocessors. Master’s thesis, University of Malta, February 2001.

[LP01] James R. Larus and Michael Parkes. Using cohort scheduling to enhanceserver performance (extended abstract). In In LCTES ’01: Proceedings of theACM SIGPLAN workshop on Languages, compilers and tools for embeddedsystems, pages 182–187, 2001.


[MS95] Maged M. Michael and Michael L. Scott. Correction of a memory managementmethod for lock-free data structures. Technical Report TR599, University ofRochester, 1995.

[PDZ99] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. Flash: An efficient andportable web server. In In Proceedings of the USENIX 1999 Annual TechnicalConference, 1999.

[Val96] John David Valois. Lock-free data structures. PhD thesis, Troy, NY, USA,1996.

[Vel98] Kevin Vella. Seamless Parallel Computing on Heterogenous Networks of Mul-tiprocessor Workstations. PhD thesis, University of Kent at Canterbury, 1998.

[Wel02] Matthew David Welsh. An Architecture for Highly Concurrent, Well-Conditioned Internet Services. PhD thesis, University of California at Berke-ley, 2002.

[WW96] P.H. Welch and D.C. Wood. The Kent Retargetable occam Compiler. In BrianO’Neill, editor, Parallel Processing Developments – Proceedings of WoTUG 19,pages 143–166, Nottingham-Trent University, UK, March 1996. World occamand Transputer User Group, IOS Press, Netherlands. ISBN 90-5199-261-0.

high performance staged event-driven middleware · 120 alan cassar and kevin vella 2.1 staged...

Documents