second international workshop on operating …...second international workshop on operating systems,...

Second International Workshop on Operating Systems, Programming Environments and Management Tools for

High-Performance Computing on Clusters (COSET-2)

June 19, 2005 Cambridge, Massachusetts (USA)

Held in conjunction with ACM International Conference on Supercomputing

(ICS05)

WORKSHOP PROCEEDINGS

Second International Workshop on Operating Systems, Programming Environments and Management Tools for

High-Performance Computing on Clusters

(COSET-2)

Clusters are not only the most widely used general high-performance computing platform for scientific computing but also according to recent results on the top500.org site, they have become the most dominant platform for high-performance computing today. While the cluster architecture is attractive with respect to price/performance there still exists a great potential for efficiency improvements at the software level. System software requires improvements to better exploit the cluster hardware resources. Programming environments need to be developed with both the cluster and human programmer efficiency in mind. Administrative processes need refinement both for efficiency and effectiveness when dealing with numerous cluster nodes. The goal of this one-day workshop is to bring together a diverse community of researchers and developers from industry and academia to facilitate the exchange of ideas and to discuss the difficulties and successes in this area. Furthermore, to discuss recent innovative results in the development of cluster based operating systems and programming environments as well as management tools for the administration of high-performance computing clusters.

COSET-2 WORKSHOP CO-CHAIRS Stephen L. Scott Christine A. Morin

Oak Ridge National Laboratory IRISA/INRIA P. O. Box 2008, Bldg. 5600, MS-6016 Campus universitaire de Beaulieu Oak Ridge, TN 37831-6016 35042 Rennes cedex, France email: [email protected] email: [email protected] http://www.csm.ornl.gov/~sscott/ http://www.irisa.fr/paris voice: 865-574-3144 voice: +33 2 99 84 72 90 fax: 865-576-5491 fax: +33 2 99 84 71 71

PROGRAM COMMITTEE Ramamurthy Badrinath, HP, India Amnon Barak, Hebrew University, Israël Jean-Yves Berthou, EDF R&D, France Brett Bode, Ames Lab, USA Ron Brightwell, SNL, USA Toni Cortès, UPC, Spain Narayan Desai, ANL, USA Christian Engleman, ORNL, USA Graham Fagg, University of Tennessee, USA Paul Farrell, Kent State University, USA Andrzej Goscinski, Deakin University, Australia Liviu Iftode, Rutgers University, USA Chokchai Leangsuksun, Louisiana Tech University, USA Laurent Lefèvre, INRIA, France Renaud Lottiaux, INRIA, France John Mugler, ORNL, USA Raymond Namyst, Université de Bordeaux 1, France Thomas Naughton, ORNL, USA Rolf Riesen, SNL, USA Michael Schoettner, University of Ulm, Germany Assaf Schuster, Technion, Israël Gil Utard, Université de Picardie, France Geoffroy Vallée, INRIA, France

COSET-2 Program

9:00 Opening remarks 9:15 Invited talk introduction 9:30 Invited Talk: Cluster of clusters with OSCAR

Eric Focht, NEC Europe. The OSCAR clustering infrastructure is designed for homogeneous clusters of similar nodes. The talk shows one possible approach for extending OSCAR to enable it to easilly deal with heterogeneous setups like for example a centrally managed cluster of subclusters with nodes of different architectures. The setup, installation and administration methods for such a cluster are explained. Also the additional toolsare described: a subcluster class which uses the C3 toolkit and its scalable features, administration commands which use "pull" methods and journals, simple Ganglia based (sub)cluster-membership functions. The extensions are integrated in HPCL, an OSCAR based clustering stack provided by NEC HPC Europe.

10:30 Coffee break 11:00 Session 1: Data Management

A New Approach to Cost-Aware Caching in Heterogeneous Storage Systems Liton Chakraborty, Ajit Singh (University of Waterloo)

Design and Implementation of HTSFS: High-throughput & Scalable File System

Wenguo Wei (Guangdong Polytechnic Normal University), Shoubin Dong, Lin Zhang (Guangdong Key Laboratory of Computer Network, South China University of Technology)

12:00 Lunch (on own)

COSET-2 Program

1:30 Session 2: High availability

Scaling Out OpenSSI Bruce Walker (HP), Jaideep Dharap (HP & UCLA)

Asymmetric Active-Active High Availability for High-end Computing

Chokchai (Box) Leangsuksun (Louisana Tech University), Venkata Kiriti (Kit) Munganuru (Louisana Tech University), Tong Liu (Dell Inc.), Stephen L. Scott (ORNL), Christian Engelmann (ORNL, The University of Reading)

High Availability for Ultra-Scale High-End Scientific Computing

Christian Engelmann (ORNL, The University of Reading), Stephen L. Scott (ORNL) 3:00 Coffee Break 3:30 Session 3: Performance Optimization

A Flexible Thread Scheduler for Hierarchical Multiprocessor Machines Samuel Thibault (LABRI)

Remote-Write Communication Protocol for Clusters and Grids

Ouissem Ben Fredj, Eric Renault (GET/INT) 4:30 Session 4: Tools

Efficient Parallel Shell Georges-André Silber (Centre de recherche en informatique, Ecole des Mines de Paris)

5:00 - Closing remarks

A New Approach to Cost-Aware Caching in HeterogeneousStorage Systems

Liton Chakraborty, Ajit SinghDept. of Electrical and Computer Engineering

University of WaterlooON, Canada, N2L 3G1

{litonc@swen, asingh@etude }.uwaterloo.ca

ABSTRACTModern single and multi-processor computer systems incor-porate, either directly or through a LAN, a number of stor-age devices with diverse performance characteristics. Thesestorage devices have to deal with workloads with unpre-dictable burstiness. Storage aware caching scheme—thatpartitions the cache and maintains one partition per disk—is necessary in this environment. Moreover, maintainingproper size for these partitions is crucial. Adjusting the par-tition size after each epoch ( a certain time interval) assumesthat the workload in the subsequent epoch will show simi-lar characteristics as observed in the current epoch. How-ever, in an environment with highly bursty and time vary-ing workload such an approach seems to be optimistic. Inthis paper, we present a caching scheme that continuouslyadjusts the partition size forgoing any periodic activity. Ex-perimental results shows the effectiveness of our approach.We show that our scheme achieves better performance whilecompared with an existing storage-aware caching scheme.

KeywordsHeterogeneous storage, caching, disk array

1. INTRODUCTIONModern computer systems, particularly large multi-processorservers, interact with a broad and diverse set of storage de-vices. Accessing of local disks, remote file servers such asNFS and AFS, archival storage on tapes, read-only com-pact disks, and network attached disks is a common phe-nomenon now a days. Disk arrays, where disks of differ-ent age and performance parameters may be incorporated,are commonly used with an aim of reducing disk latencies.Thus, there is a diversity of the behavior and properties ofthe storage devices.

Though this set of devices is disparate, one similarity is in-herent among all: the time to access them is high, specially

as compared to the CPU cache and the memory latency.Due to the cost of fetching blocks from the storage media,caching of blocks in main memory often reduces executiontime of individual applications and increases overall systemperformance—often by an order of magnitude.

However, while storage technology has dramatically changedover the past few decades, the caching policy underwent rela-tively minor changes, with most operating systems employ-ing LRU or LRU-like algorithms to decide which block toreplace. The problem with these algorithms is that they arecost oblivious: all blocks are treated as if they were fetchedfrom identically performing devices and a block can be re-fetched with same replacement cost as any other block. Un-fortunately, this assumption is increasingly problematic, asthere are many device types — each with a rich set of per-formance characteristics. As a simple example, consider ablock fetched from a local disk as compared to one fetchedfrom a remote, highly contended file server. In this case, theoperating system should most likely prefer the block fromthe file server for replacement.

Within such heterogeneous systems, caching algorithms shouldbe aware of the non-uniform replacement costs of variousblocks in the cache. As the slowest device in such a hetero-geneous environment roughly determines the throughput ofthe system, storage aware caching contend to balance workacross devices by adjusting the stream of disk requests [2].Hence, in such an environment, the caching scheme considersboth workload and device characteristics to filter requests.

The storage aware caching in [2]—that is herein referredto as Forney’s Algorithm— attempts to balance work acrossdevices; it partitions the cache, assigns one partition to eachdevice, and determines the partition sizes that lead to bal-anced work. In this report, we address the issue of maintain-ing cache partition by this algorithm, and propose a simpleand efficient solution.

2. MOTIVATION AND THE PROBLEMIn Forney’s algorithm, the behavior of the caching systemis observed during an epoch, and at the end of each epochthe cache is repartitioned based on that observation. Here,during the epoch—specially, at the end—the performancemay degenerate severely: a partition may experience largeamount of delay while another partition may have cacheblocks that are no longer pivotal. So, the length of epoch,

or window size (W ), has a significant role in the efficacy ofthe algorithm. The performance could be rendered smoothby having this epoch length very small. But, making thiswindow size small has two adverse effects: first, a smallerW -value might not provide sufficient feedback and smooth-ing so as to make decision while repartitioning; second, smallW -value indicates frequent repartitioning, which needs sig-nificant amount of processing overhead resulting in a lessefficient operating system. Though the former cause canbe eliminated by accumulating sampled information over anumber of past epochs, the latter one still exists. Further-more, every repartitioning may not produce any significantchange in the size of the partitions, hence rendering theserepartitioning activities worthless. Consider, for example,one extreme case, where the workload of all devices remainssteady for a long time; so, no adjustment of the partitionsis necessary during that interval, and hence all repartition-ing activities in that interval are redundant. In addition tothat, the other two parameters of this algorithm, which aretermed as threshold (T ) and base correction amount (I),need to be set empirically with care. The threshold value isused in determining the partitions that might be termed asthe page (or block) consumers, whereas the base correctionamount indicates the number of pages (or blocks) a page (orblock) consumer should consume.

Based on these observations, we realize that the cachingscheme can be rendered efficient by making repartitioningthe cache in continuous fashion without using the conceptof an epoch but only when it is necessary.

3. ALGORITHM OVERVIEWThis section provides an overview of the algorithmic issueswe explore. First, we outline the existing cost-aware algo-rithms. Then we describe the notion of aggregate partition-

ing, and discuss the algorithms based on this notion. Atthe end of this section, we provide a taxonomy of aggregatepartitioning.

3.1 Existing Cost-Aware AlgorithmsThe theoretical computer science community has studiedcost-aware algorithms as k-server problems [3]. Cost-awarecaching falls within a restricted class of k-server problems—i.e., weighted caching. LANDLORD [7], which is closelyrelated to the web caching algorithm in [1], is a significantalgorithm in the literature.

LANDLORD combines replacement cost, cache object size,and age; and hence includes cost and variable cache objectsize in the cache. It has two versions: FIFO and LRU.The LRU version works as follows. It associates a cost (L)with each object. When an object is brought into the cache,LANDLORD sets L to H, which is the retrieval cost of theobject divided by the size of the object (i.e., per unit re-placement cost). In the case of an eviction, LANDLORDfinds an object with lowest L value, removes it, and ages allother objects. This aging is done by decreasing the L valueof all remaining objects by the L value of the evicted object.Upon reference, the L value of an object is reset to H.

One of the important theoretical properties of LANDLORDis that, when the size of cache objects are equal, it is k-competitive, where k is the cache size in blocks. So, in a

cache of fixed object size, performance of LANDLORD isbounded by the k times the performance of the optimal off-line algorithm over all possible request sequences [3].

3.2 Algorithms Based on Aggregate Partition-ing

In a cost-oblivious caching approach, an incoming page (orblock) replaces an existing page that may be anywhere in thecache. This can also be termed as place-anywhere approach.In a place-anywhere algorithm costs are recorded at a pagelevel granularity, and a page can occupy any logical locationin the cache. On the contrary, an aggregate partitioning algo-

rithm divides the cache into logical partitions, and assigns apartition to a device. The algorithm maintains performanceor cost information at the granularity of partitions. As costinformation is maintained for each partition, the amount ofmeta-data is reduced and cost information can be updatedwithout scanning the whole cache. Moreover, this aggregatepartitioning integrates well with existing software, as costoblivious policies can be employed for replacing individualpages within a partition.

Forney’s algorithm is the first cost-aware algorithm that uti-lizes the notion of aggregate partitioning. It considers bothstatic (due to diverse physical characteristics of storage me-dia) and dynamic (due to variation of workload on disks,and network traffic) performance heterogeneity. In this ap-proach, the cache is divided into logical partitions, whereblocks within a partition are from the same device and thusshare the same replacement cost. The size of each partitionis varied dynamically to balance work across devices. Here,work is defined as the cumulative delay for each device. Themain challenge of this algorithm is to determine the relativesize of the partitions dynamically. This dynamic repartition-ing algorithm basically works in two phases: in first phase,the cumulative delay for each device is determined; and inthe second phase, cache partitions are adjusted. These twophases repeat cyclically.

The cumulative delay for each partition (or device) is mea-sured over the last W successful device requests (distributedover all the devices), where W is the window size. Knowingthe mean delay over all partitions and the per device cu-mulative wait time, the relative wait time for each device isdetermined.

During repartitioning, page consumers and page suppliers

are identified, based on relative wait times of the partitions.Page consumers are partitions that have relative wait timeabove a threshold T ; and page suppliers are partitions hav-ing below-average relative wait times. Moreover, the algo-rithm classifies each partition into one of four states: cool,warming, cooling, warm. Of these, the first one correspondsto page supplier and the rest correspond to page consumers.A page consumer increases its partition size by I pages,where I is the base correction amount. If a partition re-mains as a page consumer during subsequent epochs, theincrease in partition size grows exponentially. On the otherhand, the number of pages a page supplier must yield isgiven as:

IRWTjP

i∈suppliersIRWTi

× No. of consumed pages

where,

IRWTj = 1 − relative wait time of partition j

4. OUR SOLUTIONOur approach, like the Forney’s algorithm, attempts to bal-ance work across devices, and employs aggregate partition-ing scheme. We associate with each partition i a parame-ter Di, that records the cumulative delay of that partition.Whenever there occurs a miss in partition i, this parame-ter Di is incremented by the retrieval time of the incomingblock. Upon a miss, depending on the D-value of the relatedpartition, one of the two alternatives can happen: the par-tition either receives a block from from another partition orevicts a block of its own, to make room for the incoming diskblock. To aid in making this decision, partitions are classi-fied into three categories or states: consumer, supplier, andneutral. Upon a miss, a consumer receives a block from oneof the suppliers, whereas a supplier or a neutral partition re-places one of its blocks. Efficient maintenance of the state ofpartitions is the cardinal issue of our algorithm. This can bedone either by using a graph-based approach—that providesa set of suppliers for each individual partition— or by im-plicitly maintaining consumers and suppliers in generalizedform. We discuss these techniques in the subsequent partsof this section. At the end of this section, we discuss thestrategy employed by a consumer, while selecting a victimpartition from a set of suppliers.

4.1 A simple ApproachTo determine whether a partition can receive a cache block,or whether a partition should yield a cache block, we canmaintain a Directed Acyclic Graph (DAG), where each nodecorresponds to a partition. In this DAG, there is an edgei→ j, if Di > Dj . The edge i→ j indicates that partition j

is a supplier for partition i. So, the set of suppliers for a par-ticular partition can be decided from the outgoing edges ofthe corresponding node; and whenever a miss occurs withinthat partition, new blocks can be acquired from any of thesesuppliers.

But, this approach has a drawback: it requires frequent andunnecessary block switches, which refer to transfers of blocksfrom one partition to another. The blocks that are in tran-sition may not have a good utilization. Moreover, thoughan individual block switch requires an insignificant amountof processing, huge number of block switches over a shortinterval may add up to a significant processing overhead.

4.2 Refined ApproachTo eliminate frequent and unnecessary block switching, weintroduce the notion of threshold (δ). This parameter shouldbe chosen carefully, based on the workload and behavior ofthe devices. Now, in the DAG, an edge i → j is added, ifDi ≥ Dj + δ. But, still the problem of redundant blockswitching is inherent in this approach. A partition receiv-ing blocks from another partition may have to yield blocksto some other partitions. This occurs when a block is con-sumer with respect some partitions, but supplier for some

j

l

i

k

Figure 1: An instance of an unnecessary blockswitch: j consumes blocks from k and l, but at thesame time may supply blocks to i.

other partitions. This scenario is shown in figure 1. Here,whenever miss occurs within j, it receives block from eitherk, or l; and whenever there is a miss in i, it receives blockfrom any of the partitions i, j, and k. So, there may beunnecessary block transfer, as j receives, and may deliverblock, at the same time.

To eliminate the possibility of unnecessary block switching,we impose the following constraint:

Constraint a: Don’t allow a partition to be a can-

didate to be both a supplier and a consumer at the

same time

Here, we observe that, variation in D-values can be con-trolled by restricting the length of all paths in the DAG. Inone extreme instance of the DAG, there is no edge in theDAG: all the D-values are within δ of one another, and hencecan be termed homogeneous. It is expected that, given asuitable value of δ, existence of paths of length more thanor equal to two is highly improbable.

In the algorithms based on a DAG, maintenance of the DAGneeds substantial processing which is proportional to thecache misses; whenever a miss occurs, and hence a particu-lar D-value is changed, we have to scan the DAG to deter-mine which edges should be deleted and which edges shouldbe inserted. Moreover, to find the suppliers for a particularpartition i, we have to traverse the DAG starting from thenode i and select the leaf nodes (those having no outgoingedge), that are the suppliers for that partition. As these ac-tivities are done on each cache miss, the processing overheadis not insignificant. Other than that, some memory spacesneeds to be allotted to store the DAG.

To eliminate these problems, we attempt to represent suppli-ers and consumers implicitly without using DAG. We outlinethe technique in the following subsection.

4.3 Implicitly Maintaining Suppliers and Con-sumers

In a DAG-based approach, we can get different set of suppli-ers relative to each partition by simply traversing the DAG.Instead of deriving the set of suppliers for each partition, we

DminS

Dmax

S

DminC

DmaxC

C

SDelay

h

Figure 2: The list S and C are separated by verticaldistance h. The vertical scale represents the cumu-lative delay; the nodes are spread horizontally justfor the simplicity of representation

endeavour to implicitly maintain the the list of suppliers andconsumers in generalized way. In this case, a consumer canreceive a block, if needed, from a supplier properly selectedfrom the list of suppliers.

In this approach, as shown in figure 2, we maintain twolists: one for suppliers, denoted as S; and the other for con-sumers, denoted as C. Four variables, as shown in figure 1,keep track of maximum and minimum values of both lists.The minimum gap or distance between the supplier and con-sumer list is denoted as h. Lists C and S are subjected tofollowing constraint:

Constraint b: The minimum gap between list C

and S must be greater than or equal to δ

A partition (i) can be placed in the list C, if and only if thefollowing condition holds:

DCmax − Di < δ and Di − D

Smax ≥ δ (1)

Here, the first term ensures that the constraint a is notviolated, and the second term ensures that constraint b ismaintained.

In the similar way, a partition (i) can be placed in the listS, if and only if the following condition holds:

Di − DSmax < δ and D

Cmin − Di ≥ δ (2)

So, a partition can decide its state (i.e., supplier, consumer,or neutral) in constant time, without scanning through theall the partitions. During the operation of the system, as theD-values of partitions changes, the supplier and the con-sumer lists can be maintained by adjusting the minimumand maximum values in these lists.

4.4 Selecting Victim PartitionWhile a cache miss occurs in a partition in the consumerstate, it selects a victim partition from the list of suppli-ers. In selecting the victim partition, a strategy similar to

DCmax Node with maximum D-value, in Consumer list

DCmin Node with minimum D-value, in Consumer list

DSmax Node with maximum D-value, in Supplier list

DSmin Node with minimum D-value, in Supplier list

h minimum gap between C and S

Table 1: Parameters associated with the supplierand consumer list.

Age Bandwidth Seek time Rotation(years) (MB/S) (ms) (ms)

0 20.0 5.30 3.001 14.3 5.89 3.332 10.2 6.54 3.693 7.29 7.27 4.114 5.21 8.08 4.565 3.72 8.98 5.076 2.66 9.97 5.637 1.90 11.1 6.268 1.36 12.3 6.969 0.97 13.7 7.7310 0.69 15.2 8.59

Table 2: Aging a base disk device(IBM 9LZX)

inverse lottery, as previously proposed for resource alloca-tion [5], can be used. The idea is that each supplier is givena number of tickets in inverse proportion to its cumulativedelay. When a replacement is needed, a lottery is held byselecting a random ticket; the partition holding that ticketbecomes the victim partition. This victim partition thenyields its least valuable page. The purpose of this mecha-nism is to penalize more the suppliers with less cumulativedelay.

5. EVALUATION ENVIRONMENTThis section describes our methodology for evaluating theperformance of storage-aware caching. We describe our sim-ulator and the storage environment assumed in the simula-tor. In Section 6, we present the results obtained using thissimulator.

To measure the performance of storage-aware caching, wehave implemented a trace-driven simulator. This simula-tor assumes a storage environment where a number of disks(of varying ages) are accessed by a single client. The clienthas a local cache that is partitioned across the disks. Eachpartition is maintained using the LRU replacement strategy.The focus of our investigation is to maintain the proper par-tition size dynamically. The client issues the workload forthe disks.

The client workload that drives the simulator is capturedusing a trace file. The trace file specifies the data blocks ac-cessed at various time points. We derive the synthetic disktraces using the PQRS algorithm proposed in [6]. This al-gorithm is shown to generate traces that capture the spatio-temporal burstiness and correlation in real traces [4]. Weuse two traces (trace 1 and trace 2) in evaluating the per-formance of the caching schemes. These traces have differ-

ent spatial and temporal locality with trace 2 having thehigher spatial and temporal locality of reference than theother. Number of disk blocks for trace 1 and trace 2 are1,20,000 and 1,00,000, respectively. Whereas, the numberof requests for trace 1 and trace 2 are 3,50,000 and 3,00,000,respectively. We apply each of the traces to the set of disks.The requests are distributed among the disks non-uniformlyusing an exponential distribution, so that a disk with higherage receives exponentially lower workload. This exponentialdistribution is necessary to avoid choking of the system bydisks of higher age. All existing approaches based on aggre-gate partitioning fare poorly in the face of uniform workloadacross the disks irrespective of their ages.

We model the disk access time using only disk bandwidth,average seek time, average rotational latency. Hence, ourdisk model consider the worst case scenario. Device hetero-geneity is achieved by device aging. As in [2], we considera base device (IBM 9LZX) and age its performance over arange of years. A collection of disks from this set is usedas the disk system in the simulator. Characteristics of thedisks of various ages is shown in Table 2.

6. EXPERIMENTAL RESULTSIn this section, we present a series of experimental resultsdemonstrating the effectiveness of the proposed caching scheme.We measure the throughput obtained at the client side, anduse it as the performance metric. This throughput is mea-sured by observing the delay in retrieving the disk blocks.While measuring the delay, we consider only the delay incache misses. Delay experienced in cache hit is compara-tively negligible. We measure this throughput by varyingthe age of the slow disk and cache size. We use both trace 1and trace 2 (mentioned in Section 5) in this measurement.As the disk system, we consider a set of four disks. As men-tioned in the previous section, we distribute the requestsamong the disks using an exponential distribution depend-ing on the age of the disks. To compare our results, we usean existing storage-aware caching scheme (Forney’s scheme).The δ-value is set to 1000ms, and is observed to capture thechanges in the stable behavior of a disk.

Here it should be noted that, contrary to the reference [2], wedon’t model the disk request size and don’t use the requestlocality to calculate the seek time and rotational latency. Inour simulation the request size is equal to the block size.Hence, we take a pessimistic approach and assume that onedisk miss results in a delay equal to the disk access time.Whereas in the reference [2], the delay is calculated at thegranularity of the request size by exploiting the locality ofthe requests. Here, the request size is far greater than ablock size. So, these two simulation results might not besimilar.

Figure 3 and Figure 4 shows the effect of varying the age ofa disk. As shown in the figure, throughput decreases withthe increase in the slow disk’s age. However, the new schemeattains higher throughput than the Forney’s scheme, as thedisk age becomes higher. Here, throughput with trace 1 ishigher than that with trace 2 because of higher spatial andtemporal locality, as mentioned in the previous section.

The effect of varying cache size on the caching schemes is

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 2 4 6 8 10 Age

Thro

ughp

ut(M

B/s)

New Scheme

Forney's Scheme

Figure 3: Throughput with varying ages of the slowdisk (trace 1)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2 4 6 8 10

Age (years)

Thro

ughp

ut(M

B/s)

New Scheme

Forney's Scheme

Figure 4: Throughput with varying ages of the slowdisk (trace 2)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

20 50 100 150 200 250

Cache size (MB)

Thro

ughp

ut(M

B/s)

New Scheme

Forney's Scheme

Figure 5: Throughput with varying cache sizes(trace 1)

2.9 3

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

20 50 100 150 200 250

Cache size(MB)

Thro

ughp

ut(M

B/s)

New Scheme Forney's Scheme

Figure 6: Throughput with varying cache sizes(trace2)

shown in Figure 5 and Figure 6. Here, throughput for eachof the traces increases with increase in cache size. However,this increment in throughput is not always linear ( Figure 6).This happens because these caching scheme assumes that adisk with higher wait time can lower its wait time by pro-portionately increasing the number of cache blocks. Butthis is not always valid, as a disk with higher wait time canconsume blocks without decreasing the miss rate (or waittime) proportionately. In extreme case, a disk with higherwait time can receive more and more blocks thus chokinganother disk with lower wait time. This phenomenon leadsto degradation in overall throughput. However, as shown inthe figure, the new scheme achieves higher throughput whilecompared with that of the other.

7. CONCLUSIONIn this paper, we identified an inherent problem with thecaching algorithm in heterogeneous storage systems that ad-just cache partition sizes on a periodic basis. We proposedand analyzed solution strategies that attempt to adjust par-tition sizes in continuous fashion. We outline a strategy tomaintain partition states implicitly. A strategy for cachingdisk blocks is implemented by the operating system and thusaffects the performance of the computer system at a veryfundamental level. Thus even a small improvement on thisscore assumes large significance. Our new approach achievesupto 15% increase in throughput while compared with theprevious approaches. Using only one variable, this approachis able to balance the cumulative delay of the partitions. Oursolution is simple in computation, and continuously adjuststhe workload to balance the cumulative delay of the parti-tions. The only parameter δ can be chosen based on theworkload and performance of the devices.

In our work we assume that a disk block is brought intothe cache only when a miss occurs, i.e., we don’t considerprefetching. As a future work we like to investigate the is-sue of prefetching in the heterogeneous storage environment.Moreover, we note that when disks with varying ages areapplied with nearly equal workload (e.g., stripped disks), adisk with higher age may consume blocks without increasing

cache hits proportionately. This happens when a disk entersthe saturation region where additional disk blocks can’t im-part significant increase in hit cache hits. This problem ininherent in all the caching methods based on aggregate par-titioning. Hence an utility based scheme is necessary in thisscenario. Also, in this paper, we don’t analyze the sensi-tivity of the throughput to varying δ-values. We plan toinvestigate these issues in detail in future.

8. REFERENCES[1] P. Cao and S. Irani. Cost-aware www proxy caching

algorithms. Proceedings of the USENIX Symposium onInternet Technologies and Systems, pages 193–206,December 1997.

[2] B. C. Forney, A. C. Arpaci-Dusseau, and R. H.Arpaci-Dusseau. Storage-aware caching: Revisiting cachingfor heterogeneous storage systems. Proceedings of the 2002USENIX Conference on File and Storage Technology, pages61–74, January 2002.

[3] M. Manasse, L. McGeoch, and D. Sleator. Competitivealgorithms for on-line problems. Proceedings of theTwentieth Annual ACM Symposium on Theory ofComputing, pages 322–333, May 1988.

[4] C. Ruemmler and J. Wilkes. Unix disk access patterns.Proceedings of the Winter’93 USENIX Conference, pages405–420, 1993.

[5] C. A. Waldspurger and W. E. Weihl. Lottery scheduling:Flexible proportional-share resource management.Proceedings of the First USENIX Syposium on OperatingSystems Design and Implementation, November 1994.

[6] M. Wang, A. Ailamaki, and C. Faloutsos. Capturing thespatio-temporal behavior of real traffic data. Proceedings ofthe Symposium on Computer Performance Modeling,Measurement and Evaluation, September 2002.

[7] N. E. Young. On-line file caching. Proceedings of the NinthAnnual ACM-SIAM Symposium on Discrete Algorithms,January 1999.

Design and Implementation of HTSFS: High-throughput & Scalable File System1

Wenguo WEI23 Shoubin DONG2 Ling ZHANG2

2Guangdong Key Laboratory of Computer Network, South China University of Technol-ogy, Guangzhou, 510641, P.R.China

[email protected] 3Department of Computer Science, Guangdong Polytechnic Normal University, Guang-

zhou, 510665, P.R.China

Abstract. We propose an improved Lustre-like distributed file system: High-throughput & Scalable File System (HTSFS). It has some new features that Lustre is short of, such as auto-adaptive file striping policy depending on file size, access pattern and storage servers status; using multiple network adapt-ers synchronously to speedup network access, prefetching read-request and writing behind write-request to make HTSFS more throughputs; Mirroring the striped data between two groups of server nodes to achieve loading bal-ance and Fault-Tolerant etc. We have implemented a prototype system and get some experiment result from MuntiSocket technique.

1 Introduction

Computational scientists use large parallel computers to simulate events that occur in the real world. These large-scale applications are necessary in order to better understand scientific phenomena or to predict behavior. Computing resources are often a limiting factor in the accuracy and scope of these simulations. The limiting resources include not only CPU and memory, but also the I/O subsystem, as many such applications generate or process enormous volumes of data. In order for the simulation to continue to run quickly, the I/O system must be capable of storing many hundreds of megabytes data per second, and many disks must be used in con-cert. The software that organizes these disks into a coherent file system is called a "distributed/parallel file system".

A scalable global secure file system is very important to high performance com-puting (HPC). Scalability spans many dimensions, including data and metadata performance, large numbers of clients and inter-site file access, management and security. All these dimensions are of importance to high performance computing center (HPCC) and global enterprises alike.

1This research was supported by Guangdong Key Laboratory of Computer Network under China Education and Research GRID (ChinaGRID) project CG2003-CG005.

Parallel file systems are designed specifically to provide very high I/O rates when accessed by many processes at once. These processes are distributed across many different computers, or nodes, that make up the parallel computer. To achieve high performance, a parallel file system typically stripes files across nodes similar to a RAID system. In this analogy, the nodes are data servers, rather than disks. Similar to how a RAID volume aggregates multiple channels to a collection of local disks to increase performance, a parallel file system aggregates network connections to a collection of networked disks. Striping data across nodes is a straightforward way of gaining parallelism across multiple serial I/O systems. Unlike multiple nodes shar-ing a single RAID volume, a parallel file system can also use multiple network links at the same time, eliminating that particular bottleneck. Similarly, since files are striped in this manner and parallel programs tend to work on specific regions of a shared file, the network and disk loads can be spread across the storage nodes.

This paper presents the design and implementation of HTSFS: High-throughput & Scalable File system. It is organized as follows. In Section 2, we introduce these related projects about parallel/distributed file system. We give the architecture and design overview of HTSFS in Section 3. In Section 4, we present some experiment result. And finally we conclude the paper in Section 5.

2 Related Works

2.1 Lustre

Lustre [1] is an Inter-Galactic Cluster File System; it is running on many of the Linux supercomputers in the top 10 of Top500.org. Lustre provides a novel modular storage framework including a variety of storage management capabilities, network-ing, locking, and mass storage targets all aiming to support scalable cluster file systems for small to very large clusters.

In Lustre clusters there are three major types of systems: the Clients, the Object Storage Targets (OST), and Meta-Data Server (MDS) systems. Each of the systems internally has a very modular layout. For many modules, such as locking, the re-quest processing and message passing layers are shared between all systems and form a part of the framework. Others are unique, such as the Lustre Lite client mod-ule on the client systems.

But Lustre is not very maturity, for example it has no auto-adaptive file striping; not use multiple network adapters synchronously; simple/little prefetching and buffer management etc.

2.2 PVFS2

PVFS2 [2] shows that it is possible to build a parallel file system that implicitly maintains consistency by carefully structuring the metadata and name space and by defining the semantics of data access that can be achieved without locking. This design leads to file system behavior that some traditional applications do not expect. These relaxed semantics are not new in the field of parallel I/O. PVFS2 more closely implements the semantics dictated by MPI-IO.

PVFS2 also has native support for flexible noncontiguous data access patterns. In addition to performance, stability and scalability are important design goals. To help achieve this, PVFS2 is designed around a stateless architecture. PVFS2 includes a modular storage and networking system. A modular storage system allows multiple storage back-end implementations to be easily plugged into PVFS2. This modularity makes it easy for people researching I/O to experiment with different storage tech-niques. Similarly, a modular networking system allows operation over multiple network interconnects and makes it easy to add support for additional network types.

These design choices enable PVFS2 to perform well in a parallel environment, but not so well if treated like a local file system. Without client-side caching of metadata, stat operations typically take a long time, as the information is retrieved over the network. This can make programs like "ls" take longer to complete than might be expected. Overall, PVFS2 is only parallel I/O optimum.

3. Architecture and Design Overview

3.1 Architecture of HTSFS

Now HTSFS is based on Lustre and will use for reference PVFS2, it also makes use of “a middleware-level parallel transfer technique” [3] which Lustre does not im-plement. Further Lustre has just implemented file stripe policy configurable per directory or per file, but can not auto adjust file stripe policy according to the access pattern and OSTs status; and performance is not good when small file is accessed. The third, Lustre is a RAID 0 style file system, the consistency of metadata and file data in Lustre is maintained just by journal ext3 file system, some problems such as complete disk failure requires restoration from backup.

As figure 1 shown, HTSFS has similar framework as Lustre. In HTSFS there are also three major types of systems: the Clients, the Object Storage Targets (OST), and Meta-Data Server (MDS) systems, but those components which render peak green are new Add-ons or have some new features that Lustre is short of. HTSFS focuses on scalability for use in large computer clusters, but can equally well serve smaller commercial environments through minor variations in the implementation and deployment of the modules that make up the system. Its multiple network inter-

face bonding, prefetching read-request and writing behind write-request techniques make it more throughputs.

As a whole, each OST and MDS is pairwise which is implemented by “Mutual mirroring” component, i.e. any OST/MDS is make up of two mirroring sub-components, each sub-component implement RAID 0 as Lustre does. “Network target”, “Object server” and “Direct driver” retain invariability functions compaired with Lustre.

At client, all components are designed renewedly. “Client FS” implement auto-adapt file striping according to file size, access pattern and OSTs status; “Page cache” prefetch read-request, write behind write-request and manage buffer more efficiency; “Network communication” provides more network bandwidth by bond-ing multiple network adapters. Detail discuss is at section 3.2.

Client FS

Page cache

Network communication

Client

2ndpairwise O

bjectStorage Targets

Nnd

pairwise Object

Storage Targets

1nd pa

irwis

e Ob

ject

Sto

rage

Tar

gets

Mutual

mirroring

NetworktargetObjectserverDirectdriver

Mutual

mirroring

NetworktargetObjectserverDirectdriver

Fig. 1. Architecture of HTSFS

3.2 Design Overview of HTSFS

MuniSocket: utilize TCP’s existing functions such as reliability and flow and con-gestion control to implement a more efficient TCP-based MuniSocket, and multiple physical network connections on a cluster in a way that is transparent to the applica-tion.

In MuniSocket, user messages are partitioned into uniformly sized fragments which are transferred in parallel, through multiple network interfaces, via one or more physical networks, to the destination again through multiple network inter-faces. At the destination, fragments are reconstructed back to the original message (see figure 2). The main difference between MuniSocket and the standard Socket is that MuniSocket processes and transfers user messages in parallel, fully utilizing the existing multiple network interconnects, while the standard Socket processes and transfers the user messages sequentially through a single network interface. Muni-Socket also handles the multiple interfaces at the application level, thus providing more effective services and better control over the system. Furthermore, the process-ing is done in low granularity by processing the application messages, thus reducing

the overhead incurred by the protocol. Moreover, the technique provides fault toler-ance and dynamic load balancing to achieve the maximum possible bandwidth [3] .

LAN 1

LAN 2

Network I/O Queue

Network Adapter 1

Network Adapter 2

splitting package & scheduling

Network Adapter 1 Network I/O

Queue

Network Adapter 2

Scheduling & recomposing

package

Node 1 Node 2

Fig. 2. Data transmission of MuniSocket

Auto-adaptive file striping: striping file depending on request sizes, request con-currency inside an application and across multiple applications, file access patterns, and OSTs status. i.e. the number of stripes, the size of each stripe, and the servers chosen are all auto-configurable. Especially when file is small enough the perform-ance of reading and writing will be improved compared with Lustre.

“Client FS” gets OSTs status from MDS such as OSTs busyness degree, remain disk space of OSTs, and choices some OSTs to store part of user file. Further more, when HTSFS is idle, it will re-distribute those files to achieve more I/O throughput, all actions don’t need user intervening.

To explore long-term trends, we are constructing both analytic models and fuzzy logic rule bases that capture the relationship between access patterns and thousands of storage devices. Prefetching read-requests: Combining Autoregressive Integrated Moving Average (ARIMA) time series and Markov model spatial predictions to adaptively determine when, what, and how many data blocks to prefetch [4].

Our goal is to improve application-level input performance by using forecasts to guide the prefetch of I/O requests before they arrive to hide disk latency, reducing stall time. We addressed the problem of predicting future requests from two per-spectives --space and time.

In the space domain, we reused Oly's Markov modeling system [5] to provide quantitative predictions for future file blocks. These state-based probabilistic models were created online from applications' read accesses monitored at run time. Predic-tions were generated on a greedy basis, selecting the most likely transition from each state to determine the next state.

In the time domain, we present a software framework -- Automodeler to provide automatic modeling and forecasting for input/output request interarrival times. In Automodeler, ARIMA models of interarrival times are automatically identified and built during application execution. Model parameters are recursively estimated in real-time for every new request arrival, adapting to changes that are intrinsic or external to the running application. Online forecasts are subsequently generated based on the updated parameters. Automodeler has the ability to automatically iden-

tify, track, estimate, and predict stationary, non-stationary, and seasonal behavior in read interarrival times during application execution. Writing behind and reordering buffer: Basing on Log-structured File System (LFS) [6] to reorder write buffer.

LFS is an advanced research file system which provides good write performance, with assumption that read performance is decided by file system buffer cache (actu-ally we use the above prefetching read request to improve read performance). LFS try to improve the I/O performance by combining small write requests into large logs. Although LFS can significantly improve I/O performance for small-write dominated workloads, it suffers from a major drawback, namely garbage collection overhead or cleaning overhead, because LFS is that the system does not distinguish active data (namely short-lived data) from inactive data in the write buffer. Data are simply grouped into a segment buffer randomly according to their arrival orders. When the buffer is full, LFS writes the buffer to a disk segment. Within the seg-ment, however, some data are active and will be quickly overwritten (therefore invalidated), and others are inactive and will remain on the disk for a relatively long period. The result is that the garbage collector has to compact the segment to elimi-nate the holes in order to reclaim disk space. Previous studies have shown that gar-bage collection overhead can considerably reduce the LFS performance under heavy workloads.

Several schemes have been proposed to speed up the garbage collection process. These algorithms focus on improving the efficiency of garbage collection after data have been written to disk.

We propose a novel method that tries to reduce the I/O overhead during garbage collection, by reorganizing data in two or more segment buffers, before data are written to disk [7]. When write data arrive, the system sorts them into different buff-ers according to their activities. Active data are grouped into one buffer, while less-active data are grouped into the other buffer. When the buffers are full, two buffers are written into two disk segments using two large disk writes. Because data are sorted into active and inactive segments before reaching the disk, garbage collection overhead is drastically reduced. Since active data are grouped together, most data blocks in an active disk segment will be quickly invalidated. On the other hand, few data blocks in an inactive segment will be invalidated, resulting in few holes. The outcome is that segments are either mostly full or mostly empty. The garbage col-lector can select many nearly empty segments to clean and compact their data into a small number of segments which imply less overheads to be migrated. The old seg-ments are then freed, resulting in a large number of available empty segments for future use. Furthermore, there is no need to waste time to clean those nearly-full segments. RAID0 + RAID1: Mirroring the striped data between two groups of server nodes.

The consistency of metadata and file data in Lustre is maintained by journal ext3 file system, some problems such as complete disk failure requires restoration from backup. Now HTSFS will be load balance and Fault-Tolerant through RAID 1+ RAID 0, and will improve more read performance than sacrificing the peak write performance [8]. To implement this object, I/O requests will be scheduled to a less

loaded server in each mirroring pair. How to define the meaning of “load” in face of multiple resources such as CPU, memory, disk and network is also our research field.

HTSFS is a RAID 0+ RAID 1 style parallel file system that mirrors the striped data between two groups of server nodes, one primary group and one backup group, as shown in figure 3. There is one metadata server in each group. To make the syn-chronization simple, clients’ requests go to the primary metadata server first. If the primary metadata server fails, all metadata requests will be redirected to the backup one. All following requests will directly go to the backup metadata server until the primary one is recovered and rejoins the system. For write requests, the data will first be written to the primary group and then be duplicated to the backup group.

I/O Request

I/O Request

Meta Data Server

Data Server 1

Data Server n

Data Server 2???

Private File Server Group

Meta Data Server’

Data Server 1'

Data Server n’

Data Server 2'???

Second File Server Group

Mirror one another

Mirror one another

Mirror one another

Mirror one another

HTSFS File SystemClient 1

Client 2I/O Request

Client m

.

.

.

Fig. 3. The Structure of RAID 0+ RAID 1 Style Parallel File System

4. Experiment Result--MuniSocket vs. Single Standard Socket

Transit Time Comparison Between Gigabit + Fast Ethernet bonding and GigabitEtherent

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Sequence Number

Receive time ofbonding

Receive time ofno-bonding

Time(s)

8196 16384 … 1630208

Fig. 4. Transit Time Comparison between MuniSocket and Single Socket

We use DBS: A TCP Benchmark Tool [9] to compare network performance be-tween MuniSocket and single standard socket. The two tested nodes are IBM eServer xSeries 335 servers; each has 2 xeon 2.4G CPU, 1GB memory, connecting each other with 1 gigabit and 1 fast Ethernet network adapter. The experiment result of transit time comparison between gigabit + fast Ethernet bonding and gigabit Ethernet are as figure 4.

From figure 4 we know each group (each curve) has 200 result, receive time of bonding is below about 20%~35% than no-bonding. We think bonding multiple

same speed network adapters will be more efficiency than bonding multiple differ-ent speed network adapters.

Other aspects tests are under way and will be given at forthcoming paper.

5. Summary

HTSFS’s initiative and aims, amongst other, is in advancing the state of the art in parallel and distributed file system technology applied to high performance comput-ing center and global enterprises to make large-scale I/O-intensive applications running more smoothly. In this paper, we design and implement a Lustre-like file system that takes advantage of some new technique features to improve I/O perform-ance and scalability of cluster file systems. Our work shows that those new features can improve some aspects of Lustre, including throughput, access time etc.

References

1. Lustre, http://www.lustre.org/index.html 2. Parallel Virtual File System, http://www.pvfs.org/pvfs2/ 3. N. Mohamed, J. Al-Jaroodi, H. Jiang, and D. Swanson, "A Middleware-Level Parallel

Transfer Technique over Multiple Network Interfaces", in proceedings of Cluster World Conference and Expo, San Jose, California, June 2003.

4. Nancy Ngoc Tran , Daniel A. Reed, Automatic arima time series modeling and forecast-ing for adaptive input/output prefetching, 2002.

5. Oly, J. P. Markov Model Prediction of I/O Requests for Scientific Applications. In MasterThesis, Department of Computer Science, University of Illinois at Urbana Cham-paign (Spring 2000).

6. Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin. An implemen-tation of a log-structured file system for UNIX. In Proceedings of Winter 1993 USENIX, pages 307–326, San Diego, CA, January 1993.

7. Jun Wang. High performance i/o architectures and file systems for internet servers. Doctoral Thesis, January 2002.

8. Y. Zhu, H. Jiang, X. Qin, D. Feng and D. Swanson, "Design, Implementation, and Per-formance Evaluation of a Cost-Effective Fault-Tolerant Parallel Virtual File System," in Proceedings of International Workshop on Storage Network Architecture and Parallel I/Os, in conjunctions with 12th International Conference on Parallel Architectures and Compilation Techniques, New Orleans, LA, Sept. 27 - Oct. 1, 2003.

9. http://www.ai3.net/products/dbs/

Scaling Out OpenSSI1

Bruce J. Walker Jaideep Dharap2

Hewlett-Packard Corporation

ABSTRACT

OpenSSI [1] is a very general and complete Single System Image (SSI) cluster technology, based on 20 years of research and development. During its development, however, there was no scalable cluster filesystem technology so OpenSSI has been limited to environments up to 100-200 nodes, even though many aspects of OpenSSI were designed to scale much higher. By leveraging the Lustre cluster filesystem, OpenSSI can now consider clusters of 1000 nodes and above. In this paper we first review the OpenSSI goal and technology. Then we expose some of the scaling limitations in the current implementation and finally detail the ongoing work to remove those limitations. 1. Introduction Cluster solutions have been developed, somewhat independently, for six different cluster environments – high performance technical computing (HPTC), load-leveling, web-serving, storage, database and high availability (HA). Within each area, a cluster solution addresses one or more of the problems of availability, scalability, usability and manageability. In fact, each environment could benefit from a solution that addressed all four problems, which means it should be possible to architect a single cluster software solution that can be used in all six areas. The solution in general is a highly scalable Single System Image (SSI) cluster foundation with modular subsetting and substitution of components for special requirements in particular environments. OpenSSI is a good foundation since it is a quite modular and addresses manageability, availability, usability and some aspects of scalability. Scaling can be considered in 2 ways – incremental scalability at relatively small numbers of nodes and absolute scalability (large numbers). Linear is the ultimate incremental scale as you add nodes. For workloads other than strict parallel programs, the low overhead of OpenSSI and the utilization of a kernel instance on each node allows OpenSSI to productively scale up to 100 or 200 nodes. For that range of nodes, the per node overhead is quite bearable. For much larger numbers of nodes, the per node overhead isn’t acceptable. Specific areas of concern include, installation, booting, cluster formation, node monitoring and the cluster filesystem. There are several solution components proposed to allow OpenSSI to comfortably scale to 1000 nodes or more. The key architectural feature is the introduction of “compute nodes” and the utilization of different algorithms for compute nodes. Service nodes (a rename of what all nodes are in current OpenSSI) handle logins, editing, compiling, general computing, batching, monitoring and external networking. They also provide load balancing and high availability. Compute nodes are available for batch job allocation and don’t need to participate in load balancing and high availability. Compute nodes boot more efficiently, join the cluster in a more

1 This research is partial sponsored under DOE contract DE-AC05-000R22725 2 Also University of California, Los Angeles

scalable manner and utilize a more scalable form of node monitoring. Little or no sacrifice to the current functionality offered by SSI is necessary to allow the existence of many compute nodes. The scaling issues and ongoing enhancements are detailed after an overview of current OpenSSI. 2. OpenSSI Overview This section reviews the basic architecture of OpenSSI, the SSI capabilities, the features beyond just SSI (like load balancing and high availability), and the configuration and management of an OpenSSI cluster. From that overview we can detail the areas where scale is an issue and justify the ongoing implementation. From outside the OpenSSI cluster one sees a single, scalable virtual machine that will load balance incoming requests across a set of physical machines. The virtual machine (i.e. SSI cluster) appears highly available – service continues even in the face of multiple failures. From inside the cluster, users, administrators and processes see all the resources of all the physical machines as if they were local to the machine they are running on (with the exception of physical memory). There is a single namespace for filesystems (including the root filesystem), processes, all interprocess communication (IPC) objects and devices so any program can run on any physical machine, can co-operate with any other program, and can even transparently move from physical machine to physical machine during execution. The strategy for making this possible is to run an enhanced Linux kernel on each physical machine (node) in the cluster and to have these enhanced kernels work together so that at the system call level, all the resources of all the nodes are presented as if local. A single shared root filesystem is also key. After looking at the value this model provides, a brief review of the technology is presented. A cluster with a strong sense of SSI (like OpenSSI) adds value in several dimensions – manageability, scalability and usability. OpenSSI goes beyond pure SSI to also provide high availability. Figure 1 below shows how OpenSSI strives to match SMPs in manageability and usability while exceeding in availability and scalability. Pure HA clusters tend to excel only in availability.

SMSMP Availability Typical HA Cluster

Figure 1: OpenSSI clusters simultaneously excelling in all 4 dimensions

Manageability

Usability

Scalability

OpenSSI Linux Clusters

OpenSSI provides scalability in many different ways. First, you can add more nodes to the cluster online (incremental growth). New nodes add processors and memory and optionally add networking and disk/filesystem resources. All these resources will be transparently and automatically utilized as soon as the node is added. Sharing of resources is part of scalability. So is load balancing, for which there are two forms in OpenSSI – connection load balancing of incoming tcp/udp traffic and process load balancing. Scalability with respect to SMP/Numa architectures also comes from running a Linux kernel on each node. Scalability isn’t as valuable if the environment isn’t easy to use. Usability is another key value of OpenSSI. Any program can be run on any node at any time, unlike most cluster environment and more like SMP environments. All resources can be seen and managed from any node, be they processes, filesystems, IPC objects or networking. A key usability aspect is that there are very few new programs to learn for the cluster. Standard commands just work (ps, top, mount, df, etc). Automatic process load balancing allows users to automatically leverage the resources of other nodes without any special coding or configuring. High availability is not intrinsically an SSI characteristic but certainly is a key rationale for clustering. OpenSSI is architected with no single or even multiple points of failure if sufficient redundant hardware is configured. Additionally, OpenSSI has many HA features which are automatic or easily configured. HA-CFS automatically fails over filesystems that are on shared devices or are cross-node replicated. HA-LVS automatically fails over the cluster ip address if the node servicing it fails. RC-based services are either run on all nodes or automatically failover when the node they are running on fails. Channel bonding is an available option, as is checkpoint/restart of applications. Applications can also be monitored and restarted if they fail. One of the biggest values of OpenSSI is manageability, and it starts with installation. Since there is logically only one copy of the root filesystem, the Linux distribution is only installed once and updates to it need only be done once. OpenSSI is installed via a command done on an installed distribution. Other nodes are added to the cluster by running an ssi-addnode command. Nodes can be added online at any time. Nodes can also be gracefully taken down, as can the entire cluster. Having a single root filesystem means many management functions are just the same as in the single machine case. Updating the kernel automatically ensures all copies are updated. Management can be done from any node since all resources of all nodes are visible from all nodes. The technology behind OpenSSI can be roughly broken down into 6 categories – installation/booting, cluster filesystems (including the root filesystem), process management (including process load balancing), inter-process communication, networking (including connection load balancing) and application availability. Below we briefly review what OpenSSI provides in each of these areas and how the capability is provided. As mentioned above, installation is pretty straightforward. The base Linux distribution is installed, with OpenSSI as a layered addition. The OpenSSI install asks some questions, including a cluster name, whether failover is to be configured, which network interface is the cluster interconnect interface and a node number. After a reboot, you have a 1 node cluster. To add other nodes, you run the add node option to the openssi-config-node command, which will prompt for node number, IP address, nodename, and whether this is a root failover node. With respect to cluster filesystem technology, OpenSSI uses the built-in HA-CFS by default, but GFS [3] or Lustre [4] can be used instead or in addition. HA-CFS is a transparent stacking cluster filesystem. Each time a standard disk-based filesystem like ext3 is mounted, it is transparently and instantaneously visible and accessible in a coherent fashion from all nodes in the cluster, no matter when or where it is mounted. If the filesystem is on a shared disk or is software replicated (DRBD [6]) between nodes, HA-CFS will automatically and transparently failover the filesystem if the mounting node fails. OpenSSI presents a clusterwide process model (Vproc). Each process is given a clusterwide unique process id which it carries with it if it moves. Processes can move from node to node,

they can rexec() to other nodes and rfork() to other nodes. All forms of process relationships, including debugger/debuggee can transparently be distributed. /proc/<pid> shows all the processes on all the nodes, with the full semantics of all the subtrees under /proc/<pid>. Included as an optional part of OpenSSI is automatic process load balancing, which can move processes at exec() time or at any point in their execution. Control over which nodes participate in the load balancing and which programs participate is provided. OpenSSI makes all forms of standard base Linux Interprocess Communication (IPC) clusterwide. This includes shared memory, semaphores, message queues, pipes, fifos, unix domain sockets and internet domain sockets. For those that are nameable, the names are clusterwide unique and any process on any node can open/attach to any object they have permission to access, no matter what node it is being served on. IPC objects, when newly created, are created on the node where the process creating them is currently executing, in order to maximize performance. Others using the object may be remote. There is an ongoing task to allow objects to move like processes do. OpenSSI, thru an adaptation of LVS (Linux Virtual Server project [7]), provides a single name and address for the cluster (CVIP), and does so in a highly available manner. In addition, incoming tcp or udp “connections” can be load balanced to servers throughout the cluster for performance and availability. From within the cluster, services can be transparently accessed thru the CVIP Application availability in OpenSSI can be accomplished in one of two ways. If the application is an rc-based service, that service will, by default, only run on the initial node, with automatic failover to the takeover root node. Via an extension to the redhat-config-services command, one can designate that a service is to run on any or all nodes and can specify where to have it restart on failure. If the application is not an rc-based service, it can be easily registered with the spawndaemon/keepalive subsystem which will monitor it and restart it if it fails or the node it is running on fails. 3. Scaling Issues and Solutions The architecture and implementation of OpenSSI currently scales to a couple hundred nodes. This architecture could very likely be incrementally enhanced to scale to perhaps a thousand nodes. But the goal of his research initiative is to scale to several thousand of nodes. OpenSSI currently offers various features such as high availability, a single root filesystem, as well as network connection balancing and process load balancing. As one of our goals, we wanted to maintain these features in a solution that would be scale to thousands of nodes. Thus, we chose a design that would offer the current set features as is on a set of nodes while allowing the addition of a large number of nodes which would act as compute nodes. We describe the overall architecture and then outline some of the research challenges involved. Figure 2 below shows the general architecture, with service nodes, compute nodes and an external Luster server.

Figure 2. Architecture Overview

Service nodes offer all the current features of OpenSSI. In a cluster of thousand nodes, we expect there to be at the most a couple of hundred service nodes. These nodes will be responsible in managing cluster membership (CLMS), resource management and job scheduling, network connection load balancing (LVS). Service nodes boot from a local disk and have a single Inter-process communication (IPC) namespace. They also support a single process space (vproc) and provide process load leveling. These nodes provide high availability (HA) especially for the root filesystem. These nodes will run resource management and job scheduling software as well as performing network connection load balancing. Figure 3 below depicts the capabilities on the service nodes.

Figure 3 – Service Node Capabilities

LVS DLM

Lustre clibent

ICS

Install and sysadmin

Boot and Init

Application monitoring and restart

MPI HA Resource

Mgmt and Job

Scheduling

Process load leveling IPC Devices

Cluster Filesystem

CFSRemote File Block

Vproc

CLMS

OST

OST

App node

App node

App node

App node

App node

App node

App node

App node

App node

App node

Compute Nodes

user

Service Nodes

Services

Services

Services

Services

Services

Services

Object Storage Servers

Metadata Servers

Scalable HA Storage

OST

MDS

MDS

High Speed Interconnect

Lustre Servers

OpenSSI

Lustre Servers Incoming

In a cluster of large scale, we expect that there will be a large number of compute nodes that are used essentially for processing jobs. Compute nodes will most likely boot via the network (easing their management) but can also be booted of a local disk if required. These nodes boot up a smaller kernel and run only a restricted set of services. While they participate in a single process space (Vproc), they do not perform any dynamic load balancing. These nodes do not participate in HA and have a much more relaxed membership (CLMS lite), thus resulting in a scalable membership algorithm. Figure 4 below depicts the capabilities on the compute nodes.

Figure 4 – Compute Node Capabilities

Given the service node/compute node architecture described above, we will discuss further work in the following areas: • Booting / Installation

Installing and booting a system that scales to required limits will involve the simultaneous installation and booting of large number of nodes; included is scalable cluster formation.

• Membership Node monitoring algorithms need to be further enhanced to monitor a large number of nodes in order to detect node failures and maintain cluster membership in a timely fashion without imposing too much overhead in terms of network traffic or processing. Designing membership APIs that are able to express information for thousands of nodes is challenging as well.

• Shared Root Filesystem The single root fileystem needs to be enhanced to handle a large number of nodes and a large number or operations.

• Load-leveling Combine the job-launching functions of a cluster-wide resource manager with intra-node process creation and migration capabilities in order to distribute load efficiently.

Lustre client

ICS

MPI

Boot

CLMS

Lite

Vproc

Remote File Block

• Interprocess communication Determine if a single IPC namespace scales to required limits.

• Networking Scalability of the cluster internode communication subsystem as well as connection load balancing. 3.1 Scalability of installation and booting mechanisms

Via the ongoing RASCAL (RAdically Simple Cluster Architecture for Linux) project [11], OpenSSI will provide capability to simplify the setting up of a large scale cluster environment. At installation time, OpenSSI/RASCAL will prompt the installer to pick one of a few default cluster configurations. It then automatically sets up the cluster. Some customization is available, primarily after the basic cluster is up and functional. The configuration areas that are covered are: cluster name and cluster IP address, individual node names and numbers, availability configuration, cluster IP address, and network configuration of the cluster interconnect. The pre-defined cluster configurations are not hard-coded. Rather, they are specified in a powerful and flexible XML file format that users can customize. More information about these configurations can be found in RASCAL [11]. There are several challenges to be considered when scaling the booting mechanism to required limits. Given we expect the compute nodes to be diskless, scalable networking booting is important. Network booting in OpenSSI has 4 phases – dhcp, kernel/ramdisk download, cluster join and init/rc processing. In current OpenSSI, the download is individual tftp. Multicast tftp [12] will be used to scale this aspect. Cluster formation in OpenSSI is made up of a series of individual node joins. Each join is a 2-phase commit with all existing members to ensure a consistent cluster membership transition history on all nodes. For compute nodes this will be simplified – nodes will join in groups and compute nodes will not be involved in membership interactions. In OpenSSI there is a single init process which asynchronously runs “rc” processes on each node. Given only a few services will run on each compute node, this model might scale acceptably. If not, more parallelism of launching “rc” processing will be added either on the init node or via the service nodes. 3.2 Node Monitoring The node monitoring algorithm currently involves each node sending an “I’m alive” (with load information) message to the CLMS master and that node sending an “I’m alive” (with aggregate load information) to each node in the cluster. For the compute nodes the load information will be skipped and the I’m alive from the init node will be multicast.. The frequency is currently configurable and for compute nodes the frequency will be reduced to reduce overhead. In addition to enhancing node monitoring, additional membership APIs will be added to concisely describe the list of compute nodes in the cluster. 3.3 Shared Root Filesystem The concept of a single root filesystem in moderately scaled systems has existed for many years and is currently supported by OpenSSI. However, scaling a single root to the required limits is more challenging. We propose to use the Lustre file system as the root filesystem. Lustre is a cluster file system that uses object based disks for storage and metadata servers for storing file system metadata. This design provides a more efficient division of labor between computing and storage resources. Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. Lustre supports strong file and metadata locking semantics to maintain total coherency of the file systems even in the presence of concurrent access.

Currently, OpenSSI supports Lustre for non-root filesystems. A prototype of Lustre as the root filesystem was built for OpenSSI but only tested on a very small cluster. The Lustre project has committed to supporting Lustre as a root so we will leverage their efforts. 3.4 Load-distribution OpenSSI has a mechanism to load balance processes at exec() time and during execution (process migration). This mechanism requires each node to know a pretty current value of the load of each other node. We do not plan to support this form of load balancing on the compute nodes. Instead, they will be scheduled using standard resource managers (e.g. SLURM, PBS). 3.5 Interprocess communication Currently, OpenSSI has a single global IPC namespace across all nodes in the cluster. It will have to be determined whether this scales to the required limits. One option if it does not, is to disable the single IPC namespace on compute nodes. 3.6 Networking

The existing cluster internode communication subsystem (ICS) in OpenSSI needs to be enhanced to scale up to the required limits. Currently, ICS establishes 10 connections between each pair of nodes in the cluster. First, we intend to cut down the number of connections required between nodes to 1 or 2. In addition, instead of establishing connections between all possible node-pairs, ICS will support on-demand connections such that these connections are established only when needed and unused connections are cleaned up periodically. Since non-MPI communication between different compute nodes is rare, this ensures that a large number of these connections are never established and need not be maintained. This also ensures that a service node establishes connections only to those compute nodes that it requires. OpenSSI, through an adaptation of LVS (Linux Virtual Server project), provides a single name and address for the cluster (CVIP), and does so in a highly available manner. In addition, incoming tcp or udp “connections” can be load balanced to nodes throughout the cluster for performance and availability. In order to scale up, LVS capability will be limited to the service nodes. 4. Testing and Validation Testing a thousand node cluster is challenging enterprise especially since we expect to test various different features, from booting/installation to actual operation. We use a three-phase approach to testing various features. The initial testing is a preliminary test that stresses data structures and algorithms. During this phase, we simply boot a smaller cluster, only with some of nodes assigned node numbers closer to the maximum number of nodes for our required scaling limits. So for example, we could assign a few nodes with node numbers in the 900s. As part of the second phase, we propose to use the Xen[10] virtual machine technology to test some of the features during this phase. Xen is a virtual machine monitor which allows multiple Linux instances to share conventional hardware in a safe and resource managed fashion. The Xen design is targeted at hosting up to 100 virtual machine instances simultaneously on a modern server. We are currently in the process of integrating Xen with OpenSSI and propose to use Xen on a small cluster (say 100 nodes) to test a much larger cluster (say 1000 nodes). This approach has several advantages as well as a few disadvantages. The biggest advantage is of course, that we do not actually need a large scale cluster of nodes to test the scaling features during initial development. A few features such as installation, booting, cluster formation and normal operations can be tested on such a cluster of virtual machines easily and we could reasonably expect the correctness of these results to extend to a cluster of actual physical machines. However, the constraints placed on network traffic are likely to be lower in the cluster of virtual machines than in a cluster of actual physical machines connected by a network so

performance measurements won’t be very meaningful. Also the fact that each virtual machine will be running with less processing resources means, the stress on the filesystem is likely less than it will be in an actual cluster. The third phase includes testing on a cluster of actual physical nodes. During this phase, we will test various features such as simultaneous booting of a large number of nodes, filesystem performance, job allocation and scheduling, failure and failover testing etc. 5. Related Work Kerrighed [8] is a research project that aims at providing single system image operating system through a set of kernel distributed services in charge of the global management of cluster resources, similar to OpenSSI. Given Kerrighed is focused on the HPC environment, it is likely that a high degree of scalability was designed in. openMosix [9] is another project that provides SSI features. openMosix does not provide an overall membership model (at any given time any node only knows about some of the other participating members) so in one sense it can scale quite high and in another sense it is inappropriate for parallel computing environments where scaling is the issue. 6. Conclusions OpenSSI is a cluster technology that can be the foundation for almost all types of clusters. To add very large HPC parallel programming clusters to the OpenSSI environment, we have proposed leveraging the Lustre cluster filesystem technology and have introduced a new class of nodes to the OpenSSI cluster – compute nodes. Compute nodes would still enjoy the advantages of SSI but would not participate in high availability or OpenSSI-based load balancing enhancements. They would also have a limited participation in membership algorithms and cluster formation. As in most HPC clusters, these nodes would be under the control of a job scheduler. OpenSSI would be leveraged for management, monitoring and possibly to help with job launch. With these enhancements, OpenSSI may well scale to thousands of nodes. Bibliography [1] http://openssi.org/index.shtml. [2] Bruce Walker, Gerald Popek, Robert English, Charles Kline, and Greg Thiel. The Locus distributed operating system. In Proceedings of the ninth ACM symposium on Operating systems principles, pages 49-70. ACM Press, 1983. [3] The Global File System, http://www.redhat.com/software/rha/gfs/ [4] P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003. [5] BRAAM,P.J. 2002. The Lustre storage architecture. Technical Report available at---http://www.lustre.org/documentation.html[6] DRBD, http://www.drbd.org/ [7] http://www.linuxvirtualserver.org/ [8] http://www.kerrighed.org [9] http://openmosix.sourceforge.net/. [10] Boris Dragovic, Keir Fraser, Steve Hand, Tim Harris, Alex Ho, Ian Pratt, Andrew Warfield, Paul Barham, and Rolf Neugebauer. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles, October 2003. [11] RASCAL: RAdically Scalable Cluster Architecture for Linux, using OpenSSI, submitted to Supercomputing 05, April 2005. [12] http://wiki.etherboot.org/pmwiki.php/Main/MulticastHowto

http://openmosix.sourceforge.net/

Asymmetric Active-Active High Availability for High-endComputing ∗ †

C. LeangsuksunV. K. MunganuruComputer Science

DepartmentLouisiana Tech UniversityP.O. Box 10348, Ruston,

LA 71272, USAPhone: +1 318 257-3291

Fax: +1 318 257-4922

[email protected]@latech.edu

T. LiuDell Inc.

Tong [email protected]

S. L. ScottC. Engelmann

Computer Science andMathematics DivisionOak Ridge National

Laboratory

[email protected]@ornl.gov

ABSTRACTLinux clusters have become very popular for scientific com-puting at research institutions world-wide, because they canbe easily deployed at a fairly low cost. However, the mostpressing issues of today‘s cluster solutions are availabilityand serviceability. The conventional Beowulf cluster archi-tecture has a single head node connected to a group of com-pute nodes. This head node is a typical single point of fail-ure and control, which severely limits availability and ser-viceability by effectively cutting off healthy compute nodesfrom the outside world upon overload or failure. In this pa-per, we describe a paradigm that addresses this issue usingasymmetric active-active high availability. Our frameworkcomprises of n + 1 head nodes, where n head nodes are ac-tive in the sense that they provide services to simultaneouslyincoming user requests. One standby server monitors all ac-tive servers and performs a fail-over in case of a detectedoutage. We present a prototype implementation based on a2 + 1 solution and discuss initial results.

KeywordsScientific computing, clusters, high availability, asymmetricactive-active, hot-standby, fail-over

∗This work was supported by the U.S. Department ofEnergy under Contract No. contract No. DE-FG02-05ER25659.†This research is sponsored by the Mathematical, Informa-tion, and Computational Sciences Division; Office of Ad-vanced Scientific Computing Research; U.S. Department ofEnergy. The work was performed at the Oak Ridge NationalLaboratory, which is managed by UT-Battelle, LLC underContract No. De-AC05-00OR22725.

1. INTRODUCTIONWith their competitive price/performance ratio, COTS com-puting solutions have become a serious challenge to tra-ditional MPP-style supercomputers. Today, Beowulf-typeclusters are being used to drive the race for scientific dis-covery at research institutions and universities around theworld. Clusters are popular, because they can be easilydeployed at a fairly low cost. Furthermore, cluster man-agement systems (CMSs) like, OSCAR [6, 18] and ROCKS[21], allow uncomplicated system installation and manage-ment without the need to individually configure each compo-nent separately. HPC cluster computing is undoubtedly aneminent stepping-stone for future ultra-scale high-end com-puting systems.

However, the most pressing issues of today‘s cluster archi-tectures are availability and serviceability. The single headnode architecture of Beowulf-type systems itself is an ori-gin for these predicaments. Because of the single head nodesetup, clusters are vulnerable, as the head node representsa single point of failure affecting availability of its services.Furthermore, the head node represents also a single pointof control. This severely limits access to healthy computenodes in case of a head node failure. In fact, the entire sys-tem is inaccessible as long as the head node is down. More-over, a single head node also impacts system throughputperformance as it becomes a bottleneck. Overloading thehead node can become a serious issue for high throughputoriented systems.

To a large extent, the single point of failure issue is addressedby active/hot-standby turnkey tools like HA-OSCAR [4, 5,10], which minimize unscheduled downtimes due to headnode outages. However, sustaining throughput at large scaleremains an issue due to the fact that active/hot-standby so-lutions still run one active head node only. In this paper,we describe a paradigm that addresses this issue using asym-metric active-active high availability. Unlike typical Beowulfarchitectures (with or without hot-standby node(s)), ourframework comprises of n + 1 servers. n head nodes areactive in the sense that they provide services to simultane-ously incoming user requests. One hot-standby server mon-

itors all the active servers and performs a fail-over in caseof a detected outage.

Since our research also focuses on the batch job schedulerthat typically runs on the head node, our proposed highavailability architecture effectively transforms a single sched-uler system into a cluster running multiple schedulers in par-allel without maintaining global knowledge. These sched-ulers run their jobs on the same compute nodes or on in-dividual partitions, and fail-over to a hot-standby server incase of a single server outage.

Sharing compute nodes among multiple schedulers withoutcoordination is a very uncommon practice in cluster com-puting, but has its value for high throughput computing.

Furthermore, our solution leads the path towards the moreversatile paradigm of symmetric active-active high availabil-ity for high-end scientific computing using the virtual syn-chrony model for head node services [3]. In this model, thesame service is provided by all head nodes using group com-munication at the back-end for coordination. Head nodesmay be added, removed or fail at any time, while no pro-cessing power is wasted by keeping an idle backup serverand no service interruption occurs during recoveries. Sym-metric active-active high availability is an ongoing researcheffort and our asymmetric solution will help to understandthe concept of running the same service on multiple nodesand the necessary coordination.

However, in contrast to the symmetric active-active highavailability paradigm with its consistent symmetric replica-tion of global knowledge among all participating head nodes,our asymmetric paradigm maintains only backups on onestandby head node for all active head nodes.

This paper is organized as follows: First, we provide a re-view of related past and ongoing research activities. Second,we describe our asymmetric active-active high availabilitysolution in more detail and show how the system handlesmultiple jobs simultaneously while enforcing high availabil-ity. Third, we present some initial results from our prototypeimplementation. Finally, we conclude with a short summaryof the presented research and a brief description of futurework.

2. RELATED WORKRelated past and ongoing research activities include clustermanagement systems as well as active/hot-standby clusterhigh availability solutions.

Cluster management systems allow uncomplicated systeminstallation and management, thus improving availabilityand serviceability by reducing scheduled downtimes for sys-tem management. Examples are OSCAR and Rocks.

The Open Source Cluster Application Resources (OSCAR[6, 18]) toolkit is a turnkey option for building and maintain-ing a high performance computing cluster. OSCAR is a fullyintegrated software bundle, which includes all componentsthat are needed to build, maintain, and manage a medium-sized Beowulf cluster. OSCAR was developed by the OpenCluster Group, a collaboration of major research institu-

tions and technology companies led by Oak Ridge NationalLaboratory (ORNL), the National Center for Supercomput-ing Applications (NCSA), IBM, Indiana University, Intel,and Louisiana Tech University. OSCAR has significantlyreduced the complexity of building and managing a Beowulfcluster by using a user-friendly graphical installation wizardas front-end and by providing necessary management toolsat the back-end.

Similar to OSCAR, NPACI Rocks [21] is a complete “clusteron a CD” solution for x86 and IA64 Red Hat Linux COTSclusters. Building a Rocks cluster does not require any ex-perience in clustering, yet a cluster architect will find a flex-ible and programmatic way to redesign the entire softwarestack just below the surface (appropriately hidden from themajority of users). The NPACI Rocks toolkit was designedby the National Partnership for Advanced ComputationalInfrastructure (NPACI). The NPACI facilitates collabora-tion between universities and research institutions to buildcutting-edge computational environments for future scien-tific research. The organization is led by the University ofCalifornia (UCSD), San Diego, and the San Diego Super-computer Center (SDSC).

Numerous ongoing high availability computing projects, suchas LifeKeeper [11], Kimberlite [8], Linux Failsafe [12] andMission Critical Linux [16], focus their research on solutionsfor clusters. However, they do not reflect the Beowulf clusterarchitecture model and fail to provide availability and ser-viceability support for scientific computing, such as a highlyavailable job scheduler. Most solutions provide highly avail-able business services, such as data storage and data bases.They use “cluster of servers” to provide high availabilitylocally and enterprise-grade wide-area disaster recovery so-lutions with geographically distributed server cluster farms.

HA-OSCAR tries to bridge the gap between scientific clus-ter and traditional high availability computing. High Avail-ability Open Source Cluster Application Resources (HA-OSCAR [4, 5, 10]) is production-quality clustering softwarethat aims toward non-stop services for Linux HPC environ-ments. In contrast to previously discussed HA applications,HA-OSCAR strategies combine both the high availabilityand performance aspects making its methodology and in-frastructure to be the first field grade HA Beowulf cluster so-lution that provides high availability, critical failure predic-tion and analysis capability. The project‘s main objectivesfocus on Reliability, Availability and Serviceability (RAS)for the HPC environment. In addition, the HA-ORCAR ap-proach provides a flexible and extensible interface for cus-tomizable fault management, policy-based failover opera-tion, and alert management.

An active/hot-standby high availability variant of Rockshas been proposed [14] and is currently under development.Similar to HA-OSCAR, HA-Rocks is sensitive to the level offailure and aims to provide mechanisms for graceful recoveryto a standby master node.

Active/hot-standby solutions for essential services in scien-tific high-end computing include resource management sys-tems, such as the Portable Batch System Professional (PB-SPro [19]) and the Simple Linux Utility for Resource Man-

agement (SLURM [22]). While the commercial PBSPro ser-vice can be found in the Cray RAS and Management Sys-tem (CRMS [20]) of the Cray XT3 [24] computer system,the Open Source SLURM is freely available for AIX, Linuxand even Blue Gene [1, 7] platforms.

The asymmetric active-active architecture presented in thispaper is an extension of the HA-OSCAR framework devel-oped at Louisiana Tech University.

3. ASYMMETRIC ACTIVE-ACTIVE ARCHI-TECTURE

The conventional Beowulf cluster architecture (see Figure 1)has a single head node connected to a group of computenodes. The fundamental building block of the Beowulf ar-chitecture is the head node, usually referred to as primaryserver, which serves user requests and distributes submit-ted computational jobs to the compute nodes aided by joblaunching, scheduling and queuing software components [2].Compute nodes, usually referred to as clients, are simplydedicated to run these computational jobs.

Figure 1: Conventional Beowulf Architecture

The overall availability of a cluster computing system de-pends solely on the health of its primary node. Further-more, this head node may also become a bottleneck in large-scale high throughput use-case scenarios. In high availabilityterms, the single head node of a Beowulf-type cluster com-puting system is a typical single point of failure and control,which severely limits availability and serviceability by effec-tively cutting off healthy compute nodes from the outsideworld upon overload or failure.

Running two, or more, primary servers simultaneously andan additional monitoring hot-standby server for fail-overpurpose, or asymmetric active-active high availability, is apromising solution, which can be deployed to improve sys-tem throughput and to reduce system down times.

We also note that preemptive measures for application faulttolerance, such as checkpointing, can introduce significantoverhead even during normal system operation. Such over-

head should be counted as down time as well, since computenodes are not efficiently utilized.

We implemented a prototype of a 2 + 1 asymmetric active-active high availability solution [13] that consists of threedifferent layers (see Figure 2). The top layer has two identi-cal active head nodes and one redundant hot-standby node,which simultaneously monitors both active nodes. The mid-dle layer is equipped with two network switches to provideredundant connectivity between head nodes and computenodes. A set of compute nodes installed at the bottom layerare dedicated to run computational jobs.

In this configuration, each active head node is required tohave at least two network interface cards (NICs). One NICis used for public network access to allow users to sched-ule jobs. The other NIC is connected to the respectiveredundant private local network providing communicationbetween head and compute nodes. The hot-standby serveruses three NICs to connect to the outside world and to bothredundant networks. Compute nodes need to have two NICsfor both redundant networks.

We initially implemented our prototype using a 2+1 asym-metric active-active HA-OSCAR solution that consists ofdifferent job managers (see Figure 3), the Open PortableBatch System (OpenPBS [17]) and the Sun Grid Engine(SGE [23]), independently running on multiple identical headnodes at the same time. Additionally, one identical headnode is configured as a hot-standby server, ready to takeoverwhen one of the two active head node servers fails.

Figure 3: Normal Operation of 2 + 1 AsymmetricActive-Active HA-OSCAR Solution

Under normal system operating conditions, head node A

runs OpenPBS and head node B runs SGE, both simulta-neously serving users requests at tandem. Both active headnodes effectively employ the same compute nodes using re-dundant interconnects. Each active head node creates a dif-ferent home environment for each of its resource manager,and prevents conflicts during job submission.

Upon failure (see Figure 4) of one active head node, thehot-standby head node will assume the IP address and hostname of the failed head node. Additionally, the same set

of services will transfer control to the standby node withrespect to the same job management tool activated on thefailed node masking users from this failure.

Figure 4: Fail-Over Procedure of 2 + 1 AsymmetricActive-Active HA-OSCAR Solution

The asymmetric active-active HA-OSCAR prototype is ca-pable of masking one failure at a time. As long as one headnode is down, a second head node failure will result in adegraded operating mode. Even under this rare conditionof two simultaneous head node failures, our high availabilitysolution provides the same capability as a regular Beowulf-type cluster without high availability features.

To ensure that the system operates correctly without un-fruitful failovers, the system administrator must define afailover policy in the PBRM (Policy Based Recovery Man-agement [9]) module, which allows selecting a critical headnode (A or B). The critical head node has a higher prior-ity and will be handled first by the hot-standby head nodein case the DGAE (Data Gathering and Analysis Engine)detects any failures.

In the rare double head node outage event, there will not bea service failover from the lower priority server to the hot-standby head node. This policy ensures that critical serviceswill not be disrupted by failures on the high priority headnode. For example, if OpenPBS job management is themost critical service, we suggest setting the server runningOpenPBS as the higher priority head node.

4. PRELIMINARY RESULTSIn our experimental setup, the head node A runs OpenPBSand Maui [15] and is assigned as the critical head node. Theservices on the head node A have the priority failover to thehot-standby head node. In case of a failure of head nodeA, the hot-standby head node takes over as head node A‘.Once head node A is repaired, the hot-standby head nodewill disable OpenPBS and Maui to allow those services fail-back to the original head node A. If head node B fails whilehead node A is in normal operation, the hot-standby headnode will simply fail-over the SGE until the head node B isrecovered and back in service again.

With our lab grade prototype setup, we experienced the

same, if not better, availability behavior compared to anactive/hot-standby HA-OSCAR system, if we consider thedegraded operating mode of our prototype with two out-ages at the same time as downtime. Earlier theoretical as-sumptions and practical results (see Figure 5) using reliabil-ity analysis and tests of an active/hot-standby HA-OSCARsystem could be validated for the asymmetric active/activesolution. We obtained a steady-state system availability of99.993%, which is a significant improvement when comparedto 99.65%, from a similar Beowulf Linux cluster with a singlehead node.

If we consider the degraded operating mode with two out-ages at the same time not as downtime, availability of ourprototype is even better than the standard HA-OSCAR so-lution. We are currently in the process of validating ourresults using reliability analysis. Furthermore, we also ex-perienced an improved throughput capacity for schedulingjobs.

One of our initial concerns and one of the major challengeswe encountered during prototype implementation and test-ing was that most if not all services in a high-end comput-ing environment are not active-active aware, i.e. schedulerssuch OpenPBS and SGE do not support multiple instanceson different nodes. Automatic replication of changes in jobqueues is not very well supported. The experience gainedduring implementation will be applied to future work onsymmetric active-active high availability.

5. CONCLUSIONSOur lab grade prototype of an asymmetric active-active HA-OSCAR variant showed some promising results and showedthat the presented architecture is a significant enhancementto the standard Beowulf-type cluster architecture in satisfy-ing requirements of high availability and serviceability. Wecurrently support only 2 + 1 asymmetric active-active highavailability. However, ongoing work is currently investigat-ing the extension of the implementation to n+1 active-activearchitectures.

Future work will be more focused on symmetric active-activehigh availability for high-end scientific computing using thevirtual synchrony model for head node services [3]. In thisarchitecture, services on multiple head nodes maintain com-mon global knowledge among participating processes. If onehead node fails, the surviving ones continue to provide ser-vices without interruption. Head nodes may be added orremoved at any time for maintenance or repair. As long asone head node is alive, the system is accessible and manage-able. While the virtual synchrony model is well understood,its application to individual services on head (and service)nodes in scientific high-end computing environments still re-mains an open research question.

6. REFERENCES[1] ASCII Blue Gene/L Computing Platform at Lawrence

Livermore National Laboratory, Livermore, CA, USA.http://www.llnl.gov/asci/platforms/bluegenel.

[2] C. Bookman. Linux Clustering: Building andMaintaining Linux Cluster. New Readers Publishing,Boston, 2003.

[3] C. Engelmann and S. L. Scott. High availability forultra-scale high-end scientific computing. Proceedingsof 2nd International Workshop on Operating Systems,Programming Environments and Management Toolsfor High-Performance Computing on Clusters(COSET-2), 2005.

[4] HA–OSCAR at Louisiana Tech University, Ruston,LA, USA. At http://xcr.cenit.latech.edu/ha-oscar.

[5] I. Haddad, C. Leangsuksun, and S. Scott.HA-OSCAR: Towards highly available Linux clusters.Linux World Magazine, March 2004.

[6] J. Hsieh, T. Leng, and Y. C. Fang. OSCAR: Aturnkey solution for cluster computing. Dell PowerSolutions, pages 138–140, 2001.

[7] IBM Blue Gene/L Computing Platform at IBMResearch. http://www.research.ibm.com/bluegene.

[8] Kimberlite at Mission Critical Linux. Athttp://oss.missioncriticallinux.com/projects/kimberlite.

[9] C. Leangsuksun, T. Liu, S. L. Scott T. Rao, andRichard Libby. A failure predictive and policy-basedhigh availability strategy for Linux high performancecomputing cluster. Proceedings of 5th LCIInternational Conference on Linux Clusters, 2004.

[10] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. L.Scott. Availability prediction and modeling of highavailability OSCAR cluster. Proceedings of IEEECluster Computing (Cluster), pages 380–386, 2003.

[11] LifeKeeper at SteelEye Technology, Inc., Palo Alto,CA, USA. At http://www.steeleye.com.

[12] Linux FailSafe at Silicon Graphics, Inc., MountainView, CA, USA. Athttp://oss.sgi.com/projects/failsafe.

[13] T. Liu. High availability and performance Linuxcluster. Master Thesis Report at Louisiana TechUniversity, Ruston, LA, USA, 2004.

[14] T. Liu, S. Iqbal, Y. C. Fang, O. Celebioglu,V. Masheyakhi, and R. Rooholamin. HA-Rocks: Acost-effective high-availability system for Rocks-basedLinux HPC cluster, April 2005.

[15] Maui at Cluster Resources, Inc., Spanish Fork, UT,USA. Athttp://www.clusterresources.com/products/maui.

[16] Mission Critical Linux. Athttp://oss.missioncriticallinux.com.

[17] OpenPBS at Altair Engineering, Troy, MI, USA. Athttp://www.openpbs.org.

[18] OSCAR at Sourceforge.net. Athttp://oscar.sourceforge.net.

[19] PBSPro at Altair Engineering, Inc., Troy, MI, USA.http://www.altair.com/software/pbspro.htm.

[20] PBSPro for the Cray XT3 at Altair Engineering, Inc.,Troy, MI, USA.http://www.altair.com/pdf/PBSPro Cray.pdf.

[21] ROCKS at National Partnership for AdvancedComputational Infrastructure (NPACI), University ofCalifornia, San Diego, CA, USA. Athttp://rocks.npaci.edu/Rocks.

[22] SLURM at Lawrence Livermore National Laboratory,Livermore, CA, USA. www.llnl.gov/linux/slurm.

[23] Sun Grid Engine Project at Sun Microsystems, Inc,Santa Clara, CA, USA. Athttp://gridengine.sunsource.net.

[24] XT3 Computing Platform at Cray Inc., Seattle, WA,USA. http://www.cray.com/products/xt3.

Figure 2: Asymmetric Active-Active High Availability Architecture

Figure 5: Total Availability Improvement Analysis (Planned and Unplanned Downtime): Comparison ofHA-OSCAR with Traditional Beowulf-type Linux HPC Cluster

High Availability for Ultra-Scale High-End ScientificComputing ∗

Christian EngelmannComputer Science and Mathematics Division

Oak Ridge National LaboratoryP.O. Box 2008, Oak Ridge, TN 37831-6164, USAPhone: +1 865 574 3132, Fax: +1 865 576 5491

Department of Computer ScienceThe University of Reading

P.O. Box 217, Reading, RG6 6AH, UK

[email protected]

Stephen L. ScottComputer Science and Mathematics Division

Oak Ridge National LaboratoryP.O. Box 2008, Oak Ridge, TN 37831-6367, USAPhone: +1 865 574 3144, Fax: +1 865 576 5491

[email protected]

ABSTRACTUltra-scale architectures for scientific high-end computingwith tens to hundreds of thousands of processors, such asthe IBM Blue Gene/L and the Cray X1, suffer from avail-ability deficiencies, which impact the efficiency of runningcomputational jobs by forcing frequent checkpointing of ap-plications. Most systems are unable to handle runtime sys-tem configuration changes caused by failures and require acomplete restart of essential system services, such as the jobscheduler or MPI, or even of the entire machine. In this pa-per, we present a flexible, pluggable and component-basedhigh availability framework that expands today‘s effort inhigh availability computing of keeping a single server aliveto include all machines cooperating in a high-end scientificcomputing environment, while allowing adaptation to sys-tem properties and application needs.

KeywordsScientific computing, high availability, virtual synchrony,distributed control, group communication

1. MOTIVATIONA major concern in efficiently exploiting ultra-scale archi-tectures for scientific high-end computing (HEC) with tensto hundreds of thousands of processors, such as the IBMBlue Gene/L [2, 4, 27], the Cray X1 [39] or the Cray XT3[41], is the potential inability to identify problems and takepreemptive action before a failure impacts a running job. In

∗Research sponsored by the Laboratory Directed Researchand Development Program of Oak Ridge National Labo-ratory (ORNL), managed by UT-Battelle, LLC for the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

fact, in systems of this scale, predictions estimate the meantime to interrupt (MTTI) in terms of hours. Timely and cur-rent propagation and interpretation of important events willimprove system management through tuning, scheduling,resource usage, and self-adaptation for unexpected events,such as node failures.

Current solutions for fault-tolerance in HEC focus on deal-ing with the result of a failure. However, most are unableto handle runtime system configuration changes caused byfailures and require a complete restart of essential systemservices, such as the job scheduler or MPI, or even of theentire machine. High availability (HA) computing strives toavoid the problems of unexpected failures through preemp-tive measures. Today‘s effort in HA computing is mostlydirected toward keeping a single server alive. This effortneeds to be expanded to include all machines cooperating ina HEC environment.

There are various techniques [37] to implement high avail-ability for computing services. They include active/hot-standby and active/active. Active/hot-standby high avail-ability follows the fail-over model. Process state is savedregularly to some shared stable storage. Upon failure, anew process is restarted or an idle process takes over withthe most recent or even current state. This implies a shortinterruption of service for the time of the fail-over and mayinvolve a rollback to an old backup.

Asymmetric active/active high availability further improvesthe reliability, availability and serviceability (RAS) proper-ties of a system. In this model, two or more processes offerthe same service without coordination, while an idle processis ready to take over in case of a failure. This techniqueallows continuous availability with improved performance.However, it has limited use cases due to the missing coordi-nation between participating processes.

Symmetric active/active high availability offers a continuousservice provided by two or more processes that run the sameservice and maintain a common global process state usingdistributed control [18] or extended virtual synchrony [30].The symmetric active/active model is superior in many ar-

eas including throughput, availability, and responsiveness,but is significantly more complex.

While there are major architectural differences between in-dividual HEC systems, like vector machines vs. massivelyparallel processing systems vs. Beowulf clusters, they allhave similar high availability deficiencies, i.e. single pointsof failure (SPoF) and single points of control (SPoC).

A failure at a SPoF impacts the complete system and usuallyrequires a full or partial system restart. A failure at a SPoCadditionally renders the system useless until the failure isfixed. A recovery from a failure at a SPoC always involvesrepair or replacement of the failed component, i.e. humanintervention. Compute nodes are typical single points offailure. Head and service nodes are typical single points offailure and control.

The overall goal of our research is to expand today‘s effortin HA for HEC, so that systems that have the ability tohot-swap hardware components can be kept alive by an OSruntime environment that understands the concept of dy-namic system configuration. With the aim of addressingthe future challenges of high availability in ultra-scale HEC,our project intends to develop a proof-of-concept implemen-tation of an active/active high availability system softwareframework that removes typical single points of failure andsingle points of control.

For example, cluster head nodes may be added or removedby the system administrator at any time. Services, such asscheduler and load balancer, automatically adjust to systemconfiguration changes. Head nodes may be removed auto-matically upon failure or even preemptively when a systemhealth monitor registers unusual readings for hard drive orCPU temperatures. Multiple highly available head nodeswill also ensure access to compute nodes. As long as onehead node survives, computational jobs can continue to runwithout interruption.

2. RELATED WORKRelated past and ongoing research can be separated into twopaths: solutions for fault-tolerance and for high availabil-ity. Conceptually, fault tolerant computing enables recoveryfrom failures with an accepted loss of recent computation ac-tivity and an accepted interruption of service. In contrast,high availability computing provides an instant failure recov-ery with little or no loss and with little or no interruption.The grade of availability is defined by the overall downtime,i.e. the interruption (outage) time plus the time it takesto catch-up to the state when the failure occurred. Addi-tional overhead during normal system operation, such as forcheckpointing and message logging, also exists and needsto be counted as downtime. There are no fixed boundariesbetween both research paths.

Research in the area of fault tolerant scientific computingincludes fault tolerant message passing layers, such as PVM[23, 36] and FT-MPI [19, 20, 21], message logging systemsand checkpoint/restart facilities, like MPICH-V [31] andBLCR [29]. Advanced technologies in this area deal withdiskless checkpointing [10, 15, 35] and fault tolerant scien-tific algorithms [7, 17].

High availability computing has its roots in military, fi-nancial, medical and business computing services, wherereal-time computing resources, such as air traffic controland stock exchange, or data bases, like patient or employeerecords, need to be protected from catastrophic failures. Re-search in this area includes process state and data base repli-cation with various real-time capabilities.

Active/hot-standby high availability is the de facto stan-dard for business and telecommunication services, such asWeb-servers. Solutions for Beowulf-type systems are alsoavailable for high-end scientific computing. Examples areHA-OSCAR [25, 26] and Carrier Grade Linux [9].

Furthermore, recent solutions for ultra-scale architecturesinclude the Cray RAS and Management System (CRMS[41]) of the Cray XT3. The CRMS integrates hardware andsoftware components to provide system monitoring, faultidentification, and recovery. By using the PBSPro [33, 34]job management system, the CRMS is capable of provid-ing a seamless failover procedure without interrupting thecurrently running job. Redundancy is built in for criticalcomponents and single points of failure are minimized. Forexample, the system could lose an I/O PE, without losingthe job that was using it.

Asymmetric active/active high availability is being used forstateless services, such as read/search-only database serverfarms. Ongoing research in this area for scientific high-endcomputing focuses on a multiple head node solution for highthroughput Linux clusters.

Research in symmetric active/active high availability con-centrates on distributed control and extended virtual syn-chrony algorithms based on process group communication[1, 11, 12]. Many group communication algorithms and sev-eral software frameworks, such as Ensemble [6], Transis [13],Xpand [40], Coyote [5] and Spread [38] have been developed,but most target a specific network technology/protocol (e.g.UDP) and/or group communication algorithm (e.g. Totem[3]). Only very few allow easy modification of the groupcommunication algorithm itself via micro-protocol layers ormicro-protocol state-machines.

3. TECHNICAL APPROACHIn order to provide active/active as well as active/hot-standbyhigh availability for ultra-scale high-end scientific comput-ing, we are in the process of developing a flexible, modular,pluggable high availability component framework that al-lows adaptation to system properties, like network technol-ogy and system scale; and application needs, such as pro-gramming model and consistency requirements.

Our high availability framework (Figure 1) consists of fourmajor layers; 1-to-1 and 1-to-n communication drivers, agroup (n-to-n) communication system, virtual synchrony in-terfaces and applications.

At the lowest layer, communication drivers provide single-cast and multicast messaging capability. They may alsoprovide messaging related failure detection. The group com-munication layer offers group membership management, ex-ternal failure detectors, reliable multicast mechanisms and

Figure 1: High Availability Framework

atomic multicast algorithms. The virtual synchrony layerbuilds a bridge between the group communication systemand applications using easy-to-use interfaces common to ap-plication programmers.

Our high availability framework itself is component-based,.i.e. individual modules within each layer may be replacedwith other modules providing different properties for thesame service. Our framework allows exchanging softwaremodules using plug-in technology previously developed inthe Harness project [14, 16, 24].

In the following sections, we describe each of the four highavailability framework layers in more detail.

3.1 Communication DriversToday‘s high-end computing systems come with a varietyof different network technologies, such as Myrinet, Elan4,

Infiniband and Ethernet. Our high availability frameworkis capable of supporting vendor supplied network technolo-gies as well as established standards, such as TCP and UDP,using communication drivers, thus enabling efficient commu-nication between participating processes running in virtualsynchrony within an application.

The concept of using communication drivers to adapt spe-cific APIs of different network technologies to a unified com-munication API in order to make them interchangeable andinteroperable is not new. For example, Open MPI [22, 32]uses a component-based framework and encapsulates com-munication drivers using interchangeable and interoperablecomponents.

We are currently investigating if our high availability frame-work is able to profit from Open MPI communication drivertechnology by using the Open MPI framework. This alsoprovides an opportunity for Open MPI to benefit from ourhigh availability framework using active/active high avail-ability for essential Open MPI services.

Furthermore, we are also going to consider heterogeneity as-pects, such as byte ordering and high-level protocols. Forthe moment, communication drivers offer an interface thatdeals with raw data packets only. The use of high-level pro-tocols is managed in the group communication layer. Fu-ture work in this area will also reuse recent research inadaptive, heterogeneous and reconfigurable communicationframeworks, such as RMIX [28].

3.2 Group Communication LayerThe group communication layer contains all essential pro-tocols and services to run virtual synchronous processes foractive/active high availability. It also offers necessary ser-vices for active/hot-standby high availability with multiplestandby processes using coherent replication. The groupcommunication layer provides group membership manage-ment, external failure detectors, reliable multicast mecha-nisms and atomic multicast algorithms.

Many (60+) group communication algorithms/systems canbe found in literature. Our pluggable component-based highavailability framework provides an experimental platformfor comparing existing solutions and for developing new ones.Implementations with various replication and control strate-gies using a common API allow adaptation to system prop-erties and application needs. The modular architecture alsoenables other researchers to contribute their solutions basedon the common API.

We will also incorporate previous research in adaptive groupcommunication frameworks using micro-protocol layers andmicro-protocol state-machines.

3.3 Virtual Synchrony LayerThe supported APIs at the virtual synchrony layer are basedon application properties. Deterministic and fully symmet-rically replicated applications may use replication interfacesfor memory, files, state-machines and databases. Nondeter-ministic or asymmetrically replicated applications may usereplication interfaces for distributed control and replicatedremote procedure calls.

These application properties are entirely based on the groupcommunication systems point of view and its limited knowl-edge about the application.

For example, a batch job scheduler that runs on multiplehead nodes in a Beowulf-type cluster maintains a globalapplication state among the participating processes. Eachscheduler process has the same application state and receivesstate changes in the same (total) order. A job scheduling re-quest sent to one of these processes results in a state changethat schedules the job on all processes. Multiple requests areordered and their state changes are executed in the same or-der. The job scheduler‘s behavior is obviously deterministicfrom the group communication system point of view and thestate-machine interface can be used.

However, only one participating scheduler process is activelystarting jobs in our example, while the others only acceptjob scheduling requests from users. Conceptually, the jobstart part of the entire job management system is organizedin a fail-over chain, while the job scheduling part is providedin a symmetric fashion by all head nodes.

A job is started using a replicated RPC that goes out to allparticipating processes, but executes the job start only onone node. The other processes wait for the replicated RPCreturn in order to decide whether the job start was successfulor not. The job start and its result are based on locality asthe application is not operating entirely symmetric.

In case of failure, the surviving job scheduler processes con-tinue to provide their service and a new leader process maybe elected to start future jobs. However, a failure of theleader process during a job launch may require clean-upprocedures on compute nodes depending on the distributedprocess model (e.g. bproc [8]) used. Also, currently runningjobs need to be associated with the new leader process.

The job scheduler‘s mode of operation may be optimized byload balancing job starts among the participating processes.This may significantly increase performance if the job sched-uler is capable to start individual computational processesdirectly.

As we can see from this simple batch job scheduler example,the active/active high availability model is inherently com-plex. An API that allows an application to be viewed asa state-machine or that provides a replicated RPC facilitysignificantly improves usability.

3.4 ApplicationsThere are many, very different, applications for our highavailability framework. Typical single points of failure andcontrol involve head and service nodes. We previously de-scribed the active/active job scheduler example. Other ap-plications include: essential fault-tolerant MPI services (e.g.name server), parallel file system services (e.g. metadataserver) and services collecting data from compute nodes (e.g.system monitoring).

Another application area is more deeply involved with theOS kernel itself. For example, single system image (SSI)solutions run one virtual OS on many processors in SMP

mode. Memory page replication is needed for a highly avail-able SSI. A similar challenge poses global addressable mem-ory, like in the Cray X1 system. Both applications targethead and service nodes as well as compute nodes.

Applications on compute nodes include: super-scalable disk-less checkpointing, localized MPI recovery and coherent cachingof checkpoints on multiple service nodes. High availabilityon compute nodes will become more important in the fu-ture due to the growing size of high-end computing systems.An example is the recently deployed IBM Blue Gene/L withits 130,000 processors on compute nodes and several thou-sand processors on service nodes. The mean time to failureis going to outrun the mean time to recover without highavailability capabilities.

Our high availability framework will be implemented usinga set of shared and static libraries. Depending on the appli-cation area it may be used within the application by directlinking, via a daemon process by network access, or via anOS interface, such as /sys or /dev.

4. PROTOTYPE IMPLEMENTATIONWe are currently in the process of implementing an initialprototype using the lightweight Harness kernel [16] as a flex-ible and pluggable backbone (Figure 2) for the describedsoftware components.

Figure 2: Pluggable Lightweight Harness Kernel

Conceptually, the Harness software architecture consists oftwo major parts: a runtime environment (kernel) and aset of plug-in software modules. The multi-threaded user-space kernel daemon manages the set of dynamically load-able plug-ins. While the kernel provides only basic func-tions, plug-ins may provide a wide variety of services.

In fact, our previous research in Harness already targeteda group communication system to manage a symmetricallydistributed virtual machine environment (DVM) using dis-tributed control as a form of RPC-based virtual synchrony.The Harness distributed control plug-in provides virtual syn-chrony services to the Harness DVM plug-in, which main-tains a symmetrically replicated global state database forhigh availability. The accomplishments and limitations of

Harness and other group communication middleware projectswere the basis for the flexible, pluggable and component-based high availability framework.

Furthermore, we are also currently investigating differentgroup communication protocols for large-scale applications,such as SSI on compute nodes. Since the virtual synchronymodel is based on an all-to-all broadcast problem, scalabil-ity is an issue that needs to be addressed. The performanceimpact for small-scale applications, such as a scheduler run-ning on multiple head nodes, is negligible, but large-scaleand/or distributed process groups need to deal with latencyand bandwidth limitations.

5. CONCLUSIONSWe presented a flexible, pluggable and component-basedhigh availability framework that expands today‘s effort inhigh availability computing of keeping a single server aliveto include all machines cooperating in a high-end scien-tific computing environment. The framework mainly targetssymmetric active/active high availability, but also supportsother high availability techniques.

Our high availability framework is a proof-of-concept imple-mentation that aims to remove typical single points of failureand single points of control from high-end computing sys-tems, such as single head and service nodes, while adaptingto system properties and application needs. It uses plug-gable communication drivers to allow seamless adaptation todifferent vendor supplied network technologies as well as es-tablished standards. Its pluggable component-based groupcommunication layer provides an experimental platform forcomparing existing group communication solutions and fordeveloping new ones. Furthermore, its virtual synchronylayer provides adaptation to different application properties,such as asymmetric behavior.

We target applications that usually provide services on sin-gle head and service nodes, such as schedulers and essen-tial message passing layer services. We also target computenode applications, such as the message passing layer itselfand super-scalable high availability technologies for 100,000processors and beyond.

6. REFERENCES[1] Special issue on group communications systems.

Communications of the ACM, 39(4), 1996.

[2] N. R. Adiga et al. An overview of the Blue Gene/Lsupercomputer. Proceedings of SC, also IBM researchreport RC22570 (W0209-033), 2002.

[3] D. Agarwal. Totem: A reliable ordered deliveryprotocol for interconnected local-area networks. PhDThesis, University of CA, Santa Barbara, 1994.

[4] ASCII Blue Gene/L Computing Platform at LawrenceLivermore National Laboratory, Livermore, CA, USA.http://www.llnl.gov/asci/platforms/bluegenel.

[5] N. T. Bhatti, M. A. Hiltunen, R. D. Schlichting, andW. Chiu. Coyote: a system for constructing fine-grainconfigurable communication services. ACMTransactions on Computer Systems, 16(4):321–366,1998.

[6] K. Birman, B. Constable, M. Hayden, J. Hickey,C. Kreitz, R. van Renesse, O. Rodeh, and W. Vogels.The Horus and Ensemble projects: accomplishmentsand limitations. Proceedings of DISCEX, 1:149–161,2000.

[7] G. Bosilca, Z. Chen, J. Dongarra, and J. Langou.Recovery patterns for iterative methods in a parallelunstable environment. Submitted to SIAM Journal onScientific Computing, 2005.

[8] BProc: Beowulf Distributed Process Space atSourceforge.net. http://bproc.sourceforge.net.

[9] Carrier Grade Linux Project at Open SourceDevelopment Labs (OSDL), Beaverton, OR, USA.http://www.osdl.org/lab activities/carrier grade linux.

[10] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou,T. Angskun, G. Bosilca, and J. Dongarra. Buildingfault survivable MPI programs with FTMPI usingdiskless checkpointing. Submitted to PPoPP, 2005.

[11] G. V. Chockler, I. Keidar, and R. Vitenberg. Groupcommunication specifications: A comprehensive study.ACM Computing Surveys, 33(4):1–43, 2001.

[12] Xavier Defago, Andre Schiper, and Peter Urban. Totalorder broadcast and multicast algorithms: Taxonomyand survey. ACM Computing Surveys, 36(4):372–421,2004.

[13] Danny Dolev and Dalia Malki. The Transis approachto high availability cluster communication.Communications of the ACM, 39(4):64–70, 1996.

[14] W. R. Elwasif, D. E. Bernholdt, J. A. Kohl, and G. A.Geist. An architecture for a multi-threaded Harnesskernel. Lecture Notes in Computer Science:Proceedings of PVM/MPI User’s Group Meeting,2131:126–134, 2001.

[15] C. Engelmann and G. A. Geist. A disklesscheckpointing algorithm for super-scale architecturesapplied to the fast fourier transform. Proceedings ofCLADE, pages 47–52, 2003.

[16] C. Engelmann and G. A. Geist. A lightweight kernelfor the harness metacomputing framework.Proceedings of HCW, 2005.

[17] C. Engelmann and G. A. Geist. Super-scalablealgorithms for computing on 100,000 processors.Proceedings of ICCS, 2005.

[18] C. Engelmann, S. L. Scott, and G. A. Geist. Highavailability through distributed control. Proceedings ofHAPCW, 2004.

[19] G. E. Fagg, A. Bukovsky, and J. J. Dongarra. Harnessand fault tolerant MPI. Parallel Computing,27(11):1479–1495, 2001.

[20] G. E. Fagg, A. Bukovsky, S. Vadhiyar, and J. J.Dongarra. Fault-tolerant MPI for the Harnessmetacomputing system. Lecture Notes in ComputerScience: Proceedings of ICCS 2001, 2073:355–366,2001.

[21] FT-MPI Project at University of Tennessee,Knoxville, TN, USA. At http://icl.cs.utk.edu/ftmpi.

[22] Edgar Gabriel, Graham E. Fagg, George Bosilca,Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres,Vishal Sahay, Prabhanjan Kambadur, Brian Barrett,Andrew Lumsdaine, Ralph H. Castain, David J.Daniel, Richard L. Graham, and Timothy S. Woodall.Open MPI: Goals, concept, and design of a nextgeneration MPI implementation. Proceedings of 11thEuropean PVM/MPI Users’ Group Meeting, 2004.

[23] G. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang,R. Manchek, and V. S. Sunderam. PVM: ParallelVirtual Machine: A Users’ Guide and Tutorial forNetworked Parallel Computing. MIT Press,Cambridge, MA, USA, 1994.

[24] G.A. Geist, J.A. Kohl, S.L. Scott, and P.M.Papadopoulos. HARNESS: Adaptable virtual machineenvironment for heterogeneous clusters. ParallelProcessing Letters, 9(2):253–273, 1999.

[25] HA–OSCAR at Louisiana Tech University, Ruston,LA, USA. At http://xcr.cenit.latech.edu/ha-oscar.

[26] I. Haddad, C. Leangsuksun, and S. Scott.HA-OSCAR: Towards highly available linux clusters.Linux World Magazine, March 2004.

[27] IBM Blue Gene/L Computing Platform at IBMResearch. http://www.research.ibm.com/bluegene.

[28] D. Kurzyniec, T. Wrzosek, V. Sunderam, andA. Slominski. RMIX: A multiprotocol RMI frameworkfor Java. Proceedings of IPDPS, pages 140–145, 2003.

[29] Lawrence Berkeley National Laboratory, Berkeley, CA,USA. BLCR Project at http://ftg.lbl.gov/checkpoint.

[30] L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal.Extended virtual synchrony. Proceedings of DCS,pages 56–65, 1994.

[31] MPICH-V Project at University of Paris South,France. http://www.lri.fr/∼gk/mpich-v.

[32] Open MPI Project. http://www.open-mpi.org.

[33] PBSPro Job Management System at AltairEngineering, Inc., Troy, MI, USA.http://www.altair.com/software/pbspro.htm.

[34] PBSPro Job Management System for the Cray XT3at Altair Engineering, Inc., Troy, MI, USA.http://www.altair.com/pdf/PBSPro Cray.pdf.

[35] J. S. Plank, K. Li, and M. A. Puening. Disklesscheckpointing. IEEE Transactions on Parallel andDistributed Systems, 9(10):972–986, 1998.

[36] PVM Project at Oak Ridge National Laboratory. OakRidge, TN, USA. http://www.csm.ornl.gov/pvm.

[37] Ron I. Resnick. A modern taxonomy of highavailability. 1996.

[38] The Spread Toolkit and Secure Spread Project atJohns Hopkins University, Baltimore, MD, USA.http://www.cnds.jhu.edu/research/group/secure spread/.

[39] X1 Computing Platform at Cray Inc., Seattle, WA,USA. http://www.cray.com/products/x1.

[40] Xpand Project at Hebrew University of Jerusalem.Israel. http://www.cs.huji.ac.il/labs/danss/xpand.

[41] XT3 Computing Platform at Cray Inc., Seattle, WA,USA. http://www.cray.com/products/xt3.

A Flexible Thread Scheduler for HierarchicalMultiprocessor Machines

Samuel ThibaultLaBRI

Domaine Universitaire, 351 cours de la liberation33405 Talence cedex, France

[email protected]

ABSTRACTWith the current trend of multiprocessor machines towardsmore and more hierarchical architectures, exploiting the fullcomputational power requires careful distribution of exe-cution threads and data so as to limit expensive remotememory accesses. Existing multi-threaded libraries provideonly limited facilities to let applications express distribu-tion indications, so that programmers end up with explic-itly distributing tasks according to the underlying architec-ture, which is difficult and not portable. In this article, wepresent: (1) a model for dynamically expressing the struc-ture of the computation; (2) a scheduler interpreting thismodel so as to make judicious hierarchical distribution deci-sions; (3) an implementation within the Marcel user-levelthread library. We experimented our proposal on a scientificapplication running on a ccNUMA Bull NovaScale with16 Intel Itanium II processors; results show a 30% gaincompared to a classical scheduler, and are similar to what ahandmade scheduler achieves in a non-portable way.

1. INTRODUCTION“Disable HyperThreading!” That is unfortunately the mostcommon pragmatic answer to performance losses noticedon HyperThreading-capable processors such as the Intel

Xeon. This is of particular concern since hierarchy depthhas increased over the past few years, making current com-puter architectures more and more complex (Sun WildFire

[10], Sgi Origin [13], Bull NovaScale [28] for instance).

Those machines look like Russian dolls: nested technologiesallow them to execute several threads at the same time onthe same core of one processor (SMT: Simultaneous Multi-Threading), to share cache memory between several cores(multicore chips), and finally to interconnect several multi-processor boards (SMP) thanks to crossbar networks. Theresulting machine is a NUMA (Non-Uniform Memory Ar-chitecture) computer, on which the memory access delaydepends on the relative positions of processors and memorybanks (this is called the “NUMA factor”).

The recent integration of SMT and multicore technologiesmake the structure of NUMA machines even more com-plex, yet operating systems still have not exploited previousNUMA machines efficiently. Hennessy and Patterson under-lined that fact [11] about systems proposed for SGI Origin

and Sun Wildfire: “There is a long history of software

lagging behind on massively parallel processors, possibly be-

cause the software problems are much harder.” The intro-duction of new hardware technologies emphasizes the needfor software development. Our goal is to provide a portable

solution to enhance the efficiency of high-performance multi-threaded applications on modern computers.

Obtaining optimal performance on such machines is a signif-icant challenge. Indeed, without any information on tasks’affinity, it is difficult to make good decisions about how togroup tasks working on a common data set on NUMA nodes.Detecting such affinity is hard, unless the application itselfsomehow expresses it.

To relieve programmers from the burden of redesigning thewhole task scheduling mechanism for each target machine,we propose to establish a communication between the execu-tion environment and the application so as to automaticallyget an optimized schedule. The application describes theorganization of its tasks by grouping those that work onthe same data (memory affinity) for instance. The systemscheduler can then exploit this information by adapting thetask distribution to the hierarchical levels of the machine.

Of course, a universal scheduler that would get good resultsby using only such a small amount of information remainsto be written. In the meantime, we provide facilities forapplications to query the system about the topology of theunderlying architecture and “drive” the scheduler. As a re-sult, the programmer can easily try and evaluate differentgathering strategies. More than a mere scheduling model,we propose a scheduling experimentation platform.

In this article, we first present the main existing approachesthat exploit hierarchical machines, then we propose two newmodels describing application tasks and hierarchical levels ofthe machine, as well as a scheduler that takes advantage ofthem. Some implementation details and evaluation resultsare given before concluding.

2. EXPLOITING HIERARCHICALMACHINES

Nowadays, multiprocessor machines like NUMAs with multi-threaded multicores are increasingly difficult to exploit. Sev-eral approaches have been considered.

2.1 Predetermined distribution andscheduling

For very regular problems, it is possible to determine a taskschedule and a data distribution that are suited to the targetmachine and its hierarchical levels. The application justneeds to get the system to apply that schedule and thatdistribution, and excellent (if not optimal) performance canbe obtained. The PaStiX[12] large sparse linear systemssolver is a good example of this approach. It first launchesa simulation of the computation based on models of BLASoperators and communications on the target architecture.Then it can compute a static schedule of block-computationsand communications.

So as to enforce these scheduling strategies, many systems(Aix, Linux, Solaris, Windows, ...) allow process threadsto be bound to processor sets, and memory allocations tobe bound to memory nodes. Provided that the machineis dedicated to the application, the thread scheduling canbe fully controlled by binding exactly one thread to eachprocessor. To perform task switching, mere explicit contextswitches may be used: threads are only used as executionflow holders.

2.2 Opportunist distribution and schedulingGreedy algorithms (called Self-Scheduling (SS) [27]) are dy-namic, flexible and portable solutions for loop paralleliza-tion. Whatever the target machine, a Self-Scheduling algo-rithm takes care of both thread scheduling and data distri-bution. Operating systems schedulers are based on thesealgorithms.

They basically use a single list of ready tasks from whichthe scheduler just picks up the next thread to be sched-uled. Hence the workload is automatically distributed be-tween processors. For each task, the last processor on whichit was scheduled is recorded, so as to try to reschedule iton the same processor as much as possible to avoid cachemisses. These techniques are used in the Linux 2.4 andWindows 2000 [25] operating systems. However, a uniquethread list for the whole machine is a bottleneck, particu-larly when the machine has many processors.

To avoid such contention, Guided Self-Scheduling (GSS) [22]and Trapezoid Self-Scheduling (TSS) [30] algorithms makeeach processor take a whole part of the total work when theyare idle, raising the risk of imbalances. AFfinity Scheduling(AFS) [15] and Locality-based Dynamic Scheduling (LDS)[14] algorithms use a per-processor task list. Whenever idle,a processor will steal work from the least loaded list, forinstance. These latter algorithms are used by current oper-ating systems (Linux 2.6 [1], FreeBSD 5.0 [24], Cellular

Irix [33]). They also add a few rebalance policies: new pro-cesses are charged to the least loaded processor, for instance.

However, contention appears quickly with an increased num-ber of processors, particularly on NUMA machines. Wang

et al. propose a Clustered AFfinity Scheduling (CAFS)[31] algorithm which groups p processors in groups of

√p.

Whenever idle, rather than looking around the whole ma-chine, processors steal work from the least loaded proces-sor of their group, hence getting better localization of list

accesses. Moreover, by aligning groups to NUMA nodes,data distribution is also localized. Finally, the HierarchicalAFfinity Scheduling (HAFS) (Wang et al. [32]) algorithmlets any idle group steal work from the most loaded group.This latter approach is being considered for latest NUMA-aware developments of operating systems such as Linux 2.6and FreeBSD.

2.3 Negotiated distribution and schedulingThere are intermediate solutions between predetermined andopportunist scheduling. Some language extensions such asOpenMP [16], HPF (High Performance Fortran) [26] orUPC (Unified Parallel C ) [5] let one achieve parallel pro-gramming by simply annotating the source code. For in-stance, a for loop may be annotated to be automaticallyparallelized. An HPF matrix may be annotated to be auto-matically split into rather independent domains that will beprocessed in parallel.

The distribution and scheduling decisions then belong tothe compiler. To do this, it adds code to query the exe-cution environment (the number of processors for instance)and compiles the program in a way generic enough to adaptto the different parallel architectures. In particular, it willhave to handle threads for parallelized loops or distributedcomputing, and even handle data exchange between proces-sors (in the case of distributed matrices of HPF). To date,expressiveness is limited mostly to “Fork-Join” parallelism,which means, for instance, that the programmer can not ex-press imbalanced parallelism.

Programmers may also directly write applications that areable to adapt themselves to the target machine at runtime.Modern operating systems provide full information aboutthe architecture of the machine (user-level libraries are avail-able: lgroup [29] for Solaris or numa [1] for Linux). Theapplication can then not only get the number of processors,but also get the NUMA nodes hierarchy, their respectivenumber of processors and their memory sizes. Those sys-tems also let the application choose the memory allocationpolicy (specific memory node, first touch or round robin) andbind threads to CPU sets. Thus, the application controlsthreads and memory distribution, but it is then in charge ofbalancing threads between processors.

2.4 DiscussionWe chose to classify existing approaches into three cate-gories. The predetermined category gives excellent perfor-mance. But it is portable only if the problem is regular, i.e.,

its solving time depends on the data structure and not onthe data itself. The opportunist approach scales well, butdoes not take task affinities into account, and thus, on av-erage, does not get excellent performance. The negotiated

approach lets the application adapt itself to the underlyingmachine, but requires rewriting of some parts of the sched-uler in order to be flexible.

Our proposal is a mix between negotiated and opportunistapproaches. We will give the programmers means to dy-namically describe how their applications behave, and usethis information to guide a generic opportunist scheduler.

3. PROPOSAL: AN APPLICATION-GUIDEDSCHEDULER

Our proposal is based on a collaboration between the appli-cation and its execution environment.

3.1 Bubbles modeling the applicationstructure

The application is asked to model the general layout of itsthreads in terms of nested sets called bubbles1.

Figure 1 shows such a model: the application groups threadsinto pairs, along with a communication thread (prioritieswill be discussed later). The concept of bubbles can be un-derstood as a coset with respect to a specific affinity rela-

tion, and bubble nesting expresses refinement of a relationby another one. Indeed, several affinity relations can beconsidered, for instance:

Data sharing It is a good idea to group threads that workon the same data so as to benefit from cache effects,or at least to avoid spreading the data throughoutthe NUMA nodes thereby incurring the NUMA fac-tor penalty.

Collective operations It can be beneficial to optimize thescheduling of threads which are to perform collectiveoperations such as a synchronization barrier, which en-sures that all involved threads have reached the barrierbefore they can continue executing.

SMT Many attempts were made to address thread schedul-ing on Simultaneous Multi-Threading (SMT) proces-sors, mostly by detecting affinities between threads atruntime [4, 17]. Indeed, in some cases, pairs of threadsmay be able to efficiently exploit the SMT technology:they can run in parallel on the logical processors of thesame physical processor without interfering. If the pro-grammer knows that some pairs of threads can workin such symbiosis, he can express this relation.

Other relations may be possible to express parallelism, se-quentiality, preemption, etc. Yet, blindly expressing theserelations may also be detrimental: Bulpin and Pratt showperformance loss [3] on SMT processors due to frequentcache misses for instance; Antonopoulos et al. also showperformance loss [2] when not taking the SMP bus band-width limit into account. But the programmer may try andtest different refinements of the relations and thus experi-mentally reveal how the threads of an application should berelated.

In order to cope with the emerging multiprocessor networksof the 1980’s, Ousterhout [21] proposed to group dataand threads by affinity into gangs. These gangs hold a fixednumber of threads which are to be launched at the sametime on the same machine of the network: this is called Gang

Scheduling. However, processors may be left idle because asingle machine can only run one gang at a time, even if it is

1In a way relatively similar to some communication librariessuch as MPI, which ask the application to specify commu-nicators: groups of machines which will communicate.

“small”. Feitelson et al. [8] propose a hierarchical controlof the processors so as to execute several gangs on the samemachine. Our approach is actually a generalization of thisapproach.

3.2 Task lists modeling the computing powerstructure

According to Dandamudi and Cheng [6], a hierarchy oftask lists generally brings better performance than simpleper-processor lists. This is why two-level list schedulers havebeen developed [9, 20]. Moreover, it makes task binding toprocessor sets easier. In a manner similar to Nikolopoulos

et al.’s Nano-Threads list hierarchy [19], we have taken upand generalized this point of view.

Indeed, we model hierarchical machines by a hierarchy oftask lists. Each component of each level of the hierarchy ofthe machine has one and only one task list. Figure 2 showsa hierarchical machine and its model. The whole machine,each NUMA node, each core, each physical SMT processorand each logical SMT processor has a task list.

For a given task, the list on which it is inserted expressesthe scheduling area: if the task is on a list associated with aphysical chip, it will be allowed to be run by any processoron this chip; if it is placed on the global list, it will be allowedto be run by any processor of the machine.

3.3 Putting both models together: a bubblescheduler

Once the application has created bubbles, threads and bub-bles are just “tasks” that the execution environment dis-tributes on the machine.

3.3.1 Bubble evolutionAs Figure 3 shows, the goal of a bubble is to hold tasksand bring them to the level where their scheduling will bemost efficient. For this, the bubble goes down through liststo the wanted hierarchical level. It then “bursts”, i.e. heldthreads and bubbles are released and can be executed (or godeeper). The list of held tasks is recorded, for a potentiallater regeneration (see Section 3.3.3). The main issue is howto specify the right bursting level of a bubble.

In the long run, once we get good heuristics for a bubblescheduler, specifying such a parameter will no longer benecessary. For now, the goal is to provide an experimentalplatform for developing schedulers, and hence allow this pa-rameter to be tuned by the scheduler developers. They canfavor task affinity with the risk of making the load balancedifficult (by setting deep bursting levels) or on the contraryfavor processor use (by setting high bursting levels).

3.3.2 PrioritiesWe choose to let the application attach integer priorities totasks. When a processor looks for a task to be scheduled, itsearches through the lists that “cover” this processor, fromthe most local one (i.e. on low levels) to the most global

one, looking for a task with highest priority. It will thenschedule that task, even if less prioritized tasks remain onmore local lists.

221

221

221

221

221

221

221

221 3

0

Figure 1: Bubble example, with priorities: thread pairs that have a higher priority than the bubbles holding them,

and a highly prioritized thread.

P P M P P M M

P P M P P M M

P P M P P M M

(a) A NUMA of HyperThreaded multicores.

Machine

Cores

Nodes

Physical processors

Logical processors

(b) Model with task lists.

Figure 2: A high-depth hierarchical machine and its model.

Figure 1 shows an example using priorities. In this example,bubbles holding computing threads are less prioritized thanthe threads. Consequently, a bubble will burst only if everythread of the previously burst bubbles has terminated, or ifthere are not enough of them to occupy all the processors.This results in some Gang scheduling which automaticallyoccupies all the processors.

3.3.3 Bubble regenerationBubbles are automatically distributed by the scheduler overthe different levels of task lists of the machine, hence dis-tributing threads on the whole machine while taking affin-ity into account. However, it is possible that a whole threadgroup has far less work than others and terminates beforethem, leaving idle the whole part of the machine that wasrunning it.

To correct such imbalance, some bubbles may be regener-ated and moved up. Idle processors would then move someof them down on their side and have them re-burst there,getting a new distribution suited to the new workload whilekeeping affinity intact.

To prevent such imbalances, bubbles may periodically beregenerated2: each bubble has its own time slice after whichits threads are preempted and the bubble regenerated.

In the case of Figure 1, the preemption mechanism is ex-tended to Gang Scheduling : whenever a bubble is regener-ated (because its time slice expired), it is put back at theend of the task list while another bubble is burst to occupythe resulting idle processors.

3.4 DiscussionBubbles give programmers the opportunity to express thestructure of their application and to guide the scheduling oftheir threads in a simple, portable and structured way. Sincethe roles of processors and other hierarchical levels are notpredetermined, the scheduler still has some degrees of free-dom and can hence use an opportunist strategy to distributetasks over the whole machine. By taking into account anyirregularity in the application, this scheduler significantlyenhances the underlying machine exploitation. Such pre-ventive rebalancing techniques may still have side effects and

2In a way similar to Unix system thread preemption.

(a) (b) (c)

(d) (e)

Figure 3: Bubble evolution. (a) The outermost bubble starts on the general list. (b) It bursts, releasing a thread

(which can immediately be scheduled on any processor) and two sub-bubbles which can go down through the hierarchy.

(c) Going down achieved. (d) Both sub-bubbles burst, releasing two threads each. (e) Threads are distributed

appropriately and can start in parallel.

marcel_t thread1, thread2;marcel_bubble_t bubble;

marcel_bubble_init(&bubble);marcel_create_dontsched(&thread1, NULL, fun1, para1);marcel_create_dontsched(&thread2, NULL, fun2, para2);marcel_bubble_inserttask(&bubble, thread1);marcel_wake_up_bubble(&bubble);marcel_bubble_inserttask(&bubble, thread2);

Figure 4: Bubble creation example: threads are created

without being started, then they are inserted in the same

bubble.

lead to pathological situations (ping-ponging between tasks,useless bubble migration just before termination, etc.).

4. IMPLEMENTATION DETAILSMarcel [18, 7] is a two-level thread library: in a way sim-ilar to manual scheduling (see section 2.1), it binds onekernel-level thread on each processor and then performs fastuser-level context switches between user-level threads, hencegetting complete control on threads scheduling3 in userlandwithout any further help from the kernel. Our proposal wasimplemented within Marcel’s user threads scheduler.

Figure 4 shows an example of using the interface to buildand launch a bubble containing two Marcel threads.

The Marcel scheduler already had per-processor threadlists, so that integrating bubbles within the library did notneed a thorough rewriting of the data structures. The sched-uler code was modified to implement list hierarchy, bubble

3We suppose that no other application is running, and ne-glect system daemons wake-ups.

evolution and to take priorities (described in Section 3.3.2)into account.

So as to avoid contention, there is no global scheduling: pro-cessors just call the scheduler code themselves whenever theypreempt (or terminate) a thread. The scheduler finds somethread that is ready to be executed by the processor. Weadded bubble management there: while looking for threadsto execute, the scheduler code now also tries to “pull down”bubbles from high list levels and make them burst on a morelocal level. Getting an efficient implementation is complex,as explained below.

Given a processor, two passes are done to look for the task(thread or bubble) with maximum priority among all thetasks of the lists “covering” that processor. The first passquickly finds the list containing the task with the highestpriority, without the need of a lock. That list and the listholding the currently running task are locked4. A secondpass is then used to check that the selected list still has atask of this priority, in case some other processor took it inthe meantime. If the selected task is a thread, it is sched-uled; otherwise it is a bubble that the processor deals withappropriately (going down / bursting). The implementa-tion time-complexity is linear with respect to the number ofhierarchical levels of the machine.

Regenerating a bubble is also a difficult operation. Replac-ing threads in a given bubble requires removing all of themfrom the task lists, except threads being executed. Thosethreads go back in the bubble by themselves when the pro-cessors executing them call the scheduler. Eventually, the

4By convention, locking lists is done by locking high-levellists first, and for a given level, according to the level ele-ments identifiers.

Yield Switchns cycles % ns cycles %

Marcel (original) 186 495 69 84 223 31Marcel bubbles 250 665 63 148 395 37NPTL (Linux 2.6) 672 1790 31 1488 3930 69

Table 1: Cost of the modified Marcel scheduler for

searching lists, compared to other schedulers. Yield: list

search only, Switch: synchronization and context switch.

last thread closes the bubble and moves it up to the listwhere it was initially released by the bubble holding it.

5. PERFORMANCE EVALUATIONOur algorithm has some cost, but increases performancethanks to the resulting localization.

5.1 Bubble scheduler costWe measured the performance impact of our implementationon the Marcel library running on a 2.66 GHz Pentium

IV Xeon. Searching through lists has a reasonable cost,and our scheduler execution times are good compared tothe Linux thread libraries LinuxThread (2.4 kernel) andNPTL (2.6 kernel), see Table 1.

Creation and destruction of a bubble holding a thread doesnot cost much more than creation and destruction of a sim-ple thread: the cost increases from 3.3µs to 3.7µs.

Test-case examples of recursive creation of threads, such asdivide-and-conquer Fibonacci show that the cost of system-atically adding bubbles that express the natural recursionof threads creations is quickly balanced by the localizationthat they bring: Figure 5 shows that performance is af-fected when only a few threads are created, while on a Hy-perThreaded Bi-Pentium IV Xeon, the performance gainstabilizes at around 30 to 40% with 16 threads; on a NUMA4 × 4 Itanium II, the gain is 40% with 32 threads and getsup to 80% with 512 threads.

5.2 A real applicationMarc Perache [23] used our scheduler in a comparison ofthe efficiency of various scheduling strategies for heat con-

duction and advection simulations. Results may be seen inTable 2. The target machine is a ccNUMA Bull NovaS-

cale with 16 Itanium II processors and 64 GB of memory,distributed among 4 NUMA nodes. For a given processor,accessing the memory of its own node is about 3 times fasterthan accessing the memory of another node. The applica-tions perform cycles of fully parallel computing followed byglobal hierarchical communication barrier.

In the simple version, the mesh is split into as many stripesas the number of processors, and an opportunist schedule isused. The bound version binds them to processors in a non-portable way. This gets far better performance: each threadremains on the same node, along with its data. Our proposallets the application query Marcel about the number ofNUMA nodes and processors and then automatically buildbubbles according to the hierarchy of the machine (hence4 bubbles of 4 threads in this example). It gets performancevery similar to those of the bound version.

−40

−30

−20

−10

0

10

20

30

40

50

1 10 100 1000 10000 100000

gain

(%

)

n

(a) 2 HyperThreaded Pentium IV Xeon

−40

−20

0

20

40

60

80

10 100 1000 10000ga

in (

%)

n

(b) 4 × 4 Itanium II

Figure 5: Performance gain brought by adding bubbles

to the fibonacci test-case.

Conduction AdvectionTime (s) Speedup Time (s) Speedup

Sequential 250.2 16.13Simple 23.65 10.58 1.77 9.11Bound 15.82 15.82 1.30 12.40Bubbles 15.84 15.80 1.30 12.40

Table 2: Conduction performance depending on the ap-

proach.

As can be seen, the use of bubbles attained performanceclose to that which may be achieved with a “handmade”thread distribution, but in a portable way.

These applications are a simple example in which the work-load is balanced between stripes. The use of bubbles sim-ply allowed it to automatically fit the architecture of themachine. However, in the future these applications will bemodified to benefit from Adaptive Mesh Refinement (AMR)which increases computing precision on interesting areas.This will entail large workload imbalances in the mesh bothat runtime and according to the computation results. Itwill hence be interesting to compare both development timeand execution time of handmade-, opportunist-, and bubble-scheduled versions.

6. CONCLUSIONMultiprocessor machines are getting increasingly hierarchi-cal. This makes task scheduling extremely complex. More-over, the challenge is to get a scheduler that will perform“good” task scheduling on any multiprocessor machine with

an arbitrary hierarchy, only guided by portable schedulinghints.

In this paper, we presented a new mechanism making sig-nificant progress in that direction: the bubble model letsapplications express affinity relations of varying degrees be-tween tasks in a portable way. The scheduler can then usethese hints to distribute threads.

Ideally, the scheduler would need no other information toperform this. But practically speaking, writing such a sched-uler is difficult and will need many experiments to be tuned.In the meantime, the programmer can use stricter guidinghints (indicating bubble bursting levels, for instance) so asto experiment with several strategies.

Performance observations on several test-cases are promis-ing, far better than what opportunist schedulers can achieve,and close to what predetermined schedulers get. These ob-servations were obtained on several architectures (Intel PCSMP, Itanium II NUMA).

This work opens numerous future prospects. In the shortterm, our proposal will be included within test-cases of realapplications of CEA that run on highly hierarchical ma-chines, hence stressing the bubble mechanism power. It willthen be useful to develop analysis tools based on tracing thescheduler at runtime, so as to check and refine schedulingstrategies. It will also be useful to let the programmer setother attributes than just priorities, and thus influence thescheduler: “strength” of the bubble (which expresses theamount of affinity that the bubble represents), preemptibil-ity, some notion of amount of work, ...

In the longer term, the goal is to provide a means of ex-pression powerful and portable enough for the applicationto obtain an automatic schedule that gets close to the “op-timal” whatever the underlying architecture. It could alsobe useful to provide more powerful memory allocation func-tions, specifying which scope of tasks (a bubble for instance)will use the allocated area.

7. REFERENCES[1] Linux Scalability Effort.

http://lse.sourceforge.net/.

[2] C. Antonopoulos, D. Nikolopoulos, andT. Papatheodorou. Scheduling algorithms with busbandwidth considerations for SMPs. In 2003

International Conference on Parallel Processing, pages547–554. IEEE, Oct. 2003.

[3] J. R. Bulpin and I. A. Pratt. Multiprogrammingperformance of the pentium 4 with hyper-threading.In Third Annual Workshop on Duplicating,

Deconstructing and Debunking (WDDD2004) (at

ISCA’04), pages 53–62, June 2004.

[4] J. R. Bulpin and I. A. Pratt. Hyper-threading awareprocess scheduling heuristics. In 2005 USENIX

Annual Technical Conference, pages 103–106, 2005.

[5] W. Carlson, J. Draper, D. Culler, K. Yelick,E. Brooks, and K. Warren. Introduction to upc and

language specification. Technical ReportCCS-TR-99-157, George Mason University, May 1999.http://upc.gwu.edu/.

[6] S. Dandamudi and S. Cheng. Performance impact ofrun queue organization and synchronization onlarge-scale NUMA multiprocessor systems. Systems

Architecture, 43:491–511, 1997.

[7] V. Danjean. Contribution a l’elaboration

d’ordonnanceurs de processus legers performants et

portables pour architectures multiprocesseurs. PhDthesis, Ecole Normale Superieure de Lyon, Dec. 2004.

[8] D. G. Feitelson and L. Rudolph. Evaluation of designchoices for gang scheduling using distributedhierarchical control. Parallel and Distributed

Computing, 35:18–34, 1996.

[9] A. Fukuda, R. Fukiji, and H. Kai. Two-level processorscheduling for multiprogrammed NUMAmultiprocessors. In Computer Software and

Applications Conferences, pages 343–351. IEEE, [email protected].

[10] E. Hagersten and M. Koster. WildFire: A scalablepath for SMPs. In The Fifth International Symposium

on High Performance Computer Architecture, Jan.1999. Sun Microsystems, Inc.

[11] J. L. Hennessy and D. A. Patterson. Computer

Architecture: A Quantitative Approach. MorganKaufman, third edition, 2003.

[12] P. Henon, P. Ramet, and J. Roman. PaStiX: Aparallel sparse direct solver based on a staticscheduling for mixed 1d/2d block distributions. InProceedings of the 15 IPDPS 2000 Workshops on

Parallel and Distributed Processing, pages 519–527.Springer-Verlag, Jan. 2000.

[13] J. Laudon and D. Lenoski. The SGI Origin: AccNUMA highly scalable server. In 24th International

Symposium on Computer Architecture, pages 241–251,June 1997. Silicon Graphics, Inc.

[14] H. Li, S. Tandri, M. Stumm, and K. C. Sevcik.Locality and loop scheduling on NUMAmultiprocessors. In International Conference on

Parallel Processing, volume II, pages 140–127, Aug.1993.

[15] E. Markatos and T. Leblanc. Using processor affinityin loop scheduling on shared-memory multiprocessors.Parallel and Distributed Systems, 5(4):379–400, Apr.1994.

[16] T. Mattson and R. EigenMann. OpenMP: An API forwriting portable SMP application software. InSuperComputing 99 Conference, Nov. 1999.

[17] R. L. McGregor, C. D. Antonopoulos, and D. S.Nikolopoulos. Scheduling algorithms for effectivethread pairing on hybrid multiprocessors. In IPDPS

’05: Proceedings of the 19th IEEE International

Parallel and Distributed Processing Symposium

(IPDPS’05) - Papers, page 28.1, Washington, DC,USA, 2005. IEEE Computer Society.

[18] R. Namyst. PM2: un environnement pour une

conception portable et une execution efficace des

applications paralleles irregulieres. PhD thesis, Univ.de Lille 1, Jan. 1997.

[19] D. S. Nikolopoulos, E. D. Polychronopoulos, and T. S.Papatheodorou. Efficient runtime thread managementfor the nano-threads programming model. In Second

IEEE IPPS/SPDP Workshop on Runtime Systems for

Parallel Programming, volume 1388, pages 183–194,Orlando, FL, Apr. 1998.

[20] H. Oguma and Y. Nakayama. A scheduling mechanismfor lock-free operation of a lightweight process libraryfor smp computers. Conference on Parallel and

Distributed Systems, pages 235–242, July 2001.

[21] J. K. Ousterhout. Scheduling techniques forconcurrent systems. In Third International Conference

on Distributed Computing Systems, pages 22–30, Oct.1982.

[22] C. Polychronopoulos and D. Kuck. Guidedself-scheduling: A practical scheduling scheme forparallel supercomputers. Transactions on Computers,36(12):1425–1439, Dec. 1987.

[23] M. Perache. Nouveaux mecanismes au sein desordonnanceurs de threads pour une implantationefficace des operations collectives sur machinesmultiprocesseurs. In Rencontres francophones en

Parallelisme, Architecture, Systeme et Composant

(RenPar 16), Mar. 2005.

[24] J. Roberson. ULE: A modern scheduler for FreeBSD.Technical report, The FreeBSD Project,[email protected], 2003.

[25] M. Russinovich. Inside the windows NT scheduler,Part 2. Windows IT Pro, 303, July 1997.

[26] R. Schreiber. An introduction to HPF. In The Data

Parallel Programming Model: Foundations, HPF

Realization, and Scientific Applications, pages 27–44.Springer-Verlag, 1996.

[27] P. Tang and P.-C. Yew. Processor self-scheduling formultiple nested parallel loops. In Proceedings 1986

International Conference on Parallel Processing, pages528–535, Aug. 1986.

[28] Bull. Bull NovaScale servers.http://www.bull.com/novascale/.

[29] Sun microsystems. Solaris Memory Placement

Optimization (MPO).http://iforce.sun.com/protected/solaris10/

adoptionkit/tech/mpo/mpo_man.html.

[30] T. Tzen and L. Ni. Trapezoid self-scheduling: Apractical scheduling scheme for parallel compilers.Parallel and Distributed Systems, 4(1):87–98, Jan.1993.

[31] Y.-M. Wang, H.-H. Wang, and R.-C. Chang.Clustered affinity scheduling on large-scale NUMAmultiprocessors. Systems Software, 39:61–70, 1997.

[32] Y.-M. Wang, H.-H. Wang, and R.-C. Chang.Hierarchical loop scheduling for clustered NUMAmachines. Systems and Software, 55:33–44, 2000.

[33] S. Whitney, J. McCalpin, N. Bitar, J. L. Richardson,and L. Stevens. The SGI origin software environmentand application performance. In COMPCON 97,pages 165–170, San Jose, California, 1997. IEEE.

Remote-Write Communication Protocolfor Clusters and Grids

Ouissem Ben FredjGET / INT

9, rue Charles Fourier91011 Évry, France

[email protected]

Éric RenaultGET / INT

9, rue Charles Fourier91011 Évry, France

[email protected]

ABSTRACTRemote-write is a one-sided message-passing communicationprotocol adapted to cluster and grid computing. Its archi-tecture is designed to make use of new features providedby recent Network Interface Cards such as Remote DMAand programmable card. This protocol offers a flexible pro-gramming model as well as a low overhead. This articleis a survey of different programming models and their use.A second part presents a comparison of remote-write withother communication protocols according to the steps in-volved in the transfer’s critical path.

1. INTRODUCTIONIn recent years, clusters of workstations have become a vi-able alternative to expensive supercomputers for high per-formance computing. With the introduction of high-speednetworking hardware in LAN, the performance bottleneckin clusters was shifted from networking hardware to com-munication protocols and other machine resources.

Remote-write (RW) aims at proposing a communication pro-tocol using new high-speed network features without com-plicating developer tasks. The resulting protocol uses a one-sided scheme as a programming model. This allows data tobe moved from the sender to the receiver without any inter-vention of the receiver process.

RW is a protocol designed to exploit hardware and softwareresources to perform communications efficiently with a sim-ple programming model. Its main features are:

• No synchronization needed between the sender and thereceiver to perform a communication;

• Simple user-level interface;

• Possible removal of the operating system from the crit-ical communication path (for user-level implementa-tions);

• User-level memory management;

• Contiguous memory transfer to increase throughtput.

The first three sections of the article are an introductionto the remote-write programming model. Section 2 is anoverview of the one-sided scheme, section 3 describes thescheme and its use, and section 4 compares the remote-write programming model with other existing models. Thesecond part provides insights into the design issues for theRW protocol with a comparison with other protocols. Weconcentrate on issues that determine the performance andthe semantics of a communication system: memory manage-ment (section 5), host memory - NI data transfer (section 6),send and receive queue management (section 7), data trans-fer (section 8), and communication control (section 9).

2. OVERVIEWIn the litterature, there are many communication protocols.Remote-write adopts the one-sided communication protocolfor many reasons discussed in sections 3 and 4. A one-sidedprotocol is based on a single primitive, either a send or areceive primitive. A send-based protocol allows the senderto move its data directly to the receiver memory withoutintervention of the receiver process.

The need for a one-sided communication protocol has beenrecognized for many years. Some of these issues were ini-tially addressed by the POrtable Run-Time Systems (PORTS)consortium [1]. One of the missions of PORTS was to definea standard API for one-sided communications. During thedevelopment, several different approches were taken towardsthe one-sided API. The first is the thread-to-thread commu-nication paradigm which is supported by CHANT [2]. Thesecond is the remote service request (RSR) communicationapproach supported by libraries such NEXUS and DMCS.The third approach is a hybrid communication (that com-bine the two prior paradigms) supported by the TULIP [3]project. These cited paradigms are widely used. For ex-ample, Nexus supports the grid computing software infras-tructure GLOBUS. MOL [4] extends DMCS with an objectmigration mechanism and a global namespace of the system.DMCS/MOL is used both in Parallel Runtime Environmentfor Multi-computer Applications (PREMA) [5] and in Dy-namic Load Balancing Library (DLBL) [6].

In 1997, MPI-2 [7] (a new MPI [8] standard) have beenincluding some basic one-sided functionalities. Although,

many studies have integrated one-sided communications tooptimize MPI [9]. In 1999, a new communication librarycalled Aggregate Remote Memory Copy Interface (ARMCI) [10]has been released. ARMCI is a high-level library designedfor distributed array libraries and compiler run-time sys-tems. IBM have maintained a low-level API named LAPI [11]implementing the one-sided protocol and runing on IBM SPsystems only. Similarly, Cray SHMEM [12] provides directsend routines.

At the network layer, many factories have built RDMA fea-tures that ease the implementation of one-sided paradigms.For example, the HSL [13] network uses the PCI-DirectDeposit Component (PCI-DDC) [14] to offers a message-passing multiprocessor architecture based on one-sided pro-tocol. InfiniBand [15] proposes native one-sided communi-cations. Myrinet [16, 17] and QNIX [18] do not providenative one-sided communications. But these features maybe added (as for example in GM [19] with Myrinet sinceMyrinet NICs are programmable).

The arrival of this kind of networks has imposed commonmessage-passing libraries to support RDMA (GM , VIA [20]...).Note that most of these libraries have extended with one-sided communications to exploit RDMA features. But theydo not use these functionalities as a base for their program-ming model.

Machine hardware have provided some kind of one-sidedcommunication. Thinking machines [21] (CM1 in 1980, CM2in 1988, and CM5 in 1992) are representative examples.CM5 uses two primitives, called PUT and GET, to allowthousands of simple processors to communicate over a Ter-aFLOPS.

3. REMOTE-WRITE PROGRAMMING MODELRW uses the one-sided scheme as a programming model.It means that the completion of a send (resp. receive)operation does not require the intervention of the receiver(sender) process to take a complementary action. RW usesRDMA to copy data to (from) the remote user space di-rectly. Figure ?? describes the different steps occuring in abasic communication between two processes. Suppose thatthe receiver process have allocated a buffer to room incom-ing data and the sender have allocated a send buffer. Priorto the data transfer, the receiver must have sent its bufferaddress to the sender. Once the sender owns the destina-tion address, it initiates a direct-deposit data sending. Thistask does not interfere with the receiver process. On the re-ceiver side, it keeps on for doing computation tasks, testingif new messages have arrived, or blocking until an incomingmessage event arises.

There are several classes of applications that are easier towrite with one-sided communication:

• message passing algorithms;

• remote paging;

• adaptative codes where communication requirementsare not known in advance;

• codes that perform random access to distributed data.

In this case the process owning the data does not knowthe data to access;

• asynchronous parallel algorithms;

• symetric machines programming;

• data storage algorithms...

The RW programming model is simple, flexible and can beused as a high-level interface, or as a middleware betweena high-level library such as MPI and the network level. Arecent study proved that all MPI-2 routines can be imple-mented on top of a RW interface easily and efficiently [22].Thus, any message-passing algorithms may be implementedusing this programming model.

4. SYNCHRONIZATIONOne way to compare communication libraries is to classifythem according to the sender-receiver synchronization re-quired to perform data exchanges. There are three syn-chronization modes: full synchronization mode, rendez-vousmode, and asynchronous mode.

With the full synchronization mode, sender have to ensurethat the receiver is ready to receive incomming data. thus,a flow control is required. FM [23] and FM/MC [24] imple-ment flow control using a host-level credit scheme. Beforea host sends a packet, it checks for credits regarding the re-ceiver; a credit represents a packet buffer in the receiver’smemory. Credits can be handed out in advance by pre-allocating buffers for specific senders, but if a sender runs outof credits it must block until the receiver sends new credits.LFC [25], specifically designed for Myrinet clusters, imple-ments two levels of point-to-point synchronization: NI-leveland host-level. At host level, when the NI (Network Inter-face) control program receives a network packet, it tries tofetch a receive descriptor. When the receive queue is empty,the control program defers the transfer until the queue isrefilled by the host library. At NI-level, the protocol garan-tees that no NI sends a packet to another NI before thereceiving NI is able to store the packet. To achieve this,each NI assigns a number of its receive buffers to each NIin the system. An NI can transmit a packet to anotherone if there is at least one credit for the receiver. Eachcredit matches a free receive buffer for the sender. Once thesender has consumed all its credits for the receiver, it mustwait until the receiver frees some of its receive buffers forthis sender and returns new credits. Credits are returnedby means of explicit acknowledgements or by piggybackingthem on application-level return traffic. This mechanism isset up to all communication node’s pair, so that they arevery expensive in NI memory resource and synchronizationtime. Indeed, for applications using a lot of small messages,NI buffers could overflow quickly and synchronization timemay exceed the latency.

The rendez-vous mode discharge the duty of flow control tothe application. For example, BIP [26], VIA [20], BDM [27]and GM require that a receive request is posted before themessage enters the destination node. Otherwise, the mes-sage is dropped and/or NACKed. VMMC [28] uses a trans-fer redirection that consists in preallocating a default, redi-rectable receive buffer whenever a sender does not know the

final receive buffer address. Later, when the receiver poststhe receive buffer, it copies the data from the default bufferto the receive buffer. To use such a library, a middlewaremust be added between the user application and the libraryensuring the flow control. Similar to VMMC, if sender dataenter before the receiver creates the corresponding ReceiverContext, a QNIX program moves incoming data into a bufferin the NIC memory.

The asynchronous mode breaks all synchronization constraintsbetween sender and receiver. The completion of the sendoperation does not require the intervention of the receiverprocess to take a complementary action. This mode al-lows an overlapping between computation and communica-tion, a zero-copy without synchronization, a deadlock avoid-ance, and an efficient use of the network (since messagesdo not block on switches waiting for the receive operation).As a consequence, the asynchronous mode provides a highthroughtput and low latency, in addition of a flexibility (asthe synchronized mode can be implemented using the asyn-chronous mode) .

AM [29] and PM2 (Parallel Multithreaded Machine) [30] areusing the later mode to perform RPC-like communications.Each AM message contains the address of a user-level han-dler which is executed on message arrival with the messagebody as an argument. Unlike RPC, the role of the handleris to get the message out of the network and integrate it intothe receiver process space. The problem with this scheme isthat, for each message, a process handler is created (as withPM2) or an interrupt is generated (as with Genoa ActiveMessage MAchine(GAMMA) [31]) which is expensive (forboth time and resource).

Some libraries (like VIA, AMII and DP [32]) require a startupconnection to be executed before any communication. Suchconnection consists in creating a channel that allows commu-nication between a sender and a receiver. This step is usedmost often to exchange capabilities (reliabily level, qualityof service...) and restrictions (maximum transfer size ...) ofthe process. This can be usefull for dynamic and heteroge-nous topologies.

As discussed previously, each synchronization mode has ad-vantages and drawbacks. RW implements the asynchronousmode for both its simplicity and efficiency. The rest of thearticle analyzes each step of the communication’s criticalpath in order to implement the asynchronous mode effi-ciently.

5. MEMORY MANAGEMENTMemory allocation preceed any data transfer. It consistsin reserving memory areas to store data to send or to re-ceive. The way allocated areas are managed influences theperformance.

Most of communication libraries use DMA to transfer data.The main constraint of DMA operations is that physical ad-dresses are required rather virtual ones. Therefore, transferroutines must provide physical addresses of all pages of amessage to the DMA engine. This is a tricky task because acontiguous message in the virtual address space is not neces-sarily contiguous in physical memory. A virtual-to-physicaltranslation table built at allocation time can be used. Later,

at the send (resp. receive) step, the translation table is usedto gather (resp. scatter) data from (resp. to) memory to(resp. from) the network interface.

GM adds some optimizations to use the translation table: itstores the table in the user’s memory to be able to translatethe whole memory and it creates a small cache table in theNIC memory. The cache table contains a virtual-to-physicaltranslation of most used pages. To avoid page swapping,allocated buffer have to be locked.

Another solution used by the network layer of MPC-OS [33]consists in splitting the message to send into several smallermessages which size is less than the size of a page.

Yet another solution consists in managing physical addressesdirectly without operating system intervention. The ideais to allocate a physical contiguous buffer and to map itinto the virtual contiguous address space. Thus, just onetranslation is needed. Its most important advantage is theavoidance of scatter/gather operations at transfer time. InFreeBSD, a kernel function allows to allocate physical con-tiguous buffers. In Linux, there are two methods to allocatephysical contiguous memory. The first one is to change thekernel policy by changing the source code of Linux. Thesecond one consists in allocating memory at boot-time. Adriver maps the whole physical contiguous memory into avirtual contiguous area. Then, a function is used to searchfor a contiguous memory area that fits the requested sizein the set of free spaces. Note that this function can beexecuted in user space without any call to the operatingsystem.

Memory allocation is not a step of the communication’s crit-ical path, but the policy used to manage memory has animportant impact on data transfers. Our goal is to reducethe time spent in the virtual-to-physical translation by usingphysical contiguous memory allocations.

6. HOST MEMORY - NI DATA TRANSFERWith the RW protocol, the NI must communicate with thehost memory in three cases. The first case is when the userprocess informs the NI for a new send. The user process setsup a send descriptor to be used by the NI to send message.Both the second and the third cases are when sending and re-ceiving messages. For traditional message-passing systems,the user process must provide a receive descriptor to the NI.There are three methods to communicate between the hostmemory and the NI:

• PIO: The host processor writes (resp. reads) data to(resp. from) the I/O bus. However, only one or twowords can be transferred at a time resulting in a lotof bus transactions. Throughtput is different for writeand read, mainly because writing across a bus is usu-ally a little faster than reading;

• Write combining: It enhances write PIO performanceby enabling a write buffer for uncached writes, so thataffected data transfers can occur at cache line size in-stead of word size. Note that Write Combining is ahardware feature initially introduced on the Intel Pen-tium Pro and now available on recent AMD processors;

• DMA: DMA engines can transfer entire packets in largebursts an proceed in parallel with host computation.Because DMA engines work asynchronously, host mem-ory being source or destination of a DMA transfer mustnot be swapped out by the operating system. Somecommunication systems pin buffers before starting aDMA transfer. Moreover, DMA engines must knowthe physical addresses of the memory pages they ac-cess.

Choosing the suitable type of data transfer depends on thehost CPU, the DMA engine, the transfer direction, and thepacket size. A solution consists in classifying messages intothree types: small messages, medium messages, and largemessages. PIO suits small messages, write combining (whensupported) suits medium messages, and DMA suits largemessages. The definition of a medium message (and thenthe definition of both short and large messages) changes ac-cording to the CPU, the DMA engine, and the transfer direc-tion. Since DMA-related operations (initialization, transfer,virtual-to-physical translation) can be done by the NI or theuser process, a set of performance tests is the best way todefine medium messages.

RW should take into account the strength of each method inorder to communicate efficiently with the network interface.

7. SEND AND RECEIVE QUEUE MANAGE-MENT

RW uses two queues to ensure data transfer: a send queueand a receive event queue. Unlike synchronous libraries, RWdoes not need a receive queue to specify receive buffers. Al-though, RW needs a receive event queue which contains alist of incoming messages.

Queues allow asynchronous operations. In fact, to send amessage, the user process just appends a descriptor to thesend queue. Once the operation is finished, the sender con-tinues with the next send or with the computing task. Re-ceive event queue is used to probe or poll for a new receiveevent.

The send queue contains a set of send descriptors providedby user processes and read by NI at send time. A send de-scriptor determines the outgoing buffer address, its size, thereceiver node, the receiver buffer address, and the transfertype. Additional attributes can be specified to personalizethe send (the security level, the reliability level). The NIuses send descriptors to initiate the send. Three steps arerequired to initiate a send.

The first one is the initialization of the send descriptor. Thisstep is a part of the transfer’s critical path if the send is apoint-to-point communication. For collective sends (multi-cast, broadcast...), this step can be done once for multiplesend requests.

The second step consists in appending the send descriptorto the send queue. This step depends on the send queuemanagement. In fact, according to the NI type and thememory size, the send queue can be stored either in the NImemory or in the host memory. The first case (used byFM and GM) avoids the NI from polling on host memory

queue. The second case (used by the VIA specifications andimplemented by Berkeley VIA [34]) allows a larger queue.MyVIA [35], an implementation of VIA over Myrinet, usestwo queues (small ring in host memory and big ring on theNIC). If the small ring is not full, the send descriptor iswritten there directly. Otherwise, it is written in the bigring . If the number of unused descriptors in the small ringreaches a lower limit and if there are unprocessed descriptorsin the big ring, the NI requests a driver agent to move bigring descriptors to the NI.

The third step of the critical path is the polling performedby the NI on the send queue. This step depends on theprevious one. A comparison between MyVIA and BerkeleyVIA proved that storing the send queue in the NI memoryensures a more efficient management of the send queue es-pecially for small messages. In fact, Berkeley VIA requirestwo transactions between the NI and the host memory (thefirst are to inform the NI about the send and the secondone to read the host memory descriptor) whereas MyVIAneeds only one transaction. As for the send queue, the re-ceive event queue should be stored on the host memory toallow easy polling by user processes.

Since the size of the send descriptor is only several byte long,PIO or write combining techniques should be used to updatethe NI send queue or to inform the NI about a new send.

8. DATA TRANSFERAs introduced in the section 3, RW is based on send functiononly. The receive operation consists in checking the receiveevent queue for receive operation completions. Receive op-erations are detailed in the next section.

To avoid bottlenecks and use available resources efficiently, adata transfer should take into account the message size, thehost architecture (processor speed, PCI bus speed, DMAcharacteristics), NIC proprieties (transfer mode, memorysize), and the network characteristics (routing policy, routedispertion...).

Many studies have tried to measure network traffics to checkused message sizes. However, they mainly focused either ona set of applications [36], a set of protocols [37], a set ofnetworks [38] or a specific environnement (a single combina-tion of network, protocol, machines and applications) [39].All these studies show that small messages are prominent(about 80% less than 200 bytes). Moreover, RW requires anextra use of small messages to send receive buffer addresses.Thus, it is interesting for RW to distinguish between smalland large messages. As discussed earlier, the maximum sizeof small messages should be determined using performanceevaluation.

For the transfer of small messages, no send buffer addressnor receive buffer address are required. Therefore, it is pos-sible to store the content of small message in the send de-scriptor. To send such a message, as shown in Fig. 1, sevenoperations are performed: (1) the sender sets up the senddescriptor (including data); (2) the sender informs the NIabout the send descriptor; (3) the NI copies necessary datafrom the host memory, (4) the NI sends the message to thenetwork; (5) the remote NI receives the message and ap-pends it to the receive event queue; finally, (6) the receiver

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

PCI Bus

CPU CPU

NIC NIC

CPU CPU

PCI Bus

1

23

5

5

6

7

Memory Memory

Network

QueueSend

Evt QueueReceive Send

Queue

ReceiveEvt Queue

BridgePCI

BridgePCI

SystemBus Bus

System

4

Figure 1: Short send with Remote-write.

process reads the data from the receive descriptor and (7)informs the NI that the receive is done successfully.

For the transfer of large messages, the sender has to specifyboth the send buffer address and the remote buffer addressand ten steps are involved as shown in Fig. 2: the senderprepares the send buffer; (2) the sender sets up the senddescriptor and writes it to the NI memory using PIO; (3)the sender informs the NI about the send descriptor; (4) theNI reads the send descriptor; (5) the NI copies data fromthe host memory according to the send descriptor; (6) theNI sends the message to the network; (7) the remote NIreceives the message and writes incoming data to its finaldestination; (8) the remote NI appends a receive descriptorto the receive event queue; (9) the receiver process reads thereceive descriptor and (10) informs the NI that the receiveis done successfully.

Network policy can also affect data transfers. Adaptativerouting, which allows multiple routes for each destination,may cause buffer overwriting due to unordered arrival ofmessages. This problem doesn’t exist with synchronoustransfer.

Data tranfer is the most important step of the communica-tion. So care should be taken when writing its routines.

9. COMMUNICATION CONTROLThe main focus of this section is how to retreive messagesfrom the network device. According to RW, NI informs theuser process about the completion of the receive. Whenworking on parallel machines with user-level access to thenetwork, a user may have to choose between using an interrupt-driven communication system and a polling-based system.

Interrupt-driven approach lets the network device signalsthe arrival of a message by raising an interrupt. This is a fa-miliar approach, in which the kernel is involved in dispatch-ing the interrupt to the application in user space. The alter-native model for message handling is polling. The network

device does not actively interrupt the CPU, but merely setssome status bits to indicate the arrival of a message. Theapplication is required to poll the device status regularly;when a message is detected, the polling function returns thereceive descriptor describing the message.

Quantifying the difference in cost between using interruptsand polling is difficult because of the large number of param-eters involved: hardware (cache sizes, register windows, net-work adapters), operating system (interrupt handling), run-time support (thread packages, communication interfaces),and application (polling policy, message arrival rate, com-munication patterns).

First, executing a single poll is typically much cheaper thantaking an interrupt, because a poll executes entirely in userspace without any context switching. Recent operating sys-tems decrease the interrupt cost by saving minimal processstate, but interrupt remain expensive. Second, comparingthe cost of a single poll to the cost of a single interrupt doesnot provide a sufficient basis for statements about applica-tion performance. Each time a poll fails, the user programwastes a few cycles. Thus, coarse-grain parallel computingfavors interrupts, while fine-grain parallelism favors polling.

For application containing unprotected critical sections, in-terrupts lead to nondeterministic bugs, while polling leadsto safe run. Moreover, for asynchronous communication,polling can lead to substantial overhead if the frequency ofarrivals is low enough that the vast majority of polls fail tofind a message. With interrupts, overhead only occurs whenthere are arrivals.

Most of high-speed communication libraries (AM, FM, PM [40],GM) use polling and let interrupt to signal exceptions likequeue overflow or invalid receive buffer address. FM/MC,LFC, and PANDA [41] uses a system that integrates au-tomatically polling and interrupts through a thread sched-uler. Since there is no incompatibility between RW andboth polling and interrupt, users can use either polling or

� � � � � ��

� � � � � ��

� � � � � ��

� � � � � ��

MessagePCI Bus

CPU CPU

NIC NIC

CPU CPU

MessagePCI Bus

1

2

34

5 7

8

9

10

Memory Memory

Network

QueueSend

Evt QueueReceive Send

Queue

ReceiveEvt Queue

BridgePCI

BridgePCI

SystemBus Bus

System

6

Figure 2: Normal send with Remote-write.

interrupt depending on the application context. Note that,interrupts may be imposed by some libraries to garantee aforward progress for system communication.

10. CONCLUSIONThis paper presented several design issues for high-speedcommunication protocol. First, we classified protocols intothree synchronization modes: full synchronization mode,rendez-vous mode and asynchronous mode. Second, we de-tailled the different steps involved in the transfer’s criticalpath. Regarding the different implementation choices wedraw the following conclusions:

• one-sided communications have been used for long;

• the use of either PIO, write combining or DMA fordata transfer may be critical for efficient implementa-tions;

• remote-write is suited to a large variety of applications;

• remote-write allows zero-copy transfer without syn-chronization;

• remote-write can take good advantage of physical con-tiguous memory;

• it is necessary to have messages which do not requireaddresses to be exchanged;

• no receive queue is required; however, the use of areceive event queue is preferred;

• there are no strong requirements regarding the use ofpolling or interrupts when receiving messages; how-ever, the use of one against the other one is applica-tion dependent and performance results may be reallydifferent.

11. REFERENCES[1] The PORTS Consortium. The PORTS0 Interface.

Technical Report ANL/MCS-TM-203, Mathematicsand Computer Science Division, Argonne NationalLaboratory, February 1995.

[2] M. Haines, P. Mehrotra, and D. Cronk. Chant:Lightweight threads in a distributed memoryenvironment, 1995.

[3] P. Beckman and D. Gannon. Tulip: Parallel run-timesupport system for pc++.

[4] N. Chrisochoides, K. Barker, D. Nave, andC. Hawblitzel. Mobile object layer: A runtimesubstrate for parallel adaptive and irregularcomputations. Advances in Engineering Software,31(8–9):621–637, August 2000.

[5] Kevin James Barker. Runtime support for loadbalancing of parallel adaptive and irregularapplications. PhD thesis, 2004. Adviser-NikosChrisochoides.

[6] M. Balasubramaniam, K. Barker, I. Banicescu,N. Chrisochoides, J.P. Pabico, and R.L. Cario. Anovel dynamic load balancing library for clustercomputing. In Proceedings of the IEEE InternationalSymposium on Parallel and Distributed Computing(ISPDC/HeteroPar 2004), pages 346–353, Cork,Ireland, July 2004. IEEE Computer Society, IEEE.

[7] Message Passing Interface Forum MPIF. MPI-2:Extensions to the Message-Passing Interface. TechnicalReport, University of Tennessee, Knoxville, 1996.

[8] Message Passing Interface Forum. MPI: Amessage-passing interface standard. Technical ReportUT-CS-94-230, 1994.

[9] J. Dobbelaere and N. Chrisochoides. One-sidedcommunication over MPI-1.

[10] J. Nieplocha and B. Carpenter. ARMCI: A portableremote memory copy library for distributed arraylibraries and compiler run-time systems. Lecture Notesin Computer Science, 1586:533–??, 1999.

[11] IBM Corporation, editor. LAPI Programming Guide.Number IBM Document Number: SA22-7936-00 inIBM Reliable Scalable Cluster Technology for AIX L5.First Edition, Poughkeepsie, NY, September 2004.

[12] R. Barriuso and Allan Knies. SHMEM User’s Guide.Cray Research Inc, 1994.

[13] C. Whitby-Strevens and al. IEEE Draft Std P1355 —Standard for Heterogeneous Interconnect — Low CostLow Latency Scalable Serial Interconnect for ParallelSystem Construction, 1993.

[14] A. Greiner, J.L. Desbarbieux, J.J. Lecler, F. Potter,F. Wajsburt, S. Penain, and C. Spasevski. PCI-DDCSpecifications. UPMC / LIP6, Paris, France,December 1996. Revision 1.3.

[15] InfiniBand Trade Association. The InfiniBandArchitecture, Specification Volume 1 & 2, June 2001.Release 1.0.a.

[16] Nanette J. Boden, Danny Cohen, Robert E.Felderman, Alan E. Kulawik, Charles L. Seitz,Jakov N. Seizovic, and Wen-King Su. Myrinet: Agigabit-per-second local area network. IEEE Micro,15(1):29–36, 1995.

[17] ANSI/VITA 26-1998. Myrinet-on-vme protocolspecification., 1998.

[18] Amelia De Vivo. A Light-Weight CommunicationSystem for a High Performance System Area Network.PhD thesis, Universit di Salerno - Italy, November2001.

[19] GM: A message-passing system for myrinet networks2.0.12, 1995.

[20] Virtual Interface Architecture Specification, Version1.0, published by Compaq, Intel, and Microsoft.December 1997.

[21] Thinking Machine Corporation, Cambridge,Massachusetts. Connection Machine CM-5 TechnicalSummary, November 1992.

[22] Olivier Gluck. Optimisations de la bibliothque decommunication MPI pour machines parallles de typegrappe de PCs sur une primitive d’criture distante.PhD thesis, Universit Paris VI, juillet 2002.

[23] Scott Pakin, Mario Lauria, and Andrew Chien. Highperformance messaging on workstations: Illinois FastMessages (FM) for Myrinet. pages ??–??, 1995.

[24] (r) efficient reliable multicast on myrinet. In ICPP’96: Proceedings of the Proceedings of the 1996International Conference on Parallel Processing(ICPP ’96)-Volume 3, page 156. IEEE ComputerSociety, 1996.

[25] R. Bhoedjang, T. Ruhl, and H. Bal. Lfc: Acommunication substrate for myrinet, 1998.

[26] Loic Prylli and Bernard Tourancheau. BIP messagesuser manual.

[27] G. Henley, N. Doss, T. McMahon, and A. Skjellum.Bdm: A multiprotocol myrinet control program andhost application programmer interface, 1997.

[28] Matthias A. Blumrich, Kai Li, Richard Alpert, CezaryDubnicki, Edward W. Felten, and Jonathan Sandberg.Virtual memory mapped network interface for theshrimp multicomputer. In ISCA ’98: 25 years of theinternational symposia on Computer architecture(selected papers), pages 473–484. ACM Press, 1998.

[29] Thorsten von Eicken, David E. Culler, Seth CopenGoldstein, and Klaus Erik Schauser. Active Messages:A mechanism for integrated communication andcomputation. In 19th International Symposium onComputer Architecture, pages 256–266, Gold Coast,Australia, 1992.

[30] Raymond Namyst and Jean-Francois Mehaut. ParallelComputing: State-of-the-Art and Perspectives.Proceedings of the Intl. Conference ParCo ’95, Ghent,Belgium, 19–22 September 1995, volume 11 ofAdvances in Parallel Computing, chapter PM2:Parallel Multithreaded Machine. A ComputingEnvironment for Distributed Architectures, pages279–285. Elsevier, February 1996.

[31] G. Chiola and G. Ciaccio. Gamma: a low cost networkof workstations based on active messages, 1997.

[32] Cho-Li Wang, Anthony T. C. Tam, Benny W. L.Cheung, Wenzhang Zhu, and David C. M. Lee.Directed Point: a communication subsystem forcommodity supercomputing with Gigabit Ethernet.Future Generation Computer Systems, 18(3):401–420,2002.

[33] A. Fenyo. Conception et ralisation d’un noyau decommunication bti sur la primitive d’criture distante,pour machines parallles de type ¡¡grappe de PCs¿¿.Thse de doctorat, UPMC / LIP6, Paris, France, July2001.

[34] Philip Buonadonna, Andrew Geweke, and DavidCuller. An implementation and analysis of the virtualinterface architecture. In Supercomputing ’98:Proceedings of the 1998 ACM/IEEE conference onSupercomputing (CDROM), pages 1–15. IEEEComputer Society, 1998.

[35] Yu Chen, Xiaoge Wang, Zhenqiang Jiao, Jun Xie,Zhihui Du, and Sanli Li. Myvia: A design andimplementation of the high performance virtualinterface architecture. In CLUSTER ’02: Proceedingsof the IEEE International Conference on ClusterComputing, page 160. IEEE Computer Society, 2002.

[36] N. Basil and C. Williamson. Network trafficmeasurement of the x window system.

[37] Jin Cao, William S. Cleveland, Dong Lin, and Don X.Sun. On the nonstationarity of internet traffic. InSIGMETRICS ’01: Proceedings of the 2001 ACMSIGMETRICS international conference onMeasurement and modeling of computer systems,pages 102–112. ACM Press, 2001.

[38] Carey Williamson. Internet traffic measurement. IEEEInternet Computing, 5(6):70–74, 2001.

[39] R. Gusell. A measurement study of disklessworkstation traffic on an ethernet. In IEEETransaction on Communications, volume 38, pages1557– 1568, September 1990.

[40] H. Tezuka, A. Hori, and Y. Ishikawa. Pm: ahighperformance communication library for multi-userparallel environments, 1996.

[41] Raoul Bhoedjang, Tim Ruhl, Rutger Hofman, KoenLangendoen, Henri Bal, and Frans Kaashoek. Panda:A portable platform to support parallel programminglanguages. pages 213–226, 1993.

Efficient Parallel Shell

Georges-Andre SilberCentre de Recherche en Informatique, Ecole des mines de Paris

35, rue Saint-Honore, 77305 Fontainebleau cedex, FranceE-Mail: [email protected]

ABSTRACTWe propose a slightly modified Unix shell where the user canexplicitly or implicitly execute commands on different com-puters of a cluster. The shell redirects output and input ofthose different commands to and from the right hosts, for in-stance when the user uses the redirections | (pipe), > (writeto a file) and < (read from a file). We use a simple model ofremote execution implemented as a portable Remote Execu-tion Daemon (RED) that has to be executed on each nodeof the cluster and that can be seen as a highly simplifiedoperating system. We show with a simulation that thosevery simple shell constructs coupled with a RED could leadto significant performance results on clusters.

Keywords: Unix shell, remote execution, parallelexecution, pipelined execution.

1. INTRODUCTIONA Unix shell offers several ways to express parallelism. Forinstance, A|B|C describes a pipelined execution of processesA, B and C that collaborate in parallel to produce a result; theoperator & launch a process without waiting for its result,giving the possibility to launch other processes in parallel.

On computers with a processor executing a single threadof instructions, processes are not executed in parallel. Theyare only executed in parallel if the computer has several pro-cessors or at least several execution units and if the oper-ating system is capable of distributing the processes amongexecution units. Grouping some computers with the samearchitecture into a cluster is a widely used way of runningparallel programs. But, it does not lead directly to a parallelexecution of processes, because those processes must be ex-ecuted remotely on computers of the cluster. Programmersusually have to write specific programs using TCP/IP sock-ets or message passing libraries like MPI to obtain a parallelexecution of communicating processes.

Transparent remote execution of processes on computers ofa cluster can be obtained with a single system image oper-ating system [1, 5]. It gives the user the illusion that thecluster is a single computer with several execution units.The underlying operating system is responsible for processesallocation on computers, load-balancing, input/output redi-rections, file systems management, etc.

We propose a different approach where a modified Unix shellallows the user to explicitly or implicitly launch commandson different computers of a cluster. The shell redirects out-put and input of those different commands to and from theright hosts, for instance when the user uses the redirections| (pipe), > (write to a file) and < (read from a file). Animportant aspect of our work is that the user has the choiceto name the host where he wants to run a task or to let theshell decide for him where to run the task. Another aspectis that the user does not have to transfer the file he wantsto execute to the right hosts, the shell taking care of thatfor him.

Our approach requires a modification of the shell but nomodification of the operating system. It uses a simple modelof remote execution materialized as a Remote ExecutionDaemon (RED) [7] that has to be executed on each nodeof the cluster. This daemon implements a simple service forremote program execution and file storage and can be seenas a highly simplified operating system. Our shell process isin fact a client of many RED running on distant hosts.

RED and the modified shell are highly portable on POSIXsystems. We show with a simulation that those very simpleconstructs coupled with a simple RED could lead to signifi-cant performance results on clusters.

The organization of this paper is as follows. First, we presentour extensions to the Unix shell. Second, we present per-formance results for a simple pipeline in two cases: CPU-bounded and IO-bounded application. We show that in ev-ery case our approach leads to speedups. Third, we givea sketch of the implementation of our extensions in GNUBASH using a C implementation of a RED using XML-RPC[10, 4]. Finally, before we conclude, we discuss some relatedworks.

2. PARALLEL SHELLOur reference shell is GNU BASH and we based our exten-sions on the documentation that can be found in [2]. We

describe our parallel shell in an informal way, without giv-ing many implementation details. We are going to see insection 4 that our extensions can be easily implemented.

We use the first restriction that all computers share the samearchitecture. The second restriction is that the executablefile and the required dynamic libraries of the command thatis executed on a remote host must be present on the localhost. They are transferred on demand by the shell on theremote host. With this last restriction, the computers of thecluster does not need to share a file system.

2.1 Local shell and remote hostsWhen a shell is started, we consider that it has a list of avail-able remote hosts that have the same architecture and thataccept commands (typically, the computers of the cluster).We consider in a first approach that this list is read from afile when the shell starts and that it cannot change duringshell execution.

The shell sends a message to each host of this list, getting atoken for each host. Those tokens represent keys to privatevirtual file systems that are created on each host for theclient shell when it asks for a token. Those file systemscan be empty and there is no guarantee of persistence forthe files stored on the remote host. The private virtual filesystems on the remote hosts can be implemented in memoryor in actual disks. The client shell is the only one that canaccess its private file system.

2.2 Simple commandsA simple shell command is a sequence of words separated byblanks, terminated by one of the shell’s control operators (anewline or one of the following: ‘||’, ‘&&’, ‘&’, ‘;’, ‘;;’, ‘|’, ‘(’,or ‘)’). The first word generally specifies a command to beexecuted, with the rest of the words being that command’sarguments.

In our parallel shell, the first word, the command to beexecuted, can be prefixed by a sequence of characters of theform host:, where host is the computer where the commandhas to be executed. For instance, the command

node0:command a b c

runs the command program on the host node0 with the argu-ments a, b, and c. The remote command does not see thelocal files. All files used by command must be on node0. Theshell is responsible for transferring the command executableand its associated dynamic libraries to node0 if they are notalready there. The command can be the path where to findthe command locally. For instance, the user can write:

node0:/usr/bin/command a b c

and the local command /usr/bin/command is going to betransfered to node0, creating on the fly the directories ifneeded.

In a natural way, standard input stream is taken from theterminal that launched the command and standard outputand error streams are redirected to the terminal. A com-mand executed this way does not receive any PID (ProcessIDentifier), because no process is created on the local host.It only receives a job number in the BASH sense and canthen be manipulated with BASH built-in commands.

2.3 RedirectionsWe add a special case for file names that are used in redi-rections. The user can prefix filenames with a host:, wherehost is the remote computer where the file must be writtenor read. For instance, the command

node0:command > node1:file1 < node2:file2

use the remote file file1 as input stream of the remotecommand command that writes its result in the remote filefile2. To copy the resulting file locally, it is sufficient towrite

cat > localfile < node2:file2

2.4 PipelineThe way we handle pipelines is one of the main aspect ofour work. A pipeline is a sequence of simple commandsseparated by ‘|’. The format for a pipeline is

command1 [| command2 ...]

where the output of each command in the pipeline is con-nected via a pipe to the input of the next command. That is,each command reads the previous command’s output. Wemodify the pipe semantic the following way: if at least onecommand of the pipe is a remote execution, the pipe be-comes a direct network connection. We are going to see insection 3 that this aspect is crucial in terms of performance.

2.5 Predefined parameter to get a remote hostname

We add a new special parameter of the shell, the parame-ter $: that contains the name of a remote host acceptingcommands. Two consecutive uses of $: do not necessarilygive the same result. By default, we consider a round-robinpolicy where we just pick the next remote host on the staticlist. Other policies can be implemented, to take for instanceload-balancing aspects into account.

With this special parameter, it is possible to put the nameof a distant host in a variable like

MYHOST=$:

and to use it the following way

$MYHOST:command

The user can also directly write

$::command

leaving to the shell the placement decision.

We also add a variable called REDHOSTS that contains the listof all available hosts. In the following, we give an examplewhere this special variable is used to run a single commandon all available hosts.

2.6 ExamplesWith our parallel shell, it is possible to write commandswith explicit placement like:

producer | node1:task1 | node2:task2 > node3:dfile

cat < node3:dfile > lfile

where node1, node2 and node3 are remote computers. Thedata produced by the process task1 running on node1 is sentdirectly to the standard input of the process task2 runningon node2, without returning to the host running the processproducer.

The special parameter $: is convenient to exhibit an implicitplacement like:

DESTHOST=$:

producer | $::task1 | $::task2 > $DESTHOST:dfile

cat < $DESTHOST:dfile > lfile

where the shell decides where to run the different tasks.

It is also possible to run a single command on multiple hostsat the same time:

for node in $REDHOSTS

do

$node:task &

done

wait

Note the ’&’ character that is necessary to run all the pro-cesses in parallel. The wait command is a BASH commandthat waits for the completion of all processes running in thebackground [2].

3. PERFORMANCE RESULTSThe results presented here come from a simulation we ranbefore we began the actual implementation of our extensionsinto GNU BASH and the development of RED with XML-RPC. We wanted to validate our ideas by experiments thatare now motivating our developments. We show that ourapproach gives significant performance results when execut-ing communicating processes in parallel under the form ofa pipeline. We provide in [6] a file containing all C sourcecodes and shell scripts we have used during our experiments.

We ran experiments using the following pipeline, where x

is the number of packets of 1024 bytes that are producedby the program producer and consumed by the programconsumer. Each packet contains random data.

producer x | task w | task w | task w | consumer x

The standard output of producer is transmitted to the stan-dard input of a program task, that executes the followingsteps: 1) read a packet of 1024 bytes, 2) iterate w times overthis packet, executing two arithmetic operations on eachbyte, and 3) write the packet to the standard output. Theprogram consumer just read x packets, one at a time, anddoes not write anything on standard output.

0

5

10

15

20

25

30

35

40

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

SequentialRemoteParallel

Figure 1: Sum of user and system time in seconds(y axis) for an increasing number of 1024 bytes datapackets exchanged (x axis). IO-bound version of thepipeline (2048 iterations per packet).

Sequential experiments were done using a Pentium 4 com-puter running Debian GNU/Linux. Its processor is clockedat 2.4 GHz and has 2 GB of RAM (this computer is calledstockholm). The +-line called Sequential in Figure 1 givesthe running time (user and system time) in seconds of theprevious pipeline for numbers of packets x. The parameterw is set to 2 (1024 × 2 = 2048 iterations per packet). Wecall this version the IO-bound version because most of therunning time is taken by input/output operations. Figure2 represents the same experiment with a parameter w setto 1024 (1024× 1024 = 1048576 iterations per packet). Wecall this version the CPU-bound version because most of therunning time is taken by computing operations.

Our next experiments consisted in the evaluation of twodistributed execution schemes: one that uses the rsh com-mand, and another that uses a parallel execution schemethat can be expressed with our shell extensions. We used forthese experiments three more computers: a computer withtwo Opteron 244 processors at 1.8 GHz with a 8 GB sharedRAM (surville), a computer with a Pentium 4 computer at2.4 GHz with 2 GB of RAM (saigon), and a computer with aPentium 4 computer at 3 GHz with 2 GB of RAM (nantes).The four computers are connected using a 100 Mbit/s Eth-ernet switch that is the main switch of our laboratory. Weare not exactly in a cluster, but the differences between theschemes of executions that we present should remain in a

0

10

20

30

40

50

60

70

0 200 400 600 800 1000 1200 1400 1600 1800 2000

SequentialRemoteParallel

Figure 2: Sum of user and system time in seconds(y axis) for an increasing number of 1024 bytes datapackets exchanged (x axis). CPU-bound version ofthe pipeline (1048576 iterations per packet).

true cluster environment. All four computers run DebianGNU/Linux.

The first distributed scheme of execution, that we call theremote execution scheme, is as follows:

producer x | (rsh surville task w)

| (rsh saigon task w)

| (rsh nantes task w) | consumer x

where producer and consumer are executed on the hoststockholm and the three task processes are remotely ex-ecuted on the hosts surville, saigon, and nantes. Threebi-directional network connections are established betweenthe host stockholm and the hosts surville, saigon, andnantes. The order of data movements are depicted in Fig-ure 3 a).

We can see in Figure 3 that the remote scheme of executiontransforms the host stockholm in a potential bottleneck fordata transfers. This is verified in Figure 1 where we can seethat the ×-line representing the execution times is far abovethe sequential execution for the IO-bound case. On the con-trary, this scheme of execution leads to a linear speedupdepicted in Figure 2 for the CPU-bound case.

The second distributed scheme of execution, that we call theparallel execution scheme, is as follows:

producer x | surville:task w | saigon:task w

| nantes:task w | consumer x

where an explicit direct network connection is made betweenstockholm and surville, surville and saigon, saigon

and nantes, and nantes and stockholm. The order of datamovements are depicted in Figure 3 b). We can see that thisparallel scheme of execution has no bottleneck. This is veri-fied in Figure 1 where we can see that the ∗-line representingthe execution times becomes better than the sequential ex-ecution for the IO-bound case, when the number of packets

nantessurville

saigon

stockholm

1,2

3,4

5,6

a) remote

nantessurville

saigon

stockholm

1

2 3

4

b) parallel

Figure 3: Communication schemes for the a) remoteand the b) parallel executions of the pipelined com-mand.

exchanged exceed 105. This scheme is equivalent to the re-mote execution scheme for the CPU-bound case depicted inFigure 2.

These encouraging results confirmed that our very simpleextensions can lead to significant performance results witha very easy syntax. So, we decided to implement those ex-tensions as discussed below.

4. IMPLEMENTATION IN GNU BASH US-ING RED-XML

Our developments are twofold. First, we are implementingin C a simple daemon for remote execution and storage,following the RED interface described in [7]. Second, we aremodifying the source code of GNU BASH version 3.0 to addour extensions. The use of those extensions is going to bean option when BASH is launched.

Our implementation of RED is on top of a regular Linuxsystem and consists in an HTTP server waiting for messagesin XML-RPC format. Each kind of message represents a callto one method of the RED interface. Basically, this interfaceallows to transfer files and to remotely execute commands,with flexible input/output redirections in sockets. The mainpoint is that this interface permits the implementation ofeach extension of our parallel shell under the form of one orseveral calls to methods of one or several RED servers. Infact, the conception of the RED interface has been driven bythe needs that appear during the conception of our parallelshell.

5. RELATED WORKThe most recent work that shares some ideas with our workis [9]. Compared to our work, they do not address the prob-lem of pipelines and they do not give a simple and coherentshell syntax to place explicitly tasks on remote computers.The work in [8] is more an effort to build a Single SystemImage than a parallel shell. It does not address explicitplacement nor pipelines. The work in [3] focus on Grid en-vironments and its main purpose is to provide a work aroundto build virtual grids composed of hundred of heterogeneousmachines. It does not provide any implicit placement ofprocesses nor any mechanism to automatically transfer on-demand executable files to the remote hosts. To finish, noneof the works cited above give any real or simulated perfor-mance result for the execution time of a pipelined applica-tion.

6. CONCLUSION AND FUTURE WORKWe introduced a new syntax to express explicit or implicitsimple parallelism on clusters with Unix shell constructs.We gave significant performance results that motivate ourapproach. We are now implementing the RED interface us-ing XML-RPC and we are modifying GNU BASH with ourparallel extensions.

This simple system is not as powerful and generic as a mes-sage passing interface like MPI nor a Single System Imageoperating system, but it is far more simple to use and toimplement, and it does not require any modification in ex-isting programs nor operating systems. Furthermore, people

familiar with the shell should be able to use those new con-structs in a very natural way.

We plan to extend our framework towards heterogeneousnetworks of workstations. An easy but not very elegantway to achieve this is to store locally several versions of thecommands and to transfer the proper version at the righttime.

7. REFERENCES[1] Barak, A., and La’adan, O. The MOSIX

Multicomputer Operating System for HighPerformance Cluster Computing. Journal of FutureGeneration Computer Systems 13, 4 (Mar. 1998),361–372.

[2] Free Software Foundation (FSF). GNU BASH.Web.http://www.gnu.org/software/bash/bash.html.

[3] Kaneda, K., Taura, K., , and Yonezawa, A.Virtual Private Grid : A Command Shell for UtilizingHundreds of Machines Efficiently. In Proceedings ofthe 2nd IEEE International Symposium on ClusterComputing and the Grid (CCGrid) (may 2002).

[4] Laurent, S. S., Dumbill, E., and Johnston, J.Programming Web Services with XML-RPC. O’Reilly,2001.

[5] Morin, C., Gallard, P., Lottiaux, R., andValee, G. Towards an Efficient Single System ImageCluster Operating System. Future GenerationComputer Systems 20, 2 (2004).

[6] Silber, G.-A. Experiments for coset-2 workshop.Web. http://www.cri.ensmp.fr/people/silber/metacc/red/coset02.tar.gz.

[7] Silber, G.-A. Remote Execution Daemon (RED): ASimple Service for Remote Execution and Storage.Tech. Rep. E-267, CRI/ENSMP, Apr. 2005. http://www.cri.ensmp.fr/classement/doc/E-267.pdf.

[8] Tan, C., Tan, C., and Wong, W. Shell over aCluster (SHOC): Towards Achieving Single SystemImage via the Shell. In Proceedings of IEEEInternational Conference on Cluster Computing(CLUSTER 2002) (sep 2002), pp. 28–36.

[9] Truong, M., and Harwood, A. Distributed shellover peer-to-peer networks. In Proceedings of the 2003International Conference on Parallel and DistributedProcessing Techniques and Applications (Las Vegas,Nevada, 2003), pp. 269–275.

[10] Wikipedia, the Free Encyclopedia. Xml-rpc.Web. http://en.wikipedia.org/wiki/XML-RPC.

second international workshop on operating …...second international workshop on operating systems,...

Documents