pbsweb: a web-b ased interface to the portable batch systempaullu/papers/pbsweb.pdcs2000.pdf · 1.1...

7
PBSWEB: A WEB-BASED INTERFACE TO THE PORTABLE BATCH SYSTEM GEORGE MA PAUL LU Department of Computing Science University of Alberta Edmonton, Alberta, T6G 2E8 Canada george|paullu @cs.ualberta.ca Abstract The resource managers (e.g., batch queue schedulers) used at many parallel and distributed computing centers can be complicated systems for the average user. A large number of command-line options, environment variables, and site-specific configuration parameters can be over- whelming. Therefore, we have developed a simple Web- based interface, called PBSWeb, to the Portable Batch Sys- tem (PBS), which is our local resource manager system. We describe the design and implementation of PBSWeb. By using a Web browser and server software infrastructure, PBSWeb supports both local and remote users, maintains a simple history database of past job pa- rameters, and hides much of the complexities of the under- lying scheduler. The architecture and implementation tech- niques used in PBSWeb can be applied to other resource managers. Keywords: job management, batch scheduler, Portable Batch System (PBS), Web-based GUI, cluster and parallel computing 1 Introduction Parallel and distributed computing offers the promise of high performance through the aggregation of individ- ual computers into clusters and metacomputers (e.g., wide- area networks of computers). Fast nation-wide networks also make it possible for users to remotely access high- performance computing (HPC) resource providers (RP). However, that potential cannot be fulfilled without the proper software infrastructure to access the RPs in a conve- nient and efficient manner. One of the most basic and common tasks for a user of an RP is submitting a job to the local resource man- ager. Although systems such as the Portable Batch System (PBS) [9] and Load Sharing Facility (LSF) [8] automate CPU and resource scheduling, the many command-line op- tions, scriptable parameters, and different tools present a steep learning curve for the typical user. Although tools and systems to support application de- velopment (rightly) receive a lot of research attention, the problem of application execution and management is often overlooked. In a typical workgroup, a small number of de- velopers may write the code (or install downloaded code), but everyone must execute the applications, whether it is for testing or production runs. In other words, running ap- plications is the common case. Also, the same application is often run with many different parameters, and possibly by different members of the same workgroup. Customiz- ing, sharing, and revision control of the job control scripts can quickly become unwieldy. To simplify the common tasks of submitting jobs to a resource manager, re-running jobs, and monitoring jobs in queues, we have developed a Web-based interface to PBS called PBSWeb. The system provides a job history database so that previous job scripts can be re-used as-is or with small changes. We feel that the combination of func- tionality in PBSWeb fulfills a need in the user community. More importantly, the architecture of PBSWeb is designed to evolve and support the emerging model of wide-area metacomputing with remote users, transparent job placement across many different RPs, and end-to-end workflow management. Each of these enhancements in- volves the solution of outstanding research problems. PB- SWeb will be the development and evaluation framework for our research in these areas.

Upload: others

Post on 21-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

PBSWEB: A WEB-BASED INTERFACE TO THEPORTABLE BATCH SYSTEM

GEORGEMA PAUL LU

Departmentof ComputingScienceUniversityof Alberta

Edmonton,Alberta,T6G2E8Canada�

george|paullu � @cs.ualberta.ca

Abstract

Theresourcemanagers(e.g., batch queueschedulers)usedat manyparallel and distributed computingcenterscan be complicatedsystemsfor the average user. A largenumberof command-lineoptions,environmentvariables,and site-specificconfiguration parameters can be over-whelming. Therefore, we havedevelopeda simpleWeb-basedinterface, calledPBSWeb,to thePortableBatch Sys-tem(PBS),which is our local resourcemanager system.

We describe the design and implementation ofPBSWeb. By using a Web browser and serversoftwareinfrastructure, PBSWeb supportsboth local and remoteusers, maintainsa simplehistorydatabaseof pastjob pa-rameters,andhidesmuch of thecomplexitiesof theunder-lying scheduler. Thearchitectureandimplementationtech-niquesusedin PBSWeb can be applied to other resourcemanagers.

Keywords: job management,batch scheduler, PortableBatch System(PBS),Web-basedGUI, clusterandparallelcomputing

1 Introduction

Parallelanddistributedcomputingoffersthepromiseof high performancethrough the aggregation of individ-ualcomputersinto clustersandmetacomputers(e.g.,wide-areanetworks of computers). Fastnation-widenetworksalso make it possiblefor usersto remotelyaccesshigh-performancecomputing(HPC) resourceproviders (RP).However, that potential cannot be fulfilled without thepropersoftwareinfrastructureto accesstheRPsin aconve-nientandefficientmanner.

Oneof the mostbasicandcommontasksfor a userof an RP is submittinga job to the local resourceman-ager. AlthoughsystemssuchasthePortableBatchSystem(PBS) [9] and Load SharingFacility (LSF) [8] automateCPUandresourcescheduling,themany command-lineop-tions, scriptableparameters,and different tools presentasteeplearningcurve for thetypical user.

Althoughtoolsandsystemsto supportapplicationde-velopment(rightly) receive a lot of researchattention,theproblemof applicationexecutionandmanagementis oftenoverlooked.In a typical workgroup,a smallnumberof de-velopersmaywrite thecode(or install downloadedcode),but everyonemustexecutethe applications,whetherit isfor testingor productionruns. In otherwords,runningap-plicationsis thecommoncase.Also, thesameapplicationis often run with many differentparameters,andpossiblyby differentmembersof the sameworkgroup. Customiz-ing, sharing,andrevision controlof the job controlscriptscanquickly becomeunwieldy.

To simplify the commontasksof submittingjobs toa resourcemanager, re-runningjobs, andmonitoringjobsin queues,we have developeda Web-basedinterface toPBS called PBSWeb. The systemprovidesa job historydatabasesothatpreviousjob scriptscanbere-usedas-isorwith smallchanges.We feel that thecombinationof func-tionality in PBSWebfulfills a needin theusercommunity.

More importantly, the architectureof PBSWeb isdesignedto evolve and support the emerging model ofwide-areametacomputingwith remoteusers,transparentjob placementacrossmany differentRPs,andend-to-endworkflow management.Eachof theseenhancementsin-volvesthesolutionof outstandingresearchproblems.PB-SWeb will be the developmentandevaluationframeworkfor our researchin theseareas.

Page 2: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

Resource Provider 1’sScheduler (e.g., PBS)

Resource Provider 2’sScheduler (e.g., PBS)

Resource Provider 3’sScheduler

User 1

User 2

PBSWeb Server

Net

wor

k

Application A

Application B

Application C

Application A

Application B

Code, History

Code, History

Code, History

Code, History

Code, History

User’s Web Browser

Net

wor

k

User 1

User 2

Figure 1. PBSWeb: Architecture

1.1 Portable Batch System

ThePortableBatchSystem(PBS)is oneof a numberof differentbatchqueueschedulersfor processorresources.Resourcemanagersand schedulersare software systemsthatallocateresourcesto differentuserrequests(i.e., jobs)while attemptingto maximizeresourceutilizationandmin-imize interferencebetweendifferentjobs. Individualuserssubmitjobsto PBS,theschedulerenqueuesthejobsof thevarioususers,and jobs are executedas the requestedre-sourcesbecomeavailableandaccordingto thejob’squeuepriority.

In part,PBSis a popularsystembecauseit is power-ful, customizable,andit canbedownloadedfreeof charge.For example,theMultimediaAdvancedComputationalIn-frastructure(MACI) project [7] usesPBSto managetwoSGI Origin 2000multiprocessors(with a total of 112pro-cessors)at the University of Alberta, and a cluster ofAlpha-basedworkstations(with 130processors)attheUni-versityof Calgary.

AlthoughPBShasmany technicalmerits,it canbeadifficult systemto learnanduse.Thesystemconsistsof anumberof differentprograms(e.g.,qsub, qstat, qdel,qhold, qalter), eachwith many different command-line optionsand configurationparameters.Furthermore,jobs (or runs)aresubmittedto the systemin the form ofa job control script containingper-job parameters.Eachrun requiresits own job control script, which leadsto theproblemof how thevariousscriptscanbeeasilymodifiedandsharedbetweendifferentuserswith revisioncontrol.

xpbs andxpbsmon aretwo graphicaluserinterfaces(GUI) distributedwith PBS.AlthoughtheseGUI toolsareeasierto usethancommand-lineprograms,they do not ad-dresstheproblemof managingjob controlscripts.

1.2 Overview

We begin with a descriptionof thePBSWebarchitec-tureandhow it is designedto supportremoteaccess,to pro-vide userauthentication,andto maintainjob controlscripthistories. A detaileddescriptionof the PBSWeb interfaceandfunctionality is followed by an overview of how PB-SWeb is implemented.SincePBSWeb is part of a largerC3.caprojectto improve the sharingof high-performancecomputingresources[4], we thendiscussthe longer-termgoalsof theprojectandput it in context with relatedworkfrom othergroups.

2 PBSWeb

We have built a prototypefront-endinterfaceto PBS.Beta testingbeganin August2000,with a wider deploy-mentamongMACI usersto follow. By layeringon top ofPBS,we hopeto make resourcemanagementandresourcesharingmoreconvenientin thecommoncase.

PBSWebis intendedto simplify thetaskof submittingjobsto RPsthatarecontrolledby ascheduler. It is assumedthat theapplicationsourcecodeitself is alreadydevelopedandnow theuserwould like to run jobs. If theuseris onlyrunningexistingapplicationcode,thenhedoesnotneedto

Page 3: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

Figure 2. PBSWeb: Main Page

know anything aboutinteractive shellsor how to developcode.PBSWebmakesit easyfor remoteusersto accessanRP without having to sharefilesystemsandwithout hav-ing to log onto the systemjust to submita job. PBSWebalsomaintainsa per-userandper-applicationhistory (andrepository)of sourcecodefiles,jobssubmittedto PBS,andpastpreferencesfor varioususer-selectablePBS parame-ters(e.g.,time limits, emailaddresses)(Figure1).

To begin, an authorizeduserconnectsto the relevantWeb server. The server doesnot have to be on the samesystemaswherethejobswill eventuallyrun. OnePBSWebservercanbethefront-endto severaldifferentRPsor eachRP canrun its own instanceof the PBSWeb server. Aftersupplyingavalid usernameandpassword,themainpageispresented(Figure2). Sincethe interfaceis basedon Webpages,thereis noneedto understandsite-specificoperatingsystemconfigurations.

To run a new application,the usermustfirst uploadthe sourcecodethroughPBSWeb. To submita job usinganexisting application,theuserselectsthepreviously up-loadedsourcecodefrom a menuandsubmitsthejob froma Webpage.PBSWebsimplifiesthebuilding (i.e., compil-ing) of the executablecodeon the target machineandtheselectionof PBSparameterseitherbasedon theuser’shis-tory (comparableto commandhistoriesin Unix shellssuchastcsh) or reasonablesystemdefaults.

Four basicfunctionsarecurrently supportedby PB-SWeb:

1. Uploadsourcecodein a tarfile.

2. Compilesourcecodealreadyuploadedto PBSWeb.

3. Submita job to PBS(with PBSWebassistingin writ-ing thejob script).

4. Checkon jobsin thePBSqueue(s).

Figure 3. PBSWeb: Upload Tar File and Make

We expectthat functions(3) and(4) will be themostfrequentlyused.

To uploadatarfile, astandardWebbrowserfile selec-tor is usedto specifythefile (not shown). By default,oncethetar file is sentfrom thebrowserto thePBSWebserver,it is untaredandamake commandis issued.Theoutputisreportedbackto theuser(Figure3). At this point, theusercanproceedwith submittinga job, uploadanothertar file,or returnto themainpage.

PBS (and similar systems)allow the userto controlthe job usinga job script model. Therefore,the usermayenter the commandsfor the script in the ExecutionCommands text field (Figure4). For eachapplication(i.e.,tar file) that hasbeenuploaded,PBSWeb rememberstherecentlyexecutedcommandsso that the userdoesnot al-wayshave to reproduceanddebug complicatedsequences

Page 4: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

of operations.Instead,the usermay selectthe commandsusing� theCommand History pop-upmenu.If theappli-cationhasalreadybeenexecutedusingPBS,thenthepre-viousjob scriptscanalsobeselectedusinga pop-upmenu(not shown).

Whethercomposinga new job script or modifying aprevious job script, the user is free to selectamonganyof the valid parametersto PBS.For parameterswith onlya few valid options,radio buttonsareused,asshown forthejob queueoptionsof dque, Sequential, andPar-allel (Figure 4). Therefore,the userdoesnot have tomemorizeall of thevalid optionsfor, say, thejob queue.

More complicatedparametersthatcannotbeselectedfrom a menu,suchas the nameof the job and the emailaddressto notify upon job completion,areenteredusingtext fields. Whenever possible,the text fields areinitiallyfilled in with reasonabledefault valuesor valuesprovidedby theuserin thepast.

Parameterswith many possibleoptions,suchas theNumber of processors to use for thejob,arese-lectableusinga pop-upmenu. The goal is to reducetheneedto memorizewhatparametervaluesarevalidandwhatvaluesaregoodchoicesfor defaultsata givenRP.

When the job script parametershave beenfinalized,the user clicks the Submit Job button and PBSWebcalls the proper command(i.e., qsub) with the propercommand-lineargumentsandjob script(Figure5).

Again, excluding situationswhere compiles fail orjobsterminateunderabnormalconditions,it is possibleforanauthorizeduserto upload,configure,andsubmitjobstoPBSwithout ever loggingontotheRP’s system.Also, theusercancheckthe statusof jobs in the PBSqueuesusingthePBSWebinterface(Figure6).

Furthermore, with a properly designed makefile(which is, admittedly, non-trivial), it is possiblefor a userto utilize the samePBSWeb interfaceand tar file to runjobson, say, theMACI SGI Origin 2000sin EdmontonortheMACI Alpha Clusterin Calgary. In thefuture,thePB-SWebinterfacewill beableto monitorthemachineloadsinEdmontonandCalgaryandautomatically(with the user’spermission)run a message-passingjob on either the SGIor cluster, dependingon whichever will give the fastestturnaroundtime.

3 Implementation of PBSWeb

A Web-basedapproachwas taken in designingthePBSWebinterfacebecauseWebbrowsersarepervasiveandarea consistentandplatform-independentway of interfac-ing with PBS.HTML documents,an Apacheserver, andCommonGateway Interface(CGI) scriptswritten in Perl,arethebasicelementsof theimplementationof PBSWeb.

Again,therearefour mainoperationsin PBSWebandeachis handledby a differentsetof CGI scripts:

Figure 4. PBSWeb: PBS Job Submission andScript

1. Uploadfiles

2. Compileprograms

3. Generatescriptsandsubmitjobs

4. View thequeuestatusof ahost

First, the file uploadoperationis intended,typically,to uploadtar files of programsourcecode.Uploadedfilesare placedin a subdirectory(pbswebdir) of the user’shomedirectory, which is reservedfor PBSWeb’s use.Thetar file is thenextractedto a separatesubdirectoryof pb-swebdir. Therefore,theuserstill requiresanaccountandhomedirectoryat theRPto accesssecondarystorage.

Second,programcompilationishandledwith asimplemake command.Thus,successfulprogramcompilationisdependentuponthe usersupplyinga suitablemakefile. Ifthe userdoesnot chooseto compile the sourcecodedi-rectly after file uploadandextraction,compilationcanbeperformedatalatertimeby choosingthecompileprogramsoperation.

Third, thescriptgenerationoperationis implementedusinga HTML-basedform. Heretheusercanspecifyvar-ious job options,suchasthe numberof processorsto useand the queueto which the job is to be submitted,sim-ply anddirectly, without having to rememberthe compli-catedsyntaxof a PBSscript. Also, theform givessensible

Page 5: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

Figure 5. PBSWeb: Job Submitted

default valuesfor the most commonlyusedPBS job op-tions. After thePBSWebform hasbeensuitablyfilled out,the data is sent to a CGI script, which generatesa validPBSscript. PBSWeb thensubmitsthegeneratedscriptontheuser’s behalfto thespecifiedPBSqueue,or thedefaultqueueif a queueis not specified.

PBSWeb doesmore than simply provide a templateform to generatePBS job scripts; the systemalso keepsa history of the job option parametersandexecutedcom-mands.Thus,eachPBSWebuserhasacustomform whichcontainsa history of previous jobs. The usercan recallpreviously-generatedscriptfilesandusethemasa baseforthenext job submission.Of course,if a moreexperiencedPBSuserwishes,he may uploada customscript file anduseit to submita job or edit the uploadedscript andthensubmitit.

The multi-user functionality of PBSWeb is accom-plishedby usingtheSecureShell (ssh). Whena userfirstcreatesa PBSWebaccount,theusermustcopy PBSWeb’ssecureshellpublicidentitykey into theuser’spersonalau-thorized keys file. ThisallowsPBSWebto assumetheidentityof thePBSWebuser. Whenanew accountis beingcreated,PBSWeb doesa secureshell login into the user’ssystemaccountandcreatesthepbswebdir subdirectoryfor PBSWebto work in. All of theuser’suploadedfilesarestoredin this directory, alongwith history files andinfor-mationaboutthedirectorystructure.

Eachprogramis givenits own subdirectoryunderthePBSWebworkingdirectory. All of thescriptfilesgeneratedby PBSWeb are stored in theseprogramsubdirectories.Thus,eachoneof the user’s programshasits own uniquehistory. WhenPBSWebgeneratesacustomjob scriptform,it opensasecureshellsessionto theuser’saccount,goestothe directoryof the programthat is to be executed,looks

Figure 6. PBSWeb: Queue Status

for themostrecentlyusedscript,andloadstheparametersof that script into the form as startingvalues. Also, theusercanchooseto loadany of thepreviousscripts.Whensubmittinga job script,PBSWeb opensa secureshell ses-sion to theuser’s systemaccountandautomaticallyissuesaqsub commandwith thescriptfile generatedby theformscript.

To decreasethe execution time of the CGI scripts,sometemporaryfiles are createdin the PBSWeb’s homedirectory;fewer secureshellsessionsto theuser’s accounthave to be openedandmany of the intermediatestepsaredonewithin PBSWeb’s own directory. For example,whena useruploadsa file or runsa PBSscript generationCGI,the files are first saved in PBSWeb’s directory and thentransferedto theuser’spbswebdir by usingsecurecopy(scp). After, thefiles arecopied,PBSWebdeletesthere-latedfiles in its own directory.

Although the secureshell providesa reasonableandsafemechanismto allow the PBSWeb server to compile,submit,andmonitor jobs on behalfof a user, we arestillinvestigatingotheroptions.In principle,theability of PB-SWebto run anycommandasanotheruseris too omnipo-tent. Ideally, a systemthat combinesthe authenticationfunctionalityof thesecureshellandtheper-executableper-missioncontrolof, say, sudo, would beideal.

4 Future Work

A numberof desirablefeaturesarecurrentlymissingfrom PBSWeb. After betatestingin the MACI environ-ment,it is our goal to make PBSWeb into anopen-sourcedevelopmentproject.

On a metacomputinglevel, thereis a needfor com-

Page 6: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

plete end-to-endworkflow managementso that multiplejobscanbeco-ordinatedto solve theoverall problem.Forexample,a computationmay involve pre-processinginputdata, a simulation, and post-processingof the resultsasthreeseparatejobs. Eachjob is independentlysubmittedto a batchscheduler, possiblyat differentRPs,but thejobsarelinkedby their inter-dependence.

In theory, thedifferentHPCresourcesacrossa coun-try canbeharnessedfor a singlecomputationaltask. Thisconceptof metacomputing[1] hasattracteda lot of atten-tion. Unfortunately, in practice,the lack of an appropri-ateinfrastructuremakesit difficult to transparentlyperformonephaseof a multi-phasecomputationon, say, a clusterandanotherphaseon a shared-memorycomputer. Thus,thepainfulreality is thatsharingresourcesgenerallymeansthat a useris allowed to log onto the differentmachines,manuallytransferany neededdatafilesbetweenmachines,manuallystartup the jobs, checkfor job completionandthe integrity of the output, and repeatfor eachplatform.The many manualanderror-prone(both humanandtech-nological) stepsrequiredto do this usually meansthat auseralwaysstayson oneplatform and“makesdo.” Thisdefeatsthe purposeof sharingresourcesand can lead tosomeoverloadedandotherunder-utilizedplatforms.

A long-termresearchgoal is the design,implemen-tation, andevaluationof a workflow managerfor parallelcomputations. Considerthe scenariowhere a computa-tion is organizedas a pipeline or a sequenceof individ-ual jobs. Ideally, if the pre-processingof input data isnear-embarrassinglyparallel,theclusterwould bethebestplatform for the computation.Then, the communication-intensivephaseof thecomputationcanbeperformedontheparallelcomputer. From the user’s point of view, a job issubmittedto a workflow manager, which would thenauto-maticallyinitiatethefirstphaseof thecomputation,transferthe intermediateoutput to the secondhardwareplatform,andtheninitiate thesecondphaseof thecomputation.Theuseris only interestedin receiving the final output or anerrorreport.

The reliability of the end-to-endworkflow (i.e., startof entirecomputationto endof computation)is an impor-tantissue.Datavalidationbetweenphasesincreasesthere-liability of theresultsby detectingcorrupteddatabeforeitaffectssubsequentphases.For example,in our experience,althoughTCP/IPguaranteescorrectnetwork transfers,lostconnectionsandfull file systemscaneasily (andsilently)truncatefiles and corrupt the workflow. Simple valida-tion checksat theapplication-level areneeded.Abnormalprocessterminationmustbedetectedandsubsequentjobsmustbe suspendeduntil the situationis corrected. Inter-mediatedatafilesmustbeloggedandarchivedsothatjobsmaybere-startedfrom themiddleof thepipeline(i.e., au-tomaticcheckpointandrestart).

At the highestlevel of abstractionthereshouldbe a

graphicalWeb-basedinterfacethat allows a userto com-municatewith theworkflow manager. Theuserconfiguresthe workflow, specifiesthe input data, the computationalconstraints,andthephasesusingtheuserinterface.At thelowestlevels,theworkflow managerco-ordinatesthesite-specificlocal schedulersandfile systemsat eachcomput-ing site. The workflow managerreceivesa computationaltaskin amannersimilar to abatchscheduler. But, theman-ageris awareof the differentphasesof the computationaltaskandit is awareof thedistributednatureof thecomput-ing sites.Themanageris configuredto checkfor thelocalavailability andintegrity of thespecifiedinput data,it runsoptionaluser-specifiedscriptsto verify the pre- andpost-conditionsanityof thedata,it submitstheappropriatejobto thelocal scheduler, it checksthe integrity of theoutput,andthenco-ordinateswith the local schedulerat the nextcomputingsite to performthe next phaseof the computa-tion.

5 Related Work

There are a number of metacomputingresearchprojectsandsystems,includingLegion[6], Globus[5], Al-batross[2], andNimrod/G[3]. BakerandFox survey someof theprojectsandissues[1].

Our project targetsa higher-level of abstractionthanmostof the currentprojects.For example,we arenot ad-dressingthe issueof how to useheterogeneousand dis-tributedresourceswithin a singleparallel job. Theserun-time systemissuesare low-level problems. Instead,wefocus on the issuesof transparentdata sharingand co-ordinationbetweenthe jobsandsites.We feel thata high-levelapproachfocusesthescopeof thetechnicalchallengesandmoredirectly addressesend-to-endissuesof reliabil-ity andtransparency. It is easierto adaptto missingdata,network outages,andoverloadedcomputersif a workflowof jobs spansmultiple sites,insteadif a monolithic paral-lel job spansmultiple sites.Notably, Nimrod/G[3] sharesthesehigh-level designgoalsandsupportsacomputationaleconomyapproachto resourcesharing.

We alsofeel thata high-level approachis betterableto exploit existing technologiesandinfrastructureandpro-ducea usablesystem. Although somecomponentsof aworkflow infrastructurealreadyexist in the form of dis-tributedfile systems(e.g.,AFS)andbatchschedulers(e.g.,PBS, LSF), they are not well integratednor transparent.Settingup a distributedfile systemacrossmany sitesleadsto a variety of problemswith configurationandadminis-tration, thus it may not alwaysbe possible. Political andhumanfactorsin co-ordinatingmultiple sitesmustalsobeaddressedwith flexible and customizablepolicies in theworkflow manager. However, if somesitessharean AFSfile system,thatcanbeexploitedby having thefile systemperformthat datatransfersunderthe control of the work-

Page 7: PBSWEB: A WEB-B ASED INTERFACE TO THE PORTABLE BATCH SYSTEMpaullu/Papers/pbsweb.pdcs2000.pdf · 1.1 Portable Batch System The Portable Batch System (PBS) is one of a number of different

flow manager. Similarly, batchschedulershavemadegreatprogress� in schedulingindividual jobs,but they arenot de-signedto handlecomputationaltasksrequiringa sequenceof jobs on differentcomputingplatformsandat differentsites. Thereare researchissueswith respectto efficientglobal schedulingof tasksacrossa numberof distributedcomputingsites,which would be an extensionof existingwork in optimizedbatchscheduling.

6 Concluding Remarks

In practice, effective high-performancecomputingandmetacomputingrequiresa propersoftwareinfrastruc-tureto supporttheworkflow managementof parallelcom-putation.Currently, toomany manualanderror-pronestepsare required to use HPC computing resources,whetherwithin oneRPor acrossmultiple RPs.

Somethingas simple as submittinga job to a batchschedulerrequirestheuserto write a job controlscript,de-fine several environmentvariables,andcall the right pro-gramwith the right command-lineparameters.Resourceschedulers,suchas the PortableBatch System,are pow-erful andnecessarysystems,but we feel that the learningcurve is too high. We alsofeel that thesystemshouldau-tomaticallymanagejob control scriptsso that it is easytomodify previouscontrolscriptsfor new runs.

Towardsthesegoals,we have developedan interfaceto PBS called PBSWeb. By exploiting Web-basedtech-nologies,PBSWeb makes it particularly easyto supportbothlocal andremoteusersusingdifferentplatforms.PB-SWeb alsosupportsper-userandper-applicationjob con-trol historiesandcoderepositoriesto makeit easierto man-agelarge sequencesof productionruns. In the future, thebasicarchitectureandmodelof PBSWebwill beextendedto supporttheautomaticselectionof computingresourcesand the co-ordinationof computationsthat spanmultiplecomputingcenters.

Acknowledgements

Thankyou to Jose NelsonAmaral, JonathanSchaef-fer, andDuaneSzafronfor theirvaluablecommentsonthispaper.

Thank you to the NaturalSciencesandEngineeringResearchCouncil of Canada(NSERC), the MultimediaAdvancedComputationalInfrastructure(MACI) project,C3 (througha PioneerProject),the Universityof Alberta,andthe CanadaFoundationfor Innovation (CFI) for theirfinancialsupportof this research.

References

[1] M. Baker and G. Fox. Metacomputing:Harnessinginformal supercomputers.In RajkumarBuyya,editor,High PerformanceCluster Computing: ArchitecturesandSystems,Volume1, pages154–185.PrenticeHallPTR,UpperSaddleRiver, New Jersey, 1999.

[2] H.E. Bal, A. Plaat,T. Kielmann, J. Maassen,R. vanNieuwpoort,andR. Veldema. Parallel computingonwide-areaclusters:the Albatrossproject. In ExtremeLinux Workshop, pages20–24, Monterey, CA, June1999.

[3] R. Buyya,D. Abramson,andJ.Giddy. Nimrod/G:Anarchitecturefor aresourcemanagementandschedulingsystemin a globalcomputationalgrid. In Proceedingsof the 4th International Conferenceon High Perfor-manceComputingin Asia-Pacific Region (HPC Asia2000), Beijing, China,2000.

[4] C3.ca.http://www.c3.ca/.

[5] I. FosterandC. Kesselman.Globus: A metacomput-ing infrastructuretoolkit. The International Journalof SupercomputerApplicationsandHigh PerformanceComputing, 11(2):115–128,Summer1997.

[6] A.S. Grimshaw andW.A. Wulf. TheLegion vision ofa worldwidevirtual computer. Communicationsof theACM, 40(1):39–45,January1997.

[7] MACI. Multimedia advancedcomputationalinfras-tructure.http://www.maci.ca/.

[8] Platform Computing, Inc. Load sharing facility.http://www.platform.com/.

[9] Veridian Systems. Portable batch system.http://www.openpbs.org/.