heterogeneous grid design and implementation thesis presentation by jeffrey wells state university...

26
Heterogeneous Grid Design and Implementation Thesis Presentation By Jeffrey Wells State University New York Institute of Technology May 7, 2008 CSC 599

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Heterogeneous Grid Design and Implementation

Thesis PresentationBy Jeffrey Wells

State University New York Institute of TechnologyMay 7, 2008CSC 599

Outline

Purpose Overview Intro to Globus Toolkit and Condor Interoperability Experiments Results Conclusion

Purpose

This thesis investigates the extent to which two open source approaches to Grid computing achieves interoperability. The Globus Alliance’s Globus Toolkit and the University of Wisconsin-Madison’s Condor scheduler were used, in this thesis, to offer an example of interoperability.

Overview

What is a Grid? Condor Scheduler Globus Toolkit BITS Regional Grid SUNYIT Local Grid Network Grid Security

What is a Grid?

What is a Grid you might ask… definition given by (Ian Foster of the University of Chicago) – is a system that coordinates resources that are not subject to

centralized control uses standardized, open, general purpose protocols and interfaces delivers non- trivial qualities of service

Examples of Grids (TeraGrid has 20 Teraflops of computing power and 1 Petabyte of storage, Access Grid used for scheduling and conducting meetings, and eDiaMoND used for medical research in England)

Condor Scheduler

Condor High Throughput Computing (HTC) – Ties idle resources together to harness their idle

resource in a distributed fashion. Condor was developed by the University of

Wisconsin-Madison Other distributed schedulers …

PBS (Portable Batch System ) LSF (Load Sharing Facility) CSF (Community Scheduler Framework)

SETI (Search of Extraterrestrial Intelligence)

Globus Toolkit

The Globus Toolkit is an open source software toolkit used for building Grid systems and applications. It is constantly being developed by the Globus Alliance at the University of Chicago and many others all over the world.

Other type of Grid toolkit… Virtual Data Toolkit (VDT)

BITS Regional Grid

bitsgw

qw.cs.sunyit.edu

Corning Community CollegeSUNY Geneseo

SUNYIT

SUNY IT Local Grid Network

192.168.14.20Globus 405

192.168.14.30Globus 405

192.168.14.40Condor 605

192.168.14.50Globus 405 192.168.14.60

Globus 405Condor 605

192.168.14.70Condor 605

bitsgw

Grid Security

Grid Security Infrastructure (GSI) implements public key cryptography as the

backbone for its functionality The reasons behind GSI are:

the requirement for secure communication between resources of a Grid;

prevent a centrally managed security system allow for a “signal sign-on” for users of the Grid. This

includes delegation of credentials for jobs that require more than one resource and /or sites

SUNY Geneseo

Debian Linux Cluster

Condor Execute/Submit

Services used, tested and evaluated:• GridFTP, RFT (Reliable File Transfer)• Delegation, authentication authorization• Credential management• Grid Security Infrastructure (GSI)• Various Condor submits

Globus Services

Condor Central Manager (Scheduler)

Central Manager

Submit/Execute

Submit/Execute

Submit/Execute

Globus Globus

Central Manager

•Condor Central Manager (Scheduler) submits jobs either to a Condor Submit/Execute or Globus Machine. •Each machine “advertises” via ClassAd to Central Manager its resources•Central Manager matches up resource with submitted job requires•Central Manger sends executable to remote resource that matches requirement.•Once job is completed, Execute Machine reports back to Central Manager•Central Manager reports final results.

Cla

ssA

d/

Re

sults

Job

Re

qu

est

Job

Re

qu

est

Job

Re

qu

est

Cla

ssA

d/

Re

sults

Cla

ssA

d/

Re

sults

Job

Re

qu

est

Cla

ssA

d/

Re

sults

Various Jobs Implemented

Condor Jobs Vanilla Standard Java Parallel Grid (Globus)

Globus Jobs Forwarded a job to

Condor machine with a scheduler

From a Condor scheduler to a Globus machine (Globus Job).

Forward Jobs to other

Globus machines.

Interoperability Experiments

Globus, Condor and Condor-G Condor-G Interface Job Examples Condor to Globus Job Submit Globus to Condor Job Submit Test Scripts Swift Workflow Some More Test Scripts

Globus, Condor and Condor-G

Linux Cluster

Condor Workstation Pool

Globus Services

Condor Scheduler

Condor-G manages jobsthrough the resource manager of the GlobusToolkit.

Results of the Job passed to the Globus Toolkitare returned via the Condor-G interface.

Condor_startd advertises about the resource and executes the job.Condor_starter spawns the remote job.Condor_shadow maintains the resources.

Condor_master is responsible for keeping all the rest of the Condor daemons running. Condor_schedd submits jobs to remote resources for the job queue.Condor_negotiator is responsible for the match making.

Condor-G Interface

Linux Cluster

Globus Services

Condor Workstation Pool

Condor-G uses the Globus resource manager to start a job on the remote machine.It also manages the job running on the remote resource.

Condor-G waits for the job to becompleted and then returns theresults.

Condor-G interface

Job ExamplesCondor Job and Globus Script======================== Condor to Globus== test.submit======================universe = gridexecutable = myscript.sharguments = TestJob 10JobManager_type = Condorgrid_type = gt4globusscheduler =https://stengal.cs.sunyit.edu:8443/wsrf/services/ManagedJobFactoryService/log = test.logoutput = test.outputerror = test.errorshould_transfer_files = YESwhen_to_transfer_output = ON_EXITQueue

#! /bin/shecho "I'm process id $$ on" `hostname`echo "This is sent to standard error" 1>&2dateecho "Running as binary $0" "$@"echo "My name (argument 1) is $1"echo "My sleep duration (argument 2) is $2"sleep $2echo "Sleep of $2 seconds finished. Exiting"echo "RESULT: 0 SUCCESS“

Condor Job and MPI Program########################### Submit description file# for /bin/hostname# (Parallel)#########################universe = parallelexecutable = /bin/hostnamemachine_count = 2log = parallellogfileoutput = outfileMPI.$(NODE)error = errfileMPI.$(NODE)should_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue

MPI Program#include "mpi.h"#include <stdio.h>int main( int argc, char* argv[] ){ int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD,

&rank ); MPI_Comm_size( MPI_COMM_WORLD, &

size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize();return 0;}

Condor to Globus Job Submit

Condor-G Condor

(Scheduler)GASSServer

Gate Keeper

Job Manager

Globus Toolkit

Job

1.) Central Manager submits grid job

2.) Job Passes through Condor-G to Globus gate keeper 3.) Verify security via gate

keeper

4.) Forward job to job manager 5.) Process and return result toCentral manager

Globus to Condor Job Submission

Gram Client

GASS Server

GRAM Gatekeeper

GRAM Job Manager

Batch System Condor

GASS Client

Local Machine

Remote Machine

GRAM Job Request

Creation

Job RequestData

Callback

Grid -Proxy

Sample Test Scripts

Perl Scripts were created to test most functionality of the BITS regional Grid

Job submit from Globus to Condor print " \n------> Submitting a Job to Condor on Stengel

<---------\n"; system "globusrun-ws -submit -Ft Condor -S -c /bin/date"; Job submit from Condor to Globus print "-----> Submitting a Condor Globus Job

<--------\n"; system "condor_submit

/home/wells/testjobs/condorjobs/globussubmits/submitGFork";

Swift Workflow

Swift is a data-oriented coarse-grained scripting language that supports dataset typing and mapping, dataset iteration, conditional branching, and sub-workflow composition

The Swift programs, also known as workflows, are written in a language called SwiftScript

Swift handles the execution of these programs on remote sites

Sample Test Scripts cont.

Swift Job submit to SUNYIY3 (Geneseo) print "\n<-------- Swift Job Sent to SUNY_IT3

------------>\n"; system "swift sites.file

/home/wells/testjobs/swiftjobs/sites3.xml /home/wells/testjobs/swiftjobs/first.swift";

Results Condor.pm is malformed for job submits from Globus to Condor.

Addition of should_transfer_files = YES and when_to_transfer_output = ON_EXIT must be added to script.

-S is used in the Globus Toolkit 4.0.5 versus –s in 4.0.4. Mpiexe.py, mpdlib.py was modified so that ws-gram was able to send a

distributed job to MPICH2. Thanks to Dr. Ralph Butler of Middle Tennessee State University.

Another application layer can easily be added to the Globus Toolkit. Applications are changing and maturing faster than the documentation. Mail groups and lists are not always helpful nor do they respond to

questions. Documentation is scarce on the MPI-2 and Globus Toolkit connection

and is also outdated. Documentation on the Condor and Globus interface is outdated.

Resolved by installing Condor and then Globus with Condor scheduler.

Conclusion

1. It is necessary to modify the Condor.pm script in order to allow the Globus Toolkit to submit jobs to the Condor Scheduler.

2. It is necessary to correct Mpiexe.py, mpdlib.py in order for the Globus Toolkit to submit a distributed job to MPICH2.

3. Investigation found that –S is now used to submit a job to Condor under 4.05. versus the –s under 4.0.4

4. Another application layer can be easily added to the Globus Toolkit without effecting the interoperability with the Condor Scheduler.

5. Documentation is scarce on the MPI-2 and Globus Toolkit connection and is also outdated.

6. Applications are changing and maturing faster than the documentation.

References

Globus Toolkit Version 4 Grid Security Infrastructure: A Standards Perspective. The Globus Security Team, Version 4 updated September 12, 2005. Retrieved on September 26, 2007 from http://www.globus.org/toolkit/docs/development/4.1.2/security/GT4-GSI-Overview.pdf/

Tanenbaum, A.(2003) Computer Networks Fourth Edition. New Jersey: Prentice Hall PTR

Condor Users Manual Version 6.8 (2007) Retrieved September 24, 2007 from http://www.cs.wisc.edu/condor/manual/v6.8/

Globus Toolkit Administration Manual (2007) Retrieved September 24, 2007 from http://www.globus.org/toolkit/AdministrationManual.pdf

Swift Users Guide (Change Revision 1700). Retrieved on February 16, 2008 from http://www.ci.uchicago.edu/swift/guides/userguide.pdf

Swift – Home (2007), retrieved on February 16, 2008 from http://www.ci.uchicago.edu/swift/

Yong Zhao, Michael Hadean, Ben Clifford, Ian Foster, Gregor von Laszewski, Ioan Raicu, Tiberiu Stef-Praun, Mike Wilde Swift: Fast, Reliable, Loosely Coupled Parallel Computation (2007), retrieved on March 2, 2008 from http://www.ci.uchicago.edu/swift/papers/Swift-SWF07.pdf

References (cont.)

Mausolf, J. (2005) Grid In Action: Implementation SOA and Web Services In Grid. (2005, August 09). Retrieved September 24, 2007, from http://www.ibm.com/developmentworks/Grid/library/gr-gt4graph/

Foster, I. (2002) What is a Grid? A Three Point Checklist. Argonne National Laboratory & University of Chicago. Retrieved September 2, 2007 from http://www.globus.org

Overview of the Grid Security Infrastructure, Globus Alliance Globus Toolkit. Retrieved May 6, 2008 from http://www.globus.org/security/overview.html

Noel, C (2007). What is a Grid? CETIC’s Tentative Definition. Retrieved on September 6, 2007 from http://www.cetic.be/article432.html