introduction to grid computing 1. computing “clusters” are today’s supercomputers cluster...

Download Introduction to Grid Computing 1. Computing “Clusters” are today’s Supercomputers Cluster Management “frontend” Tape Backup robots I/O Servers typically

Post on 28-Dec-2015

214 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Introduction to Grid Computing*

  • Computing Clusters are todays SupercomputersCluster ManagementfrontendTape Backup robotsI/O Servers typically RAID fileserverDisk ArraysLots of Worker NodesA few Headnodes, gatekeepers and other service nodes*

  • Cluster ArchitectureClusterUser*Head Node(s)

    Login access (ssh)

    Cluster Scheduler(PBS, Condor, SGE)

    Web Service (http)

    Remote File Access(scp, FTP etc)

    Node 0Node NSharedClusterFilesystemStorage(applicationsand data)Job execution requests & statusCompute Nodes

    (10 10,000PCs with localdisks)ClusterUserInternet Protocols

  • Scaling up Science:Citation Network Analysis in SociologyWork of James Evans, University of Chicago, Department of Sociology*

  • Scaling up the analysisQuery and analysis of 25+ million citationsWork started on desktop workstationsQueries grew to month-long durationWith data distributed across U of Chicago TeraPort cluster:50 (faster) CPUs gave 100 X speedupMany more methods and hypotheses can be tested!Higher throughput and capacity enables deeper analysis and broader community access.*

  • Grids consist of distributed clusters

    Grid ClientApplication& User InterfaceGrid ClientMiddlewareResource,Workflow& Data Catalogs*Grid Site 2:Sao Paolo GridServiceMiddlewareGridStorageGrid ProtocolsGrid Site 1: FermilabGridServiceMiddleware

    GridStorageGrid Site N:UWisconsinGridServiceMiddlewareGridStorage

  • HPC vs. HTCHTC: many computing resources over long periods of time to accomplish a computational task.HPC tasks are characterized as needing large of amounts of computing power for short periods of time, whereas HTC tasks also require large amounts of computing, but for much longer times (months and years, rather than hours and days).

    Fine-grained calculations are better suited to HPC: big, monolithic supercomputers, or very tightly coupled computer clusters with lots of identical processors and an extremely fast, reliable network between the processors.Embarrassingly parallel calculations are ideal for HTC: more loosely-coupled networks of computers where delays in getting results from one processor will not affect the work of the others. (course grained)

  • Grids represent a different approachBuilding HTC resource by joining clusters together in a grid

    Example:

    Current Compute Resources:70 Open Science Grid sites, but all very differentConnected via Inet2, NLR.... from 10 Gbps 622 MbpsCompute Elements & Storage ElementsMost are Linux clustersMost are shared Campus gridsLocal non-grid usersMore than 20,000 CPUsA lot of opportunistic usage Total computing capacity difficult to estimateSame with StorageThe OSG

  • Initial Grid driver: High Energy PhysicsImage courtesy Harvey Newman, Caltech*

  • Mining Seismic data for hazard analysis (Southern Calif. Earthquake Center).SeismicHazardModelSeismicity PaleoseismologyLocal site effectsGeologic structureFaultsStresstransferCrustal motionCrustal deformationSeismic velocity structureRupturedynamics

    **

  • Grids can process vast datasets.Many HEP and Astronomy experiments consist of:Large datasets as inputs (find datasets)Transformations which work on the input datasets (process)The output datasets (store and publish)The emphasis is on the sharing of these large datasetsWorkflows of independent program can be parallelized.Montage Workflow: ~1200 jobs, 7 levelsNVO, NASA, ISI/Pegasus - Deelman et al.Mosaic of M42 created on TeraGrid*

  • Grids Provide Global ResourcesTo Enable e-Science*

  • Virtual OrganizationsGroups of organizations that use the Grid to share resources for specific purposesSupport a single communityDeploy compatible technology and agree on working policiesSecurity policies - difficultDeploy different network accessible services:Grid InformationGrid Resource BrokeringGrid MonitoringGrid Accounting*

  • Ian Fosters Grid ChecklistA Grid is a system that:

    Coordinates resources that are not subject to centralized controlUses standard, open, general-purpose protocols and interfacesDelivers non-trivial qualities of service*

  • The Grid Middleware Stack (and course modules)Grid Security InfrastructureJobManagement DataManagementGrid InformationServices Core Globus Services Standard Network Protocols and Web Services Workflow system (explicit or ad-hoc)Grid Application (often includes a Portal)*

  • Job and resource management*

  • How do we access the grid ?Command line with tools that you'll useSpecialised applicationsEx: Write a program to process images that sends data to run on the grid as an inbuilt feature.Web portalsI2U2SIDGrid

  • Grid Middleware glues the grid togetherA short, intuitive definition:

    the software that glues together different clusters into a grid

    taking into consideration the socio-political side of things (such as common policies on who can use what, how much, and what for)

  • Grid middleware components

    Job managementStorage managementInformation ServicesSecurity

  • Globus and Condor play key roles

    Globus Toolkit provides the base middlewareClient tools which you can use from a command lineAPIs (scripting languages, C, C++, Java, ) to build your own tools, or use direct from applicationsWeb service interfacesHigher level tools built from these basic components, e.g. Reliable File Transfer (RFT)Condor provides both client & server schedulingIn grids, Condor provides an agent to queue, schedule and manage work submission*

  • Globus Toolkit Developed at Uchicago/ANL and USC/ISI (Globus Alliance) - globus.orgOpen sourceAdopted by different scientific communities and industriesConceived as an open set of architectures, services and software libraries that support grids and grid applicationsProvides services in major areas of distributed systems:Core servicesData managementSecurity

  • Globus - core servicesA toolkit that can serve as a foundation to build a gridAuthorizationMessage level securitySystem level services (e.g., monitoring)Associated data management provides file servicesGridFTPRFT (Reliable File Transfer)RLS (Replica Location Service)

  • Local Resource Managers (LRM)Compute resources have a local resource manager (LRM) that controls:Who is allowed to run jobsHow jobs run on a specific resourceSpecifies the order and location of jobsExample policy:Each cluster node can run one job. If there are more jobs, then they must wait in a queue

    Examples: PBS, LSF, Condor

  • Local Resource Manager: a batch schedulerfor running jobs on a computing clusterPopular LRMs include:PBS Portable Batch SystemLSF Load Sharing FacilitySGE Sun Grid EngineCondor Originally for cycle scavenging, Condor has evolved into a comprehensive system for managing computingLRMs execute on the clusters head nodeSimplest LRM allows you to fork jobs quicklyRuns on the head node (gatekeeper) for fast utility functionsNo queuing (but this is emerging to throttle heavy loads)In GRAM, each LRM is handled with a job manager*

  • GRAM Globus Resource Allocation Manager

    GRAM = provides a standardised interface to submit jobs to LRMs.Clients submit a job request to GRAMGRAM translates into something a(ny) LRM can understand . Same job request can be used for many different kinds of LRM

  • Job Management on a GridThe GridCondorPBSLSFforkGRAMSite ASite BSite CSite D

  • GRAMs abilitiesGiven a job specification:

    Creates an environment for the jobStages files to and from the environmentSubmits a job to a local resource managerMonitors a jobSends notifications of the job state changeStreams a jobs stdout/err during execution

  • GRAM components

    Worker nodes / CPUs

    Worker node / CPU

    Worker node / CPU

    Worker node / CPU

    Worker node / CPU

    Worker node / CPULRM eg Condor, PBS, LSFGatekeeperJobmanagerJobmanagerglobus-job-runSubmitting machine(e.g. User's workstation)

  • Remote Resource Access: Globusglobusrun myjob Globus GRAM ProtocolGlobus JobManagerfork()Organization AOrganization B

  • Submitting a job with GRAMglobus-job-run command $ globus-job-run workshop1.ci.uchicago.edu /bin/hostname

    Run '/bin/hostname' on the resource workshop1.ci.uchicago.edu

    We don't care what LRM is used on workshop1 This command works with any LRM.

  • The client can describe the job with GRAMs Resource Specification Language (RSL)Example: &(executable = a.out) (directory = /home/nobody ) (arguments = arg1 "arg 2")

    Submit with: globusrun -f spec.rsl -r workshop2.ci.uchicago.edu

  • Use other programs to generate RSLRSL job descriptions can become very complicatedWe can use other programs to generate RSL for usExample: Condor-G next section

  • CondorCondor is a specialized workload management system for compute-intensive jobs (aka batch system)is a software system that creates an HTC environmentCreated at UW-Madison

    Detects machine availabilityHarnesses available resourcesUses remote system calls to send R/W operations over the networkProvides powerful resource management by matching resource owners with consumers (broker)

  • Condor manages your clusterGiven a set of computersDedicatedOpportunistic (Desktop computers)And given a set of jobsCan be a very large set of jobsCondor will run the jobs on the computersFault-tolerance (restart the jobs)PrioritiesCan be in a specific orderWith more features than we can mention here

  • Condor - featuresCheckpoint & migration Remote system callsAble to transfer data files and executables across machinesJob orderingJob requirements and preferences can be specified via