grid computing

Grid Computing

Grid ComputingPhilippe CharpentierPH DepartmentCERN, GenevaOutlineDisclaimer:These lectures are not meant at teaching you how to compute on the Grid!I hope it will give you a flavor on what Grid Computing is aboutWe shall see how far we can go todayTodayWhy Grid Computing, what is it?Data Management systemTomorrowWorkload Management systemUsing the Grid (production, user analysis)What is being achieved?OutlookSUSSP65, August 2009Grid Computing, PhC2The challenge of LHC ComputingLarge amounts of data for each experimentTypically 2-300 MB/s or real data (1 MB events)5 106 s gives 1.5 PB/yearLarge processing needsReconstruction: 10-20 s/evtRequires 2-4000 processors permanently For reprocessing (in 2 months): 4 times moreSimulation: up to 15 mn /evt!!!Requires few 10,000 processorsAnalysis: fast, but many users and many events!Similar to reconstruction in resource needs

SUSSP65, August 20093Grid Computing, PhCAn ideal worldHuge Computing CenterSeveral 100,000 processorsTens of PB of fast storage (disks)Highly performing network connections From the experimental siteFrom the collaborating institutes

and the associated risksIf it fails, all activities are stoppedSUSSP65, August 2009Grid Computing, PhC4An alternativeDistributed ComputingSeveral large computing centersDistributed over the worldMultiple replicas of datasetsAvoids single point of failureHigh performance network connectionsFor data replicationTransparent job submissionHide the complexity of the system to users

The Computing Grid SUSSP65, August 2009Grid Computing, PhC5The Monarc modelHierarchical model of sites (dated 1999)Tier0: where the real data are produced (CERN)Tier1s : large Computing Centers, at the level of a country or a regionTier2s : smaller Computing Centers, in large institutes or citiesTier3s: computer clusters, at the level of a physics departmentTier4s : the physicists desk/lap-top

The data flows along the hierarchyHierarchy of processing stepsReconstruction, global analysis, group analysis, final analysis

SUSSP65, August 2009Grid Computing, PhC6CERNThe LHC Computing GridUses the same terminology as MONARC but Tiers reflect the level of serviceIt is an implementation of the GridTier0: CERNTier1s: large capacity of mass storage (tapes), (multi-)national level, high bandwidth to Tier0Tier2s: no mass storage, lower quality of service, regional levelTier3s and Tier4s: not considered as part of the Grid (regular computers or clusters)Not necessarily hierarchical dependency Although Tier1s are usually supporting Tier2s but not necessarily strict link for data flow

SUSSP65, August 2009Grid Computing, PhC7

The Computing ModelsDefinition of functionalityReal data collection and recordingReal data replication for securityReal data initial reconstruction Quasi real-timeReal data re-reconstructionAfter months or yearsMonte-Carlo simulationNo input data required, highly CPU demandingAnalysisLarge dataset analysisFinal (a.k.a.) end-user analysisThe mapping to Tiers may differIt is the experiments choiceSUSSP65, August 2009Grid Computing, PhC8

Computing Models and the GridSUSSP65, August 2009Grid Computing, PhC9Tier0Tier1sTier2sThe GridComputing Models and the GridTier0Tier1Tier2Central data recordingData mass storageALICEFirst event reconstructionCalibrationAnalysisEvent re-reconstructionAOD storageSimulationAnalysisATLASCMSLHCbEvent reconstructionAnalysisSimulationSUSSP65, August 2009Grid Computing, PhC10The Grid abstractionLarge worldwide Computer CenterTop requirementsStorage and data transfer (Data Management)Job submission (Workflow Management)Global infrastructure:Computing Centers and Users supportIncluding problem reporting and solvingHigh bandwidth networksThe Grid functionality is decomposed in terms of servicesSUSSP65, August 2009Grid Computing, PhC11Possible breakdown of Grid servicesThe ARDA modelSUSSP65, August 2009Grid Computing, PhC12

Grid security (in a nutshell!)Important to be able to identify and authorise usersPossibly to enable/disable certain actionsUsing X509 certificatesThe Grid passport, delivered by a certification authorityFor using the Grid, create short-lived proxiesSame information as the certificate but only valid for the time of the actionPossibility to add group and role to a proxyUsing the VOMS extensionsAllows a same person to wear different hats (e.g. normal user or production manager)Your certificate is your passport, you should sign whenever you use it, dont give it away!Less danger if a proxy is stolen (short lived)SUSSP65, August 2009Grid Computing, PhC13Organisation of the GridInfrastructureComputing Centers, networksIn Europe, EGEE (Enabling Grids for E-SciencE)In European Nordic countries, NorduGridIn US: OSG (Open Science Grid)Each infrastructure may deploy its own software (hereafter called middleware)Activity Grid initiativesFor LHC Computing, WLCG (Worldwide LHC Computing Grid)Federates the infrastructures for a communityVirtual Organisations (VOs)Each set of users within a communityE.g. in WLCG: the LHC experiments but there are many more, in HEP and in e-ScienceSUSSP65, August 2009Grid Computing, PhC14

Grid infrastructures partnersSUSSP65, August 2009Grid Computing, PhC15

Data ManagementSUSSP65, August 2009Grid Computing, PhC16Data Management servicesStorage ElementDisk serversMass-storage (tape robots) for Tier0/1CatalogsFile catalogData location (SE) and file name (URL)Metadata catalogMore experiment-specific way for users to get information on datasetsFile transfer servicesData access librariesSUSSP65, August 2009Grid Computing, PhC17Storage Elements (1)Sites have freedom of technology4 basic storage technologies in WLCGCastorDisk and mass storage integratedUsed at CERN, RAL, CNAF and AGSC (Taiwan)dCacheDisk caching system, can use multiple mass storage systemsUsed at most other Tier1s and some Tier2s (no MSS)DPMDisk Pool Manager: can use any file system back-endUsed at most Tier2STorMDisk storage, can be coupled to e.g. Castor for MSSUsed at CNAF Tier1 and some Italian Tier2sSUSSP65, August 2009Grid Computing, PhC18UsercodeSRMSRMSRM Storage Elements (2)4 types of SEsNeed for an abstraction layerUsers use a single interfaceStorage Resource Manager (SRM)SUSSP65, August 2009Grid Computing, PhC19CastorDPMdCacheSRM functionality (1)Provides a unified interface to SEsEnables partitioning of storage into spacesClassified according to Service ClassesDefines accessibility and latencyMainly 3 classes:T1D0: mass storage but no permanent disk copy T1D1: mass storage with permanent disk copyT0D1: no mass storage, but permanent disk copyCan be extended (e.g. T2, D2)Spaces define a partitioning of the disk spaceAccess Control Lists for actions on files per space E.g. disallow normal users to stage files from tapeSUSSP65, August 2009Grid Computing, PhC20SRM functionality (2)Standard format of Storage URLsrm://:// may be VO-dependentIn principle independent on service classReturns transport UTL (tURL) for different access protocolsData transfer: gsiftpPOSIX-like access: file, rfio, dcap, gsidcap, xrootdtURL used by applications for opening or transferring filesAllows directory and file managementsrm_mkdir, srm_ls, srm_rmSUSSP65, August 2009Grid Computing, PhC21File access and operationsApplications use tURLs, obtained from SURL through SRMStandard File Access Layergfal: POSIX-like library, SRM-enabledxrootd: alternative to SRM+gfal for file access and transfer, but can be used as access protocol only too.High level tool:lcg_utilsBased on gfalSee later functionalitySUSSP65, August 2009Grid Computing, PhC22File catalogs (1)Files may have more than one copy (replica)Useful to uniquely represent files, independent on their locationLogical File NameStructure like a path, but location-independentHuman readableE.g. /grid/lhcb/MC/2009/DST/1234/1234_543_1.dstGlobal Unique Identifier (GUID)128-bit binary indentifierNot readableUsed for files cross-references (fixed size)Attributed at file creation, unique by construction

SUSSP65, August 2009Grid Computing, PhC23File catalogs (2)Associates GUID, LFN and list of replicas (SURL)Queries mainly by GUID or LFNContains file metadata (as a file system)Size, date of creationReturns list of replicasIn LHC experiments 3 typesLCG File Catalog (LFC): used by ATLAS and LHCbDatabase, hierarchical, genericAliEn: developed and used by ALICETrivial File Catalog: used by CMSJust construction rules, and list of sites per file (DB)SUSSP65, August 2009Grid Computing, PhC24High level DM toolsIntegrating File Catalog, SRM and transferE.g. lcg_utils

SUSSP65, August 2009Grid Computing, PhC25[lxplus227] ~ > lfc-ls -l /grid/lhcb/user/p/phicharp/TestDVAssociators.root -rwxr-sr-x 1 19503 2703 430 Feb 16 16:36 /grid/lhcb/user/p/phicharp/TestDVAssociators.root

[lxplus227] ~ > lcg-lr lfn:/grid/lhcb/user/p/phicharp/TestDVAssociators.root srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root

[lxplus227] ~ > lcg-gt srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root rfiorfio://castorlhcb.cern.ch:9002/?svcClass=lhcbuser&castorVersion=2&path=/castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root

[lxplus227] ~ > lcg-gt srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root gsiftpgsiftp://lxfsrj0501.cern.ch:2811//castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root

[lxplus227] ~ > lcg-cp srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/user/p/phicharp/TestDVAssociators.root file:test.rootSUSSP65, August 2009Grid Computing, PhC26[lxplus311] ~ $ lcg-lr lfn:/grid/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dstsrm://ccsrm.in2p3.fr/pnfs/in2p3.fr/data/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dstsrm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dstsrm://srmlhcb.pic.es/pnfs/pic.es/data/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dst[lxplus311] ~ $ lcg-gt srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dst rootcastor://castorlhcb.cern.ch:9002/?svcClass=lhcbdata&castorVersion=2&path=/castor/cern.ch/grid/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dst[lxplus311] ~ $ root -b ******************************************* * * * W E L C O M E to R O O T * * * * Version 5.22/00c 27 June 2009 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * *******************************************

ROOT 5.22/00c (branches/v5-22-00-patches@29026, Jun 29 2009, 14:34:00 on linuxx8664gcc)

CINT/ROOT C/C++ Interpreter version 5.16.29, Jan 08, 2008Type ? for help. Commands must be C++ statements.Enclose multiple statements between { }.root [0] TFile::Open("castor://castorlhcb.cern.ch:9002/?svcClass=lhcbdata&castorVersion=2&path=/castor/cern.ch/grid/lhcb/MC/MC09/DST/00005071/0000/00005071_00000004_1.dst)From LFN to reading a fileTransferring large datasets: FTSNeeds scheduling of network bandwidthUsing dedicated physical pathsE.g. LHC Optical Network (LHCOPN)Transfers handled by a service (FTS)An FTS service handles a channel (from A to B)Manages share between VOsOptimises bandwidthTakes care of transfer failures and retriesYet to come: file placement serviceMove this dataset to this SE (i.e. 100% of it)Implemented within the VO frameworksSUSSP65, August 2009Grid Computing, PhC27Other Data Management servicesMetadata catalogContains VO-specific metadata for files/datasetsAllows users to identify datasetsExample:Type of data (e.g. AOD)Date of data taking, type of simulated dataType of processing (e.g. reconstruction April 2010)Too specific to be genericSUSSP65, August 2009Grid Computing, PhC28VO frameworksNeed to elaborate on Grid middlewareEach LHC experiment has its own Data Management SystemFile CatalogE.g. LFC, AliEn-FCDataset definitionsSets of files always grouped togetherMetadata catalogsUser queries, returns datasets of LFNsAccess and transfer utilitiesUse gridftp, lcg_cp or FTSSUSSP65, August 2009Grid Computing, PhC29End of lecture 1SUSSP65, August 2009Grid Computing, PhC30Workload Management SystemThe Great PicturePrepare your software, distribute it on the GridThe Worker Node (WN) that will run the job must have this capabilitySubmit jobs, monitor them and retrieve resultsIf the job needs data, it must be accessibleNot care where it ran, provided it does its job! but care where it ran if it failedSUSSP65, August 2009Grid Computing, PhC31Workload Management SystemThe Great PicturePrepare your software, distribute it on the GridThe Worker Node (WN) that will run the job must have this capabilitySubmit jobs, monitor them and retrieve resultsIf the job needs data, it must be accessibleNot care where it ran, provided it does its job! but care where it ran if it failedSUSSP65, August 2009Grid Computing, PhC32The componentsLocal batch systemUnder the sites responsibilityMust ensure proper efficiency of CPU resourcesAllows to apply shares between communitiesProvides monitoring and accounting of resourcesCentral job submission serviceTo which users send their tasks (a.k.a. jobs)Job descriptionWhich resources are needed (Operating system, input data, CPU resources etc)Standard Job Description Language (JDL)Rather crypticAdditional data can be added in a sandboxInput and output sandboxes (limited in size)

SUSSP65, August 2009Grid Computing, PhC33Local Batch SystemSeveral flavors on the marketLSF, PBS, BQS, Condor etcNot practical for the Grid define an abstractionComputing Element (CE)Knows the details of all implementationProvides a standard interface for:Job submissionJob monitoringInformation on the CECE may depend on the infrastructureNeed for interoperabilitySUSSP65, August 2009Grid Computing, PhC34Central job submissionIdeally:A Big Brother that knows permanently the state of the whole system!It decides on this knowledge what is the most cost-effective site to target, knowing the jobs requirementsSUSSP65, August 2009Grid Computing, PhC35CECECECECECEInformationJobsCentralInfoSystemResource BrokerResource BrokeringCaveats with the ideal picture:One entity can hardly support the load (few 100,000 job simultaneously)Several instancesEach instance works independentlyChanges the great pictureInformation is not instantaneously availableTypical refresh rate of few minutesThe picture has time to change quite a lotHow to evaluate the cost of a choice?Estimated return time: how long shall I wait to get the job done? SUSSP65, August 2009Grid Computing, PhC36The real life of WM (1)Several instances of RB per communityTypically handling up to 10,000 jobs eachDynamic information system (BDII)Regularly updated by CEsTypically 2-3 minutesUsed by RBs to make a decision based on the state of the universeComplex RB-CE interactionWhat if no resources are indeed available?E.g. due to out-of-date informationJob refused, sent back to RB, retry somewhere elseSUSSP65, August 2009Grid Computing, PhC37The real life of WM (2)Which OS, which software is deployed?Publish software tags, used in job matchingCEs must be homogeneous, as much as possibleSpecify max CPU time limitDefine units and apply CPU ratingStill in fluxUser identityUsers ability to use resources should be checkedUsers can be banned in case of misbehaviorUsers are mapped to standard UNIX (uid,gid)These are transient, but most sites make a one-to-one assignment (pool accounts, like _001, _002)Applying priorities between users or VosCan be done using fair sharesNeeds site intervention for changesSUSSP65, August 2009Grid Computing, PhC38An alternative to the central brokerPilot jobsSmall jobs that do not carry the actual job to be executedKnows about where to ask for work to be done (central task queues)If nothing found, can just quit gracefullyPilots are all identical for a given set of queues and frameworkImplements a pull paradigm, opposed to a push paradigm (RB)Pioneered by ALICE and LHCb (2003)Now used by ATLAS and partly by CMSSUSSP65, August 2009Grid Computing, PhC39

The DIRAC exampleLate job bindingA job is fetched only when all is OKSoftware and data presentSite capabilities matchingApply priorities centrallyNo site interventionUsers and production jobs mixedFederating GridsSame pilots can be submitted to multiple GridsSimilar systems:AliEn (ALICE)PANDA (ATLAS)CMSGlideins (CMS)SUSSP65, August 2009Grid Computing, PhC40

The Grid Experience (1)Basic infrastructure is thereSeveral flavors (EGEE, OSG, NorduGrid)Basic middleware is thereStorage-ware (from HEP): Castor, dCache, DPM, StoRM, SRMDM middleware: gfal, lcg_utils, FTSWM middleware: gLite-WMS, LCG-CE (obsolescent, soon to be replaced by CREAM from gLite)Good coordination structuresEGEE, OSGWLCG for LHC experiments (and HEP)Complex relationships and dependenciesSUSSP65, August 2009Grid Computing, PhC41The Grid Experience (2)Large VOs must elaborate complex frameworksIntegrating middlewareData Management: File-catalog, SE, Metadata-catalog, FTSApplying specific policies and addressing complex use casesDataset publication, replicationDevelop alternative WM solutionsPilot jobs frameworks, still using WM middleware for pilots submissionProduction and user analysis frameworksBased on the Grid frameworksSUSSP65, August 2009Grid Computing, PhC42Production systemsOrganised data processingDecided at the VO-levelVery much CPU-intensive, repetitive tasks with different starting conditions Examples:Real data reconstructionLarge samples Monte-Carlo simulationGroup analysis (data reduction)Workflow:Define tasks to be performedCreate tasksGenerate jobs, submit and monitor themCollect and publish resultsEach VO has its own dedicated and tailored production system (possibly systems)SUSSP65, August 2009Grid Computing, PhC43Example of production systemFrom LHCb/DIRAC systemUsers define their needsValidationPhysics Planning Group: also sets prioritiesTechnical validation (application managers)Production is launchedProgress is monitoredSUSSP65, August 2009Grid Computing, PhC44

Production request exampleSUSSP65, August 2009Grid Computing, PhC45

Specificity of productionWell organised job submissionAgreed by the collaborationData placement following the Computing ModelJobs upload their resultsDatasets are then distributed to other sitesTransfer requests (associated to the production)Dataset subscription (from sites)User requestsAlso takes care of dataset archiving and deletionReduce number of replicas of old datasetsWhen unused, delete themSUSSP65, August 2009Grid Computing, PhC46User Analysis supportMuch less organised!...Jobs may vary in resource requirementsSW development: rapid cycle, short jobsTesting: longer, but still restricted datasetRunning on full datasetOutput also variesHistograms, printed output, NtupleOrder kB to few MBEvent data outputReduced dataset or elaborated data structureOrder GB

SUSSP65, August 2009Grid Computing, PhC47User AnalysisFor large datasets, users benefit from using the GridMore potential resourcesAutomatic output data storage and registrationFor development and testing, still local running End analysisFrom Ntuple, histogramsMaking final plotsOutside the GridDesk/lap-topSmall clusters (Tier3)SUSSP65, August 2009Grid Computing, PhC48Example of gangaDeveloped jointly by ATLAS and LHCbAllows users to configure their jobs, then run interactively, locally or on the Grid with just minor modificationsJob repository: keeps history of jobsJob monitoring Hides complexity of Grid frameworksDirect gLite-WMS, Panda (ATLAS), DIRAC (LHCb)SUSSP65, August 2009Grid Computing, PhC49SUSSP65, August 2009Grid Computing, PhC50



Job monitoringSUSSP65, August 2009Grid Computing, PhC53

Does this all work?Many challengesFor years now, in order to improve permanentlyData transfers, heavy load with jobs etcNow the system is in productionStill needs testing of performanceMust be robust when fist data comesSites and experiments must learnIt will still take time to fully understand all operational issuesWhat happens when things go wrong?Many metrics used to assess performancesGlobal transfer ratesIndividual sites performance

SUSSP65, August 2009Grid Computing, PhC54Data transfersSUSSP65, August 2009Grid Computing, PhC55

1 GB/s Jobs running on the Grid (ATLAS)SUSSP65, August 2009Grid Computing, PhC56


Visiting the GridGoogleEarth GridPP development (Imperial College, London)

http://gridportal-ws01.hep.ph.ic.ac.uk/dynamic_information/rtm_beta.kmzSUSSP65, August 2009Grid Computing, PhC58


Storage tape robot

CPU farm (CERN)Outlook rather than conclusionsGrid Computing is a must for LHC experimentsThe largest Computing Center in the world!Data-centric (PB of data per year)Unlike cloud computing (Amazon, Google..)Grid Computing requires huge efforts From the sites, to be part of an ensembleFrom the user communities (expts)Flexibility, constant changes in servicesNothing can be changed nowConcentrate on stability, stability, stability, stability, SUSSP65, August 2009Grid Computing, PhC60HoweverSUSSP65, August 2009Grid Computing, PhC61ParallelisationAll machines are now multi-core and soon many-coreAlready 16 cores, soon 128 in a single machineHow to best use these architectures?Adapt applications for using multiple coresSaves a lot of memory in running the same applicationChange way of allocating resourcesAnyway mandatory for desktop usageWhy not for batch / Grid processing?

SUSSP65, August 2009Grid Computing, PhC62VirtualisationAlways problems with customisation of environment on WNsOperating systemMiddleware, libraries, user softwareCreate virtual machines to be submitted to the GridNeeds trust between users and resource providers (sites)This is already the trend for cloud computingWorkshop on parallelisation and virtualisationhttp://indico.cern.ch/conferenceDisplay.py?confId=56353SUSSP65, August 2009Grid Computing, PhC63

grid computing

Documents

computing grid sussp65

todaytodaywhy grid computing

grid production

smaller computing centers

phc8 computing models

data flows

data flow sussp65

grid regular computers