data reprocessing for dzero on the sam-grid

19
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM- Grid Team Fermilab, Computing Division

Upload: driscoll-short

Post on 01-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Data reprocessing for DZero on the SAM-Grid. Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division. Overview. The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data reprocessing for DZero on the SAM-Grid

Data reprocessing for DZero on the SAM-Grid

Gabriele Garzoglio for the SAM-Grid TeamFermilab, Computing Division

Page 2: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Overview

The DZero experiment at FermilabData reprocessing

MotivationThe computing challengesCurrent deployment

The SAM-Grid systemCondor-G and Global Job ManagementLocal Job ManagementGetting more resources submitting to LCG

Page 3: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Fermilab and DZero

DZero

Page 4: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Data size for the D0 ExperimentDetector Data

1,000,000 ChannelsEvent size 250KBEvent rate ~50 Hz0.75 PB of data since 1998Past year overall 0.5 PBExpect overall 10 – 20 PB

This meansMove 10s TB / dayProcess PBs / year25% – 50% remote computing

Page 5: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Overview

The DZero experiment at FermilabData reprocessing

MotivationThe computing challengesCurrent deployment

The SAM-Grid systemCondor-G and Global Job ManagementLocal Job ManagementGetting more resources submitting to LCG

Page 6: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Motivation for the Reprocessing

Processing: changing the data format from something close to the detector to something close to the physics.As the understanding of the detector improves, the processing algorithms changeSometimes it is worth to “reprocess” all the data to have “better” analysis results.Our understanding of the DZero calorimeter calibration is now based on reality rather then design/plans: we want to reprocess

Page 7: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

The computing task

Events 1 BilionInput 250TB (250kB/Event)Output 70TB (70kB/Event)Time 50s/Event: 20,000monthsIdeally 3400CPUs (1GHz PIII) for 6mths (~2 days/file)Remote processing 100%

A stack of CDs as high as the Eiffel tower

Page 8: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Data processing model

Input Datasets (n files)

Site 1 Site 2 Site m…

Job 1 Job 2 Job n…

Out 1 Out 2 Out n…

(n batch processes per site)

(stored locally at the site)

Merging

Permanent Storage

(at any site)

(n~100: files produced in 1 day)

Page 9: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Challenges: Overall scale

A dozen computing clusters in US and EUcommon meta-computing framework: SAM-Gridadministrative independence

Need to submit 1,700 batch jobs / day to meet the dead line (without counting failures)Each site needs to be filled up at all time: locally scale up to 1000 batch nodesTime to completion of the unit of bookkeeping (~100 files): if too long (days) things are bound to failHandle 250+ TB of data

Page 10: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Challenges: Error Handling / Recovery

Design for random failuresunrecoverable application errors, network outages, file delivery failures, batch system crashes and hangups, worker-node crashes, filesystem corruption...

Book-keeping of succeeded jobs/files: needed to assure completion without duplicated eventsBook-keeping of failed jobs/files: needed for recovery AND to trace problems in order fix bugs and to assure efficiencySimple error recovery to foster smooth operations

Page 11: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Available ResourcesSITE #CPU 1GHz-eq. STATUSFNAL Farm 1000CPUs used for data-

taking

Westgrid 600CPUs readyLyon 400CPUs readySAR (UTA) 230CPUs readyWisconsin 30CPUs readyGridKa 500CPUs readyPrague 200CPUs readyCMS/OSG 100CPUs under testUK 750CPUs 4 sites being

deployed-------------------------------------------------------------

2800CPUs (1GHz PIII equiv.)

Page 12: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Overview

The DZero experiment at FermilabData reprocessing

MotivationThe computing challengesCurrent deployment

The SAM-Grid systemCondor-G and Global Job ManagementLocal Job ManagementGetting more resources submitting to LCG

Page 13: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

The SAM-GridSAM-Grid is an integrated job, data, and information management systemGrid-level job management is based on Condor-G and GlobusData handling and book-keeping is based on SAM (Sequential Access via Metadata): transparent data transport, processing history, and book-keeping.…lot of work to achieve scalability at the execution cluster

Page 14: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

SAM-Grid Diagram

SiteSite SiteSite SiteSite

Resource Selector

Info Collector

Info Gatherer

Match Making

User InterfaceUser Interface User InterfaceUser Interface

SubmissionGlobal Job Queue

Grid Client

SubmissionSubmission

User InterfaceUser Interface User InterfaceUser Interface

Global DH ServicesSAM Naming Server

SAM Log Server

Resource Optimizer

SAM DB ServerRC MetaData Catalog

Bookkeeping Service

SAM Stager(s)

SAM Station(+other servs)

Data Handling

Worker Nodes

Grid Gateway

Grid/Fabric Interface

JIM Advertise

Local Job Handling

Cluster

AAA

Dist.FS

Info Manager

XML DB server

Site Conf.Glob/Loc JID map...

Info Providers

MDS

MSS Cache Site

Web ServGrid Monitoring

User Tools

Flow of: job data meta-data

Page 15: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele GarzoglioJO

B

Computing Element

User Interface

SubmissionService

Job Management DiagramUser

Interface

User Interface

ResourceSelection

Match Making Service

Information Collector

ExecSite #1

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Submission Service

Submission Service

Grid Sensors

Computing Element

GenericService Generic

Service

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

GenericService

GenericService

ext.algo

ext.algo

Grid/Fabric

Interface

Grid/Fabric

Interface

Grid/Fabric

Interface

Grid/Fabric

Interface

Page 16: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

Fabric Level Job Management

ExecutionSite

Grid/Fabric InterfaceGrid/Fabric

Interface

Grid/Fabric Interface

JOB

SAM Station

SAM StationSandbox

Facility

Sandbox Facility

SAM Station

XML Monitoring Database

Batch System Adaptor

Sandbox Facility

Batch System

Worker Node

XML Monitoring Database

XML Monitoring Database

Batch System

Batch System

Worker Node

Worker Node

Batch System

Adapter

Batch System

Adapter

Job enters the SiteLocal Sandbox created for job (user input, configuration, SAM client, GridFTP client, user credentials)

Local services notified of jobBatch Job submission details requestedJob submittedJob starts on Worker nodePush of monitoring info startsJob fetches SandboxJob gets dependent products and input dataFramework passes control to applicationGrid monitors status of jobUser requests status of jobJob stores output from applicationStdout, stderr, logs handed to Grid

Page 17: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

How do we get more resources?We are working on forwarding jobs to the LCG GridA “forwarding-node” is the advertised gateway to LCGLCG becomes yet another batch system… well, not quite a batch system

Need to get rid on the assumptions on the locality of the network

SAM-Grid LCGFwd-node

VO-srv

Page 18: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

ConclusionsDZero needs to reprocess 250 TB of data in the next 6-8 monthsIt will produce 70 TB of output, processing data at a dozen computing centers on ~3000 CPUsThe SAM-Grid system will provide the meta-computing infrastructure to handle data, job, and information.

Page 19: Data reprocessing for DZero on the SAM-Grid

Mar 15, 2005 Gabriele Garzoglio

More info at…

http://www-d0.fnal.gov/computing/reprocessing/

http://www-d0.fnal.gov/computing/grid/

http://samgrid.fnal.gov:8080/

http://d0db.fnal.gov/sam/