1 r. pordes, atlas software mtg, may 6 th 2002 a snapshot of ppdg status and plans ruth pordes a...

23
1 R. Pordes, Atlas Software Mtg, May 6 th A snapshot of PPDG Status and Plans Ruth Pordes A PPDG and iVDGL Coordinator Computing Division, Fermilab PPDG “Vertical Integration” of Grid middleware components into HENP experiments’ ongoing work Laboratory for “experimental computer science” Common and Reused Components

Upload: sabina-sanders

Post on 31-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

1R. Pordes, Atlas Software Mtg, May 6th 2002

A snapshot of PPDG Status and Plans

Ruth PordesA PPDG and iVDGL CoordinatorComputing Division, Fermilab

PPDG “Vertical Integration” of Grid middleware components into HENP experiments’ ongoing work

Laboratory for “experimental computer science”

Common and Reused Components

2R. Pordes, Atlas Software Mtg, May 6th 2002

A snapshot of PPDG

Summary

Status

Slow Ramp up of new activity: CS-11 , Analysis Tools Cross-Cut activity

Other near term plans

(With thanks to many people who gave the information.)

3R. Pordes, Atlas Software Mtg, May 6th 2002

The running experiments are very

cautious about introducing anything new –

Experiment Location # physicists Time scale

BaBar SLAC 800 1999 - 2010

STAR BNL / RHIC 450 2000 - 2010

Jlab/CLASJlab/QCD(theory)

JLAB 20030

2000 – 20102003?

D0-Run2 FNAL 800 2001 - 2010

ATLAS CERN 2000 2007 – 2016

CMS CERN 2000 2007 - 2016

A Summary and a Reminder:

PPDG experiments include those taking data now as well as the LHC experiments

4R. Pordes, Atlas Software Mtg, May 6th 2002

Application Grid Infrastructure:

Grid Middleware:

Experiment Data Processing

Applications

Monitors, Reporters,Diagnostics

CS-10: Experiment Production Grids

Grid Middleware:CS-5: Reliable File TransferCS-2: Job SchedulingCS-6: Data Replication ServicesCS-3: Monitoring FrameworkCS-9: Authentication and authorization

System Managers, Controllers

Application Grid Infrastructure:CS-1: Job Definition Language and InterfaceCS-11: Analysis ToolsCS-9: Virtual Organization frameworkData Delivery and Access framework – (CS-12)Experiment Catalogs – (CS-12)Error and Diagnosis framework – (CS-13)Data Definition and Management – (CS-12)CS-2 Workload Management

User Analysis Programs

Fabric:

Fabric:CS-4: Storage nodesCS-3: Monitoring Information ProvidersDatabases/ObjectsCompute nodes NetworksCS-9: Security

5R. Pordes, Atlas Software Mtg, May 6th 2002

PPDG does not live in isolation on any front:

6R. Pordes, Atlas Software Mtg, May 6th 2002

SciDAC - encouragement to collaborate

SciDAC Project

Earth System Grid II

Collaboratory for Multi-Scale Chemical Science

National Fusion Collaboratory

DOE Science Grid Collaborative development of CA, RA policies and continued work on PMA etc

Pervasive Collaborative Computing Environment and Reliable and Secure Group Communication

A High-Performance Data Grid Toolkit Much Globus (ANL and ISI) developments being used by PPDG.

CoG Middleware Application developers starting to get some interest in higher level language interfaces to Globus

Scientific Annotation Middleware

Storage Resource Management for Data Grid Applications Co-collaborators with PPDG. Development of SRM interfaces done in collaboration.

Middleware to Support Group to Group Collaboration

Distributed Security Architectures

Security and Policy for Group Collaboration SiteAAA working closely with Globus CAS

Scientific Data Management Application scientist from STAR working with the collaboration. Expect more interaction on Analysis Tools.

Middleware Technology to Support Science Portals In prototype use by Grappa?

Optimizing Performance and Enhancing Functionality of Distributed Applications Using Logistical Networking

Exploring ways to work together – but to date not found a mutually beneficial task.

Bandwidth Estimation: Measurement Methodologies and Applications

Endorsed PPDG collaboration with SLAC.

Advanced Computing for 21st Century Accelerator Science and Technology

A National Computational Infrastructure for Lattice Gauge Theory

JLAB collaborators on PPDG are delivering prototype Grid applications for their user

Shedding New Light on Exploding Starts: Terascale Simulations of Neutrino-Driven SuperNovae and their NucleoSynthesis (ps file)

SciDAC Center for Supernova Research

7R. Pordes, Atlas Software Mtg, May 6th 2002

SciDAC CoG Kits

Impact and ConnectionsIMPACT. Allow application developers to make use of Grid services from higher-

level frameworks such as Java and Python. Easier development of advanced Grid services. Easier and more rapid application development. Encourage code reuse, and avoid duplication of effort amongst the

collaboratory projects. Encourage the reuse of Web Services as part of the Grids.

CONNECTIONS: We are working closely or as part of with the Globus research project, we work with a variety of major funded applications through SciDAC, NSF, en EU grants, E.g. DOE Science Grid, Earth Systems Grid, Supernova Factory, NASA IPG.

The Novel Ideas• Develop a common set of reusable components for accessing Grid services. • Focus on supporting the rapid development of Science Portals, Problem Solving Environments, and science applications that access Grid resources.• Develop and deploy a set of “Web Services” that access underlying Grid services.

• Integrate the Grid Security Infrastructure (GSI) into the “Web Services” model. Provide access to higher level Grid services that are language independent and are described via commodity Web technologies such as WSDL..

Principal Investigators: Gregor von Laszewski, ANL Keith. Jackson, LBL 09/07/2001

MICS/SciDAC Program Name

MICS Program Manager: Marry Ann Scott

GlobusToolkit

Java based Grid Portals and Applications

Java CoG ToolkitPythonCoG

Toolkit

CommodityPython

Tools andServices

CommodityJava

Tools and Services

PortalHigh

Energy Physics

Biology PSEChemistryPython

IDEEarthScience

JavaIDE

Java Distributed Programming

Framework

Java CoG Globus Service

Milestones/Dates/Status The main goal of this project is to create Software Development Kits in both Java and Python that allow easy access to Grid services.

Provide access to basic Grid services: Year - GRAM, MDS, Security, GridFTP 1 - Replica Catalog, co-scheduling 1&2 Composable Components: - Develop guidelines for component development 1 - Design and implement component hierarchies 1&2 - Develop a component repository 2&3 Web Services: - Integrate GSI 1 - Develop an initial set of useful web services 1&2

Composable CoG Components

8R. Pordes, Atlas Software Mtg, May 6th 2002

the funding agencies hope for

9R. Pordes, Atlas Software Mtg, May 6th 2002

PPDG Status: Experiment End-to-End Grids + Common Services to date:

ATLAS – MAGDA - GSI, MDS, (GDMP)

BaBar – Babar data handling system - SRB in prototype

CMS – IMPALA/MOP – GSI, Condor-G, Gram, DAGMAN, MDS

D0 – SAM – GridFTP, (GSI), (MDS), (Condor-G)

JLAB – JASMINE, Replica Catalog Portal

STAR – STACS – HRM, GridFTP

SiteAAA – working to ensure CAS can be used by all experiments. Currently PPDG using EDG VO mechanisms.

All experiments expect to demo at SC2002 – makes a good milestone! These are a valuable part of PPDG which enable the results of the work to be not only demo’d but introduced into the actual experiment running systems.

10R. Pordes, Atlas Software Mtg, May 6th 2002

Con

dor-

G D

AG

Man

IMPA

LA-M

OP stage-in DAR

declareconnect to RefDB

run wrapper script

GDMP publish andtransfer data files

run

create

error filterupdate RefDB

CERNRefDB

Step 1: submit/install DAR file to remote sites Step 2: submit all CMKIN jobs Step 3: submit all CMSIM jobs

Assigned 200 K events to test MOP Finished CMKIN part Started CMSIM part

stage-in cmkin/cmsim

wrapper scripts

11R. Pordes, Atlas Software Mtg, May 6th 2002

Lessons Learned to date:

• 5 site production grid• need:

— “grid-wide” debugging• ability to log into a remote site and talk to the System

Manager over the phone proved vital...• remote logins & telephone calls not a scalable

solution!

— site configuration monitoring• how are Globus, Condor, etc configured?• what does the GDMP export/import-catalog say?• Florida and Fermilab currently post info on web• should be monitored by standard monitoring tools?

— programmers to write very robust code!

12R. Pordes, Atlas Software Mtg, May 6th 2002

PPDG Status:Computer Science Groups software

extensions and integration:

Globus – GSI, GridFTP, Replica Catalog to support GDMP, CAS modifications

Condor – Classads call outs, Matchmaking at Condor-G level

SRB – extension of MCAT, interface to GridFTP, support for multiple catalogs

SRM – Extensions to HRM/DRM for control and error returns.

Soon ready to test: Reliable File Transfer layer, new Replica Location Services, Glue Schema,

(plus hardening as s/w used under new conditions and stress)

13R. Pordes, Atlas Software Mtg, May 6th 2002

End of PPDG First Year – Internal Reviews

Project:GDMPD0JobManagementJLAB-Replication MAGDASTAR-DDMCMS-MOPBaBar Database Replication

Questions asked:• What are the deliverables of your project

activity,  how  has the project met the deliverables to date, what effort has been contributing to the project?

• What is the deployment plan for your project activity and  what is the state of that deployment?

• Has the project benefited from being part of the PPDG work and if so how?

• Has the project been hindered by being part of the PPDG work and if so how?

• What collaborations does your project activity rely on  and/or contribute to? Have these been of benefit or a hindrance?

• What is your assessment of the potential for adapting the s/w from this project to other experiments?

•  What do you see as the future needs, deliverables and effort needed for the Project Activity

Reviewed answer the questions.Reviewers write short report.Actual reviews are by phone and quite “informal”.

Goal is as input to next years planning……done 2 reviews 5 more this week

14R. Pordes, Atlas Software Mtg, May 6th 2002

CS-11 Analysis Tools

“interface and integrate interactive data analysis tools with the grid and to identify common components and services.”

First:— identify appropriate individuals to participate in this area,

within and from outside of PPDG – several identified from each experiment

— assemble a list of references to white papers, publications, tools and related activities – available on http://www.ppdg.net/pa/ppdg-pa/idat/related-info.html

— produce a white paper style requirements document as an initial view of a coherent approach to this topic – draft circulated by June

— develop a roadmap for the future of this activity – at/post face-to-face meeting

15R. Pordes, Atlas Software Mtg, May 6th 2002

Generic data flow in HENP ?“Skims”,

“microDST production”, …Filtering chosen to make

this a convenient size$100M, 10 yr, 100 people

10 yr, 20 people

1 yr, 50 people, 5x/yr

1 mo, 1 person, 100x/yr

What’s going on in this box?Is this picture anywhere close to reality?

Many groups grappling with requirements now..

16R. Pordes, Atlas Software Mtg, May 6th 2002

Analysis of large datasets over the Grid

• Dataset does not fit on disk: Need access s/w to couple w/ processing; Distributed management implementing global experiment and local site policies

• Demand significantly exceeding available resources: Queues always full. When/how to move job and/or data; Global optimization of (or at least not totally random) total system throughput without too many local constraints (e.g. single points of failure)

• Data and Job Definition – in physicist terminology . For D0-SAM web+cl interface to specify Dataset + Dataset Snapshots. Saved in RDBMS for tracking and reuse. Many “dimensions” or attributes can be combined to define a dataset; Definitions can be iterative, extended; New versions defined at a specific date;

• Distributed processing and control: Schedule, control and monitor access to shared resources – CPU, disk, network. E.g. All D0-SAM job executions pass through a SAM-wrapper and are tracked in the database for monitoring and analysis.

• Faults of all kind occur: Preemption, exceptions, resource unavaibility; crashes,; Checkpointing and Restart; Workflow management to complete failed tasks; Error reporting and diagnosis;

• Chaotic and Large Spikes in Load; e.g.Analysis needs vary widely and difficult to predict – especially if a sniff of a new discovery..

• Estimation, Prediction, Planning, Partial Results - GriPhyN research areas.

17R. Pordes, Atlas Software Mtg, May 6th 2002

Use Cases -

• D0 has use cases and SAM support for some aspects— e.g. Submit and execute an analysis job at a site

temporarily isolated from the rest of the D0 Grid/ the FNAL site. If part of the dataset is not available locally, the system retries until the network restores or fails and reports the amount of data unavailable for delivery and processing. Critical for sites with unstable network connectivity Important for all other sites during the time of mission-critical analysis. Any output Files are catalogued and stored at least locally 

• CMS has use cases in documents from Koen.

• Atlas use cases will be discussed later in this workshop

• Expect to benefit from RTAG to look at experiment use cases also.

18R. Pordes, Atlas Software Mtg, May 6th 2002

e.g. Analysis needs depends on the life-stage of the experiment:

the 5 seasons of SAM• DRAFT1. Design + Commissioning – Monte Carlo, and test raw/processing 2. Early Data Processing: This is the most chaotic of all data handling periods. The

data being taken is of inconsistent quality and the o-line processing extremely immature. Some subsets of this data are reconstructed many times. Data selection strategies are put into place and many need to be modied or re-executed. Much of the emphasis is on the RAW data. Integrated luminosity is low.

3. Mid-term: Around the middle of the running period, the reconstruction algorithm begins to stabilize, with new versions needed only every month or two. Much of the early data is reprocessed to provide complete and consistent data sets for physics analysis. The accelerator luminosity reaches new highs. Individual events are selected from raw data and cached at about the 10% level.

4. Late-term Steady-state: By the last third to quarter of the run, the inertia for changing the reconstruction program will become very large as data accumulates quickly enthusiasm for change fades. The experiment enters a steady state, and the chaos is at a low. Only partial processing of data, or xing, is attempted due to the long lead-times caused by I/O and processing overheads. Raw events continue to be cached at the 10% level. Record luminosities are recorded.

5. Post-run: Some processing is done after the data taking period ends. No new data is added to the input repository. The caches are built and access to the raw data diminishes rapidly.

19R. Pordes, Atlas Software Mtg, May 6th 2002

STACS

http://sdm.lbl.gov/projectindividual.php?ProjectID=STACS

20R. Pordes, Atlas Software Mtg, May 6th 2002

References supplied by PPDG participants to date

• Proposal to NSF for CMS Analysis: an Interactive Grid-Enabled Environment (CAIGEE) - Julian Bunn, Caltech

• Grid Analysis Environment work at Caltech, April 2002 - Julian Bunn, Caltech• Views of CMS Event Data – Koen, Caltech• ATLAS Athena & Grid - Craig Tull, LBNL• CMS Distributed analysis workshop, April 2001 - Koen Holtman, Caltech• PPDG-8, Comparison of datagrid tools capabilities - Reagan Moore, SDSC• Interactivity in a Batched Grid Environment - David Liu, UCB• Deliverables document from Crossgrid WP4

Portals, UI examples, etc.links • GENIUS: Grid Enabled web eNvironment for site Independent User job

Submission - Roberto Barbera, INFN• SciDAC CoG Kit (Commodity Grid Kit) • ATLAS Grid Access Portal for Physics Applications XCAT, a

Common Component Architecture implementation

21R. Pordes, Atlas Software Mtg, May 6th 2002

Tools etc

• Java Analysis Studio JASTony Johnson, SLAC

• Distributed computing with JAS (prototype) linkTony Johnson, SLAC

• Abstract Interfaces for Data Analysis (AIDA) homeTony Johnson, SLAC

• BlueOx: Distributed Analysis with Java (Jeremiah Mans, Princeton)

• homeTony Johnson, SLAC

• Parallel ROOT Facility, PROOF intro, slides, update Fons Rademakers, CERN

• Integration of ROOT and SAM info, example Gabriele Garzoglio, FNAL

• Clarens Remote Analysis infoConrad Steenberg, Caltech

• IMW: Interactive Master-Worker Style Parallel Data Analysis Tool on the Grid linkMiron Livny, Wisconsin

• SC2001 demo of Bandwidth Greedy Grid-enabled Object Collection Analysis for Particle Physics linkKoen Holtman, Caltech

22R. Pordes, Atlas Software Mtg, May 6th 2002

CS-11- Short term Status

• The requirements document is now in the process of being outlined – Joseph Perl, Doug Olson – based on posted contributions.

• A workshop is being planned to bring people together at LBL in mid June (18?19?). We won’t know more specifics til after the meeting.. Clearly Experiments starting to think about Remote Analysis (D0), Analysis for Grid simulatiojn production (CMS), and ATLAS/ALICE

• Many experiments (will) use ROOT (& carrot? proof?). In conjunction with Run2 visit to Fermilab, Rene will have discussions with PPDG and CS groups in the last week of May.

• Need to identify the narrow band in which PPDG can be a contributor rather than just adding to the meeting load: Keep to our mission of using/extending existing tools “for real” over the short/medium term (but encourage and do not derail needed longer term development work!)

23R. Pordes, Atlas Software Mtg, May 6th 2002

Other Near Term Plans for PPDG

• Job Management and Scheduling Workshop – common components proposed to date are GRAM, ClassAds, GSI, DAGMAN:— Review the model of Grid Job and Data Distribution and

Scheduling.— Review Experiment technical requirements.— Understand if cross-cut activities are appropriate

• VO Policies and Procedures. Work with SiteAA, CAS, DOE Science Grid and the Experiments to put in place US VO process and support. – expecting the security people to call a phone meeting here.

• Extend contributions to and use of Glue and VDT.

• Continue and extend collaboration as part of US Physics Grid Projects and international grid projects serving HENP experiments.

• Write Year 2 Plan.

• Look towards SC2002 demos and experiment data challenges as practical milestones.