production coordination area vo meeting feb 11, 2009 dan fraser – production coordinator

15
Production Production Coordination Coordination Area VO Meeting Area VO Meeting Feb 11, 2009 Feb 11, 2009 Dan Fraser – Production Dan Fraser – Production Coordinator Coordinator

Upload: marylou-baker

Post on 18-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Production CoordinationProduction Coordination

Area VO MeetingArea VO MeetingFeb 11, 2009Feb 11, 2009

Dan Fraser – Production CoordinatorDan Fraser – Production Coordinator

Page 2: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

OSG Health MonitoringOSG Health MonitoringOverviewOverview

http://t2.unl.edu/gratia/http://t2.unl.edu/gratia/

Weekly CallsWeekly Callshttps://twiki.grid.iu.edu/bin/view/Production/WebHomehttps://twiki.grid.iu.edu/bin/view/Production/WebHome

Data movementData movementhttp://t2.unl.edu/gratia/xml/facility_transfer_volumehttp://t2.unl.edu/gratia/xml/facility_transfer_volume

Job/Error ratiosJob/Error ratioshttps://twiki.grid.iu.edu/bin/view/MeasurementsAndMetrics/https://twiki.grid.iu.edu/bin/view/MeasurementsAndMetrics/ProductionGraphsProductionGraphs

Page 3: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

A New Visual for DOEA New Visual for DOE

DOE DisplayDOE Displayhttp://display.grid.iu.edu/http://display.grid.iu.edu/

Created by Brian Bockelman & Soichi Created by Brian Bockelman & Soichi HayashiHayashi

Page 4: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Some Production Examples…Some Production Examples…

Effort from the entire teamEffort from the entire team CERN BDII stale entries problem fixedCERN BDII stale entries problem fixed GGUS-GOC Ticket automationGGUS-GOC Ticket automation LIGO Production at an all time high (Rob E.)LIGO Production at an all time high (Rob E.)

Currently running at Rank #1 on OSGCurrently running at Rank #1 on OSG

Looking into added stress to some NFS systemsLooking into added stress to some NFS systems SBGRID now at ~3000 parallel jobsSBGRID now at ~3000 parallel jobs Zero length CRL problem resolvedZero length CRL problem resolved ……

Page 5: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

A View from the Production A View from the Production CoordinatorCoordinator

What are the biggest problems in OSG?What are the biggest problems in OSG? Production is currently stableProduction is currently stable

Need to keep it that way! (limit changes)Need to keep it that way! (limit changes) Supporting VO’s I difficultSupporting VO’s I difficult

How to get opportunistic storageHow to get opportunistic storage Current method is to talk to each site…Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan)New strategies being explored (Tanya, Brian, Dan)

Lowering the barrier for scaling across sitesLowering the barrier for scaling across sites Site differences often require site-by-site investigationSite differences often require site-by-site investigation OSG operated Pilot capability could be a big win!! OSG operated Pilot capability could be a big win!!

Page 6: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Job count (2 weeks)Job count (2 weeks)

Page 7: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator
Page 8: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

http://t2.unl.edu/gratia/xml/failed_dn_site_hours_bar?vo=atlas&title=Daily%20Atlas%20Wasted%20Hours%20By%20User%20and%20Site

Page 9: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator
Page 10: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

http://t2.unl.edu/gratia/xml/failed_dn_site_hours_bar?vo=cms&title=Daily%20CMS%20Wasted%20Hours%20By%20User%20and%20Site

Page 11: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Solving Production ProblemsSolving Production Problems

Solving problems is a TEAM sportSolving problems is a TEAM sport

The weekly production call has key people from The weekly production call has key people from all the teams that are needed to solve problemsall the teams that are needed to solve problems

CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, MetricsSTG, Security, Operations, Metrics

Problems accurately prioritized and channeled Problems accurately prioritized and channeled to the correct avenueto the correct avenue

Sometimes solved on the call.Sometimes solved on the call.

Forewarning to prepare for upcoming issues.Forewarning to prepare for upcoming issues.

Page 12: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Example ProblemsExample Problems

Handling of job pre-emption (LIGO / D0)Handling of job pre-emption (LIGO / D0)

VO Package Validation probe neededVO Package Validation probe needed GIP “truth in advertising”GIP “truth in advertising”

LIGO switch to GT2 and also Condor-G job LIGO switch to GT2 and also Condor-G job submissionsubmission

Condor scaling limits in GridMon (Atlas)Condor scaling limits in GridMon (Atlas)

Globus LSF gatekeeper bug (D0/CMS)Globus LSF gatekeeper bug (D0/CMS)

Security Drill successes (for T1)Security Drill successes (for T1)

Gratia probe introduction & ITB testingGratia probe introduction & ITB testing

Page 13: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.STEP09 monitoring (partially successful)STEP09 monitoring (partially successful)

IceCube management of opportunistic storageIceCube management of opportunistic storage

Gratia file transfer data catch upGratia file transfer data catch up

Transition from VORS to myOSGTransition from VORS to myOSG

New location for RSV probes and ability to New location for RSV probes and ability to update from the “production” cacheupdate from the “production” cache Also, ensure that config_OSG does not update the Also, ensure that config_OSG does not update the

probes automaticallyprobes automatically

Root Cause Analysis of CMS BDII outageRoot Cause Analysis of CMS BDII outage

Page 14: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.Plan to localize data transfer information Plan to localize data transfer information and upload summary transfer packets.and upload summary transfer packets.

Globus memory leak was causing frequent Globus memory leak was causing frequent reboots at BNL.reboots at BNL.

Site name mapping problem to enable Site name mapping problem to enable different names internal to OSG.different names internal to OSG.

OIM display difference (http vs https)OIM display difference (http vs https)

Site admin meeting & materials prep to help Site admin meeting & materials prep to help sites upgrade to OSG 1.2.sites upgrade to OSG 1.2.

Page 15: Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

Example Issues cont.Example Issues cont.Condor problem with directory creation in Condor problem with directory creation in a multiple gateway scenario. (Nebraska)a multiple gateway scenario. (Nebraska)

Gratia collector problem with handling Gratia collector problem with handling records that accumulate faster than they records that accumulate faster than they can be processed.can be processed.

LIGO/Pegasus transition to use BDII data LIGO/Pegasus transition to use BDII data instead of central probe data.instead of central probe data.