production coordination area vo meeting feb 11, 2009 dan fraser – production coordinator
TRANSCRIPT
Production CoordinationProduction Coordination
Area VO MeetingArea VO MeetingFeb 11, 2009Feb 11, 2009
Dan Fraser – Production CoordinatorDan Fraser – Production Coordinator
OSG Health MonitoringOSG Health MonitoringOverviewOverview
http://t2.unl.edu/gratia/http://t2.unl.edu/gratia/
Weekly CallsWeekly Callshttps://twiki.grid.iu.edu/bin/view/Production/WebHomehttps://twiki.grid.iu.edu/bin/view/Production/WebHome
Data movementData movementhttp://t2.unl.edu/gratia/xml/facility_transfer_volumehttp://t2.unl.edu/gratia/xml/facility_transfer_volume
Job/Error ratiosJob/Error ratioshttps://twiki.grid.iu.edu/bin/view/MeasurementsAndMetrics/https://twiki.grid.iu.edu/bin/view/MeasurementsAndMetrics/ProductionGraphsProductionGraphs
A New Visual for DOEA New Visual for DOE
DOE DisplayDOE Displayhttp://display.grid.iu.edu/http://display.grid.iu.edu/
Created by Brian Bockelman & Soichi Created by Brian Bockelman & Soichi HayashiHayashi
Some Production Examples…Some Production Examples…
Effort from the entire teamEffort from the entire team CERN BDII stale entries problem fixedCERN BDII stale entries problem fixed GGUS-GOC Ticket automationGGUS-GOC Ticket automation LIGO Production at an all time high (Rob E.)LIGO Production at an all time high (Rob E.)
Currently running at Rank #1 on OSGCurrently running at Rank #1 on OSG
Looking into added stress to some NFS systemsLooking into added stress to some NFS systems SBGRID now at ~3000 parallel jobsSBGRID now at ~3000 parallel jobs Zero length CRL problem resolvedZero length CRL problem resolved ……
A View from the Production A View from the Production CoordinatorCoordinator
What are the biggest problems in OSG?What are the biggest problems in OSG? Production is currently stableProduction is currently stable
Need to keep it that way! (limit changes)Need to keep it that way! (limit changes) Supporting VO’s I difficultSupporting VO’s I difficult
How to get opportunistic storageHow to get opportunistic storage Current method is to talk to each site…Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan)New strategies being explored (Tanya, Brian, Dan)
Lowering the barrier for scaling across sitesLowering the barrier for scaling across sites Site differences often require site-by-site investigationSite differences often require site-by-site investigation OSG operated Pilot capability could be a big win!! OSG operated Pilot capability could be a big win!!
Job count (2 weeks)Job count (2 weeks)
http://t2.unl.edu/gratia/xml/failed_dn_site_hours_bar?vo=atlas&title=Daily%20Atlas%20Wasted%20Hours%20By%20User%20and%20Site
http://t2.unl.edu/gratia/xml/failed_dn_site_hours_bar?vo=cms&title=Daily%20CMS%20Wasted%20Hours%20By%20User%20and%20Site
Solving Production ProblemsSolving Production Problems
Solving problems is a TEAM sportSolving problems is a TEAM sport
The weekly production call has key people from The weekly production call has key people from all the teams that are needed to solve problemsall the teams that are needed to solve problems
CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, MetricsSTG, Security, Operations, Metrics
Problems accurately prioritized and channeled Problems accurately prioritized and channeled to the correct avenueto the correct avenue
Sometimes solved on the call.Sometimes solved on the call.
Forewarning to prepare for upcoming issues.Forewarning to prepare for upcoming issues.
Example ProblemsExample Problems
Handling of job pre-emption (LIGO / D0)Handling of job pre-emption (LIGO / D0)
VO Package Validation probe neededVO Package Validation probe needed GIP “truth in advertising”GIP “truth in advertising”
LIGO switch to GT2 and also Condor-G job LIGO switch to GT2 and also Condor-G job submissionsubmission
Condor scaling limits in GridMon (Atlas)Condor scaling limits in GridMon (Atlas)
Globus LSF gatekeeper bug (D0/CMS)Globus LSF gatekeeper bug (D0/CMS)
Security Drill successes (for T1)Security Drill successes (for T1)
Gratia probe introduction & ITB testingGratia probe introduction & ITB testing
Example Issues cont.Example Issues cont.STEP09 monitoring (partially successful)STEP09 monitoring (partially successful)
IceCube management of opportunistic storageIceCube management of opportunistic storage
Gratia file transfer data catch upGratia file transfer data catch up
Transition from VORS to myOSGTransition from VORS to myOSG
New location for RSV probes and ability to New location for RSV probes and ability to update from the “production” cacheupdate from the “production” cache Also, ensure that config_OSG does not update the Also, ensure that config_OSG does not update the
probes automaticallyprobes automatically
Root Cause Analysis of CMS BDII outageRoot Cause Analysis of CMS BDII outage
Example Issues cont.Example Issues cont.Plan to localize data transfer information Plan to localize data transfer information and upload summary transfer packets.and upload summary transfer packets.
Globus memory leak was causing frequent Globus memory leak was causing frequent reboots at BNL.reboots at BNL.
Site name mapping problem to enable Site name mapping problem to enable different names internal to OSG.different names internal to OSG.
OIM display difference (http vs https)OIM display difference (http vs https)
Site admin meeting & materials prep to help Site admin meeting & materials prep to help sites upgrade to OSG 1.2.sites upgrade to OSG 1.2.
Example Issues cont.Example Issues cont.Condor problem with directory creation in Condor problem with directory creation in a multiple gateway scenario. (Nebraska)a multiple gateway scenario. (Nebraska)
Gratia collector problem with handling Gratia collector problem with handling records that accumulate faster than they records that accumulate faster than they can be processed.can be processed.
LIGO/Pegasus transition to use BDII data LIGO/Pegasus transition to use BDII data instead of central probe data.instead of central probe data.