amod report december 3-9, 2012
DESCRIPTION
AMOD Report December 3-9, 2012. Torre Wenaus December 11, 2012. Activities. Datataking until 6 th , the likely end of 2012 pp physics running B ulk reprocessing mostly done ~ 1 .3M production jobs ( group, MC, validation, reprocessing) ~2.5M analysis jobs ~610 analysis users. - PowerPoint PPT PresentationTRANSCRIPT
AMOD Report December 3-9, 2012
Torre Wenaus
December 11, 2012
Torre Wenaus 2
Activities
• Datataking until 6th, the likely end of 2012 pp physics running
• Bulk reprocessing mostly done
• ~1.3M production jobs (group, MC, validation, reprocessing)
• ~2.5M analysis jobs• ~610 analysis users
Torre Wenaus 3
Production & Analysis
Sustained activity, production and analysis
Fluctuating analysis workload~6k min (!) – 34k max
Torre Wenaus 4
Data transfer - source
Torre Wenaus 5
Data transfer - destination
Torre Wenaus 6
Data transfer - activity
Torre Wenaus 7
T0 export tailing off at end of week with end of pp datataking
Torre Wenaus 8
Reprocessing (yellow) tailing off
Torre Wenaus 9
Tier 0, Central Services
• Tue pm: T0 LSF: slow LSF job dispatching ALARM ticket 23:46. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Experts looking at why the config took so long. Ticket closed. GGUS:89202
• Sat am: CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. Closed. GGUS:89334
• Weekend: CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS (next slide). GGUS:89328
• During week, a few cases (besides alarm ticket) of T0 bsub time spiking to ~6-8 sec for <~1hr
Torre Wenaus 10
EOSATLAS availability lapses
Torre Wenaus 11
ADC
• Tue pm: Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182
• Tue: Problem recurred in a corrupt dCache library libdcap.so being disseminated by sw installation resulting in ANALY jobs failing for all sites using dCache. Fixed promptly with a new check to prevent recurrence.
• Tue: MuonCalibration-17.2.7.4.1 not found at ANALY_MPPMU calibration site, resolved by AleDG/AleDS/Alden. Confusion over celist source (it is AGIS)
• Bulk ESD lifetime changed from 4 weeks to 3 weeks (Ueda)• Case of duplicate GUIDs, analysis ongoing• Thu: MUON_CALIBDISK close to full at INFN-NAPOLI, deletion run, freed
sufficient space• Weekend: SARA DATADISK filled up (next slide)
Torre Wenaus 12
T1 DATADISK space full
• At SARA on weekend, ~monotonic decline of available space for a week reached the end
• Sat pm: Taken out of T0 export at 10TB free
• DDM auto blacklisting didn’t kick in – when was it supposed to? 1TB? Very low…
• Mon am: Manually blacklisted
Torre Wenaus 13
Tier 1 Centers
• Mon am: IN2P3: regular SRM hangups thought to have been fixed with a dCache patch for long proxies problem (GGUS:88984) did not actually fix the issue. Recurred Tuesday, then they put in a cron to detect need for and perform SRM restart. No need to restart the server since. Investigations ongoing. GGUS:89111
• Mon am: RAL: Failures in input file staging, high FTS error rate. Restarted the stager and rebalanced the database which solved it. Closed. GGUS:89141
• Tue pm: Taiwan-LCG2: many job failures due to insufficient space on local disk. Site increased maximum job workdir size in schedconfig. Ticket closed but problem recurred Thu am, new ticket. Site reduced job slots on small-disk WNs. Ticket on hold for observation. GGUS:89200, 89253
• Wed am: FZK TAPE T0 export resumed after resolution of last week ticket. Some timeout failures since but not persistent. Closed. GGUS:88877
Torre Wenaus 14
Tier 1 Centers (2)
• Thu pm: SARA: T0 export failures, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. Closed. GGUS:89289
• Sat am, through weekend: FZK-LCG2: Persistent <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket. Mon am update: site canceled some long standing inactive transfers on the ATLAS write buffer pools. GGUS:89110
• Sat am: Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. Closed. GGUS:89332
• Sat pm: PIC: failing source transfers. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338
Torre Wenaus 15
Other
• GGUS experts unable to reproduce issue of last week, that clicking ‘back’ twice after creating a ticket creates another one (observed in Firefox)
• Coming:– PIC capacity at ~65% Dec 10-21 to save electricity– Several downtimes this week (Dec 10+)
• Sites: please make clear in GOC downtime notices the scope/impact of the downtime
• With regular space issues as well as occasional hardware etc issues, exclusion from T0 export is pretty common, would be nice to have monitoring of in/exclusion status, simplified/safer inclusion/exclusion procedure
• Noticed shifters paying attention to a site they shouldn’t need to (UTD-HEP)… how to prevent?– https://savannah.cern.ch/support/?133697
Torre Wenaus 16
Thanks
• Thanks to all shifters and helpful experts!