amod report december 3-9, 2012

AMOD Report December 3-9, 2012

Torre Wenaus

December 11, 2012

Torre Wenaus 2

Activities

• Datataking until 6th, the likely end of 2012 pp physics running

• Bulk reprocessing mostly done

• ~1.3M production jobs (group, MC, validation, reprocessing)

• ~2.5M analysis jobs• ~610 analysis users

Torre Wenaus 3

Production & Analysis

Sustained activity, production and analysis

Fluctuating analysis workload~6k min (!) – 34k max

Torre Wenaus 4

Data transfer - source

Torre Wenaus 5

Data transfer - destination

Torre Wenaus 6

Data transfer - activity

Torre Wenaus 7

T0 export tailing off at end of week with end of pp datataking

Torre Wenaus 8

Reprocessing (yellow) tailing off

Torre Wenaus 9

Tier 0, Central Services

• Tue pm: T0 LSF: slow LSF job dispatching ALARM ticket 23:46. Promptly answered, a reconfig run at 23:00 to fix an issue was slow, and reduced responsiveness to job submission. Queues refilled by 00:06. Experts looking at why the config took so long. Ticket closed. GGUS:89202

• Sat am: CERN-PROD: ALARM: ATLAS web server down Sat am, response in 10min, resolution in ~30. Due to power outage. Closed. GGUS:89334

• Weekend: CERN-PROD: EOS source errors and several periods of EOSATLAS instability in SLS (next slide). GGUS:89328

• During week, a few cases (besides alarm ticket) of T0 bsub time spiking to ~6-8 sec for <~1hr

Torre Wenaus 10

EOSATLAS availability lapses

Torre Wenaus 11

ADC

• Tue pm: Security ticket to ATLAS VOSupport: ATLAS creating world writable directories. In the PanDA pilot, one directory creation case was missed (in job recovery directory) in setting access to 770. Fixed in pre-production code. GGUS:89182

• Tue: Problem recurred in a corrupt dCache library libdcap.so being disseminated by sw installation resulting in ANALY jobs failing for all sites using dCache. Fixed promptly with a new check to prevent recurrence.

• Tue: MuonCalibration-17.2.7.4.1 not found at ANALY_MPPMU calibration site, resolved by AleDG/AleDS/Alden. Confusion over celist source (it is AGIS)

• Bulk ESD lifetime changed from 4 weeks to 3 weeks (Ueda)• Case of duplicate GUIDs, analysis ongoing• Thu: MUON_CALIBDISK close to full at INFN-NAPOLI, deletion run, freed

sufficient space• Weekend: SARA DATADISK filled up (next slide)

Torre Wenaus 12

T1 DATADISK space full

• At SARA on weekend, ~monotonic decline of available space for a week reached the end

• Sat pm: Taken out of T0 export at 10TB free

• DDM auto blacklisting didn’t kick in – when was it supposed to? 1TB? Very low…

• Mon am: Manually blacklisted

Torre Wenaus 13

Tier 1 Centers

• Mon am: IN2P3: regular SRM hangups thought to have been fixed with a dCache patch for long proxies problem (GGUS:88984) did not actually fix the issue. Recurred Tuesday, then they put in a cron to detect need for and perform SRM restart. No need to restart the server since. Investigations ongoing. GGUS:89111

• Mon am: RAL: Failures in input file staging, high FTS error rate. Restarted the stager and rebalanced the database which solved it. Closed. GGUS:89141

• Tue pm: Taiwan-LCG2: many job failures due to insufficient space on local disk. Site increased maximum job workdir size in schedconfig. Ticket closed but problem recurred Thu am, new ticket. Site reduced job slots on small-disk WNs. Ticket on hold for observation. GGUS:89200, 89253

• Wed am: FZK TAPE T0 export resumed after resolution of last week ticket. Some timeout failures since but not persistent. Closed. GGUS:88877

Torre Wenaus 14

Tier 1 Centers (2)

• Thu pm: SARA: T0 export failures, quick site response and resolution, "we were overloaded with requests from jobs from another cluster. This has been blocked now..." which solved the problem. Closed. GGUS:89289

• Sat am, through weekend: FZK-LCG2: Persistent <8% job failure rate due to timeouts saving files to local SE, logged on reopened 2/12 ticket. Mon am update: site canceled some long standing inactive transfers on the ATLAS write buffer pools. GGUS:89110

• Sat am: Taiwan-LCG2: Missing file needed for production. Affected by disk maintenance, recovered by site. Closed. GGUS:89332

• Sat pm: PIC: failing source transfers. Cured with SRM restart. Site is checking what caused the SRM failures. GGUS:89338

Torre Wenaus 15

Other

• GGUS experts unable to reproduce issue of last week, that clicking ‘back’ twice after creating a ticket creates another one (observed in Firefox)

• Coming:– PIC capacity at ~65% Dec 10-21 to save electricity– Several downtimes this week (Dec 10+)

• Sites: please make clear in GOC downtime notices the scope/impact of the downtime

• With regular space issues as well as occasional hardware etc issues, exclusion from T0 export is pretty common, would be nice to have monitoring of in/exclusion status, simplified/safer inclusion/exclusion procedure

• Noticed shifters paying attention to a site they shouldn’t need to (UTD-HEP)… how to prevent?– https://savannah.cern.ch/support/?133697

https://savannah.cern.ch/support/?133697

https://savannah.cern.ch/support/?133697

Torre Wenaus 16

Thanks

• Thanks to all shifters and helpful experts!

amod report december 3-9, 2012

Documents

t0 lsf

analysis jobs

end of week

slow lsf job

production jobs group

likely end

job submission

pp physics runningbulk