issues in milan two main problems (details in the next slides): – site excluded from analysis due...

3
Issues in Milan • Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7) • This was the real showstopper • Several, time consuming attempt to cleanup and reinstall • Reinstallation apparently successful, but the release was corrupted again after an hour or so – StoRM silently stopping to process requests • The underlying GPFS file system halted in an apparent deadlock, but the storage areas were correctly mounted -> no alarm was triggered • Unfortunate timing of the two, occurred contemporaneously during Summer holidays (reduced manpower) – Other, non directly related problems (air conditioning of computing room, server h/w failures) required attention, further reducing the available manpower

Upload: samson-griffin

Post on 29-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)

Issues in Milan• Two main problems (details in the next slides):

– Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)• This was the real showstopper• Several, time consuming attempt to cleanup and reinstall• Reinstallation apparently successful, but the release was corrupted again after an

hour or so

– StoRM silently stopping to process requests• The underlying GPFS file system halted in an apparent deadlock, but the storage areas

were correctly mounted -> no alarm was triggered

• Unfortunate timing of the two, occurred contemporaneously during Summer holidays (reduced manpower)– Other, non directly related problems (air conditioning of computing room,

server h/w failures) required attention, further reducing the available manpower

Page 2: Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)

Release installation issue (solved)

• In Milan, the WNs are split in two rooms, each one belonging to a different subnet, with a single NFS server providing the s/w area to both the rooms through two different network adapters

• The different NFS network names confused the s/w installation system, generating a race condition between installation jobs on the different WN subsets

• Definitively understood and solved (by including all WNs in a common subnet) only after three weeks– It wasn’t a really difficult one, but efforts were focused on the

other, storage related issue

Page 3: Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly 16.6.7)

GPFS issue• GPFS randomly goes in a deadlock state

– A GPFS thread starts waiting for an unknown condition to occur on a remote node– Waiter threads start to pile up on one of the Network Disk Servers (NDS), waiting for the

first one to complete– The reason for the hung thread is still not known. Possible candidates:

• Failure of the underlying storage hardware• Network issues• GPFS bug• …

– No clear sign of any of this, though– Very similar problem observed at Tier1

• They are still investigating too

– Ticket opened with IBM support• We were asked to gather some debugging data, but since then, the problem occurred only twice,

during non working hours, and the system was automatically restarted

• No solution found yet, only a workaround to detect the deadlock and restart the services (GPFS and StoRM)– This eased the consequences of the problem, avoiding further exclusion from DDM