srm ccrc-08 and beyond
DESCRIPTION
SRM CCRC-08 and Beyond. Shaun de Witt CASTOR Face-to-Face. Introduction. Problems in 1.3-X And what we are doing about them Positives Setups Recommendations Future Developments Release Procedures. Problems - Database. Deadlocks Observed at CERN and ASGC (CNAF too?) - PowerPoint PPT PresentationTRANSCRIPT
SRM CCRC-08 and Beyond
Shaun de WittCASTOR Face-to-Face
Introduction
Problems in 1.3-X And what we are doing about them
Positives Setups
Recommendations Future Developments Release Procedures
Problems - Database Deadlocks
Observed at CERN and ASGC (CNAF too?) Not at RAL – not sure why??? Two types (loosely)
Daemon/daemon deadlocks Server/Daemon deadlocks
Startup problems Too many connections ORA-0600 errors
Daemon/Daemon deadlocks
Found ‘accidentally’ at CERN Caused by multiple back-ends
running talking to the same database Leads to database deadlocks in GC
In 2.7 GC has moved into database as a procedure. Could be ported to 1.3, but not
planned
Server/Daemon deadlocks
Caused by using CASTOR fillObj() API When filling subrequests, multiple calls can
lead to two threads blocking one another. Daemon and Server both need to check
status and possibly modify subrequest info. Solution proposed is to take lock on the
request This would stop deadlocks But could lead to lengthy locks
Problems - Database Start up problems
Seen often at CNAF, infrequently at RAL.
TNS – ‘no listener’ error. Need to check logs at startup.
No solution at the moment Restarting cures problem. Could add monitoring to watch for
this error.
Problems - Database Too many connections
Seen at CERN Partly down to configuration
Many SRMs talking to the same database instance. Two solutions
More database hardware Fewer SRMs on same instance But expensive
Reduce Threads on server and daemon May cause TCP timeout errors under load (server) or
cause put/get requests to be processed too slowly (daemon)
More on configuration later
Problems - Database
ORA-0600 (Internal Error) problems Seen at RAL and CERN Oracle internal error Will render SRM useless
Fix available from ORACLE RAL has not seen it since applying fix Gordon Brown at RAL can provide
details
Problems - Network
Intermittent CGSI errors Terminal CGSI errors SRM ‘lock-ups’
Problems - Network Intermittent CGSI-gSOAP errors
cgsi-gSOAP errors reported in logs and to client
Seen 2-10 times per hour (at RAL) Correlation in time between front-ends
Both will get an error at about the same time Cause is unclear
No solution at the moment Seems to happen < 0.1% of requests at RAL
Problems - Network Terminal CGSI-gSOAP errors
All threads end up returning CGSI-gSOAP errors Can affect only 1 of front ends Cause unknown Does not seem correlated with load or request type
No solution at moment ASGC site report indicated may be correlated with
database deadlocks(?) Need monitoring to detect this in the log file Restart of effected front end normally clears problem. New version of CGSI plug-in available, but not yet
tested
Problems - Network SRM becomes unresponsive
Debugging indicates all threads stuck in recv() Cause unknown
May have been cause of ATLAS ‘blackouts’ during first CCRC
New releases include recv() and send() timeouts Should stop this Two new configurable parameters in srm2.conf
Problems - Other
Interactions with CASTOR Behaviour when CASTOR is slow
Needless RFIO calls loading job slots
Bulk Removal requests Use of MSG field in DLF
Problems - Other Behaviour when CASTOR becomes slow
See error “Too many threads busy with CASTOR” Can block new requests coming in
But useful diagnostic of CASTOR problems Solution is to decrease STAGERTIMEOUT in
srm2.conf Default 900 secs too long
Most clients give up after 180 secs No ‘hard and fast’ rule about what it should be
Somewhere between 60 and 180 is best guess. Pin time
Implementation ‘miscommunication’ – top heavy a weight applied
Fixed in 1.3-27 Also reduce Pin Lifetime in srm2.conf
Problems - other Needless RFIO calls
Identified by CERN Takes up jobs slots on CASTOR
Timeout after 60 seconds On all GETS without a space token Introduced when support for multiple
default spaces was introduced Fix already in CVS
For release 2.7 Duplicates code when space token provided
Could be backported to 1.3
Problems - other Bulk removal requests
Sometime produce CGSI-gSOAP errors for large numbers of files (>50)
But deletion does work – problem on send()? May be load related
On one day 4/6 tests with 100 files produced this error The next day 0/6 tests with 1000 files produced this
error Some discussion about removing stager_rm
and just do nsrm May help speed up processing But would leave more work for CASTOR cleaning
daemon
Problems - Other
Lots of MSG fields left blank Problem for monitoring
Addressed in 2.7 Will not be back ported.
Occasional crashes Traced to use of strtok (not _r) Fixed in 1.3-27
Positives Request rate
At RAL on 1 cms front end with 50 threads: 21K requests/hr
Distribution of type of request not known.
Processing speed Again using CMS at RAL Daemon running 10/5 threads Put requests in 1-5 seconds
Same for GET requests w/o tape recall
Positives
Front end quite stable At RAL few interventions required
SETUPS
Different sites have different hardware set ups Hope you can fill the gaps…!
RAL Setup
3 Node RAC
SRM-ATLAS SRM-CMS SRM-LHCb SRM-ALICE
CERN Setup
shared-db
srm-cms
atlas-db lhcb-db
srm-alice srm-dteam srm-ops
srm-atlas srm-lhcb
Single Machine
CNAF Setup
srm-cms srm-shared
cms-db shared-db
Single Machine
ASGC Setup
srm
srm-db castor-db dlf-db
3 node RAC
Useful Configuration Parameters Based on you setup, you will need to tune some or all of the following
parameters: SERVERTHREADS CASTORTHREADS REQTHREADS POLLTHREADS COPYTHREADS
The more instances on a single database instance, the fewer threads should be assigned to the SRM
Need to balance request and processing rates on daemon and server SOAPBACKLOG SOAPRECVTIMEOUT SOAPSENDTIMEOUT
Number of SOAP requests, and timeouts related to recv() and send() Best ‘guesstimate’ for these are 100, 60, 60
TIMEOUT Stager timeout in castor.conf Best ‘guesstimate’ 60-180 seconds
PINTIME Keep low
Future Developments
Move to SL4 Move to castor clients 2.1.7 New MoU
Move to SLC4
URGENT No support for SLC3 Support effort for SL3 dwindling
Have built and tested one version In 1.3 series
All new developments (2.7-X) on SL4 No new development in 1.3 series
Move to 2.1.7 clients URGENT
Addresses security vulnerability with regards to proxy certificates
Much better error messaging Fewer ‘unknown error’ messages
2.1.3 clients no longer supported or developed
Since this requires a schema change, releases in this series will be 2.7-X
New MoU
Major new features: srmPurgeFromSpace
Used to remove disk copies from a space Initial implementation will only remove
files currently also on tape VOMS based security
This will be implemented in CASTOR but may need changes to SRM/CASTOR interface.
Future Development Summary New features will be put into 2.7-X or
later releases. 2.7-X releases only on SLC4
Is port of 1.3-X to SLC4 required? Esp. given security hole in 1.3
Will require 2.1.7 clients installed on SRM nodes
Timescale? End June. Tall order!
Release Procedures Following problems just after CCRC
Srm seemed to pass all tests But daemon failed immediately in production
(CERN and RAL) Brought about by a ‘simple’ change
which only affected recalls when no space token was passed. Clear need for additional tests before release Public s2 not enough
Pre-Release Procedures (Re) Developing shell test tool which will
be delivered with the SRM. To include basic tests of all SRM functions Will include testing of tape recalls if possible
(i.e. not if only using a Disk1Tape0 system) New tests added when we find missing cases. Will require tester to have certificate (i.e. can
not be run as root) Looking at running FULL s2 test suite
This includes tests of a number of invalid requests
Not normally run since VERY time consuming
Pre-Release Procedures
As now, s2 tests will be run over 1 week to try and ensure stability
Problem still is stress testing No dedicated stress tests exist
But this is most likely to catch database problems.
Could develop simple ones But would they be realistic enough?