srm ccrc-08 and beyond

33
SRM CCRC-08 and Beyond Shaun de Witt CASTOR Face-to-Face

Upload: jerom

Post on 20-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

SRM CCRC-08 and Beyond. Shaun de Witt CASTOR Face-to-Face. Introduction. Problems in 1.3-X And what we are doing about them Positives Setups Recommendations Future Developments Release Procedures. Problems - Database. Deadlocks Observed at CERN and ASGC (CNAF too?) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SRM CCRC-08 and Beyond

SRM CCRC-08 and Beyond

Shaun de WittCASTOR Face-to-Face

Page 2: SRM CCRC-08 and Beyond

Introduction

Problems in 1.3-X And what we are doing about them

Positives Setups

Recommendations Future Developments Release Procedures

Page 3: SRM CCRC-08 and Beyond

Problems - Database Deadlocks

Observed at CERN and ASGC (CNAF too?) Not at RAL – not sure why??? Two types (loosely)

Daemon/daemon deadlocks Server/Daemon deadlocks

Startup problems Too many connections ORA-0600 errors

Page 4: SRM CCRC-08 and Beyond

Daemon/Daemon deadlocks

Found ‘accidentally’ at CERN Caused by multiple back-ends

running talking to the same database Leads to database deadlocks in GC

In 2.7 GC has moved into database as a procedure. Could be ported to 1.3, but not

planned

Page 5: SRM CCRC-08 and Beyond

Server/Daemon deadlocks

Caused by using CASTOR fillObj() API When filling subrequests, multiple calls can

lead to two threads blocking one another. Daemon and Server both need to check

status and possibly modify subrequest info. Solution proposed is to take lock on the

request This would stop deadlocks But could lead to lengthy locks

Page 6: SRM CCRC-08 and Beyond

Problems - Database Start up problems

Seen often at CNAF, infrequently at RAL.

TNS – ‘no listener’ error. Need to check logs at startup.

No solution at the moment Restarting cures problem. Could add monitoring to watch for

this error.

Page 7: SRM CCRC-08 and Beyond

Problems - Database Too many connections

Seen at CERN Partly down to configuration

Many SRMs talking to the same database instance. Two solutions

More database hardware Fewer SRMs on same instance But expensive

Reduce Threads on server and daemon May cause TCP timeout errors under load (server) or

cause put/get requests to be processed too slowly (daemon)

More on configuration later

Page 8: SRM CCRC-08 and Beyond

Problems - Database

ORA-0600 (Internal Error) problems Seen at RAL and CERN Oracle internal error Will render SRM useless

Fix available from ORACLE RAL has not seen it since applying fix Gordon Brown at RAL can provide

details

Page 9: SRM CCRC-08 and Beyond

Problems - Network

Intermittent CGSI errors Terminal CGSI errors SRM ‘lock-ups’

Page 10: SRM CCRC-08 and Beyond

Problems - Network Intermittent CGSI-gSOAP errors

cgsi-gSOAP errors reported in logs and to client

Seen 2-10 times per hour (at RAL) Correlation in time between front-ends

Both will get an error at about the same time Cause is unclear

No solution at the moment Seems to happen < 0.1% of requests at RAL

Page 11: SRM CCRC-08 and Beyond

Problems - Network Terminal CGSI-gSOAP errors

All threads end up returning CGSI-gSOAP errors Can affect only 1 of front ends Cause unknown Does not seem correlated with load or request type

No solution at moment ASGC site report indicated may be correlated with

database deadlocks(?) Need monitoring to detect this in the log file Restart of effected front end normally clears problem. New version of CGSI plug-in available, but not yet

tested

Page 12: SRM CCRC-08 and Beyond

Problems - Network SRM becomes unresponsive

Debugging indicates all threads stuck in recv() Cause unknown

May have been cause of ATLAS ‘blackouts’ during first CCRC

New releases include recv() and send() timeouts Should stop this Two new configurable parameters in srm2.conf

Page 13: SRM CCRC-08 and Beyond

Problems - Other

Interactions with CASTOR Behaviour when CASTOR is slow

Needless RFIO calls loading job slots

Bulk Removal requests Use of MSG field in DLF

Page 14: SRM CCRC-08 and Beyond

Problems - Other Behaviour when CASTOR becomes slow

See error “Too many threads busy with CASTOR” Can block new requests coming in

But useful diagnostic of CASTOR problems Solution is to decrease STAGERTIMEOUT in

srm2.conf Default 900 secs too long

Most clients give up after 180 secs No ‘hard and fast’ rule about what it should be

Somewhere between 60 and 180 is best guess. Pin time

Implementation ‘miscommunication’ – top heavy a weight applied

Fixed in 1.3-27 Also reduce Pin Lifetime in srm2.conf

Page 15: SRM CCRC-08 and Beyond

Problems - other Needless RFIO calls

Identified by CERN Takes up jobs slots on CASTOR

Timeout after 60 seconds On all GETS without a space token Introduced when support for multiple

default spaces was introduced Fix already in CVS

For release 2.7 Duplicates code when space token provided

Could be backported to 1.3

Page 16: SRM CCRC-08 and Beyond

Problems - other Bulk removal requests

Sometime produce CGSI-gSOAP errors for large numbers of files (>50)

But deletion does work – problem on send()? May be load related

On one day 4/6 tests with 100 files produced this error The next day 0/6 tests with 1000 files produced this

error Some discussion about removing stager_rm

and just do nsrm May help speed up processing But would leave more work for CASTOR cleaning

daemon

Page 17: SRM CCRC-08 and Beyond

Problems - Other

Lots of MSG fields left blank Problem for monitoring

Addressed in 2.7 Will not be back ported.

Occasional crashes Traced to use of strtok (not _r) Fixed in 1.3-27

Page 18: SRM CCRC-08 and Beyond

Positives Request rate

At RAL on 1 cms front end with 50 threads: 21K requests/hr

Distribution of type of request not known.

Processing speed Again using CMS at RAL Daemon running 10/5 threads Put requests in 1-5 seconds

Same for GET requests w/o tape recall

Page 19: SRM CCRC-08 and Beyond

Positives

Front end quite stable At RAL few interventions required

Page 20: SRM CCRC-08 and Beyond

SETUPS

Different sites have different hardware set ups Hope you can fill the gaps…!

Page 21: SRM CCRC-08 and Beyond

RAL Setup

3 Node RAC

SRM-ATLAS SRM-CMS SRM-LHCb SRM-ALICE

Page 22: SRM CCRC-08 and Beyond

CERN Setup

shared-db

srm-cms

atlas-db lhcb-db

srm-alice srm-dteam srm-ops

srm-atlas srm-lhcb

Single Machine

Page 23: SRM CCRC-08 and Beyond

CNAF Setup

srm-cms srm-shared

cms-db shared-db

Single Machine

Page 24: SRM CCRC-08 and Beyond

ASGC Setup

srm

srm-db castor-db dlf-db

3 node RAC

Page 25: SRM CCRC-08 and Beyond

Useful Configuration Parameters Based on you setup, you will need to tune some or all of the following

parameters: SERVERTHREADS CASTORTHREADS REQTHREADS POLLTHREADS COPYTHREADS

The more instances on a single database instance, the fewer threads should be assigned to the SRM

Need to balance request and processing rates on daemon and server SOAPBACKLOG SOAPRECVTIMEOUT SOAPSENDTIMEOUT

Number of SOAP requests, and timeouts related to recv() and send() Best ‘guesstimate’ for these are 100, 60, 60

TIMEOUT Stager timeout in castor.conf Best ‘guesstimate’ 60-180 seconds

PINTIME Keep low

Page 26: SRM CCRC-08 and Beyond

Future Developments

Move to SL4 Move to castor clients 2.1.7 New MoU

Page 27: SRM CCRC-08 and Beyond

Move to SLC4

URGENT No support for SLC3 Support effort for SL3 dwindling

Have built and tested one version In 1.3 series

All new developments (2.7-X) on SL4 No new development in 1.3 series

Page 28: SRM CCRC-08 and Beyond

Move to 2.1.7 clients URGENT

Addresses security vulnerability with regards to proxy certificates

Much better error messaging Fewer ‘unknown error’ messages

2.1.3 clients no longer supported or developed

Since this requires a schema change, releases in this series will be 2.7-X

Page 29: SRM CCRC-08 and Beyond

New MoU

Major new features: srmPurgeFromSpace

Used to remove disk copies from a space Initial implementation will only remove

files currently also on tape VOMS based security

This will be implemented in CASTOR but may need changes to SRM/CASTOR interface.

Page 30: SRM CCRC-08 and Beyond

Future Development Summary New features will be put into 2.7-X or

later releases. 2.7-X releases only on SLC4

Is port of 1.3-X to SLC4 required? Esp. given security hole in 1.3

Will require 2.1.7 clients installed on SRM nodes

Timescale? End June. Tall order!

Page 31: SRM CCRC-08 and Beyond

Release Procedures Following problems just after CCRC

Srm seemed to pass all tests But daemon failed immediately in production

(CERN and RAL) Brought about by a ‘simple’ change

which only affected recalls when no space token was passed. Clear need for additional tests before release Public s2 not enough

Page 32: SRM CCRC-08 and Beyond

Pre-Release Procedures (Re) Developing shell test tool which will

be delivered with the SRM. To include basic tests of all SRM functions Will include testing of tape recalls if possible

(i.e. not if only using a Disk1Tape0 system) New tests added when we find missing cases. Will require tester to have certificate (i.e. can

not be run as root) Looking at running FULL s2 test suite

This includes tests of a number of invalid requests

Not normally run since VERY time consuming

Page 33: SRM CCRC-08 and Beyond

Pre-Release Procedures

As now, s2 tests will be run over 1 week to try and ensure stability

Problem still is stress testing No dedicated stress tests exist

But this is most likely to catch database problems.

Could develop simple ones But would they be realistic enough?