tape monitoring

18
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ DSS Tape Monitoring Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011

Upload: armen

Post on 22-Feb-2016

66 views

Category:

Documents


0 download

DESCRIPTION

Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tape Monitoring

Data & Storage Services

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

DSS

Tape Monitoring

Vladimír BahylIT DSS TAB

Storage Analytics SeminarFebruary 2011

Page 2: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 2Overview

• From low level– Tape drives; libraries

• Via middle layer– LEMON– Tape Log DB

• To high level– Tape Log GUI– SLS

• TSMOD• What is missing?• Conclusion

Page 3: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 3Low level – towards the vendors

• Oracle Service Delivery Platform (SDP)– Automatically opens tickets with Oracle– We also receive notifications– Requires “hole” in the firewall, but quite useful

• IBM TS3000 console– Central point collecting all information from 4

(out of 5) libraries– Call home via Internet (not modem)– Engineers come on site to fix issues

Page 4: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 4Low level – CERN usage

• SNMP– Using it (traps) whenever available– Need MIB files with SNMPTT actuators:

– IBM libraries send traps on errors– ACSLS sends activity traps

• ACSLS– Event log messages on multiple lines concatenated into

one– Forwarded via syslog to central store– Useful for tracking issues with library components (PTP)

EVENT ibm3584Trap004 .1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICALFORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7'EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICALNODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4SDESCTrap for library TapeAlert 004.DESC

Page 5: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 5Middle layer – LEMON

• Actuators constantly check local log files• 4 situations covered:

1. Tape drive not operational2. Request stuck for at last 3600 seconds3. Cartridge is write protected4. Bad MIR (Media Information Record)

• Ticket is created= email is sent– All relevant information

is provided within the ticketto speedup the resolution

• Workflow is followed tofind a solution

Dear SUN Tape Drive maintainer team,

this is to report that a tape drive T10B661D@tpsrv963 has became non-operational.Tape T05653 has been disabled.

PROBABLE ERRORS

01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 001/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error01/28 15:33:05 10344 rlstape: TP033 - drive [email protected] not operational

IDENTIFICATION

Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr:

Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2

Tape Server: tpsrv963

Page 6: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 6Middle layer – Tape Log DB

• CASTOR log messages from all tape servers are processed and forwarded to central database

• Allows correlation of independent errors (not a complete list):– X input/output errors with Y tapes on 1 drive– X write errors on Y tapes on 1 drive– X positioning errors on Y tapes on 1 drive– X bad MIRs for 1 tape on Y drives– X write/read errors on 1 tape on Y drives– X positioning errors on 1 drive on Y drives– Too many errors on a library

• Archive for 120 days all logs slit by VID and tape server– Q: What happened to this tape?

Page 7: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 7Tape Log – the data

• Origin: rtcpd & taped log messages– All tape servers sending data in parallel

• Content: various file state information• Volume:

– Depends on the activity of the tape infrastructure– Past 7 days: ~30 GBs of text files (raw data)

• Frequency:– Depends on the activity of the tape infrastructure– Easily > 1000 lines / second

• Format: plain text

Page 8: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 8Tape Log – data transport

• Protocol: (r)syslog log messages• Volume: ~150 KB/second• Accepted delays: YES/NO

– YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file

– NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed

• Losses acceptable: YES (to some small extent)– The system is only used for statistics or slow reactive

monitoring– Serious problem will reoccur elsewhere– We use TCP in order not to loose messages

Page 9: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 9Tape Log – data storage

• Medium: Oracle database• Data structure: 3 main tables

– Accounting– Errors– Tape history

• Amount of data in store:– 2 GB– 15-20 millions of records (2 years worth of data)

• Aging: no, data kept forever

Page 10: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 10Tape Log – data processing

• No additional post processing, once data is stored in database

• Data mining and visualization done online– Can take up to a minute

Page 11: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 11High level – Tape Log GUI

• Oracle APEX on top of data in DB• Trends

– Accounting– Errors– Media issues

• Graphs– Performance– Problems

• http://castortapeweb

Page 12: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 12High level – Tape Log GUI

Page 13: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 13Tape Log – pros and cons

• Pros– Used by DG in his talk!– Using standard transfer protocol– Only uses in-house supported tools– Developed quickly; requires little/no support

• Cons– Charting limitations

• Can live with that; see point 1 – not worth supporting something special

– Does not really scale• OK if only looking at last year’s data

Page 14: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 14High level – SLS

• Service view for users• Life availability information as well as

capacity/usage trends– Partially reuses Tape Log DB data

• Information organized per VO– Text and graphs

• Per day/week/month

Page 15: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 15High level – SLS

Page 16: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 16TSMOD

• Tape Service Manager on Duty– Weekly changing role to

• Resolve issues• Talk to vendors• Supervise interventions

• Acts on twice-daily summary e-mail which monitors:– Drives stuck in (dis-)mounting– Drives not in production without any reason– Requests running or queued for too long– Queue size too long– Supply tape pools running low– Too many disabled tapes since the last run

• Goal: have one common place to watch

Page 17: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 17What is missing?

• We often need the full chain– When was the tape last time successfully read?– On which drive?– What was the firmware of that drive?

• Users hidden within upper layers– We do not know which exact user is right now

reading/writing– The only information we have is the experiment

name and that is deducted from the stager hostname

• Details investigations often require request ID

Page 18: Tape Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DSS 18Conclusion

• CERN has extensive tape monitoring covering all layers

• The monitoring is fully integrated with the rest of the infrastructure

• It is flexible to support new hardware (e.g. higher capacity media)

• The system is being improved as new requirements arise