xrootd monitoring for the cms experiment abstract: during spring and summer 2011 cms deployed xrootd...

1
Xrootd Monitoring for the CMS Experiment Abstract: During spring and summer 2011 CMS deployed Xrootd front-end servers on all US T1 and T2 sites. This allows for remote access to all experiment data and is used for user-analysis, visualization, running of jobs at T2s and T3s when data is not available at local sites, and as a fail-over mechanism for data-access in CMSSW jobs. Monitoring of Xrootd infrastructure is implemented on three levels: 1.Service and data availability checks Nagios@UNL 2.Xrootd summary monitoring Custom analyzer MonALISA 3.Xrootd detailed monitoring GLED Web, Gratia, ROOT Trees, … L.A.T. Bauerdick 1 , K.Bloom 3 , B.P.Bockelman 3 , D.C.Bradley 4 , S.Dasu 4 , I.Sfiligoi 2 , A.Tadel 2 , M.Tadel 2 , F.Wuerthwein 2 , A.Yagil 2 UCSD Caltech UNL Wisconsin FNAL Purdue UFL MIT 1 FNAL, 2 UC San Diego, 3 University of Nebraska-Lincoln, 4 University of Wisconsin-Madison #begin unique_id=xrd-1335314898281000 file_lfn=/store/data/ Run2011B/…/XXXX.root file_size=1441178626 start_time=1335314780 end_time=1335314898 read_bytes=840772088 read_operations=196 read_min=300 read_max=25090476 read_average=4289653.510204 read_sigma=8040665.338339 # single-read operation statistics removed read_vector_bytes=836030346 read_vector_operations=64 read_vector_min=132030 read_vector_max=25090476 read_vector_average=13062974.1 56250 read_vector_sigma=9146182.4405 09 read_vector_count_min=3 read_vector_count_max=512 read_vector_count_average=327. 468750 read_vector_count_sigma=179.35 7096 read_bytes_at_close=840772088 # write operation statistics removed user_dn=XXXX user_vo= user_role= user_fqan= client_domain=hep.wisc.edu client_host=g22n10 server_username=cmsuser127 app_info= server_domain=t2.ucsd.edu server_host=uaf-7 #end References: AAA & FAX, at this CHEP GLED http://gled.org/ MonALISA http://monalisa.caltech.edu/ ROOT http://root.cern.ch/ Xrootd http://xrootd.org/ 1. Service & Data Availability Nagios probes track the following core operations: Check redirection from meta-manager@UNL sites • Check authentication with CERN & OSG certificates Check that files can actually be read (get first 1kB) Mail alarms sent in case of problems Checking of individual Xrootd servers: Some sites also use Nagios@UNL (historically) The plan is to delegate this to sites (RSV probes exist) • Summary monitoring also reveals a lot about server state 2. Xrootd Summary Monitoring All redirectors and servers send their summary monitoring UDP packets to a collector at UCSD where data is pre- processed and stored into MonALISA repository. Examples of collected data: Number of connected clients • Rates of new connections, authentications, and various errors • Incoming and outgoing network traffic caused by Xrootd Server’s usage of system resources Processing with ML plugins: • Calculating per-site quantities, e.g. total traffic for each site • Detecting error conditions and sending notification emails Presentation options: • Standard ML graphs – for individual sites / host, totals • Dashboard UDP TCP multiplexer GLED TTree writer MonALIS A xrd-rep- snatcher.pl Development, testing Summary UDP packets Detailed UDP packets 3. Xrootd Detailed Monitoring As with summary data, detailed monitoring UDP packets are also sent to UCSD. The streams are merged and made available via a UDP to TCP converter / multiplexer. Contents of detailed monitoring streams: • User authentication records, including their DN and VOMS info • File-open records, including LFN by which the file was requested All read and write requests (offset, length, and timestamp) • Vector-read requests (# of elements, total length, timestamp) Optionally, servers can send offset & length info for each element Redirection records Default processing with GLED Complete in-memory representation of all servers, sessions and open files is required as packets are highly encoded. • Embedded http server shows currently ongoing user sessions • When a file is closed a detailed report is generated Sent to OSG Gratia and written into ROOT trees for further analysis

Upload: violet-hamilton

Post on 03-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xrootd Monitoring for the CMS Experiment Abstract: During spring and summer 2011 CMS deployed Xrootd front- end servers on all US T1 and T2 sites. This

Xrootd Monitoring for the CMS Experiment

Abstract: During spring and summer 2011 CMS deployed Xrootd front-end servers on all US T1 and T2 sites. This allows for remote access to all experiment data and is used for user-analysis, visualization, running of jobs at T2s and T3s when data is not available at local sites, and as a fail-over mechanism for data-access in CMSSW jobs.Monitoring of Xrootd infrastructure is implemented on three levels:1. Service and data availability checks Nagios@UNL⟼2. Xrootd summary monitoring Custom analyzer MonALISA⟼ ⟼3. Xrootd detailed monitoring ⟼ GLED Web, Gratia, ROOT Trees, …⇶

L.A.T. Bauerdick1, K.Bloom3, B.P.Bockelman3, D.C.Bradley4, S.Dasu4, I.Sfiligoi2, A.Tadel2, M.Tadel2, F.Wuerthwein2, A.Yagil2

UCSD

Caltech

UNL Wisconsin

FNALPurdue

UFL

MIT

1 FNAL, 2 UC San Diego, 3 University of Nebraska-Lincoln, 4 University of Wisconsin-Madison

#beginunique_id=xrd-1335314898281000file_lfn=/store/data/Run2011B/…/XXXX.rootfile_size=1441178626start_time=1335314780end_time=1335314898read_bytes=840772088read_operations=196read_min=300read_max=25090476read_average=4289653.510204read_sigma=8040665.338339# single-read operation statistics removedread_vector_bytes=836030346read_vector_operations=64read_vector_min=132030read_vector_max=25090476 read_vector_average=13062974.156250 read_vector_sigma=9146182.440509 read_vector_count_min=3read_vector_count_max=512 read_vector_count_average=327.468750 read_vector_count_sigma=179.357096read_bytes_at_close=840772088# write operation statistics removeduser_dn=XXXXuser_vo=user_role=user_fqan=client_domain=hep.wisc.educlient_host=g22n10server_username=cmsuser127app_info=server_domain=t2.ucsd.eduserver_host=uaf-7#end

References:• AAA & FAX, at this CHEP• GLED http://gled.org/• MonALISA http://monalisa.caltech.edu/• ROOT http://root.cern.ch/• Xrootd http://xrootd.org/

1. Service & Data AvailabilityNagios probes track the following core operations:• Check redirection from meta-manager@UNL sites⟼• Check authentication with CERN & OSG certificates• Check that files can actually be read (get first 1kB)• Mail alarms sent in case of problemsChecking of individual Xrootd servers:• Some sites also use Nagios@UNL (historically)

The plan is to delegate this to sites (RSV probes exist)• Summary monitoring also reveals a lot about server state

2. Xrootd Summary MonitoringAll redirectors and servers send their summary monitoring UDP packets to a collector at UCSD where data is pre-processed and stored into MonALISA repository.Examples of collected data:• Number of connected clients• Rates of new connections, authentications, and various errors• Incoming and outgoing network traffic caused by Xrootd• Server’s usage of system resourcesProcessing with ML plugins:• Calculating per-site quantities, e.g. total traffic for each site• Detecting error conditions and sending notification emailsPresentation options:• Standard ML graphs – for individual sites / host, totals• Dashboard

UDP TCP multiplexer➙

GLED

TTree writer

MonALISA

xrd-rep-snatcher.pl

Development, testing

SummaryUDP packets

DetailedUDP packets

3. Xrootd Detailed MonitoringAs with summary data, detailed monitoring UDP packets are also sent to UCSD. The streams are merged and made available via a UDP to TCP converter / multiplexer.Contents of detailed monitoring streams:• User authentication records, including their DN and VOMS info• File-open records, including LFN by which the file was requested• All read and write requests (offset, length, and timestamp)• Vector-read requests (# of elements, total length, timestamp)

Optionally, servers can send offset & length info for each element• Redirection recordsDefault processing with GLEDComplete in-memory representation of all servers, sessions and open files is required as packets are highly encoded.• Embedded http server shows currently ongoing user sessions• When a file is closed a detailed report is generated

Sent to OSG Gratia and written into ROOT trees for further analysis