agents and daemons automating data quality monitoring operations

1
Agents Agents And Daemons And Daemons Automating Data Quality Monitoring Operations Automating Data Quality Monitoring Operations Since 2009 when the LHC came back to active service, the Data Quality Monitoring (DQM) team was faced with the need to homogenize and automate operations across all the different environments within which DQM is used for data certification. The main goal of automation is to reduce operator intervention at the minimum possible level, especially in the area of DQM files management, where long-term archival presented the greatest challenges. Manually operated procedures cannot cope with the constant increase in luminosity, datasets and time of operation of the CMS detector. Therefore a solid and reliable set of agents has been designed since the beginning to manage all DQM-data related work-flows. This allows to fully exploit all available resources in every condition, maximizing the performance and reducing the latency in making data available for validation and certification. The agents can be easily fine-tuned to adapt to current and future hardware constraints and they proved to be flexible enough to include unforeseen features, like an ad-hoc quota management and a real time sound alarm system. Finally an agent is a script that performs a tasks and then exits, usually invoked from a cron job, and a daemon is a script that never ends, it loops forever sleeping for small intervals as to not hug the cpu. Job Root File Job Root File Job Root File Job Root File Merged Root File DQM Archive SMPS 1 – 7 DQM Applications DQM Event Processor Root File DQM Collect or 5 Production Servers DQM Archive HLTMONDQM HLTResult s A/DQM Data Flow: 16 Storage Managers Permanen t Storage on tape DQM ZIP Archive Online: Offline: CMS Data. DQM Histograms DQM Root Files DQM Zip Files Online Chain producerFileClean er.py /home/dqmprolocal/done /home/dqmprolocal/output DQM Applications machines /dqmdata/dqm/upload aliveChek.sh fileCollecto r.py /dqmdata/dqm/repository /dqmdata/dqm/agents/ import-local /dqmdata/dqm/agents/import-offsite /dqmdata/dqm/agents/ import-test visDQMReceiveDaem oon visDQMImportDaemo n /home/ dqmlocal/ix DQM local production GUI visDQMImportDaemo n /home/ dqmlocal/ix visDQMImportDaemo n /home/ dqmlocal/ix DQM offsite production GUI DQM test production GUI visDQMDeleteDaemo n visDQMDeleteDaemo n visDQMDeleteDaemo n Offline Chain DQM Offline/RelVal server /data/srv/state/dqmgui/ flavor/upload /data/srv/state/dqmgui/ flavor/data visDQMReceiveDaem on /data/srv/state/dqmgui/flavor/ agents/register /data/srv/state/dqmgui/flavor/ agents/zip visDQMImportDaemo n /dqmdata/ offline/ix /data/srv/state/dqmgui/flavor/ agents/vcontrol visDQMVersionCont rol Online-Offline Chain DQM Offline server DQM offsite production GUI visDQMOnlineSync. py /data/srv/state/dqmgui/$flavor/ data/OnlineData createOnlineInfo /data/srv/state/dqmgui/flavor/ agents/zip Introduction: Zip Chain /data/srv/state/dqmgui/flavor/ data/zipped visDQMZipDaemon /data/srv/state/dqmgui/flavor/ agents/stageout visDQMzipCastorSt ager /data/srv/state/dqmgui/flavor/ agents/verify /data/srv/state/dqmgui/flavor/ agents/clean daily- offline visDQMzipCastorVe rifier /data/srv/state/dqmgui/flavor/ agents/freezer visDQMZipFreezeDa emon Processing Chains: In order to ensure that the generated DQM data is properly stored in tape and registered in the required GUIs, the daemons and agents are arranged in chains of processing. This allows to solve communication between the each agent and daemon, it also easily allows for branching to support multiple GUI indexes, which is a useful feature in case of recovery, and parallel operation since everything is file directory based, multiple instances of the same chain can be running at the same time on the same machine, without interfering with each other. https://cmsweb.cern.ch/dqm/online/data/browse/Original/ 00019xxxx/0001940xx/ visDQMRootFileQuotaCon trol /data/srv/state/dqmgui/flavor/ agents/qcontrol /data/srv/state/dqmgui/flavor/ agents/zip DQM Offline/RelVal server .root file .dqminfo file Index operation file Directory operation Agent Daemon Local HDD Filer Reverse proxy from cmsweb.cern.ch to dqm-prod-offsite 1. Root files processed in the last 2 Years Online: 325463 Offline: 258977 Relval: 9593 RelvalData: 64316 MonteCarlo: 6519 2. Zip files processed in the last 2 Years Online: 1776 Offline: 22180 Relval: 518 RelvalData: 4914 MonteCarlo: 5300 3. Amount of data transferred to castor: 52TB 4. Number of critical (unrecoverable errors): 0 5. Number of interventions per year: 2 6. Number of flavors running the A&D: 5 A&D In numbers: A&D Description: A/D Name Dsecription A visDQMAfsSync Syncs partially the offline repository with a folder on afs, runned by acron D visDQMCreateInfoDaemon Keeps taps on a directory and creates dqminfo files for root files that do not have them. And puts them in the zip agent drop box for archival D visDQMDeleteDaemon Deletes runs/datasets based on queue files from the GUI index D visDQMImportDaemon Adds root files to the index D visDQMIndexMergeDaemon Not shown in the chains, is a helper daemon, It is useful to merge partial indexes with the main indexes coordinating with the import daemon D visDQMOnlineSyncDaemon Downloads root files from the online GUI into the offline root repository. D visDQMReceiveDaemon It picks up files from the upload directory, checks the naming conventions, and creates dqminfo files. D visDQMRootFileQuotaCon trol Keeps root repository usage of disk space in check. D visDQMVerControlDaemon Removes older version of root files, only keeping the latest available. D visDQMZipCastorStager Stages zip archives in castor for permanent storage D visDQMZipCastorVerifie r Verifies that the archives have been copied correctly. D visDQMZipDaemon Creates zip archives with the root files. D visDQMZipFreezeDaemon Prevents new info being added to an archive, and marks it as ready for transfer to castror A aliveCheck(2).sh Ensures that the producer machine A&D are running D fileCollector2.py Collects files from producer machines in emulates an upload to the online receive daemon dropbox. D producerFileCleaner.py It keeps the disk usage of a local repository of the producer machines in check A daily-offline Among other things it removes the zip archives from the zipped repository that have been successfully transferred to castor. AFS – DQM CAF visDQMAfsSync /afs/cern.ch/cms/caf/DQM/ data/ Luis I. Lopera, on behalf of the DQM Group Los Andes University Bogotá, Colombia

Upload: denise-calhoun

Post on 04-Jan-2016

36 views

Category:

Documents


4 download

DESCRIPTION

Job Root File. Job Root File. Job Root File. Job Root File. DQM ZIP Archive. CMS Data. DQM Histograms DQM Root Files DQM Zip Files. Agent. Daemon. Local HDD. Filer. Introduction:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Agents And Daemons Automating Data Quality Monitoring Operations

AgentsAgents And DaemonsAnd DaemonsAutomating Data Quality Monitoring OperationsAutomating Data Quality Monitoring Operations

AgentsAgents And DaemonsAnd DaemonsAutomating Data Quality Monitoring OperationsAutomating Data Quality Monitoring Operations

Since 2009 when the LHC came back to active service, the Data Quality Monitoring (DQM) team was faced with the need to homogenize and automate operations across all the different environments within which DQM is used for data certification.

The main goal of automation is to reduce operator intervention at the minimum possible level, especially in the area of DQM files management, where long-term archival presented the greatest challenges. Manually operated procedures cannot cope with the constant increase in luminosity, datasets and time of operation of the CMS detector. Therefore a solid and reliable set of agents has been designed since the beginning to manage all DQM-data related work-flows. This allows to fully exploit all available resources in every condition, maximizing the performance and reducing the latency in making data available for validation and certification. The agents can be easily fine-tuned to adapt to current and future hardware constraints and they proved to be flexible enough to include unforeseen features, like an ad-hoc quota management and a real time sound alarm system.

Finally an agent is a script that performs a tasks and then exits, usually invoked from a cron job, and a daemon is a script that never ends, it loops forever sleeping for small intervals as to not hug the cpu.

Job RootFile

Job RootFile

Job RootFile

Job RootFile

MergedRootFile

DQMArchive

DQMArchive

SMPS

1 – 7 DQM ApplicationsDQM Event Processor

RootFile

DQMCollector

DQMCollector

5 Production Servers

DQMArchive

DQMArchive

HLTMONDQMHLTResultsA/DQM

Data Flow:Data Flow:

16 S

tora

ge

Man

ager

s

Permanent Storage on

tape

Permanent Storage on

tapeDQM ZIPArchive

DQM ZIPArchive

Online:

Offline:

• CMS Data.

• DQM Histograms

• DQM Root Files

• DQM Zip Files

Online ChainOnline Chain

producerFileCleaner.pyproducerFileCleaner.py /home/dqmprolocal/done

/home/dqmprolocal/output

DQM Applications machines

/dqmdata/dqm/upload

aliveChek.shaliveChek.sh

fileCollector.pyfileCollector.py

/dqmdata/dqm/repository

/dqmdata/dqm/agents/import-local

/dqmdata/dqm/agents/import-offsite

/dqmdata/dqm/agents/import-test

visDQMReceiveDaemoonvisDQMReceiveDaemoon

visDQMImportDaemonvisDQMImportDaemon

/home/dqmlocal/ix

DQM local production GUI

visDQMImportDaemonvisDQMImportDaemon/home/dqmlocal/ix

visDQMImportDaemonvisDQMImportDaemon/home/dqmlocal/ix

DQM offsite production GUI

DQM test production GUI

visDQMDeleteDaemonvisDQMDeleteDaemon

visDQMDeleteDaemonvisDQMDeleteDaemon

visDQMDeleteDaemonvisDQMDeleteDaemon

Offline ChainOffline Chain

DQM Offline/RelVal server

/data/srv/state/dqmgui/flavor/upload

/data/srv/state/dqmgui/flavor/datavisDQMReceiveDaemonvisDQMReceiveDaemon

/data/srv/state/dqmgui/flavor/agents/register

/data/srv/state/dqmgui/flavor/agents/zip

visDQMImportDaemonvisDQMImportDaemon/dqmdata/offline/ix /data/srv/state/dqmgui/flavor/agents/vcontrol

visDQMVersionControlvisDQMVersionControl

Online-Offline ChainOnline-Offline Chain

DQM Offline server DQM offsite production GUI

visDQMOnlineSync.pyvisDQMOnlineSync.py /data/srv/state/dqmgui/$flavor/data/OnlineData

createOnlineInfocreateOnlineInfo

/data/srv/state/dqmgui/flavor/agents/zip

Introduction:Introduction:

Zip ChainZip Chain

/data/srv/state/dqmgui/flavor/data/zippedvisDQMZipDaemonvisDQMZipDaemon

/data/srv/state/dqmgui/flavor/agents/stageout visDQMzipCastorStagervisDQMzipCastorStager /data/srv/state/dqmgui/flavor/agents/verify

/data/srv/state/dqmgui/flavor/agents/cleandaily-offlinedaily-offline visDQMzipCastorVerifiervisDQMzipCastorVerifier

/data/srv/state/dqmgui/flavor/agents/freezer visDQMZipFreezeDaemonvisDQMZipFreezeDaemon

Processing Chains:Processing Chains:

In order to ensure that the generated DQM data is properly stored in tape and registered in the required GUIs, the daemons and agents are arranged in chains of processing. This allows to solve communication between the each agent and daemon, it also easily allows for branching to support multiple GUI indexes, which is a useful feature in case of recovery, and parallel operation since everything is file directory based, multiple instances of the same chain can be running at the same time on the same machine, without interfering with each other.

https://cmsweb.cern.ch/dqm/online/data/browse/Original/00019xxxx/0001940xx/

visDQMRootFileQuotaControlvisDQMRootFileQuotaControl

/data/srv/state/dqmgui/flavor/agents/qcontrol

/data/srv/state/dqmgui/flavor/agents/zip

DQM Offline/RelVal server

.root file

.dqminfo file

Index operation file

Directory operation

AgentAgent

DaemonDaemon

Local HDD

Filer

Reverse proxy from cmsweb.cern.ch to dqm-prod-offsite

1. Root files processed in the last 2 Years• Online: 325463• Offline: 258977• Relval: 9593• RelvalData: 64316• MonteCarlo: 6519

2. Zip files processed in the last 2 Years• Online: 1776• Offline: 22180• Relval: 518• RelvalData: 4914• MonteCarlo: 5300

3. Amount of data transferred to castor: 52TB4. Number of critical (unrecoverable errors): 05. Number of interventions per year: 26. Number of flavors running the A&D: 5

A&D In numbers:A&D In numbers:A&D Description:A&D Description:A/D Name Dsecription

A visDQMAfsSyncSyncs partially the offline repository with a folder on afs, runned by acron

D visDQMCreateInfoDaemonKeeps taps on a directory and creates dqminfo files for root files that do not have them. And puts them in the zip agent drop box for archival

D visDQMDeleteDaemon Deletes runs/datasets based on queue files from the GUI index

D visDQMImportDaemon Adds root files to the index

D visDQMIndexMergeDaemonNot shown in the chains, is a helper daemon, It is useful to merge partial indexes with the main indexes coordinating with the import daemon

D visDQMOnlineSyncDaemonDownloads root files from the online GUI into the offline root repository.

D visDQMReceiveDaemonIt picks up files from the upload directory, checks the naming conventions, and creates dqminfo files.

D visDQMRootFileQuotaControl Keeps root repository usage of disk space in check.

D visDQMVerControlDaemon Removes older version of root files, only keeping the latest available.

D visDQMZipCastorStager Stages zip archives in castor for permanent storage

D visDQMZipCastorVerifier Verifies that the archives have been copied correctly.

D visDQMZipDaemon Creates zip archives with the root files.

D visDQMZipFreezeDaemonPrevents new info being added to an archive, and marks it as ready for transfer to castror

A aliveCheck(2).sh Ensures that the producer machine A&D are running

D fileCollector2.pyCollects files from producer machines in emulates an upload to the online receive daemon dropbox.

D producerFileCleaner.pyIt keeps the disk usage of a local repository of the producer machines in check

A daily-offline Among other things it removes the zip archives from the zipped repository that have been successfully transferred to castor.

AFS – DQM CAF

AFS – DQM CAF

visDQMAfsSyncvisDQMAfsSync /afs/cern.ch/cms/caf/DQM/data/

Luis I. Lopera, on behalf of the DQM GroupLos Andes UniversityBogotá, Colombia