happyface as a monitoring tool for the atlas...

CER

N-T

HES

IS-2

016-

130

05/0

8/20

16

HappyFace as a monitoring tool for the ATLAS experiment

Dissertation

zur Erlangung des mathematisch-naturwissenschaftlichen Doktorgrades„Doctor rerum naturalium“

der Georg-August-Universität Göttingen

im Promotionsprogramm ProPhysder Georg-August University School of Science (GAUSS)

vorgelegt von

Haykuhi Musheghyan

aus Eriwan, Armenien

Göttingen, 2016

Betreuungsausschuss

Prof. Dr. Arnulf QuadtII. Physikalisches Institut, Georg-August-Universität Göttingen

Prof. Dr. Ramin YahyapourGesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen - GWDG

Mitglieder der Prüfungskommission:

Referent: Prof. Dr. Arnulf QuadtII. Physikalisches Institut, Georg-August-Universität Göttingen

Koreferent: Prof. Dr. Ramin YahyapourGesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen - GWDG

Weitere Mitglieder der Prüfungskommission:

Prof. Dr. Jens GrabowskiInstitute of Computer Science, Georg-August-Universität Göttingen

Prof. Dr. Wolfram KollatschnyInstitut für Astrophysik, Georg-August-Universität Göttingen

Prof. Dr. Stan LaiII. Physikalisches Institut, Georg-August-Universität Göttingen

Jun.-Prof. Dr. Steffen SchumannII. Physikalisches Institut, Georg-August-Universität Göttingen

Tag der mündlichen Prüfung: 05.08.2016

Referenz: II.Physik-UniGö-Diss-2016/04

A dream doesn’t become reality through magic; it takes sweat, determination and hard work.

Colin Powell

iii

Abstract

The importance of monitoring on HEP grid computing systems is growing dueto a significant increase in their complexity. Computer scientists and administratorshave been studying and building effective ways to gather information on and clarifya status of each local grid infrastructure. The HappyFace project aims at making theabove-mentioned workflow possible. It aggregates, processes and stores the informa-tion and the status of different HEP monitoring resources into the common databaseof HappyFace. The system displays the information and the status through a singleinterface.

However, this model of HappyFace relied on the monitoring resources which are al-ways under development in the HEP experiments. Consequently, HappyFace neededto have direct access methods to the grid application and grid service layers in the dif-ferent HEP grid systems. To cope with this issue, we use a reliable HEP software repos-itory, the CernVM File System. We propose a new implementation and an architectureof HappyFace, the so-called grid-enabled HappyFace. It allows its basic framework toconnect directly to the grid user applications and the grid collective services, withoutinvolving the monitoring resources in the HEP grid systems.

This approach gives HappyFace several advantages: Portability, to provide an in-dependent and generic monitoring system among the HEP grid systems; Functionality,to allow users to perform various diagnostic tools in the individual HEP grid systemsand grid sites; Flexibility, to make HappyFace beneficial and open for the various dis-tributed grid computing environments. Different grid-enabled modules, to show allcurrent datasets on a given storage disk and to check the performance of grid transfersamong the grid sites, have been implemented. The new HappyFace system has beensuccessfully integrated and now it displays the information and the status of both themonitoring resources and the direct access to the grid user applications and the gridcollective services.

iv

Contents

Abstract iv

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 LHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 CERN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 The Standard Model of Particle Physics . . . . . . . . . . . . . . . . . . . 51.4 Unification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 The Higgs Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 The Top Quark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.7 Beyond the Standard Model . . . . . . . . . . . . . . . . . . . . . . . . . . 91.8 ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.9 Worldwide LHC Computing Grid (WLCG) . . . . . . . . . . . . . . . . . 12

1.9.1 Tier Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.9.2 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.9.3 Analysis of High Energy Physics Data . . . . . . . . . . . . . . . . 15

1.10 Phase-2 Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Grid Computing 182.1 The Concept of Grid Computing . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Grid Authentication/Authorisation . . . . . . . . . . . . . . . . . 192.1.1.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . . 202.1.1.2 Public Key Infrastructure (PKI) . . . . . . . . . . . . . . 202.1.1.3 X.509 End User Certificates . . . . . . . . . . . . . . . . . 222.1.1.4 X.509 Proxy Certificates . . . . . . . . . . . . . . . . . . . 252.1.1.5 Grid Security Infrastructure (GSI) . . . . . . . . . . . . . 262.1.1.6 Authorisation . . . . . . . . . . . . . . . . . . . . . . . . 262.1.1.7 Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . 272.1.1.8 Grid Middleware and its Services . . . . . . . . . . . . . 27

2.1.2 Information Provider . . . . . . . . . . . . . . . . . . . . . . . . . . 282.1.3 Grid User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.1.4 Computing Element . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.1.4.1 Portable Batch System (PBS) . . . . . . . . . . . . . . . . 302.1.5 Storage Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.1.5.1 Data Access (rfio, dCap) Protocols . . . . . . . . . . . . . 342.1.5.2 Data Transfer (GridFTP) Protocol . . . . . . . . . . . . . 342.1.5.3 Storage Resource Management (SRM) . . . . . . . . . . 35

2.1.6 Workload Management System (WMS) . . . . . . . . . . . . . . . 372.1.6.1 Task Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 382.1.6.2 Matchmaker . . . . . . . . . . . . . . . . . . . . . . . . . 382.1.6.3 Information Supermarket . . . . . . . . . . . . . . . . . . 392.1.6.4 Information Updater . . . . . . . . . . . . . . . . . . . . 39

v

2.1.6.5 Job Handler . . . . . . . . . . . . . . . . . . . . . . . . . . 392.1.6.6 Job Description Language (JDL) . . . . . . . . . . . . . . 39

3 Grid Middleware 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 gLite Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.1 gLite-VOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.2 gLite-BDII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 gLite-UI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 CREAM Computing Element . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 The dCache Storage Element . . . . . . . . . . . . . . . . . . . . . . . . . 493.5 Description of GoeGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 ATLAS Computing 564.1 ATLAS Computing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 ATLAS Grid Information System (AGIS) . . . . . . . . . . . . . . . . . . . 564.3 Distributed Job Management System . . . . . . . . . . . . . . . . . . . . . 574.4 Distributed Data Management System . . . . . . . . . . . . . . . . . . . . 614.5 Site Status Board (SSB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.6 ATLAS Offline Software and Computing . . . . . . . . . . . . . . . . . . 65

5 Monitoring 685.1 The Concept of Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Meta-Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Monitoring and Meta-Monitoring Tools . . . . . . . . . . . . . . . . . . . 69

6 The HappyFace Project 716.1 Description of HappyFace . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.1.2 Basic Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.1.3 Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.1.4 Rating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.1.5 Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.1.6 Module Development . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2 HappyFace for the ATLAS Experiment . . . . . . . . . . . . . . . . . . . . 796.3 Grid-Enabled HappyFace . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3.1 The Concept and Design . . . . . . . . . . . . . . . . . . . . . . . . 806.3.1.1 Object-Oriented Programming (OOP) . . . . . . . . . . . 816.3.1.2 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.3.1 GridFTP Module . . . . . . . . . . . . . . . . . . . . . . . 976.3.3.2 GridSRM Module . . . . . . . . . . . . . . . . . . . . . . 996.3.3.3 ATLASDDM Module . . . . . . . . . . . . . . . . . . . . 104

7 Conclusions 1057.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Appendix A Agile Software Development 107

vi

Appendix B Installing HappyFace and Building an Instance 108B.1 Required Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.2 Setting up the CVMFS Environment . . . . . . . . . . . . . . . . . . . . . 109B.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111B.4 Running an Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113B.5 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.5.1 happyface.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114B.5.2 grid_ftp_module.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . 116B.5.3 grid_srm_module.cfg . . . . . . . . . . . . . . . . . . . . . . . . . 117B.5.4 atlas_ddm_module.cfg . . . . . . . . . . . . . . . . . . . . . . . . . 118

Acknowledgements 127

vii

List of Figures

1.1 The Large Hadron Collider (LHC). . . . . . . . . . . . . . . . . . . . . . . 31.2 Experiments running at the LHC. . . . . . . . . . . . . . . . . . . . . . . . 41.3 An overview of the properties of six quark particles in the SM [16]. . . . 51.4 An overview of the properties of six lepton particles in the SM [16]. . . . 61.5 An overview of different particles and its interactions in the SM [16]. . . 61.6 The graphical representation of the Higgs potential V (φ) = 1/2µ2φ2 +

1/4λφ4 for the real field φ. The a) is the case where µ2 > 0. The b) is thecase where µ2 < 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Main properties of top quark. . . . . . . . . . . . . . . . . . . . . . . . . . 81.8 Leading order Feynman diagrams for tt production gg fusion (a) and qq

(b) annihilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.9 Feynman diagrams for single top production in t-channel (a), s-channel

(b), Wt-channel (c and d). The q and q’ are up- and down-type quarks. . . 101.10 ATLAS detector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.11 ATLAS detector and its subsystems. . . . . . . . . . . . . . . . . . . . . . 111.12 CERN grid computing centers. . . . . . . . . . . . . . . . . . . . . . . . . 121.13 CERN tier structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.14 The data types used by the ATLAS experiment. Taken from [23]. . . . . . 15

2.1 Different VOs in the grid. Taken from [30]. . . . . . . . . . . . . . . . . . 192.2 Different VO members can share the same grid resource. . . . . . . . . . 202.3 The schematic view of the example. The user A sends a signed document

to the user B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Hierarchy structure for PKI. . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 X.509 end user certificate format. Taken from [42]. . . . . . . . . . . . . . 232.6 X.509 certificate revocation list (CRL) format. Taken from [42]. . . . . . . 242.7 X.509 proxy certificate extension [45]. . . . . . . . . . . . . . . . . . . . . . 252.8 Table of objects required for information service. Taken from [48] . . . . 282.9 Workflow of a CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.10 Schematic view PBS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.11 Back-fill algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.12 Different batch systems with their CLIs. . . . . . . . . . . . . . . . . . . . 332.13 Storage resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.14 Basic GridFTP transfers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.15 GridFTP third-party transfer. . . . . . . . . . . . . . . . . . . . . . . . . . 352.16 An example of copying a file to the grid storage. . . . . . . . . . . . . . . 362.17 JDL file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.18 Structure of WMS. Taken from [67]. . . . . . . . . . . . . . . . . . . . . . . 382.19 JDL file example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Grid layered architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 gLite services [71]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 gLite components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

viii

3.4 Overview of certificates in gLite security mechanism. . . . . . . . . . . . 453.5 An example of an executed voms-proxy-init command. . . . . . . . . . . . 453.6 An example of an executed voms-proxy-info command. . . . . . . . . . . . 463.7 Functions of information service in gLite. . . . . . . . . . . . . . . . . . . 463.8 Hierarchical tree structure of MDS. . . . . . . . . . . . . . . . . . . . . . . 473.9 Hierarchical tree structure of the MDS and the BDII. . . . . . . . . . . . . 483.10 CREAM-WMS integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.11 User transfers data to the cluster. . . . . . . . . . . . . . . . . . . . . . . . 503.12 GoeGrid structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.13 Nagios GoeGrid web interface. . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 GoeGrid services displayed in AGIS web interface [101]. . . . . . . . . . 584.2 GoeGrid PanDA queues displayed in AGIS web interface [102]. . . . . . 594.3 Schematic view of the PanDA system. Taken from [103]. . . . . . . . . . 604.4 PanDA monitor GoeGrid example [105]. . . . . . . . . . . . . . . . . . . . 604.5 Mapping between user credentials and Rucio accounts. Taken from [107]. 614.6 Aggregation hierarchy example. Taken from [106]. . . . . . . . . . . . . . 624.7 DDM Dashboard interface [110]. . . . . . . . . . . . . . . . . . . . . . . . 634.8 DDM Dashboard interface [111]. . . . . . . . . . . . . . . . . . . . . . . . 644.9 The workflow of the ATLAS SSB. Taken from [113]. . . . . . . . . . . . . 654.10 An example of the shifter view in the SSB web interface [115]. . . . . . . . 66

6.1 The main differences between HappyFace version 2 and version 3. . . . 716.2 The workflow of HappyFace. Taken from [122]. . . . . . . . . . . . . . . . 736.3 HappyFace web interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.4 HappyFace rating system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.5 HappyFace database schema. . . . . . . . . . . . . . . . . . . . . . . . . . 756.6 HappyFace module_instances table content. . . . . . . . . . . . . . . . . . . 756.7 The category configuration file for the Test category. . . . . . . . . . . . . 766.8 The module configuration file for the Test module. . . . . . . . . . . . . . 776.9 The module Python script for the Test module. . . . . . . . . . . . . . . . 776.10 The module HTML template for the Test module. . . . . . . . . . . . . . . 786.11 The web interface for the Test module. . . . . . . . . . . . . . . . . . . . . 796.12 The list of developed modules by the Göttingen group members. . . . . 806.13 HappyFace initial model. Taken from [122]. . . . . . . . . . . . . . . . . . 806.14 HappyFace modified model. Taken from [122]. . . . . . . . . . . . . . . . 816.15 An example of the source code that shows the class inheritance and poly-

morphism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.16 The output of the Python script. . . . . . . . . . . . . . . . . . . . . . . . . 836.17 A class sections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.18 UML relations notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.19 The UML diagram of the Bear class with two sub-classes (PolarBear and

BrownBear). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.20 The UML diagram of the Bear class with three sub-classes (PolarBear,

BrownBear, TeddyBear). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.21 The final UML diagram of the Bear class with its sub-classes (PolarBear,

BrownBear, TeddyBear). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.22 The grid-enabled HappyFace workflow. Taken from [122]. . . . . . . . . 876.23 The UML diagram of the envreader.py Python script classes. . . . . . . . . 886.24 The HappyFace configuration file structure. . . . . . . . . . . . . . . . . . 89

ix

6.25 The GridEnv class conf{} attribute. . . . . . . . . . . . . . . . . . . . . . . 896.26 The CvmfsEnv conf{} class attribute. . . . . . . . . . . . . . . . . . . . . . 906.27 The GridCertificate class UML diagram. . . . . . . . . . . . . . . . . . . . . 916.28 The UML diagram of the GridSubprocess class. . . . . . . . . . . . . . . . . 936.29 The UML diagram of the GridFTP, the GridSRM and the ATLASDDM

classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.30 Running the (rucio list-datasets-rse GOEGRID_LOCALGROUPDISK | sed

-n 1,10p) command in the Linux terminal. . . . . . . . . . . . . . . . . . . 966.31 Running the (rucio list-datasets-rse GOEGRID_LOCALGROUPDISK | sed

-n 1,50p) command from the HappyFace system. The value of str(start)and str(end) variables are taken from the ATLASDDM module configura-tion file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.32 Running the (uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de "mkdir/pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") command in the Linux ter-minal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.33 Running the same (uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de "mkdir/pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") command from the Happy-Face system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.34 The GridFTP module web output. . . . . . . . . . . . . . . . . . . . . . . 986.35 The GridFTP module web output. . . . . . . . . . . . . . . . . . . . . . . 996.36 GridFTP module web output. . . . . . . . . . . . . . . . . . . . . . . . . . 1006.37 GridSRM module web output. . . . . . . . . . . . . . . . . . . . . . . . . 1016.38 The GridSRM module web output. . . . . . . . . . . . . . . . . . . . . . . 1026.39 The GridSRM module web output. . . . . . . . . . . . . . . . . . . . . . . 1036.40 The ATLASDDM module web output. . . . . . . . . . . . . . . . . . . . . 104

B.1 default.local file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.2 install_cvmfs.sh Linux shell script. . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Available ATLAS environments, after executing above mentioned com-

mands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

x

Chapter 1

Introduction

1.1 Introduction

The Worldwide LHC Computing Grid (WLCG) project is providing the comput-ing resources for processing and analysing the data gathered by the LHC experiments.WLCG is based on a concept of grid and has a very complex infrastructure. This com-plexity requires the usage of various monitoring and meta-monitoring tools, which isa critical and non-trivial task. In order to deal with this task, several monitoring toolshad been designed and implemented over the years by computing models of differentexperiments. My thesis is related to one of these experiments, which is ATLAS. In AT-LAS, besides the main tools that are described in chapter 4, there are a number of otherauxiliary tools. All these tools together assist to solve the challenging task of scalabilityand reliability of the distributed services at more than 170 sites.

One of the auxiliary tools in ATLAS is HappyFace (see Chapter 6). Initially, Hap-pyFace was designed as a meta-monitoring tool that aggregates, processes and storesinformation from various monitoring tools into a single web interface. It has a modularstructure where each module is responsible for extracting, processing and storing datafrom the certain WLCG monitoring sources. Being a meta-monitoring tool, HappyFacewas strongly dependent on the WLCG monitoring sources.

During the LHC Run 1 phase, the ATLAS computing model was stable and someminor changes that were applied to the model were not really harming the modules’work of HappyFace. During this phase, HappyFace had an essential progress in termsof source code optimization and improvement, module development and usage byother grid sites.

However, the data growth during the LHC Run 2 phase brought major and funda-mental changes to the ATLAS computing model. All ATLAS tools were tuned. Someof them were completely redesigned and some had changes in their initial models byadding or removing functional components. But changes in the ATLAS computingmodel could not pass by HappyFace without affecting it and as a result, its stability be-came highly labile. The dependency on monitoring sources became a clear weak pointof HappyFace. Moreover, new features appeared that also had to be implemented inHappyFace.

Raised problems during the LHC Run 2 phase:

• Make HappyFace a stable system again

• Reduce manual interactions of site administrators

Raised problems required a certain solution. Thus, the goal of this thesis as wellas my contribution is to provide a general solution for these problems in HappyFace.

1

2 Chapter 1. Introduction

1. How to make HappyFace again stable ?To make HappyFace a stable system again was a challenging task taking the cur-

rent situation of ATLAS computing model into account. The dependency on the WLCGmonitoring sources was the weak point of HappyFace. Therefore, by solving the prob-lem of dependency on the WLCG monitoring sources, HappyFace will become againstable.

To solve the problem, I decided to modify the initial model of HappyFace by design-ing and implementing a new extension for it, the so-called grid-enabled HappyFace. Bythis newly designed extension, HappyFace got access to the grid and thus solved theproblem (see Section 6.3).

2. How to reduce the manual interactions of site administrators ?Using the existing ATLAS software alone is not enough to check the site perfor-

mance, manual interactions are required. Therefore, three different grid-enabled mod-ules (GridFTP, GridSRM, ATLAS DDM) were designed and developed (see Section6.3.3). The existence of these modules solves the problem of manual interactions interms of performing file transfers among sites and listing all current datasets for thegiven storage disk or a storage pool.

The new extension opens a new stage of development for HappyFace. Within thisextension, still a number of new modules can be easily developed and implementedfor different purposes in HappyFace.

An overview of the structure of thesis chapters are provided below.The purpose of chapter 1 is to provide a basic information about CERN, LHC ex-

periments, physics and the explanation of the need of such a complex infrastructureas WLCG computing infrastructure. Without WLCG computing infrastructure and, inparticular, without the ATLAS computing model storing, distributing, analising exper-imental data would be impossible. WLCG infrastructure is based on the concept ofgrid computing, therefore chapters 2 and 3 are describing the concept of grid comput-ing and grid middleware. In particular, chapter 2 includes all necessary informationabout the security of grid systems, description of storage systems, computing elementsand workload management systems.

As in ATLAS we are using gLite middleware, consequently in chapter 3 the maingLite components are described. This chapter also contains a detailed informationabout GoeGrid. Chapter 4 is describing the ATLAS computing model by providinginformation about the main ATLAS software. The purpose of this chapter is to showthe complexity of the ATLAS computing model and to present the actual tools work.Chapter 5 is describing the importance of using monitoring and meta-monitoring tools.

All these mentioned above chapters are needed to understand what we have andto start with the actual research topic, which is covered in chapter 6. Chapter 6 isproviding a detailed description about HappyFace and its new extension. It contains adetailed explanation about the design and implementation of new extension as well asthe module development withing this new extension. The conclusion and appendicespart are covered by chapters 7 and appendix A and B. In appendix B is a step by stepdescription of the installation of the grid environment and the HappyFace instance.

Chapter 1. Introduction 3

1.2 LHC

The Large Hadron Collider (LHC) [1] is the world’s largest, most powerful and themost complex particle accelerator. The LHC at CERN promises a major step forward inunderstanding the fundamental nature of matter. The LHC consists of a 27-kilometrering of superconducting magnets with a number of accelerating structures to enhancethe energy of the particles along the way (Figure 1.1).

Figure 1.1: The Large Hadron Collider (LHC).

It started up on the 10th of September 2008 with over 10,000 scientists and engi-neers from over 100 countries, as well as hundreds of participating universities andlaboratories.

The first accelerator run operated between March 2010 and February 2013 with aninitial 3.5 teraelectronvolts (TeV) per beam (total center of mass energy of 7 TeV). In2012, the center of mass energy increased up to 4 TeV per beam (total center of massenergy of 8 TeV). In 2015, LHC Run 2 [2] was restarted with a proton-proton collisionsat the center of mass energy of 13 TeV. An energy unit used in particle physics is tera-electronvolts (TeV), see Equation 1.1.

1TeV = 1012 electronVolts(eV) (1.1)

This new stage allows the LHC experiments to study or explore the nature andphysics laws at higher energies.

In February 2013, the LHC’s first run officially ended, and it was shut down forplanned upgrades. "Test" collisions were restarted in the upgraded collider in April2015, reaching 6.5 TeV per beam in May 2015 (13 TeV in total, the current world recordfor particle collisions). Its second research run began on schedule, on 3 June 2015.

The LHC’s aim is to allow physicists to analyse the predictions of different theoriesof particle and high-energy physics, in order to prove or disprove the existence of thetheorised Higgs boson [3] and the large family of new particles predicted by supersym-metric theories [4].

The Standard Model (SM) [5], [6], [7] (see also Section 1.3) of particle physics is atheory which was developed in the mid-1970s. describes the fundamental particles andtheir interactions. It predicted a wide variety of phenomena and explains the majorityof experimental results. However, the SM is not complete. There are still many openquestions and the LHC will help to answer these questions (see Section 1.7).


Seven different experiments have been designed and constructed for the LHC forcertain kinds of research. There are four main experiments: ATLAS, CMS, LHCb, AL-ICE. The biggest of these experiments are ATLAS and CMS, whose purpose is to inves-tigate the largest range of physical phenomena within and beyond the SM. However,they both offer different technical solutions.

ALICE and LHCb are specialised in a specific physical phenomenon. ALICE isfocused on studying the heavy-ion collisions. Here, the energy densities are extremelyhigh in order to produce forms of quark-gluon plasma [11]. In the phase of forming aquark-gluon plasma, the energy is so high that it leads to the melting of protons andneutrons. As a result, the quarks lose their tight connection with the gluons.

LHCb is specialised in studying the "beauty quarks" or "b quarks" and their prop-erties. The study meant to explain the differences between matter and antimatter.

These four experiment detectors were built underground (100 m below ground) inhuge caverns on the LHC ring (Figure 1.2).

Figure 1.2: Experiments running at the LHC.

The LHC Computing Grid is an important part of LHC and is the biggest com-puting grid in the world. It was designed at CERN (see Section 1.2.1) to manage asignificant amount of data coming from different experiments. The Worldwide LHCComputing Grid (WLCG) [13] is a global collaboration of a number of computing cen-ters around the world that are connected to each other with the grid infrastructure.More details about WLCG is described in Section 1.9.

1.2.1 CERN

The European Organisation for Nuclear Research known as CERN (derived fromthe name "Conseil Européen pour la Recherche Nucléaire") is a European research or-ganisation that operates the largest particle physics laboratory in the world. It was es-tablished in 1954 and is located in the northwest suburb of Geneva on the Franco–Swissborder.


Particle accelerators and detectors are built at CERN, which are needed for the high-energy physics research. Accelerators enhance beam energies of particles to high ener-gies before beams collide with each other or with stationary targets. Detectors observeand record the results of these collisions. CERN has made several important achieve-ments in particle physics through the different experiments. These achievements arecollected in a book with the name "Prestigious Discoveries at CERN" [14].

1.3 The Standard Model of Particle Physics

The Standard Model (SM) [5], [6], [7] is the theory that describes the fundamentalparticles and interactions between them. The SM is a theory related to the electro-magnetic, strong and weak nuclear interaction force and the classification of all theknown subatomic particles. The SM represents current understanding about Natureand its interactions and explains the phenomena in Particle Physics. The weak andelectromagnetic interactions or the so-called electroweak force are related to QuantumElectrodynamics (QED) [8] theory and the strong interactions are related to QuantumChromodynamics (QCD) [9] theory.

The SM includes two types of particles: fermions and bosons. The fermions havean intrinsic angular momentum called spin, equal to half-integer. Bosons called theparticles that have an integer spin. Each particle has its corresponding antiparticlewith the same properties as the particle, such as mass, but with opposite values of theadditive quantum numbers. The fermions consist of quarks and leptons. The fermionsare classified according to their interactions.

There are six types of quarks, up (u), down (d), charm (c), strange (s), top (t), bot-tom (b) and six types of leptons, electron (e), electron neutrino (νe), muon (µ), muonneutrino (νµ ), tau (τ ), tau neutrino (ντ ).

The pairs from each classification are grouped together and form a generation withcorresponding particles in order to present a similar physical behavior.

The six quarks are paired in three generations as shown in Figure 1.3.

Figure 1.3: An overview of the properties of six quark particles in the SM [16].

The first generation includes light and stable particles, subsequent generations in-clude unstable particles. Each quark can have three possible colours: red, green, blue.

Correspondingly, the six leptons are paired in three generations as shown in Figure1.4. The leptons, unlike the quarks, do not have any colour charge.

There are four main interactions: gravity, the electromagnetic interaction, the weakinteraction and the strong interaction.


Figure 1.4: An overview of the properties of six lepton particles in the SM [16].

• Gravitational interaction: a natural interaction by which all objects with mass grav-itate to each other, for example stars, planets, galaxies and even subatomic parti-cles. This interaction is much weaker than the electromagnetic, strong, and weakinteractions when applied to particles and therefore is not considered within theSM.

• Electromagnetic interaction: occurs between electrically charged particles. Photonsact as the electromagnetic force mediator.

• Strong interaction: many times stronger than electromagnetic interactions andmuch stronger than weak interactions. It binds quarks together to protons, neu-trons and other hadron particles. A hadron is a composite particle made of quarksthat are held together by the strong force. Gluons act as the strong force mediator.

• Weak interaction: takes care of the radioactive decay of unstable nuclei and forinteractions of neutrinos and other leptons with matter. The W, Z bosons act asweak force mediators. The W bosons have a positive or negative electric charge(W+: +1 and W−: -1) and are antiparticles of each other, while the Z boson iselectrically neutral and is its own antiparticle. All three particles have a spin of 1.

All described above interactions are shown in Figure 1.5.

Figure 1.5: An overview of different particles and its interactions in the SM [16].

1.4 Unification

The primary goal of Particle Physics is to bring together or unify the fundamentalparticles and their interactions.


In the 1860s, Maxwell showed that electricity and magnetism together constituteone and the same phenomenon called electromagnetism.

In the 1970s, Glashow, Salam and Weinberg managed to unify the Weak and Electro-magnetic interactions into one called the Electroweak force. The Nobel Prize in physicswas later awarded to them in 1979 for their "contribution to the theory of the unifiedweak and electromagnetic interaction between elementary particles, including, interalia, the prediction of the weak neutral current" [17].

1.5 The Higgs Mechanism

In the SM model, the Higgs mechanism [18] plays an important role as it explainsthe generation mechanism of the mass property for gauge bosons and fermions.

In 1964, the Higgs mechanism was proposed by three independent groups of scien-tists (Robert Brout and François Englert, Peter Higgs and Gerald Guralnik, C. Richard.Hagen, and Tom Kibble), who proposed different but related approaches about howmass could arise in local gauge theories.

In 2010, all six physicists were awarded the J. J. Sakurai Prize for Theoretical ParticlePhysics. July 4 2012 was announced that the new particle observed at CERN in the massregion around 126 GeV. The observed particle corresponds to the previously predictedthe Higgs boson in the SM. The 2013 Nobel Prize in Physics was awarded to PeterHiggs and François Englert.

Without this mechanism, all fundamental bosons would be massless, while, accord-ing to experimental results it is clear that the mass of the W and Z bosons are 80 and 91GeV, respectively.

In the SM model, the Higgs mechanism refers to a generation of masses for W±

and Z bosons through electroweak symmetry breaking. It also generates masses for allfundamental fermions.

A solution for describing the mechanism is to add a quantum field (the Higgs field)to the SM. In the conditions of high temperature, the added field causes spontaneoussymmetry breaking during interactions, which causes bosons to have a mass.

The determination of Higgs potential can be represented by the following Lagrangian1.2:

L = 1/2∂µφ∂µφ− V (φ), V (φ) = 1/2µ2φ2 + 1/4λφ4 (1.2)

The graphical representation of the Higgs potential is shown in Figure 1.6.The shape of the Higgs potential depends on the sign of µ2 as λ is a scalar and is

always positive λ > 0. If the µ2 < 0, then the minimum of the potential is 0 (|φ0| = 0). Ifthe µ2 >0, the V (φ) has a minimum when ∂V/∂φ = µ2φ+ λφ3 = 0 as it is shown in theFigure 1.6 a) b) correspondingly.|φ0|2 = µ2/λ = ν2, where ν is the vacuum expectation value (vev) of the scalar field

φ.As all the other fundamental fields, the Higgs field also has an associated particle,

called the Higgs boson. In the SM, the Higgs boson is a particle that does not have anyspin and electric charge. It is also decaying into the other particles easily.


Figure 1.6: The graphical representation of the Higgs potential V (φ) = 1/2µ2φ2+1/4λφ4 for the real field φ. The a) is the case where µ2 > 0. The b) isthe case where µ2 < 0.

1.6 The Top Quark

The top quark (t) is the heaviest particle among all other observed fundamentalparticles. The top quark is a fundamental fermion with spin-1/2 and an electric chargeof +2/3. The mass of top quark is 173.34 ± 0.27(stat) ± 0.71GeV(syst.) [19]. Due to itslarge mass, it is the quickest particle to decay and the lifetime of the top quark is verysmall ≈ 0.5 × 10−24 s. It decays before it can hadronize. The Figure 1.7 represents ageneral overview about the top quark properties.

Figure 1.7: Main properties of top quark.

The top quark has its aniparticle called the antiquark or antitop (t).


The top quark interacts through the strong interaction. However the decay of thetop quark is only possible via the weak interaction. It can decay in three possible ways:

• W boson + bottom (b) quark: the most frequent case (≈ 99.8%),

• W boson + strange (s) quark: the rare case (≈ 0.18%),

• W boson + down (d) quark: the most rare case (≈ 0.02%).

The top quark can be produced either in pairs (pp->tt) or as single quarks. The ttpairs are produced much more often than single quarks.

QCD [9] describes all tt productions. When two protons collide, they break intoquarks and gluons. The tt pairs are the result of collisions between the quarks andgluons.

At the LHC, the tt production is dominated by gluon-gluon (gg) fusion (≈ 80% ofcases, at

√s = 7 TeV), while at the Tevatron, it is dominated by quark-antiquark (qq)

annihilation (≈ 90% of cases, at√s = 2 TeV). The Figure 1.8 displays, the leading order

(LO) Feynman diagrams of gg fusion and qq annihilation.

Figure 1.8: Leading order Feynman diagrams for tt production gg fusion (a) andqq (b) annihilation.

The production of single top quarks occurs through the weak interaction. It can beproduced in three possible ways called channels: t-channel, Wt-channel and s-channel.The t-channel is the most observed type of channel at the LHC.

These channels are displayed in Figure 1.9.Because of its large mass, the top quark can be the key for understanding the origin

of generated particle masses by the mechanism of electroweak symmetry breaking.

1.7 Beyond the Standard Model

In Particle Physics, the SM model is the most successful theory, but is not a perfectone. In spite of all the explanations given by the SM model, there are a number ofdifferent phenomena or observational facts that are unexplained. The most interestingonce are the following:


Figure 1.9: Feynman diagrams for single top production in t-channel (a), s-channel(b), Wt-channel (c and d). The q and q’ are up- and down-type quarks.

• Gravity: the SM does not explain gravity. Even the most successful theories ofgravity are considered incompatible with the SM.

• Dark matter and dark energy: the invisible part of the universe is called dark mat-ter and dark energy. According to the cosmological observation, the SM modelexplains only 5% of the present universe energy. The rest rely on the dark matterand dark energy.

• Neutrino masses: according to the SM, neutrinos have no mass. However, neutrinooscillation experiments [10] showed the exact opposite.

Explanation of the phenomenon can modify the human understanding of the worldand the SM itself.

1.8 ATLAS

The ATLAS (A Toroidal LHC ApparatuS) [20] experiment is designed to take ad-vantage of the unprecedented energy available at the LHC and observe phenomenathat involve highly massive particles which were not observable using earlier lower-energy accelerators. The ATLAS detector (Figure 1.10) is 46 metres long, 25 metres indiameter, and weighs about 7,000 tonnes; it contains some 3000 km of cable. The exper-iment is a collaboration involving roughly 3,000 physicists from over 175 institutionsin 38 countries.

The ATLAS detector has been designed so that it can cover a wide range of issues.It allows accurate measurements within the parameters of the SM, as well as the dis-covery of a new physics phenomena that may occur at higher energies. The search ofHiggs boson is an example for determining the performance of various ATLAS detectorsubsystems.


Figure 1.10: ATLAS detector.

The ATLAS detector subsystems are located around the point of interaction andhave a similar structure as the layers of an onion.

Beams of particles collide with each other forming a large number of subatomic par-ticles which are spread in all possible directions of the detector. Different sub-detectorsare designed to identify the types of particles, as well as to record the momentum, pathand energy of the particles (Figure 1.11). The only particle that does not leave any tracein the detector is the neutrino.

Measuring the energy of different particles and its properties is the main task of theATLAS detector and its entire infrastructure.

Figure 1.11: ATLAS detector and its subsystems.

Detectors generate unimaginably large amounts of data. To cope with this fact,ATLAS uses an advanced "trigger" system [22] that filters the events according to theirimportance, before they can be recorded. Even though the amount of data, that needsto be stored, is huge.


1.9 Worldwide LHC Computing Grid (WLCG)

The Worldwide LHC Computing Grid (WLCG) [13] project is a global collaborationof more than 170 computing centers in 41 countries, connected by grid infrastructures.WLCG was built on a grid concept, which was previously proposed by Ian Foster andCarl Kesselman in 1999 [24].

The goal of the WLCG is to provide global computing resources in order to dis-tribute, store and analyse approximately 15 petabytes [25] of data generated by theLHC experiments at CERN every year (Figure 1.12).

Figure 1.12: CERN grid computing centers.

A number of scientists from all over the world from four different LHC experiments(ATLAS, CMS, ALICE, LHCb) are actively accessing and analysing data. The comput-ing system that is designed to handle the data has to be very flexible. WLCG providesaccess to the computing resources which include compute clusters, storage systems,data sources and necessary software tools.

Scientists make a script or a so-called "job", submit it to the WLCG and wait untilit executes and returns the result. The jobs that scientists create can be very different,for example file transfers, different and complex calculations, etc. The computing gridestablishes the identity of the users, checks their credentials and if the user is a memberof a collaboration, then he/she is allowed to run their job. Users do not have to worryabout the location of computing resources – they can tap into the grid’s computingpower and access storage on demand.

1.9.1 Tier Structure

Dealing with tens of petabytes of data is not an easy task. It requires careful plan-ning and organisation. Different computing centers have different resources and geo-graphical locations. The tiered structure [26] allows to group these computing centresaccording to their location for serving a community of more than 8000 scientists. Therole of computing centers are very important. They basically store all necessary data(raw and analysed) from all over the world.

The LHC computing center has a four layer tiered structure (Figure 1.13).


Figure 1.13: CERN tier structure.

The Tier-0 centers are the main computing centers located at CERN in Geneva,Switzerland and at the Wigner Research Centre for Physics in Budapest, Hungary. TheTier-0 center in Budapest is designed as an extension of the CERN Tier-0 center. Thesetwo computing centers are connected with each other by 100 Gbit/s data links.

Here, the raw or original data are stored, which are coming directly from differentexperiments. It is required that all raw or original data are stored in the permanentstorage. Then, initial processing of the data is performed on the site to provide rapidfeedback to the operation. Afterwards, the data are sent to the other computing centersfor further analysis. The role of the computing centers is to provide analysis capacityfor the scientists/users. Some resources are usually intended for the local users, otherswhich are intended for the simulations can be provided for the experiments.

The Tier-1 computing centers are 13 in total. These are very large national comput-ing centers. They receive data directly from CERN and provide additional permanentstorage. The computing centers also provide computing resources for the data repro-cessing. Having a special role they provide reliable services to the community such asdatabases and catalogs.

The Tier-2 centers are 160 in total. Their role is to provide the storage capacity andcomputing resources for specific analysis tasks. Typically they are associated with alarge disk storage to provide temporary storage for data that is required for analysis.

The Tier-3 centers are the smallest computing centers located at different universi-ties and laboratories. Their main role is to provide local clusters or individual PCs.


1.9.2 Data Flow

The main source for the computing model is the Event Filter (EF) [27]. The inputand output data for the EF requires different network connection speed. For example,for the input data, approximately a 10x10 Gbps network connection speed with veryhigh reliability and for the output data, approximately a 320 MB/s (3 Gbps) connectionto the first-pass processing facility is required. For remote sites the network connectionspeed requirement is higher, approximately 10 Gbps to the remote site.

The streaming of data at the EF should be reserved and for this reason the comput-ing model purposes usage of a single stream, which contains all physics events passingfrom the Event Filter to the Tier-0. There are also other supportive streams, for examplecalibration, express and pathological streams.

The calibration stream contains calibration trigger events and is used to produce suf-ficient quality calibrations in order to allow a first-pass processing with the minimumlatency.

The express stream contains approximately 5% of the full data. This stream is usedfor the improvement of the data reconstruction speed. It is designed to provide earlyaccess to the raw data and the calibration streams.

The pathological stream contains pathological events such as failures, for example,events that fail in the EF. Typically they pass standard Tier-0 processing but in case offailure they get special treatment from the development team.

After raw data arrives at the input-disk buffer of the Tier-0 site, they pass severalsteps, such as:

• data copied to the CERN Advanced STORage manager (CASTOR) tape disk [28]at CERN,

• data copied to one of the Tier-1s permanent mass storage,

• corresponding calibration stream events takes care of calibration and alignmentof the raw data,

• after having an appropriate calibration, first-pass reconstruction starts to run onthe primary event stream. This stream contains all physics trigger informationand the result is stored in CASTOR tape disk,

• copy data to each Tier-1 site,

• replicate the copied data to each Tier-2 site.

The raw data transfer to the Tier-1 sites is a very important aspect. These sites arethe main sources for the later data reprocessing and data reconstruction. These sites canalso be used as an additional capacity, in case there is a backlog of first-pass processingat the Tier-0.


1.9.3 Analysis of High Energy Physics Data

In HEP experiments the analysis of data requires a complicated chain of data pro-cessing and data reduction. The huge amounts of data recorded by experiments haveto be reconstructed and processed before it will be available for scientists from all overthe world. Data needs to pass several stages.

The first stage is the so-called raw data. This data are generated by the detectorsof the experiment. Raw data consists of the raw measurements of the detectors. Thesecan be time measurements, channel numbers, charge depositions and any other signals.Integrated over the time, data taking during a year is expected to be a petabyte. Thedata has to be stored in a safe way on a permanent storage.

The second stage is called reconstruction. During this stage the data are processed.Raw numbers are converted into physical parameters. Later, pattern recognition sys-tems translate the parameters of observed particles to their moments or energy. Theoutput of this processing step is called reconstructed data.

The reconstructed data then has to be distributed to the scientist for further detectorstudies and specific analysis.

To simplify the analysis procedure only the most valuable dimensions are storedin separate streams. These data are called AOD (Analysis Object Data). The size ofthe events is significantly reduced. It is expected that the amount of AOD per year isapproximately 100 TB.

In the process of the analysis of the data, scientists perform further data reductions.The scientists define their own ad hoc data format. In ATLAS these datasets are knownas DPD (Derived Physics Data), in CMS as PAT Skims (Physics Analysis Toolkit).

The data types used by the ATLAS experiment is shown in Figure 1.14.

Figure 1.14: The data types used by the ATLAS experiment. Taken from [23].

Monte Carlo (MC) events gets special treatment. In the process of the analysis it isrequired to carefully study the sensitivity and coverage of the detectors. The basis of


such studies are thorough and detailed simulation of events.

1.10 Phase-2 Upgrade

The LHC is the largest and most powerful particle accelerator in the world (seeSection 1.2). In the period between 2013 and 2015, the LHC was in a technical stop.This two-year period was enough to prepare the machine for the running at almost thedouble of the energy of LHC’s Run 1. The machine has been restarted and is runningwith proton-proton collisions at a center of mass energy of 13 TeV (Run 2 phase).

The preparation phase for the LHC Run 2 [2] was not only the improvement ofthe machine itself, but also is a long and complex preparation for all the necessaryhardware and software including increment of the storage capacity, changes an existingsoftware versions, maintenance, monitoring, etc. During the shutdown the necessarysystems (hardware/software) were verified and some were renovated and upgraded.

The main improvements for different LHC experiments are described below.

ALICEIn the ALICE experiment, which studies the quark-gluon plasma, its 19 sub-detectors

have been improved. Among these was the electromagnetic calorimeter, used for mea-suring the energy of electrons, positrons and photons produced by the collisions. Therange of detection was extended.

Several modules were added in other sub-detectors. Even the cable of tens of kilo-metres were replaced as a part of complete reparation of the electrical infrastructure.

The trigger and data-acquisition systems were improved and now the data regis-tration has increased by almost a factor of two. The storage capacity has increased byalmost a factor of two as well.

ATLASThe ATLAS detector has improved as a fourth layer of pixels in its pixel tracker was

added. General improvements were also done in the muon detectors and calorimeters.Here the basic infrastructure was tuned (including the electrical power supply and thecooling systems).

The trigger and data-acquisition systems were improved as well. ATLAS is nowable to log approximately twice the data than in LHC Run 1. In addition, the simu-lation, reconstruction and data-analysis software which are mainly used by differentphysicists to run their own analysis was also upgraded and renewed.

CMSFor the CMS experiment, the major work was done in its tracker. Now it can oper-

ate in lower temperature. It was also equipped with a new leak-tightness system andrenovated the cooling system.

Other improvements were also applied. The beam tube, where collisions occur, wasreplaced by a smaller diameter tube. A new sub-detector, the pixel luminosity telescope


was mounted on both sides of the detector which is increasing the experiment’s abilityto measure luminosity. New muon chambers were installed and the hadron calorimeterwas fitted to the upgraded photodetectors. The last point is that the trigger and data-acquisition systems were improved and the corresponding software and the computingsystem were significantly changed in order to reduce the needed time for producing theanalysis datasets.

LHCbHeRSChel is the forward shower counter project for LHCb. The LHCb experiment

which investigates beauty particles, added a HeRSChel detector along the beam linefor identifying rare processes in which particles are observed inside the detector butnot along the beam line itself. The beam pipe was replaced with a new one.

Chapter 2

Grid Computing

2.1 The Concept of Grid Computing

Grid computing [29] was developed in the early 1990s for making computing powereasy to access. Grid computing is a collection of computers located in various places tosolve complex tasks. The grid can be also seen as a distributed system.

In comparison to other systems, the grid is focused on the performed amount ofwork over a period of time, while the others are mostly focused on the greater per-formance in terms of processing floating-point operations per second (FLOPS) in thesystem. Floating-point operations are usually calculated by certain algorithms or com-puter programs. FLOPS is a measure of computer performance and is important forscientific calculations.

The grid is an evolutionary technology that uses existing IT infrastructure to pro-vide high-throughput computing.

One of the most important terms of a grid system is "virtualisation". This refers tothe direct integration of geographically distributed systems. This is very important asusers do not need to know where the computer is physically located. This also meansthat there is a single access point for users to enter the grid system and to submit theirrequests. Users just need to submit the request from the grid user interface (UI) (seeSection 2.1.3) and then it is up to the grid system to allocate the available computingresources for the dedicated requests.

The grid infrastructure introduces the concept of virtual organisations (VO) [31].A VO is a collection of individuals or institutions that are sharing the common gridcomputing resources (Figure 2.1).

Based on the concept of virtual organisations, we consider three conditions thatprovide the basis for the understanding of the grid system. The first condition is virtu-alisation, which was explained above.

The second condition is heterogeneity. Talking about VOs indicates a multi-institu-tional entity. These organisations can have different resources in terms of hardware,network and operating systems. So it should be clearly defined that a VO is a collectionof heterogeneous resources.

The third condition is flexibility. Different organisations can leave or join a VOaccording to their requirements and convenience. So a VO is a dynamic entity.

These three terms explain why grids have specific requirements in contrast to otherdistributed systems. Grids enable collaboration between multiple VOs in terms of re-source sharing. This collaboration is not only focused on the exchange of files betweeneach other but also provides direct access to the computing resources. Each of theseVOs can have different policies and administrative control. All VOs are part of a biggrid system. Resources that are shared between VOs may be data, special hardware,

18

Chapter 2. Grid Computing 19

Figure 2.1: Different VOs in the grid. Taken from [30].

analysed and distributed data in the grid. The members of the grid can be part ofmultiple VOs at the same time (Figure 2.2).

Grids can be used to define a security policy for the members with the ability toprioritise resources for different users.

Ian Foster describes a three point checklist [32] to describe a grid. According to it,a grid should provide resource coordination without centralised control, it should bebased on open standards, and it should provide a nontrivial quality of service. A gridcan be used for computational purposes (computational grid), for storage of data on alarge scale (data grid), or a combination of both.

2.1.1 Grid Authentication/Authorisation

As we discussed, VOs are a group of geographically distributed individuals or insti-tutions that are sharing their common resources between each other. Resource sharingis controlled by a set of rules or policies that define the condition and scale of that shar-ing. The flexibility of VOs brings the problem of security. Typically, the well-knownterms for computer security are authentication, authorisation and confidentiality.

Grid infrastructure provides security technologies using VO as a bridge among thegrid users and grid resources. One part of Globus Toolkit [33] is Grid Security Infras-tructure (GSI) (see Section 2.1.1.5) that implements grid security functionality.

The Globus Toolkit is a primary technology for the grid and allows users to securelyshare the same computing resources and other necessary tools. The toolkit includessecurity, software library services for the resource management and monitoring.

20 Chapter 2. Grid Computing

Figure 2.2: Different VO members can share the same grid resource.

2.1.1.1 Authentication

Authentication establishes an identity of an object in the network. The object can bea process, a resource or a user. For example if you need to enter another computer orwebsite, which requires a particular identity, then you need to identify yourself. Thisprocess is called authentication.

The simplest type of authentication for websites or computers can be a usernameand a password authentication [34]. Typically, a website authentication procedure is thefollowing: A website should have a database of usernames, passwords, user roles con-figured on the web or application server. After this, the website can check authenticatedusers and grant access. However, username password based authentication is not verysecure as a user sends the data as plain text. If someone intercepts the transmission,the user name and password can be easily decrypted.

More complicated and more secure type of authentication is Kerberos [35], which isused for client/server applications by using symmetric key cryptography.

The technology that plays the key role in the authentication of grid systems is PublicKey Infrastructure (PKI) [36]. PKI describes the security system used to identify objectsthrough the use of X.509 certificates [37]. X.509 certificates are issued by highly trustedorganisations known as Certifying Authorities (CAs) [38]. Different VOs have agreedon the use of different CAs. Hundreds of users can use the same grid system at thesame time using their own user credentials. More detailed information is provided inthe following sections.

2.1.1.2 Public Key Infrastructure (PKI)

The Public Key Infrastructure (PKI) [36], [39] is built to provide secure communica-tion in public networks or the internet. It uses a public/private key pair structure. Foreach user the message encryption and decryption should satisfy the equalities shownin Equations 2.1 and 2.2.

D(E(Msg)) =Msg, (2.1)


E(D(Msg)) =Msg, (2.2)

where Msg stands for any message or text, E is the encryption key and D is thedecryption key.

When the user A wants to send a signed document to the user B, the following stepsneed to be followed:

• Transform the message into a signed message (MsgSigned) by using his/her de-cryption key, such as MsgSigned = DA(Msg),

• Send to the user a signed ciphertext. User A uses the encryption key of the user B,which can be found in the public file, to generate a signed ciphertext (CSigned =EB(MsgSigned) = EB(DA(Msg))).

The actions of the user B are the following:

• The user B reduces the signed ciphertext (CSigned) to a signed message (Ms-gSigned) with his/her decryption key, such as

DB(CSigned) = DB(EB(MsgSigned)) =MsgSigned.

• The user B decrypts the signed message by using the encryption key of the userA, which can be found in the public file, such as EA(MsgSigned) =Msg.

The schematic view of the above mentioned example is shown in Figure 2.3.

Figure 2.3: The schematic view of the example. The user A sends a signed docu-ment to the user B.

The CA issues a digital certificate to users (individuals/organisations). This certifi-cate uniquely identifies a user. The digital certificate follows the structure specified bythe X.509 system. The name of the user in the X.509 system is unique and it is relatedto its public key by a CA. The certificate private key is kept with the owner of the cer-tificate and the public key is available for public usage. The data signed by the privatekey can be decrypted only using the public key and vice-verse.


Figure 2.4: Hierarchy structure for PKI.

The hierarchical structure of the Public Key Infrastructure is shown in Figure 2.4.At the lowest level of the tree are the end users who are issued with the digital

certificate.In the middle level of the tree are the CAs who are authorised to issue certificates

on a regional level. Regional CAs can be small or big organisations, institutions andcountries.

At the top level of the tree are the biggest CAs or so-called Root Certificate Au-thority (Root CA) [40] that are able to provide certificates to smaller CAs. Root CAs aretrusted by everyone. They are specially meant for that. The Root CA is identified by theroot certificate. The root certificate is a self-signed or unsigned public key certificate.The private key of the root certificate is used to sign all other certificates. All certificatesbelow the root certificate inherit directly the trustworthiness of the root certificate.

The self-signed certificate is a certificate of identity signed by the same personwhose identity is certified. From a technical point of view, the self-signed certificateis signed by a user’s own private key. Users have to wait until the certificate is signedor approved by trusted CAs.

The CAs in the PKI are also responsible for publishing the Certificate RevocationList (CRL). This list contains the serial number of invalid or expired certificates of theusers.

2.1.1.3 X.509 End User Certificates

The Globus Toolkit middleware is a dominant middleware for grid systems.The Grid Security Infrastructure (GSI) (see Section 2.1.1.5) is one part of the Globus

Toolkit that provides the fundamental security services for grid systems. The GSI pro-vides libraries and tools for authentication and message protection that use standardX.509 public key certificates, public key infrastructure (PKI), the SSL/TLS protocol [41]and X.509 Proxy Certificates.

X.509 certificates [37] are used to identify users in the grid. In the grid, each useris identified by a unique X.509 certificate and for any grid activities he/she needs touse the same certificate. Each certificate contains the public key of a user and is signedwith the private key of a trusted CA. X.509 defines alternative authentication protocols


based on the use of public-key certificates. The user certificates are assumed to becreated by some trusted CA and placed in the directory by the CA or by the user. Thedirectory itself is not responsible for the creation of public keys or for the certificationfunction; it only provides an easily accessible location for users to obtain certificates.

User certificates generated by a CA have two main characteristics:

• Any user that knows the public key of the CA can verify the encrypted publickey of the user.

• Besides the CA, no one can modify the certificate without being detected.

The format of the X.509 certificate is described below (Figure 2.5).

Figure 2.5: X.509 end user certificate format. Taken from [42].

• Version: specifies the version of the X.509 certificate.

• Certificate Serial Number: CA assigns a unique serial number for the certificate.The CA’s name and the serial number uniquely identify the certificate.

• Algorithm: identifies the algorithm that was used for the certificate. For exampleRSA can be used for the generation of the public/private key pair and md5 forthe hashing algorithm.

• Issuer Name: mentions the name of the CA who signed the certificate.

• Validity: consists of two fields "not valid before" and "not valid after". Used forspecifying the duration of the valid certificate.

• Subject Name: contains the name of the owner of the certificate.


• Subject public key info: consist of two fields: "Public key algorithm" and "publickey". In the "Public key algorithm" field is mentioned the algorithm that was usedto generate the public/private key pair. In the "public key" field is mentioned thepublic key of the certificate holder.

• Issuer Unique Identifier: an optional field used for specifying the unique ID of theissuer of the certificate (CA).

• Subject Unique Identifier: an optional field used for specifying a unique ID of theowner of the certificate.

• Extensions: an optional field used for specifying the extension of the basic fields.

• Signature: consists of the hash of the entire certificate signed by the private keyof the CA. In order to verify the certificate a user can calculate the hash usingthe algorithm specified in the certificate, then decrypt the hash by using the CA’spublic key. If these two hash values are the same, then the certificate is originaland can be trusted.

X.509 certificates have a validity period (lifetime). Typically the certificate lifetimeis one year and a new certificate is issued in advance before the expiration of the oldone. In some cases, a certificate may need to be revoked earlier than the expirationdate. In this case, the CA has a list (certificate revocation list (CRL)) of all revoked butnot expired certificates issued by that CA and issued to other CAs. This list is locatedin the corresponding directory, signed by the issuer and includes the issuer’s name, thelist creation date, the next CRL creation date and the revoked certificate entries (serialnumber and revocation date) (Figure 2.6).

Figure 2.6: X.509 certificate revocation list (CRL) format. Taken from [42].

X.509 certificates are the basis of GSI authentication mechanisms.


2.1.1.4 X.509 Proxy Certificates

Proxy certificates [43] are widely used in security systems. Proxy certificates arebased on X.509 public key certificates. The main idea of proxy certificates is to allowsingle sign-on and delegation of rights for the users.

• Single sign-on: the user should be authenticated only once within a domain byusing their own X.509 certificate. The creation of the proxy certificate itself isbased on the X.509 certificate. Within the domain a proxy certificate can be usedmultiple times.

• Delegation of rights: a user grants partially or all of his/her privileges to anotherobject, in order to perform actions that might have an access control mechanism.

It has a limited validity and the private key can be stored in a simple file systemwithout being encrypted. A proxy certificate has a new public/private key pair. Thenew certificate contains the owner details and has a specific field where it is mentionedthat it is a proxy certificate. A proxy certificate is signed by the owner of the certificateinstead of a CA.

If a proxy certificate is used, the authenticating party receives the owner’s X.509certificate and the proxy certificate. The validity of the proxy certificate is checked bythe owner’s public key. This means to check whether the proxy certificate is signed bythe same owner or not. The owner’s X.509 certificate has a signature of CA. This maybe used to validate the owner’s certificate. In this way a chain of trust is establishedbetween CA and owner and between owner and proxy certificate.

A proxy certificate is derived as an extension of an X.509 certificate, using the proxycertificate information extension defined in RFC 3820 [44]. A proxy certificate has addi-tional fields which are defined in RFC3820. The extension indicates that the certificateis a proxy and contains the information about any usage restrictions that can be placedby the owner. The structure of the extension is shown in Figure 2.7.

Figure 2.7: X.509 proxy certificate extension [45].


• Path Length Constraint: mentions the maximum depth of the proxy certificate paththat is allowed to be signed by this certificate.

• Proxy Policy: specifies a policy of authorisation within this certificate.

• Policy Language: mentions the format of the policy. The language field has twoimportant values: id-ppl-inheritAll that points out the usage of unrestricted proxyand id-ppl-independent that points out the usage of no rights proxy granted by theowner. In this case the only rights are those that are explicitly granted to it.

• Policy (Optional): mentions other policies.

2.1.1.5 Grid Security Infrastructure (GSI)

Grid Security Infrastructure (GSI) [24] is meant to provide the necessary functional-ity for the implementation of security in grid systems. GSI is part of the Globus Toolkit.It has been developed to deal with the specific security needs of grid systems, such as:authentication and authorisation.

• Single Sign-On: a user should be able to login to the grid system and use the gridresources without requiring any further authentication.

• Delegation of privileges: different users can have different privileges in grid sys-tems. For example grid administrators have more privileges than the regularuser. Some specific information such as file configurations or software manage-ment requires special user privileges.

• Inter-domain security support: the grid may contain different security domains ge-ographically located in different places. The grid security must be able to providesupport between these domains.

• Secure Communication: the grid security should provide secure message exchangewithin grid users and services.

• Uniform credentials: the grid security should have a uniform certificate which willrepresent credentials of users in the grid. As a uniform certificate GSI uses theX.509 certificate format.

To support these requirements, GSI provides different functions such as, authenti-cation, authorisation, delegation and message protection. Each of these tasks is han-dled by an open standard.

2.1.1.6 Authorisation

Authorisation is the second step (first step was described in 2.1.1) of trust establish-ment between two objects in the grid. Authorisation is setting up the privileges for theobjects in order to access a dedicated resource. Only after successful authentication,the authorisation can be done.


In grids, the resource owners can have the ability to grant or deny access to a gridsystem, based on the identity or membership of a VOs. One of the authorisation tech-niques used in grid systems is the Globus Toolkit Gridmap file [33]. It contains a list ofglobal names of grid users and the corresponding local account names to which theyare mapped. The maintaining of the updated version of the gridmap file requires cer-tain effort from the local system administrator. The Community Authorisation Service(CAS) [46] does clear separation between site policies and VO policies. CAS providesa mechanism for a VO to manage multiple independent policies. Each VO consists of aCAS server, which acts as a trusted third party between the users and the resources ofthe VO.

The Virtual Organisation Membership Service (VOMS) [47] provides an informa-tion management authorisation between different VOs. The VOMS provides a database,which contains user roles and capabilities and a set of tools to access and use thedatabase. The VOMS database typically runs on a VOMS server. These tools can createnew credentials for the new users. These credentials contain the basic authenticationinformation that can contain the standard grid proxy certificate. Additionally, they alsocontain the user role information.

2.1.1.7 Confidentiality

Confidentiality means hiding the most valuable or sensitive information from peo-ple who should not have rights to access them. Generally grid systems are taking careof valuable or sensitive data, such as experimental, analysed, financial data, etc. Thisbrings up the issue of confidential data protection, which is typically done by cryptog-raphy. The provision of data access is discussed within organisations and is managedby the grid administrators. Grid systems supposing usage of remote resources. A re-mote resource can be either an autonomous machine or a cluster in accordance with thedecision of the grid load balancer. Grid systems have specific security requirements,which are different from internet security.

2.1.1.8 Grid Middleware and its Services

In order to use grid resources effectively some middleware is needed. Grid middle-ware includes APIs, protocols, and software that allows creation and usage of the gridsystem. There are several important services running in the grid middleware, such asjob execution, security, information and file transfer services [67], [68]. The grid is a dy-namic environment where the status of the services can be changed at any time. Eachservice is meant for a specific task.

• Job execution service: executes the user submitted jobs in a remote recourse (site).The security of the resource is guaranteed with corresponding security services.During the job execution, the job submitter shows the status of the job.

• Grid security service: provides the authentication and authorisation mechanismfor users to access the grid resources in a secure way.


• Information service: provides the information about grid resources and services,such as the amount of free space in a particular computer/node, available ser-vices, running or stopped services, etc.

• File transfer service: provides safe and efficient data transfer from its original lo-cation to the remote node, where the task should be executed. For this, the useronly needs to run a simple copy command from the Linux terminal.

2.1.2 Information Provider

Information services [67] are major and critical services in the grid middleware.They are collecting and distributing user and resource data in the grid. Informationservices provide detailed information about grid services that are needed for variousdifferent tasks. It maintains non-distributed and distributed information about users,software, services, and hardware that participates in a computational grid, and makesit available upon request. The grid information services are also providing data bind-ing, discovery, lookup, and protection. In short, a grid information service (GIS) pro-vides information about grid objects. Possible grid objects required by the informationservice is listed in Figure 2.8.

Figure 2.8: Table of objects required for information service. Taken from [48]

In a data repository the structures of objects and the relationships between objectsare described by the data model. There are four main data models: hierarchical, net-work, relational and object-oriented. Typical examples for the hierarchical data modelare LDAP [50] and XML.


In order to provide fresh information, periodical updates of the data repositoriesare needed.

Information services support one of the following categories in terms of informa-tion to update.

• Read-only repositories: there is no standard way to update the information. Up-dates are usually done by other non standard interfaces, such as updates in con-figuration files.

• Read-mostly repositories: allow updates and are optimised for information reading.This can be visible in relatively slow consistency-updating protocols, or time con-suming restructuring in the face of multiple additions or repositions in the datastructure. For example, LDAP falls into this category.

• Read-write repositories: allow updates with regular read/write operations. A sim-ple relational database management system is a read/write repository. Underthis category falls all database systems (Oracle, MSSQL/MySQL, etc.).

Updates for this kind of information in GIS can only be done by an authorised per-son. The owner is able to perform changes to the object when the resource changes(new cluster substituted by old cluster, file server updates, etc.). The frequency of up-dates depends on hardware and software upgrades.

Information retrieval is done by a query interface. A query interface providesan easy way for users to obtain the necessary information. Relational and object-oriented databases support a standardised and powerful access method like the Struc-tured Query Language (SQL) [49]. SQL queries can be complex and can contain sub-queries. Hierarchical data models such as LDAP are using simplified access protocols.The queries are restricted to simple expressions which are written in a procedural lan-guage. Because of the hierarchical structure of LDAP and XML, a user must have adeep knowledge of the tree structure to write queries. The security in GIS is guaran-teed by X.509 certificate.

2.1.3 Grid User Interface

The user interface (UI) [67] provides an interface for users to access grid services.The UI itself provides a command-line interface (CLI) to the middleware services of thegrid infrastructure. A CLI is a cooperation between the user or client and a computerprogram. It is a terminal window where a user types lines of text (command-line)manually.

CLI are mostly used by advanced computer users, as they usually provide a way tocontrol the program or the operating system.

There are a number of different CLIs, such as

• Data Catalog command-line clients and APIs,

• Data Transfer command-line clients and APIs,

• gLite I/O client [51] and APIs,

• R-GMA client [52] and APIs,


• VOMS CLI tools,

• Workload Management System clients and APIs,

• Logging and Book-keeping clients and APIs.

Each CLI has its own or specific collection of tools or commands through which auser can do different activities in the grid. For example, submit a job to the grid, deletethe job, copy the file from one place to another, etc.

2.1.4 Computing Element

The Computing Element (CE) [67] is a cluster of grid resources, which is used torun jobs.

Each CE consists of three main elements, the grid gate (GG), the local resource man-agement system (LRMS) and the worker nodes (WN).

• The grid gate (GG) is the interface between the cluster and a number of workernodes. The common implementation includes an LCG grid element and a gLitegrid element.

• The local resource management system (LRMS), which is also called "batch system",manages the distribution of local resources (WN) in the cluster. The general im-plementation includes Maui/TORQUE [53] and Condor [54].

• The worker Nodes (WN) are the computing resources. WNs are typically desktopcomputers and in the grid system they are used for running the jobs.

Users submit their jobs to the grid. The GG accepts user jobs and then dispatchesthem to the LRMS. The LRMS takes care of the further user job execution by distribut-ing them to the WNs. A schematic view is shown in Figure 2.9.

2.1.4.1 Portable Batch System (PBS)

Generally, the number of jobs submitted by different users are more than existingWNs. In order to solve this problem job queuing systems are needed. Each batchsystem has a queuing system. The batch queuing system is scheduling jobs for furtherexecution. Typically, the job selection mechanism follows the model first-in, first-out(FIFO).

Job scheduling has two phases. The first phase is the job selection from the submit-ted jobs. The second phase is the allocation of memory and CPU resources accordingto the selected job requirements.

Different batch systems have different job schedulers. One example for the batchsystem is the PBS (Portable batch System) [55]. The PBS consists of four main daemonprocesses, such as

• Job Server: manages jobs and queues. The server also transfers jobs to the associ-ated execution server.


Figure 2.9: Workflow of a CE.

• Resource Monitor: collects the resource availability information about the hostwhere it actually runs.

• Job Execution Server: is also known as the Machine Oriented Miniserver (MOM),MOM is taking care of the job execution. It also establishes limits on resource us-age, monitors the job’s usage, and notifies the server when the jobs are complete.Another activity of MOM is to react to the task manager requests. This meanscommunicating with the running task and gathering necessary information. Incase of system errors, log files are available in the mom_logs directory.

• Job scheduler: receives an information about the jobs, which are ready to run orare already running on the system. It also gets information about resource avail-ability and usage.

A schematic view is shown in Figure 2.10.

Figure 2.10: Schematic view PBS.


User jobs arrive at the Job Server, then the Resource monitor has the necessary in-formation about the available resources. After exchanging the information between theScheduler and the Resource Monitor and between the Scheduler and the Job Server, thejobs in the queue go to the MOM, where they start their execution.

The core functionality of the batch system is the following:

• Jobs in the Batch System are assigned/grouped in queues.

• User (job submitter) should define in advance to which queue the job will beassigned – if not, job will be assigned to the default queue.

• Queues have three main groups of attributes

– Limits for resources: maximum job run time, memory or CPU core usage, etc.

– Priorities: according to the scheduling strategy

– Security options: who and from where is allowed to submit the job to thequeue.

• Queues, hence jobs are following a particular scheduling strategy defined for abatch system in the moment of configuration.

There are several common important scheduling strategies which might be appliedto the batch systems. These are the following.

• First in, First out (FIFO) scheduling: jobs are simply scheduled in the order in whichthey are submitted.

• Preemption: jobs get suspended in order to free resources for time-critical jobs. Thestate of a suspended job is stored in order to resume it later ("checkpointing").

• Node-sets: parallel jobs should run on the hardware with the same performance.Avoid that sub-jobs have to wait for slowest sub-job.

• Fair share scheduling: job priorities are calculated based on the resource usage, thesubmitter’s group membership, etc.

• Advanced reservations: resources can be reserved in advance for a particular useror job.

• Back-fill: can be combined with any of the scheduling algorithms to allow smallerjobs to run while waiting for enough resources to become available for larger jobs.Back-fill of smaller jobs helps maximise the overall resource utilization. D can befilled with short jobs that have less priority than job C (Figure 2.11).

There are different types of batch systems, such as PBS [55], TORQUE [56], PlatformLSF [57] and IBM Loadleveler [58]. Different batch systems use different CLIs as shownin Figure 2.12.


Figure 2.11: Back-fill algorithm.

Figure 2.12: Different batch systems with their CLIs.

2.1.5 Storage Element

Storage is a widely used term for the devices and data are connected to the workernodes (WN) via input/output objects. It is a place where data are physically stored.In the grid system, files are replicated and stored on several WNs which are located invarious geographical locations, in order to provide rapid multi-access capabilities. Italso provides an authorisation policy that defines rules about who is allowed to modifyor delete the data. Different storage resources offer different levels of Quality of Service(QoS) [59].

The storage hardware is controlled by the software. The storage resource is thecombination of the storage hardware and supported software.

A storage element (SE) [67] has an interface for accessing grid storage through dif-ferent grid protocols, such as the Storage Resource Management protocol (SRM) [60],the Globus GridFTP protocol [65] and others. The SE also allows grid users to storeand manage files/data with the space assigned to them.

In order to make storage available for usage different services are needed. Thereare three main interfaces used for the storage resource shown in Figure 2.13.

• Data Access: rfio [62] , dCap [67] protocols

• Data Transfer: GridFTP [65] protocol

• Storage Management interface: SRM [60] protocol


Figure 2.13: Storage resource.

2.1.5.1 Data Access (rfio, dCap) Protocols

rfio [62] and dCap [67] are protocols that provide access to hierarchical mass storagesystems like Castor and dCache. The rfio protocol was designed to transfer files ina trusted environment with a minimum of overhead. rdio does not use a secure au-thentication while other protocols such as GridFTP and SRM use a secure authentica-tion mechanism. rfio provides a full POSIX access mechanism to the remote files viaadapted open, read, write and close commands. rfio has a problem with many parallelrequests.

The dCap protocol file supports the regular file access functionality. It offers POSIXopen, read, write, stat, close operations as well as standard file system namespace opera-tions. dCap supports security mechanisms where security protocols are implemented.

2.1.5.2 Data Transfer (GridFTP) Protocol

GridFTP [65] is a secure and reliable data transfer protocol. GridFTP is based on FTP(File transfer protocol) and fulfills requirements of grid systems.

As an extension of FTP, GridFTP has the following main features:

• Authentication: GridFTP integrates with the Grid Security Infrastructure that pro-vides authentication and encryption to file transfers with user-specified levels ofconfidentiality and data integrity.

• Parallel streaming: GridFTP supports the usage of parallel streaming and multi-node transfers.

• Basic transfers: GridFTP provides data transfers between two worker nodes. Anexample is shown in Figure 2.14. Site A runs a GridFTP server and holds datathat will be transferred to Site B, which runs a GridFTP client.

• Third-party transfers: GridFTP performs transfers between two worker nodes, me-diated by a third host (Figure 2.15). This allows a user or application at one siteto initiate, monitor and control a data transfer operation between two other sites.


Figure 2.14: Basic GridFTP transfers.

Figure 2.15: GridFTP third-party transfer.

• Striped/Partial transfers: GridFTP supports striped and partial data transfers be-tween hosts.

In GridFTP there are two different communication channels, the data and controlchannels.

The data channel is the communication link with high-bandwidth over which thedata transfers from one place to another.

The control channel is a low-bandwidth TCP link over which commands and re-sponses flow, which is encrypted and integrity-protected by default. The direction ofthe arrow depends on where the GridFTP client and the GridFTP server are located.

The Globus Toolkit provides the most common usage of the GridFTP protocol. Itprovides server, client CLI tools (globus-gridftp-server, globus-url-copy) [63] and a set ofdeveloped libraries for the clients.

Another usage of the GridFTP protocol is developed by NCSA and provides theclient CLI (UberFTP [64]).

GridFTP is defined by the Global Grid Forum Recommendation GFD.20 [65], GFD.47[66].

2.1.5.3 Storage Resource Management (SRM)

The Storage resource manager (SRM) [60] is managing an associated storage by usingdifferent techniques such as cache management, dynamic space allocation and reser-vation for data transfers, staging of data from slow to fast storage in a hierarchicalstorage system and scheduling storage system requests. A SRM is typically managingtwo types of shared resources, files and space. Its main function is to allocate space


for a user dynamically, check the permission of a user to use that particular space, as-sign the space to the user for the given time period according to its policy and releasethe space when the user requests its release or when the time is expired. A SRM alsosupports file replication between SEs, which can be considered as third-party trans-fers. This is different from GridFTP third-party transfers as SRM is using credentialdelegation between the client and the local SE.

The SRM protocol version 2.2 [61] functionality can be grouped in the following fivecategories: space management functions, data transfer functions, directory functions,permission functions and discovery functions.

Space management functions - allow the user to reserve, release and manage spacesfor some defined time period. The main space management functions are: srmRe-serveSpace, srmUpdateSpace, srmReleaseSpace, srmChangeSpaceForFiles, srmExtendFileLife-TimeInSpace.

Data transfer functions - allow the user to manage, retrieve, copy, data from/tothe storage. The main data transfer functions are: srmPrepareToGet, srmBringOnline,srmPrepareToPut, srmPutDone, srmCopy, srmReleaseFiles, srmAbortRequest/srmAbortFiles,srmExtendFileLifeTime.

Directory functions - allow users to manage the SRM namespace for creating, mov-ing and deleting directories and files as well as allow users to put files with differentstorage requirements in a single directory. Changing the space characteristics of a filedoes not change its position in the directory but only its assignment to the correspond-ing space.

Permission functions - allow users to assign read and write privileges on a specificfile to other users. Such functions allow client applications to specify ACLs as well,wherever supported by the underlying storage service.

Discovery functions - which allow applications to query the characteristics of thestorage system behind the SRM interface and the SRM implementation itself.

The practical example is shown in Figure 2.16.

Figure 2.16: An example of copying a file to the grid storage.


A client or user requests a space reservation on a grid storage. After getting theconfirmation, the client/user copies the file to the storage. As a last point, a notificationis sent to the SRM that the client/user completed the file transfer for the allocated space.

2.1.6 Workload Management System (WMS)

The grid system provides a large variety of complex services such as the data man-agement services, the security services, the information services, etc.

The job scheduling in a grid systems is a very well-known research topic. WorkloadManagement System (WMS) [67] is an architecture for the distributed scheduling andresource management in a grid. Usage of the WMS is very important for the gridsystems.

WMS consists of several main components.

• Job preparation: includes specific information about the job, such as job charac-teristics or attributes and job requirements. The job description uses the Job De-scription Language (JDL) [67]. JDL defines the set of job attributes. Later, WMSconsider this information for the scheduled decision making.

• Job submission: is the actual JDL file submission process. There are several stepsthat each user should follow in order to submit the job.

– the user needs to login to the UI.

– the user types grid-proxy-init command with the corresponding certificatepassword in order to get a valid Globus proxy certificate.

– The user should make up his/her own JDL file. An example is shown in theFigure 2.17.

Figure 2.17: JDL file example.

Submit the JDL file to the grid, for example glite-wms-job-submit -o jId.txt job.jdl.The -o jId.txt flag is used to store the job identifier in a file called jId.txt for laterreference.

• Job monitoring and control: provides information about the job. Allows users tomodify the job, such as edit the content of the JDL file, delete the file or just checkthe status of the job.

The WMS manages job submission and cancellation from the user interface via thenetwork. It is an intermediate service that schedules jobs in the most efficient way.


Figure 2.18: Structure of WMS. Taken from [67].

The WMS workflow is shown in Figure 2.18.The workload manager (WM) is the main component in the WMS. It includes task

queue, information supermarket, information updater, matchmaker and job handler(described below).

This component mainly takes care of the optimal job allocation and the wrappingof the job before submitting them to the computing element. The process of job alloca-tion and wrapping is done by several grid services, such as the data management, thebookkeeping and the information services.

2.1.6.1 Task Queue

One of the WM elements is task queue. It is responsible for handling the user jobrequests and also takes care of the user authentication by digital proxy certificate. Oncethe job is submitted or cancelled, it gets interpreted according to the JDL attributes andafter it passes or does not pass to the task queue. After the job reaches the task queue itstays in a queue for later processing by the matchmaker. If there is no suitable resourceavailable for scheduling the job returns to the task queue and waits for processing.

2.1.6.2 Matchmaker

The next element of the WM is the matchmaker. This element is responsible fortwo main things: for the job processing and for matching them with the available re-sources. The job processing includes user access policies, requested resources and jobprivileges. Available resource information is provided by the information supermarket.The last provides an up-to-date information about the grid resources. Putting thesetwo things together, the matchmaking process cares about the optimal job schedulingin accordance to the available resources in the grid. After this process, the actual jobpasses to the next stage, which is the actual job submission to the grid system.


2.1.6.3 Information Supermarket

The next element of the WM is the information supermarket. This element storesinformation about the grid resources. The information gets updated constantly via theinformation updater.

2.1.6.4 Information Updater

The next element of the WM is the information updater. This element is respon-sible for the up-to-date information stored in the information supermarket. Up-to-dateinformation is provided by the top-level BDII server (see Section 3.2.2) of the virtualorganization.

2.1.6.5 Job Handler

The last element of WM is the job handler. The job handler is responsible for the jobpackaging, job submission/removal and for the job monitoring. Jobs first get packedand then executed into the particular resource. Submitted jobs are actively monitoredby the log monitor.

2.1.6.6 Job Description Language (JDL)

The Job Description Language (JDL) [67] is a high-level language used for makinggrid jobs.

A JDL file example is shown in Figure 2.19.

Figure 2.19: JDL file example.

JDL relevant attributes are the following.

• JobType: normal, sequential, parallel.

• RetryCount/ShallowRetryCount: used for the job auto-submission and the maxi-mum number of times that job can be resubmitted to the system.


• Executable (mandatory): should mention the filename of the executable file.

• StdInput, StdOutput, StdError (optional): standard input/output/error of the job.

• InputSandbox (optional): stores the command path for the further job execution.

• OutputSandbox (optional): contains executed job output files, that are typicallystd.out and std.err files.

Chapter 3

Grid Middleware

3.1 Introduction

The grid system includes storage resources, network resources and computationalresources such as servers, supercomputers and personal computers, which are locatedin different geographical locations. In order to use grid resources effectively some gridmiddleware is needed, which provides access to the grid resources and facilities.

The grid middleware is a shared, extensible and flexible environment. It is a set ofAPIs, software and different protocols that allow the creation of the grid system and itsusage.

Because of the complexity of the grid architecture and the diversity of its functions,the interaction between middleware and other software is also required.

The grid architecture [30] is a very complex architecture because of its variety offunctions. The communication between other software and grid middleware is re-quired.

The grid architecture has a layered structure, shown in Figure 3.1.

Figure 3.1: Grid layered architecture.

The bottom layer of the grid architecture is the grid fabric, which includes hard-ware, network and all other necessary resources.

In the top level of the grid architecture is the application layer, which includes dif-ferent user applications and portals.

In the middle layer between bottom and top layers is the grid middleware itself,which enables a connection between the hardware and the user applications. In otherwords, it provides software, protocols and APIs for the grid systems.

There are different grid middleware, such as Globus Toolkit [33], gLite [69], UNI-CORE [70], etc. At CERN, for the LHC experiments we are using the gLite middleware(see Section 3.2).

41

42 Chapter 3. Grid Middleware

3.2 gLite Middleware

The gLite middleware [69], [71] was developed by the Enabling Grids for E-sciencE(EGEE) [72] project in order to provide all necessary tools for resource sharing. It is abig collaboration between scientists and engineers from all over the world. It is a veryreliable and highly accessible middleware.

The main components of gLite are the following (Figure 3.2):

• Access methods

• Security services

• Information and monitoring services

• Workload Management services

• Data Management services

Figure 3.2: gLite services [71].

Access Services are used for providing users the access to the grid services accordingto their rights. All gLite services can be accessed via CLI or application programminginterface (API).

Security Services include authentication, authorisation and auditing services that areused to identify different objects (users, systems and services) and allow or deny theaccess to grid services and resources. Security services also provide the confidentialdata and a site proxy, which is used for controlling the access via network.

Information and Monitoring Services are meant to extract and publish information formonitoring purposes. Published data includes the time and date information, as wellas the identity of the publisher and the location from where it was published. Theinformation is not allowed to be modified as data inconsistency may occur. There arealso different ways of cleaning out the old data. As a last point, there are well-defined

Chapter 3. Grid Middleware 43

authorisation schemes to guarantee that users can only read or write data according totheir privileges.

Workload Management Services are related to the job management/execution in thecomputing element (CE), the job provenance, the accounting and the package managerservices. The accounting is a distinctive service as it takes into account not only thecomputing, but also the network and the storage resources.

Different workflows for the job management exist, for example push and pull mod-els.

• The purpose of the push model is for the WMS to decide where to execute ajob taking into account its availability, its location and the characteristics of theworker nodes (WNs) in the CEs.

• The purpose of the pull model is for the WMS to collect jobs in a task queue andthen the CEs request jobs by taking into account its availability, its location andthe characteristics of the worker nodes (WNs) from the WMS.

Data Management Services include three main components, such as the storage ele-ments (SEs), the file catalogues and the file transfer service. The SE (see 2.1.5) is a gLitecomponent that stores data for later retrieval by other gLite applications or by gridusers. Here, files are replicated and stored in different locations in order to providequick multi-access capabilities. Data consistency is also supported. Data are stored infiles more often than in relational databases [73].

In a distributed environment, there are many file replicas that are stored in differentgeographical locations. The middleware provides the functionality of replica manage-ment. Before performing the actual file transfer operations in the SEs, files have to belocated and identified. The identification of files on the storage elements is done viadifferent identifiers ([69]) such as the Grid Unique Identifier (GUID), the logical filename (LFN), the Storage URL (SURL) and the Transport URL (TURL). The LFN is thekey, where users write the actual locations of the files. The replicas are identified bySURLs. SEs extract necessary data by contacting the SURL, which is unique for eachreplica. The access to the files is controlled by the Access Control Lists (ACL).

The Data Management Systems provide all necessary interfaces to the users fordata placement in a distributed environment, where the transmission of the data isscheduled as well as the jobs in the WMSs.

The main gLite middleware components are the following

• UI - User Interface,

• CE - Computing Element,

• SE Storage Element,

• WN - Worker Node,

• WMS - Workload Management System,

• VOMS - Virtual Organisation Membership Service

• LB - Logging and Bookkeeping,

• MDS - Monitoring and Discovery Service,


• LFC - Logical File Catalog,

• BDII - Berkley Database information Index.

The details about the above listed gLite components are covered in chapter 3.The Figure 3.3 gives an overview over the gLite middleware.

Figure 3.3: gLite components.

3.2.1 gLite-VOMS

gLite remote services usually act on behalf of the user, which is called delegation.This means that the user should authorise the remote service before he can do anyactivity in the grid. The authorisation process proceeds by using the X.509 user certifi-cate. The delegation of the certificate is called the proxy. Before doing any activities inthe grid, users should login to the grid. This is done by creating the temporary proxycertificate using the original X.509 user certificate. The proxy certificate has a limitedlifetime. For security reasons the proxy lifetime is only 12 hours.

In gLite, the user authentication is done by using the X.509 certificate and the autho-risation is done by the VOMS attribute (see Section 2.1.1.6). An authorisation is basedon the Access Control Lists (ACL), which guarantees data access only by their owners.Through the VOMS provided tools, users are allowed to generate their own proxy cre-dentials based on the VOMS database content and their own X.509 proxy credentials.The gLite security mechanism is shown in Figure 3.4.

The creation of proxy certificates is very simple. It requires the execution of thevoms-proxy-init –voms <VO> command [69], where the VO is the name of the virtualorganisation. In our case we use ATLAS as the VO, an example is shown in Figure 3.5.The X.509 certificate password is required for the creation of the proxy certificate.

In this example the proxy is located in the /tmp directory.


Figure 3.4: Overview of certificates in gLite security mechanism.

Figure 3.5: An example of an executed voms-proxy-init command.


In order to have more information about the created certificate, the voms-proxy-infocommand [69] should be executed (Figure 3.6).

Figure 3.6: An example of an executed voms-proxy-info command.

Here we can see the certificate type, which is proxy, the issuer name, the validityand the location of the proxy certificate and key usage.

3.2.2 gLite-BDII

The information services [67] provide information about the status of grid resources.The role of the information services in the grid are very important (see Section 2.1.2).

The functionality of the information service in gLite is shown in Figure 3.7.

Figure 3.7: Functions of information service in gLite.

There are two information service mechanisms in gLite:

• Relational grid monitoring architecture (R-GMA): used for accounting, monitoringand publishing of user level information. R-GMA uses the servlet technology forthe intercommunication in gLite. By using communication protocols, the datatransmissions between each other is done in an efficient way. This technologyalso allows easy and quick adaptation to other web services.


• Globus monitoring and discovery service (MDS): gLite version 3.1 inherits the mainconcepts of the Globus MDS. The MDS is used for the resource discovery and thestatus publication. It has a tree structure for gathering information (Figure 3.8).

Figure 3.8: Hierarchical tree structure of MDS.

At the bottom level are individual resources (one for each CE or SE) and informa-tion is provided by the information provider. This information is then collectedand published by the grid resource information server (GRIS). Afterwards, theinformation is passed to the next level, which is the Grid Index Information Ser-vice (GIIS). For each site, there is only one GIIS that collects information from theGRIS’, located one level below. This service also caches the information accordingto its validity time.

The Berkeley Database Information Index (BDII) [74] includes standard LDAP data-bases [50] (two or more). The BDII is on the top of the MDS structure (Figure 3.9).

Each BDII is monitoring a certain amount of the GIIS. The BDII contains all theinformation about the available resources at the sites. Finally, at the highest level of allthe BDIIs is the VO.

In this level there is an overall BDII that merges different BDIIs in the same VOtogether. It is a server where all valuable information concerning the VO is stored.

3.2.3 gLite-UI

The gLite user interface (UI) [69] is an interface for users to access grid services.It is a collection of different CLI clients and APIs, such as the data catalog, the datatransfer, the I/O, the R-GMA, the VOMS, the WMS and the Logging and Book-keeping.Through the CLI tools, users can perform several grid operations like:

• submit and cancel jobs in the grid,

• check the status of jobs in the grid,

• retrieve the output of finished jobs from the grid,


Figure 3.9: Hierarchical tree structure of the MDS and the BDII.

• copy, delete and replicate files from the grid,

• get information about the various resources available in the grid.

The available APIs allow also to develop grid-enabled applications.

3.3 CREAM Computing Element

A CE has a complex structure. It represents an interface to a large computing re-source managed by a Local Resource Management System (LRMS), such as PBS andLSF. Additionally, CEs are also providing a grid user authentication and authorisation,accounting, fault tolerance and improved performance and reliability to these batchsystems.

The Computing Resource Execution and Management (CREAM) [75] is an efficientjob management system used for grid environments. It is the CE implementation inthe gLite middleware. The main goal of CREAM is to provide a robust, simple andefficient service for grid job operations. CREAM is written in the Java programminglanguage and it uses Tomcat as a server. It includes an interface that is based on WebServices and is very flexible and compatible with different clients that are written indifferent programming languages. CREAM is widely used in different areas such asbio-medicine, physics, astrophysics, high energy physics, finance, etc.

The main functionality of CREAM is the job submission. CREAM supports severalactivities, such as

• the execution of batch and parallel jobs (MPI) [76],

• the transfer of the Input Sandbox (ISB) (a set of files needed for the job execution)to the executing node, from the user WM or from the grid storage server,


• job management operations, like the job status, job cancellation, job posting andjob purging. Users are also allowed to suspend or restart submitted jobs.

Users submit their jobs, which are using JDL as the script language to the CREAMCEs. The JDL used for CREAM and WMS is the same (see Section 2.1.6.6). This meansthat gLite WMS and CREAM can be integrated together. CREAM can work as a gridcomponent of gLite WMS. User jobs that are submitted to the gLite WMS can be for-warded to CREAM-based CEs for further execution. The CREAM-WMS integrationworks together with a separate module named Interface to Cream Environment (ICE)(Figure 3.10).

Figure 3.10: CREAM-WMS integration.

ICE receives job requests from WMS and then uses the corresponding CREAMmethods to perform the request. ICE is also responsible for the monitoring of job statusand to take the appropriate decision if the job status is changed.

3.4 The dCache Storage Element

One of the most used storage management systems is the so-called dCache system[77]. dCache is a storage management system developed at DESY in collaboration witha group at FermiLab since 2003.

The dCache is a highly sophisticated storage management system used within theWLCG. It is a very effective system that is able to deal with several hundreds of ter-abytes of data. dCache stores and provides a huge amount of data that are distributedand located in different geographical locations. It provides a single virtual file sys-tem with different standard access methods and disk pool management system with orwithout a tape backend. The dCache also has an automatic disk load balancing system.


The dCache system is written in Java and it uses standard open source technologies.The main functionality and features of the dCache system are the following:

• Hierarchical Storage Management

• Manages the enormous amounts of data (no hard limit, tested in the petabyterange)

• Modular design and portability, highly configurable and customisable system

• Uses filesystems , such as ext4, XFS or btrfs to store the actual data on the storage-nodes

• Authentication is based on the usage of X.509 certificates, username+passwordand Kerberos

• Supports access protocols, such as DCAP/gsiDCAP (native protocols), NFS, GridFTP,HTTP/WebDAV, SRM

• Scriptable administration interface with terminal-based and graphical front-ends

• XML information provides detailed live information about the storage system

• Web-interface shows live summaries of the most important information

• Powerful cost calculation system that allows the control of data flow (from andto pools, between pools and also between pools and tape)

• Load balancing and performance tuning by "hot replication" (via cost calculationand replicas created by pool-to-pool-transfers)

• Garbage collection of replicas, depending on their flags, age, etc.

• Space management and support for space tokens

The user reads/writes data to the cluster as shown in Figure 3.11.

Figure 3.11: User transfers data to the cluster.

First the user with the valid X.509 certificate connects to a door with the desiredprotocol. Then dCache determines the pool from which the data should be read orwhere the data should be written to. Depending on the protocol and configuration the


user request of the actual data can happen either directly between the client and thepool or indirectly via the door.

3.5 Description of GoeGrid

GoeGrid [78] is a grid computing resource center located in Goettingen.GoeGrid is mostly used by university scientists who are doing research in different

departments such as high energy physics, computer science, theoretical physics, astro-physics and biomedicine. Besides being a grid center, GoeGrid is also a Tier-2 for theATLAS experiment within WLCG (see Chapter 4). The GoeGrid Tier-2 center is locatedin the Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG).

The GWDG hosts the GoeGrid Tier-2 and serves as a main computer center forthe University of Goettingen and the local Max-Planck-Institutes. Besides that, it alsomaintains the hardware, the network infrastructure and the water-cooling of the racks.

The performance and availability of the GoeGrid Tier-2 center is a complex taskand is done by having necessary hardware and corresponding software administrationtools. For different departments, the requirements in terms of hardware and softwaresetup are different. Therefore, every department has its own contribution into the cen-ter. The benefits are an efficient use of computer and human resources.

The ATLAS detector has approximately 100 million readout channels to detect col-lision events. This leads to a data rate of 1 TB/s. The GoeGrid cluster provides 634CPU-s and 3496 cores.

All communities that are using GoeGrid agreed on the usage of a common softwaresetup. Setup means requirements of the various applications. The structure of GoeGridis shown in Figure 3.12.

CE and the batch systemIn GoeGrid, the used batch system is called TORQUE [53] and the job scheduler is

MAUI [53].TORQUE is an open-source resource manager used to control batch jobs and dis-

tributed computing resources. It is an advanced product in terms of scalability, relia-bility, functionality and it is currently used by thousands of leading academic, govern-ment and commercial sites throughout the world. TORQUE is one version of PBS (seeSection 2.1.4.1) and has similar functionality.

MAUI is a job scheduler mainly used for clusters and supercomputers. It is sup-porting multiple scheduling policies, dynamic priorities, reservations, and fair-sharecapabilities. In GoeGrid, taking into account the shared resources between several sci-entific groups, the used scheduling algorithm is the fair-share scheduling algorithm[79]. The fair-share algorithm insures the control of the shared resources and allowssite administrators to set system utilisation targets for users, groups, accounts, etc.

GoeGrid uses CREAM (see Section 3.3) as a computing element manager. For secu-rity reasons, GoeGrid has two different CE instances which are running CREAM. TheCREAM component takes care of the communication with the batch system (PBS/TORQUE). It accepts two types of jobs: regular batch jobs and parallel jobs. For regularjobs there are no special software requirements, while for parallel jobs the MPI libraryis needed as it allows inter-process communication.


Figure 3.12: GoeGrid structure.

Storage systemThe GoeGrid storage system is based on dCache (see Section 3.4). It allows the

cluster to have a single storage space with a general file system. The storage spaceis divided into several space tokens such as GOEGRID_DATADISK, GOEGRID_DET-INDET, GOEGRID_LOCALGROUPDISK, GOEGRID_PHYS-EXOTICS, GOEGRID_PHYS-HIGGS and GOEGRID_SCRATCHDISK. Each of these space tokens has its ownmeaning and is meant for specific usage. This is an official ATLAS Distributed Com-puting (ADC) requirement. The access to the space tokens is guaranteed by the SRMprotocol.

In addition, the SAN [80] mass storage system of the GWDG provides approxi-mately 30 TB of storage space. The SAN is maintained by the GWDG.

Information providerThe calculation and management of the available computing resources is very im-

portant as it leads to an effective resource management. For each ATLAS site the com-puting resource information is needed to be open within the collaboration. In GoeGrid,this task (the CPU accounting information) is done by the APEL (Accounting Processorfor Event Logs) [81] service.

The APEL is the fundamental tool for accounting and publishing the CPU usageinformation in the grid infrastructure. The APEL parses the log information of thebatch system (in the GoeGrid case it is PBS/TORQUE) to produce CPU job account-ing records in the grid systems. More accurately, it collects the complete informationabout the usage of computing resources by user jobs. APEL then publishes accounting


data into an accounting record repository at a Grid Operations Center (GOC) (in theGoeGrid case, the center is located at Rutherford Appleton Laboratory (RAL) [82]).

APEL consists of three main components: the APEL parser, the APEL core andthe APEL publisher. The APEL parser gets the needed information (submitted andfinished jobs status) from the batch system via the computing element. The APEL corestores aggregated information into the APEL database. The APEL publisher gets theaccounting data from the APEL database and publishes them in the GOC.

The BDII service (see Section 3.2.2) provides detailed information about the site andits available resources. The GoeGrid BDII provides information about the availableCPU cores and running grid services required by the ADC management.

MonitoringMonitoring of such a big cluster needs to be done by an efficient multi-level mon-

itoring system. This means the usage of different tools for monitoring the status ofhardware and software. For GoeGrid we use tools, such as Nagios [83], HappyFace(see chapter 6), Ganglia [86], etc.

Nagios [83] is a professional open-source monitoring tool for the entire IT infras-tructure. Nagios provides a notifying system for the servers, WNs, switches and otherpossible hardware. A Nagios GoeGrid web interface is shown in Figure 3.13.

In this example, the Nagios monitoring tool displays the status of apel, bdii and thepbs hosts that are running in GoeGrid. It shows the current load of the hosts, the statusof different partitions, processes, etc. and the last check time, the duration of the test,the number of attempts and status information.

If the test of some particular service fails, then the whole line is marked in red andthe status changed to CRITICAL. More detailed information about the hosts or services,can be seen by following the structure of the site.

Management SystemsTo provide available and reliable services for the grid site, the first part to focus

on is the management and the control of the existing hardware. To perform softwareinstallations and upgrades of worker nodes manually is a time consuming task espe-cially if the number of worker nodes is more than 300. In order to cope with this kindof problem, an automatic installation tool is needed through which it will be possibleto perform all necessary actions at once.

The GoeGrid Rocks Tools [84] provides a customised installation mechanism withadditional software packages at install-time. Typically, the kickstart file [85] containsthe OS and other necessary packages, for example acrobat reader, vidyo connection, etc.Besides this, it provides the functionality of the automatic DNS server configuration.For example, if there is a new worker node, the details about it should be entered inthe Rocks database. After that, the new DNS entry will be generated without requiringany changes in the DNS configuration file. Once the new worker node is successfullyinstalled it will be attached to the monitoring for future system administration andmaintenance. In the case of GoeGrid, the new installed worker nodes are attached tothe Ganglia [86] monitoring tool.

Through the Rocks Tools the management and the control of existing hardware iseasily adaptable and efficient.


Figure 3.13: Nagios GoeGrid web interface.


Another management tool, which is used in GoeGrid is CFengine [87]. CFengineis widly used for the large scale computing facilities. It is a language-based systemadministration tool where system maintenance is automated and the worker nodesconfigurations are defined in a central file.

GoeGrid uses CFengine in order to automate a large number of routine administra-tive tasks, which are requiring manpower.

MiddlewareFor the HEP experiments, the requirements are made by the WLCG. The main re-

quirements are the choice of the grid middleware, which is gLite and the operatingsystem for the WNs, which is Scientific Linux CERN 6 (SLC6) [88]. SL6 is compatiblewith all community specific requirements, the applications and the middleware.

Chapter 4

ATLAS Computing

4.1 ATLAS Computing Model

The ATLAS computing model [89] is mainly designed to allow all ATLAS membersto have quick access to all reconstructed data for running different analysis. Addi-tionally, it provides access to the corresponding raw data for managing calibration,monitoring and other activities.

This model is using the Grid Computing concept (see Section 2.1) and is based onthe grid system itself. Therefore, there are a number of different software that are de-veloped and designed by CERN for fulfilling the ATLAS requirements. The developedsoftware have different purposes such as monitoring, job submission, data transferand others. This chapter will describe the main software that is used in ATLAS, such asAGIS (see Section 4.2), PanDA (see Section 4.3), DDM (see Section 4.4), SSB (see Section4.5) and others.

4.2 ATLAS Grid Information System (AGIS)

Each year the ATLAS experiment produces petabytes of data that needs to be dis-tributed all over the world according to the ATLAS computing model. It providesquick access to all the necessary data to all ATLAS Collaboration members, such as re-constructed data, raw data for calibration and alignment activities, according to theirroles or rights in the community.

The ATLAS Computing Infrastructure proposed the creation or development of acentral information system to store various parameters and configurations that are re-quired by many other ATLAS software components. A centralised information systemsolves the problem of data inconsistencies between data stored in various ATLAS ser-vices and the data stored in various external databases.

The ATLAS Grid Information System (AGIS) [90] is the central information systemthat integrates different configuration parameters for operating the ATLAS DistributedComputing (ADC) [91] system and its services. This includes the collection of differ-ent parameters from independent sources, such as the gLite BDII (see Section 3.2.2),the Grid Operations Centre Database (GOCDB) [92] and the Open Science Grid Infor-mation services (MyOSG , OSG Information Management System - OIM) [93]. Cur-rently, AGIS is the main source of information for the ADC applications, the ATLASDistributed Data Management (DDM) system (see Section 4.4) and the ATLAS Pro-duction and Distributed Analysis (PanDA) workload management system (see Section4.3).

56

Chapter 4. ATLAS Computing 57

The architecture of AGIS is based on the client-server computing model. It is writ-ten in Python and uses the Django [94] framework. As a databases backend, it usesthe Oracle database. AGIS caches information from different sources. Data synchro-nisation in AGIS is done by different agents who periodically check the sources viastandard interfaces and in case of changes update the content of the AGIS database.The AGIS system provides different APIs, WebUIs, CLIs for data retrieval and man-agement.

AGIS stores the complete information about the ATLAS topology such as clouds,regional centers, pledges, sites specifics (geography, time zone) and other information.It also stores service and resource information of different ATLAS Tier sites, such as thedescription of the Local File Catalog (LFC) [95], the File Transfer Service [71], [96], theComputing Element (CE) (see Section 2.1.4), the Storage Resource Manager [97], Squid[98], Frontier [99], the perfSONAR [100] services and others.

The web interface provides users the access to the AGIS information through a webbrowser. Users can visualise and modify the stored information depending on theirroles and privileges. To update the information, users need to be granted a corre-sponding write permission. AGIS uses X.509 user certificates for authentication andauthorisation.

AGIS shows the information for all the ATLAS sites.Figure 4.1 shows an example of the GoeGrid services displayed in the AGIS web

interface.The table shows all running service types in the GoeGrid with corresponding end-

points and statuses.Figure 4.2 shows an example of the GoeGrid PanDA queues displayed in the AGIS

web interface.The table shows PanDA analy, production and mcore queues with corresponding

statuses for the GoeGrid site.The AGIS system is currently in production and represents the main source of in-

formation for all the ADC applications and services.

4.3 Distributed Job Management System

The Production and Distributed Analysis (PanDA) [103] system was first developedby ATLAS collaboration members in 2005. Initially it was developed for US basedATLAS production and analysis jobs and later in 2007, it was adapted to the needs ofthe whole ATLAS collaboration.

PanDA is a workload management system for processing distributed analysis andproduction jobs. PanDA plays a key role in the ATLAS distributed computing infras-tructure. It is the only system that processes all ATLAS Monte-Carlo simulation anddata reprocessing jobs. Each day PanDA processes approximately a million jobs. Userssubmit their job to the PanDA system via a Python/HTTP client interface. This inter-face is used for analysis and production jobs.

The main component of the PanDA system is the PanDA server. The PanDA servertakes care of task queues and centrally manages job execution. Submitted user jobs goto the task queue where a brokerage module operates them according to their inputdata, type, priority, available WN resources and other parameters. Then pilots [104]

58 Chapter 4. ATLAS Computing

(a)

(b)

Figure 4.1: GoeGrid services displayed in AGIS web interface [101].


(a)

(b)

Figure 4.2: GoeGrid PanDA queues displayed in AGIS web interface [102].

retrieve jobs from the task queue and run them into the available WNs. Pilots use re-sources efficiently. If there are no available jobs they exit immediately and the submis-sion rate is regulated according to the load. Pilots execute jobs, detect stuck processes,recover failed jobs and report the job status to the PanDA server. The schematic viewof the PanDA system is shown in Figure 4.3.

PanDA monitor is the web interface to the PanDA system. It provides summaries ofprocessing per cloud/site, logfiles and shows the status of tasks/jobs. The table (Figure4.4) displays job information for the GoeGrid site.

The most important parameters of the table are marked in red. The computingsiteparameter shows the number of running jobs in GoeGrid, which is in this example2568. The jeditaskid parameter lists the tasks (job set) and the number of jobs per task.In this example there are 71 running tasks and 2568 running jobs in total. The jobstatusparameter shows the status of jobs. In this example the number of failed jobs is 1424. Byclicking on the number it is possible to see more detailed information about the failedjobs. Detailed information is written in logfiles. The processingtype parameter lists thetask types running on the system.

The BigPanDA project was established in September 2012 as an extension of thePanDA project. The BigPanDA project is more generalised in terms of usage by otherexperiments. It extended its availability for High Performance Computing (HPC) plat-forms and has intelligent network awareness.


Figure 4.3: Schematic view of the PanDA system. Taken from [103].

Figure 4.4: PanDA monitor GoeGrid example [105].


4.4 Distributed Data Management System

Each year, the ATLAS experiment produces petabytes of data that need to be dis-tributed and stored in the Worldwide LHC Computing Grid for future processing andanalysis. The management of these data is done by the ATLAS Distributed Data Man-agement system (DDM), called Rucio [106]. Rucio is meant to manage the large vol-umes of data.

The DDM project was established in spring 2005 to develop the system called DonQuijote 2 (DQ2). The main idea was to create a system that is scalable, robust, flexibleand will fulfil the needs of the ATLAS computing model. This means the creation of acomplete data flow of the experiment which includes everything: starting from archiv-ing of raw data until individual physics analysis at home institutes. DQ2 has evolvedrapidly to help ATLAS Computing operations to manage large volumes of data acrossmany grid sites at which ATLAS runs, and to help ATLAS physicists get access to thesedata. However, DQ2 reached its limits in terms of scalability, requiring a large numberof support staff to operate and it was hard to keep up with new technologies. DQ2was widely used between 2008-2014. The replacement of DQ2 is the new project calledRucio. Rucio is more flexible and solves almost all open problems of DQ2. Rucio ismeant to cope with the requirements of LHC Run 2 in terms of scalability and the ex-pected huge amount of data. Since January 2015, DQ2 was officially decommissionedand Rucio started to operate as the new DDM system.

Rucio accountsThe entry point to the Rucio system is a Rucio account. A Rucio account represents

individual users or groups of users or any service, which needs to perform activitiesin the ATLAS collaboration. Rucio users identify themselves by an X.509 certificate.User credentials can map to one or more accounts (N:M mapping). Rucio checks thepermissions of the credentials to allow the use of a Rucia account. Figure 4.5 displaysan example of the mapping between user credentials and Rucio accounts.

Figure 4.5: Mapping between user credentials and Rucio accounts. Taken from[107].


Rucio data identifiers, scope and metadata attributesATLAS data is physically stored in files. The data itself consists of persistent C++

objects associated with physics events. In Rucio, the smallest operational unit of dataare files. Files can be grouped into datasets (set of multiple files). Datasets from theirside can be grouped into containers (set of datasets). Files, datasets and containers(Figure 4.6) follow an identical naming scheme which consists of a name and the scope.The combination of both is called data identifier (DI).

Figure 4.6: Aggregation hierarchy example. Taken from [106].

Rucio account holders have read access to all scopes and write access to their ownscope. Privileged accounts have write access to multiple scopes. Files, datasets andcontainers are uniquely identified in the system. Even if data are removed from thesystem the DI cannot be reused.

Rucio Storage ElementA Rucio Storage Element (RSE) is a repository for physical files. It is the smallest

unit of storage space addressable within Rucio. The RSE has a unique identifier and aset of meta attributes such as supported protocols, storage type, physical space prop-erties, geographical zone, etc.

RSEs can be grouped in many logical ways. For example the DE RSEs and the Tier-1RSEs. It is possible to refer to a group of RSEs using metadata attributes or by simplecalculation of RSEs.

In Rucio, caching is a mutable service. The cache is usually managed by an externalprocess or application, such as the PanDA system.

Rucio architectureThe Rucio system is based on a distributed architecture. It provides a command

line, python clients and APIs for users. After successfully passing through the authen-tication and authorisation component, the user gets a short lifetime token. Once theuser is authenticated the RSE provides the token for the next interaction with the sys-tem. The RSE component cares for the interactions within various grid middlewaretools and storage systems. The Rucio daemon is one of the main components thatguides the whole system. For example, the Conveyor of the Rucio daemon takes care


of the file transfers and the Reaper deals with file replica deletion. In terms of databaseusage, Rucio supports several relational database management systems (RDBMS) likeOracle, PostgreSQL or MySQL. The Rucio Probes component takes care of performingvarious tests. It checks interactions within Rucio and other ATLAS software such asthe ATLAS Meta-data interface (AMI) [109] and AGIS. The Rucio Analytics componentis responsible for accounting the data and making up reports.

ATLAS Distributed Data Management (DDM) dashboardThe DDM dashboard [110] is a web interface while it is used to maintain the high

reliability of the DDM (Rucio) system. It is a widely used interface. Through this in-terface it is possible to follow the transfers within the ATLAS clouds/sites (Figure 4.7).In this example transfers from the DE cloud to the DE cloud are at 100%. The numberof successful transfers is 15327 and there are only 16 failed transfers. By navigatingdeeper, the detailed information about transferred datasets, such as time, name, etc.,the reason of failed transfers can be seen.

Figure 4.7: DDM Dashboard interface [110].

There are also statistics plots (Figure 4.8) that display the successful and failed trans-fers for a given time period.

The plots represent the transfer efficiency, the volume of data and the succeededand failed transfers for the last 10 min interval for different clouds. By navigatingdeeper, the same plots but for the sites in the clouds can be seen.

The DDM Dashboard is used constantly by people who are taking computing shifts[112] to investigate failures and notify sites in case of failures.


Figure 4.8: DDM Dashboard interface [111].

4.5 Site Status Board (SSB)

Taking into account the large amount of data and complexity of the whole infras-tructure, it is a challenging task to provide a reliable and stable system. An effectivemonitoring and real-time action-taking is done constantly by ATLAS Distributed Com-puting (ADC) shifters and experts. Besides a manual monitoring and action-taking, anautomated tool is needed to solve the problem of delay and to reduce the manual inter-actions. Based on monitoring data, the performance of different services can be mea-sured and the problems can be detected easily. This is very important as the solutionof existing problems can be done on time. Such a tool was designed and developed byADC operations and is called the ATLAS Site Status Board (SSB) [113].

The main objective of the SSB is to collect, aggregate and display the monitoringinformation about the ADC critical activities on a single web interface. Beside this, theSSB includes the ATLAS Site Topology [114] and an automatic alert system. The SSBworkflow is shown in Figure 4.9.

The collection and aggregation of information in the SSB is done by different scriptscalled Collectors. Collectors run periodically and collect data from various externalmonitoring resources. The collected information is stored in predefined data structurescalled metrics. Metrics typically contain quantitative or qualitative information aboutdifferent services. They contain only current status information about the ATLAS siteactivities such as data analysis, data transfers and data processing. Old information isstored in a database to follow the status change of the ATLAS site activities. The SSBstores the historical information of the ATLAS site activities.

The SSB has a number of different web views. The SSB default view displays a hugetable about all necessary information of the ATLAS sites. The SSB shifter view displaysfiltered information about all the ATLAS sites (Figure 4.10).


Figure 4.9: The workflow of the ATLAS SSB. Taken from [113].

By navigating deeper, it is possible to see detailed information from the source ofthe published values or statuses.

The SSB tool is a step towards achieving the main goal of ADC, to reduce the needfor manual interaction in the ATLAS Grid Computing activities.

4.6 ATLAS Offline Software and Computing

The ATLAS offline software is meant to process the ATLAS trigger [116] and dataacquisition [117] events, as well as to deliver the processed results to physicists withinthe ATLAS Collaboration. It also provides various tools for physicists to analyse theprocessed information and to get the physics results. A physics requirement is theability to reproduce the data of physics process based on the detector data. The dataprocessing requires corresponding simulation tools.

The Athena framework [118] is a common ATLAS–LHCb project and is used byother experiments as well. The Athena framework is designed to handle a wide rangeof data-processing applications. The component-based architecture provides the flex-ibility of the Athena framework. It allows the usage of shared components and thedevelopment of experiment specific components. The Athena framework is used forsimulation, reconstruction and analysis.

Another commonly used software package is ROOT [119]. ROOT is an open-sourcesoftware written in C++. It was developed by CERN. ROOT is mainly designed for par-ticle physics data analysis, where scientists produce and analyse enormous amounts ofcomplex data. It includes several specific libraries of this field. ROOT is widely usedby several experiments in high-energy physics. It also can be used for astronomy anddata mining.


(a)

(b)

Figure 4.10: An example of the shifter view in the SSB web interface [115].


The complexity of ATLAS require that the software must be highly robust and flex-ible in order to fulfil the needs of the experiment throughout its operational lifetime.There can indeed be changes in the physics goals and even in detector hardware. Thesoftware must also be able to cope with changes. The design and implementation ofthis kind of software is not an easy task. It should be possible to run on a distributedenvironment and it should fulfil the technical and the physics performance require-ments.

Chapter 5

Monitoring

5.1 The Concept of Monitoring

One of the most promising perspective technologies of studying the social and nat-ural processes is the system of monitoring. Nowadays, it is widely used in variousspheres of human activity, such as economic, geographic, financial, information moni-toring and others. Monitoring technology [120] is widely used, since the early 70s andimmediately gained recognition. The monitoring technology became universal by al-lowing scientists to get the results of their research independently of the scale (global,regional, local, even personal). Depending on the areas of the research, the scientificliterature indicates many types of monitoring.

Grid environments are typically distributed and consist of various complicatedcomponents. Monitoring of various components is required for different purposes.This can include status checking, performance tuning, troubleshooting and others. Forexample, a job has been submitted to the system. The particular job execution is sup-posed to be complete after approximately 15 min, but for some unknown reasons 3hours have passed and the job is still not complete. The reasons for this can be differ-ent, for example problems with the job script, problems with the worker node where itis running, network problems, problems with a missing library, etc. Monitoring is re-quired to determine the cause of issues within the system. Monitoring provides infor-mation for keeping track of the current state of work and helps track down problems.The following are main use cases of data monitoring:

• Troubleshooting and detection of failures: the reason of the failed job in the grid envi-ronment is difficult to determine. Detailed monitoring is needed for both onlinefault detection and for offline analysis.

• Performance analysis and tuning: usually grid applications face the problem of per-formance, such as unexpectedly low throughput or high latency. The determina-tion of the performance problem requires detailed examination of related compo-nents.

• Guiding scheduling decisions: in the grid, the computational resources constantlychange and the new measurements can be used for optimal or effective resourceallocation and scheduling decisions. In this case, the resource monitoring is criti-cal.

In the ATLAS experiment, monitoring systems are rather complicated. It is impos-sible to have only one monitoring tool for the huge collaborations using such a large-scale infrastructure. For this reason many monitoring systems have been designed andimplemented. Each monitoring system uses its own architecture of presenting data

68

Chapter 5. Monitoring 69

values and is designed for monitoring a certain part of the experiment. The site admin-istrators have to check different web sources and even operate grid commands man-ually in order to get the full information about the site status. However, the majorityof monitoring systems can investigate the sites and their troubles in HEP experimentswithin the WLCG. This complicated procedure takes us to a new concept, called meta-monitoring.

5.2 Meta-Monitoring

Monitoring of the monitoring system is called meta-monitoring [121]. The meta-monitoring system detects situations where the monitoring system itself has a problem.The monitoring system needs to be more available and reliable than the services beingmonitored. Every monitoring system should have some kind of a meta-monitoring sys-tem. Even the smallest system needs a simple check to make sure that it is still running.By using a meta-monitoring system, it is possible to be aware of the completeness of amonitoring system. Meta-monitoring is the answer to the question: Is there somethingthat is missing ? It is impossible to take into account every particular detail. Someaspects will always be overlooked. But with meta-monitoring it is possible to limitthe unknown to a bare minimum. For larger systems the necessity of meta-monitoringsystem is more noticeable.

In ATLAS, for example, the PanDA Monitor and the DDM Dashboard have verycomplicated web interfaces as it includes all the information about the jobs and filetransfers for all the WLCG sites. The extraction of the necessary information from thePanDA Monitor and the DDM dashboard takes time especially for a person who doesnot do it regularly. In order to have an overview about the status of the particular gridsite, the navigation through numerous websites or a manual check through the Linuxterminal is required. Of course, this causes some inconveniences, but on the otherhand the multi-monitoring is often necessary to determine the main problems that areonly visible by correlating smaller ones. This leads to the conclusion that the WLCGgrid site status is spread across numerous WLCG websites and the usage of numerousmonitoring tools leads to the need of having a meta-monitoring tool. The developmentof the meta-monitoring tool with a single web interface, which will display the mostimportant information for the grid site is mandatory. For WLCG, an example of themeta-monitoring tool is HappyFace.

5.3 Monitoring and Meta-Monitoring Tools

Undoubtedly, the role of monitoring and meta-monitoring tools is very importantto the correct functioning of the LHC. Monitoring and meta-monitoring is a necessarycondition for sustainable development, as on the basis of research it is possible to getinformation about the current state and even predict future development projects.

The preferable components [122] that each meta-monitoring or monitoring systemshould have are the following:

• Interface: should provide an interface for accessing the information, for examplea simple and generic API for querying information from other programs.

70 Chapter 5. Monitoring

• Single access point: the final output should have a single interface, which showsall requested information from existing sources.

• Fast accessibility: provide fast access to the requested information from existingsources.

• Modular structure: the modular structure provides easy/visible output.

• Up-to-date monitoring information: the tool should provide fresh or up-to-date in-formation.

• History functionality: should be able to show old data.

• Comfortable usage: the tool should be easy to use.

• Extensible: should be possible to extend the existing system and monitor newresources or data. Should be also possible to provide easy mechanisms for imple-menting such extensions.

• Security: should be possible to provide both secure connection to the monitoringsystem by using secure protocols and security in the level of user privileges.

In the grid, the resources can be changed frequently. The new resources and ser-vices can be added and removed. Moreover, grid resource and service properties canbe changed as well. An example can be when a storage element is upgraded to a largerstorage capacity, different access rates and protocols can be supported. These kind ofdynamic changes can make both the resource discovery in terms of finding the appro-priate resources to fulfil the task, and the resource monitoring, in terms of observingresources or services to keep track of the status change for fixing problems.

The Globus Toolkit solves this problem with the MDS [123] which is a collection ofcomponents implemented to monitor and discover resources and services.

The HappyFace project is my research topic and will be discussed in detail in thenext chapter.

Chapter 6

The HappyFace Project

6.1 Description of HappyFace

6.1.1 Introduction

The HappyFace project [124] is a joint collaboration between Georg-August-Univer-sity Göttingen and Karlsruhe Institute of Technology (KIT). The first version of Hap-pyFace (HappyFace 1.0) was designed by KIT in 2008. In 2009, HappyFace 2.0 wasreleased, which was relying on a more sophisticated and generic approach. The cur-rent version of HappyFace is HappyFace version 3, which was released in 2013 andis fully replacing HappyFace version 2. HappyFace version 2 and version 3 are verydifferent from each other in terms of source code structure and development. The maindifferences between HappyFace version 2 and version 3 is shown in Figure 6.1.

Figure 6.1: The main differences between HappyFace version 2 and version 3.

The development of new modules takes places at Georg-August-University Göt-tingen, KIT and Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen Uni-versity. Besides KIT and the University of Göttingen other universities such as RWTHAachen University, the research center of DESY and the High Energy Accelerator Re-search Organisation (KEK) are using HappyFace for their purposes.

71

72 Chapter 6. The HappyFace Project

Initially, HappyFace was designed as a meta-monitoring tool that aggregates, pro-cesses and stores information from various monitoring tools into a single web interface.Nowadays, HappyFace is not only a meta-monitoring tool, but also a monitoring toolthat aggregates information from the grid system directly.

HappyFace has a modular structure where each module is responsible for extract-ing, processing and storing data from defined WLCG monitoring sources.

The modular structure gives several advantages:

• It supports agile software development style [125]. This is very important as manydevelopers can work on different modules simultaneously.

• Edited modules can be easily deployed and in case of failures the previous ver-sion can be restored simply.

• New modules can easily be plugged in and in case of failures will not harm thework of the whole project, but only the corresponding problematic module.

• It makes the project more flexible, as WLCG sources and their correspondingmodules might be changed.

The HappyFace web interface displays all information that is stored in the Happy-Face database using HTML Mako template [126]. In terms of the database, HappyFaceuses a SQLite database [127]. The usage of that database enables history functionalityfor the HappyFace system and stored information can be accessed at any time. Thecore system is called HappyFaceCore. It includes individual modules, web outputand configuration files. HappyFaceCore also contains certain functions that are nec-essary to download the content from a certain web interface, to run the software andto enable the Python plotgenerator [128]. HappyFace is flexible, easy configurable andcan be adapted to site-specific requirements. In the following sections, the HappyFaceworkflow, the web interface, the category/module configuration and the database isdescribed in detail.

6.1.2 Basic Workflow

The workflow of HappyFace is shown in Figure 6.2.For each category, there can be an unlimited number of modules. All necessary

details about these categories/modules are mentioned in specific configuration files.There exists exactly one configuration file per category/module.

For each module there are three main files, which are a Python script, an HTML fileand a config file that needs to be created before in order to have a running module.

The acquire.py Python script downloads data, parses data and stores them in thedatabase. The acquire.py Python script is periodically executing every 15 minutes, e.g.via a cron job [129] under Linux/Unix operating systems.

The render.py Python script retrieves data from the database and renders it to theHTML template file.

Each module has an implemented function that accesses the previously stored datain the HappyFace database. The render script executes this function of each moduleand provides the extracted data to the module HTML template file.

Chapter 6. The HappyFace Project 73

Figure 6.2: The workflow of HappyFace. Taken from [122].

Each module can have only one table and unlimited number of subtables. However,if there are changes in the database structure, the whole database schema needs to beupdated. This is done by using the tools.py Python script.

The full cycle of HappyFace is the following:

• initializing modules with configuration parameters,

• downloading necessary sources (mentioned in the configuration files),

• aggregating, extracting necessary data from downloaded sources,

• storing time-stamped data in HappyFace database (HappyFace.db),

• displaying stored data on the HappyFace web interface.

6.1.3 Web Interface

The HappyFace web interface is divided into several navigation bars (Figure 6.3),which are history, category, module navigation bars and module content.

By changing the date/time in the history navigation bar, the stored informationfrom the HappyFace database will be retried and displayed on the module content. 15minutes is the default value of the time interval in the history of navigation as well asfor running the acquire.py Python script.

Access to the categories and its modules can be open, permod, or restricted. It dependson the parameters specified in the category configuration file. As a default value, it isspecified open, which allows full access for the user to any category/module. If permodis specified in the category configuration file, then the category will be open but someof the modules would be restricted. If restricted is specified - only authorised userscan have an access to the entire category/module. The category navigation bar has asimple rating system (see Section 6.1.4) through which it reacts to any module statuschange and displays the corresponding icon on the web interface. This kind of modularstructure is very intuitive, as with one look it is possible to get an overview about theongoing issues on the site.


Figure 6.3: HappyFace web interface.

6.1.4 Rating System

HappyFace has a very simple rating system. Each module has a rating value (Figure6.4).

Figure 6.4: HappyFace rating system.

Depending on the rating value, the corresponding icon (coloured arrow) is dis-played on the web interface. If the rating value is between 0 and 0.33, the status iscritical and the arrow will turn red and will point down. If the rating value is between0.33 and 0.66, then the status is warning and the arrow will be yellow and horizontal. Ifthe rating value is between 0.66 and 1, then the status is ok and the arrow will be greenand will point up. If, for some reason, the module execution fails, then the value is setto -1 and if the data retrieval from the database fails then it is set to -2. In case of bothnegative values at the same time, the icon will contain a small exclamation mark. Usu-ally in every category there are several modules. In this case, the overall category ratingvalue depends on the rating value of the existing modules in that particular category.

The category rating value is calculated by one of the following algorithms: unrated,worst and average. A rating algorithm is defined in the category configuration file. Ifthe algorithm unrated is set in the configuration file, then modules status will not beconsidered. If the algorithm worst is set in the configuration file, then the minimum


value over all modules status will be taken. And last, if the algorithm average is set inthe configuration file, then the average value over all modules status will be taken.

6.1.5 Database Structure

The HappyFace database has three main tables, module_instances, hf_runs and mod_className(Figure 6.5).

Figure 6.5: HappyFace database schema.

The table module_instances stores information about the name of the module andconnected instances with it as shown in Figure 6.5. Each module can have severalinstances (Figure 6.6).

Figure 6.6: HappyFace module_instances table content.

The table hf_runs stores information about the status of the executed acquire.pyPython script, i.e. id, time and a boolean flag that indicates success or fail execution.

The table mod_className stores the general information about the module. Thename of the table makes up automatically by adding to mod_ the module classNamewith "_" i.e. if the module class name is JobsStatistics then the table name will be


mod_job_statistics. The table mod_className stores general information about the mod-ule such as a status, error message, description, etc.

Each module can have multiple subtables. The columns of the subtables are de-fined manually in the Python script of each module. Each time when acquire.py Pythonscript is executed, one or many rows are inserted into the existing module’s tables andsubtables and one row is inserted into the hf_runs table.

6.1.6 Module Development

The module development in HappyFace is straightforward unless the structure ofthe source code is unknown.

Each HappyFace module consists of four main files: two configuration files (onefor the category and one for the module), a Python script and a HTML template. Inorder to plug the module into the framework mentioned above, the structure needs tobe followed up.

The module Test is a working module and is presented here as an example in orderto understand the workflow of HappyFace on the level of source coding. It is success-fully pluged into the HappyFace framework and is running without any errors.

The category configuration file (test.cfg) content for the Test category is shown inFigure 6.7. The category configuration file is located at the/var/lib/HappyFace/config/categories-enabled path.

Figure 6.7: The category configuration file for the Test category.

• name – contains the name of the category that will appear in the web interface.

• description – contains the description of the category for what it will be used.

• modules – contains the associated module(s) with the category.

The module configuration file (test.cfg) content for the Test module is shown in Fig-ure 6.8. The module configuration file is located at the /var/lib/HappyFace/config/modules-enabled path.

• module – contains the name of the Python script that is associated with that par-ticular module.

• name – contains the name of the module that will appear in the web interface.

• description – contains the description of the module(s) for what it/they will beused.


Figure 6.8: The module configuration file for the Test module.

Figure 6.9: The module Python script for the Test module.


The module Python script (Test.py) content for the Test module is shown in Figure6.9. The module Python script is located at the /var/lib/HappyFace/modules path.

All module classes are inherited from the main hf.module.ModuleBase class, whichdetermines the basic structure of all modules. The config_keys are variables that will beread by the ConfigParser in the script.

Each module has one default table (table_columns). If there are no other columns setin table_columns, then it will have the default table structure (Figure 6.5).

As it was mentioned before, each table can have unlimited number of subtables.Subtables meant for storing additional information, which is not stored in the defaultmodule table. In this example, the subtable_columns table has two columns: test andstatus.

The Python script contains four main defined functions. To create new modules theimplementation of these functions is mandatory.

• prepareAcquisition() - used for commissioning, monitoring data downloads,

• extractData() - used for the extraction of necessary information from the previ-ously downloaded files,

• fillSubtables() - used for inserting the extracted data in the HappyFace modulesubtable(s),

• getTemplateData() - used for creating variables , which can be accessed from theHTML mako template.

The module HTML template file (Test.html) content for the Test module is shown inFigure 6.10. The module HTML template file is located at the /var/lib/HappyFace/modulespath.

Figure 6.10: The module HTML template for the Test module.


By running the acquire.py Python script, the module is executed and the output isdisplayed on the web interface. For the Test module web interface is shown in Figure6.11.

Figure 6.11: The web interface for the Test module.

As we can see the output of the Test module is a simple table that displays theinformation stored in the HappyFace database.

6.2 HappyFace for the ATLAS Experiment

HappyFace does a clear separation between HappyFace core modules and individ-ual modules. Each module retrieves its specific information from external source(s), toprocess and display the relevant information on the web interface.

The Tier-2 GoeGrid center, which is located at GWDG in Göttingen (see Section 3.5)is one of the ATLAS Tier-2 centers. Initially, the HappyFace project was designed anddeveloped mainly in Karlsruhe Institute of Technology (KIT). KIT is involved in theLHC CMS experiment, which means that HappyFace modules were designed and de-veloped mainly for the CMS experiment. Any other sites, for example the GoeGrid,which would like to use HappyFace, needs to install and adapt a HappyFace instanceaccording to its own site requirements. Due to the fact that once some of the modulesare useless for the ATLAS experiment, they need to be removed or to be fully rewrit-ten. In some particular cases there is a need to redevelop the whole module from thebeginning.

So far a number of different non-ATLAS and ATLAS modules have been designedand developed by the Göttingen group members. The list of developed modules isgiven in the Figure 6.12.

All developed modules are generic and are easily adaptable to any ATLAS grid sitein the WLCG.

More detailed information about above mentioned modules can be found in thefollowing works [130], [131], [132].


Figure 6.12: The list of developed modules by the Göttingen group members.

6.3 Grid-Enabled HappyFace

6.3.1 The Concept and Design

By summarising section 6.1, HappyFace aggregates, processes and stores the infor-mation and the status of various HEP monitoring resources into the common databaseof HappyFace. It displays the information and the status through a single interface.However, this model of HappyFace relied on the monitoring resources which are al-ways under development in the HEP experiments (Figure 6.13). This made the systemunstable.

Figure 6.13: HappyFace initial model. Taken from [122].

Accordingly, HappyFace needs to have direct access methods to the grid appli-cations and grid service layers. To cope with this issue, we use a reliable HEP soft-ware repository, the CernVM File System (CVMFS) [133], [134], supplying grid envi-ronments and available HEP tools.

We propose a new implementation and an architecture of HappyFace, the so-calledgrid-enabled HappyFace. It allows its basic framework to connect directly to the griduser applications and the grid collective services, without involving the monitoringresources in the HEP grid systems (Figure 6.14).

This approach gives HappyFace several advantages:


Figure 6.14: HappyFace modified model. Taken from [122].

• Portability: providing an independent and generic monitoring system among theHEP grid systems.

• Functionality: allowing users to perform various diagnostic tools in the individualHEP grid systems and grid sites.

• Flexibility: making HappyFace beneficial and open for the various distributedgrid computing environments.

These factors need to be taken into consideration when implementing new exten-sions in HappyFace to accept new monitoring resources described above.

The Göttingen group members designed the grid-enabled HappyFace as an exten-sion of the HappyFace system. In order to have access to the grid system there arecertain requirements:

• Proper X.509 certificate

• Integration with grid environments and available HEP tools

• New source code implementation as an extension

The grid-enabled extension is based on the concept of object-oriented programming(OOP) (see Section 6.3.1.1) and uses design patterns (see Section 6.3.1.2) for structuringthe classes and objects.

An implementation of the new extension is described in detail in Section 6.3.2.

6.3.1.1 Object-Oriented Programming (OOP)

Object-oriented programming (OOP) [135] refers to a software programming thatis based on the concept of objects. It gives a possibility to create manually the datastructure that can contain a collection of different types of variables (attributes) andfunctions (methods). This kind of data structure is called a class. The objects are instancesof classes that typically determine their types.

OOP is supported by almost all main programming languages (Python, C++, Del-phi, Java, C, Perl, Ruby, PHP, etc.).

OOP is based on the three main concepts:


• Encapsulation: it binds together data attributes and methods. In addition, theinternal representation of an object is generally hidden from the external objectdefinitions. Typically, only an object’s own methods can directly inspect or ma-nipulate data attributes.

For example, in Java language the class attributes and methods can have accessmodifiers such as private, public and protected. The access modifiers restrict theaccess to the defined attributes or methods.

– private: only the current class has access to the attributes and methods.

– protected: only the current class and sub-classes have access to the attributesand methods.

– public: any class can refer to the attributes and call the methods.

In Python, the private methods have names that start with an underscore.

• Inheritance: a feature that represents the relationship between different classes.This allows classes to be arranged in a hierarchical way. For example, class A caninherit from class B. This means that all the attributes and methods existing inclass B will appear in class A. In addition, class A can also have its own attributesand methods. Sub-classes are also allowed to override the methods defined inthe superclass (parent class).

• Polymorphism: is an ability to handle object differently according to the data typeor class. An example is shown in Figure 6.15.

Figure 6.15: An example of the source code that shows the class inheritance andpolymorphism.


Depending on the class object method, the corresponding voices() method will becalled. The output is shown in Figure 6.16.

Figure 6.16: The output of the Python script.

The main benefit of polymorphism is that it simplifies the programming interface.It supports the overriding of methods, which stands for the use of the method’sname multiple times. This means that instead of inventing a new name for eachnew method, the same name can be easily reused.

6.3.1.2 Design Patterns

Designing an optimal or an effective OOP from first try is a very difficult and almostimpossible task. To define class interfaces, relationships between them and necessarymethods in a way that they would not be modified or redesigned later is impossible.It is even more difficult to make them general and reusable. The design of the classshould be generic enough to solve future problems or fulfil future requirements, but atthe same time specific enough, to solve a particular problem.

Programmers typically design classes some way, then reuse them, thus modifyingthem multiple times. Before the design of classes can be considered final, it must betested. Modifying the code can be a tedious task.

Design patterns [136] in computer science are a formal way to document solutions tocommon problems at the beginning of the design of any application or system. The de-sign patterns offer general reusable solutions that can be used multiple times in differ-ent situations. They help to minimise or isolate the endless iterations or modificationsover class attributes and methods. The design patterns offer a concept to make a clearseparation between the same and various class characteristics. They offer to generaliseparts of the classes, which can be reused anytime and some of them are kept specificso that they can be used occasionally. This way of separation is very flexible and min-imises the changes in the whole source code. The goal of design patterns is to reducechanges in the source code.

Design patterns are usually documented by using Unified Modeling Language(UML) tools [137]. UML is a common modeling language used in the field of softwaredevelopment for describing or visualising the structure or design of a system.

In this thesis, the UML diagrams are designed using the Umbrello UML Modellertool [138]. It is an open-source tool available for Microsoft Windows and Unix-likeplatforms.

A class diagram is a static type of UML structure diagram that describes class at-tributes, methods and relationships between them.


In the UML diagram, classes are displayed in boxes that includes three sections:class names, class attributes, class methods, see Figure 6.17.

Figure 6.17: A class sections.

The relationships between classes describe the logical connections between them.UML keep up the following relationships, see 6.18.

Figure 6.18: UML relations notation

The class inheritance can be seen as a design pattern. For example, one needs todesign a Bear simulation game. An example is shown in the Bear class. Initially, theBear class will be designed as it is shown in Figure 6.19. There will be one superclassnamed Bear and two sub-classes called PolarBear and BrownBear.

Figure 6.19: The UML diagram of the Bear class with two sub-classes (PolarBearand BrownBear).

The initial model is correct unless there will be an update and another type of bearwill appear, for example TeddyBear (Figure 6.20).


Figure 6.20: The UML diagram of the Bear class with three sub-classes (PolarBear,BrownBear, TeddyBear).

According to the designed model, the TeddyBear sub-class is able to eat and to sleep,which is not correct. The newly appeared sub-class (TeddyBear) does not fit the designedclass model and pushes modifications in the initial model itself.

After modifications, the UML diagram will look like the following (Figure 6.21).There should be one base class (Bear) from which all other sub-classes (PolarBear, Brown-Bear, TeddyBear) should inherit and two interfaces (Sleeping, Eating). These two inter-faces must be implemented by the sub-classes on demand.

Figure 6.21: The final UML diagram of the Bear class with its sub-classes (Polar-Bear, BrownBear, TeddyBear).

To modify the initial model by taking into account that not all bears are able to eatand to sleep is logically correct.

There are more than 20 different design patterns which are classified over threetypes.

• Creational: describe the best way of creating an object. An example of such apattern is a singleton class where only a single instance of a class can be created.This type of pattern is used when no more than one class instance can be created.

• Structural: describes the relationship between objects and classes that are satisfy-ing a particular system requirement in a way that future changes in the systemwill not require changes in already existing class relations. The Bear class, whichwas described above, is an example of the structural type of design pattern.


• Behavioral: describe the interactions between objects. The idea is to encapsulatevarious parts in a way that the changes in one source code will not affect the rest.

As is the case with all concepts, design patterns have advantages and disadvan-tages.

Advantages of using design patterns:

• offer standard solutions to common programming problems

• highly flexible and can be used for practically any type of applications

• design and implementation of the structure reaches a certain goal

• increase the quality of the code

• make connections between program components

• language independent and can be applied to any language that supports OOP

Disadvantages of using design patterns:

• does not lead to direct code reuse

• complex in its nature

• need programming experience and discussions

• decrease understandability by increasing the amount of source codes

• consumes more memory because of their generalised formats

Despite their complexity, design patterns are very worth to use. They allow makinghigh level source codes, which is an extremely important factor in software program-ming.

Grid-enabled HappyFace is designed based on the concept of OOP with using de-sign patterns.

6.3.2 Implementation

With the new grid-enabled extension, the workflow of HappyFace looks as it isdisplayed in Figure 6.22.

The idea of the new workflow is providing access to the grid from the HappyFacesystem itself. This is done by setting up the necessary environment, by using the properX.509 certificate, the HappyFace configuration file and the grid subprocess class (anextension of the Python subprocess class) from the HappyFace system.

Access methods are used for

• retrieving needed information from the grid,


Figure 6.22: The grid-enabled HappyFace workflow. Taken from [122].

• storing them into the HappyFace database,

• displaying them on the HappyFace web interface.

Access methods are similar to HappyFace modules/categories, but with the gridpart included in it.

The grid-enabled HappyFace extension consists of two main packages: GridEngineand GridToolkit.

GridEngine consists of three Python scripts: envreader.py, gridcertificate.py, gridsub-process.py. All these classes are connected with each other.

The structure of classes is the following:

• The envreader.py Python script consists of BaseEnv, GridEnv, CvmfsEnv, BaseEn-vReader, GridEnvReader, CvmfsEnvReader classes and four functions, which aredescribed below. The envreader.py script file is used for setting up the grid and thecvmfs environments by reading necessary variables from the HappyFace config-uration file.

The UML diagram of the envreader.py Python script classes is displayed in Figure6.23.

– The getuid() function is used for getting user id information, which will beused later by the GridEnv class for setting up the grid environment.

– The load_happyface_config_reader() function is checking if the HappyFace con-figuration file is available. This function is later used by the BaseEnvReaderclass.

– The check_if_happyface_config_key_exist() function is used for checking if theHappyFace configuration file structure is followed. The function is laterused by the BaseEnvReader class. There is a defined HappyFace configu-ration file structure, which is displayed in Figure 6.24.


Figure 6.23: The UML diagram of the envreader.py Python script classes.


Figure 6.24: The HappyFace configuration file structure.

– The read_happyface_config() function is used for reading the HappyFace con-figuration file variable types (int, float, boolean, string, etc). The function islater used by the BaseEnvReader class.

– BaseEnv is a base class and is used for keeping class attributes and methodsthat can be shared later by the inherited GridEnv and the CvmfsEnv classattributes and methods. This is done in order to avoid accidental name con-flicts, which may cause hard-to-find bugs in the program.

– GridEnv and the CvmfsEnv classes are inherited from the BaseEnv class andare used for storing class attributes of the grid and the cvmfs environmentscorrespondingly in case of a problem with the HappyFace configuration file.For example, the GridEnv class overrides the BaseEnv class conf{} attributeand looks like the following (Figure 6.25):

Figure 6.25: The GridEnv class conf{} attribute.

The CvmfsEnv class overrides the BaseEnv class conf{} attribute and lookslike the following (Figure 6.26):Depending on the class instance, the corresponding conf{} attribute will becalled.

– BaseEnvReader is a base class and is used for setting up the grid or the cvmfsenvironments by reading and changing the default variables of the Happy-Face configuration file.

– The GridEnvReader and the CvmfsEnvReader classes are inherited from theBaseEnvReader class and are used for setting up the grid and the cvmfsenvironments correspondingly.


Figure 6.26: The CvmfsEnv conf{} class attribute.

• The gridcertificate.py Python script consists of the GridCertificate class. The Grid-Certificate class checks if the X.509 user certificate is valid or not. The ProxyCer-tificateHandler is the inherited class from the GridCertificate class and generatesthe new proxy certificate in case it is not valid. The ProxyCertificateHandler classhas a strong connection to the GridCertificate class and cannot exist without it.

The UML diagram is displayed in Figure 6.27.

– The getUserKey(), getUserCert(), getUserProxy() methods get information aboutthe path/location of the X.509 user certificate, key and proxy certificate byusing the GridEnvReader class instance.

– The checkIfUserCertExists(), checkIfUserKeyExists(), checkIfGridProxyExists() meth-ods check an existence of the X.509 user certificate, key and proxy certificateby above mentioned path/location.

– The checkIfUserCertHasCorrectOwnership(), checkIfUserKeyHasCorrectOwnership(),checkIfUserProxyHasCorrectOwnership() methods check an ownership of theX.509 user certificate, key and proxy certificate files.

– The checkIfNoPassphraseUserKeyIs() method removes the password from theX.509 user key by using the openssl command.

– The checkIfValidUserKeyExists() method checks if the X.509 user key is validby using the openssl command.

– The checkIfGridProxyIsStillValid() method checks if the X.509 user proxy isstill valid by using the voms-proxy-info -e command.

– The checkIfGridProxyHasAcTimeleft() method checks the lifetime of the X.509proxy certificate by using the voms-proxy-info -actimeleft.


Figure 6.27: The GridCertificate class UML diagram.


– The getSubjectDN() method gets the name of owner of the X.509 certificateby using the openssl command.

• The gridsubprocess.py Python script consists of the GridCalledProcessError, theGridTimeoutExpired, the GridPopen, the GridSubprocessBaseHandler classes andfrom two functions which are described below. The gridsubprocess.py Pythonscript provides a functionality to execute the grid commands from the Happy-Face system. It also exports the X.509 user certificate and the user key and enablesthe cvmfs environment when executing the grid process. The gridsubprocess.pyis an extended module of Python subprocess with added two parameters (gridand cvmfs).


– The GridPopen class inherits from the Python subprocess Popen class andis used for opening a subprocess. The GridPopen has two more attributes:the gridSetupLoader and the cvmfsSetupLoader for setting up the grid and thecvmfs environments.

– The grid_call() function is the same Python subprocess call() function withthe difference that in the function content the GridPopen class is called in-stead of the Python subprocess Popen class. The grid_call() function waits forthe command to complete or timeout, then returns the returncode.

– The check_grid_call() function is the same Python subprocess check_call() func-tion with the difference that in the function content, the grid_call() function iscalled instead of the Python subprocess call() function. The check_grid_call()function waits for the command to complete and if the exit code is zero thenreturns, otherwise raises the GridCalledProcessError exception.

– The GridTimeoutExpired class inherits from the Python SubprocessError class.This exception is raised when the timeout expires, while waiting for the childprocess.

– The GridCalledProcessError class inherits from the Python Exception class. Thisexception is raised when a process run by the check_call() returns a non-zeroexit status.

– The GridSubprocessBaseHandler class is mainly used for executing the gridcommands. It has two methods: execute() and showGridProcess(). The exe-cute() method is just used for executing the grid commands, for example,the "uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de ’put /var/tmp/A_random.txt/pnfs/gwdg.de/data/atlas/atlaslocalgroupdisk/test/’ ". In this example the com-mand copies a file from the local host to the remote host.The showGridProcess() method shows the current running grid processes.The GridSubprocessBaseHandler class is connected with other classes and func-tions as it is shown in Figure 6.28. Access methods are inherited from the


Figure 6.28: The UML diagram of the GridSubprocess class.


GridSubprocessBaseHandler class in order to be able to execute the necessarygrid commands.

• Different access methods are implemented such as GridFTP, GridSRM and AT-LASDDM classes that are inherited from the GridSubprocessBaseHandler class andalso have access to the grid system.


– The Transfers class inherits from the GridSubprocessBaseHandler class and pro-vides set() and get() methods for source and destination hosts, ports, transfertypes and site names.

– The GenerateFile class randomly generates files in the local file system. Thegenerated files are located at the /tmp directory. This class also has a methodto copy the generated files to the remote grid site.

– The SpaceTokens class provides the set() and get() methods for commonlyused space tokens. There are four: the scratchdisk, the localgroupdisk, the prod-disk and the datadisk.

– The DdmDatasetsController class inherits from the GridSubprocessBaseHandlerclass and has several methods, such as whoami(), listDatasets() and getMeta-Data(). These methods provide the Rucio (DDM) (see Section 4.4) command-line from the HappyFace system.For example, in the Linux terminal the command (rucio list-datasets-rse GOE-GRID_LOCALGROUPDISK | sed -n 1,10p) looks like the following, see Fig-ure 6.30. The above mentioned command lists the first 10 datasets of theGoeGrid localgroupdisk space token.

In HappyFace, the same command (rucio list-datasets-rseGOEGRID_LOCALGROUPDISK | sed -n 1,50p) is written in the class methodand looks like the following, see Figure 6.31. The number of datasets aretaken from the HappyFace ATLASDDM module configuration file.

– The GridFtpCopyHandler class inherits from the GridSubprocessBaseHandlerand the Transfers classes. The GridFtpCopyHandler class methods providethe UberFTP (see Section 6.3.3.1) [139] command-line functionality from theHappyFace system.For example, in the Linux terminal the command (uberftp -retry 2 -keepalive 10se-goegrid.gwdg.de "mkdir /pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") lookslike the following, see Figure 6.32. The above mentioned command createsthe directory named test, in the GoeGrid scratchdisk space token.

In HappyFace, the same command (uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de"mkdir /pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") is written in the class methodand looks like the following, see Figure 6.33.


Figure 6.29: The UML diagram of the GridFTP, the GridSRM and the ATLASDDMclasses.


Figure 6.30: Running the (rucio list-datasets-rse GOEGRID_LOCALGROUPDISK |sed -n 1,10p) command in the Linux terminal.

Figure 6.31: Running the (rucio list-datasets-rse GOEGRID_LOCALGROUPDISK |sed -n 1,50p) command from the HappyFace system. The value ofstr(start) and str(end) variables are taken from the ATLASDDM moduleconfiguration file.

Figure 6.32: Running the (uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de "mkdir/pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") command in the Linuxterminal.


Figure 6.33: Running the same (uberftp -retry 2 -keepalive 10 se-goegrid.gwdg.de"mkdir /pnfs/gwdg.de/data/atlas/atlasscratchdisk/test") command fromthe HappyFace system.

– The GridSRMCopyHandler class inherits from the GridSubprocessBaseHandlerand the Transfers classes. The GridSRMCopyHandler class methods providethe SRM(see Section 6.3.3.2) [140] command-line functionality from the Hap-pyFace system. The GridSRMCopyHandler class has similar functionality asthe GridFtpCopyHandler class but with a different command-line.

The new grid-enabled HappyFace extension was successfully integrated and it pro-vides direct access to the grid infrastructure.

Three main grid-enabled modules, the so-called GridFTP, the GridSRM and theATLASDDM modules were implemented.

The GridFTP and the GridSRM modules are designed to show the efficiency ofthe particular grid site and to check the performance of the grid transfers among thesite space tokens and other grid sites. They check the status of transfers between sitespace tokens by using two different command-lines (UberFTP and SRM). The GridFTPand the GridSRM modules have a transfer efficiency plot that shows the number ofsuccessful and failed transfers for the site space tokens for the given time period. Theexistence of the modules allow system administrators to know whether the transfersbetween site space tokens succeed or failed and in case of failure points out the reasonof the failure.

The ATLASDDM module shows all current datasets that exist in the given storagedisk or the storage pool. This allows system administrators to see all different datasetsthat are stored in the storage systems.

6.3.3 Results

6.3.3.1 GridFTP Module

The UberFTP [64], [139] is a gridFTP-enabled client that supports both interactiveuse and the FTP commands on the uberftp command-line to transfer files between twocomputers. The UberFTP is intended for use with computers that have a GridFTPserver installed. It supports GSI authentication, parallel data channels and striping.


The plot in the web interface shows the transfer efficiency of the GoeGrid site spacetokens for the given time period by using UberFTP commands.

In GoeGrid, there are four main space tokens that need to be monitored. They are:the scratchdisk, the localgroupdisk, the proddisk and the datadisk. The space tokentransfer efficiency is defined by the number of successful and failed transfers, whichare stored in the HappyFace database. In the plot, the green bar shows the numberof successful transfers and the red bar shows the number of failed transfers for theparticular space token.

For example, in this particular case, there are 57 successful and 30 failed transfersfor the GoeGrid localgroupdisk space token (Figure 6.34).

Figure 6.34: The GridFTP module web output.

The combobox, which is on the right side of the plot shows the existing failuresin the selected space token for the given time period. By choosing a space token (forexample, localgroupdisk) more detailed information about the failures on the spacetoken will be listed in the table (Figure 6.35). In this case, the existing 30 failures aredue to the following error messages:

• error with remote service during transfer,

• problem with the file, which does not exists,

• remote server is disconnected.


The extraction of space token failure information is done by using SQLite querywith similar syntax “SELECT DISTINCT . . . FROM . . . WHERE ...;”. The SQLite DIS-TINCT clause removes duplicated rows from the result set.

Figure 6.35: The GridFTP module web output.

The transfers themselves are shown in three different tables on the HappyFace webinterface (Figure 6.36). The 1st table shows transfers from GoeGrid to GoeGrid byusing UberFTP commands. The 2nd table shows transfers from GoeGrid to Wuppertal(another Tier-2 site) by using UberFTP commands. The 3rd table shows transfers fromWuppertal to GoeGrid by using UberFTP commands.

The green color indicates successful transfers and the red color with tooltip iconindicates the failures. By moving the cursor on the icon the tooltip window will appearwith the particular error message.

The program randomly generates files and copies them from one space token toany other. After overviewing of the transfers it removes all created files and folders.

6.3.3.2 GridSRM Module

The Storage Resource Manager (SRM) [97], [140] is the grid middleware componentthat provides a dynamic space allocation, file management on shared storage resourcesin the grid. Different storage system implementations (dCache, StoRM) are based onthe same SRM specification, see 2.1.5.3.


Figure 6.36: GridFTP module web output.


The plot in the web interface shows the transfer efficiency of the GoeGrid site spacetokens for a given time period by using SRM commands. In case of the GoeGrid, thereare four main space tokens that need to be monitored. They are: scratchdisk, local-groupdisk, proddisk and datadisk. The space token transfer efficiency is defined by thenumber of successful and failed transfers that are stored in the HappyFace database.

As described in 6.3.3.1 also here, the plot (Figure 6.37) shows the number of suc-cessful and failed transfers for the particular space token.

For example, in this particular case, there are 18 successful and 18 failed transfersfor the GoeGrid scratchdisk space token.

Figure 6.37: GridSRM module web output.

The combobox which is on the right side of the plot shows existing failures in theselected space token for the given time period. By choosing a space token (for example,scratchdisk) more detailed information about the failures on the space token will belisted on the table (Figure 6.38). In this case, the existing 18 failures are due to thefollowing error messages:

• failure of creating the directory,

• command execution timeout,

• problem with the file or directory, which does not exist.

The extraction of a space token failure information is done by using SQLite querydescribed in 6.3.3.1.


Figure 6.38: The GridSRM module web output.


As described in 6.3.3.1 also here, the transfers themselves are shown in three dif-ferent tables on the HappyFace web interface (Figure 6.39). The 1st table shows fromGoeGrid to GoeGrid transfers by using SRM commands. The 2nd table shows trans-fers from GoeGrid to Wuppertal (another site) by using SRM commands. The 3rd tableshows transfers from Wuppertal to GoeGrid by using SRM commands.

Figure 6.39: The GridSRM module web output.

The tooltip window indicates the particular error message as it is shown in Figure6.39.

After performing the transfers, all files copied by the program are removed fromthe space tokens.

The main difference between GridFTP and GridSRM modules is the used command-line. The GridFTP module uses the UberFTP command-line, while the GridSRM mod-ule uses the SRM command-line.


6.3.3.3 ATLASDDM Module

The ATLAS DDM module shows all current datasets that exist on the given storagedisk or the storage pool. The ATLAS DDM module displays the name of the dataset aswell as the size, the owner, the date of creation and the last date of update. On the webinterface the table is ordered by descending dataset size and has a limited number ofrows (defined in the config file). System administrators can then use the ATLAS DDMmodule information to manage the storage area by optimising the use of disk spacebased on any of the displayed factors.

On the web interface of the module, a pie chart displays a random sampling ofdatasets (Figure 6.40). This gives the system administrator an idea of how much spacea particular user consumes.

Figure 6.40: The ATLASDDM module web output.

At present, the HappyFace system aggregates, processes and stores information ofboth the monitoring resources as well as the direct access to the grid user applicationsand grid collective services in the WLCG computing systems.

Chapter 7

Conclusions

7.1 Conclusion

This work summarises the recent activity of the HappyFace system and its goal.The HappyFace system provides a summary of all important site specific informationavailable on a single web interface with the possibility to store all previous site statuses(history functionality). It has a modular structure that makes HappyFace easy under-standable. This modularity also allows to provide new modules easily and to sharethem among different sites.

During the years of work on my PhD, HappyFace was making significant progress.A number of various modules, ATLAS and non-ATLAS modules, were designed anddeveloped by the Göttingen group members.

The expected data growth during the LHC Run 2 phase brought certain adjust-ments to the existing WLCG computing infrastructure. The main software tools havebeen improved and developed in order to handle large amounts of data. All mainsoftware of the ATLAS computing model that were described in chapter 4 had beenimproved and until now they are under development. Although, the development ofthe ATLAS computing infrastructure has top priority, it also affects the work of meta-monitoring tools as they are directly dependent on the infrastructure. HappyFace as ameta-monitoring tool should always be attentive to any kind of changes in the ATLAScomputing model, as it is directly influencing the modules’ work in HappyFace. There-fore, HappyFace modules always need to be adjusted and in particular cases even tobe fully rewritten.

Instability of the ATLAS computing model brought the Göttingen group to the con-clusion to modify the initial model of HappyFace in a way, that HappyFace is able toprovide access to the grid system directly. This is a very crucial aspect, as it solves theproblem of being dependent on the monitoring resources that are always under devel-opment. Besides this, it provides an advantage to design and develop a wide range ofvarious modules. For example, the developed grid-enabled modules (GridFTP, Grid-SRM), which are designed to show the efficiency of a particular grid site and to checkthe performance of the grid transfers among the site space tokens and other grid sites.The ATLAS DDM module shows all current datasets that exist in a given storage diskor a storage pool. These modules are the first step to take HappyFace to a new phaseof development.

At present, the HappyFace system aggregates, processes and stores information ofboth the monitoring resources as well as the direct access to the grid user applicationsand grid collective services in the WLCG computing systems. In fact, HappyFace canbe adapted to any site requirements within WLCG.

Currently, several German grid sites use the HappyFace system for site monitoring:

• Georg-August-University of Göttingen

105

106 Chapter 7. Conclusions

• Karlsruhe Institute of Technology

• RWTH Aachen University or Rheinisch-Westfälische Technische HochschuleAachen University

The installation and set up of HappyFace was simplified. In the past, it was onlypossible to do it manually, which was time consuming and requiring a certain effort.The provided RPM packages simplified and sped up the installation and set up processof HappyFace. Within several minutes, the whole HappyFace system can be installed.

Besides the progress that HappyFace had within the last several years, the grid-enabled extension is opening new features for the future development. A number ofdifferent modules can be still designed and implemented.

Appendix A

Agile Software Development

Agile software development [125] is a software development style which is basedon some defined principles. It involves adjustable planning, progressive development,early release, and continuous improvement. Agile software development focuses onkeeping code simple, testing frequently and delivering functional bits of the applica-tion as soon as they are ready.

The agile software development is based on the following main principles:

• Satisfy customers on early and continuous release by developed software

• Perform changes at any time, even on the final stage of the software development

• Make a release of the software quite often, for example every 2 weeks

• Constant communication between customers and software developers. Face-to-face communication is the best form of communication

• In some stage of the software development, working software is considered to bea progress

• Constant development is able to provide a constant speed

• Continuous attention to the source code development style

• Adaptive way of working

From above mentioned principles, the advantages of agile software developmentare clear. The progress of the software can be seen quite often and in case of bugs inthe last release than the previous version can be restored at any time. This gives safetyin terms of saving the current status, and then adjust or develop it in accordance withthe requirements that are dynamic in most cases. Consequently, if there should be anychanges in the final release of the software, then they can be easily applied.

In HappyFace, we practice agile software development principles. As HappyFacehas a modular structure, different modules can be modified or designed independently.After a certain period of time, all developed modules are merged together and thecurrent release is ready.

107

Appendix B

Installing HappyFace and Buildingan Instance

B.1 Required Packages

All required packages for installation of HappyFaceCore and ATLAS modules arelisted below:

• python >= 2.6

• httpd >= 2.0

• python-cherrypy >= 3.0

• python-sqlalchemy >= 0.5

• python-migrate

• python-mako

• python-matplotlib

• python-beautifulsoup

• python-sqlite

• python-psycopg2

• python-lxml

• numpy

• mod_wsgi

• sqlite

In addition to above mentioned packages, the grid-enabled HappyFace extensionhas its own requirements which are listed below:

• openssl

• cvmfs

108

Appendix B. Installing HappyFace and Building an Instance 109

• eclipse-platform

• eclipse-pde

• yum-protectbase

• fetch-crl

• ca-policy-egi-core

• umd-release >= 3.0.0

• rpmforge-release

• kdesdk

HappyFace uses a Linux distribution, in particular, Scientific Linux CERN 6 (SLC6).

B.2 Setting up the CVMFS Environment

The CernVM File System [134] or mostly known as the CVMFS is a file system thatprovides a reliable, scalable and efficient software distribution service.

It is providing a complete and portable environment for running and developingLHC data analysis in the grid and in the laptop/desktop computer. It is working inde-pendently of the operating system (Linux, Windows, MacOS). The CVMFS is similar toPOSIX read-only file system on user space. Files and directories are hosted on standardweb servers and mounted in the universal namespace /cvmfs.

The CVMFS is widely used for HEP experiments. It allows physicists working onLHC experiments at CERN to use their computers effectively.

In many cases, the CVMFS replaces package managers and repositories on clusterfile systems. For the LHC experiments, the CVMFS hosts several millions of files anddirectories that are distributed to thousands of client computers.

In ATLAS, in order to use ATLAS environments, the CVMFS is required.The installation or setting up the CVMFS is very simple. There are two files de-

fault.local and install_cvmfs.sh Linux shell script. In the default.local file (Figure B.1) ismentioned all necessary paths and the install_cvmfs.sh Linux shell script (Figure B.2) isjust installing the CVMFS environment. These two files should be located in the samefolder.

Figure B.1: default.local file.

If the CVMFS is succesfully installed the ATLAS user environment will be available.In order to check it, the following commands are needed to be executed from the Linuxterminal.

110 Appendix B. Installing HappyFace and Building an Instance

#!/bin/sh

#----------------------------------# Check CVMFS#----------------------------------rpm -q cvmfs && exit 0

#----------------------------------# make yum repo#----------------------------------repobase="http://cvmrepo.web.cern.ch/cvmrepo/yum/cvmfs/EL/6/x86_64"repofile="cvmfs-release-2-4.el6.noarch.rpm"HOME=/root

if ! [ -e $HOME/$repofile ]; then wget $repobase/$repofile -O $HOME/$repofile yum -y localinstall $HOME/$repofile #---------------------------------- # install cvmfs #---------------------------------- yum clean all yum -y install cvmfs cvmfs-auto-setup cvmfs-init-scriptsfi#----------------------------------# Configuration#----------------------------------cp -v default.local /etc/cvmfs/default.local#----------------------------------# Start#----------------------------------/etc/init.d/autofs restart

#----------------------------------# Check#----------------------------------cvmfs_config chksetupcvmfs_config probe

# detailscvmfs_config stat -vcvmfs_talk cache list

# lsls /cvmfs/atlas.cern.ch

Figure B.2: install_cvmfs.sh Linux shell script.


$ export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase

$ source $ATLAS_LOCAL_ROOT_BASE/user/atlasLocalSetup.sh

After successful installation, all ATLAS environments will be listed, as it is shownin Figure B.3.

Figure B.3: Available ATLAS environments, after executing above mentionedcommands.

B.3 Installation

The source code of HappyFace is stored in the GitHub repository and is located atthe https://github.com/HappyFaceGoettingen/ url.

The GitHub [141] is a web-based Git [141] repository. The Git is a source codemanagement system used for software development. While the Git repository is acommand-line tool, the GitHub has a graphical interface and has the same function-ality as the Git repository.

The HappyFace GitHub repository consists of several repositories, listed below:

• Red-Comet: stores the source code of the grid-enabled HappyFace extension.

• HappyFaceATLASModules: stores the source code of the HappyFace ATLAS mod-ules.


• HappyFaceExtra: stores the source code of the HappyFace non-ATLAS modules,such as CREAMCE, Ganglia, PBS modules.

• HappyFaceWebService: stores the source code of the HappyFace web services suchas WSDL and RESTFull web services.

• HappyFaceSmartPhoneApp: stores the source code of the HappyFace smartphoneapplication

• HappyFaceCore: stores the source code of the HappyFace core modules.

• HappyFace-Integration-Server: stores script files through which it is possible to cre-ate RPM packages [142] of all mentioned above repositories. It contains specifi-cation files, source files and all other necessary files for creating RPM packages.

Before installing the HappyFace software the RPMforge needs to be installed.RPMforge [143] is a repository that provides more than 5000 software packages in

the rpm format for Red Hat Enterprise Linux (RHEL) and Community ENTerprise Op-erating System (CentOS) Linux distributions.

## RHEL/CentOS 6 64 Bit OS ##

$ wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm

$ rpm -Uvh rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm

After a successful installation of the rpmforge-release package the HappyFace soft-ware can be installed.

Installation of HappyFace is simple by using RPM packages.

$ yum install HappyFaceCore-3.0.0-2.x86_64.rpm - the command installs Happy-Face Core

$ yum install HappyFace-ATLAS-3.0.0-20160318.x86_64.rpm - the command installsthe HappyFace ATLAS modules

$ yum install HappyFace-Red-Comet-3.0.0-20150611.x86_64.rpm - the commandinstalls the HappyFace grid-enabled extension.

After having all necessary RPM packages successfully installed, the X.509 user cer-tificate and key needs to be generated or copied to the /var/lib/HappyFace/cert.

Steps of generating the X.509 user certificate and key.

1. Generate user certificate$ openssl pkcs12 -clcerts -nokeys -in usercert.p12 -out usercert.pem

2. Create a private certficate with passphrase


$ openssl pkcs12 -nocerts -in usercert.p12 -out userkey.pem

3. Create a private certificate without passphrase$ openssl rsa -in userkey.pem -out userkey.nopass.pem

4. Change permissions$ chmod 400 userkey.nopass.pem

$ chmod 644 usercert.pem

5. Copy a grid user certificate$ cp -v userkey.nopass.pem /var/lib/HappyFace/cert

$ cp -v usercert.pem /var/lib/HappyFace/cert

B.4 Running an Instance

After a successful installation of the HappyFace instance on the machine, the Hap-pyFace folder will be available at the /var/lib/HappyFace path. To run an instance is verysimple, only one file, the acquire.py needs to be executed and after completion of theoutput will be displayed on the web interface.

$ python acquire.py

If there is a need to update the database structure, the following command needs tobe executed.

$ python tools.py dbupdate

HappyFace is a generic tool and easy to configure. For having HappyFace workingin any computing site, modifications in the HappyFace configuration file (happyface.cfg)B.5.1 and module configuration files are needed.

For grid-enabled HappyFace modules (GridFTP, GridSRM and ATLAS DDM) con-figuration files (grid_ftp_module.cfg, grid_srm_module.cfg and atlas_ddm_module.cfg) aredisplayed in Figures B.5.2, B.5.3 and B.5.4 respectively.

B.5 Files

[paths]happyface_url = /

static_dir = staticarchive_dir = %(static_dir)s/archive

tmp_dir = %(static_dir)s/tmp

hf_template_dir = templatesmodule_template_dir = modules/templatestemplate_cache_dir = mako_cache

template_icons_url = %(static_url)s/themes/armin_box_arrows

local_happyface_cfg_dir = configcategory_cfg_dir = config/categories-enabledmodule_cfg_dir = config/modules-enabled

acquire_logging_cfg = defaultconfig/acquire.logrender_logging_cfg = defaultconfig/render.log

# NOTE: Changing these URLs might have limited effectstatic_url = %(happyface_url)sstaticarchive_url = %(static_url)s/archive

[happyface]# colon separated list of categories, if empty# all are used in a random order. The name here# corresponds to the section name.categories = stale_data_threshold_minutes = 60

# automatic reload interval in minutesreload_interval = 15

[auth]# A file containing authorized DNs to access the site.# One DN per linedn_file =

# If the given DN is not found in the file above, if any, the following# script is called with DN as first argument.# The script must return 1 if user has access, 0 otherwise.auth_script =

[template]# relative to static URLlogo_img = /images/default_logo.jpgdocumentation_url = https://ekptrac.physik.uni-karlsruhe.de/trac/HappyFaceweb_title = HappyFace Project

[database]url = sqlite:///HappyFace.db

[downloadService]timeout = 300global_options =


B.5.1 happyface.cfg

[plotgenerator]enabled = Falsebackend = agg

[global]

server.socket_host: "0.0.0.0"server.socket_port: 8080

tools.encode.on: Truetools.encode.encoding: "utf-8"tools.decode.on: Truetools.trailing_slash.on: True

[grid]enabled = True# False = grid-proxy-init, True = voms-proxy-initvoms.enabled = Truevo = atlasproxy.renew.enabled = Trueproxy.lifetime.threshold.hours = 1proxy.valid.hours = 100# with voms-proxy-initproxy.voms.hours= 100x509.cert.dir = /etc/grid-security/certificates#x509.cert.dir = /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/etc/grid-security-emi/certificatesx509.user.cert = /var/lib/HappyFace/cert/usercert.pemx509.user.key = /var/lib/HappyFace/cert/userkey.pemx509.user.proxy = /var/lib/HappyFace/cert/x509up_happyface

[cvmfs]enabled = Truerucio.account = hmusheghagis = Falseatlantis = Falserucio.client = Falseemi = Truefax = Falseganga = Falsegcc = Falsepacman = Falsepanda.client = Falsepyami = Falsepod = Falseroot = Falsedq2wrappers = Falsesft = Falsexrootd = False


[grid_ftp]module = GridFtpCopyViewername = Grid FTP Viewertype = ratedweight = 1.0

timeout = 240plot_start_date = 2015-06-05

site1_host = se-goegrid.gwdg.desite1_name = GoeGrid-GoeGridsite1_port = 8443site1_space_token_scratchdisk = GOEGRID_SCRATCHDISKsite1_space_token_localgroupdisk = GOEGRID_LOCALGROUPDISKsite1_space_token_proddisk = GOEGRID_PRODDISKsite1_space_token_datadisk = GOEGRID_DATADISKsite1_scratchdisk_path = /pnfs/gwdg.de/data/atlas/atlasscratchdisk/test_haykuhi/site1_localgroupdisk_path = /pnfs/gwdg.de/data/atlas/atlaslocalgroupdisk/test_haykuhi/site1_proddisk_path = /pnfs/gwdg.de/data/atlas/atlasproddisk/site1_datadisk_path = /pnfs/gwdg.de/data/atlas/atlasdatadisk/

site2_host = grid-se.physik.uni-wuppertal.desite2_name1 = GoeGrid-Wuppertalsite2_port = 8443site2_name2 = Wuppertal-GoeGridsite2_space_token_scratchdisk = WUPPERTAL_SCRATCHDISKsite2_space_token_localgroupdisk = WUPPERTAL_LOCALGROUPDISKsite2_space_token_proddisk = WUPPERTAL_PRODDISKsite2_space_token_datadisk = WUPPERTAL_DATADISKsite2_scratchdisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasscratchdisk/test_haykuhi/site2_localgroupdisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlaslocalgroupdisk/test_haykuhi/site2_proddisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasproddisk/site2_datadisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasdatadisk/

site3_host = atlas.bu.edusite3_name1 = GoeGrid-BU_ATLAS_Tier2site3_port = 8443site3_name2 = BU_ATLAS_Tier2-GoeGridsite3_space_token_scratchdisk = BU_ATLAS_Tier2_SCRATCHDISKsite3_space_token_localgroupdisk = BU_ATLAS_Tier2_LOCALGROUPDISKsite3_space_token_proddisk = BU_ATLAS_Tier2_PRODDISKsite3_space_token_datadisk = BU_ATLAS_Tier2_DATADISKsite3_scratchdisk_path = /gpfs1/atlasscratchdisk/test_haykuhi/site3_localgroupdisk_path = /gpfs1/atlaslocalgroupdisk/test_haykuhi/site3_proddisk_path = /gpfs1/atlasproddisk/site3_datadisk_path = /gpfs1/atlasdatadisk/


B.5.2 grid_ftp_module.cfg

[grid_srm]module = GridSRMCopyViewername = Grid SRM Copy Viewertype = ratedweight = 1.0

timeout = 240plot_start_date = 2015-06-10

site1_host = se-goegrid.gwdg.desite1_name = GoeGrid-GoeGridsite1_port = 8443site1_space_token_scratchdisk = GOEGRID_SCRATCHDISKsite1_space_token_localgroupdisk = GOEGRID_LOCALGROUPDISKsite1_space_token_proddisk = GOEGRID_PRODDISKsite1_space_token_datadisk = GOEGRID_DATADISKsite1_scratchdisk_path = /pnfs/gwdg.de/data/atlas/atlasscratchdisk/test_haykuhi/site1_localgroupdisk_path = /pnfs/gwdg.de/data/atlas/atlaslocalgroupdisk/test_haykuhi/site1_proddisk_path = /pnfs/gwdg.de/data/atlas/atlasproddisk/site1_datadisk_path = /pnfs/gwdg.de/data/atlas/atlasdatadisk/

site2_host = grid-se.physik.uni-wuppertal.desite2_name1 = GoeGrid-Wuppertalsite2_port = 8443site2_name2 = Wuppertal-GoeGridsite2_space_token_scratchdisk = WUPPERTAL_SCRATCHDISKsite2_space_token_localgroupdisk = WUPPERTAL_LOCALGROUPDISKsite2_space_token_proddisk = WUPPERTAL_PRODDISKsite2_space_token_datadisk = WUPPERTAL_DATADISKsite2_scratchdisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasscratchdisk/test_haykuhi/site2_localgroupdisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlaslocalgroupdisk/test_haykuhi/site2_proddisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasproddisk/site2_datadisk_path = /pnfs/physik.uni-wuppertal.de/data/atlas/atlasdatadisk/

site3_host = atlas.bu.edusite3_name1 = GoeGrid-BU_ATLAS_Tier2site3_port = 8443site3_name2 = BU_ATLAS_Tier2-GoeGridsite3_space_token_scratchdisk = BU_ATLAS_Tier2_SCRATCHDISKsite3_space_token_localgroupdisk = BU_ATLAS_Tier2_LOCALGROUPDISKsite3_space_token_proddisk = BU_ATLAS_Tier2_PRODDISKsite3_space_token_datadisk = BU_ATLAS_Tier2_DATADISKsite3_scratchdisk_path = /gpfs1/atlasscratchdisk/test_haykuhi/site3_localgroupdisk_path = /gpfs1/atlaslocalgroupdisk/test_haykuhi/site3_proddisk_path = /gpfs1/atlasproddisk/site3_datadisk_path = /gpfs1/atlasdatadisk/


B.5.3 grid_srm_module.cfg

[ddm_datasets_viewer]module = DdmDatasetsViewername = Atlas DDM Datasets Viewerdescription =instruction = type = ratedweight = 1.0

# Parametersdatasets_range_start = 1datasets_range_end = 50 dataset_size = 50

space_token = GOEGRID_LOCALGROUPDISKlimit_table_size = 50


B.5.4 atlas_ddm_module.cfg

Bibliography

[1] C. J. Rhodes, Large Hadron Collider (LHC), Sc. Prog. 96.1 (2013): 95-109.

[2] LHC RUN 2.

[3] P. Higgs, Broken symmetries, massless particles and gauge fields, Phys. Lett. 12(1964) 132–133.

[4] S. Dimopoulos and H. Georgi, Softly broken supersymmetry and SU (5), Nucl.Phys. B 193 (1981) 150–162.

[5] S. L. Glashow, Partial-symmetries of weak interaction, Nucl. Phys. 22 (1961)579–588.

[6] J. Goldstone, A. Salam and S. Weinberg, Broken Symmetries, Phys. Rev. 127 (1962)965-970.

[7] A. Salam, et al., Electromagnetic and weak interactions, Phys. Lett. 13 (1964)168–171.

[8] T. Kinoshita, Quantum electrodynamics, Vol. 7 World Scientific (1990).

[9] W. Marciano and H. Pagels, Quantum chromodynamics, Phys. Rep. 36.3 (1978):137-276.

[10] V. Flaminio and B. Saitta, Neutrino oscillation experiments, La Rivista del NuovoCimento (1978-1999) 10.8 (1987): 1-126.

[11] FM. Liu and SX. Liu, Quark-gluon plasma formation time and direct photons fromheavy ion collisions, Phys. Rev. C 89.3 (2014): 034906.

[12] L. Evans and P. Bryant, LHC machine, J. of Inst. 3.08 (2008): S08001.

[13] J. Shiers, The worldwide LHC computing grid (worldwide LCG), Comput. Phys.Commun. 177.1 (2007): 219-223.

[14] R.Cashmore, et al., Prestigious Discoveries at CERN: 1973 Neutral Currents. 1983 WZ Bosons, Spr. Sc. Business Media, (2004).

[15] The ATLAS Collaboration, CDF Collaboration, CMS Collaboration, D0Collaboration, First combination of Tevatron and LHC measurements of the top-quarkmass, arXiv:1403.4427.

[16] K. A. Olive and Particle Data Group, Review of particle physics, Chinese Phys. C38.9 (2014): 090001.

[17] H. Georgi and S. L. Glashow, Unity of all elementary-particle forces, Phys. Rev. Lett.32.8 (1974): 438.

119

http://atlas.cern/resources/press/resources-lhc-run-2

http://arxiv.org/abs/1403.4427

120 BIBLIOGRAPHY

[18] F. Englert and R. Brout, Broken Symmetry and the Mass of Gauge Vector Mesons,Phys. Rev. Lett. 13 (1964) 321–323.

[19] K. A. Olive, et al., Review of Particle Physics, 2014-2015, Chin. Phys. C 38 (2014)090001, arXiv:1753419.

[20] ATLAS Collaboration, The ATLAS Experiment at the CERN Large Hadron Collider,J. of Inst. Vol. 3 (2008).

[21] Courtesy of the ATLAS experiment.

[22] ATLAS collaboration, Technical Design Report for the ATLAS Forward ProtonDetector, Rep. Nr.: CERN-LHCC-2015-009 (2015) 81-85.

[23] M. Borodin, et al., Unified System for Processing Real and Simulated Data in theATLAS Experiment (2015), arXiv:1508.07174.

[24] I. Foster and C. Kesselman, eds., The Grid 2: Blueprint for a new computinginfrastructure, Els. Ser. (2003).

[25] I. G. Bird, LHC computing (WLCG): Past, present, and future, Grid and CloudComp. Conc. and Pract. App. 192 (2016): 1.

[26] R. W. L. Jones and D. Barberis, The evolution of the ATLAS computing model, J. ofPhys.: Conf. Ser. Vol. 219 No. 7. IOP Publ. (2010).

[27] G. Ambrosini, et al., The ATLAS DAQ and Event Filter prototype " 1" project,Comp. Phys. Commun. 110.1 (1998): 95-102.

[28] S. Murray, et al., Tape write-efficiency improvements in CASTOR, J. of Phys.: Conf.Ser. 396 (2012) 042042, arXiv:1485674.

[29] F. Berman, et al., Grid Computing: Making the Global Infrastructure a Reality, Wiley,(2003) 1 edition ISBN: 978-0-470-85319-1.

[30] I. Foster, et al., 2001, The anatomy of the grid: Enabling scalable virtual organizationsIntern. j. of high perf. comp. app. 15.3 (2001): 200-222.

[31] M. A. Murphy, et al., Virtual Organization Clusters, (2009) 17th Eu. Int. Conf. onParal., Distr. and Net.-based Process., IEEE pp. 401-408.

[32] I. Foster, et al., What is the grid?-a three point checklist, GRIDtoday, Vol. 1, No. 6(2002).

[33] I. Foster, Globus toolkit version 4: Software for service-oriented systems, IFIP Int.Conf. on Net. and Par. Comp. Springer. Berlin Heidelberg, (2005).

[34] B. Ross, et al., Stronger Password Authentication Using Browser Extensions, Proc. ofthe 14th USENIX Secur. Symp. (2005).

[35] S. T. F. Al-Janabi and M. A. Rasheed, Public-key cryptography enabled Kerberosauthentication, Developments in E-systems Engineering (DeSE) IEEE (2011).

[36] R. Housley and T. Polk Planning for PKI: best practices guide for deploying publickey infrastructure, John Wiley Sons, Inc., (2001).

http://cds.cern.ch/record/1753419

http://arxiv.org/abs/1508.07174

http://cds.cern.ch/record/1485674

BIBLIOGRAPHY 121

[37] S. Santesson, et al., X.509 Internet Public Key Infrastructure Online Certificate StatusProtocol - OCSP, No. RFC 6960, (2013).

[38] A. Kahate, Cryptography and network security, Tata McGraw-Hill Edu. (2013).

[39] R. L. Rivest, et al., Cryptographic communications system and method, U.S. Pat. No.4,405,829 (1983).

[40] K. R. Vogel, et al., Updating trusted root certificates on a client computer, U.S. Pat.No. 6,816,900 (2004).

[41] E. Rescorla, SSL and TLS: designing and building secure systems, Vol. 1 Reading:Addison-Wesley (2001).

[42] W. Stallings, Cryptography and network security: principles and practices, PearsonEdu. India, Section 14.2 (2006).

[43] V. Welch, et al. X. 509 proxy certificates for dynamic delegation, 3rd ann. PKI RDworkshop. Vol. 14 (2004).

[44] S. Tuecke, et al., Internet X.509 Public Key Infrastructure (PKI) Proxy CertificateProfile, No. RFC 3820 2004.

[45] J. Gilbert and R. Perry, A Java API for X. 509 Proxy Certificates, HP Laboratories,HPL-2008-77 (2008).

[46] L. Pearlman, et al., The Community Authorization Service: Status and Future,CHEP03, La Jolla, California (2003).

[47] R. Alfieri, et al., VOMS, an Authorization System for Virtual Organizations, Gridcomp. Springer Berlin Heidelberg, (2004).

[48] B. Plale, et al., Key concepts and services of a grid information service, Proc. of the15th Int. Conf. on Paral. and Distr. Comp. Sys. (2002).

[49] J. S. Bowman, et al., The practical SQL handbook: using structured query language,Addison-Wesley Longman Publ. Co. Inc. (1996).

[50] T. Howes and M. Smith. LDAP: programming directory-enabled applications withlightweight directory access protocol, Que Publ. 1 ed. (1997) ISBN-10: 1578700000.

[51] V. Hernández, Challenges and opportunities of healthgrids, Proc. of HealthGrid Vol.120. IOS Press (2006).

[52] A. Cooke, et al., R-GMA: An information integration system for grid monitoring,OTM Confed. Int. Conf. Springer Berlin Heidelberg (2003).

[53] S. Zhang, et al., Dynamic VM Provisioning for TORQUE in a Cloud Environment, J.of Phys.: Conf. Ser. Vol. 513 No. 3. IOP Publ. (2014).

[54] D. Thain, et al, Condor and the Grid, Grid Comp.: Making the Global Infrast. AReality (2003): 299-335.

[55] G. Ma and P. Lu, PBSWeb: A Web-based Interface to the Portable Batch System, 12thIASTED Int. Conf. on Paral. and Distr. Comp. and Sys. (PDCS), Las Vegas,Nevada, USA (2000).

122 BIBLIOGRAPHY

[56] G. Staples, TORQUE resource manager, Proc. of the 2006 ACM/IEEE Conf. onSupercomp. (2006).

[57] I. Lumb and C. Smith, Scheduling Attributes and Platform LSF, Grid resourcemanagement. Springer US, (2004) 171-182.

[58] K. Subramanian, et al., Workload Management with LoadLeveler, IBM Redbooks(2001) ISBN: 978-0738422091.

[59] B. Stockebrand, Quality of Service (QoS), IPv6 in Pract.: A Unixer’s Guide to theNext Gener. Internet (2007): 327-331.

[60] A. Shoshani, et al., Storage resource managers, Grid resource management.Springer US, 321-340 (2004).

[61] A. Sim and A. Shoshani, The Storage Resource Manager Interface SpecificationVersion 2.2, (2009) SRMv2.2.

[62] M. Cannataro, et al., Evaluating and enhancing the use of the GridFTP protocol forefficient data transfer on the grid, Rec. Adv. in Paral. Virt. Mach. and Msg. PassingInterface, Springer Berlin Heidelberg (2003).

[63] globus-toolkit-command.

[64] UberFTP

[65] W. Allcock, et al., GridFTP: Protocol extensions to FTP for the Grid, Global GridForumGFD-RP (2003) 20: 1-21, GFD.20 .

[66] B. Allcock, et al., GridFTP v2 protocol description, GridFTP Working Group, Tech.Rep. (2005), GFD.47 .

[67] F. Magoules, et al., Introduction to Grid Computing, By CRC Press (2009) ISBN:9781420074062.

[68] D. Talia, et al., Grid Middleware and Services, Springer Singapore Pte. Limited(2009).

[69] S. Burke, et al., GLITE 3.2 USER GUIDE, Document identifier:CERN-LCG-GDEIS-722398 (2012), gLite-3-UserGuide.

[70] M. Romberg, The UNICORE grid infrastructure, Scientific Programming 10.2:149-157 (2002).

[71] E. Laure, Programming the Grid with gLite, Comp. methods in sc. and tech. 12.1(2006): 33-45.

[72] L. Wang, et al., Grid Computing: Infrastructure, Service, and Applications, CRCPress (2009) ISBN: 9781420067668.

[73] R. Ramakrishnan, et al., Database management systems, Osborne/McGraw-Hill(2000).

[74] L.V. Shamardin, LCG/gLite BDII performance measurements, PoS (2007): 014 .

[75] P. Andreetto, et al., CREAM: a simple, grid-accessible, job management system forlocal computational resources CHEP 2006, Mumbay, India (2006).

https://sdm.lbl.gov/srm-wg/doc/SRM.v2.2.html

http://toolkit.globus.org/toolkit/docs/4.0/data/gridftp/admin-index.html

http://toolkit.globus.org/grid_software/data/uberftp.php

https://www.ogf.org/documents/GFD.20.pdf

https://www.ogf.org/documents/GFD.47.pdf

http://grid.jinr.ru/wp-content/uploads/2014/04/gLite-3-UserGuide.pdf

BIBLIOGRAPHY 123

[76] C. Wang, et al., Hybrid checkpointing for MPI jobs in HPC environments, Paral. andDistr. Sys. (ICPADS), IEEE 16th Int. Conf. (2010).

[77] P. Fuhrmann and V. Gülzow, dCache, storage system for the future, EU Conf. onPara. Process. Springer Berlin Heidelberg, (2006).

[78] J. Meyer, et al., ATLAS Tier-2 at the Compute Resource Center GoeGrid in Göttingen,J. of Phys.: Conf. Ser. Vol. 331. No. 7 IOP Publ (2011).

[79] D. Jackson, et al., Core algorithms of the Maui scheduler, Workshop on Job Sched.Strat. for Paral. Process. Springer Berlin Heidelberg (2001).

[80] GWDG SAN storage system GWDG_SAN_system.

[81] P. Rey, et al., The Accounting Infrastructure in EGEE, Proc. of the 1st Ib. GridInfrast. Conf., Santiago de Compostela, Spain (2007).

[82] Rutherford Appleton Laboratory (RAL), RAL.

[83] W. Barth, Nagios: System and network monitoring, No Starch Press Series (2008),ISBN: 1593271794, 9781593271794.

[84] P.M. Papadopoulos, et al., NPACI Rocks: Tools and techniques for easily deployingmanageable linux clusters, Concur. and Comput.: Pract. and Exp. 15.7-8 (2003):707-725.

[85] M. P. DeHaan, Software provisioning in multiple network configuration environment ,U.S. Patent No. 9,021,470. (2015).

[86] M. L. Massie, et al., The ganglia distributed monitoring system: design,implementation, and experience, Paral. Comput. 30.7 (2004): 817-840.

[87] M. Burgess, Recent developments in cfengine, The Hague (2001).

[88] Scientific Linux CERN 6 (SLC6).

[89] R. W. L. Jones and D. Barberis, The evolution of the ATLAS computing model, J. ofPhys.: Conf. Ser. Vol. 219. No. 7 IOP Publ. (2010).

[90] A. Anisenkov, et al., AGIS: the ATLAS grid information system, J. of Phys.: Conf.Ser. Vol. 513. No. 3 IOP Publ. (2014).

[91] J. Schovancová, ATLAS Distributed Computing , No. ATL-SOFT-SLIDE-2011-551.ATL-COM-SOFT-2011-022 (2011).

[92] G. Mathieu, et al., GOCDB, a topology repository for a worldwide grid infrastructure,J. of Phys.: Conf. Ser. Vol. 219. No. 6 IOP Publ. (2010).

[93] R. Quick, et al., RSV: OSG Fabric Monitoring and Interoperation with WLCGMonitoring Systems, Manuscript in Proc. of 17th Int. Conf. on Comp. in HEP(2009).

[94] J. Forcier, et al., Python web development with Django, Addison-Wesley Prof.(2008).

[95] J. P. Baud, et al., LCG Data Management: From EDG to EGEE, UK eSc. All HandsMeeting Proc., Nottingham, UK (2005).

https://info.gwdg.de/~ceulig/docs-dev/doku.php?id=en:services:server_services:virtual_server:leistungsuebersicht:start

http://www.stfc.ac.uk/about-us/rutherford-appleton-laboratory/

http://linux.web.cern.ch/linux/scientific6/

124 BIBLIOGRAPHY

[96] R. K. Madduri, et al., Reliable file transfer in grid environments, Local ComputerNetworks, Proceedings. LCN 27th Ann. IEEE Conf. (2002).

[97] T. Perelmutov, et al., Storage Resource Manager, Proc. of CHEP’04 (2004).

[98] D. Dykstra and L. Lueking, Greatly improved cache update times for conditions datawith Frontier/Squid, J. of Phys.: Conf. Ser. Vol. 219. No. 7 IOP Publ. (2010).

[99] D. Dykstra and J. Blomer, Security in the CernVM File System and the FrontierDistributed Database Caching System, J. of Phys.: Conf. Ser. Vol. 513. No. 4 IOPPubl. (2014).

[100] B. Tierney, et al., perfsonar: Instantiating a global network measurement framework,SOSP Wksp. Real Overlays and Distrib. Sys (2009).

[101] AGIS web interface for runing services.

[102] AGIS web interface for pandaqueues.

[103] T. Maeno, et al., Overview of atlas panda workload management, J. of Phys.: Conf.Ser. Vol. 331. No. 7 IOP Publ. (2011).

[104] P. Nilsson, et al., The ATLAS PanDA Pilot in Operation, J. of Phys.: Conf. Ser. Vol.331. No. 6 IOP Publ. (2011).

[105] PanDA web interface for GoeGrid PandaGoeGrid.

[106] V. Garonne, et al., Rucio–The next generation of large scale distributed system forATLAS Data Management, J. of Phys.: Conf. Ser. Vol. 513. No. 4 IOP Publ. (2014).

[107] Rucio account.

[108] Rucio architecture.

[109] S. Albrand, et al., The ATLAS metadata interface, J. of Phys.: Conf. Ser. Vol. 219.No. 4 IOP Publ. (2010).

[110] DDM Dashboard.

[111] DDM Dashboard Transfer Plots.

[112] De, Kaushik, et al., ATLAS Distributed Computing Operations Shift TeamExperience, J. of Phys. Conf. Ser. Vol. 331. No. 7 (2011).

[113] J. Andreeva, et al., Automating ATLAS Computing Operations using the Site StatusBoard, J. of Phys.: Conf. Ser. Vol. 396. No. 3 IOP Publ. (2012).

[114] S. Campana, et al., Evolving ATLAS computing for today’s networks, J. of Phys.:Conf. Ser. Vol. 396. No. 3 IOP Publ. (2012).

[115] Site Status Board web interface (shifter view).

[116] ATLAS Collaboration, et al., ATLAS high-level trigger, data acquisition and controlstechnical design report, ATLAS Technical Design Reports (2003).

[117] J. Zhang, ATLAS data acquisition, Real Time Conf. 16th IEEE-NPSS. IEEE (2009).

http://atlas-agis.cern.ch/agis/service/table_view/

http://atlas-agis.cern.ch/agis/pandaqueue/table_view/

http://bigpanda.cern.ch/jobs/?jobtype=production&display_limit=100&computingsite=GoeGrid

http://rucio.cern.ch/overview_Rucio_account.html

http://rucio.cern.ch/overview_Architecture.html

http://dashb-atlas-ddm.cern.ch/ddm2/

http://dashb-atlas-ddm.cern.ch/ddm2/#tab=transfer_plots

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Shifter+view&highlight=false

BIBLIOGRAPHY 125

[118] T. Cornelissen, et al., Concepts, design and implementation of the ATLAS NewTracking (NEWT), No. ATL-COM-SOFT-2007-002. (2007).

[119] I. Antcheva, et al., ROOT—A C++ framework for petabyte data storage, statisticalanalysis and visualization, Comp. Phys. Commun. 180.12 (2009): 2499-2512.

[120] J. S. Wholey, Monitoring performance in the public sector: Future directions frominternational experience, Eds. J. Mayne and E. Zapico-Goñi, Transaction Publ.(2007).

[121] S. Ligus, Effective Monitoring and Alerting, O’Reilly Media Print (2002) ISBN:978-1-4493-3352-2.

[122] H. Musheghyan, et al., 2015, HappyFace as a generic monitoring tool for HEPexperiments, J. of Phys.: Conf. Ser. Vol. 664. No. 6 IOP Publ. (2015).

[123] X. Zhang and J. M. Schopf, Performance analysis of the globus toolkit monitoring anddiscovery service, mds2, Perf. Comput. and Commun. (2004) Int. Conf. on. IEEE.

[124] V. Mauch, et al., The HappyFace project, J. of Phys.: Conf. Ser. Vol. 331. No. 8 IOPPubl. (2011).

[125] J. Highsmith and A. Cockburn Agile software development: The business ofinnovation, Computer 34.9 (2001): 120-127.

[126] Python Mako Templates.

[127] M. Owens and G. Allen, SQLite, Apress LP (2010).

[128] Python Matplotlib.

[129] Linux cron job.

[130] L. O. Gerlach, Update of HappyFace modules for Atlas, BSc thesis Nr.:II.Physik-UniGö-BSc-2015/07 (2015).

[131] E. Buschmann, Module developments on the HappyFace project for the ATLAS GridComputing, BSc thesis Nr.: II.Physik-UniGö-BSc-2014/02 (2014).

[132] C. G. Wehrberger, HappyFace Meta-Monitoring for ATLAS in the Worldwide LHCComputing Grid, MSc thesis Nr.: II.Physik-UniGö-MSc-2013/07 (2013).

[133] J. Blomer, et al., Status and future perspectives of CernVM-FS, J. of Phys.: Conf. Ser.Vol. 396. No. 5 IOP Publ. (2012).

[134] J. Blomer, et al., Distributing LHC application software and conditions databases usingthe CernVM file system, J. of Phys.: Conf. Ser. Vol. 331. No. 4 IOP Publ. (2011).

[135] M. Blaha and J. Rumbaugh, Object-oriented modeling and design with UML,Prentice Hall (2005).

[136] R. Verma and C. Giridhar, Design Patterns in Python, Publ. by Testing Persp.(2011).

[137] J. Rumbaugh, et al., Unified Modeling Language Reference Manual, Pearson HigherEdu. (2004) ISBN: 0321245628.

http://www.makotemplates.org/

http://matplotlib.org/

http://www.thesitewizard.com/general/set-cron-job.shtml

126 BIBLIOGRAPHY

[138] Umbrello modelling tool.

[139] UberFTP CLI.

[140] SRM CLI.

[141] S. Chacon and B. Straub, Pro git, Apress (2014), Pro Git.

[142] How to create an RPM package.

[143] R. Petersen, Red Hat Enterprise Linux 6: Desktop and Administration, Surfing TurtlePress (2011).

https://docs.kde.org/trunk5/en/kdesdk/umbrello/umbrello.pdf

http://linux.die.net/man/1/uberftp

https://www.dcache.org/manuals/Book-2.7/cookbook/cb-clients-srm.shtml

https://progit2.s3.amazonaws.com/en/2016-03-22-f3531/progit-en.1084.pdf

https://fedoraproject.org/wiki/How_to_create_an_RPM_package

Acknowledgements

I would like to express my deep thankfulness and respect to Prof. Dr. Arnulf Quadtfor giving me this unique opportunity to come to Germany and to start my PhD at IIInstitute of Physics in University of Goettingen. I am very much thankful to him for hisguidance and support during the whole time period of my PhD. Also for all chancesto participate in different workshops, conferences, meetings, summer schools, throughwhich I have managed to meet and get to know many other people and to improve myprofessional qualification.

I would like also to express my thankfulness to Prof. Dr. Ramin Yahyapour, whoagreed to be the second referee for my PhD thesis.

Especially I would like to say many thanks to Gen Kawamura who was helpingme within the whole time period of my PhD. Many thanks also to our other groupmembers Erekle Magradze, Gerhard Rzehorz, Jordi Nadal and Joshua Wyatt Smith.Without their help, support and comments my work would not come to the final phaseand I would probably not write this acknowledgement now.

The HappyFace project is a joined collaboration between Karlsruhe Institute ofTechnology (KIT) and Georg-August-University of Goettingen. Many thanks to EricBuschmann, Lino Gerlach, Fabian Kukuck, Max Robinson, Christian Wehrberger atGoettingen University and to Stefan Letzelter, Marcus Schmitt, Fred Stober, GuenterQuast, Gregor Vollmer at KIT who had their own contribution in the development ofHappyFace project.

Many thanks to Bernadette Tyson, Lucie Hamdi, Gabriela Herbold and Heidi Af-shar for all their help with paper work and kind relationship to me.

I want to thank Goettingen International (Vitali Altholz, Janja Kaucic, Bettina Irmer)who provided me the financial support within ALRAKIS Action 2 program.

I would like to say THANKS to all people working in II Institute of Physics, fortheir help and support.

And finally, I would like to say GREAT THANKS to Jan Schuh, to my belovedparents (Julieta Khalatyan, Rubik Musheghyan), relatives (Norayr Khalatyan, SamvelKhalatyan, Vladimir Khalatyan, Natalia Khalatyan, Natalia Suslova) and to all myfriends for their love, support, encouragement. People who are fanning and crossingtheir fingers for all my achievements and successes.

127

happyface as a monitoring tool for the atlas...

Documents