grid tools for the atlas experiment cernfisica.fe.up.pt/gridpt/docs/miguel_thesis.pdfnecessary - by...

Faculdade de Engenharia da Universidade do PortoLicenciatura em Engenharia Informatica e Computacao

Grid Tools for the ATLAS Experiment

CERNEuropean Organization for Nuclear Research

Relatorio do Estagio Curricular da LEIC 2002/2003

Miguel Sergio de Oliveira Branco

Orientador na FEUP: Prof. Jaime VillateOrientador no CERN: Prof. Antonio Amorim

Outubro de 2003

Abstract

The ATLAS Data Challenges, part of the ATLAS LHC experiment at CERN, is scheduled to berunning by spring 2004, producing simulated data in high volumes. ATLAS is an international,geographically disperse collaboration increasingly focused on the use of Grid technologies asthe core computational infrastructure paradigm. Several Grid efforts are emerging within AT-LAS as part of other international efforts, who have produced for the moment relatively incom-patible middleware.

Our work at CERN was in designing an ATLAS-wide data management system, to be partof the new automatic production system for the ATLAS Data Challenges. The initial proposalwas a peer to peer system to replace the current data management approach. This proposal waslater reviewed and after several iterations, the current proposal is a grid-based central servicecapable of inter-operating with various grid middleware in use by ATLAS Data Challenges byspring 2004.

The proposed solution attempts to integrate several grid efforts with minimal interference inthe participating sites and collaborators policies. It does not attempt to provide a revolutionarysolution, nor it could given the necessity of ATLAS Data Challenges to have a real and workingproduction system in a short timescale.

ii

Acknowledgments

The first thank you goes to my supervisor, Prof. Jaime Villate for his support for my researchinterests and for introducing me to Grid Computing and to the research projects at CERN. Aftera short period studying grid technologies, I had the opportunity to work at CERN as part ofthese projects, due also to my supervisor Prof. Antonio Amorim to whom I also owe a thankyou, for his views, proposals and experience of working at CERN.

During my work at CERN, I wish to thank my supervisor Luc Goossens for his demandingand acute analysis of my proposals and Markus Elsing for helping me maintain the focus indeveloping a work relevant for the ATLAS collaboration.

I must also acknowledge my faculty and work colleague Pedro Andrade for his significantcollaboration during the senior training.

I would like to thank my parents and my brother; without your encouragement and inspira-tion, it would have been much more difficult!

In conclusion, I recognize that this research would not have been possible without the as-sistance of the Technical Student program, and wish to express my gratitude to CERN for thefinancial support.

iii

Contents

List of Tables vi

List of Figures vii

1 Introduction 11.1 Work Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 ATLAS Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 ATLAS Data Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Data Management Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Expected solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Technology Review 72.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Data Management in Grid Computing . . . . . . . . . . . . . . . . . . 10

2.1.2 Peer to peer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Data Management in Peer to peer . . . . . . . . . . . . . . . . . . . . 11

2.2 Existing technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Grid-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Globus Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11European DataGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14NorduGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17US Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 Peer to peer-based systems . . . . . . . . . . . . . . . . . . . . . . . . 20Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Kademlia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21JXTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Boinc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.3 ATLAS tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Madga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23POOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Data Management Tools for ATLAS Experiment 273.1 The “Data Management Problem” . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Peer to peer-based system . . . . . . . . . . . . . . . . . . . . . . . . 29

iv

Adaptive Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.2 Grid-based system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

European DataGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Storage Resource Broker . . . . . . . . . . . . . . . . . . . . . . . . . 37NorduGrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39LCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41DMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Technological choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Future Directions 67

5 Conclusion 69

Bibliography 71

v

List of Tables

3.1 dms-put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2 dms-replica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3 dms-get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 dms-get-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.5 dms-cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6 dms-mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.7 dms-ls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.8 dms-rm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.9 dms-mkdir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.10 dms-rmdir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.11 dms-meta-insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.12 dms-meta-query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.13 dms-meta-update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.14 dms-meta-remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.15 dms-metaattr-create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.16 dms-metaattr-remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

vi

List of Figures

1.1 ATLAS Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 ATLAS Experiment Tiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Adaptive grid step-by-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 EDG WP2-centric view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Storage Resource Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4 NorduGrid-based DMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 POOL file catalog-based DMS . . . . . . . . . . . . . . . . . . . . . . . . . . 423.6 Schema of the initial version of the DMS . . . . . . . . . . . . . . . . . . . . 463.7 Relational schema of the DMS database . . . . . . . . . . . . . . . . . . . . . 473.8 Component organization for the first version of the DMS . . . . . . . . . . . . 493.9 DMS Web User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.10 Schema of ATLAS DC-2 Automatic Production System . . . . . . . . . . . . . 573.11 Schema of the revised version of the DMS . . . . . . . . . . . . . . . . . . . . 583.12 Use Case for executing a job from the Automatic Production System . . . . . . 60

vii

Chapter 1

Introduction

1.1 Work Institutions

The work was split in two periods, the first on the Departamento de Fısica from the Facul-dade de Engenharia da Universidade do Porto and later at CERN, European Organization forNuclear Research.

CERN was created in 1951, as provisional body called the ”Conseil Europeen pour laRecherche Nucleaire”. This was a council - a body of people. In 1953 the Council decidedto build a central laboratory near Geneva. At that time, pure physics research concentrated onunderstanding the inside of the atom, hence the word ”nuclear”. Very soon, the work at thelaboratory went beyond the study of the atomic nucleus, on into higher and higher energy den-sities. Scientists work at CERN by looking for effects between the forces of nature that becomenoticeable only at very high energies. Therefore, from early on, it has been a High-EnergyPhysics institute, or a ”HEP” institute. Because this activity is mainly concerned with the studyof interactions between particles, CERN is commonly referred as the ”European Laboratory forParticle Physics” (”Laboratoire europeen pour la physique des particules”) and it is this lattertitle that really describes the current work of the laboratory. CERN, the European Organiza-tion for Nuclear Research, has its headquarters in Geneva. At present, its Member States areAustria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece,Hungary, Italy, Netherlands, Norway, Poland, Portugal, the Slovak Republic, Spain, Sweden,Switzerland and the United Kingdom. India, Israel, Japan, the Russian Federation, the UnitedStates of America, Turkey, the European Commission and UNESCO have observer status.

1.2 ATLAS Experiment

The ATLAS Experiment[1] for the Large Hadron Collider is under construction at the CERNLaboratory in Switzerland. Its goal is to explore the fundamental nature of matter and the basicforces that shape our universe.

The requirements for the ATLAS experiment are vast and complex. An event is what occurswhen two particles collide or a single particle decays. Over 30 million of these events will occurper second when the experiment is running - when the detector is in what is called state online.One interesting event will be buried within 1E12 events. Online software will reduce greatly theamount of data retrieved for offline processing to 100 events per second. Each event measures

1

Figure 1.1: ATLAS Detector

typically between 1 and 2 megabytes. So, given the average time the experiment is online,we can expect approximately 100 MB/s, or 10 TB/day.. This will average approximately 1PB/year. Actually, ATLAS expects the average of data to be approximately 1.8 PB/year in eitherreplicated data or auxiliary analysis parameters (log files, bookkeeping, detector parameters, ...).

The processing power to analyze all data is expected to be around 1.000.000 SpecInt95, witha data throughput of 1 TeraBit/s and the expected number of simultaneous users of this data isapproximately 1000. The experiment is expected to last at least 10 years. It is geographicallylargely distributed (four times larger than the previous CERN/LEP experiment also related toHigh Energy Physics). It includes already over 2000 scientist working and is expected to includefar more, given the relative transparent access policy to the experiment’s data.

CERN is the top-layer for all data infrastructure. Figure 1.2 shows the Experiment tiers.

Figure 1.2: ATLAS Experiment Tiers

While the detector is online, all data is processed on what is called Tier 0. Online softwarewill chose parts of the data for future analysis - the interesting events. Those parts of data arestored at CERN in Tier 1. The work done by the trainee at CERN was collaborating in the

2

design of Tier 1 infrastructure, in an effort called Data Challenges.

1.2.1 ATLAS Data Challenges

ATLAS has committed itself to a set of Data Challenges to validate its software and ComputingModel[2].

About three years ago ATLAS Computing planned a first series of Data Challenges in orderto validate its Computing Model, its software, its data model, and to ensure the correctnessof the technical choices to be made. Since then, and particularly in the context of the CERNReview of LHC Computing, the scope and goals of these Data Challenges have evolved.

It is understood that they should be of increasing complexity and will use as much as possi-ble the Grid middleware being developed in the context of several Grid projects. It is expectedthat the next edition of Data Challenges, DC-2, scheduled for April 2004 includes for the firsttime this grid integration.

Therefore, it became necessary to analyze, test and plan the integration of all existing tools.ATLAS has a great diversity of tools and even several alternative tools for the same tasks.Understanding what these tools are, how can they be grid-ified, whether they should be grid-ified or whether they should simply be dropped (since grid services may already cover theirfunctionalities) - was a significant part of the analysis during the period at CERN.

During the LHC preparation phase, all experiments have large needs for simulated data, todesign and optimize the detectors. Therefore, “Monte Carlo” data is produced in the followingsteps: particles emerging from the collisions are generated using programs typically based onphysics theories and phenomenology (called generators); particles of the generated final stateare transported through the simulation of the detector according to the known physics lawsgoverning the passing of particles through matter; the resulting interactions with the sensitiveelements of the detector are converted into information similar to the digital output from thereal detector; the events are then reconstructed; the Monte Carlo generated information is savedfor comparison with the reconstructed information.

A major feature of the first edition of Data Challenges (DC-1) was the preparation and thedeployment of the software required for the production of large event samples for the HighLevel Trigger and Physics communities, and the production of those large data samples as aworldwide distributed activity. Still, ATLAS DC-1 was in some aspects a largely manual task.The option of running all DC-1 at CERN was not viable given that all resources necessary werenot available. Therefore, organizing a worldwide distributed activity was necessary. Much waslearned from DC-1 and it became clear that a more efficient distribution of all activity wasnecessary - by using Grid Computing.

With respect to the use of grid technologies, these promise several advantages for a multina-tional, geographically distributed project: they allow for a uniform infrastructure of the projectcomputing-wise, simplify the management and coordination of the resources while potentiallydecentralizing such tasks as software development and analysis and lastly, the Grid is an afford-able way to increase the computing power.

ATLAS DC-2, which is currently being prepared, has decided to include for the first timean automatic production system. In addition it has been decided that DC-2 will include signifi-cant integration with Grid Technologies to prepare more efficient distribution of all worldwideactivities.

This production system is being designed to follow the following set of basic requirements:

3

• it shall be maximally automatic;

• production shall be run and stored on several flavors of both Grid and legacy (non-Grid)resources;

• the production system will be maximally robust.

The high level design will include the following components::

• a single logical production database (which may be replicated or distributed);

• production request tools to allow physicists to enter production requests;

• a supervisor (process) which does all management of the system (may be more than oneper computing flavor)

• data management system so that all ATLAS data is recorded and managed by a singlelogical system

The architecture for the various components of the automatic production system is stillbeing discussed. Some aspects require the design of new components or adaptation of existingcomponents to grid-aware environments. One such aspect is Data Management. Collaboratingin the architecture for an ATLAS-wide Data Management system was the main goal of thesenior training.

1.3 Data Management Overview

Dealing with an experiment that will produce approximately 1.8 PB of data per year, which haveto be discovered, analyzed and possible transfered between geographically distributed locationsrequires a very robust data management system.

The problem of data management among others led the LHC Computing Grid Project -LCG - to provide the computing infrastructure for the simulation, processing and analysis ofLHC data for all four of the LHC collaborations (of which ATLAS is the major collaboration).The LCG scope spans through all the computing aspects involved on LHC. The main emphasisof the LCG project is the deployment of Grid technologies for the LCG computing.

In the timescale of DC-2 (April 2004), LCG will have deployed LCG-2 project - the secondversion of the European DataGrid Project adapted for Particle Physics. LCG-2 will be largelybased on LCG-1. LCG-1 was scheduled for release in early June 2003, but was delayed andreleased only on September 2003, which led to some changes on the initial work plan of thesenior training which was to building Grid tools for LCG-1 and more specifically in the contextof ATLAS.

LCG is not the only grid project being used at ATLAS. Since ATLAS is a multinationaleffort, several other Grid projects have been contributing with their middleware, as is the caseof NorduGrid and US Grid. All these grid projects, given their short life, have produced ratherincompatible middleware. Therefore, it is partly the job of the experiments to try and integratethe available grid middleware.

The problem of integrating Grid middleware is particularly acute in the case of ATLAS.Since DC-2 proposes to use grid technologies by April 2004, these grids must be able to talk

4

to each other, otherwise i.e. data stored in US-Grid won’t be accessed from LCG-1/2 or Nor-duGrid. Also, ATLAS by DC-2 won’t use only grid technologies. Several legacy resources,such as HPSS or Castor tape systems will still be used and play a very important role.

Building a global ATLAS-wide Data Management System that uses Grid technologies, isable to interact with multiple grid flavors and is also able to use legacy resources without sig-nificant loss of functionality is a major requirement. It isn’t a trivial problem to address given amultiplicity of factors:

• Delay in the delivery of Grid middleware, which conditions DC-2 timescale (case ofLCG-1)

• Current state (premature) of Grid middleware, given the early phase of grid computingtechniques

• Lack of standards in grid computing which leads to incompatibility between grids

• DC-2 requirement of using several grid flavors and legacy resources

• Inexperience and uncertainty of the application of grid computing

1.3.1 Expected solution

DC-2 expects to have a unique ATLAS-wide data management system, so that data can be ac-cessed, transferred, analyzed independently of the location or “hosting environment” - whetherit is being hosted on a grid-aware site (regardless of the grid flavor being used) or on a legacy-system (typically these are tapes-based systems).

This system must be fully functional by April 2004 and is to be integrated with the automaticproduction system for DC-2.

Another issue that must be taken in account is that it is infeasible for all users to accessa single instance of all data. One solution is that of data replication. Identical replicas ofdata are generated and stored at various globally distributed sites. Replication can reduce dataaccess latency and increase the performance and robustness of distributed applications. Theexistence of multiple instances of data however introduces additional issues. Replicas must bekept consistent, they must be locatable, and their lifetime must be managed. These and otherissues necessitate a high level system for replica management in Data Grids. A coherent datamanagement system covering all these issues must be provided for DC-2.

1.4 Work Plan

The work plan was conditioned by several factors previously referred. The major factor was thedelay on the release of LCG-1. Therefore, only tests could be done in pre-releases of LCG-1until September 2003. No formal work could be done since the features of LCG hadn’t beenfrozen or stable.

Therefore, the work plan was divided in 4 major phases:

1. Initial contact with Grid Computing technologies, during work at Faculdade de Engen-haria da Universidade do Porto

5

2. Initial contact with CERN, European Organization for Nuclear Research, with the AT-LAS experiment and Data Challenges. Analysis of status of grid computing middlewareand applicability in the context of DC-2

3. Architectural design for a new data management system, to be produced from scratch,with the purpose of replacing all Data Management systems currently used by ATLAS

4. Architectural design and prototype implementation of an ATLAS-wide Data ManagementSystem which must be functioning and in production by ATLAS DC-2

The first 2 phases were part of an initial learning phase of grid computing and how it mighthelp ATLAS experiment. Several other alternatives to grid computing techniques were alsostudied.

The third phase, given the lack of grid computing technologies for production at the time,and given some problems identified in them, was the architectural design of a new data man-agement system. This was later proposed to some members of the ATLAS Data Challengesteam.

The final phase consisted on the architectural design and initial prototype implementationsof possible data management technologies to fit the requirements and the timescale of DC-2.

6

Chapter 2

Technology Review

The recent trend in grid computing has led to the emergence of many different data manage-ment alternatives. The data management problem is not new, and grid computing certainlyhasn’t been the first to present possible solutions to it. Even in recent past, there were proposalsto develop global data management systems based on peer-to-peer (P2P) technologies. P2P wasperhaps the “next big thing” just before the recent emergence of grid computing. Previously,other solutions had also been presented, but given Moore’s law these can no longer be consid-ered. A 1 TeraByte hard drive will certainly be common when ATLAS detector is running in2007. These factors must be considered when designing the system. Tape storage systems, asis the case of Castor at CERN, or HPSS in other ATLAS sites like Brookhaven National Labo-ratory (BNL) in the United States, will still be used in the foreseeable future. Still, developingnew middleware based only in these systems and their principles is less and less interesting.

As knowledge about the ATLAS Data Challenges structure was gained, it became clear thatthe current status of grid computing tools was insufficient for ATLAS. Grid is definitely veryrecent and although Grid systems have been deployed and used in many different sciences, therequirements for Particle Physics far exceed any previous attempts.

Although grid computing will most likely prove to be the medium-term solution for ATLAS,it is uncertain whether it will be the short-term solution or even long-term. During part of thesenior training, other alternative systems were studied. The most obvious alternative was P2P-based systems.

Next, an overview of Grid and P2P systems is presented, followed by a section with empha-sis on the data management technologies currently available in both systems.

2.1 Overview

Two approaches to distributed computing and data management have emerged in the past fewyears. Both claim to be able to address the problem of organizing large-scale computationalsocieties. These technologies are peer-to-peer and Grid Computing.

Both have seen rapid growth and adoption, widespread deployment, successful application.P2P in this respect has had more success, given the fact that there are widespread end-userapplications that take advantage of its concepts. P2P and Grid have similar final objectives, butdifferent communities and, currently, different designs.

Convergence of Grid computing and P2P systems has recently been discussed. The creation

7

of a P2P working group as part of the Global Grid Forum (GGF)[3] is such a proof. GGFpretends to be the equivalent of World Wide Web Consortium for Grid computing.

2.1.1 Grid Computing

State of Art of grid computing is better shown by analyzing the leading development consortium- Globus Alliance[4]. Globus Alliance is currently the leading research and development centerof the fundamental technologies behind “The Grid”[5]. Ian Foster and Carl Kesselman, thenames that launched the concept of Grid in late 1998, have been presenting versions of anextensible open Grid architecture.

Grids appeared as an important new field, different from conventional distributed computingby focusing on large-scale resource sharing. The focus of Grid Computing is not buildinga cluster system. The focus of Grid computing is more of integrating existing clusters on ahigher-level. It’s not connecting several machines together but connecting several sites together,considering that each site has different and even heterogeneous resources.

The “Grid problem” is providing flexible, secure, coordinated resource sharing among dy-namic collections of individuals, institutions and resources[6]. It introduces the concept of Vir-tual Organizations where an organization is no longer a static collection of individuals and re-sources, but can span across several existing organizations. Therefore, problems such as uniqueauthentication, authorization, resource access and resource discovery emerge in a different andmore problematic manner.

Grid computing should be distinguished from other major technological trends, such asInternet, enterprise, distributed and peer-to-peer computing. This does not mean that some ofthese trends cannot converge with the Grid, thus providing and filling the gaps of current Grids.

The core problem underlying Grid concept is coordinated resource sharing and problemsolving in dynamic, multi-institutional virtual organizations. The sharing being addressed isnot primarily file exchange, but instead direct access to computers, software, data and otherresources required for a range of collaborative problem-solving initiatives. This sharing mustbe carefully controlled so that resource providers define clearly and carefully what is shared,who is allowed to share and any conditions that must be met for the sharing to occur.

Current distributed technologies are unable to provide solutions for the concerns and re-quirements in an efficient manner as Grid computing. There is a need for highly flexible sharingrelationships, ranging from client-server to peer-to-peer; for sophisticated and precise levels ofcontrol over how shared resources are used including fine-grained and multi-stakeholder accesscontrol, delegation and application of local and global policies; for sharing of varied resourcesfrom programs to files, data in computers, sensors, even networks; for diverse usage modes,ranging from single to a multi user environment, with performance sensitive and cost-sensitivequality of service, scheduling, co-allocation and accounting.

Internet technologies address communication and information exchange among computersbut do not provide any integrated approaches to the coordinated use of resources at multiple sitesfor computation purposes. Business-to-business exchanges focus only on information sharing,typically via centralized servers. Virtual enterprise technologies offer the same informationsharing, although here sharing can typically be extended to include applications and physical de-vices. Enterprise distributed computing technologies such as CORBA or Java Enterprise set oftechnologies provide resource sharing in the context of a single organization. The Open Group’s

8

Distributed Computing Environment (DCE) is an industry-standard, vendor-neutral set of dis-tributed computing technologies in a mature state, providing only middle system. Still, mostVirtual Organizations (VOs) would find it very inflexible. Storage Service Providers (SSPs) andapplication service providers (ASPs) allow outsourcing of storage and computing requirementsto other parties but in constrained ways (i.e. using VPN connections).

The emerging concept of truly “Distributed Computing” seeks to harness the usefulness ofidle computers on an international scale but to date only supports highly centralized access tothose resources. So, conclusion is that current technologies does not accommodate the broadrange of requirements that Grid computing is designed to address, with high flexibility with therange of resource types, or control in sharing relationships needed to establish useful VOs.

Over the past five years, the grid community has produced protocols, services and toolsthat attempt to fulfill the problems faced by the “Grid problem”. The produced technologiesrange from security solutions, management of credentials and policies when computations spanmultiple institutions, resource management protocols and services that support secure remoteaccess to computing and data resources, and the co-allocation of multiple resources; informa-tion query protocols and services that can provide configuration and status information aboutresources, organizations and services; data management services which locate and transportdatasets between different storage systems and applications.

The main focus of current trend in Grid Computing is the Virtual Organization. It is believedthat VOs have the potential to change the way computers are seen and computing problems aresolved. VOs enable disparate groups of organizations and/or individuals to share resources in acontrolled fashion, so that members may collaborate to achieve a shared goal.

The establishment, management and exploitation of dynamic, cross-organizational VO shar-ing relationships require new technologies. In defining a grid architecture, it is established thateffective VO operation requires that it is possible to establish sharing relationships among anypotential participants. Interoperability is thus the central issue - in a network, interoperabilitymeans common protocols. Hence, Grid is first and foremost a protocol architecture with proto-cols defining the basic mechanism that VO users utilize to negotiate resources and to establish,manage and exploit those sharing relationships.

Defining a standards-based open architecture has been the most significant issue in currentgrid computing. These standards must define an architecture that provides extensibility, inter-operability, portability and code sharing. This technology and architecture constitute the gridmiddleware.

Interoperability ensures that relationships can be initiated among arbitrary parties, accom-modating new participants dynamically across different platforms, languages and programmingenvironments. Protocols are critical to interoperability so that there is a definition specifyinghow distributed system elements interact with one another in other to achieve a specified be-havior and the structure of the information exchange during that operation, in terms of externalinteractions. Services are crucial so that an implemented behavior is well defined. Defin-ing standard services for accessing computation, data, resource discovery, co scheduling, datareplication and many others, allow enhancement of services provided by VO participants, andto abstract away resource specific details.

Naturally, the existence of Application Programming Interfaces (APIs) and Software Devel-opment Kits (SDKs) are required so that users can operate grid systems. Application robustness,correctness, development and maintenance costs are all important concerns. APIs and SDKs areadjunct to protocol definitions. A single implementation everywhere is not possible in a grid

9

environment. A single standard has to be deployed.Globus Toolkit[7] is the most widely used grid middleware. Several grid efforts use Globus

Toolkit as the underlying middleware. Grid systems are being used now to handle large scale,highly-distributed computing. The state of art grid system shows that the technology worksalthough it still requires heavy understanding, configuration and adaptability. Therefore, theforemost important work in Grid now is mostly in the definition of standards. The technicaldifferences between what each Grid toolkit provides are secondary - first and foremost comestandards on which all grid systems must be based. Currently, with the advent of Globus Toolkit3.0, the first standard has been published by the Global Grid Forum. This standard is OGSI[8].

Data Management in Grid Computing

Data management is one of the issues addressed by the “Grid problem”.Grid Computing systems already provide several somewhat similar data management sys-

tems. Still, standards lack, which poses a major problem for any world-wide data managementsystem.

It is possible to use data management system provided by Globus Toolkit, either in version2.2 or 3.0. European DataGrid project also provides such systems as it is shown in the fol-lowing section. Still, since standards are lacking, these implementations, although similar, areincompatible.

The current state of grid computing already provides transparent access to storage resources,across multiple VOs, with authentication, authorization, with very primitive forms of data bro-kering and data sharing. Still, it does not provide transparent access between different Gridimplementations.

2.1.2 Peer to peer

Peer to peer can be defined as a class of applications that takes advantages of resources - storagecycles, content, human presence - available at the edges of the Internet[9].

Since accessing these decentralized resources means operating in an environment of unsta-ble connectivity and unpredictable resources, P2P nodes must operate outside some of the basicInternet operating protocols and have significant or total autonomy from central servers.

Peer to peer is an example of a more than general - beyond client-server - sharing modalitiesand computational structures somewhat similar to the characterization of VOs.

There is no single organization or group of organizations acting as the driving force in peerto peer technologies. It is more a collection of vertical solutions that end users can alreadyutilize providing the concepts that each implementor agrees to. Standardization is not an issuefor the P2P community, since standards don’t exist and aren’t being created.

Therefore, describing the state of art of P2P and its main focus depends from which peerto peer open project we’re referring to. Gnutella[10], Napster[11], Freenet[12], KaZaa[13] arepeer to peer solutions known to the general public. There are several underlying protocols beingdeveloped by several companies but as usual these differ greatly.

The first generation of peer to peer architectures used a central server where all communi-cation, publishing of information and resource discovery went through. Naturally these weren’tvery reliable systems because when the root server was down, the peer to peer network wasunavailable.

10

The second generation of peer to peer architectures used a completely decentralized ap-proach, where communications was only established between nodes on the end of the network.All resource discovery was done by broadcasting the entire network. Information wasn’t reallypublished in the network, but kept in each end node. This was the case of Gnutella. Naturallythere were several approaches to improve these networks, such as the caching of the most com-mon information requests on some nodes. These networks will be analyzed in greater detail inthe following section.

Current third generation of peer to peer architectures use a somewhat intermediary approachbetween the previous two. There is no central server and these do not influence the status ofthe network. It is not however a completely decentralized approach. Third generation uses theconcept of super nodes, which communicate with other super nodes and are responsible forsmall parts of the global peer to peer network. These super nodes are very dynamic so theirfailure does not affect significantly the network. This is the case of Kazaa.

All modern peer to peer implementations are based on third generation. There have beensome evolutions of second generation systems, in an attempt to emulate the behavior of thirdgeneration systems.

Data Management in Peer to peer

Peer to peer systems are specifically meant to solve the problem of Data Management - sharinginformation on a global scale. Current P2P networks manage huge amounts of data in a veryfault-tolerant manner.

Currently, they do not address the diversity of issues that grid computing is supposed tomanage. Still, the vision that motivates both trends - a worldwide virtual computer withinwhich access to resources and services can be negotiated as and when needed - will come topass only if successful application of technology actually occurs[14].

Peer to peer developers are becoming increasingly ambitious in their applications and ser-vices and the goal of providing data access in a world-wide scale is being developed far morethan in any other technology.

A peer to peer network such as Kazaa already provides access to several petabytes of datato anyone who connects to it. Still, these networks have unique characteristics such as lack ofglobal knowledge and of persistency that condition their use. Nevertheless, the easy of use andlevel of fault-tolerance of these systems far surpasses that of grid systems.

2.2 Existing technologies

This section will analyze current status of grid and peer to peer based systems that might helpsolve the problem of Data Management in the context of DC-2. Some ATLAS tools currentlybeing used which partially address this problem will also be discussed.

2.2.1 Grid-based systems

Globus Toolkit

Globus toolkit is one of the leading middleware provider for grid computing. Still, the level ofmiddleware currently provided is somewhat low-level and not capable of addressing by itself

11

the requirements of LHC experiments[15].Globus Toolkit 2.2 and 2.4 were the major releases in use during the training period. Globus

3.0 (GT3) was released in 30 June 2003 and has not been extensively studied.The Globus Project provides software tools that make it easier to build computational grids

and grid-based applications. These tools are collectively called the Globus Toolkit. The com-position of the Globus Toolkit is usually described by Globus Alliance as three “pillars”. Eachpillar represents a primary component of the Globus Toolkit and makes use of a common foun-dation of security. These pillars are Resource Management, Information Services and DataManagement.

GRAM implements a resource management protocol, MDS implements an information ser-vices protocol, and GridFTP implements a data transfer protocol. They all use the GSI securityprotocol at the connection layer. Focus will be given mainly to GridFTP protocol and the re-maining data management components since these are the ones relevant to the problem.

These technologies are designed to be modular, but complementary. Each component canbe used separately or all together. Additionally, client software can be obtained and used inde-pendently of the server software.

Globus Toolkit is currently de facto the underlying grid middleware toolkit being used bymost grid efforts. Especially, some of its components such as GSI for security are used as almostfull standards for security in grid environments. The European DataGrid project uses Globus atits base and either replaces or adds on to some of its components. NorduGrid project and USGrid projects also rely on Globus middleware.

Data Management in Globus is provided by summing several components. One is security,provided by GSI. Other is a grid-aware transfer protocol - GridFTP. Finally there is the ReplicaCatalog and Replica Management component for managing replicas and GASS to allow accessto data located on any remote system using a unique URL.

The Globus Toolkit uses the Grid Security Infrastructure (GSI) for enabling secure authen-tication and communication over an open network. GSI provides a number of useful servicesfor Grids, including mutual authentication and single sign-on. The primary motivations behindthe GSI are:

• The need for secure communication (authenticated and perhaps confidential) betweenelements of a computational Grid.

• The need to support security across organizational boundaries, thus prohibiting a centrallymanaged security system.

• The need to support ”single sign-on” for users of the Grid, including delegation of cre-dentials for computations that involve multiple resources and/or sites.

GSI is based on public key encryption, X.509 certificates, and the Secure Sockets Layer(SSL) communication protocol. Extensions to these standards have been added for single sign-on and delegation. The Globus Toolkit’s implementation of the GSI adheres to the GenericSecurity Service API (GSS-API), which is a standard API for security systems promoted by theInternet Engineering Task Force (IETF).

Other basic Data Management protocol provided by Globus is GridFTP. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-areanetworks. The GridFTP protocol is based on FTP, the Internet file transfer protocol. A set

12

of protocol features and extensions defined already in IETF RFCs were added as were someadditional features to meet requirements from current data grid projects. GridFTP provides thefollowing protocol features.

• GSI security on control and data channels

• Multiple data channels for parallel transfers

• Partial file transfers

• Third-party (direct server-to-server) transfers

• Authenticated data channels

• Reusable data channels

• Command pipelining

A draft specification for the GridFTP protocol has been proposed through the Global GridForum for standardization, and there is also the intent to submit a draft standard to the IETF forreview and approval.

Managing replicas in Globus is done by using two components: Replica Catalog and ReplicaManagement. Data replicas allow the existence of multiple copies of data stored in different sys-tems to improve access across geographically-distributed Grids. These replication technologiescurrently include a Replica Catalog1 that is accessible via a client API and a command-linetool, and a Replica Management tool that combines the Replica Catalog with GridFTP to man-age data replication.

Recent developments included a new SQL-based Local Replica Catalog tool and revisedreplica management tools done with partial collaboration with CERN and included in the newversion 3 of Globus Toolkit.

Replica management is an important issue for a number of scientific applications. A typicaluse case is a data set that contains one petabyte of experimental results for a particle physics ap-plication. While the complete data set may exist in one or possibly several physical locations, itis likely that few universities, research laboratories or individual researchers will have sufficientstorage to hold a complete copy. Instead, they will store copies of the most relevant portions ofthe data set on local storage for faster access. Replica Management is the process of keepingtrack of where portions of the data set can be found.

Globus Replica Management integrates the Globus Replica Catalog (for keeping track ofreplicated files) and GridFTP (for moving data) and provides replica management capabilitiesfor data grids.

The Globus replica management implementation involves a software API, an associatedlibrary, and a command-line tool providing the same functionality.

The software API and associated library provides client functions that allow files to be reg-istered with the replica management service, published to replica locations, and moved amongmultiple locations. The library uses the Globus Replica Catalog and GridFTP technologies toaccomplish this.

1Implemented as an LDAP directory for the version 2.x of Globus, used at the moment in most grid efforts

13

The Globus Replica Catalog supports replica management by providing mappings betweenlogical names for files and one or more copies of the files on physical storage systems.

The implementation of the Globus Replica Catalog involves also a software API, an associ-ated library, and a command-line tool providing the same functionality.

As before, the software API and associated library provides client functions that allow ma-nipulation of data in a replica catalog. In this implementation, the library runs against a standardLDAP directory server. There are numerous commercial, open source and free software LDAPservers that can be used with this library.

Despite its current LDAP-based implementation, the API was constructed to be independentof LDAP, so that future implementations could use other access protocols or storage mecha-nisms (e.g., SQL database). This port to a SQL-based database was done in GT3.

One final component of Data Management in Globus Toolkit is GASS (Globus Access toSecondary Storage). This module allows applications to access data stored in any remote filesystem by specifying a URL. This URL may be in the form of an HTTP URL (if the file isaccessible via an HTTP server) or an x-gass URL (in other cases). Support for GridFTP hasnaturally been included also.

European DataGrid

European DataGrid[16] (EDG) project was an European Union funded project to develop aData Grid. The objective was to build the next generation computing infrastructure providingintensive computation and analysis of shared large-scale databases, from hundreds of Terabytesto Petabytes, across widely distributed scientific communities. EDG paid a special attention tothe physics community of High Energy Physics since part of the project was being developedat CERN.

EDG was not supposed to produce a grid prepared for production-use by the LHC experi-ments. Instead that is the job of LCG[17] (LHC Computing Group) at CERN to adapt the resultof EDG project for High Energy Physics. Therefore, LCG-1 - the first production-ready re-lease of LHC Computing Group - is mainly EDG 2.0 middleware (some components were leftout), with some corrections and adjustments, besides new components specific to High EnergyPhysics.

Regarding Data Management components, these are for the moment the same between EDG2.0 and LCG-1. As all European Union funded projects, EDG is split into several work pack-ages. Work Package 2[18] (WP2), responsible for Data Management was developed preciselyat CERN. There is also Work Package 5[19] (WP5) which developed interfaces to Mass StorageSystem, such as CERN’s Castor.

The goal of WP2 is to specify, develop, integrate and test tools and middleware infrastruc-ture to coherently manage and share petabyte-scale information volumes in high-throughputproduction quality grid environments. The work package developed a general purpose infor-mation sharing solution which is supposed to have strong automation, ease of use, scalability,uniformity, transparency and heterogeneity.

The main requirements were to allow secure access to massive amounts of data in a universalglobal name space, to move and replicate data at high speed from one geographical site toanother, and to manage synchronization of remote replicas. Other not so explored areas wouldbe the support for wide-area data caching and distribution according to dynamic usage patterns.

14

Generic interfacing to heterogeneous mass storage management systems was also provided toallow efficient integration of distributed resources.

EDG 2.0 replaces old version EDG 1.4. EDG 2.0 is the focus of the analysis since it was themiddleware toolkit chosen by the LCG for production use by all LHC experiments, includingATLAS.

In regards to Data Management, EDG includes the following components:

• WP2 - EDG 2.0 Replication

– Replica Manager

– RLS: Replica Location Service

– RMC: Replica Metadata Catalog

– ROS: Replica Optimization Service

• WP2 - EDG 2.0 Security

– EDG Java Security

– Trust Manager

– Authorization Manager

• WP5 - Mass Storage Management

The Replica Manager (RM) also known as the Replica Management Service (RMS) orReptor[20], is a logical single entry point for the user to the replica management system. Itencapsulates the underlying systems and services and provides a uniform interface to the user.Users of the RMS may be application programmers that require access to certain files, as wellas high level Grid tools such as scheduling agents, which use the RMS to acquire informationfor their resource optimization task and file transfer execution. Although the RMS providestransparent access to underlying systems and services, other tools may well require access tothem, either directly or via a set of well defined interfaces. The RMS in this case acts as a proxythat delegates requests to the appropriate service.

The access mechanisms to the RMS must deal with the trade-off between ubiquitous ac-cess and high performance. SOAP was used as the interface since it would allow widespreadpublication of a generic interface to this service.

The RMS also provides work-flow management functionality for VOs. It is able to schedulerequests originating from within one VO according to some predefined priorities and shouldalso be able to schedule independent activities within requests, such that the available resourcesare exploited optimally. Logically, the RMS provides a single point of entry. For performanceand fault tolerance reasons however, the RMS may be configured as a distributed service. Inthis case an RMS server being unavailable or overloaded could be replaced by an alternativeRMS server. Of course, a distributed RMS service introduces consistency issues which are stillthe subject of ongoing discussions.

The RMS has full control over all the files that have been created or registered through it. Inother words, the RMS assumes a particular role on the Grid, allowing it to manage (copy, delete,modify) files that are under its control. Once a file is under RMS control, no one else is allowedto delete or modify the file. If the file is withdrawn from RMS control (e.g. by unregistering

15

it) the original access control properties are restored. This also requires a close interaction withthe underlying Storage Element that itself can create and delete files.

One of the two major components of RMS is the Replica Location Service (RLS). RLS isprovided in two flavors: EDG-RLS and Globus-RLS. The RLS was initially developed as a jointproject between EDG and Globus. Later, the projects split. Together with the Replica LocationIndex (RLI), the Local Replica Catalog (LRC) forms the EDG Replica Location Service. TheLRC maintains independent and consistent information about replicas at single site. There maybe one LRC that maintains information about replicas in one or more Storage Resource Man-agements2 (SRM) at a site, or at larger sites, one LRC may be deployed per Storage ResourceManagement. The RLIs hold collective information about replicas distributed across manyLRCs. Currently only the LRC component was deployed and used in LCG-1. It is expected thatRLI is included in next versions of LCG3

EDG-RLS implementation is a pure-java, web services based implementation. Currentlyonly of the Replica Catalog components - the Local Replica Catalog (LRC) - exists. The EDG-RLS can be viewed as another protocol to access the data in the LRC through a set of secureSOAP-RPC calls instead of a client API.

Both the Globus-RLS and EDG-RLS can be run on the same LRC database and both chan-nels can be used to access the data in the tables. Globus-RLS implements the access throughsecure Globus/IO between the client and the server and through MyODBC on the server side tothe underlying MySQL database, while EDG-RLS implements the client-server access throughSOAP over HTTPS and JDBC on the server side.

The other major component of RMS is the EDG Replica Metadata Catalog (RMC) whichprovides access to metadata on Logical Files and Collections. It is being used by the ReplicaManager to store access control lists, transactional information and other metadata.

EDG Replica Optimization Service (ROS) is the component of the EDG middleware thatperforms data optimization. Its main function is to evaluate the cost for accessing a set of filesfrom a grid CE. The ROS is supposed to be called by the EDG Resource Broker (developed byanother work package from EDG) in order to get information to be used for making optimizedjob scheduling decision.

Although ROS would be a very beneficial component for the experiments and for ATLAS itis not part of LCG-1. On DC-2 automatic production system, one of the features of an ATLAS-wide Data Management System would be an automatic data transfer using optimization meth-ods based on some declarative strategy. ROS would partially solve this problem. Unfortunately,since it is not part of LCG-1 (due to stability reasons), the experiments have to develop thiscomponent themselves.

Other relevant components from EDG are the Security component of WP2 which providessecure access to Java-based web services, and comes in two major parts: Trust Manager andAuthorization Manager. These implement Java and web services functionalities necessary touse Globus GSI and security mechanisms. The other relevant component is produced by WP5

2SRM is managing the use of a storage resource on a grid. The definition of a storage resource is flexible; itcould be managing a single disk cache (referred to as DRM), or managing the access to a tape archiving system(referred to as TRM), or both (referred to as HRM for Hierarchical Storage system). Further, an SRM at a site canmanage multiple resources. The SRMs do not perform file transfer, but can invoke middleware components thatperform file transfer, such as GridFTP.

3Possibly the Globus RLI can be used as the replica information index for the EDG LRC, instead of the EDGRLI.

16

whose main objective is to provide a grid interface to the mass storage systems used by the EDGpartners and thus to facilitate the use and exchange of the large amounts of data required by theseprojects. This would provide for ATLAS DC-2 production system the necessary compatibilitywith legacy resources such as CERN’s Castor.

NorduGrid

The NorduGrid[21] project designed a Grid architecture with the primary goal of meeting therequirements of production tasks of the LHC experiments. While it is meant to be a rathergeneric Grid system, it puts emphasis on batch processing suitable for problems encounteredin High Energy Physics. The NorduGrid architecture implementation uses the Globus Toolkit2.2 as the foundation for various components, developed by the project. While introducing newservices, the NorduGrid does not modify the Globus tools, such that the two can eventuallyco-exist. The NorduGrid topology is decentralized, avoiding a single point of failure. TheNorduGrid architecture is thus a light-weight, non-invasive and dynamic, while attempting tobe robust and scalable designed for meeting the challenges of High Energy Physics use.

Regarding Data Management components, these can be split into two major components:the Storage Element (SE) and the Replica Catalog (RC).

Storage Element (SE) is a concept not fully developed by the NorduGrid at this stage. So far,SEs are implemented as plain GridFTP servers. A software used for this is a GridFTP server,either provided as a part of the Globus Toolkit, or the one delivered as a part of the NorduGridGrid Manager. The latter is preferred, since it allows access control based on the user’s Gridcertificate instead of the local identities to which users are mapped. This added functionality isvery useful for the HEP experiments to keep control of access to data.

Replica Catalog (RC) is used for registering and locating data sources. NorduGrid makesuse of the RC as developed by the Globus project, with minor changes to improve functionality.RC records are entered and used primarily by the NorduGrid Grid Manager and its components,and can be used by the UI for resource brokering. It is based on an OpenLDAP server withthe default LDAP DatabaseManager back-end. There are no significant changes introducedinto the original Globus RC objects schema. Apparent OpenLDAP problems with transferringrelatively big amounts of data over an authenticated/encrypted connection were fixed partiallyby applying appropriate patches, and partially by automatic restart of the server. Together with afault tolerant behavior of the client part, this made the system usable. To manage the informationin the RC server, the Globus Toolkit API and libraries were used. The only significant changewas to add the possibility to perform securely authenticated connections based on the GlobusGSI mechanism.

US Grid

The United States have several particle physics grid projects. The most recent which is the resultof merging several of those projects is called Grid3[22]. Grid3 is the result of partial merge of

17

iVDGL4, PPDG5, GriPhyn6, SDSS7, LIGO8, BTeV9, US CMS10 and US ATLAS projects11.The purpose of Grid3 is to build a grid environment to:

• Provide the next phase of the iVDGL Laboratory

• Provide the infrastructure and services need for LHC production and analysis applicationsrunning at scale in a common grid environment

• Provide a platform for computer science technology demonstrators

• Provide a common grid environment for LIGO and SDSS applications.

The goal of the project is to develop, integrate, deploy and apply a functional grid across theLHC institutions, extending later to non-LHC institutions and other internation sites workingclosely together with other existing efforts. The Grid3 major project milestone is in the middleof November 2003, where there is the expectation of having a Grid3 infrastructure available forapplications and scheduled iVDGL work[23].

The requirements for Grid3 are still being reviewed and revised but consist on the following:

• Experiments must be able to effectively inter-operate and run their applications on non-dedicated resources

• Applications must be able to install themselves dynamically

• A grid architecture consisting of facilities (e.g. execution and storage sites), services andapplications.

• Middleware is based on VDT 1.1.9 (Globus-based middleware produced by iVDGL) andwith components from other providers as appropriate, such as components from LCG.

• An information service for resource publication and discovery based on Globus MDS.

4The iVDGL is a global Data Grid that will serve forefront experiments in physics and astronomy. Its comput-ing, storage and networking resources in the U.S., Europe, Asia and South America provide a unique laboratorythat will test and validate Grid technologies at international and global scales. Sites in Europe and the U.S. will belinked by a multi-gigabit per second transatlantic link funded by the European DataTAG project.

5The Particle Physics Data Grid collaboration was formed in 1999 because its members were keenly aware ofthe need for Data Grid services to enable the worldwide distributed computing model of current and future high-energy and nuclear physics experiments. Initially funded from the NGI initiative and later from the DOE MICSand HENP programs, it has provided an opportunity for early development of the Data Grid architecture as well asevaluating some prototype Grid middleware.

6The GriPhyN Project is developing Grid technologies for scientific and engineering projects that must collectand analyze distributed, petabyte-scale datasets. GriPhyN research will enable the development of Peta-scaleVirtual Data Grids (PVDGs) through its Virtual Data Toolkit.

7Sloan Digital Sky Survey.8Laser Interferometer Gravitational Wave Observatory.9The BTeV Experiment at FermiLab is designed to challenge the Standard Model explanation of CP violation,

mixing and rare decays of beauty and charm quark states.10The United States representation of the Compact Muon Solenoid Experiment at CERN, one of the LHC

experiments.11The United States representation of the ATLAS Experiment.

18

• A simple monitoring service based on Ganglia supporting hierarchy, grid-level collectionswith collectors at one of more of the operation centers. Ganglia version 2.5.3 or greaterwill be used. Ganglia provides a real-time monitoring and execution environment widelyused, developed by University of California, Berkeley.

• Other monitoring services such as MonaLisa or a work-flow monitoring system may alsobe deployed. MonaLisa provides a distributed monitoring service designed to meet theneeds of physics collaborations for monitoring global Grid systems, and is implementedusing JINI/JAVA and WSDL/SOAP technologies.

• Consistent method or set of recommendations for packaging, installing, configuring andcreating run-time environments for applications. Several alternatives considered are:

– Pacman caches and instructions for pre-installation of application libraries. Pac-man is a package manager allowing users to transparently fetch, install and managesoftware packages.

– Use of the condor grid shell, job-wrapping mechanism (application libraries areinstalled as part of the job)

– Precise instruction for the application environments so that the job will already haveinstalled all required application libraries

– Use of the Chimera Gridlauncher

– Use of the WorldGrid project mechanism

• One or more VO management mechanisms for authentication and authorization. Thealternatives considered are:

– The VOMS server method developed by the VOX project

– The WorldGrid project method developed by Pacman

– LDAP VO servers, one for each VO containing the DN’s of the expected applicationusers.

– An acceptable use policy signed by all participants

• Support for replica locations services as required by the application groups - i.e. a LocalReplica Catalog indexed to an experiment’s Replica Location Service.

• A user-support model and other service requirements such as maintaining a trouble re-porting system liaison function, etc

• A common reporting and event logging tool or framework with a GUI.

• Data replication and data movement services with support for SRM, GridFTP and dCacheas required

• Grid3 will be LCG compliant to every extent possible by attempting to develop a consis-tent and compatible set of grid services.

19

Grid3 is a recent project and while most of the components it relies on already exist, it is notyet an usable and deployed Grid environment. Nevertheless, being part of US ATLAS requiresthat it’s components be studied and analyzed since these might prove to be valid alternatives inthe near future.

2.2.2 Peer to peer-based systems

Peer to peer were also considered as the basis for a global data management system. Several ofthese technologies were analyzed and some even briefly tested.

Regarding peer to peer, the focus was on studying existing technologies and evaluatingwhether they could address the problem of locating data within a specific network node andwhether they could help to implement replication mechanisms.

Unlike grid systems which provide a rather more complete and ready to use system, in thecase of P2P, individual technologies dealing with issues such as finding data in distributed nodeshad to be analyzed. The following technologies are not all alternatives between themselves.They are more like components that could possibly be deployed in a complete P2P approachfor data management.

Chord

Chord[24] is a distributed lookup protocol that addresses the problem of locating a node thatstores a particular data item. Chord provides only one basic operation - given a key it maps thekey onto a peer node. Data location can then be implemented on top of Chord by associationof a unique key with each data item and storing that mapping between key and data in the nodewhere the data is stored.

Chord adapts efficiently as nodes join or leave the network and is able of providing answersto the location of a given data item even if the system is continuously changing. Chord hasprove to be scalable with communication cost and overall system maintained by each nodescaling logarithmically with the number of Chord nodes.

Chord uses a variant of consistent hashing to assign keys to Chord nodes. Consistent hashingtends to balance the nodes load, since in average each node receives roughly the same numberof keys and there is little movement of keys between nodes when nodes join or leave the system.

Each Chord node needs only routing information about only a few other nodes. Since rout-ing table is itself distributed along the network, a node resolves the hash function just by com-municating with a few other nodes. In steady state, given a N-node system, each node main-tains information about only O(logN) other nodes and is able to resolve lookups via O(logN)messages to other nodes. Chord can maintain routing information as nodes join and leave thesystem; with high probability each such event results in no more the O(log2N) messages.

The Chord protocol solves the problem of many distributed peer to peer applications ofdetermining the node that stores a data item in a decentralized manner. Still, by itself it does notprovide global knowledge - which is undoable in a highly unstable system without centralizedresources - nor it provides caching and maintenance of replicas. Nevertheless, these featurescould be developed on top of Chord.

20

Kademlia

Kademlia[25] is also a peer to peer distributed hash table (DHT) protocol but offers a numberof desirable features not simultaneously offered by other protocols.

It minimizes the number of configuration messages nodes must send to learn about eachother. Configuration information spreads automatically as a side-effect of key lookups. Nodeshave enough knowledge and flexibility to route queries through low-latency paths. It uses par-allel, asynchronous queries to avoid timeout delays from failed nodes. The algorithm used fornodes to record each other’s existence is able to resist to some basic denial of service attacks.Several important properties of Kademlia can be formally proved only by weak assumptions onuptime distributions.

Kademlia uses the same approach as other DHTs. Keys are opaque, 160-bit (usually SHA-1). Each node has a node ID in the 160-bit space. Each pair (key, value) is stored in the nodewhose ID is “closer” to the key, given some notion of closeness. Kademlia uses as closenessfunction the XOR metric. XOR is symmetric allowing Kademlia participants to receive lookupqueries from precisely the same distribution of nodes contained in their routing tables.

Compared to other systems such as Chord, who lack a symmetric metric, Kademlia is ableto learn useful routing information from queries that have been received. Asymmetry also leadsto rigid routing tables, so that each entry in a Chord node’s finger table must store the precisenode preceding some interval in the ID space. Kademlia uses a single routing algorithm. Othersystems use one algorithm to get the near target ID and a different algorithm to get the last fewhops.

Kademlia is perhaps one of the best suitable peer to peer systems to combine consistency,performance, minimize latency routing using a symmetric unidirectional topology. It also ex-plorers the proved fact that node failures are inversely related to uptime.

Given the problem of locating where data is store, Kademlia seemed as a best choice forimplementation. It is used already in peer to peer networks such as Overnet with good results.

JXTA

The Project JXTA[26] is more complete than the previous protocols. In fact, JXTA is not asingle protocol but a complete set. Kademlia and Chord focused on the (crucial) problem oflocating data in a peer to peer network. JXTA focus on more problems providing a completepeer to peer network.

The Project JXTA protocols aims to establish a virtual network overlay on top of the Internetallowing peers to directly interact and self-organize independently of their network connectivityand domain topology. JXTA enables application developers, not just network administratorsto design network topology that best match their application requirements. Multiple virtualnetworks may be created and dynamically mapped into one physical network.

In JXTA, peers can exchanges messages with any other peers independently of their net-work location (whether they are being firewalls, NATs or non-IP networks). Messages aretransparently routed using different transport/transfer protocols to reach receiving peers. Peerscan therefore communicate without needing to understand or manage complex and changingphysical network topologies allowing mobile peers to move transparently from one locationto another. JXTA also provides a standardized manner for peers to discover each other, self-organize into peer-groups, discover peer resources and communicate with each other.

21

JXTA includes 5 different virtual network abstractions. A logical peer addressing modelthat spans the entire JXTA network. Peer-groups that let peers dynamically self-organize intoprotected virtual domains. Advertisements to publish peer resources. A resolver to perform allbinding operations in the distributed system. Pipes as virtual communication channels to allowexchange of information.

JXTA current implementation[27] (version 2.0) introduces a stronger differentiation be-tween the way JXTA super-peers - for relay and rendezvous - behave and interact with edgepeers. Rendezvous peers can be connected within a peer group. Edge peers use a loosely-coupled distributed hash index to locate advertisements on the rendezvous peer view for effi-cient query lookups. It also provides resource management - through threads and queues - andresource usage limits to fairly allocate resources between platform services. It supports TCP/IPrelay for efficient traversal of NATs.

Project JXTA has provided a very complete infrastructure covering most of the aspects ofbuilding peer to peer systems. Nevertheless, this architecture has proved to be an overkill inmany applications given its complexity. JXTA has to be seriously considered in developingany peer to peer system. Still, if there is only the necessity of locating data - typically themain requirement of a peer to peer system -, all other mechanisms such as peer rendezvous, forstarting up the network, can be built using far more simplistic approaches.

Boinc

Berkeley Open Infrastructure for Network Computing[28] (BOINC) is a software platform fordistributed computing using volunteer computer resources, like SETI@HOME12.

Many different projects can use BOINC. Projects are independent; each one operates itsown servers and databases. However, projects can share resources in the following sense: Par-ticipants install a core client program which in turn downloads and executes project-specificapplication programs. Participants control which projects they participate in, and how their re-sources are divided among these projects. When a project is down or has no work, the resourcesof its participants are divided among the other projects in which the participants are registered.

BOINC provides features that simplify the creation and operation of distributed computingprojects. These are:

• It includes an application framework, so that existing applications in common languages(C, C++, Fortran) can run as BOINC applications with little or no modification. Anapplication can consist of several files (e.g. multiple programs and a coordinating script).New versions of applications can be deployed with no participant involvement.

• Regarding Security, BOINC includes protection against several types of attacks. Forexample, it uses digital signatures based on public-key encryption to protect against thedistribution of viruses.

• It also supports multiple servers and fault-tolerance so that projects can have separatescheduling and data servers, with multiple servers of each type. Clients automatically try

12SETI@home is a scientific experiment that uses Internet-connected computers in the Search for ExtraterrestrialIntelligence (SETI). Volunteers participate by running a free program that downloads and analyzes radio telescopedata.

22

alternate servers; if all servers are down, clients do exponential back off to avoid floodingthe servers when they come back up.

• System monitoring tools are provided via a web-based system for displaying time-varyingmeasurements (CPU load, network traffic, database table sizes). This simplifies the taskof diagnosing performance problems.

• It was designed to support for large data. BOINC supports applications that produce orconsume large amounts of data, or that use large amounts of memory. Data distributionand collection can be spread across many servers, and participant hosts transfer large dataunobtrusively. Users can specify limits on disk usage and network bandwidth. Work isdispatched only to hosts able to handle it.

• Multiple participant platforms are available. The BOINC core client is available for mostcommon platforms (Mac OS X, Windows, Linux and other Unix systems). The clientcan use multiple CPUs. Web-based participant interfaces are provided also for accountcreation, preference editing, and participant status display. A participant’s preferences areautomatically propagated to all their hosts, making it easy to manage large numbers ofhosts. There is also a configurable host work caching. The core client downloads enoughwork to keep its host busy for a user-specifiable amount of time. This can be used todecrease the frequency of connections or to allow the host to keep working during projectdowntime.

2.2.3 ATLAS tools

Some of the aspects of the data management problem were already analyzed in previous edi-tions of the ATLAS Data Challenges. During DC-1, the problem of moving data betweenheterogeneous storage resources was solved with the use of Magda tool. LCG also provides atechnology - POOL -, with a different applicability, that must also be seriously considered sinceit will probably represent a middle term solution to data management in ATLAS.

Madga

Magda[29] is a distributed data manager prototype for grid-resident data. It is partly based ontools developed for an earlier distributed analysis project, NOVA.

It has been used in ATLAS during DC-1 as distributed data management prototyping anddesign. It has been in stable operation as a file catalog from CERN and BNL13 resident ATLASData and as a file replication tool between several of ATLAS sites.

It was developed to fulfill the principal ATLAS ’01-’02 deliverable for the Particle PhysicsData Grid (PPDG) project of a production distributed data management system deployed tousers and serving BNL, CERN, international ATLAS and many US ATLAS grid testbed sites.

Magda makes use of MySQL database, Perl, Java and C++. The core of the system is aMySQL database but the bulk of the system is in a surrounding infrastructure for setting upand managing distributed sites with associated data locations, data store locations within thosesites and the hosts on which data-gathering servers and user applications run; for gathering data

13Brookhaven National Laboratory part of the ATLAS collaboration

23

from the various sorts of data stores; for interfacing with users via a web interface presentingthe catalog and allow querying and system modifications; and replicating and serving files toproduction and end-user applications.

It is composed of the following set of principal entities:

• prime: File catalog. Catalogs all instances of all files in the system.

• logical: Logical filename catalog. Metadata about logical files (associated keys, whethera master exists, etc.) which is not specific to particular physical instances.

• site: A computing facility that has a number of data storage locations associated with it,e.g. CERN stage service is a site, as is the CERN Castor service, and the AFS disk spaceat the US ATLAS Tier 1 is another site.

• location: Data locations (e.g. directory or directory tree, staging pool, directory in a massstore). A data location is associated with a particular site. A given location is designatedas either a ’prime’ or ’replica’ location. Each file in a ’prime’ location is the primaryinstance of that file. Files in ’replica’ locations are regarded as secondary replicas. Thus,the design requires that primary files ’live together’ in primary-only locations, and sim-ilarly for replicas. This is not unduly constraining (rather it forces clean organization)because the granularity of a location is a directory or directory tree (or stage pool). Notethat locations may overlap: a high level directory tree may be defined as one location,and a subdirectory may be defined as another location with distinct characteristics (i.e.ownership, data type, location type, etc.)

• host: Computers on which parts of the system run, or from which users make use ofthe system. A host is a logical name for a computer or set of computers all of whichcan ’see’ the sites/locations associated with the host. Each host has a list of sites whichare accessible to it, and hence a list of associated locations (the locations belonging toaccessible sites). A host is not (necessarily) a physical machine name but rather a logicalname, with the local host identification made at runtime by checking an environmentvariable. When a system component like the file spider discovers what host it is runningon, it thereby knows ’where it is’ and ’what it can see’ which directs what locations canbe scanned or are accessible.

• collection: Collections of logical files.

• collectionContent: Logical file lists for collections.

• task: Catalog of replication tasks.

• generic sig: Generic ’data signature’. The signature associated with a data set recordssufficient information about the processing history and characteristics of the data to beable to (in principle) regenerate it. Used in e.g. establishing equivalency between repli-cated data sets.

Content is hierarchly organized in: Virtual Organization, Group, Activity, Team and finallyPersonal.

24

Logical names are unconstrained by the system. They can be assigned arbitrarily by the user.These are not used to carry attribute information. Uniqueness is required for logical names plusinstance number (plus in some cases version name). The instance number of a file is zero forthe master copy and for the replicas it is the location ID. Therefore a given logical can haveonly one instance per location.

Different file versions corresponding to a given logical name are supported. In this case thefile to be uniquely identified must be given the version name.

POOL

POOL project[30] is the common persistence framework for the LHC experiments to storepetabytes of experiment data and metadata in a distributed and grid-enabled way. POOL is ahybrid event store consisting of a data streaming layer and relational layer.

The very large volume of data produced by the experiments - some hundred petabytes overthe lifetime of the experiments - requires that traditional approaches based on explicit file han-dling by the end user be reviewed. Furthermore, it is required by the long LHC project thatan increasing focus is given on maintainability and change management for the experimentcomputing models and core software such as data handling.

POOL is a set of service APIs, often via abstract component interfaces, and isolates exper-iment framework user code from details of a particular implementation technology. As result,POOL user code is not dependent on implementation API or header files nor do POOL applica-tions depend directly of the implementation libraries.

Even though POOL implements object streaming via ROOT-I/O and uses MySQL as backend database, this implementation is done in real time using SEAL[31] plug-in infrastructure.Therefore, it is guaranteed that changes to back end systems are largely contained inside POOLand don’t spread to client applications. This open architecture is an important factor.

It uses the concept of hybrid technology approach. Therefore POOL combines two main dif-ferent technologies into a single consistent API and storage system. The first is object streamingto deal with persistence of complex C++ objects such as event data. This data is usually writtenonce and read-many and concurrent access allow using read-only. The second technology isRelational Database services such as distributed, transactionally consistent, concurrent accessto data which can be updated. Using this hybrid approach users are able to choose the mostsuitable storage implementation for different data types and use cases.

Navigational access regardless of the distributed store and individual data objects is pos-sible. References between objects are transparently resolved. Therefore, an object a user isreferring to can be brought into the user’s application memory automatically by POOL as re-quired. References can connect objects in either the same file, spanning multiple files and eventechnology boundaries (grid and non-grid environments). Physical details such as file name,host name and storage flavor are not exposed to the user code.

POOL is organized in 3 major packages: Storage, File Catalog and Collection componentservices.

The storage hierarchy exposed by POOL consists of several layers, each dealing with moregranular objects. These layers are POOL Context, FileCatalog, Database, Container and Object.Each POOL database (a new entry in FileCatalog) has a well defined major storage technology.This can i.e. be ROOT I/O or RDBMS.

25

The File Catalog[32] keeps track of all existing POOL databases and resolves file referencesinto physical file names, so that lower level components like the storage service can accessthose contents. This File Catalog is capable of interacting with EDG Replica Location Service,making POOL grid-aware. It also interacts with MySQL or XML databases. The files have aunique immutable identifier, assigned at creation time. This FileID is based on Globally UniqueIdentifiers (GUID) so that global uniqueness of file names is guaranteed despite the conditionsunder which the ID was assigned.

The information on storage technology from the File Catalog is used to dispatch any read orwrite operation to a particular storage manager. The storage manager is responsible for translat-ing (streaming) any transient user object into a persistent storage representation which is suit-able for reconstructing an object in the same state later. The task of mapping individual objectdata members and the concrete type of the object relies on the LCG Object Dictionary com-ponent developed by the SEAL project[33]. For each persistent class this dictionary providesdetailed information about internal data layout which is used by the storage service to configurethe particular back end technology (this might be ROOT I/O) to perform I/O operations. AnRDBMS base store is also available. The POOL interface hides this from the user.

POOL supports also collections so that users can maintain large scale objects (such asevent collections). POOL collections can be extended to include metadata support (only simpleattribute-value pairs), to support user queries to select only collection elements which fulfill aquery expression. Collections can be defined explicitly (by adding each contained object) or im-plicitly by referring to all objects in a given list of databases or containers. Since all collectionsadhere to a common component interface, an end-user can switch from using ROOT trees todatabase implementations (for better distributed access and server side query evaluation) easily.

LCG POOL provides transparent cross-file and cross-technology object navigation via C++smart pointers. At the same it is integrated with EDG-RLS catalog. POOL will eventuallybecome a major technology for data management within ATLAS. For DC-2 it is still unclearwhether POOL will have a major role. It currently seems that an all-POOL approach is pre-mature at this moment - that would require “converting” all existing data to POOL and pro-ducing all new sets of data within POOL. Nevertheless, any system for data management onan ATLAS-scale must be able to talk to POOL and have a tight integration with it and it’sunderlying concepts.

26

Chapter 3

Data Management Tools for ATLASExperiment

3.1 The “Data Management Problem”

The first phase was identifying precisely in terms of use cases and requirements what was the“Data Management Problem” of the ATLAS Experiment.

Therefore the following was identified for an ATLAS Data Management System (DMS):

• For the moment, given the lack of experience with grid technologies and the early phaseof preparation for DC-2 there is only a very high level idea of what the DMS is supposedto do. Therefore it is likely that some requirements may change until early 2004.

• The DMS must manage all ATLAS data files: original and replicas

• One of its components will be a replica catalog. The DMS must be able to answer thequestion “where are the physical copies of this logical file?”

• Additionally it should be able to get a copy of a logical file without registering it as areplica with an integrated validation process to analyze log files and if possible recoverfrom errors automatically

• Move the original or replicas of a file to another location, independently of the technologyboundaries between locations

• Rename a logical file and all of its physical instances, since during DC-2 a file is registeredwith a temporary name and is only renamed to its final name after validation has beendone

• DMS should also execute bulk data transfers: i.e. replicate a complete dataset to a certainstorage facility

• Eventually DMS should become continuously active trying to optimize data distribution.At the moment it is unclear what parameters will be used for defining an optimal distri-bution

27

• The mechanism that does file movement should be logically separate from other compo-nents in the DMS. Therefore, the system can gradually evolve from one where a humandefines what transfers to do, to another where a different component automatically de-fines those transfers based on some declarative strategy definition and current situation ofvarious storage facilities

• Integration with grid and legacy storage facilities must be provided. GridFTP could beused as an umbrella for all systems since it can already act as a front-end to legacy storagefacilities

• It is unclear under which security accounts DMS should run, or whether it should runin multiple accounts - one per user. Write access everywhere is also being considered.It seems enough for DC-2 to have DMS run in a limited number of accounts, such asproduction system accounts (which should be in the order of 3 or 4 user accounts)

• Support for collections of logical files is highly desirable. A strong metadata supportcould be provided as an alternative, although the issue with metadata is not so much howit is stored, but how it is accessed, since a system without collection support would alwaysrequire a search in all logical files, which can be a very expensive operation

• The ability of linking logical collections and logical files similarly to the Unix file systemwould be desirable. This should allow the possibility of creating either soft or hard links.How these links would behave in the case of some logical files/collections be deleted isnot certain. Nevertheless, consistency of the system is crucial

• According to some tests performed in DC-1 using Magda, nominal requirements for AT-LAS are 10 M logical files with 10 physical instances per logical file for 100 M totalcatalog entries.

• Not only technological choices influence the chosen system. ATLAS is a world-widecollaboration and consensus must be achieved. Rebuilding systems from scratch and askfor world-wide deployment is very difficult. Existing and proved components shouldsomehow be used. Simple installation methods and independent systems - that do notaffect existing components - should also be provided

• Some level of integration or awareness of POOL is desirable. POOL is likely to becomethe preferred method of data management but given the changes it would imply in theentire data management structure - physical files would no longer be specified, just objects- it i s extremely unlikely that it is considered as the single data management system inthe timescale of DC-2

Several alternative solutions have been presented to the data management problem. Theycan be split in two main groups: a peer to peer approach and a grid computing based approach.Some of these proposed systems were further developed and prototypes were built, as describedin the next chapter. Nevertheless, there wasn’t a single obvious solution to this problem, so inthe following sections a description of all possible systems that were considered is given.

28

3.1.1 Peer to peer-based system

The problem of data management and distributed computing can be thought of as having twomain levels of granularity.

Resources can be shared with coarse granularity, meaning that all sharing and data manage-ment is done on the cluster level. How these clusters work internally is hidden from the outside.Users think in terms of sites, such as CERN or BNL, and never in terms of individual machines.Each site has only a few Storage Elements. This is the case of ATLAS Tier distribution. Somesites also have special characteristics such as Castor or HPSS tape backup systems. In these sys-tems, there is theoretically unlimited disk space and they provide strong level of fault-tolerance,in prejudice of performance.

A different approach, and the one considered in this section, is thinking of resource sharingat a much smaller level of granularity. Data Resources are no longer a few Storage Elements.Now, every individual machine plays an equally important role in the system. Sites can stillbe distinguished in the sense of the individual machines they contain. Nevertheless this siteseparation is no longer relevant or structural to the system. Each PC can act as either a storage,computing or monitoring element.

Peer to peer systems fit nicely in this approach. Creating a peer to peer system for a coarsegranularity environment would prove not interesting in the sense that there would exist only avery small number of such peers. No advantages in a robust peer to peer environment wouldemerge with such small number of nodes. Actually, establishing a peer to peer network withvery few nodes could easily render the network unusable. The fault tolerance of peer to peersystems comes mostly from the fact that there are at least thousands of peers working on thenetwork at the same time.

By using a finer granularity, the experience of existing peer to peer systems proves that thereis a high degree of fault tolerance. More available resources means higher redundancy. It alsointroduces a different set of problems given the fact that there is no central knowledge base.Peer to peer systems tend to be totally decentralized by design. Grid computing systems alsoclaim that decentralization, but in implementation these tend to be centralized around some keyservices and hosts (usually the information services).

One of the major requirements of the data management system is to be able to handle sig-nificant amounts of data in a distributed environment with a high degree of reliability. Peer topeer fault tolerance easily complements this. Other requirement would be interfacing to legacysystems. Here, it would be a problem to implement a peer to peer approach, so the most naturalsolution would be to gradually move data out of those legacy systems. Otherwise, there wouldbe few key nodes (tape servers) with huge amounts of data and lots of simultaneous requests.The major goal of a peer to peer system is distributing data in a more “democratic” approachand avoid having central data repositories.

Within a peer to peer system, movement of data would be avoided. Jobs would ideally besent to the computer where the necessary data is already stored. This means that jobs themselvesmust also have a finer level of granularity. A job definition couldn’t be written in terms of amultiple terabyte job since this would require all that huge amount of data to be available inthe worker nodes. In grid environments, this is typically the case, since every worker nodehas mounted (via NFS for instance) the storage elements content. Here, in peer to peer, a jobdefinition has to be written in terms of a multiple megabyte data input. Job splitting techniquesprove crucial for the success of these systems.

29

Adaptive Grid

During part of the training, a peer to peer architecture was planned. This resulted in a workproposal that was later rejected by ATLAS Data Challenges team given the fact that it wouldn’tbe feasible given the timescale of DC-2. Also, it represented a major change in some of theconcepts currently in use by ATLAS.

The proposed architecture in Figure 3.1 was a peer to peer system. In terms of componentsit can be split into: worker node, aggregator node, root node and user interface.

Aggregator Nodes

Worker Nodes

End User Machine

Aggregator Nodes

Worker Nodes

End User Machine

Aggregator Nodes

Worker Nodes

End User Machine

1st Step 2nd Step

3rd Step

Figure 3.1: Adaptive grid step-by-step

• Worker: The worker node is responsible for performing a service. A service might be acomputation in terms of job execution, or providing data in the case of data management.

• Aggregator: The aggregator node is responsible for issuing service requests to the workernodes. It is also responsible for aggregating the results and presenting them back towhomever requested them. When a request is made - whether it is a computation orfinding who has a given file - this node collects all the results from the worker nodes itknows of, aggregates them all and presents them back. There are some nodes, typicallyuser interface nodes, who make requests and consume answers from the aggregators.They receive then a compiled (joint) information with the answer.

• Root: The root node is a starting point for establishing the peer to peer network. It actsas the initial rendezvous point. All new nodes, regardless of whether they are workers,

30

aggregators, user interfaces or even have several of those roles, contact the root node(s)(there can be a few of them) to know contact locations of other peer nodes.

• User Interface: The end user interacts with the system via the user interface. This userinterface hides the inner mechanism of the peer to peer network. Therefore all interactionsbetween the user interface and with the aggregators for performing initial requests, ishidden from the user. The user only chooses higher-level options such as what grouphe would like to perform a given action on. From there, the user interface contacts theaggregators. The aggregated information is presented back to the user.

In a typical use case of a file transfer, the system would act as follows:

1. A user wants to search for a given data, whose logical name is ’higgs001’. The user startsthe User Interface process in his local machine and issues a search logical name request

2. The User Interface process contacts one of the root nodes, whose address was built-induring the user interface installation. The root node passes back to the user interface a listof aggregator nodes that the user interface can use

3. The User Interface, according to each aggregator characteristic (the services they provideand groups they serve) and end-user options (e.g. which groups to use in the query), doesthe query for the logical file in several aggregators

4. Each aggregator will receive the request and forward it to all worker nodes with the datamanagement service, since searching for a logical file is a data management service

5. Each worker node will search for the logical file, and if it contains it (or part), returns thatinformation to the aggregator node where the request originated from

6. The aggregator will then aggregate all answers received from all worker nodes

7. That joint information is passed back to the user interface

8. The user interface process, according to the end-user choice, acts with the receive infor-mation. It may start a parallel data transfer of the logical file. This transfer can happenconcurrently from all worker nodes which contain the logical file - that is, a part of thelogical file is copied from each worker node

All communication within the system is performed using SOAP web services. The choicefor SOAP came from the high interoperability it provides, for the wide range of toolkits avail-able, for the familiarity between the trainee and SOAP technology and mostly because, giventhe fact that this is a peer to peer network, it is fairly easy to forward SOAP requests throughNAT, Firewalls, ...

The use of web services allows also easy development and integration of new services. Pre-viously the description was of a search and transfer for a logical file. Many different services canbe implemented using the concept of User Interface, Aggregators and Workers. Web serviceseases that integration.

The performance penalty which comes as the result of using web services is negligible in awide area, distributed network system. The interoperability gains are more important.

31

One important aspect of this peer to peer architecture is the tight integration with metadata.All information, from root node list of known nodes, services that each nodes has (i.e. datamanagement), list of aggregators that user interface contacts with, data publication in the caseof data management (i.e. which logical and physical files each node has) is published usingXML/RDF. The choice for XML/RDF came since it is an official standard for this type ofinformation definition and storage. Using XML/RDF allows future developments to be moreeasily implemented. One such example is network spiders or agents which are capable ofsemantically understanding RDF.

There is one RDF file to represent a network node in terms of its role (aggregator, role, userinterface) - and of the services it has installed (data management, job processing, ...). Thereare also additional RDF files for describing information specific to the service a node has - i.e.a worker node data management RDF file exists to describe metadata information for the datamanagement service in a peer node with the worker node role.

This architecture covers only the basic foundation for a peer to peer network. Future de-velopments in this system would provide additional and more interesting features. In somefeatures, the peer to peer basis could be actively explored. Some aspects that could be devel-oped are:

• Caching of frequent requests and frequently aggregated answers

• Semantic web to perform tasks such as automatic data replication (to leverage data be-tween nodes) by reading published RDF files from each node

• Moving RDF files to a database back-end for faster access

• Web portal as user interface

• Use of proved Grid technologies such as GridFTP for multiple stream data transfer andGSI for security

• Interface to legacy systems such as Castor and HPSS

• Platform-neutral implementation

The presented peer to peer architecture was not implemented by the trainee. It provideda rather revolutionary system that wouldn’t be realistically implemented by the time frame ofATLAS DC-2. Therefore, different alternative had to be thought of and developed. This ispresented in the following section.

3.1.2 Grid-based system

ATLAS DC-2 needs to have a production system ready in a very short timescale. LCG-1 whichis responsible for deploying the initial infrastructure for those data challenges was delayed by 3months1. Alternatives were considered given this delay of LCG.

1LCG-1 provides an initial implementation on which the experiments can start building their software for theData Challenges. Nevertheless, it is the second version of LCG, LCG-2, which will be used by the experiments forreal production. LCG-2 should not be very different from LCG-1 (it is not absolutely clear yet what the differencesmight be), but should only include some additional components

32

LHC Computing Group is supposed to provide the entire infrastructure on which the exper-iment’s software is built and run. This is not exactly the case, since for instance US Grid tendsto use its own resources and not LCG. Also, NorduGrid has its own resources and middlewaresoftware which doesn’t rely on LCG.

The Global Grid Forum is an organization equivalent to the W3C. Its mission is to publishstandards (recommendations). The timescale on which GGF works is incompatible with theone of ATLAS DC-2 - and perhaps with even of the ATLAS experiment for the next couple ofyears. Otherwise, if GGF recommendations were adopted by all the grid middleware partnersof ATLAS, it would be much easier to integrate the different components and implementations.

The experiments will always need to adjust the middleware that is given to them. LCG fo-cuses on some aspects, such as interconnecting different Grid technologies which is a commonproblem to all LHC experiments or a generic data management mechanism, but always in aLHC timescale which is always broader and more flexible than the experiments requirements.

Since ATLAS DC-2 must be running and in production by April 2004, several alternativeswere considered, some of which prototyped. These will be analyzed in detail in the next section,some of which including implementation details (when those were done). The alternativesconsidered were:

1. EDG: Using the existing European DataGrid Replica Manager as the basis for the world-wide ATLAS data management system

2. SRB: Storage Resource Broker. It is an existing data management system developed bythe San Diego Supercomputing Center

3. DMS: Other alternative was to develop from scratch a Data Management System specificfor ATLAS

4. NorduGrid: Use the existing NorduGrid replica management service as the basis for theworld-wide ATLAS data management system

5. LCG-1: Use data management services provided by LCG-1

While most of the previous alternatives would help solve the problem, none was a fullycomplete solution except for SRB (an already existing complete and functional system) andDMS (which would be developed by ATLAS DC-team to fit its requirements).

European DataGrid

EDG Replica Manager[34] is made up of several components that would provide a fairly com-plete data management system for ATLAS. These are the Replica Location Service and Repli-cation Metadata Catalog[35]. Some important concepts that need to be known in EDG ReplicaManager middleware are:

• UUID Universally Unique IDentifier: A UUID is a 128 bits long number, and is eitherguaranteed to be different from all other UUIDs generated until 3400 A.D. or extremelylikely to be different (depending on the mechanism chosen to generate it)

33

• SFN Storage File Name: An SFN is a locater for a physical file, where the scheme specificpart is understood by a Storage Resource Manager (SRM). It is a URL where the schemeis ’sfn’ and the host is a valid SRM host.

• GUID Grid Unique IDentifier: A UUID generated by the Replica Management Systemfor an SFN. It is created at the SFN registration time. A GUID is immutable.

• LFN Logical File Name: A Logical File Name is a user defined alias to a GUID. UnlikeGUIDs, aliases are mutable but they still should be globally unique. Since the ReplicaManagement System has no control over the creation of LFNs, this global uniqueness isonly weakly enforced.

• LRC Local Replica Catalog: The LRC stores GUID to SFN mappings, along with SFNattributes for either a single site, or a single Storage Resource Manager at a single site.Only GUID to SFN mappings for those SFNs physically located at the same site arestored in the LRC. An LRC is indexed by a RLI.

• RLI Replica Location Index: A RLI indexes over all LRCs that subscribe to it. The RLIstores GUID to LRC mappings, thus maintaining a catalog of all LRCs that store at leastone physical replica of a given GUID. This provides the link between different LRCs,allowing for distributed indexing and querying of all LRCs.

• RLS Replica Location Service: The RLS is a system that maintains information about thephysical locations of logical identifiers of data and provides access to this information.It consists of two internal components, a Local Replica Catalog (LRC) and a ReplicaLocation Index (RLI).

• RMC Replication Metadata Catalog: The EDG RMC is a system that maintains LFNaliases to GUID mappings, as well as attributes on GUIDs and LFNs.

Together with the Replica Location Index[36] (RLI), the Local Replica Catalog[37] (LRC)forms the EDG Replica Location Service (RLS). The LRC maintains independent and consis-tent information about replicas at single site. There may be one LRC that maintains informationabout replicas in one or more SRMs at a site, or at larger sites, one LRC may be deployed perSRM. The RLIs hold collective information about replicas distributed across many LRCs.

For the EDG testbed 2.0 release which was adopted by LCG-12, only the LRC componenthas been deployed. The RLIs will probably be deployed in the near future. LFN to SFNmappings are many to many mappings and the GUIDs provide for the unique identification ofall data files. EDG LRC maintains information about the physical replicas themselves, storingmappings between SFNs and GUIDs, as well as any attributes or metadata on the physical filesthemselves. Applications and users are free to define LFN aliases to GUIDs. These aliases arestored in the EDG-RMC along with e.g. information about Logical Collection Names (LCNs)as well as GUID metadata such as file size, owner, ACLs, which are required for successful datafile replication. All these aliases are user-defined.

2LCG-2 will also adopt EDG, perhaps using some components from version 2.0 and others from the newlyreleased 2.1

34

Figure 3.2: EDG WP2-centric view

EDG Local Replica Catalog and Metadata Catalog would be in the case of ATLAS deployedworldwide. So there would be only one LRC and one RMC that all worker nodes would com-municate to and locate replicas. This means EDG client software would have to be deployed onall those machines (that is inclusive US Grid and NorduGrid nodes).

In terms of functionalities, WP2 is fairly complete. It does the mapping between storagefile names and physical file names (using LRC) and with logical file names (using RMC). Onefeature lacking is support for collections. In EDG WP2, collections aren’t provided, but sincemetadata attributes can be extended, that support could theoretically be built on top3. Thereare C++ and Java client APIs for accessing and manipulating EDG WP2 middleware. It alsoincludes a Replica Optimization Service[38] (ROS) which was not adopted by LCG-1. Thecurrent version of ROS includes only very primitive functionality - a single get network costfunction.

The core client-side calls provided by EDG Replica Manager corresponding to the DMSrequirements are:

• copyAndRegisterFile Put a (local) file into Grid Storage and register it in the Catalog

• registerFile Register a file that already is on a Grid-aware store. It returns the GUID withwhich the file was registered. Optionally an LFN may be given as well.

• replicateFile Replicate a file to another SE.

• copyFile Copy a file to a non-grid destination.3This is only partially true, because the only purpose of collections is not only divide and organize information

in semantical terms, but also to speed up the querying of information. Having collections, each query requiressome initial steps - to “navigate” through collections - but afterward it only has to search a very limited subset ofthe information (a particular collection contents). Otherwise, if there is no collection support, all contents have tobe queried each time, which can be a very expensive operation if the amount of content to query is high.

35

• addAlias Add a new alias to GUID mapping

• removeAlias Remove an alias LFN from a known GUID

• getBestFile Make a file available on local storage (or on the store specified by a commandline option).

Other functionalities such as deleteFile, registerGUID, unregisterFile, listReplicas, listBest-File, getAccessCost, list and Replica Metadata Catalog functions to add attribute definitions andattribute values are provided.

Although EDG WP2 would result in a fairly complete data management system there aresome issues with it that have to be considered:

• It has never been used in a production environment so its stability hasn’t really beenproven

• EDG Data Management is a component of the entire EDG grid project. Therefore itis not trivial detaching this component from the entire system - it can and was donebut there is a significant overhead on the client installation. The trainee did performthis installation which required 34 RPMs and several steps with manual configurationof a local information system. EDG needs a heavy set of client libraries and a complexinstallation. It is not simple to install EDG replica manager client libraries in all workernodes from all grid and non-grid machines. Also, grid machines may already includeother grid-flavors which rely on different versions on Globus Toolkit. These versions canconflict and especially maintenance of machines becomes a nightmare

• Support for collections is not built-in so it would have to be built on top. Still, this wouldresult in adding collections support under some metadata schema which is not an idealsolution, since every single meta-query must go through the entire list of available files

• EDG Replica Optimization Service, which is of interest in the near future for ATLASis still a very simple component, considering parameters such as network latency. Opti-mization choices for replica movement strategies within ATLAS are much more complexthan those

There are several advantages by using EDG as ATLAS DMS. For one it would almost becompletely built. Only a metadata schema for providing support for ATLAS-specific metadataand perhaps support for logical collections would have to be designed. Strong security sup-port (Globus GSI via EDG Java Security) is also provided. All the basic replica managementfunctionalities are covered by EDG Replica Manager.

Therefore, although using EDG would respond to most of our requirements it is not inpractical terms a viable choice. Several RPMs have to be installed for EDG data managementservices, since these rely on other EDG components. Deploying such a system world-widewould not be very well accepted by other ATLAS partners. It might also conflict with otherexisting middleware on those machines.

36

Storage Resource Broker

Storage Resource Broker[39] (SRB) is a client-server based middleware developed by SanDiego Supercomputing Center as a way to provide uniform access to different types of stor-age devices. SRB provides an uniform API to connect to heterogeneous resources which maybe distributed and whose data may be replicated.

Figure 3.3: Storage Resource Broker

SRB presents itself as an alternative to allow users to manage data storage and replicationacross a wide range of physical storage system types. SRB is one of the few currently avail-able tools that has been proved to work reliably and includes a comprehensive set of features.It consists of two major components: the core SRB which interfaces to the storage devicesand the MCAT (Metadata Catalog) which holds all the metadata elements[40]. The design ismodular and provides a web service based interface, so several different platforms and evenauthentication methods are available.

The MCAT database is a metadata repository. It includes both internal system data requiredby SRB core and application data regarding data sets being managed by SRB. SRB makes aclear distinction between its own internal metadata and the user’s. At least one SRB servermust be installed on the node hosting MCAT database.

MCAT provides the following operations:

• stores metadata about data sets, users, resources and proxy methods

• maintains replica information for data and containers

• provides support for collection abstraction

• provides a global user name space and authentication

37

• maintains audit trail on data and abstract collections

• maintains metadata for methods and resources

• provides resource transparency

The core SRB server is a middleware accepting requests from clients and providing thenecessary data sets. It queries the MCAT database to gather information on datasets and suppliesthis information back to the SRB client. SRB servers can work in federated mode, which meansone request to an SRB server can be forward to another SRB server that is capable of answering.There are several SRB clients like command line applications and even graphical user interfaces.

The client-side command line tools corresponding to the most important DMS requirementsare:

• Scd Change the working SRB collection

• Smkdir Make new SRB collection

• Scp Copies an object to a new directory within SRB

• Smv Change collection or objects in SRB space

• Sput Import one or more local files and/or directories into SRB space

• Sget Exports one or more objects from SRB into local file system

• Sreplica Replica an SRB object

• Sphymove Physical move a file from a SRB storage system to another

To take advantage of SRB and fulfill in a greater extent the requirements of ATLAS, a setof simple command line tools was developed. These tools, each represented by a single binaryexecutable, represent several of SRB’s core functionalities. SRB already includes in the defaultinstallation a set of command line tools but these are rather simple and had to be extended.

SRB doesn’t support by itself logical file names. It uses the concept of physical file namesand physical collections. Still, as was said, MCAT allows the association of metadata. There-fore, extending SRB to include logical file name support was just a matter of adding metadataattributes and combining it all in a single executable ready to be used by all DC-2 nodes.

This was the only major change made to SRB by ATLAS during the training. The rest ofthe work was a matter of installing several SRB servers, one MCAT database and making themwork together. Castor support for SRB was provided by other groups at CERN already usingSRB and was also tested.

The metadata schema defined for storing logical file names was extremely simple. Follow-ing SRB’s MCAT representation, the following was defined for each registered physical file:

• Row 0 of a physical file metadata

– UDSMD0 = ’logical name’

– UDSMD1 = ’<LFN>’

38

Given the previous metadata attributes it becomes possible to store a logical file name asso-ciated with an SRB file. Since SRB natively supports replicas, it is guaranteed that the metadatais the same for all replicas. So, any replica can be accessed by searching metadata attributes.

The tools that were developed are atlas-get, atlas-put and atlas-ren, which correspond tothe most basic DMS requirements. These naturally support logical file names, as required byATLAS. The tools work as follow:

• atlas-get This tool receives as a command line argument the logical file name to retrieve.It then connects to SRB using srbConnect API function and does a search for metadata at-tributes using srbGetDataDirInfo, specifying UDSMD0 as ’logical name’ and UDSMD1as the logical file name to search for. A structure is returned and the data is copied tolocal disk by using the API functions srbObjOpen and srbObjRead. Finally srbObjCloseis called and the file is stored in disk using standard C I/O calls.

• atlas-put This tool starts by connecting to SRB, using srbConnect. After, it creates a newSRB object using srbObjCreate. This object is filled using srbObjWrite and closed withsrbObjClose. Naturally, the contents of the file to put into SRB are read using C I/O callsfrom local disk (a current limitation of this tool). Afterward, it is just a matter of settingSRB metadata attributes. To do that, srbModifyDataset is called twice. The first time towrite UDSMD0 = ’logical name’ and the second time to store the actual logical file namein UDSMD1.

• atlas-ren This tool simply alters the logical name metadata attribute related to a specificphysical file. For that, after connecting to SRB using srbConnect, it uses the functionsrbModifyDataset to alter UDSMD1 attribute for the physical file. Since metadata isassociated with a physical file and all its replicas, it is only necessary to make this changeonce.

SRB is a fully functional system which would provide the complete functionalities neededby ATLAS. There are two major issues with it: it is unclear whether SRB license will remainfree as is currently and also SRB was built as an independent application - it is not truly awareof grid environments. It provides GSI security and supports GridFTP file transfers. Still, agrid resource broker does not talk to SRB to locate files and broker jobs in a “smart” way4.Therefore, considering that ATLAS DC-2 wants to be grid-aware, SRB is not a really goodchoice. In the context of DC-2, SRB would provide a solution, but a non-grid solution. SRBis instead being considered as a backup solution in case grid-based systems fail to perform asexpected.

NorduGrid

NorduGrid is based on Globus Toolkit 2.x. It uses a patched version of Globus RLS[41] fromversion 2.0. Therefore, its replica manager is an hierarchical system with an LDAP databaseas the back-end. The patches applied by NorduGrid are to improve performance and correct

4A properly integrated job broker and data management system should communicate with one another (perhapsvia the information system as in EDG) to decide where exactly the job should go. Only with this sort of commu-nication it is possible for a resource broker to steer a job to where the data is, and avoid placing it somewhere elsewhich might require transferring huge amounts of input data for the job to run.

39

some errors and limitations with Globus implementation of RLS. Services provided by a replicacatalog include:

• Registering a list of files as a logical collection

• Registering the physical location of a complete or partial replica of a logical collection

• Registering information about a particular logical file in a logical collection

• Modifying the contents of registered entries in the catalog

• Responding to queries of the catalog, such as:

– Find all physical locations for a set of logical files in a logical collection

– List all the descriptive attributes associated with a registered logical collection, lo-cation or logical file.

The purpose of the replica catalog is to provide mappings between logical names for files orcollections and one or more copies of the objects on physical storage systems. The catalog reg-isters three types of entries: logical collections, locations and logical files. Users may associatedescriptive <attribute:value> pairs with each entry registered in the catalog.

A logical collection is a user-defined group of files. Aggregating files should reduce boththe number of entries in the catalog and the number of catalog manipulation operations requiredto manage replicas. This is an important feature in the context of ATLAS since it is expected tohave a great deal of files registered. Manipulating one single table with all the files (hundredsof thousands, perhaps millions) would become problematic in terms of performance.

Logical collection entries contain no attributes that provide information required to mapfrom logical collection and file names to physical locations. Logical collection entries simplyprovide a mechanism for grouping logical files together. The replica catalog places no restric-tions on the list of files specified as a logical collection. The catalog does not even require thatany of the logical file names registered as part of a collection exist on a physical storage system.

The logical collection entries in the replica catalog do not contain any attributes that describethe contents of the logical collection or its constituent files. Such descriptive information ormetadata is assumed to exist in a separate catalog. These attribute are deliberately excludedfrom the replica catalog, since its sole purpose is to provide mappings between logical namesfor files and collections and physical storage locations.

Location entries in the replica catalog contain all the information required for mapping alogical collection to a particular physical instance of that collection. The location entry mayregister information about the physical storage system, such as the host name, port and proto-col. In addition, it contains all information needed to construct a URL that can be used to accessparticular files in the collection on the corresponding storage system. Each location entry repre-sents a complete or partial copy of a logical collection on a storage system. One location entrycorresponds to exactly one physical storage system location. The location entry explicitly listsall files from the logical collection that are stored on the specified physical storage system.

Each logical collection entry may have an arbitrary number of associated location entries,each of which contains a (possibly overlapping) subset of the files in the collection. Usingmultiple location entries, users can easily register logical collections that span multiple physicalstorage systems.

40

Replica catalog includes optional entries that describe individual logical files. Logical filesare entities with globally unique names that may have one or more physical instances. Thecatalog may optionally contain one logical file entry in the replica catalog for each logical filein a collection.

Like logical collection entries, logical file entries contain no attributes that describe thecontents of logical files (metadata) or the physical location of files on storage systems.

Using NorduGrid as DMS would result in using a version of Globus Replica Manager whichhas been proven by ATLAS to be rather stable. The architecture is presented on Figure 3.4. Themost interesting component of NorduGrid, the Grid Manager wouldn’t be used since it is meantto structure a complete grid system and all that is required for ATLAS is the data managementcomponent.

GlobusReplicaCatalog

OpenLDAP

LCG−1US−Grid NorduGrid

... WORKER NODES ...

NG

Figure 3.4: NorduGrid-based DMS

NorduGrid is considered a viable option as ATLAS Data Management System. It providesall the major functionalities in a proved solution. Still, ATLAS would be very demanding on thesystem. Current NorduGrid replica manager handles a few tens of thousand files. With ATLASDMS it would have to handle around hundreds of thousands of files, possibly more. It is unclearwhether an LDAP database would handle this amount of data. LDAP databases provide veryfast read access, with some latency in writing. ATLAS DMS would mostly be a read system, butsome complex writing operations need to be performed. Also, it might be interesting for ATLASto have some complex hierarchical organizations with symbolic-like links between collectionsand not a purely hierarchical system5. LDAP doesn’t appear at first to cope very well with this(but the same can be said with relational databases as the back-end).

LCG

• POOL

As described in the previous chapter, POOL project is the common persistence frame-work for the LHC experiments to store petabytes of experiment data and metadata in adistributed and grid enabled way.

5This in the case of a system where collection support exists. Otherwise, none of this would obviously berequired.

41

POOL is designed to replace the method of how all data is accessed by the experiments.Although this will perhaps be the chosen method around 2007 when the detector is on-line, it is extremely unlikely that full POOL integration occurs in the timescale of DC-2.POOL would require significant changes on how all jobs and data are processed, sincedetails such as physical file name would become invisible to the users. The current andforeseeable production systems in the near future rely directly in such details. Hiding allthis information from the user, although desirable is rather unattainable objective for now.

Nevertheless, some level of POOL integration can be designed and implemented now,perhaps not using all of its components but a significant sub-set. One such case is thePOOL File Catalog[32]. This catalog implements an abstract interface for accessing filecatalogs such as EDG Replica Location Service. It is designed to be able to cross tech-nological borders such as different grid flavors. For now, since only an EDG RLS imple-mentation exists in terms of grid systems, there are no “grid technological borders” thatPOOL is able to cross.

Therefore, POOL’s File Catalog with Metadata support and Storage Service could be usedfor now. Data Cache and Collections will most likely be integrated later with the rest ofATLAS production. In timescale of DC-2 it is desirable that POOL’s file catalog is ableof keeping track of all files. Still, it is rather early and unclear exactly how can this betechnically done. Figure 3.5 shows in principle how such integration would be done.

EDG ReplicaLocation Service

FileCatalog

POOL API

US−Grid LCG−1 NorduGrid

Figure 3.5: POOL file catalog-based DMS

A prototype was developed using POOL. This took advantage of the POOL File Catalogwith its integrated Metadata capabilities. The purpose was testing to what extent couldPOOL be used as a replica catalog. For testing purposes, CERN’s Castor tape systemwas used. Therefore, a small application capable of reading Castor files and directories(using Castor’s Remote File I/O library[42]) was developed. This application created asimple POOL XML-based File Catalog and registered the contents of a Castor directory(and sub-directories) in POOL. The interesting part about POOL is that a replica catalog,or part of it, can be exported and re-published in another format. Therefore, it is trivialto import an XML-based file catalog to an EDG RLS system, as long as the metadata

42

attributes used between the two are compatible. A POOL XML File Catalog for a Castordirectory looks like:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?><!DOCTYPE POOLFILECATALOG SYSTEM "InMemory"><POOLFILECATALOG><META name="description" type="string"/><File ID="5CF862F4-6802-D811-9D18-00D0B7B8621F"><physical><pfn file_status="Pre-registered" filetype="" job_status="1"

name="/castor/cern.ch/user/m/mbranco/event1.dat"/></physical><logical></logical></File><File ID="EAA2A7F4-6802-D811-9D18-00D0B7B8621F"><physical><pfn file_status="Pre-registered" filetype="" job_status="1"

name="/castor/cern.ch/user/m/mbranco/dir1/readme.txt"/></physical><logical></logical></File><File ID="180CC0F4-6802-D811-9D18-00D0B7B8621F"><physical><pfn file_status="Pre-registered" filetype="" job_status="1"

name="/castor/cern.ch/user/m/mbranco/dir1/subdir1/tmp.dat"/></physical><logical><lfn name="a_logical_file_name"/><lfn name="another_logical_file_name"/></logical><metadata att_name="description" att_value="A sample description"/></File><File ID="10FDD6F4-6802-D811-9D18-00D0B7B8621F"><physical><pfn file_status="Pre-registered" filetype="" job_status="1"

name="/castor/cern.ch/user/m/mbranco/dir2/event_testing.dat"/></physical><logical></logical></File></POOLFILECATALOG>

The File ID corresponds to the Global Unique Identifiers. The files are in status pre-

43

registered and only file tmp.dat has logical file names associated with it, as well as meta-data. These simple tests show the potential of POOL. This catalog could be imported intoanother, possibly larger catalog, with a totally different back-end technology. POOL aimsto cross technology bounders by providing common interfaces. An excerpt of a larger cat-alog in XML format, such as the sample above, could also be sent with a job definition, toprovide the necessary physical file locations to that job. This would avoid a worker nodein charge of executing a job, having to contact a replica location service to find out wherethe input files are. The excerpt of the XML catalog would already provide a job with allnecessary information. This can be very useful for production systems, even consideringthat a worker node might not have outbound IP connection - in that case, although anyprocess running there can query the local replica catalog, it could not query another grid’sreplica catalog. If all these catalogs were previously registered in POOL, the job wouldthen be able to know where each input file is by consulting the excerpt POOL XML FileCatalog.

• EDG RLS and Globus RLS integration

LCG is planning integration between EDG Replica Location Service and Globus ReplicaLocation Service. This would mean that LCG-1/2 and US Grid would then be able to talkto one another and locate each other replicas. Regarding NorduGrid, since it is based onan OpenLDAP server as a back-end, it is relatively easier to integrate it with the previoustwo catalogs. Theoretically, any LDAP client tool would be able to use replica locationdata from NorduGrid.

This would mean the existence of 3 different Replica Location Services who should eithercontain the same information or each catalog would contain a part of it. Whether it ispractically possible to synchronize the three systems is unclear. Experience with thereliability of current grid technologies tends to show that this is probably not the case.One single central logical system is therefore desirable although in this case, the single-point-of-failure may become a problem. Also, LCG integration of Globus and EDG RLSsis scheduled to occur only on mid 2004 - not compatible with DC-2 timescale. Therefore,while this effort should be taken in consideration, it is not a viable solution for DC-2world-wide data management.

DMS

An obvious alternative to providing an ATLAS Data Management System is to build one fromscratch. This possibility was considered and part of the training was in building a workingprototype for this system. Later, this prototype was proved insufficient and a slightly differentsystem was designed.

In terms of architecture the DMS would be deployed as (at least logically) a central server.In practical terms some of its components might be later physically split for load balancing.At the client side, only lightweight client applications would be needed. These would be re-quired to support the communication protocol to the central DMS. In the first prototype of theDMS, these tools also require the support for the file transfer protocol, since they would be re-sponsible for transferring files to and from different locations (3rd party transfer would also beavailable). Later, in a revised DMS architecture, this requirement was removed and only queryand registration mechanisms to the DMS are necessary.

44

• First Proposal

The initial DMS central system consisted of a server with a relational database as theback-end. This database would logically be unique although it might be - at a later phase- distributed physically. There would be a server daemon receiving requests from allclients. These requests would either be queries to the DMS database or actions such asregistering files.

Another server daemon would be running apart from the core DMS daemon which wouldact by performing automatic data transfers. This would be done analyzing the statusof the system (querying DMS) and according to a specified set of data transfer tasks(or an expected final state) the transfers would automatically be performed. The DMSwas designed to support this mechanism, but the mechanism itself wasn’t designed ordeveloped during the senior training.

DMS would be capable of talking to the various grid flavors such as NorduGrid, LCG-1and US Grid by having in its central core a common interface and a remote plug-in systemcapable of interacting with each grid and non-grid flavor. It would be able to perform3rd-party transfers between all those systems. The DMS would become the only reliablesource of replica information. Given a replica location or state provided by a grid localreplica catalog or information in the DMS, the DMS is always the reliable source. Thismeans that none the less, some interactions with the several grid’s local replica catalogswould have to be performed to keep the system somewhat consistent - otherwise a gridlocal replica catalog could eventually contain totally disparate information compared tothe DMS, if whenever an action executed via DMS is not “translated” into the grid’s localreplica catalog corresponding action. Files registered within and using DMS would notbe visible from the grid’s local replica catalog where they are being hosted. This is notcrucial for the system to work, as long as when a job is executing all input files are there(regardless of who put them there). It is nonetheless very useful to have files registeredin a grid’s local replica catalog since it can then take advantage of the resource brokingcapabilities provided by the grid middleware.

To maintain consistency between DMS and the various replica location services, the DMSwould then provide an abstract interface to all of them and with different implementationsspecific to each system. This is done via a plug-in method. Therefore, a grid plug-in man-ager (aware of the 3 different grid flavors via a common interface layer) communicateswith a specific plug-in for each flavor to perform functions such as registering files inlocal replica catalogs. This plug-in method is remote in the sense that the implementationof the common interface is in a computer belonging to the foreign grid-flavor. Also, partof the client side tools would know how to interact with their “native” middleware.

A prototype for DMS was developed and it is presented next. The prototype includesall the core functionality expected from an ATLAS data management system. Still, thereare several components missing. It covers the basic operations and was only lightly testedduring part of the senior training period. Regardless of the fact that building our own DMSmay very well result in a system which is not actually used, it was a worthy exercise anda good way of understanding the problems and limitations of existing systems.

The “Supervisor, Executor or Job” are the expected users of the DMS, and they are partof the Automatic Production System for DC-2. They weren’t considered in detail for the

45

LCG

DMS common client tools

LRCLRC

NorduGrid USGrid

DMSDB

SOAP Server

DMS Core

Auto Data Transfer

LRC

Supervisor,Executor

or Job

Grid plug−in manager

LCG NorduGrid USGrid

WS plug−in WS plug−in WS plug−in

DMS client tools

LCG−specific NG−specific

DMS client tools

USGrid−specific

DMS client tools

Figure 3.6: Schema of the initial version of the DMS

DMS since it seemed that the DMS should be independent of the rest of the mechanismsfor automatic data production.

The DMS consisted of a central server, with a back-end database. The DMS would con-tain a replica catalog accessible from anywhere. It would try and keep any local gridcatalogs in sync with its contents by calling a remote web service in each grid wheneverthe synchronization is necessary. Another component within the DMS would be respon-sible for performing automatic data transfers to try and “level” the amount of data, givensome previously specified strategy. It would provide support for Collections and metadataassociated with logical names. Since it would act as a replica catalog, it would also beable to handle logical file names, physical file names and different access protocols foreach physical file. Logging and bookkeeping functionalities would be provided internally.

Using a SOAP server would be an easy approach for providing the services that theDMS intends to offer. It would also guarantee that the DMS users could easily deployclients without constraints with what programming languages or environments to usesince SOAP is cross-platform with several alternative toolkits available.

While developing the system, the web server was Apache although there are other better-performing commercial solutions specific for hosting web services6. SOAP toolkit usedwas C/C++ gSOAP using CGI for publishing services in Apache. An obvious alternativeis using Java (Jakarta Tomcat web server and Axis toolkit) but this would require everyclient machine to have Java installed. The back-end database used in the prototype wasMySQL. MySQL poses several advantages like easy development and deployment. Itlacks in distributiveness of the system. Oracle for instance provides the facility of in-stalling its database on a cluster system with load balancing. This makes the system far

6CERN’s IT division has experience with this so they would probably be responsible for helping ATLASproviding a stable solution

46

more robust7. Since the DMS is meant to be the single world-wide catalog for ATLAS,robustness and scalability is important. It is not clear whether MySQL would be able tohandle it. Nevertheless, although the prototype was built in MySQL, it was designed tobe independent of the RDBMS so other systems can be used later.

Figure 3.7: Relational schema of the DMS database

The relational database schema is shown in Figure 3.7. It implements the core databaseback-end with the requirements set by ATLAS. Some important characteristics were setfor the system:

– There is a single top collection created with the database (default ROOT collectionwith name ’/’)

– Logical sub collection names must be unique within a collection

– Logical file names must be unique within a collection

– Logical collection or logical file names cannot contain the character / or % (wild-cards are supported in user operations using character % and collections using char-acter /).

– No support for user Groups

– Very primitive support for user ACLs (only read/write for logical files and collec-tions)

– User ACL applies only for logical file and not for physical replicas

– Metadata applies only for logical files. There is no metadata associated with logicalcollections or with replicas

– All Replicas (master copy or not) have well defined properties, such as MasterCopyand CreationDate. These properties are static (defined in database schema) and notdynamic as is metadata for logical files.

7Also Oracle is officially supported by CERN’s IT division, which means that if it were used as a back-enddatabase, ATLAS wouldn’t need to worry about maintaining the database since it could delegate this task to IT.

47

– All replicas have at least one but may have more than one access protocols defined.Therefore a single replica may i.e. have access by GridFTP (with a physical pathGridFTP URL) and another access by HTTP (a different physical path HTTP URL)

– File integrity is guaranteed by MD5 checksum stored in database associated withlogical file - guaranteeing that all replicas have exactly the same MD5 checksum

– History bookkeeping is very primitive: actions - which are just textual strings - canbe defined per date and per logical name or logical collection

From the user point of view, DMS is a set of command line tools. Most of the functional-ities only access, via web services, the DMS DB, e.g. to create a logical collection. Someof those functionalities have to necessarily access the grid middleware. This is possiblegiven the plug-in approach previously mentioned and is the main reason why DMS is acentral service and not a client application only - which would be the case if it only hadto access a database. Therefore, it is possible to distinguish the following situations forATLAS DMS client tools:

– A client tool having to manipulate the DMS DB remotely. This access could be forinstance, to create a new logical collection which only requires using DMS DB. Theclient tool calls the DMS via web services. This action falls into the category ofusing DMS DB only.

– A client tool having to manipulate a foreign grid middleware. A client tool runningon LCG-1 wishes to put a file in NorduGrid. It contacts the DMS to make thatregistration of the file in NorduGrid. It then moves the file to the NorduGrid ma-chine, using GridFTP. Meanwhile, the DMS, via its NorduGrid plug-in8, contactedNorduGrid’s LRC to register the file there, to keep both DMS Replica Catalog andNorduGrid catalog consistent. This use case falls into the category of both using theDMS DB, but also the grid plug-ins provided by the DMS.

– A client tool having to manipulate its local grid middleware. A client tool runningon LCG-1 requested the DMS the location of replicas for a given file. It calls aclient tool which only interacts remotely with DMS DB (case 1) and retrieves a listof URLs. Afterward, the file is copied from the remote location to the local gridand registered in the local catalog by calling a functionality provided by the localgrid middleware (in the case of LCG-1, this would be edg-copyAndRegister with theremote URL to get the data from as an argument).

In Figure 3.8 the dependencies between the major components is shown. The DMS clienttools access both the DMS service as well as the grid middleware where they are beingexecuted directly. These two distinct types of DMS client tools will be analyzed next.Also, the Web User Interface which is a user-friendly way of navigating through theDMS database also uses the DMS web services. This was done to prevent any directaccess to the DMS database. All accesses must go through the DMS web services; thedatabase is not accessible directly. The DMS web services then call the DMS core, whichis a library (written in C) containing all the “business logic” for the system. Since the

8The NorduGrid plug-in is just a web service deployed on a NorduGrid machine, implementing a commoninterface known by the DMS

48

Figure 3.8: Component organization for the first version of the DMS

DMS was designed to be independent of the underlying database system, a persistenceinfrastructure was built. Currently, it consists of a generic interface to access DMS DBwith a single implementation in MySQL. In the case where the remote grid middlewareshave to be contacted, the DMS core calls the plug-in manager. This in turn contactsa remote web service hosted on a given grid’s storage element. The plug-in managerin DMS service and the remote grid storage element service both know the commoninterface to method calls. Finally the remote web service contains a library which is ableto contact the local grid middleware and perform any requested changes.

The prototype was developed in C since it seems to be the only common language thatall grid middlewares and legacy systems support for their APIs. The SOAP server is aCGI-BIN built using gSOAP C/C++ and hosted (currently) in Apache, although since itis a CGI-BIN it can be as easily deployed in other web servers.

The set of commands listed next maps all of the use cases of ATLAS. They are split intotwo sections: the first with the commands which only access DMS DB, and a second set ofcommands which depend on the grid middleware - and therefore whose implementationdepends on the plug-in module.

49

– DMS DB tools

The following set of commands interacts only with DMS Database. They are to beused together with grid-specific tools to keep DMS consistent with the actual state ofthe intervening Storage Elements. These commands do not perform grid operationsbut only query or modify DMS back-end database.

The execution environment for DMS client tools is provided by reading a file /.dm-sconfig to get default username/password for use with communication with DMS.Currently the only supported transport protocol is GridFTP. Therefore when refer-ring to an URL where a file is to be accessed or stored, we are at the momentreferring to a GridFTP URL, on the form of gsiftp://<hostname>/<path>.

Figure 3.9: DMS Web User Interface

A simple web user interface was developed to browse through the catalog. This toolinteracts only with the DMS back-end database to query the catalogs. It is shown inFigure 3.9.

– DMS grid-specific client tools

The following set of commands interacts not only with DMS but also with the grid’sStorage Elements where files are stored. These ATLAS tools have to be aware oftheir execution environment because their behavior is different if they are being usedeither from LCG-1, NorduGrid or US Grid. Also, some of them require actions to beexecuted on their behalf by DMS host to update foreign Storage Elements since theymay not have direct connection to them and should not have all necessary librariesto connect to all grid flavors.

50

Command: dms-register

Description: Adds a new logical file to DMS as master copy.Also computes MD5 fingerprint of physical file and stores itin DMS associated to the logical file.

Arguments: locationCodeName Codename for the location used as Storage ElementsourceFileName URL specifying the location of the source file in aStorage ElementdestinationLogicalPath Logical path name specifying the location wherethe file is to be stored in DMS

Options: -na or –newAccess sourceFileName URL specifying a new source filelocation in a Storage Element. This is useful for specifying morethan one access protocol per Storage Element.

Example: dms-register CERNgridftp://lxshare0211.cern.ch/tmp/temp.dat “temporary files/sample data”

Table 3.1: dms-put

Command: dms-replica

Description: Adds a new logical file to DMS as a replica of an existing logicalfile. Also checks if file is equivalent (md5 checksum) to mastercopy before allowing it to register as a replica.

Arguments: sourceFileName URL specifying the location of the source file in aStorage ElementdestinationLogicalPath Logical path name specifying the location wherethe file is to be stored in DMS

Options:Example: dms-replica gridftp://lxshare0213.cern.ch/tmp/temp.dat

“temporary files/sample data”

Table 3.2: dms-replica

Command: dms-get

Description: Retrieves the URL or list of URLs of a logical file or a collectionof logical files. The list of URLs retrieved are ordered by access protocol

Arguments: logicalPath Logical path specifying the location of logical files inDMS. Wildcards may be used.

Options: -p or –protocol accessProtocol Only return URLs corresponding to thisaccess protocol

Example: dms-get “temporary files/%”

Table 3.3: dms-get

51

Command: dms-get-best

Description: Equivalent to dms-get but returns only one possible URL by logical file,by determining the network speed between the current node and Storage Ele-mentand choosing the optimized access control for the file characteristics

Arguments: logicalPath Logical path specifying the location of logical files inDMS. Wildcards may be used.

Options: -p or –protocol accessProtocol Only return URLs corresponding to thisaccess protocol

Example: dms-get-best “temporary files/%”

Table 3.4: dms-get-best

Command: dms-cp

Description: Copies logical files within DMS catalogArguments: sourceLogicalName Logical name specifying the source location of logical

files in DMS. Wildcards may be used.destinationLogicalPath Logical path name specifying the destination locationwhere the file(s) is(are) to be stored in DMS

Options:Example: dms-cp “temporary files/sample data final” “location/my data”

Table 3.5: dms-cp

Command: dms-mv

Description: Moves (or renames) logical files within DMS catalogArguments: sourceLogicalName Logical name specifying the source location of logical

files in DMS. Wildcards may be used.destinationLogicalPath Logical path name specifying the destination locationwhere the file(s) is(are) to be stored in DMS

Options:Example: dms-mv “temporary files/sample data” “final location/my data”

Table 3.6: dms-mv

Command: dms-ls

Description: Lists the contents of DMS catalogArguments: logicalPath Logical path name specifying the location to listOptions: -l or –list Display additional information (metadata, ACLs)

-p Return physical location (URL)-r Process sub collections recursively

Example: dms-ls -l “temporary files/”

Table 3.7: dms-ls

52

Command: dms-rm

Description: Removes logical files from DMS catalog. Wildcards are supportedArguments: logicalPath Logical path name specifying the location(s) to removeOptions:Example: dms-rm “temporary files/%”

Table 3.8: dms-rm

Command: dms-mkdirDescription: Creates a new logical collection in DMS catalog.Arguments: logicalParentPath Logical path name where the new collection will be storedOptions:Example: dms-mkdir “temporary files/new location/”

Table 3.9: dms-mkdir

Command: dms-rmdir

Description: Removes a logical collection from DMS catalog.Arguments: logicalPath Logical path name specifying the location to removeOptions:Example: dms-rmdir “temporary files/new location/”

Table 3.10: dms-rmdir

Command: dms-meta-insert

Description: Inserts a new metadata value in an attribute belonging to a logicalcollection or logical file

Arguments: logicalPath Logical path name specifying the location to edit(can either be a collection or a logical file)metadataAttribute Metadata attribute to insert value tovalue Value to insert

Options:Example: dms-meta-insert temporary files/ “Description”

“This collection just contains temporary files and can be deleted any time”Return : Unique identifier for newly inserted metadata value

i.e. (“Description”, 1001)

Table 3.11: dms-meta-insert

53

Command: dms-meta-query

Description: Queries metadata from a file or collectionArguments: logicalPath Logical path name specifying the location to query

(can either be a collection or a logical file, and wildcards are supported)Options: -e or –everything List all attributes and respective values

-a or –attribute attributeName List all values for givenattribute name (SQL-type queries accepted)-v or –value value List all occurrences of given value(SQL-type queries accepted)

Example: dms-meta-query “my data file” -v “SELECT * WHERE Description LIKE’%Test%’ ”

Return : Returns matching metadata values and unique identifiers for those values

Table 3.12: dms-meta-query

Command: dms-meta-update

Description: Update metadata from a logical file or from a logical collectionArguments: logicalPath Logical path name specifying the location to edit

(can either be a collection or a logical file)uniqueMetadataID Unique identifier referring to metadata value to updatenewValue New metadata value

Options:Example: dms-meta-update “my data file” 1001 “Some new value”

Table 3.13: dms-meta-update

Command: dms-meta-remove

Description: Remove metadata value associated with a logical file or with logical collectionArguments: logicalPath Logical path name specifying the location to edit

(can either be a collection or a logical file)uniqueMetadataID Unique identifier referring to metadata value to update

Options:Example: dms-meta-remove “my data file” 1001

Table 3.14: dms-meta-remove

Command: dms-metaattr-create

Description: Inserts a new metadata attribute for use in DMSArguments: name name Metadata attribute name for which attribute is referred to

description description Description of attributeOptions:Example: dms-metaattr-insert “FileSize” “Stores size of physical file in bytes”

Table 3.15: dms-metaattr-create

54

Command: dms-metaattr-removeDescription: Removes a metadata attribute from use in DMSArguments: name name Metadata attribute name for which attribute is referred toOptions:Example: dms-metaattr-remove “FileSize”

Table 3.16: dms-metaattr-remove

This service is more complex to develop especially considering the limitations onhow it is possible to interact with remote LRCs and their structure, i.e.: can there beseveral logical files with the same name as long as they are in different collections -how is this translated to LRCs which do not support collections? How is it possibleto distinguish from logical files with same name in different collections? It becomesnecessary to manipulate LRCs metadata and alter the registered logical file name.

These tools were not developed. Only a simple prototype of the presented atlas-getwas developed. Still, architecturally they were defined and are presented next. Twoimportant concepts should be realized first: native and foreign grid. In the case ofa command atlas-get being issued from a job running on a machine part of LCG-1,and that command is requesting a file from NorduGrid, the native grid is LCG-1 andthe foreign grid is NorduGrid.

∗ atlas-put Puts file to a Storage Element. If the storage element is local (i.e. NFSmounted) the tool should just copy the file there and call the grid’s API functionto register it in the local replica catalog. If the storage element is remote butwithin a native grid, the corresponding grid’s API function should be called tocopy and register the file. If the grid is foreign (and has to be remote in thiscase), it is copied using GridFTP and then the tool asks DMS service to registerit on that grid’s replica catalog.

∗ atlas-get Gets a copy of a file from a Storage Element to a local computer orany destination that the local grid middleware is capable of putting the file into.This call uses native grid middleware functionality which usually provides a“get” function. The file to retrieve is usually queried by logical file name firstusing DMS DB client tools and then one URL from the returned list of physicalURLs is used to actually transfer the file. Still, this tool expects physical URLs,so any queries using DMS DB must first be done by the user.

∗ atlas-get-best Gets “best” replica of a file from a Storage Element. This toolworks similarly to the previous atlas-get with the difference that an analysis ismade of all the possible replicas of the file and the fastest one is chosen fortransfer. Therefore, this function calls first the DMS DB client tool dms-getto get all existing replicas of a logical file. A network cost function is used todetermine which of the copies will arrive faster.

∗ atlas-replicate Allows the actual replication of logical files between storageelements. This is a matter of transferring a chosen physical copy of the logicalfile from a Storage Element to another. If the destination is a foreign grid, theDMS is called to register the file in the remote storage element grid’s catalog.If the destination is a native grid, a native grid function is used to do the regis-

55

tration. The DMS DB client tools are also called to keep the DMS DB in syncwith the transfers.

∗ atlas-phymv Moves a file between native storage element and foreign. TheDMS service must be contacted to (un)register the file in the foreign stor-age element. Regarding the local storage element, this tool does the opposite(un)registering itself since it is capable of contacting the native grid middle-ware.

∗ atlas-phyrm Remove physical instances of files from Storage Element. Thisactually deletes a physical instance of a physical file. DMS database is updatedaccordingly, as is the grid’s replica catalog. In the case that the storage elementis local, this deletion is done by call to the native grid middleware; otherwise,DMS service is called to perform a remote deletion in a foreign grid.

∗ atlas-upload This function works similarly to atlas-put. The difference is thatno Storage Element is given. The tool consults the DMS which provides asuitable location for placing the file. The location where the file is put is thenreturned to the user. It works as a data broker which can make decisions onwhere to put files (usually the storage element with the “fastest” network col-lection and enough disk space).

As shown, these tools are very complex. The amount of possible situations to deal withis quite large. Making sure that the DMS database and each grid replica catalog areconsistent is very difficult. Also, experience has shown that these wide area networksystems fail quite a bit, so a strong log and error recovering mechanism would have to beimplemented. It was at this stage, while designing some of these tools, that the systemarchitecture started to appear as extremely complex.

• Revised Proposal

The initial proposal for the DMS was an independent system. It hadn’t been designed totake into account other components of the Automatic Production System for DC-2 (Fig-ure 3.10). These components are a Supervisor and an Executor process. A Supervisor isa high level process, independent of the grid middleware. It collects jobs to execute froma production database and sends them to an Executor process for submission. The Ex-ecutor only has to submit jobs but has one important characteristic: there is one differentExecutor per grid-flavor. The Data Management System for DC-2 can rely on these twolevels, Supervisor and Executor, to make sure the input files are in place when a job startsrunning and that the output is registered correctly. Therefore, a revised DMS system wasdesigned.

There is one other factor that wasn’t initially known that contributed to the redesign ofthe DMS: the worker nodes might not have outbound IP-connection9. It would be possi-ble to go around this limitation by designing a gateway system. Nevertheless, it startedto seem clear that the following assumption should be made: a job running on a givenworker node should only contact it’s native grid middleware. Input files might still haveto be transferred from somewhere else, and output files still have to be stored in any of

9Therefore, a job running in a cluster machine cannot call a process which in turn is able to contact the externalnetwork to query a centralized DMS.

56

the available grid flavor. But this could be achieved to some extent by having a differ-ent component from the automatic production system putting the input files in place firstand registering output files first in the native grid middleware without having to performremote registration on a grid’s catalog. If later output files have to be moved around be-tween Storage Elements (and possibly grid-flavors), another component of the automaticproduction system could take care of that.

Figure 3.10: Schema of ATLAS DC-2 Automatic Production System

It became clear that having yet another Replica Catalog - the DMS back-end database -would be an added burden. Each grid-flavor already includes one. Some non-grid systemssuch as Castor do not have such a system, but that can be easily developed10. Also, byhaving a unique central world-wide catalog it was unclear whether it would scale andhandle the required load. There is one added problem. Having a central catalog wouldtend users to forget about the grid’s local replica catalogs. While these wouldn’t reallybe crucial, it is important for each grid resource broker to rely on its local replica catalog(the only one it knows of). Otherwise, a resource broker might not make very “smart”decisions on distributing jobs since it doesn’t really know where the data is. For a resourcebroker, all storage elements would be empty and would require the data to be transferredanyway, so it’s not really important who gets to do the job. In reality, those storageelements could contain data, put there by the DMS, but the grid middleware wouldn’tknow about it. Of course the initial intention was of updating the grid local catalogwhenever an action is performed on the DMS. Also, whenever an action is performeddirectly on the grid local catalog, the intent was of having that information propagated tothe DMS. Experience shows that this system would most likely end up inconsistent mostof the time. Therefore, the idea of building another catalog - which would always behelpful for storing i.e. metadata, since the database schema would be defined by ATLAS- was viewed as far from ideal. Relying on each grid’s local replica catalog seemed safer.

Using each grid local replica catalog instead of building a single global catalog also hasan advantage. If grid systems fail to inter-operate, in security or file transfer protocols -

10It is expected that LCG-2 will eventually cope with this situation since the use of Castor tape system is acommon requirement by all the LHC experiments.

57

which is a serious possibility11 - it is easier to disconnect all systems and operate 3 sep-arate grid systems plus all legacy systems. In that case, each grid catalog would alreadyhave “pointers” to some valid data stored there.

So, the second possible DMS consists on a central server without a database as the back-end. Actually, this is not entirely true, since there is a database at the back-end for legacysystems such as CERN’s Castor, so that these can support logical file names and metadata.Still, this database is being stored in “grid-aware” technology anyway - in LCG RLScatalog perhaps via POOL File Catalogs and Metadata. In this system, whenever a requestis made each different grid flavor is asked for the given file(s). The request is translated bythe plug-in system into web service calls for each grid system and the final answer is theaggregation of all results. The DMS acts then as a query and registration stateless service.Given the fact that it is stateless, it could almost run as a client application but it seemedmore scalable12 to maintain it as a central service, with very lightweight client tools toaccess it. Also, maintaining DMS as a central service allows building more functionalitiesin it without affecting the existing client tools. Therefore, the DMS stays as a centralservice, but completely stateless, capable of the following tasks: renaming LFNs in allgrid flavors, lookup - for a Logical File Name (LFN) find all Transport URLs (TURL)in all grid flavors -, and register LFNs in all grid flavors. Previously it was said thatthe new DMS system would make the assumption that “a job running on a given workernode should only contact it’s native grid middleware”, which means that the DMS centralservice should never need to be able to register LFNs in all grid flavors. Nevertheless, thismight be necessary later, not for the job, but for the remaining components of the DMSsuch as the automatic data movement system across grids which will be later developed.

LCGplug−in

LCG

LRCLRC

NorduGridplug−in

USGridplug−in

NorduGrid USGrid

RLI

SOAP ClientDMS

XML FileEDG RLSMySQL

SupervisorProcess

Executoror Job

Process

Grid plug−in manager

SOAP Server

DMS Core

Auto Data Transfer

POOL File Catalogand Metadata

Figure 3.11: Schema of the revised version of the DMS

11In the recent past, Globus Alliance has released, as part of their Globus Toolkit, versions of GSI security andGridFTP protocol that were backward incompatible.

12In the future, centrally managed ACLs could be maintained across grid implementations, or a central cachingservice could be designed. All this flexibility is only permitted if DMS is a central service and not a completelystateless client application

58

Although Figure 3.11 seems similar to the schema for the initial version of the DMS, thereare some important differences. First, there is a distinction between what the Supervisoraccesses, and what the Executor and Jobs can access. Only the executor and the job hasaccess to the grid middleware directly. The supervisor accesses only the DMS core, whichin turn can access each grid middleware via DMS plug-ins.

One characteristic that remains the same is the plug-ins for each grid middleware. Theseare still remote web services available in each grid flavor, that are accessed from DMS.A common interface is provided, with implementations different to each grid flavor. Typ-ically, the DMS common interface implementation should be hosted in a User Interfacemachine from LCG, NorduGrid and US Grid. Therefore, that machine which is includedin the maintenance cycle of the rest of the grid is updated accordingly. If the APIs usedby the DMS implementation to access the grid middleware change, it is only a matter ofchanging locally the code to access that particular grid middleware.

Meantime, while collaborating with foreign ATLAS partners, several issues were set.Therefore, it was possible to make the following set of assumptions:

– There is a GridFTP implementation that works between all grid flavors (the refer-ence implementation is EDG GridFTP).

– Legacy systems all have GridFTP “umbrellas”. This means all access is via GridFTPeven to e.g. CERN’s Castor tape system.

– Distributing load and tasks between worker nodes is the ideal situation to prevente.g. an executor of doing many more tasks than the other components. This is usefulbecause there will be very few executors or supervisors in comparison to runningjobs on worker nodes. This also can lead to having several fall-back situations, sothat if one fails (a running job cannot register a file), another ”level” componentcan still do it. Distributing load is particularly important for validation. All outputcoming from the automatic production system must be validated and it is very usefulto have the possibility of worker nodes validating the output.

– Worker nodes may store input files in their local disk. In that case, it is the com-petence of the job to copy or move the file from a Storage Element to that localdisk. It should also clean up files not to fill the Worker Node with unnecessarydata. The data management system only takes into consideration Storage Elements.Since Worker Nodes usually have mounted via NFS the contents of some StorageElements this shouldn’t in principle be necessary.

It has been stated in one of the previous assumptions that all systems support GridFTP.Therefore, to access CERN’s Castor tape system, a GridFTP-enabled machine wouldhave to be deployed as a front-end to Castor. This is currently the case13. The problem isthat GridFTP is just a grid-enabled file transfer protocol. What is necessary for ATLASis much more than that. It becomes necessary to build a metadata catalog for Castor files,either to keep metadata attributes and also, quite simply, Logical File Names for Castor

13There are already machines at CERN configured as a GridFTP front-end to Castor. Nevertheless, this is not anideal situation as these don’t work very reliably since GridFTP is basically just a grid version of the FTP protocol,and doesn’t include support for i.e. staging in and out of files, as required by a tape-based system.

59

files. To accomplish this, a simple prototype using POOL File Catalog and Metadatacomponents was developed. This allows the registration of Castor files and metadata forthose files in either an XML file catalog, a MySQL database or an EDG Replica LocationService. The back-end system on which POOL registers the information is not reallyrelevant, since it is possible to re-publish information between all those back-ends. Usingthis POOL component appears to be extremely useful, and the support for EDG RLS asa back-end seems to be the choice given that LCG uses an EDG RLS anyway at CERN,so ATLAS might as well register its files there.

It is possible then to distinguish three major situations when dealing with the DMS. Oneis reading input data, other is storing output data and finally cleaning up unused data.Input data is data necessary for a job to execute. Output data is newly produced data aftera job has been executed. Either input data and validated output data have to be accessedworld-wide. Output data whose validation mechanism marked it as invalid should bediscarded. The storage elements have limited disk space, so unused input data (usuallyunused replicas) must be deleted after it is no longer necessary, as should invalid outputdata.

Following, is a brief analysis of each of these three situations:

– Input Data

Figure 3.12: Use Case for executing a job from the Automatic Production System

The use case on Figure 3.12 shows the process of executing a job using the DMSwithin the context of the Automatic Production System. The main interaction withthe DMS is querying it to find a mapping between LFN->TURLs. That mappingis then forwarded, so that either the executor or the job can get the files. Given theassumption that it is possible using GridFTP to copy a file from anywhere (that isif outbound IP connection is available, which should be the case for all DMS usersexcept perhaps the worker node where the job is running), then it is trivial given aTURL to get a file when necessary. A Job Execution, focusing on input data mostly,would then consist of the following:

∗ The supervisor gets from the Production DB, a job to execute. It then analyzesthe requirements for this job in terms of necessary files. It asks the DMS toget a list of TURLs (perhaps more than one TURL per LFN). It orders thoseTURLs (grouping them per LFN) in some way that might improve a future

60

resource brokering, by, e.g. putting for the highest possible number of logicalfiles, the closest SE as the first alternative in a list of possible alternatives. Later,this information along with the remaining job parameters is dispatched to theexecutor. There is one important point of information that the supervisor alsopasses to the executor: whether the executor should attempt to pre-stage thefiles or not. The supervisor, although analyzing one job at the time, can beaware that there are several jobs that will be using the same files. Preparingfor this situation is an advantage. Therefore, it might send to the executor theindication that this should attempt to pre-stage the files since there is a highpossibility that other jobs will end up in the same executor using some of thesame files. Finally, the supervisor also reads from the Production DB (or someother place) the preferred location for the output files. That information is alsopassed to the executor.

∗ At this point a grid middleware is selected for executing this job. On the cor-responding executor, there are two major alternatives. One is pre-staging thefiles, the other is not pre-staging files.

· For pre-staging: If the job request by the supervisor indicated that pre-staging should be done, the executor will do an edg-copyAndRegister14toa close SE. Then, the executor will create a job definition file, indicatingthat the required files are already in a close SE. It is therefore very likelythat the grid resource broker will steer the job to a CE15 which has all ormost of the required files in a mounted SE. One other important piece ofinformation that comes from the executor to the job is the preferred des-tination location for output files. This information goes in the form of aninput sandbox in the job definition file. The worker node will then executethe job and attempt to place the file in one of the possible ”preferred” desti-nation output location. It might not be able to place the file in any of thoseand choose another location. After the job has finished, an output sandboxis generated which includes two important components: the result of thejob validation (done by the worker node itself) and the actual location ofthe output files. The executor receives this information and passes it back,perhaps in a different form considering different communication protocols,to the supervisor.

· In the case of no file pre-staging there is one assumption: the selected CEwill be able to GridFTP all necessary files for the job to execute. Otherwise,the job would simply fail because it cannot access all input files. Ideally, allfile transfers should be done by Worker Nodes, but there are two reasons fornot always doing it. The first is that Worker Nodes may not have outboundIP connection, so they can’t access the files and GridFTP them in. The otherreason is that if a given file is used by many jobs, there may be concurrentaccesses to it, which can result in several problems (several simultaneous

14This command is provided as an example and will be only executed in the case of LCG. Other similar com-mands will be executed for each of the other grid’s middleware.

15A CE is a Computing Element, or the grid component in charge of the job resource brokering. It is the CEwho decides which of the Worker Nodes associated to it that will execute a job.

61

copies with identical replica catalog entries, etc...). Afterward, when allavailable files are accessible, the worker node acts as in pre-staging, byexecuting the job, trying to place the output files in a preferred location andsending back the output sandbox with the actual output file locations andjob validation result.

– Output Data

For the moment, output data should only be placed in the grid where the job ran. Theoutput files could be transferred in some cases from the Worker Node to the localStorage Element (if they weren’t created there in the first place, since most gridmiddlewares mount using NFS the Storage Element’s directories). Nevertheless,this is handled using grid middleware since files must be registered in the local gridcatalog16. Only then can the files be transferred between Storage Elements.

There is still an issue on who gets to do the output validation. It would be usefulto have the worker node who executed the job doing part of the validation as partof the job definition file. Then, the supervisor would only have to finish it and ifvalidation has succeeded the output file can be registered as final in the local gridcatalog; if not, the output is rejected and deleted. If there is the intent of having otherworker nodes - that is different jobs - from either the same grid-flavor or other grid-flavor doing the validation this means that the temporary unvalidated output file mustbe registered in the grid local catalog, so that other worker nodes can locate it byquerying the DMS. This would require some level of interaction with the remainingautomatic production system components since it would be those deciding who getsto do the job validation. It is not clear for now who should be in charge of doing thevalidation.

Currently, transferring files between Storage Elements is a very limited task. LCGfor instance supports third party transfer. Nevertheless, this is not really third party,since the grid host which asked the other two hosts to transfer files between them-selves, has a channel opened for monitoring to each of them. Therefore, it is some-how monitoring the transfer between the two. This is the case since there is nosecurity delegation provided by the middleware, so it is not possible to pass creden-tials to the Storage Element and ask it to transfer file on a user’s behalf. Thereforethe solution was that the host who issued the transfer request - and who has usercredentials to do so - stays “active” during the transfer monitoring it. Naturally, if aworker node without outbound IP connection wishes to do a transfer of this kind, itcannot since it won’t be able to open up the channel to the foreign host. It seems asif the executor will have to be running in a publicly available machine monitoringthis transfer.

Since it is rather limiting to have a job always register the output data in the localgrid catalog, it might be interesting interfacing this system to the automatic datatransfer mechanism. The automatic data transfer mechanism could be contactedwhenever a job’s output preferred destination is outside the native grid. In thatcase, a new automatic data transfer task would be created to transfer that validated

16In the case of LCG-1 this would mean that an edg-copyAndRegister would be made to each validated outputfile.

62

output as soon as possible, from the native grid middleware where it was producedto a foreign grid storage element. Therefore, it would be possible to define a jobwhose preferred destination is a remote location. On a first step, the output wouldalways be stored in the local grid (not a very expensive operation - just a necessaryintermediary step). Then, as soon as possible, the output would be moved out to theinitially proposed destination.

– Cleaning up unused data

Another important issue that was later realized is that the DMS would have to handlethe task of cleaning up unused files. The grid middleware contrary to what initiallyproposed does not handle the problem. It would be easy to fall into the situationwhere jobs always fail to execute because there’s no sufficient disk space in theStorage Elements to copy new files/replicas. A mechanism for deletion is beingconsidered and will probably consist of maintaining a “last used date”. When thelast used date of a file is more than two weeks old, it should be safe to delete file.The process doing the file transfer (meaning it will use the file) should update thisfield. A daemon will be running in the Storage Elements, typically once or twice aday (depending on site policies) that queries a metadata catalog and delete old files.The metadata catalog containing this information should probably be provided bythe native grid middleware.

The revised version of the DMS is still a work in progress. All implementation so farhave been either in adapting components developed for the initial DMS such as the SOAPserver or trying to inter-operate between different grid middleware, which was somethingleft out momentarily while developing the initial version of the DMS. At that time, thefocus was on developing DMS DB and the client tools to operate it. Now, the crucialpoint is inter-operation between grids, since there is no significant central database. Thekey issue is developing a mechanism to poll all grid middleware. The complication isnot so much the polling but designing a mechanism for it with minimal requirements andinterference with current site policies.

Some functionality is lost in this version of the DMS. The metadata mechanism is farmore simple, relying only on either the grid’s metadata catalogs or on the POOL Metadatamechanism. Collection support is dropped, since some of the grid’s middlewares don’tsupport it directly. Still, metadata support can somehow go around this situation. Ideallycollections would be natively supported, but unfortunately that is not the case.

One of the problems with building an ATLAS-specific data management system is se-curity. Either in the initial or the revised proposal of the DMS, the level of securityintegration will still have to be considered in the near future. One possibility is usingGSI, which would mean that the DMS would be truly “grid-aware” in security terms.The problem is exactly how to use GSI. Every grid middleware in use by ATLAS relieson a GSI implementation from Globus Toolkit. Still, in the past, there have been prob-lems in interoperability between different GSI versions from Globus Toolkit. Also, it isstill rather unclear what will Grid3 use for security and any security considerations forDMS will have to be dealt with when that becomes clear. The other obvious possibility ishaving the DMS rely on a different security mechanism. Therefore, a query request to the

63

DMS would be made through some secure channel17. Still, for the query request to beperformed on a given grid middleware, a valid Globus proxy certificate would have to bein place in that machine. The solution in this case would be having another security level,at the plug-in level, with a native grid certificate. This would mean that there would be asecurity technology in use to get to the DMS and perform a query there, and a differentsecurity technology within the DMS plug-in to query the local grid catalog. This wouldhave the advantage of maintaining the systems separate and would be easier to develop,deploy and maintain. The disadvantages are that a query to a grid catalog would alwaysbe performed on behalf of the same user18 (there would be no delegation), and the levelof grid integration regarding security would be quite low - if the DMS wishes to evolveto a more stable system, grid security will eventually have to be tightly integrated. Forthe moment it is unclear which of the previous two possibilities will be in place by spring2004.

3.2 Technological choices

ATLAS DC-2 will be the first ATLAS data challenges to include significant grid technologies.This has lead to re-think several issues so that grid environments are properly integrated. Sup-port for legacy systems, such as Castor and HPSS mass storage systems is still crucial. Also,the timescale for DC-2 is April 2004. By that date, there must be a fully working and stabledata management system. DC-2 can’t rely on prototypes since it is a production project.

Developing a peer to peer based data management system from scratch would require a sig-nificant work. Peer to peer systems have many characteristics such as lack of global knowledgeand consistency. These are acceptable in the context of a typical user sharing MP3 files. Foran HEP experiment the status of data must be known exactly: whether data can be found or ifnot, what is its status (e.g. it was deleted or is temporarily unavailable). The lack of knowl-edge shown in peer to peer systems is not acceptable. Also, peer to peer networks are for veryactive and fast changing environments. Their fault tolerance comes from the large communityproviding redundant data. For ATLAS this is not the case. The systems where data is storedare very static and usually offer some reasonable degree of fault tolerance. They are not verydynamic either - there aren’t storage elements constantly entering and leaving the system giventheir magnitude. Also, it is not possible to rely on fault tolerance by having multiple replicasas in a peer to peer network. Files to be replicated are usually very large and the replicationstrategies are far more complex. Also, replication is not done only on the base of file size andother file or storage elements conditions. It is also done on the basis of future work to be per-formed on the files. Therefore, files that will be analyzed on a given cluster should better be putnear that cluster in the first place. In a peer to peer system, compatibility with Castor, HPSSand other legacy systems would have to be wrapped somehow. This would provide additionalcomplication to an already complex and patched system.

All of the above could be implemented in a peer to peer based system. Such a system would

17One possibility would be using the same X509 certificates in use by Globus, but with our own security mech-anism.

18i.e. a LCG-1 machine with a DMS plug-in installed would have to have a grid proxy certificate. To generatethat proxy certificate, a simple cron job would be enough making sure that the web service installed there canalways find a proxy certificate to issue a query request to the grid middleware.

64

be an extremely interesting research project and might provide valuable contributions to eitherthe peer to peer and grid computing communities. Still, it is not the job of DC-2 to do thisresearch. DC-2 needs working systems in a short timescale.

Therefore, a peer to peer alternative data management system had to be dropped. Such asystem architecture was actually thought of during part of the training period. ATLAS DC-2team decided it was incompatible with DC-2 current needs.

Focus was then given on analyzing existing components and how these could provide (par-tial) solutions to our data management needs. The core of this analysis was presented on thischapter.

Choosing a grid-based system seemed the most obvious alternative. The Replica LocationService is the basic system of any data management system. There are several existing RLSsavailable. Globus has an RLS, EDG (and the subsequent LCG project) also and NorduGridalso.

NorduGrid RLS has as a back-end an LDAP-based server, which is a hierarchical database.It is unclear whether such a system is capable of handling several hundred thousand (possiblymore) files. An RDBMS based system seems a more stable (and known) choice. For ATLAS,since CERN’s IT division provides support for Oracle with a very high uptime, choosing a sys-tem which can use as a back-end an Oracle database would probably be an advantage. LCGRLS will probably be adapted to use Oracle somewhere in the future. Currently, all these sys-tems (Globus, EDG, LCG) use MySQL or PostgreSQL. Still, a hierarchical database such as theone used by NorduGrid’s RLS (and naturally by version 2.2.x of Globus on which NorduGridmiddleware is based) could provide a very fast read access time in an hierarchical system. AllATLAS data is currently structured in very well defined hierarchies. Since NorduGrid sup-ports collections and these hierarchies by design, this appears as an advantage. Nevertheless, itshould be possible for ATLAS to drop the hierarchical collection support and implement meta-data attributes, since it appears as if all ATLAS collections represent a metadata attribute witha semantical meaning.

Regarding grid-based systems, Globus RLS is extremely similar with EDG RLS. Their ar-chitecture is basically the same. Only some issues with security integration and design issuessupporting GUIDs are different between the two. There would be no advantage in using a com-ponent developed outside CERN (Globus RLS) when the equivalent component was developedindoors (EDG RLS). Also, EDG RLS is the RLS chosen by LCG. Therefore, it is expectedthat LCG provides support for RLS. It seems as a more interesting choice to choose such acomponent given the fact that it is developed and supported by CERN.

Magda provides some of the basic facilities of a data management system. Still, Magdais not really grid-aware. It supports GridFTP, but that is only relevant for file transfers be-tween grid sites. Therefore, a Resource Broker can’t talk to a Magda back-end server to make“smarter” choices on what replica to use to execute a given job. Although Magda providesmany interesting features in transferring files, it lacks in the grid integration.

Building our own system would perhaps result in the best DC-2 specific system possible,since it would be engineered to fit the exact requirements of the data challenges. Nevertheless,the ATLAS experiment is a multinational collaborative effort. If a new data management systemwas built from scratch, and given the fact that it would have to interact with all existing grid-flavors, a huge work would have to be accomplished in a short period of time.

The other possible alternative system appears to be Storage Resource Broker. SRB is awell known and rather stable system. As was said, it is not grid-aware and was designed as a

65

complete end-to-end solution. This is not entirely desirable for ATLAS since DC-2 plans to usegrid middleware. SRB remains as a backup solution. It remains to see whether it is possible towrap grid middleware within SRB, so that a grid resource broker or monitoring service coulduse SRB’s resources. Also, a problem with SRB is that it requires for now a single MCATcentralized database. Although this is a disadvantage, the same thing happens now with mostgrid middlewares19, since the Replica Location Indexes aren’t deployed yet for some of them,so only Local Replica Catalogs exists.

LCG POOL will most likely be the default data management for all LHC experiments.POOL is much more than a pure data management system and that is perhaps the reason itis not being fully considered for DC-2. A complete POOL integration would require hugechanges in all of DC-2 processes. Also, POOL is a very recent system whose stability hasn’tbeen proved. Using a sub-set of POOL, such as POOL’s file catalog is an interesting and usefulfeature but it shouldn’t be a very high priority. If at a later stage, still within DC-2 timescale,all files are published into POOL’s file catalog, it is a useful but not crucial project. Also, usingjust POOL’s file catalog isn’t all that important, since it translates into using LCG RLS withPOOL’s metadata support.

There are several efforts within LHC Computing Group to handle integration between theseveral grid flavors. For ATLAS DC-2, these projects remain unusable given the larger timescaleon which they are being developped. It remains to see if these projects will perform as expected,and if so, they could hopefully be integrated within ATLAS shortly after their release.

One of the basis of software engineering is re-utilizing existing components. Therefore, al-though the ideal system would be one designed entirely by ourselves to fit our own requirements- this corresponds to the first proposal of our DMS system -, this choice had to be realisticallydropped. An international collaboration can hardly make such changes and implement an uni-form system. Several components exist that contribute to solve our problem. The system thatbetter made use of the several existing components was the revised version of the DMS system.This takes advantage of each grid’s middleware, adding in addition support for legacy systems.It was considered that by implementing the DMS revised version, ATLAS could not only havethe benefits of designing a somewhat proprietary system, where requirements can be more eas-ily met, but at the same time for the crucial component which is the Replica Catalogs, takeadvantage of the several existing options. Each main collaborator of ATLAS uses a differentReplica Catalog and this way they can keep using it.

Therefore, the decision to use DMS revised system as ATLAS DC-2 Automatic ProductionData Management system was made.

19Currently, only Globus has a distributed Replica Location Index working, although apparently this has notbeen extensively tested.

66

Chapter 4

Future Directions

The peer-to-peer system Adaptive Grid is still very much a work in progress. It cannot bedirectly applied for now to the ATLAS Data Challenges but could certainly become extremelyuseful within the physics community. There’s much to be done, but mainly the focus should begiven in a job execution mechanism, capable of job partitioning. Also, a software installationmechanism is crucial. The jobs require always a significant middleware to be available, and itis important that a peer-to-peer system addresses this distribution of software in a secure andcontrollable way. Some mechanisms developed for ATLAS, such as Pacman would help thatdistribution of software. Even GNU/Linux Debian includes a state of the art software packagingmechanism. Then, it should be possible to execute jobs or ideally parts of jobs, using jobpartitioning in a highly distributed manner. For job partitioning there are, it seems two majorapproaches: one is directly analyzing the code trying to discover known patterns which aredistributable, and the other is building programs ready for distribution using a programminglanguage. Python seems ideal for this second approach given its capability of code modularity.

The Adaptive Grid effort could converge to some existing ”at home” projects, such asSETI@Home, since it intends to provide a far more complete middleware. Integrating somegrid technologies, such as GSI or GridFTP, as an optional component would increase greatlya potential market for such a system. Peer-to-peer definitely has a significant set of interest-ing and useful characteristics. It would allow also the development of smart storage elements,crucial for ATLAS, which are distributed storage elements capable of talking to each other andmaking data management and data brokering decisions by themselves. The convergence ofweb services, grid computing, peer-to-peer and, given the strong metadata capabilities, seman-tic web, could provide Adaptive Grid with a unique ground from which applications could bebuilt. There is currently no toolkit capable of providing that kind of middleware. A small,simple, dynamic grid-enabled middleware internally functioning as a peer-to-peer mechanismwould allow users to build such disparate systems as smart storage elements, distributed anal-ysis programs and even semantic agents crawling the network. This would map into the HEPrequirements of having a simple and easy work environment available for the physicists whichis capable of handling smaller loads in more dynamic environments. Currently, there is no lead-ing initiative at CERN answering the problem of the physicists work environment after 2007,when the detector becomes online1

Some of the basic concepts in Adaptive Grid are also the basis of the Open Grid Services1There are some preliminary plans and implementations of systems such as GANGA, DIAL and more recently

the ARDA RTAG initiative.

67

Architecture for Distributed Systems Integration[43]. The notion of providing dynamic servicesand service discovery - the core of Adaptive Grid - maps into the concepts defined by OGSA.It would be interesting to see the merge of Adaptive Grid into an “OGSA-aware” system.

There are at CERN a multitude of concurrent grid efforts, and it is our belief that Adap-tive Grid could take advantage of some of them. Perhaps the most complete for the moment isAliEn[44, 45] - a “grid” middleware specific to HEP created as part of the Alice LHC experi-ment2 -, which includes job partitioning and splitting mechanisms (limited to special kinds ofjobs - those using ROOT physics input files) that Adaptive Grid could take advantage of.

OGSI-fying Adaptive Grid and working closely with the AliEn project (which is also sched-uled to be OGSI-fied in the next few months), would be a worthily research project with aguarantied short development timescale since both projects have one basic concept in common:keeping the system as simple and lightweight as possible.

Given the complexity of the data management problem for DC-2, future directions are con-stantly being analyzed. The initially proposed DMS system changed design after just one monthof development. Previously, there was a constant rotation between which of the several possi-ble systems to use. It is expected that the DMS will undergo important changes before it isdeployed and ready to use.

There are some technologies which weren’t evaluated since they have not yet producedvalid software. One is the LCG integration efforts with the US Grid. Other is SRM. SRM is aStorage Resource Manager developed in Brookhaven National Laboratory in the United States,with collaboration of members of the EDG project at CERN. Together, they have produced thespecification, currently at version 2.1, of a Storage Resource Manager. This includes many ofthe features required for ATLAS, making it a “smarter” and active storage manager, and not justa GridFTP server as is currently the case3. Future developments for ATLAS will probably takeSRM into consideration. Other crucial system is POOL. ATLAS will probably become POOLenabled within the next year - parts of ATLAS software already produce POOL-enabled output.Introducing some of POOL components for DC-2 will help future integrations.

There will always be room for innovation and new technologies, simply because all datamanagement and grid initiatives have a larger set of users than just the High Energy Physicscommunity. The follow up of the EDG project, called EGEE (Enabling Grids for E-science andindustry in Europe) will most likely focus on the requirements of the bio-medics communityas much as the HEP, as required by the European Union funding. Therefore, whatever gridmiddleware is available in those projects, will always have to be adapted and adjusted for HEP.LCG seems capable of addressing the changes required for generic HEP, but each experimentstill has it’s own unique requirements. ATLAS Data Challenges is the suitable place to adapt theavailable middleware to ATLAS requirements, since the whole purpose of the Data Challengesis to test the existing infrastructure and making sure it works as required.

2It is commonly said that AliEn is, from all the grid middlewares available, the one that is arguably less of a Gridbut more of a working system proving to be functional and conforming more thoroughly to the HEP requirements(HEPCAL document).

3LCG-2 will implement SRM specification version 1.0 which is very primitive and doesn’t even support truethird party transfers between Storage Elements.

68

Chapter 5

Conclusion

We started with an initial introduction to distributed computing, particularly focusing on Gridtechnologies by over-viewing the current state of technology. The focus was then moved to theData Management problem particularly in the context of ATLAS Data Challenges. When thefirst thoughts of having an ATLAS-wide Data Management System emerged, there was onlya very faint idea of how it should behave. This led us to design a peer-to-peer based systemwhich being a worthwhile and very potential project is far from achievable for ATLAS DC-2timescale.

An ongoing evaluation of existing technologies was the core part of the senior training. Asthis report is proof, there are several valid alternatives for the data management problem. Nosingle technology provides a full solution and that is the reason that DC-2 decided to build itsown system, relying on existing components.

ATLAS DC-2 requires a grid-enabled, world-wide Data Management System. Unfortu-nately, grid middleware is still premature in compatibility terms. There are several standard-ization initiatives such as the Global Grid Forum or DataTAG and even LCG is attempting todirectly achieve compatibility in some key components. This doesn’t exclude that in the nearfuture one of the grid middlewares used by ATLAS will migrate some of its components to a dif-ferent and possibly incompatible version therefore jeopardizing the “ATLAS-wide” capabilitiesof DC-2.

Currently, it seems that DMS will succeed to some extent, possibly with multiple reconfig-urations and an expected downtime. It is expected that it will be adapted and even redesignedas new, unpredicted situations are found.

Data Management is a crucial point in any High Energy Physics Experiment. A citation,from a researcher of the “friendly-rival” CMS experiment, seen on the PhD Thesis of CERN’sresearcher Kurt Stockinger states that “One of the crucial points is the fast access of data. Whoaccesses it faster, makes physics discoveries first.”. DMS aims to provide access to data, so thatdiscoveries are made first.

Understanding the ATLAS experiment, the Data Challenges and the requirements for DC-2were the most challenging issues of the senior training. The personal experience of working atCERN proved extremely valuable. There is still plenty of room to progress. Making a usefulsystem capable of handling the data load when the accelerator goes online and providing it toa vastly dispersed geographic community is a huge effort. Working as part of an internationalcollaboration and still making a valuable contribution to the experiment proved to be very chal-lenging. Not only is the geographic community vastly dispersed, but so are some of its efforts.

69

Personally, this senior training was extremely rewarding by helping to gain the experience indealing with several simultaneous efforts and trying to make technical decisions with a multi-plicity of technical and non-technical factors. It is our understanding that the current proposalof DMS satisfies these requirements as thoroughly as possible.

70

Bibliography

[1] ATLAS, “The ATLAS Experiment.” http://atlasexperiment.org.

[2] The ATLAS DC1 Task Force, “ATLAS Data Challenge 1,” 2003.

[3] “Global Grid Forum.” http://gridforum.org.

[4] “Globus Alliance.” http://www.globus.org.

[5] I. Foster and C. Kesselman, The Grid: blueprint for a new computing infrastructure. Mor-gan Kaufmann Publishers Inc., 1998.

[6] I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Grid,”http://www.globus.org/research/papers/anatomy.pdf, 2001.

[7] “Globus Toolkit.” http://www-unix.globus.org/toolkit/.

[8] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham, C. Kesselman, and P. Vanderbilt,“Grid service specification,” tech. rep., Global Grid Forum, October 2002.

[9] C. Shirky, “What is P2P... and What Isn’t,” http://www.openp2p.com/pub/a/p2p/2000/11/24/shirky1-whatisp2p.html.

[10] “Gnutella.” http://gnutella.wego.com/.

[11] “Napster.” http://www.napster.com/.

[12] “The Freenet Project.” http://freenet.sourceforge.net/.

[13] “Kazaa.” http://www.kazaa.com/.

[14] I. Foster and A. Iamnitchi, “On death, taxes, and the convergence of peer-to-peer and gridcomputing,” 2nd International Workshop on Peer-to-Peer Systems (IPTPS’03), February2003.

[15] J. Beringer and N. Brook and P. Buncic and F. Carminati and R. Cavanaugh and P.Cerello and F. Donno and D. Foster and C. Grandi and F. Harris and L. Perini and A.Pfeiffer and R. Pordes and D. Quarrie and A. Sciaba and O. Smirnova and J. Templon andA. Tsaregorodtsev and C. Tull, “Common Use Cases for a HEP Common ApplicationLayer for Analysis - HEPCAL II,” tech. rep., CERN, European Organization for NuclearResearch, October 2003.

71

[16] “The European DataGrid project.” http://eu-datagrid.web.cern.ch/eu-datagrid/.

[17] “LHC Computing Grid Project.” http://lcg.web.cern.ch/LCG/.

[18] “EDG WP2 Data Management.” http://edg-wp2.web.cern.ch/edg-wp2/.

[19] “EDG WP5 Storage Element.” http://web01.esc.rl.ac.uk/projects/DataGrid/wp5/.

[20] L. Guy, P. Kunszt, E. Laure, H. Stockinger, and K. Stockinger, “Replica management indata grids,” July 2002.

[21] “NorduGrid.” http://www.nordugrid.org.

[22] “The Grid 2003 Project.” http://www.ivdgl.org/grid2003/index.php.

[23] “Grid 3 or Grid 2003.” http://www.ivdgl.org/grid2003/documents/document_server/uploaded_documents/doc--647--Grid3_v19.pdf.

[24] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scal-able peer-to-peer lookup service for internet applications,” in Proceedings of the 2001conference on applications, technologies, architectures, and protocols for computer com-munications, pp. 149–160, ACM Press, 2001.

[25] P. Maymounkov and D. Mazieres, “Kademlia: A Peer-to-peer Information System Basedon the XOR Metric,” March 2002.

[26] “The Project JXTA.” http://www.jxta.org.

[27] B. Traversat, A. Arora, M. Abdelaziz, M. Duigou, C. Haywood, J.-C. Hugly, E. Pouyoul,and B. Yeager, “Project JXTA 2.0 Super-Peer Virtual Network,”

[28] “Berkeley Open Infrastructure for Network Computing.” http://boinc.berkeley.edu/.

[29] “Magda - Manager for Grid-based Data.” http://www.atlasgrid.bnl.gov/magda/info.

[30] D. Duellmann, “The LCG POOL Project - General Overview and Project Structure,” 2003.

[31] J. Generowicz, P. Mato, L. Moneta, S. Roiser, M. M. LBNL, and L. Tuura, “SEAL: Com-mon Core Libraries and Services for LHC Applications,” in CHEP 2003, La Jolla, CA(USA), March 2003.

[32] C. Cioffi, S. Echmann, D. Malon, A. Vaniachine, M. Girone, H. Schmuecker, J. Woj-cieszuk, J. Hrivnac, and Z. Xie, “POOL File Catalog, Collection and Metadata Compo-nents,” in CHEP 2003, La Jolla, CA (USA), March 2003.

[33] “The SEAL Project.” http://seal.web.cern.ch/seal/.

72

[34] WP2, “User Guide for EDG Replica Manager,” October 2003. http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-replica-manager-userguide.pdf.

[35] WP2, “User Guide for Replica Metadata Service,” October 2003. http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-rmc-userguide.pdf.

[36] WP2, “User Guide for EDG RLS Replica Location Index,” October 2003. http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-rli-userguide.pdf.

[37] WP2, “User guide for local replica catalog,” October 2003. http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-lrc-devguide.pdf.

[38] WP2, “User Guide for EDG Replica Optimization Service,” October 2003. http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-ros-userguide.pdf.

[39] “SDSC Storage Resource Broker.” http://www.npaci.edu/DICE/SRB/.

[40] M. Wan, A. Rajasekar, R. Moore, and P. Andrew, “A Simple Mass Storage System for theSRB Data Grid,” in 20th IEEE/ 11th NASA Goddard Conference on Mass Storage Systems& Technologies (MSST2003) San Diego, California, April 2003.

[41] “Getting started with Globus Replica Catalog.” http://www.globus.org/datagrid/deliverables/replicaGettingStarted.pdf.

[42] “CASTOR Developer’s Guide.” http://castor.web.cern.ch/castor/DOCUMENTATION/DEVGUIDE/.

[43] I. Foster, C. Kesselman, J. Nick, and S. Tuecke, “The Physiology of the Grid: An OpenGrid Services Architecture for Distributed Systems Integration,” 2002.

[44] “AliEn.” http://alien.cern.ch/.

[45] P. Buncic, A. Peters, and P. Saiz, “The AliEn system, status and perspectives,” 2003.

73

grid tools for the atlas experiment cernfisica.fe.up.pt/gridpt/docs/miguel_thesis.pdfnecessary - by...

Documents