the alice distributed computing federico carminati alice workshop, sibiu, romania, 20/08/2008

37
The ALICE The ALICE Distributed Distributed Computing Computing Federico Carminati Federico Carminati ALICE workshop, Sibiu, ALICE workshop, Sibiu, Romania, 20/08/2008 Romania, 20/08/2008

Upload: arlene-hicks

Post on 05-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

The ALICE Distributed The ALICE Distributed Computing Computing

Federico CarminatiFederico CarminatiALICE workshop, Sibiu, Romania, ALICE workshop, Sibiu, Romania,

20/08/200820/08/2008

Page 2: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

Sibiu 20/08/2008Sibiu 20/08/2008 22

Perspectives…Perspectives…

This workshop: exactly 30 days before the This workshop: exactly 30 days before the official start of LHC (2.6Mio seconds)official start of LHC (2.6Mio seconds)

10 September 200810 September 2008

Is the Grid Is the Grid readyready to register, process and to register, process and provide the ALICE physicists with a provide the ALICE physicists with a platform to efficiently analyze the first dataplatform to efficiently analyze the first data

Page 3: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

33

The ALICE Grid is bornThe ALICE Grid is born

Working prototype in 2002Working prototype in 2002The Vision From the Very BeginningThe Vision From the Very Beginning

Single interface to distributed computing for Single interface to distributed computing for all ALICE physicistsall ALICE physicists

File catalogue, job submission and control, File catalogue, job submission and control, application software management, end user application software management, end user analysisanalysis

And this is….And this is….

AliEn – Alice EnvironmentAliEn – Alice Environment Sibiu 20/08/2008Sibiu 20/08/2008

Page 4: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

44

Toddler years 2003-2004Toddler years 2003-2004

First MC productions…First MC productions…Full vertical Grid – interfaced down to local Full vertical Grid – interfaced down to local

batch system level and any type of local batch system level and any type of local storage the site provides (capability storage the site provides (capability retained and refined up to today)retained and refined up to today)

Few hundred CPUs at 13 sitesFew hundred CPUs at 13 sitesVery ambitious goals – validation of the Very ambitious goals – validation of the

entire ALICE Computing model entire ALICE Computing model

Sibiu 20/08/2008Sibiu 20/08/2008

Page 5: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

55

Toddler years (2)Toddler years (2)First report of yearly ALICE data First report of yearly ALICE data

challenge: Sep. 2004challenge: Sep. 2004

Sibiu 20/08/2008Sibiu 20/08/2008

CPU work: 285 MSI-2K hours (one 2.8 GHz PC working for 35 years)

Page 6: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

CR4 22.03.2004CR4 22.03.2004 C.W.FabjanC.W.Fabjan 66

At that time ALICE was…At that time ALICE was…

Services

Page 7: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

77

Terrible twos (and a bit beyond)…2005-2006Terrible twos (and a bit beyond)…2005-2006

Increased sophistication of the central AliEn Increased sophistication of the central AliEn servicesservices

Full integration with the LCG servicesFull integration with the LCG services AliEn is moving toward ‘Top level Grid management’AliEn is moving toward ‘Top level Grid management’

Beginning of the story with storage (still ongoing)Beginning of the story with storage (still ongoing) Increase the number of participating sites and Increase the number of participating sites and

CPUs usedCPUs used And more data challenges…this time with a solid And more data challenges…this time with a solid

backup of the Computing TDRbackup of the Computing TDRPublished in June 2005Published in June 2005

Sibiu 20/08/2008Sibiu 20/08/2008

Page 8: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

88

HP ProLiant DL580 AliEn central services

HP ProLiant DL360 AliEn to CASTOR (MSS) interface

HP ProLiant DL380 gShell API service

Disk SE log files storage 1TB SATA Disk server

AliEn services in 2005 AliEn services in 2005

Sibiu 20/08/2008Sibiu 20/08/2008

Page 9: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

99

PDC results@end of 2005PDC results@end of 2005 Data Challenge IV with LCG SC3: distributed MC production works Data Challenge IV with LCG SC3: distributed MC production works

routinely routinely Monitoring: 25 sites operational, up to 1800 concurrent jobs, : 25 sites operational, up to 1800 concurrent jobs, all using all using

LCG resources, LCG services and ALICE services, thanks to our LCG resources, LCG services and ALICE services, thanks to our experience with AliEnexperience with AliEn

Running jobs (8 November)

Farm Min Avg Max

Sum 1160 1651 1771

CCIN2P3 134 210 231

CERN-L 268 286 304

CNAF 255 362 394

FZK 0 531 600

Houston 0 3 14

Münster 2 58 81

Prague 43 61 71

Sejong 2 2 2

Torino 33 41 43

Page 10: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1010

First records – Dec. 2005First records – Dec. 2005

2450 jobs

Sibiu 20/08/2008Sibiu 20/08/2008

Page 11: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1111

At that time ALICE was…At that time ALICE was…

Page 12: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1212

Beginning of the endless DCBeginning of the endless DCApril 2006 – decision to run in a quasi-April 2006 – decision to run in a quasi-

permanent modepermanent mode Integration of new sitesIntegration of new sites

Workload and storage Workload and storage Improving the robustness of central AliEn Improving the robustness of central AliEn

servicesservicesOperational experienceOperational experienceTest of AliRoot, new MC productionsTest of AliRoot, new MC productions

Sibiu 20/08/2008Sibiu 20/08/2008

Page 13: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

PRODUCTION!!Deploy in prodn

PPSCertification

Tuesday 28/2/06gLite 3.0β exitscertification andenters PPS.

February April May JuneMarch

Friday 28/4/06gLite 3.0 exits PPSand enters production.

Wednesday 15/3/06gLite 3.0β availableto users in the PPS.

Deployment ofgLite 3.0β in PPS

Thursday 1/6/06SC4 starts!!

PPS Schedule for gLite 3.0PPS Schedule for gLite 3.0

Patches for bugscontinually passedto PPS.

YOU ARE HERE

gLite enters the Grid world

Page 14: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1414

April 2006 - xrootd is adoptedApril 2006 - xrootd is adopted xrootd with MSS backend is operational at CERNxrootd with MSS backend is operational at CERN Similar setup is being brought online at LyonSimilar setup is being brought online at Lyon For other sites – experts are preparing the software and For other sites – experts are preparing the software and

instructions for site experts how to install xrootd on the instructions for site experts how to install xrootd on the local storage serverslocal storage servers

Discussion of interoperation of various storage solutions Discussion of interoperation of various storage solutions (CASTOR2, DPM, dCache and xrootd ongoing(CASTOR2, DPM, dCache and xrootd ongoing

SRM standard is still being discussedSRM standard is still being discussed ALICE will use xrootd based storage and is actively ALICE will use xrootd based storage and is actively

pursuing its inclusion in the standard LCG packagepursuing its inclusion in the standard LCG package

Sibiu 20/08/2008Sibiu 20/08/2008

Page 15: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

StatStatus of PDC’06us of PDC’06 1515

History of PDC’06History of PDC’06

Continuous running since Aplil 2006Continuous running since Aplil 2006 Test jobs, allowing to debug all site services and test the Test jobs, allowing to debug all site services and test the

stabilitystability From July – production and reconstruction of p+p MC eventsFrom July – production and reconstruction of p+p MC events

Page 16: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1616

Resources statistics PDC’06Resources statistics PDC’06

Resources contribution (normalized Si2K units): 50% Resources contribution (normalized Si2K units): 50% from T1s, 50% from T2sfrom T1s, 50% from T2s

6 T1s, 30 T2s

Sibiu 20/08/2008Sibiu 20/08/2008

Page 17: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1717

At that time ALICE was…At that time ALICE was…

Sibiu 20/08/2008Sibiu 20/08/2008

Page 18: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1818

Kindergarten 2007-todayKindergarten 2007-today Toward fully redundant central servicesToward fully redundant central services Integration of gLite transfer and storage Integration of gLite transfer and storage

solutionssolutions Automatic site and central services management Automatic site and central services management

tools ‘auto-pilot mode’tools ‘auto-pilot mode’ Automatic production toolsAutomatic production tools Integration of more sites, deployment of storageIntegration of more sites, deployment of storage MC and RAW data productionMC and RAW data production

Sibiu 20/08/2008Sibiu 20/08/2008

Page 19: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

1919

Number of jobs evolutionNumber of jobs evolution

Sibiu 20/08/2008Sibiu 20/08/2008

Max. attained 10325

Page 20: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2020

Sites contributionSites contribution

50% resources contribution from T2s!

Sibiu 20/08/2008Sibiu 20/08/2008

Page 21: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2121

The ALICE Grid in numbersThe ALICE Grid in numbers73 participating sites73 participating sites

1 T0 (CERN/Switzerland)1 T0 (CERN/Switzerland)6 T1s (France, Germany, Italy, The 6 T1s (France, Germany, Italy, The

Netherlands, Nordic DataGrid Facility, UK)Netherlands, Nordic DataGrid Facility, UK)66 T2s spread over 4 continents66 T2s spread over 4 continents

As of today the ALICE share is some As of today the ALICE share is some 7000 (out of ~30000 total Grid) CPUs and 7000 (out of ~30000 total Grid) CPUs and 1.5 PB of distributed storage1.5 PB of distributed storage

In ½ year ~15K CPUs, x2 storageIn ½ year ~15K CPUs, x2 storage

Sibiu 20/08/2008Sibiu 20/08/2008

Page 22: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2222

The ALICE Grid MapThe ALICE Grid MapEurope

Asia

North America

Africa

Here is the live picture Sibiu 20/08/2008Sibiu 20/08/2008

Page 23: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2323

Control of the ALICE Grid Control of the ALICE Grid Fully redundantFully redundant

DB (MySQL) DB (MySQL) master-slave master-slave structure and structure and backupbackup

All central All central services run services run multiple instancesmultiple instances

Build servers for Build servers for I686, x86_64, I686, x86_64,

ia63, MacOS 32- ia63, MacOS 32- and 64- bitand 64- bit

Sibiu 20/08/2008Sibiu 20/08/2008

Page 24: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2424

ALICE Grid task list today ALICE Grid task list today Registration of data at MSS T0 and on the GRIDRegistration of data at MSS T0 and on the GRID Replication T0->T1Replication T0->T1 Quasi-online reconstructionQuasi-online reconstruction

Pass 1 at T0Pass 1 at T0 Pass 2 at T1sPass 2 at T1s

MC production and user analysisMC production and user analysis

Sibiu 20/08/2008Sibiu 20/08/2008

Page 25: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2525

RAW data from cosmic tests RAW data from cosmic tests

Sibiu 20/08/2008Sibiu 20/08/2008

230K RAW data files

Page 26: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2626

Replication of RAW Replication of RAW 60MB/sec rate (p+p data taking scenario)60MB/sec rate (p+p data taking scenario)Using gLite transfer tools (FTS), operated Using gLite transfer tools (FTS), operated

through AliEn FTDthrough AliEn FTD

Sibiu 20/08/2008Sibiu 20/08/2008

Page 27: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2727

Data and MC production - emphasis Data and MC production - emphasis Fast MC production for first physics Fast MC production for first physics

with various LHC startup scenarioswith various LHC startup scenariosAnd analysis for first publicationAnd analysis for first publicationAlready 2 cycles made Already 2 cycles made

Fast analysis of detector calibration Fast analysis of detector calibration datadataEssentially immediately after data taking Essentially immediately after data taking

(same day/night)(same day/night)Crucial for feedback to detector expertsCrucial for feedback to detector experts

Page 28: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2828

Production of RAW Production of RAW Major activity in the past 6 months, very Major activity in the past 6 months, very

successful despite rapidly changing conditionssuccessful despite rapidly changing conditions Both in the code and detector operationBoth in the code and detector operation In total some 120 TB of RAW passed through the In total some 120 TB of RAW passed through the

reconstructionreconstruction

Sibiu 20/08/2008Sibiu 20/08/2008

Page 29: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

2929

MC production since 2006 MC production since 2006 Some 330Mio events with various physics contentSome 330Mio events with various physics content

Sibiu 20/08/2008Sibiu 20/08/2008

Page 30: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3030

Storage evolution Storage evolution Subject of intense development and deploymentSubject of intense development and deployment

For ALICE – xrootd as single access protocolFor ALICE – xrootd as single access protocol

4 de-facto storage solutions4 de-facto storage solutions Pure MSS – CASTOR2Pure MSS – CASTOR2 Hybrid disk+MSS – dCacheHybrid disk+MSS – dCache Disk – DPM, xrootdDisk – DPM, xrootd Critical for analysis is the disk-based storageCritical for analysis is the disk-based storage

Good overall deployment progress and stability of all Good overall deployment progress and stability of all storage typesstorage types

Secure access through the ALICE security envelope – Secure access through the ALICE security envelope – every operation is authorized through unique set of every operation is authorized through unique set of

encrypted keysencrypted keys

Sibiu 20/08/2008Sibiu 20/08/2008

Page 31: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3131

Storage deployment Storage deployment 38 storage endpoints over 21 sites, ~1.5 PB

Sibiu 20/08/2008Sibiu 20/08/2008

Page 32: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3232

Current operation principleCurrent operation principle

Central AliEn services

Site VO-boxSite VO-box Site VO-box

Site VO-boxSite VO-box

WMS (gLite/ARC/OSG/Local)

SM (dCache/DPM/CASTOR/xrootd)

Monitoring, Package management

The VO-box system (very controversial in the beginning)• Has been extensively tested• Allows for site services scaling• Is a simple isolation layer for the VO in case of troubles

Sibiu 20/08/2008Sibiu 20/08/2008

Page 33: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3333

Operation – central/site supportOperation – central/site support Central services support (2 FTEs equivalent)

There are no experts which do exclusively support – there are 6 highly-qualified experts doing development/support

Site services support - handled by ‘regional experts’ (one per country) in collaboration with local cluster administrators Extremely important part of the system In normal operation ~0.2FTEs/regions

Regular weekly discussions and active all-activities mailing lists

Page 34: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3434

User analysis activitiesUser analysis activitiesGenerally successful User jobs priorities are well mastered in the AliEn system Simple priority scheduling seem to work well, will be

expanded soon to ‘pay for what you use’ principles Storage remains a weak point

Only the lack of it – available amount does not allow full (as per compiuting model) replication of data to be analysed

As of today, 180 registered, 110 active users on the Grid Not counting the MC/RAW production and CAF

Page 35: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3535

User analysis statistics, one yearUser analysis statistics, one year

~1Mio user jobs completed, 2.7K/day

Chaotic analysis only!

Page 36: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3636

Today, ALICE is ready…Today, ALICE is ready…

Sibiu 20/08/2008Sibiu 20/08/2008

Page 37: The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008

3737

Getting ready for the first school dayGetting ready for the first school day In 6 short yearsIn 6 short years the ALICE Grid has been transformed from a the ALICE Grid has been transformed from a

‘proof-of-concept’ into a fully operational computational platform for ‘proof-of-concept’ into a fully operational computational platform for ALICE offline work ALICE offline work

The ALICE computing model has been validated thoroughly through The ALICE computing model has been validated thoroughly through a series of data challenges with increasing complexity and scopea series of data challenges with increasing complexity and scope

The ALICE Grid today uses CPU and storage resources at some 73 The ALICE Grid today uses CPU and storage resources at some 73 computing centres, fully integrated into the local fabric and servicescomputing centres, fully integrated into the local fabric and services

The MC and RAW data production is a routine exercise, moving The MC and RAW data production is a routine exercise, moving rapidly into an automated mode of operationrapidly into an automated mode of operation

User analysis is a routine exercise too…User analysis is a routine exercise too… Services support on all levels is well understoodServices support on all levels is well understood

The ALICE Grid is ready for its first day in The ALICE Grid is ready for its first day in school, coming, as usual, at the beginning of school, coming, as usual, at the beginning of SeptemberSeptember

Sibiu 20/08/2008Sibiu 20/08/2008