benigno gobbo 1 incontro con nando 8 february 2005 more than four years of compute farm benigno...

19
Benigno Gobbo Benigno Gobbo 1 Incontro con Nando Incontro con Nando 8 February 2005 8 February 2005 More than Four Years of Compute More than Four Years of Compute Farm Farm Benigno Gobbo Benigno Gobbo [email protected] [email protected] Info: Info: http://www.ts.infn.it/acid http://www.ts.infn.it/acid [email protected] [email protected]

Upload: howard-pilcher

Post on 29-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Benigno Gobbo Benigno Gobbo 11Incontro con NandoIncontro con Nando8 February 20058 February 2005

More than Four Years of More than Four Years of Compute FarmCompute Farm

Benigno GobboBenigno [email protected]@cern.ch

Info:Info:

http://www.ts.infn.it/acidhttp://www.ts.infn.it/acid

[email protected]@ts.infn.it

Page 2: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 22

RequirementsRequirements

COMPASS: High statistics - Medium event complexityCOMPASS: High statistics - Medium event complexity~ 10~ 101010 events/year events/year~ 10 “good” tracks/event~ 10 “good” tracks/event

More than 200 tracking planes in non uniform magnetic fieldMore than 200 tracking planes in non uniform magnetic fieldParticle Identification: RICH, calorimeters, …Particle Identification: RICH, calorimeters, …

Non trivial event reconstructionNon trivial event reconstructionProduction time: ~ 300 sProduction time: ~ 300 s..SpecCINT2000 (Si2k)SpecCINT2000 (Si2k)

DATA STORAGE, PRODUCTION and ANALYSIS modelDATA STORAGE, PRODUCTION and ANALYSIS modelRaw data stored at CERN (~300 TB/year)Raw data stored at CERN (~300 TB/year)Production at CERN (E.g. for 2005 we have 200000 Si2k/QuarterProduction at CERN (E.g. for 2005 we have 200000 Si2k/QuarterMonte Carlo Production and Data Analysis at Home-LabsMonte Carlo Production and Data Analysis at Home-Labs

►►Need of Compute Farms at Home LaboratoriesNeed of Compute Farms at Home LaboratoriesAlso due to usual CERN request of computing redistribution:Also due to usual CERN request of computing redistribution:

33% at CERN, 67% outside33% at CERN, 67% outside

Page 3: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 33

A different Computing ModelA different Computing Model

1998. Definition of a Computing Model for the post–LEP era1998. Definition of a Computing Model for the post–LEP eraJanuary 1998. A Task Force was established at CERN January 1998. A Task Force was established at CERN (1)

To achieve: agreement with time scale and requirements of To achieve: agreement with time scale and requirements of experiments, flexibility of environment, constraints from used experiments, flexibility of environment, constraints from used commercial software, realistic assessment of costs, …commercial software, realistic assessment of costs, …

April 1998. Conclusions (Recommendations): Hybrid ArchitectureApril 1998. Conclusions (Recommendations): Hybrid Architectureusing PCs for computation (preferred: Windows NT, “tolerated”: using PCs for computation (preferred: Windows NT, “tolerated”: Linux)Linux)using at present RISC systems for I/O (legacy Unix) using at present RISC systems for I/O (legacy Unix)

1999. Evolution of the model1999. Evolution of the modelSensitive Linux improvements: now stable and better performing Sensitive Linux improvements: now stable and better performing

than Win NTthan Win NTDevelopment of “low price + good enough quality” IDE disk based Development of “low price + good enough quality” IDE disk based

PC serversPC servers

COMPASS Definitive choice:COMPASS Definitive choice:PCs for both server and computation machinesPCs for both server and computation machines(RedHat) Linux OS(RedHat) Linux OS

Page 4: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 44

The HistoryThe History

Sep. 2000. Approved (and above all “sponsored”!) by CSN Sep. 2000. Approved (and above all “sponsored”!) by CSN II

Financed in two yearsFinanced in two years200M ITL in 2000200M ITL in 2000124k € in 2001 124k € in 2001

Oct. 2000. Definition of a schema for the farm “initial Oct. 2000. Definition of a schema for the farm “initial setup”setup”

The farm has to be as much as possible compatible with the The farm has to be as much as possible compatible with the CERN oneCERN one

But not CERN-dependentBut not CERN-dependent

The “initial setup” must guarantee a “production environment”The “initial setup” must guarantee a “production environment”Enough disk space (for data storage and MC production)Enough disk space (for data storage and MC production)Enough CPU power (i.e. PC clients)Enough CPU power (i.e. PC clients)

It must be scalable to the final configuration without (major) It must be scalable to the final configuration without (major) modificationsmodifications

It must fit with approved financingIt must fit with approved financing

Page 5: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 55

History: first stepsHistory: first stepsNov. 2000. “Initial setup” decided, orders submittedNov. 2000. “Initial setup” decided, orders submitted1 PC Server with large EIDE disk space (with 14 x 75 GB EIDE 1 PC Server with large EIDE disk space (with 14 x 75 GB EIDE

disks)disks)RAID1 (mirroring) configured, it allowed RAID1 (mirroring) configured, it allowed 0.5 TB0.5 TB of (cheap) disk of (cheap) disk storagestorageThe machine was assembled by ELONEX following a CERN R&DThe machine was assembled by ELONEX following a CERN R&D

1 Sun Server with external SCSI disks ( 8 x 73 GB)1 Sun Server with external SCSI disks ( 8 x 73 GB)Configured RAID5, gave a 0.47 TB of more reliable disk storageConfigured RAID5, gave a 0.47 TB of more reliable disk storageDifferent OS (Solaris) and architecture (SPARC): allows better test Different OS (Solaris) and architecture (SPARC): allows better test and debugging of softwareand debugging of software

1 PC Supervision Server1 PC Supervision ServerNothing special: just a white-box PC with better components. Used Nothing special: just a white-box PC with better components. Used as a supervisor or master in monitoring or client-server softwareas a supervisor or master in monitoring or client-server software

12 PC Clients12 PC ClientsValue white-box PC, to stay into available budgetValue white-box PC, to stay into available budget

All machines are dual processor to improve performances/costsAll machines are dual processor to improve performances/costsWell… Sun was bought as single processor (it was so expansive…) Well… Sun was bought as single processor (it was so expansive…) and upgraded subsequentlyand upgraded subsequently

Network switch (36 100BaseT + 3 1000BaseSX ports)Network switch (36 100BaseT + 3 1000BaseSX ports)KVM switches, rack, shelves, monitor, keyboard, etc.KVM switches, rack, shelves, monitor, keyboard, etc.UPS and cooling system provided by “Sezione di Trieste” UPS and cooling system provided by “Sezione di Trieste” (thanks (thanks

to A. Mansutti & S. Rizzarelli)to A. Mansutti & S. Rizzarelli)

Page 6: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 66

History. Feb. 2001: “First setup” in History. Feb. 2001: “First setup” in productionproduction

First First LinuxLinux Compute Farm locally installed and Compute Farm locally installed and completely managed by INFN personnelcompletely managed by INFN personnel

Page 7: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 77

History: the final setupHistory: the final setup

Sep. 2001. Start Farm upgrade to Final Setup Sep. 2001. Start Farm upgrade to Final Setup 1 more EIDE PC Server (with 20 x 80 GB EIDE disks)1 more EIDE PC Server (with 20 x 80 GB EIDE disks)

Configured RAID1: Configured RAID1: 0.75 GB0.75 GB

Upgrade of previous EIDE Server with 6 additional 80 GB disksUpgrade of previous EIDE Server with 6 additional 80 GB disksNow it provides Now it provides 0.72 TB0.72 TB (RAID1) (RAID1)

Upgrade of the Sun to dual processorUpgrade of the Sun to dual processorSTK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM STK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM

Ultrium drives (can have 4 drives)Ultrium drives (can have 4 drives)It can store up to 4 TB of data. Drives transfer rate up to 30 MB/sIt can store up to 4 TB of data. Drives transfer rate up to 30 MB/s

1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured RAID 1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured RAID 0 (striping)0 (striping)

To be used with Tape Lib forming HSM systemTo be used with Tape Lib forming HSM system

19 PC clients19 PC clientswhite-box machines, dual 1 GHz P III white-box machines, dual 1 GHz P III

12 ports 1000BaseSX switch12 ports 1000BaseSX switchKVM switches, etc.KVM switches, etc.

Page 8: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 88

History: the 2002 “Final Setup”History: the 2002 “Final Setup”

11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro

Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk

512 MB RAM512 MB RAM

11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro

Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk

512 MB RAM512 MB RAM

19 New clients:19 New clients:Abit VP6Abit VP6

Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk

512 MB RAM512 MB RAM

19 New clients:19 New clients:Abit VP6Abit VP6

Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk

512 MB RAM512 MB RAM

3com 49003com 49003com 39003com 3900

Kvm switchKvm switch

Server SGE, DHCP, BServer SGE, DHCP, BB, …B, …Asus CUR-DLSAsus CUR-DLSDual PIII 800 MHzDual PIII 800 MHz2 x 36 GB SCSI Disk2 x 36 GB SCSI Disk512 MB RAM512 MB RAMGA620 G gigabitGA620 G gigabit

EIDE disk serverEIDE disk serverIntel L440 GX+Intel L440 GX+Dual PIII 700 MHzDual PIII 700 MHz2 x 15 GB ATA disk2 x 15 GB ATA disk14 x 75 GB ATA disk14 x 75 GB ATA disk6 x 80 GB ATA disk6 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit

EIDE disk serverEIDE disk serverIntel STL2Intel STL2Dual PIII 866 MHzDual PIII 866 MHz2 x 20 GB ATA disk2 x 20 GB ATA disk20 x 80 GB ATA disk20 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit

Tape LibraryTape LibrarySTK L40 20 slotSTK L40 20 slot2 x IBM Ultrium2 x IBM Ultrium

Tape/disk serverTape/disk serverDell PowerEdge 4400Dell PowerEdge 4400Dual Xeon 1 GHzDual Xeon 1 GHz2 x 36 GB SCSI RAID12 x 36 GB SCSI RAID16 x 73 GB SCSI RAID06 x 73 GB SCSI RAID0

SCSI disk serverSCSI disk serverSun Blade 1000Sun Blade 1000Dual SparcIII 750 MHzDual SparcIII 750 MHz18 GB SCSI FC disk18 GB SCSI FC disk8 x 73 GB SCSI RAID58 x 73 GB SCSI RAID5

CRD-5440CRD-5440

Page 9: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 99

History: up to now and in the near futureHistory: up to now and in the near future

2002 - 2004. Upgrades2002 - 2004. UpgradesAdditional EIDE PC Server with 20 x 200 Additional EIDE PC Server with 20 x 200

GB disksGB disksPowerful machine (Dual Xeon). 4 RAID5 Powerful machine (Dual Xeon). 4 RAID5 partitions allowing 3 TB of disk spacepartitions allowing 3 TB of disk space

PC server for Oracle/DB with 12 x 200 GB PC server for Oracle/DB with 12 x 200 GB disksdisks

To contain event databaseTo contain event database

HP PC Server with 6 x 142 GB SCSI disksHP PC Server with 6 x 142 GB SCSI disksSTK Tape Library upgrade from 20 to 40 STK Tape Library upgrade from 20 to 40

slotsslotsNow allows to store up to 8 TB of dataNow allows to store up to 8 TB of data

Ultrium2 Tape Drive for STK Tape LibraryUltrium2 Tape Drive for STK Tape LibraryUp to 400 GB/cartridge, up to 70 MB/s Up to 400 GB/cartridge, up to 70 MB/s transfer ratetransfer rate

8 PC Clients8 PC Clients Rack mount Dual Rack mount Dual Opteron processor Opteron processor machinesmachines

2005. Financed2005. FinancedUpgrade of disk spaceUpgrade of disk space(SATA disks rack)(SATA disks rack)

EIDE Disk ServerEIDE Disk ServerIntel SE7500CW2Intel SE7500CW2Dual Xeon 2 GHzDual Xeon 2 GHz1 GB RAM1 GB RAM2 x 40 GB + 20 x 200 GB 2 x 40 GB + 20 x 200 GB

ATAATANetgear GA 621Netgear GA 621

Oracle ServerOracle ServerSuperMicro X5DP8-G2SuperMicro X5DP8-G2Dual Xeon 2.4 GHzDual Xeon 2.4 GHz2 GB RAM2 GB RAM2 x 20 GB + 12 x 200 GB ATA2 x 20 GB + 12 x 200 GB ATA3com 3C996-SX3com 3C996-SX

HP Proliant ML530G2HP Proliant ML530G2Dual Xeon 2.8 GHzDual Xeon 2.8 GHz2 GB RAM2 GB RAM2 x 36 + 6 x 146.8 SCSI2 x 36 + 6 x 146.8 SCSIGigabitGigabit

Newisys 2100Newisys 2100Dual Opteron 250Dual Opteron 2502 GB RAM2 GB RAM2x36 GB SCSI2x36 GB SCSI

Page 10: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1010

ACID Farm w.r.t. CERN farm: ACID Farm w.r.t. CERN farm: HardwareHardware

The choicesThe choicesClients. No alternatives due to cost difference: use PCs. Clients. No alternatives due to cost difference: use PCs.

But…But…At CERN there are short hardware upgrade periods At CERN there are short hardware upgrade periods use use “old”, good quality (e.g. Intel chipsets), well Linux tested “old”, good quality (e.g. Intel chipsets), well Linux tested (certified) hardware(certified) hardware

Here hardware lifetime is longer Here hardware lifetime is longer use “recent” hardware use “recent” hardware (as it becomes “dated” really fastly), middle quality (e.g. VIA (as it becomes “dated” really fastly), middle quality (e.g. VIA chipset, for cost reasons), may be not yet completely Linux chipset, for cost reasons), may be not yet completely Linux certifiedcertifiedWhat we learnedWhat we learned. “Whitebox” PC are quite fragile. In . “Whitebox” PC are quite fragile. In particular EIDE disks are particular EIDE disks are veryvery fragile, and the worst piece to fragile, and the worst piece to be replace due to need of data recovery. High quality disks be replace due to need of data recovery. High quality disks are preferable (if possible).are preferable (if possible).

EIDE disk server shows a great performance/cost ratioEIDE disk server shows a great performance/cost ratioNot completely tested at beginning, but looked nice and the Not completely tested at beginning, but looked nice and the difference in cost with SCSI based servers (a factor three) difference in cost with SCSI based servers (a factor three) looked too attractivelooked too attractiveWhat we learnedWhat we learned. See above comment on disks.. See above comment on disks.

The SunThe SunAlso at CERN the is a SUNDEV cluster made available for Also at CERN the is a SUNDEV cluster made available for code quality checking. In addition, there are some services code quality checking. In addition, there are some services still run on Suns for stability or commercial software still run on Suns for stability or commercial software requirement reasons requirement reasons

Page 11: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1111

ACID Farm w.r.t. CERN Farm: ACID Farm w.r.t. CERN Farm: SoftwareSoftware

Requirements and solutionsRequirements and solutionsCompatible as much as possibleCompatible as much as possible

Programs should run without recompilation Programs should run without recompilation Use same kernel, C- Use same kernel, C-library, and compilerslibrary, and compilersUsers should find similar environment Users should find similar environment Use same Linux distributionUse same Linux distributionUse CERN patches if they helpUse CERN patches if they help

Independent as much as possibleIndependent as much as possibleDo not use too-CERN-specific tools like SUE (hard to port, not so Do not use too-CERN-specific tools like SUE (hard to port, not so useful)useful)Use official distributions (RedHat) and not CERN “adapted” onesUse official distributions (RedHat) and not CERN “adapted” onesDo not use CERN patches if they do not helpDo not use CERN patches if they do not help

Chose something else if nothing available or simply if there is Chose something else if nothing available or simply if there is something better around:something better around:

CERN batch solution too expensive (LSF), nothing interesting at INFN CERN batch solution too expensive (LSF), nothing interesting at INFN level level use use SGESGE: free, good, supported: free, good, supportedMonitoring: Monitoring: BigBrotherBigBrother is free and well done (just little complicated is free and well done (just little complicated to install)to install)Software documenting too: found Software documenting too: found doxygendoxygen, it is so good that it was , it is so good that it was subsequently adopted by CERN and now available in many Linux subsequently adopted by CERN and now available in many Linux distributionsdistributions

Page 12: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1212

ACID w.r.t. CERN Farm: Commercial ACID w.r.t. CERN Farm: Commercial SoftwareSoftware

We try to avoid it, if possible (it costs and it is source of We try to avoid it, if possible (it costs and it is source of troubles)troubles)

CERN attempt to go for “commercial-only software” dramatically CERN attempt to go for “commercial-only software” dramatically failed!failed!

In general: too difficult to interface to HEP environmentIn general: too difficult to interface to HEP environmentIn general: it never completely fits with HEP requirementsIn general: it never completely fits with HEP requirementsIn general: not able to follow the fast Linux and GNU software In general: not able to follow the fast Linux and GNU software evolution (e.g. compiler: we are forced to use quite outdated and evolution (e.g. compiler: we are forced to use quite outdated and now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, ORACLE needs gcc 2.95.3 or 2.96 and only recently gcc 3.2; current ORACLE needs gcc 2.95.3 or 2.96 and only recently gcc 3.2; current gcc version is 3.4)gcc version is 3.4)Expansive or whit unsatisfactory support (and, in any case, no source Expansive or whit unsatisfactory support (and, in any case, no source code available: so no way to fix problems by ourselves)code available: so no way to fix problems by ourselves)

So, the current idea is to use commercial software only where So, the current idea is to use commercial software only where there are not alternativesthere are not alternatives

Basically only DBMS (Basically only DBMS (Objectivity/DB 6 Objectivity/DB 6 before,before, ORACLE 9i ORACLE 9i after): too after): too difficult to develop an HEP specific DBMS. Well, free DBMS are difficult to develop an HEP specific DBMS. Well, free DBMS are available too (e.g. MySQL), but it is too dangerous to follow a available too (e.g. MySQL), but it is too dangerous to follow a solution different with the CERN one on this subject… solution different with the CERN one on this subject…

Page 13: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1313

ACID w.r.t. CERN Farm: HEP LinuxACID w.r.t. CERN Farm: HEP Linux

Due to RedHat change of philosophy during 2003Due to RedHat change of philosophy during 2003Free distribution Free distribution “Fedora Project”“Fedora Project”

Free distribution with a release period of 4-6 month (too fast for HEP Free distribution with a release period of 4-6 month (too fast for HEP needs) and just 3 months support/patching of previous release (too needs) and just 3 months support/patching of previous release (too short for HEP needs)short for HEP needs)

Commercial distribution Commercial distribution “Enterprise”“Enterprise”Commercial distribution with 5 years support of previous release but too Commercial distribution with 5 years support of previous release but too expensive!expensive!

HEP ReactionsHEP ReactionsMandate to the 3 HEP big labs to negotiate with RedHat, but at the Mandate to the 3 HEP big labs to negotiate with RedHat, but at the

end…end…FNALFNAL

Rebuild RHEL from source (legal if done without violating RedHat Rebuild RHEL from source (legal if done without violating RedHat copyrights!) Scientific Linux 3.x. Other HEP labs joined FNAL in copyrights!) Scientific Linux 3.x. Other HEP labs joined FNAL in developing and supporting SLdeveloping and supporting SL

CERNCERNSLC3 (a local flavour of FNAL’s SL). We certified it on Nov. 1SLC3 (a local flavour of FNAL’s SL). We certified it on Nov. 1stst, 2004; and , 2004; and now it is the official Linux distributed at CERN. But some (~200) RHEL3-now it is the official Linux distributed at CERN. But some (~200) RHEL3-WS were bought too. WS were bought too.

SLACSLACRHEL (got “via” DOE) is the main distribution. BRHEL (got “via” DOE) is the main distribution. BAABBARAR certified SL too certified SL too (SLC was certified to run binaries but not (yet?) to build the codes). (SLC was certified to run binaries but not (yet?) to build the codes).

Page 14: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1414

ACID w.r.t. CERN Farm: software, what is ACID w.r.t. CERN Farm: software, what is now changednow changed

Keep CERN compatibility, as it’s easier and less expensive…Keep CERN compatibility, as it’s easier and less expensive…GoodGood

CERN port will be less specific (no more SUE, etc.)CERN port will be less specific (no more SUE, etc.)No more “alternative gcc” compilersNo more “alternative gcc” compilersBut with additional “wanted” packages (CASTOR, patched kernels, CERN But with additional “wanted” packages (CASTOR, patched kernels, CERN TeX styles, PINE, …) not available from RedHat distributions. TeX styles, PINE, …) not available from RedHat distributions. ACID now uses CERN SLC distribution with an adapted installation setup ACID now uses CERN SLC distribution with an adapted installation setup instead of use RedHat or SL distribution plus add-ons. instead of use RedHat or SL distribution plus add-ons.

Something Bad?Something Bad?The port will be supported for 1-2 years. And after?The port will be supported for 1-2 years. And after?The RHEL option still present. That could mean extra costs for software The RHEL option still present. That could mean extra costs for software (now we use RHEL (AS2.1) just on the ORACLE server machine). In that (now we use RHEL (AS2.1) just on the ORACLE server machine). In that case an I.N.F.N. wide license solution would be a better solution. We case an I.N.F.N. wide license solution would be a better solution. We have just to wait and see… have just to wait and see…

Page 15: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1515

Farm management: man power costs Farm management: man power costs (SW)(SW)

Distribution UpgradeDistribution UpgradeIt is a major task as a local certification is needed tooIt is a major task as a local certification is needed too

All applications need to be testedAll applications need to be testedAll nodes need to be re-installed from scratchAll nodes need to be re-installed from scratchIn general it requires more than a month preparation timeIn general it requires more than a month preparation timeNot too frequent: one every few years (~2)Not too frequent: one every few years (~2)

Software InstallationSoftware InstallationComplexity and test-debug period depend on packageComplexity and test-debug period depend on package

Could be a strong work (e.g. CASTOR/HSM porting: many months of Could be a strong work (e.g. CASTOR/HSM porting: many months of work)work)Time-to-time, upgrades/updates are neededTime-to-time, upgrades/updates are needed

PatchingPatchingIn general simple but quite frequent (security patches)In general simple but quite frequent (security patches)

In the past it needed a lot of time (e.g. as we used a locally patched In the past it needed a lot of time (e.g. as we used a locally patched kernel, we need a complete kernel recompilation after every official kernel, we need a complete kernel recompilation after every official patch). Now thing are easier thanks to tools like APT or YUM. patch). Now thing are easier thanks to tools like APT or YUM. Risk of troubles after a patch is not frequent but not negligible: in Risk of troubles after a patch is not frequent but not negligible: in particular after Kernel updatesparticular after Kernel updates

Page 16: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1616

Farm management: man power costs Farm management: man power costs (HW)(HW)

New hardwareNew hardwarePurchasePurchase

Product choice, offers requests, “CONSIP”, …Very time consuming and Product choice, offers requests, “CONSIP”, …Very time consuming and generally boringgenerally boring

Installation and/or integrationInstallation and/or integrationIn general non complex, but in some cases needs timeIn general non complex, but in some cases needs time

Maintenance Maintenance (1)Many parts of the farm are no more covered by warranty nor Many parts of the farm are no more covered by warranty nor

under outsourced maintenanceunder outsourced maintenanceBroken parts (disks, boards, …) need to be replaced by hand. That Broken parts (disks, boards, …) need to be replaced by hand. That takes a lot of timetakes a lot of timeAn Example:An Example:MicroStar 694D ProMicroStar 694D Pro mainboards mount bad quality electrolytic capacitors (from mainboards mount bad quality electrolytic capacitors (from TAYEHTAYEH). Over 11 boards, on 7 there were failures due to that capacitors ). Over 11 boards, on 7 there were failures due to that capacitors leakage. Intervention requires a complete PC dismount, board removal, leakage. Intervention requires a complete PC dismount, board removal, capacitor replacement and re-mount. On two boards capacitor failure damaged capacitor replacement and re-mount. On two boards capacitor failure damaged following electronics: in those cases mainboard replacement where necessary.following electronics: in those cases mainboard replacement where necessary.

Power loss (HW failures were many times due to overheating).Power loss (HW failures were many times due to overheating).Quite (better: too) frequent in AREA. No cooling for long periods with Quite (better: too) frequent in AREA. No cooling for long periods with consequent machines overheating (In addition T02 room was consequent machines overheating (In addition T02 room was definitively too small compared to the hardware installed inside, now definitively too small compared to the hardware installed inside, now size doubled, but machinery too). size doubled, but machinery too).

Page 17: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1717

The good and the badThe good and the badAs said: the first Linux Compute Farm installed and managed in an As said: the first Linux Compute Farm installed and managed in an INFN LabINFN LabFirst COMPASS home-lab farm in productionFirst COMPASS home-lab farm in productionOne of the first CASTOR/HSM installation outside CERNOne of the first CASTOR/HSM installation outside CERN

and probably the first one (outside CERN) in production and probably the first one (outside CERN) in production

First “in production” ORACLE database replica of part of First “in production” ORACLE database replica of part of (COMPASS) events outside CERN (COMPASS) events outside CERN Heavily used by COMPASS-Trieste groupHeavily used by COMPASS-Trieste group

Data analysis, Monte Carlo production, RICH software development and Data analysis, Monte Carlo production, RICH software development and analysis, …analysis, …

““Borrowed” for other Trieste groups works (LEP, …) Borrowed” for other Trieste groups works (LEP, …)

It is an “in production” apparatusIt is an “in production” apparatusInterventions have to be immediate, quick (& Interventions have to be immediate, quick (& NOTNOT dirt) dirt)It requires a continuous monitoring: i.e. someone always has to be present It requires a continuous monitoring: i.e. someone always has to be present “nearby T02”“nearby T02”It always “evolve” (software updates, hardware upgrades) and that requires It always “evolve” (software updates, hardware upgrades) and that requires manpowermanpowerIt is fragile: the probability of failures is highIt is fragile: the probability of failures is highParts of software need to be updated and checked very frequently (even Parts of software need to be updated and checked very frequently (even every day or so)every day or so)It is difficult to have a day without need of interventions somewhere inside It is difficult to have a day without need of interventions somewhere inside the farmthe farm

Page 18: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1818

What nextWhat next

A new project: the A new project: the “Farm di Sezione”“Farm di Sezione”To (try to) merge all local farms in a kind of To (try to) merge all local farms in a kind of unique entityunique entity. .

It is again something relatively new inside INFN sitesIt is again something relatively new inside INFN sites

It involves It involves Gruppo CalcoloGruppo Calcolo and several experiments people and several experiments people from existing farms (ALICE and COMPASS) and new from existing farms (ALICE and COMPASS) and new onesones

Room expanded: T02 Room expanded: T02 T01+T02T01+T02Cooling was poweredCooling was poweredDiscussion started: to find common requirements and Discussion started: to find common requirements and

evaluate incompatibilitiesevaluate incompatibilitiesHardware is available and is being installedHardware is available and is being installedR&D will start soon, hopefully during this month R&D will start soon, hopefully during this month

(compatibility tests between current farms (compatibility tests between current farms environments, etc.)environments, etc.)

Consequences on the ACIDs: too early to say anything, Consequences on the ACIDs: too early to say anything, anyhow the ACID team is collaborating in the anyhow the ACID team is collaborating in the “Farm di “Farm di Sezione”Sezione” setup… setup…

Page 19: Benigno Gobbo 1 Incontro con Nando 8 February 2005 More than Four Years of Compute Farm Benigno Gobbo Benigno.gobbo@cern.ch Info:

Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1919

Acknowledges and ConclusionsAcknowledges and ConclusionsThanks toThanks toR. BirsaR. Birsa

Sun ManagementSun ManagementHelp in software installation and Help in software installation and debuggingdebugging (e.g. CASTOR would (e.g. CASTOR would never be installed without his accurate work on it) never be installed without his accurate work on it)

V. DuicV. DuicData (DB) import, job parallelization tools Data (DB) import, job parallelization tools

All people fromAll people from Gruppo Calcolo Gruppo CalcoloOffer requestsOffer requestsConsultancyConsultancy

To concludeTo concludeThis farm shows that at INFN-Trieste there is a not negligible IT This farm shows that at INFN-Trieste there is a not negligible IT knowledge (compared to other INFN sites)knowledge (compared to other INFN sites)Computing is becoming more and more relevant in HEP Computing is becoming more and more relevant in HEP experiments. It will probably be dominant (in good and bad) at experiments. It will probably be dominant (in good and bad) at LHCLHCUnfortunately (Unfortunately (this is an opinion of minethis is an opinion of mine) INFN looks ) INFN looks NOTNOT so so pioneering on that field as in all others aspects HEPpioneering on that field as in all others aspects HEP