southgrid technical meeting pete gronbech: 26 th august 2005 oxford

Southgrid Technical Meeting

Pete Gronbech: 26th August 2005Oxford

Present

• Pete Gronbech – Oxford• Ian Stokes-Rees - Oxford• Chris Brew – RAL PPD• Santanu Das - Cambridge• Yves Coppens - Birmingham

Agenda

• Chat• 10:30 Coffee• Pete + Others• 1pm Lunch• Ineteractive Workshop!!• 3:15pm Coffee• Finish

Southgrid Member Institutions

• Oxford • RAL PPD• Cambridge • Birmingham• Bristol• HP-Bristol• Warwick

Stability, Throughput and Involvement

• The last Quarter has been a good stable period for SouthGrid

• Addition of Bristol PP• 4 out of 5 already upgraded to 2_6_0• Large involvement in Biomed DC

Monitoring

• http://www.gridpp.ac.uk/ganglia/• http://map.gridpp.ac.uk/• http://lcg-testzone-reports.web.cern.ch/lcg-

testzone-reports/cgi-bin/lastreport.cgi• Configure view UKI• http://www.physics.ox.ac.uk/users/gronbech/gridmon.htm• Dave Kants helpful doc in the minutes of a tbsupport meeting

links to• http://goc.grid-support.ac.uk/gridsite/accounting/tree/

gridpp_view.php

http://www.gridpp.ac.uk/ganglia/

http://map.gridpp.ac.uk/

Ganglia mods for Oxford August 2005

Status at RAL PPD

• SL3 cluster on 2.6.0 • CPUs: 11 2.4 GHz, 33 2.8GHz

– 100% Dedicated to LCG

• 0.7 TB Storage– 100% Dedicated to LCG

• Configured 6.4TB of IDE RAID disks for use by dcache

• 5 systems to be used for preprodution testbed

RAL 2

• Dcache installation• Pre Production?• Upgrade to 2_6_0 report

– RGMA mon node in early yaim did not work for upgrade (only fresh installs) Problems with connector order (Tomcat openssl before insecure connector, if no cert then had to fix), Latest release of yaim and CERT is OK

– yum name changes caused some problems perl api rpm needs deleting.

Status at Cambridge

• Currently LCG 2.6.0 on SL3

• CPUs: 42 2.8GHz (Extra Nodes only 2/10 any good)– 100% Dedicated to LCG

• 2 TB Storage (have 3 but only 2 available)– 100% Dedicated to LCG

• Condor Batch System• Lack of Condor support

from LCG teams

Cambridge 2

• Cam Grid – LCG interaction– All nodes would need LCG wn software in order to make them

available everywhere– gridpp ce has central manager – nodes have 2 ips , one cambridge private and one lcg public ip.– condor user1 and condor user2 – atlas jobs not working due to software is installed but not verified.

Due to above probs.• Condor Issues• Monitoring / Accounting

– ganglia installed nearly ready, need to inform A McNab.• Upgrade (260) report

– same rpm probs– rgma fixes from Yves– Tomcat– overall quite easy cf previous releases

Status at Bristol

• Status– Yves and Pete Installed SL304 and LCG-2_4_0 and went live on

July 5th 2005. Yves upgraded to 2_6_0 in last week of July as part of pre-release testing.

• Existing resources– 80-CPU BaBar farm moved to Birmingham– GridPP nodes plus local cluster nodes used to bring site on line.

Local cluster needs to be integrated. • New resources

– Funding now confirmed for large University investment in hardware

– Includes CPU, high quality and scratch disk resources• Humans

– New system manager post (RG) should be in place.– New SouthGrid support / development post (GridPP / HP) being

filled– HP have moved ia64 bit machines on to Cancer research due to

lack of use by LCG.

Status at Birmingham

• Currently SL3 with LCG-2_6_0

• CPUs: 24 2.0GHz Xenon (+48 local nodes which could in principle be used but…)– 100% LCG

• 1.8TB Classic se– 100% LCG.

• Babar Farm moving to SL3 and Bristol integrated but not yet on LCG

Birmingham 2

• Babar Cluster expansion• LCG-2_6_0 early testing July • Involvement in Pre Production Grid• Installation of DPM?

– How to migrate data– or just close old se

• Integration of Local users vs grid users

Status at Oxford

• Currently LCG 2.4.0 on SL304• All 74 cpus’s running since ~June 20th• CPUs: 80 2.8 GHz

– 100% LCG• 1.5 TB Storage – second 1.5TB will be brought on line

as DPM or dcache.– 100% LCG.

• Some further Air Conditioning Problems now resolved for Room 650, Second rack in overheating basement.

• Heavy use by Biomed during their DC• Plan to give local users access

Oxford 2

• Need to upgrade to 2_6_0 next week.• Early testing of 2_6_0 in July on tbce01• Integration with pp cluster to give local

access to grid queues

Security

• Best practices linkhttps://www.gridpp.ac.uk/deployment/security/index.html

• Wiki entryhttp://goc.grid.sinica.edu.tw/gocwiki/AdministrationFaq

• iptables?? – Birmingham to share their setup on the South Grid web pagesCompleted

Action Plan for Bristol

• Plan to visit on June 9th to install an installation server– dhcp server– NFS copies of SL (local mirror)– PXE boot setup etc

• Second visit to reinstall head nodes with SL304 and LCG-2_4_0 and some worker nodes

• Babar cluster to go to Birmingham– Fergus, Chris, Yves to Liaise.–

Completed

Action plan for SouthGRID

• Ensure all upto date for GridPP14 (Oxford)

• SRM installations• SC4 Preparations• LHC DC Awareness

Grid site wiki

• http://www.gridsite.org/wiki/main_page• http://www.physics.gla.ac.uk/gridpp/

datamanagement

LCG Deployment Schedule

SC2SC3

LHC Service OperationFull physics run

2005 20072006 2008

First physicsFirst beams

cosmics

June05 - Technical Design Report

Sep05 - SC3 Service Phase

May06 –SC4 Service Phase starts

Sep06 – Initial LHC Service in stable operation

SC4

Apr07 – LHC Service commissioned

Apr05 – SC2 Complete

Jul05 – SC3 Throughput Test

Apr06 – SC4 Throughput Test

Dec05 – Tier-1 Network operational

preparationsetupservice

SC2SC2SC3SC3

LHC Service OperationLHC Service OperationFull physics run

2005 20072006 2008


cosmicsFull physics run

2005 20072006 20082005 20072006 2008


cosmics

June05 - Technical Design Report

Sep05 - SC3 Service Phase

May06 –SC4 Service Phase starts

Sep06 – Initial LHC Service in stable operation

SC4SC4

Apr07 – LHC Service commissioned

Apr05 – SC2 Complete

Jul05 – SC3 Throughput Test

Apr06 – SC4 Throughput Test

Dec05 – Tier-1 Network operational



Overall Schedule (Raw-ish)

Sep Sep Oct Oct Nov Nov Dec Dec

ALICE ALICE

ATLAS ATLAS

CMS CMS CMS CMS

LHCb LHCb

Sep Sep Oct Oct Nov Nov Dec Dec

ALICE ALICE

ATLAS

ATLAS

CMS CMS CMS CMS

LHCb LHCb

Service Challenge 4 – SC4

• SC4 starts April 2006• SC4 ends with the deployment of the FULL PRODUCTION SERVICE

Deadline for component (production) delivery: end January 2006

• Adds further complexity over SC3– Additional components and services– Analysis Use Cases– SRM 2.1 features required by LHC experiments– All Tier2s (and Tier1s…) at full service level– Anything that dropped off list for SC3…– Services oriented at analysis and end-user– What implications for the sites?

• Analysis farms:– Batch-like analysis at some sites (no major impact on sites)– Large-scale parallel interactive analysis farms and major sites– (100 PCs + 10TB storage) x N

• User community:– No longer small (<5) team of production users– 20-30 work groups of 15-25 people– Large (100s – 1000s) numbers of users worldwide

SC4 Timeline

• September 2005: first SC4 workshop(?) – 3rd week September proposed

• January 31st 2006: basic components delivered and in place

• February / March: integration testing

• February: SC4 planning workshop at CHEP (w/e before)

• March 31st 2006: integration testing successfully completed

• April 2006: throughput tests

• May 1st 2006: Service Phase starts (note compressed schedule!)

• September 1st 2006: Initial LHC Service in stable operation

• Summer 2007: first LHC event data

southgrid technical meeting pete gronbech: 26 th august 2005 oxford

Documents

lcg teamscambridge

lcg birmingham

cambridgecurrently lcg

local nodes

lcg wn software

local cluster nodes

lcg public ip

birminghamgridpp nodes