the german hep community grid [email protected] for the german hep community grid 27-march-2007,...
TRANSCRIPT
The German HEP Community Grid [email protected]
for the German HEP Community Grid
27-March-2007, ISGC2007, Taipei
Agenda:
D-Grid in contextHEP Community GridHEP-CG Work PackagesSummary
~ 10 000 scientists from 1000 institutes out of more then 100 countries, investigate with the help of huge accelerators basic problems of particle physics.
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
EDG EGEE EGEE 2
Oct.HI run
Mar.-Sep.pp run
Today
EGEE 3 ?
GridKa/GGUS
D-Grid in context: e-Science in Germany
LCG R&D WLCG Ramp-up ...
Community Grids
DGI DGI 2
D-Grid Initiative
Initiative10. Berlin
HEP CG
Commercial uptake of
service
www.d-grid.de
Generic platform and generic Grid servicesD-Grid Integration Project
Ast
ro G
rid
Med
i Grid
C3
Grid
HE
P C
G
In G
rid
Tex
t G
rid
C3
Grid
Grid Computing &Knowledge
management & e-Learning
e-Science=
see talk of Anette Weisbecker,
in Life Sciences I
PC²
RRZN
TUD
RZG
LRZ
RWTH
FZJ
FZK
FHG/ITWM
Uni-KA
D-Grid WPs: Middleware & Tools, Infrastructure, Network & Security, Management & Sustainability
Middleware:Globus 4.x
gLite (LCG)
UNICORE
GAT and GridSphere
Data Management:SRM/dCache
OGSA-DAI
Meta data schemas
VO Management:VOMS and Shibboleth
see talk of Thomas Fieseler,
in Operation I
see talk of Michael Rambadt, in Middleware II
LHC groups in Germany
Alice: Darmstadt, Frankfurt, Heidelberg, Münster
ATLAS: Berlin, Bonn, Dortmund, Dresden, Freiburg, Gießen, Heidelberg, Mainz, Mannheim, München, Siegen, Wuppertal
CMS: Aachen, Hamburg, Karlsruhe
LHCb: Heidelberg, Dortmund
German HEP instituts on the WLCG monitoring map.
WLCG: Karlsruhe (GridKa & Uni), DESY, GSI, München, Aachen, Wuppertal, Münster, Dortmund, Freiburg
HEP CG partner:
Project partner: Uni Dortmund, TU Dresden, LMU München, Uni Siegen, Uni Wuppertal, DESY (Hamburg & Zeuthen), GSI
via subcontract: Uni Freiburg, Konrad-Zuse-Zentrum Berlin,
unfunded: Uni Mainz, HU Berlin, MPI f. Physik München, LRZ München, Uni Karlsruhe, MPI Heidelberg, RZ Garching, John von Neumann Institut für Computing, FZ Karlsruhe
Focus on tools to improve data analysis for HEP and Astroparticle Physics.Focus on gaps, do not reinvent the wheel.
Data managementAdvanced scalable data managementJob-and data co-schedulingExtendable Metadata catalogues for Lattice QCD and
Astroparticle physics
Job monitoring and automated user supportInformation servicesImproved Job failure treatmentIncremental results of distributed analysis
End-user data analysis toolsPhysics and user oriented job scheduling, workflows, automatic job scheduling
All development is based on LCG / EGEE sw and will be kept compatible!
HEP CG WP1: Data Management
Coordination P.Fuhrmann, DESY
Developing and supporting a scalable Storage Element based on Grid standards
(DESY, Uni Dortmund, UniFreiburg, unfunded FZK)
Combined job- and data-scheduling, accounting and monitoring of data used
(Uni Dortmund)
Development of grid-based, extendable metadata catalogues with semantically world-wide access
(DESY, ZlB, unfunded: Humboldt Uni Berlin, NIC)
Scalable Storage Element: dCache
The dCache project is funded from DESY, FERMI Lab, OpenScience Grid and in part from the Nordic Data Grid Facility.
HEP CG contributes:
Professional product management: code versioning, packaging, user support and test suites.
- only one host- ~ 10 TB- zero maintenance
- thousands of pool - PB disk storage- hundreds of file transfers per second- not more than 2 FTEs
dCache.ORG
dCache: The principle
Backend Tape Storage
Streaming Data
(gsi)FTPhttp(g)
Posix I/O
xRootdCap
Storage Control
SRMEIS
protocol Engines
dCache Controller
Managed Disk Storage
HS
M A
dapt
er
dCache.ORG
Information Prot.
CPU and data co-scheduling: online vs. near line files, information about time to get a file online
HEP CG WP2:Job Monitoring + User Support Tools
Coordination: P.Mättig, Uni Wuppertal
Development of a job information system(TU Dresden)
Development of an expert-system to classify job -failures, automatic treatment of most common errors
(Uni Wuppertal, unfunded FZK)
R&D on interactive job steering and access to temporary, incomplete analysis job results
(Uni Siegen)
User specific job- and resource usage-monitoring
Worker NodeJob Monitoring
✗ monitoring sensors
Job Execution Monitoring✗ stepwise
User Application(Physics)
Worker NodeJob Monitoring
✗ monitoring sensors
User Application(Physics)
Worker NodeJob Monitoring
✗ monitoring sensors
User Application(Physics)
Worker NodeJob Monitoring
✗ monitoring sensorsJob Execution Monitoring
✗ stepwise
User Application(Physics)
Worker NodeJob Monitoring
✗ monitoring sensors
User Application(Physics)
Monitoring Box✗ R-GMA
User✗ Browser✗ Visualisation Applet✗ Visualisations
● Interactivity● Overviews● Details● Timelines, Histograms
...
Analysis✗ Web-Service✗ Interface to
monitoring systemse.g. R-GMA Consumer
R -GMA●
●
●
Portal Server✗ GridSphere✗ Monitoring Portlet
Integration into GridSphere
Focus on many
job scenario.
Ease of use.
User should not need to know more than necessary, which should be almost nothing.
From general to detailed views on jobs.
Information like status, resource usage by jobs, output, time lines etc.
Interactivity:
zoom in display,
clicking shows detailed information
Development of an expert-system to classify job -failures, automatic treatment of most common errors.
submitted
waiting
ready
scheduled
running
What is goingon here ?
done (failed) done (ok)
cleared
cancelled aborted
Motivation
Thousands of jobs/day in the LHC Computing Grid (LCG)
Job status at run-time is hidden from the
Manual error tracking is difficult and can take long
Current monitoring is more resource then user oriented (GridICE, …)
Therefore
Monitoring on script level
JEM
Automation necessary
Expert-system
gLite/LCGWorkernodePreexecution Test
Supervision of commands
Status-reports via R-GMA
Visualisation via GridSphere
Bash
Python
Expert-system for error classification
Integration in the ATLAS software environment
Integration in GGUS
post D-Grid I: automatic error correction, ... ?
JEM: Job Execution Monitor
HEP CG WP3: Distributed Interactive Data Analysis
Coordination P.Malzacher , GSI(LMU, GSI, unfunded: LRZ, MPI M, RZ Garching, Uni Karlsruhe, MPI Heidelberg)
Optimize application specific job schedulingAnalyze and test of software environment required
Job management and Bookkeeping of distributed analysis
Distribution of analysis, sum-up of results
Interactive Analysis: Creation of a dedicated analysis cluster
Dynamic partitioning of Grid analysis clusters
Start with Gap Analysis
LMU:
Investigating Job-Scheduler requirements for distributed and interactive analysis
GANGA (ATLAS/LHCb) project shows good features for this task
Used for MC production, reconstruction and analysis on LCG
GSI:
Analysis based on PROOF
Investigating different versions of PROOF clusters
Connect ROOT and gLite: TGlite
class TGrid : public TObject {public: … virtual TGridResult *Query ( …
static TGrid *Connect ( const char *grid,
const char *uid = 0, const char *pw = 0 … ClassDef(TGrid,0) };
Storage
queues
manager
outputs
catalog
query
“static” use of resources jobs frozen: 1 job / worker node
splitting at the beginning, merging limited monitoring (end of single job)
submit
files
jobsdata file splitting
myAna.C
mergingfinal analysis
GANGA, Job split approach
catalog Storage
scheduler
query
farm perceived as extension of local PC same macro, syntax as in local session
more dynamic use of resources real time feedback automated splitting and merging
MASTER
PROOF query:data file list, myAna.C
files
final outputs
(merged)
feedbacks
The PROOF approach
Summary:
• Rather late compared to other national Grid initiatives a German e-science program is well under way. It is build on top of 3 different middleware flavors: UNICORE, Globus 4 and gLite.
• The HEP-CG production environment is based on LCG / EGEE software.
• The HEP-CG focuses on gaps in three work packages: data management, automated user support and interactive analysis.
Challenges for HEP:• Very heterogeneous disciplines and stakeholders.• LCG/EGEE is not basis for many other partners.
More Information• I showed only a few highlights for more info see:
http://www.d-grid.de