proposal: lhc_07 r&d for atlas grid computing
DESCRIPTION
Proposal: LHC_07 R&D for ATLAS Grid computing. Tetsuro Mashimo International Center for Elementary Particle Physics (ICEPP), The University of Tokyo on behalf of the LHC_07 project team 2013 Joint Workshop of FKPPL and FJPPL(TYL) June 4, 2013 @ Yonsei University, Seoul. - PowerPoint PPT PresentationTRANSCRIPT
Proposal: LHC_07R&D for ATLAS Grid computing
Tetsuro MashimoInternational Center for Elementary Particle Physics
(ICEPP), The University of Tokyoon behalf of the LHC_07 project team
2013 Joint Workshop of FKPPL and FJPPL(TYL)June 4, 2013 @Yonsei University, Seoul
LHC_07“R&D for ATLAS Grid computing”
• Cooperation between French and Japanese teams in R&D on ATLAS distributed computing in order to face the important challenges of the next years (in preparation for LHC 14 TeV runs in 2015~)
• Important challenges of the next years: new computing model, hardware, software, and networking issues
• The International Center for Elementary Particle Physics (ICEPP), the University of Tokyo (WLCG Japanese Tier-2 center) and French Tier-2 centers
LHC_07: members
* leader
French group Lab. Japanese group Lab.
E. Lançon* Irfu T. Mashimo* ICEPP
L. Poggioli IN2P3 I. Ueda ICEPPP.-E. Macchi IN2P3 T. Nakamura ICEPPM. Jouvin IN2P3 N. Matsui ICEPPS. Jézéquel IN2P3 H. Sakamoto ICEPPE. Fede IN2P3 T. KawamotoJ.-P. Meyer Irfu
LHC_07“R&D for ATLAS Grid computing”
Successor of the project LHC_02: “ATLAS computing” (year 2006 ~ 2012)• The LHC_02 project started as a collaboration
between the computer center of the IN2P3 in Lyon, France (Tier-1 center) and the ICEPP Tier-2 center (associated with the Lyon Tier-1 in the ATLAS “cloud” computing model)
• LHC_02: Various R&D studies, especially, how to exploit efficiently the available bandwidth of the international long distance network connection
Network between Lyon and Tokyo
Lyon
BNL (USA-Long Island)
Triumf (Canada-Vancouver)
ASGC (Taiwan)
Tokyo
New York
SINET
GEANT
RENATER
Lyon
10 Gb/sRTT=300 ms
Exploiting the bandwidth is not a trivial thing: packet loss at various places, directional asymmetry in transfer performance, performance change in time, …
Difficulties to find out the source of NW problems
6
Sitessecurity policy
Network providerssupport area
Projects, VOscomputing model
LYONTOKYOTOKYOLYON
[Mbp
s]
May. 28, 2011
iperf
May. 28, 2011
gridftp
Large
Medium
Small
LYONTOKYO
[MB/
sec]
Case with CC-IN2P3
7
Nov. 17, 2011
LYONTOKYOTOKYOLYON
packet loss cleared
Nov. 17, 2011
Jan. 21, 2012
Misconfiguration on the LBE packet related to QoS
Improved by LAN reconfiguration in CC-IN2P3
LHC_07“R&D for ATLAS Grid computing”
• The LHC_02 project was successful• Continue the collaboration to face the
important challenges of the next years: new ATLAS computing model, hardware, software, and networking issues
9
ATLAS Computing Model - Tiers
Implementation of the ATLAS computing model: tiers and clouds
• Hierarchical tier organization based on Monarc network topology
• Sites are grouped into clouds for organizational reasons
• Possible communications:• Optical Private Network
• T0-T1• T1-T1
• National networks• Intra-cloud T1-T2
• Restricted communications: General public network
• Inter-cloud T1-T2• Inter-cloud T2-T2
11
Detector Data Distribution
– RAW and reconstructed data generated at CERN and dispatched at T1s.
– Reconstructed data further replicated downstream to T2s of the SAME cloud
Tier-2Tier-2Tier-2
Tier-2
Tier-0
Tier-1
Tier-2Tier-2Tier-2
Tier-2
Tier-1
Tier-2
Tier-1
O(2to4GB) files(with exceptions)
Data distribution after Reprocessing and Monte Carlo Reconstruction
– RAW data is re-processed at T1s to produce a new version of derived data
– Derived data are replicated to T2s of the same cloud
– Derived data are replicated to a few other T1s (or CERN)
• And, from there, to other T2s of the same cloud Tier-2Tier-2
Tier-2
Tier-0
Tier-1
Tier-2Tier-2
Tier-1
O(2to4GB) files(with exceptions)
Monte Carlo production
– Simulation (and some reconstruction) run at T2s
– Input data hosted at T1s is transferred (and cached) at T2s
– Output data are copied and stored back to T1s
– For reconstruction, derived data are
• replicated to few other T1s (or CERN)
• And, from there, to other T2s of the same cloud
Tier-2Tier-2
Tier-2
Tier-0
Tier-1Tier-1
INPUT
OUTPUT
Analysis
• The paradigm is “jobs go to data” i.e. – Jobs are brokered at sites where data have been pre-
placed– Jobs access data only from the local storage of the site
where they run– Jobs store the output in the storage the site where
they run• No WAN involved.
(by Simone Campana, ATLAS TIM, Tokyo, May 2013)
Issues - I
• You need data at some T2 (normally “your” T2)
• The inputs are at some other T2 in a different cloud
• Examples:– Outputs of analysis jobs– Replication of particular
samples on demand
15
Tier-2
Tier-2
Tier-1Tier-1
According to the model you should:
(by Simone Campana, ATLAS TIM, Tokyo, May 2013)
Issues - II
• You need to process data available only at a give T1
• All sites of that cloud are very busy
• You assign jobs to some T2 of a different cloud
16
Tier-2
Tier-1Tier-1
INPUT
OUTPUTAccording to the model you should:
(by Simone Campana, ATLAS TIM, Tokyo, May 2013)
Evolution of the ATLAS computing model
• ATLAS decided to relax the “monarch model”– Allow T1-T2 and T2-T2 traffic between different
clouds (growth of network bandwidth)• Any site can exchange data with any site if the
system believes it is convenient• So far ATLAS asked (large) T2s
– To be well connected to their T1– To be well connected to the T2s of their cloud
• Now ATLAS are asking large T2s:– To be well connected to all T1s– To foresee non negligible traffic from/to other
(large) T2s
Evolution of the model
18
Tier-2
Tier-2
Tier-1Tier-1
Tier-2
Tier-1Tier-1
Multi-Cloud Monte Carlo production Analysis Output
(by Simone Campana, ATLAS TIM, Tokyo, May 2013)
LHC_07: R&D for ATLAS Grid computing
• Networking therefore remains as a very important issue
• Other topics addressed by the collaboration– Use of virtual machines for operating WLCG services– Improvement of reliability of the middleware for storage– Performance of data access from analysis jobs through
various protocols– Investigation of federated Xrootd storage– Optimization and monitoring of data transfer between
remote sites
WAN for TOKYO
20
TOKYO
ASGC
BNL
TRIUMF
NDGF
RALCCIN2P3CERNCANFPIC
SARANIKEF
LA
PacificAtlantic
10Gbps
10Gbps
WIX
Additional new line (10Gbps)since the end of March 2013
OSAKA
40Gbps
40Gbps
20 Gbps
10 Gbps
Amsterdam
Geneva
Dedicated line
14:50 Overview of the SINET 20' 14:50 Overview of the SINET 20'
LHCONE: New dedicated (virtual) network for Tier-2 centers, etc.“perfSONAR” tool put in place for network monitoring
Budget plan in the year 2013
Item Euro Support-ed by
Item k Yen Support-ed by
Travel 1,000 Travel 160
Nb travels 3 3,000 IN2P3 Nb travels 3 480 ICEPP
Per-diem 230 Per-diem 22.7
Nb days 15 3,450 IN2P3 Nb days 12 272 ICEPP
Nb Travels 1 1,000 Irfu
Nb days 5 1,150 Irfu
Total 8,600 752
Cost of the project• The project uses the existing computing facilities at
the Tier-1 and Tier-2 centers in France and Japan and the existing network infrastructure provided by NRENs and GEANT, etc. The cost for hardware is therefore not necessary in this project.
• For the communication between the members, e-mails and TV conferences are mainly used, but face-to-face meetings are necessary usually once per year (a small workshop), therefore the cost for travel and stay.