hep grid workshop @ chep, knu 11/9/2002 youngjoon kwon (yonsei univ.) 1 belle computing / data...

HEP GRID Workshop @ CHEP, KNU 11/9/2002 Youngjoon Kwon (Yonsei Univ.)

1

Belle Computing / Data Belle Computing / Data HandlingHandling

What is Belle and why we need large-scale computing?

Current Belle computing system & data handling Planning for super-B era A case study

Youngjoon KwonYonsei Univ.

Jysoo Lee

KISTI

&


2

What is Belle ?What is Belle ?

• KEKB asymmetric energy collidere+ (3.5 GeV) e- (8 GeV)

• design Luminosity = 1034 /cm2/s• E(cm) = 10.58 GeV on resonance of (4S) production• Belle detector optimized for studying matter-antimatter asymmetry in the Universe

e- (8 GeV) e+ (3.5 GeV)


3

The Belle Experiment.The Belle Experiment. To study matter-anitmatter asymmetry in B meson decays.

Accumulated 100 million pairs since turn-on in 1999. Published 44 journal papers and over 200 conference

contributions

BB


4

Belle’s need for large-scale Belle’s need for large-scale computingcomputing

To achieve ½ of Belle’s physics goals: need ~108 events

Time required for “Real Data” analysis – 40 days/ 100Mevts / 1GHz– Need 10GHz/analysis to finish one data loop within 1 week– Belle produce ~ 20 papers/year– a typical paper takes ~2 years to finish analysis

=> 40 analyses being done simultaneously– Hence, we need ~400 GHz to sustain current activity of “real data”

analysis alone But, we also need Monte-Carlo sample (x4 in size)

– 10 sec/evt/GHz => 130 years/GHz– Hence, need ~200 GHz to provide MC sample within a year

Need almost 1 THz to sustain physics analysis activites

We need additional CPU’s for raw data processing, etc.


5

Central Belle computing Central Belle computing systemsystem


6

CPUsCPUs Belle’s reference platform: Sparc’s running Solaris 2.7

– 9 workgroup servers (500 MHz, 4CPU)– 38 compute servers (500 MHz, 4CPU)

• LSF batch system / 40 tape drives (2 each on 20 servers)

– Fast access to disk servers– 20 user workstations with DAT, DLT, AITs

Additional Intel CPUs– Compute servers (@KEK, Linux RH 6.2/7.2)

• 4 CPU (Pentium Xeon 500-700 MHz) servers~96 units• 2 CPU (Pentium III 0.8~1.26 GHz) servers~167 units

– User terminals (@KEK to log onto the group servers)• 106 PCs (~50Win2000+X window sw, ~60 Linux)

– User analysis PCs(@KEK, unmanaged)– Compute/file servers at universities

• A few to a few hundreds @ each institution• Used in generic MC production as well as physics analyses at each

institution

– Tau analysis center @ Nagoya U. for example


7

Disk servers @ KEKDisk servers @ KEK 8TB NFS file servers 120TB HSM (4.5TB staging disk)

– DST skims– User data files

500TB tape library (direct access)– 40 tape drives on 20 sparc servers– DTF2:200GB/tape, 24MB/s IO speed– Raw, DST files– generic MC files are stored and read by users(batch jobs)

~12TB local data disks on PCs– Not used efficiently at this point


8

Data storage requirementsData storage requirements

Raw data: 1GB/pb-1 (100 TB /100 fb-1) DST: 1.5GB /pb-1/copy (150 TB /100 fb-1) Skims for calibration: 1.5GB /pb-1

MDST: 50GB/fb-1 (5 TB /100 fb-1) Other physics skims: 30GB/fb-1 (3 TB /100 fb-1) Generic MC (MDST): ~20 TB/year

Total: ~450 TB/year


9

CPU requirements – DST CPU requirements – DST productionproduction

Goal: 3 months to reprocess all data– Often we have to wait for

const.– Often we have to restart due

to bad constants

300 GHz (PIII) for 1fb-1/day


10

CPU requirements – MC CPU requirements – MC productionproduction

For every real data set, need to generate at least x3 as many MC events

240 GB/fb data in the compressed format No intermediate info (DC hits; ECL showers) are saved

– With every new release of the s/w library,need to produce new generic MC sample

400 GHz (PIII) for 1fb-1/day


11

Data transfer to remote Data transfer to remote usersusers

A firewall & login servers make the data transfer miserable (100 Mbps max.)

DAT tapes are used for massive data transfer– Compressed hadron skim files– MC events generated by outside institutions

Dedicated GbE network to a few institutions are now being added– Total 10 Gbit to/from KEK being added

Slow network to most other collaborators


12

Compute problem?Compute problem?

Obviously, the existing computing resources are already stretched to over-capacity

Data set is doubling every year with no end in site.

Management of data and CPU is already a major burden

By far the most cost effective solution are large clusters of commodity PCs running Linux.

How to manage these? GRID!


13

Prototype GRID-style Prototype GRID-style analysis analysis

Need to run multi-parameter fitting program for CP violation measurement => a multi-CPU CP-fitter


14

Planning for Planning for Super-BSuper-B era era

x15 increase in luminosity is planned c. 2006 Data accumulation: ~ 2PB/year Including MC’s, need 10PB of storage to start super-B To re-process 2 year’s accumulation (2 ab-1) of data in 3

months, we need x30 CPU power– CPU @ KEK alone is not enough– A cluster of local data centers (connected by GRID) is planned!

One unit of LDC– 300 GHz + 60 TB + 3 MBps to KEK– Cost: $0.3M + $0.2M + $(Network)

Can we afford one?


15

Belle-GRID – a case Belle-GRID – a case studystudyTwo Australian collaborators in Belle (U. Melobourne & U. Sydney) ar

e working on a GRID prototype for Belle physics analyses


16

Belle-GRID – a case Belle-GRID – a case studystudy

Blue-print for Belle-GRID in Australia

Belle-GRID – a case studyBelle-GRID – a case study Belle analysis using a Grid environment

– useful locally » adopted by Belle » wider community– construction of a Grid Node at Melbourne

• Certificate Authority to approve security• Globus toolkit...• GRIS (Grid Resource Information Service) - LDAP with Grid security• Globus Gateway - connected to local queue (GNU Queue; PBS?)• GSIFTP - data resource providing access to local storage• Replica Catalog - LDAP for virtual data directory

– Replicate this in Sydney– initial test of Belle code with grid node & queue– data access via the grid (Physical File Names as stored in Replica Catalo

g)– modification of Belle code to access the data on the grid– test of Belle code with grid node & queue & grid data access– connect 2 grid nodes (Melbourne EPP and Sydney EPP)– test of Belle code running over separated grid clusters– implement or build Resource Broker


18

Belle-GRID – a case studyBelle-GRID – a case study

Belle analysis test case…– Analysis of charmless B meson decays to 2 vector mesons, used to

determine 2 angles of the CKM unitarity triangle. Belle analysis code over Grid resources (10 files ; 2 GB total)

– Data files processed serially 95 mins– Data files processed over Globus 35 mins

Data access (2 secure protocols GASS/GSIFTP ; 100 Mbit network)– NFS access for comparison 8.5 MB/s– GASS access 4.8 MB/s– GSIFTP access 9.1 MB/s

Belle analysis using Grid data access– NFS access for comparison 0.34 MB/s– GSIFTP data streaming 0.36 MB/s


19

SummarySummary

Belle’s computing resources are stretched to over-capacity.

Moreover, we are planning a x15 increase in luminosity (so called the “super KEKB”) within a few years.

Perhaps, Local Data Centers connected by GRID is the only viable option.

Two Australian groups are working on a Belle-GRID analysis prototype. So far it has been working as planned.

hep grid workshop @ chep, knu 11/9/2002 youngjoon kwon (yonsei univ.) 1 belle computing / data...

Documents