big science and big databig science and big data dirk duellmann, cern apache big data europe 28 sep...

Big Science and Big DataDirk Duellmann, CERN

Apache Big Data Europe 28 Sep 2015, Budapest, Hungary

16/02/2015 Real-Time Analytics: Making better and faster business decisions 8

CERN IT Department CH-1211 Genève 23

Switzerland www.cern.ch/it

The Worldwide LHC Computing Grid

7000 tons, 150 million sensors generating data 40 millions times per second

i.e. a petabyte/s

The ATLAS experiment

5

[email protected]

LHCb: 200-400 MB/sec

Data flow to permanent storage: 4-6 GB/sec

Alice: 4 GB/sec

ATLAS: 1-2 GB/sec

CMS: 1-2 GB/sec

Data Collection and Archiving at CERN

The Worldwide LHC Computing Grid

Tier-1: permanent storage, re-processing, analysis

Tier-0 (CERN): data recording, reconstruction and distribution

Tier-2: Simulation,end-user analysis

> 2 million jobs/day

~350’000 cores

500 PB of storage

nearly 170 sites, 40 countries

10-100 Gb links

An international collaboration to distribute and analyse LHC data

Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists

LHC – Big Data…Few PB of raw data becomes ~100 PB! ➔

• Duplicate raw data • Simulated data • Derived data products • Versions as software improves • Replicas to allow access by

more physicists

[email protected]

• 1st Try -‐ All data in an commercial Object Database (1995) – good match for complex data model and OO language integraLon – but the market predicted by many analysts did not materialise!

• 2nd Try -‐ All data in a relaLonal DB -‐ object relaLonal mapping (1999) – PB-‐scale of deployment was far for from being proven –Users code in C++ — and rejected data model definiLon in SQL

• Hybrid between RDBMS and structured files (from 2001 -‐ today) – RelaLonal DBs for transacLonal management of metadata (only TB-‐scale)

• File/dataset meta data, condiLons, calibraLon, provenance, work flow • via DB abstracLon (plugins: Oracle, MySQL, SQLite, FronLer/SQUID)

• Open source persistency framework (ROOT) –Uses C++ “introspecLon” to store/retrieve networks of C++ objects – Column-‐store for efficient sparse reading

9

How do we store/retrieve LHC data? A short history…

Processing a TTree

16

preselection analysisOk

Output list

Process()

Branch

Branch

Branch

BranchLeaf Leaf

Leaf Leaf Leaf

Leaf Leaf

Event n

Read needed parts only

TTree

Loop over events

1 2 n last

Terminate()- Finalize analysis

(fitting, ...)

Begin()- Create histograms- Define output list

TSelector

CERN Disk Storage Overview

2

AFS CASTOR EOS Ceph NFS CERNBoxRaw Capacity 3 PB 20 PB 140 PB 4 PB 200 TB 1.1 PBData Stored 390 TB 86 PB (tape) 27 PB 170 TB 36 TB 35 TBFiles Stored 2.7 B 300 M 284 M 77 M (obj) 120 M 14 M

AFS is CERN’s linux home directory service

CASTOR & EOS are mainly used for the physics use case (Data Analysis and DAQ)

Ceph is our storage backend for images and volumes in OpenStack

NFS is mainly used by engineering application

CERNBox is our file synchronisation service based on OwnCloud+EOS

CHEP 2015, Okinawa

Tape at CERN

14/4/201512

Archive read

15 PB 23 PB

27 PBArchive write

Data Volume 100 PB physics archive 7 PB backup (TSM)

Tape libraries 3+2 x IBM TS3500 4 x Oracle SL8500

Tape drives 100 physics archive 50 backup

Capacity 70k slots 30k tapes

A look into the Future

• LHC upgrades will further increase luminosity• Computing resources needs will be higher• Data generated will increase drastically

• Next accelerators• Future Circular Collider (80-100 km)

Bangalore –05/02/2015 India Analytics & Big Data Summit 2015 21

CHEP 2015, Okinawa

Archive: Large scale media migration

14/4/201513

…

LHC Run1

Repack

LHC Run1

Repack

Deadline:LHC run 2 start !

Part 1:Oracle T10000D

Part 2:IBM TS1150

[email protected] 17 201514

Smart vs Simple Archive: HSM Issues

• CASTOR had been designed as Hierarchical Storage Management system

• disk-only and multi-pool support were added later — painfully..

• required rates for namespace access and file-open exceeded earlier estimates

• Around LHC start also conceptual issues with the HSM model became visible

• “A file” is not a meaningful granule for managing data exchange— experiment use datasets

• Dataset parts needed to be “pinned” on disk by users to avoid cache trashing

• Users had to “trick” the HSM to do the right thing :-(

CERN IT Department CH-1211 Genève 23

Switzerland www.cern.ch/it

Internet Services

DSS EOS Project: Goals & Choices

• Server, media, file system failures need to be transparently absorbed

– key functionality: file level replication and rebalancing – data stays available after a failure - no human intervention

• Fine grained redundancy within one h/w setup – choose & change redundancy level for specific data

• either file replica count or erasure encoding • Support bulk deployment operations

– eg replace hundreds of servers at end of warranty • In-memory namespace (sparse hash per directory)

– file stat calls 1-2 orders faster – write ahead logging for durability

• Later in addition: transparent multi-site clustering • eg between Geneva and Budapest

16

Connectivity (100 Gbps)

Dante/Géant

T-Systems

http://information-technology.web.cern.ch/

EOS Raw Capacity Evolution

Why do we develop our own open source storage software?• Large science community trained to be effective with set of

products

• efficiency of this community is our main asset - not just the raw utilisation of CPUs and disks

• integration and specific support do matter

• community sharing via tools and formats even more

• Long term projects

• change of “vendor/technology” is not only likely but expected

• we carry old but valuable data through time (bit-preservation)

• “loss of data ownership” after first active project period

Does Kryder’s law still hold?

areal density CAGR

source: HDD Opportunities & Challenges, Now to 2020, Dave Anderson, Seagate

Object Disk• Each disk talks object storage

protocol over TCP – replication/failover with other disks

in a networked disk cluster – open access library for app

development

– Why now? • shingled media come with constrained

(object) semantic: eg no updates

– Early stage with several open questions • port price for disk network vs price gain

by reduced server/power cost? • standardisation of protocol/semantics to

allow app development at low risk of vendor binding?

Can we optimise our systems further?

• Infrastructure analytics

• apply statistical analysis to the complete system: storage, cpu, network, user app

• measure/predict quantitative impact of changes on real job population

• Easy!

• looks like physics analysis with infrastructure metrics instead of physics data

• … really?

Non-trivial…• Technically

• needs consolidated service and application side metrics

• usually: log data for human consumption — without data design

• Conceptually

• some established metrics turn out to be less useful for analysis of today’s hardware than expected

• cpu efficiency = t_cpu / t_wall ? storage efficiency = GB / s ?

• correlation does not imply causal relation

• Sociologically

• better observe “rule of local discovery”

• people who quantitatively understand the infrastructure are busy running services — Always …

Data Collection and Analysis Repository

eos

HDFS

ai

lsfreadbytes : numberfilename : stringopentime : time

Set: EOS readbytes : numberfilename : stringopentime : time

Set: EOS readbytes : numberfilename : stringopentime : time

Set: eos PeriodicExtract & Cleaning

MonitoringJSON Files

export

User extract

MR node

MR node

MR node

MR node

MR node

MR node

Hadoop

small, binary subset

Ramping up:~ 100 nodes~ 100 TB raw logs

In production:- Flume- HDFS- MR- Pig- Spark- Scoop- {Impala}

Current work items: Service: availability (eg isolation and rolling upgrades) Analytics: workbooks support for popular analysis tools: R/python/ROOT

Summary

• CERN has a long tradition in deploying large scale storage systems used by a distributed science community world-wide

• During the first LHC run period we have passed the 100 PB mark at CERN and more importantly have contributed to the rapid confirmation of the Higgs boson and many other LHC results

• For LHC Run 2 we have significantly upgraded & optimised the infrastructure in close collaboration between service providers and users

• Adding more quantitative infrastructure analytics to prepare for High-Luminosity-LHC

• CERN is already very active as user and provider in the open source world and the overlap with other Big Data communities is increasing.

Thank you!

big science and big databig science and big data dirk duellmann, cern apache big data europe 28 sep...

Documents