the large scale data management and analysis project (lsdma) dr. andreas heiss , scc, kit

29
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association Steinbuch Centre for Computing (SCC) www.kit.edu The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss, SCC, KIT

Upload: angeni

Post on 25-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT. Overview. I ntroducing KIT and SCC Big Data Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF) Large Scale Data Management and Analysis ( LSDMA) Summary and Outlook. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Steinbuch Centre for Computing (SCC)

www.kit.edu

The Large Scale Data Management and Analysis Project (LSDMA)Dr. Andreas Heiss, SCC, KIT

Page 2: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

2 September 12, 2013 Dr. Andreas Heiss

Introducing KIT and SCC

Big Data

Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF)

Large Scale Data Management and Analysis (LSDMA)

Summary and Outlook

Overview

Page 3: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

3 September 12, 2013 Dr. Andreas Heiss

KIT is bothstate university with research and teaching andresearch center of the Helmholtz Association with program oriented provident research

Objectives:

research

teaching

innovation

Introducing KIT

Numbers

24,000 students

9,400 employees

3,200 PhD researchers

370 professors

790 million EUR annual budget in 2012

Page 4: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

4 September 12, 2013 Dr. Andreas Heiss

Provisioning and development of IT services for KIT and beyond

R&DHigh Performance ComputingGrids and CloudsBig Data

~ 200 employees in total50% scientists50% technicians, administrative personnel and student assistants

named after Karl Steinbuch, professor at Karlsruhe University, creator of the term “Informatik” (German term for computer science)

Introducing Steinbuch Center for Computing

Page 5: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

5 September 12, 2013 Dr. Andreas Heiss

Big Data

Comparing Google trends

Cloud computing

Big Data

Grid Computing

2010 2013

Page 6: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

6 September 12, 2013 Dr. Andreas Heiss

Big Data

Cloud computing

Big Data

Grid Computing

Comparing Google trends

Page 7: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

7 September 12, 2013 Dr. Andreas Heiss

“In those days Caesar Augustus issued a decree that a census should be taken of the entire Roman world.”(Luke 2,1)

Big Data 2000 years ago

clearly defined purpose for collecting data: tax lists of all tax payers

data collectiondistributedanalogtime-consuming

distributed storage of datatedious data aggregation

Page 8: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

8 September 12, 2013 Dr. Andreas Heiss

Big Data today

One Buzzword ….. various challenges!

Industry Science

- Data mining - Business intelligence- Get additional information from

(often) already existing data.- Data aggregation- Typically O(10) or O(100) TBs

New field to make money!- Products- Services - Market shared between some

‘big players’ and many start-ups / spin-offs!

- Handling huge amounts of data - PetaBytes- Distributed data sources and/or

storage- (Global) data management- High Throughput- Data preservation

Page 9: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

9 September 12, 2013 Dr. Andreas Heiss

Definition of Data Science

Venn-Diagramm by Drew Conway (IA Ventures)

Page 10: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

10 September 12, 2013 Dr. Andreas Heiss

Goalssearch for the origin of massunderstanding the early state of the universe

LHCwent live in 2008four detectors main discovery until now: a Higgs boson

Big Data in science: LHC at CERN

Level 1 - HardwareLevel 2 – Online Farm

40 MHz (1,000 TB/sec) equivalent Level 3 – Online Farm

300 Hz (250 MB/sec)

100 KHz (100 GB/sec digitized)

5 KHz (5 GB/sec)

world-wideLHC community Goal for 2015: 500 Hz@L3

2012: 25 PB of data taken

Page 11: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

11 September 12, 2013 Dr. Andreas Heiss

Goalssearch for the origin of massunderstanding the early state of the universe

LHCwent live in 2008four detectors main discovery until now: a Higgs boson

Big Data in science: LHC at CERN

Level 1 - HardwareLevel 2 – Online Farm

40 MHz (1,000 TB/sec) equivalent Level 3 – Online Farm

300 Hz (250 MB/sec)

100 KHz (100 GB/sec digitized)

5 KHz (5 GB/sec)

world-wideLHC community Goal for 2015: 500 Hz@L3

2012: 25 PB of data taken

O(1000) physicists distributed worldwide

Page 12: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

12 September 12, 2013 Dr. Andreas Heiss

Hierarchy of services, response times and availability:

1 Tier-0 center at CERNcopy of all raw data (tape)first pass reconstruction

11 Tier-1 centers worldwide2 to 3 distributed copies of raw datalarge-scale data reprocessingStorage of simulated data from Tier-2 centerstape storage

~150 Tier-2 centers worldwideuser analysissimulations

Worldwide LHC Computing Grid – Hierarchical Tier Structure

Hierarchy

Courtesy of Ian Bird, CERN

Mesh

Hierarchical m

odel relaxed

Page 13: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

13 September 12, 2013 Dr. Andreas Heiss

Big Data in science: DNA sequencing

MB

GB

Page 14: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

14 September 12, 2013 Dr. Andreas Heiss

Big Data in science: synchrotron light sources

Source: Wikipedia

ANKA @ KIT

Page 15: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

15 September 12, 2013 Dr. Andreas Heiss

Big Data in science: synchrotron light sources

Dectris Pilatus 6M2463 x 2527 pixels7 MB images25 frames/s175 MB/sSeveral TB/day

Data doesn‘t fit any more on USB driveUsers are usually not affiliated to the synchrotron labUsers from physics, biology, chemistry, material sciences, …

Page 16: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

16 September 12, 2013 Dr. Andreas Heiss

Big Data in science: high throughput imaging

Imaging machines / microscope1 – 100 frames/s => up to 800 MByte/s => O(10) TBytes/day

Reconstruction of zebrafish early embryonic development

Page 17: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

17 September 12, 2013 Dr. Andreas Heiss

Big Data in science

Many research areas, where the data growth is very fastBiology, chemistry, earth sciences, …

Data sets became too big to take homeData rates require dedicated IT infrastructures to record and storeData analysis requires farms and clusters. Single PCs not sufficient.Collaborations require distributed infrastructures and networksData management becomes a challenge

Less IT experienced and IT interested people than e.g. in phyisics

Page 18: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

18 September 12, 2013 Dr. Andreas Heiss

Definition of Data Science

Venn-Diagramm by Drew Conway (IA Ventures)

Physicist

Biologist, chemist, …

Page 19: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

19 September 12, 2013 Dr. Andreas Heiss

German WLCG Tier-1 CenterSupports all LHC experiments + Belle II + several small communities and older experiments>10,000 coresDisk space: 12 PB, tape space: 17 PB6x10 Gbit/s network connectivity~ 15% of LHC data permanently stored at GridKaServices: file transfer, workload management, file catalog, …Global Grid User Support (GGUS): service development and operation of the trouble ticket system for the world-wide LHC Grid

Annual international GridKa School2013: ~140 participants from 19 countries

KIT infrastructures: GridKa

Page 20: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

20 September 12, 2013 Dr. Andreas Heiss

GridKa Experiences

evolving demands and usage patternsno common workflows

hardware commodity, software nothierarchical storage with tape is challengingdata access and I/O is the central issue

Different users / user communities have different data access methods and access patterns!

on-site experiment representation highly useful

Page 21: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

21 September 12, 2013 Dr. Andreas Heiss

Main goalsprovision of storage for multiple research groups at KIT and U-Heidelbergsupport of research groups in data analysis

Resources and access6 PB of on-line storage6 PB of archival storage100 GbE connection between LSDF@KIT and U-Heidelberganalysis cluster of 58*8 coresvariety of storage protocolsjointly funded by Helmholtz Association and state of Baden-Württemberg

KIT infrastructure: Large Scale Data Facility

Page 22: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

23 September 12, 2013 Dr. Andreas Heiss

high demand for storage, analysis and archivalresearch groups vary in

research topics (from genetic sequencing to geophysics)sizeIT expertiseneed for services and protocols

Important needs common to many user groupssharing data with other groupsdata security and preservation‘consulting’

many small groups depend on LSDF

LSDF experiences

Page 23: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

24 September 12, 2013 Dr. Andreas Heiss

The Large Scale Data Management and Analysis (LSDMA) project: facts and figures

Helmholtz portfolio extensioninitial project duration: 2012-2016partners:

project coordinator: Achim Streit (KIT)sustainability: inclusion of activities into respective Helmholtz program-oriented funding in 2015next annual international symposium: September 24th at KIT

Page 24: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

25 September 12, 2013 Dr. Andreas Heiss

Scientific Data Life Cycle

Page 25: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

26 September 12, 2013 Dr. Andreas Heiss

LSDMA: Dual Approach

Data Life Cycle LabsJoint R&D with scientific user communities

optimization of the data life cyclecommunity-specific data analysis tools and services

Data Services Integration TeamGeneric methods R&D

data analysis tools and services common to several DLCLsinterface between federated data infrastructures and DLCLs/communities

Page 26: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

27 September 12, 2013 Dr. Andreas Heiss

Selected LSDMA activities (I)

DLCL Energy (KIT, U-Ulm)analyzing stereoscopic satellite images for estimating the efficiency of solar energy with Hadoopprivacy policies for personal energy data

DLCL Key Technologies (KIT, U-Heidelberg, U-Dresden)optimization of tomographical reconstruction using data-intensive computingvisualization for high throughput microscopy

DLCL Health (FZJ)workflow support for data-intensive parameter studiesefficient metadata administration and indexing

Page 27: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

28 September 12, 2013 Dr. Andreas Heiss

Selected LSDMA activities (II)

DLCL Earth&Environment (KIT, DKRZ)MongoDB for data and metadata of meteorologic satellite dataData Replication within the European EUDAT project using iRods

DLCL Structure of Matter (DESY, GSI, HTW)Development of a portal for PETRA-III dataDetermining the computing requirements for FAIR data analysis

DSIT (all partners)Federated identity management ArchiveFederated storage (e.g. dCache)…

Page 28: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

29 September 12, 2013 Dr. Andreas Heiss

Communities differ inprevious knowledgelevel of specification of the data life cycletools and services used

Needs driven byincreasing amount of datacooperation between groupspolicies

open access/datalong-term preservation

LSDMA Challenges

Within communitiesfocus on data analysishigh fluctuation of computing expertsrunning tools and services

Lessons learnedinteroperable AAI crucialdata privacy very challenging, both legally and technicallycommunities need evolution, not revolutionneeds can be very specific

Page 29: The Large Scale Data Management and Analysis  Project (LSDMA) Dr. Andreas  Heiss , SCC, KIT

30 September 12, 2013 Dr. Andreas Heiss

data facilities and R&D very important for KITextensive experience at GridKa and LSDFwide variety of user communitiesoften very specific needsInteroperable AAI and privacy crucial topics

Today, data is important to basically all research topicsmore projects on state, national and international levels to comeLSDMA: research on generic data methods, workflows and services and community specific support and R&D.

Summary and Outlook