the large scale data management and analysis project (lsdma) dr. andreas heiss , scc, kit
DESCRIPTION
The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT. Overview. I ntroducing KIT and SCC Big Data Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF) Large Scale Data Management and Analysis ( LSDMA) Summary and Outlook. - PowerPoint PPT PresentationTRANSCRIPT
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
Steinbuch Centre for Computing (SCC)
www.kit.edu
The Large Scale Data Management and Analysis Project (LSDMA)Dr. Andreas Heiss, SCC, KIT
2 September 12, 2013 Dr. Andreas Heiss
Introducing KIT and SCC
Big Data
Infrastructures at KIT: GridKa and the Large Scale Data Facility (LSDF)
Large Scale Data Management and Analysis (LSDMA)
Summary and Outlook
Overview
3 September 12, 2013 Dr. Andreas Heiss
KIT is bothstate university with research and teaching andresearch center of the Helmholtz Association with program oriented provident research
Objectives:
research
teaching
innovation
Introducing KIT
Numbers
24,000 students
9,400 employees
3,200 PhD researchers
370 professors
790 million EUR annual budget in 2012
4 September 12, 2013 Dr. Andreas Heiss
Provisioning and development of IT services for KIT and beyond
R&DHigh Performance ComputingGrids and CloudsBig Data
~ 200 employees in total50% scientists50% technicians, administrative personnel and student assistants
named after Karl Steinbuch, professor at Karlsruhe University, creator of the term “Informatik” (German term for computer science)
Introducing Steinbuch Center for Computing
5 September 12, 2013 Dr. Andreas Heiss
Big Data
Comparing Google trends
Cloud computing
Big Data
Grid Computing
2010 2013
6 September 12, 2013 Dr. Andreas Heiss
Big Data
Cloud computing
Big Data
Grid Computing
Comparing Google trends
7 September 12, 2013 Dr. Andreas Heiss
“In those days Caesar Augustus issued a decree that a census should be taken of the entire Roman world.”(Luke 2,1)
Big Data 2000 years ago
clearly defined purpose for collecting data: tax lists of all tax payers
data collectiondistributedanalogtime-consuming
distributed storage of datatedious data aggregation
8 September 12, 2013 Dr. Andreas Heiss
Big Data today
One Buzzword ….. various challenges!
Industry Science
- Data mining - Business intelligence- Get additional information from
(often) already existing data.- Data aggregation- Typically O(10) or O(100) TBs
New field to make money!- Products- Services - Market shared between some
‘big players’ and many start-ups / spin-offs!
- Handling huge amounts of data - PetaBytes- Distributed data sources and/or
storage- (Global) data management- High Throughput- Data preservation
9 September 12, 2013 Dr. Andreas Heiss
Definition of Data Science
Venn-Diagramm by Drew Conway (IA Ventures)
10 September 12, 2013 Dr. Andreas Heiss
Goalssearch for the origin of massunderstanding the early state of the universe
LHCwent live in 2008four detectors main discovery until now: a Higgs boson
Big Data in science: LHC at CERN
Level 1 - HardwareLevel 2 – Online Farm
40 MHz (1,000 TB/sec) equivalent Level 3 – Online Farm
300 Hz (250 MB/sec)
100 KHz (100 GB/sec digitized)
5 KHz (5 GB/sec)
world-wideLHC community Goal for 2015: 500 Hz@L3
2012: 25 PB of data taken
11 September 12, 2013 Dr. Andreas Heiss
Goalssearch for the origin of massunderstanding the early state of the universe
LHCwent live in 2008four detectors main discovery until now: a Higgs boson
Big Data in science: LHC at CERN
Level 1 - HardwareLevel 2 – Online Farm
40 MHz (1,000 TB/sec) equivalent Level 3 – Online Farm
300 Hz (250 MB/sec)
100 KHz (100 GB/sec digitized)
5 KHz (5 GB/sec)
world-wideLHC community Goal for 2015: 500 Hz@L3
2012: 25 PB of data taken
O(1000) physicists distributed worldwide
12 September 12, 2013 Dr. Andreas Heiss
Hierarchy of services, response times and availability:
1 Tier-0 center at CERNcopy of all raw data (tape)first pass reconstruction
11 Tier-1 centers worldwide2 to 3 distributed copies of raw datalarge-scale data reprocessingStorage of simulated data from Tier-2 centerstape storage
~150 Tier-2 centers worldwideuser analysissimulations
Worldwide LHC Computing Grid – Hierarchical Tier Structure
Hierarchy
Courtesy of Ian Bird, CERN
Mesh
Hierarchical m
odel relaxed
13 September 12, 2013 Dr. Andreas Heiss
Big Data in science: DNA sequencing
MB
GB
14 September 12, 2013 Dr. Andreas Heiss
Big Data in science: synchrotron light sources
Source: Wikipedia
ANKA @ KIT
15 September 12, 2013 Dr. Andreas Heiss
Big Data in science: synchrotron light sources
Dectris Pilatus 6M2463 x 2527 pixels7 MB images25 frames/s175 MB/sSeveral TB/day
Data doesn‘t fit any more on USB driveUsers are usually not affiliated to the synchrotron labUsers from physics, biology, chemistry, material sciences, …
16 September 12, 2013 Dr. Andreas Heiss
Big Data in science: high throughput imaging
Imaging machines / microscope1 – 100 frames/s => up to 800 MByte/s => O(10) TBytes/day
Reconstruction of zebrafish early embryonic development
17 September 12, 2013 Dr. Andreas Heiss
Big Data in science
Many research areas, where the data growth is very fastBiology, chemistry, earth sciences, …
Data sets became too big to take homeData rates require dedicated IT infrastructures to record and storeData analysis requires farms and clusters. Single PCs not sufficient.Collaborations require distributed infrastructures and networksData management becomes a challenge
Less IT experienced and IT interested people than e.g. in phyisics
18 September 12, 2013 Dr. Andreas Heiss
Definition of Data Science
Venn-Diagramm by Drew Conway (IA Ventures)
Physicist
Biologist, chemist, …
19 September 12, 2013 Dr. Andreas Heiss
German WLCG Tier-1 CenterSupports all LHC experiments + Belle II + several small communities and older experiments>10,000 coresDisk space: 12 PB, tape space: 17 PB6x10 Gbit/s network connectivity~ 15% of LHC data permanently stored at GridKaServices: file transfer, workload management, file catalog, …Global Grid User Support (GGUS): service development and operation of the trouble ticket system for the world-wide LHC Grid
Annual international GridKa School2013: ~140 participants from 19 countries
KIT infrastructures: GridKa
20 September 12, 2013 Dr. Andreas Heiss
GridKa Experiences
evolving demands and usage patternsno common workflows
hardware commodity, software nothierarchical storage with tape is challengingdata access and I/O is the central issue
Different users / user communities have different data access methods and access patterns!
on-site experiment representation highly useful
21 September 12, 2013 Dr. Andreas Heiss
Main goalsprovision of storage for multiple research groups at KIT and U-Heidelbergsupport of research groups in data analysis
Resources and access6 PB of on-line storage6 PB of archival storage100 GbE connection between LSDF@KIT and U-Heidelberganalysis cluster of 58*8 coresvariety of storage protocolsjointly funded by Helmholtz Association and state of Baden-Württemberg
KIT infrastructure: Large Scale Data Facility
23 September 12, 2013 Dr. Andreas Heiss
high demand for storage, analysis and archivalresearch groups vary in
research topics (from genetic sequencing to geophysics)sizeIT expertiseneed for services and protocols
Important needs common to many user groupssharing data with other groupsdata security and preservation‘consulting’
many small groups depend on LSDF
LSDF experiences
24 September 12, 2013 Dr. Andreas Heiss
The Large Scale Data Management and Analysis (LSDMA) project: facts and figures
Helmholtz portfolio extensioninitial project duration: 2012-2016partners:
project coordinator: Achim Streit (KIT)sustainability: inclusion of activities into respective Helmholtz program-oriented funding in 2015next annual international symposium: September 24th at KIT
25 September 12, 2013 Dr. Andreas Heiss
Scientific Data Life Cycle
26 September 12, 2013 Dr. Andreas Heiss
LSDMA: Dual Approach
Data Life Cycle LabsJoint R&D with scientific user communities
optimization of the data life cyclecommunity-specific data analysis tools and services
Data Services Integration TeamGeneric methods R&D
data analysis tools and services common to several DLCLsinterface between federated data infrastructures and DLCLs/communities
27 September 12, 2013 Dr. Andreas Heiss
Selected LSDMA activities (I)
DLCL Energy (KIT, U-Ulm)analyzing stereoscopic satellite images for estimating the efficiency of solar energy with Hadoopprivacy policies for personal energy data
DLCL Key Technologies (KIT, U-Heidelberg, U-Dresden)optimization of tomographical reconstruction using data-intensive computingvisualization for high throughput microscopy
DLCL Health (FZJ)workflow support for data-intensive parameter studiesefficient metadata administration and indexing
28 September 12, 2013 Dr. Andreas Heiss
Selected LSDMA activities (II)
DLCL Earth&Environment (KIT, DKRZ)MongoDB for data and metadata of meteorologic satellite dataData Replication within the European EUDAT project using iRods
DLCL Structure of Matter (DESY, GSI, HTW)Development of a portal for PETRA-III dataDetermining the computing requirements for FAIR data analysis
DSIT (all partners)Federated identity management ArchiveFederated storage (e.g. dCache)…
29 September 12, 2013 Dr. Andreas Heiss
Communities differ inprevious knowledgelevel of specification of the data life cycletools and services used
Needs driven byincreasing amount of datacooperation between groupspolicies
open access/datalong-term preservation
LSDMA Challenges
Within communitiesfocus on data analysishigh fluctuation of computing expertsrunning tools and services
Lessons learnedinteroperable AAI crucialdata privacy very challenging, both legally and technicallycommunities need evolution, not revolutionneeds can be very specific
30 September 12, 2013 Dr. Andreas Heiss
data facilities and R&D very important for KITextensive experience at GridKa and LSDFwide variety of user communitiesoften very specific needsInteroperable AAI and privacy crucial topics
Today, data is important to basically all research topicsmore projects on state, national and international levels to comeLSDMA: research on generic data methods, workflows and services and community specific support and R&D.
Summary and Outlook