braintalk cuso nm
DESCRIPTION
Thomas Heinis is a post-doctoral researcher in the database group at EPFL. His research focuses on scalable data management algorithms for large-scale scientific applications. Thomas is a part of the "Human Brain Project" and currently works with neuroscientists to develop the data management infrastructure necessary for scaling up brain simulations. Prior to joining EPFL, Thomas completed his Ph.D. in the Systems Group at ETH Zurich, where he pursued research in workflow execution systems as well as data provenance.TRANSCRIPT
![Page 1: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/1.jpg)
Analyzing and Querying Big Scientific Data
Thomas Heinis
![Page 2: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/2.jpg)
2
Data-Driven Scientific Discovery
HumanBrainProjectSDSS
LHC ATLAS
Scientists Are Overwhelmed with Big Data
Large Hadron Collider12 Petabytes / experiment
Sloan Digital Sky Survey4 Petabytes / year
Human Brain Project~100 Gigabytes / sec
![Page 3: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/3.jpg)
3
Scientific Data Growth
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 20140
1
2
3
4
5
6
7
8
9
10 Astronomy [NRAO]
Physics [LHC]
Simulation [ICESS]
Gene Sequencing [EBI]
Year
Cum
ulati
ve S
ize
of D
atas
ets
[Pet
abyt
es]
Scientific Data Grows Exponentially!
![Page 4: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/4.jpg)
4
Data in the Simulation Sciences
DURATION
Increasing simulation duration
COVERAGE
RESO
LUTI
ON
Incr
easi
ng le
vel o
f det
ail
Dimensions are Multiplicative!
Increasing model size by order of magnitude
![Page 5: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/5.jpg)
What is the Human Brain Project?
A 10-year European initiative to understand the human brain, enabling advances in neuroscience, medicine and future computing.
A consortium of 250+ Scientists, 135 Research Groups, from over 80 institutions, and more than 20 countries in Europe and beyond.
![Page 6: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/6.jpg)
Human Brain Project - Vision Future Medicine
Symptom-based to biology-based classification Unique signatures of diseases Early diagnosis
Future Neuroscience Multi-level view of brain Causal chain of events from genes to cognition
Future Computing Supercomputing as scientific method Human like intelligence
![Page 7: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/7.jpg)
7
Brain Simulation – Wet Lab Neuron structure & electrophysiological properties:
![Page 8: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/8.jpg)
Simulating the Brain
![Page 9: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/9.jpg)
9
Spatial Analysis
Static 3DExploration
Interactive 3DExploration
Simulation Science Data Challenges
Simulation
Observational Data
Post Simulation
Data
Dynamic 3DExploration
Need Scalable Spatial Access Methods
Spatial Modeling
![Page 10: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/10.jpg)
10
Spatial Analysis
Static 3DExploration
Interactive 3DExploration
Simulation Science Data Challenges
Simulation
Observational Data
Post Simulation
Data
Dynamic 3DExploration
Need Scalable Spatial Access Methods
Spatial Modeling
![Page 11: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/11.jpg)
11
Static ExplorationNeural Tissue Model
Single Neuron3D Model
Efficient Spatial Index is Crucial
3D Spatial Range Query
![Page 12: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/12.jpg)
c
State-of-the-Art Spatial Indexes
12
R-Tree: Hierarchy of Minimum Bounding Rectangles (MBR)
R-Trees Variants:Hilbert packed R-Tree STR R-Tree PR-Tree
Overlap
Range Query
Structural Overlap Degrades Performance
![Page 13: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/13.jpg)
13
50 100 150 200 250 300 350 400 4500
50
100
150
200
250
300Hilbert R-Tree
STR R-Tree
PR-Tree
Dataset Density [Million of Elements per unit Volume]
Tim
e [s
econ
ds]
Scalability ChallengeDataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.
Spatial Density Increases with Dataset Size State of the Art Does Not Scale with Density
![Page 14: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/14.jpg)
FLAT: A Two Phase Spatial Index
2) CRAWLING: Traverse neighborhood
c1) SEEDING: Find any one object
Requires Reachability
14
Use Connectivity To Avoid Overlap
Key Idea: Two phases, each independent of overlap:
![Page 15: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/15.jpg)
15
Earthquake simulations datasets
No Problem!
FLAT: Reachability Problem
Convex Dataset GeometryNever crawl outside the query bound
ConnectivityFor accessing neighboring objects in data.
REQUIREMENTS:
Not every dataset satisfies this requirement!
No path inside query
No Connectivity
![Page 16: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/16.jpg)
FLAT: Reachability
16
c
1) PartitioningGroup spatially close elements
2) LinkingConnect neighboring partitions
Add Connectivity → Enable Recursive Crawling
Index Building:
![Page 17: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/17.jpg)
FLAT: Seeding Phase
17
Seed R-Tree
R-Tree for seeding, but will it scale with density?
Seeding phase avoids overlap overhead in R-Tree
Overlap Seed query picks
one child arbitrarily
Seed Query
Seeding is fast page reads = ~height of tree.
Range Query: Find ALL element inside querySeed Query: Find ANY ONE element inside query
![Page 18: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/18.jpg)
18
SeedPartition
FLAT: Crawling PhaseThe neighbor links are used for recursive graph traversal Starting from the seed page
Linear complexity in terms of graph edges
Range Query
![Page 19: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/19.jpg)
19
50 100 150 200 250 300 350 400 4500
50
100
150
200
250
300 Hilbert R-TreeSTR R-TreePR-TreeFLAT
Dataset Density [Million of Elements per unit Volume]
Tim
e [s
econ
ds]
FLAT: Performance EvaluationDataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.
Spatial Density Increases with Dataset Size Decouples Execution Time from Density
7.8 x
![Page 20: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/20.jpg)
20
FLAT: Scalability
50 100 150 200 250 300 350 400 4500
0.51
1.52
2.53
3.54
4.55
Hilbert R-TreeSTR R-TreePR-TreeFLAT
Dataset Density [Million of Elements per Unit Volume]
Tim
e pe
r Res
ult O
bjec
t [m
s]
Seeding cost amortizes with increase in result cardinality
Trend is “FLAT”, Scales With Density
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk.Range Queries: Uniform Random 500 for each experiment.
![Page 21: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/21.jpg)
21
FLAT: iPad Implementation
http://www.youtube.com/watch?v=zaUEARq-IY0
![Page 22: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/22.jpg)
22
Static 3DExploration
Interactive 3DExploration
Simulation Science Data Challenges
Simulation
Observational Data
Post Simulation
Data
Dynamic 3DExploration
Spatial Modeling
![Page 23: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/23.jpg)
23
Interactive Exploration
23
Bronchial Tree of the Lung
Arterial Tree of the Heart
Spatial Range Query SequencesGuiding
Path
Guided Analysis Ubiquitous in Scientific Applications
Neural Network
![Page 24: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/24.jpg)
24
Guiding paths are not known in advanceInteractive execution of query sequence
Interactive Query Execution
DISK
CPU
Retrieve Query ResultsProcess Results
Time
1st Query 2nd Query 3rd Query
Predictive Prefetching Hides Data Retrieval Cost
Prefetching Opportunity
1st Query 3rd Query2nd Query Path decided after processing results
Prefetch DataPrediction
Predict next query location in the sequencePrefetch data of next query into prefetch cache
![Page 25: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/25.jpg)
25
Existing techniques: Extrapolate past query locations
Exponential Weighted Moving Average (EWMA) Straight LineHilbert Prefetching
Predictive Prefetching
Large Volume Queries
Small Volume Queries
10k 80k 150k 220k05
101520253035404550
Volume of Query [µm3]
Cach
e Hi
t Rat
e [%
]
Neuroscience Data set25 query in sequence
Not Efficient With Arbitrary Query Volume!
![Page 26: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/26.jpg)
26
SCOUT: Content Aware PrefetchingKey Insight: Use previous query content!
Approach:
1. Inspect query results
2. Identify guiding path
3. Predict next query using guiding path
Need to Identify Guiding Path
?
![Page 27: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/27.jpg)
27
SCOUT: How paths are defined
Query results = many primitive spatial objects.
Idea: Graph FrameworkG(V,E) such that, Vertices = spatial objects, Edges between nearby objects.
Independence from data representation
Exact graphN2 comparisons!
Grid Hash based construction Approximate Graph Representation
Range Query
![Page 28: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/28.jpg)
28
PathsCandidate set
SCOUT: Guiding Path IdentificationIterative Candidate PruningKey Insight: Guiding path goes through all queries!
nn+1
n+2
n+3
Guiding path
PredictedQuery
Longer Sequence → Better Prediction
![Page 29: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/29.jpg)
29
Prefetch duration not known in advance. Query dimension not known in advance.
Idea: Incremental PrefetchingRepeatedly prefetch growing regionsBy extrapolating guiding path
nth query in sequence
SCOUT: Where to Prefetch
Independence from query size
Guiding Path
Exit…..
p1 p2 pn
Policy = safest region first
![Page 30: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/30.jpg)
30
0102030405060708090
100 EWMA Straight Line HilbertSCOUT
Cach
e H
it Ra
te [%
]
SCOUT: Prediction Accuracy
Sequence 1 Sequence 2Visualization
Cache Hit Rate = Amount of data retrieved from cache Total amount of data retrieved x 100
80K [μm3] 32
Query Volume:Sequence Length:
20K [μm3] 32
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk
72% - 91% Prediction AccuracySCOUT speeds up sequences up to 14.7x
Speedup 2x
Speedup 14.7x
![Page 31: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/31.jpg)
31
SCOUT: ScalabilityIncrease in Data set Size
50M 150M 250M 350M 450M0
102030405060708090
100 SCOUT
Data set Size [# of spatial objects]
Cach
e H
it Ra
te [%
]
SCOUT scales with increase in data set size
CPU
DISK
Retrieve Query ResultsProcessing Results
Time
3rd Query2nd Query
PredictionPrefetching
SCOUT Overhead
50M 150M 250M 350M 450M0
50
100
150
200
PredictionRetrieve Query Results
Data set Size [# of spatial objects]
Tim
e [s
ec]
Selectivity increases
15-16%
![Page 32: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/32.jpg)
32
Static 3DExploration
Interactive 3DExploration
Simulation Science Data Challenges
Simulation
Observational Data
Post Simulation
Data
Dynamic 3DExploration
Spatial Modeling
![Page 33: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/33.jpg)
33
Dynamic Exploration
Mesh: Collection of 3D Connected Polyhedra
Mesh → Enable High Precision 3D Models
Polyhedra Connected Polyhedra Volumetric Mesh Model
3D Vertices Shared Faces
Challenge: Monitoring Memory Resident Spatial Mesh Models
![Page 34: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/34.jpg)
34
Monitoring Mesh Simulations
Problem: Efficiently Execute Range Queries
Time step 1 Time step 2 Time step 3
timeSimulation Time step
Simulation Time step
Updates Queries
Monitor Monitor
![Page 35: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/35.jpg)
35
Data Challenge
Need: Solution That Scales
Mesh Detail:
Highly Dynamic:Unpredictable Mesh MovementUpdates Affect Entire Dataset
Mesh Detail Increases With Dataset Size
Now Future
Timestep 2Timestep 1
![Page 36: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/36.jpg)
36
State of the Art
Moving Object IndexesTPR-Tree, STRIPES
Neither Scales with Size nor Detail!
Mesh Movement is Inherently Unpredictable
Static Spatial IndexesR-Tree, LUR-Tree, QU-
Trade
Linear Scan
Coarse Grained Fine Grained
![Page 37: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/37.jpg)
37
Performance Evaluation
Linear Scan Outperforms Indexed Approaches
Not Enough Queries to Invest on Index Maintenance
MonitortimeSimulation
Time stepMonitor
Simulation Time step
Few Queries
Massive Updates
SETUP:Neural Mesh Dataset: 1.32 Billion Tetrahedral Mesh (33GB)15 Queries per 60 simulation time step
Statistical Analysis Microb...0
1000
2000
3000
4000
5000
6000
7000
8000
LinearScan OCTREELUR-Tree QU-Trade
Tota
l Que
ry R
espo
nse
Tim
e [s
ec]
99.5%
80%
72%
Maintenance
![Page 38: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/38.jpg)
38
Can We Do Better?
Mesh Connectivity → Query Execution
Reduce Search Space → Index ApproachNo Maintenance → Linear Scan
Best of Both Worlds
Not Rely on External Data Structure:→ Directly use in-memory Mesh Data
Mesh Graph Traversal: → Retrieve Results in Spatial Proximity
OCTOPUS: Idea
Vertices
Edges
Mesh Graph
Key Insight: Use Mesh Connectivity to Retrieve Query Results!
![Page 39: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/39.jpg)
39
OCTOPUS
Range Query
Update Oblivious Query Execution
Time step 1 Time step 2 Time step 3
What About Non-Convex Meshes?
![Page 40: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/40.jpg)
40
OCTOPUS: Non-Convex Meshes
Using Mesh Surface Guarantees Accuracy
?
No Reachability! Surface Scan
![Page 41: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/41.jpg)
41
OCTOPUS: Mesh Deformation
Deformation: Zero Cost of surface maintenance
Scales With Massive Updates
Time step 1 Time step 2 Time step 3
Graph changes
![Page 42: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/42.jpg)
42
OCTOPUS: Mesh Detail
Scales with Mesh Resolution
Quadratic Increase Surface Points Cubic Increase
Non-Surface Points
Scalability: Surface grows slower than volume (and therefore dataset size)!
![Page 43: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/43.jpg)
43
OCTOPUS: Performance
7.3-8X Speedup
Visualization Microbench-
Mark
Statistical Analysis Microbenchmark
0
1000
2000
3000
4000
5000
6000
7000
8000
9000OCTOPUSLinearScanOCTREELUR-TreeQU-Trade
Tota
l Que
ry E
xecu
tion
Tim
e [s
ec]
8X 7.3X Visualization
Microbenchmark
![Page 44: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/44.jpg)
44
OCTOPUS: Scalability
0.13 0.17 0.26 0.52 1.320
20
40
60
80
100
120
140 Graph TraversalSurface Scan
Mesh Detail[Tetrahedrals in Billions]
Tota
l Que
ry E
xecu
tion
Tim
e [s
ec]
OCTOPUS Breakdown
64%
41%
0.13 0.17 0.26 0.52 1.320
350
700
1050
1400 LinearScanOCTOPUS
Mesh Detail[Tetrahedrals in Billions]
Tota
l Que
ry E
xecu
tion
Tim
e [s
ec]
Scales with Mesh Detail
SETUP: Queries: Uniform random 15 per time step, 60 time steps
8X 10X
![Page 45: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/45.jpg)
45
Algorithm Overview
Simulation
Observational Data
Post Simulation
Data
Spatial Analysis
Model Validation
Spatial Modeling
OCTOPUS: ICDE’14FLAT: ICDE’12SCOUT: VLDB’12
TOUCH: SIGMOD’13GIPSY: SSDBM ‘13
![Page 46: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/46.jpg)
46
Human Brain Project:Part of the toolset used every dayFebruary 2013: first 10 million neuron model builtStill 4 orders of magnitude smaller than human brain
General Applicability:Material SciencesAstronomyGeographical InformationSystems
Impact
2010
20082006
1K 10K 100K 10M05
1015202530
Simulation Size [# Neurons]
Mod
el S
ize
[GB]
2013(2.5 TB)
![Page 47: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/47.jpg)
47
Future ChallengesEnable Scientific Breakthroughs via Scalable Data Analysis! Address Scientific Data Trends:
→ Progressively Complex Datasets→ Increasingly Complex Scientific Queries→ Modern Hardware
Approximate Queries on Big Data:→ Use Mechanism of Learning & Forgetting to
manage Data Synopses
![Page 48: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/48.jpg)
48
Data Privacy/Anonymization Scalable Querying of Petascale Data Cloud Analytics Quick & efficient access to raw data Distributed Workflow Execution Provenance/Reproducibility Data Personalization
HBP Data Management Challenges
![Page 49: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/49.jpg)
49
Conclusions Enabling data exploration is key to scientific
discovery. Prior spatial access methods do not scale with
data growth. Use Spatial Connectivity to achieve scalability.
→ Explicitly Added (FLAT & TOUCH)→ Implicitly Present in the Dataset (OCTOPUS
& SCOUT) Many exciting big data management
challenges remain!
![Page 50: Braintalk cuso nm](https://reader035.vdocuments.net/reader035/viewer/2022062513/554e8accb4c90526358b4a2b/html5/thumbnails/50.jpg)
50
Thank You!
Collaborators:Farhan Tauheed, Anastasia Ailamaki,
Felix Schürmann, Henry Markram, Sadegh Nobari, Panagiotis Karras, Laurynas Biveinis, Mirjana Pavlovic