Download - Large Scale Analytical Data Management
![Page 2: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/2.jpg)
Database Research Data Mgmt Systems Research• SIGMOD, TODS, PVLDB, ICDE, VLDBJ
– major industry connections (billion$/y)
Expanding Topic set & Societal Impact– Data Stream Processing– Data Mining – Information Extraction, Text Retrieval– RDF and Graph data management– MapReduce + Cloud– Data Privacy
![Page 3: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/3.jpg)
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
![Page 4: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/4.jpg)
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture
– Columnar storage
– Fast Compression Methods– Differential Storage Techniques (Positional Delta
Trees)– Vectorized Execution
• http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
– Robust Query Execution (“micro adaptivity”)– Just-In-Time (JIT) Compilation– Cooperative Scans – sharing scarce I/O bandwidth
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
![Page 5: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/5.jpg)
DB Research Highlights (2/4)
Commodity Cluster Computing - Cloud• Various MonetDB Cluster Projects
– Shared-nothing data storage, query optimization• Hadoop VectorWise (VU MSc projects)
– cluster scalability &failover– Tightly integrated Hadoop/YARN/HDFS
• CWI scilens cluster– Amdahl number >1 large I/O resources– Other uses:webcraw analysis, 500 billion triple BI BSBM
benchmark
![Page 6: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/6.jpg)
DB Research Highlights (3/4)
Adaptive Indexing• DBA expertise extremely scarce• Science workloads hard to predict & variableDatabase Cracking:“every query is an advise how to store the
data”continuous self-steering data
reorganization
+ Approximate Query Execution on Samples+ Recycling – exploit overlap in workloads+ Fingerprint Indexing – exploit local
correlations
![Page 7: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/7.jpg)
DB Research Highlights (4/4)
Support for non-tabular data• Text (retrieval)• Scientific
– Data vaults: directly query FITS, GeoTIFF,BEM,MSEED,..
– SciQL: Arrays as 1st class database objects– MonetDB.R: using columns as arrays (and vice
versa)• Semantic Data – RDF
– “automatically discovering schemas in LOD data”• Bridge gap between RDF and relational
• Graph Data Management– Benchmark development
![Page 8: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/8.jpg)
Application Areas
– Business Intelligence• Marketing/Sales, Fraud Detection, Churn (spin-offs)• Social network analysis (LDBC)
– Security• Digital Forensics (NFI - XIRAF)• ...
– Science• Astronomy (LOFAR transient search) • Meterology (Earthquake Analysis - KNMI)
– Linked Data• Open government (LOD2)
![Page 9: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/9.jpg)
Areas of Activity
Data
Understand and decide
Analyze and model
Store and process
Reasoning
Knowledge representati
on
MultimediaRetrieval
Modeling and
simulation
Machine Learning
Information Retrieval
Decision Theory
BusinessAnalytics
VisualAnalytics
DistributedProcessing
Large Scale Databases
SoftwareEng.
System / Network
Eng.
![Page 10: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/10.jpg)
Data Science Education
enormous demand for (“big”) data scientists• Possibilities/limitations of wide array of techniques
– Information extraction, cleaning– Ranking, retrieval– Data Mining, and its applications– DB principles (Q-opt, query processing algorithms, storage techniques)
• Understand key performance factors– Latency vs bandwidth– Networks, computer architecture– algorithm optimization techniques
• Practical skills– Modern Software engineering methods– Rapid prototyping languages– Solving problems usin Hadoop clusters
proposal: “Extreme Data Management” MSc course
![Page 11: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/11.jpg)
Opportunities: CWI
• Database Architecture Group– research, application, data science experience– MonetDB, Vectorwise technologies– Scilens: data-intensive large compute cluster
• CWI motivators– Dual Appointments– Data Science MSc education
• Attracting top students into MSc projects / PhD– DSRC co-positioning in future research funding
![Page 12: Large Scale Analytical Data Management](https://reader036.vdocuments.net/reader036/viewer/2022062419/557cf58ed8b42a89158b48a7/html5/thumbnails/12.jpg)
Conclusion
• Database research present in Amsterdam– research, application, valorisation
• Data Science Education!– Proposal: Extreme data Management course
• ..DSRC and the CWI..