data processing over very large relational databases
DESCRIPTION
Final presentation of my dissertation thesis focused on orientation, analyzing and finding information in large or unknown relational databases and data visualisationTRANSCRIPT
![Page 1: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/1.jpg)
Data Processing over Very Large Databases
Ing. Ľuboš Takáč
Supervisor: doc. Ing. Michal Zábovský, PhD.
Faculty of Management Science and Informatics
University of Žilina
![Page 2: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/2.jpg)
Large Databases
• VLDB (very large databases)
• Relational Databases with hundreds of tables and millions of rows
![Page 3: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/3.jpg)
The Problem
• How to understand relational database model so that we could find information in them.
• Orientation in large RDB– given by the complexity of RDB model
• Modification and development of RDB.
![Page 4: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/4.jpg)
Existing approaches
• Database metrics
• Database visualization
• Database to ontology mapping and examination of ontology
![Page 5: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/5.jpg)
Database Metrics• Database metric is a function that assigns to an
object from the database a numeric value.
• Examples of table metrics– DRT(T) – depth of relational tree
– TS(T) – table size
– RD(T) – referential degree
– …
• Rankings – grouping metrics with different weights.
![Page 6: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/6.jpg)
RDB Visualization
• Database schema visualization.
• Standard ER - diagram is insufficient for large RDB model.
![Page 7: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/7.jpg)
![Page 8: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/8.jpg)
![Page 9: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/9.jpg)
SchemaBall
• Visualization of large or complex RDB schemas.
• Using RDB metrics and rankings.
• We implemented and enhanced such solution.
![Page 10: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/10.jpg)
SchemaBall
![Page 11: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/11.jpg)
Visualization of RDB schema graph
• Vertex and edge weighted graph based on RDB metrics.
• Using Gephi for visualization– automatic generated layout
– interactive visualization (selections, examinations of nodes and edges)
– using graph algorithms
![Page 12: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/12.jpg)
![Page 13: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/13.jpg)
![Page 14: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/14.jpg)
Analyzing of RDB graph
• Three approaches– graph of RDB model (vertex – table, edges – foreign key
relations)
– alternative (vertex – table, edge – foreign key relation for each tuple)
– graph of tuples (vertex – tuple, edge – foreign key relation between tuples)
![Page 15: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/15.jpg)
Analyzing of RDB Graph – first approach
1 2 3 4 5 6 7 8 9 10 11 13 17 18 290.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
probability
vertex degreeDistribution function of vertex degree.
![Page 16: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/16.jpg)
Analyzing of RDB Graph – second approach
probability
vertex degreeDistribution function of vertex degree.
0.00
E+00
5.00
E+05
1.00
E+06
1.50
E+06
2.00
E+06
2.50
E+06
3.00
E+06
3.50
E+06
4.00
E+06
4.50
E+06
5.00
E+06
5.50
E+06
6.00
E+06
6.50
E+06
7.00
E+06
7.50
E+06
8.00
E+06
8.50
E+06
9.00
E+06
9.50
E+06
1.00
E+07
1.05
E+07
1.10
E+07
1.15
E+07
1.20
E+07
1.25
E+07
1.30
E+07
1.35
E+070
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 17: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/17.jpg)
Analyzing of RDB Graph – third approach
count
vertex degree
Distribution function of vertex degree.
![Page 18: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/18.jpg)
Analyzing of RDB Graph – Scale free networks• Connected graph with Yule-Simon distribution of
vertex degree.
• , usually between 2 – 3
![Page 19: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/19.jpg)
Visualization of RDB schema network
![Page 20: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/20.jpg)
Analyzing of RDB Graph - Conclusion
• RDB model is scale-free.
• To understand RDB you must to understand centers at first. (there is not a lot of centres)
• Very useful metric NR(T) – number of references validated by analyzing of RDB Graph.
• We created 2 new metrics based on mentioned three approaches.
![Page 21: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/21.jpg)
A Method for Analyzing Large RDB
• Find components of schema graph (tables = vertices, FK = edges)
• Examine each component starting in order with largest first– If you get alone table, very probably is an archive, try to
check it or find another purpose.
– Else visualize it via ER diagram, Schamaball or graph using table metrics.
![Page 22: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/22.jpg)
Practical Example
• Unknown complex RDB– 332 tables
– 2339 attributes
– 192 foreign keys
– Size 2,4 GB
![Page 23: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/23.jpg)
All tables
![Page 24: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/24.jpg)
Archive Tables
• Each alone table is archive table, with convention “_A”
![Page 25: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/25.jpg)
Component A
![Page 26: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/26.jpg)
Component B
![Page 27: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/27.jpg)
![Page 28: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/28.jpg)
RDBAnalyzer• supports all RDB Systems supporting JDBC, easy
scalable, online connection
• features– large online RDB schema visualization
– finding the components of graph
– schema graph creation, visualization and export (GEPHI)
– transform RDB to tuple graph
– metrics charts, parallel coordinates visualization
![Page 29: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/29.jpg)
RDBAnalyzer
![Page 30: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/30.jpg)
RDB to Ontology Mapping
– better understanding and searching for information without knowledge of RDB model, data mining from RDB
– can be used by web search engines to search in RDBs
– getting information from RDB by people, whose do not understand RDB technology (layman)
– a method how to merge multiple databases (ontology merging)
– interactive searching for information (Protégé)
![Page 31: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/31.jpg)
RDB Schema NORTHWIND (ER-Diagram)
![Page 32: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/32.jpg)
OntoGraph (Protége)
![Page 33: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/33.jpg)
![Page 34: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/34.jpg)
How to find information in Ontologies
• using query language (SPARQL)
• interactive (e.g. Protégé)– using OntoGraf combined with text searching
– explore entities and individuals
![Page 35: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/35.jpg)
![Page 36: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/36.jpg)
Disadvantages & Problems of mapped RDBs to Ontologies
• Difficult to maintain actual data (static & dynamic Ontology creation).
• Aggregated queries are very slow.
• Existing tools are not capable with large RDBs (or large ontologies).
![Page 37: Data Processing over very Large Relational Databases](https://reader035.vdocuments.net/reader035/viewer/2022062707/5585064bd8b42ad71b8b525e/html5/thumbnails/37.jpg)
Conclusion & Scientific Contribution• Design and creation of method for orientation,
understanding and finding information in large or unknown relational databases. (RDBAnalyzer supports mentioned principles)
• Detection of RDB graph characteristics (Scale free network) and using this knowledge to create 2 new and validate 1 existing metric.
• Design and creation of method for finding information in ontologies generated from RDB.