masters thesis defense presentation
TRANSCRIPT
![Page 1: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/1.jpg)
TEXT MINING APPLIED TO SQL QUERIES: A CASE STUDY FOR SDSS SKYSERVERVitor Hirota MakiyamaAdvised by Dr. Rafael D. C. dos Santos
Master in Applied ComputingNational Institute for Space Research
![Page 2: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/2.jpg)
Outline
1.Introduction• SDSS & SkyServer• SkyServer as a data mining tool• Log Analysis• Thesis motivation
2.Theory review3.Methodology4.Experimental results5.Conclusion remarks
2
![Page 3: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/3.jpg)
SDSS & SkyServer
• The Sloan Digital Sky Survey
• 15 years operation, 4th iteration
• 3D map of over 1/3 of the sky, and
• 5+ million spectra
• SkyServer
• The Internet portal to SDSS
providing data access tools to the
catalog
3Introduction
![Page 4: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/4.jpg)
SkyServer as a data mining toolFind all galaxies without saturated pixels within 1' of a given point
Find quasars with a line width > 2000 km/s and 2.5 < redshift < 2.7
DECLARE @saturated BIGINT; SET @saturated = dbo.Fphotoflags('saturated');
SELECT G.objid, GN.distance INTO ##results FROM galaxy AS G JOIN Fgetnearbyobjeq(185, -0.5, 1) AS GN ON G.objid = GN.objid WHERE ( G.flags & @saturated ) = 0 ORDER BY distance
DECLARE @qso INT; SET @qso = dbo.Fspecclass('QSO'); DECLARE @hiZ_qso INT; SET @hiZ_qso = dbo.Fspecclass('HIZ-QSO');
SELECT s.specobjid,Max(l.sigma * 300000.0 / l.wave) AS veldisp, Avg(s.z) AS z INTO ##results FROM specobj s,specline l WHERE s.specobjid = l.specobjid AND ( ( s.specclass = @qso ) OR ( s.specclass = @hiZ_qso ) ) AND l.sigma * 300000.0 / l.wave > 2000.0 AND s.zconf > 0.9 GROUP BY s.specobjid
4Introduction
![Page 5: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/5.jpg)
SkyServer as a data mining tool
5Introduction
![Page 6: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/6.jpg)
SkyServer as a data mining tool
6Introduction
![Page 7: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/7.jpg)
Log Analysis
7
![Page 8: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/8.jpg)
Log Analysis
2006 2014
8
![Page 9: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/9.jpg)
Motivation
Apply text mining techniques over the SQL logs to define a methodology to parse, clean and tokenize statements into an intermediate numerical representation for data mining.
9Introduction
![Page 10: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/10.jpg)
Outline
1.Introduction2.Theory review• Text Mining• Information Retrieval• Clustering
3.Methodology4.Experimental results5.Conclusion
10
![Page 11: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/11.jpg)
Text MiningKnowledge Discovery in Databases:
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Fayyad et al. (1996)
11Theory Review
![Page 12: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/12.jpg)
Text MiningText Mining:
The discovery by computer of new previously unknown, information by automatically extracting information from different written sources.
Fayyad et al. (1996)
12Theory Review
![Page 13: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/13.jpg)
Text Mining
13Theory Review
Miner, Gary. Practical text mining and statistical analysis for non-structured text data applications. Academic Press (2012)
![Page 14: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/14.jpg)
Zipf’s Law
14Theory Review
Singh et al. (2006)
![Page 15: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/15.jpg)
15
Term Weighting• To balance term significance within a document
collection, accounting for terms that are too common or too rare.• TF*IDF assigns the largest weight to terms that arise
with high frequency in individual documents, but are at the same time, relatively rare in the collection as a whole .
![Page 16: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/16.jpg)
Clustering
The exploratory procedure that organizes a
collection of patterns into natural groupings based
on a given association measure.
16Theory Review
![Page 17: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/17.jpg)
Association Measures
Sthrel et al. (2000)
17Theory Review
![Page 18: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/18.jpg)
K-Means1. Choose k clusters centers.2. Assign each pattern to the closest cluster center.3. Recompute cluster center using the current cluster
memberships. 4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
18Theory Review
![Page 19: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/19.jpg)
K-Means1. Choose k clusters centers2. Assign each pattern to the closest cluster center.3. Recompute cluster center using the current cluster
memberships. 4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
19Theory Review
![Page 20: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/20.jpg)
Fuzzy C-Means• Fuzzy extension to traditional K-Means, where every pattern
belongs to every cluster with varying degrees of membership.
• Cluster validity metrics:
20Theory Review
![Page 21: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/21.jpg)
Self-Organizing Maps
21Theory Review
•ANN that performs unsupervised, competitive learning.
Yin (2008)
![Page 22: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/22.jpg)
22
Self-Organizing Maps•Maps high-dimensional data into a regular low-dimensional grid
•Reduces the original data dimension while preserving relationships of the data
•Particular interesting for visualization with the U-Matrix
![Page 23: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/23.jpg)
Outline
1.Introduction2.Theory review3.Methodology• SQL queries to feature vectors• Data mining
4.Experimental results5.Conclusion remarks
23
![Page 24: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/24.jpg)
Methodology
24Methodology
Fayyad et al. (1996)
![Page 25: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/25.jpg)
SQL queries to feature vectors
25
Transformation
PreprocessingParsing
Cleaning / Tokenization
Normalization
Methodology
![Page 26: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/26.jpg)
SQL queries to feature vectors
26
Transformation
PreprocessingParsing
Cleaning / Tokenization
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefehFROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platexWHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803
select objid ra dec u g r i z plate fiberid elodiefehfrom photoobj fgetnearbyobjeq specobj platexwhere objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate
![Page 27: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/27.jpg)
SQL queries to feature vectors
27
Transformation
PreprocessingParsing
Cleaning / Tokenization
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefehFROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platexWHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803
select objid ra dec u g r i z plate fiberid elodiefehfrom photoobj fgetnearbyobjeq specobj platexwhere objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate
select_objid 1select_ra 1select_dec 1select_u 1select_g 1select_r 1select_i 1select_z 1select_plate 1select_fiberid 1select_elodiefeh 1from_photoobj 1from_fgetnearbyobjeq 1from_specobj 1from_platex 1where_objid 3where_logic 8where_bestobjid 1where_plateid 2where_class 1where_r 2where_g 2where_plate 1
![Page 28: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/28.jpg)
SQL queries to feature vectors
28
Transformation
PreprocessingParsing
Cleaning / Tokenization
Normalization
Term Weighting
Scaling
Methodology
Term Document matrix
![Page 29: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/29.jpg)
Transformation
PreprocessingParsing
Cleaning / Tokenization
Normalization
Term Weighting
Scaling
Data mining
29Methodology
Data Mining
Fuzzy C-Means
Self organizing Maps
Term Document matrix
![Page 30: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/30.jpg)
Outline
1.Introduction2.Theory review3.Methodology4.Experimental results5.Conclusion remarks
30
![Page 31: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/31.jpg)
Number of clusters with FCM
31Results
![Page 32: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/32.jpg)
Number of clusters with FCM
32Results
![Page 33: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/33.jpg)
Visualization with SOM
33Results
![Page 34: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/34.jpg)
Visualization with SOM22: Finding spectra by classification (object type)select top 100 specobjidfrom specobjwhere class = 'star' and zwarning = 0
43: QSOs by spectroscopy
select top 100 specobjid, zfrom specobjwhere class = 'qso' and zwarning = 0
34
Cosine distance:
Term-Frequency: 0.0205SOM U-Matrix: 0.0
Results
39: Classifications from Galaxy Zooselect objid, nvote, p_el as elliptical, p_cw as spiralclock, p_acw as spiralanticlock, p_edge as edgeon, p_dk as dontknow, p_mg as mergerfrom zoonospecwhere objid = 1237656495650570395
39B: Classifications from Galaxy Zooselect top 100 g.objid, zns.nvote, zns.p_el as elliptical, zns.p_cw as spiralclock, zns.p_acw as spiralanticlock, zns.p_edge as edgeon, zns.p_dk as dontknow, zns.p_mg as mergerfrom galaxy as g join zoonospec as zns on g.objid = zns.objidwhere g.clean=1 and zns.nvote >= 10 and zns.p_cw > 0.8
Cosine distance:
Term-Frequency: 0.1610SOM U-Matrix: 0.0
![Page 35: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/35.jpg)
Outline
1.Introduction2.Theory review3.Methodology4.Experimental results5.Conclusion remarks
35
![Page 36: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/36.jpg)
Conclusions• A methodology for proper parsing, cleaning and
tokenization of SQL statements into feature vectors was defined, which can be used for KDD.• Preprocessing and transformation can be tuned
according to data mining goal.• Foreseen applications include:• Detailed SQL and database usage statistics• Query recommedation systems• Running time prediction
36Conclusions
![Page 37: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/37.jpg)
37
Publications•Clustering SQL queries to analyse database usage, IASC Satellite for the ISI WSC Conference, 2015
• Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer, 2nd International Symposium on Information Management and Big Data, 2015
![Page 38: Masters Thesis Defense Presentation](https://reader035.vdocuments.net/reader035/viewer/2022062503/58eef09c1a28abd9188b45fb/html5/thumbnails/38.jpg)
Thank you!