masters thesis defense presentation

38
TEXT MINING APPLIED TO SQL QUERIES: A CASE STUDY FOR SDSS SKYSERVER Vitor Hirota Makiyama Advised by Dr. Rafael D. C. dos Santos Master in Applied Computing National Institute for Space Research

Upload: vitor-hirota-makiyama

Post on 13-Apr-2017

204 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Masters Thesis Defense Presentation

TEXT MINING APPLIED TO SQL QUERIES: A CASE STUDY FOR SDSS SKYSERVERVitor Hirota MakiyamaAdvised by Dr. Rafael D. C. dos Santos

Master in Applied ComputingNational Institute for Space Research

Page 2: Masters Thesis Defense Presentation

Outline

1.Introduction• SDSS & SkyServer• SkyServer as a data mining tool• Log Analysis• Thesis motivation

2.Theory review3.Methodology4.Experimental results5.Conclusion remarks

2

Page 3: Masters Thesis Defense Presentation

SDSS & SkyServer

• The Sloan Digital Sky Survey

• 15 years operation, 4th iteration

• 3D map of over 1/3 of the sky, and

• 5+ million spectra

• SkyServer

• The Internet portal to SDSS

providing data access tools to the

catalog

3Introduction

Page 4: Masters Thesis Defense Presentation

SkyServer as a data mining toolFind all galaxies without saturated pixels within 1' of a given point

Find quasars with a line width > 2000 km/s and 2.5 < redshift < 2.7

DECLARE @saturated BIGINT; SET @saturated = dbo.Fphotoflags('saturated');

SELECT G.objid, GN.distance INTO   ##results FROM   galaxy AS G        JOIN Fgetnearbyobjeq(185, -0.5, 1) AS GN ON G.objid = GN.objid WHERE  ( G.flags & @saturated ) = 0 ORDER  BY distance 

DECLARE @qso INT; SET @qso = dbo.Fspecclass('QSO'); DECLARE @hiZ_qso INT; SET @hiZ_qso = dbo.Fspecclass('HIZ-QSO');

SELECT s.specobjid,Max(l.sigma * 300000.0 / l.wave) AS veldisp, Avg(s.z) AS z INTO   ##results FROM   specobj s,specline l WHERE  s.specobjid = l.specobjid        AND ( ( s.specclass = @qso )               OR ( s.specclass = @hiZ_qso ) )        AND l.sigma * 300000.0 / l.wave > 2000.0        AND s.zconf > 0.9 GROUP  BY s.specobjid 

4Introduction

Page 5: Masters Thesis Defense Presentation

SkyServer as a data mining tool

5Introduction

Page 6: Masters Thesis Defense Presentation

SkyServer as a data mining tool

6Introduction

Page 7: Masters Thesis Defense Presentation

Log Analysis

7

Page 8: Masters Thesis Defense Presentation

Log Analysis

2006 2014

8

Page 9: Masters Thesis Defense Presentation

Motivation

Apply text mining techniques over the SQL logs to define a methodology to parse, clean and tokenize statements into an intermediate numerical representation for data mining.

9Introduction

Page 10: Masters Thesis Defense Presentation

Outline

1.Introduction2.Theory review• Text Mining• Information Retrieval• Clustering

3.Methodology4.Experimental results5.Conclusion

10

Page 11: Masters Thesis Defense Presentation

Text MiningKnowledge Discovery in Databases:

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Fayyad et al. (1996)

11Theory Review

Page 12: Masters Thesis Defense Presentation

Text MiningText Mining:

The discovery by computer of new previously unknown, information by automatically extracting information from different written sources.

Fayyad et al. (1996)

12Theory Review

Page 13: Masters Thesis Defense Presentation

Text Mining

13Theory Review

Miner, Gary. Practical text mining and statistical analysis for non-structured text data applications. Academic Press (2012)

Page 14: Masters Thesis Defense Presentation

Zipf’s Law

14Theory Review

Singh et al. (2006)

Page 15: Masters Thesis Defense Presentation

15

Term Weighting• To balance term significance within a document

collection, accounting for terms that are too common or too rare.• TF*IDF assigns the largest weight to terms that arise

with high frequency in individual documents, but are at the same time, relatively rare in the collection as a whole .

Page 16: Masters Thesis Defense Presentation

Clustering

The exploratory procedure that organizes a

collection of patterns into natural groupings based

on a given association measure.

16Theory Review

Page 17: Masters Thesis Defense Presentation

Association Measures

Sthrel et al. (2000)

17Theory Review

Page 18: Masters Thesis Defense Presentation

K-Means1. Choose k clusters centers.2. Assign each pattern to the closest cluster center.3. Recompute cluster center using the current cluster

memberships. 4. If convergence criterion is not met, go to step 2.

Manning et al. (2009)

18Theory Review

Page 19: Masters Thesis Defense Presentation

K-Means1. Choose k clusters centers2. Assign each pattern to the closest cluster center.3. Recompute cluster center using the current cluster

memberships. 4. If convergence criterion is not met, go to step 2.

Manning et al. (2009)

19Theory Review

Page 20: Masters Thesis Defense Presentation

Fuzzy C-Means• Fuzzy extension to traditional K-Means, where every pattern

belongs to every cluster with varying degrees of membership.

• Cluster validity metrics:

20Theory Review

Page 21: Masters Thesis Defense Presentation

Self-Organizing Maps

21Theory Review

•ANN that performs unsupervised, competitive learning.

Yin (2008)

Page 22: Masters Thesis Defense Presentation

22

Self-Organizing Maps•Maps high-dimensional data into a regular low-dimensional grid

•Reduces the original data dimension while preserving relationships of the data

•Particular interesting for visualization with the U-Matrix

Page 23: Masters Thesis Defense Presentation

Outline

1.Introduction2.Theory review3.Methodology• SQL queries to feature vectors• Data mining

4.Experimental results5.Conclusion remarks

23

Page 24: Masters Thesis Defense Presentation

Methodology

24Methodology

Fayyad et al. (1996)

Page 25: Masters Thesis Defense Presentation

SQL queries to feature vectors

25

Transformation

PreprocessingParsing

Cleaning / Tokenization

Normalization

Methodology

Page 26: Masters Thesis Defense Presentation

SQL queries to feature vectors

26

Transformation

PreprocessingParsing

Cleaning / Tokenization

Normalization

Methodology

SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefehFROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platexWHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803

select objid ra dec u g r i z plate fiberid elodiefehfrom photoobj fgetnearbyobjeq specobj platexwhere objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate

Page 27: Masters Thesis Defense Presentation

SQL queries to feature vectors

27

Transformation

PreprocessingParsing

Cleaning / Tokenization

Normalization

Methodology

SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefehFROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platexWHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803

select objid ra dec u g r i z plate fiberid elodiefehfrom photoobj fgetnearbyobjeq specobj platexwhere objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate

select_objid 1select_ra 1select_dec 1select_u 1select_g 1select_r 1select_i 1select_z 1select_plate 1select_fiberid 1select_elodiefeh 1from_photoobj 1from_fgetnearbyobjeq 1from_specobj 1from_platex 1where_objid 3where_logic 8where_bestobjid 1where_plateid 2where_class 1where_r 2where_g 2where_plate 1

Page 28: Masters Thesis Defense Presentation

SQL queries to feature vectors

28

Transformation

PreprocessingParsing

Cleaning / Tokenization

Normalization

Term Weighting

Scaling

Methodology

Term Document matrix

Page 29: Masters Thesis Defense Presentation

Transformation

PreprocessingParsing

Cleaning / Tokenization

Normalization

Term Weighting

Scaling

Data mining

29Methodology

Data Mining

Fuzzy C-Means

Self organizing Maps

Term Document matrix

Page 30: Masters Thesis Defense Presentation

Outline

1.Introduction2.Theory review3.Methodology4.Experimental results5.Conclusion remarks

30

Page 31: Masters Thesis Defense Presentation

Number of clusters with FCM

31Results

Page 32: Masters Thesis Defense Presentation

Number of clusters with FCM

32Results

Page 33: Masters Thesis Defense Presentation

Visualization with SOM

33Results

Page 34: Masters Thesis Defense Presentation

Visualization with SOM22: Finding spectra by classification (object type)select top 100 specobjidfrom specobjwhere class = 'star' and zwarning = 0

43: QSOs by spectroscopy

select top 100 specobjid, zfrom specobjwhere class = 'qso' and zwarning = 0

34

Cosine distance:

Term-Frequency: 0.0205SOM U-Matrix: 0.0

Results

39: Classifications from Galaxy Zooselect objid, nvote, p_el as elliptical, p_cw as spiralclock, p_acw as spiralanticlock, p_edge as edgeon, p_dk as dontknow, p_mg as mergerfrom zoonospecwhere objid = 1237656495650570395

39B: Classifications from Galaxy Zooselect top 100 g.objid, zns.nvote, zns.p_el as elliptical, zns.p_cw as spiralclock, zns.p_acw as spiralanticlock, zns.p_edge as edgeon, zns.p_dk as dontknow, zns.p_mg as mergerfrom galaxy as g join zoonospec as zns on g.objid = zns.objidwhere g.clean=1 and zns.nvote >= 10 and zns.p_cw > 0.8

Cosine distance:

Term-Frequency: 0.1610SOM U-Matrix: 0.0

Page 35: Masters Thesis Defense Presentation

Outline

1.Introduction2.Theory review3.Methodology4.Experimental results5.Conclusion remarks

35

Page 36: Masters Thesis Defense Presentation

Conclusions• A methodology for proper parsing, cleaning and

tokenization of SQL statements into feature vectors was defined, which can be used for KDD.• Preprocessing and transformation can be tuned

according to data mining goal.• Foreseen applications include:• Detailed SQL and database usage statistics• Query recommedation systems• Running time prediction

36Conclusions

Page 37: Masters Thesis Defense Presentation

37

Publications•Clustering SQL queries to analyse database usage, IASC Satellite for the ISI WSC Conference, 2015

• Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer, 2nd International Symposium on Information Management and Big Data, 2015

Page 38: Masters Thesis Defense Presentation

Thank you!