make accumulated data in companies eloquent by sql statement constructors (pdf)

57
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors IEEE BigData 2017 (Boston) Dec. 11 Toshiyuki Shimono Digital Garage, Inc.

Upload: -

Post on 23-Jan-2018

132 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

MakeAccumulatedDatainCompaniesEloquent

bySQLStatementConstructors

IEEEBigData2017(Boston)Dec.11

ToshiyukiShimonoDigitalGarage,Inc.

Page 2: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

WorkContributions1. ForexploringanunknownDB,

Ø Organizedthemilestones.Ø Conceivedthemethods.

2. Proposedabeneficialsoftwaretoolfor“BigData”.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].

3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthstoaweek.Ø “Knowing” latently dominatesadataanalysisproject.

Asimilarslideappearsagainintheending.

2

Page 3: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

I.Background6slides

3

Page 4: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Background

•Manyorganizationshaveaccumulatedtheirownbusinessdata,intherecentyears.

• Butactually,theirDBarerarelywelldesigned.Thustheirdataisfarfrombeingfullyutilized.

4

Page 5: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Therealsituationtodayasof2017Thedataaccumulatedissobigandcomplex.Whichpartofitshouldbetakenforanalysis?

Whichtables areneeded?

Whichcolumns areneeded?

Whereisthemeaningful date/user columns?

Howmanybyteswillbeexported?

Whatdothetables/columnsmean?

Howcanthedates/customersbenarroweddown?

Howistheexporteddatadamagedetected?

Andthedatabasesystemissooldinthatmeaningful“pre-analysis”isverydifficult!

5

Page 6: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Howcandatascientistsutilizedata?

Somanytablesandcolumns.Difficultiesoccurin:Øknowingthemeanings,Øreadingthedocuments,Ødiscussingwiththeclients.

Whatisthegoodwayforutilizingdatasleepinginthedatabase?

6▲Manytables,eachentailingmanycolumns!

Page 7: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

1.UnderstandingDB

3.SerendipitousdiscoveryinBusinessbysomeadvancedanalysis

2.Newenvironmentbuildingforanalysis

Ø Arethedataeffectivetoanalyze?Ø Whatarespecial/errorvalues?Ø Howcolumnsareconnectedovertables?

7

Page 8: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

PreliminaryKnowledge

•Databaseisacollectionoftables.•AtableislikeaMSExcelsheet,withrows (records) andcolumns (attributes).•ManyDatabasesarehandledbySQL.suchasMySQL,Oracle,PostgreSQL,SQLServer.

8

Page 9: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Ø Somecolumnsareconnectedtosharethesamecodingsystem.Ø Thenhowcanonedeterminealloftheconnectedcolumns?

9uOneneedstoseethevalues ofeachcolumn,buthow?

Retouched.https://commons.wikimedia.org/wiki/File:Data_model_in_ER.png

Page 10: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

II.Introductionsofthenovelsoftware10slides

10

Page 11: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

•Currentsoftwaredoesn’tcovertheneedstoday.•Newsoftwareisnecessary.•SoIcreateditforSQL-typeDB.

11

Page 12: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Assumptioninenvironmentn CLI (CommandLineInterface)

n toproduceSQLstatements.n tostorethedata.n toprocessthedata.

n SQL-typeDB.

▲ CommandLineInterface(CLI) ▲ SQLClientsoftware

SQLstatementsareenteredhere.

TheSQLoutput

appearshere.

12

Page 13: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Complement:SQLstatements

createtableT(xnumeric,yvarchar,zdate)ßMakeatable.insertintoTvalues(2,’abc’,’2017-12-11’)ß addarecord.select*fromTß outputalltherecordsofT.selectcount(*)fromTß TherecordnumberofT.

13

Page 14: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Togetan“listed”resultbySQL:1.PreparetheSQLstatement:

2.ThereturnedtablebytheSQLstatement:Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.

SoSQLstatementgeneratorisdesirable!

14

Page 15: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Timeflies.Timeinfoarehelpful.

Process-timeanddate-timeinformationareattached.

15

Acombinationof(20of) SQLstatementsyieldedbythesamecommand(usinganoption).

Page 16: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

16

SQLstatementsareyieldedintheCLIenvironment.

Page 17: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

20commandsformanyfunctions

Eachcommandreceiveseither:(1)tablenamesor(2)columnnameswiththeirtablenames

Eachcommandutilizesoptionswitches:--help:toshowtheonlinemanual-a,-b,-c..,-z:variousminorfunctions.

17

Page 18: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Theprogramfunctions

18

Programname Whatfunctionthe producedSQLstatement(s)has.serverInfo SQLDBsystemversioninformation.

tableLines Counting therecordsof eachtable.

tableColumns Columninformationofall.

sampleRows Randomsampling ofrows.

minMax Takingmin/maxofeach columns.Alsotakingthe4values.

mostFreq/FewId Takingthemost/fewfrequentvaluesofeach columns.

distinctCount Counting distinctvaluesofeachcolumns.

hasChar/nullCount Counting thevalueswithspecificcharacterornullvalues.

byteTable/byteCol Computeorestimatethebyte-sizeofeachtable orcolumn.

vennTwo Tocalculate howsetsofvaluesoverlap.

newTable Creatinga tablewithease.

hashSum Summing numericallymappedSHA-1valuetocomparetables.

Page 19: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

SQLgeneratorsDemo(PowerPointanimation)

19

Page 20: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

GitHubpage (programrepository)

20

Findthewebpage:github.com/tulamiliBothEnglishandJapaneseskillsarenecessarytouseit.Sorry!

Page 21: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

ToimprovetheUIonCLI:• Ihavebeencreatingcommandsonebyonewiththepolicyof“using2Englishwordsforaprogram”.• Tokeepcleanthenamespace ofUnix/Linuxcommandnames,UIshouldbealtered.• Nextstepwouldbelikethis, thestyleofusingacommandargumenttospecifythefunction.

21

Skipthispageunlesstimeisenough.

Page 22: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

III.TrickyfunctionstoseethevaluesofDB11slides

22

Page 23: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Whatisthemostconcisewaytoseethevalues??

Ø Columnnamesdon’tusuallytellifthevaluesare:Ø substancename(man,woman,Japan,USA,..)

Ø codedvalue (1,2,JP,US,…)

Ø Thecolumnrelationsovertablesareuneasytosee.Ø Knowingthespecial/error valuesisacraftwork.

seeingtheconcretevaluesisamust.

23

Page 24: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Ideatoget4valuesfromeachcolumn

24

(1) Colorthevaluesiftheirfirstcharacteristheminimumcharacter.(2) Fromthecoloredvalues,extracttheminimumandthemaximum.(3) Fromtheuncoloredvalues,extracttheminimumandthemaximum.(4) Those4values*would*tellthecolumncharacteristicwellJ

Whatisthegood/simplemethodtogetsometypicalvaluesfromacolumnifitsdatatypeistext,number,dateorwhatever?

Page 25: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

TheVennDiagramandSQLstatement

Allthevaluesfromacolumn

Allthosewhosefirstdigitistheminimum.

Alltheothers

Theminimumoftheabove

v11

Themaximumoftheabove

v12

Theminimumoftheabove

v21

Themaximumoftheabove

v22

selectCfrom Twhereleft(C,1)=(select min(left(C,1))from T)

selectC from Twhereleft(C,1)!=(select min(left(C,1))from T)

selectmin(C),max(C)from Twhereleft(C,1)=(select min(left(C,1))from T)

selectmin(C),max(C)from Twhereleft(C,1)!=(select min(left(C,1))from T)

25

Skipthispageunlesstimeisenough.

Page 26: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Arethe4valuesenoughtoseeacolumn?

26

Skipthispageunlesstimeisenough.

• 2values(e.g.min/max) wouldnotworkL.• 4valuescancausemisleadinginsmallpossibility,butitactuallyworkswellasshownlater,sofar.• Howabout5or6ormorevalues:

• Themin/maxfromthe3rdsetcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• Butitisbecomingnotsimple.RequiringcomplexSQL.• MuchcomputationtimeasIoncetriedL .

Page 27: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Appliedtothewholecolumns

27

Timedimension

Timedimension

Timedimension

UserdimensionUserdimension

Weightdimension

Non-numericorderNumericorder

Page 28: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Whatifonly2values?

28Hidedmeaningfulminimum2014-07-07.

Hidedmeaningfulminimum“-9990”.Hidedmeaningfulminimum“-5”.

Hidedexistenceof“00000”.

Thistableonlyassuresthereexistsatleast2distinctvaluesforeachoriginalcolumn.(3or4insteadof2isdesirable.)

Only1countrycodescanbeseenduetotheexistenceofspecialvalue“ZZZ”.

Skipthispageunlesstimeisenough.

selectmin(C),max(C)fromT

Page 29: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Complement:SQLstatements.

29

Page 30: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Relationsfoundsharingsamecodes

30

Page 31: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Special/anomalousvalues

31

Cf.randomsampling

Page 32: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Howabouttherandomsampling?1.SQLstatementbuilding

2.TheSQLoutput(partoftheresults)

32

Page 33: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Randomnesshelpstoseecolumnrelevance

• “Age”and“marriage”areyellow-backcolored.• Probably,1meansmarried,2meansunmarried.

33

Page 34: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Estimatinghow2tablesdiffer

34

Assumeyoucanaccesstoboth“therunningDB”anditsexporteddata.Exportingmaytakealotoftime,so“datachangebytime”occurs.Then,howyoucanestimatethenumberofdifferentrecordsbetweenthetwo?Establishingthemethodisrequired.

ItriedusingSHA-1functiontoseethedifference.Only6lineswasthelinecountingdifference,butactuallytheydifferinsomewhere[83,441]inthe95%confidence.

Youcanassumeeachofthe3valuesonarowlikeaGaussianvariableaccordingtothevariancedeterminedbythenumberoftherecordofeachtable,T1,T2,and(T1-T2)U (T2-T1),respectively.Bychangingtheconditions,yougotthe12repetitionmeasurements.Thenyoucanassumetherecordnumbersbasedontheestimationofthepopulationvarianceasshowninthelowerpartofthetable.

Skipthispageunlesstimeisenough.

Page 35: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Continuedfromthepreviousslide.

35

IfyousumupNvariables(ini.i.d.)fromadistributionwiththemeanzeroandthevarianceone,thenthesummationobeysadistributionwhichiswellapproximatedbythenormaldistributionwiththemeanzeroandthevarianceN.

IfyougotsuchnumberintherepetitionofKtimes,thenhowisitpossibletoestimatethenumberNbackward?ItcanbeestimatedfromthetotalofthesquareofthatKvaluesdividedby2.5%-tileand97.5-tilepointsofthechisquaredistributionwithKdegreesoffreedom.

AndaneasycomputationoftakingavariablewiththemeanzeroandthevarianceoneistotransformtheSHA-1valueinto[- sqrt(3),sqrt(3)].

Skipthispageunlesstimeisenough.

Page 36: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

IV.Summary3slides

36

Page 37: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Rownumbers,tablecomparison.

The4valuetaking,randomsampling

Determiningallthesamecodesharing.

Randomsamplingfromthespeciallines.

StepstoknowDBtowardanalysis:

1. Knowingthetables.

2. Knowingthecolumns(individually).

3. Knowingthecolumnconnections(relations).

4. Knowinghow(row-wise)specialconditionsoccur.

Thoseaboveshouldbefulfilledbeforegoingbeyond.37

Page 38: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

DB

SQLcmdgenerator

TableinfoColumninfo

Short-cuttingoperations

Extractedinfo Findingsbefore

main-analysis

GeneratedSQLcmd

Concretevalues

ü Valueformats

ü Special/errvalues

ü Columns’relations

ØMeanings

Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing

Ø VisualizationØ Mathmethods

BusinessValuebymain-analysis

Bigdiscoveryfromdata+Bigbusinessvalues

Page 39: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Anapplicationexample.

1. Youmayhavealotoftables.

2. Youunderstandeachcolumnofthemby:• seeingsomeoftheconcretevalues,• seeingthespecialandanomalousvalues,• determiningallofthesamecodesharingcolumns.

3. Thereafter,youcan:1. narrowdowntomodest-sizedtables.2. easilyhandlethedataforvisualizations.3. summarizethedatayouneedintoonetable

thatcanbehandledbymanymathematicalmethods.

39

Page 40: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

40

Page 41: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Contributions(summary)1. ForexploringanunknownDB,

Ø Organizedthemilestones.Ø Conceivedthemethods.

2. ProposedabeneficialsoftwaretoolforBigData.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].

3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthsonlytoaweek.Ø “Knowing” latently dominatesadataanalysisproject.

41

Page 42: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

V.ExtraSlides9slides

42

Page 43: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Wemust“understandDBcontents”⏤ beforeanyanalysis

Reasons:1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments

43

Page 44: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

1.UnderstandingDB

3.Business-relatedcalculation(monthlysales,..)Advancedanalysisemployingmath-relatedmethods

2.Newenvironmentbuildingforanalysis

※ Note:Preprocessing existseverywhere,butwedonottouchthisexplicitly.

1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments

ReasonswhywemustunderstandDB:

WefocusonDBunderstanding

44

Page 45: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

SquirrelSQL(since2001) 45

Page 46: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Detail:LineNumberListingExample

Thecommand“lineNumber”canyieldvarioustypeofSQLstatementsbyutilizingcommandoptionsuchas-n,-t.

Toproperlymakeoutput,itisdesignedsothattheSQLoutputcontains:1)sequencenumber,2)table(andcolumn)names,3)processtimeinseconds,4)thetimeofcalculations.

46

Page 47: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

ü ADB hastables whichhavecolumns whichhavevalues.Ø Oneneedstodeterminecolumnconnections overtables.

47

©2013MicrosoftCorporation.Allrightsreserved.

uOneneedstoseethevalues ofeachcolumn,buthow?

Page 48: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Howtoseevaluesineachcolumn.

48

Page 49: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Tooutputan“integrated”tablebySQL:1.AnewcommandyieldsaSQLstatement:

2.ThereturnedtablebyqueryoftheSQLstatement:

Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.

ThusSQLstatementgeneratorisdesirable!

Outputtedby“newCmd <tables.txt”

49

Page 50: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Decipheringsomanycolumnsatonce.

50

Page 51: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Arethe4valuesenoughtoseeacolumn?

51

Skipthispageunlesstimeisenough.

• Only2values(e.g.min/max) wouldnotworkL.• Only4valuesmaycausemisleadingpossiblyL.• Aligningmorethan4values:

• Themin/maxfromsome(thethird)setcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• MuchcomputationtimeasIoncetriedL .• SQLmayreallyneed“second_min”and“second_max”.

• Misc.• Nullvaluecareisdesirable.• Thefrequencynumbermaybedesirable.• Thevaluelengthsinformationishelpful.

Page 52: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

52

Page 53: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

RandomSampling,alsoweighting

53

Page 54: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Remainingissues⏤ beforetobuildnewDBenv.foranalysisCombinationalexplosion incalculationcanoccurtoreducetheredundancyofcolumns/rows.• Graspingalltheredundantcolumnsthroughknowingtherelationsinsideatable:1. Thevaluesof2ormorecolumnsofeveryrow

hasthe samevalues.2. Thevaluesofacolumncanbedeterminedother

columnvalues.• Graspingalltheredundantrows

• Howtoknowtheconditionwhetheracolumnhasavalueofnull,special,anomalous,rare values,whentheothercolumnvaluesseemstohaveclue?

54

Page 55: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

55

Page 56: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Page 57: Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

DB

SQLcmdgenerator

TableinfoColumninfo

Short-cuttingoperations

Extractedinfo Findingsbefore

main-analysis

GeneratedSQLcmd

Concretevalues

ü Valueformats

ü Special/errvalues

ü Columns’relations

ØMeanings

Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing

Ø VisualizationØ Mathmethods

BusinessValuebymain-analysis

Bigdiscoveryfromdata+Bigbusinessvalues