make accumulated data in companies eloquent by sql statement constructors (pdf)

Post on 23-Jan-2018

134 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MakeAccumulatedDatainCompaniesEloquent

bySQLStatementConstructors

IEEEBigData2017(Boston)Dec.11

ToshiyukiShimonoDigitalGarage,Inc.

WorkContributions1. ForexploringanunknownDB,

Ø Organizedthemilestones.Ø Conceivedthemethods.

2. Proposedabeneficialsoftwaretoolfor“BigData”.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].

3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthstoaweek.Ø “Knowing” latently dominatesadataanalysisproject.

Asimilarslideappearsagainintheending.

2

I.Background6slides

3

Background

•Manyorganizationshaveaccumulatedtheirownbusinessdata,intherecentyears.

• Butactually,theirDBarerarelywelldesigned.Thustheirdataisfarfrombeingfullyutilized.

4

Therealsituationtodayasof2017Thedataaccumulatedissobigandcomplex.Whichpartofitshouldbetakenforanalysis?

Whichtables areneeded?

Whichcolumns areneeded?

Whereisthemeaningful date/user columns?

Howmanybyteswillbeexported?

Whatdothetables/columnsmean?

Howcanthedates/customersbenarroweddown?

Howistheexporteddatadamagedetected?

Andthedatabasesystemissooldinthatmeaningful“pre-analysis”isverydifficult!

5

Howcandatascientistsutilizedata?

Somanytablesandcolumns.Difficultiesoccurin:Øknowingthemeanings,Øreadingthedocuments,Ødiscussingwiththeclients.

Whatisthegoodwayforutilizingdatasleepinginthedatabase?

6▲Manytables,eachentailingmanycolumns!

1.UnderstandingDB

3.SerendipitousdiscoveryinBusinessbysomeadvancedanalysis

2.Newenvironmentbuildingforanalysis

Ø Arethedataeffectivetoanalyze?Ø Whatarespecial/errorvalues?Ø Howcolumnsareconnectedovertables?

7

PreliminaryKnowledge

•Databaseisacollectionoftables.•AtableislikeaMSExcelsheet,withrows (records) andcolumns (attributes).•ManyDatabasesarehandledbySQL.suchasMySQL,Oracle,PostgreSQL,SQLServer.

8

Ø Somecolumnsareconnectedtosharethesamecodingsystem.Ø Thenhowcanonedeterminealloftheconnectedcolumns?

9uOneneedstoseethevalues ofeachcolumn,buthow?

Retouched.https://commons.wikimedia.org/wiki/File:Data_model_in_ER.png

II.Introductionsofthenovelsoftware10slides

10

•Currentsoftwaredoesn’tcovertheneedstoday.•Newsoftwareisnecessary.•SoIcreateditforSQL-typeDB.

11

Assumptioninenvironmentn CLI (CommandLineInterface)

n toproduceSQLstatements.n tostorethedata.n toprocessthedata.

n SQL-typeDB.

▲ CommandLineInterface(CLI) ▲ SQLClientsoftware

SQLstatementsareenteredhere.

TheSQLoutput

appearshere.

12

Complement:SQLstatements

createtableT(xnumeric,yvarchar,zdate)ßMakeatable.insertintoTvalues(2,’abc’,’2017-12-11’)ß addarecord.select*fromTß outputalltherecordsofT.selectcount(*)fromTß TherecordnumberofT.

13

Togetan“listed”resultbySQL:1.PreparetheSQLstatement:

2.ThereturnedtablebytheSQLstatement:Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.

SoSQLstatementgeneratorisdesirable!

14

Timeflies.Timeinfoarehelpful.

Process-timeanddate-timeinformationareattached.

15

Acombinationof(20of) SQLstatementsyieldedbythesamecommand(usinganoption).

16

SQLstatementsareyieldedintheCLIenvironment.

20commandsformanyfunctions

Eachcommandreceiveseither:(1)tablenamesor(2)columnnameswiththeirtablenames

Eachcommandutilizesoptionswitches:--help:toshowtheonlinemanual-a,-b,-c..,-z:variousminorfunctions.

17

Theprogramfunctions

18

Programname Whatfunctionthe producedSQLstatement(s)has.serverInfo SQLDBsystemversioninformation.

tableLines Counting therecordsof eachtable.

tableColumns Columninformationofall.

sampleRows Randomsampling ofrows.

minMax Takingmin/maxofeach columns.Alsotakingthe4values.

mostFreq/FewId Takingthemost/fewfrequentvaluesofeach columns.

distinctCount Counting distinctvaluesofeachcolumns.

hasChar/nullCount Counting thevalueswithspecificcharacterornullvalues.

byteTable/byteCol Computeorestimatethebyte-sizeofeachtable orcolumn.

vennTwo Tocalculate howsetsofvaluesoverlap.

newTable Creatinga tablewithease.

hashSum Summing numericallymappedSHA-1valuetocomparetables.

SQLgeneratorsDemo(PowerPointanimation)

19

GitHubpage (programrepository)

20

Findthewebpage:github.com/tulamiliBothEnglishandJapaneseskillsarenecessarytouseit.Sorry!

ToimprovetheUIonCLI:• Ihavebeencreatingcommandsonebyonewiththepolicyof“using2Englishwordsforaprogram”.• Tokeepcleanthenamespace ofUnix/Linuxcommandnames,UIshouldbealtered.• Nextstepwouldbelikethis, thestyleofusingacommandargumenttospecifythefunction.

21

Skipthispageunlesstimeisenough.

III.TrickyfunctionstoseethevaluesofDB11slides

22

Whatisthemostconcisewaytoseethevalues??

Ø Columnnamesdon’tusuallytellifthevaluesare:Ø substancename(man,woman,Japan,USA,..)

Ø codedvalue (1,2,JP,US,…)

Ø Thecolumnrelationsovertablesareuneasytosee.Ø Knowingthespecial/error valuesisacraftwork.

seeingtheconcretevaluesisamust.

23

Ideatoget4valuesfromeachcolumn

24

(1) Colorthevaluesiftheirfirstcharacteristheminimumcharacter.(2) Fromthecoloredvalues,extracttheminimumandthemaximum.(3) Fromtheuncoloredvalues,extracttheminimumandthemaximum.(4) Those4values*would*tellthecolumncharacteristicwellJ

Whatisthegood/simplemethodtogetsometypicalvaluesfromacolumnifitsdatatypeistext,number,dateorwhatever?

TheVennDiagramandSQLstatement

Allthevaluesfromacolumn

Allthosewhosefirstdigitistheminimum.

Alltheothers

Theminimumoftheabove

v11

Themaximumoftheabove

v12

Theminimumoftheabove

v21

Themaximumoftheabove

v22

selectCfrom Twhereleft(C,1)=(select min(left(C,1))from T)

selectC from Twhereleft(C,1)!=(select min(left(C,1))from T)

selectmin(C),max(C)from Twhereleft(C,1)=(select min(left(C,1))from T)

selectmin(C),max(C)from Twhereleft(C,1)!=(select min(left(C,1))from T)

25

Skipthispageunlesstimeisenough.

Arethe4valuesenoughtoseeacolumn?

26

Skipthispageunlesstimeisenough.

• 2values(e.g.min/max) wouldnotworkL.• 4valuescancausemisleadinginsmallpossibility,butitactuallyworkswellasshownlater,sofar.• Howabout5or6ormorevalues:

• Themin/maxfromthe3rdsetcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• Butitisbecomingnotsimple.RequiringcomplexSQL.• MuchcomputationtimeasIoncetriedL .

Appliedtothewholecolumns

27

Timedimension

Timedimension

Timedimension

UserdimensionUserdimension

Weightdimension

Non-numericorderNumericorder

Whatifonly2values?

28Hidedmeaningfulminimum2014-07-07.

Hidedmeaningfulminimum“-9990”.Hidedmeaningfulminimum“-5”.

Hidedexistenceof“00000”.

Thistableonlyassuresthereexistsatleast2distinctvaluesforeachoriginalcolumn.(3or4insteadof2isdesirable.)

Only1countrycodescanbeseenduetotheexistenceofspecialvalue“ZZZ”.

Skipthispageunlesstimeisenough.

selectmin(C),max(C)fromT

Complement:SQLstatements.

29

Relationsfoundsharingsamecodes

30

Special/anomalousvalues

31

Cf.randomsampling

Howabouttherandomsampling?1.SQLstatementbuilding

2.TheSQLoutput(partoftheresults)

32

Randomnesshelpstoseecolumnrelevance

• “Age”and“marriage”areyellow-backcolored.• Probably,1meansmarried,2meansunmarried.

33

Estimatinghow2tablesdiffer

34

Assumeyoucanaccesstoboth“therunningDB”anditsexporteddata.Exportingmaytakealotoftime,so“datachangebytime”occurs.Then,howyoucanestimatethenumberofdifferentrecordsbetweenthetwo?Establishingthemethodisrequired.

ItriedusingSHA-1functiontoseethedifference.Only6lineswasthelinecountingdifference,butactuallytheydifferinsomewhere[83,441]inthe95%confidence.

Youcanassumeeachofthe3valuesonarowlikeaGaussianvariableaccordingtothevariancedeterminedbythenumberoftherecordofeachtable,T1,T2,and(T1-T2)U (T2-T1),respectively.Bychangingtheconditions,yougotthe12repetitionmeasurements.Thenyoucanassumetherecordnumbersbasedontheestimationofthepopulationvarianceasshowninthelowerpartofthetable.

Skipthispageunlesstimeisenough.

Continuedfromthepreviousslide.

35

IfyousumupNvariables(ini.i.d.)fromadistributionwiththemeanzeroandthevarianceone,thenthesummationobeysadistributionwhichiswellapproximatedbythenormaldistributionwiththemeanzeroandthevarianceN.

IfyougotsuchnumberintherepetitionofKtimes,thenhowisitpossibletoestimatethenumberNbackward?ItcanbeestimatedfromthetotalofthesquareofthatKvaluesdividedby2.5%-tileand97.5-tilepointsofthechisquaredistributionwithKdegreesoffreedom.

AndaneasycomputationoftakingavariablewiththemeanzeroandthevarianceoneistotransformtheSHA-1valueinto[- sqrt(3),sqrt(3)].

Skipthispageunlesstimeisenough.

IV.Summary3slides

36

Rownumbers,tablecomparison.

The4valuetaking,randomsampling

Determiningallthesamecodesharing.

Randomsamplingfromthespeciallines.

StepstoknowDBtowardanalysis:

1. Knowingthetables.

2. Knowingthecolumns(individually).

3. Knowingthecolumnconnections(relations).

4. Knowinghow(row-wise)specialconditionsoccur.

Thoseaboveshouldbefulfilledbeforegoingbeyond.37

DB

SQLcmdgenerator

TableinfoColumninfo

Short-cuttingoperations

Extractedinfo Findingsbefore

main-analysis

GeneratedSQLcmd

Concretevalues

ü Valueformats

ü Special/errvalues

ü Columns’relations

ØMeanings

Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing

Ø VisualizationØ Mathmethods

BusinessValuebymain-analysis

Bigdiscoveryfromdata+Bigbusinessvalues

Anapplicationexample.

1. Youmayhavealotoftables.

2. Youunderstandeachcolumnofthemby:• seeingsomeoftheconcretevalues,• seeingthespecialandanomalousvalues,• determiningallofthesamecodesharingcolumns.

3. Thereafter,youcan:1. narrowdowntomodest-sizedtables.2. easilyhandlethedataforvisualizations.3. summarizethedatayouneedintoonetable

thatcanbehandledbymanymathematicalmethods.

39

40

Contributions(summary)1. ForexploringanunknownDB,

Ø Organizedthemilestones.Ø Conceivedthemethods.

2. ProposedabeneficialsoftwaretoolforBigData.Ø Seemsthatnoothertoolsexcept[Shimono16],basedonthesurveys [Saltz,Shamshurin 16],[Kumar,Alencar 16].

3. ReducinglabortounderstandaDB.Ø Byshrinkingitfrommonthsonlytoaweek.Ø “Knowing” latently dominatesadataanalysisproject.

41

V.ExtraSlides9slides

42

Wemust“understandDBcontents”⏤ beforeanyanalysis

Reasons:1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments

43

1.UnderstandingDB

3.Business-relatedcalculation(monthlysales,..)Advancedanalysisemployingmath-relatedmethods

2.Newenvironmentbuildingforanalysis

※ Note:Preprocessing existseverywhere,butwedonottouchthisexplicitly.

1. Effectivenesscheckforanalysispurpose2. Seeingtypical/special/anomalousvalues3. Handlingrelations amongcolumns4. RebuildinganotherDBenvironments

ReasonswhywemustunderstandDB:

WefocusonDBunderstanding

44

SquirrelSQL(since2001) 45

Detail:LineNumberListingExample

Thecommand“lineNumber”canyieldvarioustypeofSQLstatementsbyutilizingcommandoptionsuchas-n,-t.

Toproperlymakeoutput,itisdesignedsothattheSQLoutputcontains:1)sequencenumber,2)table(andcolumn)names,3)processtimeinseconds,4)thetimeofcalculations.

46

ü ADB hastables whichhavecolumns whichhavevalues.Ø Oneneedstodeterminecolumnconnections overtables.

47

©2013MicrosoftCorporation.Allrightsreserved.

uOneneedstoseethevalues ofeachcolumn,buthow?

Howtoseevaluesineachcolumn.

48

Tooutputan“integrated”tablebySQL:1.AnewcommandyieldsaSQLstatement:

2.ThereturnedtablebyqueryoftheSQLstatement:

Thisoutputismuchuseful,buttheSQLstatementistoolongtomanuallyenter.

ThusSQLstatementgeneratorisdesirable!

Outputtedby“newCmd <tables.txt”

49

Decipheringsomanycolumnsatonce.

50

Arethe4valuesenoughtoseeacolumn?

51

Skipthispageunlesstimeisenough.

• Only2values(e.g.min/max) wouldnotworkL.• Only4valuesmaycausemisleadingpossiblyL.• Aligningmorethan4values:

• Themin/maxfromsome(thethird)setcanbeadded.• Indeedgoodtoseethevarious/lengthytextvaluesJ .• MuchcomputationtimeasIoncetriedL .• SQLmayreallyneed“second_min”and“second_max”.

• Misc.• Nullvaluecareisdesirable.• Thefrequencynumbermaybedesirable.• Thevaluelengthsinformationishelpful.

52

RandomSampling,alsoweighting

53

Remainingissues⏤ beforetobuildnewDBenv.foranalysisCombinationalexplosion incalculationcanoccurtoreducetheredundancyofcolumns/rows.• Graspingalltheredundantcolumnsthroughknowingtherelationsinsideatable:1. Thevaluesof2ormorecolumnsofeveryrow

hasthe samevalues.2. Thevaluesofacolumncanbedeterminedother

columnvalues.• Graspingalltheredundantrows

• Howtoknowtheconditionwhetheracolumnhasavalueofnull,special,anomalous,rare values,whentheothercolumnvaluesseemstohaveclue?

54

55

DB

SQLcmdgenerator

TableinfoColumninfo

Short-cuttingoperations

Extractedinfo Findingsbefore

main-analysis

GeneratedSQLcmd

Concretevalues

ü Valueformats

ü Special/errvalues

ü Columns’relations

ØMeanings

Simplertable(s)<- columnselecting<- time(date)narrowing<- customernarrowing

Ø VisualizationØ Mathmethods

BusinessValuebymain-analysis

Bigdiscoveryfromdata+Bigbusinessvalues

top related