exploring data -...

Post on 24-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ExploringData

CENG499Introduc7ontoDataScience

ErdoğanDoğdu

Content

•  Ch.10WorkingwithData

ExploringData

•  Beforeyoustartbuildingmodelsandpredic7ng,knowyourdata– Exploreyourdatafirst

One-dimensionaldata

•  Example.Acollec7onofnumbers– Thenumberofminuteseachuserspendonyourwebsite

•  Howtoexplore?– Summarysta7s7cs

•  #ofitems,thesmallest,thelargest,themean,std.dev

– Histograms•  Groupdataintobuckets

Histograms

Histograms

•  Mean:0,Std.dev=58forbothdistribu7ons•  Distribu7on?–  plot_histogram(uniform,10,"UniformHistogram")–  plot_histogram(normal,10,"NormalHistogram”)

Histograms

Twodimensions

•  Example:– Users’dailyminutesinthewebsite(dim1)– Users’experienceinyearsindatascience(dim2)– Howdotheyvarytogether?

Twodimensions

•  plot_histogram(ys1,10,”ys1")•  plot_histogram(ys2,10,”ys2")•  Samemean,std.dev,

normallydistributed

Twodimensions•  Buteachhasaverydifferentjointdistribu7onwithxs

printcorrela7on(xs,ys1)#0.9printcorrela7on(xs,ys2)#-0.9

ManyDimensions

•  Howdoallthedimensionsrelatetooneanother?

•  Correla'onmatrix– Rowi,Colj:Correla7onofdimianddimj

ManyDimensions

•  Scaberplotmatrixplt.subplots()

CleaningandMunging

•  Realworlddataisdirty•  Convertstringtonumbers(ex.float[str])•  Ifcannotconvert?

Manipula7ngData

•  Stockpricesdata

•  Thehighest-everclosingpriceforAAPL?– RestrictourselvestoAAPLrows.– Grabtheclosing_pricefromeachrow.– Takethemaxofthoseprices.

Manipula7ngData

•  Thehighest-everclosingpriceforeachstockinourdataset?

Rescaling

•  Clusterbodysizes?– Euclidiandistancebetween(height,weight)pairs

Rescaling

Rescaling

Rescaling

top related