10 mistakes to avoid in data science

15
avier rudent

Upload: xavier-prudent

Post on 11-Apr-2017

71 views

Category:

Science


2 download

TRANSCRIPT

Page 1: 10 Mistakes to avoid in data science

avier rudent

Page 2: 10 Mistakes to avoid in data science
Page 3: 10 Mistakes to avoid in data science

enough? dependsonwhatyouwanttoachieve(trainneuralnetwork,ABtes8ng…)backtosta8s8cs(mathema8calcondi8ons)+ruleofthumb

ManyformatcanbeeasilyreadwithR Text,csv,excel,protobuffer,json,xml,html,SQL…

ManysourcesalreadyavailableKaggleWebsitesminingOpendataGovernmentagencies

library(XML)Web.page<-htmlTreeParse("hNp://lapresse.ca")

More details in coming lectures

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Mistake 10 Not Having (enough) Data

Page 4: 10 Mistakes to avoid in data science

Where does your data come from? standupandgetout,talkwithpeople,readdoc

library(ggplot2)library(tabplot)tableplot(diamonds)

Donotunderes;matenaïvetests…

Mistake 9 Do not check data quality

PackageRDatacheck

Page 5: 10 Mistakes to avoid in data science

Whatkindofdata?WhatdoIwanttoknow? Geographic?Time-series?Correla8on?Whichvisualiza;on?histogram,boxplot,mosaic,heatmap,hexbining,scaNerplot,linechart,3DManyRpackagesavailable:

ggplot2leafletplot_lycorrplot

Mistake 8 Do not look at your data

Look at your data

More details in coming lectures

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 6: 10 Mistakes to avoid in data science

ChoosetherightcolorColorblindness,prin8ng,meaning…

Levelofinterac;vity?

Page 7: 10 Mistakes to avoid in data science

SetupdeadlineMaketo-dolistsProjectmanagementtool(Asana)Plan&monitoryour8me

Mistake 7 Not having a plan

Have a plan and focus on it

Donotforgetthebigpicturege\nglostintotechnicaltools

What is the question you want to answer?

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 8: 10 Mistakes to avoid in data science

Rpackagecaret:evaluatemodel,choice,es8mateperformance(regression&classifica8on)Sta;s;caltests:Goodnessoffit,R2,Homer-Lemeshowtest(MKmisc),Waldtest,k-foldvalida8onRetrain“oMen”

Observe > Clean > Understand > Train > Predict

Mistake 6 Focus on training

Ques;on:HowmuchsnowwillfallonMontréalduringthe5nextyears?Data:Snowfallandtemperatureofthelast80years

?

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 9: 10 Mistakes to avoid in data science

Mistake 5 Keep it complex

Do not jump first on the fashion complicated method

Keepyourmethodassimpleaspossible(focusontheques8on)KnowthelimitsofthismethodComparethemethods(caret,ROC)

BoostedDecisionTreecoupledtoneuralnetwork

Linearregression

Complexitycomesataprice(speed,errorprone,

exper8se,amountofdata)

Canyouaffordit?

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 10: 10 Mistakes to avoid in data science

Rstandardfunc8on:p.adjust,Bonferroni,Benjamini-Hochberg

Youareamiserableshooter,probatohit1%Youshoot10,000lasers,hitat10,001stshotDoesthatmakeyouashooHnggenius?

Mistake 4 Do not correct for multiple tests

Mul8plica8onofsensors,datagatheringprotocolsàEraofBigDataThemoredatayouanalyze,themoreweirdcaseswillpopupregularly

Aretheysignificant?

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 11: 10 Mistakes to avoid in data science

document your work

RMarkdown,Shiny

CreateHTML,pdf,Word,slides,webpages,CV,journal,bookAutoma8callyinclude&updatetheresultofyouranalysis

Moreinterac8ve?Dashboards,interac8vemaps…

hNp://rmarkdown.rstudio.com/gallery.html

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Mistake 3 Do not communicate or document

fortheothersaswellasforyourself

More details in coming lectures

Page 12: 10 Mistakes to avoid in data science

RMarkdown

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 13: 10 Mistakes to avoid in data science

Café,meetups,colleagues,boardgamejoggingclubPublishonline(blog)Askforexternalviewofyourwork

Mistake 2 Stay alone

Do not stay alone, do not work alone

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Page 14: 10 Mistakes to avoid in data science

DATAscienceassocia8on:codeofconducthNp://www.datascienceassn.org/code-of-conduct.html

Mistake 1 Ethics is a useless luxury

Whatareyoudoing?Forwhom?Whatistheimpactofyourwork? -Company,society,yourself -Short–longtermWhattypeofdataareyouanalyzing? -Law&regula8on -PrivacyDoyouhaveanyconflictofinterest?

MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com

Tendencytofocusonthetechnics,onthechallenge

“Yes,but”answers?

Page 15: 10 Mistakes to avoid in data science

CAST!

XavierPrudent XAVIERPRUDENTOrganizer MICHAELALBOTheAudience ALLOFYOU

TechnicalSupport OVHDesign-PhotographyCHRISTINENAULLEAU

SpecialThankstoGeorgeLucasandtotheaudiencefortheiraNen8on

question? Comment? Feel free to contact me:!

Xavier Prudent, [email protected]!