10 mistakes to avoid in data science
TRANSCRIPT
avier rudent
enough? dependsonwhatyouwanttoachieve(trainneuralnetwork,ABtes8ng…)backtosta8s8cs(mathema8calcondi8ons)+ruleofthumb
ManyformatcanbeeasilyreadwithR Text,csv,excel,protobuffer,json,xml,html,SQL…
ManysourcesalreadyavailableKaggleWebsitesminingOpendataGovernmentagencies
library(XML)Web.page<-htmlTreeParse("hNp://lapresse.ca")
More details in coming lectures
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Mistake 10 Not Having (enough) Data
Where does your data come from? standupandgetout,talkwithpeople,readdoc
library(ggplot2)library(tabplot)tableplot(diamonds)
Donotunderes;matenaïvetests…
Mistake 9 Do not check data quality
PackageRDatacheck
Whatkindofdata?WhatdoIwanttoknow? Geographic?Time-series?Correla8on?Whichvisualiza;on?histogram,boxplot,mosaic,heatmap,hexbining,scaNerplot,linechart,3DManyRpackagesavailable:
ggplot2leafletplot_lycorrplot
Mistake 8 Do not look at your data
Look at your data
More details in coming lectures
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
ChoosetherightcolorColorblindness,prin8ng,meaning…
Levelofinterac;vity?
SetupdeadlineMaketo-dolistsProjectmanagementtool(Asana)Plan&monitoryour8me
Mistake 7 Not having a plan
Have a plan and focus on it
Donotforgetthebigpicturege\nglostintotechnicaltools
What is the question you want to answer?
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Rpackagecaret:evaluatemodel,choice,es8mateperformance(regression&classifica8on)Sta;s;caltests:Goodnessoffit,R2,Homer-Lemeshowtest(MKmisc),Waldtest,k-foldvalida8onRetrain“oMen”
Observe > Clean > Understand > Train > Predict
Mistake 6 Focus on training
Ques;on:HowmuchsnowwillfallonMontréalduringthe5nextyears?Data:Snowfallandtemperatureofthelast80years
?
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Mistake 5 Keep it complex
Do not jump first on the fashion complicated method
Keepyourmethodassimpleaspossible(focusontheques8on)KnowthelimitsofthismethodComparethemethods(caret,ROC)
BoostedDecisionTreecoupledtoneuralnetwork
Linearregression
Complexitycomesataprice(speed,errorprone,
exper8se,amountofdata)
Canyouaffordit?
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Rstandardfunc8on:p.adjust,Bonferroni,Benjamini-Hochberg
Youareamiserableshooter,probatohit1%Youshoot10,000lasers,hitat10,001stshotDoesthatmakeyouashooHnggenius?
Mistake 4 Do not correct for multiple tests
Mul8plica8onofsensors,datagatheringprotocolsàEraofBigDataThemoredatayouanalyze,themoreweirdcaseswillpopupregularly
Aretheysignificant?
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
document your work
RMarkdown,Shiny
CreateHTML,pdf,Word,slides,webpages,CV,journal,bookAutoma8callyinclude&updatetheresultofyouranalysis
Moreinterac8ve?Dashboards,interac8vemaps…
hNp://rmarkdown.rstudio.com/gallery.html
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Mistake 3 Do not communicate or document
fortheothersaswellasforyourself
More details in coming lectures
RMarkdown
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Café,meetups,colleagues,boardgamejoggingclubPublishonline(blog)Askforexternalviewofyourwork
Mistake 2 Stay alone
Do not stay alone, do not work alone
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
DATAscienceassocia8on:codeofconducthNp://www.datascienceassn.org/code-of-conduct.html
Mistake 1 Ethics is a useless luxury
Whatareyoudoing?Forwhom?Whatistheimpactofyourwork? -Company,society,yourself -Short–longtermWhattypeofdataareyouanalyzing? -Law®ula8on -PrivacyDoyouhaveanyconflictofinterest?
MontréalBigDataMeetup22ndMarch2017–XavierPrudent–www.xavierprudent.com
Tendencytofocusonthetechnics,onthechallenge
“Yes,but”answers?
CAST!
XavierPrudent XAVIERPRUDENTOrganizer MICHAELALBOTheAudience ALLOFYOU
TechnicalSupport OVHDesign-PhotographyCHRISTINENAULLEAU
SpecialThankstoGeorgeLucasandtotheaudiencefortheiraNen8on
question? Comment? Feel free to contact me:!
Xavier Prudent, [email protected]!