dr. datascience or: how i learned to stop munging and love tests

Post on 14-Apr-2017

7.441 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dr.Datascience

Or:HowILearnedtoStopMungingandLoveTests

MikeMalecki(mike@crunch.io)

NealRichardson(neal@crunch.io)

Aboutus

•Politicalscientists

•Thenworkedinsurveyresearchindustry

•Nowindataproductdevelopment

•Crunch.io

Data“Science”

vs.“Faith-basedcoding”

•Misplacedfaithinowninfallability ✔︎

•Yourcodeworksbecauseyoubelieveitdoes

•Itsoutputfeelstrue

Tests

•Maketheimplicitexplicit

•Turnassumptionsintoassertions

•Areaformofdocumentation

•Reducecomplexity

•Areliberating

Whataretests?

•Assertions,writtenincode,thatyourfunctionsdowhatyouexpect

•Thatifyougivecertaininputs,you’llgetknown,expectedoutputs

•Thatgivinginvalidinputresultsinanexpectedfailure

•Testsarecode:codethatmustberuneverytimeyoumakechanges

Gettingstarted

•Makeapackage

Gettingstarted

•Makeapackage

source("mycode.R")df<-read.csv("data.csv")doThings(df)

Gettingstarted

•Makeapackage.Notthatdifferent.

Useapackageskeleton,suchashttps://github.com/nealrichardson/skeletor

library(rmycode)df<-read.csv("data.csv")doThings(df)

Testingflow

•Writetest.Runitandseeitfail.

•Writecodethatmakestestpass.

•Runtestsagain.Seethempass.

•Repeat

Example

ReadandanalyzeAWSElasticLoadBalancerlogs

Example

enpiar:cnpr$R-e'skeletor::skeletor("elbr")'enpiar:cnpr$cdelbrenpiar:elbrnpr$atom.

Example

#elbr/tests/testthat/test-read.R

context("read.elb")

test_that("read.elbreturnsadata.frame",{ expect_true(is.data.frame(read.elb("example.log")))})

Example

enpiar:elbrnpr$maketest...Loadingrequiredpackage:elbrread.elb:1

Failed-------------------------------------------------------------------------1.Error:read.elbreturnsadata.frame(@test-something.R#4)------------------couldnotfindfunction"read.elb"1:.handleSimpleError(function(e){e$call<-sys.calls()[(frame+11):(sys.nframe()-2)]register_expectation(e,frame+11,sys.nframe()-2)signalCondition(e)},"couldnotfindfunction\"read.elb\"",quote(eval(expr,envir,enclos)))attestthat/test-something.R:42:eval(expr,envir,enclos)

DONE===========================================================================Error:Testfailures

Example

#elbr/R/read-elb.R

read.elb<-function(file,stringsAsFactors=FALSE,...){read.delim(file,sep="",stringsAsFactors=stringsAsFactors,col.names=c("timestamp","elb","client_port","backend_port","request_processing_time","backend_processing_time","response_processing_time","elb_status_code","backend_status_code","received_bytes","sent_bytes","request","user_agent","ssl_cipher","ssl_protocol"),...)}

Example

enpiar:elbrnpr$maketest...Loadingrequiredpackage:elbrread.elb:.

DONE===========================================================================

Example

test_that("read.elbreturnsadata.frame",{df<-read.elb("example.log")expect_true(is.data.frame(df))expect_equal(dim(df),c(4,15))})

Example

enpiar:elbrnpr$maketest...Loadingrequiredpackage:elbrread.elb:.1

Failed-------------------------------------------------------------------------1.Failure:read.elbreturnsadata.frame(@test-something.R#6)----------------dim(df)notequaltoc(4,15).1/2mismatches[1]3-4==-1

DONE===========================================================================Error:Testfailures

Example

read.elb<-function(file,stringsAsFactors=FALSE,...){read.delim(file,sep="",header=FALSE,#<--Oh,right.stringsAsFactors=stringsAsFactors,col.names=c("timestamp","elb","client_port","backend_port","request_processing_time","backend_processing_time","response_processing_time","elb_status_code","backend_status_code","received_bytes","sent_bytes","request","user_agent","ssl_cipher","ssl_protocol"),...)}

Example

enpiar:elbrnpr$maketest...Loadingrequiredpackage:elbrread.elb:..

DONE===========================================================================

Testsmakeexplicit

•Tradeoffseverywhere⚖

•isanintegeranimplicitcategorical?

•Don’ttrytobeclever.

Testsassert

•Youcanassertdumbthingslikerowcounts

•Despitelubridate, isneversimple

•Don’tbesurprisedbybeingwronglater

Testsdocument

•“Icombinedcategories”aka“recode”

•Thedataitselfdoesn’tpreservethisrelationship

•Missingnessishard

•DidIalreadydoit?

•df$col[df$col==1||df$col==2]<-1

•expect_equal(unique(col),1:5)

Testssimplify

•Turnbig,hard-to-reason-aboutproblemsintosmallones

•expect_equal(dimnames(pred),dimnames(population))

num[1:4,1:4,1:6,1:51,1:3]0.01960.04140.0380.01060.0167...-attr(*,"dimnames")=Listof5..$edu:chr[1:4]"<HS""HS""Some""Grad"..$age:chr[1:4]"18-29""30-44""45-64""≥65"..$race.female:chr[1:6]"WhiteM""BlackM""HispanicM""WhiteF".....$state:chr[1:51]"AK""AL""AR""AZ".....$party:chr[1:3]"R""I""D"

Testsliberate

•Freetoextendyourcodewithoutworryingaboutbreakingwhatitalreadydoes

•Fixbugsandhandleunforeseencomplicationsonlyonce

Whynotjusthack?

Becausedatacontractscan'tbetrusted

Becauseyou'llhavetoextendyourcodetodosomething

else

Becausesomeoneelsewillpickupyourcodeinthefuture

Becausethatsomeoneelsecouldbeyourfutureself

Becauseyou’realreadytesting,justnotsystematically

Dr.Datascience

Or:HowILearnedtoStopMungingandLoveTests

MikeMalecki(mike@crunch.io)

NealRichardson(neal@crunch.io)

top related