data.table tutorial uros2018r-project.ro/conference2018/presentations/data... · 3 developers: matt...
Post on 23-Jul-2020
5 Views
Preview:
TRANSCRIPT
JaapWalhoutSeptember12,2018
tutorialuRos 2018
2
1. Introduction
2. Fast read &write
3. Syntax
4. Basicoperations(filteringrows &selecting columns)
5. Summarizing
6. Adding /updatingvariables
7. Joining datasets
8. Reshaping data
Specialsymbols:.N +.SD +.I Specialoperator::=
Overview
3
Developers:MattDowle,Arun Srinivasan,JanGorecki,MichaelChirico,Pasha Stetsenko,TomShort,SteveLianoglou,EduardAntonyan,MarkusBonsch,Hugh Parsonage
Since 2006 onCRAN,>35 releasesso far
678 packagesimport/depend/suggest data.table (543 CRAN+135 Bioconductor)
Homepage:http://r-datatable.com
Introduction
4
Why use data.table?
Pros:- speed- memoryefficiency- coding flexibility- non-equi joins
Cons:- ‘different’syntax
Introduction
5
50million rows /10columns/± 4GB
fread("datafile.csv")
expr time
data.table_fread 15.6 readr_read_csv 92.6 base_read.csv 559.9
fwrite(DT,"datafile.csv")
expr time
data.table_fread 32.6 readr_read_csv 102.2 base_read.csv 201.9
timesinseconds
Fast read &write
6
Threemain enhancements:
1. Columnnames can be used asvariables inside [….]
2. Because they arevariables,wecan use columnnames
to calculate stuffinside [….]
3. Anadditional grouping argument:by
Syntax:data.table ==enhanced data.frame
7
Columnar datastructure:2D– rows and columns
- subset rows df[df$id =="01",]
- select columns df[,"val1"]
- subset rows &select columns df[df$id =="01","val1"]
- that’s about it ….
Syntax:dataframerefresher
8
DT[i,j,by]
Syntax:general form
Which rows? What to do? Grouped by what?
- vectorofrownumbers- logical vector- another data.table
- summarizing- updatingvariable(s)- adding variable(s)
- one ormorecolumns- onthe fly grouping var(s)
9
DT[i,j,by]
data.table: i j bySQL: where select | update group by
Syntax:general form
10
build iniris dataset:
irisDT <- as.data.table(iris)
Example data
11
syntax:DT[i,j,by]
subset rows
select columns
subset rows &select columns
irisDT[Species=="setosa",]
irisDT[,Petal.Width]
irisDT[,.(Petal.Width)]
irisDT[Species=="setosa",Petal.Width]
irisDT[Species=="setosa",.(Petal.Width)]
Filteringrows &selecting columns
12
subset rows irisDT[between(Petal.Width,1,1)]
irisDT[Petal.Width %between%c(1,2)]
select columns irisDT[,.(Species,Sepal.Length)]
Filteringrows &selecting columns
13
Openthe fileex1.R
subset rows : getonly the rows with aday lower than or
equal to 10
select columns : selectonly the Month columnand make
sure you getadata.table back
subset rows &select columns : getonly the Wind&Tempcolumnsfor the
rows with aday higher than 5and lower
than orequal to 10
Exercise 1
14
1. Counts
2. Aggregating
3. Groupby
Summarizing
15
syntax:DT[i,j,by]
count irisDT[Species=="setosa",.N]
count distinct irisDT[,.uniqueN(Species)]
irisDT[Petal.Width <0.9,. uniqueN(Species)]
uniqueN(irisDT,by ="Species")
Counts
16
syntax:DT[i,j,by]
Simpleaggregation: irisDT[,.(count =.N,average =mean(Petal.Width))]
Including filtering: irisDT[Petal.Width <0.9,.(count =.N,average =mean(Petal.Width))]
Aggregating
17
syntax:DT[i,j,by]
irisDT[,.N,by =Species]
irisDT[,.(average =mean(Petal.Width)),by =Species]
irisDT[Sepal.Length <5.3,.(average =mean(Petal.Width)),by =Species]
irisDT[,.(average =mean(Petal.Width)),by =.(Species,logi =Sepal.Length <5.3)]
Groupby
18
specialsymbol:.SD
SD=SubsetofData
- adata.table by itself
- holds dataofcurrent goup asdefined inby
- when noby,.SDapplies to whole data.table
- allows for calculations onmultiplecolumns
Groupby
19
specialsymbol:.SD
irisDT[,lapply(.SD,mean),by =Species]
irisDT[Sepal.Length <5.3,lapply(.SD,mean),by =Species]
Groupby
20
specialsymbol:.SD
specialsymbol:.SDcols
irisDT[,lapply(.SD,mean),by =Species,.SDcols =1:2]
irisDT[,lapply(.SD,mean),by =Species,.SDcols =grep("Length",names(irisDT))]
Groupby
21
DT[i,j,by]
DT[1,3,2]
Orderofexecution
22
Openthe fileex2.R
- Count the number ofdays permonth
- Calculate the average Windspeedby month for only those days that havean
ozone value
- Calculate the mean temperature for the odd and evendays for each month
Exercise 2
23
specialoperator::=
- updatesadata.table inplace (by reference)
- can be used to:
o updateexisting column(s)
o add newcolumn(s)
o deletecolumn(s)
Updating,adding &deleting variables
24
specialoperator::=
irisDT[,Sepal.Length :=Sepal.Length *2]
irisDT[,`:=`(Sepal.Length =Sepal.Length *2,Petal.Width =Petal.Width /2)]
Updatingvariables
25
specialoperator::=
irisDT[,Sepal.Length :=Sepal.Length *uniqueN(Sepal.Width)/.N,by =Species]
irisDT[,`:=`(Sepal.Length =Sepal.Length *uniqueN(Sepal.Width),Petal.Width =Petal.Width /.N)
,by =Species]
Updatingvariablesby group
26
specialoperator::= specialsymbol:.I
irisDT[,rownumber :=.I]
irisDT[,Sepal.Area :=Sepal.Length *Sepal.Width]
irisDT[,`:=`(Sepal.Area =Sepal.Length *Sepal.Width,Petal.Area =Petal.Length *Petal.Width)]
Adding variables
27
specialoperator::=
irisDT[,Total.Sepal.Area :=sum(Sepal.Area),by =Species]
irisDT[,`:=`(Total.Sepal.Area =sum(Sepal.Area),Total.Petal.Area =sum(Petal.Area))
,by =Species]
Adding variablesby group
28
specialoperator::=
irisDT[,Sepal.Length :=NULL]
irisDT[,(1:4):=NULL]
irisDT[,grep("Length",names(irisDT)) :=NULL]
Deleting variables
29
Openthe fileex3.R
- Changethe Windcolumnfrom miles perhour to kilometersperhour
(1mph =1.6kmh)
- Calculate anewchill variable (Wind*Temperature)
- Calculate the average chill by month and add that asanewvariable
- Remove the Ozone and Solar.R columns
Exercise 3
30
- subsetrows DT[id ==“01”,]
- selectcolumns DT[,val1]
- subsetrows &select columns DT[id ==“01”,val1]
Joining datasets
31
DT[i,j,by]
Joining datasets
Which rows? What to do? Grouped by what?
- vectorofrownumbers- logical vector- another data.table
- summarizing- updatingvariable(s)- adding variable(s)
- one ormorecolumns- onthe fly grouping var(s)
32
Example data
irisDT <- copy(iris)setDT(irisDT)
irisH <- data.table(Species=c("setosa","versicolor","virginica"),Species.full =c("Irissetosa","Irisversicolor","Iris virginica"),height =1:3,soil =c("mud","rock","sand"))
Joining datasets
33
syntax:DT[i,on,j,by]
irisDT[irisH,on=.(Species)]
irisDT[irisH,on="Species"]
irisDT[irisH,on=.(Species=Spec,other_col)]
Joining datasets
34
syntax:DT[i,on,j,by]
irisDT[irisH,on=.(Species),Species.full :=Species.full]
irisDT[irisH,on=.(Species),`:=`(Species.full =Species.full,height =height,soil =soil)]
irisDT[irisH,on=.(Species),`:=`(Species.full =i.Species.full,height =i.height,soil =i.soil)]
Joining datasets
35
syntax:DT[i,on,j,by]
like%>% from the tidyverse,you can also chaindata.table operationstogether
irisDT[… ][… ][… ]
irisDT[irisH,on=.(Species),Species.full :=Species.full][,median(Sepal.Length),by =Species.full]
Joining &chaining
36
Openthe fileex4.R
- Use ajoin to add the month namefrom 'airmonths' to 'air'
- Use ajoin to add both the month nameand the month abbreviation from
'airmonths' to 'air'
- Use ajoin to add the month namefrom 'airmonths' to 'air’;then use chaining to
calculate the median Windspeedfor each month name
Exercise 4
37
From wide to long: irisMelted <- melt(irisDT,id ="Species")
melt(data,id.vars,measure.vars,variable.name ="variable",value.name ="value",na.rm =FALSE,variable.factor =TRUE,value.factor =FALSE)
Seealso:?melt
Reshaping data
38
From longto wide: dcast(irisMelted,Species~variable)
dcast(data,formula,fun.aggregate =NULL,sep="_",...,margins =NULL,subset=NULL,fill =NULL,drop=TRUE,value.var =guess(data))
Seealso:?dcast
Reshaping data
39
morejoins: non-equi joins +rollingjoins
morespecialsymbols: .BY +.GRP
specialgrouping functions: rowid +rleid
set*functions: setkey +setorder +setcolorder +setnames +…..
and evenmore: frank +shift +CJ +tstrsplit +…..
What else isthere to discover?
40
Overview ofgettingstarted vignettes
Datacamp’s data.tablecourse (paid)
StackOverflow [data.table]tag (>7700questions)
Wantto learn more?
41
Thank you for your attention!
TheEnd
top related