introduction into r for historians (part 4: data manipulation)
TRANSCRIPT
RecapData manipulationdata.table package
Basic statistical techniques
Data manipulation in R
Richard L. Zijdeman
May 29, 2015
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
1 Recap
2 Data manipulation
3 data.table package
4 Basic statistical techniques
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Recap
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
What we’ve seen so far
functions to read in dataread.csv(), read.xlsx()
objectsassignment <-characteristics, e.g.:
str(), summary(), head(), tail()
calculusmean(), min(), max()
plottingplot()ggplot()
paint by ‘layer’
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Before we go on. . .
Structure your R scriptFilename, Date, Purpose, Author, Last changeUse comments to tell what you are doing
read in datachanging variables (why did you do it)
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Create a working directory, with subdirs
+ documents+ data
- source- derived
+ analysis+ figures
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Set a working directorysetwd(), getwd()use relative paths to save things“./” = currenty directory“./../” = folder up
Read J. Scott Long’ “Workflow”
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Data manipulation
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Assignment and Indexing
First, we’ll read in the HSN marriages again
hmar <- read.csv("./../data/derived/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 10000)
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Change case of text
tolower()toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")
## [1] "can we please just have lower cases?"
names(hmar) <- tolower(names(hmar))names(hmar)
## [1] "id_marriage" "idnr" "m_loc" "m_year"## [5] "sex_hsnrp" "age_groom" "occ_groom" "civilst_groom"## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "age_bride"## [13] "occ_bride" "civilst_bride" "sign_bride" "b_loc_bride"## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "sign_f_groom"## [21] "a_m_groom" "occ_m_groom" "sign_m_groom" "a_f_bride"## [25] "occ_f_bride" "sign_f_bride" "a_m_bride" "occ_m_bride"## [29] "sign_m_bride"
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Indexing
There were way to many names to print on a slide. . . How manynames are there actually?
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Use the length() command to find out:
length(names(hmar))
## [1] 29
So let’s print just the first two:
names(hmar)[1:2]
## [1] "id_marriage" "idnr"
The technique using squared brackets is called indexing
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Any idea how we would show the last two names?
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
x <- length(names)names(hmar)[(x-1):x]
## [1] "id_marriage"
Using concatenate we could also extract various names
names(hmar)[c(1, 3, 5)]
## [1] "id_marriage" "m_loc" "sex_hsnrp"
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
We can also apply indexing to a data.frame:
hmar[1:2, 1:3]
## id_marriage idnr m_loc## 1 1 1001 Abcoude-Baambrugge## 2 2 1005 Baarn
# shows the first 2 rows and first 3 columns# so, in general: data.frame[rows, columns]
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
head() and tail()
So actually, you should now be able to replace head() and tail()
How?
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
# head()hmar[1:6, ]
# tail()y <- nrow(hmar)hmar[(y-6):y, ]
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
data.table package
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Developed by Matt DowleWebsite:https://github.com/Rdatatable/data.table/wikiWhy data.table?
fast subsetting on large filesmore consistent ‘grammar’less typing
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
install.packages("data.table")
library(data.table)
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Class: data.tableFor data.table functions to work we need to define a data.frame asclass data.base
is.data.table(hmar)
## [1] FALSE
hmar.dt <- data.table(hmar)is.data.table(hmar.dt)
## [1] TRUE
is.data.frame(hmar.dt)
## [1] TRUERichard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Friends with benefitsData.frame and data.table are like ‘friends with benefits’
all.equal(hmar, hmar.dt)
## [1] "Attributes: < Names: 2 string mismatches >"## [2] "Attributes: < Length mismatch: comparison on first 2 components >"## [3] "Attributes: < Component 1: Modes: character, externalptr >"## [4] "Attributes: < Component 1: target is character, current is externalptr >"## [5] "Attributes: < Component 2: Modes: numeric, character >"## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"## [7] "Attributes: < Component 2: target is numeric, current is character >"
# so we have all the benefits of a data.frame# ... and additional benefits of data.table
NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Sort with setkeyOften we want to sort our data. We can do so with setkey()
hmar.dt[1:6, m_year]
## [1] 1849 1851 1864 1840 1843 1858
# note for data.frame hmar it would be:# hmar[1:6, hmar$m_year]setkeyv(hmar.dt, "m_year")hmar.dt[1:6, m_year]
## [1] 1831 1831 1833 1833 1834 1834
identical(hmar.dt, hmar)
## [1] FALSE Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Multiple keys
It is alo possible to sort on multiple keys
setkeyv(hmar.dt, c("id_marriage", "idnr"))
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Subsetting
groom.sig <- hmar.dt[age_groom > 30, ]dim(groom.sig)
## [1] 2493 29
groom.sig <- hmar.dt[sign_groom == "h", ]dim(groom.sig)
## [1] 9590 29
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
groom.sig <- hmar.dt[sign_groom == "h" &age_groom > 30, ]
dim(groom.sig)
## [1] 2358 29
groom.sig <- hmar.dt[m_year != 1840,list(id_marriage, idnr)]
dim(groom.sig)
## [1] 9985 2
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Creating new variablesLet’s create a variable for the mean of marriage of grooms
hmar.dt[, mean.gage := mean(age_groom)]
summary(hmar.dt$age_groom)
## Min. 1st Qu. Median Mean 3rd Qu. Max.## -2.00 24.00 26.00 28.38 30.00 79.00
summary(hmar.dt$mean.gage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.## 28.38 28.38 28.38 28.38 28.38 28.38
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Another example (from yesterday)
Dummy variable for equal municipality of birth
hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]
summary(hmar.dt$eq_b_loc)
## Mode FALSE TRUE NA's## logical 6957 3043 0
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Creating variables by groupAs we saw, a var with mean age wasn’t really interesting
average age of grooms at marriage by civil status
hmar.dt[, gage.mean.civ := mean(age_groom),by = civilst_groom]
table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)
#### 27.2427939112599 40.8829787234043 42.9548286604361 53## 1 9263 0 0 0## 2 0 0 642 0## 3 0 94 0 0## 6 0 0 0 1
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Summary subsets of the data
So far, added vars to original data.framecan be redundant though
Think of context, say municipalitiesarchival material on characteristics, e.g.:
populationsteam power
You can also make context characteristics by aggregation
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]
summary(mc)
## b_loc_groom V1## Length:1184 Min. :-2.00## Class :character 1st Qu.:26.00## Mode :character Median :28.17## Mean :29.36## 3rd Qu.:31.00## Max. :69.00
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
We can improve by naming the variable directly, and adding morevariables
mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),mean_bage = mean(age_bride)),
by = b_loc_groom]
summary(mc2)
## b_loc_groom mean_gage mean_bage## Length:1184 Min. :-2.00 Min. :-2.00## Class :character 1st Qu.:26.00 1st Qu.:23.80## Mode :character Median :28.17 Median :25.88## Mean :29.36 Mean :26.53## 3rd Qu.:31.00 3rd Qu.:28.00## Max. :69.00 Max. :64.00
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
One more. . . counts
Yesterday, we talked about the problem of overlapping points. Weused geom_jitter to solve it.
Now let’s do it properly:
mc3 <- hmar.dt[, list(frequency = .N),by = list(m_year, age_bride)]
# notice the .N ... N is often used for nr. of obs
library(ggplot2)
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Using colour
ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(colour = frequency),
size = 10, shape = 18) +theme_bw()
0
20
40
60
1850 1900 1950 2000m_year
age_
brid
e
10
20
30frequency
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Using size
ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(size = frequency),
colour = "blue", shape = 18) +theme_bw()
0
20
40
60
1850 1900 1950 2000m_year
age_
brid
e frequency
10
20
30
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Basic statistical techniques
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Box and whisker plot
Distribution of dataMedian: 50% of the cases above and belowBox: 1st and 3rd quartileInterquartile range (IQR): Q3-Q1Outliers (Tukey, 1977):
x < Q1 - 1.5*IQRx > Q3 + 1.5*IQR
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
boxplot(hmar.dt$age_bride,ylab = "Age")
020
4060
Age
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
hmar.dt[, sign.bride.cln := sign_bride == "h"]hmar.dt[age_bride < 14, age_bride := NA]# NB: no missing values here, but mind this when recoding!
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,names = c("not signed", "signed"),col = c("red", "green"))
not signed signed
2030
4050
6070
Richard L. Zijdeman Data manipulation in R
RecapData manipulationdata.table package
Basic statistical techniques
Richard L. Zijdeman Data manipulation in R