introduction into r for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Data manipulation in R

Richard L. Zijdeman

May 29, 2015

Richard L. Zijdeman Data manipulation in R



1 Recap

2 Data manipulation

3 data.table package

4 Basic statistical techniques




Recap




What we’ve seen so far

functions to read in dataread.csv(), read.xlsx()

objectsassignment <-characteristics, e.g.:

str(), summary(), head(), tail()

calculusmean(), min(), max()

plottingplot()ggplot()

paint by ‘layer’




Before we go on. . .

Structure your R scriptFilename, Date, Purpose, Author, Last changeUse comments to tell what you are doing

read in datachanging variables (why did you do it)




Create a working directory, with subdirs

+ documents+ data

- source- derived

+ analysis+ figures




Set a working directorysetwd(), getwd()use relative paths to save things“./” = currenty directory“./../” = folder up

Read J. Scott Long’ “Workflow”


http://www.indiana.edu/~jslsoc/web_workflow/wf_home.htm



Data manipulation




Assignment and Indexing

First, we’ll read in the HSN marriages again

hmar <- read.csv("./../data/derived/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 10000)




Change case of text

tolower()toupper()

tolower("CaN we pleASe jUSt have LOWER cases?")

## [1] "can we please just have lower cases?"

names(hmar) <- tolower(names(hmar))names(hmar)

## [1] "id_marriage" "idnr" "m_loc" "m_year"## [5] "sex_hsnrp" "age_groom" "occ_groom" "civilst_groom"## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "age_bride"## [13] "occ_bride" "civilst_bride" "sign_bride" "b_loc_bride"## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "sign_f_groom"## [21] "a_m_groom" "occ_m_groom" "sign_m_groom" "a_f_bride"## [25] "occ_f_bride" "sign_f_bride" "a_m_bride" "occ_m_bride"## [29] "sign_m_bride"




Indexing

There were way to many names to print on a slide. . . How manynames are there actually?




Use the length() command to find out:

length(names(hmar))

## [1] 29

So let’s print just the first two:

names(hmar)[1:2]

## [1] "id_marriage" "idnr"

The technique using squared brackets is called indexing




Any idea how we would show the last two names?




x <- length(names)names(hmar)[(x-1):x]

## [1] "id_marriage"

Using concatenate we could also extract various names

names(hmar)[c(1, 3, 5)]

## [1] "id_marriage" "m_loc" "sex_hsnrp"




We can also apply indexing to a data.frame:

hmar[1:2, 1:3]

## id_marriage idnr m_loc## 1 1 1001 Abcoude-Baambrugge## 2 2 1005 Baarn

# shows the first 2 rows and first 3 columns# so, in general: data.frame[rows, columns]




head() and tail()

So actually, you should now be able to replace head() and tail()

How?




# head()hmar[1:6, ]

# tail()y <- nrow(hmar)hmar[(y-6):y, ]




data.table package




Developed by Matt DowleWebsite:https://github.com/Rdatatable/data.table/wikiWhy data.table?

fast subsetting on large filesmore consistent ‘grammar’less typing


https://github.com/Rdatatable/data.table/wiki

https://github.com/Rdatatable/data.table/wiki



install.packages("data.table")

library(data.table)




Class: data.tableFor data.table functions to work we need to define a data.frame asclass data.base

is.data.table(hmar)

## [1] FALSE

hmar.dt <- data.table(hmar)is.data.table(hmar.dt)

## [1] TRUE

is.data.frame(hmar.dt)

## [1] TRUERichard L. Zijdeman Data manipulation in R



Friends with benefitsData.frame and data.table are like ‘friends with benefits’

all.equal(hmar, hmar.dt)

## [1] "Attributes: < Names: 2 string mismatches >"## [2] "Attributes: < Length mismatch: comparison on first 2 components >"## [3] "Attributes: < Component 1: Modes: character, externalptr >"## [4] "Attributes: < Component 1: target is character, current is externalptr >"## [5] "Attributes: < Component 2: Modes: numeric, character >"## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"## [7] "Attributes: < Component 2: target is numeric, current is character >"

# so we have all the benefits of a data.frame# ... and additional benefits of data.table

NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R



Sort with setkeyOften we want to sort our data. We can do so with setkey()

hmar.dt[1:6, m_year]

## [1] 1849 1851 1864 1840 1843 1858

# note for data.frame hmar it would be:# hmar[1:6, hmar$m_year]setkeyv(hmar.dt, "m_year")hmar.dt[1:6, m_year]

## [1] 1831 1831 1833 1833 1834 1834

identical(hmar.dt, hmar)

## [1] FALSE Richard L. Zijdeman Data manipulation in R



Multiple keys

It is alo possible to sort on multiple keys

setkeyv(hmar.dt, c("id_marriage", "idnr"))




Subsetting

groom.sig <- hmar.dt[age_groom > 30, ]dim(groom.sig)

## [1] 2493 29

groom.sig <- hmar.dt[sign_groom == "h", ]dim(groom.sig)

## [1] 9590 29




groom.sig <- hmar.dt[sign_groom == "h" &age_groom > 30, ]

dim(groom.sig)

## [1] 2358 29

groom.sig <- hmar.dt[m_year != 1840,list(id_marriage, idnr)]

dim(groom.sig)

## [1] 9985 2




Creating new variablesLet’s create a variable for the mean of marriage of grooms

hmar.dt[, mean.gage := mean(age_groom)]

summary(hmar.dt$age_groom)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## -2.00 24.00 26.00 28.38 30.00 79.00

summary(hmar.dt$mean.gage)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 28.38 28.38 28.38 28.38 28.38 28.38




Another example (from yesterday)

Dummy variable for equal municipality of birth

hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]

summary(hmar.dt$eq_b_loc)

## Mode FALSE TRUE NA's## logical 6957 3043 0




Creating variables by groupAs we saw, a var with mean age wasn’t really interesting

average age of grooms at marriage by civil status

hmar.dt[, gage.mean.civ := mean(age_groom),by = civilst_groom]

table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)

#### 27.2427939112599 40.8829787234043 42.9548286604361 53## 1 9263 0 0 0## 2 0 0 642 0## 3 0 94 0 0## 6 0 0 0 1




Summary subsets of the data

So far, added vars to original data.framecan be redundant though

Think of context, say municipalitiesarchival material on characteristics, e.g.:

populationsteam power

You can also make context characteristics by aggregation




mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]

summary(mc)

## b_loc_groom V1## Length:1184 Min. :-2.00## Class :character 1st Qu.:26.00## Mode :character Median :28.17## Mean :29.36## 3rd Qu.:31.00## Max. :69.00




We can improve by naming the variable directly, and adding morevariables

mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),mean_bage = mean(age_bride)),

by = b_loc_groom]

summary(mc2)

## b_loc_groom mean_gage mean_bage## Length:1184 Min. :-2.00 Min. :-2.00## Class :character 1st Qu.:26.00 1st Qu.:23.80## Mode :character Median :28.17 Median :25.88## Mean :29.36 Mean :26.53## 3rd Qu.:31.00 3rd Qu.:28.00## Max. :69.00 Max. :64.00




One more. . . counts

Yesterday, we talked about the problem of overlapping points. Weused geom_jitter to solve it.

Now let’s do it properly:

mc3 <- hmar.dt[, list(frequency = .N),by = list(m_year, age_bride)]

# notice the .N ... N is often used for nr. of obs

library(ggplot2)




Using colour

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(colour = frequency),

size = 10, shape = 18) +theme_bw()

0

20

40

60

1850 1900 1950 2000m_year

age_

brid

e

10

20

30frequency




Using size

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(size = frequency),

colour = "blue", shape = 18) +theme_bw()

0

20

40

60

1850 1900 1950 2000m_year

age_

brid

e frequency

10

20

30




Box and whisker plot

Distribution of dataMedian: 50% of the cases above and belowBox: 1st and 3rd quartileInterquartile range (IQR): Q3-Q1Outliers (Tukey, 1977):

x < Q1 - 1.5*IQRx > Q3 + 1.5*IQR




boxplot(hmar.dt$age_bride,ylab = "Age")

020

4060

Age




hmar.dt[, sign.bride.cln := sign_bride == "h"]hmar.dt[age_bride < 14, age_bride := NA]# NB: no missing values here, but mind this when recoding!




boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,names = c("not signed", "signed"),col = c("red", "green"))

not signed signed

2030

4050

6070


introduction into r for historians (part 4: data manipulation)

Data & Analytics