introduction into r for historians (part 4: data manipulation)

41
Recap Data manipulation data.table package Basic statistical techniques Data manipulation in R Richard L. Zijdeman May 29, 2015 Richard L. Zijdeman Data manipulation in R

Upload: richard-zijdeman

Post on 16-Apr-2017

299 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Data manipulation in R

Richard L. Zijdeman

May 29, 2015

Richard L. Zijdeman Data manipulation in R

Page 2: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

1 Recap

2 Data manipulation

3 data.table package

4 Basic statistical techniques

Richard L. Zijdeman Data manipulation in R

Page 3: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Recap

Richard L. Zijdeman Data manipulation in R

Page 4: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

What we’ve seen so far

functions to read in dataread.csv(), read.xlsx()

objectsassignment <-characteristics, e.g.:

str(), summary(), head(), tail()

calculusmean(), min(), max()

plottingplot()ggplot()

paint by ‘layer’

Richard L. Zijdeman Data manipulation in R

Page 5: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Before we go on. . .

Structure your R scriptFilename, Date, Purpose, Author, Last changeUse comments to tell what you are doing

read in datachanging variables (why did you do it)

Richard L. Zijdeman Data manipulation in R

Page 6: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Create a working directory, with subdirs

+ documents+ data

- source- derived

+ analysis+ figures

Richard L. Zijdeman Data manipulation in R

Page 7: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Set a working directorysetwd(), getwd()use relative paths to save things“./” = currenty directory“./../” = folder up

Read J. Scott Long’ “Workflow”

Richard L. Zijdeman Data manipulation in R

Page 8: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Data manipulation

Richard L. Zijdeman Data manipulation in R

Page 9: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Assignment and Indexing

First, we’ll read in the HSN marriages again

hmar <- read.csv("./../data/derived/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 10000)

Richard L. Zijdeman Data manipulation in R

Page 10: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Change case of text

tolower()toupper()

tolower("CaN we pleASe jUSt have LOWER cases?")

## [1] "can we please just have lower cases?"

names(hmar) <- tolower(names(hmar))names(hmar)

## [1] "id_marriage" "idnr" "m_loc" "m_year"## [5] "sex_hsnrp" "age_groom" "occ_groom" "civilst_groom"## [9] "sign_groom" "b_loc_groom" "l_loc_groom" "age_bride"## [13] "occ_bride" "civilst_bride" "sign_bride" "b_loc_bride"## [17] "l_loc_bride" "a_f_groom" "occ_f_groom" "sign_f_groom"## [21] "a_m_groom" "occ_m_groom" "sign_m_groom" "a_f_bride"## [25] "occ_f_bride" "sign_f_bride" "a_m_bride" "occ_m_bride"## [29] "sign_m_bride"

Richard L. Zijdeman Data manipulation in R

Page 11: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Indexing

There were way to many names to print on a slide. . . How manynames are there actually?

Richard L. Zijdeman Data manipulation in R

Page 12: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Use the length() command to find out:

length(names(hmar))

## [1] 29

So let’s print just the first two:

names(hmar)[1:2]

## [1] "id_marriage" "idnr"

The technique using squared brackets is called indexing

Richard L. Zijdeman Data manipulation in R

Page 13: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Any idea how we would show the last two names?

Richard L. Zijdeman Data manipulation in R

Page 14: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

x <- length(names)names(hmar)[(x-1):x]

## [1] "id_marriage"

Using concatenate we could also extract various names

names(hmar)[c(1, 3, 5)]

## [1] "id_marriage" "m_loc" "sex_hsnrp"

Richard L. Zijdeman Data manipulation in R

Page 15: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

We can also apply indexing to a data.frame:

hmar[1:2, 1:3]

## id_marriage idnr m_loc## 1 1 1001 Abcoude-Baambrugge## 2 2 1005 Baarn

# shows the first 2 rows and first 3 columns# so, in general: data.frame[rows, columns]

Richard L. Zijdeman Data manipulation in R

Page 16: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

head() and tail()

So actually, you should now be able to replace head() and tail()

How?

Richard L. Zijdeman Data manipulation in R

Page 17: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

# head()hmar[1:6, ]

# tail()y <- nrow(hmar)hmar[(y-6):y, ]

Richard L. Zijdeman Data manipulation in R

Page 18: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

data.table package

Richard L. Zijdeman Data manipulation in R

Page 19: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Developed by Matt DowleWebsite:https://github.com/Rdatatable/data.table/wikiWhy data.table?

fast subsetting on large filesmore consistent ‘grammar’less typing

Richard L. Zijdeman Data manipulation in R

Page 20: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

install.packages("data.table")

library(data.table)

Richard L. Zijdeman Data manipulation in R

Page 21: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Class: data.tableFor data.table functions to work we need to define a data.frame asclass data.base

is.data.table(hmar)

## [1] FALSE

hmar.dt <- data.table(hmar)is.data.table(hmar.dt)

## [1] TRUE

is.data.frame(hmar.dt)

## [1] TRUERichard L. Zijdeman Data manipulation in R

Page 22: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Friends with benefitsData.frame and data.table are like ‘friends with benefits’

all.equal(hmar, hmar.dt)

## [1] "Attributes: < Names: 2 string mismatches >"## [2] "Attributes: < Length mismatch: comparison on first 2 components >"## [3] "Attributes: < Component 1: Modes: character, externalptr >"## [4] "Attributes: < Component 1: target is character, current is externalptr >"## [5] "Attributes: < Component 2: Modes: numeric, character >"## [6] "Attributes: < Component 2: Lengths: 10000, 2 >"## [7] "Attributes: < Component 2: target is numeric, current is character >"

# so we have all the benefits of a data.frame# ... and additional benefits of data.table

NB: next series of commands will only work for data.tablesRichard L. Zijdeman Data manipulation in R

Page 23: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Sort with setkeyOften we want to sort our data. We can do so with setkey()

hmar.dt[1:6, m_year]

## [1] 1849 1851 1864 1840 1843 1858

# note for data.frame hmar it would be:# hmar[1:6, hmar$m_year]setkeyv(hmar.dt, "m_year")hmar.dt[1:6, m_year]

## [1] 1831 1831 1833 1833 1834 1834

identical(hmar.dt, hmar)

## [1] FALSE Richard L. Zijdeman Data manipulation in R

Page 24: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Multiple keys

It is alo possible to sort on multiple keys

setkeyv(hmar.dt, c("id_marriage", "idnr"))

Richard L. Zijdeman Data manipulation in R

Page 25: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Subsetting

groom.sig <- hmar.dt[age_groom > 30, ]dim(groom.sig)

## [1] 2493 29

groom.sig <- hmar.dt[sign_groom == "h", ]dim(groom.sig)

## [1] 9590 29

Richard L. Zijdeman Data manipulation in R

Page 26: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

groom.sig <- hmar.dt[sign_groom == "h" &age_groom > 30, ]

dim(groom.sig)

## [1] 2358 29

groom.sig <- hmar.dt[m_year != 1840,list(id_marriage, idnr)]

dim(groom.sig)

## [1] 9985 2

Richard L. Zijdeman Data manipulation in R

Page 27: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Creating new variablesLet’s create a variable for the mean of marriage of grooms

hmar.dt[, mean.gage := mean(age_groom)]

summary(hmar.dt$age_groom)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## -2.00 24.00 26.00 28.38 30.00 79.00

summary(hmar.dt$mean.gage)

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 28.38 28.38 28.38 28.38 28.38 28.38

Richard L. Zijdeman Data manipulation in R

Page 28: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Another example (from yesterday)

Dummy variable for equal municipality of birth

hmar.dt[, eq_b_loc := (b_loc_groom == b_loc_bride)]

summary(hmar.dt$eq_b_loc)

## Mode FALSE TRUE NA's## logical 6957 3043 0

Richard L. Zijdeman Data manipulation in R

Page 29: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Creating variables by groupAs we saw, a var with mean age wasn’t really interesting

average age of grooms at marriage by civil status

hmar.dt[, gage.mean.civ := mean(age_groom),by = civilst_groom]

table(hmar.dt$civilst_groom, hmar.dt$gage.mean.civ)

#### 27.2427939112599 40.8829787234043 42.9548286604361 53## 1 9263 0 0 0## 2 0 0 642 0## 3 0 94 0 0## 6 0 0 0 1

Richard L. Zijdeman Data manipulation in R

Page 30: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Summary subsets of the data

So far, added vars to original data.framecan be redundant though

Think of context, say municipalitiesarchival material on characteristics, e.g.:

populationsteam power

You can also make context characteristics by aggregation

Richard L. Zijdeman Data manipulation in R

Page 31: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

mc <- hmar.dt[, mean(age_groom), by = b_loc_groom]

summary(mc)

## b_loc_groom V1## Length:1184 Min. :-2.00## Class :character 1st Qu.:26.00## Mode :character Median :28.17## Mean :29.36## 3rd Qu.:31.00## Max. :69.00

Richard L. Zijdeman Data manipulation in R

Page 32: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

We can improve by naming the variable directly, and adding morevariables

mc2 <- hmar.dt[, list(mean_gage = mean(age_groom),mean_bage = mean(age_bride)),

by = b_loc_groom]

summary(mc2)

## b_loc_groom mean_gage mean_bage## Length:1184 Min. :-2.00 Min. :-2.00## Class :character 1st Qu.:26.00 1st Qu.:23.80## Mode :character Median :28.17 Median :25.88## Mean :29.36 Mean :26.53## 3rd Qu.:31.00 3rd Qu.:28.00## Max. :69.00 Max. :64.00

Richard L. Zijdeman Data manipulation in R

Page 33: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

One more. . . counts

Yesterday, we talked about the problem of overlapping points. Weused geom_jitter to solve it.

Now let’s do it properly:

mc3 <- hmar.dt[, list(frequency = .N),by = list(m_year, age_bride)]

# notice the .N ... N is often used for nr. of obs

library(ggplot2)

Richard L. Zijdeman Data manipulation in R

Page 34: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Using colour

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(colour = frequency),

size = 10, shape = 18) +theme_bw()

0

20

40

60

1850 1900 1950 2000m_year

age_

brid

e

10

20

30frequency

Richard L. Zijdeman Data manipulation in R

Page 35: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Using size

ggplot(mc3, aes(x= m_year, y = age_bride)) +geom_point(aes(size = frequency),

colour = "blue", shape = 18) +theme_bw()

0

20

40

60

1850 1900 1950 2000m_year

age_

brid

e frequency

10

20

30

Richard L. Zijdeman Data manipulation in R

Page 36: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Basic statistical techniques

Richard L. Zijdeman Data manipulation in R

Page 37: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Box and whisker plot

Distribution of dataMedian: 50% of the cases above and belowBox: 1st and 3rd quartileInterquartile range (IQR): Q3-Q1Outliers (Tukey, 1977):

x < Q1 - 1.5*IQRx > Q3 + 1.5*IQR

Richard L. Zijdeman Data manipulation in R

Page 38: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

boxplot(hmar.dt$age_bride,ylab = "Age")

020

4060

Age

Richard L. Zijdeman Data manipulation in R

Page 39: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

hmar.dt[, sign.bride.cln := sign_bride == "h"]hmar.dt[age_bride < 14, age_bride := NA]# NB: no missing values here, but mind this when recoding!

Richard L. Zijdeman Data manipulation in R

Page 40: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

boxplot(hmar.dt$age_bride ~ hmar.dt$sign.bride.cln,names = c("not signed", "signed"),col = c("red", "green"))

not signed signed

2030

4050

6070

Richard L. Zijdeman Data manipulation in R

Page 41: Introduction into R for historians (part 4: data manipulation)

RecapData manipulationdata.table package

Basic statistical techniques

Richard L. Zijdeman Data manipulation in R