introduction into r for historians (part 3: examine and import data)

39
Recap Getting data in R Do it yourself! Plotting using ggplot2 Examining data and importing data in R Richard L. Zijdeman May 29, 2015 Richard L. Zijdeman Examining data and importing data in R

Upload: richard-zijdeman

Post on 09-Jan-2017

283 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Examining data and importing data in R

Richard L. Zijdeman

May 29, 2015

Richard L. Zijdeman Examining data and importing data in R

Page 2: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

1 Recap

2 Getting data in R

3 Do it yourself!

4 Plotting using ggplot2

Richard L. Zijdeman Examining data and importing data in R

Page 3: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Recap

Richard L. Zijdeman Examining data and importing data in R

Page 4: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

The structure of objects

Store just about anything in R: numbers, sentences, datasetsObjectsStudy the structure of objects: str()

type of objectfeatures of object

ships <- data.frame(year = c(1850, 1860, 1870, 1880),inbound = c(215, 237, 237, NA),outbound = c(212, 239, 260, 265))

Richard L. Zijdeman Examining data and importing data in R

Page 5: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Study the structure of object “ships”"

str(ships)

## 'data.frame': 4 obs. of 3 variables:## $ year : num 1850 1860 1870 1880## $ inbound : num 215 237 237 NA## $ outbound: num 212 239 260 265

Richard L. Zijdeman Examining data and importing data in R

Page 6: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Characteristics of objects

Class: class()Length: length()Dimensions: dim()

class(ships)

## [1] "data.frame"

length(ships)

## [1] 3

dim(ships) # rows, columns

## [1] 4 3Richard L. Zijdeman Examining data and importing data in R

Page 7: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Closer inspection of data.frames

names of columns (variables): names()top/bottom rows: head(), tail()missing data: is.na()

names(ships)

## [1] "year" "inbound" "outbound"

is.na(ships)

## year inbound outbound## [1,] FALSE FALSE FALSE## [2,] FALSE FALSE FALSE## [3,] FALSE FALSE FALSE## [4,] FALSE TRUE FALSE

Richard L. Zijdeman Examining data and importing data in R

Page 8: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Summarizing data in data.frames

descriptive statistics: summary()calculus: e.g. min(), mean(), sum()results table format: table()

summary(ships)

## year inbound outbound## Min. :1850 Min. :215.0 Min. :212.0## 1st Qu.:1858 1st Qu.:226.0 1st Qu.:232.2## Median :1865 Median :237.0 Median :249.5## Mean :1865 Mean :229.7 Mean :244.0## 3rd Qu.:1872 3rd Qu.:237.0 3rd Qu.:261.2## Max. :1880 Max. :237.0 Max. :265.0## NA's :1

Richard L. Zijdeman Examining data and importing data in R

Page 9: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

is.na(ships)

## year inbound outbound## [1,] FALSE FALSE FALSE## [2,] FALSE FALSE FALSE## [3,] FALSE FALSE FALSE## [4,] FALSE TRUE FALSE

table(is.na(ships))

#### FALSE TRUE## 11 1

Richard L. Zijdeman Examining data and importing data in R

Page 10: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Visualizing your data

Not just for analyses!Data quality

representativenessmissing data

Richard L. Zijdeman Examining data and importing data in R

Page 11: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

plot(ships)

year

215 220 225 230 235

1850

1860

1870

1880

215

220

225

230

235

inbound

1850 1855 1860 1865 1870 1875 1880 210 220 230 240 250 260

210

220

230

240

250

260

outbound

Richard L. Zijdeman Examining data and importing data in R

Page 12: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Getting data in R

Richard L. Zijdeman Examining data and importing data in R

Page 13: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Data already in R

The “datasets” packagevery slim datasetsspecific example data

To obtain list of datasets, type:

library(help = "datasets")

To obtain information on a specific dataset, type:

help(swiss) # thus: help(name_of_package)

or to just see the data:

help(swiss)

Richard L. Zijdeman Examining data and importing data in R

Page 14: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Reading in data

Different functions for different files:Base R: read.table() (read.csv())foreign package: read.spss(), read.dta(), read.dbf()openxlsx package: read.xlsx()alternatives packages:

xlsx(Java required)gdata (perl-based)

Richard L. Zijdeman Examining data and importing data in R

Page 15: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

read.xlsx() from openxlsx package

file: your file, including directorysheet: name of sheet

Richard L. Zijdeman Examining data and importing data in R

Page 16: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

read.csv()

file: your file, including directoryheader: variable names or not?sep: seperator

read.csv default: “,”read.csv2 default: “;”

skip: number of rows to skipnrows: total number of rows to readstringsAsFactorsencoding (e.g. “latin1” or “UTF-8”)

Richard L. Zijdeman Examining data and importing data in R

Page 17: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Do it yourself!

Richard L. Zijdeman Examining data and importing data in R

Page 18: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Read in the following files as data.frames:

HSN_basic.xlsxcheck the data.frame: using dim(), length()check the variables: using summary(), min(), table()Repeat for HSN_marriages.csv:

read in only 100 lines

Richard L. Zijdeman Examining data and importing data in R

Page 19: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Plotting using ggplot2

Richard L. Zijdeman Examining data and importing data in R

Page 20: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

ggplot2

Package by Hadley WickhamGeneric plotting for a great range of plotsggplot2 website: http://ggplot2.orgexcellent tutorial:https://jofrhwld.github.io/avml2012/#Section_1.1

Richard L. Zijdeman Examining data and importing data in R

Page 21: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Building your graph

Each plot consists of multiple layersThink of a canvas on which you ‘paint’

data layergeometries layerstatistics layer

Richard L. Zijdeman Examining data and importing data in R

Page 22: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Data layer

data.frame and aesthetics

ggplot(data.frame, aes(x= ..., y = ...))

geometries layer

ggplot(..., aes(x= ..., y = ...)) +geom_...() # e.g. geom_line

statistics layer

ggplot(..., aes(x= ..., y = ...)) +geom_...() +stat_...() # e.g. stat_smooth

Richard L. Zijdeman Examining data and importing data in R

Page 23: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

an example

Reading in the data

hmar <- read.csv("./../data/derived/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 100)

Richard L. Zijdeman Examining data and importing data in R

Page 24: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Plotting the data

install.packages(ggplot2)library(ggplot2)ggplot(hmar, aes(x= M_year, y = Age_bride)) +

geom_point()

Richard L. Zijdeman Examining data and importing data in R

Page 25: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

20

30

40

50

1830 1840 1850 1860 1870M_year

Age

_brid

e

Richard L. Zijdeman Examining data and importing data in R

Page 26: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Improving the plot

Specify characteristics of the geom_layer

ggplot(hmar, aes(x= M_year, y = Age_bride)) +geom_point(colour = "blue", size = 3, shape = 18)

See http://www.cookbook-r.com/Graphs/Shapes_and_line_types/

Richard L. Zijdeman Examining data and importing data in R

Page 27: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Specify characteristics of the geom_layer

20

30

40

50

1830 1840 1850 1860 1870M_year

Age

_brid

e

Richard L. Zijdeman Examining data and importing data in R

Page 28: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

A PTE example

Does age at marriage depend on educational attainment?To marry you need resources

the more attainment the longer it takes to acquire resourcesergo: brides with edu attainment marry later in life

Not a statistical test: but let’s graph this

Richard L. Zijdeman Examining data and importing data in R

Page 29: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

A request from yesterday

Can I plot labels?

ggplot(hmar, aes(x= M_year, y = Age_bride,label = SIgn_bride)) +

geom_text()

Richard L. Zijdeman Examining data and importing data in R

Page 30: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Yes you can!

Not really useful though. . .

h

a

h

h

h

a

h

a

h

a

a

aa

h

a

a

h

h

h

h

h

h

ha

a

h

h

aa

h

a

a

a

hh

h hh

a

a

a

a

h

a

ha

h

h

a

a

h

hh

h

a

h

h h

h

h

h

h

a

ha

h

h

a

h

a

h

h

a

hh

a

h

h

h

h

h

h

a

a

h

h

h

h

hh

h

h

h

a

h

a

a

h

a

h

20

30

40

50

1830 1840 1850 1860 1870M_year

Age

_brid

e

Richard L. Zijdeman Examining data and importing data in R

Page 31: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Let’s try with colours. . .

ggplot(hmar, aes(x= M_year, y = Age_bride)) +geom_point(aes(colour = factor(SIgn_bride)),

size = 3, shape = 18)

Richard L. Zijdeman Examining data and importing data in R

Page 32: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

20

30

40

50

1830 1840 1850 1860 1870M_year

Age

_brid

e factor(SIgn_bride)

a

h

No realpattern, though. . .

Richard L. Zijdeman Examining data and importing data in R

Page 33: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Finalizing the graph

ggplot(hmar, aes(x= M_year, y = Age_bride)) +geom_point(aes(colour = factor(SIgn_bride)),

size = 3,shape = 18) +

labs(list(title = "Age of marriage over time",x = "time (years since A.D.)",

y = "age of bride (years)",colour = "Signature"))

# here we use colour since legend shows colour

Richard L. Zijdeman Examining data and importing data in R

Page 34: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

20

30

40

50

1830 1840 1850 1860 1870time (years since A.D.)

age

of b

ride

(yea

rs)

Signature

a

h

Age of marriage over time

Richard L. Zijdeman Examining data and importing data in R

Page 35: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Satisfied?

Richard L. Zijdeman Examining data and importing data in R

Page 36: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Actually not. . . the points are plotted on top of each other. . .

Solution: geom_jitter

ggplot(hmar, aes(x= M_year, y = Age_bride)) +geom_jitter(aes(colour = factor(SIgn_bride)),

size = 3,shape = 18) +

labs(list(title = "Age of marriage over time",x = "time (years since A.D.)",

y = "age of bride (years)",colour = "Signature"))

# here we use colour since legend shows colour

Richard L. Zijdeman Examining data and importing data in R

Page 37: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

20

30

40

50

1830 1840 1850 1860 1870time (years since A.D.)

age

of b

ride

(yea

rs)

Signature

a

h

Age of marriage over time

Richard L. Zijdeman Examining data and importing data in R

Page 38: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

Final remarks on ggplot2

We have just scratched the surface of ggplot2Build your graph slowly

start with the basicsadd complexity step-wise

Now it’s your turn!

Richard L. Zijdeman Examining data and importing data in R

Page 39: Introduction into R for historians (part 3: examine and import data)

RecapGetting data in R

Do it yourself!Plotting using ggplot2

A small PTE project

Look at the variables in the HSN filesThink of a research questionProvide a general mechanism and hypothesisPlot your results

Richard L. Zijdeman Examining data and importing data in R