introduction to using r

125
An Introduction to R Graeme L. Hickey 31st October 2014 Graeme L. Hickey An Introduction to R 31st October 2014 1 / 125

Upload: graeme-hickey

Post on 14-Jun-2015

382 views

Category:

Software


4 download

DESCRIPTION

R is the lingua franca of statistical computing. One of its attractions for users of statistics is that it encompasses an enormous range of modern statistical methods developed by world-leading statistics researchers. In order to exploit its capabilities for data analysis and statistics, a basic understanding of the core functions is required. In this session we will cover all of the preliminaries that are common to all uses of R, with particular focus on the topics of functions and data objects. Statistical methods are not formally covered, although some basic functions will be demonstrated.

TRANSCRIPT

Page 1: Introduction to using R

An Introduction to R

Graeme L. Hickey

31st October 2014

Graeme L. Hickey An Introduction to R 31st October 2014 1 / 125

Page 2: Introduction to using R

Getting ready to use R

Getting ready to use R

Graeme L. Hickey An Introduction to R 31st October 2014 2 / 125

Page 3: Introduction to using R

Getting ready to use R

Logistics

Owing to room change, there will unfortunately be less hands-onexperience (possibly none)I will email the slides to all who registered – no need to make notesWe will take a short break during

Graeme L. Hickey An Introduction to R 31st October 2014 3 / 125

Page 4: Introduction to using R

Getting ready to use R

What is R?

Derives from a proprietary software packaged called S-Plus

“R is a free software programming language and softwareenvironment for statistical computing and graphics” Wikipedia(2014)

The “lingua franca of data analysts” The New York Times(2009)

Used worldwide by bioinformaticians, data scientists, high-level statisticians,app developers, . . .

Graeme L. Hickey An Introduction to R 31st October 2014 4 / 125

Page 5: Introduction to using R

Getting ready to use R

Why use R?

Keep whole analysis together (data processing, analysis, publicationfigures, reports)Reproducible researchState of the art statistical methods are wrapped up in ‘R packages’It’s freeIt’s cross platform (Windows, OSX, Linux) compatibleIt will be extensively used in EPH / IGH statistical training

Graeme L. Hickey An Introduction to R 31st October 2014 5 / 125

Page 6: Introduction to using R

Getting ready to use R

Objectives

The primary objective is for you to be able to apply statisticalfunctions available in R to your own data

To achieve this we should be able to:

Understand the core concepts of R and its syntaxBe able to read and write data filesBe able to interrogate a datasetBe able to use functions and optionsBe able to write a simple function

Graeme L. Hickey An Introduction to R 31st October 2014 6 / 125

Page 7: Introduction to using R

Getting ready to use R

How to install R?

Download and install from: http://www.r-project.orgCan use in isolation, but combing with an IDE front-end makes lifeeasier when starting outRecommend using R Studio: http://www.rstudio.comOnce both programs installed, only ever need to run R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 7 / 125

Page 8: Introduction to using R

Getting ready to use R

R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 8 / 125

Page 9: Introduction to using R

Getting ready to use R

R Console vs. Script Editor

Would you consider writing your thesis using a typewriter?

Don’t just use the console – not reproducible!Always write analysis as an R ScriptFile -> New File -> R ScriptHighlight code and press Ctrl + Enter to execute

Graeme L. Hickey An Introduction to R 31st October 2014 9 / 125

Page 10: Introduction to using R

R as a calculator

R as a calculator

Graeme L. Hickey An Introduction to R 31st October 2014 10 / 125

Page 11: Introduction to using R

R as a calculator

Simple maths

1 + 2

## [1] 3

13 * 17

## [1] 221

((6 * 7) + 4 - 7) / 2^8

## [1] 0.1523438

Graeme L. Hickey An Introduction to R 31st October 2014 11 / 125

Page 12: Introduction to using R

R as a calculator

Routine mathematical functions and constants

exp(3)

## [1] 20.08554

sin(2 * pi) - 1

## [1] -1

atan(1) ^ 2

## [1] 0.6168503

All of these examples use base R functions – we’ll revisit these laterGraeme L. Hickey An Introduction to R 31st October 2014 12 / 125

Page 13: Introduction to using R

R as a calculator

Other ‘numbers’ to look out for

1 / 0

## [1] Inf

-1 / 0

## [1] -Inf

0 / 0

## [1] NaN

If you see these, you have probably done something wrong!Graeme L. Hickey An Introduction to R 31st October 2014 13 / 125

Page 14: Introduction to using R

Data objects

Data objects

Graeme L. Hickey An Introduction to R 31st October 2014 14 / 125

Page 15: Introduction to using R

Data objects

Assignment operator

We can tell R to remember things so that we can call them laterWe do this using either assignment operators = or <-

x <- 5x

## [1] 5

x = 5x

## [1] 5

Graeme L. Hickey An Introduction to R 31st October 2014 15 / 125

Page 16: Introduction to using R

Data objects

A warning!

R is case SENSITIVE

This is a common error and can lead to a great deal of pain in tracing bugs!

x # Lowercase

## [1] 5

X # Uppercase

## Error in eval(expr, envir, enclos): object 'X' not found

Graeme L. Hickey An Introduction to R 31st October 2014 16 / 125

Page 17: Introduction to using R

Data objects

Another warning!

Names cannot begin with numbers or include spaces

gra.eme_30 <- 5 # OK30graeme <- 5 # Not allowed!gra eme <- 5 # Not allowed either!

Graeme L. Hickey An Introduction to R 31st October 2014 17 / 125

Page 18: Introduction to using R

Data objects

Task

What do you think the following lines of R code will output at the end?

x <- 2y <- piz <- (sin(y) + x)^2z

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 18 / 125

Page 19: Introduction to using R

Data objects

Solution

x <- 2y <- piz <- (sin(y) + x)^2z

## [1] 4

Graeme L. Hickey An Introduction to R 31st October 2014 19 / 125

Page 20: Introduction to using R

Data objects

Task

What do you think the following lines of R code will output at the end?

x <- 2x <- x + 5x

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 20 / 125

Page 21: Introduction to using R

Data objects

Solution

x <- 2x <- x + 5x

## [1] 7

Graeme L. Hickey An Introduction to R 31st October 2014 21 / 125

Page 22: Introduction to using R

Data objects

Vectors

We often have more than a single number, which we combine into avector using the function, e.g.

c(184, 162, 145, 200, 178, 154, 172, 142)

## [1] 184 162 145 200 178 154 172 142

heights <- c(184, 162, 145, 200, 178, 154, 172, 142)heights

## [1] 184 162 145 200 178 154 172 142

Graeme L. Hickey An Introduction to R 31st October 2014 22 / 125

Page 23: Introduction to using R

Data objects

Task

What do you think the following lines of R code will output at the end?

heights/10 + 1

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 23 / 125

Page 24: Introduction to using R

Data objects

Solution

heights/10 + 1

## [1] 19.4 17.2 15.5 21.0 18.8 16.4 18.2 15.2

Graeme L. Hickey An Introduction to R 31st October 2014 24 / 125

Page 25: Introduction to using R

Data objects

Selection

We might want to select the 5-th value from a vectorWe use square brackets for this, e.g.

heights[5]

## [1] 178

Graeme L. Hickey An Introduction to R 31st October 2014 25 / 125

Page 26: Introduction to using R

Data objects

Task

What do you think the following lines of R code will output at the end?

heights[c(1, 3, 5)]

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 26 / 125

Page 27: Introduction to using R

Data objects

Solution

heights[c(1, 3, 5)]

## [1] 184 145 178

Graeme L. Hickey An Introduction to R 31st October 2014 27 / 125

Page 28: Introduction to using R

Data objects

Logic

Boils down to something being TRUE or FALSE

x > y asks: is x greater than y?x < y asks: is x less than y?

x == y asks: is x equal to y?x >= y asks: is x greater than or equal to y?x <= y asks: is x less than or equal to y?

Graeme L. Hickey An Introduction to R 31st October 2014 28 / 125

Page 29: Introduction to using R

Data objects

Basic examples

5 < 10

## [1] TRUE

3 > 5

## [1] FALSE

sin(pi) == cos(pi) + 1

## [1] FALSE

Graeme L. Hickey An Introduction to R 31st October 2014 29 / 125

Page 30: Introduction to using R

Data objects

Logic

We can combine logical statements, for example

(5 < 10) & (3 < 5)

## [1] TRUE

(1 > 2) | (3 > 4)

## [1] FALSE

Graeme L. Hickey An Introduction to R 31st October 2014 30 / 125

Page 31: Introduction to using R

Data objects

Logic & selection

We can use a logical vector to pick out elements of a vector so long as thelogical vector and the vector of data are the same length

logic.vec <- c(TRUE, FALSE, TRUE, FALSE, TRUE,FALSE, TRUE, FALSE)

heights[logic.vec]

## [1] 184 145 178 172

Task: How do you extract heights greater than 160cm?

Graeme L. Hickey An Introduction to R 31st October 2014 31 / 125

Page 32: Introduction to using R

Data objects

SolutionHow to extract heights greater than 160cm?

i <- (heights > 160)i

## [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE

heights[i]

## [1] 184 162 200 178 172

And once you understand, simply. . .

heights[heights > 160]

## [1] 184 162 200 178 172Graeme L. Hickey An Introduction to R 31st October 2014 32 / 125

Page 33: Introduction to using R

Data objects

Character data

Vectors don’t just store numbersThey can store items of class: integer, numeric, character, date,factor, etc.For character data, just put things inside quotation marks

subjects <- c("Bob", "Amy", "Amy", "Bob", "Amy","Bob", "Bob", "Amy", "Amy")

subjects

## [1] "Bob" "Amy" "Amy" "Bob" "Amy" "Bob" "Bob" "Amy" "Amy"

Graeme L. Hickey An Introduction to R 31st October 2014 33 / 125

Page 34: Introduction to using R

Data objects

MatricesThese are generalizations of vectors: instead of being one vector, we havemultiple columns of vectors:

matrix(heights, nrow = 2)

## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142

matrix(subjects, nrow = 3)

## [,1] [,2] [,3]## [1,] "Bob" "Bob" "Bob"## [2,] "Amy" "Amy" "Amy"## [3,] "Amy" "Bob" "Amy"

Graeme L. Hickey An Introduction to R 31st October 2014 34 / 125

Page 35: Introduction to using R

Data objects

MatricesWe can apply the same arithmetic as per vectors

myMat <- matrix(heights, nrow = 2)myMat

## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142

0.5*myMat + 3

## [,1] [,2] [,3] [,4]## [1,] 95 75.5 92 89## [2,] 84 103.0 80 74

Graeme L. Hickey An Introduction to R 31st October 2014 35 / 125

Page 36: Introduction to using R

Data objects

Matrices

Each row and column has to have data of the same type (e.g. numeric,character, logical) — you can’t mix-and-matchMost useful when do linear algebra (e.g. PCA, solve systems ofequations)R often coerces into matrix form when required by functionsIf you want different data types, need to use objects calleddata.frames

Graeme L. Hickey An Introduction to R 31st October 2014 36 / 125

Page 37: Introduction to using R

Data objects

Data frames

Think of these like Microsoft Excel spreadsheetsColumns represent different variables, e.g. age, sex, number of cells,. . .Rows represent samples, e.g. patients, testsLike matrices, they are a generalization of vectors, but can storedifferent types of data

Graeme L. Hickey An Introduction to R 31st October 2014 37 / 125

Page 38: Introduction to using R

Data objects

Data frames

R has some pre-installed data frames, including the infamous Sir RonaldFisher iris dataset1, to allow us to practice

iris

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa

. . .

1R. A. Fisher (1936). The use of multiple measurements in taxonomic problems.Annals of Eugenics 7 (2): 179–188.

Graeme L. Hickey An Introduction to R 31st October 2014 38 / 125

Page 39: Introduction to using R

Data objects

Selection in data frames

Earlier, we learnt how to select individual elements from a vectorFor a data frame the same principles apply, except there are now 2dimensions: rows and columns (note the order!)

Graeme L. Hickey An Introduction to R 31st October 2014 39 / 125

Page 40: Introduction to using R

Data objects

Selection in data frames

There are 3 primary methods of selecting data from data frames

1 Square brackets2 Using the dollar ($) operator (for columns only)3 Using the subset function (we won’t discuss this today)

They all do the same thing (sort of), and you can combine these methods

Graeme L. Hickey An Introduction to R 31st October 2014 40 / 125

Page 41: Introduction to using R

Data objects

Selection using square brackets

One method of selection is the square brackets:

dat[i , ] would select the i-th row (which is a vector)dat[ , j] would select the j-th column (which is a vector)dat[i, j] would select the value from the i-th row and j-th column

Graeme L. Hickey An Introduction to R 31st October 2014 41 / 125

Page 42: Introduction to using R

Data objects

iris[1, 1]

## [1] 5.1

iris[ , 1]

## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8## [14] 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0## [27] 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4## [40] 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4## [53] 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6## [66] 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7## [79] 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5## [92] 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3## [105] 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5## [118] 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2## [131] 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8## [144] 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Graeme L. Hickey An Introduction to R 31st October 2014 42 / 125

Page 43: Introduction to using R

Data objects

Selection using square brackets

i and j don’t have to be single numbers, they can be:

vectors of numberslogical vectors (which need to be the same length as the rows orcolumns)

iris[c(1, 3) , ]

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa

Graeme L. Hickey An Introduction to R 31st October 2014 43 / 125

Page 44: Introduction to using R

Data objects

Selection using the dollar operator

Each column in a data frame should have a nameWe use dat$foo1 to extract the column called foo1 from a dataframe called dat, e.g.

iris$Petal.Width

## [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1## [14] 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2## [27] 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 0.2## [40] 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5## [53] 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3## [66] 1.4 1.5 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7## [79] 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2## [92] 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8## [105] 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8## [118] 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6## [131] 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9## [144] 2.3 2.5 2.3 1.9 2.0 2.3 1.8

Graeme L. Hickey An Introduction to R 31st October 2014 44 / 125

Page 45: Introduction to using R

Data objects

Tasks

1 Select all rows of the iris data where the sepal length is >7.6cm2 Extract the sepal lengths of iris flowers sp. virginica with petal widths

>2.4cm

N.B. there are multiple ways of solving these problems

Graeme L. Hickey An Introduction to R 31st October 2014 45 / 125

Page 46: Introduction to using R

Data objects

Solution (1)

iris[iris$Sepal.Length > 7.6, ]

## Sepal.Length Sepal.Width Petal.Length Petal.Width## 118 7.7 3.8 6.7 2.2## 119 7.7 2.6 6.9 2.3## 123 7.7 2.8 6.7 2.0## 132 7.9 3.8 6.4 2.0## 136 7.7 3.0 6.1 2.3## Species## 118 virginica## 119 virginica## 123 virginica## 132 virginica## 136 virginica

Graeme L. Hickey An Introduction to R 31st October 2014 46 / 125

Page 47: Introduction to using R

Data objects

Solution (2)I’ll break this one into pieces to make it clearer. . .lvec1 <- (iris$Petal.Width > 2.4)lvec2 <- (iris$Species == "virginica")iris2 <- iris[lvec1 & lvec2, ]iris2

## Sepal.Length Sepal.Width Petal.Length Petal.Width## 101 6.3 3.3 6.0 2.5## 110 7.2 3.6 6.1 2.5## 145 6.7 3.3 5.7 2.5## Species## 101 virginica## 110 virginica## 145 virginica

iris2$Sepal.Length

## [1] 6.3 7.2 6.7

Graeme L. Hickey An Introduction to R 31st October 2014 47 / 125

Page 48: Introduction to using R

Data objects

I could have combined all of this into a single line. . .

iris[(iris$Petal.Width > 2.4) &(iris$Species == "virginica"), ]$Sepal.Length

## [1] 6.3 7.2 6.7

Graeme L. Hickey An Introduction to R 31st October 2014 48 / 125

Page 49: Introduction to using R

Data objects

Factors

An important class of data in R are factorsThey are categorical variables, e.g. gender, countryThey are similar to character data, except that R is “aware” of them,which allows us to do lots of clever things with our data

Graeme L. Hickey An Introduction to R 31st October 2014 49 / 125

Page 50: Introduction to using R

Data objects

iris$Species

## [1] setosa setosa setosa setosa setosa## [6] setosa setosa setosa setosa setosa## [11] setosa setosa setosa setosa setosa## [16] setosa setosa setosa setosa setosa## [21] setosa setosa setosa setosa setosa## [26] setosa setosa setosa setosa setosa## [31] setosa setosa setosa setosa setosa## [36] setosa setosa setosa setosa setosa## [41] setosa setosa setosa setosa setosa## [46] setosa setosa setosa setosa setosa## [51] versicolor versicolor versicolor versicolor versicolor## [56] versicolor versicolor versicolor versicolor versicolor## [61] versicolor versicolor versicolor versicolor versicolor## [66] versicolor versicolor versicolor versicolor versicolor## [71] versicolor versicolor versicolor versicolor versicolor## [76] versicolor versicolor versicolor versicolor versicolor## [81] versicolor versicolor versicolor versicolor versicolor## [86] versicolor versicolor versicolor versicolor versicolor## [91] versicolor versicolor versicolor versicolor versicolor## [96] versicolor versicolor versicolor versicolor versicolor## [101] virginica virginica virginica virginica virginica## [106] virginica virginica virginica virginica virginica## [111] virginica virginica virginica virginica virginica## [116] virginica virginica virginica virginica virginica## [121] virginica virginica virginica virginica virginica## [126] virginica virginica virginica virginica virginica## [131] virginica virginica virginica virginica virginica## [136] virginica virginica virginica virginica virginica## [141] virginica virginica virginica virginica virginica## [146] virginica virginica virginica virginica virginica## Levels: setosa versicolor virginica

Graeme L. Hickey An Introduction to R 31st October 2014 50 / 125

Page 51: Introduction to using R

Data objects

Matrices and data frames too limited?

What if you need something more than a flat matrix or data.frame?E.g. recording 100 measurements for 70 subjects at 25 time points?array and ?list

Graeme L. Hickey An Introduction to R 31st October 2014 51 / 125

Page 52: Introduction to using R

Functions

Functions

Graeme L. Hickey An Introduction to R 31st October 2014 52 / 125

Page 53: Introduction to using R

Functions

What are functions?

In short, you put something in and get something outIn order to do interesting things with out data and apply the wealth ofstatistical methods available, we need to understand about functionsfirst

Graeme L. Hickey An Introduction to R 31st October 2014 53 / 125

Page 54: Introduction to using R

Functions

Recognising a function

Functions must have an assigned nameFunctions are applied using round bracketsFunctions generally take arguments (either required or optional)Arguments can be anything, depending on the function, but often oneof them will be some data

e.g. myFunc(x)

Graeme L. Hickey An Introduction to R 31st October 2014 54 / 125

Page 55: Introduction to using R

Functions

Base R functions

R has lots of built in functions, which you can apply to most data objects

Graeme L. Hickey An Introduction to R 31st October 2014 55 / 125

Page 56: Introduction to using R

Functions

summary

summary(iris)

## Sepal.Length Sepal.Width Petal.Length## Min. :4.300 Min. :2.000 Min. :1.000## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600## Median :5.800 Median :3.000 Median :4.350## Mean :5.843 Mean :3.057 Mean :3.758## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100## Max. :7.900 Max. :4.400 Max. :6.900## Petal.Width Species## Min. :0.100 setosa :50## 1st Qu.:0.300 versicolor:50## Median :1.300 virginica :50## Mean :1.199## 3rd Qu.:1.800## Max. :2.500

Graeme L. Hickey An Introduction to R 31st October 2014 56 / 125

Page 57: Introduction to using R

Functions

head & tail

If you want to inspect a data frame, you don’t want to look at thewhole thingWe use either the head() or tail() functionsOr if using R Studio, click the Environment tab

head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa

Graeme L. Hickey An Introduction to R 31st October 2014 57 / 125

Page 58: Introduction to using R

Functions

ncol, nrow, dim

ncol(iris)

## [1] 5

nrow(iris)

## [1] 150

dim(iris)

## [1] 150 5

Graeme L. Hickey An Introduction to R 31st October 2014 58 / 125

Page 59: Introduction to using R

Functions

length

dim() doesn’t work on vectors, so we have to use length()

length(heights)

## [1] 8

Graeme L. Hickey An Introduction to R 31st October 2014 59 / 125

Page 60: Introduction to using R

Functions

names

names(iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length"## [4] "Petal.Width" "Species"

Graeme L. Hickey An Introduction to R 31st October 2014 60 / 125

Page 61: Introduction to using R

Functions

Mathematical functions

We saw some of these before, e.g.

sin(pi) # Not zero as R is using numerical approximation

## [1] 1.224647e-16

exp(pi*4)

## [1] 286751.3

Graeme L. Hickey An Introduction to R 31st October 2014 61 / 125

Page 62: Introduction to using R

Functions

Statistical functions

mean(heights)

## [1] 167.125

sd(heights)

## [1] 20.09575

range(heights)

## [1] 142 200

Graeme L. Hickey An Introduction to R 31st October 2014 62 / 125

Page 63: Introduction to using R

Functions

Warnings & Errors

Sometimes we get messages listed as Warning and Error

Warnings mean that the function was able to do something, but what itreturns may not be what you were expecting

Errors mean that the function aborted as something did not makesense

Don’t ignore either unless you are 100% confident why ithappened!

Graeme L. Hickey An Introduction to R 31st October 2014 63 / 125

Page 64: Introduction to using R

Functions

mean(iris)

## Warning in mean.default(iris): argument is not numeric or## logical: returning NA

## [1] NA

sin(Pi)

## Error in eval(expr, envir, enclos): object 'Pi' not found

Graeme L. Hickey An Introduction to R 31st October 2014 64 / 125

Page 65: Introduction to using R

Functions

Tasks

1 How many iris samples have petal widths >2cm?2 Of these, what is the mean and SD of their petal lengths?

Graeme L. Hickey An Introduction to R 31st October 2014 65 / 125

Page 66: Introduction to using R

Functions

Solutions

x <- iris[iris$Petal.Width > 2, ]nrow(x)

## [1] 23

mean(x$Petal.Length)

## [1] 5.76087

sd(x$Petal.Length)

## [1] 0.4793358

Graeme L. Hickey An Introduction to R 31st October 2014 66 / 125

Page 67: Introduction to using R

Functions

seq

Some functions take multiple arguments and it is often best to formallydeclare them, e.g.

seq(from = 1, to = 10, by = 2)

## [1] 1 3 5 7 9

But if confident of the order, we could just apply

seq(1, 10, 2)

## [1] 1 3 5 7 9

Graeme L. Hickey An Introduction to R 31st October 2014 67 / 125

Page 68: Introduction to using R

Functions

Shorthand trick

We can replace the function seq(x, y, by = 1) with x:y, e.g.

seq(1, 10, 1)

## [1] 1 2 3 4 5 6 7 8 9 10

1:10

## [1] 1 2 3 4 5 6 7 8 9 10

Graeme L. Hickey An Introduction to R 31st October 2014 68 / 125

Page 69: Introduction to using R

Functions

Making a data frame from vectorsIf we have vectors v1, v2, v3, then we can make our own data frame usingthe data.frame function

ID <- seq(1, 8, 1)heights.m <- heights / 10 # Heights in metresdata.frame(ID, heights, heights.m)

## ID heights heights.m## 1 1 184 18.4## 2 2 162 16.2## 3 3 145 14.5## 4 4 200 20.0## 5 5 178 17.8## 6 6 154 15.4## 7 7 172 17.2## 8 8 142 14.2

Graeme L. Hickey An Introduction to R 31st October 2014 69 / 125

Page 70: Introduction to using R

Functions

Coercion

We can coerce one data type into another using the as.* functions, e.g.

as.data.frame()as.matrix()as.vector()as.numeric()

Don’t worry about these for now, but handy for your own studies one day

Graeme L. Hickey An Introduction to R 31st October 2014 70 / 125

Page 71: Introduction to using R

Functions

Merging 2 (or more) data frames

If you have 2 data frames, that share a common field, e.g. subject IDs,we can merge them together using merge()This is particularly useful for longitudinal datasets

Graeme L. Hickey An Introduction to R 31st October 2014 71 / 125

Page 72: Introduction to using R

Functions

Let’s make another data set

Species <- c("setosa", "versicolor", "virginica")Colours <- c("red", "blue", "violet")flowerCols <- data.frame(Species, Colours)flowerCols

## Species Colours## 1 setosa red## 2 versicolor blue## 3 virginica violet

Graeme L. Hickey An Introduction to R 31st October 2014 72 / 125

Page 73: Introduction to using R

Functions

Now let’s merge them

irisMerge <- merge(iris, flowerCols)head(irisMerge, 5)

## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1 setosa 5.1 3.5 1.4 0.2## 2 setosa 4.9 3.0 1.4 0.2## 3 setosa 4.7 3.2 1.3 0.2## 4 setosa 4.6 3.1 1.5 0.2## 5 setosa 5.0 3.6 1.4 0.2## Colours## 1 red## 2 red## 3 red## 4 red## 5 red

Graeme L. Hickey An Introduction to R 31st October 2014 73 / 125

Page 74: Introduction to using R

Functions

Writing our own function

We can write our own functions when neededWe use the function() functionWe must remember to assign it to a name, otherwise we can’t use it

myFun <- function(arguments) {# do something

}

Graeme L. Hickey An Introduction to R 31st October 2014 74 / 125

Page 75: Introduction to using R

Functions

E.g. f (x) = ex + x2 + 1

fx <- function(x) {exp(x) + x^2 + 1

}fx(5)

## [1] 174.4132

fx(seq(3, 12, 3))

## [1] 30.08554 440.42879 8185.08393 162899.79142

Graeme L. Hickey An Introduction to R 31st October 2014 75 / 125

Page 76: Introduction to using R

Functions

Comments

Notice that anything written after a hash-symbol is ignored by RUse this to annotate your R scripts to remember what you are doing

# Graeme thinks R is great!# R will ignore all of this

Graeme L. Hickey An Introduction to R 31st October 2014 76 / 125

Page 77: Introduction to using R

Functions

Help with functions

There are thousands of functions in RSome are loaded on launch of R (e.g. mean, seq, dim)Others require packages to be loaded firstIf you know the name of a function, you can use the ? operator toaccess the help file, e.g.

?sd

Graeme L. Hickey An Introduction to R 31st October 2014 77 / 125

Page 78: Introduction to using R

Functions

Graeme L. Hickey An Introduction to R 31st October 2014 78 / 125

Page 79: Introduction to using R

Functions

Help with functions

You can also use the search bar in the R Studio softwareIf you don’t know the name of the function, try the help.search fora list of possible candidates

help.search("sequences")

When all else fails: Google it!

Graeme L. Hickey An Introduction to R 31st October 2014 79 / 125

Page 80: Introduction to using R

Conditional statements and loops

Conditional statements and loops

Graeme L. Hickey An Introduction to R 31st October 2014 80 / 125

Page 81: Introduction to using R

Conditional statements and loops

Introduction

Inherent to all programming languages are conditional statementsand loopsRequire them when we need to make complex rulesWould require a more advanced tutorial to fully appreciate the powerIf interested to learn more, see references at end

Graeme L. Hickey An Introduction to R 31st October 2014 81 / 125

Page 82: Introduction to using R

Conditional statements and loops

Conditional statements

The if statement, which is technically a function, only does something ifTRUE, e.g.

if(3 < 4) {3 + 3}

## [1] 6

if(3 > 4) {2 + 2}

Also, look up while and else using help.search()

Graeme L. Hickey An Introduction to R 31st October 2014 82 / 125

Page 83: Introduction to using R

Conditional statements and loops

Loops

We might want to sequentially do something, conditional on somethingelseFor example, let Yi = Yi−1 + i/10 with Y1 = 0Calculate Y =

∑20i=1 Yi

E.g. 0 + (0 + 2/10) + (2/10 + 3/10) + . . . + (2/10 + 3/10 + . . .+ 20/10)

Graeme L. Hickey An Introduction to R 31st October 2014 83 / 125

Page 84: Introduction to using R

Conditional statements and loops

Y <- 0 # Start with Y_1for(i in 2:20) {

Yi <- Y + i/10 # Calculate Y_iY <- Y + Yi # Cummulative sum

}Y # Solution

## [1] 157284.2

Graeme L. Hickey An Introduction to R 31st October 2014 84 / 125

Page 85: Introduction to using R

Conditional statements and loops

Task

For each value in our heights vector earlier, how can we calculate thedifference between it and the previous one, i.e. calculate heightsi -heightsi−1?

Graeme L. Hickey An Introduction to R 31st October 2014 85 / 125

Page 86: Introduction to using R

Conditional statements and loops

Solution

d <- 0for(i in 2:length(heights)) {

d[i] <- heights[i] - heights[i-1]}d

## [1] 0 -22 -17 55 -22 -24 18 -30

Graeme L. Hickey An Introduction to R 31st October 2014 86 / 125

Page 87: Introduction to using R

Reading and writing files

Reading and writing files

Graeme L. Hickey An Introduction to R 31st October 2014 87 / 125

Page 88: Introduction to using R

Reading and writing files

Reading

You want to get your data into RData comes in lots of different formats, luckily R can handle almost allof them!Most use packages – we’ll explore these later

Graeme L. Hickey An Introduction to R 31st October 2014 88 / 125

Page 89: Introduction to using R

Reading and writing files

read.csv

The simplest way is to convert your data to a comma separated value(*.csv) files and use

my.data <- read.csv(file.choose())

Instead of writing file.choose() we could have specified the filelocationLook at the help file for more customization settings

Graeme L. Hickey An Introduction to R 31st October 2014 89 / 125

Page 90: Introduction to using R

Reading and writing files

read.xlsx

Converting our data to CSV format is a pain!We can use the xlsx package instead

library("xlsx")my.data <- read.xlsx(file.choose(), sheetIndex = 1)

Graeme L. Hickey An Introduction to R 31st October 2014 90 / 125

Page 91: Introduction to using R

Reading and writing files

foreign

What if our data is in a Stata, SPSS, SAS, etc. file?We can use the foreign package instead, e.g.

library("foreign")my.data <- read.spss(file.choose())

Graeme L. Hickey An Introduction to R 31st October 2014 91 / 125

Page 92: Introduction to using R

Reading and writing files

Data on the web

What if our data is in the cloud?We can use the utils package function download.file instead, e.g.

library("utils")my.data <- download.file("http://www.liv.ac.uk/dat.csv")

Useful if you share your data on a public Dropbox folder

Graeme L. Hickey An Introduction to R 31st October 2014 92 / 125

Page 93: Introduction to using R

Reading and writing files

Other formats

If data exists, R can read it inJust need to find the right package

Graeme L. Hickey An Introduction to R 31st October 2014 93 / 125

Page 94: Introduction to using R

Reading and writing files

Writing

Usually as simple as change the read. to write.

Need to specify:

1 What file we want to save2 A name for the file we will save

write.csv(iris, "IrisData.csv")

Graeme L. Hickey An Introduction to R 31st October 2014 94 / 125

Page 95: Introduction to using R

Graphics

Graphics

Graeme L. Hickey An Introduction to R 31st October 2014 95 / 125

Page 96: Introduction to using R

Graphics

Introduction

R has 3 primary graphics packages2:

1 Base R - those built into R2 lattice - a functional extensional of the base graphics (requires a

package)3 ggplot2 - built on the grammar of graphics (requires a package)

All called using functions, and typically have lots of optional arguments forcustomization of figures

2I will discuss packages shortly.Graeme L. Hickey An Introduction to R 31st October 2014 96 / 125

Page 97: Introduction to using R

Graphics

Plot

The plot function can be applied to most data objects

Alternatively, one can give it two arguments:

x - x-axis coordinatesy - y-axis coordinates

Can also specify arguments to: label the axes; colour the points; etc. See:?help

Graeme L. Hickey An Introduction to R 31st October 2014 97 / 125

Page 98: Introduction to using R

Graphics

plot(iris)

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

12

34

56

7

0.5

1.5

2.5

Petal.Width

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0

1.0

2.0

3.0

Species

Graeme L. Hickey An Introduction to R 31st October 2014 98 / 125

Page 99: Introduction to using R

Graphics

Task

How can I plot the sepal length against the petal length of the iris data, andcolour the points by species?

Hint: ?as.numeric + ?plot

Graeme L. Hickey An Introduction to R 31st October 2014 99 / 125

Page 100: Introduction to using R

Graphics

Solution

plot(x = iris$Sepal.Length, y = iris$Petal.Length,col = as.numeric(iris$Species))

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

12

34

56

7

iris$Sepal.Length

iris$

Pet

al.L

engt

h

Graeme L. Hickey An Introduction to R 31st October 2014 100 / 125

Page 101: Introduction to using R

Graphics

Histograms

hist(iris$Petal.Length,col = "grey", xlab = "Petal length (cm)")

Histogram of iris$Petal.Length

Petal length (cm)

Fre

quen

cy

1 2 3 4 5 6 7

010

2030

Graeme L. Hickey An Introduction to R 31st October 2014 101 / 125

Page 102: Introduction to using R

Graphics

Boxplots

boxplot(Petal.Length ~ Species, data = iris)

setosa versicolor virginica

12

34

56

7

Graeme L. Hickey An Introduction to R 31st October 2014 102 / 125

Page 103: Introduction to using R

Graphics

ggplot2

Flexible publication quality graphics

library("ggplot2")

Graeme L. Hickey An Introduction to R 31st October 2014 103 / 125

Page 104: Introduction to using R

Graphics

ggplot(aes(x = Petal.Length, y = Petal.Width, colour = Sepal.Length),data = iris) +

geom_point() + geom_smooth() +facet_wrap(~ Species, scales = "free")

setosa versicolor virginica

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

1.5

2.0

2.5

1.00 1.25 1.50 1.75 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0Petal.Length

Pet

al.W

idth

5

6

7

Sepal.Length

Graeme L. Hickey An Introduction to R 31st October 2014 104 / 125

Page 105: Introduction to using R

Graphics

Saving figures

In R Studio, click Export then select file typeIn R, right click + Save

Graeme L. Hickey An Introduction to R 31st October 2014 105 / 125

Page 106: Introduction to using R

Packages

Packages

Graeme L. Hickey An Introduction to R 31st October 2014 106 / 125

Page 107: Introduction to using R

Packages

Introduction

Packages are like appsMany state of the art statistical methods published in journals with RpackagesLarge number of books released with specific R packages also,e.g. time series, prognostic modelling, survival regression, . . .Packages are published:

On CRANGit HubBio-conductorCode files on personal websites (not really packages per se)

Graeme L. Hickey An Introduction to R 31st October 2014 107 / 125

Page 108: Introduction to using R

Packages

Where to find relevant packages

There are thousands of R packages!Books, journal articles (look what is reported in the Statistical Analysissection), word of mouth, your local friendly statisticianwww.rseek.org

Graeme L. Hickey An Introduction to R 31st October 2014 108 / 125

Page 109: Introduction to using R

Packages

What packages are already installed?

R comes with a number of packages pre-installedWe can see all them, plus any we have installed ourselves

library() # No arguments!

or click the Packages tab if working from R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 109 / 125

Page 110: Introduction to using R

Packages

Installing a package from CRAN

Once you have identified the package you want, run:

install.packages("ggplot2")

Graeme L. Hickey An Introduction to R 31st October 2014 110 / 125

Page 111: Introduction to using R

Packages

Loading a package

R does not automatically load all installed packages as it slows yourcomputer down, and can also cause clashes

When you know what package(s) you want to use, run:

library("ggplot2") # Load ggplot2 package

Graeme L. Hickey An Introduction to R 31st October 2014 111 / 125

Page 112: Introduction to using R

Statistics

Statistics

Graeme L. Hickey An Introduction to R 31st October 2014 112 / 125

Page 113: Introduction to using R

Statistics

Introduction

In R, statistical method is just another term for functionBeware of applying statistical methods to your data withoutunderstanding what they do first: G-I-G-O!

Graeme L. Hickey An Introduction to R 31st October 2014 113 / 125

Page 114: Introduction to using R

Statistics

Student’s t-test

t.test(x = heights, mu = 180)

#### One Sample t-test#### data: heights## t = -1.8121, df = 7, p-value = 0.1129## alternative hypothesis: true mean is not equal to 180## 95 percent confidence interval:## 150.3245 183.9255## sample estimates:## mean of x## 167.125

Graeme L. Hickey An Introduction to R 31st October 2014 114 / 125

Page 115: Introduction to using R

Statistics

Notice that we applied a function to two arguments:

1 Some data2 A number

And it gave us back out some statistics and a P-value

Graeme L. Hickey An Introduction to R 31st October 2014 115 / 125

Page 116: Introduction to using R

Statistics

Linear regression

The first argument of many statistical functions is a formula

A formula is written as:

The outcome variable on the RHS (one of the variables in your dataset)A tilde symbol (~) which means model ontoThe explanatory variables on the RHS (separated by a + if more thanone)

Graeme L. Hickey An Introduction to R 31st October 2014 116 / 125

Page 117: Introduction to using R

Statistics

fit <- lm(Petal.Width ~ Sepal.Length, data = iris)fit

#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Coefficients:## (Intercept) Sepal.Length## -3.2002 0.7529

Graeme L. Hickey An Introduction to R 31st October 2014 117 / 125

Page 118: Introduction to using R

Statistics

Recall we can apply the summary function to different things?We have just assigned our linear model to the name fit

summary(fit)

#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Residuals:## Min 1Q Median 3Q Max## -0.96671 -0.35936 -0.01787 0.28388 1.23329#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.20022 0.25689 -12.46 <2e-16 ***## Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.44 on 148 degrees of freedom## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16

Graeme L. Hickey An Introduction to R 31st October 2014 118 / 125

Page 119: Introduction to using R

Statistics

We can apply other functions to fit also, e.g.

plot(fit)

0.0 0.5 1.0 1.5 2.0 2.5

−1.

00.

00.

51.

01.

5

Fitted values

Res

idua

ls

Residuals vs Fitted

115107

122

−2 −1 0 1 2

−2

−1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

115107

122

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location115107

122

0 50 100 150

0.00

0.02

0.04

0.06

0.08

Obs. number

Coo

k's

dist

ance

Cook's distance132

107

123

Graeme L. Hickey An Introduction to R 31st October 2014 119 / 125

Page 120: Introduction to using R

Statistics

Probability

Most routine statistical distributions are built-in, e.g. Gaussian (norm),Binomial (binom), Poisson (pois), . . .

We use a prefix:

d to get the densityp to get the cumulative probability distributionq to get the quantile functionr to sample random values from the distribution

Graeme L. Hickey An Introduction to R 31st October 2014 120 / 125

Page 121: Introduction to using R

Statistics

E.g.

# Sample 5 values from a N(2, 9) distributionrnorm(5, mean = 2, sd = 3)

## [1] -2.370804 3.118035 -2.090598 2.051935 3.879390

# Probability of tossing 10 heads in a rowpbinom(0, size = 10, prob = 0.5)

## [1] 0.0009765625

Graeme L. Hickey An Introduction to R 31st October 2014 121 / 125

Page 122: Introduction to using R

Wrapping up

Wrapping up

Graeme L. Hickey An Introduction to R 31st October 2014 122 / 125

Page 123: Introduction to using R

Wrapping up

Saving your workspace

You can save everything you have done using File -> Save (R Studiowill prompt you when exiting)I never save my workspaceWhy? Because I save the R Script (copy & paste)

Graeme L. Hickey An Introduction to R 31st October 2014 123 / 125

Page 124: Introduction to using R

Wrapping up

Where to learn more

Venables WN, Smith DM (2014). An Introduction to R:http://cran.r-project.org/doc/manuals/R-intro.pdfShahbaba B (2012). Biostatistics with R. Springer, NY.Data Camp online course: https://www.datacamp.com/courses/Ask me, Peter, Elisabeth or Helen!

Graeme L. Hickey An Introduction to R 31st October 2014 124 / 125

Page 125: Introduction to using R

Wrapping up

Further IGH Statistical Seminars

Other sessions in this series will make use of R

Introduction to Time Series (5th Dec 2014)Regression Modelling (Jan 2015)Time-to-Event Analysis (Feb 2015)Geostatistical Methods for Disease Prevalence Mapping (March 2015)Statistical Power + Sample Size Calculations (April 2015)Quantile Regression (May 2015)

Graeme L. Hickey An Introduction to R 31st October 2014 125 / 125