introduction to using r

Post on 14-Jun-2015

383 Views

Category:

Software

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

R is the lingua franca of statistical computing. One of its attractions for users of statistics is that it encompasses an enormous range of modern statistical methods developed by world-leading statistics researchers. In order to exploit its capabilities for data analysis and statistics, a basic understanding of the core functions is required. In this session we will cover all of the preliminaries that are common to all uses of R, with particular focus on the topics of functions and data objects. Statistical methods are not formally covered, although some basic functions will be demonstrated.

TRANSCRIPT

An Introduction to R

Graeme L. Hickey

31st October 2014

Graeme L. Hickey An Introduction to R 31st October 2014 1 / 125

Getting ready to use R

Getting ready to use R

Graeme L. Hickey An Introduction to R 31st October 2014 2 / 125

Getting ready to use R

Logistics

Owing to room change, there will unfortunately be less hands-onexperience (possibly none)I will email the slides to all who registered – no need to make notesWe will take a short break during

Graeme L. Hickey An Introduction to R 31st October 2014 3 / 125

Getting ready to use R

What is R?

Derives from a proprietary software packaged called S-Plus

“R is a free software programming language and softwareenvironment for statistical computing and graphics” Wikipedia(2014)

The “lingua franca of data analysts” The New York Times(2009)

Used worldwide by bioinformaticians, data scientists, high-level statisticians,app developers, . . .

Graeme L. Hickey An Introduction to R 31st October 2014 4 / 125

Getting ready to use R

Why use R?

Keep whole analysis together (data processing, analysis, publicationfigures, reports)Reproducible researchState of the art statistical methods are wrapped up in ‘R packages’It’s freeIt’s cross platform (Windows, OSX, Linux) compatibleIt will be extensively used in EPH / IGH statistical training

Graeme L. Hickey An Introduction to R 31st October 2014 5 / 125

Getting ready to use R

Objectives

The primary objective is for you to be able to apply statisticalfunctions available in R to your own data

To achieve this we should be able to:

Understand the core concepts of R and its syntaxBe able to read and write data filesBe able to interrogate a datasetBe able to use functions and optionsBe able to write a simple function

Graeme L. Hickey An Introduction to R 31st October 2014 6 / 125

Getting ready to use R

How to install R?

Download and install from: http://www.r-project.orgCan use in isolation, but combing with an IDE front-end makes lifeeasier when starting outRecommend using R Studio: http://www.rstudio.comOnce both programs installed, only ever need to run R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 7 / 125

Getting ready to use R

R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 8 / 125

Getting ready to use R

R Console vs. Script Editor

Would you consider writing your thesis using a typewriter?

Don’t just use the console – not reproducible!Always write analysis as an R ScriptFile -> New File -> R ScriptHighlight code and press Ctrl + Enter to execute

Graeme L. Hickey An Introduction to R 31st October 2014 9 / 125

R as a calculator

R as a calculator

Graeme L. Hickey An Introduction to R 31st October 2014 10 / 125

R as a calculator

Simple maths

1 + 2

## [1] 3

13 * 17

## [1] 221

((6 * 7) + 4 - 7) / 2^8

## [1] 0.1523438

Graeme L. Hickey An Introduction to R 31st October 2014 11 / 125

R as a calculator

Routine mathematical functions and constants

exp(3)

## [1] 20.08554

sin(2 * pi) - 1

## [1] -1

atan(1) ^ 2

## [1] 0.6168503

All of these examples use base R functions – we’ll revisit these laterGraeme L. Hickey An Introduction to R 31st October 2014 12 / 125

R as a calculator

Other ‘numbers’ to look out for

1 / 0

## [1] Inf

-1 / 0

## [1] -Inf

0 / 0

## [1] NaN

If you see these, you have probably done something wrong!Graeme L. Hickey An Introduction to R 31st October 2014 13 / 125

Data objects

Data objects

Graeme L. Hickey An Introduction to R 31st October 2014 14 / 125

Data objects

Assignment operator

We can tell R to remember things so that we can call them laterWe do this using either assignment operators = or <-

x <- 5x

## [1] 5

x = 5x

## [1] 5

Graeme L. Hickey An Introduction to R 31st October 2014 15 / 125

Data objects

A warning!

R is case SENSITIVE

This is a common error and can lead to a great deal of pain in tracing bugs!

x # Lowercase

## [1] 5

X # Uppercase

## Error in eval(expr, envir, enclos): object 'X' not found

Graeme L. Hickey An Introduction to R 31st October 2014 16 / 125

Data objects

Another warning!

Names cannot begin with numbers or include spaces

gra.eme_30 <- 5 # OK30graeme <- 5 # Not allowed!gra eme <- 5 # Not allowed either!

Graeme L. Hickey An Introduction to R 31st October 2014 17 / 125

Data objects

Task

What do you think the following lines of R code will output at the end?

x <- 2y <- piz <- (sin(y) + x)^2z

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 18 / 125

Data objects

Solution

x <- 2y <- piz <- (sin(y) + x)^2z

## [1] 4

Graeme L. Hickey An Introduction to R 31st October 2014 19 / 125

Data objects

Task

What do you think the following lines of R code will output at the end?

x <- 2x <- x + 5x

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 20 / 125

Data objects

Solution

x <- 2x <- x + 5x

## [1] 7

Graeme L. Hickey An Introduction to R 31st October 2014 21 / 125

Data objects

Vectors

We often have more than a single number, which we combine into avector using the function, e.g.

c(184, 162, 145, 200, 178, 154, 172, 142)

## [1] 184 162 145 200 178 154 172 142

heights <- c(184, 162, 145, 200, 178, 154, 172, 142)heights

## [1] 184 162 145 200 178 154 172 142

Graeme L. Hickey An Introduction to R 31st October 2014 22 / 125

Data objects

Task

What do you think the following lines of R code will output at the end?

heights/10 + 1

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 23 / 125

Data objects

Solution

heights/10 + 1

## [1] 19.4 17.2 15.5 21.0 18.8 16.4 18.2 15.2

Graeme L. Hickey An Introduction to R 31st October 2014 24 / 125

Data objects

Selection

We might want to select the 5-th value from a vectorWe use square brackets for this, e.g.

heights[5]

## [1] 178

Graeme L. Hickey An Introduction to R 31st October 2014 25 / 125

Data objects

Task

What do you think the following lines of R code will output at the end?

heights[c(1, 3, 5)]

Try it!

Graeme L. Hickey An Introduction to R 31st October 2014 26 / 125

Data objects

Solution

heights[c(1, 3, 5)]

## [1] 184 145 178

Graeme L. Hickey An Introduction to R 31st October 2014 27 / 125

Data objects

Logic

Boils down to something being TRUE or FALSE

x > y asks: is x greater than y?x < y asks: is x less than y?

x == y asks: is x equal to y?x >= y asks: is x greater than or equal to y?x <= y asks: is x less than or equal to y?

Graeme L. Hickey An Introduction to R 31st October 2014 28 / 125

Data objects

Basic examples

5 < 10

## [1] TRUE

3 > 5

## [1] FALSE

sin(pi) == cos(pi) + 1

## [1] FALSE

Graeme L. Hickey An Introduction to R 31st October 2014 29 / 125

Data objects

Logic

We can combine logical statements, for example

(5 < 10) & (3 < 5)

## [1] TRUE

(1 > 2) | (3 > 4)

## [1] FALSE

Graeme L. Hickey An Introduction to R 31st October 2014 30 / 125

Data objects

Logic & selection

We can use a logical vector to pick out elements of a vector so long as thelogical vector and the vector of data are the same length

logic.vec <- c(TRUE, FALSE, TRUE, FALSE, TRUE,FALSE, TRUE, FALSE)

heights[logic.vec]

## [1] 184 145 178 172

Task: How do you extract heights greater than 160cm?

Graeme L. Hickey An Introduction to R 31st October 2014 31 / 125

Data objects

SolutionHow to extract heights greater than 160cm?

i <- (heights > 160)i

## [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE

heights[i]

## [1] 184 162 200 178 172

And once you understand, simply. . .

heights[heights > 160]

## [1] 184 162 200 178 172Graeme L. Hickey An Introduction to R 31st October 2014 32 / 125

Data objects

Character data

Vectors don’t just store numbersThey can store items of class: integer, numeric, character, date,factor, etc.For character data, just put things inside quotation marks

subjects <- c("Bob", "Amy", "Amy", "Bob", "Amy","Bob", "Bob", "Amy", "Amy")

subjects

## [1] "Bob" "Amy" "Amy" "Bob" "Amy" "Bob" "Bob" "Amy" "Amy"

Graeme L. Hickey An Introduction to R 31st October 2014 33 / 125

Data objects

MatricesThese are generalizations of vectors: instead of being one vector, we havemultiple columns of vectors:

matrix(heights, nrow = 2)

## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142

matrix(subjects, nrow = 3)

## [,1] [,2] [,3]## [1,] "Bob" "Bob" "Bob"## [2,] "Amy" "Amy" "Amy"## [3,] "Amy" "Bob" "Amy"

Graeme L. Hickey An Introduction to R 31st October 2014 34 / 125

Data objects

MatricesWe can apply the same arithmetic as per vectors

myMat <- matrix(heights, nrow = 2)myMat

## [,1] [,2] [,3] [,4]## [1,] 184 145 178 172## [2,] 162 200 154 142

0.5*myMat + 3

## [,1] [,2] [,3] [,4]## [1,] 95 75.5 92 89## [2,] 84 103.0 80 74

Graeme L. Hickey An Introduction to R 31st October 2014 35 / 125

Data objects

Matrices

Each row and column has to have data of the same type (e.g. numeric,character, logical) — you can’t mix-and-matchMost useful when do linear algebra (e.g. PCA, solve systems ofequations)R often coerces into matrix form when required by functionsIf you want different data types, need to use objects calleddata.frames

Graeme L. Hickey An Introduction to R 31st October 2014 36 / 125

Data objects

Data frames

Think of these like Microsoft Excel spreadsheetsColumns represent different variables, e.g. age, sex, number of cells,. . .Rows represent samples, e.g. patients, testsLike matrices, they are a generalization of vectors, but can storedifferent types of data

Graeme L. Hickey An Introduction to R 31st October 2014 37 / 125

Data objects

Data frames

R has some pre-installed data frames, including the infamous Sir RonaldFisher iris dataset1, to allow us to practice

iris

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa

. . .

1R. A. Fisher (1936). The use of multiple measurements in taxonomic problems.Annals of Eugenics 7 (2): 179–188.

Graeme L. Hickey An Introduction to R 31st October 2014 38 / 125

Data objects

Selection in data frames

Earlier, we learnt how to select individual elements from a vectorFor a data frame the same principles apply, except there are now 2dimensions: rows and columns (note the order!)

Graeme L. Hickey An Introduction to R 31st October 2014 39 / 125

Data objects

Selection in data frames

There are 3 primary methods of selecting data from data frames

1 Square brackets2 Using the dollar ($) operator (for columns only)3 Using the subset function (we won’t discuss this today)

They all do the same thing (sort of), and you can combine these methods

Graeme L. Hickey An Introduction to R 31st October 2014 40 / 125

Data objects

Selection using square brackets

One method of selection is the square brackets:

dat[i , ] would select the i-th row (which is a vector)dat[ , j] would select the j-th column (which is a vector)dat[i, j] would select the value from the i-th row and j-th column

Graeme L. Hickey An Introduction to R 31st October 2014 41 / 125

Data objects

iris[1, 1]

## [1] 5.1

iris[ , 1]

## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8## [14] 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0## [27] 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4## [40] 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4## [53] 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6## [66] 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7## [79] 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5## [92] 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3## [105] 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5## [118] 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2## [131] 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8## [144] 6.8 6.7 6.7 6.3 6.5 6.2 5.9

Graeme L. Hickey An Introduction to R 31st October 2014 42 / 125

Data objects

Selection using square brackets

i and j don’t have to be single numbers, they can be:

vectors of numberslogical vectors (which need to be the same length as the rows orcolumns)

iris[c(1, 3) , ]

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa

Graeme L. Hickey An Introduction to R 31st October 2014 43 / 125

Data objects

Selection using the dollar operator

Each column in a data frame should have a nameWe use dat$foo1 to extract the column called foo1 from a dataframe called dat, e.g.

iris$Petal.Width

## [1] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1## [14] 0.1 0.2 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 0.2## [27] 0.4 0.2 0.2 0.2 0.2 0.4 0.1 0.2 0.2 0.2 0.2 0.1 0.2## [40] 0.2 0.3 0.3 0.2 0.6 0.4 0.3 0.2 0.2 0.2 0.2 1.4 1.5## [53] 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3## [66] 1.4 1.5 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7## [79] 1.5 1.0 1.1 1.0 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2## [92] 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3 2.5 1.9 2.1 1.8## [105] 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8## [118] 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6## [131] 1.9 2.0 2.2 1.5 1.4 2.3 2.4 1.8 1.8 2.1 2.4 2.3 1.9## [144] 2.3 2.5 2.3 1.9 2.0 2.3 1.8

Graeme L. Hickey An Introduction to R 31st October 2014 44 / 125

Data objects

Tasks

1 Select all rows of the iris data where the sepal length is >7.6cm2 Extract the sepal lengths of iris flowers sp. virginica with petal widths

>2.4cm

N.B. there are multiple ways of solving these problems

Graeme L. Hickey An Introduction to R 31st October 2014 45 / 125

Data objects

Solution (1)

iris[iris$Sepal.Length > 7.6, ]

## Sepal.Length Sepal.Width Petal.Length Petal.Width## 118 7.7 3.8 6.7 2.2## 119 7.7 2.6 6.9 2.3## 123 7.7 2.8 6.7 2.0## 132 7.9 3.8 6.4 2.0## 136 7.7 3.0 6.1 2.3## Species## 118 virginica## 119 virginica## 123 virginica## 132 virginica## 136 virginica

Graeme L. Hickey An Introduction to R 31st October 2014 46 / 125

Data objects

Solution (2)I’ll break this one into pieces to make it clearer. . .lvec1 <- (iris$Petal.Width > 2.4)lvec2 <- (iris$Species == "virginica")iris2 <- iris[lvec1 & lvec2, ]iris2

## Sepal.Length Sepal.Width Petal.Length Petal.Width## 101 6.3 3.3 6.0 2.5## 110 7.2 3.6 6.1 2.5## 145 6.7 3.3 5.7 2.5## Species## 101 virginica## 110 virginica## 145 virginica

iris2$Sepal.Length

## [1] 6.3 7.2 6.7

Graeme L. Hickey An Introduction to R 31st October 2014 47 / 125

Data objects

I could have combined all of this into a single line. . .

iris[(iris$Petal.Width > 2.4) &(iris$Species == "virginica"), ]$Sepal.Length

## [1] 6.3 7.2 6.7

Graeme L. Hickey An Introduction to R 31st October 2014 48 / 125

Data objects

Factors

An important class of data in R are factorsThey are categorical variables, e.g. gender, countryThey are similar to character data, except that R is “aware” of them,which allows us to do lots of clever things with our data

Graeme L. Hickey An Introduction to R 31st October 2014 49 / 125

Data objects

iris$Species

## [1] setosa setosa setosa setosa setosa## [6] setosa setosa setosa setosa setosa## [11] setosa setosa setosa setosa setosa## [16] setosa setosa setosa setosa setosa## [21] setosa setosa setosa setosa setosa## [26] setosa setosa setosa setosa setosa## [31] setosa setosa setosa setosa setosa## [36] setosa setosa setosa setosa setosa## [41] setosa setosa setosa setosa setosa## [46] setosa setosa setosa setosa setosa## [51] versicolor versicolor versicolor versicolor versicolor## [56] versicolor versicolor versicolor versicolor versicolor## [61] versicolor versicolor versicolor versicolor versicolor## [66] versicolor versicolor versicolor versicolor versicolor## [71] versicolor versicolor versicolor versicolor versicolor## [76] versicolor versicolor versicolor versicolor versicolor## [81] versicolor versicolor versicolor versicolor versicolor## [86] versicolor versicolor versicolor versicolor versicolor## [91] versicolor versicolor versicolor versicolor versicolor## [96] versicolor versicolor versicolor versicolor versicolor## [101] virginica virginica virginica virginica virginica## [106] virginica virginica virginica virginica virginica## [111] virginica virginica virginica virginica virginica## [116] virginica virginica virginica virginica virginica## [121] virginica virginica virginica virginica virginica## [126] virginica virginica virginica virginica virginica## [131] virginica virginica virginica virginica virginica## [136] virginica virginica virginica virginica virginica## [141] virginica virginica virginica virginica virginica## [146] virginica virginica virginica virginica virginica## Levels: setosa versicolor virginica

Graeme L. Hickey An Introduction to R 31st October 2014 50 / 125

Data objects

Matrices and data frames too limited?

What if you need something more than a flat matrix or data.frame?E.g. recording 100 measurements for 70 subjects at 25 time points?array and ?list

Graeme L. Hickey An Introduction to R 31st October 2014 51 / 125

Functions

Functions

Graeme L. Hickey An Introduction to R 31st October 2014 52 / 125

Functions

What are functions?

In short, you put something in and get something outIn order to do interesting things with out data and apply the wealth ofstatistical methods available, we need to understand about functionsfirst

Graeme L. Hickey An Introduction to R 31st October 2014 53 / 125

Functions

Recognising a function

Functions must have an assigned nameFunctions are applied using round bracketsFunctions generally take arguments (either required or optional)Arguments can be anything, depending on the function, but often oneof them will be some data

e.g. myFunc(x)

Graeme L. Hickey An Introduction to R 31st October 2014 54 / 125

Functions

Base R functions

R has lots of built in functions, which you can apply to most data objects

Graeme L. Hickey An Introduction to R 31st October 2014 55 / 125

Functions

summary

summary(iris)

## Sepal.Length Sepal.Width Petal.Length## Min. :4.300 Min. :2.000 Min. :1.000## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600## Median :5.800 Median :3.000 Median :4.350## Mean :5.843 Mean :3.057 Mean :3.758## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100## Max. :7.900 Max. :4.400 Max. :6.900## Petal.Width Species## Min. :0.100 setosa :50## 1st Qu.:0.300 versicolor:50## Median :1.300 virginica :50## Mean :1.199## 3rd Qu.:1.800## Max. :2.500

Graeme L. Hickey An Introduction to R 31st October 2014 56 / 125

Functions

head & tail

If you want to inspect a data frame, you don’t want to look at thewhole thingWe use either the head() or tail() functionsOr if using R Studio, click the Environment tab

head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 0.2 setosa## 2 4.9 3.0 1.4 0.2 setosa## 3 4.7 3.2 1.3 0.2 setosa## 4 4.6 3.1 1.5 0.2 setosa## 5 5.0 3.6 1.4 0.2 setosa## 6 5.4 3.9 1.7 0.4 setosa

Graeme L. Hickey An Introduction to R 31st October 2014 57 / 125

Functions

ncol, nrow, dim

ncol(iris)

## [1] 5

nrow(iris)

## [1] 150

dim(iris)

## [1] 150 5

Graeme L. Hickey An Introduction to R 31st October 2014 58 / 125

Functions

length

dim() doesn’t work on vectors, so we have to use length()

length(heights)

## [1] 8

Graeme L. Hickey An Introduction to R 31st October 2014 59 / 125

Functions

names

names(iris)

## [1] "Sepal.Length" "Sepal.Width" "Petal.Length"## [4] "Petal.Width" "Species"

Graeme L. Hickey An Introduction to R 31st October 2014 60 / 125

Functions

Mathematical functions

We saw some of these before, e.g.

sin(pi) # Not zero as R is using numerical approximation

## [1] 1.224647e-16

exp(pi*4)

## [1] 286751.3

Graeme L. Hickey An Introduction to R 31st October 2014 61 / 125

Functions

Statistical functions

mean(heights)

## [1] 167.125

sd(heights)

## [1] 20.09575

range(heights)

## [1] 142 200

Graeme L. Hickey An Introduction to R 31st October 2014 62 / 125

Functions

Warnings & Errors

Sometimes we get messages listed as Warning and Error

Warnings mean that the function was able to do something, but what itreturns may not be what you were expecting

Errors mean that the function aborted as something did not makesense

Don’t ignore either unless you are 100% confident why ithappened!

Graeme L. Hickey An Introduction to R 31st October 2014 63 / 125

Functions

mean(iris)

## Warning in mean.default(iris): argument is not numeric or## logical: returning NA

## [1] NA

sin(Pi)

## Error in eval(expr, envir, enclos): object 'Pi' not found

Graeme L. Hickey An Introduction to R 31st October 2014 64 / 125

Functions

Tasks

1 How many iris samples have petal widths >2cm?2 Of these, what is the mean and SD of their petal lengths?

Graeme L. Hickey An Introduction to R 31st October 2014 65 / 125

Functions

Solutions

x <- iris[iris$Petal.Width > 2, ]nrow(x)

## [1] 23

mean(x$Petal.Length)

## [1] 5.76087

sd(x$Petal.Length)

## [1] 0.4793358

Graeme L. Hickey An Introduction to R 31st October 2014 66 / 125

Functions

seq

Some functions take multiple arguments and it is often best to formallydeclare them, e.g.

seq(from = 1, to = 10, by = 2)

## [1] 1 3 5 7 9

But if confident of the order, we could just apply

seq(1, 10, 2)

## [1] 1 3 5 7 9

Graeme L. Hickey An Introduction to R 31st October 2014 67 / 125

Functions

Shorthand trick

We can replace the function seq(x, y, by = 1) with x:y, e.g.

seq(1, 10, 1)

## [1] 1 2 3 4 5 6 7 8 9 10

1:10

## [1] 1 2 3 4 5 6 7 8 9 10

Graeme L. Hickey An Introduction to R 31st October 2014 68 / 125

Functions

Making a data frame from vectorsIf we have vectors v1, v2, v3, then we can make our own data frame usingthe data.frame function

ID <- seq(1, 8, 1)heights.m <- heights / 10 # Heights in metresdata.frame(ID, heights, heights.m)

## ID heights heights.m## 1 1 184 18.4## 2 2 162 16.2## 3 3 145 14.5## 4 4 200 20.0## 5 5 178 17.8## 6 6 154 15.4## 7 7 172 17.2## 8 8 142 14.2

Graeme L. Hickey An Introduction to R 31st October 2014 69 / 125

Functions

Coercion

We can coerce one data type into another using the as.* functions, e.g.

as.data.frame()as.matrix()as.vector()as.numeric()

Don’t worry about these for now, but handy for your own studies one day

Graeme L. Hickey An Introduction to R 31st October 2014 70 / 125

Functions

Merging 2 (or more) data frames

If you have 2 data frames, that share a common field, e.g. subject IDs,we can merge them together using merge()This is particularly useful for longitudinal datasets

Graeme L. Hickey An Introduction to R 31st October 2014 71 / 125

Functions

Let’s make another data set

Species <- c("setosa", "versicolor", "virginica")Colours <- c("red", "blue", "violet")flowerCols <- data.frame(Species, Colours)flowerCols

## Species Colours## 1 setosa red## 2 versicolor blue## 3 virginica violet

Graeme L. Hickey An Introduction to R 31st October 2014 72 / 125

Functions

Now let’s merge them

irisMerge <- merge(iris, flowerCols)head(irisMerge, 5)

## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1 setosa 5.1 3.5 1.4 0.2## 2 setosa 4.9 3.0 1.4 0.2## 3 setosa 4.7 3.2 1.3 0.2## 4 setosa 4.6 3.1 1.5 0.2## 5 setosa 5.0 3.6 1.4 0.2## Colours## 1 red## 2 red## 3 red## 4 red## 5 red

Graeme L. Hickey An Introduction to R 31st October 2014 73 / 125

Functions

Writing our own function

We can write our own functions when neededWe use the function() functionWe must remember to assign it to a name, otherwise we can’t use it

myFun <- function(arguments) {# do something

}

Graeme L. Hickey An Introduction to R 31st October 2014 74 / 125

Functions

E.g. f (x) = ex + x2 + 1

fx <- function(x) {exp(x) + x^2 + 1

}fx(5)

## [1] 174.4132

fx(seq(3, 12, 3))

## [1] 30.08554 440.42879 8185.08393 162899.79142

Graeme L. Hickey An Introduction to R 31st October 2014 75 / 125

Functions

Comments

Notice that anything written after a hash-symbol is ignored by RUse this to annotate your R scripts to remember what you are doing

# Graeme thinks R is great!# R will ignore all of this

Graeme L. Hickey An Introduction to R 31st October 2014 76 / 125

Functions

Help with functions

There are thousands of functions in RSome are loaded on launch of R (e.g. mean, seq, dim)Others require packages to be loaded firstIf you know the name of a function, you can use the ? operator toaccess the help file, e.g.

?sd

Graeme L. Hickey An Introduction to R 31st October 2014 77 / 125

Functions

Graeme L. Hickey An Introduction to R 31st October 2014 78 / 125

Functions

Help with functions

You can also use the search bar in the R Studio softwareIf you don’t know the name of the function, try the help.search fora list of possible candidates

help.search("sequences")

When all else fails: Google it!

Graeme L. Hickey An Introduction to R 31st October 2014 79 / 125

Conditional statements and loops

Conditional statements and loops

Graeme L. Hickey An Introduction to R 31st October 2014 80 / 125

Conditional statements and loops

Introduction

Inherent to all programming languages are conditional statementsand loopsRequire them when we need to make complex rulesWould require a more advanced tutorial to fully appreciate the powerIf interested to learn more, see references at end

Graeme L. Hickey An Introduction to R 31st October 2014 81 / 125

Conditional statements and loops

Conditional statements

The if statement, which is technically a function, only does something ifTRUE, e.g.

if(3 < 4) {3 + 3}

## [1] 6

if(3 > 4) {2 + 2}

Also, look up while and else using help.search()

Graeme L. Hickey An Introduction to R 31st October 2014 82 / 125

Conditional statements and loops

Loops

We might want to sequentially do something, conditional on somethingelseFor example, let Yi = Yi−1 + i/10 with Y1 = 0Calculate Y =

∑20i=1 Yi

E.g. 0 + (0 + 2/10) + (2/10 + 3/10) + . . . + (2/10 + 3/10 + . . .+ 20/10)

Graeme L. Hickey An Introduction to R 31st October 2014 83 / 125

Conditional statements and loops

Y <- 0 # Start with Y_1for(i in 2:20) {

Yi <- Y + i/10 # Calculate Y_iY <- Y + Yi # Cummulative sum

}Y # Solution

## [1] 157284.2

Graeme L. Hickey An Introduction to R 31st October 2014 84 / 125

Conditional statements and loops

Task

For each value in our heights vector earlier, how can we calculate thedifference between it and the previous one, i.e. calculate heightsi -heightsi−1?

Graeme L. Hickey An Introduction to R 31st October 2014 85 / 125

Conditional statements and loops

Solution

d <- 0for(i in 2:length(heights)) {

d[i] <- heights[i] - heights[i-1]}d

## [1] 0 -22 -17 55 -22 -24 18 -30

Graeme L. Hickey An Introduction to R 31st October 2014 86 / 125

Reading and writing files

Reading and writing files

Graeme L. Hickey An Introduction to R 31st October 2014 87 / 125

Reading and writing files

Reading

You want to get your data into RData comes in lots of different formats, luckily R can handle almost allof them!Most use packages – we’ll explore these later

Graeme L. Hickey An Introduction to R 31st October 2014 88 / 125

Reading and writing files

read.csv

The simplest way is to convert your data to a comma separated value(*.csv) files and use

my.data <- read.csv(file.choose())

Instead of writing file.choose() we could have specified the filelocationLook at the help file for more customization settings

Graeme L. Hickey An Introduction to R 31st October 2014 89 / 125

Reading and writing files

read.xlsx

Converting our data to CSV format is a pain!We can use the xlsx package instead

library("xlsx")my.data <- read.xlsx(file.choose(), sheetIndex = 1)

Graeme L. Hickey An Introduction to R 31st October 2014 90 / 125

Reading and writing files

foreign

What if our data is in a Stata, SPSS, SAS, etc. file?We can use the foreign package instead, e.g.

library("foreign")my.data <- read.spss(file.choose())

Graeme L. Hickey An Introduction to R 31st October 2014 91 / 125

Reading and writing files

Data on the web

What if our data is in the cloud?We can use the utils package function download.file instead, e.g.

library("utils")my.data <- download.file("http://www.liv.ac.uk/dat.csv")

Useful if you share your data on a public Dropbox folder

Graeme L. Hickey An Introduction to R 31st October 2014 92 / 125

Reading and writing files

Other formats

If data exists, R can read it inJust need to find the right package

Graeme L. Hickey An Introduction to R 31st October 2014 93 / 125

Reading and writing files

Writing

Usually as simple as change the read. to write.

Need to specify:

1 What file we want to save2 A name for the file we will save

write.csv(iris, "IrisData.csv")

Graeme L. Hickey An Introduction to R 31st October 2014 94 / 125

Graphics

Graphics

Graeme L. Hickey An Introduction to R 31st October 2014 95 / 125

Graphics

Introduction

R has 3 primary graphics packages2:

1 Base R - those built into R2 lattice - a functional extensional of the base graphics (requires a

package)3 ggplot2 - built on the grammar of graphics (requires a package)

All called using functions, and typically have lots of optional arguments forcustomization of figures

2I will discuss packages shortly.Graeme L. Hickey An Introduction to R 31st October 2014 96 / 125

Graphics

Plot

The plot function can be applied to most data objects

Alternatively, one can give it two arguments:

x - x-axis coordinatesy - y-axis coordinates

Can also specify arguments to: label the axes; colour the points; etc. See:?help

Graeme L. Hickey An Introduction to R 31st October 2014 97 / 125

Graphics

plot(iris)

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

12

34

56

7

0.5

1.5

2.5

Petal.Width

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 1.5 2.0 2.5 3.0

1.0

2.0

3.0

Species

Graeme L. Hickey An Introduction to R 31st October 2014 98 / 125

Graphics

Task

How can I plot the sepal length against the petal length of the iris data, andcolour the points by species?

Hint: ?as.numeric + ?plot

Graeme L. Hickey An Introduction to R 31st October 2014 99 / 125

Graphics

Solution

plot(x = iris$Sepal.Length, y = iris$Petal.Length,col = as.numeric(iris$Species))

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

12

34

56

7

iris$Sepal.Length

iris$

Pet

al.L

engt

h

Graeme L. Hickey An Introduction to R 31st October 2014 100 / 125

Graphics

Histograms

hist(iris$Petal.Length,col = "grey", xlab = "Petal length (cm)")

Histogram of iris$Petal.Length

Petal length (cm)

Fre

quen

cy

1 2 3 4 5 6 7

010

2030

Graeme L. Hickey An Introduction to R 31st October 2014 101 / 125

Graphics

Boxplots

boxplot(Petal.Length ~ Species, data = iris)

setosa versicolor virginica

12

34

56

7

Graeme L. Hickey An Introduction to R 31st October 2014 102 / 125

Graphics

ggplot2

Flexible publication quality graphics

library("ggplot2")

Graeme L. Hickey An Introduction to R 31st October 2014 103 / 125

Graphics

ggplot(aes(x = Petal.Length, y = Petal.Width, colour = Sepal.Length),data = iris) +

geom_point() + geom_smooth() +facet_wrap(~ Species, scales = "free")

setosa versicolor virginica

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

1.5

2.0

2.5

1.00 1.25 1.50 1.75 3.0 3.5 4.0 4.5 5.0 4.5 5.0 5.5 6.0 6.5 7.0Petal.Length

Pet

al.W

idth

5

6

7

Sepal.Length

Graeme L. Hickey An Introduction to R 31st October 2014 104 / 125

Graphics

Saving figures

In R Studio, click Export then select file typeIn R, right click + Save

Graeme L. Hickey An Introduction to R 31st October 2014 105 / 125

Packages

Packages

Graeme L. Hickey An Introduction to R 31st October 2014 106 / 125

Packages

Introduction

Packages are like appsMany state of the art statistical methods published in journals with RpackagesLarge number of books released with specific R packages also,e.g. time series, prognostic modelling, survival regression, . . .Packages are published:

On CRANGit HubBio-conductorCode files on personal websites (not really packages per se)

Graeme L. Hickey An Introduction to R 31st October 2014 107 / 125

Packages

Where to find relevant packages

There are thousands of R packages!Books, journal articles (look what is reported in the Statistical Analysissection), word of mouth, your local friendly statisticianwww.rseek.org

Graeme L. Hickey An Introduction to R 31st October 2014 108 / 125

Packages

What packages are already installed?

R comes with a number of packages pre-installedWe can see all them, plus any we have installed ourselves

library() # No arguments!

or click the Packages tab if working from R Studio

Graeme L. Hickey An Introduction to R 31st October 2014 109 / 125

Packages

Installing a package from CRAN

Once you have identified the package you want, run:

install.packages("ggplot2")

Graeme L. Hickey An Introduction to R 31st October 2014 110 / 125

Packages

Loading a package

R does not automatically load all installed packages as it slows yourcomputer down, and can also cause clashes

When you know what package(s) you want to use, run:

library("ggplot2") # Load ggplot2 package

Graeme L. Hickey An Introduction to R 31st October 2014 111 / 125

Statistics

Statistics

Graeme L. Hickey An Introduction to R 31st October 2014 112 / 125

Statistics

Introduction

In R, statistical method is just another term for functionBeware of applying statistical methods to your data withoutunderstanding what they do first: G-I-G-O!

Graeme L. Hickey An Introduction to R 31st October 2014 113 / 125

Statistics

Student’s t-test

t.test(x = heights, mu = 180)

#### One Sample t-test#### data: heights## t = -1.8121, df = 7, p-value = 0.1129## alternative hypothesis: true mean is not equal to 180## 95 percent confidence interval:## 150.3245 183.9255## sample estimates:## mean of x## 167.125

Graeme L. Hickey An Introduction to R 31st October 2014 114 / 125

Statistics

Notice that we applied a function to two arguments:

1 Some data2 A number

And it gave us back out some statistics and a P-value

Graeme L. Hickey An Introduction to R 31st October 2014 115 / 125

Statistics

Linear regression

The first argument of many statistical functions is a formula

A formula is written as:

The outcome variable on the RHS (one of the variables in your dataset)A tilde symbol (~) which means model ontoThe explanatory variables on the RHS (separated by a + if more thanone)

Graeme L. Hickey An Introduction to R 31st October 2014 116 / 125

Statistics

fit <- lm(Petal.Width ~ Sepal.Length, data = iris)fit

#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Coefficients:## (Intercept) Sepal.Length## -3.2002 0.7529

Graeme L. Hickey An Introduction to R 31st October 2014 117 / 125

Statistics

Recall we can apply the summary function to different things?We have just assigned our linear model to the name fit

summary(fit)

#### Call:## lm(formula = Petal.Width ~ Sepal.Length, data = iris)#### Residuals:## Min 1Q Median 3Q Max## -0.96671 -0.35936 -0.01787 0.28388 1.23329#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.20022 0.25689 -12.46 <2e-16 ***## Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***## ---## Signif. codes:## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.44 on 148 degrees of freedom## Multiple R-squared: 0.669, Adjusted R-squared: 0.6668## F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16

Graeme L. Hickey An Introduction to R 31st October 2014 118 / 125

Statistics

We can apply other functions to fit also, e.g.

plot(fit)

0.0 0.5 1.0 1.5 2.0 2.5

−1.

00.

00.

51.

01.

5

Fitted values

Res

idua

ls

Residuals vs Fitted

115107

122

−2 −1 0 1 2

−2

−1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

115107

122

0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location115107

122

0 50 100 150

0.00

0.02

0.04

0.06

0.08

Obs. number

Coo

k's

dist

ance

Cook's distance132

107

123

Graeme L. Hickey An Introduction to R 31st October 2014 119 / 125

Statistics

Probability

Most routine statistical distributions are built-in, e.g. Gaussian (norm),Binomial (binom), Poisson (pois), . . .

We use a prefix:

d to get the densityp to get the cumulative probability distributionq to get the quantile functionr to sample random values from the distribution

Graeme L. Hickey An Introduction to R 31st October 2014 120 / 125

Statistics

E.g.

# Sample 5 values from a N(2, 9) distributionrnorm(5, mean = 2, sd = 3)

## [1] -2.370804 3.118035 -2.090598 2.051935 3.879390

# Probability of tossing 10 heads in a rowpbinom(0, size = 10, prob = 0.5)

## [1] 0.0009765625

Graeme L. Hickey An Introduction to R 31st October 2014 121 / 125

Wrapping up

Wrapping up

Graeme L. Hickey An Introduction to R 31st October 2014 122 / 125

Wrapping up

Saving your workspace

You can save everything you have done using File -> Save (R Studiowill prompt you when exiting)I never save my workspaceWhy? Because I save the R Script (copy & paste)

Graeme L. Hickey An Introduction to R 31st October 2014 123 / 125

Wrapping up

Where to learn more

Venables WN, Smith DM (2014). An Introduction to R:http://cran.r-project.org/doc/manuals/R-intro.pdfShahbaba B (2012). Biostatistics with R. Springer, NY.Data Camp online course: https://www.datacamp.com/courses/Ask me, Peter, Elisabeth or Helen!

Graeme L. Hickey An Introduction to R 31st October 2014 124 / 125

Wrapping up

Further IGH Statistical Seminars

Other sessions in this series will make use of R

Introduction to Time Series (5th Dec 2014)Regression Modelling (Jan 2015)Time-to-Event Analysis (Feb 2015)Geostatistical Methods for Disease Prevalence Mapping (March 2015)Statistical Power + Sample Size Calculations (April 2015)Quantile Regression (May 2015)

Graeme L. Hickey An Introduction to R 31st October 2014 125 / 125

top related