a short tutorial on r for data science - bowei chena short tutorial on r for data science bowei chen...

248
A short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017

Upload: others

Post on 04-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

A short tutorial on R for data science

Bowei Chen

School of Computer Science

University of Lincoln

2016 - 2017

Page 2: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Preface

This short tutorial is to give a practical introduction to R for data science programming. It

aims at the undergraduate students or practitioners who have no background or experience

in data science or statistics. It should be noted that the tutorial is focused on teaching basic

R data programming skills from scratch rather than data science algorithms.

The tutorial is created based on several open-source materials in R Community (see the key

references section for details). It has been used in the workshops of the Data Science

module in the School of Computer Science at the University of Lincoln, UK. The content

is around 8 hours’ study. Thanks the module demonstrators Deema Abdal Hafeth and

Jingmin Huang who have provided help with exercises.

2

Page 3: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

3

Page 4: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

4

Page 5: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

What is R?

• R is a free software environment for

statistical computing and graphics.

• R compiles and runs on a wide variety of

UNIX platforms, Windows and MacOS.

• R can be downloaded at:

https://cran.r-project.org/Old logo New logo

5

Page 6: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Comprehensive R Archive Network (CRAN)

• CRAN includes packages which provide additional functionalities.

• Over 7,801 additional packages (as of January 2016) available at CRAN, Bioconductor,

Omegahat, GitHub, and other repositories.

• R packages are written mainly by academics and company staff.

• The R Foundation is seated in Vienna, Austria and currently hosted by the Vienna

University of Economics and Business. It is a registered association under Austrian law

and active worldwide.

6

Page 7: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Short history of R (1/2)

• S is a statistical programming language developed primarily by John Chambers, Rick

Becker and Allan Wilks at Bell Laboratories since 1976.

• The two modern implementations of S are:

– R: part of the GNU free software project

– S-PLUS (or S+): A commercial product sold by TIBCO Software

7

Page 8: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Short history of R (2/2)

• S-PLUS is a commercial implementation of the S programming language sold by TIBCO

Software Inc.

• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland,

New Zealand, and is currently developed by the R Development Core Team, of which

John Chambers is a member. R is named partly after

the first names of the first two R authors and partly as a play on the name of S.

8

Page 9: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

What can you do using R? (1/2)

• Data entry and manipulation

– Input data

• from keyboard

• from spreadsheet

• from another statistics package

– Manipulate data

• Statistical analysis

– Descriptive statistics

– Statistical inference

9

Page 10: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

What can you do using R? (2/2)

• Graphical display

– Predefined plots for some models

– Flexible, powerful options

– Save to image files in various formats

• Write new functions

– Make a change to an existing function

– Create new functions tailored to your exact needs

– Contribute a new package

• Create documents (with Sweave, knitr)

– PDF (article and slides)

– HTML

10

Page 11: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Why use R for statistical computing?

• Open source (R is a GNU S+)

• Good visualisations (ggplot2, lattice, standard plot library)

• Easier for writing custom packages and functions

• Closer to the statistics and machine learning community

• Better LaTeX support (Sweave, knitr)

• Works with Big data (Rhadoop, Rspark, RCpp)

11

Page 12: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

O’Reilly 2016

DATA SCIENCE

SALARY SURVEY

Page 13: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Limitations of R

• The quality of some packages is less than perfect. They are not error-free!

• Many R commands give little thought to memory management, and so R can very

quickly consume all available memory. This can be a restriction when doing data

mining. There are various solutions, including using 64 bit operating systems that can

access much more memory than 32 bit ones.

• Documentation is sometimes patchy and terse, and impenetrable to the non-

statistician. However, some very high-standard books are increasingly plugging the

documentation gaps.

13

Page 14: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

RGui

When R is waiting for us to tell it what to do, it begins the line with >

Type• 'demo()' for some demos• 'help()' for on-line help• 'help.start()' for an HTML

browser interface• 'q()' to quit R

14

Page 15: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Editors and IDEs

• Rstudio

• Jupyter Notebook

• Vim

• Emacs (ESS)

• Eclipse (StatET)

• Tinn-R

• …

15

Page 16: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

https://www.rstudio.com/

Page 17: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

R source editor (Ctrl+1)

R console (Ctrl+2)

Environment (Ctrl+8)history (Ctrl+4)

Help (Ctrl+4)Files (Ctrl+5)Plots (Ctrl+6)

Packages (Ctrl+7)

Page 18: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Objects

• Everything in R is an object, having a class.

• Data, intermediate results are stored in R objects

• The Class of the object both describes what the object contains and what many

standard functions

• Objects are usually accessed by name.

18

Page 19: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

R commands

• R commands are either assignments or expressions

• Commands are separated either by a semicolon ; or newline

19

Page 20: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

x <- 1+2

`<-`(x, 1+2) #same thing

x = 1+2 #same thing

Assignment operations

An assignment command evaluates

an expression and passes the value

to a variable but the result is not

printed.

20

Page 21: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Expression operations

An expression command is evaluated

and (normally) printed.

If the statement results in a value, R will

print that value automatically.

> 1+2

[1] 3

> 1+2*3

[1] 7

> (1+2)*3

[1] 9In R, any number that you print out in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1.

21

Page 22: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Workspace

• R stores objects in workspace that is kept in memory.

• When quitting R ask you if you want to save that workspace

• The workspace containing all objects you work on can then be restored next time you

work with R along with a history of the used commands.

22

Page 23: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Variables (1/3)

A variable is a symbol that holds a value,

which can be any R object.

The types of variables are:

• Integer

• Double

• Character

• Logical

• Factor or categorical

23

Page 24: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Variables (2/3)

Integer, double (numerical values)

> a = 49

> sqrt(a)

[1] 7

> a <- pi

> print(a)

[1] 3.141593

Character, string, logical

> a = "The dog ate my homework"

> sub("dog","cat",a)

[1] "The cat ate my homework“

> a = (1+1==3)

> a

[1] FALSE

24

Page 25: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Variables (3/3)

Factor

> a <- factor(c("H", "e", "l", "l", "o"))

> print(a)

[1] H e l l o

Levels: e H l o

> class(a)

[1] "factor"

25

Page 26: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Types of numerical variables (1/2)

When we use numerical objects, in

mathematical terms, variables can be

classified as:

• Scalars

• Vectors

• Matrices

A scalar is a single number

> x <- 5

> Y <- 100

26

Page 27: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Types of numerical variables (2/2)

A vector is a sequence of numbers

> x <- c(3, 5, 2)

> x

[1] 3 5 2

A matrix is a two-way table of numbers

> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2)

> x

[,1] [,2]

[1,] 2 5

[2,] 3 6

[3,] 4 7

27

Page 28: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Variable names

• You can use simple variable names like x, y, A, and a (note that A and a are different

variable names). You can also use longer names like counter, index1, or

subject_id.

• A variable name can contain digits, but it cannot begin with a digit.

• Be careful about the built-in operators or symbols with your own variable names!

For example, you could create a variable named log, but then you would no longer be

able to use the logarithm function

28

Page 29: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Comments

A comment is anything you write in your

program code that is ignored by

the computer.

Comments help others understand your

code. Anything following a “#” character is

a comment in R.

> x <- c(3, 5, 2) ## These are the doses of the new drug formulation.

29

Page 30: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Arithmetic operators

Addition +

Subtraction -

Multiplication *

Division /

Exponentiation ^ or **

Modulus (x mod y) 5%%2 is 1 x %% y

Integer division 5%/%2 is 2 x %/% y

30

Page 31: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Comparison operators

Equal ==

Not equal !=

Greater than >

Greater than or equal >=

Less than <

Less than or equal <=

31

Page 32: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Logical operators

x and y x & y

x or y x | y

Not x !x

Test if x is TRUE isTRUE(x)

32

Page 33: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Numeric functions

Absolute value abs(x)Square root sqrt(x)Ceiling(3.475) is 4 ceiling(x)Foor(3.475) is 3 floor(x)Round(3.475, digits=2) is 3.48 round(x, digits=n)Signif(3.475, digits=2) is 3.5 signif(x, digits=n)Cosine, sine, tan, … cos(x), sin(x), tan(x)Natural logarithm log(x)Common logarithm log10(x)Exponential of x exp(x)

33

Page 34: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: if

Syntax:

if(cond1=true) { cmd1 }

> if (TRUE) {

+ "this will be printed if it is TRUE"

+ }

[1] "this will be printed if it is TRUE"

34

Page 35: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: if-else

Syntax:

if(cond1=true) { cmd1 } else { cmd2 }

> if(1==0) {

+ print(1)

+ } else {

+ print(2)

+ }

[1] 2

35

Page 36: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: ifelse

Syntax:

ifelse(cond, yes, no)

> ifelse(1 == 0,

+ "this will be printed if 1==0",

+ "this will not be printed if 1!=0")

[1] "this will not be printed if 1!=0"

36

Page 37: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: for

Syntax:

for (var in seq) { expr }

> x <- c("a", "a", "a", "a", "a")

> for (i in x){

+ print(i)

+ }

[1] "a"

[1] "a"

[1] "a"

[1] "a"

[1] "a"

37

Page 38: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: repeat

Syntax:

repeat { (cond) expr }

> i <- 10> repeat {+ if (i > 25)+ break+ else {+ print(i); i <- i + 5;+ }+ }[1] 10[1] 15[1] 20[1] 25

38

Page 39: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: while

Syntax:

while (cond) { expr }

> i <- 10

> while (i <= 25) {

+ print(i); i <- i + 5

+ }

[1] 10

[1] 15

[1] 20

[1] 25

39

Page 40: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Control structures: switch

Syntax:

switch(expr, ...)

> AA = 'foo'> switch(AA,+ foo = {+ print('foo') # case 'foo'+ },+ bar = {+ print('bar') # case 'bar'+ },+ {+ print('default')+ })[1] "foo"

40

Page 41: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Installing R and RStudio on your machine

• Download R from https://cran.r-project.org/

• Download RStudio at https://www.rstudio.com/

41

Page 42: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/10

demo(graphics)

demo(plotmath)

demo(Japanese)

demo(lm.glm)

demo(hclColors)

42

Page 43: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/10

x<-c(4,2,6)

y<-c(1,0,-1)

length(x)

sum(x)

sum(x^2)

x+y

x*y

x-2

x^2

43

Page 44: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/10

7:11

seq(2,9)

seq(4,10,by=2)

seq(3,30,length=10)

seq(6,-4,by=-2)

44

Page 45: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 4/10

rep(2,4)

rep(c(1,2),4)

rep(c(1,2),c(4,4))

rep(1:4,4)

rep(1:4,rep(3,4))

45

Page 46: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 5/10

c(T,T,F,F) & c(T,F,F,T)

!x

x <- seq(-3,3,length=200) > 0

1:3 + c(T,F,T)

intersect(1:10,5:15)

drinks <- factor(c("beer","beer","wine","water"))

46

Page 47: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 6/10

x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y);

print(z)

c(1, 2, 3, . . . , 19, 20)

x <- c(3,6,8); y <- c(2,5,1);

x[y>1.5]

x <- c(3,6,8); y <- c(2,5,1);

y[x==6]

47

Page 48: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 7/10

x <- 1:15if (sample(x, 1) <= 10) {

print("x is less than 10")} else {

print("x is greater than 10")}

Clean all the variables (the workspace)rm(list=ls())

Clean one variablerm(x)

48

Page 49: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 8/10

x <- c("apples", "oranges", "bananas", "strawberries")

for (i in x) {

print(i)

}

for (i in 1:4) {

print(x[i])

}

for (i in seq(x)) {

print(x[i])

}

for (i in 1:4) print(x[i])

49

Page 50: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 9/10

i <- 1

while (i < 10) {

print(i)

i <- i + 1

}

50

Page 51: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 10/10

z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z)

x <- c(0.5, 0.7)

x <- c(TRUE, FALSE)

x <- c("a", "b", "c", "d", "e")

x <- 9:100

x <- c(1 + (0+0i), 2 + (0+4i))

51

Page 52: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

52

Page 53: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

R data structures

53

Page 54: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Vectors (1/3)

Vectors are one-dimensional arrays that

can hold numeric data, character data, or

logical data. The combine function c() is

used to form the vector.

Note that the data in a vector must only

be one data type (numeric, character, or

logical).

> a <-c(1, 2, 5, 3, 6, -2, 4)

> b <-c("one", "two", "three")

> c <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)

# a is numeric vector,

# bis a character vector, and

# c is a logical vector

54

Page 55: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Vectors (2/3)

Scalars are one-element vectors. > f <- 3

> x <- TRUE

> y <- 100.01

55

Page 56: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Vectors (3/3)

You can refer to elements of a vector using

a numeric vector of positions within

brackets.

> a <- c(1, 2, 5, 3, 6, -2, 4)

> a[3]

[1] 5

> a[c(1, 3, 5)]

[1] 1 5 6

> a[2:6]

[1] 2 5 3 6 -2

56

Page 57: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Matrices (1/4)

A matrix is a two-dimensional array where each element has the same data type

(numeric, character, or logical). Matrices are created with the matrix() function.

myymatrix <- matrix(vector,

nrow=number_of_rows,

ncol=number_of_columns,

byrow=logical_value,

dimnames=list(char_vector_rownames,

char_vector_colnames)

)

57

Page 58: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Matrices (2/4)

# Create a matrix from a vector

> vector <-c(1,2,3,4)

> foo <-matrix(vector, nrow=2, ncol=2)

> foo

[,1] [,2]

[1,] 1 3

[2,] 2 4

# Create a 5x4 matrix

> y <- matrix(1:20, nrow=5, ncol=4)

> y

[,1] [,2] [,3] [,4]

[1,] 1 6 11 16

[2,] 2 7 12 17

[3,] 3 8 13 18

[4,] 4 9 14 19

[5,] 5 10 15 20

58

Page 59: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Matrices (3/4)

Create a 2x2 matrix with labels and

fill the matrix by rows

Create a 2x2 matrix with labels and

fill the matrix by column

> cells <- c(1,26,24,68)

> rnames <- c("R1", "R2")> cnames <- c("C1", "C2")

> mymatrix <- matrix(

+ cells, nrow = 2, ncol = 2, byrow = TRUE,

+ dimnames = list(rnames, cnames) )

> mymatrix

C1 C2

R1 1 26

R2 24 68

> mymatrix <- matrix(

+ cells, nrow = 2, ncol = 2, byrow = FALSE,

+ dimnames = list(rnames, cnames))

> mymatrix

C1 C2

R1 1 24

R2 26 68 59

Page 60: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Matrices (4/4)

You can identify rows, columns, or elements of a matrix, x, by using subscripts and brackets.

• x[i,] refers to the ith row

• x[,j] refers to jth column

• x[i,j] refers to the i,jth element

> x <- matrix(1:10, nrow=2)> x

[,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10> x[2,][1] 2 4 6 8 10> x[,2][1] 3 4> x[1,4][1] 7> x[1, c(4,5)][1] 7 9

60

Page 61: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Arrays (1/2)

Matrices are two-dimensional and, like vectors, can contain only one data type. When

there are more than two dimensions, you’ll use arrays.

myarray <- array(vector, dimensions, dimnames)

61

Page 62: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Arrays (2/2)

> dim1 <- c("A1", "A2")> dim2 <- c("B1", "B2", "B3")> dim3 <- c("C1", "C2", "C3", "C4")> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))> z, , C1

B1 B2 B3A1 1 3 5A2 2 4 6

, , C2

B1 B2 B3A1 7 9 11A2 8 10 12

, , C3

B1 B2 B3A1 13 15 17A2 14 16 18

, , C4

B1 B2 B3A1 19 21 23A2 20 22 24

62

Page 63: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Data frame (1/4)

A data frame is more general than a matrix in that different columns can contain different

modes of data (numeric, character, etc.). A data frame is created with the data.frame() function

It’s similar to the datasets you’d typically see in SAS, SPSS, Stata, and Python (pandas).

Each column must have only one data type, but you can put columns of different data

types together to form the data frame. Because data frames are close to what analysts

typically think of as datasets, we’ll use the terms columns and variables interchangeably

when discussing data frames.

mydata <- data.frame(col1, col2, col3,…)

63

Page 64: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Data frame (2/4)

> patientID <- c(1, 2, 3, 4)

> age <- c(25, 34, 28, 52)

> diabetes <- c("Type1", "Type2", "Type1", "Type1")

> status <- c("Poor", "Improved", "Excellent", "Poor")

> patientdata <- data.frame(patientID, age, diabetes, status)

> patientdata

patientID age diabetes status

1 1 25 Type1 Poor

2 2 34 Type2 Improved

3 3 28 Type1 Excellent

4 4 52 Type1 Poor

64

Page 65: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Data frame (3/4)

Accessing data frame elements can be

straight forward. Element can be accessed

by column names.

> patientdata$patientID

[1] 1 2 3 4

> patientdata$diabetes

[1] Type1 Type2 Type1 Type1

Levels: Type1 Type2

> patientdata$status

[1] Poor Improved Excellent Poor

Levels: Excellent Improved Poor

65

Page 66: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Data frame (4/4)

If you want to cross tabulate diabetes type by status.

> table(patientdata$diabetes, patientdata$status)

Excellent Improved PoorType1 1 0 2Type2 0 1 0

66

Page 67: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (1/7)

The attach() function adds the data frame

to the R search path. When a variable name

is encountered, data frames in the search

path are checked in order to locate the

variable.

> summary(mtcars$mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mtcars$mpg, mtcars$disp)

> plot(mtcars$mpg, mtcars$wt)

> attach(mtcars)

> summary(mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

67

Page 68: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (2/7)

The detach() function removes the data

frame from the search path. Note that

detach() does nothing to the data frame

itself. The statement is optional but is good

programming practice and should be

included routinely.

> attach(mtcars)

> summary(mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

68

Page 69: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (3/7)

The limitations with this approach are

evident when more than one object can

have the same name.

Here we already have an object named mpg

in our environment when the mtcars data

frame is attached. In such cases, the

original object takes precedence, which

isn’t what you want. The plot statement

fails because mpg has 3 elements and disp

has 32 elements.

> mpg <- c(25, 36, 47)

> attach(mtcars)

The following object is masked _by_ .GlobalEnv:

mpg

> plot(mpg, wt)

Error in xy.coords(x, y, xlabel, ylabel, log) :

'x' and 'y' lengths differ

69

Page 70: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (4/7)

In this case, the statements within the

{} brackets are evaluated with reference

to the mtcars data frame. You don’t

have to worry about name conflicts

here. If there’s only one statement (for

example, summary(mpg)), the {} brackets are optional.

> with(mtcars, {

+ summary(mpg, disp, wt)

+ plot(mpg, disp)

+ plot(mpg, wt)

+ })

70

Page 71: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (5/7)

The limitation of the with() function

is that assignments will only exist

within the function brackets.

> with(mtcars, {

stats <- summary(mpg)

stats

})

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.43 19.20 20.09 22.80 33.90

> stats

Error: object ‘stats’ not found

71

Page 72: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (6/7)

If you need to create objects that will

exist outside of the with() construct,

use the special assignment operator <<-instead of the standard one <-. It will

save the object to the global

environment outside of the with() call.

> with(mtcars, {

nokeepstats <- summary(mpg)

keepstats <<- summary(mpg)

})

> nokeepstats

Error: object ‘nokeepstats’ not found

> keepstats

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.43 19.20 20.09 22.80 33.90

72

Page 73: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some useful functions for data frame (7/7)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

73

Page 74: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Factors (1/3)

Categorical (nominal) and ordered

categorical (ordinal) variables in R are

called factors.

The function factor() stores the

categorical values as a vector of integers

in the range [1... k] (where k is the

number of unique values in the nominal

variable), and an internal vector of

character strings (the original values)

mapped to these integers.

> diabetes <- c("Type1", "Type2", "Type1", "Type1")

> diabetes

[1] "Type1" "Type2" "Type1" "Type1"

74

Page 75: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Factors (2/3)

> patientID <- c(1, 2, 3, 4)

age <- c(25, 34, 28, 52)

> diabetes <- c("Type1", "Type2", "Type1", "Type1")

> status <- c("Poor", "Improved", "Excellent", "Poor")

> diabetes <- factor(diabetes)

> status <- factor(status, order=TRUE)

> patientdata <- data.frame(patientID, age, diabetes, status)

> str(patientdata)

‘data.frame’: 4 obs. of 4 variables:

$ patientID: num 1 2 3 4 w

$ age : num 25 34 28 52

$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1

$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 75

Page 76: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Factors (3/3)

> summary(patientdata)

patientID age diabetes status

Min. :1.00 Min. :25.00 Type1:3 Excellent:1

1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1

Median :2.50 Median :31.00 Poor :2

Mean :2.50 Mean :34.75

3rd Qu.:3.25 3rd Qu.:38.50

Max. :4.00 Max. :52.00

76

Page 77: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Lists (1/2)

Lists are the most complex of the R data

types. Basically, a list is an ordered

collection of objects (components). A list

allows you to gather a variety of (possibly

unrelated) objects under one name.

mylist <- list(object1, object2, …)

mylist <- list(name1=object1, name2=object2, …)

77

Page 78: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Lists (2/2)

> g <- "My First List"> h <- c(25, 26, 18, 39)> j <- matrix(1:10, nrow=5)> k <- c("one", "two", "three")> mylist <- list(title=g, ages=h, j, k)

> mylist$title[1] "My First List"$ages[1] 25 26 18 39[[3]][,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10[[4]][1] "one" "two" "three"

> mylist[[2]][1] 25 26 18 39> mylist[["ages"]][[1] 25 26 18 39

78

Page 79: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/10

# Declare different variablestypesmy_numeric <- 42my_character <- "universe“my_logical <- FALSE

# Check class of my_numericclass(my_numeric)

# Check class of my_characterclass(my_character)

# Check class of my_logicalclass(my_logical)

Page 80: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/10

# Vector operations

a) Create a verctor like 1,2,3, . . ., 10

b) Get the length of the above vector

c) Get the last three numbers from the vector

d) Sort the numbers with decreasing order

e) Remove the number 9 from the above vector

Page 81: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/10

# Vector operations

a) Create a vector from 1 to 3.1415 with the length of 100

b) Create a vector from -2 to 0.1 with the length of 100

c) Get the sum and inner product of a and b

Page 82: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 4/10

# Vector operations

a) Create a vector x contains 2, 3, 4, 1

b) Create a vector y contains 1, 1, 3, 7

c) Combine column vectors x, y

Page 83: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 5/10

# Vector operations

Use rep() function to create the following vectors:

a) “0” “x” “0” “x” “0” “x”

b) 1 3 2 1 3 2 1 3 2 1 3 2

c) 1 1 1 2 2 2 3 3 3

Page 84: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 6/10

# Matrix operations

a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns

b) Print out the dimensions of the matrix

c) Find out the 4th column’s sum

d) Find out the sum of row 3 and row 17

e) Assign the following names to the rows:

“A”, “B”, “C”, “D”, “E”

Page 85: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 7/10

# Matrix operations

a) Use matrix() function to create the following matrix:

TypeA TypeB TypeC

Navarra 190 8 22

Zaragoza 191 4 1.7

Madrid 223 80 2.0

b) Add the following column into the matrix:

TypeD

2.00

3.50

2.75

c) Use apply() function to calculate the means of each column of the matrix

Page 86: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 8/10

# Array operationsCreate the following array, , 1

[,1] [,2] [,3][1,] 1 4 7[2,] 2 5 8[3,] 3 6 9

, , 2

[,1] [,2] [,3][1,] 10 13 16[2,] 11 14 17[3,] 12 15 18

, , 3

[,1] [,2] [,3][1,] 19 22 25[2,] 20 23 26[3,] 21 24 27

Page 87: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 9/10

# Data frame operations

Type df <- iris, then

a) Print out the dimensions of df

b) Find out the sum of “Sepal.Width” column

c) Rename column “Species” as “label”

d) Find out how many records with “Petal.Length” larger than 1.41

Page 88: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 10/10

# List operations

Create the following list and save it to the variable x:

[[1]]

[1] 2 3 5

[[2]]

[1] "aa" "bb" "cc" "dd" "ee"

[[3]]

[1] TRUE FALSE TRUE FALSE FALSE

[[4]]

[1] 3

Page 89: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

89

Page 90: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sources of data for R

90

Page 91: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Entering data from the keyboard

Perhaps the simplest method of data entry

is from the keyboard. The edit() function

in R will invoke a text editor that will allow

you to enter your data manually.

> mydata <- data.frame(age = numeric(0),

+ gender = character(0),

+ weight = numeric(0))

> mydata <- edit(mydata)

91

Page 92: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from Excel

There are many R packages can allow you to import data from excel. For example:

openxlsxXLConnectxlsx…

Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later .xlsx format.

> install.packages("openxlsx")

> library("openxlsx")> df <-+ read.xlsx(+ "PublicHealthEnglandDataTableDistrict.xlsx",+ sheet = 1,+ startRow = 1,+ colNames = TRUE+ )

92

Page 93: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from a delimited text file (1/2)

You can import data from delimited text files using read.table() , a function that

reads a file in table format and saves it as a data frame.

where file is a delimited ASCII file , header is a logical value indicating whether

the first row contains variable names (TRUE or FALSE), sep specifies the delimiter

separating data values, and row.names is an optional parameter specifying one or more

variables to represent row identifiers.

> mydataframe <- read.table(file, header = logical_value,

+ sep = "delimiter",

+ row.names = "name")

93

Page 94: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from a delimited text file (2/2)

> file <- paste0(path, '/AMZN.csv')

> df <- read.table(file, header = TRUE, sep = ",")

> head(df)

Date Open High Low Close Volume Adj.Close

1 2016-03-04 581.07 581.40 571.07 575.14 3405100 575.14

2 2016-03-03 577.96 579.87 573.11 577.49 2736700 577.49

3 2016-03-02 581.75 585.00 573.70 580.21 4576900 580.21

4 2016-03-01 556.29 579.25 556.00 579.04 5014400 579.04

5 2016-02-29 554.00 564.81 552.51 552.52 4013400 552.52

6 2016-02-26 560.12 562.50 553.17 555.23 4858200 555.23

94

Page 95: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from XML

> # install and load the necessary package

> install.packages(“XML”)

> library(XML)

> xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

> xmlfile <- xmlTreeParse(xml.url)

> class(xmlfile)

[1] "XMLDocument" "XMLAbstractDocument"

> xmltop = xmlRoot(xmlfile)

> plantcat <- xmlSApply(xmltop, function(x) { xmlSApply(x, xmlValue) } )

> # Finally, get the data in a data-frame and have a look at the first rows and columns

> plantcat_df <- data.frame(t(plantcat),row.names = NULL)

> plantcat_df[1:5,1:4]

95

Page 96: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from R package

> library(MASS)

> data()

> data(phones)

> phones

$year

[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

$calls

[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9 16.1

[14] 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0

96

Page 97: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data from other sources

• Importing SPSS files into R

• Importing Stata files into R

• Importing SAS files into R

• Importing Minitab files into R

• Importing Matlab files into R

• …

97

Page 98: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data in RStudio (1/2)

98

Page 99: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Importing data in RStudio (2/2)

99

Page 100: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Writing data frame into csv or txt files

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",

eol = "\n", na = "NA", dec = ".", row.names = TRUE,

col.names = TRUE, qmethod = c("escape", "double"),

fileEncoding = "")

write.csv(...)

write.csv2(...)

100

Page 101: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Useful functions for working with data objects (1/2)

Number of elements/components length(object)Dimensions of an object dim(object)Structure of an object str(object)Class or type of an object class(object)How an object is stored mode(object)Names of components in an object names(object)Combines objects into a vector c(object, object,...)Combines objects as columns cbind(object, object, ...)Combines objects as rows rbind(object, object, ...)Prints the object object

101

Page 102: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Useful functions for working with data objects (2/2)

Lists the first part of the object head(object)Lists the last part of the object tail(object)Lists current objects ls()Deletes one or more objects. rm(object, object, ...)Edits object and saves as new object newobject <- edit(object)

102

Page 103: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

The drop= argument

By default, subscripting operations reduce

the dimensions of an array

whenever possible. To avoid that, we can

use the drop=FALSE argument

> mat <- matrix(1:12, 3, 4, byrow = TRUE)

> mat

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> s1 <- mat[1,]; s1

[1] 1 2 3 4

> dim(s1)

NULL

> s2 <- mat[1,,drop=FALSE]; s2

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

> dim(s2)

[1] 1 4

103

Page 104: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Combined selection

Suppose we want to get all the columns for which the element at the first row is less than 3:

> mat <- matrix(1:12, 3, 4, byrow = TRUE)> mat

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

> mycols <- mat[1,] < 3; mycols[1] TRUE TRUE FALSE FALSE

> mat[ , mycols, drop=FALSE][,1] [,2][1,] 1 2[2,] 5 6[3,] 9 10

104

Page 105: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Using SQL statements to manipulate data frames

# install the package

> install.packages("sqldf")

> library(sqldf)

> newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE)

> newdf

mpg cyl disp hp drat wt qsec vs am gear carb

Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1

Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1

Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1

Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1

Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1105

Page 106: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/8

1) Create a vector x represent numbers from 1 to 11

2) Save x into the x.RData file

3) Remove the object x from R workspace

4) Import the x.RData file into R and save it to x.

Please google it

Page 107: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/8

1) Create the following data frame dfA

ID Case Number1 case1 102 case2 203 case3 30

2) Save dfA into the dfA.csv file

Page 108: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/8

1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard

2) Import the dataset into R using openxlsx package

3) Show the first and last 20 lines of the dataset, respectively

4) Obtain the column names of the dataset

5) Create a new data frame which has the same column names of the dataset and has

the first and last 20 lines of the dataset

Page 109: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 4/8

1) Download the file AMZN.csv from Blackboard

2) Import the dataset into R

3) Show the class of all columns/fields

4) Create a new data frame where Open <= 570 and Close >= 550

5) Sort the data frame by High (in decreasing order)

6) Create another data frame where Close >= Open

Page 110: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 5/8

1) Import data from url "http://www.w3schools.com/xml/plant_catalog.xml"

2) Use the xmlTreePares function to parse xml file directly from the web

3) Use the xmlRoot function to access the top node

Page 111: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 6/8

1) Install the MASS package

2) Find Cars93 dataset

3) Extract all the records for the Volkswagen from the field Manufacturer

4) Order the extracted records (ascend) by Price and save it to a data frame

5) Write the data frame to Cars93FilteredData.csv

Page 112: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 7/8

1) Use SQL statements to manipulate data frame as required in Exercise 4/6

2) Write the data frame to Cars93FilteredData.RData

Page 113: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 8/8

1) Download the files iris1.csv and iris2.csv from Blackboard

2) Import these two files into R

3) Combine these two datasets into one data frame

4) Calculate the mean value of every columns

5) What will you do with missing values?

Page 114: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

114

Page 115: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exploratory graphs

If you are familiar with statistical graphical representations, please

skip this part

Page 116: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Pie chart

AL5% AR

5%AZ5%

CA5%

CO4%

CT7%

DE4%

FL6%GA

4%IA5%

ID4%

IL4%

IN4%

KS5%

KY3%

LA5%

MA6%

MD4%

ME6%

MI5%

taxs

AL AR AZ CA CO CT DEFL GA IA ID IL IN KSKY LA MA MD ME MI

Dataset: Cigarette

A pie chart is used to show the

relative frequencies or percentages

of the levels of a categorical variable

with wedges of a pie/circle..

It is very useful when creating a well

designed document that is intended

to people that will not read the data

(e.g., management)

Page 117: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Scatter plot

With a scatter plot a mark,

usually a dot or small circle,

represents a single data point.

With one mark (point) for every

data point a visual distribution

of the data can be seen.

Depending on how tightly the

points cluster together, you may

be able to discern a clear trend

in the data.

y = 31.887x - 62057

0

100

200

300

400

500

600

700

1949 1951 1953 1955 1957 1959

Dataset: AirPassengers

AirPassengers Linear (AirPassengers)Date

Number of air

passengers

Page 118: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Line plot

A line plot provides an excellent

way to map independent and

dependent variables that are both

quantitative.

It is clear to see how things are

going by the rises and falls a line

plot shows.

0

100

200

300

400

500

600

700

1949 1951.4166671953.833333 1956.25 1958.666667

Dataset: AirPassengers

AirPassengersDate

Number of air

passengers

Page 119: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Multiple line plot

Multiple line plots have space-

saving characteristics. Because

the data values are marked by

small marks (points) and not

bars, they do not have to be

offset from each other (only

when data values are very dense does this become a problem).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Cumulative

percentage

Page 120: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Area chart/graph

An area chart/graph displays

graphically quantitative data.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Cumulative

percentage

Page 121: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Bar chart

A bar plot is a chart that shows

grouped data with rectangular

bars with lengths proportional to

the values that they show. The

bars can be plotted vertically or

horizontally.

It is one of the best methods to

summarise categorical data.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Percentage

Page 122: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram

A histogram is a graphical

representation of the distribution of

quantitative data. It is an estimate

of the probability distribution of a

quantitative variable and was first introduced by Karl Pearson.

0

5

10

15

20

25

40

42

44

46

48

50

52

54

56

58

60

Dataset: MSFT

Adjust

closing price

Frequency

Page 123: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram with

distribution fit

A histogram with a distribution

fit is normally used to show the

empirical distribution of the

variable. Sometimes, we use

the Normal/Gaussian distribution to fit the histogram.

0

5

10

15

20

25

40

42

44

46

48

50

52

54

56

58

60

Dataset: MSFT

Adjust

closing price

Frequency

Page 124: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Base plotting system in R

Page 125: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Dataset (1/3)

> data(Chem97, package = "mlmRev")

> head(Chem97)

lea school student score gender age gcsescore gcsecnt

1 1 1 1 4 F 3 6.625 0.3393157

2 1 1 2 10 F -3 7.625 1.3393157

3 1 1 3 10 F -4 7.250 0.9643157

4 1 1 4 10 F -2 7.500 1.2143157

5 1 1 5 8 F -1 6.444 0.1583157

6 1 1 6 10 F 4 7.750 1.4643157

125

Page 126: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Dataset (2/3)

> data(iris)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

126

Page 127: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Dataset (3/3)

> data(EuStockMarkets)

> EuStockMarkets <- data.frame(EuStockMarkets)

> head(EuStockMarkets)

DAX SMI CAC FTSE

1 1628.75 1678.1 1772.8 2443.6

2 1613.63 1688.5 1750.5 2460.2

3 1606.51 1678.6 1718.0 2448.2

4 1621.04 1684.1 1708.1 2470.4

5 1618.16 1686.6 1723.1 2484.7

6 1610.61 1671.6 1714.3 2466.8

127

Page 128: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram (1/2)

> hist(Chem97$gcsescore)

128

Page 129: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram (2/2)

> hist(+ Chem97$gcsescore,+ main = "Histogram",+ xlab = "gcsescore",+ ylab = "Frequency",+ col = "green"+ )

129

Page 130: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Boxplot (1/2)

> boxplot(Chem97$gcsescore,

+ main = 'title',

+ ylab = 'gcsescore')

130

Page 131: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Boxplot (2/2)

> boxplot(+ Chem97$gcsescore,+ Chem97$age,+ main = 'title',+ ylab = 'value',+ names = c('gcsescore','age')+ )

131

Page 132: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Scatter plot (1/3)

> plot(

+ Chem97$gcsescore,

+ Chem97$gcsecnt,

+ main = "title",

+ xlab = "gcsescore",

+ ylab = 'gcsecnt',

+ col = "blue"

+ )

132

Page 133: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Scatter plot (2/3)

> pairs(iris)

133

Page 134: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Scatter plot (3/3)

> pairs(iris, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])134

Page 135: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Line plot (1/3)

> plot(

+ EuStockMarkets$DAX,

+ type = "l",

+ main = 'EuStockMarkets',

+ xlab = 'Day',

+ ylab = 'DAX'

+ )

135

Page 136: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Line plot (2/3)> plot(

+ EuStockMarkets$DAX,

+ type = "l", col = 'red',

+ xlab = 'Day', ylab = 'Price'

+ )

> lines(EuStockMarkets$FTSE,

+ type = "l", col = 'blue')

> title("EuStockMarkets", cex.main = 1.1)

> legend(

+ 100, 5500, c("DAX", "FTSE"),

+ col = c('red', 'blue'),

+ text.col = "black",

+ lty = c(1,1), merge = TRUE

+ ) 136

Page 137: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Line plot (3/3)

> plot(

+ EuStockMarkets$DAX,

+ EuStockMarkets$CAC,

+ type = "l",

+ main = 'EuStockMarkets',

+ xlab = 'DAX',

+ ylab = 'CAC'

+ )

137

Page 138: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/5

1) Create a vector x from a series 1 to 1000

2) Create a vector y from a series 12 to 10002

3) Generate the following scatter plot that x on x-axis and y on y-axis

Page 139: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/5

1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)

2) Each variable has 500 observations (i.e., 500 rows)

3) x follows a standard norm distribution N(0,1)

4) y follows a continuous uniform distribution U[0,1]

5) z follows a poison distribution Poisson(0.5)

6) Generate a pairs plot for x, y, z

Please google the pairs function

Page 140: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/5

Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)

Page 141: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 4/5

1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard

2) Import the dataset into R using openxlsx package

3) Save the data frame into df

4) Plot the histogram of df (as same as on the right)

hint: a) bandwidth; b) values on x-axis

Page 142: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 5/5

1) Download the file AMZN.csv from Blackboard

2) Import the dataset into R

3) Plot the multiple lines figure as below

Page 143: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

R graphics packages:

lattice & ggplot2

Page 144: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Author

lattice was developed and

maintained by Deepayan Sarkar,

Assistant Professor at Indian

Statistical Institute.

http://www.isid.ac.in/~deepayan/

144

Page 145: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram by wrap

> pl <- histogram(~ gcsescore |

+ factor(score), data = Chem97)

> print(pl)

145

Page 146: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Density by wrap

> pl <- densityplot(

+ ~ gcsescore | factor(score),

+ data = Chem97,

+ plot.points = FALSE,

+ ref = TRUE

+ )

> print(pl)

146

Page 147: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Density plot by different colour

> pl <- densityplot(

+ ~ gcsescore,

+ data = Chem97,

+ groups = score,

+ plot.points = FALSE,

+ ref = TRUE,

+ auto.key = list(columns = 3)

+ )

> print(pl)

147

Page 148: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

boxplot by wrap (1/2)

> pl <- bwplot(

+ gcsescore ^ 2.34 ~ gender | factor(score),

+ Chem97,

+ varwidth = TRUE,

+ layout = c(6, 1),

+ ylab = "Transformed GCSE score"

+ )

> print(pl)

148

Page 149: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

boxplot by wrap (2/2)

> pl <- densityplot(

+ ~ gcsescore,

+ data = Chem97,

+ groups = score,

+ plot.points = FALSE,

+ ref = TRUE,

+ auto.key = list(columns = 3)

+ )

> print(pl)

149

Page 150: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

There are many other functions in lattice

Below the references will be useful:

• http://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf

• https://www.stat.auckland.ac.nz/~paul/RGraphics/chapter4.pdf

• https://fas-web.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm

150

Page 151: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Author

ggplot2 was developed by Hadley

Wickham, Chief Scientist at RStudio, and

an Adjunct Professor of Statistics at the

University of Auckland.

http://hadley.nz/

151

Page 152: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Histogram by wrap

> pg <-

+ ggplot(Chem97, aes(gcsescore)) +

+ geom_histogram(binwidth = 0.5) +

+ facet_wrap( ~ score)

> print(pg)

152

Page 153: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Density plot by wrap

> pg <- ggplot(Chem97, aes(gcsescore)) +

+ stat_density(geom = "path",

+ position = "identity") +

+ facet_wrap(~ score)

> print(pg)

153

Page 154: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Density plot by different colour

> pg <- ggplot(Chem97, aes(gcsescore)) +

+ stat_density(geom = "path",

+ position = "identity",

+ aes(colour = factor(score)))

> print(pg)

154

Page 155: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

boxplot by wrap (1/2)

> pg <- ggplot(Chem97,

+ aes(factor(gender),

+ gcsescore^2.34)) +

+ geom_boxplot() +

+ facet_grid(~score) +

+ ylab("Transformed GCSE score")

> print(pg)

155

Page 156: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

boxplot by wrap (2/2)

> pg <- ggplot(Chem97,

+ aes(factor(score),

+ gcsescore)) +

+ geom_boxplot() +

+ coord_flip() +

+ ylab("Average GCSE score") +

+ facet_wrap( ~ gender)

> print(pg)

156

Page 157: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

There are many other functions in ggplot2

Below the references will be useful:

• http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf

• http://www.statmethods.net/advgraphs/ggplot2.html

• http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/

• http://www.stat.wisc.edu/~larget/stat302/chap2.pdf

157

Page 158: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/11

Use lattice or ggplot package to draw the figure as below

> data(postdoc, package = "latticeExtra")

> pl <- barchart(prop.table(postdoc, margin = 1),

+ xlab = "Proportion",

+ auto.key = list(adj = 1))

> print(pl)

Page 159: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/11

1) Read the dataset PublicHealthEnglandDataTableDistrict.xlsx

2) Plot the following figure using lattice package

Hint: xyplot

Page 160: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/11

1) Read the dataset

PublicHealthEnglandDataTa

bleDistrict.xlsx

2) Plot the following figure

using ggplot2 package

Hint: 1) ggplot; 2) plot points: 3)

by wrap

Page 161: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 4/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using ggplot2 package

Page 162: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 5/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using ggplot2 package

Page 163: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 6/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using ggplot2 package

Page 164: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 7/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using ggplot2 package

Page 165: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 8/11

1) Read the dataset “chem97” from

“mlmRev” package

2) Plot the following figure using

ggplot2 package

Page 166: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 9/11

1) Read the dataset “chem97” from

“mlmRev” package

2) Plot the following figure using ggplot2

package

Page 167: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 10/11

1) Read the dataset “chem97” from

“mlmRev” package

2) Plot the following figure using ggplot2

package

Page 168: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 11/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using lattice package

Page 169: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

169

Page 170: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Empty string

An empty string can be produced by

consecutive quotation marks: ""> empty_str = ""

> empty_str

[1] ""

> class(empty_str)

[1] "character"

170

Page 171: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Vector of empty strings

character() will produce a character

vector with as many empty strings

> # vector with 5 empty strings

> char_vector = character(5)

> char_vector

[1] "" "" "" "" ""

171

Page 172: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

is.character() and as.character()

as.character() and is.character() are

generic methods for creating and testing

for objects of type "character"

> a = "test me"

> b = 8 + 9

> # are 'a' and 'b' characters?

> is.character(a)

[1] TRUE

> is.character(b)

[1] FALSE

172

Page 173: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

c() for character vector

As you can tell, the resulting vector from

combining integers (1:5), the number pi,

and some "text" is a vector with all its

elements treated as character strings. In

other words, when we combine mixed

data in vectors, strings will dominate.

> a <- c("x", "y", "c")

> a

[1] "x" "y" "c"

> b <- c(1:5, pi, "text")

> b

[1] "1"

[2] "2"

[3] "3"

[4] "4"

[5] "5"

[6] "3.14159265358979"

[7] "text"

173

Page 174: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

paste()

paste() takes one or more R objects,

converts them to "character", and then it

concatenates (pastes) them to form one or

several character strings

> PI = paste("The life of", pi)

> PI

[1] "The life of 3.14159265358979"

> IloveR = paste("I", "love", "R")

> IloveR

[1] "I love R"

> IloveR = paste0("I", "love", "R")

> IloveR

[1] "IloveR"

> IloveR = paste("I", "love", "R", sep = "-")

> IloveR

[1] "I-love-R"

> paste(1:3, c("!", "?", "+"), sep = "",

+ collapse = "")

[1] "1!2?3+"174

Page 175: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Printing characters

Function Description

print() Generic printing

noquote() Print with no quotes

cat() Concatenation

format() Special formats

toString() Covert to string

sprintf() Printing

175

Page 176: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Basic string manipulations

Function Description

nchar() Number of characters

tolower() Convert to lower case

toupper() Convert to upper case

casefold() Case folding

chartr() Character translation

abbreviate() Abbreviation

substring() Substrings of a character vector

substr() Substrings of a character vector

176

Page 177: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Set operations

Function Description

union() Set union

intersect() Intersection

setdiff() Set difference

setequal() Equal sets

identical() Exact equality

is.element() Is element

sort() Sorting

paste(rep()) Repetition

177

Page 178: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

setequal() vs indentical()

> set7 = c("some", "random", "string")> set8 = c("some", "random", "none", "few")> set9 = c("string", "some", "random")> setequal(set7, set8)[1] FALSE> setequal(set7, set9)[1] TRUE> identical(set7, set7)[1] TRUE> identical(set7, set9)[1] FALSE

178

Page 179: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

stringr package

Thanks to Hadley Wickham, we have the

package stringr that adds more

functionality to the base functions for

handling strings in R.

stringr provides functions for:

1) Basic manipulations

2) Regular expression operations.

http://hadley.nz/

179

Page 180: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Basic string manipulations in stringr

Function Description Similar to

str_c() string concatenation paste()

str_length() number of characters nchar()

str_sub() extracts substrings substring()

str_dup() duplicates characters

str_trim() removes leading and trailing whitespace

str_pad() pads a string

str_wrap() wraps a string paragraph strwrap()

180

Page 181: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

paste() vs str_c()

> paste("University", "of", "Lincoln")

[1] "University of Lincoln"

> paste("University", "of", "Lincoln", NULL)

[1] "University of Lincoln "

> paste("University", "of", "Lincoln", character(0))

[1] "University of Lincoln “

> library(stringr)

> str_c("University", "of", "Lincoln")

[1] "UniversityofLincoln"

> str_c("University", "of", "Lincoln", NULL)

[1] "UniversityofLincoln"

> str_c("University", "of", "Lincoln", character(0))

[1] "UniversityofLincoln“ 181

Page 182: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

nchar() vs str_length()

> nchar("The life of PI")

[1] 14

> str_length("The life of PI")

[1] 14

>

> text_str = c("one", "two", "three", NA)

> nchar(text_str)

[1] 3 3 5 2

> str_length(text_str)

[1] 3 3 5 NA

182

Page 183: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

str_sub()

> hw <- "Hadley Wickham"

> str_sub(hw, 1, 6)

[1] "Hadley"

> str_sub(hw, end = 6)

[1] "Hadley"

> str_sub(hw, 8, 14)

[1] "Wickham"

> str_sub(hw, 8)

[1] "Wickham"

> str_sub(hw, c(1, 8), c(6, 14))

[1] "Hadley" "Wickham"

> str_sub(hw, 1:3)

[1] "Hadley Wickham" "adley Wickham" "dley Wickham" 183

Page 184: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

What is a regular expression?

A regular expression (shortly regex or regexp) is a pattern describing a certain amount

of text. Basically, it is a way for a computer user or programmer to express how a

computer program should look for a specified pattern in text and then what the program

is to do when each pattern match is found.

184

Page 185: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Functions of regex in R

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)

185

Page 186: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Regular expression functions

Function Description

grep() Find regex matches and return (index or value)

grepl() Find regex matches and return (TRUE & FALSE)

sub() Replace the first match

gsub() Replace all the matches

regexpr() Find regex matches (position of the first match)

gregexpr() Find regex matches (position of all match)

regexec() Find regex matches (hybrid of regexpr() and gregexpr())

strsplit() Split regex matches

186

Page 187: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Metacharacters in R (1/2)

There are some special characters that have a reserved status and they are known as metacharacters.

The metacharacters in Extended Regular Expressions (EREs) are:

In R, we need to escape them with a double backslash \\ when we want to represent them in a regex pattern

. \ | ( ) [ { $ * + ?

Metacharacter Escape in R

. \\.

$ \\$

* \\*

+ \\+

? \\?

| \\|

\ \\\

^ \\^

[ \\[

] \\]

{ \\{

} \\}

( \\(

) \\)187

Page 188: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Metacharacters in R (2/2)

> money = "$money"

>

> sub(pattern = "$", replacement = "XXXXXX", x = money)

[1] "$moneyXXXXXX“

> money = "$money"

>

> sub(pattern = "\\$", replacement = "XXXXXX", x = money)

[1] "XXXXXXmoney"

188

Page 189: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sequences (1/4)Anchor Description

\\d Match a digital character

\\D match a non-digit character

\\s match a space character

\\S match a non-space character

\\w match a word character

\\W match a non-word character

\\b match a word boundary

\\B match a non-(word boundary)

\\h match a horizontal space

\\H match a non-horizontal space

\\v match a vertical space

\\V match a non-vertical space189

Page 190: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sequences (2/4)

> sub("\\d", "_", "the dandelion war 2010")

[1] "the dandelion war _010"

> gsub("\\d", "_", "the dandelion war 2010")

[1] "the dandelion war ____"

>

> sub("\\D", "_", "the dandelion war 2010")

[1] "_he dandelion war 2010"

> gsub("\\D", "_", "the dandelion war 2010")

[1] "__________________2010"

190

Page 191: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sequences (3/4)

> # replace space with "_"> sub("\\s", "_", "the dandelion war 2010")[1] "the_dandelion war 2010"> gsub("nns", "_", "the dandelion war 2010")[1] "the dandelion war 2010"> > # replace non-space with "_"> sub("\\S", "_", "the dandelion war 2010")[1] "_he dandelion war 2010"> gsub("\\S", "_", "the dandelion war 2010")[1] "___ _________ ___ ____"

191

Page 192: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sequences (4/4)

> # replace word with "_"

> sub("\\b", "_", "the dandelion war 2010")

[1] "_the dandelion war 2010"

> gsub("\\b", "_", "the dandelion war 2010")

[1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"

> # replace non-word with "_"

> sub("\\B", "_", "the dandelion war 2010")

[1] "t_he dandelion war 2010"

> gsub("\\B", "_", "the dandelion war 2010")

[1] "t_he d_an_de_li_on w_ar 2_01_0"

192

Page 193: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some regex character classes (1/2)

Anchor Description

[aeiou] Match any one lower case vowel

[AEIOU] Match any one upper case vowel

[0123456789] Match any digit

[0-9] Match any digit (same as previous class)

[a-z] Match any lower case ASCII letter

[A-Z] Match any upper case ASCII letter

[a-zA-Z0-9] Match any of the above classes

[^aeiou] Match anything other than a lowercase vowel

[^0-9] Match anything other than a digit193

Page 194: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Some regex character classes (2/2)

> # some string> transport = c("car", "bike", "plane", "boat")> # look for e or i> grep(pattern = "[ei]", transport, value = TRUE)[1] "bike" "plane">> # some numeric strings> numerics = c("123", "17-April", "I-II-III", "R 3.0.1")> grep(pattern = "[01]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[0-9]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[^0-9]", numerics, value = TRUE)[1] "17-April" "I-II-III" "R 3.0.1"

194

Page 195: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

POSIX character classes (1/2)

Notation Description

[[:lower:]] Lower-case letters

[[:upper:]] Upper-case letters

[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])

[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])

[[:blank:]] Blank characters: space and tab

[[:cntrl:]] Control characters

[[:punct:]] Punctuation characters: ! " # % & ' ( ) * + , - . / : ;

[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space

[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f

[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)

[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]]) 195

Page 196: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

> # la vie (string)> la_vie = "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you print la_vie> print(la_vie)[1] "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you cat la_vie> cat(la_vie)La vie en #FFC0CB (rose);Cest la vie! tres jolie> > # remove space characters> gsub(pattern = "[[:blank:]]", replacement = "", la_vie)[1] "Lavieen#FFC0CB(rose);\nCestlavie!tresjolie"> # remove digits> gsub(pattern = "[[:punct:]]", replacement = "", la_vie)[1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"

POSIX character classes (2/2)

196

Page 197: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Quantifiers (1/2)

Notation Description

* The preceding item will be matched zero or more times

+ The preceding item will be matched one or more times

? The preceding item will be matched zero or more times

{n} The preceding item is matched exactly n times

{n,} The preceding item is matched n or more times

{n,m} The preceding item is matched at least n times, but not more than m times

197

Page 198: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Quantifiers (2/2)

> strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")> grep("ac*b", strings, value = TRUE)[1] "ab" "acb" "accb" "acccb" "accccb"> grep("ac*b", strings, value = FALSE)[1] 2 3 4 5 6> grepl("ac*b", strings)[1] FALSE TRUE TRUE TRUE TRUE TRUE> grep("ac+b", strings, value = TRUE)[1] "acb" "accb" "acccb" "accccb"> grep("ac?b", strings, value = TRUE)[1] "ab" "acb"> grep("ac{2}b", strings, value = TRUE)[1] "accb"

198

Page 199: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Regex functions in stringr

Notation Description

str_detect() Detect the presence or absence of a pattern in a string

str_extract() Extract rst piece of a string that matches a pattern

str_extract all() Extract all pieces of a string that match a pattern

str_match() Extract rst matched group from a string

str_match all() Extract all matched groups from a string

str_locate() Locate the position of the rst occurence of a pattern in a string

str_locate all() Locate the position of all occurences of a pattern in a string

str_replace() Replace rst occurrence of a matched pattern in a string

str_replace all() Replace all occurrences of a matched pattern in a string

str_split() Split up a string into a variable number of pieces

str_split_fixed() Split up a string into a xed number of pieces

199

Page 200: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 1/3

# dollarsub("\\$", "", "$Peace-Love")

# dotsub("\\.", "", "Peace.Love")

# plussub("\\+", "", "Peace+Love")

# caretsub("\\^", "", "Peace^Love")

# vertical barsub("\\|", "", "Peace|Love")

# opening round bracketsub("\\(", "", "Peace(Love)")

# closing round bracketsub("\\)", "", "Peace(Love)")

# opening square bracketsub("\\[", "", "Peace[Love]")

# closing square bracketsub("\\]", "", "Peace[Love]")

# opening curly bracketsub("\\{", "", "PeacefLoveg")

200

Page 201: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 2/3

# replace word boundary with "_"

sub("\\w", "_", "the dandelion war 2010")

gsub("\\w", "_", "the dandelion war 2010")

# replace non-word-boundary with "_"

sub("\\W", "_", "the dandelion war 2010")

gsub("\\W", "_", "the dandelion war 2010")

201

Page 202: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Exercise 3/3

# people namespeople = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", "rasmus",

"jacob", "youna", "flora", "adi")# match "m" at most oncegrep(pattern = "m?", people, value = TRUE)# match "m" exactly oncegrep(pattern = "mf1g", people, value = TRUE, perl = FALSE)# match "m" zero or more times, and "t"grep(pattern = "m*t", people, value = TRUE)# match "t"zero or more times, and "m"grep(pattern = "t*m", people, value = TRUE)

202

Page 203: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

203

Page 204: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (1/13)

This is an example of implementing linear regression models in R.

We will use the R dataset Cars93 in the MASS library

> library(MASS)

> df <- Cars93

> dim(df)

[1] 93 27

Using dim() function to see the size of data. There are 93

observations and 27 features/predictors in the dataset

Page 205: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (2/13)

> head(df,3)

Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain

1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front

2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front

3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front

Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length

1 4 1.8 140 6300 2890 Yes 13.2 5 177

2 6 3.2 200 5500 2335 Yes 18.0 5 195

3 6 2.8 172 5500 2280 Yes 16.9 5 180

Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make

1 102 68 37 26.5 11 2705 non-USA Acura Integra

2 115 71 38 30.0 15 3560 non-USA Acura Legend

3 102 67 37 28.0 14 3375 non-USA Audi 90

Using head() function to look at a few

sample observations of the data. This is an

important step in data analysis!

Page 206: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (3/13)

> sapply(df, class)

Manufacturer Model Type Min.Price Price Max.Price

"factor" "factor" "factor" "numeric" "numeric" "numeric"

MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize

"integer" "integer" "factor" "factor" "factor" "numeric"

Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers

"integer" "integer" "integer" "factor" "numeric" "integer"

Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room

"integer" "integer" "integer" "integer" "numeric" "integer"

Weight Origin Make

"integer" "factor" "factor"

Using sapply() can look at what are the

data types of each variables

Page 207: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (4/13)

> plot(df$Horsepower, df$Price,

+ xlab = "Horsepower",

+ ylab = "Price")

Let’s look at two variables of cars:

horsepower and price. Do they have some

correlations?

Page 208: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (5/13)> # Simple linear regression (method 2) -----------------

> model <- lm(y ~ x)

> model$coefficients

(Intercept) x

-1.3987691 0.1453712

> beta0 <- model$coefficients[1]

> beta1 <- model$coefficients[2]

>

> plot(df$Horsepower, df$Price,

+ xlab = "Horsepower",

+ ylab = "Price")

> y_hat_vec <- beta1 * df$Horsepower + beta0

> lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)

> legend(50,

+ 30,

+ lty = 2,

+ col = 4,

+ "Regression line")

Estimate parameters of a simple linear

regression model by using R function

Page 209: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

> residuals_vec <- df$Price - y_hat_vec> summary(residuals_vec)

Min. 1st Qu. Median Mean 3rd Qu. Max. -16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500

Example (6/13)> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.977 on 91 degrees of freedom

Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171

F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16

The residual here means the error 𝑦𝑖 − 𝑦𝑖

Estimate parameters of a simple linear

regression model by using R function

Page 210: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (7/13)> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.977 on 91 degrees of freedom

Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171

F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16

This is the standard deviation of the sampling

distribution of the coefficient estimate under

standard regression assumptions.

It should be noted that you are not required to

understand how standard errors are calculated.

However, if you are interested, please read

Casella’s book Chapters 11-12

Estimate parameters of a simple linear

regression model by using R function

Page 211: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (8/13)> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.977 on 91 degrees of freedom

Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171

F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16

• t value is the t-statistic value for testing

whether the corresponding regression

coefficient is different from 0.

• Pr(> |𝑡|) is the p-value for the hypothesis test

for the 𝑡 value. The null hypothesis is that the

coefficient is zero;

It should be noted that you are not required to

understand how t value and p-value are calculated.

However, if you are interested, please read

Casella’s book Chapters 11-12

Estimate parameters of a simple linear

regression model by using R function

Page 212: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (9/13)> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.977 on 91 degrees of freedom

Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171

F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16

R-squared is a statistical measure of how close

the data are to the fitted regression line. It is also

known as the coefficient of determination,

simply defined by

𝑅2 =Explained variation

Total variation

In general, the higher the R-squared, the better

the model fits your data.

It should be noted that you are not required to

understand how R-squared, multiple R-squared,

adjusted R-squared and their tests are calculated.

However, if you are interested, please read

Casella’s book Chapters 11-12

Estimate parameters of a simple linear

regression model by using R function

Page 213: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (10/13)

Prediction

If a new Audi A4 has 175 horsepower, what is

the selling price of this Audi A4?

> # Prediction ------------------------------------------

>

> x_i <- 175

> y_hat_i <- beta1 * x_i + beta0

>

> plot(df$Horsepower, df$Price,

+ xlab = "Horsepower",

+ ylab = "Price")

> y_hat <- beta1 * df$Horsepower + beta0

> lines(df$Horsepower, y_hat, lty = 2, col = 4)

> points(x_i, y_hat_i, col = 2, pch=9)

> legend(75,

+ 50,

+ lty = c(2,NA),

+ pch = c(NA,9),

+ col = c(4,2),

+ c("Regression line", "New Audi A4"))

Page 214: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (11/13)

> attach(df)

> pairs(

+ data.frame(

+ MPG.city,

+ MPG.highway,

+ EngineSize,

+ Horsepower,

+ Fuel.tank.capacity,

+ Length,

+ Width,

+ Rear.seat.room,

+ Luggage.room

+ )

+ )

> detach(df)

Let’s look at many

variables of cars

Page 215: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (12/13)

> attach(df)

> model.multiple <-

+ lm(

+ Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room

+ )

> detach(df)

> model.multiple$coefficients

(Intercept) MPG.city MPG.highway EngineSize Horsepower Fuel.tank.capacity Length

59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087 0.6154648 0.1150924

Width Rear.seat.room Luggage.room

-1.3785983 0.1206144 0.2735771

Estimate parameters of a multiple linear

regression model by using R function

Page 216: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

> summary(model.multiple)

Call:

lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)

Residuals:

Min 1Q Median 3Q Max

-11.7444 -3.7098 -0.2932 2.9824 28.7627

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 59.14740 27.51934 2.149 0.03497 *

MPG.city 0.23631 0.44678 0.529 0.59848

MPG.highway -0.37663 0.44106 -0.854 0.39598

EngineSize 1.80483 1.85233 0.974 0.33314

Horsepower 0.12901 0.02576 5.008 3.78e-06 ***

Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801

Length 0.11509 0.11504 1.000 0.32044

Width -1.37860 0.49336 -2.794 0.00666 **

Rear.seat.room 0.12061 0.33957 0.355 0.72348

Luggage.room 0.27358 0.39166 0.699 0.48711

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)

Multiple R-squared: 0.6914, Adjusted R-squared: 0.6528

F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15

Example (13/13)

Page 217: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

217

Page 218: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (1/13)

This is an example of implementing logistic regression models in R.

We will use the Housing.csv dataset

> df <- read.csv(“C:/Housing.csv”)

> dim(df)

[1] 546 12

Using dim() function to see the size of data. There are 546

observations and 12 features/predictors in the dataset

Page 219: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (2/13)

> head(df)

price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea

1 420 5850 3 1 2 1 0 1 0 0 1 0

2 385 4000 2 1 1 1 0 0 0 0 0 0

3 495 3060 3 1 1 1 0 0 0 0 0 0

4 605 6650 3 1 2 1 1 0 0 0 0 0

5 610 6360 2 1 1 1 0 0 0 0 0 0

6 660 4160 3 1 1 1 1 1 0 1 0 0

Using head() function to look at a few sample

(default 6) observations of the data.

Page 220: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (3/13)

> lapply(df,class)

$price

[1] "numeric"

$housesize

[1] "integer"

$bedrooms

[1] "integer"

$bathrms

[1] "integer“

$stories

[1] "integer“

$driveway

[1] "integer“

…….

Using lapply() can look at what are the data types of

each variables (display in vertical way)

Page 221: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (4/13)

> summary(df)

price housesize bedrooms bathrms stories driveway

Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000

1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000

Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000

Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859

3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000

Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000

recroom fullbase gashw airco garagepl prefarea

Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000

1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000

Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000

Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344

3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000

Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000

Using summary() to produce result summaries at each variable

Page 222: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (5/13)

> summary(df$price)

Min. 1st Qu. Median Mean 3rd Qu. Max.

250.0 491.2 620.0 681.2 820.0 1900.0

Using summary() to produce the result summaries for one variable at a time

Page 223: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (6/13)

Let’s create graph with two subplots. Each subplot is for a predictor. This can be very helpful for

helping understand the effect of each predictor the response variable.

> par(mfrow=c(1, 2))

> plot(df$price, df$fullbase,xlab = "Price",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,

+ pch = 16, col = "green",cex.lab=1.5, cex.axis=1.5, + cex.sub=1.5)

> plot(df$housesize, df$fullbase,xlab = "Housesize",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,

+ pch = 16, col = "blue",cex.lab=1.5, cex.axis=1.5,

+ cex.sub=1.5)

Page 224: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (7/13)

> model1<-glm(fullbase~price,data=df,family=binomial)

> model1$coefficients

(Intercept) price

-1.622737e+00 1.447098e-05

> plot(df$price, df$fullbase,xlab = "Price",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,pch = 16,

+ col = "blue",cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price))

> yprice<-predict(model1,list(price=xprice),type="response")

> lines(xprice,yprice)

Develop a logistic regression model by

using R built-in function

Note: The regression line may be not clear because the

big range values of price variable

Page 225: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (8/13)

# get better regression line plot

> range(df$price)

[1] 250 1900

>

> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),

+ xlab = "Price", ylab = "Fullbase", col = "blue",

+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,

+ cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(0,2150)

> yprice<-predict(model1,list(price=xprice),type="response")

> lines(xprice,yprice)

Develop a logistic regression model by

using R built-in function

Page 226: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Here we see:

• If response variable and predictor(s) are

positively or negatively correlated

• 𝑧 value and 𝑝-value are for the hypothesis

test to see if the coefficient is zero or not.

The null hypothesis is that the coefficient is

zero. As the 𝑝-value is much less than 0.05,

we reject the null hypothesis that 𝛽 = 0.

Example (9/13)

> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

Page 227: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (10/13)> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

Deviance is a measure of goodness of fit of a regression

model (higher numbers indicate worse fit). The ‘Null

deviance’ shows how well the response variable is

predicted by a model that includes only the intercept

R:

model1$null.deviance (find Null deviance)model1$deviance (find Residual deviance)

For example, we have a value of 706.89 on 545 degrees

of freedom. Including the independent variables (price)

decreased the deviance to 688.28 on 544 degrees of

freedom.

The Residual Deviance has reduced by 18.61 with a loss

of one degrees of freedom.

Page 228: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (11/13)> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

The Akaike Information Criterion (AIC) provides a

method for assessing the quality of your model through

comparison of related models (the model that has the

smallest AIC is best fitted model).

Fisher scoring is a derivative of Newton’s

method for solving maximum likelihood

problems numerically.

Page 229: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Example (12/13) Prediction

If a new house has 385.00 pounds rental price, what is

the probability of fullbase of this house?

> # Prediction ------------------------------------------

> model1<-glm(fullbase~price,data=df,family=binomial)

> plot(df$price, df$fullbase,xlab = "Price", ylab = "Fullbase",

+ frame.plot=TRUE,cex=1.5,pch = 16, col = "blue",

+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price))

> yprice<-predict(model1,list(price=xprice),type="response")

> lines(xprice,yprice)

> newdata <- data.frame(price = 385.00)

> y_hat_i<-predict(model1, newdata, type="response")

> points(newdata, y_hat_i, col = 2, pch=20)

Page 230: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

>model2<-glm(fullbase~price+housesize,data=df,family=binomial)

>model2$coefficient

(Intercept) price housesize

-1.466744e+00 1.766831e-03 -7.286285e-05

> summary(model2)

Call:

glm(formula = fullbase ~ price + housesize, family = binomial,

data = df)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.7777 -0.8973 -0.7971 1.3701 1.7224

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***

price 1.767e-03 4.120e-04 4.289 1.80e-05 ***

housesize -7.286e-05 5.108e-05 -1.427 0.154

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 686.19 on 543 degrees of freedom

AIC: 692.19

Number of Fisher Scoring iterations: 4

Example (13/13)

Page 231: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

231

Page 232: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Assignment operators: ‘=’ Vs. ‘<-’

In R, you can use both ‘=’ and ‘<-‘ as assignment operators. So what’s the difference

between them and which one should you use?

232

Page 233: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

What’s the difference?

> mean(x=1:10)

[1] 5.5

> x

Error: object 'x' not found

> mean(x<-1:10)

[1] 5.5

> x

[1] 1 2 3 4 5 6 7 8 9 10

The main difference between the two assignment operators is scope. It’s easiest to see the

difference with an example:

Here x is declared within the function’s scope of the function, so it doesn’t exist in the user workspace.

This time the x variable is declared within the user workspace.

233

Page 234: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

When does the assignment take place? (1/2)

In the code above, you may be tempted

to thing that we “assign 1:10 to x, then

calculate the mean.” This would be

true for languages such as C, but it isn’t

true in R. Considering the function on

the right-hand side. Notice that the

value of a hasn’t changed!

> a <- 1

> f <- function(a) {

+ return(TRUE)

+ }

> f <- f(a <- a + 1); a

[1] 1

234

Page 235: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

When does the assignment take place? (2/2)

In R, the value of a will only change if

we need to evaluate the argument in

the function. This can lead to

unpredictable behaviour:

> f <- function(a) {

+ if (runif(1) > 0.5)

+ TRUE

+ else

+ a

+ }

> a <- 1

> f(a <- a+1); a

[1] 2

> f(a <- a+1); a

[1] 3

> f(a <- a+1); a

[1] TRUE

[1] 3 235

Page 236: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Which one should I use? (1/2)

Well there’s quite a strong following for the “<-” operator:

• The Google R style guide prohibits the use of “=” for assignment.

• Hadley Wickham’s style guide recommends “<-“

• If you want your code to be compatible with S-plus you should use “<-”

(Note: it seems that S-plus now accepts “=” now).

• General R community recommends using “<-”

236

Page 237: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Which one should I use? (2/2)

Some people use the “=” operator for the following reasons:

• The other languages use the “=” operator, e.g., python, C

• It’s quicker to type “=” and “<-“

• Wanting the declared variable to exist in the current workspace

• Using “=” avoids misleading expressions like if (x[1]<-2)

237

Page 238: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Computer representation of numbers (1/2)

> a <- sqrt(2)

> a * a == 2

[1] FALSE

> a * a - 2

[1] 4.440892e-16

> all.equal(a * a, 2)

[1] TRUE

Real numbers are not stored exactly on

computers. Use binary version of

“scientific” notation, e.g., 1.24 × 102.

The function all.equal() compares two

objects using a numeric tolerance 1.5e-8(default). If you want much greater

accuracy than this you will need to

consider error propagation carefully.

238

Page 239: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Computer representation of numbers (2/2)

> x<- seq(0,0.5,0.1)

> x

[1] 0.0 0.1 0.2 0.3 0.4 0.5

> y <- c(0,0.1,0.2,0.3,0.4,0.5)

> y

[1] 0.0 0.1 0.2 0.3 0.4 0.5

> x == y

[1] TRUE TRUE TRUE FALSE TRUE TRUE

> for (i in x) {

+ print(all.equal(x[i], y[i]))

+ }

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

239

Page 240: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Assigning a value (1/2)

> x <- c(8, 6, 4)

> x[7] <- 10

> x

[1] 8 6 4 NA NA NA 10

Assigning a value to a nonexistent element

of a vector, matrix, array, or list will

expand that structure to accommodate the

new value.

240

Page 241: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Assigning a value (2/2)

In R, the use of semicolons between statements is optional, and most people don't bother,

e.g.,

there is a risk that the first statement ended on the first line, i.e. that you said y <- 2 + 3

It is better you signal to R that an expression is incomplete, e.g.,

y <- 2 + 3

+ 5

y <- 2 + 3 +

5241

Page 242: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Debugging with RStudio

Usually, I do no recommend you use R for

projects with many dependency files,

instead, calling R from other languages

such Java/Python/C++ for statistical

analysis would be a better solution.

Debugging with RStudio is very easy and

simple (similar to Matlab)

Detailed operations see here:

https://support.rstudio.com/hc/en-

us/articles/205612627-Debugging-with-

RStudio

242

Page 243: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

LaTeX

LaTeX is a document preparation system

for high-quality typesetting. It is freely

available for Windows, Mac, and Linux

platforms.

Donald E. Knuth

http://cs.stanford.edu/~uno/

https://latex-project.org/intro.html

243

Page 244: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Sweave (R + LaTeX)

• Install LaTeX on your PC

• Install sweave library in Rstudio

• Download the SweaveDemo.rnw file

from Blackboard

• Open the file and compile the PDF as

shown on the right!

244

Page 245: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

R markdown

• Download the RMarkdownDemo.rmd file from Blackboard

• Open the file and compile the HTML as shown below!

245

Page 246: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Other topics in R that are not covered in our lectures

• Rcpp: R and C++ mixed programming

• rJava: R and Java mixed programming

• Rpython: R and Python mixed programming

• Creating your own R package

• R for statistical modelling (gbm, etc.)

• R for machine learning (kernlab, Rweka, caret, nnet, etc.)

• R for time series analysis

• …

246

Page 247: A short tutorial on R for data science - Bowei ChenA short tutorial on R for data science Bowei Chen School of Computer Science University of Lincoln 2016 - 2017. Preface ... O’Reilly

Key references

• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.

• P. Teetor (2011) R Cookbook. O’Reilly.

• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly

247