a short tutorial on r for data science - bowei chena short tutorial on r for data science bowei chen...

A short tutorial on R for data science

Bowei Chen

School of Computer Science

University of Lincoln

2016 - 2017

Preface

This short tutorial is to give a practical introduction to R for data science programming. It

aims at the undergraduate students or practitioners who have no background or experience

in data science or statistics. It should be noted that the tutorial is focused on teaching basic

R data programming skills from scratch rather than data science algorithms.

The tutorial is created based on several open-source materials in R Community (see the key

references section for details). It has been used in the workshops of the Data Science

module in the School of Computer Science at the University of Lincoln, UK. The content

is around 8 hours’ study. Thanks the module demonstrators Deema Abdal Hafeth and

Jingmin Huang who have provided help with exercises.

2

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics

5. Handling and processing strings

6. Linear regression

7. Logistic regression

8. Other topics

3

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

4

What is R?

• R is a free software environment for

statistical computing and graphics.

• R compiles and runs on a wide variety of

UNIX platforms, Windows and MacOS.

• R can be downloaded at:

https://cran.r-project.org/Old logo New logo

5

https://cran.r-project.org/

Comprehensive R Archive Network (CRAN)

• CRAN includes packages which provide additional functionalities.

• Over 7,801 additional packages (as of January 2016) available at CRAN, Bioconductor,

Omegahat, GitHub, and other repositories.

• R packages are written mainly by academics and company staff.

• The R Foundation is seated in Vienna, Austria and currently hosted by the Vienna

University of Economics and Business. It is a registered association under Austrian law

and active worldwide.

6

Short history of R (1/2)

• S is a statistical programming language developed primarily by John Chambers, Rick

Becker and Allan Wilks at Bell Laboratories since 1976.

• The two modern implementations of S are:

– R: part of the GNU free software project

– S-PLUS (or S+): A commercial product sold by TIBCO Software

7

Short history of R (2/2)

• S-PLUS is a commercial implementation of the S programming language sold by TIBCO

Software Inc.

• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland,

New Zealand, and is currently developed by the R Development Core Team, of which

John Chambers is a member. R is named partly after

the first names of the first two R authors and partly as a play on the name of S.

8

What can you do using R? (1/2)

• Data entry and manipulation

– Input data

• from keyboard

• from spreadsheet

• from another statistics package

– Manipulate data

• Statistical analysis

– Descriptive statistics

– Statistical inference

9

What can you do using R? (2/2)

• Graphical display

– Predefined plots for some models

– Flexible, powerful options

– Save to image files in various formats

• Write new functions

– Make a change to an existing function

– Create new functions tailored to your exact needs

– Contribute a new package

• Create documents (with Sweave, knitr)

– PDF (article and slides)

– HTML

10

Why use R for statistical computing?

• Open source (R is a GNU S+)

• Good visualisations (ggplot2, lattice, standard plot library)

• Easier for writing custom packages and functions

• Closer to the statistics and machine learning community

• Better LaTeX support (Sweave, knitr)

• Works with Big data (Rhadoop, Rspark, RCpp)

11

O’Reilly 2016

DATA SCIENCE

SALARY SURVEY

Limitations of R

• The quality of some packages is less than perfect. They are not error-free!

• Many R commands give little thought to memory management, and so R can very

quickly consume all available memory. This can be a restriction when doing data

mining. There are various solutions, including using 64 bit operating systems that can

access much more memory than 32 bit ones.

• Documentation is sometimes patchy and terse, and impenetrable to the non-

statistician. However, some very high-standard books are increasingly plugging the

documentation gaps.

13

RGui

When R is waiting for us to tell it what to do, it begins the line with >

Type• 'demo()' for some demos• 'help()' for on-line help• 'help.start()' for an HTML

browser interface• 'q()' to quit R

14

Editors and IDEs

• Rstudio

• Jupyter Notebook

• Vim

• Emacs (ESS)

• Eclipse (StatET)

• Tinn-R

• …

15

https://www.rstudio.com/


R source editor (Ctrl+1)

R console (Ctrl+2)

Environment (Ctrl+8)history (Ctrl+4)

Help (Ctrl+4)Files (Ctrl+5)Plots (Ctrl+6)

Packages (Ctrl+7)

Objects

• Everything in R is an object, having a class.

• Data, intermediate results are stored in R objects

• The Class of the object both describes what the object contains and what many

standard functions

• Objects are usually accessed by name.

18

R commands

• R commands are either assignments or expressions

• Commands are separated either by a semicolon ; or newline

19

x <- 1+2

`<-`(x, 1+2) #same thing

x = 1+2 #same thing

Assignment operations

An assignment command evaluates

an expression and passes the value

to a variable but the result is not

printed.

20

Expression operations

An expression command is evaluated

and (normally) printed.

If the statement results in a value, R will

print that value automatically.

> 1+2

[1] 3

> 1+2*3

[1] 7

> (1+2)*3

[1] 9In R, any number that you print out in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1.

21

Workspace

• R stores objects in workspace that is kept in memory.

• When quitting R ask you if you want to save that workspace

• The workspace containing all objects you work on can then be restored next time you

work with R along with a history of the used commands.

22

Variables (1/3)

A variable is a symbol that holds a value,

which can be any R object.

The types of variables are:

• Integer

• Double

• Character

• Logical

• Factor or categorical

23

Variables (2/3)

Integer, double (numerical values)

> a = 49

> sqrt(a)

[1] 7

> a <- pi

> print(a)

[1] 3.141593

Character, string, logical

> a = "The dog ate my homework"

> sub("dog","cat",a)

[1] "The cat ate my homework“

> a = (1+1==3)

> a

[1] FALSE

24

Variables (3/3)

Factor

> a <- factor(c("H", "e", "l", "l", "o"))

> print(a)

[1] H e l l o

Levels: e H l o

> class(a)

[1] "factor"

25

Types of numerical variables (1/2)

When we use numerical objects, in

mathematical terms, variables can be

classified as:

• Scalars

• Vectors

• Matrices

A scalar is a single number

> x <- 5

> Y <- 100

26

Types of numerical variables (2/2)

A vector is a sequence of numbers

> x <- c(3, 5, 2)

> x

[1] 3 5 2

A matrix is a two-way table of numbers

> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2)

> x

[,1] [,2]

[1,] 2 5

[2,] 3 6

[3,] 4 7

27

Variable names

• You can use simple variable names like x, y, A, and a (note that A and a are different

variable names). You can also use longer names like counter, index1, or

subject_id.

• A variable name can contain digits, but it cannot begin with a digit.

• Be careful about the built-in operators or symbols with your own variable names!

For example, you could create a variable named log, but then you would no longer be

able to use the logarithm function

28

Comments

A comment is anything you write in your

program code that is ignored by

the computer.

Comments help others understand your

code. Anything following a “#” character is

a comment in R.

> x <- c(3, 5, 2) ## These are the doses of the new drug formulation.

29

Arithmetic operators

Addition +

Subtraction -

Multiplication *

Division /

Exponentiation ^ or **

Modulus (x mod y) 5%%2 is 1 x %% y

Integer division 5%/%2 is 2 x %/% y

30

Comparison operators

Equal ==

Not equal !=

Greater than >

Greater than or equal >=

Less than <

Less than or equal <=

31

Logical operators

x and y x & y

x or y x | y

Not x !x

Test if x is TRUE isTRUE(x)

32

Numeric functions

Absolute value abs(x)Square root sqrt(x)Ceiling(3.475) is 4 ceiling(x)Foor(3.475) is 3 floor(x)Round(3.475, digits=2) is 3.48 round(x, digits=n)Signif(3.475, digits=2) is 3.5 signif(x, digits=n)Cosine, sine, tan, … cos(x), sin(x), tan(x)Natural logarithm log(x)Common logarithm log10(x)Exponential of x exp(x)

33

Control structures: if

Syntax:

if(cond1=true) { cmd1 }

> if (TRUE) {

+ "this will be printed if it is TRUE"

+ }

[1] "this will be printed if it is TRUE"

34

Control structures: if-else

Syntax:

if(cond1=true) { cmd1 } else { cmd2 }

> if(1==0) {

+ print(1)

+ } else {

+ print(2)

+ }

[1] 2

35

Control structures: ifelse

Syntax:

ifelse(cond, yes, no)

> ifelse(1 == 0,

+ "this will be printed if 1==0",

+ "this will not be printed if 1!=0")

[1] "this will not be printed if 1!=0"

36

Control structures: for

Syntax:

for (var in seq) { expr }

> x <- c("a", "a", "a", "a", "a")

> for (i in x){

+ print(i)

+ }

[1] "a"

[1] "a"

[1] "a"

[1] "a"

[1] "a"

37

Control structures: repeat

Syntax:

repeat { (cond) expr }

> i <- 10> repeat {+ if (i > 25)+ break+ else {+ print(i); i <- i + 5;+ }+ }[1] 10[1] 15[1] 20[1] 25

38

Control structures: while

Syntax:

while (cond) { expr }

> i <- 10

> while (i <= 25) {

+ print(i); i <- i + 5

+ }

[1] 10

[1] 15

[1] 20

[1] 25

39

Control structures: switch

Syntax:

switch(expr, ...)

> AA = 'foo'> switch(AA,+ foo = {+ print('foo') # case 'foo'+ },+ bar = {+ print('bar') # case 'bar'+ },+ {+ print('default')+ })[1] "foo"

40

Installing R and RStudio on your machine

• Download R from https://cran.r-project.org/

• Download RStudio at https://www.rstudio.com/

41

https://cran.r-project.org/


Exercise 1/10

demo(graphics)

demo(plotmath)

demo(Japanese)

demo(lm.glm)

demo(hclColors)

42

Exercise 2/10

x<-c(4,2,6)

y<-c(1,0,-1)

length(x)

sum(x)

sum(x^2)

x+y

x*y

x-2

x^2

43

Exercise 3/10

7:11

seq(2,9)

seq(4,10,by=2)

seq(3,30,length=10)

seq(6,-4,by=-2)

44

Exercise 4/10

rep(2,4)

rep(c(1,2),4)

rep(c(1,2),c(4,4))

rep(1:4,4)

rep(1:4,rep(3,4))

45

Exercise 5/10

c(T,T,F,F) & c(T,F,F,T)

!x

x <- seq(-3,3,length=200) > 0

1:3 + c(T,F,T)

intersect(1:10,5:15)

drinks <- factor(c("beer","beer","wine","water"))

46

Exercise 6/10

x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y);

print(z)

c(1, 2, 3, . . . , 19, 20)

x <- c(3,6,8); y <- c(2,5,1);

x[y>1.5]

x <- c(3,6,8); y <- c(2,5,1);

y[x==6]

47

Exercise 7/10

x <- 1:15if (sample(x, 1) <= 10) {

print("x is less than 10")} else {

print("x is greater than 10")}

Clean all the variables (the workspace)rm(list=ls())

Clean one variablerm(x)

48

Exercise 8/10

x <- c("apples", "oranges", "bananas", "strawberries")

for (i in x) {

print(i)

}

for (i in 1:4) {

print(x[i])

}

for (i in seq(x)) {

print(x[i])

}

for (i in 1:4) print(x[i])

49

Exercise 9/10

i <- 1

while (i < 10) {

print(i)

i <- i + 1

}

50

Exercise 10/10

z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z)

x <- c(0.5, 0.7)

x <- c(TRUE, FALSE)

x <- c("a", "b", "c", "d", "e")

x <- 9:100

x <- c(1 + (0+0i), 2 + (0+4i))

51

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

52

R data structures

53

Vectors (1/3)

Vectors are one-dimensional arrays that

can hold numeric data, character data, or

logical data. The combine function c() is

used to form the vector.

Note that the data in a vector must only

be one data type (numeric, character, or

logical).

> a <-c(1, 2, 5, 3, 6, -2, 4)

> b <-c("one", "two", "three")

> c <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)

# a is numeric vector,

# bis a character vector, and

# c is a logical vector

54

Vectors (2/3)

Scalars are one-element vectors. > f <- 3

> x <- TRUE

> y <- 100.01

55

Vectors (3/3)

You can refer to elements of a vector using

a numeric vector of positions within

brackets.

> a <- c(1, 2, 5, 3, 6, -2, 4)

> a[3]

[1] 5

> a[c(1, 3, 5)]

[1] 1 5 6

> a[2:6]

[1] 2 5 3 6 -2

56

Matrices (1/4)

A matrix is a two-dimensional array where each element has the same data type

(numeric, character, or logical). Matrices are created with the matrix() function.

myymatrix <- matrix(vector,

nrow=number_of_rows,

ncol=number_of_columns,

byrow=logical_value,

dimnames=list(char_vector_rownames,

char_vector_colnames)

)

57

Matrices (2/4)

# Create a matrix from a vector

> vector <-c(1,2,3,4)

> foo <-matrix(vector, nrow=2, ncol=2)

> foo

[,1] [,2]

[1,] 1 3

[2,] 2 4

# Create a 5x4 matrix

> y <- matrix(1:20, nrow=5, ncol=4)

> y

[,1] [,2] [,3] [,4]

[1,] 1 6 11 16

[2,] 2 7 12 17

[3,] 3 8 13 18

[4,] 4 9 14 19

[5,] 5 10 15 20

58

Matrices (3/4)

Create a 2x2 matrix with labels and

fill the matrix by rows

Create a 2x2 matrix with labels and

fill the matrix by column

> cells <- c(1,26,24,68)

> rnames <- c("R1", "R2")> cnames <- c("C1", "C2")

> mymatrix <- matrix(

+ cells, nrow = 2, ncol = 2, byrow = TRUE,

+ dimnames = list(rnames, cnames) )

> mymatrix

C1 C2

R1 1 26

R2 24 68

> mymatrix <- matrix(

+ cells, nrow = 2, ncol = 2, byrow = FALSE,

+ dimnames = list(rnames, cnames))

> mymatrix

C1 C2

R1 1 24

R2 26 68 59

Matrices (4/4)

You can identify rows, columns, or elements of a matrix, x, by using subscripts and brackets.

• x[i,] refers to the ith row

• x[,j] refers to jth column

• x[i,j] refers to the i,jth element

> x <- matrix(1:10, nrow=2)> x

[,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10> x[2,][1] 2 4 6 8 10> x[,2][1] 3 4> x[1,4][1] 7> x[1, c(4,5)][1] 7 9

60

Arrays (1/2)

Matrices are two-dimensional and, like vectors, can contain only one data type. When

there are more than two dimensions, you’ll use arrays.

myarray <- array(vector, dimensions, dimnames)

61

Arrays (2/2)

> dim1 <- c("A1", "A2")> dim2 <- c("B1", "B2", "B3")> dim3 <- c("C1", "C2", "C3", "C4")> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))> z, , C1

B1 B2 B3A1 1 3 5A2 2 4 6

, , C2

B1 B2 B3A1 7 9 11A2 8 10 12

, , C3

B1 B2 B3A1 13 15 17A2 14 16 18

, , C4

B1 B2 B3A1 19 21 23A2 20 22 24

62

Data frame (1/4)

A data frame is more general than a matrix in that different columns can contain different

modes of data (numeric, character, etc.). A data frame is created with the data.frame() function

It’s similar to the datasets you’d typically see in SAS, SPSS, Stata, and Python (pandas).

Each column must have only one data type, but you can put columns of different data

types together to form the data frame. Because data frames are close to what analysts

typically think of as datasets, we’ll use the terms columns and variables interchangeably

when discussing data frames.

mydata <- data.frame(col1, col2, col3,…)

63

Data frame (2/4)

> patientID <- c(1, 2, 3, 4)

> age <- c(25, 34, 28, 52)

> diabetes <- c("Type1", "Type2", "Type1", "Type1")

> status <- c("Poor", "Improved", "Excellent", "Poor")

> patientdata <- data.frame(patientID, age, diabetes, status)

> patientdata

patientID age diabetes status

1 1 25 Type1 Poor

2 2 34 Type2 Improved

3 3 28 Type1 Excellent

4 4 52 Type1 Poor

64

Data frame (3/4)

Accessing data frame elements can be

straight forward. Element can be accessed

by column names.

> patientdata$patientID

[1] 1 2 3 4

> patientdata$diabetes

[1] Type1 Type2 Type1 Type1

Levels: Type1 Type2

> patientdata$status

[1] Poor Improved Excellent Poor

Levels: Excellent Improved Poor

65

Data frame (4/4)

If you want to cross tabulate diabetes type by status.

> table(patientdata$diabetes, patientdata$status)

Excellent Improved PoorType1 1 0 2Type2 0 1 0

66

Some useful functions for data frame (1/7)

The attach() function adds the data frame

to the R search path. When a variable name

is encountered, data frames in the search

path are checked in order to locate the

variable.

> summary(mtcars$mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mtcars$mpg, mtcars$disp)

> plot(mtcars$mpg, mtcars$wt)

> attach(mtcars)

> summary(mpg)


10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

67


The detach() function removes the data

frame from the search path. Note that

detach() does nothing to the data frame

itself. The statement is optional but is good

programming practice and should be

included routinely.

> attach(mtcars)

> summary(mpg)


10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

68


The limitations with this approach are

evident when more than one object can

have the same name.

Here we already have an object named mpg

in our environment when the mtcars data

frame is attached. In such cases, the

original object takes precedence, which

isn’t what you want. The plot statement

fails because mpg has 3 elements and disp

has 32 elements.

> mpg <- c(25, 36, 47)

> attach(mtcars)

The following object is masked _by_ .GlobalEnv:

mpg

> plot(mpg, wt)

Error in xy.coords(x, y, xlabel, ylabel, log) :

'x' and 'y' lengths differ

69


In this case, the statements within the

{} brackets are evaluated with reference

to the mtcars data frame. You don’t

have to worry about name conflicts

here. If there’s only one statement (for

example, summary(mpg)), the {} brackets are optional.

> with(mtcars, {

+ summary(mpg, disp, wt)

+ plot(mpg, disp)

+ plot(mpg, wt)

+ })

70


The limitation of the with() function

is that assignments will only exist

within the function brackets.

> with(mtcars, {

stats <- summary(mpg)

stats

})


10.40 15.43 19.20 20.09 22.80 33.90

> stats

Error: object ‘stats’ not found

71


If you need to create objects that will

exist outside of the with() construct,

use the special assignment operator <<-instead of the standard one <-. It will

save the object to the global

environment outside of the with() call.

> with(mtcars, {

nokeepstats <- summary(mpg)

keepstats <<- summary(mpg)

})

> nokeepstats

Error: object ‘nokeepstats’ not found

> keepstats


10.40 15.43 19.20 20.09 22.80 33.90

72


> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

73

Factors (1/3)

Categorical (nominal) and ordered

categorical (ordinal) variables in R are

called factors.

The function factor() stores the

categorical values as a vector of integers

in the range [1... k] (where k is the

number of unique values in the nominal

variable), and an internal vector of

character strings (the original values)

mapped to these integers.


> diabetes

[1] "Type1" "Type2" "Type1" "Type1"

74

Factors (2/3)

> patientID <- c(1, 2, 3, 4)

age <- c(25, 34, 28, 52)


> status <- c("Poor", "Improved", "Excellent", "Poor")

> diabetes <- factor(diabetes)

> status <- factor(status, order=TRUE)

> patientdata <- data.frame(patientID, age, diabetes, status)

> str(patientdata)

‘data.frame’: 4 obs. of 4 variables:

$ patientID: num 1 2 3 4 w

$ age : num 25 34 28 52

$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1

$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 75

Factors (3/3)

> summary(patientdata)

patientID age diabetes status

Min. :1.00 Min. :25.00 Type1:3 Excellent:1

1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1

Median :2.50 Median :31.00 Poor :2

Mean :2.50 Mean :34.75

3rd Qu.:3.25 3rd Qu.:38.50

Max. :4.00 Max. :52.00

76

Lists (1/2)

Lists are the most complex of the R data

types. Basically, a list is an ordered

collection of objects (components). A list

allows you to gather a variety of (possibly

unrelated) objects under one name.

mylist <- list(object1, object2, …)

mylist <- list(name1=object1, name2=object2, …)

77

Lists (2/2)

> g <- "My First List"> h <- c(25, 26, 18, 39)> j <- matrix(1:10, nrow=5)> k <- c("one", "two", "three")> mylist <- list(title=g, ages=h, j, k)

> mylist$title[1] "My First List"$ages[1] 25 26 18 39[[3]][,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10[[4]][1] "one" "two" "three"

> mylist[[2]][1] 25 26 18 39> mylist[["ages"]][[1] 25 26 18 39

78

Exercise 1/10

# Declare different variablestypesmy_numeric <- 42my_character <- "universe“my_logical <- FALSE

# Check class of my_numericclass(my_numeric)

# Check class of my_characterclass(my_character)

# Check class of my_logicalclass(my_logical)

Exercise 2/10

# Vector operations

a) Create a verctor like 1,2,3, . . ., 10

b) Get the length of the above vector

c) Get the last three numbers from the vector

d) Sort the numbers with decreasing order

e) Remove the number 9 from the above vector

Exercise 3/10

# Vector operations

a) Create a vector from 1 to 3.1415 with the length of 100

b) Create a vector from -2 to 0.1 with the length of 100

c) Get the sum and inner product of a and b

Exercise 4/10

# Vector operations

a) Create a vector x contains 2, 3, 4, 1

b) Create a vector y contains 1, 1, 3, 7

c) Combine column vectors x, y

Exercise 5/10

# Vector operations

Use rep() function to create the following vectors:

a) “0” “x” “0” “x” “0” “x”

b) 1 3 2 1 3 2 1 3 2 1 3 2

c) 1 1 1 2 2 2 3 3 3

Exercise 6/10

# Matrix operations

a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns

b) Print out the dimensions of the matrix

c) Find out the 4th column’s sum

d) Find out the sum of row 3 and row 17

e) Assign the following names to the rows:

“A”, “B”, “C”, “D”, “E”

Exercise 7/10

# Matrix operations

a) Use matrix() function to create the following matrix:

TypeA TypeB TypeC

Navarra 190 8 22

Zaragoza 191 4 1.7

Madrid 223 80 2.0

b) Add the following column into the matrix:

TypeD

2.00

3.50

2.75

c) Use apply() function to calculate the means of each column of the matrix

Exercise 8/10

# Array operationsCreate the following array, , 1

[,1] [,2] [,3][1,] 1 4 7[2,] 2 5 8[3,] 3 6 9

, , 2

[,1] [,2] [,3][1,] 10 13 16[2,] 11 14 17[3,] 12 15 18

, , 3

[,1] [,2] [,3][1,] 19 22 25[2,] 20 23 26[3,] 21 24 27

Exercise 9/10

# Data frame operations

Type df <- iris, then

a) Print out the dimensions of df

b) Find out the sum of “Sepal.Width” column

c) Rename column “Species” as “label”

d) Find out how many records with “Petal.Length” larger than 1.41

Exercise 10/10

# List operations

Create the following list and save it to the variable x:

[[1]]

[1] 2 3 5

[[2]]

[1] "aa" "bb" "cc" "dd" "ee"

[[3]]

[1] TRUE FALSE TRUE FALSE FALSE

[[4]]

[1] 3

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

89

Sources of data for R

90

Entering data from the keyboard

Perhaps the simplest method of data entry

is from the keyboard. The edit() function

in R will invoke a text editor that will allow

you to enter your data manually.

> mydata <- data.frame(age = numeric(0),

+ gender = character(0),

+ weight = numeric(0))

> mydata <- edit(mydata)

91

Importing data from Excel

There are many R packages can allow you to import data from excel. For example:

openxlsxXLConnectxlsx…

Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later .xlsx format.

> install.packages("openxlsx")

> library("openxlsx")> df <-+ read.xlsx(+ "PublicHealthEnglandDataTableDistrict.xlsx",+ sheet = 1,+ startRow = 1,+ colNames = TRUE+ )

92

Importing data from a delimited text file (1/2)

You can import data from delimited text files using read.table() , a function that

reads a file in table format and saves it as a data frame.

where file is a delimited ASCII file , header is a logical value indicating whether

the first row contains variable names (TRUE or FALSE), sep specifies the delimiter

separating data values, and row.names is an optional parameter specifying one or more

variables to represent row identifiers.

> mydataframe <- read.table(file, header = logical_value,

+ sep = "delimiter",

+ row.names = "name")

93

Importing data from a delimited text file (2/2)

> file <- paste0(path, '/AMZN.csv')

> df <- read.table(file, header = TRUE, sep = ",")

> head(df)

Date Open High Low Close Volume Adj.Close

1 2016-03-04 581.07 581.40 571.07 575.14 3405100 575.14

2 2016-03-03 577.96 579.87 573.11 577.49 2736700 577.49

3 2016-03-02 581.75 585.00 573.70 580.21 4576900 580.21

4 2016-03-01 556.29 579.25 556.00 579.04 5014400 579.04

5 2016-02-29 554.00 564.81 552.51 552.52 4013400 552.52

6 2016-02-26 560.12 562.50 553.17 555.23 4858200 555.23

94

Importing data from XML

> # install and load the necessary package

> install.packages(“XML”)

> library(XML)

> xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

> xmlfile <- xmlTreeParse(xml.url)

> class(xmlfile)

[1] "XMLDocument" "XMLAbstractDocument"

> xmltop = xmlRoot(xmlfile)

> plantcat <- xmlSApply(xmltop, function(x) { xmlSApply(x, xmlValue) } )

> # Finally, get the data in a data-frame and have a look at the first rows and columns

> plantcat_df <- data.frame(t(plantcat),row.names = NULL)

> plantcat_df[1:5,1:4]

95

Importing data from R package

> library(MASS)

> data()

> data(phones)

> phones

$year

[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

$calls

[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9 16.1

[14] 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0

96

Importing data from other sources

• Importing SPSS files into R

• Importing Stata files into R

• Importing SAS files into R

• Importing Minitab files into R

• Importing Matlab files into R

• …

97

Importing data in RStudio (1/2)

98

Importing data in RStudio (2/2)

99

Writing data frame into csv or txt files

write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",

eol = "\n", na = "NA", dec = ".", row.names = TRUE,

col.names = TRUE, qmethod = c("escape", "double"),

fileEncoding = "")

write.csv(...)

write.csv2(...)

100

Useful functions for working with data objects (1/2)

Number of elements/components length(object)Dimensions of an object dim(object)Structure of an object str(object)Class or type of an object class(object)How an object is stored mode(object)Names of components in an object names(object)Combines objects into a vector c(object, object,...)Combines objects as columns cbind(object, object, ...)Combines objects as rows rbind(object, object, ...)Prints the object object

101

Useful functions for working with data objects (2/2)

Lists the first part of the object head(object)Lists the last part of the object tail(object)Lists current objects ls()Deletes one or more objects. rm(object, object, ...)Edits object and saves as new object newobject <- edit(object)

102

The drop= argument

By default, subscripting operations reduce

the dimensions of an array

whenever possible. To avoid that, we can

use the drop=FALSE argument

> mat <- matrix(1:12, 3, 4, byrow = TRUE)

> mat

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

> s1 <- mat[1,]; s1

[1] 1 2 3 4

> dim(s1)

NULL

> s2 <- mat[1,,drop=FALSE]; s2

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

> dim(s2)

[1] 1 4

103

Combined selection

Suppose we want to get all the columns for which the element at the first row is less than 3:

> mat <- matrix(1:12, 3, 4, byrow = TRUE)> mat

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

> mycols <- mat[1,] < 3; mycols[1] TRUE TRUE FALSE FALSE

> mat[ , mycols, drop=FALSE][,1] [,2][1,] 1 2[2,] 5 6[3,] 9 10

104

Using SQL statements to manipulate data frames

# install the package

> install.packages("sqldf")

> library(sqldf)

> newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE)

> newdf

mpg cyl disp hp drat wt qsec vs am gear carb

Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1

Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1

Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1

Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1

Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1

Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1105

Exercise 1/8

1) Create a vector x represent numbers from 1 to 11

2) Save x into the x.RData file

3) Remove the object x from R workspace

4) Import the x.RData file into R and save it to x.

Please google it

Exercise 2/8

1) Create the following data frame dfA

ID Case Number1 case1 102 case2 203 case3 30

2) Save dfA into the dfA.csv file

Exercise 3/8

1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard

2) Import the dataset into R using openxlsx package

3) Show the first and last 20 lines of the dataset, respectively

4) Obtain the column names of the dataset

5) Create a new data frame which has the same column names of the dataset and has

the first and last 20 lines of the dataset

Exercise 4/8

1) Download the file AMZN.csv from Blackboard

2) Import the dataset into R

3) Show the class of all columns/fields

4) Create a new data frame where Open <= 570 and Close >= 550

5) Sort the data frame by High (in decreasing order)

6) Create another data frame where Close >= Open

Exercise 5/8

1) Import data from url "http://www.w3schools.com/xml/plant_catalog.xml"

2) Use the xmlTreePares function to parse xml file directly from the web

3) Use the xmlRoot function to access the top node

Exercise 6/8

1) Install the MASS package

2) Find Cars93 dataset

3) Extract all the records for the Volkswagen from the field Manufacturer

4) Order the extracted records (ascend) by Price and save it to a data frame

5) Write the data frame to Cars93FilteredData.csv

Exercise 7/8

1) Use SQL statements to manipulate data frame as required in Exercise 4/6

2) Write the data frame to Cars93FilteredData.RData

Exercise 8/8

1) Download the files iris1.csv and iris2.csv from Blackboard

2) Import these two files into R

3) Combine these two datasets into one data frame

4) Calculate the mean value of every columns

5) What will you do with missing values?

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

114

Exploratory graphs

If you are familiar with statistical graphical representations, please

skip this part

Pie chart

AL5% AR

5%AZ5%

CA5%

CO4%

CT7%

DE4%

FL6%GA

4%IA5%

ID4%

IL4%

IN4%

KS5%

KY3%

LA5%

MA6%

MD4%

ME6%

MI5%

taxs

AL AR AZ CA CO CT DEFL GA IA ID IL IN KSKY LA MA MD ME MI

Dataset: Cigarette

A pie chart is used to show the

relative frequencies or percentages

of the levels of a categorical variable

with wedges of a pie/circle..

It is very useful when creating a well

designed document that is intended

to people that will not read the data

(e.g., management)

Scatter plot

With a scatter plot a mark,

usually a dot or small circle,

represents a single data point.

With one mark (point) for every

data point a visual distribution

of the data can be seen.

Depending on how tightly the

points cluster together, you may

be able to discern a clear trend

in the data.

y = 31.887x - 62057

0

100

200

300

400

500

600

700

1949 1951 1953 1955 1957 1959

Dataset: AirPassengers

AirPassengers Linear (AirPassengers)Date

Number of air

passengers

Line plot

A line plot provides an excellent

way to map independent and

dependent variables that are both

quantitative.

It is clear to see how things are

going by the rises and falls a line

plot shows.

0

100

200

300

400

500

600

700

1949 1951.4166671953.833333 1956.25 1958.666667

Dataset: AirPassengers

AirPassengersDate

Number of air

passengers

Multiple line plot

Multiple line plots have space-

saving characteristics. Because

the data values are marked by

small marks (points) and not

bars, they do not have to be

offset from each other (only

when data values are very dense does this become a problem).

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

Dataset: StockShare

Stock1 Stock2 Stock3

Day

Cumulative

percentage

Area chart/graph

An area chart/graph displays

graphically quantitative data.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

Dataset: StockShare


Day

Cumulative

percentage

Bar chart

A bar plot is a chart that shows

grouped data with rectangular

bars with lengths proportional to

the values that they show. The

bars can be plotted vertically or

horizontally.

It is one of the best methods to

summarise categorical data.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Dataset: StockShare


Day

Percentage

Histogram

A histogram is a graphical

representation of the distribution of

quantitative data. It is an estimate

of the probability distribution of a

quantitative variable and was first introduced by Karl Pearson.

0

5

10

15

20

25

40

42

44

46

48

50

52

54

56

58

60

Dataset: MSFT

Adjust

closing price

Frequency

Histogram with

distribution fit

A histogram with a distribution

fit is normally used to show the

empirical distribution of the

variable. Sometimes, we use

the Normal/Gaussian distribution to fit the histogram.

0

5

10

15

20

25

40

42

44

46

48

50

52

54

56

58

60

Dataset: MSFT

Adjust

closing price

Frequency

Base plotting system in R

Dataset (1/3)

> data(Chem97, package = "mlmRev")

> head(Chem97)

lea school student score gender age gcsescore gcsecnt

1 1 1 1 4 F 3 6.625 0.3393157

2 1 1 2 10 F -3 7.625 1.3393157

3 1 1 3 10 F -4 7.250 0.9643157

4 1 1 4 10 F -2 7.500 1.2143157

5 1 1 5 8 F -1 6.444 0.1583157

6 1 1 6 10 F 4 7.750 1.4643157

125

Dataset (2/3)

> data(iris)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

126

Dataset (3/3)

> data(EuStockMarkets)

> EuStockMarkets <- data.frame(EuStockMarkets)

> head(EuStockMarkets)

DAX SMI CAC FTSE

1 1628.75 1678.1 1772.8 2443.6

2 1613.63 1688.5 1750.5 2460.2

3 1606.51 1678.6 1718.0 2448.2

4 1621.04 1684.1 1708.1 2470.4

5 1618.16 1686.6 1723.1 2484.7

6 1610.61 1671.6 1714.3 2466.8

127

Histogram (1/2)

> hist(Chem97$gcsescore)

128

Histogram (2/2)

> hist(+ Chem97$gcsescore,+ main = "Histogram",+ xlab = "gcsescore",+ ylab = "Frequency",+ col = "green"+ )

129

Boxplot (1/2)

> boxplot(Chem97$gcsescore,

+ main = 'title',

+ ylab = 'gcsescore')

130

Boxplot (2/2)

> boxplot(+ Chem97$gcsescore,+ Chem97$age,+ main = 'title',+ ylab = 'value',+ names = c('gcsescore','age')+ )

131

Scatter plot (1/3)

> plot(

+ Chem97$gcsescore,

+ Chem97$gcsecnt,

+ main = "title",

+ xlab = "gcsescore",

+ ylab = 'gcsecnt',

+ col = "blue"

+ )

132

Scatter plot (2/3)

> pairs(iris)

133

Scatter plot (3/3)

> pairs(iris, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])134

Line plot (1/3)

> plot(

+ EuStockMarkets$DAX,

+ type = "l",

+ main = 'EuStockMarkets',

+ xlab = 'Day',

+ ylab = 'DAX'

+ )

135

Line plot (2/3)> plot(


+ type = "l", col = 'red',

+ xlab = 'Day', ylab = 'Price'

+ )

> lines(EuStockMarkets$FTSE,

+ type = "l", col = 'blue')

> title("EuStockMarkets", cex.main = 1.1)

> legend(

+ 100, 5500, c("DAX", "FTSE"),

+ col = c('red', 'blue'),

+ text.col = "black",

+ lty = c(1,1), merge = TRUE

+ ) 136

Line plot (3/3)

> plot(


+ EuStockMarkets$CAC,

+ type = "l",

+ main = 'EuStockMarkets',

+ xlab = 'DAX',

+ ylab = 'CAC'

+ )

137

Exercise 1/5

1) Create a vector x from a series 1 to 1000

2) Create a vector y from a series 12 to 10002

3) Generate the following scatter plot that x on x-axis and y on y-axis

Exercise 2/5

1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)

2) Each variable has 500 observations (i.e., 500 rows)

3) x follows a standard norm distribution N(0,1)

4) y follows a continuous uniform distribution U[0,1]

5) z follows a poison distribution Poisson(0.5)

6) Generate a pairs plot for x, y, z

Please google the pairs function

Exercise 3/5

Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)

Exercise 4/5

1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard

2) Import the dataset into R using openxlsx package

3) Save the data frame into df

4) Plot the histogram of df (as same as on the right)

hint: a) bandwidth; b) values on x-axis

Exercise 5/5

1) Download the file AMZN.csv from Blackboard

2) Import the dataset into R

3) Plot the multiple lines figure as below

R graphics packages:

lattice & ggplot2

Author

lattice was developed and

maintained by Deepayan Sarkar,

Assistant Professor at Indian

Statistical Institute.

http://www.isid.ac.in/~deepayan/

144

http://www.isid.ac.in/~deepayan/

Histogram by wrap

> pl <- histogram(~ gcsescore |

+ factor(score), data = Chem97)

> print(pl)

145

Density by wrap

> pl <- densityplot(

+ ~ gcsescore | factor(score),

+ data = Chem97,

+ plot.points = FALSE,

+ ref = TRUE

+ )

> print(pl)

146

Density plot by different colour


+ ~ gcsescore,

+ data = Chem97,

+ groups = score,


+ ref = TRUE,

+ auto.key = list(columns = 3)

+ )

> print(pl)

147

boxplot by wrap (1/2)

> pl <- bwplot(

+ gcsescore ^ 2.34 ~ gender | factor(score),

+ Chem97,

+ varwidth = TRUE,

+ layout = c(6, 1),

+ ylab = "Transformed GCSE score"

+ )

> print(pl)

148



+ ~ gcsescore,

+ data = Chem97,

+ groups = score,


+ ref = TRUE,

+ auto.key = list(columns = 3)

+ )

> print(pl)

149

There are many other functions in lattice

Below the references will be useful:

• http://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf

• https://www.stat.auckland.ac.nz/~paul/RGraphics/chapter4.pdf

• https://fas-web.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm

150

http://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf

https://www.stat.auckland.ac.nz/~paul/RGraphics/chapter4.pdf

https://fas-web.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm

Author

ggplot2 was developed by Hadley

Wickham, Chief Scientist at RStudio, and

an Adjunct Professor of Statistics at the

University of Auckland.

http://hadley.nz/

151

http://hadley.nz/

Histogram by wrap

> pg <-

+ ggplot(Chem97, aes(gcsescore)) +

+ geom_histogram(binwidth = 0.5) +

+ facet_wrap( ~ score)

> print(pg)

152

Density plot by wrap

> pg <- ggplot(Chem97, aes(gcsescore)) +

+ stat_density(geom = "path",

+ position = "identity") +

+ facet_wrap(~ score)

> print(pg)

153

Density plot by different colour

> pg <- ggplot(Chem97, aes(gcsescore)) +

+ stat_density(geom = "path",

+ position = "identity",

+ aes(colour = factor(score)))

> print(pg)

154


> pg <- ggplot(Chem97,

+ aes(factor(gender),

+ gcsescore^2.34)) +

+ geom_boxplot() +

+ facet_grid(~score) +

+ ylab("Transformed GCSE score")

> print(pg)

155


> pg <- ggplot(Chem97,

+ aes(factor(score),

+ gcsescore)) +

+ geom_boxplot() +

+ coord_flip() +

+ ylab("Average GCSE score") +

+ facet_wrap( ~ gender)

> print(pg)

156

There are many other functions in ggplot2

Below the references will be useful:

• http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf

• http://www.statmethods.net/advgraphs/ggplot2.html

• http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/

• http://www.stat.wisc.edu/~larget/stat302/chap2.pdf

157

http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf

http://www.statmethods.net/advgraphs/ggplot2.html

http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/

http://www.stat.wisc.edu/~larget/stat302/chap2.pdf

Exercise 1/11

Use lattice or ggplot package to draw the figure as below

> data(postdoc, package = "latticeExtra")

> pl <- barchart(prop.table(postdoc, margin = 1),

+ xlab = "Proportion",

+ auto.key = list(adj = 1))

> print(pl)

Exercise 2/11

1) Read the dataset PublicHealthEnglandDataTableDistrict.xlsx

2) Plot the following figure using lattice package

Hint: xyplot

Exercise 3/11

1) Read the dataset

PublicHealthEnglandDataTa

bleDistrict.xlsx

2) Plot the following figure

using ggplot2 package

Hint: 1) ggplot; 2) plot points: 3)

by wrap

Exercise 4/11

1) Read the dataset “chem97” from “mlmRev” package

2) Plot the following figure using ggplot2 package

Exercise 5/11



Exercise 6/11



Exercise 7/11



Exercise 8/11

1) Read the dataset “chem97” from

“mlmRev” package

2) Plot the following figure using

ggplot2 package

Exercise 9/11



2) Plot the following figure using ggplot2

package

Exercise 10/11



2) Plot the following figure using ggplot2

package

Exercise 11/11


2) Plot the following figure using lattice package

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

169

Empty string

An empty string can be produced by

consecutive quotation marks: ""> empty_str = ""

> empty_str

[1] ""

> class(empty_str)

[1] "character"

170

Vector of empty strings

character() will produce a character

vector with as many empty strings

> # vector with 5 empty strings

> char_vector = character(5)

> char_vector

[1] "" "" "" "" ""

171

is.character() and as.character()

as.character() and is.character() are

generic methods for creating and testing

for objects of type "character"

> a = "test me"

> b = 8 + 9

> # are 'a' and 'b' characters?

> is.character(a)

[1] TRUE

> is.character(b)

[1] FALSE

172

c() for character vector

As you can tell, the resulting vector from

combining integers (1:5), the number pi,

and some "text" is a vector with all its

elements treated as character strings. In

other words, when we combine mixed

data in vectors, strings will dominate.

> a <- c("x", "y", "c")

> a

[1] "x" "y" "c"

> b <- c(1:5, pi, "text")

> b

[1] "1"

[2] "2"

[3] "3"

[4] "4"

[5] "5"

[6] "3.14159265358979"

[7] "text"

173

paste()

paste() takes one or more R objects,

converts them to "character", and then it

concatenates (pastes) them to form one or

several character strings

> PI = paste("The life of", pi)

> PI

[1] "The life of 3.14159265358979"

> IloveR = paste("I", "love", "R")

> IloveR

[1] "I love R"

> IloveR = paste0("I", "love", "R")

> IloveR

[1] "IloveR"

> IloveR = paste("I", "love", "R", sep = "-")

> IloveR

[1] "I-love-R"

> paste(1:3, c("!", "?", "+"), sep = "",

+ collapse = "")

[1] "1!2?3+"174

Printing characters

Function Description

print() Generic printing

noquote() Print with no quotes

cat() Concatenation

format() Special formats

toString() Covert to string

sprintf() Printing

175

Basic string manipulations


nchar() Number of characters

tolower() Convert to lower case

toupper() Convert to upper case

casefold() Case folding

chartr() Character translation

abbreviate() Abbreviation

substring() Substrings of a character vector

substr() Substrings of a character vector

176

Set operations


union() Set union

intersect() Intersection

setdiff() Set difference

setequal() Equal sets

identical() Exact equality

is.element() Is element

sort() Sorting

paste(rep()) Repetition

177

setequal() vs indentical()

> set7 = c("some", "random", "string")> set8 = c("some", "random", "none", "few")> set9 = c("string", "some", "random")> setequal(set7, set8)[1] FALSE> setequal(set7, set9)[1] TRUE> identical(set7, set7)[1] TRUE> identical(set7, set9)[1] FALSE

178

stringr package

Thanks to Hadley Wickham, we have the

package stringr that adds more

functionality to the base functions for

handling strings in R.

stringr provides functions for:

1) Basic manipulations

2) Regular expression operations.

http://hadley.nz/

179

http://hadley.nz/

Basic string manipulations in stringr

Function Description Similar to

str_c() string concatenation paste()

str_length() number of characters nchar()

str_sub() extracts substrings substring()

str_dup() duplicates characters

str_trim() removes leading and trailing whitespace

str_pad() pads a string

str_wrap() wraps a string paragraph strwrap()

180

paste() vs str_c()

> paste("University", "of", "Lincoln")

[1] "University of Lincoln"

> paste("University", "of", "Lincoln", NULL)

[1] "University of Lincoln "

> paste("University", "of", "Lincoln", character(0))

[1] "University of Lincoln “

> library(stringr)

> str_c("University", "of", "Lincoln")

[1] "UniversityofLincoln"

> str_c("University", "of", "Lincoln", NULL)

[1] "UniversityofLincoln"

> str_c("University", "of", "Lincoln", character(0))

[1] "UniversityofLincoln“ 181

nchar() vs str_length()

> nchar("The life of PI")

[1] 14

> str_length("The life of PI")

[1] 14

>

> text_str = c("one", "two", "three", NA)

> nchar(text_str)

[1] 3 3 5 2

> str_length(text_str)

[1] 3 3 5 NA

182

str_sub()

> hw <- "Hadley Wickham"

> str_sub(hw, 1, 6)

[1] "Hadley"

> str_sub(hw, end = 6)

[1] "Hadley"

> str_sub(hw, 8, 14)

[1] "Wickham"

> str_sub(hw, 8)

[1] "Wickham"

> str_sub(hw, c(1, 8), c(6, 14))

[1] "Hadley" "Wickham"

> str_sub(hw, 1:3)

[1] "Hadley Wickham" "adley Wickham" "dley Wickham" 183

What is a regular expression?

A regular expression (shortly regex or regexp) is a pattern describing a certain amount

of text. Basically, it is a way for a computer user or programmer to express how a

computer program should look for a specified pattern in text and then what the program

is to do when each pattern match is found.

184

Functions of regex in R

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)

185

Regular expression functions


grep() Find regex matches and return (index or value)

grepl() Find regex matches and return (TRUE & FALSE)

sub() Replace the first match

gsub() Replace all the matches

regexpr() Find regex matches (position of the first match)

gregexpr() Find regex matches (position of all match)

regexec() Find regex matches (hybrid of regexpr() and gregexpr())

strsplit() Split regex matches

186

Metacharacters in R (1/2)

There are some special characters that have a reserved status and they are known as metacharacters.

The metacharacters in Extended Regular Expressions (EREs) are:

In R, we need to escape them with a double backslash \\ when we want to represent them in a regex pattern

. \ | ( ) [ { $ * + ?

Metacharacter Escape in R

. \\.

$ \\$

* \\*

+ \\+

? \\?

| \\|

\ \\\

^ \\^

[ \\[

] \\]

{ \\{

} \\}

( \$

) \$187

Metacharacters in R (2/2)

> money = "$money"

>

> sub(pattern = "$", replacement = "XXXXXX", x = money)

[1] "$moneyXXXXXX“

> money = "$money"

>

> sub(pattern = "\\$", replacement = "XXXXXX", x = money)

[1] "XXXXXXmoney"

188

Sequences (1/4)Anchor Description

\\d Match a digital character

\\D match a non-digit character

\\s match a space character

\\S match a non-space character

\\w match a word character

\\W match a non-word character

\\b match a word boundary

\\B match a non-(word boundary)

\\h match a horizontal space

\\H match a non-horizontal space

\\v match a vertical space

\\V match a non-vertical space189

Sequences (2/4)

> sub("\\d", "_", "the dandelion war 2010")

[1] "the dandelion war _010"

> gsub("\\d", "_", "the dandelion war 2010")

[1] "the dandelion war ____"

>

> sub("\\D", "_", "the dandelion war 2010")

[1] "_he dandelion war 2010"

> gsub("\\D", "_", "the dandelion war 2010")

[1] "__________________2010"

190

Sequences (3/4)

> # replace space with "_"> sub("\\s", "_", "the dandelion war 2010")[1] "the_dandelion war 2010"> gsub("nns", "_", "the dandelion war 2010")[1] "the dandelion war 2010"> > # replace non-space with "_"> sub("\\S", "_", "the dandelion war 2010")[1] "_he dandelion war 2010"> gsub("\\S", "_", "the dandelion war 2010")[1] "___ _________ ___ ____"

191

Sequences (4/4)

> # replace word with "_"

> sub("\\b", "_", "the dandelion war 2010")

[1] "_the dandelion war 2010"

> gsub("\\b", "_", "the dandelion war 2010")

[1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"

> # replace non-word with "_"

> sub("\\B", "_", "the dandelion war 2010")

[1] "t_he dandelion war 2010"

> gsub("\\B", "_", "the dandelion war 2010")

[1] "t_he d_an_de_li_on w_ar 2_01_0"

192

Some regex character classes (1/2)

Anchor Description

[aeiou] Match any one lower case vowel

[AEIOU] Match any one upper case vowel

[0123456789] Match any digit

[0-9] Match any digit (same as previous class)

[a-z] Match any lower case ASCII letter

[A-Z] Match any upper case ASCII letter

[a-zA-Z0-9] Match any of the above classes

[^aeiou] Match anything other than a lowercase vowel

[^0-9] Match anything other than a digit193

Some regex character classes (2/2)

> # some string> transport = c("car", "bike", "plane", "boat")> # look for e or i> grep(pattern = "[ei]", transport, value = TRUE)[1] "bike" "plane">> # some numeric strings> numerics = c("123", "17-April", "I-II-III", "R 3.0.1")> grep(pattern = "[01]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[0-9]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[^0-9]", numerics, value = TRUE)[1] "17-April" "I-II-III" "R 3.0.1"

194

POSIX character classes (1/2)

Notation Description

[[:lower:]] Lower-case letters

[[:upper:]] Upper-case letters

[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])

[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])

[[:blank:]] Blank characters: space and tab

[[:cntrl:]] Control characters

[[:punct:]] Punctuation characters: ! " # % & ' ( ) * + , - . / : ;

[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space

[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f

[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)

[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]]) 195

> # la vie (string)> la_vie = "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you print la_vie> print(la_vie)[1] "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you cat la_vie> cat(la_vie)La vie en #FFC0CB (rose);Cest la vie! tres jolie> > # remove space characters> gsub(pattern = "[[:blank:]]", replacement = "", la_vie)[1] "Lavieen#FFC0CB(rose);\nCestlavie!tresjolie"> # remove digits> gsub(pattern = "[[:punct:]]", replacement = "", la_vie)[1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"

POSIX character classes (2/2)

196

Quantifiers (1/2)


* The preceding item will be matched zero or more times

+ The preceding item will be matched one or more times

? The preceding item will be matched zero or more times

{n} The preceding item is matched exactly n times

{n,} The preceding item is matched n or more times

{n,m} The preceding item is matched at least n times, but not more than m times

197

Quantifiers (2/2)

> strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")> grep("ac*b", strings, value = TRUE)[1] "ab" "acb" "accb" "acccb" "accccb"> grep("ac*b", strings, value = FALSE)[1] 2 3 4 5 6> grepl("ac*b", strings)[1] FALSE TRUE TRUE TRUE TRUE TRUE> grep("ac+b", strings, value = TRUE)[1] "acb" "accb" "acccb" "accccb"> grep("ac?b", strings, value = TRUE)[1] "ab" "acb"> grep("ac{2}b", strings, value = TRUE)[1] "accb"

198

Regex functions in stringr


str_detect() Detect the presence or absence of a pattern in a string

str_extract() Extract rst piece of a string that matches a pattern

str_extract all() Extract all pieces of a string that match a pattern

str_match() Extract rst matched group from a string

str_match all() Extract all matched groups from a string

str_locate() Locate the position of the rst occurence of a pattern in a string

str_locate all() Locate the position of all occurences of a pattern in a string

str_replace() Replace rst occurrence of a matched pattern in a string

str_replace all() Replace all occurrences of a matched pattern in a string

str_split() Split up a string into a variable number of pieces

str_split_fixed() Split up a string into a xed number of pieces

199

Exercise 1/3

# dollarsub("\\$", "", "$Peace-Love")

# dotsub("\\.", "", "Peace.Love")

# plussub("\\+", "", "Peace+Love")

# caretsub("\\^", "", "Peace^Love")

# vertical barsub("\\|", "", "Peace|Love")

# opening round bracketsub("\$", "", "Peace(Love)")

# closing round bracketsub("\$", "", "Peace(Love)")

# opening square bracketsub("\\[", "", "Peace[Love]")

# closing square bracketsub("\\]", "", "Peace[Love]")

# opening curly bracketsub("\\{", "", "PeacefLoveg")

200

Exercise 2/3

# replace word boundary with "_"

sub("\\w", "_", "the dandelion war 2010")

gsub("\\w", "_", "the dandelion war 2010")

# replace non-word-boundary with "_"

sub("\\W", "_", "the dandelion war 2010")

gsub("\\W", "_", "the dandelion war 2010")

201

Exercise 3/3

# people namespeople = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", "rasmus",

"jacob", "youna", "flora", "adi")# match "m" at most oncegrep(pattern = "m?", people, value = TRUE)# match "m" exactly oncegrep(pattern = "mf1g", people, value = TRUE, perl = FALSE)# match "m" zero or more times, and "t"grep(pattern = "m*t", people, value = TRUE)# match "t"zero or more times, and "m"grep(pattern = "t*m", people, value = TRUE)

202

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

203

Example (1/13)

This is an example of implementing linear regression models in R.

We will use the R dataset Cars93 in the MASS library

> library(MASS)

> df <- Cars93

> dim(df)

[1] 93 27

Using dim() function to see the size of data. There are 93

observations and 27 features/predictors in the dataset

Example (2/13)

> head(df,3)

Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain

1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front

2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front

3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front

Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length

1 4 1.8 140 6300 2890 Yes 13.2 5 177

2 6 3.2 200 5500 2335 Yes 18.0 5 195

3 6 2.8 172 5500 2280 Yes 16.9 5 180

Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make

1 102 68 37 26.5 11 2705 non-USA Acura Integra

2 115 71 38 30.0 15 3560 non-USA Acura Legend

3 102 67 37 28.0 14 3375 non-USA Audi 90

Using head() function to look at a few

sample observations of the data. This is an

important step in data analysis!

Example (3/13)

> sapply(df, class)

Manufacturer Model Type Min.Price Price Max.Price

"factor" "factor" "factor" "numeric" "numeric" "numeric"

MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize

"integer" "integer" "factor" "factor" "factor" "numeric"

Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers

"integer" "integer" "integer" "factor" "numeric" "integer"

Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room

"integer" "integer" "integer" "integer" "numeric" "integer"

Weight Origin Make

"integer" "factor" "factor"

Using sapply() can look at what are the

data types of each variables

Example (4/13)

> plot(df$Horsepower, df$Price,

+ xlab = "Horsepower",

+ ylab = "Price")

Let’s look at two variables of cars:

horsepower and price. Do they have some

correlations?

Example (5/13)> # Simple linear regression (method 2) -----------------

> model <- lm(y ~ x)

> model$coefficients

(Intercept) x

-1.3987691 0.1453712

> beta0 <- model$coefficients[1]

> beta1 <- model$coefficients[2]

>



+ ylab = "Price")

> y_hat_vec <- beta1 * df$Horsepower + beta0

> lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)

> legend(50,

+ 30,

+ lty = 2,

+ col = 4,

+ "Regression line")

Estimate parameters of a simple linear

regression model by using R function

> residuals_vec <- df$Price - y_hat_vec> summary(residuals_vec)

Min. 1st Qu. Median Mean 3rd Qu. Max. -16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500

Example (6/13)> summary(model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-16.413 -2.792 -0.821 1.803 31.753

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.977 on 91 degrees of freedom

Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171

F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16

The residual here means the error 𝑦𝑖 − 𝑦𝑖




Call:

lm(formula = y ~ x)

Residuals:


-16.413 -2.792 -0.821 1.803 31.753

Coefficients:


(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1




This is the standard deviation of the sampling

distribution of the coefficient estimate under

standard regression assumptions.

It should be noted that you are not required to

understand how standard errors are calculated.

However, if you are interested, please read

Casella’s book Chapters 11-12




Call:

lm(formula = y ~ x)

Residuals:


-16.413 -2.792 -0.821 1.803 31.753

Coefficients:


(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1




• t value is the t-statistic value for testing

whether the corresponding regression

coefficient is different from 0.

• Pr(> |𝑡|) is the p-value for the hypothesis test

for the 𝑡 value. The null hypothesis is that the

coefficient is zero;


understand how t value and p-value are calculated.






Call:

lm(formula = y ~ x)

Residuals:


-16.413 -2.792 -0.821 1.803 31.753

Coefficients:


(Intercept) -1.3988 1.8200 -0.769 0.444

x 0.1454 0.0119 12.218 <2e-16 ***

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1




R-squared is a statistical measure of how close

the data are to the fitted regression line. It is also

known as the coefficient of determination,

simply defined by

𝑅2 =Explained variation

Total variation

In general, the higher the R-squared, the better

the model fits your data.


understand how R-squared, multiple R-squared,

adjusted R-squared and their tests are calculated.





Example (10/13)

Prediction

If a new Audi A4 has 175 horsepower, what is

the selling price of this Audi A4?

> # Prediction ------------------------------------------

>

> x_i <- 175

> y_hat_i <- beta1 * x_i + beta0

>



+ ylab = "Price")

> y_hat <- beta1 * df$Horsepower + beta0

> lines(df$Horsepower, y_hat, lty = 2, col = 4)

> points(x_i, y_hat_i, col = 2, pch=9)

> legend(75,

+ 50,

+ lty = c(2,NA),

+ pch = c(NA,9),

+ col = c(4,2),

+ c("Regression line", "New Audi A4"))

Example (11/13)

> attach(df)

> pairs(

+ data.frame(

+ MPG.city,

+ MPG.highway,

+ EngineSize,

+ Horsepower,

+ Fuel.tank.capacity,

+ Length,

+ Width,

+ Rear.seat.room,

+ Luggage.room

+ )

+ )

> detach(df)

Let’s look at many

variables of cars

Example (12/13)

> attach(df)

> model.multiple <-

+ lm(

+ Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room

+ )

> detach(df)

> model.multiple$coefficients

(Intercept) MPG.city MPG.highway EngineSize Horsepower Fuel.tank.capacity Length

59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087 0.6154648 0.1150924

Width Rear.seat.room Luggage.room

-1.3785983 0.1206144 0.2735771

Estimate parameters of a multiple linear


> summary(model.multiple)

Call:

lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)

Residuals:


-11.7444 -3.7098 -0.2932 2.9824 28.7627

Coefficients:


(Intercept) 59.14740 27.51934 2.149 0.03497 *

MPG.city 0.23631 0.44678 0.529 0.59848

MPG.highway -0.37663 0.44106 -0.854 0.39598

EngineSize 1.80483 1.85233 0.974 0.33314

Horsepower 0.12901 0.02576 5.008 3.78e-06 ***

Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801

Length 0.11509 0.11504 1.000 0.32044

Width -1.37860 0.49336 -2.794 0.00666 **

Rear.seat.room 0.12061 0.33957 0.355 0.72348

Luggage.room 0.27358 0.39166 0.699 0.48711

---

Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)


F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15

Example (13/13)

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

217

Example (1/13)

This is an example of implementing logistic regression models in R.

We will use the Housing.csv dataset

> df <- read.csv(“C:/Housing.csv”)

> dim(df)

[1] 546 12

Using dim() function to see the size of data. There are 546

observations and 12 features/predictors in the dataset

Example (2/13)

> head(df)

price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea

1 420 5850 3 1 2 1 0 1 0 0 1 0

2 385 4000 2 1 1 1 0 0 0 0 0 0

3 495 3060 3 1 1 1 0 0 0 0 0 0

4 605 6650 3 1 2 1 1 0 0 0 0 0

5 610 6360 2 1 1 1 0 0 0 0 0 0

6 660 4160 3 1 1 1 1 1 0 1 0 0

Using head() function to look at a few sample

(default 6) observations of the data.

Example (3/13)

> lapply(df,class)

$price

[1] "numeric"

$housesize

[1] "integer"

$bedrooms

[1] "integer"

$bathrms

[1] "integer“

$stories

[1] "integer“

$driveway

[1] "integer“

…….

Using lapply() can look at what are the data types of

each variables (display in vertical way)

Example (4/13)

> summary(df)

price housesize bedrooms bathrms stories driveway

Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000

1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000

Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000

Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859

3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000

Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000

recroom fullbase gashw airco garagepl prefarea

Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000

1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000

Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000

Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344

3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000

Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000

Using summary() to produce result summaries at each variable

Example (5/13)

> summary(df$price)


250.0 491.2 620.0 681.2 820.0 1900.0

Using summary() to produce the result summaries for one variable at a time

Example (6/13)

Let’s create graph with two subplots. Each subplot is for a predictor. This can be very helpful for

helping understand the effect of each predictor the response variable.

> par(mfrow=c(1, 2))

> plot(df$price, df$fullbase,xlab = "Price",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,

+ pch = 16, col = "green",cex.lab=1.5, cex.axis=1.5, + cex.sub=1.5)

> plot(df$housesize, df$fullbase,xlab = "Housesize",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,

+ pch = 16, col = "blue",cex.lab=1.5, cex.axis=1.5,

+ cex.sub=1.5)

Example (7/13)

> model1<-glm(fullbase~price,data=df,family=binomial)

> model1$coefficients

(Intercept) price

-1.622737e+00 1.447098e-05

> plot(df$price, df$fullbase,xlab = "Price",

+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,pch = 16,

+ col = "blue",cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price))

> yprice<-predict(model1,list(price=xprice),type="response")

> lines(xprice,yprice)

Develop a logistic regression model by

using R built-in function

Note: The regression line may be not clear because the

big range values of price variable

Example (8/13)

# get better regression line plot

> range(df$price)

[1] 250 1900

>

> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),

+ xlab = "Price", ylab = "Fullbase", col = "blue",

+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,

+ cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(0,2150)



Develop a logistic regression model by

using R built-in function

Here we see:

• If response variable and predictor(s) are

positively or negatively correlated

• 𝑧 value and 𝑝-value are for the hypothesis

test to see if the coefficient is zero or not.

The null hypothesis is that the coefficient is

zero. As the 𝑝-value is much less than 0.05,

we reject the null hypothesis that 𝛽 = 0.

Example (9/13)

> summary(model1)

Call:

glm(formula = fullbase ~ price, family = binomial, data = df)

Deviance Residuals:


-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 706.89 on 545 degrees of freedom

Residual deviance: 688.28 on 544 degrees of freedom

AIC: 692.28

Number of Fisher Scoring iterations: 4

Example (10/13)> summary(model1)

Call:


Deviance Residuals:


-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:


(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




AIC: 692.28


Deviance is a measure of goodness of fit of a regression

model (higher numbers indicate worse fit). The ‘Null

deviance’ shows how well the response variable is

predicted by a model that includes only the intercept

R:

model1$null.deviance (find Null deviance)model1$deviance (find Residual deviance)

For example, we have a value of 706.89 on 545 degrees

of freedom. Including the independent variables (price)

decreased the deviance to 688.28 on 544 degrees of

freedom.

The Residual Deviance has reduced by 18.61 with a loss

of one degrees of freedom.

Example (11/13)> summary(model1)

Call:


Deviance Residuals:


-1.6778 -0.8992 -0.8012 1.3529 1.7316

Coefficients:


(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***

price 0.0014471 0.0003423 4.228 2.36e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




AIC: 692.28


The Akaike Information Criterion (AIC) provides a

method for assessing the quality of your model through

comparison of related models (the model that has the

smallest AIC is best fitted model).

Fisher scoring is a derivative of Newton’s

method for solving maximum likelihood

problems numerically.

Example (12/13) Prediction

If a new house has 385.00 pounds rental price, what is

the probability of fullbase of this house?

> # Prediction ------------------------------------------

> model1<-glm(fullbase~price,data=df,family=binomial)

> plot(df$price, df$fullbase,xlab = "Price", ylab = "Fullbase",

+ frame.plot=TRUE,cex=1.5,pch = 16, col = "blue",

+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)

> xprice<-seq(min(df$price),max(df$price))



> newdata <- data.frame(price = 385.00)

> y_hat_i<-predict(model1, newdata, type="response")

> points(newdata, y_hat_i, col = 2, pch=20)

>model2<-glm(fullbase~price+housesize,data=df,family=binomial)

>model2$coefficient

(Intercept) price housesize

-1.466744e+00 1.766831e-03 -7.286285e-05

> summary(model2)

Call:

glm(formula = fullbase ~ price + housesize, family = binomial,

data = df)

Deviance Residuals:


-1.7777 -0.8973 -0.7971 1.3701 1.7224

Coefficients:


(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***

price 1.767e-03 4.120e-04 4.289 1.80e-05 ***

housesize -7.286e-05 5.108e-05 -1.427 0.154

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




AIC: 692.19


Example (13/13)

Table of contents

1. Introduction

2. Data structures

3. Data I/O

4. Graphics




8. Other topics

231

Assignment operators: ‘=’ Vs. ‘<-’

In R, you can use both ‘=’ and ‘<-‘ as assignment operators. So what’s the difference

between them and which one should you use?

232

What’s the difference?

> mean(x=1:10)

[1] 5.5

> x

Error: object 'x' not found

> mean(x<-1:10)

[1] 5.5

> x

[1] 1 2 3 4 5 6 7 8 9 10

The main difference between the two assignment operators is scope. It’s easiest to see the

difference with an example:

Here x is declared within the function’s scope of the function, so it doesn’t exist in the user workspace.

This time the x variable is declared within the user workspace.

233

When does the assignment take place? (1/2)

In the code above, you may be tempted

to thing that we “assign 1:10 to x, then

calculate the mean.” This would be

true for languages such as C, but it isn’t

true in R. Considering the function on

the right-hand side. Notice that the

value of a hasn’t changed!

> a <- 1

> f <- function(a) {

+ return(TRUE)

+ }

> f <- f(a <- a + 1); a

[1] 1

234

When does the assignment take place? (2/2)

In R, the value of a will only change if

we need to evaluate the argument in

the function. This can lead to

unpredictable behaviour:

> f <- function(a) {

+ if (runif(1) > 0.5)

+ TRUE

+ else

+ a

+ }

> a <- 1

> f(a <- a+1); a

[1] 2

> f(a <- a+1); a

[1] 3

> f(a <- a+1); a

[1] TRUE

[1] 3 235

Which one should I use? (1/2)

Well there’s quite a strong following for the “<-” operator:

• The Google R style guide prohibits the use of “=” for assignment.

• Hadley Wickham’s style guide recommends “<-“

• If you want your code to be compatible with S-plus you should use “<-”

(Note: it seems that S-plus now accepts “=” now).

• General R community recommends using “<-”

236

Which one should I use? (2/2)

Some people use the “=” operator for the following reasons:

• The other languages use the “=” operator, e.g., python, C

• It’s quicker to type “=” and “<-“

• Wanting the declared variable to exist in the current workspace

• Using “=” avoids misleading expressions like if (x[1]<-2)

237

Computer representation of numbers (1/2)

> a <- sqrt(2)

> a * a == 2

[1] FALSE

> a * a - 2

[1] 4.440892e-16

> all.equal(a * a, 2)

[1] TRUE

Real numbers are not stored exactly on

computers. Use binary version of

“scientific” notation, e.g., 1.24 × 102.

The function all.equal() compares two

objects using a numeric tolerance 1.5e-8(default). If you want much greater

accuracy than this you will need to

consider error propagation carefully.

238

Computer representation of numbers (2/2)

> x<- seq(0,0.5,0.1)

> x

[1] 0.0 0.1 0.2 0.3 0.4 0.5

> y <- c(0,0.1,0.2,0.3,0.4,0.5)

> y

[1] 0.0 0.1 0.2 0.3 0.4 0.5

> x == y

[1] TRUE TRUE TRUE FALSE TRUE TRUE

> for (i in x) {

+ print(all.equal(x[i], y[i]))

+ }

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

[1] TRUE

239

Assigning a value (1/2)

> x <- c(8, 6, 4)

> x[7] <- 10

> x

[1] 8 6 4 NA NA NA 10

Assigning a value to a nonexistent element

of a vector, matrix, array, or list will

expand that structure to accommodate the

new value.

240

Assigning a value (2/2)

In R, the use of semicolons between statements is optional, and most people don't bother,

e.g.,

there is a risk that the first statement ended on the first line, i.e. that you said y <- 2 + 3

It is better you signal to R that an expression is incomplete, e.g.,

y <- 2 + 3

+ 5

y <- 2 + 3 +

5241

Debugging with RStudio

Usually, I do no recommend you use R for

projects with many dependency files,

instead, calling R from other languages

such Java/Python/C++ for statistical

analysis would be a better solution.

Debugging with RStudio is very easy and

simple (similar to Matlab)

Detailed operations see here:

https://support.rstudio.com/hc/en-

us/articles/205612627-Debugging-with-

RStudio

242

https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio

LaTeX

LaTeX is a document preparation system

for high-quality typesetting. It is freely

available for Windows, Mac, and Linux

platforms.

Donald E. Knuth

http://cs.stanford.edu/~uno/

https://latex-project.org/intro.html

243

http://cs.stanford.edu/~uno/

https://latex-project.org/intro.html

Sweave (R + LaTeX)

• Install LaTeX on your PC

• Install sweave library in Rstudio

• Download the SweaveDemo.rnw file

from Blackboard

• Open the file and compile the PDF as

shown on the right!

244

R markdown

• Download the RMarkdownDemo.rmd file from Blackboard

• Open the file and compile the HTML as shown below!

245

Other topics in R that are not covered in our lectures

• Rcpp: R and C++ mixed programming

• rJava: R and Java mixed programming

• Rpython: R and Python mixed programming

• Creating your own R package

• R for statistical modelling (gbm, etc.)

• R for machine learning (kernlab, Rweka, caret, nnet, etc.)

• R for time series analysis

• …

246

Key references

• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.

• P. Teetor (2011) R Cookbook. O’Reilly.

• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly

247

Thank you!

[email protected]

mailto:[email protected]

a short tutorial on r for data science - bowei chena short tutorial on r for data science bowei chen...

Documents