a short tutorial on r for data science - bowei chena short tutorial on r for data science bowei chen...
TRANSCRIPT
A short tutorial on R for data science
Bowei Chen
School of Computer Science
University of Lincoln
2016 - 2017
Preface
This short tutorial is to give a practical introduction to R for data science programming. It
aims at the undergraduate students or practitioners who have no background or experience
in data science or statistics. It should be noted that the tutorial is focused on teaching basic
R data programming skills from scratch rather than data science algorithms.
The tutorial is created based on several open-source materials in R Community (see the key
references section for details). It has been used in the workshops of the Data Science
module in the School of Computer Science at the University of Lincoln, UK. The content
is around 8 hours’ study. Thanks the module demonstrators Deema Abdal Hafeth and
Jingmin Huang who have provided help with exercises.
2
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
3
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
4
What is R?
• R is a free software environment for
statistical computing and graphics.
• R compiles and runs on a wide variety of
UNIX platforms, Windows and MacOS.
• R can be downloaded at:
https://cran.r-project.org/Old logo New logo
5
Comprehensive R Archive Network (CRAN)
• CRAN includes packages which provide additional functionalities.
• Over 7,801 additional packages (as of January 2016) available at CRAN, Bioconductor,
Omegahat, GitHub, and other repositories.
• R packages are written mainly by academics and company staff.
• The R Foundation is seated in Vienna, Austria and currently hosted by the Vienna
University of Economics and Business. It is a registered association under Austrian law
and active worldwide.
6
Short history of R (1/2)
• S is a statistical programming language developed primarily by John Chambers, Rick
Becker and Allan Wilks at Bell Laboratories since 1976.
• The two modern implementations of S are:
– R: part of the GNU free software project
– S-PLUS (or S+): A commercial product sold by TIBCO Software
7
Short history of R (2/2)
• S-PLUS is a commercial implementation of the S programming language sold by TIBCO
Software Inc.
• R was created by Ross Ihaka and Robert Gentleman at the University of Auckland,
New Zealand, and is currently developed by the R Development Core Team, of which
John Chambers is a member. R is named partly after
the first names of the first two R authors and partly as a play on the name of S.
8
What can you do using R? (1/2)
• Data entry and manipulation
– Input data
• from keyboard
• from spreadsheet
• from another statistics package
– Manipulate data
• Statistical analysis
– Descriptive statistics
– Statistical inference
9
What can you do using R? (2/2)
• Graphical display
– Predefined plots for some models
– Flexible, powerful options
– Save to image files in various formats
• Write new functions
– Make a change to an existing function
– Create new functions tailored to your exact needs
– Contribute a new package
• Create documents (with Sweave, knitr)
– PDF (article and slides)
– HTML
10
Why use R for statistical computing?
• Open source (R is a GNU S+)
• Good visualisations (ggplot2, lattice, standard plot library)
• Easier for writing custom packages and functions
• Closer to the statistics and machine learning community
• Better LaTeX support (Sweave, knitr)
• Works with Big data (Rhadoop, Rspark, RCpp)
11
O’Reilly 2016
DATA SCIENCE
SALARY SURVEY
Limitations of R
• The quality of some packages is less than perfect. They are not error-free!
• Many R commands give little thought to memory management, and so R can very
quickly consume all available memory. This can be a restriction when doing data
mining. There are various solutions, including using 64 bit operating systems that can
access much more memory than 32 bit ones.
• Documentation is sometimes patchy and terse, and impenetrable to the non-
statistician. However, some very high-standard books are increasingly plugging the
documentation gaps.
13
RGui
When R is waiting for us to tell it what to do, it begins the line with >
Type• 'demo()' for some demos• 'help()' for on-line help• 'help.start()' for an HTML
browser interface• 'q()' to quit R
14
Editors and IDEs
• Rstudio
• Jupyter Notebook
• Vim
• Emacs (ESS)
• Eclipse (StatET)
• Tinn-R
• …
15
https://www.rstudio.com/
R source editor (Ctrl+1)
R console (Ctrl+2)
Environment (Ctrl+8)history (Ctrl+4)
Help (Ctrl+4)Files (Ctrl+5)Plots (Ctrl+6)
Packages (Ctrl+7)
Objects
• Everything in R is an object, having a class.
• Data, intermediate results are stored in R objects
• The Class of the object both describes what the object contains and what many
standard functions
• Objects are usually accessed by name.
18
R commands
• R commands are either assignments or expressions
• Commands are separated either by a semicolon ; or newline
19
x <- 1+2
`<-`(x, 1+2) #same thing
x = 1+2 #same thing
Assignment operations
An assignment command evaluates
an expression and passes the value
to a variable but the result is not
printed.
20
Expression operations
An expression command is evaluated
and (normally) printed.
If the statement results in a value, R will
print that value automatically.
> 1+2
[1] 3
> 1+2*3
[1] 7
> (1+2)*3
[1] 9In R, any number that you print out in the console is interpreted as a vector. A vector is an ordered collection of numbers. The “[1]” means that the index of the first item displayed in the row is 1.
21
Workspace
• R stores objects in workspace that is kept in memory.
• When quitting R ask you if you want to save that workspace
• The workspace containing all objects you work on can then be restored next time you
work with R along with a history of the used commands.
22
Variables (1/3)
A variable is a symbol that holds a value,
which can be any R object.
The types of variables are:
• Integer
• Double
• Character
• Logical
• Factor or categorical
23
Variables (2/3)
Integer, double (numerical values)
> a = 49
> sqrt(a)
[1] 7
> a <- pi
> print(a)
[1] 3.141593
Character, string, logical
> a = "The dog ate my homework"
> sub("dog","cat",a)
[1] "The cat ate my homework“
> a = (1+1==3)
> a
[1] FALSE
24
Variables (3/3)
Factor
> a <- factor(c("H", "e", "l", "l", "o"))
> print(a)
[1] H e l l o
Levels: e H l o
> class(a)
[1] "factor"
25
Types of numerical variables (1/2)
When we use numerical objects, in
mathematical terms, variables can be
classified as:
• Scalars
• Vectors
• Matrices
A scalar is a single number
> x <- 5
> Y <- 100
26
Types of numerical variables (2/2)
A vector is a sequence of numbers
> x <- c(3, 5, 2)
> x
[1] 3 5 2
A matrix is a two-way table of numbers
> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2)
> x
[,1] [,2]
[1,] 2 5
[2,] 3 6
[3,] 4 7
27
Variable names
• You can use simple variable names like x, y, A, and a (note that A and a are different
variable names). You can also use longer names like counter, index1, or
subject_id.
• A variable name can contain digits, but it cannot begin with a digit.
• Be careful about the built-in operators or symbols with your own variable names!
For example, you could create a variable named log, but then you would no longer be
able to use the logarithm function
28
Comments
A comment is anything you write in your
program code that is ignored by
the computer.
Comments help others understand your
code. Anything following a “#” character is
a comment in R.
> x <- c(3, 5, 2) ## These are the doses of the new drug formulation.
29
Arithmetic operators
Addition +
Subtraction -
Multiplication *
Division /
Exponentiation ^ or **
Modulus (x mod y) 5%%2 is 1 x %% y
Integer division 5%/%2 is 2 x %/% y
30
Comparison operators
Equal ==
Not equal !=
Greater than >
Greater than or equal >=
Less than <
Less than or equal <=
31
Logical operators
x and y x & y
x or y x | y
Not x !x
Test if x is TRUE isTRUE(x)
32
Numeric functions
Absolute value abs(x)Square root sqrt(x)Ceiling(3.475) is 4 ceiling(x)Foor(3.475) is 3 floor(x)Round(3.475, digits=2) is 3.48 round(x, digits=n)Signif(3.475, digits=2) is 3.5 signif(x, digits=n)Cosine, sine, tan, … cos(x), sin(x), tan(x)Natural logarithm log(x)Common logarithm log10(x)Exponential of x exp(x)
33
Control structures: if
Syntax:
if(cond1=true) { cmd1 }
> if (TRUE) {
+ "this will be printed if it is TRUE"
+ }
[1] "this will be printed if it is TRUE"
34
Control structures: if-else
Syntax:
if(cond1=true) { cmd1 } else { cmd2 }
> if(1==0) {
+ print(1)
+ } else {
+ print(2)
+ }
[1] 2
35
Control structures: ifelse
Syntax:
ifelse(cond, yes, no)
> ifelse(1 == 0,
+ "this will be printed if 1==0",
+ "this will not be printed if 1!=0")
[1] "this will not be printed if 1!=0"
36
Control structures: for
Syntax:
for (var in seq) { expr }
> x <- c("a", "a", "a", "a", "a")
> for (i in x){
+ print(i)
+ }
[1] "a"
[1] "a"
[1] "a"
[1] "a"
[1] "a"
37
Control structures: repeat
Syntax:
repeat { (cond) expr }
> i <- 10> repeat {+ if (i > 25)+ break+ else {+ print(i); i <- i + 5;+ }+ }[1] 10[1] 15[1] 20[1] 25
38
Control structures: while
Syntax:
while (cond) { expr }
> i <- 10
> while (i <= 25) {
+ print(i); i <- i + 5
+ }
[1] 10
[1] 15
[1] 20
[1] 25
39
Control structures: switch
Syntax:
switch(expr, ...)
> AA = 'foo'> switch(AA,+ foo = {+ print('foo') # case 'foo'+ },+ bar = {+ print('bar') # case 'bar'+ },+ {+ print('default')+ })[1] "foo"
40
Installing R and RStudio on your machine
• Download R from https://cran.r-project.org/
• Download RStudio at https://www.rstudio.com/
41
Exercise 1/10
demo(graphics)
demo(plotmath)
demo(Japanese)
demo(lm.glm)
demo(hclColors)
42
Exercise 2/10
x<-c(4,2,6)
y<-c(1,0,-1)
length(x)
sum(x)
sum(x^2)
x+y
x*y
x-2
x^2
43
Exercise 3/10
7:11
seq(2,9)
seq(4,10,by=2)
seq(3,30,length=10)
seq(6,-4,by=-2)
44
Exercise 4/10
rep(2,4)
rep(c(1,2),4)
rep(c(1,2),c(4,4))
rep(1:4,4)
rep(1:4,rep(3,4))
45
Exercise 5/10
c(T,T,F,F) & c(T,F,F,T)
!x
x <- seq(-3,3,length=200) > 0
1:3 + c(T,F,T)
intersect(1:10,5:15)
drinks <- factor(c("beer","beer","wine","water"))
46
Exercise 6/10
x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y);
print(z)
c(1, 2, 3, . . . , 19, 20)
x <- c(3,6,8); y <- c(2,5,1);
x[y>1.5]
x <- c(3,6,8); y <- c(2,5,1);
y[x==6]
47
Exercise 7/10
x <- 1:15if (sample(x, 1) <= 10) {
print("x is less than 10")} else {
print("x is greater than 10")}
Clean all the variables (the workspace)rm(list=ls())
Clean one variablerm(x)
48
Exercise 8/10
x <- c("apples", "oranges", "bananas", "strawberries")
for (i in x) {
print(i)
}
for (i in 1:4) {
print(x[i])
}
for (i in seq(x)) {
print(x[i])
}
for (i in 1:4) print(x[i])
49
Exercise 9/10
i <- 1
while (i < 10) {
print(i)
i <- i + 1
}
50
Exercise 10/10
z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z)
x <- c(0.5, 0.7)
x <- c(TRUE, FALSE)
x <- c("a", "b", "c", "d", "e")
x <- 9:100
x <- c(1 + (0+0i), 2 + (0+4i))
51
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
52
R data structures
53
Vectors (1/3)
Vectors are one-dimensional arrays that
can hold numeric data, character data, or
logical data. The combine function c() is
used to form the vector.
Note that the data in a vector must only
be one data type (numeric, character, or
logical).
> a <-c(1, 2, 5, 3, 6, -2, 4)
> b <-c("one", "two", "three")
> c <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
# a is numeric vector,
# bis a character vector, and
# c is a logical vector
54
Vectors (2/3)
Scalars are one-element vectors. > f <- 3
> x <- TRUE
> y <- 100.01
55
Vectors (3/3)
You can refer to elements of a vector using
a numeric vector of positions within
brackets.
> a <- c(1, 2, 5, 3, 6, -2, 4)
> a[3]
[1] 5
> a[c(1, 3, 5)]
[1] 1 5 6
> a[2:6]
[1] 2 5 3 6 -2
56
Matrices (1/4)
A matrix is a two-dimensional array where each element has the same data type
(numeric, character, or logical). Matrices are created with the matrix() function.
myymatrix <- matrix(vector,
nrow=number_of_rows,
ncol=number_of_columns,
byrow=logical_value,
dimnames=list(char_vector_rownames,
char_vector_colnames)
)
57
Matrices (2/4)
# Create a matrix from a vector
> vector <-c(1,2,3,4)
> foo <-matrix(vector, nrow=2, ncol=2)
> foo
[,1] [,2]
[1,] 1 3
[2,] 2 4
# Create a 5x4 matrix
> y <- matrix(1:20, nrow=5, ncol=4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
58
Matrices (3/4)
Create a 2x2 matrix with labels and
fill the matrix by rows
Create a 2x2 matrix with labels and
fill the matrix by column
> cells <- c(1,26,24,68)
> rnames <- c("R1", "R2")> cnames <- c("C1", "C2")
> mymatrix <- matrix(
+ cells, nrow = 2, ncol = 2, byrow = TRUE,
+ dimnames = list(rnames, cnames) )
> mymatrix
C1 C2
R1 1 26
R2 24 68
> mymatrix <- matrix(
+ cells, nrow = 2, ncol = 2, byrow = FALSE,
+ dimnames = list(rnames, cnames))
> mymatrix
C1 C2
R1 1 24
R2 26 68 59
Matrices (4/4)
You can identify rows, columns, or elements of a matrix, x, by using subscripts and brackets.
• x[i,] refers to the ith row
• x[,j] refers to jth column
• x[i,j] refers to the i,jth element
> x <- matrix(1:10, nrow=2)> x
[,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10> x[2,][1] 2 4 6 8 10> x[,2][1] 3 4> x[1,4][1] 7> x[1, c(4,5)][1] 7 9
60
Arrays (1/2)
Matrices are two-dimensional and, like vectors, can contain only one data type. When
there are more than two dimensions, you’ll use arrays.
myarray <- array(vector, dimensions, dimnames)
61
Arrays (2/2)
> dim1 <- c("A1", "A2")> dim2 <- c("B1", "B2", "B3")> dim3 <- c("C1", "C2", "C3", "C4")> z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))> z, , C1
B1 B2 B3A1 1 3 5A2 2 4 6
, , C2
B1 B2 B3A1 7 9 11A2 8 10 12
, , C3
B1 B2 B3A1 13 15 17A2 14 16 18
, , C4
B1 B2 B3A1 19 21 23A2 20 22 24
62
Data frame (1/4)
A data frame is more general than a matrix in that different columns can contain different
modes of data (numeric, character, etc.). A data frame is created with the data.frame() function
It’s similar to the datasets you’d typically see in SAS, SPSS, Stata, and Python (pandas).
Each column must have only one data type, but you can put columns of different data
types together to form the data frame. Because data frames are close to what analysts
typically think of as datasets, we’ll use the terms columns and variables interchangeably
when discussing data frames.
mydata <- data.frame(col1, col2, col3,…)
63
Data frame (2/4)
> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
64
Data frame (3/4)
Accessing data frame elements can be
straight forward. Element can be accessed
by column names.
> patientdata$patientID
[1] 1 2 3 4
> patientdata$diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> patientdata$status
[1] Poor Improved Excellent Poor
Levels: Excellent Improved Poor
65
Data frame (4/4)
If you want to cross tabulate diabetes type by status.
> table(patientdata$diabetes, patientdata$status)
Excellent Improved PoorType1 1 0 2Type2 0 1 0
66
Some useful functions for data frame (1/7)
The attach() function adds the data frame
to the R search path. When a variable name
is encountered, data frames in the search
path are checked in order to locate the
variable.
> summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mtcars$mpg, mtcars$disp)
> plot(mtcars$mpg, mtcars$wt)
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
67
Some useful functions for data frame (2/7)
The detach() function removes the data
frame from the search path. Note that
detach() does nothing to the data frame
itself. The statement is optional but is good
programming practice and should be
included routinely.
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
68
Some useful functions for data frame (3/7)
The limitations with this approach are
evident when more than one object can
have the same name.
Here we already have an object named mpg
in our environment when the mtcars data
frame is attached. In such cases, the
original object takes precedence, which
isn’t what you want. The plot statement
fails because mpg has 3 elements and disp
has 32 elements.
> mpg <- c(25, 36, 47)
> attach(mtcars)
The following object is masked _by_ .GlobalEnv:
mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
69
Some useful functions for data frame (4/7)
In this case, the statements within the
{} brackets are evaluated with reference
to the mtcars data frame. You don’t
have to worry about name conflicts
here. If there’s only one statement (for
example, summary(mpg)), the {} brackets are optional.
> with(mtcars, {
+ summary(mpg, disp, wt)
+ plot(mpg, disp)
+ plot(mpg, wt)
+ })
70
Some useful functions for data frame (5/7)
The limitation of the with() function
is that assignments will only exist
within the function brackets.
> with(mtcars, {
stats <- summary(mpg)
stats
})
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
> stats
Error: object ‘stats’ not found
71
Some useful functions for data frame (6/7)
If you need to create objects that will
exist outside of the with() construct,
use the special assignment operator <<-instead of the standard one <-. It will
save the object to the global
environment outside of the with() call.
> with(mtcars, {
nokeepstats <- summary(mpg)
keepstats <<- summary(mpg)
})
> nokeepstats
Error: object ‘nokeepstats’ not found
> keepstats
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
72
Some useful functions for data frame (7/7)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
73
Factors (1/3)
Categorical (nominal) and ordered
categorical (ordinal) variables in R are
called factors.
The function factor() stores the
categorical values as a vector of integers
in the range [1... k] (where k is the
number of unique values in the nominal
variable), and an internal vector of
character strings (the original values)
mapped to these integers.
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> diabetes
[1] "Type1" "Type2" "Type1" "Type1"
74
Factors (2/3)
> patientID <- c(1, 2, 3, 4)
age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata)
‘data.frame’: 4 obs. of 4 variables:
$ patientID: num 1 2 3 4 w
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
$ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 75
Factors (3/3)
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
76
Lists (1/2)
Lists are the most complex of the R data
types. Basically, a list is an ordered
collection of objects (components). A list
allows you to gather a variety of (possibly
unrelated) objects under one name.
mylist <- list(object1, object2, …)
mylist <- list(name1=object1, name2=object2, …)
77
Lists (2/2)
> g <- "My First List"> h <- c(25, 26, 18, 39)> j <- matrix(1:10, nrow=5)> k <- c("one", "two", "three")> mylist <- list(title=g, ages=h, j, k)
> mylist$title[1] "My First List"$ages[1] 25 26 18 39[[3]][,1] [,2][1,] 1 6[2,] 2 7[3,] 3 8[4,] 4 9[5,] 5 10[[4]][1] "one" "two" "three"
> mylist[[2]][1] 25 26 18 39> mylist[["ages"]][[1] 25 26 18 39
78
Exercise 1/10
# Declare different variablestypesmy_numeric <- 42my_character <- "universe“my_logical <- FALSE
# Check class of my_numericclass(my_numeric)
# Check class of my_characterclass(my_character)
# Check class of my_logicalclass(my_logical)
Exercise 2/10
# Vector operations
a) Create a verctor like 1,2,3, . . ., 10
b) Get the length of the above vector
c) Get the last three numbers from the vector
d) Sort the numbers with decreasing order
e) Remove the number 9 from the above vector
Exercise 3/10
# Vector operations
a) Create a vector from 1 to 3.1415 with the length of 100
b) Create a vector from -2 to 0.1 with the length of 100
c) Get the sum and inner product of a and b
Exercise 4/10
# Vector operations
a) Create a vector x contains 2, 3, 4, 1
b) Create a vector y contains 1, 1, 3, 7
c) Combine column vectors x, y
Exercise 5/10
# Vector operations
Use rep() function to create the following vectors:
a) “0” “x” “0” “x” “0” “x”
b) 1 3 2 1 3 2 1 3 2 1 3 2
c) 1 1 1 2 2 2 3 3 3
Exercise 6/10
# Matrix operations
a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns
b) Print out the dimensions of the matrix
c) Find out the 4th column’s sum
d) Find out the sum of row 3 and row 17
e) Assign the following names to the rows:
“A”, “B”, “C”, “D”, “E”
Exercise 7/10
# Matrix operations
a) Use matrix() function to create the following matrix:
TypeA TypeB TypeC
Navarra 190 8 22
Zaragoza 191 4 1.7
Madrid 223 80 2.0
b) Add the following column into the matrix:
TypeD
2.00
3.50
2.75
c) Use apply() function to calculate the means of each column of the matrix
Exercise 8/10
# Array operationsCreate the following array, , 1
[,1] [,2] [,3][1,] 1 4 7[2,] 2 5 8[3,] 3 6 9
, , 2
[,1] [,2] [,3][1,] 10 13 16[2,] 11 14 17[3,] 12 15 18
, , 3
[,1] [,2] [,3][1,] 19 22 25[2,] 20 23 26[3,] 21 24 27
Exercise 9/10
# Data frame operations
Type df <- iris, then
a) Print out the dimensions of df
b) Find out the sum of “Sepal.Width” column
c) Rename column “Species” as “label”
d) Find out how many records with “Petal.Length” larger than 1.41
Exercise 10/10
# List operations
Create the following list and save it to the variable x:
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
89
Sources of data for R
90
Entering data from the keyboard
Perhaps the simplest method of data entry
is from the keyboard. The edit() function
in R will invoke a text editor that will allow
you to enter your data manually.
> mydata <- data.frame(age = numeric(0),
+ gender = character(0),
+ weight = numeric(0))
> mydata <- edit(mydata)
91
Importing data from Excel
There are many R packages can allow you to import data from excel. For example:
openxlsxXLConnectxlsx…
Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later .xlsx format.
> install.packages("openxlsx")
> library("openxlsx")> df <-+ read.xlsx(+ "PublicHealthEnglandDataTableDistrict.xlsx",+ sheet = 1,+ startRow = 1,+ colNames = TRUE+ )
92
Importing data from a delimited text file (1/2)
You can import data from delimited text files using read.table() , a function that
reads a file in table format and saves it as a data frame.
where file is a delimited ASCII file , header is a logical value indicating whether
the first row contains variable names (TRUE or FALSE), sep specifies the delimiter
separating data values, and row.names is an optional parameter specifying one or more
variables to represent row identifiers.
> mydataframe <- read.table(file, header = logical_value,
+ sep = "delimiter",
+ row.names = "name")
93
Importing data from a delimited text file (2/2)
> file <- paste0(path, '/AMZN.csv')
> df <- read.table(file, header = TRUE, sep = ",")
> head(df)
Date Open High Low Close Volume Adj.Close
1 2016-03-04 581.07 581.40 571.07 575.14 3405100 575.14
2 2016-03-03 577.96 579.87 573.11 577.49 2736700 577.49
3 2016-03-02 581.75 585.00 573.70 580.21 4576900 580.21
4 2016-03-01 556.29 579.25 556.00 579.04 5014400 579.04
5 2016-02-29 554.00 564.81 552.51 552.52 4013400 552.52
6 2016-02-26 560.12 562.50 553.17 555.23 4858200 555.23
94
Importing data from XML
> # install and load the necessary package
> install.packages(“XML”)
> library(XML)
> xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
> xmlfile <- xmlTreeParse(xml.url)
> class(xmlfile)
[1] "XMLDocument" "XMLAbstractDocument"
> xmltop = xmlRoot(xmlfile)
> plantcat <- xmlSApply(xmltop, function(x) { xmlSApply(x, xmlValue) } )
> # Finally, get the data in a data-frame and have a look at the first rows and columns
> plantcat_df <- data.frame(t(plantcat),row.names = NULL)
> plantcat_df[1:5,1:4]
95
Importing data from R package
> library(MASS)
> data()
> data(phones)
> phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9 16.1
[14] 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0
96
Importing data from other sources
• Importing SPSS files into R
• Importing Stata files into R
• Importing SAS files into R
• Importing Minitab files into R
• Importing Matlab files into R
• …
97
Importing data in RStudio (1/2)
98
Importing data in RStudio (2/2)
99
Writing data frame into csv or txt files
write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
write.csv(...)
write.csv2(...)
100
Useful functions for working with data objects (1/2)
Number of elements/components length(object)Dimensions of an object dim(object)Structure of an object str(object)Class or type of an object class(object)How an object is stored mode(object)Names of components in an object names(object)Combines objects into a vector c(object, object,...)Combines objects as columns cbind(object, object, ...)Combines objects as rows rbind(object, object, ...)Prints the object object
101
Useful functions for working with data objects (2/2)
Lists the first part of the object head(object)Lists the last part of the object tail(object)Lists current objects ls()Deletes one or more objects. rm(object, object, ...)Edits object and saves as new object newobject <- edit(object)
102
The drop= argument
By default, subscripting operations reduce
the dimensions of an array
whenever possible. To avoid that, we can
use the drop=FALSE argument
> mat <- matrix(1:12, 3, 4, byrow = TRUE)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> s1 <- mat[1,]; s1
[1] 1 2 3 4
> dim(s1)
NULL
> s2 <- mat[1,,drop=FALSE]; s2
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
> dim(s2)
[1] 1 4
103
Combined selection
Suppose we want to get all the columns for which the element at the first row is less than 3:
> mat <- matrix(1:12, 3, 4, byrow = TRUE)> mat
[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12
> mycols <- mat[1,] < 3; mycols[1] TRUE TRUE FALSE FALSE
> mat[ , mycols, drop=FALSE][,1] [,2][1,] 1 2[2,] 5 6[3,] 9 10
104
Using SQL statements to manipulate data frames
# install the package
> install.packages("sqldf")
> library(sqldf)
> newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE)
> newdf
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1
Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1105
Exercise 1/8
1) Create a vector x represent numbers from 1 to 11
2) Save x into the x.RData file
3) Remove the object x from R workspace
4) Import the x.RData file into R and save it to x.
Please google it
Exercise 2/8
1) Create the following data frame dfA
ID Case Number1 case1 102 case2 203 case3 30
2) Save dfA into the dfA.csv file
Exercise 3/8
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Show the first and last 20 lines of the dataset, respectively
4) Obtain the column names of the dataset
5) Create a new data frame which has the same column names of the dataset and has
the first and last 20 lines of the dataset
Exercise 4/8
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Show the class of all columns/fields
4) Create a new data frame where Open <= 570 and Close >= 550
5) Sort the data frame by High (in decreasing order)
6) Create another data frame where Close >= Open
Exercise 5/8
1) Import data from url "http://www.w3schools.com/xml/plant_catalog.xml"
2) Use the xmlTreePares function to parse xml file directly from the web
3) Use the xmlRoot function to access the top node
Exercise 6/8
1) Install the MASS package
2) Find Cars93 dataset
3) Extract all the records for the Volkswagen from the field Manufacturer
4) Order the extracted records (ascend) by Price and save it to a data frame
5) Write the data frame to Cars93FilteredData.csv
Exercise 7/8
1) Use SQL statements to manipulate data frame as required in Exercise 4/6
2) Write the data frame to Cars93FilteredData.RData
Exercise 8/8
1) Download the files iris1.csv and iris2.csv from Blackboard
2) Import these two files into R
3) Combine these two datasets into one data frame
4) Calculate the mean value of every columns
5) What will you do with missing values?
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
114
Exploratory graphs
If you are familiar with statistical graphical representations, please
skip this part
Pie chart
AL5% AR
5%AZ5%
CA5%
CO4%
CT7%
DE4%
FL6%GA
4%IA5%
ID4%
IL4%
IN4%
KS5%
KY3%
LA5%
MA6%
MD4%
ME6%
MI5%
taxs
AL AR AZ CA CO CT DEFL GA IA ID IL IN KSKY LA MA MD ME MI
Dataset: Cigarette
A pie chart is used to show the
relative frequencies or percentages
of the levels of a categorical variable
with wedges of a pie/circle..
It is very useful when creating a well
designed document that is intended
to people that will not read the data
(e.g., management)
Scatter plot
With a scatter plot a mark,
usually a dot or small circle,
represents a single data point.
With one mark (point) for every
data point a visual distribution
of the data can be seen.
Depending on how tightly the
points cluster together, you may
be able to discern a clear trend
in the data.
y = 31.887x - 62057
0
100
200
300
400
500
600
700
1949 1951 1953 1955 1957 1959
Dataset: AirPassengers
AirPassengers Linear (AirPassengers)Date
Number of air
passengers
Line plot
A line plot provides an excellent
way to map independent and
dependent variables that are both
quantitative.
It is clear to see how things are
going by the rises and falls a line
plot shows.
0
100
200
300
400
500
600
700
1949 1951.4166671953.833333 1956.25 1958.666667
Dataset: AirPassengers
AirPassengersDate
Number of air
passengers
Multiple line plot
Multiple line plots have space-
saving characteristics. Because
the data values are marked by
small marks (points) and not
bars, they do not have to be
offset from each other (only
when data values are very dense does this become a problem).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
Area chart/graph
An area chart/graph displays
graphically quantitative data.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Cumulative
percentage
Bar chart
A bar plot is a chart that shows
grouped data with rectangular
bars with lengths proportional to
the values that they show. The
bars can be plotted vertically or
horizontally.
It is one of the best methods to
summarise categorical data.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10
Dataset: StockShare
Stock1 Stock2 Stock3
Day
Percentage
Histogram
A histogram is a graphical
representation of the distribution of
quantitative data. It is an estimate
of the probability distribution of a
quantitative variable and was first introduced by Karl Pearson.
0
5
10
15
20
25
40
42
44
46
48
50
52
54
56
58
60
Dataset: MSFT
Adjust
closing price
Frequency
Histogram with
distribution fit
A histogram with a distribution
fit is normally used to show the
empirical distribution of the
variable. Sometimes, we use
the Normal/Gaussian distribution to fit the histogram.
0
5
10
15
20
25
40
42
44
46
48
50
52
54
56
58
60
Dataset: MSFT
Adjust
closing price
Frequency
Base plotting system in R
Dataset (1/3)
> data(Chem97, package = "mlmRev")
> head(Chem97)
lea school student score gender age gcsescore gcsecnt
1 1 1 1 4 F 3 6.625 0.3393157
2 1 1 2 10 F -3 7.625 1.3393157
3 1 1 3 10 F -4 7.250 0.9643157
4 1 1 4 10 F -2 7.500 1.2143157
5 1 1 5 8 F -1 6.444 0.1583157
6 1 1 6 10 F 4 7.750 1.4643157
125
Dataset (2/3)
> data(iris)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
126
Dataset (3/3)
> data(EuStockMarkets)
> EuStockMarkets <- data.frame(EuStockMarkets)
> head(EuStockMarkets)
DAX SMI CAC FTSE
1 1628.75 1678.1 1772.8 2443.6
2 1613.63 1688.5 1750.5 2460.2
3 1606.51 1678.6 1718.0 2448.2
4 1621.04 1684.1 1708.1 2470.4
5 1618.16 1686.6 1723.1 2484.7
6 1610.61 1671.6 1714.3 2466.8
127
Histogram (1/2)
> hist(Chem97$gcsescore)
128
Histogram (2/2)
> hist(+ Chem97$gcsescore,+ main = "Histogram",+ xlab = "gcsescore",+ ylab = "Frequency",+ col = "green"+ )
129
Boxplot (1/2)
> boxplot(Chem97$gcsescore,
+ main = 'title',
+ ylab = 'gcsescore')
130
Boxplot (2/2)
> boxplot(+ Chem97$gcsescore,+ Chem97$age,+ main = 'title',+ ylab = 'value',+ names = c('gcsescore','age')+ )
131
Scatter plot (1/3)
> plot(
+ Chem97$gcsescore,
+ Chem97$gcsecnt,
+ main = "title",
+ xlab = "gcsescore",
+ ylab = 'gcsecnt',
+ col = "blue"
+ )
132
Scatter plot (2/3)
> pairs(iris)
133
Scatter plot (3/3)
> pairs(iris, pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])134
Line plot (1/3)
> plot(
+ EuStockMarkets$DAX,
+ type = "l",
+ main = 'EuStockMarkets',
+ xlab = 'Day',
+ ylab = 'DAX'
+ )
135
Line plot (2/3)> plot(
+ EuStockMarkets$DAX,
+ type = "l", col = 'red',
+ xlab = 'Day', ylab = 'Price'
+ )
> lines(EuStockMarkets$FTSE,
+ type = "l", col = 'blue')
> title("EuStockMarkets", cex.main = 1.1)
> legend(
+ 100, 5500, c("DAX", "FTSE"),
+ col = c('red', 'blue'),
+ text.col = "black",
+ lty = c(1,1), merge = TRUE
+ ) 136
Line plot (3/3)
> plot(
+ EuStockMarkets$DAX,
+ EuStockMarkets$CAC,
+ type = "l",
+ main = 'EuStockMarkets',
+ xlab = 'DAX',
+ ylab = 'CAC'
+ )
137
Exercise 1/5
1) Create a vector x from a series 1 to 1000
2) Create a vector y from a series 12 to 10002
3) Generate the following scatter plot that x on x-axis and y on y-axis
Exercise 2/5
1) Create a data frame df that contains 3 variables: x, y, z (i.e., 3 columns)
2) Each variable has 500 observations (i.e., 500 rows)
3) x follows a standard norm distribution N(0,1)
4) y follows a continuous uniform distribution U[0,1]
5) z follows a poison distribution Poisson(0.5)
6) Generate a pairs plot for x, y, z
Please google the pairs function
Exercise 3/5
Plot the following figure where x is in 0, 2𝜋 , and y is sin(𝑥)
Exercise 4/5
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Save the data frame into df
4) Plot the histogram of df (as same as on the right)
hint: a) bandwidth; b) values on x-axis
Exercise 5/5
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Plot the multiple lines figure as below
R graphics packages:
lattice & ggplot2
Author
lattice was developed and
maintained by Deepayan Sarkar,
Assistant Professor at Indian
Statistical Institute.
http://www.isid.ac.in/~deepayan/
144
Histogram by wrap
> pl <- histogram(~ gcsescore |
+ factor(score), data = Chem97)
> print(pl)
145
Density by wrap
> pl <- densityplot(
+ ~ gcsescore | factor(score),
+ data = Chem97,
+ plot.points = FALSE,
+ ref = TRUE
+ )
> print(pl)
146
Density plot by different colour
> pl <- densityplot(
+ ~ gcsescore,
+ data = Chem97,
+ groups = score,
+ plot.points = FALSE,
+ ref = TRUE,
+ auto.key = list(columns = 3)
+ )
> print(pl)
147
boxplot by wrap (1/2)
> pl <- bwplot(
+ gcsescore ^ 2.34 ~ gender | factor(score),
+ Chem97,
+ varwidth = TRUE,
+ layout = c(6, 1),
+ ylab = "Transformed GCSE score"
+ )
> print(pl)
148
boxplot by wrap (2/2)
> pl <- densityplot(
+ ~ gcsescore,
+ data = Chem97,
+ groups = score,
+ plot.points = FALSE,
+ ref = TRUE,
+ auto.key = list(columns = 3)
+ )
> print(pl)
149
There are many other functions in lattice
Below the references will be useful:
• http://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf
• https://www.stat.auckland.ac.nz/~paul/RGraphics/chapter4.pdf
• https://fas-web.sunderland.ac.uk/~cs0her/Statistics/UsingLatticeGraphicsInR.htm
150
Author
ggplot2 was developed by Hadley
Wickham, Chief Scientist at RStudio, and
an Adjunct Professor of Statistics at the
University of Auckland.
http://hadley.nz/
151
Histogram by wrap
> pg <-
+ ggplot(Chem97, aes(gcsescore)) +
+ geom_histogram(binwidth = 0.5) +
+ facet_wrap( ~ score)
> print(pg)
152
Density plot by wrap
> pg <- ggplot(Chem97, aes(gcsescore)) +
+ stat_density(geom = "path",
+ position = "identity") +
+ facet_wrap(~ score)
> print(pg)
153
Density plot by different colour
> pg <- ggplot(Chem97, aes(gcsescore)) +
+ stat_density(geom = "path",
+ position = "identity",
+ aes(colour = factor(score)))
> print(pg)
154
boxplot by wrap (1/2)
> pg <- ggplot(Chem97,
+ aes(factor(gender),
+ gcsescore^2.34)) +
+ geom_boxplot() +
+ facet_grid(~score) +
+ ylab("Transformed GCSE score")
> print(pg)
155
boxplot by wrap (2/2)
> pg <- ggplot(Chem97,
+ aes(factor(score),
+ gcsescore)) +
+ geom_boxplot() +
+ coord_flip() +
+ ylab("Average GCSE score") +
+ facet_wrap( ~ gender)
> print(pg)
156
There are many other functions in ggplot2
Below the references will be useful:
• http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf
• http://www.statmethods.net/advgraphs/ggplot2.html
• http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/
• http://www.stat.wisc.edu/~larget/stat302/chap2.pdf
157
Exercise 1/11
Use lattice or ggplot package to draw the figure as below
> data(postdoc, package = "latticeExtra")
> pl <- barchart(prop.table(postdoc, margin = 1),
+ xlab = "Proportion",
+ auto.key = list(adj = 1))
> print(pl)
Exercise 2/11
1) Read the dataset PublicHealthEnglandDataTableDistrict.xlsx
2) Plot the following figure using lattice package
Hint: xyplot
Exercise 3/11
1) Read the dataset
PublicHealthEnglandDataTa
bleDistrict.xlsx
2) Plot the following figure
using ggplot2 package
Hint: 1) ggplot; 2) plot points: 3)
by wrap
Exercise 4/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
Exercise 5/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
Exercise 6/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
Exercise 7/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using ggplot2 package
Exercise 8/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using
ggplot2 package
Exercise 9/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using ggplot2
package
Exercise 10/11
1) Read the dataset “chem97” from
“mlmRev” package
2) Plot the following figure using ggplot2
package
Exercise 11/11
1) Read the dataset “chem97” from “mlmRev” package
2) Plot the following figure using lattice package
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
169
Empty string
An empty string can be produced by
consecutive quotation marks: ""> empty_str = ""
> empty_str
[1] ""
> class(empty_str)
[1] "character"
170
Vector of empty strings
character() will produce a character
vector with as many empty strings
> # vector with 5 empty strings
> char_vector = character(5)
> char_vector
[1] "" "" "" "" ""
171
is.character() and as.character()
as.character() and is.character() are
generic methods for creating and testing
for objects of type "character"
> a = "test me"
> b = 8 + 9
> # are 'a' and 'b' characters?
> is.character(a)
[1] TRUE
> is.character(b)
[1] FALSE
172
c() for character vector
As you can tell, the resulting vector from
combining integers (1:5), the number pi,
and some "text" is a vector with all its
elements treated as character strings. In
other words, when we combine mixed
data in vectors, strings will dominate.
> a <- c("x", "y", "c")
> a
[1] "x" "y" "c"
> b <- c(1:5, pi, "text")
> b
[1] "1"
[2] "2"
[3] "3"
[4] "4"
[5] "5"
[6] "3.14159265358979"
[7] "text"
173
paste()
paste() takes one or more R objects,
converts them to "character", and then it
concatenates (pastes) them to form one or
several character strings
> PI = paste("The life of", pi)
> PI
[1] "The life of 3.14159265358979"
> IloveR = paste("I", "love", "R")
> IloveR
[1] "I love R"
> IloveR = paste0("I", "love", "R")
> IloveR
[1] "IloveR"
> IloveR = paste("I", "love", "R", sep = "-")
> IloveR
[1] "I-love-R"
> paste(1:3, c("!", "?", "+"), sep = "",
+ collapse = "")
[1] "1!2?3+"174
Printing characters
Function Description
print() Generic printing
noquote() Print with no quotes
cat() Concatenation
format() Special formats
toString() Covert to string
sprintf() Printing
175
Basic string manipulations
Function Description
nchar() Number of characters
tolower() Convert to lower case
toupper() Convert to upper case
casefold() Case folding
chartr() Character translation
abbreviate() Abbreviation
substring() Substrings of a character vector
substr() Substrings of a character vector
176
Set operations
Function Description
union() Set union
intersect() Intersection
setdiff() Set difference
setequal() Equal sets
identical() Exact equality
is.element() Is element
sort() Sorting
paste(rep()) Repetition
177
setequal() vs indentical()
> set7 = c("some", "random", "string")> set8 = c("some", "random", "none", "few")> set9 = c("string", "some", "random")> setequal(set7, set8)[1] FALSE> setequal(set7, set9)[1] TRUE> identical(set7, set7)[1] TRUE> identical(set7, set9)[1] FALSE
178
stringr package
Thanks to Hadley Wickham, we have the
package stringr that adds more
functionality to the base functions for
handling strings in R.
stringr provides functions for:
1) Basic manipulations
2) Regular expression operations.
http://hadley.nz/
179
Basic string manipulations in stringr
Function Description Similar to
str_c() string concatenation paste()
str_length() number of characters nchar()
str_sub() extracts substrings substring()
str_dup() duplicates characters
str_trim() removes leading and trailing whitespace
str_pad() pads a string
str_wrap() wraps a string paragraph strwrap()
180
paste() vs str_c()
> paste("University", "of", "Lincoln")
[1] "University of Lincoln"
> paste("University", "of", "Lincoln", NULL)
[1] "University of Lincoln "
> paste("University", "of", "Lincoln", character(0))
[1] "University of Lincoln “
> library(stringr)
> str_c("University", "of", "Lincoln")
[1] "UniversityofLincoln"
> str_c("University", "of", "Lincoln", NULL)
[1] "UniversityofLincoln"
> str_c("University", "of", "Lincoln", character(0))
[1] "UniversityofLincoln“ 181
nchar() vs str_length()
> nchar("The life of PI")
[1] 14
> str_length("The life of PI")
[1] 14
>
> text_str = c("one", "two", "three", NA)
> nchar(text_str)
[1] 3 3 5 2
> str_length(text_str)
[1] 3 3 5 NA
182
str_sub()
> hw <- "Hadley Wickham"
> str_sub(hw, 1, 6)
[1] "Hadley"
> str_sub(hw, end = 6)
[1] "Hadley"
> str_sub(hw, 8, 14)
[1] "Wickham"
> str_sub(hw, 8)
[1] "Wickham"
> str_sub(hw, c(1, 8), c(6, 14))
[1] "Hadley" "Wickham"
> str_sub(hw, 1:3)
[1] "Hadley Wickham" "adley Wickham" "dley Wickham" 183
What is a regular expression?
A regular expression (shortly regex or regexp) is a pattern describing a certain amount
of text. Basically, it is a way for a computer user or programmer to express how a
computer program should look for a specified pattern in text and then what the program
is to do when each pattern match is found.
184
Functions of regex in R
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)
185
Regular expression functions
Function Description
grep() Find regex matches and return (index or value)
grepl() Find regex matches and return (TRUE & FALSE)
sub() Replace the first match
gsub() Replace all the matches
regexpr() Find regex matches (position of the first match)
gregexpr() Find regex matches (position of all match)
regexec() Find regex matches (hybrid of regexpr() and gregexpr())
strsplit() Split regex matches
186
Metacharacters in R (1/2)
There are some special characters that have a reserved status and they are known as metacharacters.
The metacharacters in Extended Regular Expressions (EREs) are:
In R, we need to escape them with a double backslash \\ when we want to represent them in a regex pattern
. \ | ( ) [ { $ * + ?
Metacharacter Escape in R
. \\.
$ \\$
* \\*
+ \\+
? \\?
| \\|
\ \\\
^ \\^
[ \\[
] \\]
{ \\{
} \\}
( \\(
) \\)187
Metacharacters in R (2/2)
> money = "$money"
>
> sub(pattern = "$", replacement = "XXXXXX", x = money)
[1] "$moneyXXXXXX“
> money = "$money"
>
> sub(pattern = "\\$", replacement = "XXXXXX", x = money)
[1] "XXXXXXmoney"
188
Sequences (1/4)Anchor Description
\\d Match a digital character
\\D match a non-digit character
\\s match a space character
\\S match a non-space character
\\w match a word character
\\W match a non-word character
\\b match a word boundary
\\B match a non-(word boundary)
\\h match a horizontal space
\\H match a non-horizontal space
\\v match a vertical space
\\V match a non-vertical space189
Sequences (2/4)
> sub("\\d", "_", "the dandelion war 2010")
[1] "the dandelion war _010"
> gsub("\\d", "_", "the dandelion war 2010")
[1] "the dandelion war ____"
>
> sub("\\D", "_", "the dandelion war 2010")
[1] "_he dandelion war 2010"
> gsub("\\D", "_", "the dandelion war 2010")
[1] "__________________2010"
190
Sequences (3/4)
> # replace space with "_"> sub("\\s", "_", "the dandelion war 2010")[1] "the_dandelion war 2010"> gsub("nns", "_", "the dandelion war 2010")[1] "the dandelion war 2010"> > # replace non-space with "_"> sub("\\S", "_", "the dandelion war 2010")[1] "_he dandelion war 2010"> gsub("\\S", "_", "the dandelion war 2010")[1] "___ _________ ___ ____"
191
Sequences (4/4)
> # replace word with "_"
> sub("\\b", "_", "the dandelion war 2010")
[1] "_the dandelion war 2010"
> gsub("\\b", "_", "the dandelion war 2010")
[1] "_t_h_e_ _d_a_n_d_e_l_i_o_n_ _w_a_r_ _2_0_1_0_"
> # replace non-word with "_"
> sub("\\B", "_", "the dandelion war 2010")
[1] "t_he dandelion war 2010"
> gsub("\\B", "_", "the dandelion war 2010")
[1] "t_he d_an_de_li_on w_ar 2_01_0"
192
Some regex character classes (1/2)
Anchor Description
[aeiou] Match any one lower case vowel
[AEIOU] Match any one upper case vowel
[0123456789] Match any digit
[0-9] Match any digit (same as previous class)
[a-z] Match any lower case ASCII letter
[A-Z] Match any upper case ASCII letter
[a-zA-Z0-9] Match any of the above classes
[^aeiou] Match anything other than a lowercase vowel
[^0-9] Match anything other than a digit193
Some regex character classes (2/2)
> # some string> transport = c("car", "bike", "plane", "boat")> # look for e or i> grep(pattern = "[ei]", transport, value = TRUE)[1] "bike" "plane">> # some numeric strings> numerics = c("123", "17-April", "I-II-III", "R 3.0.1")> grep(pattern = "[01]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[0-9]", numerics, value = TRUE)[1] "123" "17-April" "R 3.0.1" > grep(pattern = "[^0-9]", numerics, value = TRUE)[1] "17-April" "I-II-III" "R 3.0.1"
194
POSIX character classes (1/2)
Notation Description
[[:lower:]] Lower-case letters
[[:upper:]] Upper-case letters
[[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])
[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])
[[:blank:]] Blank characters: space and tab
[[:cntrl:]] Control characters
[[:punct:]] Punctuation characters: ! " # % & ' ( ) * + , - . / : ;
[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space
[[:xdigit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f
[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space)
[[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]]) 195
> # la vie (string)> la_vie = "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you print la_vie> print(la_vie)[1] "La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie"> # if you cat la_vie> cat(la_vie)La vie en #FFC0CB (rose);Cest la vie! tres jolie> > # remove space characters> gsub(pattern = "[[:blank:]]", replacement = "", la_vie)[1] "Lavieen#FFC0CB(rose);\nCestlavie!tresjolie"> # remove digits> gsub(pattern = "[[:punct:]]", replacement = "", la_vie)[1] "La vie en FFC0CB rose\nCest la vie \ttres jolie"
POSIX character classes (2/2)
196
Quantifiers (1/2)
Notation Description
* The preceding item will be matched zero or more times
+ The preceding item will be matched one or more times
? The preceding item will be matched zero or more times
{n} The preceding item is matched exactly n times
{n,} The preceding item is matched n or more times
{n,m} The preceding item is matched at least n times, but not more than m times
197
Quantifiers (2/2)
> strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")> grep("ac*b", strings, value = TRUE)[1] "ab" "acb" "accb" "acccb" "accccb"> grep("ac*b", strings, value = FALSE)[1] 2 3 4 5 6> grepl("ac*b", strings)[1] FALSE TRUE TRUE TRUE TRUE TRUE> grep("ac+b", strings, value = TRUE)[1] "acb" "accb" "acccb" "accccb"> grep("ac?b", strings, value = TRUE)[1] "ab" "acb"> grep("ac{2}b", strings, value = TRUE)[1] "accb"
198
Regex functions in stringr
Notation Description
str_detect() Detect the presence or absence of a pattern in a string
str_extract() Extract rst piece of a string that matches a pattern
str_extract all() Extract all pieces of a string that match a pattern
str_match() Extract rst matched group from a string
str_match all() Extract all matched groups from a string
str_locate() Locate the position of the rst occurence of a pattern in a string
str_locate all() Locate the position of all occurences of a pattern in a string
str_replace() Replace rst occurrence of a matched pattern in a string
str_replace all() Replace all occurrences of a matched pattern in a string
str_split() Split up a string into a variable number of pieces
str_split_fixed() Split up a string into a xed number of pieces
199
Exercise 1/3
# dollarsub("\\$", "", "$Peace-Love")
# dotsub("\\.", "", "Peace.Love")
# plussub("\\+", "", "Peace+Love")
# caretsub("\\^", "", "Peace^Love")
# vertical barsub("\\|", "", "Peace|Love")
# opening round bracketsub("\\(", "", "Peace(Love)")
# closing round bracketsub("\\)", "", "Peace(Love)")
# opening square bracketsub("\\[", "", "Peace[Love]")
# closing square bracketsub("\\]", "", "Peace[Love]")
# opening curly bracketsub("\\{", "", "PeacefLoveg")
200
Exercise 2/3
# replace word boundary with "_"
sub("\\w", "_", "the dandelion war 2010")
gsub("\\w", "_", "the dandelion war 2010")
# replace non-word-boundary with "_"
sub("\\W", "_", "the dandelion war 2010")
gsub("\\W", "_", "the dandelion war 2010")
201
Exercise 3/3
# people namespeople = c("rori", "emilia", "matteo", "mehmet", "filipe", "anna", "tyler", "rasmus",
"jacob", "youna", "flora", "adi")# match "m" at most oncegrep(pattern = "m?", people, value = TRUE)# match "m" exactly oncegrep(pattern = "mf1g", people, value = TRUE, perl = FALSE)# match "m" zero or more times, and "t"grep(pattern = "m*t", people, value = TRUE)# match "t"zero or more times, and "m"grep(pattern = "t*m", people, value = TRUE)
202
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
203
Example (1/13)
This is an example of implementing linear regression models in R.
We will use the R dataset Cars93 in the MASS library
> library(MASS)
> df <- Cars93
> dim(df)
[1] 93 27
Using dim() function to see the size of data. There are 93
observations and 27 features/predictors in the dataset
Example (2/13)
> head(df,3)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front
Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1 4 1.8 140 6300 2890 Yes 13.2 5 177
2 6 3.2 200 5500 2335 Yes 18.0 5 195
3 6 2.8 172 5500 2280 Yes 16.9 5 180
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1 102 68 37 26.5 11 2705 non-USA Acura Integra
2 115 71 38 30.0 15 3560 non-USA Acura Legend
3 102 67 37 28.0 14 3375 non-USA Audi 90
Using head() function to look at a few
sample observations of the data. This is an
important step in data analysis!
Example (3/13)
> sapply(df, class)
Manufacturer Model Type Min.Price Price Max.Price
"factor" "factor" "factor" "numeric" "numeric" "numeric"
MPG.city MPG.highway AirBags DriveTrain Cylinders EngineSize
"integer" "integer" "factor" "factor" "factor" "numeric"
Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
"integer" "integer" "integer" "factor" "numeric" "integer"
Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room
"integer" "integer" "integer" "integer" "numeric" "integer"
Weight Origin Make
"integer" "factor" "factor"
Using sapply() can look at what are the
data types of each variables
Example (4/13)
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
Let’s look at two variables of cars:
horsepower and price. Do they have some
correlations?
Example (5/13)> # Simple linear regression (method 2) -----------------
> model <- lm(y ~ x)
> model$coefficients
(Intercept) x
-1.3987691 0.1453712
> beta0 <- model$coefficients[1]
> beta1 <- model$coefficients[2]
>
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
> y_hat_vec <- beta1 * df$Horsepower + beta0
> lines(df$Horsepower, y_hat_vec, lty = 2, col = 4)
> legend(50,
+ 30,
+ lty = 2,
+ col = 4,
+ "Regression line")
Estimate parameters of a simple linear
regression model by using R function
> residuals_vec <- df$Price - y_hat_vec> summary(residuals_vec)
Min. 1st Qu. Median Mean 3rd Qu. Max. -16.4100 -2.7920 -0.8208 0.0000 1.8030 31.7500
Example (6/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
The residual here means the error 𝑦𝑖 − 𝑦𝑖
Estimate parameters of a simple linear
regression model by using R function
Example (7/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
This is the standard deviation of the sampling
distribution of the coefficient estimate under
standard regression assumptions.
It should be noted that you are not required to
understand how standard errors are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
Example (8/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
• t value is the t-statistic value for testing
whether the corresponding regression
coefficient is different from 0.
• Pr(> |𝑡|) is the p-value for the hypothesis test
for the 𝑡 value. The null hypothesis is that the
coefficient is zero;
It should be noted that you are not required to
understand how t value and p-value are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
Example (9/13)> summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-16.413 -2.792 -0.821 1.803 31.753
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.3988 1.8200 -0.769 0.444
x 0.1454 0.0119 12.218 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.977 on 91 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6171
F-statistic: 149.3 on 1 and 91 DF, p-value: < 2.2e-16
R-squared is a statistical measure of how close
the data are to the fitted regression line. It is also
known as the coefficient of determination,
simply defined by
𝑅2 =Explained variation
Total variation
In general, the higher the R-squared, the better
the model fits your data.
It should be noted that you are not required to
understand how R-squared, multiple R-squared,
adjusted R-squared and their tests are calculated.
However, if you are interested, please read
Casella’s book Chapters 11-12
Estimate parameters of a simple linear
regression model by using R function
Example (10/13)
Prediction
If a new Audi A4 has 175 horsepower, what is
the selling price of this Audi A4?
> # Prediction ------------------------------------------
>
> x_i <- 175
> y_hat_i <- beta1 * x_i + beta0
>
> plot(df$Horsepower, df$Price,
+ xlab = "Horsepower",
+ ylab = "Price")
> y_hat <- beta1 * df$Horsepower + beta0
> lines(df$Horsepower, y_hat, lty = 2, col = 4)
> points(x_i, y_hat_i, col = 2, pch=9)
> legend(75,
+ 50,
+ lty = c(2,NA),
+ pch = c(NA,9),
+ col = c(4,2),
+ c("Regression line", "New Audi A4"))
Example (11/13)
> attach(df)
> pairs(
+ data.frame(
+ MPG.city,
+ MPG.highway,
+ EngineSize,
+ Horsepower,
+ Fuel.tank.capacity,
+ Length,
+ Width,
+ Rear.seat.room,
+ Luggage.room
+ )
+ )
> detach(df)
Let’s look at many
variables of cars
Example (12/13)
> attach(df)
> model.multiple <-
+ lm(
+ Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room
+ )
> detach(df)
> model.multiple$coefficients
(Intercept) MPG.city MPG.highway EngineSize Horsepower Fuel.tank.capacity Length
59.1474034 0.2363122 -0.3766282 1.8048313 0.1290087 0.6154648 0.1150924
Width Rear.seat.room Luggage.room
-1.3785983 0.1206144 0.2735771
Estimate parameters of a multiple linear
regression model by using R function
> summary(model.multiple)
Call:
lm(formula = Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + Fuel.tank.capacity + Length + Width + Rear.seat.room + Luggage.room)
Residuals:
Min 1Q Median 3Q Max
-11.7444 -3.7098 -0.2932 2.9824 28.7627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.14740 27.51934 2.149 0.03497 *
MPG.city 0.23631 0.44678 0.529 0.59848
MPG.highway -0.37663 0.44106 -0.854 0.39598
EngineSize 1.80483 1.85233 0.974 0.33314
Horsepower 0.12901 0.02576 5.008 3.78e-06 ***
Fuel.tank.capacity 0.61546 0.50620 1.216 0.22801
Length 0.11509 0.11504 1.000 0.32044
Width -1.37860 0.49336 -2.794 0.00666 **
Rear.seat.room 0.12061 0.33957 0.355 0.72348
Luggage.room 0.27358 0.39166 0.699 0.48711
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 5.868 on 72 degrees of freedom (11 observations deleted due to missingness)
Multiple R-squared: 0.6914, Adjusted R-squared: 0.6528
F-statistic: 17.92 on 9 and 72 DF, p-value: 3.547e-15
Example (13/13)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
217
Example (1/13)
This is an example of implementing logistic regression models in R.
We will use the Housing.csv dataset
> df <- read.csv(“C:/Housing.csv”)
> dim(df)
[1] 546 12
Using dim() function to see the size of data. There are 546
observations and 12 features/predictors in the dataset
Example (2/13)
> head(df)
price housesize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
1 420 5850 3 1 2 1 0 1 0 0 1 0
2 385 4000 2 1 1 1 0 0 0 0 0 0
3 495 3060 3 1 1 1 0 0 0 0 0 0
4 605 6650 3 1 2 1 1 0 0 0 0 0
5 610 6360 2 1 1 1 0 0 0 0 0 0
6 660 4160 3 1 1 1 1 1 0 1 0 0
Using head() function to look at a few sample
(default 6) observations of the data.
Example (3/13)
> lapply(df,class)
$price
[1] "numeric"
$housesize
[1] "integer"
$bedrooms
[1] "integer"
$bathrms
[1] "integer“
$stories
[1] "integer“
$driveway
[1] "integer“
…….
Using lapply() can look at what are the data types of
each variables (display in vertical way)
Example (4/13)
> summary(df)
price housesize bedrooms bathrms stories driveway
Min. : 250.0 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.000
1st Qu.: 491.2 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median : 620.0 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :1.000
Mean : 681.2 Mean : 5150 Mean :2.965 Mean :1.286 Mean :1.808 Mean :0.859
3rd Qu.: 820.0 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.000
Max. :1900.0 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :1.000
recroom fullbase gashw airco garagepl prefarea
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1777 Mean :0.3498 Mean :0.04579 Mean :0.3168 Mean :0.6923 Mean :0.2344
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000 Max. :1.0000
Using summary() to produce result summaries at each variable
Example (5/13)
> summary(df$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
250.0 491.2 620.0 681.2 820.0 1900.0
Using summary() to produce the result summaries for one variable at a time
Example (6/13)
Let’s create graph with two subplots. Each subplot is for a predictor. This can be very helpful for
helping understand the effect of each predictor the response variable.
> par(mfrow=c(1, 2))
> plot(df$price, df$fullbase,xlab = "Price",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,
+ pch = 16, col = "green",cex.lab=1.5, cex.axis=1.5, + cex.sub=1.5)
> plot(df$housesize, df$fullbase,xlab = "Housesize",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,
+ pch = 16, col = "blue",cex.lab=1.5, cex.axis=1.5,
+ cex.sub=1.5)
Example (7/13)
> model1<-glm(fullbase~price,data=df,family=binomial)
> model1$coefficients
(Intercept) price
-1.622737e+00 1.447098e-05
> plot(df$price, df$fullbase,xlab = "Price",
+ ylab = "Fullbase",frame.plot=TRUE,cex=1.5,pch = 16,
+ col = "blue",cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price))
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
Develop a logistic regression model by
using R built-in function
Note: The regression line may be not clear because the
big range values of price variable
Example (8/13)
# get better regression line plot
> range(df$price)
[1] 250 1900
>
> plot(df$price, df$fullbase, xlim=c(0,2150),ylim=c(-1,2),
+ xlab = "Price", ylab = "Fullbase", col = "blue",
+ frame.plot=TRUE,cex=1.5,pch = 16,cex.lab=1.5,
+ cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(0,2150)
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
Develop a logistic regression model by
using R built-in function
Here we see:
• If response variable and predictor(s) are
positively or negatively correlated
• 𝑧 value and 𝑝-value are for the hypothesis
test to see if the coefficient is zero or not.
The null hypothesis is that the coefficient is
zero. As the 𝑝-value is much less than 0.05,
we reject the null hypothesis that 𝛽 = 0.
Example (9/13)
> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
Example (10/13)> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
Deviance is a measure of goodness of fit of a regression
model (higher numbers indicate worse fit). The ‘Null
deviance’ shows how well the response variable is
predicted by a model that includes only the intercept
R:
model1$null.deviance (find Null deviance)model1$deviance (find Residual deviance)
For example, we have a value of 706.89 on 545 degrees
of freedom. Including the independent variables (price)
decreased the deviance to 688.28 on 544 degrees of
freedom.
The Residual Deviance has reduced by 18.61 with a loss
of one degrees of freedom.
Example (11/13)> summary(model1)
Call:
glm(formula = fullbase ~ price, family = binomial, data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6778 -0.8992 -0.8012 1.3529 1.7316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6227365 0.2567345 -6.321 2.60e-10 ***
price 0.0014471 0.0003423 4.228 2.36e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 688.28 on 544 degrees of freedom
AIC: 692.28
Number of Fisher Scoring iterations: 4
The Akaike Information Criterion (AIC) provides a
method for assessing the quality of your model through
comparison of related models (the model that has the
smallest AIC is best fitted model).
Fisher scoring is a derivative of Newton’s
method for solving maximum likelihood
problems numerically.
Example (12/13) Prediction
If a new house has 385.00 pounds rental price, what is
the probability of fullbase of this house?
> # Prediction ------------------------------------------
> model1<-glm(fullbase~price,data=df,family=binomial)
> plot(df$price, df$fullbase,xlab = "Price", ylab = "Fullbase",
+ frame.plot=TRUE,cex=1.5,pch = 16, col = "blue",
+ cex.lab=1.5, cex.axis=1.5, cex.sub=1.5)
> xprice<-seq(min(df$price),max(df$price))
> yprice<-predict(model1,list(price=xprice),type="response")
> lines(xprice,yprice)
> newdata <- data.frame(price = 385.00)
> y_hat_i<-predict(model1, newdata, type="response")
> points(newdata, y_hat_i, col = 2, pch=20)
>model2<-glm(fullbase~price+housesize,data=df,family=binomial)
>model2$coefficient
(Intercept) price housesize
-1.466744e+00 1.766831e-03 -7.286285e-05
> summary(model2)
Call:
glm(formula = fullbase ~ price + housesize, family = binomial,
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7777 -0.8973 -0.7971 1.3701 1.7224
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.467e+00 2.784e-01 -5.269 1.37e-07 ***
price 1.767e-03 4.120e-04 4.289 1.80e-05 ***
housesize -7.286e-05 5.108e-05 -1.427 0.154
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 706.89 on 545 degrees of freedom
Residual deviance: 686.19 on 543 degrees of freedom
AIC: 692.19
Number of Fisher Scoring iterations: 4
Example (13/13)
Table of contents
1. Introduction
2. Data structures
3. Data I/O
4. Graphics
5. Handling and processing strings
6. Linear regression
7. Logistic regression
8. Other topics
231
Assignment operators: ‘=’ Vs. ‘<-’
In R, you can use both ‘=’ and ‘<-‘ as assignment operators. So what’s the difference
between them and which one should you use?
232
What’s the difference?
> mean(x=1:10)
[1] 5.5
> x
Error: object 'x' not found
> mean(x<-1:10)
[1] 5.5
> x
[1] 1 2 3 4 5 6 7 8 9 10
The main difference between the two assignment operators is scope. It’s easiest to see the
difference with an example:
Here x is declared within the function’s scope of the function, so it doesn’t exist in the user workspace.
This time the x variable is declared within the user workspace.
233
When does the assignment take place? (1/2)
In the code above, you may be tempted
to thing that we “assign 1:10 to x, then
calculate the mean.” This would be
true for languages such as C, but it isn’t
true in R. Considering the function on
the right-hand side. Notice that the
value of a hasn’t changed!
> a <- 1
> f <- function(a) {
+ return(TRUE)
+ }
> f <- f(a <- a + 1); a
[1] 1
234
When does the assignment take place? (2/2)
In R, the value of a will only change if
we need to evaluate the argument in
the function. This can lead to
unpredictable behaviour:
> f <- function(a) {
+ if (runif(1) > 0.5)
+ TRUE
+ else
+ a
+ }
> a <- 1
> f(a <- a+1); a
[1] 2
> f(a <- a+1); a
[1] 3
> f(a <- a+1); a
[1] TRUE
[1] 3 235
Which one should I use? (1/2)
Well there’s quite a strong following for the “<-” operator:
• The Google R style guide prohibits the use of “=” for assignment.
• Hadley Wickham’s style guide recommends “<-“
• If you want your code to be compatible with S-plus you should use “<-”
(Note: it seems that S-plus now accepts “=” now).
• General R community recommends using “<-”
236
Which one should I use? (2/2)
Some people use the “=” operator for the following reasons:
• The other languages use the “=” operator, e.g., python, C
• It’s quicker to type “=” and “<-“
• Wanting the declared variable to exist in the current workspace
• Using “=” avoids misleading expressions like if (x[1]<-2)
237
Computer representation of numbers (1/2)
> a <- sqrt(2)
> a * a == 2
[1] FALSE
> a * a - 2
[1] 4.440892e-16
> all.equal(a * a, 2)
[1] TRUE
Real numbers are not stored exactly on
computers. Use binary version of
“scientific” notation, e.g., 1.24 × 102.
The function all.equal() compares two
objects using a numeric tolerance 1.5e-8(default). If you want much greater
accuracy than this you will need to
consider error propagation carefully.
238
Computer representation of numbers (2/2)
> x<- seq(0,0.5,0.1)
> x
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> y <- c(0,0.1,0.2,0.3,0.4,0.5)
> y
[1] 0.0 0.1 0.2 0.3 0.4 0.5
> x == y
[1] TRUE TRUE TRUE FALSE TRUE TRUE
> for (i in x) {
+ print(all.equal(x[i], y[i]))
+ }
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
239
Assigning a value (1/2)
> x <- c(8, 6, 4)
> x[7] <- 10
> x
[1] 8 6 4 NA NA NA 10
Assigning a value to a nonexistent element
of a vector, matrix, array, or list will
expand that structure to accommodate the
new value.
240
Assigning a value (2/2)
In R, the use of semicolons between statements is optional, and most people don't bother,
e.g.,
there is a risk that the first statement ended on the first line, i.e. that you said y <- 2 + 3
It is better you signal to R that an expression is incomplete, e.g.,
y <- 2 + 3
+ 5
y <- 2 + 3 +
5241
Debugging with RStudio
Usually, I do no recommend you use R for
projects with many dependency files,
instead, calling R from other languages
such Java/Python/C++ for statistical
analysis would be a better solution.
Debugging with RStudio is very easy and
simple (similar to Matlab)
Detailed operations see here:
https://support.rstudio.com/hc/en-
us/articles/205612627-Debugging-with-
RStudio
242
LaTeX
LaTeX is a document preparation system
for high-quality typesetting. It is freely
available for Windows, Mac, and Linux
platforms.
Donald E. Knuth
http://cs.stanford.edu/~uno/
https://latex-project.org/intro.html
243
Sweave (R + LaTeX)
• Install LaTeX on your PC
• Install sweave library in Rstudio
• Download the SweaveDemo.rnw file
from Blackboard
• Open the file and compile the PDF as
shown on the right!
244
R markdown
• Download the RMarkdownDemo.rmd file from Blackboard
• Open the file and compile the HTML as shown below!
245
Other topics in R that are not covered in our lectures
• Rcpp: R and C++ mixed programming
• rJava: R and Java mixed programming
• Rpython: R and Python mixed programming
• Creating your own R package
• R for statistical modelling (gbm, etc.)
• R for machine learning (kernlab, Rweka, caret, nnet, etc.)
• R for time series analysis
• …
246
Key references
• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.
• P. Teetor (2011) R Cookbook. O’Reilly.
• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly
247