an introduction to statistical computing in r k2i data...
TRANSCRIPT
![Page 1: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/1.jpg)
An Introduction to Statistical Computing in RK2I Data Science Boot Camp - Day 1 AM Session
May 15, 2017
Statistical Computing in R May 15, 2017 1 / 55
![Page 2: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/2.jpg)
AM Session Outline
Intro to R Basics
Plotting In R
Data Manipulation
Statistical Computing in R May 15, 2017 2 / 55
![Page 3: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/3.jpg)
R Basics
Here we will give a quick overview of the R language and the RStudio IDE.
Our emphasis will be to explore the most used features of R, especiallythose used in later courses.
This won’t cover all the details, but will the most important parts.
Statistical Computing in R May 15, 2017 3 / 55
![Page 4: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/4.jpg)
Working with Rstudio
Before beginning with R let’s orient ourselves with RStudio.
Statistical Computing in R May 15, 2017 4 / 55
![Page 5: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/5.jpg)
Our initial view of RStudio is:
Statistical Computing in R May 15, 2017 5 / 55
![Page 6: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/6.jpg)
Go to: File -> New File -> R Script. This gives:
Statistical Computing in R May 15, 2017 6 / 55
![Page 7: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/7.jpg)
Statistical Computing in R May 15, 2017 7 / 55
![Page 8: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/8.jpg)
Try It Out
Type the following into console
?lm
??linear
plot(1:20, 1:20)
Statistical Computing in R May 15, 2017 8 / 55
![Page 9: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/9.jpg)
There are several useful shortcut keys in RStudio. A few popular ones:
Ctrl+Enter - When pressed in Editor, sends current line to console.
Ctrl+1, Ctrl+2 - switch between editor and console
Ctrl+Shift+Enter - run entire script in console
tab completion - this is perhaps the most used feature
For vim/emacs users Tools -> Global Options -> Code -> Keybindingswill give you your prefered bindings.
Statistical Computing in R May 15, 2017 9 / 55
![Page 10: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/10.jpg)
It’s important to know our working directory.
Given a file name, R will assume it is located in your current workingdirectory.
R will also save output to the working directory by default.
It is important to set your working directory to the correct location orspecify full path names.
Statistical Computing in R May 15, 2017 10 / 55
![Page 11: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/11.jpg)
Try out the following in the console window:
getwd()
list.files()
To change your working directory go to: Session -> Set Working Directory-> Choose Directory
Alternatively,
setwd("/path/to/directory")
Statistical Computing in R May 15, 2017 11 / 55
![Page 12: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/12.jpg)
Reading, Writing, Saving, and Loading
Here we’ll look at bringing data into R and getting it out
We’ll also see how to save R objects and environments
Statistical Computing in R May 15, 2017 12 / 55
![Page 13: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/13.jpg)
Reading In Data
read.table
read.csv
read.fwf
Check out options for each ?read.table
Statistical Computing in R May 15, 2017 13 / 55
![Page 14: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/14.jpg)
Syntax
?read.table
?read.csv
read.table("/path/to/your/file.ext",
header=TRUE,
sep=",",
stringsAsFactors = FALSE)
Statistical Computing in R May 15, 2017 14 / 55
![Page 15: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/15.jpg)
Most Common Options
sep tells how fields/variables are separated. Commons values are:
”,” (comma)
” ” (single space)
”\t” (tab escape character)
stringsAsFactors tells whether to treat non numeric values asfactor/categorical variables.
header tells whether first line of file has variable names
na.strings tells how missing values are encoded in the file.
Statistical Computing in R May 15, 2017 15 / 55
![Page 16: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/16.jpg)
Standard Procedure
Open file in text editor
Check items relevant to options. Header? Separator type?
For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe
Statistical Computing in R May 15, 2017 16 / 55
![Page 17: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/17.jpg)
Try it Out
Let’s read in the ReadMeInX.txt files into R.
Try it on your own before looking at the answer on the next slides.
Example workflow:
1 Set your working directory to the directory containing the files.
2 Examine the files in a text editor to check for common options(header, separator, etc.)
Statistical Computing in R May 15, 2017 17 / 55
![Page 18: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/18.jpg)
# read.table's default seperator ok for this one
set0 <- read.table("ReadMeIn0.txt",
header=TRUE)
# specify new seperator
set1 <- read.table("ReadMeIn1.txt",
header=TRUE,
sep=',')
# Or use read.csv
set1 <- read.csv("ReadMeIn1.txt",
header=TRUE)
Statistical Computing in R May 15, 2017 18 / 55
![Page 19: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/19.jpg)
# another change of seperator
set2 <- read.table("ReadMeIn2.txt",
header=TRUE,
sep=';')
# check for missing
set3 <- read.table("ReadMeIn3.txt",
header=FALSE,
sep=',',
na.strings = '')
Statistical Computing in R May 15, 2017 19 / 55
![Page 20: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/20.jpg)
Writing Data
write.table
write.csv
Statistical Computing in R May 15, 2017 20 / 55
![Page 21: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/21.jpg)
Syntax and Common Options
?write.csv
write.csv(myRObject,
file="/path/to/save/spot/file.csv",
row.names=FALSE)
Options largely the same as their read counterparts
row.names = FALSE is helpful to avoid have 1,2,3,... as avariable/column
Statistical Computing in R May 15, 2017 21 / 55
![Page 22: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/22.jpg)
Try It Out
Write out one of the files you imported. Try to varying options like sep,quote.
Statistical Computing in R May 15, 2017 22 / 55
![Page 23: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/23.jpg)
Saving Objects
saveRDS/readRDS are used to save (compressed version of) individual Robjects
# save our data set
saveRDS(set1,file="TstObj.rds")
# get it back
newtst <- readRDS("TstObj.rds")
# can save any R object. Try a vector
my.vector <- c(1,8,-100)
saveRDS(my.vector, file="JustAVector.rds")
Statistical Computing in R May 15, 2017 23 / 55
![Page 24: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/24.jpg)
Saving Environment
We can save all variables in the current R workspace with save.image
We can load in a saved workspace with load
R will ask you save your work when you exit
# Save all our work
save.image("AllMyWork.RData")
# Reload it
load("AllMyWork.RData")
# name given to default save
load(".RData")
Statistical Computing in R May 15, 2017 24 / 55
![Page 25: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/25.jpg)
The Basics of R
Let’s do a whirlwind tour of R: it’s syntax and data structures
This won’t cover all the details, but will the most important parts
Statistical Computing in R May 15, 2017 25 / 55
![Page 26: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/26.jpg)
Basic R Data Types
# numeric types: interger, double
348
# character
"my string"
# logical
TRUE
FALSE
# artithmetic as you'd expect
43 + 1 * 2^4
# so too logical operators/comparison
TRUE | FALSE
1 + 7 != 7
# Other logical operators:
# &, |, !
# <,>,<=,>=, ==, !=
Statistical Computing in R May 15, 2017 26 / 55
![Page 27: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/27.jpg)
Data Types Cont.
# variables assignment is done with the <- operator
my.number <- 483
# the '.' above does nothing. we could have done:
# mynumber <- 483
# instead
# it's an Rism to use .'s in variable names.
# typeof() tells use type
typeof(my.number)
## [1] "double"
# we can convert between types
my.int <- as.integer(my.number)
typeof(my.int)
## [1] "integer"
# we can test for types
is.logical(my.int)
## [1] FALSE
Statistical Computing in R May 15, 2017 27 / 55
![Page 28: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/28.jpg)
R Data Structures - Vectors
# the vector is the most important data structure
# create it with c()
my.vec <- c(1,2,67,-98)
# get some properties
str(my.vec)
## num [1:4] 1 2 67 -98
length(my.vec)
## [1] 4
# access elements with []
my.vec[3]
## [1] 67
my.vec[c(3,4)]
## [1] 67 -98
# can do assignment too
my.vec[5] <- 41.2
Statistical Computing in R May 15, 2017 28 / 55
![Page 29: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/29.jpg)
Vectors - Cont.
# other ways to create vectors
x <- 1:6
y <- seq(7,12,by=1)
# Operations get recycled through whole vector
x + 1
## [1] 2 3 4 5 6 7
x > 3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
# Can do component wise operations between vectors
x * y
## [1] 7 16 27 40 55 72
x / y
## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000
y %/% x
## [1] 7 4 3 2 2 2
Statistical Computing in R May 15, 2017 29 / 55
![Page 30: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/30.jpg)
Try It Out
# Try guess what the following lines will do
# Will it run at all? If so, what will it give?
# Think about it and run to confirm
7 -> w
w <- z <- 44
1 + TRUE
0 | 15 & 3
my.vec[2:4]
my.vec[-2]
my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)]
my.vec[
sum(
c(TRUE,FALSE,FALSE,TRUE,TRUE)
)
] <- TRUE
my.vec[3] <- "I'm a string"
as.numeric(my.vec)
x[x>3]
x + c(1,2)
Statistical Computing in R May 15, 2017 30 / 55
![Page 31: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/31.jpg)
Matrices# matricies are 2d vectors.
# create using matrix()
my.matrix <- matrix(rnorm(20),nrow=4,ncol=5)
# rnorm() draws 20 random samples from a n(0,1) distribution
my.matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743
## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529
## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967
## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199
# note matricies loaded by column
# Get details
dim(my.matrix)
## [1] 4 5
nrow(my.matrix)
## [1] 4
ncol(my.matrix)
## [1] 5
Statistical Computing in R May 15, 2017 31 / 55
![Page 32: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/32.jpg)
Matrices - Cont.
# Indexing is similar to vectors but with 2 dimensions
# get second row
my.matrix[2,]
## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529
# get first,last columns of row three
my.matrix[3,c(1,4)]
## [1] -1.293177 1.978752
# transposing done with t()
Statistical Computing in R May 15, 2017 32 / 55
![Page 33: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/33.jpg)
Lists# lists similar to vectors but contain different types
# create with list
my.list <- list("just a string",
44,
my.matrix,
c(TRUE,TRUE,FALSE))
# access items via double brackets [[]]
my.list[[4]]
## [1] TRUE TRUE FALSE
# access multiple items
my.list[1:2]
## [[1]]
## [1] "just a string"
##
## [[2]]
## [1] 44
# list items can be named too
named.list <- list(Item1="my string",
Item2=my.list)
# access of named item is via dollar sign operator
# [[]] also works
c(named.list$Item1,named.list[[1]])
## [1] "my string" "my string"
Statistical Computing in R May 15, 2017 33 / 55
![Page 34: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/34.jpg)
Putting it together
Let’s practice with R data types by doing PCA on the iris data.
data("iris")
head(iris)
str(iris)
Note iris is a data.frame data type; this is simply a list.
Statistical Computing in R May 15, 2017 34 / 55
![Page 35: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/35.jpg)
PCA outline
Save the numeric columns of iris as a matrix. (Hint: ?as.matrix)
Center and scale the matrix (Hint: ?scale)
Compute the correlation matrix
R =1
n − 1XTX
Here X is our (centered and scaled) data matrix, n is the number ofrows/observations in our data, and XT is the transpose of X .
(Hint: t(X) is transpose operator and A%*%B performs matrixmultiplication on the matricies A and B)
Statistical Computing in R May 15, 2017 35 / 55
![Page 36: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/36.jpg)
PCA outline cont.
Obtain the two leading eigenvectors of the correlation matrix R.Denote these as v1, v2. (Hint: ?eigen)
Compute the first and second principle components via
z1 = Xv1
z2 = Xv2
Produce a scatter plot of z1 vs z2 (Hint: ?plot)
Take a few moments to try it yourself before looking at the answers on thenext slides.
Statistical Computing in R May 15, 2017 36 / 55
![Page 37: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/37.jpg)
PCA from scratch
data("iris")
# get numeric portions of list and make a matrix
X <- as.matrix(iris[1:4])
# center and scale
X <- scale(X,center = TRUE,scale=TRUE)
# get the number of rows
n <- nrow(X)
# compute correlation matrix
R <- (1/(n-1))*t(X)%*%X
# perform eigen decomposition
Reig <- eigen(R)
# get eigen vectors
Reig.vecs <- Reig$vectors
# create principle components
pc1 <- X%*%Reig.vecs[,1]
pc2 <- X%*%Reig.vecs[,2]
Statistical Computing in R May 15, 2017 37 / 55
![Page 38: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/38.jpg)
PCA from scratch cont.
# compare to R's PCA function
their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE)
head(their.pcs$x[,1:2])
## PC1 PC2
## [1,] -2.257141 -0.4784238
## [2,] -2.074013 0.6718827
## [3,] -2.356335 0.3407664
## [4,] -2.291707 0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053
# our result
head(cbind(pc1,pc2))
## [,1] [,2]
## [1,] -2.257141 -0.4784238
## [2,] -2.074013 0.6718827
## [3,] -2.356335 0.3407664
## [4,] -2.291707 0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053
Statistical Computing in R May 15, 2017 38 / 55
![Page 39: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/39.jpg)
PCA from scratch cont.
plot(pc1,pc2,col=iris$Species)
−3 −2 −1 0 1 2 3
−2
−1
01
2
pc1
pc2
Statistical Computing in R May 15, 2017 39 / 55
![Page 40: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/40.jpg)
Factors# Factors are like vector, but with predefined allowed values called levels
# Factors are used to represent categorical variables in R
# create a factor
factor1 <- factor(c('Good','Bad','Ugly'))
# find it's levels
levels(factor1)
## [1] "Bad" "Good" "Ugly"
# below gives warning, but not error
factor1[4] <- 17
## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated
# see what happened
factor1
## [1] Good Bad Ugly <NA>
## Levels: Bad Good Ugly
factor1[4] <- 'Bad'
# get the breakdown
table(factor1)
## factor1
## Bad Good Ugly
## 2 1 1
Statistical Computing in R May 15, 2017 40 / 55
![Page 41: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/41.jpg)
Note one of our previous examples R filled in the improper factor valuewith NA
NA is R’s way of specifying missing data
Note the missing data is handled differently than ordinary values, as wewill see as we go along.
Statistical Computing in R May 15, 2017 41 / 55
![Page 42: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/42.jpg)
Questions
What will the following lines of code do?
my.matrix[3:4,1:2] <- c(4,5)
my.matrix[4,5] <- 'string'
mf.strings <- c('F','F','M','F')
factor2 <- as.factor(mf.strings)
c(factor1, factor2)
factor1 == 'Ugly'
my.list[[3]][2,]
sum(c(1,2,3,NA))
sum(c(1,2,3,NA),na.rm = TRUE)
Statistical Computing in R May 15, 2017 42 / 55
![Page 43: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/43.jpg)
Data Frames
The data.frame is how R represents data sets. They are simply lists, witha few additional restrictions.
# create your own
my.df <- data.frame(
age = c(45,27,19,59,71,13,5),
gender = factor(c('M','M','M','F','M','F','F'))
)
str(my.df)
## 'data.frame': 7 obs. of 2 variables:
## $ age : num 45 27 19 59 71 13 5
## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1
Statistical Computing in R May 15, 2017 43 / 55
![Page 44: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/44.jpg)
Data Frames - Cont.
Individual variables can be accessed via $ operator
my.df$age
## [1] 45 27 19 59 71 13 5
summary(my.df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 16.00 27.00 34.14 52.00 71.00
table(my.df$gender)
##
## F M
## 3 4
# data frames are really just lists
my.df[[2]]
## [1] M M M F M F F
## Levels: F M
Statistical Computing in R May 15, 2017 44 / 55
![Page 45: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/45.jpg)
Data Frames - Cont.
# data.frames can be subsetted like matrcies
my.df[1:3,c("age")]
## [1] 45 27 19
# logical subsetting especially useful for .data.frames
# get ages over 40
age.logic <- my.df$age > 40
# take a subset of these rows
my.df[age.logic,]
## age gender
## 1 45 M
## 4 59 F
## 5 71 M
# create a new variable age.sq
my.df$age.sq <- my.df$age^2
Statistical Computing in R May 15, 2017 45 / 55
![Page 46: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/46.jpg)
Try It Out
Let’s use R’s internal iris data set to practice with data frames
my.iris <- iris
my.iris
1 Create two new variables Length.Sum and Width.Sum which are thesum of Sepal and Petal length/width respectively.
2 Use subsetting and R’s mean function to find the averageLength.Sum of setosa species
Statistical Computing in R May 15, 2017 46 / 55
![Page 47: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/47.jpg)
my.iris$Length.Sum = my.iris$Sepal.Length +
my.iris$Petal.Length
my.iris$Width.Sum = my.iris$Sepal.Width +
my.iris$Petal.Width
setosa.inds <- my.iris$Species == 'setosa'
mean(my.iris[setosa.inds,]$Length.Sum)
## [1] 6.468
Statistical Computing in R May 15, 2017 47 / 55
![Page 48: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/48.jpg)
Control Structures
R has all the typical control structures:
if-else statements
for loops
while loops
Statistical Computing in R May 15, 2017 48 / 55
![Page 49: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/49.jpg)
Syntax
if(logical_expression){execute_code
} else{executre_other_code
}
for(value in sequence){work_with_value
}
while(expression_is_true){execute_code
}
Statistical Computing in R May 15, 2017 49 / 55
![Page 50: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/50.jpg)
Functions
Defining functions is R is easy
# use function key word with assignment <-
my.mean <- function(input.vector){sum = 0
for(val in input.vector) {sum = sum + val
}# the expression get retuned
return.me <- sum / length(input.vector)
}my.mean(1:10)
Statistical Computing in R May 15, 2017 50 / 55
![Page 51: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/51.jpg)
Functions cont.
my.mean <- function(input.vector){sum = 0
for(val in input.vector) {sum = sum + val
}# returns 1 now
retrun.me <- sum / length(input.vector)
1
}my.mean(1:10)
## [1] 1
Statistical Computing in R May 15, 2017 51 / 55
![Page 52: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/52.jpg)
Try It Out
Create a function my.summary which inputs a vector, x, calculates themean, standard deviation, max, and min of x, and returns these in a list
Try out R’s internal functions mean, sd, max,min
Statistical Computing in R May 15, 2017 52 / 55
![Page 53: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/53.jpg)
my.summary <- function(x) {list(
mean = mean(x),
sd = sd(x),
max = max(x),
min = min(x)
)
}
Statistical Computing in R May 15, 2017 53 / 55
![Page 54: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/54.jpg)
Try It Out cont.
Loop through the variables in my.iris, evaluating my.summary on each(provided the variable is numeric) and printing the maximum.
Hint: Use is.numeric to test each variable before applying my.summary
Statistical Computing in R May 15, 2017 54 / 55
![Page 55: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the](https://reader033.vdocuments.net/reader033/viewer/2022050218/5f6442a942b43979c77843ae/html5/thumbnails/55.jpg)
for(var in my.iris) {if(is.numeric(var)){tmp <- my.summary(var)
print(tmp$max)
}}
Statistical Computing in R May 15, 2017 55 / 55