part i { getting started & manipulating data...

23

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Part I { Getting Started & Manipulating Data with R

Gilles Lamothe

February 21, 2017

Contents

1 URL for these notes and data 2

2 Origins of R 2

3 Downloading and Installing R 2

4 R Console and Editor 3

5 RStudio 3

6 Working Directory 3

7 Importing & Exporting data with R / Data structures (Vectors and Dataframes) 4

8 Types, Test and Coercion / sapply / Factors 8

9 Functions of a numerical vector (Descriptive Statistics) / Group statistics 9

10 Missing values / User build functions 11

11 To source an R script 13

12 Applying our new skills 14

13 Logical vectors / Table of frequencies / Convert a numerical vector into a factor 15

14 Operations on numerical vectors 17

15 Cut Functions Revisited 20

16 Plot a one numerical variable (Boxplot & Histogram) / Installing an R package 21

17 Applying our new skills { Part II 22

18 Saving the workspace 23

1

Page 2: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 2

Intro to R Workshop - Psychology Statistics Club

1 URL for these notes and data

To download these notes, follow the link:

aix1.uottawa.ca/~glamothe/PsyStatClub

The name of the �le is Rworkshop.pdf.To �nd the corresponding data, follow the link:

aix1.uottawa.ca/~glamothe/PsyStatClub

The data is in the folder called data.

2 Origins of R

� R was designed by Ross Ihaka (computer scientist) and Robert Gentleman (statis-tician) in the early 1990's at the University of Auckland (New Zealand).

� R was developed at the Bell Laboratories in Murray Hill, New Jersey.

� R is much more than a statistical package. It is its own computer language andenvironment within wich statistical techniques are implemented.

� To know more ABOUT R, follow the link:

https://www.r-project.org/about.html

� R's strength is that is allows the user to retain full control. It is highly exible.

3 Downloading and Installing R

� To �nd R, to google.ca and search for R. Follow the link for:

R: The R Project for Statistical Computing.

� Here is the URL:

www.r-project.org/

� Here is a youtube video for instructions concerning the installation of R.

https://www.youtube.com/watch?v=7iuKrPS8fMM

Page 3: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 3

4 R Console and Editor

� R is a command based programming language. We enter commands at theprompt > in the R console.

� Here is an example, where we computep2 + 4. We enter the command and the

prompt and press ENTER.

> sqrt(2)+4[1] 5.414214

� Alternatively, we can enter commands in an R editor window. Select File !New script in Windows or select File! New document on a Mac.

� In an Editor, we select the commands that we want to send to the prompt andwe press CTRL-R in Windows or CMD-Enter on a Mac.

https://www.youtube.com/watch?v=pGhjRJ9le7g

5 RStudio

� RStudio is a free and open-source integrated development environment (IDE) forR.

� RStudio is a more user-friendly interface to facilitate the use of R.

� To install RStudio, download the installer for your operating system from the fol-lowing Webpage: https://www.rstudio.com/products/rstudio/download/

� You will see many windows, within the RStudio environment. There is, on theleft-hand-side :

{ the console with the R prompt, which is waiting patiently for you commands.

{ an R editor window, within which was can work and eventually submit ourcommands to the console.

6 Working Directory

We use the command getwd() to display the working directory.

We use the function setwd() to display the working directory.

Below, we set the directory to C:/Rstuff and we verify that the working directoryhas been properly set by displaying the current working directory.

> setwd("c:/Rstuff")> getwd()[1] "c:/Rstuff"

Page 4: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 4

Comments:

� Windows users are used to seeing the blackslash (\) as the path separator. How-ever, even for Windows, R uses the frontslash (/) as the path separator.

� Using a working directory will save you time. However, we will see later thatwe can ignore the working directory by using the function file.choose(). Itforces R to open a window and ask you for the location of your �le.

7 Importing & Exporting data with R / Data structures (Vec-tors and Dataframes)

� R does not have a good worksheet. So you will need to use some other editingsoftware, e.g. excel or Open Office Calc, with a worksheet to save your data.

� We recommend that you save your data as a text �le, preferably as tab delimited,or CSV (i.e. comma separated).

� We will be using tab delimited. This allows me to have commas in my �le thatare not interpreted as the start of a new column.

� R has four data structures: dataframe, vector, list, and matrix. For now, wewill discuss the dataframe and the vector.

Dataframe:

{ A dataframe is a data table.

{ The rows are the statistical units.

{ The columns are the variables that are used to describe the statistical units.

{ Here is an example of a dataframe:

subject ID gender age (in years) measured response complient1 Male 34 1.2 12 Female 45 1.4 13 Female 62 1.6 24 Female 19 1.8 25 Female 23 1.9 26 Male 44 3.2 2

{ We saved the above table in the �le example.txt. It is a tab-delimited text�le.

{ Assuming that the �le is in our working directory (for me it is: C:/Rstuff),then we can use the following command to assign the data in the �le to adataframe that we called data.

> data<-read.table("example.txt",header=TRUE,sep="\t")> names(data)[1] "subject.ID" "gender" "age..in.years."[4] "measured.response" "complient"

Page 5: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 5

Comments:

1. <- is used for assignments. Using the read.table function, we assigneda dataframe to data. The name of the dataframe is provided by theuser.

2. By default, R assumes that we will not give names to our columns. If the�rst row in your table has column names, we need to add the argumentheader=TRUE.

3. By default, R uses space to separate columns. We used tabs, so we addthe argument sep="\t". For a CSV �le, use sep=","

4. We can use the function names() to display the names of the columnsof a dataframe.

5. R does not permit certain symbols for the names of a column, e.g. spaces.In all of those cases, the symbol is replaced by a dot.

6. Here we display the �rst three rows of the dataframe data.

> head(data,n=3)subject.ID gender age..in.years. measured.response complient

1 1 Male 34 1.2 12 2 Female 45 1.4 13 3 Female 62 1.6 2

7. Here we display the dimensions (# of rows & # of columns) of thedataframe data.

> dim(data)[1] 6 5> nrow(data)[1] 6> ncol(data)[1] 5

8. Instead of giving the name of the �le to import, we use use the file.choose()function. R will open a window to ask you the location of the �le.

> data<-read.table(file.choose(),header=TRUE,sep="\t")

Vector:

� Think of a vector, as a column in a dataframe.

� It is a homogeneous data structure. All of its elements are of the same type.Here are some common types:

integer, double (often called numeric), boolean (often called logical),categorical (often called character or factor).

Basically, a vector is either a numerical (quantitative) variable or a categorical(qualitative) variable.

� Reference to a vector in a dataframe:

{ with the column's name:

Page 6: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 6

> data$gender[1] Male Female Female Female Female MaleLevels: Female Male

Remark: R noticed that data$gender is categorical. So it also displayedits levels.

{ with the column's position:

> data[,2][1] Male Female Female Female Female MaleLevels: Female Male

Remark: We use square brackets for subsetting data with R. Here [,2]refers to the second column. [2,] refers to the second row and [2,3] refersto 2nd row and 3rd column.

{ Here we display the 2nd row in the dataframe data and, then, we ask R if itis a vector or a dataframe. A row is not a vector.

> data[2,]subject.ID gender age..in.years. measured.response complient

2 2 Female 45 1.4 1> is.data.frame(data[2,])[1] TRUE> is.vector(data[2,])[1] FALSE

{ Here we display the 3nd column in the dataframe data and, then, we ask Rif it is a vector.

> data[,3][1] 34 45 62 19 23 44> is.vector(data[,3])[1] TRUE

� A vector does not need to be a column of a dataframe. Just think of it as a listof elements of the same type.

Examples of vectors:

1. The names of the dataframe is a character vector.

> names(data)[1] "subject ID" "gender" "age (in years)"[4] "measured response" "complient"> is.vector(names(data))[1] TRUE

2. We can construct vectors with function c() (which stands for combine).

Here is a command to assign a character vector to names(data).

Page 7: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 7

names(data)<-c("subject ID","gender","age","measured response","complient")

We now redisplay the names of the columns in the dataframe data.

> names(data)[1] "subject ID" "gender" "age"[4] "measured response" "complient"

Comment: If you have a space in the name of a column, you must usequotes to refer to it by name.

> data$`measured response`[1] 1.2 1.4 1.6 1.8 1.9 3.2

3. Here are a few useful ways to build vectors.

> # a numerical vector of zeros of size 6> numeric(6)[1] 0 0 0 0 0 0> # build a vector by repeating a vector 4 times> rep(c(1,2),4)[1] 1 2 1 2 1 2 1 2> # an empty vector> c()NULL> # a numerical vector with the integers from 1 to 6> 1:6[1] 1 2 3 4 5 6> # a numerical vector with the integers from 10 to 20> 10:20[1] 10 11 12 13 14 15 16 17 18 19 20

> # build a sequence of numbers with an increment of 0.5> # starting at 1 and ending at 5.> seq(1,5,by=0.5)[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Remark: We use the symbol # to write comments that R does not interpret.

� We can easily add a vector to a dataframe.

> data<-data.frame(data,degree)> names(data)[1] "subject.ID" "gender" "age"[4] "measured.response" "complient" "degree"

We end this section by exporting a dataframe with the function read.table(). Itwill save the �le in the working directory, unless you give it a path.

write.table(data, file = "SavedExample.txt",sep = "\t",row.names = FALSE, col.names = TRUE)

Page 8: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 8

Comments:

� With the above command, we saved the dataframe data in the �le `SavedExample.txt'in the working directory.

� R might modify the names of the columns, since some characters are not per-mitted, e.g. spaces.

8 Types, Test and Coercion / sapply / Factors

� Here are a few test functions: is.character(), is.logical(), is.numeric(),is.factor().

� Here are a few coercion functions: as.character(), as.logical(), as.numeric(),as.factor(), factor().

� Here is an example of testing that it is of type factor and then forcing it to beof type factor.

> is.factor(data$complient)[1] FALSE> is.numeric(data$complient)[1] TRUE> data$complient<-factor(data$complient)> is.factor(data$complient)[1] TRUE> is.numeric(data$complient)[1] FALSE

large Factors: Factors are categorical variables.

� They have levels, i.e. the categories.

> levels(data$complient)[1] "1" "2"

� Here is a display of the vector:

> data$complient[1] 1 1 2 2 2 2Levels: 1 2

� We can change the labels of the levels, with the function factor by using a|labels| argument. The labels must be a vector as the same length as thelevels.

> data$complient<-factor(data$complient,labels=c("Yes","No"))> data$complient[1] Yes Yes No No No NoLevels: Yes No

Page 9: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 9

sapply:

� It is good practice to verify the type of all the columns in a dataframe. This canbe time consuming to verify the type one column at a time.

� We can use the function sapply to apply the function is.factor on all thecolumns of a dataframe. It returns a logical vector.

> sapply(data,is.factor)subject.ID gender age

FALSE TRUE FALSEmeasured.response complient degree

FALSE TRUE TRUE

� We identify which columns are numerical.

> sapply(data,is.numeric)subject.ID gender age

TRUE FALSE TRUEmeasured.response complient degree

TRUE FALSE FALSE

9 Functions of a numerical vector (Descriptive Statistics) /Group statistics

Let x be a numerical vector. For example, consider the following assignment.

x<-c(16,15,14,16,13,12,14,13,10)

By using the command summary(x), we obtain some descriptive statistics for x.

> summary(x)Min. 1st Qu. Median Mean 3rd Qu. Max.

10.00 13.00 14.00 13.67 15.00 16.00

Here are some descriptives for the age of the subjects in the dataframe data.

> summary(data$age)Min. 1st Qu. Median Mean 3rd Qu. Max.

19.00 25.75 39.00 37.83 44.75 62.00

Here is a list of some common statistics.

sum(x) # sum of the componentsmean(x) # meanvar(x) # variancesd(x) # standard deviationmin(x) # minmax(x) # maxmedian(x) # median

Page 10: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 10

quantile(x) # 5-number summary (min,q1,median,q3,max)length(x) # number of componentssort(x) # arrange values in ascending orderrank(x) # rank the values from 1 to length(x)

Let us use the sapply function to get the mean for each column in the dataframedata.

> sapply(data,mean)subject.ID gender age

3.50000 NA 37.83333measured.response complient degree

1.85000 NA NAWarning messages:1: In mean.default(X[[i]], ...) :

argument is not numeric or logical: returning NA2: In mean.default(X[[i]], ...) :

argument is not numeric or logical: returning NA3: In mean.default(X[[i]], ...) :

argument is not numeric or logical: returning NA

Comments:

� R is warning us that it cannot compute the mean of a non-numerical vector.

� If we know that a certain command will display warnings, we can suppress warn-ings before using the command. It is good practice,

> # suppress warnings> options(warn=-1)> sapply(data,mean)

subject.ID gender age3.50000 NA 37.83333

measured.response complient degree1.85000 NA NA

> # put warnings back on> options(warn=0)

� Recall that we use square brackets for subsetting. We could extract the numericalcolumns and only compute the mean for each of these columns.

> numericCol<-sapply(data,is.numeric)> sapply(data[,numericCol],mean)

subject.ID age measured.response3.50000 37.83333 1.85000

Group Statistics:

Page 11: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 11

� We can use the aggregate function to get group statistics for a numerical vectory according to the levels of another variable x in a dataframe data. fun is thedescriptive statistic that you want to compute. Its usage is

aggregate(y~x,data,fun)

� Here are the mean age for the subjects in the dataframe data according to thelevels of gender.

> Mean<-aggregate(age~gender,data,mean)> Mean

gender age1 Female 37.252 Male 39.00

� Here we build a dataframe with a few descriptive statistics for the age for thesubjects in the dataframe data according to the levels of gender.

> Mean<-aggregate(age~gender,data,mean)> SD<-aggregate(age~gender,data,sd)> n<-aggregate(age~gender,data,length)> AgeStats<-data.frame(Mean,SD[,2],n[,2])> names(AgeStats)<-c("Gender","Mean","Std dev","n")> AgeStats

Gender Mean Std dev n1 Female 37.25 20.072784 42 Male 39.00 7.071068 2

Remark: The discussion in this section assumes that there are no missing values.

> mean(c(NA,0,1))[1] NA

Missing values can cause some problems. R could not compute the mean of thevector, since there was a missing value.

10 Missing values / User build functions

Consider the data in the �le OmitExample.txt. We import the data and assign it tothe dataframe data1.

> data1<-read.table("OmitExample.txt",header=TRUE,sep="\t")> data1

subject.ID gender age..in.years. measured.response complient1 1 Male 34 1.2 12 2 Female 45 NA 13 3 Female 62 1.6 24 4 Female NA 1.8 2

Page 12: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 12

5 5 Female 23 1.9 26 6 Male 44 3.2 2

degree1 yes2 no3 no4 yes5 yes6 yes

There is a missing value in the third column and also in the fourth column. So wewill have di�culties compute the mean for each of these columns.

> numericCol<-sapply(data1,is.numeric)> sapply(data1[,numericCol],mean)

subject.ID age..in.years. measured.response3.500000 NA NA

complient1.666667

We can omit the missing values with the function na.omit. However, it will deleteall rows with a missing value.

> na.omit(data1)subject.ID gender age..in.years. measured.response complient

1 1 Male 34 1.2 13 3 Female 62 1.6 25 5 Female 23 1.9 26 6 Male 44 3.2 2

degree1 yes3 no5 yes6 yes

User de�ned function: Another solution is to build our own function to computethe mean of a vector after omitting missing values in this vector.

mean.na<-function(x){x<-x[!is.na(x)] # keep value if not NAreturn(mean(x))

}

We use our user de�ned function to compute the mean of the numerical vectors inthe dataframe data1.

> sapply(data1[,numericCol],mean.na)subject.ID age..in.years. measured.response

Page 13: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 13

3.500000 41.600000 1.940000complient1.666667

R does not have a sample size function. We can use the function length, but itgives the number of components in the vector (including missing values).

> length(data1$age..in.years.)[1] 6> data1$age..in.years.[1] 34 45 62 NA 23 44

We de�ne functions to count the number of missing values in a vector and to countthe number of non-missing values in a vector.

numNA<-function(x){

return(length(x[is.na(x)]))}

SampleSize<-function(x){return(length(x)-numNA(x))

}

We use our functions on the vector age..in.years. in the dataframe data1.

> SampleSize(data1$age..in.years.)[1] 5> numNA(data1$age..in.years.)[1] 1

11 To source an R script

We will want to save our user de�ned R functions.Steps to save your functions:

1. In the RStudio menu, select File ! New File ! New R script.

2. In the new editor window, enter your functions.

3. Save the current editor window. I saved it as myFunctions.R.

2 ways to Access your functions:

1. In the RStudio menu, select File ! Open File. Browse for myFunctions.R.

2. Alternatively, we can source the �le. R will interpret all commands in the �lethat have been sourced. I am assuming that myFunctions.R is in the workingdirectory.

> source("myFunctions.R")

Page 14: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 14

12 Applying our new skills

Exercises to try in class.

1. Consider a company that produces items made from glass. The data is in the�le glassworks.txt.(Source: Applied Linear Models, by Kutner et al.).

A large company is studying the e�ects of the length of special training for newemployees on the quality of work. Employees are randomly assigned to haveeither 6, 8, 10, or 12 hours of training. After the special training, each employeeis given the same amount of material. The response variable is the number ofacceptable pieces.

(a) Compute the number of subjects per group. Is it a balanced study (i.e. same# of observations per group)?

(b) Compute the mean response and the standard deviation of the response foreach group.

2. Consider the data in the �le FinalGrades.txt.

(a) For each assignment, compute the mean and the standard deviation.

(b) For the �nal exam, compute the mean according to the levels of the faculty.

Page 15: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 15

13 Logical vectors / Table of frequencies / Convert a numer-ical vector into a factor

It is easy to compute the proportion of statistical units in a dataframe that satisfy acertain condition. We use logical operators.

Operator Description< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to!= not equal to!x Not xx j y x OR yx&y x AND y

isTRUE(x) test if x is TRUE

What proportion of the subjects in data are over 60?

> data$age>60[1] FALSE FALSE TRUE FALSE FALSE FALSE

R considers FALSE as 0 and TRUE as 1. So by computing the mean, we get theproportion of TRUE values:

> mean(data$age>60)[1] 0.1666667

So 16.7% of the subjects are over 60.

What proportion of the subjects in data are male?

> data$gender=="Male"[1] TRUE FALSE FALSE FALSE FALSE TRUE> mean(data$gender=="Male")[1] 0.3333333

So 33.3% of the subjects are male.

What proportion of the subjects in data are female and at most 30?

> (data$gender=="Female")&(data$age<=30)[1] FALSE FALSE FALSE TRUE TRUE FALSE> mean((data$gender=="Female")&(data$age<=30))[1] 0.3333333

So 33.3% of the subjects are female and at most 30.

Table of frequencies: We can use the function table to obtain a table of frequenciesfor a categorical vector.

Page 16: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 16

> table(data$gender)

Female Male4 2

> # expressed as a percentage> table(data$gender)/length(data$gender)*100

Female Male66.66667 33.33333

Cut a numeric vector into intervals:We can use the function cut to break a numerical vector into intervals. It has an

argument called break. It can be a number for the number of breaks or a numericvector for the values of the breaks.

> cut(data$age,breaks=3)[1] (33.3,47.7] (33.3,47.7] (47.7,62] (19,33.3] (19,33.3][6] (33.3,47.7]Levels: (19,33.3] (33.3,47.7] (47.7,62]> cut(data$age,breaks=c(19,30,40,65))[1] (30,40] (40,65] (40,65] <NA> (19,30] (40,65]Levels: (19,30] (30,40] (40,65]> ageCat<-cut(data$age,breaks=c(19,30,40,65))

Here we have a two-way contingency table for the joint distribution of gender andage.

> table(data$gender,ageCat)ageCat(19,30] (30,40] (40,65]

Female 1 0 2Male 0 1 1

Logical vector for subsetting: We can use a logical vector to obtain a subseta dataframe.

Suppose that we only want to retain the female subjects.

> dataFemale <- data[data$gender=="Female",]> dataFemale

subject.ID gender age measured.response complient degree2 2 Female 45 1.4 Yes no3 3 Female 62 1.6 No no4 4 Female 19 1.8 No yes5 5 Female 23 1.9 No yes

Comment: [data$gender=="Female",] means that we want to retain the rowsthat correspond to females, but keep all of the columns.

Page 17: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 17

14 Operations on numerical vectors

R has a few arithmetic operators.

Operator Description+ addition� subtraction� multiplication= division

^ or �� exponentiationx%%y modulus (x mod y) 5%%2 is 1x%=%y integer division 5%=%2 is 2

Operations are done component-wise. Here are a few examples.

> x<-c(1,2,3)> y<-c(-1,0,1)> x+y[1] 0 2 4> x-y[1] 2 2 2> x*y[1] -1 0 3> x/y[1] -1 Inf 3> 2+x[1] 3 4 5> 2*x[1] 2 4 6> x/2[1] 0.5 1.0 1.5> x^2[1] 1 4 9

Here are a few other functions that might be useful.

> sqrt(x) ## square root[1] 1.000000 1.414214 1.732051> log(x) ## natural logarithm[1] 0.0000000 0.6931472 1.0986123> exp(x) ## exponential of base e[1] 2.718282 7.389056 20.085537> w<-c(1.2,4.5,7.8)> w<-c(1.2,4.5,-7.8)> ceiling(w) ## round up[1] 2 5 -7> floor(w) ## round down[1] 1 4 -8

Page 18: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 18

> round(w) ## round[1] 1 4 -8> abs(w) ## absolute value[1] 1.2 4.5 7.8

We can apply an arithmetic operations in the presence of missing values. Theresult will be NA.

> z<-c(NA,1,1)> x+z[1] NA 3 4

Grades Example: We import the data from the �le SmallClass.txt and displaythe names of the columns.

> grades<-read.table("SmallClass.txt",header=TRUE,sep="\t")> names(grades)[1] "Student.ID" "Assignment1..Total.Pts..20."[3] "Assignment2..Total.Pts..30." "Test...Total.Pts..12."[5] "Assignment3..Total.Pts..50." "Assignment4..Total.Pts..20."[7] "FinalExam..Total.Pts..100."

We will build a dataframe just for the assignments.

> assignments<-grades[,c(2,3,5,6)]> names(assignments)[1] "Assignment1..Total.Pts..20." "Assignment2..Total.Pts..30."[3] "Assignment3..Total.Pts..50." "Assignment4..Total.Pts..20."

We will convert each assignment into a percentage.

> assignments<-grades[,c(2,3,5,6)]> assignments[,1]<-assignments[,1]/20*100> assignments[,2]<-assignments[,2]/30*100> assignments[,3]<-assignments[,3]/50*100> assignments[,4]<-assignments[,4]/20*100> names(assignments)<-c("hw1","hw2","hw3","hw4")> sapply(assignments,mean)hw1 hw2 hw3 hw4

95.0 87.6 94.8 89.0

Remark: There are no NA in the means. This means that there are no missingassignments.

Say that we would like to keep that best 3 of 4 assignments. We will build afunction that does the following

�� For each student (i.e. each statistical unit), we construct a numeric vector of thefour marks and convert any NA to a zero.

Page 19: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 19

� We sort the four assignments and keep the best 3 of the four.

� We add the row of the best 3 of 4 assignments to a dataframe and we computethe average of the three marks. We use the function rbind to add a row to adataframe.

� The result of the function will be a list containing a dataframe and a vector.

BestOf <- function(x,n){

nStudents<-nrow(x)nAssignments<-ncol(x)

# empty dataframeBestMarks<-data.frame(NULL)

# initialize a vector for the averagesAvg<-numeric(nStudents)

for (i in 1:nStudents){

best<-sort(as.numeric(x[i,]),decreasing=TRUE)[1:n]best[best==NA]<-0Avg[i]<-mean(best)BestMarks<-rbind(BestMarks,best)

}names(BestMarks)<-paste0("Best",1:n)

return(list(BestMarks=BestMarks,Mean=Avg))}

We are ready to get the best 3 assignment for each student.

> BestOf(assignments,3)$BestMarks

Best1 Best2 Best31 100 100 962 92 90 873 92 89 884 100 95 925 100 98 95

$Mean[1] 98.66667 89.66667 89.66667 95.66667 97.66667

We will add these means to the original dataframe.

> grades<-data.frame(grades,BestOf(assignments,3)$Mean)

Page 20: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 20

> names(grades)[1] "Student.ID" "Assignment1..Total.Pts..20."[3] "Assignment2..Total.Pts..30." "Test...Total.Pts..12."[5] "Assignment3..Total.Pts..50." "Assignment4..Total.Pts..20."[7] "FinalExam..Total.Pts..100." "BestOf.assignments..3..Mean"

We are ready to compute the �nal grades with the following scheme.

Best 3 of 4 assignments 20%Max(Test,Final) 25%Final Exam 55%

If the student did not write the �nal exam, it should remain an NA.

> test<-grades[,4]/12*100> test[test==NA]<-0> exam<-grades[,7]> exam[1] 53.50 60.50 51.50 78.75 70.25> hw<-grades[,8]> FinalGrade<-.2*hw+.25*max(exam,test)+.55*exam> FinalGrade[1] 72.0750 74.1250 69.1750 85.3625 81.0875

15 Cut Functions Revisited

We are now ready to compute the alpha grades, i.e. convert a numeric vector into afactor. We use the cut function as follows.

## letter grades:##90-100=A+; 85-89=A; 80-84=A-;##75-79=B; 70-74=B-;##65-69=C; 60-64=C-;##55-59=D; 50-54=D-;##40-49=E; 0-39=F;letters=c("F","E","D-","D","C-","C","B-","B","A","A-","A+")# right-hand limits are not includedalpha.grade<-cut(FinalGrade,c(0,40,50,55,60,65,70,75,80,85,90,Inf), right=FALSE,labels=letters)

Here are the alpha grades for the 5 students and a corresponding table of frequency.

> alpha.grade[1] B- B- C A- ALevels: F E D- D C- C B- B A A- A+> table(alpha.grade)alpha.gradeF E D- D C- C B- B A A- A+0 0 0 0 0 1 2 0 1 1 0

Page 21: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 21

16 Plot a one numerical variable (Boxplot & Histogram) /Installing an R package

To produce a boxplot of x and a histogram of x, we use boxplot(x) and hist(x),respectively.

> par(mfrow=c(1,2)) # graphics window of 1 row, 2 columns> hist(FinalGrade)> boxplot(FinalGrade)

The result is

Installing an R package: R has a very large community of contributes that aredeveloping packages that are freely available through CRAN (The Comprehensive RArchive Network).

We will install the package packHV with the following command: install.packages("packHV")Comments:

� You only need to install the package once onto your computer.

� If you do not have administrator privileges on the computer, then you might notbe able to install a package in the default location for packages. If that is the case,then create a new directory on the computer, say C:/myPackages. You should beable to install in this location : install.packages("packHV",lib="C:/myPackages")

� To use a package during a session, we must load it by using the function library.For example:

> library(packHV)> # you can also give the location of the package on your computer> library(packHV,lib.loc="C:/myPackages")

We can now use the function hist_boxplot from the package packHV

par(mfrow=c(1,1)) # graphics window of 1 row, 1 columnhist_boxplot(FinalGrade,col="lightblue",freq=TRUE,ylab="Frequency", xlab="Final Grade (in percentage)",main="Distribution of the Final Grades",

Page 22: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 22

cex.lab=1.15)# increase the font size on both axesaxis(2,cex.axis=1.05)axis(1,cex.axis=1.15)

Now this is where R shines. We can easily modify a graph. For example, we canadd text to the plot with the function text.

text(85,1.7,"mean = 76, sd = 6.7, n = 5",cex=1.15)

The result is

17 Applying our new skills { Part II

Exercises to try in class.

1. Consider a company that produces items made from glass. The data is in the�le glassworks.txt.(Source: Applied Linear Models, by Kutner et al.).

A large company is studying the e�ects of the length of special training for newemployees on the quality of work. Employees are randomly assigned to haveeither 6, 8, 10, or 12 hours of training. After the special training, each employeeis given the same amount of material. The response variable is the number ofacceptable pieces.

(a) Compute comparative boxplots to describe the number of acceptable piecesaccording to the amount of training. Hint: try boxplot(y~x,data)

2. Consider the data in the �le FinalGrades.txt.

(a) Compute the �nal grades with the following distribution. Convert the �nalgrades to alpha grades and produce a frequency table for the alpha grades.

Best 2 of 4 assignments 15%Max(Test,Final) 15%Final Exam 60%

(b) Produce a plot to describe the distribution of the �nal grades.

(c) What is the failure rate? That is, what proportion of the students have a�nal grade lower than 50%?

Page 23: Part I { Getting Started & Manipulating Data withaix1.uottawa.ca/~glamothe/PsyStatClub/Rworkshop.pdf · 1 URL for these notes and data oT download these notes, follow the link: aix1.uottawa.ca

Intro to R Workshop - Psychology Statistics Club 23

18 Saving the workspace

You can save objects to retrieve in the future. Here we saved the objects Cereal andmethadone.

> # save specific objects to a file> # if you don't specify the path, the cwd is assumed> save(Cereal,methadone,file="myfile.RData")

To retrieve the saved objects, use the function load.

# load a workspace into the current session# if you don't specify the path, the cwd is assumedload("myfile.RData")