![Page 1: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/1.jpg)
Session 3: More features of R and the Central Limit Theorem
Class web site: http://ludwig-sun2.unil.ch/~darlene/
Statistics for Microarray Data Analysis with R
> tumor.info<-data.frame(localization,tumorsize,progress)> rownames(tumor.info)<-c("XX348","XX234","XX987")> tumor.info$tumorsize[1] 6.3 8.0 10.0
![Page 2: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/2.jpg)
Today’s Outline
• Further features of the R language
• Preliminary data analysis exercise
• Central Limit Theorem (CLT)
• CLT exercise
• some material included here was adapted from materials available at
http://www.bioconductor.org/
and is used by permission
![Page 3: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/3.jpg)
R: factors
• Categorical variables in R should be specified as factors
• Factors can take on a limited number of values, called levels
• Levels of a factor may have a natural order
• Functions in R for creating factors: factor(), ordered()
![Page 4: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/4.jpg)
R: data frames (review)• data frame: the type of R object normally
used to store a data set• A data frame is a rectangular table with rows
and columns – data within each column has the same
type (e.g. number, character, logical)– different columns may have different types
• Example:> tumor.info localisation tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE
![Page 5: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/5.jpg)
R: making data frames• Data frames can be created in R by importing a
data set• A data frame can also be created from pre-
existing variables• Example:
> localisation<-c("proximal","distal","proximal")> tumorsize<- c(6.3,8,10)> progress<-c(FALSE,TRUE,FALSE)> tumor.info<-data.frame(localization,tumorsize,progress)
> rownames(tumor.info)<-c("XX348","XX234","XX987")
> tumor.info$tumorsize[1] 6.3 8.0 10.0
![Page 6: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/6.jpg)
> tumor.info[c(1,3),] localisation tumorsize progressXX348 proximal 6.3 FALSEXX987 proximal 10.0 FALSE
> tumor.info[c(TRUE,FALSE,TRUE),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
> tumor.info$localisation[1] "proximal" "distal" "proximal"
> tumor.info$localisation=="proximal"[1] TRUE FALSE TRUE
> tumor.info[ tumor.info$localisation=="proximal", ] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0
subset rows by a vector of indices
subset rows by a logical vector
subset a column
comparison resulting in logical vector
subset the selected rows
R: more on subsetting
![Page 7: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/7.jpg)
R: loops• When the same or similar tasks need to be
performed multiple times in an iterative fashion
• A data frame can also be created from pre-existing variables
• Examples:> for(i in 1:10) { > i = 1
print(i*i) while(i<=10) { } print(i*i) i=i+sqrt(i) }
• Explicit loops such as these should be avoided where possible
![Page 8: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/8.jpg)
R: lapply, sapply
• When the same or similar tasks need to be performed multiple times for all elements of a list or for all columns of an array
• These implicit loops are generally faster than explicit ‘for’ loops
• lapply(the.list,the.function)– the.function is applied to each element of the.list
– result is a list whose elements are the individual results for the.function
• sapply(the.list,the.function)– Like lapply, but tries to simplify the result, by
converting it into a vector or array of appropriate size
![Page 9: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/9.jpg)
R: apply
• apply(array, margin,the.function)– applies the.function along the dimension of array specified by margin
– result is a vector or matrix of the appropriate size
• Example: > x [,1] [,2] [,3] [1,] 5 7 0 [2,] 7 9 8 [3,] 4 6 7 [4,] 6 3 5
> apply(x, 1, sum) [1] 12 24 17 14
> apply(x, 2, sum) [1] 22 25 20
![Page 10: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/10.jpg)
R: sweep and scale
• sweep(...) removes a statistic from dimensions of an array
• Example: Subtract column medians
> col.med<-apply(my.data,2,median)
> sweep(my.data,2,col.med)
• scale(...) centers and/or rescales columns of a matrix
![Page 11: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/11.jpg)
R: importing and exporting data (review)
• Many ways to get data into and out of R
• One straightforward way is to use tab-delimited text files (e.g. save an Excel sheet as tab-delimited text, for easy import into R)
• Useful R functions: read.delim(), read.table(), read.csv(), write.table()
• Example:> x = read.delim(“filename.txt”)> write.table(x, file=“x.txt”, sep=“\t”)
![Page 12: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/12.jpg)
R: introduction to object orientation
• Primitive (or atomic) data types in R are:– numeric (integer, double, complex)– character– logical– function
• From these, vectors, arrays, lists can be built• An object is an abstract term for anything
that can be assigned to a variable• Components of objects are called slots• Example: a microarray experiment
– probe intensities– patient data (tissue location, diagnosis, follow-up)– gene data (sequence, IDs, annotation)
![Page 13: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/13.jpg)
R: classes and generic functions
• Object-oriented programming aims to create coherent data systems and methods that work on them
• In general, there is a class of data objects and a (print, plot, etc.) method for that class
• Generic functions, such as print, act differently depending on the function argument
• This means that we don’t need to worry about a lot of the programming details
• In R, an object has a (character vector) class attribute which determines the mode of action for the generic function
![Page 14: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/14.jpg)
Exercises: Bittner et al. dataset
• You should have downloaded the dataset gene_list-Cutaneous_Melanoma.xls from the web
• Use the handout as a guide to get this dataset into R and do some preliminary analyses
• If you do not have this dataset, you can use your own data
![Page 15: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/15.jpg)
Sample surveys
• Surveys are carried out with the aim of learning about characteristics (or parameters) of a target population, the group of interest
• The survey may select all population members (census) or only a part of the population (sample)
• Typically studies sample individuals (rather than obtain a census) because of time, cost, and other practical constraints
![Page 16: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/16.jpg)
Sampling variability• Say we sample from a population in order to
estimate the population mean of some (numerical) variable of interest (e.g. weight, height, number of children, etc.)
• We would use the sample mean as our guess for the unknown value of the population mean
• Our sample mean is very unlikely to be exactly equal to the (unknown) population mean just due to chance variation in sampling
• Thus, it is useful to quantify the likely size of this chance variation (also called ‘chance error’ or ‘sampling error’, as distinct from ‘nonsampling errors’ such as bias)
![Page 17: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/17.jpg)
Sampling variability of the sample mean
• Say the SD in the population for the variable is known to be some number
• If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample mean (called the ‘standard error’) is
SE(mean) = /n
• If is not known, you can substitute an estimate
![Page 18: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/18.jpg)
Sampling variability of the sample proportion
• Similarly, we could use the sample proportion as a guess for the unknown population proportion p with some characteristic (e.g. proportion of females)
• If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample proportion is
SE(proportion) = p(1-p)/n
• Of course, we don’t know p (or we would not need to estimate it), so we substitute our estimate
![Page 19: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/19.jpg)
Central Limit Theorem (CLT)
• The CLT says that if we – repeat the sampling process many times– compute the sample mean (or proportion) each
time– make a histogram of all the means (or proportions)
• then that histogram of sample means (or proportions) should look like the normal distribution
• Of course, in practice we only get one sample from the population
• The CLT provides the basis for making confidence intervals and hypothesis tests for means or proportions
![Page 20: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/20.jpg)
What the CLT does not say
• The CLT does not say that the histogram of variable values will look like the normal distribution
• The distribution of the individual variable values will look like the population distribution of variable values for a big enough sample
• This population distribution does not have to be normal, and in practice is typically not normal
![Page 21: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/21.jpg)
CLT: technical details
• A few technical conditions must be met for the CLT to hold
• The most important ones in practice are that
– the sampling should be random (in a carefully defined sense)
– the sample size should be ‘big enough ’
• How big is ‘big enough’? There is no single answer because it depends on the variable’s distribution in the population: the less symmetric the distribution, the more samples you need
![Page 22: Statistics for Microarray Data Analysis with R](https://reader030.vdocuments.net/reader030/viewer/2022033102/56813623550346895d9d994a/html5/thumbnails/22.jpg)
Exercises: CLT simulations
• Here, you will simulate flipping coins
• The coins will have differing probabilities of ‘heads’
• The object is to see how the number of coin flips required for the distribution of the proportion of heads in the simulated flips to become approximately normal
• See the handout for details