introduction to statistics and data analysisin the intro stats course (last term) we covered...
TRANSCRIPT
![Page 1: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/1.jpg)
Introduction to Statistics and Data Analysis
Chris Holmes
Professor of Biostatistics,Department of Statistics,
& Wellcome Trust Centre for Human Genetics
DTC 2012
Chris Holmes Intro Stats
![Page 2: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/2.jpg)
What is Statistics?
Chris Holmes Intro Stats
![Page 3: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/3.jpg)
Statistics:
◦ Statistics is the science and art of data analysis - from observationalstudies - or from planned experiments
◦ Statistics is concerned with the collection, analysis andinterpretation of data
◦ It is the science of the scientific method
Chris Holmes Intro Stats
![Page 4: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/4.jpg)
Branches of statistics
◦ Statistics covers a wide range of areas, from how best to collect data(optimal design of experiments) to the construction of predictivestochastic (empirical) models
◦ Some areas of note include
- graphical displays of data- stochastic modelling of systems- predictive algorithms
Chris Holmes Intro Stats
![Page 5: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/5.jpg)
Uncertainty
◦ At the heart of Statistics is the rigorous treatment of uncertainty orrandom variation characterised via probability
◦ Statistics works in units of uncertainty (you can think of probabilityas the currency)
◦ Probability:
- probability provides a formal system to quantify uncertainty- probability calculus provides a formal system to update uncertainty
(beliefs) in light of information (data)- allows for coherent accumulation of evidence supporting or refuting a
scientific hypothesis of interest
◦ Statistics is about being precise about our level of imprecision
Chris Holmes Intro Stats
![Page 6: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/6.jpg)
Probability
Probability (of continuous, real, valued observations) has deep rootedmathematical foundations
But for us we shall only need to deal with some simple aspects:
Chris Holmes Intro Stats
![Page 7: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/7.jpg)
Consider an arbitrary event denoted A, e.g., A ≡ “Britain will join theEuro currency in 2012”
Then:
Pr(A) ∈ [0, 1]
- Probability ranges between 0 and 1
Pr(A) = 0
- Event A can never occur: “I’d take a bet of 1p for A occurring inreturn for my life” (assuming you don’t wish to die)
Pr(A) = 1
- Event A will surely occur: “I’ll take a bet of 1p in return for all myworldly wealth if A does not happen”
And so on, e.g. Pr(A) = 0.5, equal chance of A occurring or not (note:we need to define “chance” without referring to probability!), “I’d behappy for my worst enemy to decide on A or A′ for me in a bet of equalodds”
Chris Holmes Intro Stats
![Page 8: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/8.jpg)
Note that the use of Probability to refer to degrees of belief in arbitraryevents is not without controversy
Some (many) feel that Probability should be restricted to events that canbe measured via the long running frequency of outcome under perfectlyrepeatable trials (such as in hypothetical games of chance)
- The probability that a fair coin tossed 5 times gives {H,H,H,H,H}- The probability of a Royal Flush in Poker
I find this overly restrictive and am happy to interpret Probability interms of personal degrees of belief (measures of uncertainty) in events(Savage, 1954)
- What’s the probability that I had more that 10 quid in my wallet atsome point yesterday?
Chris Holmes Intro Stats
![Page 9: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/9.jpg)
Updating uncertainty
Coherent updating of uncertainty in light of information followsconditional probability calculus
Pr(A|B), to be read “my updated beliefs in event A occurring givenknowledge of the status of event B”,
Pr(A|B) =Pr(A,B)
Pr(B)(1)
where Pr(A,B) is the joint probability of both (A,B) occurring, andPr(B) is a normalising constant (that does not change with the outcomeof A) and ensures Pr(A|B) ∈ [0, 1], in fact (Theorem of TotalProbability)
Pr(B) = Pr(A,B) + Pr(A′,B)
Note also from(1) we have the useful identity Pr(A,B) = Pr(A|B)Pr(B)
Chris Holmes Intro Stats
![Page 10: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/10.jpg)
Updating uncertainty II – Bayes Rule
Given the definition of conditional probability of A|B, then clearly also
Pr(B|A) = Pr(A,B)Pr(A) and equating terms and rearranging leads us to,
Bayes Rule:
Pr(A|B) =Pr(B|A)Pr(A)
Pr(B)
Bayes Rule allows us to express beliefs in A|B in terms of B|A andbackground beliefs Pr(A) (before we knew the status of B) which turnsout to be a extremely useful!
Which lead to Bayesian updating being referred to as “inverseprobability”
Chris Holmes Intro Stats
![Page 11: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/11.jpg)
Statistical Data Analysis
Broadly speaking, the analysis of data proceeds in two stages
◦ Exploratory analysis of data via graphing and summary statistics
◦ Formal statistical modelling of dependence structures of interest,e.g. for prediction, or for evaluating the empirical evidence for aparticular scientific hypothesis
Today we will deal with topics related to exploratory analysis of data
Chris Holmes Intro Stats
![Page 12: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/12.jpg)
Due to time constraints we will not cover the optimal design ofexperiments that precedes the above tasks for experimental studies,although this is an important discipline: “How to set up an experimentand collect samples so as to maximise the information content, reducebias, and reduce confounding?”
Chris Holmes Intro Stats
![Page 13: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/13.jpg)
Graphing data
◦ The starting point of ALL good statistical data analysis begins withgraphical plots and summary statistics of the data
◦ ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!
◦ Why?
Chris Holmes Intro Stats
![Page 14: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/14.jpg)
Graphical Excellence
“Graphics reveal data, communicate complex ideas and dependencieswith clarity, precision and efficiency”
- Edward Tufte: The Visual Display of Quantitative Information
Chris Holmes Intro Stats
![Page 15: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/15.jpg)
Graphical Excellence
Excellent graphics:
◦ show the data
◦ induce the viewer to think about the substance
◦ avoid bias
◦ make large complex data sets coherent
◦ encourage data exploration and debate
Chris Holmes Intro Stats
![Page 16: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/16.jpg)
Moreover:
◦ Graphical plots and summary stats provide a feel for the variation inthe data
◦ They can also highlight unusual results, measurement errors, outliers
- Such features can severely distort your results if left unchecked!- Many formal tests assume that the data follows a certain pattern (a
probability distribution such as Normal), if these assumptions areinvalid the results will be completely misleading
- Confidence in these assumptions can be gained through plotting thedata
Chris Holmes Intro Stats
![Page 17: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/17.jpg)
In the intro stats course (last term) we covered histograms, boxplots,scatterplots, qq-plots, pie-charts (don’t use them!), barcharts. It isexpected that you familiarise yourself with these
However, modern experiments in the life sciences, genetics, genomics,molecular biology can produce huge data sets, involving1000s-to-1000000s of measurements on 100s-to-1000s of samples
For example, mRNA microarrays can measure gene-expression levels forall known genes in the genome within a tissue sample, and whole genomesequencing can read off all of the 3x109 DNA bases in a human genome
These data sets can hold signals associated to heritable disease risk, orprovide a snapshot of the functional status of a population of cells, whichhave important applications in
◦ Biomarkers
◦ Understanding function
Chris Holmes Intro Stats
![Page 18: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/18.jpg)
Exploring “Big Data”
Modern bio-data sets are large and highly structured
Underlying biological mechanisms lead to strong dependences betweenvariables
Conventional exploratory tools such as scatterplots are not usable due tothe dimensionality
To explore such high-dimensional data it is useful to consider multivariatetechniques. We shall consider two,
◦ Cluster Analysis
◦ Principal Components Analysis
Chris Holmes Intro Stats
![Page 19: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/19.jpg)
Cluster Analysis
In Big Data it is interesting to search for and highlight sub-groups ofsamples embedded within high-d data that show self-similarity, such thatobjects within a group are more similar to one another than those inother groups, or to know that no such groups exist
Chris Holmes Intro Stats
![Page 20: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/20.jpg)
E.g., in mRNA gene-expression analysis of tumour samples it isinteresting to see if there are undetected sub-groups, that perhaps relateto heterogeneity in clinical outcome; or in population geneticshighlighting individuals that are (cryptically) more closely or distantlyrelated than expected
This process, or detecting and assigning objects to sub-groups, can begenerally referred to as Cluster Analysis
Chris Holmes Intro Stats
![Page 21: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/21.jpg)
Cluster Analysis
Broadly speaking, there are two characterisations of cluster analysismethods
- Model based or Model free, for defining the similarity betweenobjects
- Hierarchical or Partition, for assigning objects to clusters based ontheir similarity
so you can have {{hierarchical model based}, {hierarchical model free},{Model based partition}, {Model free partition} } clustering
Chris Holmes Intro Stats
![Page 22: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/22.jpg)
Model based
- assumes a probability distribution for each group
- e.g. model objects within a group as arising from a Normal(Gaussian) distribution
Model free (or algorithmic)
- define an arbitrary distance metric between objects
- e.g., Euclidean dij =√∑
s(xis − xjs)2
Chris Holmes Intro Stats
![Page 23: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/23.jpg)
Partition
- divide up the space (probabilistically for model based)
- assign objects to clusters by the partition they fall in
(a) Model free Partition (b) Model based Partition
Chris Holmes Intro Stats
![Page 24: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/24.jpg)
Partitioning
One issue with Partitioning is that you need to define the number clusters
If you wish to explore the data over differing resolutions then you’d wantto examine the clusterings obtained from k = {1, 2, . . . , n} clusters (withn data points)
You could simply run n parallel Partition models for the differing values.But then there is no dependence between the clusterings such that, say,k = 5 and k = 6 clusterings might be very dissimilar
Hierarchical clustering allows one to explore the data across multipleresolutions via recursive partitioning (division) or recursive merging(agglomeration) of data objects
Chris Holmes Intro Stats
![Page 25: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/25.jpg)
Hierarchical Clustering
HC works in either divisive or agglomerative fashion
Divisive (top down)
- Start with all data points in a single cluster
- Partition the cluster into two clusters
- For each cluster, partition into two clusters; Repeat
Agglomerative (bottom up)
- Start with each point in its own cluster
- Merge two points (clusters) into a single cluster
- Repeat
To complete either we’ll need to define a score (model free) or aprobability (model based) that measures the similarity between twoclusters
Chris Holmes Intro Stats
![Page 26: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/26.jpg)
Hiearchical Clustering: dendrogram
Such a recursive approach then produces a dendrogram (tree) thatrepresents the clustering
where the length of the branches quantifies the “distance” betweenclusters
The dendrogram provides a useful semi-quantitative description of thesimilarity and major groupings of objects in a data table
Chris Holmes Intro Stats
![Page 27: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/27.jpg)
Chris Holmes Intro Stats
![Page 28: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/28.jpg)
Model free HC
In order to decide on clusters which to join / split we need to define adistance between objects
Common choice are
- Euclidean,
dij = ||xi − xj ||2 =
√√√√ p∑v=1
(xiv − xjv )2
where dij is the distance between the i ’th, j ’th objects, xiv denotesthe v ’th of p measurements on xi
- Absolute (Manhattan)
dij = ||xi − xj ||1 =
p∑v=1
|xiv − xjv |
Chris Holmes Intro Stats
![Page 29: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/29.jpg)
Linkage
Given a metric we can calculate the pairwise distance matrix, D, thatrecords the distance between every pair of objects, (D)ij = dij ,i = 1, . . . , n, j = i + 1, . . . , n
We now need to score any potential split / merge of a cluster(s) todecide on the best next step
The linkage method defines the overall distance between two sets(clusters) of observations
Chris Holmes Intro Stats
![Page 30: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/30.jpg)
Consider two clusters A and B
Common types of linkage include
- Single linkagemin
i∈A,j∈Bdij
- Complete linkagemax
i∈A,j∈Bdij
- Average linkage1
|A||B|∑
i∈A,j∈B
dij
Chris Holmes Intro Stats
![Page 31: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/31.jpg)
Single, Complete, and Average Linkage
Chris Holmes Intro Stats
![Page 32: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/32.jpg)
Biclustering
Suppose the data is recorded in a matrix X with n rows of objects and pcolumns of measurements
It may well be of interest to cluster both objects (rows) and to clustermeasurements (columns)
Then plot out the joint dendrogram on top of the distance matrix
Known as biclustering
Chris Holmes Intro Stats
![Page 33: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/33.jpg)
Biclustering: of mRNA from case-control samples reveals geneexpression profiles
Chris Holmes Intro Stats
![Page 34: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/34.jpg)
Principal Components Analysis
In biological systems, measurements of molecular phenotypes, such asmRNA, miRNA, DNA, are high dimensional and strongly dependent dueto fundamental mechanisms such as, gene function, biological pathwaysor recombination
In exploratory data analysis we would like to identify patterns embeddedin complex data tables (from experiments)
PCA is one of the most important and widely used methods inexploratory statistics, used in a huge variety of applications
- reveal patterns in hidden in high-d data tables
- provides a low dimensional views of high-d data
Chris Holmes Intro Stats
![Page 35: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/35.jpg)
Chris Holmes Intro Stats
![Page 36: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/36.jpg)
PCA
Suppose X is a (m × n) data table, e.g. m rows measuringgene-expression, on n columns of samples
X =
| | | |x1 x2 · · · xn| | | |
where
xi =
xi1xi2...
xim
We suspect that signal (interesting patterns) may be contained indimensions much lower than m. For example, within a few correlatedgenes (on a common pathway)
Chris Holmes Intro Stats
![Page 37: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/37.jpg)
Change of Basis
We can seek a linear projection of the m dimensional space into a newbasis space via,
PX = Y
where P is a (p × p) square matrix,
PX =
− p1 −...
......
− pm −
| | | |
x1 x2 · · · xn| | | |
and Y is an (m × n) matrix,
Y =
p1x1 . . . p1xn...
. . ....
pmx1 . . . pmxn
with elements (Y )ij = pixj
Chris Holmes Intro Stats
![Page 38: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/38.jpg)
The row vectors pi ’s, i = 1, . . . ,m, provide linear combinations of the moriginal measurements
p1 = [p11, p12, . . . , p1m]
You should convince yourself that information is not lost by making sucha projection (transformation) if P is square and invertible, P−1P = I ,then,
X = P−1PX = P−1Y
So we can invert the transformation to get from Y back to X
Chris Holmes Intro Stats
![Page 39: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/39.jpg)
Optimal choice of basis
We wish to reveal patterns embedded in X
Hence we should construct P to compress (i.e. project) the structuredparts of X (the signal) into the first few bases (dimensions of Y )
Then we can visualise and explore X in a much lower dimensional space
That is choose the first few p1, p2, . . . , pk with k << m to preserve thesignal, so that pk+1, . . . , pm contain unstructured variation (noise)
Chris Holmes Intro Stats
![Page 40: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/40.jpg)
Optimal choice of basis
We still need to define precisely what we mean by “optimal”
We shall first constrain P to form an orthonormal basis (to make theproblem well posed)
PPT = I
Chris Holmes Intro Stats
![Page 41: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/41.jpg)
Think of a scatterplot then each pi is now akin to a rotation of theoriginal axes
pi · pTi =[− pi −
] |pi|
= 1
and
pi · pTj =[− pi −
] |pj|
= 0
for j 6= iChris Holmes Intro Stats
![Page 42: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/42.jpg)
Signal-to-noise
If we believe the unstructured noise is identically distributed andindependent across the m measurements
Then a second order statistic to define “optimality” is to maximise thesignal-to-noise ratio (SNR),
SNR =σ2signal
σ2noise
where σ2 denotes the variance – note the assumption here is thatvariance is a good measure of “signal”
So SNR >> 1 suggests a lot of signal (pattern) in the data
Chris Holmes Intro Stats
![Page 43: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/43.jpg)
Constructive derivation of basis
Suppose we wish to construct our first basis (projection) p1
If the noise is common across measurements (equivalent to a ball of noisein the original axes) then to maximise the SNR in the first projection wecan simply maximise the variance of the first axis (row) of Y
Let CY denote the variance-covariance matrix of Y , assume Y is centred(mean zero in each direction)
CY =1
n − 1YY T
so (CY )11 records the variance along the first row of Y
Chris Holmes Intro Stats
![Page 44: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/44.jpg)
So find p1 so as to maximise the spread of points (as defined by thevariance) along the axis of Y1·
Then, having set p1, find p2 that maximises the variance, subject top1 · pT2 = 0
and then p3 subject to { p1 · pT3 = 0 and p2 · pT3 = 0 }, and so on for p4etc.....
Chris Holmes Intro Stats
![Page 45: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/45.jpg)
Calculating the basis – for the more mathematical among you
Recall we are trying to find P so as to maximise the diagonal elements ofCY = 1
n−1YYT and,
CY =1
n − 1YY T
=1
n − 1(PX )(PX )T
=1
n − 1PXXTPT
=1
n − 1P(XXT )PT
and we recognise XXT as the (unnormalised) variance-covariance of X
Chris Holmes Intro Stats
![Page 46: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/46.jpg)
Eigen-decomposition
XXT is square symetric and hence has an eigen-decomposition
XXT = UDUT
where U is a matrix of eigenvectors of (XXT ) and D is a diagonal matrixstoring the m decreasing eigenvalues of (XXT )
Now, select P to be the eigenvector of XXT , so P = UT ,
Chris Holmes Intro Stats
![Page 47: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/47.jpg)
in which case,
CY =1
n − 1P(XXT )PT
=1
n − 1P(PTDP)PT
=1
n − 1(PPT )D(PPT )
=1
n − 1(I )D(I )
=1
n − 1D
*** Setting P = UT makes P an othonormal basis, PPT = I , andmaximises the variance in each direction *** !!!
Chris Holmes Intro Stats
![Page 48: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/48.jpg)
Example of PCA
In population genetics and genetic epidemiology we often genotypeindividuals from a large cross-section of the population
Genotyping measures positions in the genome where we know commonvariation exists between individuals
- there are roughly around 3,000,000 DNA bases in the humangenome where we can expect greater than 1% of the population toshow variation
E.g. suppose at a locus we know that some people might be {A,A} andsome {A,T} and others {T ,T}. We could encode this as {0, 1, 2}
Now construct a large data table X with elements (X )ij ∈ {0, 1, 2} forsay m = 3, 000, 000 loci (rows) and n = 1000s of individuals (columns)
Perform PCA on X and project individuals into the first few PCs andexplore
Chris Holmes Intro Stats
![Page 49: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/49.jpg)
Chris Holmes Intro Stats
![Page 50: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/50.jpg)
Chris Holmes Intro Stats
![Page 51: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/51.jpg)
Chris Holmes Intro Stats
![Page 52: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/52.jpg)
Chris Holmes Intro Stats
![Page 53: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/53.jpg)
Chris Holmes Intro Stats
![Page 54: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/54.jpg)
Chris Holmes Intro Stats
![Page 55: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/55.jpg)
Chris Holmes Intro Stats
![Page 56: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/56.jpg)
Chris Holmes Intro Stats
![Page 57: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/57.jpg)
Chris Holmes Intro Stats
![Page 58: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/58.jpg)
Chris Holmes Intro Stats
![Page 59: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/59.jpg)
PCA on genotype matrix mirrors geography, Nature, 456, 98-101
Chris Holmes Intro Stats
![Page 60: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/60.jpg)
What have we learnt?
Statistics is the scientific discipline concerned with the analysis andinterpretation of data (in an increasingly data rich world)
Statistics concerns itself with uncertainty, quantified using probability,and updated using probability calculus
The analysis of data should proceed via exploratory analysis followed byformal modelling (if required)
Exploratory analysis involves graphical interrogation of the data andsummary statistics
In high-d data, cluster analysis and PCA can provide useful tools forexploration of structure
Chris Holmes Intro Stats
![Page 61: Introduction to Statistics and Data AnalysisIn the intro stats course (last term) we covered histograms, boxplots, scatterplots, qq-plots, pie-charts (don’t use them!), barcharts](https://reader030.vdocuments.net/reader030/viewer/2022041120/5f332454a3330c686b58fac4/html5/thumbnails/61.jpg)
Key References
◦ Cleveland, W. S. (1993) Visualising data. Hobart Press
◦ Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer. 2nd Ed.
◦ Savage, L, J. (1954). The Foundations of Statistics. Dover. – worthreading Chapter 1-8 for those fascinated with foundations ofsubjective probability (Bayesian stats)
◦ Tufte, E. (2001) The Visual Display of Quantitative Information.2nd Edn. Graphics Press. - you will never look at a graph in thesame way again
◦ Wainer, H. (1984). How to display data badly. AmericanStatistician. Vol. 38, No. 2
Chris Holmes Intro Stats