introduction to statistics and data analysisin the intro stats course (last term) we covered...

Introduction to Statistics and Data Analysis

Chris Holmes

Professor of Biostatistics,Department of Statistics,

& Wellcome Trust Centre for Human Genetics

DTC 2012

Chris Holmes Intro Stats

What is Statistics?


Statistics:

◦ Statistics is the science and art of data analysis - from observationalstudies - or from planned experiments

◦ Statistics is concerned with the collection, analysis andinterpretation of data

◦ It is the science of the scientific method


Branches of statistics

◦ Statistics covers a wide range of areas, from how best to collect data(optimal design of experiments) to the construction of predictivestochastic (empirical) models

◦ Some areas of note include

- graphical displays of data- stochastic modelling of systems- predictive algorithms


Uncertainty

◦ At the heart of Statistics is the rigorous treatment of uncertainty orrandom variation characterised via probability

◦ Statistics works in units of uncertainty (you can think of probabilityas the currency)

◦ Probability:

- probability provides a formal system to quantify uncertainty- probability calculus provides a formal system to update uncertainty

(beliefs) in light of information (data)- allows for coherent accumulation of evidence supporting or refuting a

scientific hypothesis of interest

◦ Statistics is about being precise about our level of imprecision


Probability

Probability (of continuous, real, valued observations) has deep rootedmathematical foundations

But for us we shall only need to deal with some simple aspects:


Consider an arbitrary event denoted A, e.g., A ≡ “Britain will join theEuro currency in 2012”

Then:

Pr(A) ∈ [0, 1]

- Probability ranges between 0 and 1

Pr(A) = 0

- Event A can never occur: “I’d take a bet of 1p for A occurring inreturn for my life” (assuming you don’t wish to die)

Pr(A) = 1

- Event A will surely occur: “I’ll take a bet of 1p in return for all myworldly wealth if A does not happen”

And so on, e.g. Pr(A) = 0.5, equal chance of A occurring or not (note:we need to define “chance” without referring to probability!), “I’d behappy for my worst enemy to decide on A or A′ for me in a bet of equalodds”


Note that the use of Probability to refer to degrees of belief in arbitraryevents is not without controversy

Some (many) feel that Probability should be restricted to events that canbe measured via the long running frequency of outcome under perfectlyrepeatable trials (such as in hypothetical games of chance)

- The probability that a fair coin tossed 5 times gives {H,H,H,H,H}- The probability of a Royal Flush in Poker

I find this overly restrictive and am happy to interpret Probability interms of personal degrees of belief (measures of uncertainty) in events(Savage, 1954)

- What’s the probability that I had more that 10 quid in my wallet atsome point yesterday?


Updating uncertainty

Coherent updating of uncertainty in light of information followsconditional probability calculus

Pr(A|B), to be read “my updated beliefs in event A occurring givenknowledge of the status of event B”,

Pr(A|B) =Pr(A,B)

Pr(B)(1)

where Pr(A,B) is the joint probability of both (A,B) occurring, andPr(B) is a normalising constant (that does not change with the outcomeof A) and ensures Pr(A|B) ∈ [0, 1], in fact (Theorem of TotalProbability)

Pr(B) = Pr(A,B) + Pr(A′,B)

Note also from(1) we have the useful identity Pr(A,B) = Pr(A|B)Pr(B)


Updating uncertainty II – Bayes Rule

Given the definition of conditional probability of A|B, then clearly also

Pr(B|A) = Pr(A,B)Pr(A) and equating terms and rearranging leads us to,

Bayes Rule:

Pr(A|B) =Pr(B|A)Pr(A)

Pr(B)

Bayes Rule allows us to express beliefs in A|B in terms of B|A andbackground beliefs Pr(A) (before we knew the status of B) which turnsout to be a extremely useful!

Which lead to Bayesian updating being referred to as “inverseprobability”


Statistical Data Analysis

Broadly speaking, the analysis of data proceeds in two stages

◦ Exploratory analysis of data via graphing and summary statistics

◦ Formal statistical modelling of dependence structures of interest,e.g. for prediction, or for evaluating the empirical evidence for aparticular scientific hypothesis

Today we will deal with topics related to exploratory analysis of data


Due to time constraints we will not cover the optimal design ofexperiments that precedes the above tasks for experimental studies,although this is an important discipline: “How to set up an experimentand collect samples so as to maximise the information content, reducebias, and reduce confounding?”


Graphing data

◦ The starting point of ALL good statistical data analysis begins withgraphical plots and summary statistics of the data

◦ ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

◦ Why?


Graphical Excellence

“Graphics reveal data, communicate complex ideas and dependencieswith clarity, precision and efficiency”

- Edward Tufte: The Visual Display of Quantitative Information


Graphical Excellence

Excellent graphics:

◦ show the data

◦ induce the viewer to think about the substance

◦ avoid bias

◦ make large complex data sets coherent

◦ encourage data exploration and debate


Moreover:

◦ Graphical plots and summary stats provide a feel for the variation inthe data

◦ They can also highlight unusual results, measurement errors, outliers

- Such features can severely distort your results if left unchecked!- Many formal tests assume that the data follows a certain pattern (a

probability distribution such as Normal), if these assumptions areinvalid the results will be completely misleading

- Confidence in these assumptions can be gained through plotting thedata


In the intro stats course (last term) we covered histograms, boxplots,scatterplots, qq-plots, pie-charts (don’t use them!), barcharts. It isexpected that you familiarise yourself with these

However, modern experiments in the life sciences, genetics, genomics,molecular biology can produce huge data sets, involving1000s-to-1000000s of measurements on 100s-to-1000s of samples

For example, mRNA microarrays can measure gene-expression levels forall known genes in the genome within a tissue sample, and whole genomesequencing can read off all of the 3x109 DNA bases in a human genome

These data sets can hold signals associated to heritable disease risk, orprovide a snapshot of the functional status of a population of cells, whichhave important applications in

◦ Biomarkers

◦ Understanding function


Exploring “Big Data”

Modern bio-data sets are large and highly structured

Underlying biological mechanisms lead to strong dependences betweenvariables

Conventional exploratory tools such as scatterplots are not usable due tothe dimensionality

To explore such high-dimensional data it is useful to consider multivariatetechniques. We shall consider two,

◦ Cluster Analysis

◦ Principal Components Analysis


Cluster Analysis

In Big Data it is interesting to search for and highlight sub-groups ofsamples embedded within high-d data that show self-similarity, such thatobjects within a group are more similar to one another than those inother groups, or to know that no such groups exist


E.g., in mRNA gene-expression analysis of tumour samples it isinteresting to see if there are undetected sub-groups, that perhaps relateto heterogeneity in clinical outcome; or in population geneticshighlighting individuals that are (cryptically) more closely or distantlyrelated than expected

This process, or detecting and assigning objects to sub-groups, can begenerally referred to as Cluster Analysis


Cluster Analysis

Broadly speaking, there are two characterisations of cluster analysismethods

- Model based or Model free, for defining the similarity betweenobjects

- Hierarchical or Partition, for assigning objects to clusters based ontheir similarity

so you can have {{hierarchical model based}, {hierarchical model free},{Model based partition}, {Model free partition} } clustering


Model based

- assumes a probability distribution for each group

- e.g. model objects within a group as arising from a Normal(Gaussian) distribution

Model free (or algorithmic)

- define an arbitrary distance metric between objects

- e.g., Euclidean dij =√∑

s(xis − xjs)2


Partition

- divide up the space (probabilistically for model based)

- assign objects to clusters by the partition they fall in

(a) Model free Partition (b) Model based Partition


Partitioning

One issue with Partitioning is that you need to define the number clusters

If you wish to explore the data over differing resolutions then you’d wantto examine the clusterings obtained from k = {1, 2, . . . , n} clusters (withn data points)

You could simply run n parallel Partition models for the differing values.But then there is no dependence between the clusterings such that, say,k = 5 and k = 6 clusterings might be very dissimilar

Hierarchical clustering allows one to explore the data across multipleresolutions via recursive partitioning (division) or recursive merging(agglomeration) of data objects


Hierarchical Clustering

HC works in either divisive or agglomerative fashion

Divisive (top down)

- Start with all data points in a single cluster

- Partition the cluster into two clusters

- For each cluster, partition into two clusters; Repeat

Agglomerative (bottom up)

- Start with each point in its own cluster

- Merge two points (clusters) into a single cluster

- Repeat

To complete either we’ll need to define a score (model free) or aprobability (model based) that measures the similarity between twoclusters


Hiearchical Clustering: dendrogram

Such a recursive approach then produces a dendrogram (tree) thatrepresents the clustering

where the length of the branches quantifies the “distance” betweenclusters

The dendrogram provides a useful semi-quantitative description of thesimilarity and major groupings of objects in a data table


Model free HC

In order to decide on clusters which to join / split we need to define adistance between objects

Common choice are

- Euclidean,

dij = ||xi − xj ||2 =

√√√√ p∑v=1

(xiv − xjv )2

where dij is the distance between the i ’th, j ’th objects, xiv denotesthe v ’th of p measurements on xi

- Absolute (Manhattan)

dij = ||xi − xj ||1 =

p∑v=1

|xiv − xjv |


Linkage

Given a metric we can calculate the pairwise distance matrix, D, thatrecords the distance between every pair of objects, (D)ij = dij ,i = 1, . . . , n, j = i + 1, . . . , n

We now need to score any potential split / merge of a cluster(s) todecide on the best next step

The linkage method defines the overall distance between two sets(clusters) of observations


Consider two clusters A and B

Common types of linkage include

- Single linkagemin

i∈A,j∈Bdij

- Complete linkagemax

i∈A,j∈Bdij

- Average linkage1

|A||B|∑

i∈A,j∈B

dij


Single, Complete, and Average Linkage


Biclustering

Suppose the data is recorded in a matrix X with n rows of objects and pcolumns of measurements

It may well be of interest to cluster both objects (rows) and to clustermeasurements (columns)

Then plot out the joint dendrogram on top of the distance matrix

Known as biclustering


Biclustering: of mRNA from case-control samples reveals geneexpression profiles


Principal Components Analysis

In biological systems, measurements of molecular phenotypes, such asmRNA, miRNA, DNA, are high dimensional and strongly dependent dueto fundamental mechanisms such as, gene function, biological pathwaysor recombination

In exploratory data analysis we would like to identify patterns embeddedin complex data tables (from experiments)

PCA is one of the most important and widely used methods inexploratory statistics, used in a huge variety of applications

- reveal patterns in hidden in high-d data tables

- provides a low dimensional views of high-d data


PCA

Suppose X is a (m × n) data table, e.g. m rows measuringgene-expression, on n columns of samples

X =

| | | |x1 x2 · · · xn| | | |

where

xi =

xi1xi2...

xim

We suspect that signal (interesting patterns) may be contained indimensions much lower than m. For example, within a few correlatedgenes (on a common pathway)


Change of Basis

We can seek a linear projection of the m dimensional space into a newbasis space via,

PX = Y

where P is a (p × p) square matrix,

PX =

− p1 −...

......

− pm −

| | | |

x1 x2 · · · xn| | | |

and Y is an (m × n) matrix,

Y =

p1x1 . . . p1xn...

. . ....

pmx1 . . . pmxn

with elements (Y )ij = pixj


The row vectors pi ’s, i = 1, . . . ,m, provide linear combinations of the moriginal measurements

p1 = [p11, p12, . . . , p1m]

You should convince yourself that information is not lost by making sucha projection (transformation) if P is square and invertible, P−1P = I ,then,

X = P−1PX = P−1Y

So we can invert the transformation to get from Y back to X


Optimal choice of basis

We wish to reveal patterns embedded in X

Hence we should construct P to compress (i.e. project) the structuredparts of X (the signal) into the first few bases (dimensions of Y )

Then we can visualise and explore X in a much lower dimensional space

That is choose the first few p1, p2, . . . , pk with k << m to preserve thesignal, so that pk+1, . . . , pm contain unstructured variation (noise)


Optimal choice of basis

We still need to define precisely what we mean by “optimal”

We shall first constrain P to form an orthonormal basis (to make theproblem well posed)

PPT = I


Think of a scatterplot then each pi is now akin to a rotation of theoriginal axes

pi · pTi =[− pi −

] |pi|

= 1

and

pi · pTj =[− pi −

] |pj|

= 0

for j 6= iChris Holmes Intro Stats

Signal-to-noise

If we believe the unstructured noise is identically distributed andindependent across the m measurements

Then a second order statistic to define “optimality” is to maximise thesignal-to-noise ratio (SNR),

SNR =σ2signal

σ2noise

where σ2 denotes the variance – note the assumption here is thatvariance is a good measure of “signal”

So SNR >> 1 suggests a lot of signal (pattern) in the data


Constructive derivation of basis

Suppose we wish to construct our first basis (projection) p1

If the noise is common across measurements (equivalent to a ball of noisein the original axes) then to maximise the SNR in the first projection wecan simply maximise the variance of the first axis (row) of Y

Let CY denote the variance-covariance matrix of Y , assume Y is centred(mean zero in each direction)

CY =1

n − 1YY T

so (CY )11 records the variance along the first row of Y


So find p1 so as to maximise the spread of points (as defined by thevariance) along the axis of Y1·

Then, having set p1, find p2 that maximises the variance, subject top1 · pT2 = 0

and then p3 subject to { p1 · pT3 = 0 and p2 · pT3 = 0 }, and so on for p4etc.....


Calculating the basis – for the more mathematical among you

Recall we are trying to find P so as to maximise the diagonal elements ofCY = 1

n−1YYT and,

CY =1

n − 1YY T

=1

n − 1(PX )(PX )T

=1

n − 1PXXTPT

=1

n − 1P(XXT )PT

and we recognise XXT as the (unnormalised) variance-covariance of X


Eigen-decomposition

XXT is square symetric and hence has an eigen-decomposition

XXT = UDUT

where U is a matrix of eigenvectors of (XXT ) and D is a diagonal matrixstoring the m decreasing eigenvalues of (XXT )

Now, select P to be the eigenvector of XXT , so P = UT ,


in which case,

CY =1

n − 1P(XXT )PT

=1

n − 1P(PTDP)PT

=1

n − 1(PPT )D(PPT )

=1

n − 1(I )D(I )

=1

n − 1D

*** Setting P = UT makes P an othonormal basis, PPT = I , andmaximises the variance in each direction *** !!!


Example of PCA

In population genetics and genetic epidemiology we often genotypeindividuals from a large cross-section of the population

Genotyping measures positions in the genome where we know commonvariation exists between individuals

- there are roughly around 3,000,000 DNA bases in the humangenome where we can expect greater than 1% of the population toshow variation

E.g. suppose at a locus we know that some people might be {A,A} andsome {A,T} and others {T ,T}. We could encode this as {0, 1, 2}

Now construct a large data table X with elements (X )ij ∈ {0, 1, 2} forsay m = 3, 000, 000 loci (rows) and n = 1000s of individuals (columns)

Perform PCA on X and project individuals into the first few PCs andexplore


PCA on genotype matrix mirrors geography, Nature, 456, 98-101


What have we learnt?

Statistics is the scientific discipline concerned with the analysis andinterpretation of data (in an increasingly data rich world)

Statistics concerns itself with uncertainty, quantified using probability,and updated using probability calculus

The analysis of data should proceed via exploratory analysis followed byformal modelling (if required)

Exploratory analysis involves graphical interrogation of the data andsummary statistics

In high-d data, cluster analysis and PCA can provide useful tools forexploration of structure


Key References

◦ Cleveland, W. S. (1993) Visualising data. Hobart Press

◦ Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements ofStatistical Learning: Data Mining, Inference, and Prediction.Springer. 2nd Ed.

◦ Savage, L, J. (1954). The Foundations of Statistics. Dover. – worthreading Chapter 1-8 for those fascinated with foundations ofsubjective probability (Bayesian stats)

◦ Tufte, E. (2001) The Visual Display of Quantitative Information.2nd Edn. Graphics Press. - you will never look at a graph in thesame way again

◦ Wainer, H. (1984). How to display data badly. AmericanStatistician. Vol. 38, No. 2


introduction to statistics and data analysisin the intro stats course (last term) we covered...

Documents