introduction to statistics alastair kerr, phd. think about these statements (discuss at end)...

Introduction to Statistics

Alastair Kerr, PhD

Think about these statements (discuss at end)

Paraphrased from real conversations:– “We used a t-test to compare our samples”– “These genes are the most highly expressed in my

experiment: this must be significant”– “No significant difference between these samples

therefore the samples are the same”– “Yes I have replicates, I ran the same sample 3 times”– “We ignored those points, they are obviously wrong!”– “ X and Y are related as the p-value is 1e-168!”– “I need you to show this data is significant”

Basic Probability

Which of these sequence of numbers is random? (outcomes 0 or 1, unsorted data)

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Binomial Distribution

Thought experiment Everyone flip a coin 10 times and count

number of ‘heads’ Most frequent observation? Least? Pattern of observations between these?

How would these factors affect the graph shape? Using a dice instead of a coin and looking for the

number 6? Increasing the number of times the coin was

flipped?

Binomial Distributions

Types of Data

Discrete or Continuous Discrete: values for a finite number of samples Continuous: infinite population...

Parametric or Non-parametric Fits a known distribution Fits specific properties Specific tests are available if and only if the data

is parametric

Normal Distribution

the curve has a single peak the mean (average) lies at the centre of the

distribution distribution is symmetrical around the mean the two tails of the distribution extend

indefinitely and never touch the horizontal axis (continuous distribution)

the shape of the distribution is determined by its Mean (µ) and Standard Deviation (σ).

Variance and standard deviation

Variance is just how dispersed your data is from the mean.

Formalised: "The average of the square of the distance of

each data point from the mean" Standard deviation is the square root of the

variance aka RMS [or root mean squared] deviation Really just the distance to the mean from a

‘average’ sample

Normal distribution

95% of the data are within 2σ [standard deviation] of the mean

aka the 95% confidence interval

Understanding 'average'

When talking about average or mean, we commonly refer to the arithmetic mean. sum of samples / number of samples

Other Pythagorean means: geometric and harmonic

geometric mean – average of factors harmonic mean – average of rates

Other ways to describe Mode - most common value Median – central value in an ordered list

of numbers

When geometric mean is useful

nth root of the product of n numbers Or mean of the log values of a dataset,

converted back to base10 Factors such as ratio microarray data

e.g. for 'fold change' or other non-linear proportions less sensitive to extremely large values, it can be applied to data with relatively large fluctuations.

When harmonic mean is useful

Mean of the reciprocal of values, then take the reciprocal again to convert back.

Looking at ‘rates of change’ I’ve used it for the rate of change of nucleotide

substitutions Gives the lowest values of all the means Good way for limiting the effect of outliers (if outliers

are all large values…)

Why use median?

Remember median is the central value of a ranked list

What is the median of <pick 5 numbers> Great to use for skewed distributions Similar to the mean in a normal distribution

Why? Cannot really use SD or variance – instead

quartiles and interquartile range [IQR]

Quartiles and Quantiles

Quantiles are points taken at regular intervals on a ranked list of data

The 100-quantiles are called percentiles. The 10-quantiles are called deciles. The 5-quantiles are called quintiles. The 4-quantiles are called quartiles.

Quartiles 'middle 50', or inter quartile range [IQR] = 1st to 3rd quartile first quartile (lower quartile)

cuts off lowest 25% of data = 25th percentile second quartile (median)

cuts data set in half = 50th percentile third quartile (upper quartile)

cuts off highest 25% of data, or lowest 75% = 75th percentile

Visualisation: boxplot

aka candlestick box = 50% of data whisker =lines dots = outliers

Easy way to visualise the properties of multiple distributions beside each other

Visualisation: Cumulative Distribution Function

How does this CDF differ?

Hypothesis testing

Define your question Bad: “Is this significant?”

You need to compare to a model, usually that model is random chance

Good: “Does this data differ significantly from random chance compared to this other set?”

Hypothesis testing

Test a hypothesis NOT a result Bad: Gene XYZ is the most expressed in our

data set, is it significant? Ok to get hypothesis to test from eye-balling data, but

define on a biological concept, not a cherry-picked data point

OK to use to build a hypothesis: cold shock protein cspC is the most expressed gene, does this experiment enrich for cold shock proteins?

OK if enough REPLICATES

Hypothesis testing

'Bayesian' analysis– model testing against is not random – Instead 'Priors” exist, knowledge of the

system– Examples

• The 3 envelope puzzle• Odds at racing

Hypothesis testing

Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson-Darling test

Question: are samples A and B different? Null hypothesis What is the likelihood that

differences between A and B are from random chance

You are testing ONE hypothesis. If it does not pass, the inverse question is not necessarily true

Testing 2 groups

If Normal Distribution Analysis of variance

[ANOVA] e.g. t-test

Most powerful tests to use but data MUST resemble parametric

If Non-Parametric KS [Kolmogorov-Smirnov]

test (Q-Q testing) Mann-Whitney (rank sum) Chi-squared

Fishers exact test if small numbers

Test if parametric by using a non-parametric test Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk against the normal distribution – e.g. Shapiro-Wilk or Anderson-Darling testor Anderson-Darling test

P-values: multiple testing

P-values:Correlation & Causation

Replicates

• Your statement about your data is limited by what you tested by replication.

– It may be significant but for different reasons that you think

• Replicates show the noise in the system: but what system?

– Technical, each experimental unit• Machine Variance

– Pipetting variance, Temperature Variance...• Biological: Changes in what you are

examining.– from person to person, cell to cell, grown

condition to growth condition

Define the Number of Biological Repeats

Discuss the problems with each of these

“We used a t-test to compare our samples”

“These genes are the most highly expressed in my experiment: this must be significant”

“No significant difference between these samples therefore the samples are the same”

“Yes I have replicates, I ran the same sample 3 times”

“We ignored those points, they are obviously wrong!”

“ X and Y are related as the p-value is 1e-168!”

“I need you to show this data is significant”

introduction to statistics alastair kerr, phd. think about these statements (discuss at end)...

Documents