biostatistics. why statistics? you want to make the strongest conclusions based on limited data...

BioStatistics

Why Statistics?

You want to make the strongest conclusions based on limited data

Differences in biological systems sometimes cannot be easily observed

Random variation?

Real difference?

Statistics sometimes are Unnecessary

Large differences in observed events

And small scatter within groups

In most instances, though, the use of statistics can provide you with mathematically-based conclusions

Clinical research

Field research

Statistics extrapolate from sample to population

The only way to draw absolute conclusions about a population is to measure the trait(s) of interest of every individual in that population

The reality is, this is almost always impossible to do

Thus, randomly sampling some of the individuals can provide information about the entire population

Sometimes random sampling can be difficult to define

If your sample is not random, then conclusions drawn from it are not reliable

Samples and Populations

Quality control

A company manufactures 20,000 vials (population) of a vaccine from a single production run

About 50 vials (samples) are taken from this production run and analyzed for a variety of characteristics

The results on 50 vials are then extrapolated to the remaining vials


Political polls

The number of eligible U. S. voters is about 125,000,000 (population)

A few hundred or thousands (sample) are asked to respond to political questions


Clinical studies

Patients in a clinical study (sample) have a clinical condition (e.g., disease)

They rarely reflect the entire population

However, they often reflect the population with the condition

Sampling humans can be particularly difficult


Field experiments

Local variations

Impact of weather

Environmental conditions/changes

Human impact

Sampling bias


Laboratory experiments

Usually not necessary

Highly-controlled experiments

Single variable

Genetically-defined organisms

Very little variation

What statistical calculations can do

Statistical estimationCalculation of a mean within a population is a precise number

However, the number is only an estimate of the whole population

Statistical hypothesis testingHelps determine if an observed difference is due simply to random chance

Provides a P value; if P is small, the difference is unlikely due to random chance and the conclusion is statistically significant

Statistical modelingTests how well experimental data fit a mathematical model

The most common form of statistical modeling is linear regression

LR usually determines the best straight line through a set of data points

What statistical calculations cannot do

Analysis of a simple experimentDefine a population you are interested inRandomly select a sample of subjects to studyRandomly split the sample subjects into two groups

One group gets one treatment

The other group gets another treatment

Measure a single variable trait in each subjectUse statistical tests to determine if there’s a difference between the groups

What statistical calculations cannot doThe problems with real experiments

Populations can be more diverse than your samples

Samples are collected on convenience, rather than randomly

The measured value is proxy value for what you’re really interested in

Errors in data collectionRecord data incorrectly

Assays may not report what you think they report

You need to combine different types of measurements to reach an overall conclusion (multiple variables)

Why statistics are difficult to learn

Deceptive terminology (significant, error, hypothesis)

Statistical conclusions are never absolute (statistically significant)

Statistics uses abstract concepts (populations, probabilities)

Statistics are at the interface of math and science

Many statistical calculations require complex math

Variables

Independent variable - The variable scientists manipulate to evaluate a response

Dependent variable - The variable (i.e., trait) resulting from a treatment with an independent variable

Variables

Types of variables in biology

Measurement variables

Continuous

Discontinuous

Ranked variables

Attributes

Variables

Measurement variables - Those whose differing states can be expressed in a numerically-ordered fashionContinuous

Can assume any value between two distinct points

For example, there are infinite numbers between 1.5 and 1.6

Include: lengths, areas, volumes, weights, angles, temperatures, periods of time, percentages, rates

Discontinuous Discrete values that can only have fixed numerical values

The number of segments in an insect’s appendage may be 4, 5, or 6, but not 4.3

Variables

Ranked variables

Variables that cannot be measured

For example, order of emergence of pupae without regard to time

Attribute variables

Variables that cannot be measured, but must be expressed qualitatively

For example: black/white; pregnant/nonpregnant; male/female; live/dead

Appropriate tests

Design Measurement Var Ranked Var Attribute Var

1 variable1 sample

Computing median and frequenciesComputing meansComputing standard deviations

Confidence limits for percentagesRuns test for randomness

1 Variable2 samples

t-testsTest of equalityPaired comparisons test

Mann-Whitney U-testKolmogorov-Smirnov two-sample test

Testing differences between two percentages

1 Variable2+ Samples

ANOVATukey-Kramer test

Kruskal-Wallis testFriedman’s random-ized block test

G-test for percentages

2 Variables1 Sample

Regression analysisPolynomial regressionOlmstead and Tukey’s corner test

Ordering testSpearman’s rank test

Chi-square testFisher’s exact test

Means and Standard Deviations

The mean is the average of measured trait from a populationIn biology, we usually compare two or more populations, which we call groups

The standard deviation is the variance around the meanMany statistical tests use means and standard deviations to determine if there are significant differences between groups

null hypothesis

Used to assume an event is true

Statistics can be used to disprove the hypothesis

This lends support to an alternative hypothesis

Nearly every experiment that uses statistics should define null and alternative hypotheses

Student’s T-test

Determines if there is a significant difference between the means of two groups of measured data

Paired - compares matched values between members of a group

Unpaired - assumes values between members are not related

Tests values for fit to a normal (aka -Gaussian) distribution (“bell curve”)

If not, then use nonparametric testing

One-tailed vs. two-tailed

One-tailed: You must specify which group will have a larger mean in advance of data collection

Two-tailed: You do not know which group will have a larger mean in advance of data collection

Student’s T-test

P value: Is there a significant difference between the means of the two groups?

Generally, if the P value is less than or equal to 0.05, then the difference is considered significant

t-value:

Positive if the first mean is larger than the second and negative if it is smaller

Student’s T-test

Confidence interval

The calculated mean is unlikely the exact same as the entire population

Assumes your samples are randomly collected and fit a normal distribution

If your sample is large with a small standard deviation, then your calculated mean likely is close to the actual mean

The CI is a calculation based upon sample size and standard deviation

If the CI is 95%, then the range of your calculated mean (i.e, standard deviation) probably (95%) includes the actual mean of the population under study

biostatistics. why statistics? you want to make the strongest conclusions based on limited data...

Documents

vials population

difficult slide

entire population

reliable slide

statistical calculations

vials samples

little variation slide

random sampling