statistical genomics and bioinformatics workshop:...

Statistical Genomics and Bioinformatics Workshop8/16/2013

1

Statistical Genomics and Bioinformatics Workshop:

Genetic Association and RNA-Seq Studies

Overview of Genetics, Data Resources, Terminology and Statistics

Brooke L. Fridley, PhDUniversity of Kansas Medical Center

1

Schedule for the Day• 9:45 – 10 am: Morning Break • 11:30 – 12:30 pm: Lunch Break• 2:30 – 2:45 pm: Afternoon Break

Schedule of Topics:• Overview of Genetics and Genomics

– Genetics– Technologies for genotyping– Databases and publically available resources

• Review of Statistical Aspects– Study Design– Power/Sample Size– Hypothesis Testing

2


2

Schedule for the Day (con’t)• Population Genetics (LD)

• Genetic Association Studies

– Study Design

– Quality Control

– Genetic Models and Association Methods

– Haplotypes

– Power / Sample Size

– Population Stratification

– Genotype Imputation

– Multiple locus methods

• Example: GWAS for Hormone Levels

3

Schedule for the Day (con’t)• Multiple Testing

– FWER– FDR– Permutation based p-values

• Example: Acetaminophen toxicity GWAS• Limitations and Common Errors with GWAS• RNA-Seq

– Goals and review of types of RNAs– NGS and Experimental Design– Bioinformatics and processing RNA-Seq data– Quality Control– Differential Expression Testing Methods

• Clustering– Goals– Methods– Validation

4


3

REVIEW OF GENETICS

5

Individualized Medicine

6


4

Anticipated benefits of Individualized Medicine

• More powerful medicines

• Better, safer drugs the first time

• More accurate methods of determining appropriate drug dosages

• Advanced screen for diseases

• Better vaccines

• Improvements in the drug discovery and approval process

• Decrease in the overall cost of health care

From Human Genomic Project Website: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml#whatis 7

Integrative ‘Omics

Genome DNA

RNA

Proteins

Transcriptome

Proteome

Metabolome

Epigenome

Metabolites (e.g. Lipids)

Phenotype& Function

Phenome

Regulatory Elements

8


5

DNA‐mRNA‐Protein

9

p

Centromere

q

(Chromosome 5)

Telomere

22 pairs of autosomes, 1 pair of sex chromosomes

Humans have 46 chromosomes

10


6

5' end

Promoter

Start site

3' end

Stop site

Intron Exon 2 IntronExon 1 Exon 3

Splice sites

Exon 2Exon 1 Exon 3

Messenger RNA

The exons encode the actual “blueprint” for a protein

Gene Structure

11

Adenine (A)

Thymine (T)

Cytosine (C)

Guanine (G)

Nucleotidebases

Sugar phosphate backbone

Base pair

The DNA double helix

12


7

Adenine (A)

Thymine (T)

Cytosine (C)

Guanine (G)

T A A T A C T C A T T G G G T C

A T T A T G A G T A A C C C A G

DNA (uncoiled)

13

Adenine (A)

Thymine (T)

Uracil (U)

Cytosine (C)

Guanine (G)

T A A T A C T C A T T G G G T C

A U U A U G A G U A A C C C A G

DNA basepairs are read by threesCodons

14


8

Genetic Code

• A codon is made of 3 base pairs

• There are 64 possible codons

1 codon (AUG) encodes methionine and starts

translation of all proteins

3 codons stop protein

translation

61 codons encode 20 amino acids

(redundant code)

U A AA U G

Met

G C A

Ala

15

DNA Mutation

• A mutation is a change in the normal DNA base pair sequence

16


9

Functional protein Nonfunctional or missing protein

Proteins are chains of amino acids

Mutations can cause disease

17

SNP Markers

• SNP:AATGCAGGTGCAATCGATTTCAATGCAGGTGCAATTGATTTC

• SNPs make up 90% of all human genetic variation

• SNPs with a minor allele frequency of ≥ 1% occur, on average, every 100 to 300 bases along the 3 – billion- base human genome.

• Variations in the DNA sequences of humans can affect how humans develop disease or response to drug treatments (pharmacogenomics)

18


10

Normal

Missense

Nonsense

Frameshift (deletion)

Frameshift (insertion)

THE BIG RED DOG RAN OUT.

THE BIG RAD DOG RAN OUT.

THE BIG RED.

THE BRE DDO GRA.

THE BIG RED ZDO GRA.

Some types of mutations

19

Polymorphisms

• A change in the normal DNA base pair sequence

• Mutations that do not alter protein function can become common in the population

• A polymorphism is defined as a ‘common’ genetic change, usually >1% is considered common.

20


11

Alternative forms of a DNA sequence or gene

SNP allele A …….AATGCAGGTGCAATCGATTTC…….allele B …….AATGCAGGTGCAATTGATTTC…….

Insertion allele A …….AATGCAGGTGCAATCGATTTC……./Deletion …….AATGCAGGTGCAATCGATTTC…….

allele B …….AATGCAGGATTTC…….

Microsatellite allele A …….AATGCGAGAGAGAGAGATTTC…….allele B …….AATGCGAGAGAGATTTC…………..

Marker Types

21

SNPs in the Human GenomeGAAATAATTAATGTTTTCCTTCCTTCTCCTATTTTGTCCTTTACTTCAATTTATTTATTTATTATTAATATTATTATTTTTTGAGACGGAGTTTCACTCTTGTTGCCAACCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCACACTCCGCTTTC[C/T]GGTTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGTCACACACCACCACGCCCGGCTAATTTTTGTATTTTTAGTAGAGTTGGGGTTTCACCATGTTGGCCAGACTGGTCTCGAACTCCTGACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCTGGGATTACAGGCGTGAGCCACCGCGCTCGGCCCTTTGCATCAATTTCTACAGCTTGTTTTCTTTGCCTGGACTTTACAAGTCTTACCTTGTTCTGCCTTCAGATATTTGTGTGGTCTCATTCTGGTGTGCCAGTAGCTAAAAATCCATGATTTGCTCTCATCCCACTCCTGTTGTTCATCTCCTCTTATCTGGGGTCACTTTTATCTCTTCGTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAATGCATTGTAAGTTAAATTATTAAAGATTTTAAATATAGGAAAAAAGTAAGCAAACATAAGGAACAAAAAGGAAAGAACATGTATTCTAATCCATTATTTATTATACAATTAAGAAATTTGGAAACTTTAGATTACACTGCTTTTAGAGATGGAGATGTAGTAAGTCTTTTACTCTTTACAAAATACATGTGTTAGCAATTTTGGGAAGAATAGTAACTCACCCGAACAGTGTAATGTGAATATGTCACTTACTAGAGGAAAGAAGGCACTTGAAAAACATCTCTAAACCGTATAAAAACAATTACATCATAATGATGAAAACCCAAGGAATTTTTTTAGAAAACATTACCAGGGCTAATAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGAGAATTCTAGAGTTATATTTGTACATAGCATGGAAAAATGAGAGGCTAGTTTATCAACTAGTTCATTTTTTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGAAA[A/C]AGTCTAACACATCCTAGGTATAGGTGAACTGTCCTCCTGCCAATGTATTGCACATTTGTGCCCAGATCCAGCATAGGGTATGTTTGCCATTTACAAACGTTTATGTCTTAAGAGAGGAAATATGAAGAGCAAAACAGTGCATGCTGGAGAGAGAAAGCTGATACAAATATAAATGAAACAATAATTGGAAAAATTGAGAAACTACTCATTTTCTAAATTACTCATGTATTTTCCTAGAATTTAAGTCTTTTAATTTTTGATAAATCCCAATGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAACTCTTTCCACTAAGAAAGTTCAACTATTAATTTAGGCACATACAATAATTACTCCATTCTAAAATGCCAAAAAGGTAATTTGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCATAAGAGACTTAAAACTGAAAACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCCCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCGTTTAAGATAGTCACACTGAACTATATTAAAAAATCCACAGGGTGGTTGGAACTAGGCCTTATATTAAAGAGGCTAAAAATTGCAATAAGACCACAGGCTTTAAATATGGCTTTAAACTGTGAAAGGTGAAACTAGAATGAATAAAATCCTATAAATTTAAATCAAAAGAAAGAAACAAACT[A/G]AAATTAAAGTTATTATACAAGAATATGGTGGCCTGGATCTAGTGAACATATAGTAAAGATAAAACAGAATATTTCTGAAAAATCCTGGAAAATCTTTTGGGCTAACCTGAAAACAGTATATTTGAAACTATTTTTAAAATGCAGTGATACTAGAAATATTTTAGAATCATATGTATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAAAATTTCGA

22


12

• Locus (plural: loci): May also be called a

polymorphism, marker, variant, mutation

• Allele: variant forms of the same locus, e.g., A, C

• “wildtype” vs “variant”

• Genotype: Pair of alleles

• Phenotype: Expressed trait

• Homozygote: AA, CC

• Heterozygote: AC

• Carrier: AA + AC

• Phase: do alleles occur together on the same chromosome?

• Haplotype: a collection of closely linked alleles,

usually inherited as a unit. eg. CTG

•Penetrance: P(Phenotype|Genotype)

Terminology

A

GTA

TC

23

Type Effect Freq RR ofphenotype

Nonsense stop AA seq. v. low v. high

Missense change AA low low - v.high

Frameshift change frame of low v. highprotein coding

Intronic No known function Med v. low

Intergenic No known function High v. low

Variant Types

24


13

Typical Steps in most Genomic Study

• Hypotheses• Tissue/sample processing• Study Design

– Focused/candidate regions vs whole genome– ‘Omic data type– Array vs NGS– Sample size and power– Confounding issues, covariates (epi, drug/trt)

• Bioinformatics processing of raw data• Statistical Analysis• Annotation of results and relationship (IPA, etc)• Validation studies (replication, functional studies)

25

Evolution of Genomics Research

Candidate Gene Studies

< 2005

Genome-wide Association

Studies

2005-2010

Next-Gen Sequencing

2010-Present

3rd Generation Sequencing

Events leading up to Candidate Studies

1950 – Structure of DNA

1970s – Sanger

Sequencing

1983 – PCR

1990 – HGP begins

1997 – NHGRI formed

Events leading up to GWAS

2000-1 – Draft version of

human genome

sequence completed

2002 – HapMap begins

2003 – HGP ends

Events leading up to Next-Gen Sequencing

2005 – 1st Commercial

platform (Roche 454)

2006– Illumina’s Genome

Analyzer (GA) IIx2008 – 1KGP begins

SNP arrays

mRNA arrays

Methylation arrays

Genotyping

RT-PCR

Resequencing genes(exons) with Sanger

Sequencing

DNA (Exome & WGS)

RNA-seq

Bisulfite or RRBS (methylation)

Single MoleculeSequencing

PacBio, Complete Genomics, etc.

Translation to clinical practice

26


14

Human Genome Project

• Completed in 2003; 13 year project

• Goals:

– Identify all ~25,000 genes in human DNA

– Determine the sequences of the 3 billion bp

– Store this information in databases

– Improve tools for data analysis

– Address ethical, legal, social issues (ELSI)

nature

February, 2001

http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml27

Human Genome Facts

• 3 billion base pairs

• Around 25,000 genes

– Functions unknown for ~50%

• Average gene size is 3000 nucleotides

• Coding is about 1.5% of genome

nature

February, 2001

28


15

High Throughput Methods for Measuring DNA

• Many approaches for genotyping– Hybridization Methods (Affymetrix, TaqMan)– Primer extension (Pyrosequencing)– Ligation (Illumina)

• Custom Content / Design– GoldenGate, Infinium at Illumina– Disease Specific panels (PGx, Cancer, Carbo‐Metabo)

• Standard large arrays– Genome‐wide arrays (> 1 million SNPs)– Exome Arrays (rare variants)

• Next‐Generation Sequencing

29

NGS Technologies• Illumina (Solexa) HiSeq 2000 (2500) & MiSeq, Life

Technologies SOLiD, PacBio, Ion Torrent PGM, Roche 454, ... , and many more to come

– No one-size-fits-all solution

– Each has pros and cons

30


16

Integrative Genomic Viewer (IGV)

Thorvaldsdottir, Robinson, Mesirov (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration Briefings in Bioinformatics

31

ENCODE (Encyclopedia of DNA Elements)

• Goal to build a comprehensive parts list of functional elements in the human genome.

32


17

Mouse ENCODE Project• Enhance the human ENCODE Project through relevant

comparative studies

• Access cell types, tissues, and developmental time points that are not addressable by the human project

• Provide a general resource to inform and accelerate ongoing efforts in mouse genomics and disease modeling with human translational potential

33

J Barretina et al. Nature 483, 603-607 (2012) doi:10.1038/nature11003

Cancer Cell Line Encyclopedia

34


18

The Cancer Genome Atlas (TCGA)

• Began in 2006 as a three-year pilot project (NCI & NHGRI) for three tumors.

• NIH is now commit to characterizing more than 20 additional tumors.

• Extensive data available on 17 cancers

• Tumor and normal tissue being analyzed on multiple levels, such as:

– nucleotide variation (SNP, Indel, SNV)

– gene copy number variation

– gene expression levels

– DNA methylation levels35

Other Public Data and Information

36


19

Public databases The entire human genome sequence can be

found in several public databases.

– National Center for Biotechnology Information (NCBI)

http://www.ncbi.nlm.nih.gov

Entrez – NCBIs search and retrieval system; Build 37

– University of California at Santa Cruz (UCSC)

http://genome.ucsc.edu/

Genome Browser; hg19

– Ensembl Genome Browser

http://www.ensembl.org/index.html

37

Public databases

Species UCSC Release Date Release Name Status

Human hg19 Feb. 2009

Genome Reference Consortium GRCh37

Available

hg18 Mar. 2006 NCBI Build 36.1 Available

hg17 May 2004 NCBI Build 35 Available

hg16 Jul. 2003 NCBI Build 34 Available

hg15 Apr. 2003 NCBI Build 33 Archived

• Compare NCBI Build to UCSC assembly (hg18)

http://genome.ucsc.edu/FAQ/FAQreleases.html38


20

UCSC Genome Brower

39

Haplotype Map of the Human Genome

Goals:• Define patterns of genetic variation across human genome• Guide selection of SNPs efficiently to “tag” common variants• Public release of all data (assays, genotypes)

Phase I: 1.3 M markers in 269 peoplePhase II: +2.8 M markers in 270 peoplePhase III: 1.6 M markers on 1,184 people (11 populations)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

40


21

1000 Genomes Project (1KGP)

• International project to construct a foundational data set for human genetics– Discover virtually all common human variations by

investigating many genomes at the base pair level– Consortium with multiple centers, platforms, funders

• Aims• Discover population level human genetic

variations of all types (95% of variation > 1% frequency)

• Define haplotype structure in the human genome• Develop sequence analysis methods, tools, and other

reagents that can be transferred to other sequencing projects

41

42

3 pilot coverage strategies


22

1KGP Projects

43

• 1000 Genomes Phase 2• Started in 2011• 1715 individuals• 19 Populations• Low coverage and exome next generation sequencing

• 1000 Genomes Pilot project• Started in 2008• Paper release contained ~14 million snps• 179 individuals• 4 populations• Low coverage next generation sequencing

• 1000 Genomes Phase 1• Started in 2009• Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions• 1094 individuals• 14 populations• Low coverage and exome next generation sequencing

Methodological Impact of 1000 Genomes

• 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing.

• Developed methods to integrate information across several algorithms and diverse data sources.

• Joint calling and phasing of haplotypes

Flannick J, Korn JM, Fontanillas P, Grant GB, et al. (2012) Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation. PLoS Comput Biol 8(7): e1002604. doi:10.1371/journal.pcbi.1002604http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604

44


23

Bioinformatics and Statistical Genomics

Statistics

Biostatistics

Biology & Medicine

Computer Science Informatics

Statistical Genomics Bioinformatics

Genomics

Computational Genomics

45

Bioinformatics-Statistics “continuum”

Processing of data via computers

Biological knowledge/annotation

Algorithms to determine function, structure

Informatics

New algorithms for processing next‐generation sequence data

Data mining

Clustering/Profile

Network and Interactions

Gene set and pathway analysis

Experimental Design

Association Analysis

Differential Analysis

GWAS & Haplotype

Modeling & Prediction

Pedigree Studies (Linkage)

New statistical methods

Bioinformatics Statistical Genomics

46


24

Questions?

47

Statistics Overview

48


25

What is Statistics/Biostatistics?

• It is the science of gaining information from data (ie collecting, analyzing, and interpreting data)

• Statistics is mainly used in practice for evaluating data to gain an understanding of some subject matter.

49

3 Parts of Statistics

• Collecting Data– Experiments and Experimental design

– Sampling and observational studies

• Analyzing Data– Graphs and numerical summaries

– Estimation and confidence intervals

– Hypothesis Testing

– Statistical Modeling (i.e. fitting lines)

50


26

3 Part of Statistics (cont)

• Interpreting Results– Was the statistical analysis appropriate?

– Was the data reliable?

– What do the results tell about the research question?

– What do the results tell about the estimate of an effect?

51

Statistics

• Useful definitions:– population: all objects, individuals, etc. in

which we are interested

– sample: the subset of a population that is actually measured

– data: information collected on objects, individuals, etc. of the sample

52


27

Statistical Inference

Statistical Analysis

Population

Sample

Data

measure, question, read, record, etc.

data manipulation summary statistics

inferences

Research question, hypothesis

53

Types of Data/Variables

• Qualitative Variables / Data

– Categorical

• These variables (data) classify subjects or objects into groups. The data can be character or numeric. If numeric, the numbers have no inherent meaning.

54


28


• Qualitative Variables/Data: Types– Nominal

• Qualitative variables (data) in which the classifications/groups/categories are unordered.

• Examples–blood group: A, B, O, AB

–group: 0—control, 1—study

–gender: 0—female, 1—male

55


• Qualitative Variables/Data: Types

– Ordinal

• Qualitative variables (data) in which the classifications/groups/categories are ordered.

• Examples

–smoking status: 0-never, 1-former, 2-current

–cancer stage: 1, 2, 3

–class: I, II, III, IV

56


29


• Quantitative Variables/Data– These variables (data) are numeric with

inherent numeric meaning. They typically arise from measurements or counts.

57


• Quantitative Variables/Data: Types

– Count or Discrete

• Quantitative variables (data) that arise from a counting process (only integers).

• Examples

–number of affected individuals in a family

–number of renal arteries with more than 50% stenosis

–number of bacterial colonies on a slide

58


30


• Quantitative Variables/Data: Types (cont.)

– Continuous

• Quantitative variables (data) when if measured with sufficient accuracy, there would be no gaps between possible values (continuum of values).

• Examples

– height

– systolic blood pressure

– time from diagnosis to last date alive or end of study

59

Graphical Displays of Data

• Examples of Data Distribution Shapes

60

skewed right symmetric skewed left

uni-modal

Bi-modal symmetric symmetric with outlier

Somewhat symmetric


31

Comparing sample mean and median

• In a perfectly symmetric distribution, the mean and median are always the same.

• Sample mean is influenced by outliers and skewed data, while the median is not.

• Mean will move away from the median toward the tail of skewed data or outlier.

61

mean mean

median median

Measures of Spread (Variability)

• Sample Variance– idea: a measure of variability that depends on all the

observations, looks at amount of variation about the mean

– notation• population variance: 2

• sample variance: s2

– Formula

11

2

2

N

xx

s

N

ii

62


32


• Characteristics of variance

– s2 = 0 means no spread in the data

– s2 is never negative

– As variability increases, so does s2

– squared units of the values of the variable

– Influenced by outliers

63


• Sample Standard Deviation (SD)

– notation• population standard deviation: • sample standard deviation: s

– formula

– characteristics

• square root of the variance

• has same units as the value of the variable

64

11

2

2

N

xx

ss

N

ii


33

Boxplots

• Extremely useful for comparing groups

65

Maximum of(1) minimum value not flagged as outlier,(2) QL – 1.5*IQR

QLmedian QU

Minimum of(1) maximum value not flagged as outlier,(2) QU + 1.5*IQR

outliers outlier

Data Collection• Two General Ways to get data

– Observational study: gathers information about individuals through response to questions or observations of an individual's "normal" actions

– sampling, surveys, retrospective studies– Will not be able to conclude a “causative” effect

– Experiment: deliberately imposes some treatment in order to observe a response

– Completely randomized design, clinical trials– Will be able to conclude a “causative” effect

66


34

Experimental Terms

• Experimental Unit: object on which experiment is performed

• Measurement Unit: object for which you are taking a measurement of; usually the same as the experimental unit, but not always

Example: Apply a fertilizer (treatment) to an orange tree (experimental unit); measure the acid level in the oranges

(measurement unit).

• Treatment: specific experimental conditions applied to the units

67

Experimental Terms

• Experimental Error:– Natural differences in experimental units– Variation in the measuring device– Variation in setting the experimental/treatment

conditions– The effect on the response variable of all

extraneous factors other than the experimental factors

– WISH TO MINIMIZE THE EXPERIMENTAL ERROR

68


35

Design Of Experiments (DOE) Principles

1. Control Group / Comparison Group(s)a. Controls for lurking variables

2. Randomization of experimental units to treatment groups

a. Avoids bias due to assignmentb. Produces similar treatment groups

3. Replication of experiment on many experimental units (n)

a. Better able to find differences in treatments

69

Completely randomized design

n1 TRT 1

Random Compare

allocation responses

n2 TRT 2

70


36

Variation in Experiments

• If redo the experiment, will have different randomization and a different outcome

• Some differences are due to chance differences in the groups

• Statistically significant differences are differences that are too large to occur by chance alone (will study later with hypothesis / significance testing)

• The larger the sample the better we are able to detect differences in treatment groups

71

Study Design

1. What is the Question of Interest? Objectives?

2. Determine the Scope of the Inferencea. Will this be a randomized experiment or an

observational study?b. What experimental or sampling units will be

used?c. What are the populations of interest?

3. Understand the system under study.

72


37

Study Design

4. Decide how to measure a response.

5. List Factors that can affect the response.a. Design Factors

i. Factors to vary (treatments and controls)

ii. Factors to fix

b. Confounding Factorsi. Factors to control by design (blocking)

ii. Factors to control by analysis (covariates)

iii. Factors to control by randomization

73

Study Design

6. Plan the conduct of the experiment (time line)

7. Outline the statistical analysis

8. Determine the sample size / power

74


38

Some Other Experimental Designs

• Block Designs• Factorial Designs• Cross-over • Matched / Paired Design

– special case of blocked design

• Latin Square• Split-plot• Fractional factorial• Randomized incomplete block design

75

Probability and Statistics

Probability (deductive)

Population Sample

Statistics (inference)

76


39

Probability Distribution

• A probability distribution tells us what the possible outcomes are and the probability assigned to each outcome.– Example: Table with blood type probabilities

• Examples:– Uniform distribution

– Normal distribution

– T-distribution

77

What makes a good estimator?

If you have 6 darts; what locations of the darts on the dart board would represent:

1) Unbiased &Low Variability?

2) Biased & HighVariability?

78


40

Questions?

79

Probability for a Statistic

• A sampling distribution is a probability distribution for a statistic

• We use statistics to estimate unknown population parameters.

• Sampling distribution will be centered around the true value of the parameter (if the statistic is unbiased)

• As sample size increases, the statistic gets closer to the parameter (less spread in the distribution)– Larger the sample size, the more precise the

estimate• Sampling distribution looks approximately normal (i.e.

symmetrical/bell-shaped) and has no outliers. – Will be “more normal” for larger samples.

80


41

Significance Testing

• Often called “Hypothesis” Testing• Statistical Inferences: two most common types

1. Confidence Interval: used when your goal is to estimate a population parameter.2. Hypothesis Testing: used to assess the evidence provided by the data in favor of some claim about the population.

• Reasoning for both types is based on asking what would happen if we repeated the sample or experiment many times.

81

Idea of Hypothesis Testing

• Does our sample statistic indicate a TRUE effect? OR

• Could we easily get this sample statistic by chance alone? – That is, taking into account variability in samples (and thus

the statistic) is our observed value not an uncommon value?

• We would like to simply prove our alternative hypothesis is true, but statistics can never prove anything. Instead, we accumulate evidence against the null hypothesis.

82


42

Null Hypothesis• Status quo

– usually not the hypothesis believed by the investigator

• population parameter does not differ from established value

• two population parameters do not differ

– indirect method for ascertaining whether data support researcher’s belief

• Why status quo?makes it possible to calculate probabilities (p-values)

83

Alternative Hypothesis• Research hypothesis

typically the hypothesis the investigator believes is true

• Format– two-sided (two-tailed)

Ha: parameter hypothesized quantity

– upper-tailed (one-sided)

Ha: parameter > hypothesized quantity

– lower-tailed (one-sided)

Ha: parameter < hypothesized quantity

• Generally will use a two-sided hypothesis84


43

p-value• Informally: The p-value helps answer the

question, “Is the observed difference real or merely the result of chance?”

• It does not answer this question directly. Rather it indicates how likely it is for the observed difference to be due to chance (assuming H0 is true).

• FormallyThe p-value is the probability of observing the statistic value you got (or a value more extreme) if the null hypothesis is true.

85

Reasoning of Hypothesis Testing

• We assume Ho is true.

• Look at data to see if evidence is against Ho(Ho false)

• Results that are very unlikely if Ho is true have very small p‐values and are evidence against the null hypothesis (Ho) – Small p‐value = prove Ha

– Large p‐value = fail to prove Ha

• Cut‐off for p‐value is significance level α

86


44

Interpreting p-values• Use p-value to determine which possibility is

supported by data

– p-value 0.001

• if the null hypothesis is true, there is a 1 in 1000 chance or less of observing our data or data more extreme

• strong evidence against the null hypothesis

87

Interpreting p-values• Use p-value to determine which possibility is

supported by data (cont.)

– p-value 0.05

• if the null hypothesis is true, there is a 1 in 20 chance or less of observing our data or data more extreme

• evidence against the null hypothesis

88


45

Interpreting p-vales• Use p-value to determine which possibility is

supported by data (cont.)

– p-value 0.1

• if the null hypothesis is true, there is 1 in 10 chance or more of observing our data or data more extreme

• no evidence against the null hypothesis

• NOTE:• p-value is based on the assumption that the null

hypothesis is true and so it cannot tell you if the null hypothesis is really true

• not having enough evidence against the null hypothesis does not “prove” null hypothesis is true) 89

Practical Significance

• When the sample size is large, you are more likely to get a significant p-value.

• This is because the spread in the sampling distribution is getting very small and thus, the test statistic is getting large in magnitude.

• Don't confuse statistical significance with practical significance.

90


46

Type I and II errors

• Type I error: reject the null hypothesis when it is true = Prob(Type I error)

• Type II error: fail to reject null hypothesis when it is false = Prob(Type II error)

• We control Type I error by setting α as low as possible

• α and β act inversely, as α gets smaller β gets larger.

• Make n large to control for both types of error

Ho: True Ho: False

Decision: Reject Ho Type I error (α) OK

Decision: Fail to Reject Ho OK Type II error (β)

91

Power of a Test• Power

– 1 – = 1 –Type II error– If the true parameter is , what is the chance of

obtaining a significant result?

• Larger samples have greater power to obtain a significant result. In other words, when you increase sample size, you increase power.

• For a given sample size, have greater power to detect larger effects.

92


47

Power and Sample Size: Why important• You are planning an experiment and you want to give yourself

the best possible chance of determining the truth.– Incorrect decisions:

• Failure to reject Ho when we should --- Type II error (1-Power)

• Reject Ho when we shouldn’t --- Type I error• Planning stage:

• What effects do you think are possible?• What is a clinically meaningful effect?

– What result do you need to proceed to the next stage?

– What result do you need to recommend a change in clinical practice?

• What sample size is required to make it all work?

93

Ways to Determine Sample Size

Two ways to determine sample size

1. Estimate n based on precision of confidence interval

– Studies should be designed with sample size sufficient to estimate precisely

2. Estimate n based on power of study– Studies should be designed with sample size

sufficient to provide good power(.8 or greater) to detect the smallest effect that would be clinically meaningful.

94


48

P(reject H0| H0 is true)

Power P(reject H0| HA is true)

P(fail to reject H0|

HA is true)

Reject HoFail to reject H0

95

• We know the location of the null and alternative curves, but we do not know the shape because the sample size determines the shape.

• We need to find the sample size that will give the curves the shape so that the a level and power equal the specified values.

Alpha=0.025

Power=0.8

Beta=0.2

Estimating the Sample Size

96


49

Sample Size Determination for Test of Significance

• Necessary components– , level of significance– 1 – , power– , the minimum difference between population

parameters that is of clinical usefulness – s, the standard deviation of each group (Better to

overestimate than underestimate)

• Cautions:– formulas provided are only an approximation– Based on many assumptions

• Need to be clear in presenting how the power/sample size estimates were computed

– need to inflate the sample size you compute to account for loss to follow-up, dropouts, etc.

97

http://www.stat.uiowa.edu/~rlenth/Power/index.html

98


50

What impact does variance in population have on power? Higher variability, lower power What impact does effect size have on power? Smaller effect size, lower powerWhat impact does type I level (α) have on power? Lower α, lower power

99

Time To Event (TTR, OS, DFS)

Two group comparison (no covariates): KM curves and log-rank test

Regression framework: Cox Proportional Hazards models

Logistic regression (binary outcome with covariates)

Poisson regression (count data; RNA-seq)

Linear models

100


51

Simple Linear Regression & Correlation

• Goals:1. Describe the nature of the relationship between two

variables.2. To find out whether some variables help explain, predict

or even cause the value of another variable.

• Response Variable : the result, effect, or outcome that we are interested in; also called the dependent variable.

• Explanatory Variable(s): explains, causes, or helps to predict the response; also called the independent variable.

• A relationship between two variables, does not always imply that the one variable causes a change in the other variable 101

Correlation• Correlation (r): a numerical measure for the strength

and direction of a linear relationship between two quantitative variables.

•

• Values of r close to 0 indicate a weak linear relationship (r = 0 indicates no linear relationship)

• Values of r close to ‐1 or +1 indicate a strong linear relationship

• r has no units of measurement – r will not change if we change weight from lbs. to kg. or height

from inches to cm.

2 2

( )( ) ( )( )r -1 r 1

( 1) ( ) ( )x y

X X Y Y X X Y Y

n S S X X Y Y

102


52

Least Squares Regression• We will use the line to predict y from x, so we want the line

that is as close as possible to the points in the vertical (y) direction

103X variable (Explanatory)

Y v

aria

ble

(Res

pons

e)

Least Squares Regression

• A "good" regression line is one that makes the errors / residuals (ε) or distances as small as possible

• A Least Squares regression line of y and x is the line that makes the sum of the squared vertical distances (errors) of the data points from the line as small as possible, or minimizes Σ(errors)2

104


53

LSR Line

• Ŷ = b0 + b1 (X), Ŷ = predicted response– Based on data / sample

• b1 = slope = r (Sy/Sx)

= rate of change

= amount of predicted change in Y when X is increased by 1 unit

• b0 = y‐intercept =

= value of Ŷ when X = 0.

= statistically meaningful only when X can take values close to 0

XY b1

105

Prediction

• Prediction: substitute an x‐value into the equation and will get a Ŷ which is the predicted response value for that x value.

• The predicted value/point (X, Ŷ) is always on your line.

• Not all the observed values (Y) will fall on the line unless r=1.0 or r=‐1.0.

106


54

Interpreting correlation and regression

• Know limitations:– Correlation and simple linear regression describes only linear relationships

– Both r and LSR line are influenced by extreme observations (outliers/influential points)

– One outlier can change r and LSR line dramatically– Always plot your data before you interpret correlation and regression

• Influential Point : a point that when removed changes the position of the LSR line and affects the correlation.

107

Influential Point

Categorical Data Analysis

• When looking at categorical data, one often looks at proportions as opposed to means.

• Testing that a proportion differs from a given value

• Test that proportions for 2 populations differ

• Test for relationship/association between two

categorical variables.

– Ex. Disease status and genotype frequency

108


55

Chi‐Square Tests• Uses:

– Comparison of Several (2 or more) Proportions

– Test for relationship/association/independent between two categorical variables.

• Ex. Disease status and genotype frequency

– Test that k subpopulations are the same (homogeneity)

– Goodness of fit

109

Example: Genetic Association Testing

aa aA AA Total

Case 10 (7.5) 25 (22.5) 50 (55) 85

Control 5 (7.5) 20 (22.5) 60 (55) 85

Total 15 45 110 170

(Row Total) (Column Total)Expected Count =

(Table Total)

• If the expected counts are far away from the observed counts, this is evidence against Ho.

• Chi-square test statistic:

∑

• Under null hypothesis, ~ with df = (R-1)(C-1) 110


56

Chi‐Square Distribution

• Takes only positive values

• Skewed distribution

– A standard normal random variable squared is a Chi‐square with 1 df (i.e. Z2 ~ χ2 df =1)

111

•chi-square distribution (df = 1)

p-value

X2 test statistic

Logistic Regression

• Used when response (dependent) variable has only two possible outcomes, “success” (y=1) or “failure” (y=0).

• Interested in what explanatory variables explain the response variable in terms of P(success).

• Type of nonlinear model (generalized linear model).

• Poisson Regression for when the response variable is a count from 0, …, ∞.

112


57

Logistic Regression

• Probability of success = 1

• Relate a function of , , to a linear combination of explanatory (independent) variables or predictors.

• Simple logistic regression model:

log log

113

Logistic Regression

• Thus, the probability in terms of Xi (independent variable) is

• 1

• β1 measures the degree of association between the probability of success and the value of the explanatory or predictor variable.

• is referred to as the ODDS RATIO.

114


58

Questions?

115

statistical genomics and bioinformatics workshop:...

Documents