statistical genomics and bioinformatics workshop:...
TRANSCRIPT
Statistical Genomics and Bioinformatics Workshop8/16/2013
1
Statistical Genomics and Bioinformatics Workshop:
Genetic Association and RNA-Seq Studies
Overview of Genetics, Data Resources, Terminology and Statistics
Brooke L. Fridley, PhDUniversity of Kansas Medical Center
1
Schedule for the Day• 9:45 – 10 am: Morning Break • 11:30 – 12:30 pm: Lunch Break• 2:30 – 2:45 pm: Afternoon Break
Schedule of Topics:• Overview of Genetics and Genomics
– Genetics– Technologies for genotyping– Databases and publically available resources
• Review of Statistical Aspects– Study Design– Power/Sample Size– Hypothesis Testing
2
Statistical Genomics and Bioinformatics Workshop8/16/2013
2
Schedule for the Day (con’t)• Population Genetics (LD)
• Genetic Association Studies
– Study Design
– Quality Control
– Genetic Models and Association Methods
– Haplotypes
– Power / Sample Size
– Population Stratification
– Genotype Imputation
– Multiple locus methods
• Example: GWAS for Hormone Levels
3
Schedule for the Day (con’t)• Multiple Testing
– FWER– FDR– Permutation based p-values
• Example: Acetaminophen toxicity GWAS• Limitations and Common Errors with GWAS• RNA-Seq
– Goals and review of types of RNAs– NGS and Experimental Design– Bioinformatics and processing RNA-Seq data– Quality Control– Differential Expression Testing Methods
• Clustering– Goals– Methods– Validation
4
Statistical Genomics and Bioinformatics Workshop8/16/2013
3
REVIEW OF GENETICS
5
Individualized Medicine
6
Statistical Genomics and Bioinformatics Workshop8/16/2013
4
Anticipated benefits of Individualized Medicine
• More powerful medicines
• Better, safer drugs the first time
• More accurate methods of determining appropriate drug dosages
• Advanced screen for diseases
• Better vaccines
• Improvements in the drug discovery and approval process
• Decrease in the overall cost of health care
From Human Genomic Project Website: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml#whatis 7
Integrative ‘Omics
Genome DNA
RNA
Proteins
Transcriptome
Proteome
Metabolome
Epigenome
Metabolites (e.g. Lipids)
Phenotype& Function
Phenome
Regulatory Elements
8
Statistical Genomics and Bioinformatics Workshop8/16/2013
5
DNA‐mRNA‐Protein
9
p
Centromere
q
(Chromosome 5)
Telomere
22 pairs of autosomes, 1 pair of sex chromosomes
Humans have 46 chromosomes
10
Statistical Genomics and Bioinformatics Workshop8/16/2013
6
5' end
Promoter
Start site
3' end
Stop site
Intron Exon 2 IntronExon 1 Exon 3
Splice sites
Exon 2Exon 1 Exon 3
Messenger RNA
The exons encode the actual “blueprint” for a protein
Gene Structure
11
Adenine (A)
Thymine (T)
Cytosine (C)
Guanine (G)
Nucleotidebases
Sugar phosphate backbone
Base pair
The DNA double helix
12
Statistical Genomics and Bioinformatics Workshop8/16/2013
7
Adenine (A)
Thymine (T)
Cytosine (C)
Guanine (G)
T A A T A C T C A T T G G G T C
A T T A T G A G T A A C C C A G
DNA (uncoiled)
13
Adenine (A)
Thymine (T)
Uracil (U)
Cytosine (C)
Guanine (G)
T A A T A C T C A T T G G G T C
A U U A U G A G U A A C C C A G
DNA basepairs are read by threesCodons
14
Statistical Genomics and Bioinformatics Workshop8/16/2013
8
Genetic Code
• A codon is made of 3 base pairs
• There are 64 possible codons
1 codon (AUG) encodes methionine and starts
translation of all proteins
3 codons stop protein
translation
61 codons encode 20 amino acids
(redundant code)
U A AA U G
Met
G C A
Ala
15
DNA Mutation
• A mutation is a change in the normal DNA base pair sequence
16
Statistical Genomics and Bioinformatics Workshop8/16/2013
9
Functional protein Nonfunctional or missing protein
Proteins are chains of amino acids
Mutations can cause disease
17
SNP Markers
• SNP:AATGCAGGTGCAATCGATTTCAATGCAGGTGCAATTGATTTC
• SNPs make up 90% of all human genetic variation
• SNPs with a minor allele frequency of ≥ 1% occur, on average, every 100 to 300 bases along the 3 – billion- base human genome.
• Variations in the DNA sequences of humans can affect how humans develop disease or response to drug treatments (pharmacogenomics)
18
Statistical Genomics and Bioinformatics Workshop8/16/2013
10
Normal
Missense
Nonsense
Frameshift (deletion)
Frameshift (insertion)
THE BIG RED DOG RAN OUT.
THE BIG RAD DOG RAN OUT.
THE BIG RED.
THE BRE DDO GRA.
THE BIG RED ZDO GRA.
Some types of mutations
19
Polymorphisms
• A change in the normal DNA base pair sequence
• Mutations that do not alter protein function can become common in the population
• A polymorphism is defined as a ‘common’ genetic change, usually >1% is considered common.
20
Statistical Genomics and Bioinformatics Workshop8/16/2013
11
Alternative forms of a DNA sequence or gene
SNP allele A …….AATGCAGGTGCAATCGATTTC…….allele B …….AATGCAGGTGCAATTGATTTC…….
Insertion allele A …….AATGCAGGTGCAATCGATTTC……./Deletion …….AATGCAGGTGCAATCGATTTC…….
allele B …….AATGCAGGATTTC…….
Microsatellite allele A …….AATGCGAGAGAGAGAGATTTC…….allele B …….AATGCGAGAGAGATTTC…………..
Marker Types
21
SNPs in the Human GenomeGAAATAATTAATGTTTTCCTTCCTTCTCCTATTTTGTCCTTTACTTCAATTTATTTATTTATTATTAATATTATTATTTTTTGAGACGGAGTTTCACTCTTGTTGCCAACCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCACACTCCGCTTTC[C/T]GGTTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGTCACACACCACCACGCCCGGCTAATTTTTGTATTTTTAGTAGAGTTGGGGTTTCACCATGTTGGCCAGACTGGTCTCGAACTCCTGACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCTGGGATTACAGGCGTGAGCCACCGCGCTCGGCCCTTTGCATCAATTTCTACAGCTTGTTTTCTTTGCCTGGACTTTACAAGTCTTACCTTGTTCTGCCTTCAGATATTTGTGTGGTCTCATTCTGGTGTGCCAGTAGCTAAAAATCCATGATTTGCTCTCATCCCACTCCTGTTGTTCATCTCCTCTTATCTGGGGTCACTTTTATCTCTTCGTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAATGCATTGTAAGTTAAATTATTAAAGATTTTAAATATAGGAAAAAAGTAAGCAAACATAAGGAACAAAAAGGAAAGAACATGTATTCTAATCCATTATTTATTATACAATTAAGAAATTTGGAAACTTTAGATTACACTGCTTTTAGAGATGGAGATGTAGTAAGTCTTTTACTCTTTACAAAATACATGTGTTAGCAATTTTGGGAAGAATAGTAACTCACCCGAACAGTGTAATGTGAATATGTCACTTACTAGAGGAAAGAAGGCACTTGAAAAACATCTCTAAACCGTATAAAAACAATTACATCATAATGATGAAAACCCAAGGAATTTTTTTAGAAAACATTACCAGGGCTAATAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGAGAATTCTAGAGTTATATTTGTACATAGCATGGAAAAATGAGAGGCTAGTTTATCAACTAGTTCATTTTTTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGAAA[A/C]AGTCTAACACATCCTAGGTATAGGTGAACTGTCCTCCTGCCAATGTATTGCACATTTGTGCCCAGATCCAGCATAGGGTATGTTTGCCATTTACAAACGTTTATGTCTTAAGAGAGGAAATATGAAGAGCAAAACAGTGCATGCTGGAGAGAGAAAGCTGATACAAATATAAATGAAACAATAATTGGAAAAATTGAGAAACTACTCATTTTCTAAATTACTCATGTATTTTCCTAGAATTTAAGTCTTTTAATTTTTGATAAATCCCAATGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAACTCTTTCCACTAAGAAAGTTCAACTATTAATTTAGGCACATACAATAATTACTCCATTCTAAAATGCCAAAAAGGTAATTTGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCATAAGAGACTTAAAACTGAAAACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCCCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCGTTTAAGATAGTCACACTGAACTATATTAAAAAATCCACAGGGTGGTTGGAACTAGGCCTTATATTAAAGAGGCTAAAAATTGCAATAAGACCACAGGCTTTAAATATGGCTTTAAACTGTGAAAGGTGAAACTAGAATGAATAAAATCCTATAAATTTAAATCAAAAGAAAGAAACAAACT[A/G]AAATTAAAGTTATTATACAAGAATATGGTGGCCTGGATCTAGTGAACATATAGTAAAGATAAAACAGAATATTTCTGAAAAATCCTGGAAAATCTTTTGGGCTAACCTGAAAACAGTATATTTGAAACTATTTTTAAAATGCAGTGATACTAGAAATATTTTAGAATCATATGTATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAAAATTTCGA
22
Statistical Genomics and Bioinformatics Workshop8/16/2013
12
• Locus (plural: loci): May also be called a
polymorphism, marker, variant, mutation
• Allele: variant forms of the same locus, e.g., A, C
• “wildtype” vs “variant”
• Genotype: Pair of alleles
• Phenotype: Expressed trait
• Homozygote: AA, CC
• Heterozygote: AC
• Carrier: AA + AC
• Phase: do alleles occur together on the same chromosome?
• Haplotype: a collection of closely linked alleles,
usually inherited as a unit. eg. CTG
•Penetrance: P(Phenotype|Genotype)
Terminology
A
GTA
TC
23
Type Effect Freq RR ofphenotype
Nonsense stop AA seq. v. low v. high
Missense change AA low low - v.high
Frameshift change frame of low v. highprotein coding
Intronic No known function Med v. low
Intergenic No known function High v. low
Variant Types
24
Statistical Genomics and Bioinformatics Workshop8/16/2013
13
Typical Steps in most Genomic Study
• Hypotheses• Tissue/sample processing• Study Design
– Focused/candidate regions vs whole genome– ‘Omic data type– Array vs NGS– Sample size and power– Confounding issues, covariates (epi, drug/trt)
• Bioinformatics processing of raw data• Statistical Analysis• Annotation of results and relationship (IPA, etc)• Validation studies (replication, functional studies)
25
Evolution of Genomics Research
Candidate Gene Studies
< 2005
Genome-wide Association
Studies
2005-2010
Next-Gen Sequencing
2010-Present
3rd Generation Sequencing
Events leading up to Candidate Studies
1950 – Structure of DNA
1970s – Sanger
Sequencing
1983 – PCR
1990 – HGP begins
1997 – NHGRI formed
Events leading up to GWAS
2000-1 – Draft version of
human genome
sequence completed
2002 – HapMap begins
2003 – HGP ends
Events leading up to Next-Gen Sequencing
2005 – 1st Commercial
platform (Roche 454)
2006– Illumina’s Genome
Analyzer (GA) IIx2008 – 1KGP begins
SNP arrays
mRNA arrays
Methylation arrays
Genotyping
RT-PCR
Resequencing genes(exons) with Sanger
Sequencing
DNA (Exome & WGS)
RNA-seq
Bisulfite or RRBS (methylation)
Single MoleculeSequencing
PacBio, Complete Genomics, etc.
Translation to clinical practice
26
Statistical Genomics and Bioinformatics Workshop8/16/2013
14
Human Genome Project
• Completed in 2003; 13 year project
• Goals:
– Identify all ~25,000 genes in human DNA
– Determine the sequences of the 3 billion bp
– Store this information in databases
– Improve tools for data analysis
– Address ethical, legal, social issues (ELSI)
nature
February, 2001
http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml27
Human Genome Facts
• 3 billion base pairs
• Around 25,000 genes
– Functions unknown for ~50%
• Average gene size is 3000 nucleotides
• Coding is about 1.5% of genome
nature
February, 2001
28
Statistical Genomics and Bioinformatics Workshop8/16/2013
15
High Throughput Methods for Measuring DNA
• Many approaches for genotyping– Hybridization Methods (Affymetrix, TaqMan)– Primer extension (Pyrosequencing)– Ligation (Illumina)
• Custom Content / Design– GoldenGate, Infinium at Illumina– Disease Specific panels (PGx, Cancer, Carbo‐Metabo)
• Standard large arrays– Genome‐wide arrays (> 1 million SNPs)– Exome Arrays (rare variants)
• Next‐Generation Sequencing
29
NGS Technologies• Illumina (Solexa) HiSeq 2000 (2500) & MiSeq, Life
Technologies SOLiD, PacBio, Ion Torrent PGM, Roche 454, ... , and many more to come
– No one-size-fits-all solution
– Each has pros and cons
30
Statistical Genomics and Bioinformatics Workshop8/16/2013
16
Integrative Genomic Viewer (IGV)
Thorvaldsdottir, Robinson, Mesirov (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration Briefings in Bioinformatics
31
ENCODE (Encyclopedia of DNA Elements)
• Goal to build a comprehensive parts list of functional elements in the human genome.
32
Statistical Genomics and Bioinformatics Workshop8/16/2013
17
Mouse ENCODE Project• Enhance the human ENCODE Project through relevant
comparative studies
• Access cell types, tissues, and developmental time points that are not addressable by the human project
• Provide a general resource to inform and accelerate ongoing efforts in mouse genomics and disease modeling with human translational potential
33
J Barretina et al. Nature 483, 603-607 (2012) doi:10.1038/nature11003
Cancer Cell Line Encyclopedia
34
Statistical Genomics and Bioinformatics Workshop8/16/2013
18
The Cancer Genome Atlas (TCGA)
• Began in 2006 as a three-year pilot project (NCI & NHGRI) for three tumors.
• NIH is now commit to characterizing more than 20 additional tumors.
• Extensive data available on 17 cancers
• Tumor and normal tissue being analyzed on multiple levels, such as:
– nucleotide variation (SNP, Indel, SNV)
– gene copy number variation
– gene expression levels
– DNA methylation levels35
Other Public Data and Information
36
Statistical Genomics and Bioinformatics Workshop8/16/2013
19
Public databases The entire human genome sequence can be
found in several public databases.
– National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov
Entrez – NCBIs search and retrieval system; Build 37
– University of California at Santa Cruz (UCSC)
http://genome.ucsc.edu/
Genome Browser; hg19
– Ensembl Genome Browser
http://www.ensembl.org/index.html
37
Public databases
Species UCSC Release Date Release Name Status
Human hg19 Feb. 2009
Genome Reference Consortium GRCh37
Available
hg18 Mar. 2006 NCBI Build 36.1 Available
hg17 May 2004 NCBI Build 35 Available
hg16 Jul. 2003 NCBI Build 34 Available
hg15 Apr. 2003 NCBI Build 33 Archived
• Compare NCBI Build to UCSC assembly (hg18)
http://genome.ucsc.edu/FAQ/FAQreleases.html38
Statistical Genomics and Bioinformatics Workshop8/16/2013
20
UCSC Genome Brower
39
Haplotype Map of the Human Genome
Goals:• Define patterns of genetic variation across human genome• Guide selection of SNPs efficiently to “tag” common variants• Public release of all data (assays, genotypes)
Phase I: 1.3 M markers in 269 peoplePhase II: +2.8 M markers in 270 peoplePhase III: 1.6 M markers on 1,184 people (11 populations)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
40
Statistical Genomics and Bioinformatics Workshop8/16/2013
21
1000 Genomes Project (1KGP)
• International project to construct a foundational data set for human genetics– Discover virtually all common human variations by
investigating many genomes at the base pair level– Consortium with multiple centers, platforms, funders
• Aims• Discover population level human genetic
variations of all types (95% of variation > 1% frequency)
• Define haplotype structure in the human genome• Develop sequence analysis methods, tools, and other
reagents that can be transferred to other sequencing projects
41
42
3 pilot coverage strategies
Statistical Genomics and Bioinformatics Workshop8/16/2013
22
1KGP Projects
43
• 1000 Genomes Phase 2• Started in 2011• 1715 individuals• 19 Populations• Low coverage and exome next generation sequencing
• 1000 Genomes Pilot project• Started in 2008• Paper release contained ~14 million snps• 179 individuals• 4 populations• Low coverage next generation sequencing
• 1000 Genomes Phase 1• Started in 2009• Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions• 1094 individuals• 14 populations• Low coverage and exome next generation sequencing
Methodological Impact of 1000 Genomes
• 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing.
• Developed methods to integrate information across several algorithms and diverse data sources.
• Joint calling and phasing of haplotypes
Flannick J, Korn JM, Fontanillas P, Grant GB, et al. (2012) Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation. PLoS Comput Biol 8(7): e1002604. doi:10.1371/journal.pcbi.1002604http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604
44
Statistical Genomics and Bioinformatics Workshop8/16/2013
23
Bioinformatics and Statistical Genomics
Statistics
Biostatistics
Biology & Medicine
Computer Science Informatics
Statistical Genomics Bioinformatics
Genomics
Computational Genomics
45
Bioinformatics-Statistics “continuum”
Processing of data via computers
Biological knowledge/annotation
Algorithms to determine function, structure
Informatics
New algorithms for processing next‐generation sequence data
Data mining
Clustering/Profile
Network and Interactions
Gene set and pathway analysis
Experimental Design
Association Analysis
Differential Analysis
GWAS & Haplotype
Modeling & Prediction
Pedigree Studies (Linkage)
New statistical methods
Bioinformatics Statistical Genomics
46
Statistical Genomics and Bioinformatics Workshop8/16/2013
24
Questions?
47
Statistics Overview
48
Statistical Genomics and Bioinformatics Workshop8/16/2013
25
What is Statistics/Biostatistics?
• It is the science of gaining information from data (ie collecting, analyzing, and interpreting data)
• Statistics is mainly used in practice for evaluating data to gain an understanding of some subject matter.
49
3 Parts of Statistics
• Collecting Data– Experiments and Experimental design
– Sampling and observational studies
• Analyzing Data– Graphs and numerical summaries
– Estimation and confidence intervals
– Hypothesis Testing
– Statistical Modeling (i.e. fitting lines)
50
Statistical Genomics and Bioinformatics Workshop8/16/2013
26
3 Part of Statistics (cont)
• Interpreting Results– Was the statistical analysis appropriate?
– Was the data reliable?
– What do the results tell about the research question?
– What do the results tell about the estimate of an effect?
51
Statistics
• Useful definitions:– population: all objects, individuals, etc. in
which we are interested
– sample: the subset of a population that is actually measured
– data: information collected on objects, individuals, etc. of the sample
52
Statistical Genomics and Bioinformatics Workshop8/16/2013
27
Statistical Inference
Statistical Analysis
Population
Sample
Data
measure, question, read, record, etc.
data manipulation summary statistics
inferences
Research question, hypothesis
53
Types of Data/Variables
• Qualitative Variables / Data
– Categorical
• These variables (data) classify subjects or objects into groups. The data can be character or numeric. If numeric, the numbers have no inherent meaning.
54
Statistical Genomics and Bioinformatics Workshop8/16/2013
28
Types of Data/Variables
• Qualitative Variables/Data: Types– Nominal
• Qualitative variables (data) in which the classifications/groups/categories are unordered.
• Examples–blood group: A, B, O, AB
–group: 0—control, 1—study
–gender: 0—female, 1—male
55
Types of Data/Variables
• Qualitative Variables/Data: Types
– Ordinal
• Qualitative variables (data) in which the classifications/groups/categories are ordered.
• Examples
–smoking status: 0-never, 1-former, 2-current
–cancer stage: 1, 2, 3
–class: I, II, III, IV
56
Statistical Genomics and Bioinformatics Workshop8/16/2013
29
Types of Data/Variables
• Quantitative Variables/Data– These variables (data) are numeric with
inherent numeric meaning. They typically arise from measurements or counts.
57
Types of Data/Variables
• Quantitative Variables/Data: Types
– Count or Discrete
• Quantitative variables (data) that arise from a counting process (only integers).
• Examples
–number of affected individuals in a family
–number of renal arteries with more than 50% stenosis
–number of bacterial colonies on a slide
58
Statistical Genomics and Bioinformatics Workshop8/16/2013
30
Types of Data/Variables
• Quantitative Variables/Data: Types (cont.)
– Continuous
• Quantitative variables (data) when if measured with sufficient accuracy, there would be no gaps between possible values (continuum of values).
• Examples
– height
– systolic blood pressure
– time from diagnosis to last date alive or end of study
59
Graphical Displays of Data
• Examples of Data Distribution Shapes
60
skewed right symmetric skewed left
uni-modal
Bi-modal symmetric symmetric with outlier
Somewhat symmetric
Statistical Genomics and Bioinformatics Workshop8/16/2013
31
Comparing sample mean and median
• In a perfectly symmetric distribution, the mean and median are always the same.
• Sample mean is influenced by outliers and skewed data, while the median is not.
• Mean will move away from the median toward the tail of skewed data or outlier.
61
mean mean
median median
Measures of Spread (Variability)
• Sample Variance– idea: a measure of variability that depends on all the
observations, looks at amount of variation about the mean
– notation• population variance: 2
• sample variance: s2
– Formula
11
2
2
N
xx
s
N
ii
62
Statistical Genomics and Bioinformatics Workshop8/16/2013
32
Measures of Spread (Variability)
• Characteristics of variance
– s2 = 0 means no spread in the data
– s2 is never negative
– As variability increases, so does s2
– squared units of the values of the variable
– Influenced by outliers
63
Measures of Spread (Variability)
• Sample Standard Deviation (SD)
– notation• population standard deviation: • sample standard deviation: s
– formula
– characteristics
• square root of the variance
• has same units as the value of the variable
64
11
2
2
N
xx
ss
N
ii
Statistical Genomics and Bioinformatics Workshop8/16/2013
33
Boxplots
• Extremely useful for comparing groups
65
Maximum of(1) minimum value not flagged as outlier,(2) QL – 1.5*IQR
QLmedian QU
Minimum of(1) maximum value not flagged as outlier,(2) QU + 1.5*IQR
outliers outlier
Data Collection• Two General Ways to get data
– Observational study: gathers information about individuals through response to questions or observations of an individual's "normal" actions
– sampling, surveys, retrospective studies– Will not be able to conclude a “causative” effect
– Experiment: deliberately imposes some treatment in order to observe a response
– Completely randomized design, clinical trials– Will be able to conclude a “causative” effect
66
Statistical Genomics and Bioinformatics Workshop8/16/2013
34
Experimental Terms
• Experimental Unit: object on which experiment is performed
• Measurement Unit: object for which you are taking a measurement of; usually the same as the experimental unit, but not always
Example: Apply a fertilizer (treatment) to an orange tree (experimental unit); measure the acid level in the oranges
(measurement unit).
• Treatment: specific experimental conditions applied to the units
67
Experimental Terms
• Experimental Error:– Natural differences in experimental units– Variation in the measuring device– Variation in setting the experimental/treatment
conditions– The effect on the response variable of all
extraneous factors other than the experimental factors
– WISH TO MINIMIZE THE EXPERIMENTAL ERROR
68
Statistical Genomics and Bioinformatics Workshop8/16/2013
35
Design Of Experiments (DOE) Principles
1. Control Group / Comparison Group(s)a. Controls for lurking variables
2. Randomization of experimental units to treatment groups
a. Avoids bias due to assignmentb. Produces similar treatment groups
3. Replication of experiment on many experimental units (n)
a. Better able to find differences in treatments
69
Completely randomized design
n1 TRT 1
Random Compare
allocation responses
n2 TRT 2
70
Statistical Genomics and Bioinformatics Workshop8/16/2013
36
Variation in Experiments
• If redo the experiment, will have different randomization and a different outcome
• Some differences are due to chance differences in the groups
• Statistically significant differences are differences that are too large to occur by chance alone (will study later with hypothesis / significance testing)
• The larger the sample the better we are able to detect differences in treatment groups
71
Study Design
1. What is the Question of Interest? Objectives?
2. Determine the Scope of the Inferencea. Will this be a randomized experiment or an
observational study?b. What experimental or sampling units will be
used?c. What are the populations of interest?
3. Understand the system under study.
72
Statistical Genomics and Bioinformatics Workshop8/16/2013
37
Study Design
4. Decide how to measure a response.
5. List Factors that can affect the response.a. Design Factors
i. Factors to vary (treatments and controls)
ii. Factors to fix
b. Confounding Factorsi. Factors to control by design (blocking)
ii. Factors to control by analysis (covariates)
iii. Factors to control by randomization
73
Study Design
6. Plan the conduct of the experiment (time line)
7. Outline the statistical analysis
8. Determine the sample size / power
74
Statistical Genomics and Bioinformatics Workshop8/16/2013
38
Some Other Experimental Designs
• Block Designs• Factorial Designs• Cross-over • Matched / Paired Design
– special case of blocked design
• Latin Square• Split-plot• Fractional factorial• Randomized incomplete block design
75
Probability and Statistics
Probability (deductive)
Population Sample
Statistics (inference)
76
Statistical Genomics and Bioinformatics Workshop8/16/2013
39
Probability Distribution
• A probability distribution tells us what the possible outcomes are and the probability assigned to each outcome.– Example: Table with blood type probabilities
• Examples:– Uniform distribution
– Normal distribution
– T-distribution
77
What makes a good estimator?
If you have 6 darts; what locations of the darts on the dart board would represent:
1) Unbiased &Low Variability?
2) Biased & HighVariability?
78
Statistical Genomics and Bioinformatics Workshop8/16/2013
40
Questions?
79
Probability for a Statistic
• A sampling distribution is a probability distribution for a statistic
• We use statistics to estimate unknown population parameters.
• Sampling distribution will be centered around the true value of the parameter (if the statistic is unbiased)
• As sample size increases, the statistic gets closer to the parameter (less spread in the distribution)– Larger the sample size, the more precise the
estimate• Sampling distribution looks approximately normal (i.e.
symmetrical/bell-shaped) and has no outliers. – Will be “more normal” for larger samples.
80
Statistical Genomics and Bioinformatics Workshop8/16/2013
41
Significance Testing
• Often called “Hypothesis” Testing• Statistical Inferences: two most common types
1. Confidence Interval: used when your goal is to estimate a population parameter.2. Hypothesis Testing: used to assess the evidence provided by the data in favor of some claim about the population.
• Reasoning for both types is based on asking what would happen if we repeated the sample or experiment many times.
81
Idea of Hypothesis Testing
• Does our sample statistic indicate a TRUE effect? OR
• Could we easily get this sample statistic by chance alone? – That is, taking into account variability in samples (and thus
the statistic) is our observed value not an uncommon value?
• We would like to simply prove our alternative hypothesis is true, but statistics can never prove anything. Instead, we accumulate evidence against the null hypothesis.
82
Statistical Genomics and Bioinformatics Workshop8/16/2013
42
Null Hypothesis• Status quo
– usually not the hypothesis believed by the investigator
• population parameter does not differ from established value
• two population parameters do not differ
– indirect method for ascertaining whether data support researcher’s belief
• Why status quo?makes it possible to calculate probabilities (p-values)
83
Alternative Hypothesis• Research hypothesis
typically the hypothesis the investigator believes is true
• Format– two-sided (two-tailed)
Ha: parameter hypothesized quantity
– upper-tailed (one-sided)
Ha: parameter > hypothesized quantity
– lower-tailed (one-sided)
Ha: parameter < hypothesized quantity
• Generally will use a two-sided hypothesis84
Statistical Genomics and Bioinformatics Workshop8/16/2013
43
p-value• Informally: The p-value helps answer the
question, “Is the observed difference real or merely the result of chance?”
• It does not answer this question directly. Rather it indicates how likely it is for the observed difference to be due to chance (assuming H0 is true).
• FormallyThe p-value is the probability of observing the statistic value you got (or a value more extreme) if the null hypothesis is true.
85
Reasoning of Hypothesis Testing
• We assume Ho is true.
• Look at data to see if evidence is against Ho(Ho false)
• Results that are very unlikely if Ho is true have very small p‐values and are evidence against the null hypothesis (Ho) – Small p‐value = prove Ha
– Large p‐value = fail to prove Ha
• Cut‐off for p‐value is significance level α
86
Statistical Genomics and Bioinformatics Workshop8/16/2013
44
Interpreting p-values• Use p-value to determine which possibility is
supported by data
– p-value 0.001
• if the null hypothesis is true, there is a 1 in 1000 chance or less of observing our data or data more extreme
• strong evidence against the null hypothesis
87
Interpreting p-values• Use p-value to determine which possibility is
supported by data (cont.)
– p-value 0.05
• if the null hypothesis is true, there is a 1 in 20 chance or less of observing our data or data more extreme
• evidence against the null hypothesis
88
Statistical Genomics and Bioinformatics Workshop8/16/2013
45
Interpreting p-vales• Use p-value to determine which possibility is
supported by data (cont.)
– p-value 0.1
• if the null hypothesis is true, there is 1 in 10 chance or more of observing our data or data more extreme
• no evidence against the null hypothesis
• NOTE:• p-value is based on the assumption that the null
hypothesis is true and so it cannot tell you if the null hypothesis is really true
• not having enough evidence against the null hypothesis does not “prove” null hypothesis is true) 89
Practical Significance
• When the sample size is large, you are more likely to get a significant p-value.
• This is because the spread in the sampling distribution is getting very small and thus, the test statistic is getting large in magnitude.
• Don't confuse statistical significance with practical significance.
90
Statistical Genomics and Bioinformatics Workshop8/16/2013
46
Type I and II errors
• Type I error: reject the null hypothesis when it is true = Prob(Type I error)
• Type II error: fail to reject null hypothesis when it is false = Prob(Type II error)
• We control Type I error by setting α as low as possible
• α and β act inversely, as α gets smaller β gets larger.
• Make n large to control for both types of error
Ho: True Ho: False
Decision: Reject Ho Type I error (α) OK
Decision: Fail to Reject Ho OK Type II error (β)
91
Power of a Test• Power
– 1 – = 1 –Type II error– If the true parameter is , what is the chance of
obtaining a significant result?
• Larger samples have greater power to obtain a significant result. In other words, when you increase sample size, you increase power.
• For a given sample size, have greater power to detect larger effects.
92
Statistical Genomics and Bioinformatics Workshop8/16/2013
47
Power and Sample Size: Why important• You are planning an experiment and you want to give yourself
the best possible chance of determining the truth.– Incorrect decisions:
• Failure to reject Ho when we should --- Type II error (1-Power)
• Reject Ho when we shouldn’t --- Type I error• Planning stage:
• What effects do you think are possible?• What is a clinically meaningful effect?
– What result do you need to proceed to the next stage?
– What result do you need to recommend a change in clinical practice?
• What sample size is required to make it all work?
93
Ways to Determine Sample Size
Two ways to determine sample size
1. Estimate n based on precision of confidence interval
– Studies should be designed with sample size sufficient to estimate precisely
2. Estimate n based on power of study– Studies should be designed with sample size
sufficient to provide good power(.8 or greater) to detect the smallest effect that would be clinically meaningful.
94
Statistical Genomics and Bioinformatics Workshop8/16/2013
48
P(reject H0| H0 is true)
Power P(reject H0| HA is true)
P(fail to reject H0|
HA is true)
Reject HoFail to reject H0
95
• We know the location of the null and alternative curves, but we do not know the shape because the sample size determines the shape.
• We need to find the sample size that will give the curves the shape so that the a level and power equal the specified values.
Alpha=0.025
Power=0.8
Beta=0.2
Estimating the Sample Size
96
Statistical Genomics and Bioinformatics Workshop8/16/2013
49
Sample Size Determination for Test of Significance
• Necessary components– , level of significance– 1 – , power– , the minimum difference between population
parameters that is of clinical usefulness – s, the standard deviation of each group (Better to
overestimate than underestimate)
• Cautions:– formulas provided are only an approximation– Based on many assumptions
• Need to be clear in presenting how the power/sample size estimates were computed
– need to inflate the sample size you compute to account for loss to follow-up, dropouts, etc.
97
http://www.stat.uiowa.edu/~rlenth/Power/index.html
98
Statistical Genomics and Bioinformatics Workshop8/16/2013
50
What impact does variance in population have on power? Higher variability, lower power What impact does effect size have on power? Smaller effect size, lower powerWhat impact does type I level (α) have on power? Lower α, lower power
99
Time To Event (TTR, OS, DFS)
Two group comparison (no covariates): KM curves and log-rank test
Regression framework: Cox Proportional Hazards models
Logistic regression (binary outcome with covariates)
Poisson regression (count data; RNA-seq)
Linear models
100
Statistical Genomics and Bioinformatics Workshop8/16/2013
51
Simple Linear Regression & Correlation
• Goals:1. Describe the nature of the relationship between two
variables.2. To find out whether some variables help explain, predict
or even cause the value of another variable.
• Response Variable : the result, effect, or outcome that we are interested in; also called the dependent variable.
• Explanatory Variable(s): explains, causes, or helps to predict the response; also called the independent variable.
• A relationship between two variables, does not always imply that the one variable causes a change in the other variable 101
Correlation• Correlation (r): a numerical measure for the strength
and direction of a linear relationship between two quantitative variables.
•
• Values of r close to 0 indicate a weak linear relationship (r = 0 indicates no linear relationship)
• Values of r close to ‐1 or +1 indicate a strong linear relationship
• r has no units of measurement – r will not change if we change weight from lbs. to kg. or height
from inches to cm.
2 2
( )( ) ( )( )r -1 r 1
( 1) ( ) ( )x y
X X Y Y X X Y Y
n S S X X Y Y
102
Statistical Genomics and Bioinformatics Workshop8/16/2013
52
Least Squares Regression• We will use the line to predict y from x, so we want the line
that is as close as possible to the points in the vertical (y) direction
103X variable (Explanatory)
Y v
aria
ble
(Res
pons
e)
Least Squares Regression
• A "good" regression line is one that makes the errors / residuals (ε) or distances as small as possible
• A Least Squares regression line of y and x is the line that makes the sum of the squared vertical distances (errors) of the data points from the line as small as possible, or minimizes Σ(errors)2
104
Statistical Genomics and Bioinformatics Workshop8/16/2013
53
LSR Line
• Ŷ = b0 + b1 (X), Ŷ = predicted response– Based on data / sample
• b1 = slope = r (Sy/Sx)
= rate of change
= amount of predicted change in Y when X is increased by 1 unit
• b0 = y‐intercept =
= value of Ŷ when X = 0.
= statistically meaningful only when X can take values close to 0
XY b1
105
Prediction
• Prediction: substitute an x‐value into the equation and will get a Ŷ which is the predicted response value for that x value.
• The predicted value/point (X, Ŷ) is always on your line.
• Not all the observed values (Y) will fall on the line unless r=1.0 or r=‐1.0.
106
Statistical Genomics and Bioinformatics Workshop8/16/2013
54
Interpreting correlation and regression
• Know limitations:– Correlation and simple linear regression describes only linear relationships
– Both r and LSR line are influenced by extreme observations (outliers/influential points)
– One outlier can change r and LSR line dramatically– Always plot your data before you interpret correlation and regression
• Influential Point : a point that when removed changes the position of the LSR line and affects the correlation.
107
Influential Point
Categorical Data Analysis
• When looking at categorical data, one often looks at proportions as opposed to means.
• Testing that a proportion differs from a given value
• Test that proportions for 2 populations differ
• Test for relationship/association between two
categorical variables.
– Ex. Disease status and genotype frequency
108
Statistical Genomics and Bioinformatics Workshop8/16/2013
55
Chi‐Square Tests• Uses:
– Comparison of Several (2 or more) Proportions
– Test for relationship/association/independent between two categorical variables.
• Ex. Disease status and genotype frequency
– Test that k subpopulations are the same (homogeneity)
– Goodness of fit
109
Example: Genetic Association Testing
aa aA AA Total
Case 10 (7.5) 25 (22.5) 50 (55) 85
Control 5 (7.5) 20 (22.5) 60 (55) 85
Total 15 45 110 170
(Row Total) (Column Total)Expected Count =
(Table Total)
• If the expected counts are far away from the observed counts, this is evidence against Ho.
• Chi-square test statistic:
∑
• Under null hypothesis, ~ with df = (R-1)(C-1) 110
Statistical Genomics and Bioinformatics Workshop8/16/2013
56
Chi‐Square Distribution
• Takes only positive values
• Skewed distribution
– A standard normal random variable squared is a Chi‐square with 1 df (i.e. Z2 ~ χ2 df =1)
111
•chi-square distribution (df = 1)
p-value
X2 test statistic
Logistic Regression
• Used when response (dependent) variable has only two possible outcomes, “success” (y=1) or “failure” (y=0).
• Interested in what explanatory variables explain the response variable in terms of P(success).
• Type of nonlinear model (generalized linear model).
• Poisson Regression for when the response variable is a count from 0, …, ∞.
112
Statistical Genomics and Bioinformatics Workshop8/16/2013
57
Logistic Regression
• Probability of success = 1
• Relate a function of , , to a linear combination of explanatory (independent) variables or predictors.
• Simple logistic regression model:
log log
113
Logistic Regression
• Thus, the probability in terms of Xi (independent variable) is
• 1
• β1 measures the degree of association between the probability of success and the value of the explanatory or predictor variable.
• is referred to as the ODDS RATIO.
114
Statistical Genomics and Bioinformatics Workshop8/16/2013
58
Questions?
115