gene expression data analysis - vanderbilt...
TRANSCRIPT
Gene Expression Data Analysis
BMIF 310, Fall 2009
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
Gene expression technologies (summary)
Hybridization-based approaches Printed arrays
cDNA arrays: customizable, high array variation
Synthesized oligo arrays Affymetrix arrays: high density, low array variation
Classic arrays: probes on 3’ UTR
Exon arrays: probes on all known exons
Tiling arrays: probes spread across the genomic sequence
Sequencing-based approaches Traditional Sanger sequencing-based approaches
Serial analysis of gene expression: ~10bp tag at the 3’ end
2nd generation sequencing based approaches RNA-Seq: high-throughput unbiased profiling
BMIF 310, Fall 2009 2
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
3 BMIF 310, Fall 2009
Well begun is half done
A clearly defined biological question
Well control of potential sources of variation (biological and technical) Statistically sound microarray experimental arrangement (replicates)
Compliance with the standard of microarray information collection (MIAME)
http://www.mged.org/Workgroups/MIAME/miame.html
4 BMIF 310, Fall 2009
Image analysis
Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array.
Gridding: align a grid to the spots
Segmentation: identify the shape of each spot
Intensity extraction: extract intensity for each spot and potentially for each surrounding background
Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot
BMIF 310, Fall 2009 5
Garbage in, garbage out
Remove bad arrays
Remove poor-quality spots
Remove data points with low signal/noise ratio
Remove data points with too many missing value
Bad Array
6 BMIF 310, Fall 2009
Normalization
The purpose of normalization is to remove systematic variation in a microarray experiment which affects the measured gene expression levels
Systematic Variation Unequal quantities of starting RNA
Differences in labelling and detection efficiencies
Topographical slide variation
Scanner introduced bias
7 BMIF 310, Fall 2009
Normalization method
Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)
Match the percentiles of each array (Quantile normalization)
Adjust using a nonlinear smoothing curve
Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples
Adjust using spike control
BMIF 310, Fall 2009 8
No normalization Global normalization Quantile normalization
Get to know your data matrix
ID Samp1 Samp2 Samp3 …… Sampm-1 Sampm
Gene1 5.25 6.37 7.30 …… 6.02 7.17
Gene2 6.96 5.01 7.23 …… 5.87 5.02
Gene3 5.44 5.67 4.23 …… 5.33 6.34
Gene4 12.83 10.35 12.56 …… 9.98 11.13
Gene5 3.20 3.07 3.19 …… 3.27 3.16
Gene6 7.74 7.66 7.12 …… 7.46 7.95
…… …… …… …… …… …… ……
Genen 6.06 6.04 6.35 …… 6.44 6.60
Genen-1 8.92 8.52 7.62 …… 7.90 8.02
BMIF 310, Fall 2009 9
Genes
Samples
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
10 BMIF 310, Fall 2009
Differential Gene Expression
n-fold change Arbitrarily selected fold change
cut-offs
Usually ≥ 2 fold
Pros Intuitive and easily visualised
Simple and rapid
Cons Statistically inefficient
Magnitude does not necessarily indicate importance
Often too restrictive
MVA plot
M: log ratio ( log2(A/B) )
A: average log intensity ( log2(A*B)/2 )
11 BMIF 310, Fall 2009
Differential Gene Expression
Statistical tests
Test for significant change between repeated measurements of a variable in two groups/multiple groups
Calculation of statistics, selection of a cut-off value, reject the null-hypothesis
Methods Two independent groups
Student’s t-test: parametric Mann-Whitney U test: nonparametric
Two or more independent groups ANOVA (Analysis of variance): parametric
Kruskal-Wallis test: nonparametric
12 BMIF 310, Fall 2009
Correction for multiple testing
Why? In an experiment with a 10,000-gene array in which the significance level
p is set at 0.05, 10,1000x0.05=500 genes would be inferred as significant even though none is differentially expressed
Unadjusted p-value is likely to exaggerate Type I errors (false positives)
Methods Control the family-wise error rate (FWER), the probability that there is a
single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of gene tested
Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction.
13 BMIF 310, Fall 2009
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
14 BMIF 310, Fall 2009
What is clustering
Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities
Unsupervised techniques, does not require the incorporation of any prior knowledge in the process
15 BMIF 310, Fall 2009
Why clustering?
Exploratory data analysis, providing rough maps and suggesting directions for further study
Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram
Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes
16 BMIF 310, Fall 2009
Clustering method
Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters
Partitioning: divide the data into g groups using some reallocation algorithm, e.g. K-means
Fuzzy clustering: each object has a set of weights suggesting the probability of it belonging to each cluster
17 BMIF 310, Fall 2009
Hierarchical clustering
Agglomerative clustering (bottom-up) Start with n groups, join the two closest, continue
Divisive clustering (top-down) Start with 1 group, split into 2, then into 3, …, into n
Require distance measurement Between two objects
Between clusters
18 BMIF 310, Fall 2009
Between objects distance measurement
Euclidean distance Focus on the absolute expression value
Pearson correlation coefficient Focus on the expression profile shape Parametric, normally distributed and follow the linear regression
model
Spearman correlation coefficient Focus on the expression profile shape Non-parametric, no assumption Less sensitive than Pearson
19 BMIF 310, Fall 2009
Different measurement, different distance
Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink)
Pearson: GeneC (green)
Spearman: GeneD (red)
20 BMIF 310, Fall 2009
Between cluster distance measurement
Single linkage: the smallest distance of all pairwise distances
Complete linkage: the maximum distance of all pairwise distances Average linkage: the average distance of all pairwise distances
21 BMIF 310, Fall 2009
Hierarchical clustering
Dendrogram Output of a hierarchical
clustering
Tree structure with the genes or samples as the leaves
The height of the join indicates the distance between the left branch and the right branch
Problems Hard to define distinct
clusters
22 BMIF 310, Fall 2009
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
23 BMIF 310, Fall 2009
What is classification
Classification algorithms are methods to classify objects into predefined classes
Supervised techniques, requires training data and predefined classes
Two step process Model construction: describe a set of predetermined classes using
training data
Model application: classify new objects into predefined classes
24 BMIF 310, Fall 2009
Classification methods
K-nearest neighbor
Decision tree
Support vector machine
Naïve Bayes classifier
Artificial neural network
…
25 BMIF 310, Fall 2009
Feature selection
Microarray data are characterized by large numbers of variables (genes) with respect to very few observations (samples), we need to select a subset of genes likely to be predictive (i.e. highly related with particular classes for classification)
26 BMIF 310, Fall 2009
Model construction
Sample GeneA GeneB Tumor A H H N B H L Y C L L N D H L Y E L L N F L H N
Training Data
Classification Algorithms
Classifier (Model)
IF GeneA = ‘H’
AND GeneB = ‘L’
THEN Tumor=‘yes’
27 BMIF 310, Fall 2009
Model application
Sample GeneA GeneB Tumor Z H L ?
New objects
Classifier (Model)
IF GeneA = ‘H’
AND GeneB = ‘L’
THEN Tumor=‘yes’
Sample Z = ‘Tumor’?
Yes
28 BMIF 310, Fall 2009
K-Nearest neighbor
Objects are points in an n-D space
Compute the distance between the new case and all learning cases
Return the most common value among the k learning cases nearest to the new case
=
29 BMIF 310, Fall 2009
Over-fitting and cross-validation
Over-fitting The classifier is very effective in classifying the training samples but not
accurate enough for new samples
Cross-validation Hold-out
Split data into Training and Testing data Learn with Training data and estimate true error with Testing data
N-fold Randomly Split data into Training and Testing data n times Learn with Training and estimate true error with Testing in each split separately Average test performance
Leave-one-out Leave one case for Testing Learn with the remaining data and estimate true error with the Testing Average test performance
30 BMIF 310, Fall 2009
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
31 BMIF 310, Fall 2009
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
32 BMIF 310, Fall 2009
Importance of biological interpretation
Normalize, Filter, Cluster and Visualize
Importance of biological interpretation
Identification of sets of genes of potential interest
Numerical technique, does not reveal the biological implications encrypted in expression data
Evaluation of the functional significance of large, heterogeneous and noisy sets of genes constitutes a big challenge
BMIF 310, Fall 2009 33
Gene Ontology
Structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products
Three major categories that describe the attributes of biological process, molecular function and cellular component for a gene product
Categories of concepts are held within a Directed Acyclic Graph (DAG)
http://geneontology.org
BMIF 310, Fall 2009 34
Gene Ontology Tree Machine (GOTM)
A web-based tool for the analysis and visualization of sets of genes identified from high-throughput technologies
User friendly data navigation and visualization
Statistical analysis suggesting biological areas that warrant further study
http://bioinfo.vanderbilt.edu/gotm
BMIF 310, Fall 2009 35
GOTM
BMIF 310, Fall 2009 36
69
observed
24
147 69
expected
0.5
147
p=1.92e-34
Up-regulated
mitotic cell cycle
random
mitotic cell cycle
Bioinformatics tasks
Microarray experiment
Data storage Data integration
Data visualization
Biological question
Experiment design
Image analysis
Normalization
Data Mining
Differential expression
Clustering
Classification
Network analysis
Hypothesis Experimental verification
Biological interpretation
37 BMIF 310, Fall 2009