gene expression data analysis - vanderbilt...

37
Gene Expression Data Analysis BMIF 310, Fall 2009 Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Upload: phamdang

Post on 23-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Gene Expression Data Analysis

BMIF 310, Fall 2009

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Gene expression technologies (summary)

  Hybridization-based approaches   Printed arrays

  cDNA arrays: customizable, high array variation

  Synthesized oligo arrays   Affymetrix arrays: high density, low array variation

  Classic arrays: probes on 3’ UTR

  Exon arrays: probes on all known exons

  Tiling arrays: probes spread across the genomic sequence

  Sequencing-based approaches   Traditional Sanger sequencing-based approaches

  Serial analysis of gene expression: ~10bp tag at the 3’ end

  2nd generation sequencing based approaches   RNA-Seq: high-throughput unbiased profiling

BMIF 310, Fall 2009 2

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

3 BMIF 310, Fall 2009

Well begun is half done

  A clearly defined biological question

  Well control of potential sources of variation (biological and technical)   Statistically sound microarray experimental arrangement (replicates)

  Compliance with the standard of microarray information collection (MIAME)

  http://www.mged.org/Workgroups/MIAME/miame.html

4 BMIF 310, Fall 2009

Image analysis

  Analysis of the image of the scanned array in order to extract an intensity for each spot or feature on the array.

  Gridding: align a grid to the spots

  Segmentation: identify the shape of each spot

  Intensity extraction: extract intensity for each spot and potentially for each surrounding background

  Background correction: subtract background signal from the spot intensity to get a more accurate estimate of the biological signal from the spot

BMIF 310, Fall 2009 5

Garbage in, garbage out

  Remove bad arrays

  Remove poor-quality spots

  Remove data points with low signal/noise ratio

  Remove data points with too many missing value

Bad Array

6 BMIF 310, Fall 2009

Normalization

  The purpose of normalization is to remove systematic variation in a microarray experiment which affects the measured gene expression levels

  Systematic Variation   Unequal quantities of starting RNA

  Differences in labelling and detection efficiencies

  Topographical slide variation

  Scanner introduced bias

7 BMIF 310, Fall 2009

Normalization method

  Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)

  Match the percentiles of each array (Quantile normalization)

  Adjust using a nonlinear smoothing curve

  Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples

  Adjust using spike control

BMIF 310, Fall 2009 8

No normalization Global normalization Quantile normalization

Get to know your data matrix

ID Samp1 Samp2 Samp3 …… Sampm-1 Sampm

Gene1 5.25 6.37 7.30 …… 6.02 7.17

Gene2 6.96 5.01 7.23 …… 5.87 5.02

Gene3 5.44 5.67 4.23 …… 5.33 6.34

Gene4 12.83 10.35 12.56 …… 9.98 11.13

Gene5 3.20 3.07 3.19 …… 3.27 3.16

Gene6 7.74 7.66 7.12 …… 7.46 7.95

…… …… …… …… …… …… ……

Genen 6.06 6.04 6.35 …… 6.44 6.60

Genen-1 8.92 8.52 7.62 …… 7.90 8.02

BMIF 310, Fall 2009 9

Genes

Samples

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

10 BMIF 310, Fall 2009

Differential Gene Expression

  n-fold change   Arbitrarily selected fold change

cut-offs

  Usually ≥ 2 fold

  Pros   Intuitive and easily visualised

  Simple and rapid

  Cons   Statistically inefficient

  Magnitude does not necessarily indicate importance

  Often too restrictive

MVA plot

M: log ratio ( log2(A/B) )

A: average log intensity ( log2(A*B)/2 )

11 BMIF 310, Fall 2009

Differential Gene Expression

  Statistical tests

  Test for significant change between repeated measurements of a variable in two groups/multiple groups

  Calculation of statistics, selection of a cut-off value, reject the null-hypothesis

  Methods   Two independent groups

  Student’s t-test: parametric   Mann-Whitney U test: nonparametric

  Two or more independent groups   ANOVA (Analysis of variance): parametric

  Kruskal-Wallis test: nonparametric

12 BMIF 310, Fall 2009

Correction for multiple testing

  Why?   In an experiment with a 10,000-gene array in which the significance level

p is set at 0.05, 10,1000x0.05=500 genes would be inferred as significant even though none is differentially expressed

  Unadjusted p-value is likely to exaggerate Type I errors (false positives)

  Methods   Control the family-wise error rate (FWER), the probability that there is a

single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of gene tested

  Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction.

13 BMIF 310, Fall 2009

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

14 BMIF 310, Fall 2009

What is clustering

  Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities

  Unsupervised techniques, does not require the incorporation of any prior knowledge in the process

15 BMIF 310, Fall 2009

Why clustering?

  Exploratory data analysis, providing rough maps and suggesting directions for further study

  Representing distances among high-dimensional expression profiles in a concise, visually effective way, such as a tree or dendrogram

  Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes

16 BMIF 310, Fall 2009

Clustering method

  Hierarchical clustering: generate a hierarchy of clusters going from 1 cluster to n clusters

  Partitioning: divide the data into g groups using some reallocation algorithm, e.g. K-means

  Fuzzy clustering: each object has a set of weights suggesting the probability of it belonging to each cluster

17 BMIF 310, Fall 2009

Hierarchical clustering

  Agglomerative clustering (bottom-up)   Start with n groups, join the two closest, continue

  Divisive clustering (top-down)   Start with 1 group, split into 2, then into 3, …, into n

  Require distance measurement   Between two objects

  Between clusters

18 BMIF 310, Fall 2009

Between objects distance measurement

  Euclidean distance   Focus on the absolute expression value

  Pearson correlation coefficient   Focus on the expression profile shape   Parametric, normally distributed and follow the linear regression

model

  Spearman correlation coefficient   Focus on the expression profile shape   Non-parametric, no assumption   Less sensitive than Pearson

19 BMIF 310, Fall 2009

Different measurement, different distance

Most similar profile to GeneA (blue) based on different distance measurement: Euclidean: GeneB (pink)

Pearson: GeneC (green)

Spearman: GeneD (red)

20 BMIF 310, Fall 2009

Between cluster distance measurement

  Single linkage: the smallest distance of all pairwise distances

  Complete linkage: the maximum distance of all pairwise distances   Average linkage: the average distance of all pairwise distances

21 BMIF 310, Fall 2009

Hierarchical clustering

  Dendrogram   Output of a hierarchical

clustering

  Tree structure with the genes or samples as the leaves

  The height of the join indicates the distance between the left branch and the right branch

  Problems   Hard to define distinct

clusters

22 BMIF 310, Fall 2009

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

23 BMIF 310, Fall 2009

What is classification

  Classification algorithms are methods to classify objects into predefined classes

  Supervised techniques, requires training data and predefined classes

  Two step process   Model construction: describe a set of predetermined classes using

training data

  Model application: classify new objects into predefined classes

24 BMIF 310, Fall 2009

Classification methods

  K-nearest neighbor

  Decision tree

  Support vector machine

  Naïve Bayes classifier

  Artificial neural network

  …

25 BMIF 310, Fall 2009

Feature selection

  Microarray data are characterized by large numbers of variables (genes) with respect to very few observations (samples), we need to select a subset of genes likely to be predictive (i.e. highly related with particular classes for classification)

26 BMIF 310, Fall 2009

Model construction

Sample GeneA GeneB Tumor A H H N B H L Y C L L N D H L Y E L L N F L H N

Training Data

Classification Algorithms

Classifier (Model)

IF GeneA = ‘H’

AND GeneB = ‘L’

THEN Tumor=‘yes’

27 BMIF 310, Fall 2009

Model application

Sample GeneA GeneB Tumor Z H L ?

New objects

Classifier (Model)

IF GeneA = ‘H’

AND GeneB = ‘L’

THEN Tumor=‘yes’

Sample Z = ‘Tumor’?

Yes

28 BMIF 310, Fall 2009

K-Nearest neighbor

  Objects are points in an n-D space

  Compute the distance between the new case and all learning cases

  Return the most common value among the k learning cases nearest to the new case

=

29 BMIF 310, Fall 2009

Over-fitting and cross-validation

  Over-fitting   The classifier is very effective in classifying the training samples but not

accurate enough for new samples

  Cross-validation   Hold-out

  Split data into Training and Testing data   Learn with Training data and estimate true error with Testing data

  N-fold   Randomly Split data into Training and Testing data n times   Learn with Training and estimate true error with Testing in each split separately   Average test performance

  Leave-one-out   Leave one case for Testing   Learn with the remaining data and estimate true error with the Testing   Average test performance

30 BMIF 310, Fall 2009

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

31 BMIF 310, Fall 2009

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

32 BMIF 310, Fall 2009

Importance of biological interpretation

Normalize, Filter, Cluster and Visualize

Importance of biological interpretation

  Identification of sets of genes of potential interest

  Numerical technique, does not reveal the biological implications encrypted in expression data

  Evaluation of the functional significance of large, heterogeneous and noisy sets of genes constitutes a big challenge

BMIF 310, Fall 2009 33

Gene Ontology

  Structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products

  Three major categories that describe the attributes of biological process, molecular function and cellular component for a gene product

  Categories of concepts are held within a Directed Acyclic Graph (DAG)

  http://geneontology.org

BMIF 310, Fall 2009 34

Gene Ontology Tree Machine (GOTM)

  A web-based tool for the analysis and visualization of sets of genes identified from high-throughput technologies

  User friendly data navigation and visualization

  Statistical analysis suggesting biological areas that warrant further study

  http://bioinfo.vanderbilt.edu/gotm

BMIF 310, Fall 2009 35

GOTM

BMIF 310, Fall 2009 36

69

observed

24

147 69

expected

0.5

147

p=1.92e-34

Up-regulated

mitotic cell cycle

random

mitotic cell cycle

Bioinformatics tasks

Microarray experiment

Data storage Data integration

Data visualization

Biological question

Experiment design

Image analysis

Normalization

Data Mining

Differential expression

Clustering

Classification

Network analysis

Hypothesis Experimental verification

Biological interpretation

37 BMIF 310, Fall 2009