whole genome expression analysis. in this presentation…… part 1 – gene expression microarray...

Whole Whole Genome Genome

Expression Expression AnalysisAnalysis

In this presentation……

Part 1 – Gene Expression Microarray Data

Part 2 – Global Expression & Sequence Data Analysis

Part 3 – Proteomic Data Analysis

Part

1

Microarray Microarray DataData

Examples of Problems

Gene sequence problems: Given a DNA sequence state which sectionsare coding or noncoding regions. Which sections are promoters etc...

Protein Structure problems: Given a DNA or amino acid sequence state what structure the resulting protein takes.

Gene expression problems: Given DNA/gene microarray expression data infer either clinical or biological class labels or genetic machinery that gives rise to the expression data.

Protein expression problems: Study expression of proteins and their function.

Microarray TechnologyBasic idea:

The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA.Measuring amount particular mRNA gives measure ofamount of corresponding protein.Copies of mRNA is expression of a gene.

Microarray technology allows us to measure the expressionof thousands of genes at once.

Measure the expression of thousands of genesunder different experimental conditions and ask what isdifferent and why.

Oligo vs cDNA arrays

Lockhart and Winzler 2000

Format

What is whole genome expression analysis?

Clustering algorithm - Hierarchical clustering - K-means clustering - Principal component analysis - Self organizing maps

Beyond clustering - Support vector machines - Automatic discovery of regulatory patterns in promoter region - Bayesian networks analysis

What is whole genome expression analysis?

Messenger RNA is only an intermediate of gene expression

What is whole genome expression analysis? (continued)

Why measure mRNA?

What is whole genome expression analysis? (continued)

Why Separate Feature selection ?

• most learning algorithms looks for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes

• We first reduce number of genes by a linear method, e.g. T-values

• Heuristic: select genes from each class• Then apply a favorite machine learning

algorithm

Feature selection approach

• Rank genes by measure; select top 200-500

• T-test for Mean Difference =

• Signal to Noise (S2N) =

• Other: Information-based, biological?• Almost any method works well with a good feature

selection

)//(

)(

2211

21

NN

AvgAvg

)(

)(

21

21

AvgAvg

Heatmap Visualization of selected fieldsALL AML

Heatmap visualizationis done by normalizingeach gene to mean 0,std. 1 to get a picturelike this.

Good correlation overall

AM

L-r

elat

edA

LL

-rel

ated

Possible outliers sample #12

Controlling False Positives

ClassCD37 antigen

178105

41747133

1122

Mean Difference between Classes:T-value = -3.25Significance: p=0.0007

Class Avg Std

1 2287.9 1452.42 4457.5 2010.3

Controlling False Positives with Randomization

Class

178105

41747133

1122

Class

178105

41747133

2112

RandomizedClass

2112Randomize

T-value = -1.1

CD37 antigen Randomization isLess Conservative

Preserves inner structure of data

Controlling false positives with randomization, II

ClassGene

178105

41747133

1122

Class

178105

41747133

2112

RandClass

2112Randomize

500 times

Bottom 1% T-value = -2.08

Select potentially interesting genes at 1%

Gene

Part

2

Microarray Microarray Data Data

ClassificatioClassificationn

Classification

• desired features:– robust in presence of false positives– understandable– return confidence/probability– fast enough

• simplest approaches are most robust

Popular Classification Methods• Decision Trees/Rules

– find smallest gene sets, but also false positives• Neural Nets -

– work well if # of genes is reduced• SVM

– good accuracy, does its own gene selection, hard to understand

• K-nearest neighbor - robust for small # genes• Bayesian nets - simple, robust

Model-building Methodology

• Select gene set (and other parameters) on a train set.

• When the final model is built, evaluate it on the independent test set which should not be used in the model building

Selecting the best gene set

Error Avg

0%5%

10%15%

20%25%30%

1 2 3 4 5 10 20 30 40

Genes per Class

We tested 9 sets, 10 fold x-validation (90 neural nets) Select gene set with lowest average error Heuristically, at least 10 genes overall

Results on the test data

• Evaluation of the 10 genes per class on the test data (34 samples) gives – 33 correct predictions (97% accuracy),– 1 error on sample 66

• Actual Class AML, Net prediction: ALL

• net confidence low

Classification - other applications

• Combining clinical and genetic data

• Outcome / Treatment prediction – Age, Sex, stage of disease, are useful– e.g. if Data from Male, not Ovarian cancer

Classification: Multi-ClassSimilar Approach:

• select top genes most correlated to each class

• select best subset using cross-validation

• build a single model separating all classes

• Advanced:– build separate model for each class vs. rest – choose model making the strongest prediction

Multi-class Data Example

• Brain data, Pomeroy et al 2002, Nature (415), Jan 2002– 42 examples, about 7,000 genes, 5 classes

• Selected top 100 genes most correlated to each class

• Selected best subset by testing 1,2, …, 20 genes subsets, leave-one-out x-validation for each

Brain data results

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 8 10 15 20

AvgError

Number of genes (same from each class)

Optimal set:8 genes per class(4 errorsfrom 42)

Part

3

Classification Classification of Cancer of Cancer

Microarray Microarray DataData

Cancer Classification

38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)

34 examples to test classifier

Results: 33/34 correct

d perpendicular distancefrom hyperplane

Test data

d

Gene expression and Coregulation

Nonlinear classifier

Nonlinear SVM

Nonlinear SVM does not help when using all genes but does help whenremoving top genes, ranked by Signal to Noise (Golub et al).

Rejections

Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes

Need to introduce concept of rejects to SVM

g1

g2

Normal

Cancer

Reject

Rejections

Estimating a CDF

The Regularized Solution

Rejections for SVM

1/d

P(c=1 | d)

.95

95% confidence or p = .05 d = .107

Results with rejections

Results: 31 correct, 3 rejected of which 1 is an error

Test data

d

Why Feature Selection

• SVMs as stated use all genes/features

• Molecular biologists/oncologists seem to be conviced that only a small subset of genes are responsible for particular biological properties, so they want which genes are are most important in discriminating

• Practical reasons, a clinical device with thousands of genes is not financially practical

•Possible performance improvement

Results with Gene Selection

AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.

B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.

d

Test data

d

Test data

Leave-one-out Procedure

The Basic Idea

Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.

When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.

The rescaling can be done in the input space or in a “Principal Components” space.

whole genome expression analysis. in this presentation…… part 1 – gene expression microarray...

Documents

gene expression problems

microarray data slide

expression of thousands

protein expression problems

proteomic data analysis

class gene

study expression of

good feature selection