statistical classification for gene analysis based on micro-array data fan li & yiming yang...

28
Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang [email protected] In collaboration with Judith Klein- Seetharaman

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Statistical Classification for Gene Analysis based on Micro-array Data

Fan Li & Yiming Yang [email protected]

In collaboration with Judith Klein-Seetharaman

Page 2: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

DNA clones

PCR purification

Reversetranscription

Robotprinting

Hybridize targetto microarray

ExcitationLaser 1 Laser 2

Emission

Computer analysis

Label withFluorescent dyes

G. Gibson et al.

ReferenceTreated sample

Principles of cDNA microarray

Page 3: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Microarray data : how it looks like ?

Expression level of a gene across treatments

Expression profilesof genes in a certain condition

Exp 1Exp 2Exp 3

Exp i

Exp M

G1 G2 GN-1GN

Typical examplesHeat shock, G phase in cell cycle, etc … conditionsLiver cancer patient, normal person, etc … samples

Expression matrix

Page 4: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

AML/ALL micro-array dataset

This dataset can be downloaded from http://genome-www.standford.edu/clustering

Maxtrix• Each Row – a gene• Each column – a patient (a sample)• Each patient belong to one of two diseases types:

AML(acute myeloid leukemia) or ALL (acute lymph oblastic leukemia) disease

• The 72 patient samples are further divided into a training set(including 27 ALLs and 11 AMLs) and a test set(including 20 ALLs and 14 AMLs). The whole dataset is over 7129 probes from 6817 human genes.

Page 5: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Published work on AML/ALL

Classification task: gene expression -> {AML, ALL}

Techniques: Support Vector Machings (SVM), Rocchio-style and logistic regression classifiers

Main findings: classifiers can get a better performance when using a small subset (8) of genes, instead of thousands

Implication: Many genes are irrelevant or redundant?

Page 6: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Possible Relationship (Hypothesis)

disease

Gene6

Gene8

Gene5Gene4Gene3

Gene2Gene1

Gene7

Page 7: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

How can find such a structure? Find the most informative genes

(“primary” ones) Statistical feature selection (brief)

Find the genes related (or “similar”) to the primary ones Unsupervised clustering (detailed)

based on statistical patterns of gene distributed over microarrays

Bayes network for causal reasoning(future direction)

Page 8: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Possible Relationship (Hypothesis)

disease

Gene2Gene1

Gene6

Gene8

Gene5Gene4Gene3

Gene7

Page 9: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Feature selection Feature selection

Choose a small subset of input variable (a few instead of 7000+ genes, for example)

In text categorization Features = words in documents Output variables = subject categories of a document

In protein classification Features = amino acid motifs … Output variables = protein categories

In genome micro-array data Features = “useful” genes Output variables = diseased or not of a patient

Page 10: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Feature selection on micro-array (ALM vs ALL)

Golub-Slonim: GS-ranking (filtering method) Ben-Dor TNoM-ranking (filtering method) Isabelle-Guyon: Recursive SVM(Wrapper

method) Selected 8 genes (out of 1000+ in that

dataset) Accuracy 100%

Our work (Fan & Yiming) (best) Selected 3 genes (using Ridge regression) Accuracy 100%

Page 11: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Feature selection experiments already done in this micro-array data

The 3 genes we found

Id1882: CST3 Cystatin C(amyloid angiopathy and cerebral hemorrhage) M27891_at

Id6201: INTERLEUKIN-8PRECURSOR Y00787_at

Id4211: VIL2 Villin 2(ezrin) X51521_at

Page 12: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Some analysis on the result we get

The first two genes are strongly correlated with each other.

The third gene is very different from the first two genes.

1st gene + 2nd gene is bad (10/34 errors)

1st gene + 3rd gene is good (1/34 error)

Page 13: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Question:As the next step, Can we find more gene-gene relationship?Several techniques available: Clustering Bayesian network learning Independent component analysis …

Page 14: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Clustering Analysis in micro-array data

Clustering methods have already been widely used to find similar genes or common binding sites from micro-array data.

A lot of different clustering algorithms… Hierarchical clustering K-means SOM CAST ……

Page 15: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

A example of hierarchical clustering analysis(from Spellman et al.)

Page 16: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Our clustering experiment on AML/ALL dataset

Our clustering result is over the top 1000 genes most relevant to the disease.

Page 17: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

The feature-selection curve

Page 18: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Our clustering result in the top 1000 genes

Page 19: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Some analysis to the clustering result

The first two genes are always clustered in the same cluster(in hierarchical clustering, they are in cluster 1. In k-means clustering, they are in cluster 2)

The third gene is always not clustered in the same group with the first two genes(in hierarchical clustering, it is in cluster 23. In k-means clustering, it is in cluster 1)

This validates our previous analysis.

Page 20: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Disadvantage of Clustering

However… It can not find out the internal relationship

inside one cluster It can not find the relationship between

clusters genes connected to each other may not be

in the same cluster. Clustering vs Bayesian network

learning(copied from David K,Gifford, Science, VOL293, Sept,2001)

Page 21: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

A counter example of clustering analysis

Page 22: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Bayesian network learning Thus Bayesian network seems a much

better technique if we want to model the relationship among genes.

Researcher have done experiments and constructed bayesian networks from micro-array data.

They found there are a few genes which have a lot of connections with other genes.

They use prior biology knowledge to validate their learned edges(interactions between genes and found they are reasonable)

Page 23: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

A example of the bayesian network

Part of the bayesian network Nir Friedman constructed. There are total 800 genes(nodes) in the graph. These 800 genes are all cell-cycle regulated genes.

Page 24: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman
Page 25: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Our plan in genetic regulatory network construction

There are several possible ways

Using feature selection technique to make the network learning task more robust and with less computational cost.

Learning gene regulatory networks on microarray dataset with disease labels(thus we may find pathways relevant to specific disease).

Using ICA to finding hidden variables(hidden layers) and check its consistency with bayes network learning result.

Page 26: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Our plan in genetic regulatory network construction

Use prior prior biology knowledge in gene network ,like the “network motifs”. The following example is copied from Shai S.Shen-Orr, Naturtics ,genetics, 2002. Previous network learning algorithm have not considered those characters.

Page 27: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman
Page 28: Statistical Classification for Gene Analysis based on Micro-array Data Fan Li & Yiming Yang hustlf@cs.cmu.edu In collaboration with Judith Klein-Seetharaman

Reference

•Using Bayesnetwork to analyze Expression Data , Nir Friedman, M.Linial, I.Nachman, Journal of Computational Biology , 7:601-620, 2000.

•Gene selection for cancer classification using support vector machines. Guyon,I.et al. Machine Learning,46,389-422.

•Clustering analysis and display of genome-wide expression patterns, Eisen,M.B. et al. PNAs, 95:14863-14868, 1998

•Clustering gene expression patterns . Ben-Dor, A.,Shamir,R., and Yakini,Z., Computational Biology, 6(3/4):281-297, 1999.