clustering visualization filemaking sense of gene expression data clustering visualization!...
TRANSCRIPT
Making sense of gene expression data
Clustering
Visualization
→ Discrimination
Dirk HusmeierBiomathematics & Statistics Scotland (BioSS)
JCMB, The King’s Buildings, Edinburgh EH9 3JZ, UKhttp://www.bioss.ac.uk/∼dirk
Typical application:
• Predict tumour type or survivial time for agiven gene expression profile.
• Learn a predictive model from a training setof gene expression profiles plus target values(tumour class or survival time).
Linear models
Examples:
• Fisher linear discriminant analysis
• Linear maximum likelihood models
• Linear perceptrons
Non-linear models
Examples:
• Classification and regression trees (CART)
• Support vector machines
• Neural networks
Linear versus non-linear models
Linear models Non-linear models
Advantage Training simple
and robust
Shortcoming biased
Linear versus non-linear models
Linear models Non-linear models
Advantage Training simple unbiased
and robust
Training complex:
Shortcoming biased heuristic tuning parameters,
risk of over-fitting
• The “knowledge” of a neural networks resides in the
weights.
• Training: gradient descent on the error surface.
• The “knowledge” of a neural networks resides in the
weights.
• Training: gradient descent on the error surface.
• This leads to the generalized Hebb rule for adapting the
weights:
• Weight change = learning rate * activity of parent node
* error of child node.
• Errors of internal nodes are obtained by
backpropagation.
How can we guard against overfitting?
Split the data into a training set and acrossvalidation set. Use the latter for monitoringthe generalization performance. When over-fitting sets in, stop the training process.
How can we guard against overfitting?
Split the data into a training set and acrossvalidation set. Use the latter for monitoringthe generalization performance. When over-fitting sets in, stop the training process.
What is the shortcoming of this approach?
How can we guard against overfitting?
Split the data into a training set and acrossvalidation set. Use the latter for monitoringthe generalization performance. When over-fitting sets in, stop the training process.
What is the shortcoming of this approach?
Crossvalidation sacrifices a large proportion ofdata, which are no longer available for training.This deteriorates the training efficiency.
Classification and diagnostic prediction ofcancer using gene expression profiling and
artificial neural networks
Khan et al. (2001)
Nature Medicine 7(6), 673-679
• Small, round blue cell tumours (SRBCT)
• Four sub-classes:
– Neuroblastoma (NB)
– Rhabdomyosarcoma (RMS)
– Non-Hodkin lymphoma (NHL)
– Ewing family of tumours (EWS)
• Small, round blue cell tumours (SRBCT)
• Four sub-classes:
– Neuroblastoma (NB)
– Rhabdomyosarcoma (RMS)
– Non-Hodkin lymphoma (NHL)
– Ewing family of tumours (EWS)
• Accurate diagnostic essential → Treatment options, responses to
therapy and prognoses vary widely.
• Currently, no single test (light microscopy, immunohistochemistry,
etc.) can precisely distinguish these cancers.
• Khan et al. discuss the classification of these tumours with neural
networks on the basis of gene-expression profiling using cDNA
microarrays.
• 2308 genes, 63 training examples, 25 test cases.
• Khan et al. discuss the classification of these tumours with neural
networks on the basis of gene-expression profiling using cDNA
microarrays.
• 2308 genes, 63 training examples, 25 test cases.
• How can you reduce the complexity of the input space?
• Khan et al. discuss the classification of these tumours with neural
networks on the basis of gene-expression profiling using cDNA
microarrays.
• 2308 genes, 63 training examples, 25 test cases.
• How can you reduce the complexity of the input space?
• PCA: 10 leading principal components.
• Khan et al. discuss the classification of these tumours with neural
networks on the basis of gene-expression profiling using cDNA
microarrays.
• 2308 genes, 63 training examples, 25 test cases.
• How can you reduce the complexity of the input space?
• PCA: 10 leading principal components.
• Correct classification of all samples in the test set.
• Identification of genes most relevant to the classification.
.
.
.
.
Unsupervised
Machine learning
Linear
discriminant analysis)(e.g. Fisher linear
networks)
(e.g. neural Non-linear
Supervised
.
.
.
.
ClusteringVisualization
Supervised
Non-linear(e.g. neural
networks)(e.g. Fisher lineardiscriminant analysis)
Linear
Machine learning
Unsupervised
Clustering
.
Non-linear
(e.g. SOM, GTM)
.
.
.
(e.g. PCA)
Supervised
Non-linear(e.g. neural
networks)(e.g. Fisher lineardiscriminant analysis)
Linear
Machine learning
Unsupervised
Visualization
Linear
Partitive(e.g. K-means)
Hierarchical
.
Non-linear
(e.g. SOM, GTM)
.
.
.
Supervised
Non-linear(e.g. neural
networks)(e.g. Fisher lineardiscriminant analysis)
Linear
Machine learning
Unsupervised
Visualization
Linear
(e.g. PCA)
Clustering
(e.g. BTSVQ)
Visualization
Linear
(e.g. PCA)
.
Non-linear
(e.g. SOM, GTM)
.
.
.
top-down
Supervised
Non-linear(e.g. neural
networks)(e.g. Fisher lineardiscriminant analysis)
Linear
Machine learning
Unsupervised
Clustering
Partitive(e.g. K-means)
Hierarchical
agglomerativeor bottom-up
(e.g. UPGMA)
divisive or