computational genomics and proteomics

34
Computational Genomics and Proteomics Lecture 9 CGP Part 2: Introduction Based on: Martin Bachler slides

Upload: yves

Post on 03-Feb-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Computational Genomics and Proteomics. Lecture 9 CGP Part 2: Introduction. Based on: Martin Bachler slides. Overview CGP part 2. Lesson 9: Introduction, description of assignment, short overview papers Lesson 10: Mass Spectrometric Data Analysis (Guest lecturer: Thang Pham, VUmc) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Genomics and Proteomics

Computational Genomics and Proteomics

Lecture 9

CGP Part 2: Introduction

Based on: Martin Bachler slides

Page 2: Computational Genomics and Proteomics

Overview CGP part 2

• Lesson 9: Introduction, description of assignment, short overview papers

• Lesson 10: Mass Spectrometric Data Analysis (Guest lecturer: Thang Pham, VUmc)

• Lesson 11: Team+E.M. meeting 1• Lesson 12: Team+E.M. meeting 2• Lesson 13: Presentation/Evaluation Part 2

Page 3: Computational Genomics and Proteomics

CGP part 2

• Main focus of CGP2: Feature Selection • Three papers available for teams’ study

– Each team (pair of students) chooses one paper– Critical study of topic/method of the paper (meeting with E.M.)– Proposal of method changes (meeting with E.M.)– Public presentation, discussion, evaluation (all teams)

• One paper review on Feature Selection techniques in Bioinformatics:– Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics

Bioinformatics. 2007 Oct 1;23(19):2507-17

Page 4: Computational Genomics and Proteomics

Papers

• K. Jong, E. Marchiori, M. Sebag, A. van der Vaart. Feature Selection in Proteomic Pattern Data with Support Vector Machines. CIBCB , pp. 41-48, IEEE, 2004.

• K. Ye, A. Feenstra, A.Ijzerman, J. Heringa, E. Marchiori. Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine Learning approach for feature weighting. Bioinformatics, 2007.

• K. Ye, E.W. Lameijer, M.W. Beukers, and A.P. IJzerman (2006) A two-entropies analysis to identify functional positions in the transmembrane region of class A G protein-coupled receptors. Proteins, Structure, Function and Bioinformatics 63, 1018-1030.

Page 5: Computational Genomics and Proteomics

Students’ Evaluation

• Based on:– Discussions – Report describing topic/method and proposed

changes – Presentation

Page 6: Computational Genomics and Proteomics

Problem: Where to focus attention ?

• A universal problem of intelligent (learning) agents is where to focus their attention.

• What aspects of the problem at hand are important/necessary to solve it?

• Discriminate between the relevant and irrelevant parts of experience.

Page 7: Computational Genomics and Proteomics

What is Feature selection ?

• Feature selection: Problem of selecting some subset of a learning algorithm’s input variables upon which it should focus attention, while ignoring the rest (DIMENSIONALITY REDUCTION)

• Humans/animals do that constantly!

Page 8: Computational Genomics and Proteomics

Feature Selection in ML ?

Why even think about Feature Selection in ML?- The information about the target class is inherent in the

variables!

- Naive theoretical view: More features => More information=> More discrimination power.

- In practice: many reasons why this is not the case!

- Also:Optimization is (usually) good, so why not try to optimize the input-coding ?

Page 9: Computational Genomics and Proteomics

Feature Selection in ML ? YES!- Many explored domains have hundreds to tens of

thousands of variables/features with many irrelevant and redundant ones!

- In domains with many features the underlying probability distribution can be very complex and very hard to estimate (e.g. dependencies between variables) !

- Irrelevant and redundant features can „confuse“ learners!

- Limited training data!

- Limited computational resources!

- Curse of dimensionality!

Page 10: Computational Genomics and Proteomics

Curse of dimensionality

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1x1

positive examplesnegative examples

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x1

x2positive examplesnegative examples

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

00.10.2

0.30.40.5

0.60.70.8

0.910

0.10.20.30.40.50.60.70.80.9

1

x1x2

x3

Page 11: Computational Genomics and Proteomics

Curse of dimensionality• The required number of samples (to achieve the

same accuracy) grows exponentionally with the number of variables!

• In practice: number of training examples is fixed!=> the classifier’s performance usually will degrade for a large number of features!

In many cases the information that is lost by discarding variables is made up for by a more accurate mapping/sampling in the lower-dimensional space !

Page 12: Computational Genomics and Proteomics

Example for ML-ProblemGene selection from microarray data

– Variables: gene expression coefficients corresponding to the amount of mRNA in a patient‘s sample (e.g. tissue biopsy)

– Task: Separate healthy patients from cancer patients

– Usually there are only about 100 examples (patients) available for training and testing (!!!)

– Number of variables in the raw data: 6.000 – 60.000– Does this work ? ([8])

[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)

Page 13: Computational Genomics and Proteomics

Example for ML-ProblemText-Categorization

- Documents are represented by a vector of dimension the size of the vocabulary containing word frequency counts

- Vocabulary ~ 15.000 words (i.e. each document is represented by a 15.000-dimensional vector)

- Typical tasks: - Automatic sorting of documents into web-directories- Detection of spam-email

Page 14: Computational Genomics and Proteomics

Motivation

• Especially when dealing with a large number of variables there is a need for dimensionality reduction!

• Feature Selection can significantly improve a learning algorithm’s performance!

Page 15: Computational Genomics and Proteomics

Approaches

• Wrapper– feature selection takes into account the contribution to

the performance of a given type of classifier • Filter

– feature selection is based on an evaluation criterion for quantifying how well feature (subsets) discriminate the two classes

• Embedded– feature selection is part of the training procedure of a

classifier (e.g. decision trees)

Page 16: Computational Genomics and Proteomics

Embedded methods• Attempt to jointly or simultaneously train both a

classifier and a feature subset• Often optimize an objective function that jointly

rewards accuracy of classification and penalizes use of more features.

• Intuitively appealing

Example: tree-building algorithms

Adapted from J. Fridlyand

Page 17: Computational Genomics and Proteomics

Input Features

Feature Selection by

Distance Metric Score

Train Model

Feature Selection

Search

Feature Set

Importance of features given by the model

Filter Approach

Wrapper Approach

Input Features

Model

Train Model

Model

Approaches to Feature Selection

Adapted from Shin and Jasso

Page 18: Computational Genomics and Proteomics

Filter methods

Rp

Feature selection Rs

s << pClassifier design

•Features are scored independently and the top s are used by the classifier

•Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into the disease markers.

Adapted from J. Fridlyand

Page 19: Computational Genomics and Proteomics

Problems with filter method• Redundancy in selected features: features are

considered independently and not measured on the basis of whether they contribute new information

• Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others)

• Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others.

Adapted from J. Fridlyand

Page 20: Computational Genomics and Proteomics

Dimension reduction: a variant on a filter method

• Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc)

• Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks)

• Those methods tend not to work better than simple filter methods.

Adapted from J. Fridlyand

Page 21: Computational Genomics and Proteomics

Wrapper methods

Rp

Feature selection Rs

s << pClassifier design

•Iterative approach: many feature subsets are scored based on classification performance and best is used.

•Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc

Adapted from J. Fridlyand

Page 22: Computational Genomics and Proteomics

Problems with wrapper methods• Computationally expensive: for each feature

subset to be considered, a classifier must be built and evaluated

• No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only.

• Easy to overfit.

p

Adapted from J. Fridlyand

Page 23: Computational Genomics and Proteomics

Example: Microarray Analysis

“Labeled” cases(38 bone marrow samples: 27 AML, 11 ALL

Each contains 7129 gene expression values)

Train model(using Neural Networks, Support Vector

Machines, Bayesian nets, etc.)

Model34 New

unlabeled bone marrow samples

AML/ALL

key genes

Page 24: Computational Genomics and Proteomics

• Few samples for analysis (38 labeled)

• Extremely high-dimensional data (7129 gene expression values per sample)

• Noisy data

• Complex underlying mechanisms, not fully understood

Microarray Data Challenges :

Page 25: Computational Genomics and Proteomics

Some genes are more useful than others for building classification models

Example: genes 36569_at and 36495_at are useful

Page 26: Computational Genomics and Proteomics

Example: genes 36569_at and 36495_at are useful

AML

ALL

Some genes are more useful than others for building classification models

Page 27: Computational Genomics and Proteomics

Example: genes 37176_at and 36563_at not useful

Some genes are more useful than others for building classification models

Page 28: Computational Genomics and Proteomics

Importance of Feature (Gene) Selection

• Majority of genes are not directly related to leukemia

• Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting

• Noise and the small number of training samples makes this even more likely

• Some types of models, like kNN do not scale well with many features

Page 29: Computational Genomics and Proteomics

Classification: CV error

• Training error– Empirical error

• Error on independent test set – Test error

• Cross validation (CV) error– Leave-one-out (LOO)– N-fold CV

N samples

splitting

1/n samples for testing

Summarize CV error rate

N-1/n samples for training

Count errors

Page 30: Computational Genomics and Proteomics

Two schemes of cross validation

N samples

LOO

Train and test the feature-selector and

the classifier

Count errors

N samples

feature selection

Train and test the classifier

Count errors

LOO

CV2CV1

Page 31: Computational Genomics and Proteomics

Difference between CV1 and CV2

• CV1 gene selection within LOOCV• CV2 gene selection before before LOOCV• CV2 can yield optimistic estimation of classification true error

• CV2 used in paper by Golub et al. :– 0 training error– 2 CV error (5.26%)– 5 test error (14.7%)– CV error different from test error!

Page 32: Computational Genomics and Proteomics

Significance of classification results

• Permutation test:– Permute class label of samples– LOOCV error on data with permuted labels– Repeat process a high number of times– Compare with LOOCV error on original data:

• P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

Page 33: Computational Genomics and Proteomics

Feature Selection techniques in a nutshell

WHY ?

WHAT ?

HOW ?

Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics Bioinformatics. 2007 Oct 1;23(19):2507-17

Page 34: Computational Genomics and Proteomics

Now it is team’s work!

• Meetings: 29/11, 6/12:– each team comes to discuss with E.M. once a

week for 30 minutes• In the last lecture (13 Dec):

– Report due date– Public talk by each team

See you on the 29/11!