proteome analyst

58
Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Upload: bowie

Post on 21-Jan-2016

82 views

Category:

Documents


0 download

DESCRIPTION

Proteome Analyst. Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors. Proteome Analyst. Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell. Proteome Analyst. Proteome - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Proteome Analyst

Proteome Analyst

Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

Page 2: Proteome Analyst

Proteome Analyst

Duane Szafron, Paul Lu, Russell Greiner, David Wishart, Zhiyong Lu, Brett Poulin, Roman Eisner, John Anvik,Cam Macdonell

Page 3: Proteome Analyst

Proteome Analyst

Proteomeone of many ‘-omes’set of all proteins in an organism

Analysisprediction of protein function or

localization from sequence data

Page 4: Proteome Analyst

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

Page 5: Proteome Analyst

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do?

Page 6: Proteome Analyst

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do? Find homologues to each protein and

assume similar function.

Page 7: Proteome Analyst

Analyze a Protein

We have examples of annotated proteins in various protein classes.

We have more examples of unannotated proteins.

What do we do? Find homologues to each protein and

assume similar function. Find characteristics of each protein that affect

function.

Page 8: Proteome Analyst

Analyzing Proteins

One Protein?

Page 9: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

Page 10: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?

Page 11: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

Page 12: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?

Page 13: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

Page 14: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

5000 proteins?

Page 15: Proteome Analyst

Analyzing Proteins

One Protein?Just do it.

5 Proteins?Post-doc familiar with protein classes.

50 Proteins?grad student

5000 proteins?summer students

Page 16: Proteome Analyst

Proteome Analyst

Page 17: Proteome Analyst

Proteome Analyst

High-throughput Transparent Prediction of

Protein FunctionProtein LocalizationCustom Classification

Page 18: Proteome Analyst

Machine Learning Task

TrainingINPUT: sequences, classesOUTPUT: Classifier

AnalysisINPUT: sequences, ClassifierOUTPUT: classes

Page 19: Proteome Analyst

Machine Learning Task

TrainingINPUT: sequences, classesOUTPUT: Classifier

AnalysisINPUT: sequences, ClassifierOUTPUT: classes, explanation

Page 20: Proteome Analyst

Training

INPUTsequences, classes

PA Toolssequences features

ML Algorithmfeatures, classes Classifier

OUTPUTClassifier

Page 21: Proteome Analyst

Training: INPUT

>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...

Page 22: Proteome Analyst

Training: INPUT

>class A<Training Seq 1MVGSGLLWLALVSCILTQASAVQRGYGNPIEASSYGL...>class B<Training Seq 2LLDEPFRSTENSAGSQGCDKNMSGWYRFVGEGGVRMS...>class B<Training Seq 3EVIAYLRDPNCSSILQTEERNWVSVTSPVQASACRNI... ...

classes

protein sequences

Page 23: Proteome Analyst

Training: PA Tools

sequences features

Page 24: Proteome Analyst

Training: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Page 25: Proteome Analyst

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

Page 26: Proteome Analyst

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

DBSOURCE swissprot: locus MPPB_NEUCR, ...xrefs (non-sequence databases): ...InterProIPR001431,...KEYWORDS Hydrolase; Metalloprotease; Zinc; Mitochondrion; Transit peptide; Oxidoreductase; Electron transport; Respiratory chain.

Page 27: Proteome Analyst

Homology Tool

sequence features

sequence

homologues

annotations features

seq DB

BLAST

retrieve

parse

Page 28: Proteome Analyst

Training: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features

Page 29: Proteome Analyst

Pattern Tool

sequence features

sequence

patterns

features

patternDB

find

parse

Page 30: Proteome Analyst

Pattern Tool

sequence features

sequence

patterns

features

patternDB

find

parse

Pfam; PF00234; tryp_alpha_amyl; 1.PROSITE; PS00940; GAMMA_THIONIN; 1.PROSITE; PS00305; 11S_SEED_STORAGE; 1.

Page 31: Proteome Analyst

Pattern Tool

sequence features

not included in current results

sequence

patterns

features

patternDB

find

parse

Page 32: Proteome Analyst

Training: ML Algorithm

features, classes Classifier

Page 33: Proteome Analyst

Training: ML Algorithm

features, classes Classifierany ML Algorithm may be useddefault = naïve Bayes

consistently near-best accuracy

(SVM, ANN slightly better)efficient (for high-throughput)easy to interpret

Page 34: Proteome Analyst

Training: OUTPUT

Classifier

Page 35: Proteome Analyst

Analysis (Classification)

INPUTsequences

PA Toolssequences features

Classifierfeatures classes, explanation

OUTPUTclasses

Page 36: Proteome Analyst

Analysis: INPUT

>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...

Page 37: Proteome Analyst

Analysis: INPUT

>Seq 1DTILNINFQCAYPLDMKVSLQAALQPIVSSLNVSVDG...>Seq 2AVELSVESVLYVGAILEQGDTSRFNLVLRNCYATPTE...>Seq 3HVEENGQSSESRFSVQMFMFAGHYDLVFLHCEIHLCD... ...

protein sequences

Page 38: Proteome Analyst

Analysis: PA Tools

sequences features

Page 39: Proteome Analyst

Analysis: PA Tools

sequences featuresHomology Tools (BLAST)

sequence homologueshomologues annotationsannotations features

Pattern Tools (PFAM, ProSite, …)sequences motifsmotifs features

Page 40: Proteome Analyst

Analysis: Classification

features classes

Page 41: Proteome Analyst

Analysis: Classification

features classesnaïve Bayes

returns probabilities of each class for each sequence

efficient (for high-throughput)easy to interpret

Page 42: Proteome Analyst

Analysis: Classification

features classes, explanation

Page 43: Proteome Analyst

Analysis: Classification

features classes, explanation

Page 44: Proteome Analyst

Analysis: Classification

features classes, explanation

Page 45: Proteome Analyst

Analysis: Classification

features classes, explanation

Page 46: Proteome Analyst

Analysis: Classification

features classes, explanation

Page 47: Proteome Analyst

Results: General Function

GeneQuiz classification5-fold x-val accuracy on 14 classes

Page 48: Proteome Analyst

Results: General Function

GeneQuiz classification5-fold x-val accuracy on 14 classes

E. Coli (2370) 82.5%

Yeast (2359) 78.8%

Fly (3842) 76.6%

Page 49: Proteome Analyst

Results: Specific Function

K+ Ion Channel Proteins5-fold x-val accuracy on

78 sequences, 4 classes

Page 50: Proteome Analyst

Results: Specific Function

K+ Ion Channel Proteins5-fold x-val accuracy on

78 sequences, 4 classes

Accuracy

1st effort 97.4%

2nd effort 100%

Page 51: Proteome Analyst

Results: Localization

Sub-cellular localization prediction 3146 sequences from 10 classes

Page 52: Proteome Analyst

Results: Localization

Sub-cellular localization prediction 3146 sequences from 10 classes

Accuracy Coverage

Nair and Rost 81.5% 36.9%

Proteome Analyst 87.8% 100%

Page 53: Proteome Analyst

Results

Sub-cellular localization prediction 3146 sequences from 10 classes

Accuracy Coverage

Nair and Rost 81.5% 36.9%

Proteome Analyst 87.8% 100%

Page 54: Proteome Analyst

Proteome Analyst

High-throughput Transparent Prediction of

Protein FunctionProtein LocalizationCustom Classification

Page 55: Proteome Analyst

Acknowledgements

Student developers Cynthia Luk Samer Nassar Kevin McKee

Biologists Warren Gallin Kathy Magor

Data Nair and Rost

Page 56: Proteome Analyst

Acknowledgements

FundingPENCE – Protein Engineering

Network of Centres of ExcellenceNSERC - National Science and

Engineering Research CouncilSun MicrosystemsAICML - Alberta Ingenuity Centre for

Machine Learning

Page 57: Proteome Analyst

Acknowledgements

Many ‘-ome’ jokesmy wife, Jen

Page 58: Proteome Analyst

Contact

http://www.cs.ualberta.ca/~bioinfo/PA

[email protected]