vienna afp2011

Combining large-scale evolutionary analyses with multiple biological data sources to predict human protein function

David Jones UCL Depts. of Computer Science and Structural and Molecular Biology

Background

MF

BP

CC

30%

In Uniprot, 30% of human proteins still have no functional annotations at all

MF

BP

CC

… and only 0.5% have completely specific ones for all aspects

Main approaches for function annotation

• Annotation transfers by homology e.g. BLAST, HMMER Only applicable to a subset of the data Has reached a plateau in terms of novel function

annotation but provides highest quality information

• Model-classifier based using sequence features Limited to common and broad functions for which there

are many examples

FFPRED - Function Prediction Pipeline

posterior probability estimate

GO Term SVM

Amino acid sequence

structure disorder motifs localisation

Novel sequence

Characteristics

Classification

aa transmem

Going further – computing gene function from multiple data sources

• FFPRED is a currently available server for human (and vertebrate) proteins

• It works well but is limited to predicting only the

functional classes that it was trained to recognize • Extending the library requires time consuming

training of new SVM models • It also cannot be applied to rare functional classes

due to limited training sets

Desirable features of a new approach

• Able to annotate all sequences

• Able to predict rare functions

• Able to offer something more than simple homology-based approaches

• Amenable to easy and quick updating

FunctionSpace Data Sources for H. sapiens

• Sequence similarity • Signal peptides and other local features • Predicted secondary structure • Transmembrane segments • Predicted disordered regions • Domain architecture patterns • Gene fusion information • Gene co-expression • Protein-protein interactions

For each sequence 49,231 features were derived

Aim

Functional Similarity

Score

To estimate the functional similarity (a.k.a. semantic distance) between two human proteins from their sequence features plus available high throughput data.

Protein A

Protein B

Large-scale (domain-based) evolutionary features

• Patterns of domain occurrence can provide valuable functional clues

• “Deeper” homology detection allows greater

coverage

• We make use of our in-house fold/domain recognition method and several public domain libraries

pDomTHREADER Domain Coverage

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

Public domain Threading

37.56 % increase in domain annotations across 5.5M sequences

~ 1.7 million novel domain assignments over public domain data

CATH Domain annotations

59.4% Gene3d 64.8% threading

81.6% threading

35.7% Gene3d Residues

Sequences

Computational Practicalities

2Gb

5.5M Query sequences

PSIBLAST

Sequence database

(5.5M seqs)

1min – 3 hours

Store & post process

Legion Nodes

“Embarrassingly parallel” application: one sequence = one job.

Ideal capacity filling task for a modern supercomputer like Legion.

Find matches & generate alignments

Gene Fusion Events can Predict Protein-Protein Interactions from Sequence Data

H1 H2

Mycobacterium avium

Mycobacterium tuberculosis

Mycobacterium paratuberculosis

3.90.850.10

3.90.850.10 3.60.15.10

3.60.15.10

fumaryl aceto acetase beta lactamase

Hydrolase activity

Hydrolysis of C-N bonds Hydrolysis of C-C bonds

Bi-functional enzyme

Saccharopolyspora erythraea

Syntrophomonas wolfei

3.40.120.10

Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases

Oxidative stress

D-glucose metabolism DNA repair

3.40.120.10

3.40.120.10

3.40.50.300

3.40.50.300

Transcription coupling repair factor

DNA repair (RAD50)Phosphoglyceromutase

A Novel Gene Fusion Discovered using CATH domain fusion analysis

Novel Gene Fusion Discovery

3.40.50.3003.40.50.300

3.40.50.300 3.40.120.10

3.40.120.10

Novel annotations

Saccharopolyspora erythraea

Syntrophomonas wolfei

• Rice PGM1 gene annotated as GO:0006950 response to stress

• PGM3 has relationship with DNA repair sequence

Kanazawa K, Ashida H (1991) Relationship between oxidative stress and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225

Domain based features

Score complexes

Score architectures

7960 features 11210 features

Fusion scoring

Each domain is a feature, score has 2 components 1. Prediction quality (logistic transform of feature) 2. Promiscuity weight related to the number of times the sequence

occurs as part of a fused product wi = log fus i

Integration of “External” Features: Microarray Expression Data

Gene A

Gene B

Nor

mal

ised

Mic

roar

ray

Dat

aset

s

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Experiment (conditions) P

robe

Sig

nal (

log2

)

Pearson Correlation (R)

Biclustering Microarray Expression Data

Zinc binding sequences

global correlation 0.42

A set of transcription factors

global correlation 0.48

23912 features generated from biclustering of 2346 publicly available microarrays (81 experiments) using BIMAX algorithm

Functional Similarity

Score

Protein A

FunctionSpace: Two-stage Integration of Data

SVMsw

SVMss

SVMdis

SVMloc

SVMge

SVMppi

SVMtm

SVMdpc

SVMdpp

SVMgfc

SVMgfp

Feature vectors

Feature vectors Protein B

SVMfsc

A 3-D Projection of Annotated Human Proteins

• 49,231 dimensions first reduced to 11 dimensions by SVM regression with 11 different groups of features

• Each protein is here represented as a point in this derived 11-D feature space projected into 3-D

• Colouring is according to functional similarity which shows that proteins with similar functions (warmer colours) cluster strongly in this space

• 75% of nearest neighbour pairs share common GO terms

Individual Feature Contributions M

atth

ews

Cor

rela

tion

Coe

ffici

ent

Each sequence is classed “Easy”, “Medium” or “Hard” depending on degree of homology to functionally annotated proteins in UNIPROT.

Function Annotation Results for 20674 Unannotated IPI Human Sequences

Preliminary Results In 2009 FunctionSpace produced GO term predictions for 19678 IPI uncharacterized human sequences. 2746 have been annotated since.

Less specific

More specific

Less specific

More specific

MF Measure BP 16% % Exact Matches 9% -1.3 Mean semantic distance -1.7

Initial considerations for CAFA

• 50,000 sequences • 11 eukaryotic & 7 prokaryotic species • High specificity annotations needed • Partial descriptive text already in Swiss-Prot/Uniprot for some

entries

• FFPRED/FunctionSpace would not be enough

• Need to incorporate textual information from databases and comprehensive homology(orthology)-derived labels

• Need to get all this working in a few months!

Best Laid Plans for CAFA

• Plan A – Build separate annotation pipelines for missing data – Calibrate each pipeline according to precision values derived from

benchmark on 500 highly annotated Swiss-Prot entries – Combine pipeline annotations using high-level classifier (SVM or Naive

Bayes)

• Plan B

– No time to build high-level classifier! – Combine annotation sources using heuristic graphical approach

• Hope for the best! (and expect the worst...)

GO term prediction from Swiss-Prot text-mining

• For targets which already had

descriptive text, keywords or comments in Swiss-Prot, GO terms were assigned using a naive Bayes text-classification approach

• Single words and groups of 2 and 3 words were counted

• Words occurring in different Swiss-Prot record types were distinguished in the analysis, and some simple pre-parsing of feature (FT) records was carried out in addition.

Homology-based annotation sources

• PSI-BLAST searches against Uniprot – Low E-value threshold to ensure close homologues used for

annotation transfer – Alignment length threshold to avoid domain problem

• Transfer of annotations from orthologues – EggNOG 2.0 – More reliable GO term transfer than for PSI-BLAST but lower

coverage

• Profile-profile searches against Swiss-Prot – Low reliability transfer from very distant homologues – Improves coverage where needed (at expense of specificity)

P’ = 1 - (1 – P) (1 – Q)

Back-propagation of precision estimates

Heuristic back-propagation of precision estimates

Back-propagation repeated for each annotation source to define a consensus for each node

Final steps

• After back-propagation, all referenced GO terms are ranked according to final confidence scores

• To reduce conflicting annotations, pairs of terms with zero observed co-occurrence frequency in GOA are subjected to pairwise tournament selection.

• Results submitted to server using the mouse-window-cut-paste-click-submit algorithm

CASP vs CAFA from a Predictor’s Point of View • Number of targets

– Manual vs automated approaches • Difficulty of targets

– A major limit in driving CASP forwards • Assessment

– Hard to pre-judge impact of decisions made during prediction season

• Tools for the community – Standards and methods in CASP have been very useful

• Getting the word out to the wider community

Anna Lobley Domenico Cozzetto Daniel Buchan Kevin Bryson Christine Orengo

Acknowledgements

vienna afp2011

Technology

public domain data

fusion scoringeach domain

novel domain assignments

gene fusion events

novel gene fusion discovery3

sequence datah1

sequence featureslimited

local features