probabilistic ranking

Integration of Full-Coverage Probabilistic Functional Networks

with Relevance to Specific Biological Processes

James, K., Wipat, A. & Hallinan, J.School of Computing Science, Newcastle University

Data Integration in the Life Sciences 2009

2

Integrated functional networksIntegrated functional networks

• Bring together data from a wide range of sources

• High throughput data is – Large (one node per gene; multiple interactions per node)

– noisy (FP 20 – 90%)

– Incomplete (to different extents)

• Assess quality of each dataset against a Gold Standard

• Weighted edges reflect sum probability that edge actually exists

• Network can be thresholded to draw attention to most probable edges

• Suitable for manual (interactive) or computational analysis

3

Dataset biasDataset bias

• Different experiment types provide different types of information

• Overlap between datasets usually low– 1% of synthetic lethal pairs physically interact

• Genes involved in the same process may be transcribed together– Ribosomal biogenesis in yeast

• Some types of interaction may provide more information about a particular biological process– Complex formation: Y2H

– Signal transduction: phosphorylation

4

Bias in HTP datasetsBias in HTP datasets

From Myers and Troyanskaya, Bioinformatics 2007.

5

Bias & RelevanceBias & Relevance

• Most network analyses are related to a Process of Interest (PoI)

• PFINs tend to be very large

• Interactions with equal probability will have different utility

• Several attempts to eliminate bias– Loss of data

• We aim to use bias – Relevance

6

HypothesisHypothesis

Functional annotations can be applied to probabilistic integrated functional networks to identify interactions

relevant to a biological process of interest

7

)(~/)(

)|(~/)|(ln

LPLP

ELPELPLLS

Network integrationNetwork integration

8

)(~/)(

)|(~/)|(ln

LPLP

ELPELPLLS

n

iii

DL

WS1

)1(

Network integrationNetwork integration

9

Effect of D valueEffect of D value

10

Relevance scoringRelevance scoring

• GO annotations

• One-tailed Fisher’s exact test to score over-representation of genes related to POI

• POI: term of interest plus any descendants except inferred from electronic annotation

• Control network integrated in order of confidence

• Relevance network integrated in order of relevance

• We use Lee et al. (2004), but method can be applied to any network, any data integration algorithm

11

Relevance scoringRelevance scoring

12

Data setsData sets

• Saccharomyces cerevisiae data from BioGRID v.38

• Split by PMID, duplicates removed

• Datasets > 100 interactions treated individually– 50 data sets, max 14,421 interactions

• Datasets < 100 grouped by BioGRID Experimental categories– 22 data sets, min 33 interactions

• Gene Ontology terms – Telomere Maintenance (GO:0000723)

– Ageing (GO:0007568)

13

Choice of D valueChoice of D value

• GO annotations

• Assign function to nodes based on annotation of neighbour with highest weighted edge

• Leave-one-out on full network

• Construct Receiver Operating Characteristic (ROC) curve– Area Under Curve (AUC)

– SE(W) using Wilcoxon statistic

14

Classifier outputClassifier outputclassification threshold

positives negatives

TP TN

FP FN

Increasing specificity Increasing sensitivity

15

ROC CurvesROC Curves

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

False Positives

True

Pos

itive

s

No power

Intermediate

Perfect classification

16

D valueD value

17

D valueD value

18

RankingRanking

Dataset Conf. Score

Conf. Rank

Ageing Rank

Telomere Rank

A&T Rank

1 6.6937 1 4 6 6

2 5.7054 2 8 7 8

3 5.7040 3 6 2 2

4 5.0842 4 3 4 4

5 4.9335 5 7 5 5

6 4.5212 6 5 8 7

7 4.4641 7 1 1 1

8 4.2253 8 2 3 3

19

ResultsResults

20

Evaluation - ClusteringEvaluation - Clustering

• MCL Markov-based clustering algorithm

• Considers network topology and edge weights

21

ResultsResults

Net Bias Clusters % COI >2 nodes >3 nodes

>4 nodes

A C 573 21.29 26.14 28.86 35.19

R 523 22.37 27.73 31.75 36.92

T C 573 5.06 6.14 7.02 6.53

R 508 6.50 7.73 8.90 8.59

C C 573 24.26 29.55 33.80 37.98

R 523 24.67 29.83 33.33 38.35

22

Cluster annotationCluster annotation

23

ConclusionsConclusions

• Function assignment is statistically significantly better, but probably not practically useful– Simplistic algorithm

– Dependant upon existing annotation

• Clustering– Fewer, larger clusters

– Clusters draw together genes of interest

– Different GO terms perform differently

• Relevance networks are better for interactive exploration– Related PoIs

24

Future workFuture work

• Which GO terms work best with relevance?

• Why?

• Further exploration of experimental types and relevance

• Implement algorithms in Ondex

• Optimize function assignment / clustering algorithms

• Extend technique to edges

25

AcknowledgementsAcknowledgements

• Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN)

• Newcastle Systems Biology Resource Centre

• Research Councils of the UK

• BBSRC SABR Ondex Project

• Integrative Bioinformatics Research Group

probabilistic ranking

Technology

relevance networks

network integration

order of relevance

tousebias relevance

data integration algorithm

interactions datasets

network topology

network analyses