integrating ontological prior knowledge into relational learning for protein function prediction...

26
Integrating Integrating Ontological Prior Ontological Prior Knowledge into Knowledge into Relational Learning Relational Learning for Protein Function for Protein Function Prediction Prediction Stefan Reckow Stefan Reckow Max Planck Institute of Psychiatry Max Planck Institute of Psychiatry Volker Tresp Volker Tresp Siemens, Corporate Technology Siemens, Corporate Technology

Upload: randell-hancock

Post on 14-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Integrating Ontological Prior Integrating Ontological Prior Knowledge into Relational Knowledge into Relational

LearningLearning for Protein Function Predictionfor Protein Function Prediction

Stefan ReckowStefan ReckowMax Planck Institute of PsychiatryMax Planck Institute of Psychiatry

Volker TrespVolker TrespSiemens, Corporate TechnologySiemens, Corporate Technology

Page 2: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 2

Proteins and Protein Proteins and Protein OntologiesOntologies

Page 3: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 3

Protein and Protein FunctionsProtein and Protein Functions

motivationmotivation• proteins – molecular machines in any organismproteins – molecular machines in any organism• understanding protein function is essential for understanding protein function is essential for

all areas of bio-sciencesall areas of bio-sciences• diverse sources of knowledge about proteinsdiverse sources of knowledge about proteins

challengeschallenges• experimental determination of functions difficult experimental determination of functions difficult

and expensiveand expensive• homologies can be misleadinghomologies can be misleading• most proteins have several functionsmost proteins have several functions

Page 4: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 4

Protein function predictionProtein function prediction

catalytic activity (catalyzes a reaction)

isomerase activity

intramolecular oxidoreductase activity

intramolecular oxidoreductase activity, interconverting aldoses and ketoses

triose-phosphate isomerase activity (catalyzes a very specific reaction)

speci

fici

ty

What function does this protein have?

Page 5: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 5

““Function” OntologiesFunction” Ontologies

function

energy transcription cell fate

glycolysis fermentation respiration cell growth cell death

aerobic anaerobic

ontologies are a way of bringing order in the function of ontologies are a way of bringing order in the function of proteinsproteins

an ontology is a description of concepts of a domain and an ontology is a description of concepts of a domain and their relationships their relationships

hierarchical representation (subclass-relationship)hierarchical representation (subclass-relationship)• treetree• directed, acyclic graphdirected, acyclic graph

Page 6: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 6

Complex

Cytoskeleton Proteasome Intracellular transport

Actin filaments Microtubules 10 nm filaments Clathrin

Intermediate filaments Septin filaments

Golgi transport

complex: structure formed by a group of two or more proteins to complex: structure formed by a group of two or more proteins to perfom certain functions concertedlyperfom certain functions concertedly

““Complex” OntologyComplex” Ontology

Page 7: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 7

Ontologies as Great Source of Prior Ontologies as Great Source of Prior Knowledge in Machine LearningKnowledge in Machine Learning

A considerable amount of community effort is invested in A considerable amount of community effort is invested in designing ontologiesdesigning ontologies

Typically this prior knowledge is deterministic (logical Typically this prior knowledge is deterministic (logical constraints)constraints)

Machine Learning should be able to exploit this knowledgeMachine Learning should be able to exploit this knowledge

• Interactions of proteins is an important information for predicting Interactions of proteins is an important information for predicting function: statistical relational learning function: statistical relational learning

Page 8: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 8

Statistical Relational Statistical Relational Learning with the IHRMLearning with the IHRM

Page 9: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 9

SRL generalizes standard Machine Learning to domains SRL generalizes standard Machine Learning to domains where relations between entities (and not just entity where relations between entities (and not just entity attributes) play a significant roleattributes) play a significant role

Examples: PRM, DAPER, MLN, RMN, RDNExamples: PRM, DAPER, MLN, RMN, RDN

The IHRM is an easily applicable general model, performs a The IHRM is an easily applicable general model, performs a cluster analysis of relational domains and requires no cluster analysis of relational domains and requires no structural learningstructural learning

Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proc. 22nd UAI, 2006 Proc. 22nd UAI, 2006

Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T. & Ueda, N. (2006). Learning systems of concepts with an infinite relational model. AAAI 2006Learning systems of concepts with an infinite relational model. AAAI 2006

Statistical Relational Learning Statistical Relational Learning (SRL)(SRL)

Page 10: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 10

Standard Latent Model for Protein Standard Latent Model for Protein Mixture ModelsMixture Models

1Z

1A

2Z

2A

Protein1Protein1 Protein2Protein2

In a Bayesian approach, we can permit an infinite number of In a Bayesian approach, we can permit an infinite number of states in the latent variables and achieve a Dirichlet Process states in the latent variables and achieve a Dirichlet Process Mixture Model (DPM)Mixture Model (DPM)

Advantage: the model only uses a finite number of those states; Advantage: the model only uses a finite number of those states; thus no time consuming structural optimization is requiredthus no time consuming structural optimization is required

Page 11: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 11

Infinite Hidden Relational Model Infinite Hidden Relational Model (IHRM)(IHRM)

1Z

1A 2Z

2A

3Z

3A

2,1R3,2R

2,1RProtein1Protein1

Protein2Protein2

Protein3Protein3

interactinteract

interactinteractinteractinteract

• Permits us to include protein-protein interactions into Permits us to include protein-protein interactions into the model the model

Page 12: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 12

Ground NetworkGround Network

Z1

motif complex function

motif complex function

motif

complex

function

Z2

interactZ3

interact

interact

Page 13: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 13

Experimental ResultsExperimental ResultsKDD Cup 2001KDD Cup 2001 Yeast genome dataYeast genome data 1243 genes/proteins: 862 (training) / 381 (test) 1243 genes/proteins: 862 (training) / 381 (test) AttributesAttributes

• ChromosomeChromosome• MotifMotif (351) [1-6]: A gene might contain one or more characteristic motifs (351) [1-6]: A gene might contain one or more characteristic motifs

(information about the amino acid sequence of the protein)(information about the amino acid sequence of the protein)• EssentialEssential• Structural classStructural class (24) [1-2] The protein coded by the gene might belong to (24) [1-2] The protein coded by the gene might belong to

one or more structural categories (24) [1-2]one or more structural categories (24) [1-2]• PhenotypePhenotype (11)[1-6] observed phenotypes in the organism (11)[1-6] observed phenotypes in the organism• InteractionInteraction• ComplexComplex (56)[1-3] The expression of the gene can complex with others to (56)[1-3] The expression of the gene can complex with others to

form a larger proteinform a larger protein• FunctionFunction (14)[1-4] (cell growth, cell organization, transport, … ) (14)[1-4] (cell growth, cell organization, transport, … )

genes were anonymousgenes were anonymous

Page 14: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 14

ResultsResultsROC curve

Comparison with Supervised Models

IHRMIHRM 93.1693.16

Krogel et Krogel et al.al.

93.6393.63

SVMSVM 93.4893.48

ModelModel AccuracyAccuracy

Page 15: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 15

IHRM ResultIHRM Result

Node: geneLink: interaction Color: cluster.

Page 16: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 16

Integrating Ontological Integrating Ontological Prior Knowledge into Prior Knowledge into

the IHRMthe IHRM

Page 17: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 17

Integration of ontologiesIntegration of ontologies

Deductive closure

Page 18: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 18

Integration of ontologiesIntegration of ontologies

Zi

motif function

signal peptidase actin filaments microtubules

independent concepts

dependent concepts

cytoskeletontransloconcomplex

Page 19: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 19

Experiments: Including Experiments: Including “Complex” Ontology“Complex” Ontology

Data collected from CYGD of MIPSData collected from CYGD of MIPS 1000 genes/proteins: 800 (Training) / 200 (Test)1000 genes/proteins: 800 (Training) / 200 (Test) AttributesAttributes

• chromosome, motif, essential, structural class, phenotype, interaction, chromosome, motif, essential, structural class, phenotype, interaction, complex, functioncomplex, function

interactions from DIPinteractions from DIP usage of ontological knowledge on usage of ontological knowledge on complex complex

• five levels of hierarchalfive levels of hierarchal• in our model 258 nodes (concepts) using 66 top level categoriesin our model 258 nodes (concepts) using 66 top level categories• every protein has at least one complex annotationevery protein has at least one complex annotation• After including ontological constraints: about three annotations per After including ontological constraints: about three annotations per

protein on averageprotein on average

Page 20: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 20

ResultsResults800 (training) / 200 (test) 200 (training) / 200 (test)

w/o ontology: 0.895

with ontology: 0.928

w/o ontology: 0.832

with ontology: 0.894AUC

Page 21: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 21

ResultsResultsexplicit modeling of dependencies

Page 22: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 22

ResultsResults

•proteins acting in cell division•control proteins •"Septins“: Septins have several roles throughout the cell cycle and carry out essential functions in cytokinesis

•The three highlighted proteins fit into this cluster ( "cell fate" and "cell type differentiation“)

•proteins concerned with secretion and transportation

•The "Golgi apparatus" works together with the "endoplasmatic reticulum (ER)" as the transport and delivery system of the cell.

•"SNARE" proteins help to direct material to the correct destination

•Test proteins also "cellular transport"

•Grey: in test set

Page 23: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 23

ResultsResultssampling convergence

Page 24: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 24

ResultsResultsDistribution of proteins in the clusters

Page 25: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 25

ResultsResults

•Tasks occurring during DNA replication

•The former singleton "DNA polymerase", as a main actor in replication, obviously is assigned the correct cluster here

•Cellular Transport Cluster•The former singleton "Clathrin light chain", as a major constituent of coated vesicles (a component for transport) fits into this cluster quite well

•Grey: former singletons

Page 26: Integrating Ontological Prior Knowledge into Relational Learning for Protein Function Prediction TexPoint fonts used in EMF. Read the TexPoint manual before

Page 26

ConclusionConclusion application of the IHRM to function prediction application of the IHRM to function prediction

• competitive with competitive with supervised learning supervised learning methodsmethods• insights into the solutioninsights into the solution

advantages of integrating ontological knowledgeadvantages of integrating ontological knowledge• improvement of the clustering structureimprovement of the clustering structure• robustness: stable results with varying parameterizationrobustness: stable results with varying parameterization• deductive closure prior to learning is a general powerful principledeductive closure prior to learning is a general powerful principle

future challengesfuture challenges• usage of several or more complex ontologiesusage of several or more complex ontologies• further analysis of further analysis of dependent dependent vs.vs. independent independent concepts concepts

Acknowledgements: Acknowledgements: Karsten Borgwardt (MPIs Tübingen); Hans-Peter Kriegel (LMU)Karsten Borgwardt (MPIs Tübingen); Hans-Peter Kriegel (LMU)