"probabilistic latent semantic analysis for prediction of gene ontology annotations" -...

23
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations Davide Chicco, Pietro Pinoli, Marco Masseroli [email protected] 2012

Upload: davide-chicco

Post on 21-Jun-2015

429 views

Category:

Documents


0 download

DESCRIPTION

Talk delivered by Davide Chicco at PhDay 2012 at Dipartimento di Elettronica e Informazione of Politecnico di Milano, Milan, September 2012.

TRANSCRIPT

Page 1: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE

Probabilistic Latent Semantic Analysis

for prediction of

Gene Ontology annotations

Davide Chicco, Pietro Pinoli, Marco Masseroli

[email protected]

2012

Page 2: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 2

Summary

1. The problem

• Biomolecular annotations

• Prediction of biomolecular annotations

2. The methods

• SVD – Singular Value Decomposition

• pLSA – Probabilistic Latent Semantic Analysis

3. Evaluation

• Evaluation data set

• Evaluation results

4. Conclusions

Page 3: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 3

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,

sometimes structured as ontologies, where every controlled

term of the vocabulary is associated with a unique

alphanumeric code

• The association of such a code with a gene or protein ID

constitutes an annotation

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 4: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 4

Biomolecular annotations (2)

• The association of an information/feature with a gene or

protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 5: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 5

Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations

are extremely useful

• These lists could be generated softwares based that implement

Machine Learning algorithms

Page 6: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 7

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 7: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 8

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 8: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 9

Compute SVD:

Compute reduced rank approximation:

• An annotation prediction is performed by computing a reduced

rank approximation Ak of the annotation matrix A

(where 0 < k < r, with r the number of non zero singular values

of A, i.e. the rank of A)

TA U V

TA U V

TA U V TA U V TA U V

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Singular Value Decomposition – SVD

Page 9: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 10

Probabilistic Latent Semantic Analysis - pLSA

pLSA:

• An alternative to the SVD method

• Based on Latent Semantic Indexing (LSI)

Latent Semantic Indexing – LSI:

• Identifies latent relationships between different elements

in a certain class

− e.g. between documents and words within them

− between genes and their biomolecular features

described by controlled annotation terms

• Maps class elements to a vector space of reduced

dimensionality, and then analyzes it

Page 10: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 11

Probabilistic Latent Semantic Analysis - pLSA (2)

Suppose you have;

• A set of genes G = {g1, …, gn} related to a set of feature

terms F = {f1, …, fn} which, together, form a set of controlled

biomolecular annotations

• A set of class variables T = {t1, …, tn},

called topics, with every feature

term f F that can be associated

with a topic t T

The pLSA statistical model associates

every unobserved class variable

(topic) with each observation

(feature term and gene)

Page 11: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 12

Probabilistic Latent Semantic Analysis - pLSA (3)

Ff

tfPTt 1)|(,

Tt

gtPGg 1)|(,

Tt

tfPtgPtPfgP )|()|()(),(

• P(f | t): probability of a feature term f to be associated with a

topic t

• P(t | g): probability of getting a topic t by selecting a gene g

• The following conditions hold:

• The joint probability between g and f is given by:

Page 12: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 13

Probabilistic Latent Semantic Analysis - pLSA (4)

Model training

• Aim: maximum likelihood estimation of P(f|t) by using

Expectation Maximization (EM) algorithm, on a training set

Model validation

• Gene and feature term validation set with the same feature

terms, but completely different genes, respect to the ones in

the training set

• Aim: maximize the formula in [1], but by using the P(f|t)

calculated in the training phase and varying the parameters

P(t|g) related to the new genes in the validation set

]1[),(log),( fgPfgaLGg Ff

Page 13: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 14

Probabilistic Latent Semantic Analysis - pLSA (5)

EM Algorithm:

It seeks to find a Maximum Likelihood Estimation by iteratively

applying:

• Expectation step: in which the a posteriori probabilities for the

latent variables t are computed, as

• Maximization step: in which the parameters values are updated

in order to maximize the log-likelihood.

)|,()(),|( tfgPtPfgtP

Page 14: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 15

Probabilistic Latent Semantic Analysis - pLSA (5)

In comparison to SVD:

Uk = [ P(gi|tk) ] ik

k = diag[ P(tk) ] k

Vk = [ P(fi|tk) ]jk

Ak = [ P(gi, fj) ]ij = Uk k VkT

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Page 15: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 16

Probabilistic Latent Semantic Analysis - pLSA (6)

Since the pLSA model constraints:

• This can bias the prediction because the more annotations a

gene has, the lower its average conditional probability is

• To avoid such bias we propose a normalized extension of pLSA:

• :

i. Compute:

ii. Compute the normalized P(f | g) vector as:

• Thus, the feature terms with the highest conditional probability

for a gene always result predicted to be annotated to that gene

Ff

gfPGg 1)|(,

g G

max ( | )f F

M P f g

1( | ) ( | )normP f g P f g

M

Page 16: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 17

Evaluation of the prediction

To evaluate the prediction, we compare each A(i,j) element to its

corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed

(AC AC+1)

• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed

(AR AR+1)

• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed

(NAC NAC+1)

• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted

(AP AP+1)

Page 17: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 18

New concept: Receiver Operating Characteristic

(ROC) curve

Starting from the annotation prediction evaluation factor we just

introduced

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

We can design the Receiver Operating Characteristic curves for

every prediction:

On the x, the annotation to be reviewed rate:

On the y, the annotation predicted rate:

Input Output

Yes Yes

Yes No

No No

No Yes

Page 18: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 19

Evaluation data set

• We considered the Gene Ontology annotations of organisms:

Gallus gallus (Chicken), and Bos taurus (Cattle)

− Excluding less reliable Inferred Electronic Annotations

• After this, the four organism data set were:

with total (true-path-rule) annotations about 10-times more

than the direct annotations

Organism Ontology Genes Terms Annotations

(direct )

Gallus gallus BP 275 527 738

Gallus gallus CC 260 148 478

Gallus gallus MF 309 225 509

Bos taurus BP 512 930 1,557

Bos taurus CC 497 234 921

Bos taurus MF 543 422 934

Page 19: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 20

Evaluation results

•The ROC curve of annotation to be

reviewed rate AR / (AC + AR) and

annotation predicted rate AP / (AP +

NAC) of Bos taurus (Cattle) Cellular

Component (top left), Molecular

Function (top) and Biological Process

(left), for SVD with best truncation value

(in red) and for pLSAnorm with best

topics number (in green)

Page 20: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 22

Evaluation results (3)

• As an aggregated indicator of prediction performance, we

computed the Area Under the Curve(AUC) in the [0; 0.01] range

of AP rate values

− We are interested in the low range of AP rate, since it

corresponds to top-ranked predictions of newly inferred

annotations (AP) with the highest score

Area under ROC curves (AUC) % and Execution Time (sec)

Taxonomy ID Ontology SVD pLSAnorm Time(SVD) Time(pLSAnorm)

Bos taurus BP 44.30 34.75 33 28 188

Bos taurus CC 53.03 27.31 36 4 674

Bos taurus MF 80.96 30.69 11 1 890

Gallus gallus BP 47.33 44.83 98 3 990

Gallus gallus CC 75.39 37.22 10 796

Gallus gallus MF 65.76 29.87 5 422

Page 21: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 23

Conclusions

• We proposed the pLSAnorm method as a novel contribution in

the context of prediction of genomic ontological annotations

- Our pLSAnorm method gives better predictions than the

Singular Value Decomposition (SVD) method

- Higher execution time of pLSAnorm vs. SVD requires better

optimizations, currently limiting its use to off-line analysis or

small dimension data sets

Page 22: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 24

Conclusions (2)

• Our approach is not limited to the here considered Gene

Ontology and can be applied to any controlled annotations

• Increasingly available multiple annotations from different

controlled vocabularies and ontologies could be jointly

considered to further improve prediction reliability (both in

SVD and pLSAnorm)

Page 23: "Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations" - Davide Chicco (PoliMi) @ Dei PoliMi PhDay 2012

Davide Chicco @ PhDay2012 25

Thank you for your attention

Probabilistic Latent Semantic Analysis for

prediction of Gene Ontology annotations