"genome-wide annotation prediction with svd truncation based on roc analysis" - davide...

36
Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis Escs 2012 ISCB European Student Council Symposium September 8 th 2012, Basel, Switzerland Davide Chicco, Marco Masseroli [email protected]

Upload: davide-chicco

Post on 11-May-2015

146 views

Category:

Documents


0 download

DESCRIPTION

Presentation at International Society of Computational Biology European Student Council Symposium in Basel, Switzerland. September 2012

TRANSCRIPT

Page 1: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

Genome-Wide Annotation Prediction

with SVD Truncation

based on ROC Analysis

Escs 2012 ISCB European Student

Council Symposium September 8th 2012, Basel, Switzerland

Davide Chicco, Marco Masseroli

[email protected]

Page 2: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 2

Summary

1. The context & the problem

• Biomolecular annotations

• Prediction of biomolecular annotations

• SVD (Singular Value Decomposition)

• SVD Truncation

2. The proposed solution

• ROC Area Under the Curve comparison

• Truncation level choices

3. Evaluation

• Evaluation data set & results

4. Conclusions

Page 3: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 3

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• This information is expressed through controlled vocabularies,

sometimes structured as ontologies, where every controlled

term of the vocabulary is associated with a unique

alphanumeric code

• The association of such a code with a gene or protein ID

constitutes an annotation

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 4: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 4

Biomolecular annotations (2)

• The association of an information/feature with a gene or

protein ID constitutes an annotation

• Annotation example:

• gene: GD4

• feature: “is present in the mitochondrial membrane”

Gene /

Protein

Biological function feature

Annotation

gene2bff

Page 5: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 5

Prediction of biomolecular annotations

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations

are extremely useful

• These lists could be generated softwares based that implement

Machine Learning algorithms

Page 6: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 7

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 7: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 8

Annotation prediction through

Singular Value Decomposition – SVD

• Annotation matrix A {0, 1} m x n

− m rows: genes / proteins

− n columns: annotation terms

A(i,j) = 1 if gene / protein i is annotated to term j or to any

descendant of j in the considered ontology structure (true

path rule)

A(i,j) = 0 otherwise (it is unknown)

term01 term02 term03 term04 … termN

gene01 0 0 0 0 … 0

gene02 0 1 1 0 … 1

… … … … … … …

geneM 0 0 0 0 … 0

Page 8: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 9

Compute SVD:

Compute reduced rank approximation:

• An annotation prediction is performed by computing a reduced

rank approximation Ak of the annotation matrix A

(where 0 < k < r, with r the number of non zero singular values

of A, i.e. the rank of A)

TA U V

TA U V

TA U V TA U V TA U V

T

k k k kA U V

k

T

k k k kA U V T

k k k kA U V T

k k k kA U V T

k k k kA U V

k

Singular Value Decomposition – SVD

Page 9: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 10

Singular Value Decomposition – SVD (2)

• Ak contains real valued entries related to the likelihood that

gene i shall be annotated to term j

For a certain real threshold τ:

if Ak(i,j) > τ, gene i is predicted to be annotated to term j

− The threshold τ can be chosen in order to obtain the

best predicted annotations [Khatri et al., 2005]

Page 10: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 11

Singular Value Decomposition – SVD (3)

• It is possible to rewrite the SVD decomposition in an equivalent

form, such that the predicted annotation profile is given by:

ak,iT = ai

T Vk VkT

where ak,iT is a row vector containing the predictions for gene i

• Note that Vk depends on the whole set of genes

• Indeed, the columns of Vk are a set of eigenvectors of the

global term-to-term correlation matrix T = ATA, estimated from

the whole set of available annotations

Page 11: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 12

Evaluation of the prediction

To evaluate the prediction, we compare each A(i,j) element to its

corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0

• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed

(AC <- AC+1)

• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed

(AR <- AR+1)

• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed

(NAC <- NAC+1)

• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted

(AP <- AP+1)

Page 12: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 13

SVD truncation

• The main problem of truncated SVD: how to choose the

truncation?

• Where to truncate?

How to choose the k here?

Page 13: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 14

New concept: Receiver Operating Characteristic

(ROC) curve

Starting from the annotation prediction evaluation factor we just

introduced

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

We can design the Receiver Operating Characteristic curves for

every prediction:

On the x, the annotation to be reviewed rate:

On the y, the annotation predicted rate:

Input Output

Yes Yes

Yes No

No No

No Yes

Page 14: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 15

New concept: Receiver Operating Characteristic

(ROC) curve (2)

On the x, the annotation to be reviewed rate:

On the y, the annotation predicted rate:

Page 15: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 16

SVD truncation choice

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has minimum

AUC

Page 16: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 17

SVD truncation choice (2)

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has minimum

AUC

Quite easy!

Page 17: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 18

SVD truncation choice (3)

Algorithm:

1) Choose some possible truncation levels

2) Compute the Receiver Operating Characteristic for each

SVD prediction of those truncation levels

3) Compute the Area Under the Curve (AUC) of each ROC

4) Choose the truncation level of the ROC that has minimum

AUC

Quite challenging!

Page 18: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 19

Minimum AUC between all the ROCs of various

truncation levels

1) Choose some possible truncation levels

We cannot compute the SVD, its ROC and its AUC for every

truncation values because would be too expensive (for time

and resources).

Algorithm:

1) Since the matrix A(i,j) has m rows (genes) and n columns

(annotation terms), we take p = min(m, n)

2) Since r ≤ p is the number of non-zero singular values

along the diagonal of , the best truncation value is in the

interval [1; r]

3) We limited the range to [r*10% ; r*90%], to avoid taking

truncation levels that, during SVD reconstruction phase,

would consider too few main singular values, or almost all

the non-zero singular values of A

Page 19: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 20

Minimum AUC between all the ROCs of various

truncation levels (2)

4. We take the 25%*r value as first possible truncation, and

compute the SVD for it and the next four levels: q1, q2, q3,

q4, q5

5. We compute ROC and its AUC for q1, q2, q3, q4, q5

6. We take the level that has minimum AUC

Page 20: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 21

Minimum AUC between all the ROCs of various

truncation levels (3)

If the minimum AUC between those of (q1, q2, q3, q4, q5) is

the middle element q3, it is takes as the best truncation

value, and the algorithm finishes.

This means we found a local minimum.

Page 21: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 22

Minimum AUC between all the ROCs of various

truncation levels (4)

If the minimum AUC between those of (q1, q2, q3, q4, q5) is

the 4th element q4, it is takes as the best truncation value,

and the algorithm finishes.

This means we found a local minimum.

Page 22: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 23

Minimum AUC between all the ROCs of various

truncation levels (5)

If the minimum AUC between those of (q1, q2, q3, q4, q5) is

the 2nd element q2, it is takes as the best truncation value,

and the algorithm finishes.

This means we found a local minimum.

Page 23: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 24

Minimum AUC between all the ROCs of various

truncation levels (6)

If the minimum between (q1, q2, q3, q4, q5) is q5, the last,

that means that probably the AUC values will decrease again

moving to left

so we move the truncation interval to the left

Page 24: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 25

Minimum AUC between all the ROCs of various

truncation levels (7)

If the minimum between (q1, q2, q3, q4, q5) is q5, the last,

that means that probably the AUC values will decrease again

moving to right

so we move the truncation interval to the right.

Page 25: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 26

Minimum AUC between all the ROCs of various

truncation levels (8)

The levels are computed by adding 2*q5-q1 to each element

of the first analysis

Page 26: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 27

Minimum AUC between all the ROCs of various

truncation levels (9)

On the new group of levels, we repeat the minimum

computation and the choice

If q7, q8 or q9 ROC has minimum AUC, the algorithm stops.

If this local minimum is lower than previous ones, it is

considered as global minimum and elected best truncation

value.

Page 27: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 28

Minimum AUC between all the ROCs of various

truncation levels (10)

On the new group of levels, we repeat the minimum

computation and the choice

Page 28: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 29

Minimum AUC between all the ROCs of various

truncation levels (11)

On the new group of levels, we repeat the minimum

computation and the choice

The algorithm stops when:

• One of the middle elements is chosen, or

• Max number of attempts (e.g. 5) is made

Page 29: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 30

Evaluation data set

• We considered the Gene Ontology annotations of organisms:

Gallus gallus (Chicken), and Bos taurus (Cattle)

− Excluding less reliable Inferred Electronic Annotations

• After this, the four organism data set were:

with total (true-path-rule) annotations about 10-times more

than the direct annotations

Organism Ontology Genes Terms Annotations

(direct )

Gallus gallus BiologicalProcess 275 527 738

Gallus gallus CellularComponent 260 148 478

Gallus gallus MolecularFunction 309 225 509

Bos taurus BiologicalProcess 512 930 1,557

Bos taurus CellularComponent 497 234 921

Bos taurus MolecularFunction 543 422 934

Page 30: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 31

Results

• To evaluate the performance of our method, we used

annotations of

terms: Biological process (BP), Cellular component (CC) and

Molecular function (MF) GO features

organisms Gallus gallus and Bos taurus genes

• Available on July 2009 in an old version of the Gene Ontology

Annotation (GOA) database ( http://GeneOntology.org/ ).

• For example, by analyzing Gallus gallus annotations between

genes and BP (8,731 annotations; 275 genes; 610 MF terms), our

method suggested k=77 as best truncation value for the SVD.

Page 31: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 32

Results (2)

• This value of k led to a ROC curve having AUC=40.27%, while

the 2nd best k value, 59, led to AUC=40.46%

• From the 8,731 input annotations, with t=0.4, the SVD method

with best truncation level k=77 predicted 44 annotations as

APs.

• Out of these, 28 (63.63%) turned out to be present among the

GO annotations in a 27 month more recent GOA database

version (Oct. 2011); these 28 APs included 14 annotations

(50%) with GO evidence different from IEA or ND.

• Other truncation levels lead to worst results

Page 32: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 33

Results (3)

• Costs (time & resources): maximum number of SVD computation:

5 * 5 = 25 << min(#genes, #terms)

Maximum number of elements

in the truncation interval

Maximum number of

truncation intervals

Maximum number of

SVD computations if all

the possible truncation

level were considered

(in the previous table,

from 148 to over)

Page 33: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 34

Conclusions

Problem: SVD truncation in

the prediction of genomic

annotations context

Proposed solution: finding the

truncation level corresponding to

the minimum AUC of the ROC

curve

Page 34: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 35

Conclusions (2)

•To avoid computing SVD for all the possible truncation levels

(too expensive!), we proposed an algorithm for the search of

local and global minima.

•The best SVD truncation levels suggested by this algorithm for

our dataset (annotations of Bos taurus and Gallus gallus genes,

and GO terms) gave better results than other truncation levels, in

a reasonable time.

Page 35: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 36

Future developments

• To obtain the best sampling, we could study the gradient

variations in the distribution of the AUC values for different

truncation levels and the histogram of the eigenvalues

• Our approach is not limited to the Gene Ontology and can be

applied to any controlled annotations

Page 36: "Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis" - Davide Chicco (PoliMi) @ ISCB ESCS 2012

“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 37

Thanks for your attention!!!

www.DavideChicco.it

[email protected]

Genome-Wide Annotation Prediction with SVD

truncation based on ROC Analysis

Fellowship