"genome-wide annotation prediction with svd truncation based on roc analysis" - davide...
DESCRIPTION
Presentation at International Society of Computational Biology European Student Council Symposium in Basel, Switzerland. September 2012TRANSCRIPT
Genome-Wide Annotation Prediction
with SVD Truncation
based on ROC Analysis
Escs 2012 ISCB European Student
Council Symposium September 8th 2012, Basel, Switzerland
Davide Chicco, Marco Masseroli
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 2
Summary
1. The context & the problem
• Biomolecular annotations
• Prediction of biomolecular annotations
• SVD (Singular Value Decomposition)
• SVD Truncation
2. The proposed solution
• ROC Area Under the Curve comparison
• Truncation level choices
3. Evaluation
• Evaluation data set & results
4. Conclusions
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 3
Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• This information is expressed through controlled vocabularies,
sometimes structured as ontologies, where every controlled
term of the vocabulary is associated with a unique
alphanumeric code
• The association of such a code with a gene or protein ID
constitutes an annotation
Gene /
Protein
Biological function feature
Annotation
gene2bff
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 4
Biomolecular annotations (2)
• The association of an information/feature with a gene or
protein ID constitutes an annotation
• Annotation example:
• gene: GD4
• feature: “is present in the mitochondrial membrane”
Gene /
Protein
Biological function feature
Annotation
gene2bff
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 5
Prediction of biomolecular annotations
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information
• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations
are extremely useful
• These lists could be generated softwares based that implement
Machine Learning algorithms
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 7
Annotation prediction through
Singular Value Decomposition – SVD
• Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 8
Annotation prediction through
Singular Value Decomposition – SVD
• Annotation matrix A {0, 1} m x n
− m rows: genes / proteins
− n columns: annotation terms
A(i,j) = 1 if gene / protein i is annotated to term j or to any
descendant of j in the considered ontology structure (true
path rule)
A(i,j) = 0 otherwise (it is unknown)
term01 term02 term03 term04 … termN
gene01 0 0 0 0 … 0
gene02 0 1 1 0 … 1
… … … … … … …
geneM 0 0 0 0 … 0
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 9
Compute SVD:
Compute reduced rank approximation:
• An annotation prediction is performed by computing a reduced
rank approximation Ak of the annotation matrix A
(where 0 < k < r, with r the number of non zero singular values
of A, i.e. the rank of A)
TA U V
TA U V
TA U V TA U V TA U V
T
k k k kA U V
k
T
k k k kA U V T
k k k kA U V T
k k k kA U V T
k k k kA U V
k
Singular Value Decomposition – SVD
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 10
Singular Value Decomposition – SVD (2)
• Ak contains real valued entries related to the likelihood that
gene i shall be annotated to term j
For a certain real threshold τ:
if Ak(i,j) > τ, gene i is predicted to be annotated to term j
− The threshold τ can be chosen in order to obtain the
best predicted annotations [Khatri et al., 2005]
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 11
Singular Value Decomposition – SVD (3)
• It is possible to rewrite the SVD decomposition in an equivalent
form, such that the predicted annotation profile is given by:
ak,iT = ai
T Vk VkT
where ak,iT is a row vector containing the predictions for gene i
• Note that Vk depends on the whole set of genes
• Indeed, the columns of Vk are a set of eigenvectors of the
global term-to-term correlation matrix T = ATA, estimated from
the whole set of available annotations
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 12
Evaluation of the prediction
To evaluate the prediction, we compare each A(i,j) element to its
corresponding Ak(i,j) for each real threshold τ, with 0 ≤ τ ≤ 1.0
• if A(i,j) = 1 & Ak(i,j) > τ: AC: Annotation Confirmed
(AC <- AC+1)
• if A(i,j) = 1 & Ak(i,j) ≤ τ: AR: Annotation to be Reviewed
(AR <- AR+1)
• if A(i,j) = 0 & Ak(i,j) ≤ τ: NAC: No Annotation Confirmed
(NAC <- NAC+1)
• if A(i,j) = 0 & Ak(i,j) > τ: AP: annotation predicted
(AP <- AP+1)
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 13
SVD truncation
• The main problem of truncated SVD: how to choose the
truncation?
• Where to truncate?
How to choose the k here?
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 14
New concept: Receiver Operating Characteristic
(ROC) curve
Starting from the annotation prediction evaluation factor we just
introduced
AC: Annotation Confirmed
AR: Annotation to be Reviewed
NAC: No Annotation Confirmed
AP: Annotation Predicted
We can design the Receiver Operating Characteristic curves for
every prediction:
On the x, the annotation to be reviewed rate:
On the y, the annotation predicted rate:
Input Output
Yes Yes
Yes No
No No
No Yes
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 15
New concept: Receiver Operating Characteristic
(ROC) curve (2)
On the x, the annotation to be reviewed rate:
On the y, the annotation predicted rate:
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 16
SVD truncation choice
Algorithm:
1) Choose some possible truncation levels
2) Compute the Receiver Operating Characteristic for each
SVD prediction of those truncation levels
3) Compute the Area Under the Curve (AUC) of each ROC
4) Choose the truncation level of the ROC that has minimum
AUC
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 17
SVD truncation choice (2)
Algorithm:
1) Choose some possible truncation levels
2) Compute the Receiver Operating Characteristic for each
SVD prediction of those truncation levels
3) Compute the Area Under the Curve (AUC) of each ROC
4) Choose the truncation level of the ROC that has minimum
AUC
Quite easy!
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 18
SVD truncation choice (3)
Algorithm:
1) Choose some possible truncation levels
2) Compute the Receiver Operating Characteristic for each
SVD prediction of those truncation levels
3) Compute the Area Under the Curve (AUC) of each ROC
4) Choose the truncation level of the ROC that has minimum
AUC
Quite challenging!
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 19
Minimum AUC between all the ROCs of various
truncation levels
1) Choose some possible truncation levels
We cannot compute the SVD, its ROC and its AUC for every
truncation values because would be too expensive (for time
and resources).
Algorithm:
1) Since the matrix A(i,j) has m rows (genes) and n columns
(annotation terms), we take p = min(m, n)
2) Since r ≤ p is the number of non-zero singular values
along the diagonal of , the best truncation value is in the
interval [1; r]
3) We limited the range to [r*10% ; r*90%], to avoid taking
truncation levels that, during SVD reconstruction phase,
would consider too few main singular values, or almost all
the non-zero singular values of A
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 20
Minimum AUC between all the ROCs of various
truncation levels (2)
4. We take the 25%*r value as first possible truncation, and
compute the SVD for it and the next four levels: q1, q2, q3,
q4, q5
5. We compute ROC and its AUC for q1, q2, q3, q4, q5
6. We take the level that has minimum AUC
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 21
Minimum AUC between all the ROCs of various
truncation levels (3)
If the minimum AUC between those of (q1, q2, q3, q4, q5) is
the middle element q3, it is takes as the best truncation
value, and the algorithm finishes.
This means we found a local minimum.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 22
Minimum AUC between all the ROCs of various
truncation levels (4)
If the minimum AUC between those of (q1, q2, q3, q4, q5) is
the 4th element q4, it is takes as the best truncation value,
and the algorithm finishes.
This means we found a local minimum.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 23
Minimum AUC between all the ROCs of various
truncation levels (5)
If the minimum AUC between those of (q1, q2, q3, q4, q5) is
the 2nd element q2, it is takes as the best truncation value,
and the algorithm finishes.
This means we found a local minimum.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 24
Minimum AUC between all the ROCs of various
truncation levels (6)
If the minimum between (q1, q2, q3, q4, q5) is q5, the last,
that means that probably the AUC values will decrease again
moving to left
so we move the truncation interval to the left
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 25
Minimum AUC between all the ROCs of various
truncation levels (7)
If the minimum between (q1, q2, q3, q4, q5) is q5, the last,
that means that probably the AUC values will decrease again
moving to right
so we move the truncation interval to the right.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 26
Minimum AUC between all the ROCs of various
truncation levels (8)
The levels are computed by adding 2*q5-q1 to each element
of the first analysis
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 27
Minimum AUC between all the ROCs of various
truncation levels (9)
On the new group of levels, we repeat the minimum
computation and the choice
If q7, q8 or q9 ROC has minimum AUC, the algorithm stops.
If this local minimum is lower than previous ones, it is
considered as global minimum and elected best truncation
value.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 28
Minimum AUC between all the ROCs of various
truncation levels (10)
On the new group of levels, we repeat the minimum
computation and the choice
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 29
Minimum AUC between all the ROCs of various
truncation levels (11)
On the new group of levels, we repeat the minimum
computation and the choice
The algorithm stops when:
• One of the middle elements is chosen, or
• Max number of attempts (e.g. 5) is made
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 30
Evaluation data set
• We considered the Gene Ontology annotations of organisms:
Gallus gallus (Chicken), and Bos taurus (Cattle)
− Excluding less reliable Inferred Electronic Annotations
• After this, the four organism data set were:
with total (true-path-rule) annotations about 10-times more
than the direct annotations
Organism Ontology Genes Terms Annotations
(direct )
Gallus gallus BiologicalProcess 275 527 738
Gallus gallus CellularComponent 260 148 478
Gallus gallus MolecularFunction 309 225 509
Bos taurus BiologicalProcess 512 930 1,557
Bos taurus CellularComponent 497 234 921
Bos taurus MolecularFunction 543 422 934
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 31
Results
• To evaluate the performance of our method, we used
annotations of
terms: Biological process (BP), Cellular component (CC) and
Molecular function (MF) GO features
organisms Gallus gallus and Bos taurus genes
• Available on July 2009 in an old version of the Gene Ontology
Annotation (GOA) database ( http://GeneOntology.org/ ).
• For example, by analyzing Gallus gallus annotations between
genes and BP (8,731 annotations; 275 genes; 610 MF terms), our
method suggested k=77 as best truncation value for the SVD.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 32
Results (2)
• This value of k led to a ROC curve having AUC=40.27%, while
the 2nd best k value, 59, led to AUC=40.46%
• From the 8,731 input annotations, with t=0.4, the SVD method
with best truncation level k=77 predicted 44 annotations as
APs.
• Out of these, 28 (63.63%) turned out to be present among the
GO annotations in a 27 month more recent GOA database
version (Oct. 2011); these 28 APs included 14 annotations
(50%) with GO evidence different from IEA or ND.
• Other truncation levels lead to worst results
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 33
Results (3)
• Costs (time & resources): maximum number of SVD computation:
5 * 5 = 25 << min(#genes, #terms)
Maximum number of elements
in the truncation interval
Maximum number of
truncation intervals
Maximum number of
SVD computations if all
the possible truncation
level were considered
(in the previous table,
from 148 to over)
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 34
Conclusions
Problem: SVD truncation in
the prediction of genomic
annotations context
Proposed solution: finding the
truncation level corresponding to
the minimum AUC of the ROC
curve
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 35
Conclusions (2)
•To avoid computing SVD for all the possible truncation levels
(too expensive!), we proposed an algorithm for the search of
local and global minima.
•The best SVD truncation levels suggested by this algorithm for
our dataset (annotations of Bos taurus and Gallus gallus genes,
and GO terms) gave better results than other truncation levels, in
a reasonable time.
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 36
Future developments
• To obtain the best sampling, we could study the gradient
variations in the distribution of the AUC values for different
truncation levels and the histogram of the eigenvalues
• Our approach is not limited to the Gene Ontology and can be
applied to any controlled annotations
“Genome-Wide Annotation Prediction with SVD Truncation based on ROC Analysis” 37
Thanks for your attention!!!
www.DavideChicco.it
Genome-Wide Annotation Prediction with SVD
truncation based on ROC Analysis
Fellowship