a graph-theoretic modeling on go space for biological interpretation of gene clusters
DESCRIPTION
A graph-theoretic modeling on GO space for biological interpretation of gene clusters. Bioinformatics Unit, ISTECH Inc. Cancer Metastasis Research Center, Yonsei University College of Medicine Sung Geun Lee, Jung Uk Hur and Yang Seok Kim. 報告人 : 張家榮. Introduction. Gene Ontology (GO) - PowerPoint PPT PresentationTRANSCRIPT
A graph-theoretic modeling on GO space for biological interpretation of gene
clusters
Bioinformatics Unit, ISTECH Inc.Cancer Metastasis Research Center, Yonsei University College of
Medicine Sung Geun Lee,Jung Uk Hur andYang Seok Kim
報告人 :張家榮
Introduction
Gene Ontology (GO)– Controlled vocabulary of various genomic databases about
diverse species
Clusters of microarray data– Each cluster has some genes
Extracts GO terms for a gene cluster– Each gene has several corresponding GO terms
Purpose– To discover the meaning of each cluster
Metric structure of GO tree
Lowest common ancestor
Given a non-empty subset U, v is a common ancestor of U if every node in U is on a subtree having v as the root and v0 is an LCA of U if v0 is greater than or equal to the level of w for any common ancestor w of U.
Lowest common ancestor
Principal distance
Each level has its own weight– W: IH -> R+
– W(i) > W(i+1)– For example: W(k)=150-10(k-1)
Where w0 is LCA of v1 and v2
Principal distance
40
30
20
10
0
Multiset
Mathematically, the following three sets {1}, {1, 1}, {1, 1, 1} are equal in the set notation.
Yet, we want to take the number of occurrences of elements into account.
Such set is called as a multiset.
MaxPd and AverPd
given a multiset G ={v1, v2, . . . , vn}
MaxPd and AverPd
MaxPd– give the comprehensive biological meanings of a
gene cluster– Not flexible but informs us of the existence of
some functional outliers
AverPd– Signifies the most frequent GO codes– More than one
Algorithmic approach (1)
c[i,j] is j’st GO code of i’st gene We consider ordered GO codes g[m] where 1
≤m≤α α is a constant related to the input data α≤Ω , Ω is the total number of GO code
Algorithmic approach (2)
MaxPd is used to find LCA of C Complexity : 3αn
min
MaxPd
40
30
20
10
0
40 4030
40
20
Algorithmic approach (3)
AverPd is used to find an optimal GO code g[m0] such that the average distance between g[m0] and each gene in C is smaller than that of any g[m]
Complexity : 3αn
AverPd
Discussion
Other algorithms consider GO term frequencies or compare specific GO term-related gene groups
In our modeling, the topological property of GO hierarchy is used.
Utility
Biological assessment of the clustering results of DNA microarray data
Coupled with any clustering technique to predict the functional category of the unknown genes
Not only DNA microarray data, but also any kinds of group analysis with any ontology having an identical structure with GO
Another approach
The length of GO code is about logα Take one number of GO code each timepseudo-code:for 1≤k≤ logα
cluster C by the kth numberbreak if no cluster above ndelete clusters under n
end
MaxPd
Complexity
Worst case : O( nlogα) Best case : O (n) Alternate
– Change the break time to cluster it in detail
disadvantage
Cannot obtain information not contained in GO such as disease-related genes
GO terms on the same level have different level information
GO hierarchy is dynamic and flexible