learnability-based further prediction of gene functions in gene ontology

7
Learnability-based further prediction of gene functions in Gene Ontology Kang Tu a, * , Hui Yu a , Zheng Guo a,b,c , Xia Li a,c a Department of Bioinformatics, Harbin Medical University, Harbin 150086, PeopleTs Republic of China b School of Biology Science and Technology, Tongji University, Shanghai 200092, PeopleTs Republic of China c Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, PeopleTs Republic of China Received 5 June 2004; accepted 2 August 2004 Available online 17 September 2004 Abstract Currently the functional annotations of many genes are not specific enough, limiting their further application in biology and medicine. It is necessary to push the gene functional annotations deeper in Gene Ontology (GO), or to predict further annotated genes with more specific GO terms. A framework of learnability-based further prediction of gene functions in GO is proposed in this paper. Local classifiers are constructed in local classification spaces rooted at qualified parent nodes in GO, and their classification performances are evaluated with the averaged Tanimoto index (ATI). Classification spaces with higher ATIs are selected out, and genes annotated only to the parent classes are predicted to child classes. Through learnability-based further predicting, the functional annotations of annotated genes are made more specific. Experiments on the fibroblast serum response dataset reported further functional predictions for several human genes and also gave interesting clues to the varied learnability between classes of different GO ontologies, different levels, and different numbers of child classes. D 2004 Elsevier Inc. All rights reserved. Keywords: Gene expression data; Functional further prediction; Gene Ontology; Neural network; Tanimoto index In the postgenome era, computational biologists are exploiting various data sources such as gene or protein sequences [1], protein–protein interaction relationships [2,3], and gene expression data [4–6] to help characterize unknown genes. Among these diverse data sources, gene expression data [7] are of special interest because they allow parallel quantification of thousands of genes in different biological settings. Since the first trial of Brown et al. [8], Hvidsten et al. [5], Kuramochi et al. [9], and Mateos et al. [6] used such supervised learning methods as rough set theory [10], support vector machine [11], and k -nearest neighbors [12] to predict gene functions from gene expression profiles, and they obtained encouraging results. To predict gene functions from gene expression profiles, one must have a gene functional categorization system that provides the functional class labels of known genes. Gene Ontology (GO) [13] is such a gene functional categorization system that has been utilized in many applications of expression-data-based gene functional prediction. GO pro- vides a valuable source of structured knowledge of gene function (more accurately, gene product function) in terms of bmolecular functionQ (MF), bbiological processQ (BP), and bcellular componentQ (CC). The gene function catego- rizations are represented with nodes (or called terms or classes ) in a directed acyclic graph in which one node may have more than one parent node. There are two relationships between child nodes and parent nodes in GO: bIs-a,Q where the child node is an instance of its parent (function), and bPart-of,Q where the child node is a part of its parent (component). With existing gene function knowledge, known genes can be annotated to the GO class(es) representing their most specific function(s). With the scarce and fragmented status of present knowl- edge in gene functionality, however, we generally have rather abstract gene annotations with GO, since most known genes are annotated to nodes of high levels in GO [4]. It is necessary to push the gene functional annotations deeper in GO, or to predict further annotated genes with more specific 0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2004.08.005 * Corresponding author. E-mail address: [email protected] (K. Tu). Genomics 84 (2004) 922 – 928 www.elsevier.com/locate/ygeno

Upload: kang-tu

Post on 31-Oct-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Learnability-based further prediction of gene functions in Gene Ontology

www.elsevier.com/locate/ygeno

Genomics 84 (20

Learnability-based further prediction of gene functions in Gene Ontology

Kang Tua,*, Hui Yua, Zheng Guoa,b,c, Xia Lia,c

aDepartment of Bioinformatics, Harbin Medical University, Harbin 150086, PeopleTs Republic of ChinabSchool of Biology Science and Technology, Tongji University, Shanghai 200092, PeopleTs Republic of China

cDepartment of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, PeopleTs Republic of China

Received 5 June 2004; accepted 2 August 2004

Available online 17 September 2004

Abstract

Currently the functional annotations of many genes are not specific enough, limiting their further application in biology and medicine. It is

necessary to push the gene functional annotations deeper in Gene Ontology (GO), or to predict further annotated genes with more specific

GO terms. A framework of learnability-based further prediction of gene functions in GO is proposed in this paper. Local classifiers are

constructed in local classification spaces rooted at qualified parent nodes in GO, and their classification performances are evaluated with the

averaged Tanimoto index (ATI). Classification spaces with higher ATIs are selected out, and genes annotated only to the parent classes are

predicted to child classes. Through learnability-based further predicting, the functional annotations of annotated genes are made more

specific. Experiments on the fibroblast serum response dataset reported further functional predictions for several human genes and also gave

interesting clues to the varied learnability between classes of different GO ontologies, different levels, and different numbers of child classes.

D 2004 Elsevier Inc. All rights reserved.

Keywords: Gene expression data; Functional further prediction; Gene Ontology; Neural network; Tanimoto index

In the postgenome era, computational biologists are

exploiting various data sources such as gene or protein

sequences [1], protein–protein interaction relationships

[2,3], and gene expression data [4–6] to help characterize

unknown genes. Among these diverse data sources, gene

expression data [7] are of special interest because they allow

parallel quantification of thousands of genes in different

biological settings. Since the first trial of Brown et al. [8],

Hvidsten et al. [5], Kuramochi et al. [9], and Mateos et al.

[6] used such supervised learning methods as rough set

theory [10], support vector machine [11], and k-nearest

neighbors [12] to predict gene functions from gene

expression profiles, and they obtained encouraging results.

To predict gene functions from gene expression profiles,

one must have a gene functional categorization system that

provides the functional class labels of known genes. Gene

Ontology (GO) [13] is such a gene functional categorization

0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.ygeno.2004.08.005

* Corresponding author.

E-mail address: [email protected] (K. Tu).

system that has been utilized in many applications of

expression-data-based gene functional prediction. GO pro-

vides a valuable source of structured knowledge of gene

function (more accurately, gene product function) in terms

of bmolecular functionQ (MF), bbiological processQ (BP),

and bcellular componentQ (CC). The gene function catego-

rizations are represented with nodes (or called terms or

classes) in a directed acyclic graph in which one node may

have more than one parent node. There are two relationships

between child nodes and parent nodes in GO: bIs-a,Q wherethe child node is an instance of its parent (function), and

bPart-of,Q where the child node is a part of its parent

(component). With existing gene function knowledge,

known genes can be annotated to the GO class(es)

representing their most specific function(s).

With the scarce and fragmented status of present knowl-

edge in gene functionality, however, we generally have

rather abstract gene annotations with GO, since most known

genes are annotated to nodes of high levels in GO [4]. It is

necessary to push the gene functional annotations deeper in

GO, or to predict further annotated genes with more specific

04) 922–928

Page 2: Learnability-based further prediction of gene functions in Gene Ontology

Fig. 1. The distribution of ATI values of the 44 candidate classification

spaces.

K. Tu et al. / Genomics 84 (2004) 922–928 923

GO terms. The first point in this work is to predict further

the functions of known genes from parent classes to child

classes in GO. Specifically, we regard genes that are

annotated to a parent node but not to their child nodes as

unknown genes, and we train a classification model to

classify them into one or more child nodes. Thus we

encounter separate classification tasks at separate parent

nodes (PNs), and so different PN-specific classifiers are

trained at different parent nodes to distinguish among their

child nodes (CNs). For conciseness, we refer to the set of a

PN and its CNs as a classification space.

There are a large number of classes (nodes) in a gene

functional categorization system, but not all of these classes

can be well learned by the learning model, given one

expression dataset of finite number of experiment condi-

tions. Brown [8] pointed out that many classes in MYGD

[14] would be unlearnable from expression data by any

classifier, and they selected only five classes for learning

that were expected to exhibit similar expression profiles.

Among the enormous classes in GO, some are expression

irrelevant, which cannot be learned from any expression

data. Other classes exhibit experiment-dependent learnabil-

ity, for they may be well learned from certain expression

data, which are obtained under relevant experiment con-

ditions, but cannot return good prediction performance in

the face of other expression data out of irrelevant experi-

ment conditions. Under irrelevant experiment conditions,

functional classes are not influenced and therefore do not

exhibit sufficient expression similarity to be tracked by any

classifier. Kuramochi et al. [9] noted this point and

suggested including more diverse expression data to allow

more classes learned. Given one gene expression dataset

under particular experiment conditions, we suggest that the

more learnable classes be selected out of the enormous GO

classes, with those expression-irrelevant and experiment-

irrelevant classes being excluded out of the prediction study.

We adopted the learnability-based class scoring idea from

Pavlidis et al. [15] to evaluate and select functional classes.

The ideas of further prediction and learnability-based

selection of classification spaces are combined in this paper

to give birth to a novel gene function prediction framework,

the learnability-based further prediction, as we term it. In

the first step, we set up the classification problem and

prepare the training examples at every candidate classi-

fication space, build up candidate classifiers under a

fivefold cross-validation schema, and then evaluate the

prediction precision with ATI, the averaged Tanimoto index.

In the second step, we select the learnable classification

spaces where the ATI values go beyond a threshold and then

perform the gene functional further prediction tasks at these

learnable classification spaces. The experiments on the

fibroblast serum response dataset reported further functional

predictions for several human genes and also gave interest-

ing clues to the varied learnability between classes of

different GO ontologies, different levels, and different

numbers of child classes.

Results and discussion

The known genes in the small serum dataset were first

annotated directly to nodes in the bbiological processQontology, and then the direct annotations were generalized

to ancestor nodes of the same branches. In this way we got

160 nodes that were each annotated with at least one gene.

We required that for a candidate parent node to be classified

it must have at least two child nodes and more than five

generally annotated genes, which restricted the number of

candidate parent nodes to 44. We then constructed PN-

specific classifiers separately at the 44 candidate classifica-

tion spaces. All 44 classification spaces involved 131

classes (including PNs and CNs), and we term them the

candidate classes.

The distribution of ATI values of the 44 candidate

classification spaces is shown in Fig. 1. From Fig. 1, we can

see that the classification performances (measured by the

ATI value) of the candidate classification spaces vary, and

more than half are unsatisfactory (ATI b 0.6). With the

threshold of ATI set at 0.7, 14 candidate classification

spaces passed this criterion and were seen as the learnable

classification spaces. All 14 learnable classification spaces

involved 45 classes (including PNs and CNs), and they are

termed the learnable classes.

The 14 learnable parent nodes are shown in Table 1. As

was discussed in Iyer [16], the fibroblast cells are

programmed to interpret the abrupt exposure to serum not

only as a general mitogenic stimulus but also as a specific

physiological signal, signifying a wound. Among the 14

learnable parent nodes, bM phaseQ, btranscriptionQ,btranscription, DNA-dependentQ, and bbiosynthesisQ are

related to the proliferative program; bresponse to biotic

stimulusQ and bhomeostasisQ are related to the wound repair

program. bSteroid metabolismQ appears in Table 1 because

the cholesterol biosynthetic pathway is down regulated in a

coordinated fashion in the face of the low-density lip-

oproteins provided by serum.

Page 3: Learnability-based further prediction of gene functions in Gene Ontology

Fig. 2. The scatter plot of ATI values vs Onto-Express p values of the 44

candidate classification spaces.

Table 1

The learnable parent nodes and their classification performances

Learnable parent node ATI TPR TNR

GO:0008150, biological process 0.71 0.84 0.84

GO:0050875, cellular physiological

process

0.80 0.89 0.85

GO:0030003, cation homeostasis 0.95 0.95 0.95

GO:0000279, M phase 0.92 0.96 0.94

GO:0046907, intracellular transport 0.80 0.75 0.72

GO:0009653, morphogenesis 0.76 0.73 0.70

GO:0042592, homeostasis 0.93 0.93 0.87

GO:0009058, biosynthesis 0.72 0.73 0.55

GO:0008202, steroid metabolism 0.95 0.95 0.95

GO:0006350, transcription 0.98 0.99 0.99

GO:0006351, transcription,

DNA-dependent

0.73 0.86 0.84

GO:0006464, protein modification 0.79 0.77 0.75

GO:0007600, sensory perception 0.90 0.95 0.90

GO:0009607, response to biotic

stimulus

0.79 0.90 0.81

ATI, the averaged Tanimoto index; TPR, the true positive rate; TNR, the

true negative rate.

Table 2

The cross-table of the 44 candidate parent nodes considering their ATI

values and Onto-Express p values

Onto-Express p value

b 0.05 N 0.05

Learnable parent classes (ATI z 0.7) 9 5

Nonlearnable parent classes (ATI V 0.7) 6 24

K. Tu et al. / Genomics 84 (2004) 922–928924

The learnable classification spaces are enriched with the

differentially expressed genes

Under the hypothesis that the classes involved in the

learnable classification spaces are relevant to the exper-

imental conditions, it follows that most of the genes

annotated in these learnable classification spaces are induced

or repressed by the experiment conditions. We tested the

hypothesis with the algorithm described in Onto-Express

[17]. The p value of a functional class returned by the Onto-

Express algorithm tells how much the class is enriched with

differentially expressed genes. The smaller the p value, the

more enriched with differentially expressed genes the class

and therefore the more experiment-relevant the class.

Fig. 2 illustrates the relationship between the classifica-

tion performance (measured by ATI values) and the experi-

ment relevance (measured by Onto-Express p values) of the

44 candidate parent classes. Though not very obvious, we

can still see that classes of lower p values tend to exhibit

higher ATI values, which suggests that higher learnability is

associated with higher experiment relevance.

Setting the threshold of the Onto-Express p value at 0.05,

9 of the 14 learnable parent classes and 6 of the remaining

30 nonlearnable parent classes passed the criteria (Table 2).

The classes with Onto-Express p values less than 0.05 are

termed experiment-relevant classes for conciseness. We

tested with FisherTs exact test whether the experiment-

relevant classes were uniformly distributed in the non-

learnable parent classes and the learnable parent classes and

got a p value of 0.004. This very small p value suggests that

the distribution of experiment-relevant classes was not

uniform in the two sets of classes and that there was a

significantly enhanced likelihood that the differentially

expressed genes were enriched in the learnable parent

classes.

The relationship between the classification performance and

the number of child nodes

We studied the relationship between the classification

performance (measured by ATI) and the number of child

nodes in the candidate classification spaces (Fig. 3). The

trend is clear that ATI decreases with incremental number of

child nodes. This is easy to understand: as the number of

child nodes increases, there are more classes involved in a

classification task, and the classification model needs to

distinguish more subtle subclass patterns. In addition, single

genes are more likely to be labeled with more than one child

class label. In summary, with more child nodes involved in

the classification space, the problems of multiclass and

multilabel are more severe with the classification task,

leading to decreased classification performance.

The relationship between the classification performance and

the level of classification spaces

We also studied the relationship between the classifica-

tion performance and the level of parent nodes in the

candidate classification space (Fig. 4). From Fig. 4 we can

see that in the range from level 2 to level 6, ATI increases

with increasing level of child nodes. This may be explained

by the fact that deeper functional classes in GO are more

homogeneous and are therefore easier to be learned by

classification models. The ATI drops, however, after level 6,

Page 4: Learnability-based further prediction of gene functions in Gene Ontology

Fig. 3. The relationship between the classification performance and the

number of child nodes.

Fig. 4. The relationship between the classification performance and the

level of child nodes. The level of classification space is defined as the

level of the parent node, beginning from the level-1 node of

bGO:0008150:Biological_ProcessQ.

K. Tu et al. / Genomics 84 (2004) 922–928 925

probably due to the shrink in sample size at low-leveled

functional classes.

Fewer learnable classification spaces discovered in MF and

CC ontologies

In addition to the biological process ontology, we also

carried out studies in the MF and CC ontologies. To our

surprise, we found only three learnable classification spaces

in the MF ontology and one in the CC ontology (ATI higher

than 0.7). Ashburner et al. [13] emphasized that the BP

functional classes exhibited a higher degree of expression

similarity than the other two ontologies; Hvidsten et al. [5]

claimed that BP was the only aspect of gene function in

which one may expect a significant correlation between

annotations and temporal gene expression profiles. Their

claims and findings suggested that the classification

performances of BP classes may be higher than those of

the MF and CC classes. Our results once again validated the

significantly higher correlation between the biological

process categorization and the gene expression similarity.

Gene functional further prediction

With the threshold of ATI value set at 0.7, we obtained 14

learnable classification spaces with relatively higher classi-

fication performances. We then performed the functional

further predictions on genes that were annotated directly to

these learnable parent nodes. Only 6 learnable parent nodes

were found with directly annotated genes, and finally a total

of 12 genes were predicted to 9 child nodes, resulting in 17

records in Table 3. The fact that there are only a few genes

involved in the functional further prediction can be explained

in two ways: first, only a few functional classes were aroused

by the particular experiment settings, leading to a limited

number of learnable classification spaces; second, the

fibroblast serum response dataset was rather small, contain-

ing only 517 cDNA sequences in all. Nevertheless, the few

further prediction findings give hints of the more specific

biological roles of the involved genes and could guide

further biological experiments to validate these predictions.

The same framework can be applied to other gene expression

datasets, and gene functional further predictions involving

other genes may be obtained.

Materials and methods

Annotation source

Results were based on the GO database released on May

1, 2004, the UniGene database released on December 20,

2003, and the LocusLink database released onMay 12, 2003.

Data

We used two datasets provided by Iyer et al. [16] that

describe the transcript levels of genes during the first 24 h of

the serum response in serum-starved human fibroblasts. The

large dataset consists of 9706 cDNAs, and 517 of them

showing extreme expression regulation or large expression

deviation make up the small serum dataset. Roughly speak-

ing, the 517 cDNAs can be regarded as differentially

expressed genes [18,19]. We implemented our learnability-

based further prediction on the small serum dataset consist-

ing of 517 differentially expressed genes. The 9706 cDNAs

were utilized in the Onto-Express analysis (see below).

Gene annotation

The 517 cDNAs were mapped to 426 unique genes

according to UniGene and LocusLink. Of the 426 genes 277

Page 5: Learnability-based further prediction of gene functions in Gene Ontology

Fig. 5. A classification space example.

Table 3

The functional further prediction results for genes annotated directly at the learnable parent nodes

LocusID Symbol Directly annotated term Further predicted term Prediction stepsa

212 ALAS2 GO:0009058, biosynthesis GO:0009059, macromolecule biosynthesis 1

2149 F2R GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1

2149 F2R GO:0009653, morphogenesis GO:0009887, organogenesis 1

2222 FDFT1 GO:0009058, biosynthesis GO:0046138, coenzyme and prosthetic group biosynthesis 1

3156 HMGCR GO:0009058, biosynthesis GO:0046138, coenzyme and prosthetic group biosynthesis 1

3164 NR4A1 GO:0006351, transcription, DNA-dependent GO:0006366, transcription from Pol II promoter 2

3207 HOXA11 GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1

3207 HOXA11 GO:0009653, morphogenesis GO:0009887, organogenesis 1

4281 MID1 GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1

4281 MID1 GO:0009653, morphogenesis GO:0009887, organogenesis 1

5352 PLOD2 GO:0006464, protein modification GO:0006470, protein amino acid dephosphorylation 1

8870 IER3 GO:0009653, morphogenesis GO:0009887, organogenesis 1

10559 SLC35A1 GO:0006464, protein modification GO:0006470, protein amino acid dephosphorylation 1

10559 SLC35A1 GO:0006464, protein modification GO:0006468, protein amino acid phosphorylation 1

25830 SULT4A1 GO:0008202, steroid metabolism GO:0016125, sterol metabolism 1

25830 SULT4A1 GO:0008202, steroid metabolism GO:0006694, steroid biosynthesis 1

114876 OSBPL1A GO:0008202, steroid metabolism GO:0006694, steroid biosynthesis 1

a When a gene is predicted from its directly annotated GO term to a one-level-lower child term, the prediction step is 1. If the further predicted child term is

also a learnable parent node, the gene is again further predicted to a one-level-lower child term, which is a grandchild term of the original parent term. In this

case, there are 2 prediction steps.

K. Tu et al. / Genomics 84 (2004) 922–928926

could be annotated to at least one class in GO. As implied by

the ontology structure, a gene annotated to a class is also

involved (annotated) in the ancestor classes of the same

branch from the annotated node to the ROOT node. To

distinguish these two types of annotation, we refer to the

annotation provided by LocusLink as the bdirect annotationQand the later generalized annotation according to GO

structure as the bgeneral annotation.Q The goal of this work

is to select learnable classification spaces in GO and predict

further the known genes from their directly annotated classes

to deeper classes in these learnable classification spaces.

Framework of learnability-based further prediction

Constructing PN-specific classifiers at candidate

classification spaces

We first annotate known genes directly to GO nodes

according to the LocusLink database [20] and generalize

these direct annotations into general annotations. The set of

genes that are generally annotated to a node N is denoted as

GENE(N).

If a node has more than one child node, and is annotated

with more than five genes, it is termed a candidate parent

node, and the corresponding classification space rooted at the

candidate parent node is a candidate classification space. A

PN-specific classifier is constructed at each candidate

classification space. As the task is to assign genes from

parent nodes to child nodes correctly, only genes that have

been annotated to child nodes are useful for learning the

classifier. WeTll explain the methodological details with the

example classification space illustrated in Fig. 5. The training

examples of the classification problem at node P make up

the set, TE Pð Þ ¼ [qi ¼ 1 GENE Cið Þ where q is the number of

child nodes of P and Ci represents the ith child node.

According to our statistics, 70% of parent nodes in the

BP ontology are linked with more than two child nodes, and

37.9% of the genes in the serum dataset were found

annotated to some sibling nodes. So our classification task

is a multiclass, multilabel problem. The feedforward

artificial neural network [21] is chosen as the learning

algorithm, which is very suited to such kind of classification

task. For the classification task formalized in Fig. 5, we let

the neural network have a output layer of q neurons, with

each one standing for a different child node. The neurons in

the input layer correspond to the experimental conditions of

gene expression dataset, and the number of neurons is equal

to the number of experimental conditions, say p. We

included only one hidden layer in the numerical experiment,

and the number of neurons in this layer was empirically set

to the maximum integer less than 2ffiffiffip

p. For a training example,

a gene g, the input vector is Xg = (xg1, xg2, . . ., xgp), and the true

output vector is Og = (og1, og2, . . ., ogq), where ogi = 1 signifies

that g is annotated to Ci, and ogi = 0 signifies that g is not

annotated to Ci. Xg is input into the neural network, and a

Page 6: Learnability-based further prediction of gene functions in Gene Ontology

K. Tu et al. / Genomics 84 (2004) 922–928 927

predicted output vector Pg = ( pg1, pg2, . . ., pgq) is output and

compared with the true output vector Og = (og1, og2, . . ., ogq).

Through adjusting the weight matrix, the neural network learns the

training example and corrects the prediction error. All training

examples are sequentially processed on and on before finally a

trained neural network is output. Note that the predicted output

vectors originally have elements ranging in [0,1], but they are

converted to a closer value in {0,1}.

Selecting learnable classification spaces according to the

averaged Tanimoto index

As we mentioned in the introduction, we believe that it is

unrealistic to predict accurately gene functions at all parent

nodes in GO. We select only more accurate classifiers to

perform the functional further prediction. To obtain an

objective evaluation of the classifiers at different parent

nodes, we adopted the fivefold cross-validation schema. For

example, for the classification task in Fig. 5, we partition

TE(P) into five subsets of about the same size. Then we

repeat the training and testing procedures five times, in

which each time a different subset is pushed aside from the

training set and used as testing examples each time. In this

way, every gene is predicted once. The Tanimoto index [22],

originally proposed as a similarity measure between binary

vectors, is used to compare the predicted output vector Pg

and the true output vector Og, and henceforth to measure the

prediction accuracy of the prediction of one gene.

Given a gene g that has a true output vector Og = (og1,

og2, . . ., ogq) and a predicted output vector Pg = ( pg1, pg2,

. . ., pgq), the Tanimoto index is computed as follows:

TI gð Þ ¼

P

i¼ 1

q

ogi ˙pgi

P

i¼ 1

q

ogi ˙pgi þP

i¼ 1

q

1� ogi� �

˙pgi þP

i¼ 1

q

ogi ˙ 1� pgi� �

The Tanimoto index differs from other binary similarity

measures in that it takes both the true b1’sQ and the predictedb1TsQ into the denominator. If only the true b1’sQ were

considered, the index would reflect only how many

annotated functions are recalled in the predictions; if only

the predicted b1’sQ were considered, the index would reflect

only how many of the predicted functions are accurate. Only

when both the true b1’sQ and the predicted b1’sQ are both

considered in the denominator is the index able to reflect

both the recall and the precision of the predictions.

Through averaging the Tanimoto indices over all genes

in TE(P), we obtain an overall prediction performance

measurement, the averaged Tanimoto index, for the classi-

fier at the current node P:

ATI Pð Þ ¼

P

gaTE Pð ÞTI gð Þ

jjTE Pð Þjj

In the numerical experiment we empirically set the

threshold of the averaged Tanimoto index at 0.7. Only

classification spaces with ATI values above 0.7 were

selected as learnable classification spaces and were

employed to predict further the child functions for genes

that were annotated directly to the parent functions.

Further prediction of genes annotated directly to parent

nodes in learnable classification spaces

If node P is trained and selected as a learnable clas-

sification space, then we perform the gene functional further

prediction at node P. For a gene gVa GENE Pð Þ � TE Pð Þ, wepredict its function(s) with the five neural networks

generated in the fivefold cross-validation schema and

obtain five predicted output vectors, termed voter output

vectors. According to the ensemble idea [23], we examine

each child node individually and accept the majority value

of the five voter output vectors. That comes to our final

decision on the functions of gene gV, in the form of an

ensemble output vector.

Measuring the degree of enrichment of differentially

expressed genes in functional classes

We measured the degree of enrichment of differentially

expressed genes in a class with the algorithm described in

Onto-Express [17]. We used the 517 cDNAs in the small

serum dataset as the study set of differentially expressed

genes and took the 9706 cDNAs in the large dataset as the

set of reference genes. With the study set and reference set

as inputs, we computed the p value of a functional class,

which quantifies the degree of enrichment of differentially

expressed genes in the class in question.

Acknowledgments

This work was supported in part by the National

Natural Science Foundation of China (Grants 30170515,

30370388, and 30370798), the Chinese 863 Program

(Grants 2003AA2Z2051 and 2002AA2Z2052), and the

211 Project, the Tenth bFive-YearQ Plan, Harbin Medical

University.

References

[1] R.D. King, A. Karwath, A. Clare, L. Dehaspe, The utility of different

representations of protein sequences for predicting functional class,

Bioinformatics 17 (2001) 445–454.

[2] S. Letovsky, S. Kasif, Predicting protein function from protein/protein

interaction data: a probabilistic approach, Bioinformatics 19 (2003)

i197– i204.

[3] C. Brun, et al., Functional classification of proteins for the prediction

of cellular function from a protein–protein interaction network,

Genome Biol. 5 (2004) R6.

[4] A. Lagreid, T.R. Hvidsten, H. Midelfart, J. Komorowski, A.K.

Sandvik, Predicting gene ontology biological process from temporal

gene expression patterns, Genome Res. 13 (2003) 965–979.

[5] T.R. Hvidsten, J. Komorowski, Predicting gene function from gene

expressions and ontologies, Pacific Symposium on Biocomputing,

2001, Hawaii.

Page 7: Learnability-based further prediction of gene functions in Gene Ontology

K. Tu et al. / Genomics 84 (2004) 922–928928

[6] A. Mateos, et al., Systematic learning of gene functional classes from

DNA array expression data by using multilayer perceptrons, Genome

Res. 12 (2002) 1703–1715.

[7] G. Ramsay, DNA chips: state-of-the art, Nat. Biotechnol. 16 (1998)

40–44.

[8] M. Brown, et al., Knowledge-based analysis of microarray gene

expression data using support vector machines, Proc. Natl. Acad. Sci.

USA 97 (2000) 262–267.

[9] M. Kuramochi, G. Karypis, Gene classification using expression

profiles: a feasibility study, 2nd IEEE International Symposium on

Bioinformatics and Bioengineering, Bethesda, MD, 2001.

[10] Z. Pawlak, Rough sets, Int. J, Comput. Inf. Sci. 11 (1982) 341–356.

[11] C.J.C. Burges (Ed.), Data Mining and Knowledge Discovery, vol. 2,

1998, pp. 121–167.

[12] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.

[13] G.O. Consortium, Gene Ontology: tool for the unification of biology,

Nat. Genet. 25 (2000) 25–29.

[14] H.W. Mewes, et al., MIPS: a database for genomes and protein

sequences, Nucleic Acids Res. 28 (2000) 37–40.

[15] P. Pavlidis, D.P. Lewis, W.S. Noble, Exploring gene expression data

with class scores, Pacific Symposium on Biocomputing, 2002.

[16] V. Iyer, et al., The transcriptional program in the response of human

fibroblasts to serum, Science 283 (1999) 17–18.

[17] S. Draghici, P. Khatri, Global functional profiling of gene

expression data, in: D.P. Berar, M. Granzow (Eds.), A Practical

Approach to Microarray Data Analysis, Kluwer Academic, Dor-

drecht, 2003, pp. 98–104.

[18] T.J. Mariani, et al., A variable fold change threshold determines

significance for expression microarrays, FASEB J. 17 (2003)

321–323.

[19] G.K. Smyth, Y.H. Yang, T. Speed, Statistical issues in cDNA

microarray data analysis, Methods Mol. Biol. 224 (2003) 111–136.

[20] D.L. Wheeler, et al., Database resources of the National Center for

Biotechnology, Nucleic Acids Res. 31 (2003) 28–33.

[21] P.D. Wasserman, Neural Computing: Theory and Practice, Van

Nostrand–Reinhold, New York, 1989.

[22] P. Willet, Similarity and Clustering in Chemical Information Systems,

Research Studies Press, Letchworth, UK, 1987.

[23] T.G. Dietterich, Ensemble methods in machine learning, in: F. Roli

(Ed.), International Workshop on Multiple Classifier Systems,

Lecture Notes in Computer Science, Springer-Verlag, New York,

2000, pp. 1–15.