learnability-based further prediction of gene functions in gene ontology
TRANSCRIPT
www.elsevier.com/locate/ygeno
Genomics 84 (20
Learnability-based further prediction of gene functions in Gene Ontology
Kang Tua,*, Hui Yua, Zheng Guoa,b,c, Xia Lia,c
aDepartment of Bioinformatics, Harbin Medical University, Harbin 150086, PeopleTs Republic of ChinabSchool of Biology Science and Technology, Tongji University, Shanghai 200092, PeopleTs Republic of China
cDepartment of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, PeopleTs Republic of China
Received 5 June 2004; accepted 2 August 2004
Available online 17 September 2004
Abstract
Currently the functional annotations of many genes are not specific enough, limiting their further application in biology and medicine. It is
necessary to push the gene functional annotations deeper in Gene Ontology (GO), or to predict further annotated genes with more specific
GO terms. A framework of learnability-based further prediction of gene functions in GO is proposed in this paper. Local classifiers are
constructed in local classification spaces rooted at qualified parent nodes in GO, and their classification performances are evaluated with the
averaged Tanimoto index (ATI). Classification spaces with higher ATIs are selected out, and genes annotated only to the parent classes are
predicted to child classes. Through learnability-based further predicting, the functional annotations of annotated genes are made more
specific. Experiments on the fibroblast serum response dataset reported further functional predictions for several human genes and also gave
interesting clues to the varied learnability between classes of different GO ontologies, different levels, and different numbers of child classes.
D 2004 Elsevier Inc. All rights reserved.
Keywords: Gene expression data; Functional further prediction; Gene Ontology; Neural network; Tanimoto index
In the postgenome era, computational biologists are
exploiting various data sources such as gene or protein
sequences [1], protein–protein interaction relationships
[2,3], and gene expression data [4–6] to help characterize
unknown genes. Among these diverse data sources, gene
expression data [7] are of special interest because they allow
parallel quantification of thousands of genes in different
biological settings. Since the first trial of Brown et al. [8],
Hvidsten et al. [5], Kuramochi et al. [9], and Mateos et al.
[6] used such supervised learning methods as rough set
theory [10], support vector machine [11], and k-nearest
neighbors [12] to predict gene functions from gene
expression profiles, and they obtained encouraging results.
To predict gene functions from gene expression profiles,
one must have a gene functional categorization system that
provides the functional class labels of known genes. Gene
Ontology (GO) [13] is such a gene functional categorization
0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ygeno.2004.08.005
* Corresponding author.
E-mail address: [email protected] (K. Tu).
system that has been utilized in many applications of
expression-data-based gene functional prediction. GO pro-
vides a valuable source of structured knowledge of gene
function (more accurately, gene product function) in terms
of bmolecular functionQ (MF), bbiological processQ (BP),
and bcellular componentQ (CC). The gene function catego-
rizations are represented with nodes (or called terms or
classes) in a directed acyclic graph in which one node may
have more than one parent node. There are two relationships
between child nodes and parent nodes in GO: bIs-a,Q wherethe child node is an instance of its parent (function), and
bPart-of,Q where the child node is a part of its parent
(component). With existing gene function knowledge,
known genes can be annotated to the GO class(es)
representing their most specific function(s).
With the scarce and fragmented status of present knowl-
edge in gene functionality, however, we generally have
rather abstract gene annotations with GO, since most known
genes are annotated to nodes of high levels in GO [4]. It is
necessary to push the gene functional annotations deeper in
GO, or to predict further annotated genes with more specific
04) 922–928
Fig. 1. The distribution of ATI values of the 44 candidate classification
spaces.
K. Tu et al. / Genomics 84 (2004) 922–928 923
GO terms. The first point in this work is to predict further
the functions of known genes from parent classes to child
classes in GO. Specifically, we regard genes that are
annotated to a parent node but not to their child nodes as
unknown genes, and we train a classification model to
classify them into one or more child nodes. Thus we
encounter separate classification tasks at separate parent
nodes (PNs), and so different PN-specific classifiers are
trained at different parent nodes to distinguish among their
child nodes (CNs). For conciseness, we refer to the set of a
PN and its CNs as a classification space.
There are a large number of classes (nodes) in a gene
functional categorization system, but not all of these classes
can be well learned by the learning model, given one
expression dataset of finite number of experiment condi-
tions. Brown [8] pointed out that many classes in MYGD
[14] would be unlearnable from expression data by any
classifier, and they selected only five classes for learning
that were expected to exhibit similar expression profiles.
Among the enormous classes in GO, some are expression
irrelevant, which cannot be learned from any expression
data. Other classes exhibit experiment-dependent learnabil-
ity, for they may be well learned from certain expression
data, which are obtained under relevant experiment con-
ditions, but cannot return good prediction performance in
the face of other expression data out of irrelevant experi-
ment conditions. Under irrelevant experiment conditions,
functional classes are not influenced and therefore do not
exhibit sufficient expression similarity to be tracked by any
classifier. Kuramochi et al. [9] noted this point and
suggested including more diverse expression data to allow
more classes learned. Given one gene expression dataset
under particular experiment conditions, we suggest that the
more learnable classes be selected out of the enormous GO
classes, with those expression-irrelevant and experiment-
irrelevant classes being excluded out of the prediction study.
We adopted the learnability-based class scoring idea from
Pavlidis et al. [15] to evaluate and select functional classes.
The ideas of further prediction and learnability-based
selection of classification spaces are combined in this paper
to give birth to a novel gene function prediction framework,
the learnability-based further prediction, as we term it. In
the first step, we set up the classification problem and
prepare the training examples at every candidate classi-
fication space, build up candidate classifiers under a
fivefold cross-validation schema, and then evaluate the
prediction precision with ATI, the averaged Tanimoto index.
In the second step, we select the learnable classification
spaces where the ATI values go beyond a threshold and then
perform the gene functional further prediction tasks at these
learnable classification spaces. The experiments on the
fibroblast serum response dataset reported further functional
predictions for several human genes and also gave interest-
ing clues to the varied learnability between classes of
different GO ontologies, different levels, and different
numbers of child classes.
Results and discussion
The known genes in the small serum dataset were first
annotated directly to nodes in the bbiological processQontology, and then the direct annotations were generalized
to ancestor nodes of the same branches. In this way we got
160 nodes that were each annotated with at least one gene.
We required that for a candidate parent node to be classified
it must have at least two child nodes and more than five
generally annotated genes, which restricted the number of
candidate parent nodes to 44. We then constructed PN-
specific classifiers separately at the 44 candidate classifica-
tion spaces. All 44 classification spaces involved 131
classes (including PNs and CNs), and we term them the
candidate classes.
The distribution of ATI values of the 44 candidate
classification spaces is shown in Fig. 1. From Fig. 1, we can
see that the classification performances (measured by the
ATI value) of the candidate classification spaces vary, and
more than half are unsatisfactory (ATI b 0.6). With the
threshold of ATI set at 0.7, 14 candidate classification
spaces passed this criterion and were seen as the learnable
classification spaces. All 14 learnable classification spaces
involved 45 classes (including PNs and CNs), and they are
termed the learnable classes.
The 14 learnable parent nodes are shown in Table 1. As
was discussed in Iyer [16], the fibroblast cells are
programmed to interpret the abrupt exposure to serum not
only as a general mitogenic stimulus but also as a specific
physiological signal, signifying a wound. Among the 14
learnable parent nodes, bM phaseQ, btranscriptionQ,btranscription, DNA-dependentQ, and bbiosynthesisQ are
related to the proliferative program; bresponse to biotic
stimulusQ and bhomeostasisQ are related to the wound repair
program. bSteroid metabolismQ appears in Table 1 because
the cholesterol biosynthetic pathway is down regulated in a
coordinated fashion in the face of the low-density lip-
oproteins provided by serum.
Fig. 2. The scatter plot of ATI values vs Onto-Express p values of the 44
candidate classification spaces.
Table 1
The learnable parent nodes and their classification performances
Learnable parent node ATI TPR TNR
GO:0008150, biological process 0.71 0.84 0.84
GO:0050875, cellular physiological
process
0.80 0.89 0.85
GO:0030003, cation homeostasis 0.95 0.95 0.95
GO:0000279, M phase 0.92 0.96 0.94
GO:0046907, intracellular transport 0.80 0.75 0.72
GO:0009653, morphogenesis 0.76 0.73 0.70
GO:0042592, homeostasis 0.93 0.93 0.87
GO:0009058, biosynthesis 0.72 0.73 0.55
GO:0008202, steroid metabolism 0.95 0.95 0.95
GO:0006350, transcription 0.98 0.99 0.99
GO:0006351, transcription,
DNA-dependent
0.73 0.86 0.84
GO:0006464, protein modification 0.79 0.77 0.75
GO:0007600, sensory perception 0.90 0.95 0.90
GO:0009607, response to biotic
stimulus
0.79 0.90 0.81
ATI, the averaged Tanimoto index; TPR, the true positive rate; TNR, the
true negative rate.
Table 2
The cross-table of the 44 candidate parent nodes considering their ATI
values and Onto-Express p values
Onto-Express p value
b 0.05 N 0.05
Learnable parent classes (ATI z 0.7) 9 5
Nonlearnable parent classes (ATI V 0.7) 6 24
K. Tu et al. / Genomics 84 (2004) 922–928924
The learnable classification spaces are enriched with the
differentially expressed genes
Under the hypothesis that the classes involved in the
learnable classification spaces are relevant to the exper-
imental conditions, it follows that most of the genes
annotated in these learnable classification spaces are induced
or repressed by the experiment conditions. We tested the
hypothesis with the algorithm described in Onto-Express
[17]. The p value of a functional class returned by the Onto-
Express algorithm tells how much the class is enriched with
differentially expressed genes. The smaller the p value, the
more enriched with differentially expressed genes the class
and therefore the more experiment-relevant the class.
Fig. 2 illustrates the relationship between the classifica-
tion performance (measured by ATI values) and the experi-
ment relevance (measured by Onto-Express p values) of the
44 candidate parent classes. Though not very obvious, we
can still see that classes of lower p values tend to exhibit
higher ATI values, which suggests that higher learnability is
associated with higher experiment relevance.
Setting the threshold of the Onto-Express p value at 0.05,
9 of the 14 learnable parent classes and 6 of the remaining
30 nonlearnable parent classes passed the criteria (Table 2).
The classes with Onto-Express p values less than 0.05 are
termed experiment-relevant classes for conciseness. We
tested with FisherTs exact test whether the experiment-
relevant classes were uniformly distributed in the non-
learnable parent classes and the learnable parent classes and
got a p value of 0.004. This very small p value suggests that
the distribution of experiment-relevant classes was not
uniform in the two sets of classes and that there was a
significantly enhanced likelihood that the differentially
expressed genes were enriched in the learnable parent
classes.
The relationship between the classification performance and
the number of child nodes
We studied the relationship between the classification
performance (measured by ATI) and the number of child
nodes in the candidate classification spaces (Fig. 3). The
trend is clear that ATI decreases with incremental number of
child nodes. This is easy to understand: as the number of
child nodes increases, there are more classes involved in a
classification task, and the classification model needs to
distinguish more subtle subclass patterns. In addition, single
genes are more likely to be labeled with more than one child
class label. In summary, with more child nodes involved in
the classification space, the problems of multiclass and
multilabel are more severe with the classification task,
leading to decreased classification performance.
The relationship between the classification performance and
the level of classification spaces
We also studied the relationship between the classifica-
tion performance and the level of parent nodes in the
candidate classification space (Fig. 4). From Fig. 4 we can
see that in the range from level 2 to level 6, ATI increases
with increasing level of child nodes. This may be explained
by the fact that deeper functional classes in GO are more
homogeneous and are therefore easier to be learned by
classification models. The ATI drops, however, after level 6,
Fig. 3. The relationship between the classification performance and the
number of child nodes.
Fig. 4. The relationship between the classification performance and the
level of child nodes. The level of classification space is defined as the
level of the parent node, beginning from the level-1 node of
bGO:0008150:Biological_ProcessQ.
K. Tu et al. / Genomics 84 (2004) 922–928 925
probably due to the shrink in sample size at low-leveled
functional classes.
Fewer learnable classification spaces discovered in MF and
CC ontologies
In addition to the biological process ontology, we also
carried out studies in the MF and CC ontologies. To our
surprise, we found only three learnable classification spaces
in the MF ontology and one in the CC ontology (ATI higher
than 0.7). Ashburner et al. [13] emphasized that the BP
functional classes exhibited a higher degree of expression
similarity than the other two ontologies; Hvidsten et al. [5]
claimed that BP was the only aspect of gene function in
which one may expect a significant correlation between
annotations and temporal gene expression profiles. Their
claims and findings suggested that the classification
performances of BP classes may be higher than those of
the MF and CC classes. Our results once again validated the
significantly higher correlation between the biological
process categorization and the gene expression similarity.
Gene functional further prediction
With the threshold of ATI value set at 0.7, we obtained 14
learnable classification spaces with relatively higher classi-
fication performances. We then performed the functional
further predictions on genes that were annotated directly to
these learnable parent nodes. Only 6 learnable parent nodes
were found with directly annotated genes, and finally a total
of 12 genes were predicted to 9 child nodes, resulting in 17
records in Table 3. The fact that there are only a few genes
involved in the functional further prediction can be explained
in two ways: first, only a few functional classes were aroused
by the particular experiment settings, leading to a limited
number of learnable classification spaces; second, the
fibroblast serum response dataset was rather small, contain-
ing only 517 cDNA sequences in all. Nevertheless, the few
further prediction findings give hints of the more specific
biological roles of the involved genes and could guide
further biological experiments to validate these predictions.
The same framework can be applied to other gene expression
datasets, and gene functional further predictions involving
other genes may be obtained.
Materials and methods
Annotation source
Results were based on the GO database released on May
1, 2004, the UniGene database released on December 20,
2003, and the LocusLink database released onMay 12, 2003.
Data
We used two datasets provided by Iyer et al. [16] that
describe the transcript levels of genes during the first 24 h of
the serum response in serum-starved human fibroblasts. The
large dataset consists of 9706 cDNAs, and 517 of them
showing extreme expression regulation or large expression
deviation make up the small serum dataset. Roughly speak-
ing, the 517 cDNAs can be regarded as differentially
expressed genes [18,19]. We implemented our learnability-
based further prediction on the small serum dataset consist-
ing of 517 differentially expressed genes. The 9706 cDNAs
were utilized in the Onto-Express analysis (see below).
Gene annotation
The 517 cDNAs were mapped to 426 unique genes
according to UniGene and LocusLink. Of the 426 genes 277
Fig. 5. A classification space example.
Table 3
The functional further prediction results for genes annotated directly at the learnable parent nodes
LocusID Symbol Directly annotated term Further predicted term Prediction stepsa
212 ALAS2 GO:0009058, biosynthesis GO:0009059, macromolecule biosynthesis 1
2149 F2R GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1
2149 F2R GO:0009653, morphogenesis GO:0009887, organogenesis 1
2222 FDFT1 GO:0009058, biosynthesis GO:0046138, coenzyme and prosthetic group biosynthesis 1
3156 HMGCR GO:0009058, biosynthesis GO:0046138, coenzyme and prosthetic group biosynthesis 1
3164 NR4A1 GO:0006351, transcription, DNA-dependent GO:0006366, transcription from Pol II promoter 2
3207 HOXA11 GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1
3207 HOXA11 GO:0009653, morphogenesis GO:0009887, organogenesis 1
4281 MID1 GO:0009653, morphogenesis GO:0000902, cellular morphogenesis 1
4281 MID1 GO:0009653, morphogenesis GO:0009887, organogenesis 1
5352 PLOD2 GO:0006464, protein modification GO:0006470, protein amino acid dephosphorylation 1
8870 IER3 GO:0009653, morphogenesis GO:0009887, organogenesis 1
10559 SLC35A1 GO:0006464, protein modification GO:0006470, protein amino acid dephosphorylation 1
10559 SLC35A1 GO:0006464, protein modification GO:0006468, protein amino acid phosphorylation 1
25830 SULT4A1 GO:0008202, steroid metabolism GO:0016125, sterol metabolism 1
25830 SULT4A1 GO:0008202, steroid metabolism GO:0006694, steroid biosynthesis 1
114876 OSBPL1A GO:0008202, steroid metabolism GO:0006694, steroid biosynthesis 1
a When a gene is predicted from its directly annotated GO term to a one-level-lower child term, the prediction step is 1. If the further predicted child term is
also a learnable parent node, the gene is again further predicted to a one-level-lower child term, which is a grandchild term of the original parent term. In this
case, there are 2 prediction steps.
K. Tu et al. / Genomics 84 (2004) 922–928926
could be annotated to at least one class in GO. As implied by
the ontology structure, a gene annotated to a class is also
involved (annotated) in the ancestor classes of the same
branch from the annotated node to the ROOT node. To
distinguish these two types of annotation, we refer to the
annotation provided by LocusLink as the bdirect annotationQand the later generalized annotation according to GO
structure as the bgeneral annotation.Q The goal of this work
is to select learnable classification spaces in GO and predict
further the known genes from their directly annotated classes
to deeper classes in these learnable classification spaces.
Framework of learnability-based further prediction
Constructing PN-specific classifiers at candidate
classification spaces
We first annotate known genes directly to GO nodes
according to the LocusLink database [20] and generalize
these direct annotations into general annotations. The set of
genes that are generally annotated to a node N is denoted as
GENE(N).
If a node has more than one child node, and is annotated
with more than five genes, it is termed a candidate parent
node, and the corresponding classification space rooted at the
candidate parent node is a candidate classification space. A
PN-specific classifier is constructed at each candidate
classification space. As the task is to assign genes from
parent nodes to child nodes correctly, only genes that have
been annotated to child nodes are useful for learning the
classifier. WeTll explain the methodological details with the
example classification space illustrated in Fig. 5. The training
examples of the classification problem at node P make up
the set, TE Pð Þ ¼ [qi ¼ 1 GENE Cið Þ where q is the number of
child nodes of P and Ci represents the ith child node.
According to our statistics, 70% of parent nodes in the
BP ontology are linked with more than two child nodes, and
37.9% of the genes in the serum dataset were found
annotated to some sibling nodes. So our classification task
is a multiclass, multilabel problem. The feedforward
artificial neural network [21] is chosen as the learning
algorithm, which is very suited to such kind of classification
task. For the classification task formalized in Fig. 5, we let
the neural network have a output layer of q neurons, with
each one standing for a different child node. The neurons in
the input layer correspond to the experimental conditions of
gene expression dataset, and the number of neurons is equal
to the number of experimental conditions, say p. We
included only one hidden layer in the numerical experiment,
and the number of neurons in this layer was empirically set
to the maximum integer less than 2ffiffiffip
p. For a training example,
a gene g, the input vector is Xg = (xg1, xg2, . . ., xgp), and the true
output vector is Og = (og1, og2, . . ., ogq), where ogi = 1 signifies
that g is annotated to Ci, and ogi = 0 signifies that g is not
annotated to Ci. Xg is input into the neural network, and a
K. Tu et al. / Genomics 84 (2004) 922–928 927
predicted output vector Pg = ( pg1, pg2, . . ., pgq) is output and
compared with the true output vector Og = (og1, og2, . . ., ogq).
Through adjusting the weight matrix, the neural network learns the
training example and corrects the prediction error. All training
examples are sequentially processed on and on before finally a
trained neural network is output. Note that the predicted output
vectors originally have elements ranging in [0,1], but they are
converted to a closer value in {0,1}.
Selecting learnable classification spaces according to the
averaged Tanimoto index
As we mentioned in the introduction, we believe that it is
unrealistic to predict accurately gene functions at all parent
nodes in GO. We select only more accurate classifiers to
perform the functional further prediction. To obtain an
objective evaluation of the classifiers at different parent
nodes, we adopted the fivefold cross-validation schema. For
example, for the classification task in Fig. 5, we partition
TE(P) into five subsets of about the same size. Then we
repeat the training and testing procedures five times, in
which each time a different subset is pushed aside from the
training set and used as testing examples each time. In this
way, every gene is predicted once. The Tanimoto index [22],
originally proposed as a similarity measure between binary
vectors, is used to compare the predicted output vector Pg
and the true output vector Og, and henceforth to measure the
prediction accuracy of the prediction of one gene.
Given a gene g that has a true output vector Og = (og1,
og2, . . ., ogq) and a predicted output vector Pg = ( pg1, pg2,
. . ., pgq), the Tanimoto index is computed as follows:
TI gð Þ ¼
P
i¼ 1
q
ogi ˙pgi
P
i¼ 1
q
ogi ˙pgi þP
i¼ 1
q
1� ogi� �
˙pgi þP
i¼ 1
q
ogi ˙ 1� pgi� �
The Tanimoto index differs from other binary similarity
measures in that it takes both the true b1’sQ and the predictedb1TsQ into the denominator. If only the true b1’sQ were
considered, the index would reflect only how many
annotated functions are recalled in the predictions; if only
the predicted b1’sQ were considered, the index would reflect
only how many of the predicted functions are accurate. Only
when both the true b1’sQ and the predicted b1’sQ are both
considered in the denominator is the index able to reflect
both the recall and the precision of the predictions.
Through averaging the Tanimoto indices over all genes
in TE(P), we obtain an overall prediction performance
measurement, the averaged Tanimoto index, for the classi-
fier at the current node P:
ATI Pð Þ ¼
P
gaTE Pð ÞTI gð Þ
jjTE Pð Þjj
In the numerical experiment we empirically set the
threshold of the averaged Tanimoto index at 0.7. Only
classification spaces with ATI values above 0.7 were
selected as learnable classification spaces and were
employed to predict further the child functions for genes
that were annotated directly to the parent functions.
Further prediction of genes annotated directly to parent
nodes in learnable classification spaces
If node P is trained and selected as a learnable clas-
sification space, then we perform the gene functional further
prediction at node P. For a gene gVa GENE Pð Þ � TE Pð Þ, wepredict its function(s) with the five neural networks
generated in the fivefold cross-validation schema and
obtain five predicted output vectors, termed voter output
vectors. According to the ensemble idea [23], we examine
each child node individually and accept the majority value
of the five voter output vectors. That comes to our final
decision on the functions of gene gV, in the form of an
ensemble output vector.
Measuring the degree of enrichment of differentially
expressed genes in functional classes
We measured the degree of enrichment of differentially
expressed genes in a class with the algorithm described in
Onto-Express [17]. We used the 517 cDNAs in the small
serum dataset as the study set of differentially expressed
genes and took the 9706 cDNAs in the large dataset as the
set of reference genes. With the study set and reference set
as inputs, we computed the p value of a functional class,
which quantifies the degree of enrichment of differentially
expressed genes in the class in question.
Acknowledgments
This work was supported in part by the National
Natural Science Foundation of China (Grants 30170515,
30370388, and 30370798), the Chinese 863 Program
(Grants 2003AA2Z2051 and 2002AA2Z2052), and the
211 Project, the Tenth bFive-YearQ Plan, Harbin Medical
University.
References
[1] R.D. King, A. Karwath, A. Clare, L. Dehaspe, The utility of different
representations of protein sequences for predicting functional class,
Bioinformatics 17 (2001) 445–454.
[2] S. Letovsky, S. Kasif, Predicting protein function from protein/protein
interaction data: a probabilistic approach, Bioinformatics 19 (2003)
i197– i204.
[3] C. Brun, et al., Functional classification of proteins for the prediction
of cellular function from a protein–protein interaction network,
Genome Biol. 5 (2004) R6.
[4] A. Lagreid, T.R. Hvidsten, H. Midelfart, J. Komorowski, A.K.
Sandvik, Predicting gene ontology biological process from temporal
gene expression patterns, Genome Res. 13 (2003) 965–979.
[5] T.R. Hvidsten, J. Komorowski, Predicting gene function from gene
expressions and ontologies, Pacific Symposium on Biocomputing,
2001, Hawaii.
K. Tu et al. / Genomics 84 (2004) 922–928928
[6] A. Mateos, et al., Systematic learning of gene functional classes from
DNA array expression data by using multilayer perceptrons, Genome
Res. 12 (2002) 1703–1715.
[7] G. Ramsay, DNA chips: state-of-the art, Nat. Biotechnol. 16 (1998)
40–44.
[8] M. Brown, et al., Knowledge-based analysis of microarray gene
expression data using support vector machines, Proc. Natl. Acad. Sci.
USA 97 (2000) 262–267.
[9] M. Kuramochi, G. Karypis, Gene classification using expression
profiles: a feasibility study, 2nd IEEE International Symposium on
Bioinformatics and Bioengineering, Bethesda, MD, 2001.
[10] Z. Pawlak, Rough sets, Int. J, Comput. Inf. Sci. 11 (1982) 341–356.
[11] C.J.C. Burges (Ed.), Data Mining and Knowledge Discovery, vol. 2,
1998, pp. 121–167.
[12] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[13] G.O. Consortium, Gene Ontology: tool for the unification of biology,
Nat. Genet. 25 (2000) 25–29.
[14] H.W. Mewes, et al., MIPS: a database for genomes and protein
sequences, Nucleic Acids Res. 28 (2000) 37–40.
[15] P. Pavlidis, D.P. Lewis, W.S. Noble, Exploring gene expression data
with class scores, Pacific Symposium on Biocomputing, 2002.
[16] V. Iyer, et al., The transcriptional program in the response of human
fibroblasts to serum, Science 283 (1999) 17–18.
[17] S. Draghici, P. Khatri, Global functional profiling of gene
expression data, in: D.P. Berar, M. Granzow (Eds.), A Practical
Approach to Microarray Data Analysis, Kluwer Academic, Dor-
drecht, 2003, pp. 98–104.
[18] T.J. Mariani, et al., A variable fold change threshold determines
significance for expression microarrays, FASEB J. 17 (2003)
321–323.
[19] G.K. Smyth, Y.H. Yang, T. Speed, Statistical issues in cDNA
microarray data analysis, Methods Mol. Biol. 224 (2003) 111–136.
[20] D.L. Wheeler, et al., Database resources of the National Center for
Biotechnology, Nucleic Acids Res. 31 (2003) 28–33.
[21] P.D. Wasserman, Neural Computing: Theory and Practice, Van
Nostrand–Reinhold, New York, 1989.
[22] P. Willet, Similarity and Clustering in Chemical Information Systems,
Research Studies Press, Letchworth, UK, 1987.
[23] T.G. Dietterich, Ensemble methods in machine learning, in: F. Roli
(Ed.), International Workshop on Multiple Classifier Systems,
Lecture Notes in Computer Science, Springer-Verlag, New York,
2000, pp. 1–15.