unsupervised pattern - informatics: indiana university boutros is a phd student in the department of...
TRANSCRIPT
Paul Boutros
is a PhD student in the
Department of Medical
Biophysics, at the University of
Toronto, Canada.
Dr Allan Okey
is a professor in the
Department of Pharmacology,
at the University of Toronto,
Canada.
Keywords: patternrecognition, clustering,microarray, systems biology,data integration, co-regulation
Paul C. Boutros,
Department of Pharmacology,
University of Toronto,
4302 Medical Sciences Building,
1 King’s College Circle,
Toronto, Ontario, Canada M5S 1A8
Tel: +1 416 978 2207
Fax: +1 416 978 6395
E-mail: [email protected]
Unsupervised patternrecognition: An introductionto the whys and wherefores ofclustering microarray dataPaul C. Boutros and Allan B. OkeyDate received (in revised form): 21st June 2005
Abstract
Clustering has become an integral part of microarray data analysis and interpretation. The
algorithmic basis of clustering – the application of unsupervised machine-learning techniques to
identify the patterns inherent in a data set – is well established. This review discusses the
biological motivations for and applications of these techniques to integrating gene expression
data with other biological information, such as functional annotation, promoter data and
proteomic data.
INTRODUCTIONThe advent of high-throughput genomics
and proteomics is leading to an
unprecedented surge of cross-disciplinary
work between biology and statistics. The
growing interface between these distinct
groups of researchers has shown both
significant successes and surprising
challenges. One major area of progress has
been the application of machine-learning
techniques to large biological data sets.
Machine-learning techniques (also known
as pattern-recognition techniques) have
been successfully used on microarray-
based gene-expression data to identify
groups of functionally related genes1 as
well as to predict broader biological
phenotypes such as genetic interactions,2
social behaviour3 or cancer type.4
While the various algorithms and
methods used for machine-learning
techniques have been extensively
reviewed elsewhere,5–8 the biological
theory underlying and motivating these
approaches has rarely been discussed. We
focus here on unsupervised pattern
recognition, and, in particular, on the
clustering techniques commonly used in
the analysis of mRNA expression
microarray data. Before discussing the
hypothesis of co-regulation that underlies
many of the applications of unsupervised
pattern recognition, we briefly outline
both the basic microarray data-analysis
pipeline and the general classes of
clustering algorithms in use. We then
discuss the underlying assumptions behind
unsupervised pattern recognition and
conclude with a series of examples from
the recent literature.
AN OVERVIEW OFMICROARRAY DATAANALYSISThe pre-processing of microarray data can
be thought of as a pipeline, with a series
of stages through which the data must
travel before useful results can be
extracted (Figure 1). The analytical
pipeline is virtually identical for single-
channel (Affymetrix) and dual-channel
(cDNA) arrays, although different
algorithms are employed for different
platforms. The first step in analysis is
quantification of the scanned image into
numeric data amenable to statistical
analysis.9 This raw data set is filtered to
remove high-noise and low-signal
spots.10–12 The filtered data set is then
normalised (often twice) to remove spatial
variability, channel imbalances and inter-
array heterogeneity.13–16 The filtered and
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 1
normalised data can then be analysed in a
variety of ways to extract useful biological
data. For example, results can be
subjected to significance-testing to
identify differentially expressed genes.17,18
Alternatively, a variety of clustering
approaches can be used to identify broad
patterns in the biological system.5 The
clustering can then form a platform for
network reconstruction (see Example II
below) or other downstream analyses
(Figure 1).
It is intriguing to note that the early
stages of this pipeline have, in our view,
largely been the most challenging to
model. Genomics analyses start with low-
level operations such as filtering and
normalisation and terminate with high-
level operations such as clustering or other
types of pattern recognition. Low-level
analyses remain poorly understood from a
statistical perspective, and are largely
handled empirically. By contrast, higher-
level analyses such as clustering appear
better able to leverage techniques from
other fields,19 often with genomics-
specific enhancements.20 This implies that
at a high level, microarray genomics data
are not particularly different from other
types of data, despite their uniqueness at
lower levels.
Pattern recognition generally occurs at
the end of the analysis pipeline, although
it occasionally finds use earlier, such as for
$�'0���
= ����������
���� ��! ���
4�������
$����0���0�����
2����>�����������������
����0��������0����0�������
2����>�����������������
%��'��3������ �����
�� ������2�������0������������0����������
���������0����
�����������������
����0������0����0�������
Figure 1: The basic pipeline for the pre-processing of microarray data. The pipeline starts with image quantitation. Thisfigure shows a two-channel array, but most steps are the same for one-channel platforms. The quantified image data arefirst filtered to remove bad spots, then normalised to bring the two channels of an array into balance. This procedure isrepeated for each array in an experiment, after which the entire set is subject to a second, inter-array normalisation. Thenormalised dataset is significance-tested to identify differentially expressed genes. These genes can then be clustered todisplay trends in gene-expression
332 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
data exploration. The goal of
unsupervised pattern recognition is to
identify small subsets of genes that display
coordinated expression patterns. This process
is also known as ‘clustering’, and the two
terms are often used interchangeably. The
goal of clustering large sets of expression
data is to rationalise changes in expression
of entire groups of genes and to discover
possible functional relationships among
them.
There are two key questions one might
ask at this point. First, how can we find
these sets of genes that exhibit
coordinated expression profiles? Second,
why is it biologically useful and important
to identify subsets of genes with
coordinated expression? We answer these
two questions in turn.
WHAT IS CLUSTERANALYSIS?What is clustering?Clustering algorithms – a subset of
unsupervised pattern-recognition or
machine-learning algorithms – identify
patterns within a data set. In genomic
work, this is often described as identifying
‘clusters’ of genes that have similar
expression patterns. To be succinct:
unsupervised pattern recognition (clustering)
explicitly identifies the underlying patterns
inherent to a data set.
This definition reveals one key
weakness of unsupervised pattern-
recognition approaches: they generally
operate under the assumption that there is
an underlying pattern within the data.
Most unsupervised pattern-recognition
algorithms ‘see’ patterns in random data;
hence their output must be rigorously
validated, both statistically and
scientifically.
Figure 2A shows a simple two-
dimensional example of the goal of
clustering. The raw data are a set of points
on the two-dimensional graph. In a
genomic data set, each point represents a
single gene, while the axes represent
changes in expression levels in response to
different treatments or stimuli. The
clustering algorithm identifies the
underlying order in these data by
identifying three ‘clusters’. Each cluster is
a group of genes that show similar,
coordinated expression profiles.
The goal of clusteringmicroarray data is toidentify patternsinherent in a dataset
��������
���������������
����������
���
�
�
��
��
Figure 2: Theunderlying idea ofclustering is to grouptogether sets of genes(white circles) that havesimilar expressionprofiles (dashed-blackovals). In (A) we seethree distinct clusters,along with two outlyinggenes that do not easilyfit into one of the maingroups. In (B) the effectof ‘forcing’ these twogenes into clusters canbe seen. The within-cluster distance (WC)grows substantially,while the between-cluster distance (BC)drops dramatically
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 3
Unsupervised pattern recognition
Note that in this case, two genes were
classified as outliers: they were not
sufficiently similar to any other gene to be
grouped into a cluster. Not all clustering
algorithms are capable of making this
choice to ‘not cluster’ an outlier gene. If
the outliers had been forced into one of
the existing clusters (Figure 2B), the
clusters become significantly larger. This
is reflected by an increase in the size of
clusters, termed the within-cluster
distance (WC on Figure 2B) and a
decrease in the space between clusters,
termed the between-cluster distance (BC
on Figure 2B). It is important for a
clustering algorithm to correctly handle
outliers in order to minimise within-
cluster distance (to maximise the similarity
of genes in a cluster) and to maximise the
between-cluster distance (to ensure that
each cluster is functionally distinct).
Correct parameterisation of clustering
algorithms is important to ensure that
these outliers are not grouped with one of
the strongly supported clusters.7
How is clustering done?There are numerous machine-learning
algorithms, many of which have been
successfully used in the clustering of
microarray data. The following section
lists references that give details on both
the algorithms and their applications.
Here, we give a brief overview of the
three most commonly used clustering
algorithms: k-means, hierarchical and self-
organising maps.
Before describing these three
algorithms, note that clustering algorithms
are generally classified into two groups,
based on the information they process. If
an algorithm starts with only the raw data,
it is termed an ‘unsupervised’ algorithm.
If, on the other hand, the algorithm also is
given data about the types of samples or
how they are related to one another, then
the algorithm is said to be ‘supervised’.
Supervised and unsupervised
algorithms generally are used for very
different purposes. Supervised algorithms
excel at classification. For example, they
allow identification of sets of genes that
can be toxicologically21,22 or
diagnostically4 informative. Unsupervised
algorithms, on the other hand, are
primarily used for ‘pattern discovery’, the
identification of novel and unexpected
patterns in a data set. While both types of
clustering – supervised and unsupervised
– are important, unsupervised algorithms
like the three clustering algorithms
described here are most commonly used
in microarray data analysis.
To compare and contrast the three
methods, we ran each algorithm on the
same set of data (Figure 3). This data set
comes from a developmental study of the
murine pre-frontal cortex between two
and four weeks after birth (Semeralul et
al., manuscript in preparation). Each time
point contains four biological replicates
analysed on Affymetrix MOE430A arrays.
Here we clustered the 185 most
developmentally responsive genes with
each algorithm using the popular Cluster
package.23
The agglomerative hierarchical
clustering algorithm (Figure 3A) works by
successively grouping together very
similar pairs of genes. The algorithm starts
by comparing each gene with every other
gene. It selects the two genes with the
most similar expression patterns, groups
these together into a ‘node’, then repeats
the procedure with the remaining genes.
This process continues until every gene
has been placed in a node. This procedure
is grounded in the assumption that the
internal structure of the data is
fundamentally hierarchical – that the
genes can be accurately ranked in terms of
their similarity to one another. If this is
not always the case, or if genes are not
correlated across all conditions,
hierarchical clustering will not perform
well.
Both k-means clustering (Figure 3B)
and self-organising maps (SOMs; Figure
3C) are classified as ‘iterative relocation
algorithms’. These algorithms do not
assume that the data are hierarchical.
Instead, they require the user to estimate
the number of clusters present in the data.
Sometimes these estimates can be inferred
Improper outlierhandling can decreaseclustering
Supervised versusunsupervised clustering
33 4 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
from knowledge of the biological system.
For example, in the k-means clustering
(Figure 3B) we chose three clusters of
samples (mirroring the three
developmental time-points) and eight
clusters of genes (to reflect the idea that at
any time point a gene can go up or down,
leading to 23 ¼ 8 combinations). Because
such estimates can ‘freeze’ prior bias into
the analysis, several methods have been
developed to estimate the number of
clusters inherent in the data.24
Iterative relocation algorithms work by
first randomly placing every gene into one
of the preset clusters. They then proceed
to move (relocate) the genes one at a time
(iteratively) to produce increasingly
homogeneous clusters. This can require
many steps (relocations), but on modern
computers these algorithms generally
require minutes, rather than hours or
days.
The primary difference between k-
means clustering and self-organising maps
lies in the type of clusters assigned. In k-
means clustering each gene is assigned to a
distinct and independent cluster; a total of
k clusters are used, giving the algorithm its
��0:����������� ��0�>���� ��0����>��������0���
Figure 3: Hierarchical clustering (A) generates dendrograms for each axis of the data: the y-axis represents genes and thex-axis represents samples. Each week has four replicated animals, and these biological replicates can be seen to clustertogether on the x-axis. On the y-axis, the genes are ordered by the similarity of their expression profiles into a tree-likestructure called a dendrogram. K-means clustering (B) was performed with k ¼ 8 for the genes (y) axis and k ¼ 3 for thesamples (x) axis. It was repeated 100,000 times and the solution with minimal within-cluster distance was chosen. BecauseSOMs (C) are themselves two-dimensional, there is no easy way to display SOM analysis of both genes and samples on asingle plot. The SOM figures give the response of the average gene with each cell for each of the three time-frames. Notethat adjacent cells of SOM are more similar to one another than non-adjacent cells
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 5
Unsupervised pattern recognition
name. In contrast, self-organising maps
use a ‘gradient’ of related clusters. A two-
dimensional grid of clusters is created, and
the user supplies the size of the grid as a
parameter. Each grid within the map is
more closely related to the adjacent
clusters than it is to non-adjacent clusters.
For example, in Figure 3C the grid (1,1)
is far more similar to grid (1,2) than it is to
grid (1,4).
Details on clustering andclustering algorithmsThe many algorithms used in the
clustering of microarray data have been
described in detail several times. The
classic text on statistical pattern-
recognition, Duda et al., includes a lucid
overview of clustering algorithms.8
Several excellent reviews of the
mathematical details of clustering
algorithms have also appeared.5,7,25–27
Further, a large number of primary
publications making extensive use of
clustering have been published.3,28–31
Other workers have specifically studied
the performance of different clustering
algorithms for identifying co-expressed
sets of genes.32,33 Finally, a large number
of methods for visualising,34,35
improving20,35–38 and validating39–42 the
performance of clustering algorithms on
microarray data have been developed.
WHY IS CLUSTERINGPOSSIBLE?Co-expression: The rationalefor clusteringUnsupervised pattern recognition is one
of the most common analytical tasks
performed on microarray data. The heat-
map type of displays introduced by
Michael Eisen19 are intuitive and these
plots abound in journals. But aside from
providing a clear display methodology,
what is the underlying biological
motivation for clustering expression data?
To answer this question it is helpful to
return to some basic biological concepts.
The phenomenon of genes displaying
similar patterns of expression over a
variety of conditions is termed ‘co-
expression’. Virtually all clustering
techniques identify sets of genes that are
co-expressed. The question of why
clustering is possible is essentially
equivalent to asking why we expect genes
to be co-expressed. Fortunately co-
expression is supported both by an
immense body of empirical observations
and by detailed mechanistic explanation.
We first explain co-expression intuitively,
then mechanistically.
Co-expression explained intuitively
Intuitively it makes sense that if two or
more separate genes are involved in a
single biological process, they will be
biologically ‘needed’ at the same time.
Because they are needed in unison, they
will also be expressed in unison – co-
expression. For example, drugs and
environmental chemicals are eliminated
from the body by the combined action of
multiple enzymes. Phase I enzymes (such
as the various species of cytochrome
P450) convert foreign chemicals into
chemically reactive intermediates. Phase
II enzymes then conjugate these reactive
intermediates with water-soluble moieties
such as glutathione or glucuronic acid,
thereby facilitating elimination of the
foreign chemical via the kidney. In
animals exposed to polycyclic aromatic
hydrocarbons, expression of specific forms
of cytochrome P450 (CYP1A1, CYP1A2
and CYP1B1) is highly upregulated, ie
the enzymes are induced at the
transcriptional level. In most instances,
Phase II conjugating enzymes such as
glutathione S-transferases and UDP-
glucuronosytransferases are simultaneously
induced so that an increase in reactive
intermediates (generated by the P450s) is
accompanied by an increase in Phase II
enzymes that conjugate these reactive
metabolites and detoxify them.43 These
gene batteries will be co-expressed:
conditions that ‘require’ the activity of
one gene-product will usually ‘require’
the activity of the others.
From this viewpoint, co-expression can
be seen as an inherent property of
biological systems. When two or more
Clustering identifies co-expressed genes
33 6 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
genes function in the same biological
process – either independently or by
complementing one another – they will
be expressed at the same time. Co-
expression follows naturally, and clusters
of gene expressions are inevitably
observed in microarray studies.
Co-expression explained mechanistically
Our intuitive description of co-expression
emphasizes co-expression as a necessary
outgrowth of the fact that biological
systems have multiple coordinated ways
of responding to stimuli. The body
responds to drugs or disease in many
different ways, and the effects occur in
parallel, as well as in series.
From a mechanistic point of view,
however, this explanation remains
unsatisfying. It helps explain, from a
teleological perspective, why co-
expression occurs but not how it occurs.
The key is that biological responses to
external stimuli are not stochastic – they
are regulated. The regulatory systems that
help coordinate biological responses
underlie co-expression: genes that are co-
expressed are often coordinately regulated via a
common mechanism.
Continuing our earlier example, the
co-expression of various xenobiotic
metabolising enzymes can be explained
mechanistically: the P450 enzymes and
the conjugating enzymes are coordinately
regulated by a common ligand-dependent
transcription factor, the AH receptor
(reviewed in Okey et al.44). These genes
will thus be co-expressed: conditions that
‘require’ the activity of one gene usually
will simultaneously alter the activity of
other genes that share the regulatory
mechanism.
In other words, co-expression can be
the result of co-regulation. It is important
to emphasise here that co-regulation is
not the only mechanism that can lead to
co-expression. If only one or two
conditions are considered, many genes
will appear to respond in a similar fashion.
As more and more conditions are added,
however, these types of random
observations will be diminished.
Alternately, two genes may be regulated
separately, but their independent
regulatory mechanisms may overlap
significantly, making it difficult to
distinguish their separability.
Despite these caveats, it is often
reasonable to make the assumption that
co-expressed genes are co-regulated
through one or a few common
mechanisms. For example, many of the
genes that are co-expressed in response to
the toxicant TCDD respond through a
specific transcription factor, the aryl
hydrocarbon receptor (AHR). These
AHR-mediated genes fall into two
distinct mechanisms, depending on
whether the AHR functions as a
transcription factor45,46 or as a
coactivator.47
Based on this assumption, clustering
becomes a way of identifying sets of genes
that are putatively co-regulated, thereby
generating testable hypotheses. For
example, if the regulatory mechanisms for
one or two genes in a cluster are known,
the hypothesis would be that all genes in
the cluster are regulated by that
mechanism. This hypothesis can be tested
through such approaches as gene-
knockouts or knockdowns, the use of
transfection assays with reporter-
constructs, or direct assessment of
transcription-factor binding.
The regulatory mechanisms that alter
gene expression are not fully understood,
but the basic outlines are well known.
Specific proteins, called transcription
factors, are capable of binding to short,
specific sequences of DNA. These specific
sequences, called transcription factor
binding sites (TFBS), usually lie close to
the coding regions of genes they regulate.
When a transcription factor binds to its
TFBS, it modulates expression of that
gene. Transcription factors can either
induce (activate) or repress expression.
In summary, co-expressed genes often
also are co-regulated. This regulation is
achieved by short sequences of DNA
(TFBSs), to which a transcription factor
can bind. This relationship has been
empirically demonstrated many times and
Co-expressed is oftenthe result of co-regulation
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 7
Unsupervised pattern recognition
was recently validated experimentally
with high-throughput ChIP-Chip
studies.32,48,49 For example, sets of co-
expressed yeast genes have been used to
clearly identify novel TFBSs, which have
then been identified with specific
transcription factors and biological
regulatory mechanisms.50
CLUSTERING AND CO-REGULATION: EXAMPLESThe presence of coordinated regulation in
biological systems might seem trivial, but
in fact, the significance is quite profound.
The knowledge that co-expressed genes
are often co-regulated gives us a platform
upon which to integrate different types of
biological data. By amalgamating
(integrating) multiple data sets, one can
remove false-positives and help identify
disparities between the different levels of
data. Further, integration helps to explain
biological phenomena on deeper levels
than a single data set alone could.
Integration can be performed at many
levels. Below, we outline three recent
examples of how clustering algorithms
were used to integrate microarray
expression data with other types of data to
yield unexpected conclusions. The first
example details the integration of array
data with functional annotation, the
second with genome-wide transcription-
factor binding-site searches, and the third
with proteomic data.
Example I: Expression data andfunctional annotationHughes et al.1 demonstrated that
clustering of expression data is one way to
assess the function of novel,
uncharacterised genes. They considered
300 distinct, significantly different yeast
conditions. These conditions included
both genetic mutations and chemical
treatments. A microarray experiment was
performed for each condition. The entire
data set (compendium, in Hughes’s
terminology) was then subjected to two-
dimensional agglomerative hierarchical
clustering. Sets of genes that showed
similar expression patterns (ie clustered
together) were then annotated with any
known functions. Uncharacterised genes
were then hypothesised to be functionally
related to other genes in their cluster. In
this way, function could be inferred from
a set of co-expressed genes, thereby
generating a set of testable hypotheses
about the biological roles of
uncharacterised transcripts.
It is important to note that the 300
individual experiments were each of
biological interest. The individual
knockouts and chemical treatments used
could each be analysed for their effects on
gene expression across the entire yeast
transcriptome. By aggregating this entire
data set, however, Hughes et al. were able
to extract an enormous amount of
additional information regarding gene co-
regulation and function that was not
available from any of the individual data
sets in isolation. In yeast, this work has
been extended to include both knock-
outs of non-essential genes and knock-
downs of essential genes, allowing the
effects of genetic perturbation of every
single gene to be exploited for
identification of co-regulation.51
This type of integration of expression
data with functional annotation has been
used many times.28,29 It is becoming a
standard part of comprehensive
microarray data analysis, and a wide array
of software tools have been developed to
automate the testing of each individual
cluster for different functional
annotations.52–54
Example II: Expression dataand TFBS identificationSegal et al. extended this integration
approach to encompass transcription-
factor data.55,56 They did so by integrating
yeast gene-expression data with
automated transcription factor binding
site (TFBS) searches in an attempt to
rationalise the mechanism of regulation.
They compared the results of their
clustering based on both expression and
TFBS data with the results of clustering
based only on expression data, and
demonstrated the superiority of combined
Using co-expression topredict gene function
33 8 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
clusters, in part by the greater
concordance of functional annotations
within a cluster. In other words, Segal et
al. used the integration of functional data
as a tool to help test their algorithm for
combining TFBS and expression data.
Recall that a TFBS is a short sequence
of DNA near the coding region of a gene
that modulates the gene’s expression level.
Because TFBSs are short sequences, both
functional and non-functional sites are
present throughout the genome. The
false-positives generated by non-
functional sites often outnumber the true
positives, leading to the development of a
variety of methods for winnowing non-
functional TFBSs from functional ones57
and for rationalising observed patterns of
co-regulation based on the activity and
expression levels of specific transcription
factors. One such method is to search sets
of co-expressed genes for common,
conserved sequences. A sequence that
appears in an entire set of co-expressed
genes can be hypothesised to be at least
partially responsible for the regulation
underlying this co-regulation.
Segal et al. exploited this idea
extensively. Like Hughes et al., they
clustered a large set of distinct microarray
experiments, then tried to rationalise the
observed clusters. In this case, however,
the rationalisation was initially done on
the basis of observed TFBSs rather than in
terms of functional annotation. Segal et
al.’s algorithm did not stop there: they
used the initial clustering and TFBS
searches as seeds for an iterative algorithm.
If a cluster did not contain a known
TFBS, then they automatically searched
for a novel one. This TFBS knowledge
was then incorporated back into a new
round of clustering (Figure 4). After 50–
60 iterations of this process, the clustering
patterns stabilised and the resulting groups
were interpreted both with functional
annotation (as per Hughes) and with
manual inspection of how the proposed
regulatory mechanisms matched the
known biology of yeast regulation.
Although this study was somewhat
limited by some of the simplifying
assumptions used – in particular the
assumption that the regulatory species
themselves will change expression – it
helped identify a large number of novel
regulatory mechanisms. Further, it
provided one of the first large-scale
examples of integrating expression-data
with TFBS data to estimate regulatory
networks. This approach will be
particularly valuable when it is used as a
complement to other high-throughput
approaches to identify regulatory
networks.58 This approach has since been
extended in scope59 to include mRNA
degradation rates and transcription factor
levels. A similar approach has also been
developed for higher eukaryotes, and was
tested using the human cell cycle.60
Example III: Expression dataand proteomic dataA final example of the importance of co-
regulated genes involves the integration of
genomic and proteomic data. Genomic
experiments assay biological activity at the
level of genes, whereas proteomic
experiments assay activity at the protein
level. Ideker et al. performed an in-depth
study (yet again in yeast) integrating these
Using co-expression toidentify TFBSs
2������0�� ������ 4����0�� ������
���0<4��0����
Figure 4: The incorporation of TFBS datainto a pattern-recognition scheme candramatically improve the coherency of theclusters identified. Many iterations of theclustering/TFBS searching procedure may berequired to identify an invariant pattern ofclusters
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 9
Unsupervised pattern recognition
two sources of data.61,62 In particular, this
study integrated genomic assays of
mRNA expression levels with proteomic
studies reflecting protein–protein
interactions and proteomic studies
reflecting protein concentrations. They
focused on a single biochemical pathway:
the utilisation of the sugar galactose. They
first investigated the effects of knocking
out different enzymes in the galactose
pathway to identify the transcriptional
network underlying the pathways. Next,
they integrated this information with
protein–protein interactions and used this
combined data set to significantly expand
and clarify the textbook representation of
this important metabolic pathway.61
Although mRNA levels are not always
predictive of protein levels,63 they are
typically correlated. Genome-wide
protein and mRNA levels have been
extensively compared in yeast,64,65 and
were found to be highly related. In
particular, there was found to be, on
average, about 5,000 protein molecules
per mRNA molecule. This ratio was
largely independent of whether the gene
was expressed at a high or a low level.
These studies have shown that the yeast
translatome – the set of expressed yeast
proteins – is biased towards short,
cytoplasmic proteins primarily involved in
fundamental cell-maintenance
functions.66
Similar studies, directly comparing
protein and mRNA levels in eukaryotic
systems have been performed
occasionally,67–70 but more often parallel
proteomic and genomic studies have been
done with no attempt made to integrate
the data.71,72 It is not yet clear if protein
and mRNA levels correlate in eukaryotic
cells, or if the level of correlation is at all
cell-type or species dependent. Pattern-
recognition techniques will have an
important role to play in answering these
types of questions as greater volumes of
proteomic data become available.
CONCLUSIONSThe biological rationale underlying the
clustering of microarray data is the fact
that many co-expressed genes are also co-
regulated. This trivial observation enables
an enormous variety of new science: the
functional annotation of uncharacterised
genes, the de novo identification of TFBSs,
and the elucidation of complete biological
pathways. Each of these findings is based
on exploitation of unsupervised pattern-
recognition techniques. Pattern discovery
has become an essential part of the
‘-omic’ sciences.
The increasing role for machine-
learning is a natural outgrowth of
reductionist biology. In theory, co-
regulation can simplify our understanding
of biological response. Instead of
characterising how a drug or toxicant
affects each gene in a cell, we can simply
identify how it changes cellular
regulation. We no longer consider
thousands of separate genes, but rather
only hundreds of regulators. Given the
diversity of regulatory mechanisms in a
cell – signal transduction, transcriptional,
translational and post-translational – it is
certain that identifying complete
regulatory networks will require
immensely complex pattern recognition.
The co-regulation hypothesis is very
powerful. However, it is just that – a
hypothesis. It can be supported by in silico
evidence such as functional annotation
and integration with TFBS or proteomic
data. But while these sources can give
additional confidence in the assertion that
a pair of genes is co-regulated, they do
not provide proof. Instead, at least some
of the putative co-regulated genes must
be shown, in vivo, to be co-regulated
through genetic manipulation of the
relevant transcription factors. And even in
such cases, additional in vitro work to
detail the mechanism of co-regulation is
often necessary. In short: in silico work
may generate valuable hypotheses, but
traditional techniques are still required to
validate them.
As an increasing number and variety of
high-throughput data sets become
available, it is increasingly important to
find methods for integrating different
levels of biological data. This integration
Hypotheses generatedby clustering need
34 0 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
lies at the heart of the emerging discipline
of systems biology. Unsupervised pattern-
recognition methods are a critical tool in
that integration and an important
technique for the interpretation of
biological data.
Acknowledgments
The authors gratefully thank Dr Albert Wong and
Ms Mawahib Semeralul for permission to use their
brain-development data in figures. The authors also
thank Dr Rafal Kustra and Dr Hugh Chipman for
stimulating conversations and Ms Joanne Kim for
critical reading of the manuscript. A.B.O.
recognises funding from the Canadian Institutes of
Health Research (grant MOP-57903) and the
National Cancer Institute of Canada with funds
from the Canadian Cancer Society.
References
1. Hughes, T. R., Marton, M. J., Jones, A. R.et al. (2000), ‘Functional discovery via acompendium of expression profiles’, Cell, Vol.102(1), pp. 109–126.
2. Tong, A. H., Lasage, G., Bader, G. D. et al.(2004), ‘Global mapping of the yeast geneticinteraction network’, Science, Vol. 303(5659),pp. 808–813.
3. Whitfield, C. W., Cziko, A. M. andRobinson, G. E. (2003), ‘Gene expressionprofiles in the brain predict behavior inindividual honey bees’, Science, Vol. 302(5643),pp. 296–299.
4. Golub, T. R., Slonim, D. K., Tamayo, P. et al.(1999), ‘Molecular classification of cancer:Class discovery and class prediction by geneexpression monitoring’, Science, Vol.286(5439), pp. 531–537.
5. Sherlock, G. (2001), ‘Analysis of large-scalegene expression data’, Brief. Bioinformatics,Vol. 2(4), pp. 350–362.
6. Sherlock, G. (2000), ‘Analysis of large-scalegene expression data’, Curr. Opin. Immunol.,Vol. 12(2), pp. 201–205.
7. Azuaje, F. (2003), ‘Clustering-basedapproaches to discovering and visualisingmicroarray data patterns’, Brief. Bioinformatics,Vol. 4(1), pp. 31–42.
8. Duda, R. O., Hart, P. E. and Stork, D. G.(2001), ‘Pattern Classification’, 2nd edn,Wiley, New York, p. 654.
9. Jain, A. N., Tokuyasu, T. A., Snijders, A. M.et al. (2002), ‘Fully automatic quantification ofmicroarray image data’, Genome Res., Vol.12(2), pp. 325–332.
10. Tseng, G. C., Oh, M. K., Rohlin, L. et al.(2001), ‘Issues in cDNA microarray analysis:Quality filtering, channel normalization,
models of variations and assessment of geneeffects’, Nucleic Acids Res., Vol. 29(12), pp.2549–2557.
11. Wang, X., Ghosh, S. and Guo, S. W. (2001),‘Quantitative quality control in microarrayimage processing and data acquisition’, NucleicAcids Res., Vol. 29(15), E75.
12. Jenssen, T. K., Langaas, M., Kuo, W. P. et al.(2002), ‘Analysis of repeatability in spottedcDNA microarrays’, Nucleic Acids Res., Vol.30(14), pp. 3235–3244.
13. Bolstad, B. M., Irizarry, R. A., Astrand, M.and Speed, T. P. (2003), ‘A comparison ofnormalization methods for high densityoligonucleotide array data based on varianceand bias’, Bioinformatics, Vol. 19(2), pp.185–193.
14. Workman, C., Lensen, L. J., Jarmer, H. et al.(2002), ‘A new non-linear normalizationmethod for reducing variability in DNAmicroarray experiments’, Genome Biol., Vol.3(9), p. research0048.
15. Yang, Y. H., Dudoit, S., Luu, P. et al. (2002),‘Normalization for cDNA microarray data: Arobust composite method addressing single andmultiple slide systematic variation’, NucleicAcids Res., Vol. 30(4), p. e15.
16. Smyth, G. K. and Speed, T. (2003),‘Normalization of cDNA microarray data’,Methods, Vol. 31(4), pp. 265–273.
17. Smyth, G. K. (2003), ‘Linear models andempirical Bayes methods for assessingdifferential expression in microarrayexperiments’, Stat. Appl. Genet. Mol. Biol.,Vol. 3(1), pp. 1–26.
18. Pan, W. (2002), ‘A comparative review ofstatistical methods for discovering differentiallyexpressed genes in replicated microarrayexperiments’, Bioinformatics, Vol. 18(4), pp.546–554.
19. Eisen, M. B., Spellman, P. T., Brown, P. O.and Botstein, D. (1998), ‘Cluster analysis anddisplay of genome-wide expression patterns’,Proc. Natl Acad. Sci. USA, Vol. 95(25), pp.14863–14868.
20. Yeung, K. Y., Fraley, C., Murua, A. et al.(2001), ‘Model-based clustering and datatransformations for gene expression data’,Bioinformatics, Vol. 17(10), pp. 977–987.
21. Waring, J. F., Gum, R., Morfitt, D. et al.(2002), ‘Identifying toxic mechanisms usingDNA microarrays: Evidence that anexperimental inhibitor of cell adhesionmolecule expression signals through the arylhydrocarbon nuclear receptor’, Toxicology,Vol. 181–182, pp. 537–550.
22. Thomas, R. S., Penn, S. G., Holden, K. et al.(2002), ‘Sequence variation and phylogenetichistory of the mouse Ahr gene’,Pharmacogenetics, Vol. 12(2), pp. 151–163.
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 34 1
Unsupervised pattern recognition
23. de Hoon, M. J., Imoto, S., Nolan, J. andMiyano, S. (2004), ‘Open source clusteringsoftware’, Bioinformatics, Vol. 20(9), pp.1453–1454.
24. Dudoit, S. and Fridlyand, J. (2002), ‘Aprediction-based resampling method forestimating the number of clusters in a dataset’,Genome Biol., Vol. 3(7), p. research0036.
25. Altman, R. B. and Raychaudhuri, S. (2001),‘Whole-genome expression analysis: challengesbeyond clustering’, Curr. Opin. Struct. Biol.,Vol. 11(3), pp. 340–347.
26. Raychaudhuri, S., Sutphin, P. D., Chang, J. T.and Altman, R. B. (2001), ‘Basic microarrayanalysis: Grouping and feature reduction’,Trends Biotechnol., Vol. 19(5), pp. 189–193.
27. Simon, R., Radmacher, M. D., Dobbin, K.and McShane, L. M. (2003), ‘Pitfalls in the useof DNA microarray data for diagnostic andprognostic classification. J. Natl Cancer Inst.,2003. 95(1), pp. 14–18.
28. Stuart, J. M., Sergal, E., Koller, D. and Kim,S. K. (2003), A gene-coexpression network forglobal discovery of conserved geneticmodules’, Science, Vol. 302(5643), pp. 249–255.
29. Tavazoie, S., Hughes, J. D., Campbell, M. J.et al. (1999), ‘Systematic determination ofgenetic network architecture’, Nat. Genet.,Vol. 22(3), pp. 281–285.
30. Daoud, S. S., Munson, P. J., Reinhold, W.et al. (2003), ‘Impact of p53 knockout andtopotecan treatment on gene expressionprofiles in human colon carcinoma cells: Apharmacogenomic study’, Cancer Res., Vol.63(11), pp. 2782–2793.
31. Fertuck, K. C., Eckel, J. E., Gennings, C. andZacharewski, T. R. (2003), ‘Identification oftemporal patterns of gene expression in theuteri of immature, ovariectomized micefollowing exposure to ethynylestradiol’,Physiol. Genomics, Vol. 15(2), pp. 127–141.
32. Allocco, D. J., Kohane, I. S. and Butte, A. J.(2004), ‘Quantifying the relationship betweenco-expression, co-regulation and genefunction’, BMC Bioinformatics, Vol. 5(1), p. 18.
33. Yeung, K. Y., Medvedovic, M. andBumgarner, R. E. (2004), ‘From co-expressionto co-regulation: How many microarrayexperiments do we need?’, Genome Biol., Vol.5(7), p. R48.
34. Gilbert, D. R., Schroeder, M. and van Helden,J. (2000), ‘Interactive visualization andexploration of relationships between biologicalobjects’, Trends Biotechnol., Vol. 18(12), pp.487–494.
35. Sultan, M., Wigle, D. A., Cumbaa, C. A. et al.(2002), ‘Binary tree-structured vectorquantization approach to clustering andvisualizing microarray data’, Bioinformatics, Vol.18 (Suppl 1), pp. S111–119.
36. Hanisch, D., Zien, A., Zimmer, R. andLengauer, T. (2002), ‘Co-clustering ofbiological networks and gene expression data’,Bioinformatics, Vol. 18 (Suppl 1), pp.S145–154.
37. McLachlan, G. J., Bean, R. W. and Peel, D.(2002), ‘A mixture model-based approach tothe clustering of microarray expression data’,Bioinformatics, Vol. 18(3), pp. 413–422.
38. Getz, G., Levine, E. and Domany, E. (2000),‘Coupled two-way clustering analysis of genemicroarray data’, Proc. Natl Acad. Sci. USA,Vol. 97(22), pp. 12079–12084.
39. Datta, S. (2003), ‘Comparisons and validationof statistical clustering techniques formicroarray gene expression data’,Bioinformatics, Vol. 19(4), pp. 459–466.
40. Gat-Viks, I., Sharan, R. and Shamir, R.(2003), ‘Scoring clustering solutions by theirbiological relevance’, Bioinformatics, Vol.19(18), pp. 2381–2389.
41. Kerr, M. K. and Churchill, G. A. (2001),‘Bootstrapping cluster analysis: Assessing thereliability of conclusions from microarrayexperiments’, Proc. Natl Acad. Sci. USA, Vol.98(16), pp. 8961–8965.
42. McShane, L. M., Radmacher, M. D., Freidlin,B. et al. (2002), ‘Methods for assessingreproducibility of clustering patterns observedin analyses of microarray data’, Bioinformatics,Vol. 18(11), pp. 1462–1469.
43. Nebert, D. W., Dalton, T. P., Okey, A. B.and Gonzalez, F. J. (2004), ‘Role of arylhydrocarbon receptor-mediated induction ofthe CYP1 enzymes in environmental toxicityand cancer’, J. Biol. Chem., Vol. 279(23), pp.23847–23850.
44. Okey, A. B., Franc, M. A., Moffat, I. D. et al.(2005), ‘Toxicological implications ofpolymorphisms in receptors for xenobioticchemicals: The case of the aryl hydrocarbonreceptor’, Toxicol. Appl. Pharmacol., Vol. 207(2 Suppl), pp. 43–51.
45. Denison, M. S. and Nagy, S. R. (2003),‘Activation of the aryl hydrocarbon receptorby structurally diverse exogenous andendogenous chemicals’, Annu. Rev. Pharmacol.Toxicol., Vol. 43, pp. 309–334.
46. Sun, Y. V., Boverhof, D. R., Burgoon, L. D.et al. (2004), ‘Comparative analysis of dioxinresponse elements in human, mouse and ratgenomic sequences’, Nucleic Acids Res., Vol.32(15), pp. 4512–4523.
47. Boutros, P. C., Moffat, I. K., Franc, M. A.et al. (2004), ‘Dioxin-responsive AHRE-IIgene battery: Identification by phylogeneticfootprinting’, Biochem. Biophys. Res. Commun.,Vol. 321(3), pp. 707–715.
48. Mao, D. Y., Watson, J. D., Yan, P. S. et al.(2003), ‘Analysis of Myc bound loci identifiedby CpG island arrays shows that Max is
34 2 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005
Boutros and Okey
essential for Myc-dependent repression’, Curr.Biol., Vol. 13(10), pp. 882–886.
49. Odom, D. T., Zizlsperger, N., Gordon, D. B.et al. (2004), ‘Control of pancreas and livergene expression by HNF transcription factors’,Science, Vol. 303(5662), pp. 1378–1381.
50. Zhu, Z., Pilpel, Y. and Church, G. M. (2002),‘Computational identification of transcriptionfactor binding sites via a transcription-factor-centric clustering (TFCC) algorithm’, J. Mol.Biol., Vol. 318(1), pp. 71–81.
51. Mnaimneh, S., Davierwala, A. P., Haynes, J.et al. (2004), ‘Exploration of essential genefunctions via titratable promoter alleles’, Cell,Vol. 118(1), pp. 31–44.
52. Beissbarth, T. and Speed, T. (2004), ‘GOstat:Find statistically overrepresented geneontologies within a group of genes’,Bioinformatics, Vol 20(9), pp. 1464–1465.
53. Cheng, J., Sun, S., Tracy, A. et al. (2004),‘NetAffx Gene Ontology Mining Tool: Avisual approach for microarray data analysis’,Bioinformatics, Vol 20(9), pp. 1462–1463.
54. Zeeberg, B. R., Feng, W., Wang, G. et al.(2003), ‘GoMiner: A resource for biologicalinterpretation of genomic and proteomic data’,Genome Bio.l, Vol. 4(4), p. R28.
55. Segal, E., Shapira, M., Regev, A. et al. (2003),‘Module networks: Identifying regulatorymodules and their condition-specific regulatorsfrom gene expression data’, Nat. Genet., Vol.34(2), pp. 166–176.
56. Segal, E., Yelensky, R. and Koller, D. (2003),‘Genome-wide discovery of transcriptionalmodules from DNA sequence and geneexpression’, Bioinformatics, Vol. 19 (Suppl 1),pp. I273–I282.
57. Wasserman, W. W. and Sandelin, A. (2004),‘Applied bioinformatics for the identificationof regulatory elements’, Nat. Rev. Genet., Vol.5(4), pp. 276–287.
58. Lee, T. I., Rinaldi, N. J., Robert, F. et al.(2002), ‘Transcriptional regulatory networks inSaccharomyces cerevisiae’, Science, Vol. 298(5594),pp. 799–804.
59. Nachman, I., Regev, A. and Friedman, N.(2004), ‘Inferring quantitative models ofregulatory networks from expression data’,Bioinformatics, Vol. 20 (Suppl 1), pp. I248–I256.
60. Elkon, R., Linhart, C., Sharan, R. et al.(2001), ‘Genome-wide in silico identification oftranscriptional regulators controlling the cellcycle in human cells’, Genome Res., Vol. 13(5),pp. 773–780.
61. Ideker, T., Thorsson, V., Ranish, J. A. et al.(2001), ‘Integrated genomic and proteomicanalyses of a systematically perturbed metabolicnetwork’, Science, Vol. 292(5518), pp.929–934.
62. Ideker, T., Ozier, O, Schwikowski, B. andSiegel, A. F. (2002), ‘Discovering regulatoryand signalling circuits in molecular interactionnetworks’, Bioinformatics, Vol. 18 (Suppl 1),pp. S233–240.
63. McFadyen, M. C., Rooney, P. H., Melvin,W. T. and Murray, G. I. (2003), ‘Quantitativeanalysis of the Ah receptor/cytochrome P450CYP1B1/CYP1A1 signalling pathway’,Biochem. Pharmacol., Vol. 65(10), pp. 1663–1674.
64. Ghaemmaghami, S., Huh, W. K., Bower, K.et al. (2003), ‘Global analysis of proteinexpression in yeast’, Nature, Vol. 425(6959),pp. 737–741.
65. Huh, W. K., Falvo, J. V., Gerke, L. C. et al.(2003), ‘Global analysis of protein localizationin budding yeast’, Nature, Vol. 425(6959), pp.686–691.
66. Greenbaum, D., Jansen, R. and Gerstein, M.(2002), ‘Analysis of mRNA expression andprotein abundance data: an approach for thecomparison of the enrichment of features inthe cellular population of proteins andtranscripts’, Bioinformatics, Vol. 18(4), pp.585–596.
67. Lian, Z., Kluger, Y, Greenbaum, D. S. et al.(2002), ‘Genomic and proteomic analysis ofthe myeloid differentiation program: Globalanalysis of gene expression during induceddifferentiation in the MPRO cell line’, Blood,Vol. 100(9), pp. 3209–3220.
68. Feezor, R. J., Baker, H. V., Ziao, W. et al.(2004), ‘Genomic and proteomic determinantsof outcome in patients undergoingthoracoabdominal aortic aneurysm repair’,J. Immunol., Vol. 172(11), pp. 7103–7109.
69. Tian, Q., Stepaniants, S. B., Mao, M. et al.(2004), ‘Integrated genomic and proteomicanalyses of gene expression in mammaliancells’, Mol. Cell Proteomics, Vol. 3(10), pp.960–969.
70. Kim, C. H., Kim do K., Choi, S. J. et al.(2003), ‘Proteomic and transcriptomic analysisof interleukin-1 beta treated lung carcinomacell line’, Proteomics, Vol. 3(12), pp. 2454–2471.
71. Kirby, J., Menzies, F. M., Cookson, M. R.et al. (2002), ‘Differential gene expression in acell culture model of SOD1-related familialmotor neurone disease’, Hum. Mol. Genet.,Vol. 11(17), pp. 2061–2075.
72. Allen, S., Heath, P. R., Kirby, J. et al. (2003),‘Analysis of the cytosolic proteome in a cellculture model of familial amyotrophic lateralsclerosis reveals alterations to the proteasome,antioxidant defenses, and nitric oxide syntheticpathways’, J. Biol. Chem., Vol. 278(8), pp.6371–6383.
& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 34 3
Unsupervised pattern recognition