unsupervised pattern - informatics: indiana university boutros is a phd student in the department of...

Paul Boutros

is a PhD student in the

Department of Medical

Biophysics, at the University of

Toronto, Canada.

Dr Allan Okey

is a professor in the

Department of Pharmacology,

at the University of Toronto,

Canada.

Keywords: patternrecognition, clustering,microarray, systems biology,data integration, co-regulation

Paul C. Boutros,

Department of Pharmacology,

University of Toronto,

4302 Medical Sciences Building,

1 King’s College Circle,

Toronto, Ontario, Canada M5S 1A8

Tel: +1 416 978 2207

Fax: +1 416 978 6395

E-mail: [email protected]

Unsupervised patternrecognition: An introductionto the whys and wherefores ofclustering microarray dataPaul C. Boutros and Allan B. OkeyDate received (in revised form): 21st June 2005

Abstract

Clustering has become an integral part of microarray data analysis and interpretation. The

algorithmic basis of clustering – the application of unsupervised machine-learning techniques to

identify the patterns inherent in a data set – is well established. This review discusses the

biological motivations for and applications of these techniques to integrating gene expression

data with other biological information, such as functional annotation, promoter data and

proteomic data.

INTRODUCTIONThe advent of high-throughput genomics

and proteomics is leading to an

unprecedented surge of cross-disciplinary

work between biology and statistics. The

growing interface between these distinct

groups of researchers has shown both

significant successes and surprising

challenges. One major area of progress has

been the application of machine-learning

techniques to large biological data sets.

Machine-learning techniques (also known

as pattern-recognition techniques) have

been successfully used on microarray-

based gene-expression data to identify

groups of functionally related genes1 as

well as to predict broader biological

phenotypes such as genetic interactions,2

social behaviour3 or cancer type.4

While the various algorithms and

methods used for machine-learning

techniques have been extensively

reviewed elsewhere,5–8 the biological

theory underlying and motivating these

approaches has rarely been discussed. We

focus here on unsupervised pattern

recognition, and, in particular, on the

clustering techniques commonly used in

the analysis of mRNA expression

microarray data. Before discussing the

hypothesis of co-regulation that underlies

many of the applications of unsupervised

pattern recognition, we briefly outline

both the basic microarray data-analysis

pipeline and the general classes of

clustering algorithms in use. We then

discuss the underlying assumptions behind

unsupervised pattern recognition and

conclude with a series of examples from

the recent literature.

AN OVERVIEW OFMICROARRAY DATAANALYSISThe pre-processing of microarray data can

be thought of as a pipeline, with a series

of stages through which the data must

travel before useful results can be

extracted (Figure 1). The analytical

pipeline is virtually identical for single-

channel (Affymetrix) and dual-channel

(cDNA) arrays, although different

algorithms are employed for different

platforms. The first step in analysis is

quantification of the scanned image into

numeric data amenable to statistical

analysis.9 This raw data set is filtered to

remove high-noise and low-signal

spots.10–12 The filtered data set is then

normalised (often twice) to remove spatial

variability, channel imbalances and inter-

array heterogeneity.13–16 The filtered and

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005 33 1

normalised data can then be analysed in a

variety of ways to extract useful biological

data. For example, results can be

subjected to significance-testing to

identify differentially expressed genes.17,18

Alternatively, a variety of clustering

approaches can be used to identify broad

patterns in the biological system.5 The

clustering can then form a platform for

network reconstruction (see Example II

below) or other downstream analyses

(Figure 1).

It is intriguing to note that the early

stages of this pipeline have, in our view,

largely been the most challenging to

model. Genomics analyses start with low-

level operations such as filtering and

normalisation and terminate with high-

level operations such as clustering or other

types of pattern recognition. Low-level

analyses remain poorly understood from a

statistical perspective, and are largely

handled empirically. By contrast, higher-

level analyses such as clustering appear

better able to leverage techniques from

other fields,19 often with genomics-

specific enhancements.20 This implies that

at a high level, microarray genomics data

are not particularly different from other

types of data, despite their uniqueness at

lower levels.

Pattern recognition generally occurs at

the end of the analysis pipeline, although

it occasionally finds use earlier, such as for

$�'0��

= ��

�� ! ��

4��

$��0��0��

2��>��

��0��0��0��

2��>��

%��'��3��

�� 2��0��0��

��0��

��

��0��0��0��

Figure 1: The basic pipeline for the pre-processing of microarray data. The pipeline starts with image quantitation. Thisfigure shows a two-channel array, but most steps are the same for one-channel platforms. The quantified image data arefirst filtered to remove bad spots, then normalised to bring the two channels of an array into balance. This procedure isrepeated for each array in an experiment, after which the entire set is subject to a second, inter-array normalisation. Thenormalised dataset is significance-tested to identify differentially expressed genes. These genes can then be clustered todisplay trends in gene-expression

332 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005

Boutros and Okey

data exploration. The goal of

unsupervised pattern recognition is to

identify small subsets of genes that display

coordinated expression patterns. This process

is also known as ‘clustering’, and the two

terms are often used interchangeably. The

goal of clustering large sets of expression

data is to rationalise changes in expression

of entire groups of genes and to discover

possible functional relationships among

them.

There are two key questions one might

ask at this point. First, how can we find

these sets of genes that exhibit

coordinated expression profiles? Second,

why is it biologically useful and important

to identify subsets of genes with

coordinated expression? We answer these

two questions in turn.

WHAT IS CLUSTERANALYSIS?What is clustering?Clustering algorithms – a subset of

unsupervised pattern-recognition or

machine-learning algorithms – identify

patterns within a data set. In genomic

work, this is often described as identifying

‘clusters’ of genes that have similar

expression patterns. To be succinct:

unsupervised pattern recognition (clustering)

explicitly identifies the underlying patterns

inherent to a data set.

This definition reveals one key

weakness of unsupervised pattern-

recognition approaches: they generally

operate under the assumption that there is

an underlying pattern within the data.

Most unsupervised pattern-recognition

algorithms ‘see’ patterns in random data;

hence their output must be rigorously

validated, both statistically and

scientifically.

Figure 2A shows a simple two-

dimensional example of the goal of

clustering. The raw data are a set of points

on the two-dimensional graph. In a

genomic data set, each point represents a

single gene, while the axes represent

changes in expression levels in response to

different treatments or stimuli. The

clustering algorithm identifies the

underlying order in these data by

identifying three ‘clusters’. Each cluster is

a group of genes that show similar,

coordinated expression profiles.

The goal of clusteringmicroarray data is toidentify patternsinherent in a dataset

��

��

��

��

�

�

��

��

Figure 2: Theunderlying idea ofclustering is to grouptogether sets of genes(white circles) that havesimilar expressionprofiles (dashed-blackovals). In (A) we seethree distinct clusters,along with two outlyinggenes that do not easilyfit into one of the maingroups. In (B) the effectof ‘forcing’ these twogenes into clusters canbe seen. The within-cluster distance (WC)grows substantially,while the between-cluster distance (BC)drops dramatically


Unsupervised pattern recognition

Note that in this case, two genes were

classified as outliers: they were not

sufficiently similar to any other gene to be

grouped into a cluster. Not all clustering

algorithms are capable of making this

choice to ‘not cluster’ an outlier gene. If

the outliers had been forced into one of

the existing clusters (Figure 2B), the

clusters become significantly larger. This

is reflected by an increase in the size of

clusters, termed the within-cluster

distance (WC on Figure 2B) and a

decrease in the space between clusters,

termed the between-cluster distance (BC

on Figure 2B). It is important for a

clustering algorithm to correctly handle

outliers in order to minimise within-

cluster distance (to maximise the similarity

of genes in a cluster) and to maximise the

between-cluster distance (to ensure that

each cluster is functionally distinct).

Correct parameterisation of clustering

algorithms is important to ensure that

these outliers are not grouped with one of

the strongly supported clusters.7

How is clustering done?There are numerous machine-learning

algorithms, many of which have been

successfully used in the clustering of

microarray data. The following section

lists references that give details on both

the algorithms and their applications.

Here, we give a brief overview of the

three most commonly used clustering

algorithms: k-means, hierarchical and self-

organising maps.

Before describing these three

algorithms, note that clustering algorithms

are generally classified into two groups,

based on the information they process. If

an algorithm starts with only the raw data,

it is termed an ‘unsupervised’ algorithm.

If, on the other hand, the algorithm also is

given data about the types of samples or

how they are related to one another, then

the algorithm is said to be ‘supervised’.

Supervised and unsupervised

algorithms generally are used for very

different purposes. Supervised algorithms

excel at classification. For example, they

allow identification of sets of genes that

can be toxicologically21,22 or

diagnostically4 informative. Unsupervised

algorithms, on the other hand, are

primarily used for ‘pattern discovery’, the

identification of novel and unexpected

patterns in a data set. While both types of

clustering – supervised and unsupervised

– are important, unsupervised algorithms

like the three clustering algorithms

described here are most commonly used

in microarray data analysis.

To compare and contrast the three

methods, we ran each algorithm on the

same set of data (Figure 3). This data set

comes from a developmental study of the

murine pre-frontal cortex between two

and four weeks after birth (Semeralul et

al., manuscript in preparation). Each time

point contains four biological replicates

analysed on Affymetrix MOE430A arrays.

Here we clustered the 185 most

developmentally responsive genes with

each algorithm using the popular Cluster

package.23

The agglomerative hierarchical

clustering algorithm (Figure 3A) works by

successively grouping together very

similar pairs of genes. The algorithm starts

by comparing each gene with every other

gene. It selects the two genes with the

most similar expression patterns, groups

these together into a ‘node’, then repeats

the procedure with the remaining genes.

This process continues until every gene

has been placed in a node. This procedure

is grounded in the assumption that the

internal structure of the data is

fundamentally hierarchical – that the

genes can be accurately ranked in terms of

their similarity to one another. If this is

not always the case, or if genes are not

correlated across all conditions,

hierarchical clustering will not perform

well.

Both k-means clustering (Figure 3B)

and self-organising maps (SOMs; Figure

3C) are classified as ‘iterative relocation

algorithms’. These algorithms do not

assume that the data are hierarchical.

Instead, they require the user to estimate

the number of clusters present in the data.

Sometimes these estimates can be inferred

Improper outlierhandling can decreaseclustering

Supervised versusunsupervised clustering

33 4 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 4. 331–343. DECEMBER 2005

Boutros and Okey

from knowledge of the biological system.

For example, in the k-means clustering

(Figure 3B) we chose three clusters of

samples (mirroring the three

developmental time-points) and eight

clusters of genes (to reflect the idea that at

any time point a gene can go up or down,

leading to 23 ¼ 8 combinations). Because

such estimates can ‘freeze’ prior bias into

the analysis, several methods have been

developed to estimate the number of

clusters inherent in the data.24

Iterative relocation algorithms work by

first randomly placing every gene into one

of the preset clusters. They then proceed

to move (relocate) the genes one at a time

(iteratively) to produce increasingly

homogeneous clusters. This can require

many steps (relocations), but on modern

computers these algorithms generally

require minutes, rather than hours or

days.

The primary difference between k-

means clustering and self-organising maps

lies in the type of clusters assigned. In k-

means clustering each gene is assigned to a

distinct and independent cluster; a total of

k clusters are used, giving the algorithm its

��0:�� 0�>�� 0��>��0��

Figure 3: Hierarchical clustering (A) generates dendrograms for each axis of the data: the y-axis represents genes and thex-axis represents samples. Each week has four replicated animals, and these biological replicates can be seen to clustertogether on the x-axis. On the y-axis, the genes are ordered by the similarity of their expression profiles into a tree-likestructure called a dendrogram. K-means clustering (B) was performed with k ¼ 8 for the genes (y) axis and k ¼ 3 for thesamples (x) axis. It was repeated 100,000 times and the solution with minimal within-cluster distance was chosen. BecauseSOMs (C) are themselves two-dimensional, there is no easy way to display SOM analysis of both genes and samples on asingle plot. The SOM figures give the response of the average gene with each cell for each of the three time-frames. Notethat adjacent cells of SOM are more similar to one another than non-adjacent cells



name. In contrast, self-organising maps

use a ‘gradient’ of related clusters. A two-

dimensional grid of clusters is created, and

the user supplies the size of the grid as a

parameter. Each grid within the map is

more closely related to the adjacent

clusters than it is to non-adjacent clusters.

For example, in Figure 3C the grid (1,1)

is far more similar to grid (1,2) than it is to

grid (1,4).

Details on clustering andclustering algorithmsThe many algorithms used in the

clustering of microarray data have been

described in detail several times. The

classic text on statistical pattern-

recognition, Duda et al., includes a lucid

overview of clustering algorithms.8

Several excellent reviews of the

mathematical details of clustering

algorithms have also appeared.5,7,25–27

Further, a large number of primary

publications making extensive use of

clustering have been published.3,28–31

Other workers have specifically studied

the performance of different clustering

algorithms for identifying co-expressed

sets of genes.32,33 Finally, a large number

of methods for visualising,34,35

improving20,35–38 and validating39–42 the

performance of clustering algorithms on

microarray data have been developed.

WHY IS CLUSTERINGPOSSIBLE?Co-expression: The rationalefor clusteringUnsupervised pattern recognition is one

of the most common analytical tasks

performed on microarray data. The heat-

map type of displays introduced by

Michael Eisen19 are intuitive and these

plots abound in journals. But aside from

providing a clear display methodology,

what is the underlying biological

motivation for clustering expression data?

To answer this question it is helpful to

return to some basic biological concepts.

The phenomenon of genes displaying

similar patterns of expression over a

variety of conditions is termed ‘co-

expression’. Virtually all clustering

techniques identify sets of genes that are

co-expressed. The question of why

clustering is possible is essentially

equivalent to asking why we expect genes

to be co-expressed. Fortunately co-

expression is supported both by an

immense body of empirical observations

and by detailed mechanistic explanation.

We first explain co-expression intuitively,

then mechanistically.

Co-expression explained intuitively

Intuitively it makes sense that if two or

more separate genes are involved in a

single biological process, they will be

biologically ‘needed’ at the same time.

Because they are needed in unison, they

will also be expressed in unison – co-

expression. For example, drugs and

environmental chemicals are eliminated

from the body by the combined action of

multiple enzymes. Phase I enzymes (such

as the various species of cytochrome

P450) convert foreign chemicals into

chemically reactive intermediates. Phase

II enzymes then conjugate these reactive

intermediates with water-soluble moieties

such as glutathione or glucuronic acid,

thereby facilitating elimination of the

foreign chemical via the kidney. In

animals exposed to polycyclic aromatic

hydrocarbons, expression of specific forms

of cytochrome P450 (CYP1A1, CYP1A2

and CYP1B1) is highly upregulated, ie

the enzymes are induced at the

transcriptional level. In most instances,

Phase II conjugating enzymes such as

glutathione S-transferases and UDP-

glucuronosytransferases are simultaneously

induced so that an increase in reactive

intermediates (generated by the P450s) is

accompanied by an increase in Phase II

enzymes that conjugate these reactive

metabolites and detoxify them.43 These

gene batteries will be co-expressed:

conditions that ‘require’ the activity of

one gene-product will usually ‘require’

the activity of the others.

From this viewpoint, co-expression can

be seen as an inherent property of

biological systems. When two or more

Clustering identifies co-expressed genes


Boutros and Okey

genes function in the same biological

process – either independently or by

complementing one another – they will

be expressed at the same time. Co-

expression follows naturally, and clusters

of gene expressions are inevitably

observed in microarray studies.

Co-expression explained mechanistically

Our intuitive description of co-expression

emphasizes co-expression as a necessary

outgrowth of the fact that biological

systems have multiple coordinated ways

of responding to stimuli. The body

responds to drugs or disease in many

different ways, and the effects occur in

parallel, as well as in series.

From a mechanistic point of view,

however, this explanation remains

unsatisfying. It helps explain, from a

teleological perspective, why co-

expression occurs but not how it occurs.

The key is that biological responses to

external stimuli are not stochastic – they

are regulated. The regulatory systems that

help coordinate biological responses

underlie co-expression: genes that are co-

expressed are often coordinately regulated via a

common mechanism.

Continuing our earlier example, the

co-expression of various xenobiotic

metabolising enzymes can be explained

mechanistically: the P450 enzymes and

the conjugating enzymes are coordinately

regulated by a common ligand-dependent

transcription factor, the AH receptor

(reviewed in Okey et al.44). These genes

will thus be co-expressed: conditions that

‘require’ the activity of one gene usually

will simultaneously alter the activity of

other genes that share the regulatory

mechanism.

In other words, co-expression can be

the result of co-regulation. It is important

to emphasise here that co-regulation is

not the only mechanism that can lead to

co-expression. If only one or two

conditions are considered, many genes

will appear to respond in a similar fashion.

As more and more conditions are added,

however, these types of random

observations will be diminished.

Alternately, two genes may be regulated

separately, but their independent

regulatory mechanisms may overlap

significantly, making it difficult to

distinguish their separability.

Despite these caveats, it is often

reasonable to make the assumption that

co-expressed genes are co-regulated

through one or a few common

mechanisms. For example, many of the

genes that are co-expressed in response to

the toxicant TCDD respond through a

specific transcription factor, the aryl

hydrocarbon receptor (AHR). These

AHR-mediated genes fall into two

distinct mechanisms, depending on

whether the AHR functions as a

transcription factor45,46 or as a

coactivator.47

Based on this assumption, clustering

becomes a way of identifying sets of genes

that are putatively co-regulated, thereby

generating testable hypotheses. For

example, if the regulatory mechanisms for

one or two genes in a cluster are known,

the hypothesis would be that all genes in

the cluster are regulated by that

mechanism. This hypothesis can be tested

through such approaches as gene-

knockouts or knockdowns, the use of

transfection assays with reporter-

constructs, or direct assessment of

transcription-factor binding.

The regulatory mechanisms that alter

gene expression are not fully understood,

but the basic outlines are well known.

Specific proteins, called transcription

factors, are capable of binding to short,

specific sequences of DNA. These specific

sequences, called transcription factor

binding sites (TFBS), usually lie close to

the coding regions of genes they regulate.

When a transcription factor binds to its

TFBS, it modulates expression of that

gene. Transcription factors can either

induce (activate) or repress expression.

In summary, co-expressed genes often

also are co-regulated. This regulation is

achieved by short sequences of DNA

(TFBSs), to which a transcription factor

can bind. This relationship has been

empirically demonstrated many times and

Co-expressed is oftenthe result of co-regulation



was recently validated experimentally

with high-throughput ChIP-Chip

studies.32,48,49 For example, sets of co-

expressed yeast genes have been used to

clearly identify novel TFBSs, which have

then been identified with specific

transcription factors and biological

regulatory mechanisms.50

CLUSTERING AND CO-REGULATION: EXAMPLESThe presence of coordinated regulation in

biological systems might seem trivial, but

in fact, the significance is quite profound.

The knowledge that co-expressed genes

are often co-regulated gives us a platform

upon which to integrate different types of

biological data. By amalgamating

(integrating) multiple data sets, one can

remove false-positives and help identify

disparities between the different levels of

data. Further, integration helps to explain

biological phenomena on deeper levels

than a single data set alone could.

Integration can be performed at many

levels. Below, we outline three recent

examples of how clustering algorithms

were used to integrate microarray

expression data with other types of data to

yield unexpected conclusions. The first

example details the integration of array

data with functional annotation, the

second with genome-wide transcription-

factor binding-site searches, and the third

with proteomic data.

Example I: Expression data andfunctional annotationHughes et al.1 demonstrated that

clustering of expression data is one way to

assess the function of novel,

uncharacterised genes. They considered

300 distinct, significantly different yeast

conditions. These conditions included

both genetic mutations and chemical

treatments. A microarray experiment was

performed for each condition. The entire

data set (compendium, in Hughes’s

terminology) was then subjected to two-

dimensional agglomerative hierarchical

clustering. Sets of genes that showed

similar expression patterns (ie clustered

together) were then annotated with any

known functions. Uncharacterised genes

were then hypothesised to be functionally

related to other genes in their cluster. In

this way, function could be inferred from

a set of co-expressed genes, thereby

generating a set of testable hypotheses

about the biological roles of

uncharacterised transcripts.

It is important to note that the 300

individual experiments were each of

biological interest. The individual

knockouts and chemical treatments used

could each be analysed for their effects on

gene expression across the entire yeast

transcriptome. By aggregating this entire

data set, however, Hughes et al. were able

to extract an enormous amount of

additional information regarding gene co-

regulation and function that was not

available from any of the individual data

sets in isolation. In yeast, this work has

been extended to include both knock-

outs of non-essential genes and knock-

downs of essential genes, allowing the

effects of genetic perturbation of every

single gene to be exploited for

identification of co-regulation.51

This type of integration of expression

data with functional annotation has been

used many times.28,29 It is becoming a

standard part of comprehensive

microarray data analysis, and a wide array

of software tools have been developed to

automate the testing of each individual

cluster for different functional

annotations.52–54

Example II: Expression dataand TFBS identificationSegal et al. extended this integration

approach to encompass transcription-

factor data.55,56 They did so by integrating

yeast gene-expression data with

automated transcription factor binding

site (TFBS) searches in an attempt to

rationalise the mechanism of regulation.

They compared the results of their

clustering based on both expression and

TFBS data with the results of clustering

based only on expression data, and

demonstrated the superiority of combined

Using co-expression topredict gene function


Boutros and Okey

clusters, in part by the greater

concordance of functional annotations

within a cluster. In other words, Segal et

al. used the integration of functional data

as a tool to help test their algorithm for

combining TFBS and expression data.

Recall that a TFBS is a short sequence

of DNA near the coding region of a gene

that modulates the gene’s expression level.

Because TFBSs are short sequences, both

functional and non-functional sites are

present throughout the genome. The

false-positives generated by non-

functional sites often outnumber the true

positives, leading to the development of a

variety of methods for winnowing non-

functional TFBSs from functional ones57

and for rationalising observed patterns of

co-regulation based on the activity and

expression levels of specific transcription

factors. One such method is to search sets

of co-expressed genes for common,

conserved sequences. A sequence that

appears in an entire set of co-expressed

genes can be hypothesised to be at least

partially responsible for the regulation

underlying this co-regulation.

Segal et al. exploited this idea

extensively. Like Hughes et al., they

clustered a large set of distinct microarray

experiments, then tried to rationalise the

observed clusters. In this case, however,

the rationalisation was initially done on

the basis of observed TFBSs rather than in

terms of functional annotation. Segal et

al.’s algorithm did not stop there: they

used the initial clustering and TFBS

searches as seeds for an iterative algorithm.

If a cluster did not contain a known

TFBS, then they automatically searched

for a novel one. This TFBS knowledge

was then incorporated back into a new

round of clustering (Figure 4). After 50–

60 iterations of this process, the clustering

patterns stabilised and the resulting groups

were interpreted both with functional

annotation (as per Hughes) and with

manual inspection of how the proposed

regulatory mechanisms matched the

known biology of yeast regulation.

Although this study was somewhat

limited by some of the simplifying

assumptions used – in particular the

assumption that the regulatory species

themselves will change expression – it

helped identify a large number of novel

regulatory mechanisms. Further, it

provided one of the first large-scale

examples of integrating expression-data

with TFBS data to estimate regulatory

networks. This approach will be

particularly valuable when it is used as a

complement to other high-throughput

approaches to identify regulatory

networks.58 This approach has since been

extended in scope59 to include mRNA

degradation rates and transcription factor

levels. A similar approach has also been

developed for higher eukaryotes, and was

tested using the human cell cycle.60

Example III: Expression dataand proteomic dataA final example of the importance of co-

regulated genes involves the integration of

genomic and proteomic data. Genomic

experiments assay biological activity at the

level of genes, whereas proteomic

experiments assay activity at the protein

level. Ideker et al. performed an in-depth

study (yet again in yeast) integrating these

Using co-expression toidentify TFBSs

2��0�� 4��0��

��0<4��0��

Figure 4: The incorporation of TFBS datainto a pattern-recognition scheme candramatically improve the coherency of theclusters identified. Many iterations of theclustering/TFBS searching procedure may berequired to identify an invariant pattern ofclusters



two sources of data.61,62 In particular, this

study integrated genomic assays of

mRNA expression levels with proteomic

studies reflecting protein–protein

interactions and proteomic studies

reflecting protein concentrations. They

focused on a single biochemical pathway:

the utilisation of the sugar galactose. They

first investigated the effects of knocking

out different enzymes in the galactose

pathway to identify the transcriptional

network underlying the pathways. Next,

they integrated this information with

protein–protein interactions and used this

combined data set to significantly expand

and clarify the textbook representation of

this important metabolic pathway.61

Although mRNA levels are not always

predictive of protein levels,63 they are

typically correlated. Genome-wide

protein and mRNA levels have been

extensively compared in yeast,64,65 and

were found to be highly related. In

particular, there was found to be, on

average, about 5,000 protein molecules

per mRNA molecule. This ratio was

largely independent of whether the gene

was expressed at a high or a low level.

These studies have shown that the yeast

translatome – the set of expressed yeast

proteins – is biased towards short,

cytoplasmic proteins primarily involved in

fundamental cell-maintenance

functions.66

Similar studies, directly comparing

protein and mRNA levels in eukaryotic

systems have been performed

occasionally,67–70 but more often parallel

proteomic and genomic studies have been

done with no attempt made to integrate

the data.71,72 It is not yet clear if protein

and mRNA levels correlate in eukaryotic

cells, or if the level of correlation is at all

cell-type or species dependent. Pattern-

recognition techniques will have an

important role to play in answering these

types of questions as greater volumes of

proteomic data become available.

CONCLUSIONSThe biological rationale underlying the

clustering of microarray data is the fact

that many co-expressed genes are also co-

regulated. This trivial observation enables

an enormous variety of new science: the

functional annotation of uncharacterised

genes, the de novo identification of TFBSs,

and the elucidation of complete biological

pathways. Each of these findings is based

on exploitation of unsupervised pattern-

recognition techniques. Pattern discovery

has become an essential part of the

‘-omic’ sciences.

The increasing role for machine-

learning is a natural outgrowth of

reductionist biology. In theory, co-

regulation can simplify our understanding

of biological response. Instead of

characterising how a drug or toxicant

affects each gene in a cell, we can simply

identify how it changes cellular

regulation. We no longer consider

thousands of separate genes, but rather

only hundreds of regulators. Given the

diversity of regulatory mechanisms in a

cell – signal transduction, transcriptional,

translational and post-translational – it is

certain that identifying complete

regulatory networks will require

immensely complex pattern recognition.

The co-regulation hypothesis is very

powerful. However, it is just that – a

hypothesis. It can be supported by in silico

evidence such as functional annotation

and integration with TFBS or proteomic

data. But while these sources can give

additional confidence in the assertion that

a pair of genes is co-regulated, they do

not provide proof. Instead, at least some

of the putative co-regulated genes must

be shown, in vivo, to be co-regulated

through genetic manipulation of the

relevant transcription factors. And even in

such cases, additional in vitro work to

detail the mechanism of co-regulation is

often necessary. In short: in silico work

may generate valuable hypotheses, but

traditional techniques are still required to

validate them.

As an increasing number and variety of

high-throughput data sets become

available, it is increasingly important to

find methods for integrating different

levels of biological data. This integration

Hypotheses generatedby clustering need


Boutros and Okey

lies at the heart of the emerging discipline

of systems biology. Unsupervised pattern-

recognition methods are a critical tool in

that integration and an important

technique for the interpretation of

biological data.

Acknowledgments

The authors gratefully thank Dr Albert Wong and

Ms Mawahib Semeralul for permission to use their

brain-development data in figures. The authors also

thank Dr Rafal Kustra and Dr Hugh Chipman for

stimulating conversations and Ms Joanne Kim for

critical reading of the manuscript. A.B.O.

recognises funding from the Canadian Institutes of

Health Research (grant MOP-57903) and the

National Cancer Institute of Canada with funds

from the Canadian Cancer Society.

References

1. Hughes, T. R., Marton, M. J., Jones, A. R.et al. (2000), ‘Functional discovery via acompendium of expression profiles’, Cell, Vol.102(1), pp. 109–126.

2. Tong, A. H., Lasage, G., Bader, G. D. et al.(2004), ‘Global mapping of the yeast geneticinteraction network’, Science, Vol. 303(5659),pp. 808–813.

3. Whitfield, C. W., Cziko, A. M. andRobinson, G. E. (2003), ‘Gene expressionprofiles in the brain predict behavior inindividual honey bees’, Science, Vol. 302(5643),pp. 296–299.

4. Golub, T. R., Slonim, D. K., Tamayo, P. et al.(1999), ‘Molecular classification of cancer:Class discovery and class prediction by geneexpression monitoring’, Science, Vol.286(5439), pp. 531–537.

5. Sherlock, G. (2001), ‘Analysis of large-scalegene expression data’, Brief. Bioinformatics,Vol. 2(4), pp. 350–362.

6. Sherlock, G. (2000), ‘Analysis of large-scalegene expression data’, Curr. Opin. Immunol.,Vol. 12(2), pp. 201–205.

7. Azuaje, F. (2003), ‘Clustering-basedapproaches to discovering and visualisingmicroarray data patterns’, Brief. Bioinformatics,Vol. 4(1), pp. 31–42.

8. Duda, R. O., Hart, P. E. and Stork, D. G.(2001), ‘Pattern Classification’, 2nd edn,Wiley, New York, p. 654.

9. Jain, A. N., Tokuyasu, T. A., Snijders, A. M.et al. (2002), ‘Fully automatic quantification ofmicroarray image data’, Genome Res., Vol.12(2), pp. 325–332.

10. Tseng, G. C., Oh, M. K., Rohlin, L. et al.(2001), ‘Issues in cDNA microarray analysis:Quality filtering, channel normalization,

models of variations and assessment of geneeffects’, Nucleic Acids Res., Vol. 29(12), pp.2549–2557.

11. Wang, X., Ghosh, S. and Guo, S. W. (2001),‘Quantitative quality control in microarrayimage processing and data acquisition’, NucleicAcids Res., Vol. 29(15), E75.

12. Jenssen, T. K., Langaas, M., Kuo, W. P. et al.(2002), ‘Analysis of repeatability in spottedcDNA microarrays’, Nucleic Acids Res., Vol.30(14), pp. 3235–3244.

13. Bolstad, B. M., Irizarry, R. A., Astrand, M.and Speed, T. P. (2003), ‘A comparison ofnormalization methods for high densityoligonucleotide array data based on varianceand bias’, Bioinformatics, Vol. 19(2), pp.185–193.

14. Workman, C., Lensen, L. J., Jarmer, H. et al.(2002), ‘A new non-linear normalizationmethod for reducing variability in DNAmicroarray experiments’, Genome Biol., Vol.3(9), p. research0048.

15. Yang, Y. H., Dudoit, S., Luu, P. et al. (2002),‘Normalization for cDNA microarray data: Arobust composite method addressing single andmultiple slide systematic variation’, NucleicAcids Res., Vol. 30(4), p. e15.

16. Smyth, G. K. and Speed, T. (2003),‘Normalization of cDNA microarray data’,Methods, Vol. 31(4), pp. 265–273.

17. Smyth, G. K. (2003), ‘Linear models andempirical Bayes methods for assessingdifferential expression in microarrayexperiments’, Stat. Appl. Genet. Mol. Biol.,Vol. 3(1), pp. 1–26.

18. Pan, W. (2002), ‘A comparative review ofstatistical methods for discovering differentiallyexpressed genes in replicated microarrayexperiments’, Bioinformatics, Vol. 18(4), pp.546–554.

19. Eisen, M. B., Spellman, P. T., Brown, P. O.and Botstein, D. (1998), ‘Cluster analysis anddisplay of genome-wide expression patterns’,Proc. Natl Acad. Sci. USA, Vol. 95(25), pp.14863–14868.

20. Yeung, K. Y., Fraley, C., Murua, A. et al.(2001), ‘Model-based clustering and datatransformations for gene expression data’,Bioinformatics, Vol. 17(10), pp. 977–987.

21. Waring, J. F., Gum, R., Morfitt, D. et al.(2002), ‘Identifying toxic mechanisms usingDNA microarrays: Evidence that anexperimental inhibitor of cell adhesionmolecule expression signals through the arylhydrocarbon nuclear receptor’, Toxicology,Vol. 181–182, pp. 537–550.

22. Thomas, R. S., Penn, S. G., Holden, K. et al.(2002), ‘Sequence variation and phylogenetichistory of the mouse Ahr gene’,Pharmacogenetics, Vol. 12(2), pp. 151–163.



23. de Hoon, M. J., Imoto, S., Nolan, J. andMiyano, S. (2004), ‘Open source clusteringsoftware’, Bioinformatics, Vol. 20(9), pp.1453–1454.

24. Dudoit, S. and Fridlyand, J. (2002), ‘Aprediction-based resampling method forestimating the number of clusters in a dataset’,Genome Biol., Vol. 3(7), p. research0036.

25. Altman, R. B. and Raychaudhuri, S. (2001),‘Whole-genome expression analysis: challengesbeyond clustering’, Curr. Opin. Struct. Biol.,Vol. 11(3), pp. 340–347.

26. Raychaudhuri, S., Sutphin, P. D., Chang, J. T.and Altman, R. B. (2001), ‘Basic microarrayanalysis: Grouping and feature reduction’,Trends Biotechnol., Vol. 19(5), pp. 189–193.

27. Simon, R., Radmacher, M. D., Dobbin, K.and McShane, L. M. (2003), ‘Pitfalls in the useof DNA microarray data for diagnostic andprognostic classification. J. Natl Cancer Inst.,2003. 95(1), pp. 14–18.

28. Stuart, J. M., Sergal, E., Koller, D. and Kim,S. K. (2003), A gene-coexpression network forglobal discovery of conserved geneticmodules’, Science, Vol. 302(5643), pp. 249–255.

29. Tavazoie, S., Hughes, J. D., Campbell, M. J.et al. (1999), ‘Systematic determination ofgenetic network architecture’, Nat. Genet.,Vol. 22(3), pp. 281–285.

30. Daoud, S. S., Munson, P. J., Reinhold, W.et al. (2003), ‘Impact of p53 knockout andtopotecan treatment on gene expressionprofiles in human colon carcinoma cells: Apharmacogenomic study’, Cancer Res., Vol.63(11), pp. 2782–2793.

31. Fertuck, K. C., Eckel, J. E., Gennings, C. andZacharewski, T. R. (2003), ‘Identification oftemporal patterns of gene expression in theuteri of immature, ovariectomized micefollowing exposure to ethynylestradiol’,Physiol. Genomics, Vol. 15(2), pp. 127–141.

32. Allocco, D. J., Kohane, I. S. and Butte, A. J.(2004), ‘Quantifying the relationship betweenco-expression, co-regulation and genefunction’, BMC Bioinformatics, Vol. 5(1), p. 18.

33. Yeung, K. Y., Medvedovic, M. andBumgarner, R. E. (2004), ‘From co-expressionto co-regulation: How many microarrayexperiments do we need?’, Genome Biol., Vol.5(7), p. R48.

34. Gilbert, D. R., Schroeder, M. and van Helden,J. (2000), ‘Interactive visualization andexploration of relationships between biologicalobjects’, Trends Biotechnol., Vol. 18(12), pp.487–494.

35. Sultan, M., Wigle, D. A., Cumbaa, C. A. et al.(2002), ‘Binary tree-structured vectorquantization approach to clustering andvisualizing microarray data’, Bioinformatics, Vol.18 (Suppl 1), pp. S111–119.

36. Hanisch, D., Zien, A., Zimmer, R. andLengauer, T. (2002), ‘Co-clustering ofbiological networks and gene expression data’,Bioinformatics, Vol. 18 (Suppl 1), pp.S145–154.

37. McLachlan, G. J., Bean, R. W. and Peel, D.(2002), ‘A mixture model-based approach tothe clustering of microarray expression data’,Bioinformatics, Vol. 18(3), pp. 413–422.

38. Getz, G., Levine, E. and Domany, E. (2000),‘Coupled two-way clustering analysis of genemicroarray data’, Proc. Natl Acad. Sci. USA,Vol. 97(22), pp. 12079–12084.

39. Datta, S. (2003), ‘Comparisons and validationof statistical clustering techniques formicroarray gene expression data’,Bioinformatics, Vol. 19(4), pp. 459–466.

40. Gat-Viks, I., Sharan, R. and Shamir, R.(2003), ‘Scoring clustering solutions by theirbiological relevance’, Bioinformatics, Vol.19(18), pp. 2381–2389.

41. Kerr, M. K. and Churchill, G. A. (2001),‘Bootstrapping cluster analysis: Assessing thereliability of conclusions from microarrayexperiments’, Proc. Natl Acad. Sci. USA, Vol.98(16), pp. 8961–8965.

42. McShane, L. M., Radmacher, M. D., Freidlin,B. et al. (2002), ‘Methods for assessingreproducibility of clustering patterns observedin analyses of microarray data’, Bioinformatics,Vol. 18(11), pp. 1462–1469.

43. Nebert, D. W., Dalton, T. P., Okey, A. B.and Gonzalez, F. J. (2004), ‘Role of arylhydrocarbon receptor-mediated induction ofthe CYP1 enzymes in environmental toxicityand cancer’, J. Biol. Chem., Vol. 279(23), pp.23847–23850.

44. Okey, A. B., Franc, M. A., Moffat, I. D. et al.(2005), ‘Toxicological implications ofpolymorphisms in receptors for xenobioticchemicals: The case of the aryl hydrocarbonreceptor’, Toxicol. Appl. Pharmacol., Vol. 207(2 Suppl), pp. 43–51.

45. Denison, M. S. and Nagy, S. R. (2003),‘Activation of the aryl hydrocarbon receptorby structurally diverse exogenous andendogenous chemicals’, Annu. Rev. Pharmacol.Toxicol., Vol. 43, pp. 309–334.

46. Sun, Y. V., Boverhof, D. R., Burgoon, L. D.et al. (2004), ‘Comparative analysis of dioxinresponse elements in human, mouse and ratgenomic sequences’, Nucleic Acids Res., Vol.32(15), pp. 4512–4523.

47. Boutros, P. C., Moffat, I. K., Franc, M. A.et al. (2004), ‘Dioxin-responsive AHRE-IIgene battery: Identification by phylogeneticfootprinting’, Biochem. Biophys. Res. Commun.,Vol. 321(3), pp. 707–715.

48. Mao, D. Y., Watson, J. D., Yan, P. S. et al.(2003), ‘Analysis of Myc bound loci identifiedby CpG island arrays shows that Max is


Boutros and Okey

essential for Myc-dependent repression’, Curr.Biol., Vol. 13(10), pp. 882–886.

49. Odom, D. T., Zizlsperger, N., Gordon, D. B.et al. (2004), ‘Control of pancreas and livergene expression by HNF transcription factors’,Science, Vol. 303(5662), pp. 1378–1381.

50. Zhu, Z., Pilpel, Y. and Church, G. M. (2002),‘Computational identification of transcriptionfactor binding sites via a transcription-factor-centric clustering (TFCC) algorithm’, J. Mol.Biol., Vol. 318(1), pp. 71–81.

51. Mnaimneh, S., Davierwala, A. P., Haynes, J.et al. (2004), ‘Exploration of essential genefunctions via titratable promoter alleles’, Cell,Vol. 118(1), pp. 31–44.

52. Beissbarth, T. and Speed, T. (2004), ‘GOstat:Find statistically overrepresented geneontologies within a group of genes’,Bioinformatics, Vol 20(9), pp. 1464–1465.

53. Cheng, J., Sun, S., Tracy, A. et al. (2004),‘NetAffx Gene Ontology Mining Tool: Avisual approach for microarray data analysis’,Bioinformatics, Vol 20(9), pp. 1462–1463.

54. Zeeberg, B. R., Feng, W., Wang, G. et al.(2003), ‘GoMiner: A resource for biologicalinterpretation of genomic and proteomic data’,Genome Bio.l, Vol. 4(4), p. R28.

55. Segal, E., Shapira, M., Regev, A. et al. (2003),‘Module networks: Identifying regulatorymodules and their condition-specific regulatorsfrom gene expression data’, Nat. Genet., Vol.34(2), pp. 166–176.

56. Segal, E., Yelensky, R. and Koller, D. (2003),‘Genome-wide discovery of transcriptionalmodules from DNA sequence and geneexpression’, Bioinformatics, Vol. 19 (Suppl 1),pp. I273–I282.

57. Wasserman, W. W. and Sandelin, A. (2004),‘Applied bioinformatics for the identificationof regulatory elements’, Nat. Rev. Genet., Vol.5(4), pp. 276–287.

58. Lee, T. I., Rinaldi, N. J., Robert, F. et al.(2002), ‘Transcriptional regulatory networks inSaccharomyces cerevisiae’, Science, Vol. 298(5594),pp. 799–804.

59. Nachman, I., Regev, A. and Friedman, N.(2004), ‘Inferring quantitative models ofregulatory networks from expression data’,Bioinformatics, Vol. 20 (Suppl 1), pp. I248–I256.

60. Elkon, R., Linhart, C., Sharan, R. et al.(2001), ‘Genome-wide in silico identification oftranscriptional regulators controlling the cellcycle in human cells’, Genome Res., Vol. 13(5),pp. 773–780.

61. Ideker, T., Thorsson, V., Ranish, J. A. et al.(2001), ‘Integrated genomic and proteomicanalyses of a systematically perturbed metabolicnetwork’, Science, Vol. 292(5518), pp.929–934.

62. Ideker, T., Ozier, O, Schwikowski, B. andSiegel, A. F. (2002), ‘Discovering regulatoryand signalling circuits in molecular interactionnetworks’, Bioinformatics, Vol. 18 (Suppl 1),pp. S233–240.

63. McFadyen, M. C., Rooney, P. H., Melvin,W. T. and Murray, G. I. (2003), ‘Quantitativeanalysis of the Ah receptor/cytochrome P450CYP1B1/CYP1A1 signalling pathway’,Biochem. Pharmacol., Vol. 65(10), pp. 1663–1674.

64. Ghaemmaghami, S., Huh, W. K., Bower, K.et al. (2003), ‘Global analysis of proteinexpression in yeast’, Nature, Vol. 425(6959),pp. 737–741.

65. Huh, W. K., Falvo, J. V., Gerke, L. C. et al.(2003), ‘Global analysis of protein localizationin budding yeast’, Nature, Vol. 425(6959), pp.686–691.

66. Greenbaum, D., Jansen, R. and Gerstein, M.(2002), ‘Analysis of mRNA expression andprotein abundance data: an approach for thecomparison of the enrichment of features inthe cellular population of proteins andtranscripts’, Bioinformatics, Vol. 18(4), pp.585–596.

67. Lian, Z., Kluger, Y, Greenbaum, D. S. et al.(2002), ‘Genomic and proteomic analysis ofthe myeloid differentiation program: Globalanalysis of gene expression during induceddifferentiation in the MPRO cell line’, Blood,Vol. 100(9), pp. 3209–3220.

68. Feezor, R. J., Baker, H. V., Ziao, W. et al.(2004), ‘Genomic and proteomic determinantsof outcome in patients undergoingthoracoabdominal aortic aneurysm repair’,J. Immunol., Vol. 172(11), pp. 7103–7109.

69. Tian, Q., Stepaniants, S. B., Mao, M. et al.(2004), ‘Integrated genomic and proteomicanalyses of gene expression in mammaliancells’, Mol. Cell Proteomics, Vol. 3(10), pp.960–969.

70. Kim, C. H., Kim do K., Choi, S. J. et al.(2003), ‘Proteomic and transcriptomic analysisof interleukin-1 beta treated lung carcinomacell line’, Proteomics, Vol. 3(12), pp. 2454–2471.

71. Kirby, J., Menzies, F. M., Cookson, M. R.et al. (2002), ‘Differential gene expression in acell culture model of SOD1-related familialmotor neurone disease’, Hum. Mol. Genet.,Vol. 11(17), pp. 2061–2075.

72. Allen, S., Heath, P. R., Kirby, J. et al. (2003),‘Analysis of the cytosolic proteome in a cellculture model of familial amyotrophic lateralsclerosis reveals alterations to the proteasome,antioxidant defenses, and nitric oxide syntheticpathways’, J. Biol. Chem., Vol. 278(8), pp.6371–6383.



unsupervised pattern - informatics: indiana university boutros is a phd student in the department of...

Documents