making sense of microarray data - theoretical physics...

30
BNF074 Fall 2003 Making Sense of Microarray Data Carsten Peterson & Markus Ringn´ er Complex Systems Division Department of Theoretical Physics, Lund University [email protected] 046-2229337 1. Introduction - Data representation - Normalization - Metrics 2. Clustering - K-means - Hierarchical - Self-organization - Examples 3. Supervised Learning - Simple statistics - Regression models - Examples 4. Kinetics Equations These lectures serve as an elementary introduction to microarray analysis. They have the same scope as the microarray lecture in BIM083, but do things in somewhat more detail. Furthermore, they serve as introduction to the computer exercises. 1

Upload: duongthien

Post on 28-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

BNF074Fall 2003

Making Sense of Microarray Data

Carsten Peterson & Markus Ringner

Complex Systems DivisionDepartment of Theoretical Physics, Lund University

[email protected] 046-2229337

1. Introduction

- Data representation- Normalization- Metrics

2. Clustering

- K-means- Hierarchical- Self-organization- Examples

3. Supervised Learning

- Simple statistics- Regression models- Examples

4. Kinetics Equations

These lectures serve as an elementary introduction to microarray analysis. They havethe same scope as the microarray lecture in BIM083, but do things in somewhat moredetail. Furthermore, they serve as introduction to the computer exercises.

1

1 Introduction

Goals

Map out interactions between genes and gene products inside cells and relate tophenotypes, functionality etc.

Measurements

1. RNA

- ”gene chips”/oligonucleotides (Cf. visit to Immunotechnology Dept.)- cDNA microarrays/cDNA genes (Cf. visit to BMC.)

- For a review that provides a good overview of the cDNA array technology seeD. Duggan et al., Nat. Genet. vol. 21, suppl. 10-14 (1999) (freely available at:

http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v21/n1s/full/ng0199supp 10.html)

2. Proteins

- 2D gels etc. (Proteomics Lab/next week)

(1) and (2) are to some extent complementary. Also, remember that 10-20 % ofproteins in HUGO correspond to intercellular signaling.

Analysis

Regardless of measurement techniques one has the steps:

1. Clean up the data – remove noise.

2. Map out underlying structures in the data.

- Data mining tools

- Compare with models (coming, not yet mature)

3. Link results from (2) to databases, regulatory motifs, ontologies, pathways, func-tionality, etc.

In what follows we illustrate with cDNA technology, but the procedures are generic.For a review see J. Quackenbush, Nat. Genet. Reviews. (2001).

2

Types of Measurements

• Static

Each experiment (hybridization to one array slide) corresponds to e.g. a tissuesample. Clustering and supervised learning techniques are used to classify dataand find relations between gene expressions and categories.

• Dynamic

Each experiment corresponds to a time-point for e.g. a cell line. Clusteringtechniques are used to find common behaviour among genes as function of time.Can probe causal structures.

Data Representation

We represent each experiment by a high-dimensional column vector ~x(k), and a setof experiments by the following matrix:

~x(1) ~x(2) . . ~x(k) . . ~x(M)= = . . = . . =

x1(1) x1(2) . . x1(k) . . x1(M)x2(1) x2(2) . . x2(k) . . x2(M)

. . . . . . . .

. . . . . . . .xi(1) xi(2) . . xi(k) . . xi(M)

. . . . . . . .

. . . . . . . .xN(1) xN(2) . . xN(k) . . xN(M)

where xi(k) is e.g. the logarithmic ratio of red versus green intensities for gene i andsample k.

Typically one has M << N . Indices i and j will be used for genes and k and l forexperiments, which are arguments in our notation.

In this picture each of the M experiments is a data point in N -dimensional gene space.One could also look at it the other way around; each gene is an M -dimensional pointin sample space. The latter will be the case when studying time course experiments.(Note: Quackenbush defines genes in ’expression space’ and experiments in ’experi-ment space’).

3

Normalization

The matrix above only contains ratios of red versus green intensities and not theintensities themselves. However, before the intensities can be used to calculate ratiosthey have to be normalized to adjust for differences in efficiencies of fluorescent dyesand quantities of initial RNA for the two samples used in the hybridization. Thereare many widely used methods for normalization. It is important to remember thateach method is based on assumptions that may not be valid for any given data set.Two common methods are:

Total intensity normalization. Here one assumes that the total amount of initialmRNA is the same for the sample of interest and the reference sample. Furthermore,one assumes that most of the genes are unchanged in expression in the sample ascompared to the reference. Hence, the average log ratio is expected to be zero. Underthese assumptions, a normalization factor can be calculated and used to re-scale thelog ratio for each gene in the array.

LOWESS normalization. Here one goes one step further and assumes that asignificant fraction of the genes should have log ratios close to zero for any intensityinterval. If one plots the log ratio versus the logarithm of the product of the twointensities for all genes in an array, one would then expect the points to follow astraight line with slope zero. The normalization is then carried out by fitting aline to these data using a regression technique called LOWESS (LOcally WEightedScatterplot Smoothing) regression, and adjusting the intensities by using this fittedline so that the calculated slope is zero (see Fig 1).

Metrics

Looking at the matrix we use to represent the data, we note that there is a lotof numbers for each sample (or gene). A simple approach is to reduce this to onenumber for each pair of experiments (genes), a distance between pairs. How ”close”are two data points, may it be in gene or sample space? We need to define distances;a metric.

In gene space, which is common in static applications (how close are two experi-ments?), Euclidean distances appears natural, but are by no means holy.

d(k, l) =√

(x1(k)− x1(l))2 + .. + (xi(k)− xi(l))2 + ... + (xN(k)− xN(l))2 (1)

Correspondingly in sample space one has

4

dij =√

(xi(1)− xj(1))2 + .. + (xi(k)− xj(k))2 + ... + (xi(S)− xj(S))2 (2)

Euclidean distances might sometimes be misleading. Consider the genes i and j in atime course experiment with the behaviour shown in Fig 2.

The distance dij is√

0.02 + 0.42 + 0.62 + 0.42 + 0.02 ≈ 0.82, which is misleadinglyhigh in the biological context; genes, which behave in a related way should be ”close”since they are expected to have similar origin. Hence compute the correlation Cij

between genes i and j. The average of genes i and j are denoted xi and xj respectively.One has

Cij =

∑Mk=1 xi(k)xj(k)− (

∑Mk=1 xi(k)

∑Mk=1 xj(k))/M√

(∑M

k=1 xi(k)2 − (∑M

k=1 xi(k))2/M)(∑M

k=1 xj(k)2 − (∑M

k=1 xj(k))2/M)

=〈xixj〉 − 〈xi〉〈xj〉√

(〈x2i 〉 − 〈xi〉2)(〈x2

j〉 − 〈xj〉2)(3)

where M is the number of experiments. Since correlation and anti-correlation shouldcount the same, we disregard the sign of Cij and consider |Cij|. We define the”distance” between the two curves as

dij = 1− |Cij| (4)

In our example above, one gets with Cij=-1 (verify this) and hence dij=0 as desired.See http://davidmlane.com/hyperstat/A56626.html for a numerical example ofa calculation of correlation.

2 Clustering

Armed with distance measures we proceed to clustering algorithms. Here the agendais to group data points (in gene or experiment space) together into clusters and havecluster points representing the clusters.

2.1 K-means Clustering

The oldest clustering algorithm is K-means clustering1. One here needs to preset thenumber of clusters (K). As an example consider genes measured by two experiments

1This is covered very briefly by Mattias in BIM083 with a slightly different notation.

5

xi(k) (k=1,2) in Fig. 3. Denote the cluster centers ym, where m=1,...,K. Thealgorithm goes as follows:

1. Initialize the K ym(k) close to the center of gravity of all ~xi.

2. Pick one xi at random.

3. Move the ym closest to xi towards xi. More precisely, for each dimension do forthe ”winner”

∆ym = ηDmi (5)

where ∆yi is the change in yi and Dmi is the distance according to some metricbetween ym and xi and η is a step size; large distances implies large moves etc.

4. Redo 2 and 3 until convergence.

The procedure is schematically illustrated in Fig. 3 In this way every xi gets assignedto the cluster ym that is closest2.

A potential drawback with this method is that one has to know the number of clusters(K) in advance. To some extent this will be remedied in the next subsection, whichdeals with hierarchical clustering.

2.2 Hierarchical Clustering

There are many variants of hierarchical clustering. We here pick one that closelyrelates to K-means clustering above.

Basically, it goes as follows:

1. Perform K-means clustering with K=2, i.e. divide the data set into two halves.

2. Within each cluster separately redo 1.

3. Redo the above until number of clusters is equal to number of data points.

4. Based upon subcluster assignments reconstruct the tree – a dendrogram.

The above represents a top-down procedure, often called divisive. Alternatively, onemight start at the bottom, form pairs based upon shortest distances and proceedupwards. This bottom-up procedure is called agglomerative. In an agglomerativeapproach there are many ways to replace a pair of experiments with a common

2One can also have ”fuzzy” cluster assignments, where each xi in principle belongs to all clusters with differentweights computed from the distances.

6

representative. In array analysis it is common to use average linkage which meanseach sample gets a distance to a merged pair that is the average distance to the twoexperiments that make up the pair. It is important to be aware that different choicesmay lead to different dendrograms.

In Fig. 4 a dendrogram is shown, where static small round blue tumor of childhood(SRBCT) [there exist four kinds] samples are clustered in gene space.

2.3 Self-Organizing Networks

Self-organizing networks (SOM) is yet another approach that starts off from K-means clustering. In this case one defines a topology among the clusters, e.g. upona 2-dimensional grid. This is done such that adjacent grid points represent featuresimilarities. See Fig. 5 for a synthetic 2-dimensional example.

This concept is implemented such that when updating Eq. (5) not only the ”winner”is updated but also its neighbours on the grid [in the 2-dimensional case there are4 of them (except boundary problems)]. In this way, the clusters are related to oneanother, which is not the case for K-means clusterng.

This method is particularly useful for time course studies. In Fig. 6 a 5 × 6 SOM isshown for the yeast cycle. Here the yeast genes are compressed into 30 clusters; thecurves represent averages and the error bars the standard deviations. As can be seen,the 30 screens are behaviour-wise connected with its neighbours; other than that itis just standard clustering.

2.4 Clustering - So What?

Clustering provides visual insights into data, as did Multidimensional Scaling [C.f.BIM083]. More importantly:

1. What biology is learned from cluster exercises on static data [typically in genespace]?

• Class discovery [e.g. diseases].

• Investigate by removal etc. which genes are responsible for the classification.

2. What biology is learned from cluster exercises on time course data [typically inexperiment space]?

• Which genes behave similarly?

7

• Do genes that behave similarly have a common upstream motif? If so, one mayfind regulatory elements [more in computer exercise].

3 Supervised Learning

In static measurements, where a classification (e.g. phenotype) is already knownin advance, one often employs supervised learning techniques. The goals here aretwofold.

• Based upon the ”historical data” develop a classifier that handles ”test data”.

• Determine the genes that are responsible for the classification.

3.1 Simple Statistics

Signal-to-noise ratio. Based upon averages (xi) and standard deviations (σi) forclass 1 and 2 compute for each gene i the weight (see Fig. 7)

wi =xi(1)− xi(2)

σi(1) + σi(2)(6)

is computed. In this way one obtains a ranked list of genes based upon wi.

How do we know that the top-ranked genes are significant? Perform random permu-tation tests3

1. Permute sample labels (1 and 2)

2. Calculate weights for the random labels.

3. Redo (1) and (2).

4. Are weights for the ”true” labeling significant?

In Fig. 8 a weight distributions are shown for ”true” and random labels repectively.From these compute P (w); the probability that any weight is larger than w for arandom labeling.

In Fig. 9 the weights for one gene are shown for ”true” and random labels repectively.From these compute α(g); the probability that any weight is larger than w(g) for arandom labeling for the particular gene.

3Cf. P value calculations in sequence alignment (BIM083).

8

3.2 Regression Models

The analysis above assumes that the classification depends upon the genes one-by-one. How about collective dependencies?

We next describe how to model the data to account for more general dependencies.For this we employ Multilayer Perceptrons (MLP) [these belong to the family of Arti-ficial Neural Networks (see Mattias notes in BIM083)]. For a schematic 2-dimensionalexamples of linear and non-linear networks see Fig. 10.

When calibrating such models, the number of parameters (links) should not exceedthe number of experiments. With O(10000) genes (number of inputs) per sample,we are in trouble. Hence, one needs a way to reduce the number of genes to asmaller number prior to calibrating the models. It is here convenient to use PrincipalComponent Analysis (PCA), where one rotates the coordinate system of gene spacesuch that the variance is maximized along a few axis. For a 2-dimensional examplesee Fig. 11.

PCA is a standard procedure abundantly available in software packages. For ourpurposes one retains the most important directions. The integrated PCA + MLParchitecture is shown in Fig. 12.

Once the MPL has been calibrated, it can be used to classify test experiments. Bycomputing the derivative of the output (o) with respect to xi a sensitivity Si can bedefined.

Si =∑

samples

| do

dxi

| (7)

Based upon S the genes can be ranked as above and random permutation tests canbe performed (CPU consuming though).

The above procedure can be generalized to multi-class instances (multiple outputs).

There exists more supervised learning schemes not covered here. One should inparticular mention Support Vector Machines (SVM).

Example – Estrogen Receptor Status. Estrogens are important regulators in thedevelopment and progression of breast cancer and regulate gene expression via theestrogen receptor (ER). Microarray images of node-negative sporadic breast cancerstumors are investigated with respect to ER status (+ or -) using microarray data for23 ER+ and 24 ER- tissues and 11 ”blind test” samples. 8 PCA components and200 MLP models (committee) are used, sensitivieties computed and ranked.

ER status of blind test samples predicted with 100 % accuracy (even when ER and

9

GATA3 genes excluded). For results confined to top-100 and top 301-400 genes seeFig 4.

Conclusion: The ER expression signature is deep!

Example – Correlations with other array data

In addition to hybridizing the RNA contents of cells in a sample to investigate ex-pression profiles, one can do the same thing for the DNA contents. Why would onedo this? Would we not then expect a ratio of 1 corresponding to that we have thesame number of copies of each gene in every cell in a genome? In solid tumors, thereis uncontrolled growth and many of the control mechanisms to ensure that DNA iscopied properly is broken down. This results in regions of DNA that are deleted oramplified into multiple copies. By performing DNA hybridizations (comparative ge-nomic hybridization [CGH] on microarrays) it is possible to investigate the impact ofcopy number changes on expression levels. A nice approach to investigate this is us-ing a simple statistical measure to see if the expression levels for each gene correlateswith if the gene is amplified in a sample or not.

4 Models

So far we have been focusing on various data mining tools, i.e. model independentanalysis. Biocomputing is now moving towards models – Systems Biology. Onesuch approach, the Kauffman model, was briefly presented in BIM083. This Booleanmodel is on a ”meta”-level for gene-gene interactions and not intended for detailedcomparisons. Below, we go through Michaelis-Menten kinetics, which will be usedfor detailed modeling of cell cycle behaviour in an exercise.

Consider the reaction:

S + Ek1→←k−1

ES k2→ P + E (8)

where

S = substrateE = enzymeES = substrate/enzyme complexP = product

with reaction rates k1, k−1 and k2 respectively. In what follows, concentrations willbe denoted by [...].

10

We are interested in the production rate

dP

dt= k2[ES] (9)

subject to the constraints:

1. Enzyme conservation

E0 = [E] + [ES] (10)

where E0 is a constant.

2. Equilibrium of the first step ([S] = constant)

k1[S][E] = k−1[ES] (11)

We want to express dP/dt (Eq. (9)) in terms of [S] alone. To this end, we multiplyEq. (10) with k1[S] and subtract from Eq. (11). One obtains for [ES]

[ES] =k1[S]E0

k1[S] + k−1

(12)

which yields

dP

dt=

k1k2[S]E0

k1[S] + k−1

= vm[S]

[S] + km

(13)

where we have divided both denominator and numerator with k1 and lumped to-gether the constants into vm=k2E0 and km=k−1/k1. These are the Michaelis-Mentenequations. As can be seen, the production rate saturates with [S] with a steepnessregulated by the ratio k−1/k1.

11

List of Figures

1 LOWESS normalization. Instead of plotting the log of the red intensity versus thelog of the green intensity, one often rotates the entire plot 45 degrees. After therotation the vertical axis will correspond to the log ratio and the horizontal axis tothe average of the log intensities. In LOWESS normalization, one fits a line to thisplot and uses the fitted line to normalize the data such that the line has slope zeroand corresponds to a typical log ratio of zero. . . . . . . . . . . . . . . . . . . . . . . 14

2 Time courses for two anticorrelated genes. . . . . . . . . . . . . . . . . . . . . . . . . 15

3 A 2-dimensional synthetic clustering example. . . . . . . . . . . . . . . . . . . . . . . 16

4 Hierarchical clustering of SRBCT samples. Colours from left to right are supposedto be yellow, green, red and blue respectively. . . . . . . . . . . . . . . . . . . . . . . 17

5 Schematic picture of a self-organizing network on a 2×3 grid with 2-dimensional data[from PNAS 96, 2907-2912 (1999)]. Initial geometry of nodes in 3×2 rectangular gridis indicated by solid lines connecting the nodes. Hypothetical trajectories of nodesas they migrate to fit data during successive iterations of SOM algorithm are shown.Data points are represented by black dots, six nodes of SOM by large circles, andtrajectories by arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 A 5×6 SOM of the yeast cell cycle [from PNAS 96, 2907-2912 (1999)]. Yeast CellCycle SOM. (a) 6×5 SOM. The 828 genes that passed the variation filter were groupedinto 30 clusters. Each cluster is represented by the centroid (average pattern) forgenes in the cluster. Expression level of each gene was normalized to have mean =0 and SD = 1 across time points. Expression levels are shown on y-axis and timepoints on x-axis. Error bars indicate the SD of average expression. n indicates thenumber of genes within each cluster. Note that multiple clusters exhibit periodicbehavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail.Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression inlate G1. Normalized expression pattern of 30 genes nearest the centroid are shown.(c) Centroids for SOM-derived clusters 29, 14, 1, and 5, corresponding to G1, S, G2and M phases of the cell cycle, are shown. (d) Centroids for groups of genes identifiedby visual inspection by Cho et al. (4) as having peak expression in G1, S, G2, or Mphase of the cell cycle are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Calculation of gene classification weight using Eq. (6) . . . . . . . . . . . . . . . . . 20

8 Distribution of weights w for ”true” and random labelings. . . . . . . . . . . . . . . 21

9 Distribution of weights w for ”true” and random labelings for a single gene. . . . . . 22

10 Schematic examples of linear and nonlinear separations of 2-dimensional spaces. . . . 23

11 Rotation of a 2-dimensional gene space to maximize the variance along a ”new” axis. 24

12 MLP + PCA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

13 MLP output values from committee of models; top100 and top301-400 genes are usedrepectively from ranking list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

12

14 From Hyman et al. Cancer Research (2002). Impact of gene copy number on globalgene expression levels. A, percentage of over- and underexpressed genes (Y axis)according to copy number ratios (X axis ). Threshold values used for over- andunderexpression were > 2.184 (global upper 7% of the cDNA ratios) and < 0.4826(global lower 7% of the expression ratios). B, percentage of amplified and deletedgenes according to expression ratios. Threshold values for amplification and deletionwere > 1.5 and < 0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

15 From Hyman et al. Cancer Research (2002). Genome-wide copy number and expres-sion analysis in the MCF-7 breast cancer cell line. A, chromosomal CGH analysisof MCF-7. The copy number ratio profile (blue line) across the entire genome from1p telomere to Xq telomere is shown along with ±1 SD (orange lines). The blackhorizontal line indicates a ratio of 1.0; red line, a ratio of 0.8; and green line, a ratioof 1.2. B-C , genome-wide copy number analysis in MCF-7 by CGH on cDNA mi-croarray. The copy number ratios were plotted as a function of the position of thecDNA clones along the human genome. In B, individual data points are connectedwith a line, and a moving median of 10 adjacent clones is shown. Red horizontalline, the copy number ratio of 1.0. In C, individual data points are labeled by colorcoding according to cDNA expression ratios. The bright red dots indicate the upper2%, and dark red dots, the next 5% of the expression ratios in MCF-7 cells (overex-pressed genes); bright green dots indicate the lowest 2%, and dark green dots, thenext 5% of the expression ratios (underexpressed genes); the rest of the observationsare shown with black crosses. The chromosome numbers are shown at the bottom ofthe figure, and chromosome boundaries are indicated with a dashed line. . . . . . . . 28

16 List of 50 genes with a statistically significant correlation α value < 0.05 betweengene copy number and gene expression (from permutation tests using the signal-tonoise-statistic and defining two classes (gene amplified in a sample or not-amplified)separately for each gene). Name, chromosomal location, and the α value for eachgene are indicated. The genes have been ordered according to their position in thegenome. The color maps on the right illustrate the copy number and expression ratiopatterns in the 14 cell lines. The key to the color code is shown at the bottom of thegraph. Gray squares, missing values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

17 Michaelis-Menten kinetics. dP/dt (here normalized to vm=1) as functions of substrateconcentrations [S] for k−1/k1 = 1 and 0.01 respectively. . . . . . . . . . . . . . . . . 30

13

Figure 1: LOWESS normalization. Instead of plotting the log of the red intensity versus the logof the green intensity, one often rotates the entire plot 45 degrees. After the rotation the verticalaxis will correspond to the log ratio and the horizontal axis to the average of the log intensities. InLOWESS normalization, one fits a line to this plot and uses the fitted line to normalize the datasuch that the line has slope zero and corresponds to a typical log ratio of zero.

14

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1 1.5 2 2.5 3 3.5 4 4.5 5

"time.gnu" u 1:2"time.gnu" u 1:3

Figure 2: Time courses for two anticorrelated genes.

15

X (2)i

iX (1)

X (2)i

X (2)i

iX (1)

iX (1)

Figure 3: A 2-dimensional synthetic clustering example.

16

Figure 4: Hierarchical clustering of SRBCT samples. Colours from left to right are supposed to beyellow, green, red and blue respectively.

17

Figure 5: Schematic picture of a self-organizing network on a 2×3 grid with 2-dimensional data[from PNAS 96, 2907-2912 (1999)]. Initial geometry of nodes in 3×2 rectangular grid is indicatedby solid lines connecting the nodes. Hypothetical trajectories of nodes as they migrate to fit dataduring successive iterations of SOM algorithm are shown. Data points are represented by blackdots, six nodes of SOM by large circles, and trajectories by arrows.

18

Figure 6: A 5×6 SOM of the yeast cell cycle [from PNAS 96, 2907-2912 (1999)]. Yeast Cell CycleSOM. (a) 6×5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters.Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expressionlevel of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levelsare shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression.n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodicbehavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains76 genes exhibiting periodic behavior with peak expression in late G1. Normalized expressionpattern of 30 genes nearest the centroid are shown. (c) Centroids for SOM-derived clusters 29, 14,1, and 5, corresponding to G1, S, G2 and M phases of the cell cycle, are shown. (d) Centroids forgroups of genes identified by visual inspection by Cho et al. (4) as having peak expression in G1,S, G2, or M phase of the cell cycle are shown.

19

Figure 7: Calculation of gene classification weight using Eq. (6)

20

Figure 8: Distribution of weights w for ”true” and random labelings.

21

Figure 9: Distribution of weights w for ”true” and random labelings for a single gene.

22

X2

1X

X2

1X

1X X2

class

1X X2

Non−Linear Case

Linear Case

class

Figure 10: Schematic examples of linear and nonlinear separations of 2-dimensional spaces.

23

Figure 11: Rotation of a 2-dimensional gene space to maximize the variance along a ”new” axis.

24

...

...

...... .....................

inputs

output

hiddenWeights

(parameters)

O(10) PCA components

O(10000) gene expression values

MLP

PCA

Figure 12: MLP + PCA architecture

25

0 0.2 0.4 0.6 0.8 1ANN committee vote

0 0.2 0.4 0.6 0.8 1ANN committee vote

Figure 13: MLP output values from committee of models; top100 and top301-400 genes are usedrepectively from ranking list.

26

Figure 14: From Hyman et al. Cancer Research (2002). Impact of gene copy number on globalgene expression levels. A, percentage of over- and underexpressed genes (Y axis) according to copynumber ratios (X axis ). Threshold values used for over- and underexpression were > 2.184 (globalupper 7% of the cDNA ratios) and < 0.4826 (global lower 7% of the expression ratios). B, percentageof amplified and deleted genes according to expression ratios. Threshold values for amplificationand deletion were > 1.5 and < 0.7.

27

Figure 15: From Hyman et al. Cancer Research (2002). Genome-wide copy number and expressionanalysis in the MCF-7 breast cancer cell line. A, chromosomal CGH analysis of MCF-7. The copynumber ratio profile (blue line) across the entire genome from 1p telomere to Xq telomere is shownalong with ±1 SD (orange lines). The black horizontal line indicates a ratio of 1.0; red line, a ratioof 0.8; and green line, a ratio of 1.2. B-C , genome-wide copy number analysis in MCF-7 by CGHon cDNA microarray. The copy number ratios were plotted as a function of the position of thecDNA clones along the human genome. In B, individual data points are connected with a line, anda moving median of 10 adjacent clones is shown. Red horizontal line, the copy number ratio of1.0. In C, individual data points are labeled by color coding according to cDNA expression ratios.The bright red dots indicate the upper 2%, and dark red dots, the next 5% of the expression ratiosin MCF-7 cells (overexpressed genes); bright green dots indicate the lowest 2%, and dark greendots, the next 5% of the expression ratios (underexpressed genes); the rest of the observations areshown with black crosses. The chromosome numbers are shown at the bottom of the figure, andchromosome boundaries are indicated with a dashed line.

28

Figure 16: List of 50 genes with a statistically significant correlation α value < 0.05 between genecopy number and gene expression (from permutation tests using the signal-to noise-statistic anddefining two classes (gene amplified in a sample or not-amplified) separately for each gene). Name,chromosomal location, and the α value for each gene are indicated. The genes have been orderedaccording to their position in the genome. The color maps on the right illustrate the copy numberand expression ratio patterns in the 14 cell lines. The key to the color code is shown at the bottomof the graph. Gray squares, missing values.

29

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10

x/(1+x)x/(0.1+x)

Figure 17: Michaelis-Menten kinetics. dP/dt (here normalized to vm=1) as functions of substrateconcentrations [S] for k−1/k1 = 1 and 0.01 respectively.

30