an algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass...

An algorithm for chemical genomic profiling that minimizes batch effects: Bucket Evaluations

by

Daniel Shabtai

A thesis submitted in conformity with the requirements for the degree of Master of Science

Department of Cell and Systems Biology University of Toronto

© Copyright by Daniel Shabtai 2011

ii

An algorithm for chemical genomic profiling that minimizes batch effects: Bucket Evaluations

Daniel Shabtai

Master of Science

Department of Cell and Systems Biology University of Toronto

2011

Abstract

Chemical genomics is an interdisciplinary field that combines small molecule perturbation with

genomics to understand gene function and to study the mode(s) of drug action. Existing methods

for correlating chemical genomic profiles are not ideal as they often require one to define the

disrupting effects, commonly known as batch effects. These effects are not always known, and

they can mask true biological differences.

I present a method, Bucket Evaluations (BE), which surmounts these problems. This method is a

non-parametric correlation approach, which is suitable for locating correlations in somewhat

perturbed datasets such as chemical genomic profiles. BE can be used on other datasets such as

those obtained via gene expression profiling and performs well on both array-based and

sequence based readouts. Using BE, along with various correlation methods, on a collection of

datasets, showed it to be highly accurate for locating similarity between experiments.

iii

Acknowledgments

I would like to thank Dr. Corey Nislow, who gave me an opportunity upon my arrival to

Toronto, supported and guided my ideas, and always had an open door for sharing thoughts. I

also want to thank my co-supervisor, Dr. Tim Westwood, and my supervisory committee

members Drs. Guri Giaever and Nick Provart for their guidance.

In addition, I would like to thank the Giaever/Nislow lab members, and the CCBR 6th floor

computational researchers with who I had the pleasure to work with, enjoyed my time during

work and beyond.

Finally, I would like to thank my parents Janet and Yaakov, who trust and support me, with great

love, in any step I take, no matter what. My siblings Arei and Runn, and family for their long

distance (>9000KM) video chats that fill my energy stores. Most of all, Michal Sibony for her

love, knowledge and support throughout my research.

iv

Table of Contents

Abstract ........................................................................................................................................................ ii

Acknowledgments ..................................................................................................................................... iii

Table of Contents ...................................................................................................................................... iv

List of Figures ............................................................................................................................................ vi

List of Tables ............................................................................................................................................. vii

List of Abbreviations................................................................................................................................ viii

1. Introduction .......................................................................................................................................... 1

1.1. Microarrays .................................................................................................................................. 2

1.2. High Throughput Sequencing ................................................................................................... 4

1.3. Chemogenomic Profiles ............................................................................................................ 5

1.4. Batch Effects ............................................................................................................................... 8

1.4.1. History .................................................................................................................................. 8

1.4.2. Definition ............................................................................................................................ 12

1.4.3. Sources of Batch Effects ................................................................................................. 14

1.5. Analysis Approaches ............................................................................................................... 17

1.5.1. Overview ............................................................................................................................ 17

1.5.2. Supervised vs. Unsupervised Methods......................................................................... 18

1.5.3. Parametric vs. Non-Parametric Methods ...................................................................... 19

1.6. Software Design ....................................................................................................................... 19

1.6.1. Threads .............................................................................................................................. 20

v

2. Rationale ............................................................................................................................................ 22

3. Results and Discussion ................................................................................................................... 26

3.1. TAG4 Barcode Microarray Dataset ....................................................................................... 31

3.2. TAG3 Microarray 2004 PNAS Dataset ................................................................................. 39

3.3. Gene Expression (Transcript Abundance) Dataset ............................................................ 45

3.4. High Throughput Sequencing Dataset .................................................................................. 54

4. Conclusions ....................................................................................................................................... 59

5. Methods ............................................................................................................................................. 63

5.1. Levelled scoring matrix ............................................................................................................ 63

5.2. Software imaging and implementation .................................................................................. 64

6. Bucket Evaluations Software .......................................................................................................... 65

6.1. User Experience ....................................................................................................................... 68

6.2. Main GUI Window (MGUIW) .................................................................................................. 70

6.2.1. User Input .......................................................................................................................... 70

6.2.2. Status Notifications .......................................................................................................... 74

6.2.3. Cancel Run ........................................................................................................................ 76

6.3. Information Form ...................................................................................................................... 76

6.4. BE Thread Manager (BETM) .................................................................................................. 76

7. References ........................................................................................................................................ 79

vi

List of Figures

Figure 1 Experimental procedure for creating chemical genomic profiles

Figure 2 The score distribution of chemogenomic profiles sorted by date

Figure 3 Two chemogenomic experiments performed using the same conditions

Figure 4 A simplified example of a basic implementation of BE for scoring experiments

Figure 5 Expected results of the ideal outcome and a random outcome

Figure 6 Four correlation methods heat maps applied to the same dataset

Figure 7 Four correlation methods score distribution applied to the same dataset

Figure 8 A comparison of barcode TAG3 microarray similarity results

Figure 9 BE method results on the Gasch et al. dataset

Figure 10 Gasch et al. dataset differentiation between the induced and repressed genes

Figure 11 The distribution of scores of the Gasch et al. study dataset

Figure 12 Results for running the BE method on high throughput sequencing data

Figure 13 A comparison of several methods run on high throughput sequencing data

Figure 14 The score distribution of several methods on high throughput sequencing data

Figure 15 Bucket Evaluations Software Graphical User Interface

Figure 16 Bucket Evaluations Software Architecture

Figure 17 Bucket Evaluations Software - Load file location window

Figure 18 Bucket Evaluations Software - Save file location window

Figure 19 Example of different bucket sizes on clustering of data

Figure 20 Bucket Evaluations Software GUI once executed

vii

List of Tables

Table 1 Non-biological effects on datasets

Table 2 A scoring matrix formula in accordance to the guidelines needed for BE scoring

Table 3 Implementation example of the scoring matrix

Table 4 Top three drug similarity scores located by several correlation methods

viii

List of Abbreviations

AtCAST Arabidopsis thaliana: DNA Microarray Correlation Analysis Tool

bp base pairs

BE Bucket Evaluations

BETM BE Thread Manager

CPU central processing unit

DMSO dimethyl sulfoxide

DNA deoxyribonucleic acid

FDA Food and Drug Administration

GUI graphical user interface

MASTA microarray overlap search tool and analysis

MGUIW Main GUI Window

mRNA messenger ribonucleic acid

NGS next generation sequencing

PAM Partitioning Around Medoids

RNA ribonucleic acid

SD standard deviation

SOLiD Supported Oligo Ligation Detection

1. Introduction

High throughput analysis of genes is a developing technology that allows the user to analyse

thousands of genes simultaneously. High throughput analysis first emerged at the beginning of

the 1990’s with microarray technology (Brown and Botstein, 1999; Lockhart et al., 1996; Schena

et al., 1995), and continued with new technologies such as next-generation sequencing (Alkan et

al., 2009; Bentley et al., 2008; Hillier et al., 2008; Ley et al., 2008; Smith et al., 2009).

One issue that has hindered high throughput experiments since their introduction is the limited

ability to compare results from experiment to experiment, lab to lab and between different dates.

Similarity between experiments helps to understand the activity of specific genes, which are

under new experimental conditions. In my research, I address similarity evaluation problems of

high throughput analysis of genes using an analysis algorithm I have developed, and by

designing and implementing software for using this algorithm.

1

2

1.1. Microarrays

Microarrays are a collection of single or double stranded DNA segments that are attached to a

surface. A sample of cDNA made from RNA or genomic DNA is hybridized to these segments

allowing one to measure the abundance of gene transcript levels in the sample for expression

microarrays, or by barcode/genomic DNA abundance for barcode/genomic microarrays (see

section 1.3).. Such measurements enable researchers to monitor the expression of all known

genes of an organism simultaneously as part of genome-wide studies. For example, the user can

measure the abundance of gene transcripts (or gene copy number) either over time, under stress

conditions, or in the presence of chemical compounds (Ammar et al., 2009; Lieb et al., 2001;

Lockhart et al., 1996; Redon et al., 2006; Shoemaker et al., 2001; Wang et al., 1998). In yeast,

some of the landmark studies using microarrays to measure either transcript levels or gene copy

number include identifying drug targets using yeast deletion strains (Giaever et al., 2004;

Giaever et al., 1999; Marton et al., 1998), locating the group of yeast genes that are affected by

cell stress conditions (Gasch et al., 2000), showing yeast can acquire stress resistance when

exposed to mild stress (Berry and Gasch, 2008), discovering phenotypic activity for almost all

the genes in yeast (Hillenmeyer et al., 2008), and the creation of a drug-gene connectivity map

for understanding drug mechanism in mammalian species (Lamb et al., 2006).

There are several manufacturing methods that exist for the creation of single stranded DNA-

based microarrays. The Affymetrix platform makes use of technology known as

photolithography (Fodor et al., 1991). This technology blocks or exposes light to regions of a

substrate, called a wafer. By selectively exposing portions of the wafer to ultraviolet light, it is

possible to synthesize nucleic acids through photochemical directed reactions in a spatially

3

dependent manner. Similar to Affymetrix, Nimblegen use a photolithographic technique to create

the microarrays (Singh-Gasson et al., 1999). However, Nimblegen use an array of micro-mirrors

to selectively direct light onto regions of the microarray substrate, unlike Affymetrix that uses

physical masks. This makes it easier to change oligonucleotide sequences as the micro-mirrors

are controlled by software. Agilent Technologies create microarrays using inkjet technology

(Hughes et al., 2001). This technology uses modified inkjets to spatially deliver and separate

chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible

since a computer file is used, rather than physical masks, to define the pattern and sequences of

oligonucleotides on the microarray. Illumina use a method called Bead Chips. This technology

uses the random placement of bulk synthesized oligonucleotides on polymer beads. On each

bead there is a unique nucleotide sequence defined by the user plus an identifier sequence. The

beads are attached to a substrate in random location, and thus each microarray creates a different

distribution of beads on the array (Illumina, 2011). A set of control probes that hybridize to the

identifier sequences distinguishes which bead is which on the array. Finally, spotted microarrays

are created using a robot that uses pins that repetitively pick up DNA probes (either single

stranded oligos or double stranded DNA fragments) from microtiter plates and deposits it on a

coated glass slide in a rectilinear pattern (Schena et al., 1995).

4

1.2. High Throughput Sequencing

High throughput sequencing is another method that is used for assessing the abundance of DNA

segments. This method is able to produce an enormous amount of data cheaply, and in this

introduction I’ll mention several technologies for massive parallel DNA sequencing (Metzker,

2010). An example of high throughput sequencing includes Illumina technology, which works by

generating a DNA library for sequencing (Bennett, 2004). Adaptor sequences are ligated onto

the ends of each molecule. Next, the DNA is placed on a glass slide coated with DNA

complementary to the adaptor sequence. Each molecule is then amplified, generating clusters of

the cloned DNA molecules. A single base, which is fluorescently labeled, is incorporated to each

chain, and the terminator blocks further hybridization (Ju et al., 2006; Turcatti et al., 2008). The

cluster is then imaged, and the base is assigned according to the fluorescent colour. Next, the

blocked terminate is reversed, allowing to incorporate the next base. The process of

incorporating a base, reading an image, and unblocking the terminator is repeated 20-150 times,

which allows sequencing a 20-150 base long sequence (Smith et al., 2009). Another example of

sequencing technology is Sequencing by Oligonucleotide Ligation and Detection (SOLiD),

created by Applied Biosystems (Mardis, 2008). Similar to Illumina, this technology starts by

generating a DNA library for sequencing. Adaptor sequences are ligated onto the ends of each

DNA molecule. Then, the DNA is added to an emulsion PCR, where a single DNA molecule is

amplified onto a bead. The beads are deposited onto a slide and sequenced. Unlike Illumina,

SOLiD uses a DNA ligase to add DNA bases (Mardis, 2008).

High throughput sequencing has several advantages over microarrays, as paralogous sequences

can be distinguished, quantitation is ‘digital’ rather than ‘analog’, and prior sequence knowledge

is not required, to name a few.

5

1.3. Chemogenomic Profiles

There are many advantages for using Saccharomyces cerevisiae as a model organism in drug

discovery research. Some of these advantages are its well characterized genome and proteome

(Lourdes Peña-Castillo, 2007; Pena-Castillo and Hughes, 2007), the availability of a complete

molecular barcode deletion strains collection (Giaever et al., 2002; Winzeler et al., 1999), its low

cost maintaining in the lab, and its facile genetics.

The chemogenomic profiles I compared were created by using the yeast Saccharomyces

cerevisiae deletion strains collection (Giaever et al., 2002; Giaever et al., 2004; Giaever et al.,

1999; Winzeler et al., 1999). Heterozygous and homozygous diploid gene deletion collections

were used to determine those gene products of pathways most affected by treatment

(Deutschbauer et al., 2005). In this method each deletion strain is tagged with a barcode, which is

a unique 20bp sequence used for identification of the strain. Once a collection of strains is grown

in the presence of a compound, the sensitivity of a certain strain with a deleted gene is measured

as a decrease in its abundance by PCR amplification of the strain specific barcodes followed by

barcode microarray hybridization or barcode sequencing (Bar-Seq) (Giaever et al., 1999; Smith

et al., 2010). This method allows identifying potential drug targets and/or genes and pathways

required for growth in the presence of a compound (Deutschbauer et al., 2005; Giaever et al.,

2004) (Figure 1).

The results of each experiment are microarray signal intensities or barcode sequence counts,

which reflect barcode abundance and, by extrapolation, strain abundance. These values are

normalized by evaluating the log2 ratio between the signal intensities of drug-treated pools and

control pools, which are treated only with DMSO. This value is represented as the strains fitness

6

defect. In a typical experiment, a few strains show a high fitness defect while the majority show

little or no defect relative to the control treatment. Generally, lower values may be true sensitive

strains, yet are not necessarily located using a set threshold, as they are concealed within

midrange values that are considered background noise.

The fitness defect values vary in repeated experiments under identical conditions due to many

factors such as systematic effects (Baryshnikova et al., 2010) and batch effects (Fare et al., 2003;

Lander, 1999). Examples of possible effects are different dates of the experiment, different

plates, and the machinery used in an experiment. Due to the relatively high variability of

experiment results from experiment to experiment, the ability to compare experiments is limited

to those with higher fitness defects according to a selected threshold. To achieve a meaningful

comparison of different experiments, it is desirable to look at large collections of repeated

experiments, rather than individual, singleton experiments. This will allow evaluating midrange

values, which are not always seen as significant when looking at a single experiment.

Normalization solutions for minimizing non-biological effects exist, however, these solutions

each have limitations because they are designed for defined datasets and known batch effect

conditions (Alter et al., 2000; Benito et al., 2004; Fare et al., 2003; Johnson and Li, 2007;

Lander, 1999; Mecham et al., 2010), which are not always obvious. This lead me to hypothesize

that the similarity between different chemical genomic profiles can be evaluated using the BE

method based on a weighted rank scoring.

7

Figure 1

Figure 1| Experimental procedure for creating chemical genomic profiles. The traditional barcode microarray assay is depicted on the left and the Barcode sequencing modification is presented on the right. In both assays, yeast is grown (1) in the presence of a chemical compound. In this example, the green strain grows poorly in this specific condition. The genomic DNA is isolated (2) and barcodes are amplified (3 or 5). Samples are either hybridized to a barcode microarray (4) or sequenced (6). Only one of the two yeast barcodes is shown, while the red, blue and green boxes represent the barcode which uniquely identifies that particular strain. This figure is used with permission from Andrew M. Smith, adapted from Pierce et al. 2006.

8

1.4. Batch Effects

1.4.1. History

Microarrays are a powerful tool; however, they are not devoid of disadvantages. As any other

measurement process, high throughput analysis methods are susceptible to variability in results

due to technical, biological, and other non-biological sources (Table 1). Due to the variability in

the results of microarray experiments, researchers questioned the validity (Frantz, 2005;

Ioannidis, 2005; Strauss, 2006; Ying and Sarwal, 2009) and challenged the reproducibility of

microarray results (Dobbin et al., 2005; Ein-Dor et al., 2006; Irizarry et al., 2005; Larkin et al.,

2005; Marshall, 2004). Possible technical factors were investigated with an aim to minimize their

effect (Bakay et al., 2002; Boedigheimer et al., 2008; Eklund and Szallasi, 2008; Fare et al.,

2003; Han et al., 2004; Lusa et al., 2007; Novak et al., 2002; Zakharkin et al., 2005). Solutions

for minimizing the effects of technical sources were introduced during microarray studies. Such

solutions located problems in technical steps of microarray procedures, such as improper

experimental design (Lee et al., 2005; Rothman et al., 1980), RNA extraction (Bakay et al.,

2002; Boedigheimer et al., 2008; Huang et al., 2001; Lin et al., 2006; Thompson et al., 2007;

Whitney et al., 2003), RNA processing (Boelens et al., 2007; Lynch et al., 2006; Ma et al.,

2006), hybridization (Schaupp et al., 2005), washing (Branham et al., 2007; Fare et al., 2003),

scanning (Satterfield et al., 2008; Shi et al., 2005), clinical diagnosis (Daskalakis et al., 2008;

Furness et al., 2003) and data interpretation (Ambroise and McLachlan, 2002). These solutions

eliminated technical effects, though did not solve other non-biological effects.

9

Non-technical sources affecting results, such as the person performing the experiment, date of

the experiment, etc., were not resolved by accounting for the technical issues. Therefore,

statistical tools, rather than technical procedures, were developed for accounting for these

effects.

10

Table 1 Non-biological effects on microarray datasets

Source Result

Date of experiments Grouping of experiment similarity according to date (Baggerly et al., 2008)

Location of experiment Grouping of experiment similarity according to location (Shi et al., 2006)

Experimental design Masking of physiological state being studied (Lee et al., 2005; Ransohoff, 2005a; Rothman et al., 1980)

Tissue heterogeneity (sample and RNA extraction)

Masking of tissue or cell population being studied (Bakay et al., 2002; Scherer, 2009)

Temporal and biological variation in expression (sample and RNA extraction)

Masking of the biological state being studied (Boedigheimer et al., 2008; Scherer, 2009; Whitney et al., 2003)

Expression changes after tissue extraction (sample and RNA extraction)

Measured RNA abundances are different than true physiological state being studied (Huang et al., 2001; Lin et al., 2006; Scherer, 2009)

Degraded RNA (sample and RNA extraction)

Measured RNA abundances different than true physiological state being studied (Scherer, 2009; Thompson et al., 2007)

Amplification biases (RNA Processing)

RNA abundances change with different protocols and handling (Boelens et al., 2007; Ma et al., 2006; Scherer, 2009)

Labeling biases (RNA Processing) Measured signals differ from actual abundances and are dependent on actual procedure used (Lynch et al., 2006; Scherer, 2009)

Non-uniform hybridization (hybridization)

Spatial signal biases and non-uniform high backgrounds (Schaupp et al., 2005)

11

Source Result

Cy5 degradation (washing) Cy5 molecule degrades under ozone exposure (Branham et al., 2007; Fare et al., 2003; Scherer, 2009)

System stability (scanning) Variation in signal outputs (Scherer, 2009)

System settings (scanning) Scan-to-scan variability (Scherer, 2009)

Subjective analysis of specimen (clinical diagnostics)

Systematic bias in assessment due to single or multiple pathologists making diagnosis (Daskalakis et al., 2008; Furness et al., 2003; Scherer, 2009)

Selection bias (data interpretation) Bias in selecting data sets for training and validation (Ambroise and McLachlan, 2002; Scherer, 2009)

Subtle differences in growth conditions, such as incubation time, from one array plate to the next (systematic effects)

Plate effect (Baryshnikova et al., 2010)

Growth of different subsets of colonies on the same plate (systematic effects)

Local nutrient availability (Baryshnikova et al., 2010)

Angle at which agar medium was allowed to solidify (systematic effects)

Gradients in growth medium (Baryshnikova et al., 2010)

Increased colony size next to less fit mutants (systematic effects)

Neighboring mutant strain fitness change due to local competition for nutrients (Baryshnikova et al., 2010)

Table 1 | Non-biological effects on microarray datasets. This table mentions several possible effects on microarray experiments that may change the data.

12

1.4.2. Definition

Batch effects are non-biological experimental variations that affect the outcome of experiments

(Johnson and Li, 2007). The term originates from statistical process control, where it refers to

systematic differences of quality parameters between different production batches (Scherer,

2009). Such differences in parameters become significant if the average difference between

batches is larger than the within-batch random variation (Scherer, 2009). These effects create

differences in the gene expression intensities of samples processed in different batches, and

distort real biological effects. As a result, the distribution of intensities is largely due to the batch

effect, rather than the true biological variation (Figure 2). Despite my focus on avoiding batch

effects in relation to high throughput chemogenomic research, batch effects are also found in

other fields, such as physics (Youden, 1972). Examples of batch effects have been documented

in many studies, showing high correlation between variables that are not study related (Petricoin

et al., 2002; Spielman et al., 2007). Such effects created concerns regarding the credibility of

biological conclusions even after the publication of results (Akey et al., 2007; Baggerly et al.,

2004), which led to blocking the use of such an assay by the FDA until further validation

(Ransohoff, 2005b). When not detected, batch effects can result in lack or reproducibility and

subsequent misallocation of resources (Baggerly et al., 2008).

13

Figure 2

Figure 2| The score distribution of chemogenomic profiles sorted by date. The red rectangle shows a group of experiments that display a similar score distribution of the gene fitness defect. These experiments were performed on the same date, but not under the same conditions, which is an example of a batch effect due to the date of the experiment.

14

1.4.3. Sources of Batch Effects

Batch effect sources can be ambiguous; therefore dealing with specific known effects may not be

enough to avoid batch effects completely. Batch effects vary with respect to their impact on the

data. For instance, in almost every gene expression study there are variations that are associated

with processing date of the microarray (Scherer, 2009) (Figure 3). Another example is seen in

the comparison of microarray experiments between laboratories which show strong lab-specific

effects (Irizarry et al., 2005). Another example is seen in the large variations associated with

DNA preparation groups, such as for different batches or reagents (Scharpf et al., 2011). These

‘strong’ effects are commonly used to account for batch effects, though they may be surrogates

for other sources of variation such as ozone levels, lab temperatures, reagent quality, etc. (Fare et

al., 2003; Leek and Storey, 2007; Scherer, 2009).

Most of the batch effects are masked by the ‘strong’ effects, therefore are not recorded as a

potential effect, which makes it impossible to account for them. Taking into account the ‘strong’

factors, which may affect the data, and ignoring the factors that were not recorded, may not be

enough to clear the data from non-biological effects. The reason for the remaining batch effects

is that neither date nor biological factors are completely associated with all affecting

components, and it also suggests that other unknown sources are present. In other words, we

cannot explicitly account for undetected or unmeasured effects. Therefore accounting only for

the known batch effects is not sufficient for removing non-biological effects (Leek et al., 2010).

Samples, in which batch effects are confounding the outcome of interest, may result in wrong

biological or clinical conclusions. For example, an experiment where all control samples are

processed on one day and case samples on another may not provide useful data.

15

Figure 3

a.

b.

16

Figure 3| Two chemogenomic experiments performed using the same conditions (cantharidin, a protein phosphatase inhibitor) on different dates (a). These images show the extent of the differences between experiments that were performed under the same conditions. There is a difference in the scale of results (left experiment’s top value is ~22 representing a 10

� fold difference in abundance while right experiment’s top value is ~31 representing a 10

�fold difference in abundance). The lower results are the least affected genes, and include the majority of strains. These results vary in range between experiments, and are assessed as noise as they are due to unmanageable differences between experiments, i.e. temperature perturbations. Despite the fact that the experiments were performed under the same conditions, the most sensitive deletion strains are not necessarily in the same ratio to each other nor are necessarily ranked in the same order (i.e. a strain can obtain the second highest fitness defect value in one experiment, yet the third highest in another). Another representation of the differences between experiments is shown in image b. The scatterplot shows an example of scores of two experiments performed using the same conditions. Top fitness defect scores are similar, though these strains are not in ranked the same for both experiments and have a different range of scores.

17

1.5. Analysis Approaches

1.5.1. Overview

Several approaches exist for evaluating similarities between experiments. All methods attempt to

overcome batch effects within the data, while some methods require more information than

others about the experiments. Obviously, the ideal dataset of experiments will hold all possible

data about the conditions of the experiments, allowing utilization of the variables for analyzing

the data; though, in reality, having all the information about the experiment’s conditions is not

possible. Normalizing the data is a standard step in data analysis of gene expression experiments

(Allison et al., 2006), yet it does not completely remove batch effects, which can affect (in

chemogenomics) different genes in different ways, as different biological pathways are affected

by conditions unrelated to the experiment (Bolstad et al., 2003; Dudoit et al., 2002; Tseng et al.,

2001; Wu et al., 2004).

There are several tools, for comparing microarray data, which are available online. For example,

Arabidopsis thaliana: DNA Microarray Correlation Analysis Tool (AtCAST ) (Sasaki et al.,

2011) is a tool for comparing microarray results, specifically for Arabidopsis thaliana. This

method uses a module based correlation analysis, incorporating accumulated microarray data and

known shared biological activity of genes, to identify biological relationships. Another example,

microarray overlap search tool and analysis (MASTA) (Reina-Pinto et al., 2010). MASTA is

used for comparing differentially expressed genes against a publicly available Arabidopsis

microarray datasets. Another method uses Eu.Gene (Cavalieri et al., 2007) to assess similarity of

samples within microarray databases. Eu.Gene is used to generate pathway signatures,

18

recapitulating the biologically meaningful pathways related to some clinical/biological variable

of interest. It then uses them to compare different microarray experiments (Beltrame et al.,

2009). There are many other methods which require prior knowledge such as information

regarding gene regulation (Breitling et al., 2004; Gasch and Eisen, 2002), biological pathways

(Ovaska et al., 2008), or defining the groups of batch effects (see section 1.5.2). Such methods

may be powerful analysis tools, though they rely on prior knowledge and/or accumulated data

for performing the analysis. Here I present a method that examines independent datasets and

does not rely on prior knowledge for the analysis.

1.5.2. Supervised vs. Unsupervised Methods

Methods for assessing similarities between experiments can be divided into two main categories.

First, supervised methods, which are methods that take into account study design, and attempt to

use all measured variables as the basis to correct for the batch effects (Baird et al., 2004; Dabney

and Storey, 2007; Johnson and Li, 2007; Mecham et al., 2010; Wolfinger et al., 2001; Wu et al.,

2004; Wu and Irrizary, 2007). However, these methods require highly structured experimental

design. They assume that the experimenter has identified all sources of variation, and that all

these sources of variations are recorded in the data (Mecham et al., 2010). In contrast,

unsupervised methods are methods that do not require the utilization of batch variables and can

also be used when these data are missing.

I have developed a non-supervised method which does not assume prior knowledge of the data.

Most of the datasets I used did not publish additional information beyond the results themselves,

and the use of such additional data was not needed because the method does not require prior

19

knowledge of possible batch effects. I compared this method to other commonly used

unsupervised methods to assess its abilities, such as Pearson, Spearman and Kendall correlations

(Kendall, 1938; Pearson, 1909; Spearman, 1904).

1.5.3. Parametric vs. Non-Parametric Methods

Unsupervised methods are widely used in biostatistics (Armstrong et al., 2011; Nugent and

Meila, 2010), and are divided into parametric or non-parametric methods. Some statistical tests

depend on certain assumptions about the data behaviour for accurate evaluations. Tests that

require such prior assumptions are defined as parametric methods (Nugent and Meila, 2010).

Parametric statistics depend on a particular distribution of the data, and will base their conclusion

on presumptions regarding data parameters (e.g. standard deviation, variance etc.). Unlike

parametric tests, the non-parametric methods do not make assumptions regarding the probability

distribution of the data (Qualls et al., 2010). Here I present a non-parametric method I developed,

and compare its performance to other commonly used methods, including non-parametric

methods such as Spearman (Spearman, 1904) and Kendall (Kendall, 1938) correlations, and to a

parametric method, Pearson correlation (Pearson, 1909).

1.6. Software Design

Part of my research project included design and implementation of software that will accompany

the algorithm I developed.

20

1.6.1. Threads

One of the major concepts I use in my software implementation is multi-threading. Threads are

useful for parallelizing applications and are similar to processes. Computer processes consume

the computer’s central processing unit (CPU) time, and run concurrently with other processes.

The operating system allocates CPU time to each application. A single core computer processor

consists of a single CPU; therefore, the processes do not run concurrently. Despite the

asynchronous running of processes, when users run several programs at the same time, they

experience all the applications as if they are running concurrently. The illusion of concurrent

activity is attainable due to fast context switching, which is the transfer of the information

needed for each process, between processes. When the processor consists of multiple cores, the

processes are actually running concurrently on several CPUs, while there is still context

switching within the cores, as there are often more processes than cores (Microsoft, 2011).

A thread is a processing event which is allocated with CPU time; therefore, multiple threads

allow a single program to run multiple events at the same time. Multiple threads are useful when

using a graphical user interface (GUI), because it should remain active while performing events

in the background. In practice, at least one thread is dedicated to the GUI, while additional

threads perform events in the background. Another advantage of threads is the use of multiple

threads for concurrent analysis of a single resource (e.g. file or data in the computer memory).

Therefore, concurrent analysis by multiple threads provides a faster outcome compared to a

single thread. I have created a multithreaded program which allows the GUI to remain active

while multiple threads analyse a shared dataset. Such a multithreaded analysis results in a faster

21

outcome compared to single threaded analysis. However, excessive usage of threads can result in

slowing the performance of a program due to lack of CPUs and overwhelming the system with

time-consuming context switching.

22

2. Rationale

Chemogenomics, the study of how the genome is affected by chemical compounds, is a valuable

approach to elucidate the mechanism of action of small molecules by identifying their cellular

targets and target pathways (Wuster and Babu, 2008). Recent applications of chemical genomics

in yeast include haploinsufficiency profiling and homozygote profiling of barcoded deletion

collections in yeast (Giaever et al., 2002; Giaever et al., 2004; Giaever et al., 1999; Winzeler et

al., 1999), exploration of essential genes using temperature-sensitive mutants (Li et al., 2011),

molecular barcoded open reading frame libraries (Ho et al., 2009), decreased abundance by

mRNA perturbation (Yan et al., 2008), multi-copy suppression profiling (Hoon et al., 2008) and

gene function and drug action analysis using the relationships between gene fitness profiles and

drug inhibition profiles (Hillenmeyer et al., 2010), to name a few. To apply chemical genomics

on a larger scale (i.e. thousands-100 thousands of tests) a robust, extensible means to correct for

variation is needed. This variation can come from many sources; including operator, laboratory,

sample preparation and date (Irizarry et al., 2005; Scherer, 2009). Taken together, many profiles

will cluster based on these non-biological parameters, into "batches", which adversely affect the

validity of the conclusions of a study (Akey et al., 2007; Spielman et al., 2007). Furthermore, as

throughput increases, batch effects are likely to increase.

I used chemogenomic profiles obtained from experiments that utilized the yeast Saccharomyces

cerevisiae gene deletion collections (Deutschbauer et al., 2005), which include heterozygous and

homozygous diploid deletions and haploid deletions. These screens primarily measure growth of

individual strains in a mixed population of deletion strains in the presence of diverse small

23

molecules. In these screens, a strain’s fitness defect can reflect that the deleted gene is the target

of the chemical compound present (in heterozygous diploid deletion strains) or that a particular

pathway is the target of the small molecule (homozygous diploid deletion strains).

In a genome-wide chemical-genetic profile, the fitness of each strain can be determined by

measuring the abundance of each deletion strain at the conclusion of the experiment, relative to a

mock treatment control profile. As each chemical compound produces a unique profile of gene

sensitivities, comparing the profiles helps understand the similarity between the modes of action

of compounds (Baetz et al., 2004; Hillenmeyer et al., 2008). This “guilt-by-association”

approach may help uncover therapeutic applications for known compounds as well as the

mode(s) of action of novel compounds (Buchdunger et al., 1996; Druker et al., 1996). Because

most chemical profiles display a range of fitness defects, identifying similarities between

chemical profiles requires locating shared gene targets of each profile and emphasizing genes

with highest fitness defect values, i.e. the strains most sensitive to treatment.

Batch effects, defined as non-biological variation in results (Scherer, 2009), interfere with the

ability to compare profiles because they mask the actual biological differences. Because each

experiment is subject to non-biological effects, and some of these co-occur every time the assay

is performed or new, technical variables are introduced in an unscheduled manner (e.g. – a new

lot of lab consumables), such variations are often termed batch effects (Leek et al., 2010). Batch

effects can be caused by many factors, such as the date on which an experiment was done, the

experimenter, the machinery used, etc. While most of these factors are recorded for each

experiment, one cannot account for all variation, and even when these factors are logged, some

are very difficult to normalize. One example of an effect that is not always recorded is the level

of training, which varies in time, of the person performing the experiment. Another example is

24

the atmospheric ozone levels on the day of the experiment, which affects certain types of

microarrays (Fare et al., 2003) (Table 1), and temperature which affects all next generation

sequencing experiments.

Due to batch effects, correlation between experiments displays unwanted similarity according to

these effects rather than the similarity of the underlying chemical biology (Johnson and Li, 2007;

Leek et al., 2010). Comparison algorithms, which do not consider batch effects, provide

inaccurate correlation mapping of profiles. Some algorithms require that one defines the variable

that affects the results for an accurate comparison (Baryshnikova et al., 2010; Benito et al., 2004;

Johnson and Li, 2007; Leek et al., 2010; Mecham et al., 2010), yet these variables, and their

relative impact are not always known.

To find correlation between experiments in a way that accommodates such uncertainty, I devised

a method which finds correlation between experiments without the need to define the batch

effects variables. This method is based on scaled ranks, which are scored according to a levelled

scoring matrix. The levelled scoring matrix provides a score for each gene comparison. I

evaluated the method using chemogenomic profiles (see section 1.3), and compared the method

to other existing correlation methods, including Pearson (Pearson, 1909), Spearman (Spearman,

1904), and Kendall (Kendall, 1938) correlations, which also do not require prior knowledge of

the variables that affect the results. Finally, I explored the extensibility of the Bucket Evaluations

(BE) algorithm on other microarray data and barcode sequencing data (see results). Because

many different clustering methods exist (e.g. hierarchical, PAM, etc. (Bozinov and

Rahnenfuhrer, 2002)) and each method relies on different agglomeration/division methods, each

approach can yield different results. It is therefore essential, when comparing performance, to

statistically evaluate the different profile similarity metrics irrespective of their clustering

25

method. I will demonstrate the performance of the BE algorithm compared to other correlation

methods, and will illustrate its applications on a variety of data types.

26

3. Results and Discussion

The BE algorithm is based on ranking and comparing a large number of columns within a

dataset, and was initially applied to chemogenomic profiles. For better understanding the

applicability of the algorithm, consider an example from the world of spiders. There are over

40000 species of spiders around the world, living in a variety of areas ranging from the freezing

arctic to the hot deserts. Similar spider habitats are expected to have similar groups of spider

species, as these species have adapted to the same type of environments. To evaluate similarity

between spider habitats, one would compare the groups of successful species, those that are most

prosperous in numbers, of each habitat, rather than comparing the single most successful species

alone. The reason for such a comparison is that for very similar habitats A and B, the most

successful species in habitat A is not necessarily the most successful species in habitat B. One

can determine that habitats A and B are similar if the most successful species in habitat A is in

the top fifty most successful species in habitat B, as such a rank is still very high considering

there are 40000 species.

Similar to the world of spiders, comparing the effect of chemical compounds requires examining

the groups of genes affected by the chemical compounds rather than the top gene alone. There

are many differences between profiles, such as scale of results, standard deviation, and a

changing rank of gene values, even when the experiment was performed with the same

compound at the same dosage (Figure 3). These differences require analysing the ranking, not by

comparing specific ranks, but by comparing groups of ranks. A pure rank comparison, meaning

the highest value in one profile against the highest value in another profile and so on, gives poor

27

results because it does not take into account the variability of ranks between genome-wide

profiles.

Back to the spider world, the widow spider species can be found in dry warm areas.

Environments that have the widow spider as one of the successful species will be considered as

similar habitats, while the specific rank of how successful the widow spider, can vary between

these habitats. I confronted this problem using section comparisons, dividing each profile’s gene

scores into sections, defined as buckets. The algorithm creates a weighted scoring system by

ranking sections separately, and holding a higher score for highly ranked gene scores compared

to lower ranked gene scores. Each section, or “bucket”, is defined as a subgroup of ranked scores

and itself is scored according to significance. The genes with the highest fitness defect scores are

considered the most significant for comparing profiles, as these deletion strains are the most

influenced by the chemical compound. Therefore, I define the bucket sizes in each experiment

according to significance, i.e. smaller buckets contain the most significant genes (genes with the

higher fitness defects scores and lower fitness), whereas larger buckets contain the least

significant genes (those with lower fitness defect scores and higher fitness). After the genes of

each profile are parsed into buckets, I used a levelled scoring matrix (see section 5.1) with

weighted scores for scoring similarity between profiles, and evaluate a summed similarity score

(Figure 4).

The levelled scoring matrix guidelines consisted of awarding a higher score to genes located in

lower buckets (e.g. when comparing two experiments, a gene located in bucket 2 for both

experiments is awarded a higher score compared to a gene located in bucket 3 for both

experiments), and to genes located in closer buckets (e.g. when comparing two experiments, a

gene that is located in buckets 2 and 3 will get a higher score than a gene located in buckets 2

28

and 4). To implement the levelled scoring matrix guidelines, I devised a scoring matrix formula

(Table 2) which meets the requirements of the levelled scoring matrix (Table 3). These

guidelines allowed me to find resemblance between profiles in addition to identifying profiles of

repeated conditions.

29

Figure 4

Figure 4| A simplified example of a basic implementation of BE for scoring experiments: (1) Define bucket sizes and scoring table values. (2) For each experiment, insert the strains in the relevant bucket according to rank. Each strain is mentioned with its bucket definition, while the values in brackets represent the fitness defect score. The fitness defect diagrams represent the buckets according to a coloured rectangle (red for bucket1, green for bucket2, and blue for bucket3). (3) Compare each experiment to the other experiments, and score similarity according to the scoring table. In this example, there is a higher similarity between Exp1-Exp3 rather than Exp2-Exp3. This example demonstrates that the BE algorithm gives greater emphasis to strains with a high value rather than strains with a lower value.

30

Table 2

Buckets 1 2 3 ... 1 )1(2 −n cS cc /),1( −

2 cS cc /)1,1( −−

3 cS cc /)1,1( −−

... ... Table 2 | A scoring matrix formula in accordance to the guidelines needed for BE scoring. The top score (bucket 1 vs. bucket 1) depends on the total number of buckets (n) in order to achieve a wide spread of scores throughout the table. For example, the range of scores for n=5 buckets is from 4

5,1 102.1 −⋅=S to

162 )15(1,1 == −S , while the range of scores for 11 buckets is from 16

11,1 109.9 −⋅=S to

10242 )111(1,1 == −S (as seen in Table 3).

n= Total number of buckets c= Current bucket column

),( jiS = Score for when comparing bucket i to bucket j.

Table 3

Bucket 1 2 3 4 5 6 7 8 9 10 11

1 1024 256 18.96296 0.666667 0.013653 0.000183 1.727E-06 1.211E-08 6.555E-11 2.822E-13 9.89E-16

2 256 512 56.88889 2.666667 0.068267 0.001097 1.209E-05 9.688E-08 5.9E-10 2.822E-12 1.088E-14

3 18.96296 56.88889 170.6667 10.66667 0.341333 0.006584 8.462E-05 7.75E-07 5.31E-09 2.822E-11 1.197E-13

4 0.666667 2.666667 10.66667 42.66667 1.706667 0.039506 0.0005923 6.2E-06 4.779E-08 2.822E-10 1.316E-12

5 0.013653 0.068267 0.341333 1.706667 8.533333 0.237037 0.0041464 4.96E-05 4.301E-07 2.822E-09 1.448E-11

6 0.000183 0.001097 0.006584 0.039506 0.237037 1.422222 0.0290249 0.0003968 3.871E-06 2.822E-08 1.593E-10

7 1.73E-06 1.21E-05 8.46E-05 0.000592 0.004146 0.029025 0.2031746 0.0031746 3.484E-05 2.822E-07 1.752E-09

8 1.21E-08 9.69E-08 7.75E-07 6.2E-06 4.96E-05 0.000397 0.0031746 0.0253968 0.0003135 2.822E-06 1.927E-08

9 6.56E-11 5.9E-10 5.31E-09 4.78E-08 4.3E-07 3.87E-06 3.484E-05 0.0003135 0.0028219 2.822E-05 2.12E-07

10 2.82E-13 2.82E-12 2.82E-11 2.82E-10 2.82E-09 2.82E-08 2.822E-07 2.822E-06 2.822E-05 0.0002822 2.332E-06

11 9.89E-16 1.09E-14 1.2E-13 1.32E-12 1.45E-11 1.59E-10 1.752E-09 1.927E-08 2.12E-07 2.332E-06 2.565E-05

Table 3 | Implementation example of the scoring matrix (Table 2) where the number of buckets (n) equals 11 (therefore 10242 )1(

1,1 == −nS ). The cell colour, ranging from red to green, indicates the

significance of a similarity score when comparing gene ranks between experiments. The most significant buckets hold few genes (buckets are smaller in size), yet have the potential of receiving the highest scores (shown in green) giving more significance to the most sensitive genes, providing that the most sensitive genes appear in close buckets for both experiments being compared (such as the scores in the blue rectangle). If a gene is in distant buckets, the score is lower, i.e. a strain in bucket 6 in both experiments is scored 1.42, while a strain in bucket 6 in one experiment, and in bucket 5 in another is scored 0.237. For hits in the same bucket, the score will be more significant for a lower bucket, i.e. a strain in bucket 2 in both experiments will get a score of 512, while a strain in bucket 4 in both experiments will get a score of 42.67.

31

3.1. TAG4 Barcode Microarray Dataset

I ran the BE method on a dataset of TAG4 barcode microarray results (see section 5.1), which

included platinum based novel chemical compounds, in addition to well characterised

compounds, such as cisplatin. The dataset was created by screening novel platinum-acridine

conjugates in addition to known DNA-damaging chemical compounds against the complete pool

of ~6,000 barcoded deletion strains of Saccharomyces cerevisiae, 1200 essential genes as

heterozygous diploids and 4800 non-essential genes as homozygous diploids, producing unique

genome-wide profiles (Cheung-Ong et al., In review; Giaever et al., 2002; Giaever et al., 2004;

Giaever et al., 1999; Winzeler et al., 1999). I used several correlation methods, including Pearson

(Pearson, 1909), Spearman (Spearman, 1904) and Kendall (Kendall, 1938), for finding

similarities between the compounds. I then assessed their performance according to batching of

dates, an unwanted cluster outcome, versus batching by chemical compounds, a wanted cluster

outcome (Figure 5, Figure 6). The findings showed the BE method performed better than other

methods, providing an understanding of the mechanism of action of new chemical compounds by

comparing them to better known chemical compounds.

I statistically assessed the distribution of similarity scores generated by each of the algorithms by

using the Wilcoxon test (Figure 7) (Wilcoxon, 1945). Typically, when clustering experiments to

evaluate similarity one would like to see experiments cluster according to experimental factors,

i.e. chemical compound or mechanism of action, and not according to the date of the experiment,

for example. To assess whether the date of the experiment had an effect in batching the scores, I

used a two-sided Wilcoxon test on two vectors. The first vector contained the similarity scores of

pairs of experiments performed on the same date, and the second vector contained scores of pairs

32

of experiments performed on different dates. The graphs represent the distribution of similarity

scores of both vectors (Figure 7a, 7c, 7e, 7g). These differences demonstrate a statistically

significant shift in the distribution of scores between the two vectors when Pearson, Spearman or

Kendall algorithms are used (p-values 10��-10��, Figure 7a, 7c, 7e), indicating a strong

unwanted effect of the experiment’s date on the outcome. In contrast, the BE algorithm was not

significantly affected by date (p>0.05, Figure 7g). Indeed, the statistical evaluation confirmed

that, compared to these other methods, the BE algorithm was least influenced by the date of the

experiment, visualized as a highly similar distribution of scores for same dates and different

dates. This is because BE compares groups of genes, rather than single gene ranks (Figure 7g). I

next evaluated whether the chemical compound used in an experiment had an effect in batching

the scores, using the Wilcoxon test. I used two vectors: the first contained similarity scores for

pairs of experiments performed with the same chemical compound, and the second contained

scores of experiment pairs performed using different compounds (Figure 7b, 7d, 7f, 7h).

Repeated experiments, using the same chemical compound, received higher similarity scores

compared to experiments using different chemical compounds. The graphs represent the

distribution of similarity scores of both vectors, and demonstrate a statistically significant shift in

distribution for all algorithms used, indicating all methods used are affected by the chemical

compound present. This was substantial for the BE algorithm, which attained the lowest p-value

(p=8.28e-23, W=40060) compared to the other methods (1.89e-10<p<0.0041,

26396<W<33347), confirming that the chemical compound has the strongest effect on the

batching of scores rated by the BE method, and seen where the distribution of scores for different

compounds is much lower than the distribution of scores for identical compounds (Figure 7h).

To summarize this application of the BE algorithm, BE showed a clear difference in the

33

distributions of scores between date and chemical compound, showing date has less effect on the

BE method (Figure 7g), while chemical compounds have a strong effect on the BE method

(Figure 7h). On the other hand, the differences in score distribution for each one of the

correlation methods other than BE, look similar for both date and chemical compound, which

means that experiments performed on the same date receive a score distribution nearly as high as

experiments where the same chemical compound was used (Figure 7a-b, 7c-d, 7e-f).

34

Figure 5 Cluster by date Cluster by chemical compound

Ideal:

a.

b.

Random:

c.

d.

Figure 5 | Expected results of an ideal outcome and a random outcome. The left column displays the cluster of experiments where the labels are the dates on which the experiment was performed (a, c). Adjacent identical dates are displayed in a red rectangle to indicate when clustering occurs by date. The right column displays the cluster of experiments where the labels are the chemical compound that was used for each experiment (b, d). Adjacent identical chemical compounds are displayed in a green rectangle as shown in the legend, to indicate when the same chemical compounds are clustering together. The ideal result shows that experiments, performed using the same chemical compound, cluster together according to chemical compounds, where each cluster can be seen in a green rectangle (b). The ideal result also shows that the experiments cluster by date only when they were performed using the same chemical compound (a). The random score did not cluster any of the experiments according to chemical compound (d), and clustered experiments by date only by chance.

35

Figure 6 Cluster by date Cluster by chemical compound

Pearson:

a. b. Spearman:

c. d. Kendall:

e. f. BE:

g. h.

36

Figure 6 | Four correlation methods applied to the same dataset were clustered to show the performance of BE compared to other methods. The left column displays the cluster of experiments where the labels are the dates on which the experiment was performed (a, c, e, g). Adjacent identical dates are displayed in a red rectangle to indicate when clustering occurs by date. The right column displays the cluster of experiments where the labels are the chemical compound that was used for each experiment (b, d, f, h). Adjacent identical chemical compounds are displayed in a green rectangle to indicate when the same chemical compounds are clustering together. The desired result of a cluster is that similar conditions will cluster together. Examining the Pearson correlation cluster, the experiments cluster by date (a), due to a date batch effect. The BE method minimized the batch effect where identical dates did not cluster together (g), while identical conditions (chemical compounds) did cluster together (h).

37

Figure 7 By Date By Chemical Compound

Pearson:

a.

b.

Spearman:

c.

d.

38

By Date By Chemical Compound Kendall:

e.

f.

BE:

g.

h.

Figure 7 | The BE algorithm is least affected by the experiment date and most affected by experiment’s chemical compound used. The graphs show the distribution of scores. The graphs on the left column represent results affected by date (a, c, e, g). The solid blue line represents the score distribution of experiment pairs performed on identical dates, and the fragmented red line represents the score distribution of experiment pairs performed on different dates (a, c, e, g). The distributions according to date are significantly diverse for Pearson, Spearman and Kendall correlations (a, c, e), whereas the distributions by date are similar for BE correlation (g), meaning the scores were highly comparable for experiments done on the same date compared to experiments done on different dates. The graphs on the right column represent the score distributions affected by chemical compound (b, d, f, h). The solid blue line represents the score distribution of experiment pairs using identical chemical compounds, and the fragmented red line represents the score distribution of experiment pairs using different chemical compounds. All methods show that the distribution of the same chemical compound scores is significantly different than the distribution of different chemical compound scores, signifying, as expected, that all methods are affected by the chemical compound. The BE method shows the most significant difference in distribution compared to the other methods (h), being most affected by the chemical compound.

39

3.2. TAG3 Microarray 2004 PNAS Dataset

In order to evaluate the BE method on other types of datasets, I tested the method on a dataset

which included 80 published microarray results for 10 different FDA approved drugs including

anticancer and antifungal agents, statins, alverine citrate, and dyclonine (Giaever et al., 2004).

The assay used Haploinsufficiency Profiling, which comprises 6200 diploid heterozygous yeast

strains that can be sensitized to compounds that inhibit the product of the heterozygous locus.

This was performed by lowering gene dosage from two copies to one copy in the yeast

heterozygous deletion strain, and was identified by a unique barcode sequence using TAG3

microarrays (see section 1.3) (Giaever et al., 1999). This dataset consisted of 4 to 16 replicate

experiments for each drug. The BE algorithm successfully located similarity between drugs

(Table 4), recapitulating the previously reported similarity between three drugs: alverine-citrate,

dyclonine, and fenpropimorph (Giaever et al., 2004), demonstrating the accuracy of the

algorithm (Figure 8d). In the original study, the similarity between drugs was found using a

parametric method that set a threshold to ignore genes with low fitness defects (<3SD) (Giaever

et al., 2004), while the BE method is non-parametric and did not ignore any genes for scoring

similarity between experiments. I assessed the similarity results using other methods, including

Pearson, Spearman and Kendall correlations, which all found similarity between these drugs.

However, BE was the only method which found the three drugs as most similar to one another

(Table 4, Figure 8). All methods found the replicate experiments as most similar to one another,

scoring the drug itself within the top two most similar drugs.

All methods found alverine-citrate, dyclonine, and fenpropimorph drugs as highly similar, with

the closest result to BE being the Pearson correlation method. The Pearson results showed high

40

similarity between the three drugs, with a higher similarity score occurring between dyclonine

and miconazole (Figure 8a). BE found miconazole as the next similar drug to the three

mentioned drugs (Figure 8d), which suggests there is also some similarity in structure and mode

of action between the three mentioned drugs and miconazole. These lower level similarities are

found by BE when less significant buckets hold the same genes.

41

Table 4

alverine-citrate dyclonine fenpropimorph Total Identification

Pearson 3/3 (100%) 2/3 (67%) 3/3 (100%) 89% Spearman 2/3 (67%) 2/3 (67%) 3/3 (100%) 78% Kendall 2/3 (67%) 2/3 (67%) 3/3 (100%) 78% BE 3/3 (100%) 3/3 (100%) 3/3 (100%) 100% Table 4 | Top three drug similarity scores of the group of drugs that were reported as similar. Each drug column mentions the amount of drugs that were in the top three highest scores. For example, Pearson correlation showed alverine-citrate experiments as most similar to all three reported drugs: alverine-citrate, dyclonine and fenpropimorph. BE is the only method which identified the similarity for all drugs (100%) recapitulating the previously reported similarity of alverine-citrate, dyclonine and fenpropimorph.

42

Figure 8 a.

b.

43

c.

d.

44

e.

Figure 8 | A comparison of barcode TAG3 microarray similarity results between a variety of correlation methods including Pearson (a), Spearman (b), Kendall (c) and BE (d). Each colour represents a drug, and each column represents similarity scores of one drug to other drugs using coloured bars according to the compared drug. An example of a column is seen in figure a showing similarity levels to alverine citrate as calculated using Pearson correlation. Each bar represents a different drug, and the size of each bar represents the level of similarity to alverine citrate as a percentage of the top score of the method used (e). To recapitulate the previously reported similarity between three drugs: alverine-citrate, dyclonine, and fenpropimorph, I used different methods, and ascertained all methods found similarity between these drugs as seen in the orange (alverine-citrate), green (dyclonine) and blue (fenpropimorph) bars. The top three most similar drugs are mentioned within the drug’s similarity column of each method for these drugs. For the BE method, the top three values for these compounds are the three compounds themselves, where the chemical structure of these drugs is similar explained by a similar mode of action (d). BE was the only method where all three drugs shared the same top three similar drugs.

45

3.3. Gene Expression (Transcript Abundance) Dataset

Having shown BE works on barcode data from different studies, I next evaluated the BE method

on an entirely different data type, genome-wide expression profiles from yeast. In this instance,

gene expression is the measurement of transcript abundance, which is used as a proxy to measure

the relative transcriptional activity of genes. Using microarrays, this process allows analyzing

thousands of genes at once, creating a global picture of transcript abundance (see section 1.1).

For this analysis I selected the widely used study of Gasch et al. which contains microarray

results for 173 environmental stress experiments for all ~6000 genes (Gasch et al., 2000). This

data was composed of genomic expression of Saccharomyces cerevisiae to diverse

environmental conditions such as heat shock, oxidative and reductive stress, osmotic shock,

nutrient starvation, DNA damage and extreme pH. In this dataset, high correlation

scores between genes, represented by the transcript abundance measured, are indicative of a

shared response to stress. These data were initially analyzed using fuzzy k-means (Gasch and

Eisen, 2002), a method that differs from the standard k-means, as it provides a membership

value for each gene to a centroid. Such membership permits each gene (scored according to

transcript abundance) to belong to more than one centroid as it may be co-regulated with several

groups. Gasch and co-workers used prior knowledge about the data to select the k value

according to the expected number of clusters, and chose the initial centroid locations according

to known regulatory elements, and I therefore used this as a benchmark. The BE method

positions the most affected genes, those with the highest score represented by transcript

abundance, in the top significant buckets, providing a high score for comparing buckets among

experiments with shared top genes, which resulted in a high correlation score specifically

between groups of highly affected genes, confirming the previously reported group of ~900

46

specific genes which were found to be strongly affected throughout all stress treatments (Figure

9). This group of environmental stress response genes represent a common gene expression to

stress, help understand cell response to stress, and help reflect the bias in experimental gene

study due to these genes activity in unfavourable conditions (Giaever et al., 2002). Furthermore,

the BE score cluster allowed dividing the specific 868 genes mentioned by Gasch et al. into two

groups of 586 and 282 genes, where each group was affected counter to the other group (Figure

10) (Berry and Gasch, 2008). The affected genes received statistically significant greater scores

than the less affected genes where p<2e-16 (Figure 9c, Figure 9f). These findings suggest that

one can use the BE algorithm to locate unique groups of genes that display a similar pattern of

behaviour within certain experimental conditions, i.e. stress conditions or in the presence of

chemical compounds. The BE method was found to perform as well as other correlation

methods, which also scored a significantly higher score for the reported genes (Figure 11),

including Pearson, Spearman and Kendall, for locating groups of similarly affected genes,

presenting an additional application of the method.

47

Figure 9 a.

b.

c.

48

d.

e. f.

49

Figure 9 | In order to locate genes of interest, the BE method was executed on a dataset of yeast response to environmental changes. Because both negative values and positive values are meaningful, I created two datasets where one included all positive values (negative values were set to 0) and the second dataset included all negative values, set to their absolute value (positive values were set to 0). Results show how the BE method successfully located the most affected genes , according to measured transcript abundance, confirming the 586 positively affected genes (2a), and the 282 negatively affected genes (2d), marked in yellow in the ranked scores as seen as the exceedingly affected genes. The higher scores, that the 868 genes received compared to other genes, can be seen in light green for both positive (2b) and negative (2e) scores. The 868 genes received statistically significant greater scores than other genes both for positive (2c P<2e-16) and negative (2f P<2e-16) affected genes where the full green line represents the positively, induced genes (2c), and negatively, repressed genes (2f), and the fragmented red line represents the rest of the genes. The distribution of scores for the less affected genes displays two peaks due to lower scores for the negative genes compared to the other genes and seen as two dark stripes (2b), marked in blue at the low end scores (2a).

50

Figure 10

a.

b.

Figure 10 | Gasch et al. dataset differentiation between the induced and repressed genes within the group of 900 genes, represented by transcript abundance measured. To differentiate between groups within the group of ~900 genes, running the BE method can separate the induced and repressed genes by clustering them into 2 separate branches in the dendrogram (3a). These genes are anti-correlated, wherein ~300 genes are either repressed or induced in an anti-correlated manner to ~600 genes, depending on the stress experiments performed (3b).

51

Figure 11 a.

e.

b.

f.

52

c.

g.

d.

h.

53

Figure 11 | The distribution of scores of the Gasch et al. study dataset. The green line represents the score distribution of the previously reported group of genes found to be significantly affected by the stress treatments. For the negative score dataset (a, b, c, d), the green line represents the group of ~300 repressed genes, and for the positive score dataset (e, f, g, h), the green line represents the group of ~600 induced genes. The fragmented red line represents the score distribution of the genes other than the reported group of genes. The methods used for comparing the score distribution included BE, Pearson, Spearman and Kendall correlations. All methods showed there are statistically significant higher scores for the reported genes (similar W statistic value) successfully locating the affected genes. The BE method performed as well as other methods identifying the affected group of genes, moreover, it differentiated the lower results and identified anti-correlation between the two groups of ~300 and ~600 affected genes by showing two peaks for the lower scores.

54

3.4. High Throughput Sequencing Dataset

An additional type of dataset which I evaluated the BE method was high throughput sequencing

data of chemogenomic profiles performed in a manner similar to that described in my initial test

(see section 1.3). The fitness of the yeast strains was assessed using SOLiD sequencing in a

multiplex format, allowing sequencing of many experiments concurrently (Smith et al., 2010).

For this method, each strain carries a strain specific barcode. In addition, each individual

experiment carried a second, unique barcode, so together one can simultaneously identify both

the strain and the multiplexing tag of the sequence, where the multiplexing tag allowed

distinguishing between experiments. The sequencing results consisted of counts of barcode

sequences representing the abundance of strains for each experiment. The fitness defects are

expressed as a log2 ratio of the strain specific barcode counts versus the mock condition, for

calculating the differences between the treatment and control, creating a sequencing result matrix

of strain fitness, that provided a dataset for using the BE. I ran the algorithm on 12 experiments

which included 4 repeated experiments for each of the 3 different drugs. The BE method

successfully identified the experiments where repeated conditions clustered together according to

the drug (Figure 12a). Same drug experiments had a statistically significant higher scores than

different drug experiments where P=1.27e-20 (Figure 12b). Such findings are significant as they

confirm that one can use the BE method to compare different chemical compounds using data

originated from high throughput sequencing experiments. The BE method performed better than

the Pearson correlation method (seen in cluster of repeated experiments in Figure 13a compared

to Figure 13d), and as well as non-parametric methods including Spearman and Kendall

correlations (Figure 12, Figure 13, Figure 14). This is an important result as many experiments

have recently been usurped by sequencing alternatives, such as assessing abundance of yeast

55

deletion strains using barcodes (Smith et al., 2009), mapping of the yeast genome (Nagalakshmi

et al., 2008), stem cell transcriptome profiling (Cloonan et al., 2008), mammalian cell

transcriptome mapping (Mortazavi et al., 2008) and epigenetics studies of plants (Lister et al.,

2008).

I implemented the BE method so that it is available in a graphical user interface environment

program. The application loads an input dataset, provided by the user, and produces a similarity

matrix according to the BE variable definitions.

56

Figure 12

a.

b. Figure 12 | Running the BE method on high throughput sequencing data successfully cluster experiments using the same drug (a). I used the Wilcoxon test to evaluate the distribution of the scores (b) of same drug experiment scores (green line) and different drug experiment scores (red line). These results showed that same drug scores received a statistically significant higher score than different drug scores (P=1.27e-20).

57

Figure 13 a.

b.

c.

d.

Figure 13 | A comparison of several methods, including Pearson (a), Spearman (b), Kendall (c) and BE (d), for finding correlations between barcode sequencing experiments. A heat-map and dendrogram displays the clustering of experiments for each method. For BE, Spearman and Kendall methods, all experiments that were performed using the same drug clustered together, showing BE (d) performed as well as other non-parametric methods, including Spearman (b) and Kendall (c). BE performed better than the Pearson correlation (a), where not all same-drug experiments clustered together.

58

Figure 14 a.

b.

c.

d.

Figure 14 | The score distribution of several methods, including Pearson (a), Spearman (b), Kendall (c) and BE (d) of correlations scores of barcode sequencing experiments. The full green line represents the similarity score distribution of experiments performed using the same drug, while the fragmented red line represents the score distribution of experiments performed using different drugs. All methods present statistically significant greater scores to experiments performed using the same drug.

59

4. Conclusions

Rigorous evaluations on several datasets, which included TAG4 microarrays, TAG3

microarrays, gene expression microarrays, and high throughput sequencing data, show that the

BE algorithm overcomes the batch effects (Figure 6). I confirmed that the BE algorithm

outperforms other well-established methods, by statistically validating the differences of score

distributions, and comparing these differences between the BE method and other methods

(Figure 7). Clustering of results showed the BE algorithm successfully identified similar

conditions for microarray and sequencing data (Figure 6, Figure 8d and Figure 12). The BE

method performed as well as other methods by successfully locating the group of key genes as

most sensitive to environmental changes, by attaining the highest similarity scores, confirming

the findings of Gasch et al. (Gasch et al., 2000) (Figure 9). The BE algorithm can thus provide

another analytical tool to aid in the understanding of the mechanism of action of characterized

and uncharacterised compounds according to similarity between compounds, and by learning the

gene targets of specific experimental conditions (chemical compound in use or environmental

changes). Similarity of an unknown chemical compound to other known drugs suggests a similar

mode of action and provides information about possible applications of the unknown chemical

compound. Similarity between a drug to other known drugs can suggest additional applications

and better understanding of the mode of action of action of that drug.

Having tested the BE method on data arising from different technological platforms, I conclude

that the method is applicable to other datasets where correlation between values is needed.

Specifically, by changing the BE variables to fine tune it for different datasets, e.g. for high

60

throughput sequencing data I modified the first bucket size to be 0.05% of the total number of

genes, and set the maximum amount of buckets to 20. In general, achieving accurate correlation

of results may involve changing these variables (as explained in section 6.2.1). The general

concept of bucket weighted scores can therefore be applicable to both groups of highly similar

profiles, and diverse matrices, according to the definition of the variables. This method may also

be applicable to data collected from emerging technologies, such as next generation sequencing,

as finding correlation between results will continue to be beneficial (Smith et al., 2010).

I note that despite being applicable to many dataset models, like any algorithm it may not satisfy

all datasets. When considering whether to use the BE method or other methods instead, one

should take into account several factors. First, whether the data is significant for both positive

and negative values. As the BE method evaluates scores according to rank, datasets that are

significant for both positive and negative values are not analyzed properly. This occurs due to

negative values appraised as insignificant relative to positive values. For example, a genomic

expression dataset can hold positive scores for induced genes and negative scores for repressed

genes, represented by transcript abundance. Therefore both positive and negative values are

significant, as they both show a change in cell response to the conditions measured in the

experiment. A possible way to surmount such a problem, which I used in my study, is to create

two datasets from the original dataset. The first dataset will hold the positive values, and the

second dataset will hold the absolute values of the original negative values, while removing the

original positive values. Running separate analysis for positive and negative values can be

sufficient for locating the effected genes, represented by transcript abundance. Though, such a

solution is not ideal, as it adds several steps to the analysis, and may not be accurate in cases

61

where none of the genes, represented by transcript abundance, were repressed in the

experimental conditions.

The second factor is whether there is prior data regarding the dataset which the user wishes to

take into account when assessing similarity between experiments. An example is the work done

by Gasch and co-workers (see section 3.3), in which they wished to filter out highly regulated

genes. To do so, Gasch and co-workers used the fuzzy k-means method, which uses prior

knowledge about the expected number of clusters, and regulatory elements (see section 3.3). This

resulted in filtering out many genes that are highly co-regulated, based on prior knowledge of the

regulation factors. If the user wishes to ignore subsections of the dataset, the BE method is not

suitable, as it is specifically designed to avoid the need of prior knowledge about the dataset, and

utilize an entire-dataset analysis approach in order to maximize the amount of scientifically

significant results that can be discovered in the dataset. If the user insists on using the BE

method, the dataset would have to be updated to exclude the data the user wished to ignore;

however, this approach is not as straightforward as using a method that relies on prior

knowledge.

The researcher should opt to use the BE method when he/she is not interested in including prior

knowledge in the analysis or when prior knowledge is unknown, when the scores are deemed

more significant as their value increases, and when he/she wants to include all the gene scores in

the dataset without restricting them using a threshold value. The researcher can safely choose to

use the BE method as it was shown by comparison to various methods that BE consistently

performs better than or as well as other parametric and non-parametric methods. Compared to

the results of the TAG3 microarray dataset (see section 3.2), the BE method clearly performed

better than other non-parametric methods. Pearson correlation, a parametric method, performed

62

almost as well as the BE method in this analysis. Furthermore, in the high throughput sequencing

dataset results (see section 3.4), BE method’s performance was comparable to the non-

parametric methods Spearman and Kendall in terms of statistical results (BE: � = 4484,� ≈

10��, Spearman and Kendall: � = 4608,� ≈ 10��). On the other hand, Pearson correlation

performance in this dataset was worse than non-parametric methods, with a demonstrated

smaller statistic (� = 3640,� ≈ 10��).

63

5. Methods

5.1. Levelled scoring matrix

The levelled scoring matrix is constructed of decreasing scores, from high scores for a gene in

closely ranked groups (buckets) to low scores for a gene in distant groups (buckets). When

comparing profiles, the score matrix yields the score of jiS , to a gene located in bucket i and

bucket j in each of the profiles compared. For a score ofjiS , the scoring matrix follows these

guidelines: (1) For each experiment, the strains are divided into buckets. The buckets are ordered

in ascending importance so that a lower bucket holds the strains with the highest fitness defect.

(2) Assign higher scores for hits in different experiments which fall within the same bucket,

while taking into consideration that first buckets are more significant than last buckets, where

jiS , for experiments 1Exp and 2Exp , is the score of a fitness defect strain which is located in

bucket i in 1Exp , and in bucket j in 2Exp . (3) jjii SSjiji ,,|, >⇒<∀ For example: 2,21,1 SS > .

(4) Assign a higher score for hits in closer buckets: kiji SSkjikji ,,|,, >⇒<<∀ . For example:

4,23,2 SS > .

64

5.2. Software imaging and implementation

Images and analysis were created using R (Team, 2011). Figure 3b was created using SPSS. The

BE software was developed using C# .NET 3.0 Framework. The software is available for

download at: http://chemogenomics.med.utoronto.ca/supplemental/BE/.

65

6. Bucket Evaluations Software

In order to create a program for running the BE algorithm, I took into consideration several

design approaches, such as: (1) creating a program that allows the user to decide on resource

allocation for better performance according to the hardware abilities, and (2) an easy-to-use

graphical user interface (GUI) (Figure 15).

The program includes a multithreaded architecture that allows the GUI to remain active, while

multiple threads are executing the analysis on the dataset provided (Figure 16). The design of the

program consisted of 3 independent threads, including the main GUI window (MGUIW) thread,

information form thread, and the BE Thread Manager (BETM). In addition to these threads are

the dataset analysis threads. The amount of Dataset Threads (DT) is a variable set by the user

(Figure 16).

66

Figure 15

Figure 15 | Bucket Evaluations Software Graphical User Interface. This image is the main window, which displayed to the user. It provides the user the needed steps to load a dataset (step 1), run the BE algorithm and produce a similarity matrix as a file (step 3). This window gives the user an option to set the algorithm’s variables (step 2), and provides help for each of the variables using a tool-tip hover button.

67

Figure 16

Figure 16 | Bucket Evaluations Software Architecture. Each rectangle represents a class in the program. The green blue and orange rectangles represent separate threads. The Thread Barrier (purple rectangle) is a separate object used by the BE Thread Manager.

Thread

Barrier

Main GUI Window

Info Form BE Thread Manager

Dataset

Thread

Dataset

Thread

Dataset

Thread

Dataset

Thread

Dataset

Thread…

68

6.1. User Experience

Running the analysis requires three steps, as described in the MGUIW (Figure 15):

1. Choose a data source file for which you wish to create a similarity matrix. Clicking the

button opens a file browser for choosing the data source file (Figure 17). The

data source file should be a standard tab delimited file, which includes the column names,

row names and numeric data results.

2. Set algorithm variables according to the data type you are using. As previously

mentioned, results may be more accurate by manipulating these values. It is therefore

recommended to run the source dataset multiple times with different values. This will

allow to fine tune the variables to best the user’s data source.

3. Clicking the button opens a window requesting the user for a file save

location (Figure 18). The program writes the similarity matrix information to the save

location, which is the output target file. The program commences the analysis of the data

once the user confirms the output location. If the dataset is in an incorrect format, or if

there is any other problem with the run, a message will be displayed to the user using

exception handlers, which are code sections that deal with faulty input.

Once the program commences the analysis, the status of the run is displayed to the user. The

status is displayed by using a progress bar, percentage status, and status text components, which

are constantly updated throughout the run.

69

Figure 17

Figure 17 | Bucket Evaluations Software - Load file location window

Figure 18

Figure 18 | Bucket Evaluations Software - Save file location window

70

6.2. Main GUI Window (MGUIW)

The MGUIW thread is the initial execution thread, and is responsible for user input and status

notifications to the user. The user input includes input file selection window, initial input file

validations, algorithm parameter input etc. (Figure 15). The MGUIW thread also includes tool-

tip hover labels , which provide the user with a text explanation of the variable fields to fill.

6.2.1. User Input

Each input parameter consists of a GUI object that is relevant to the type of data needed (e.g. file

location requires text and initial bucket size requires a number). The parameters, which the user

can modify include:

• “Choose data source” – A variable input that consists of a textbox and a button

components. Allows users to select the data file which they wish to load. The data must

be a tab delimited file including column and row names. For example (file format:

“\tC1\tC2\tC3\nR1\t1\t2\t3\nR2\t4\t5\t6\nR3\t7\t8\t9”):

C1 C2 C3

R1 1 2 3

R2 4 5 6

R3 7 8 9

• “Use Pre-set Values” – A variable input that consists of a combo box component. Allows

the user to choose pre-set values for the BE variables, including: “Stringent” - Score is

71

dependent on more accuracy between the ranks of values, and “Broad” - The group of top

ranked values is larger. A large group of top ranked values produces a high score for

more distant ranks in each bucket. These values are set while assuming there are ~6000

values to compare. The sizes vary for different sizes of datasets.

• “Number of Additional Threads” – A variable input that consists of a numeric up-down

component. Increasing the number of threads allows the algorithm to run analysis

concurrently. Concurrent running of the algorithm may result in a faster outcome. Thread

performance is dependent on the computer's hardware; therefore increasing the number of

threads may result in a delayed outcome (see section 1.6.1).

• “Initial Bucket Size (%)” – A variable input that consists of a numeric up-down

component. Allows the user to select the size of the first bucket. This value is the

percentage of the dataset size. For example, if there are 10000 values to compare, and the

value of the initial bucket size is set to 0.05, then the first bucket, which holds the most

significant values, will hold 5 top values (which are 0.05% of 10000). Following buckets

will be larger in size (Table 2). A small value will result in fewer scores considered as top

hits, resulting in a stringent result when comparing columns/rows. The value set for the

initial bucket size affects the results of the algorithm, therefore it is recommended to run

the algorithm several times, while using different sets of variables, for finding the ideal

variable values (Figure 19). These values can also be changed by using the different

options in the pre-set values combo-box.

• “Maximum Number of Buckets” – A variable input that consists of a numeric up-down

component. Allows the user to select the algorithm’s maximum amount of levels to

divide each experiment. For example, when comparing fitness defects of genes, this

72

variable will represent the maximum number of groups the genes will be divided. A small

value in this field will result in a reduced effect on the similarity score of the lower values

in the dataset (Table 2).

• “Comparing Columns/Rows” – A variable input that consists of radio button

components. Allows the users to select what analysis they wish to run on the dataset. For

example, if the columns represent the experiments and the rows the represent the gene’s

fitness defect, then selecting the “Column” radio button will create an experiment

similarity scoring matrix. For example, for an input file such as (file format:

“\tC1\tC2\tC3\nR1\t1\t2\t3\nR2\t4\t5\t6\nR3\t7\t8\t9”):

C1 C2 C3

R1 1 2 3

R2 4 5 6

R3 7 8 9

selecting 'Column' will result in a similarity scoring matrix of C1-C3 versus C1-C3, while

selecting 'Rows' will result in a scoring matrix of R1-R3 versus R1-R3.

• “Set score of lowest bucket to 0” – A variable input that consists of a checkbox

component. If checked, the score for the lowest buckets is set to 0. This results in giving a

score of 0 similarity to the lowest bucket, which contains the lowest ranks, hence

avoiding the low end of ranks. If left unchecked, the lowest ranks can cause a higher

score of similarity, as it is included in the final similarity score.

73

Figure 19

Figure 19 | Example of different result outputs for setting different bucket sizes when running the bucket evaluations algorithm. The dataset originated from high throughput sequencing (see section 3.4) with initial bucket sizes set to a broad value of 5% (a), and a stringent value of 0.05% (b). These dendrograms show the importance of running several parameter definitions for finding the best fit for the dataset. Both results show same drug treatments clustered together, though this dataset required a stringent set of variables as seen in dendrogram b, as all same chemical compounds clustered together, while for broad values of the variables, not all chemical compounds clustered together (cisplatin).

74

6.2.2. Status Notifications

In addition to the user input, the MGUIW is responsible for status notifications to the user. Status

notifications are important as the user can get information regarding what stage of the program is

running at each moment. In order to allow the BETM and DTs to change the components of the

MGUIW, delegates execute MGUIW methods from external threads. These delegates provide an

up-to-date status to the user by manipulating several components, such as:

• Progress Bar – Provides a graphical display of progress percentage (Figure 20B).

• Percentage Label – Provides a numerical display of progress percentage (Figure 20C).

• Status Text Label – Provides a brief text explanation of the current analysis action that is

performed on the dataset (Figure 20D). Once the analysis is completed, the Status Text

Label displays the output file location.

75

Figure 20

Figure 20 | Program GUI once executed. The user can cancel the run by clicking the “Cancel” button (A). This button is located at the same place the “Run” button was located prior execution. The status is presented to the user using a status bar (B), the current action percentage of the run (C), and text that provides a brief explanation of the current analysis action being performed (D).

76

6.2.3. Cancel Run

Once a run has been executed by clicking the button, the MGUIW allows the user to

cancel the run by clicking the button (Figure 20A). The “Cancel” button is located at

the same location that the “Run” button was located. Once “Cancel” is clicked, a series of events

is initiated for terminating all running threads. The button-click event raises a flag that is

periodically checked by the BETM. The raised flag leads to the following actions: (1) it prevents

the creation of additional DTs by the BETM, (2) prevents existing DTs, that are in queue prior

running, from starting analysis on the dataset, and (3) leaves the DTs, that are already running, to

terminate upon completion.

6.3. Information Form

The Information Form is a separate thread, accessible through the button on the MGUIW. It

provides information about software version, and usage citation. This form is executed as a

separate thread, therefore can also be displayed when the analysis is underway.

6.4. BE Thread Manager (BETM)

The BETM is a thread which controls the flow of the analysis according to the user’s parameters,

and orchestrates over multiple DTs. In order to do so, it uses thread control tools such as

semaphores and thread barriers.

Semaphores are software objects that limit the amount of running threads. If the defined number

of running threads is at its maximum, the semaphore puts any additional threads into sleep mode.

77

Once a running thread is terminated, the semaphore activates one of the queued threads. The

semaphores were used to limit the amount of DTs running at the same time. The amount of

allowed threads is set by the user prior the run. If the user sets the number of additional threads

to 0 (Figure 15), the analysis will be performed from the BETM thread and not from additional

DTs.

The thread barrier is an object responsible for preventing a selected thread from running as long

as a certain group of signal threads have not completed their run. The barrier prevents the

selected thread from running, by putting it in sleep mode. Once the group of signal threads

complete their run, the sleeping thread is awakened. The thread barrier was not part of .Net 3.0

framework, therefore, I implemented the barrier object as a separate class. The barrier was used

for controlling the stages of the analysis, adding DTs to the queue only for relevant sections of

the dataset. I also used the thread barrier for preventing the queue of DTs from becoming too

large. Preventing an oversized DT queue is important for the event that the “Cancel” button is

clicked, as there is a limited amount of threads in the queue that need to be cancelled.

Each part of the analysis is divided into sub-tasks that are performed by DTs. The BETM creates

a limited amount of DTs so that if the user cancels the run, there is a limited amount of DTs

awaiting execution. Because the dataset is analysed by multiple threads, the BETM makes sure

that the sections being analysed are not overrun by other threads, and therefore provides the DT

with a mutually exclusive section for it to work on. These subtasks included tasks such as

ranking scores of specific columns, entering the values of a specific column into buckets, and

comparing experiments. For example, for a dataset of size 100X100 there will be ~10200

subtasks (100 ranks + 100 bucket definitions + ~10000 comparisons) assigned to threads. Once

78

the analysis is completed, the BETM saves the output to a standard tab delimited file which is

located in the path selected by the user.

79

7. References

Akey, J.M., Biswas, S., Leek, J.T., and Storey, J.D. (2007). On the design and analysis of gene expression

studies in human populations. Nat Genet 39, 807-808; author reply 808-809.

Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker,

C., Malig, M., Mutlu, O., et al. (2009). Personalized copy number and segmental duplication maps using

next-generation sequencing. Nat Genet 41, 1061-1067.

Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to

consolidation and consensus. Nat Rev Genet 7, 55-65.

Alter, O., Brown, P.O., and Botstein, D. (2000). Singular value decomposition for genome-wide

expression data processing and modeling. PNAS 97, 10101–10106.

Ambroise, C., and McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray

gene-expression data. Proc Natl Acad Sci U S A 99, 6562-6566.

Ammar, R., Smith, A.M., Heisler, L.E., Giaever, G., and Nislow, C. (2009). A comparative analysis of DNA

barcode microarray feature size. BMC Genomics 10.

Armstrong, R.A., Davies, L.N., Dunne, M.C., and Gilmartin, B. (2011). Statistical guidelines for clinical

studies of human vision. Ophthalmic Physiol Opt 31, 123-136.

Baetz, K., McHardy, L., Gable, K., Tarling, T., Reberioux, D., Bryan, J., Andersen, R.J., Dunn, T., Hieter, P.,

and Roberge, M. (2004). Yeast genome-wide drug-induced haploinsufficiency screen to determine drug

mode of action. Proc Natl Acad Sci U S A 101, 4525-4530.

Baggerly, K.A., Coombes, K.R., and Neeley, E.S. (2008). Run batch effects potentially compromise the

usefulness of genomic signatures for ovarian cancer. J Clin Oncol 26, 1186-1187; author reply 1187-

1188.

Baggerly, K.A., Edmonson, S.R., Morris, J.S., and Coombes, K.R. (2004). High-resolution serum proteomic

patterns for ovarian cancer detection. Endocr Relat Cancer 11, 583-584; author reply 585-587.

Baird, D., Johnstone, P., and Wilson, T. (2004). Normalization of microarray data using a spatial mixed

model analysis which includes splines. Bioinformatics 20, 3196-3205.

Bakay, M., Chen, Y.W., Borup, R., Zhao, P., Nagaraju, K., and Hoffman, E.P. (2002). Sources of variability

and effect of experimental approach on expression profiling data interpretation. BMC Bioinformatics 3,

4.

Baryshnikova, A., Costanzo, M., Kim, Y., Youn, J.-Y., Ding, H., Koh, J., Toufighi, K., Luis, B.-J.S.,

Bandyopadhyay, S., Hibbs, M., et al. (2010). Quantitative analysis of fitness and genetic interactions in

yeast on a genome scale. Nature Methods 7, 1017-1024.

Beltrame, L., Rizzetto, L., Paola, R., Rocca-Serra, P., Gambineri, L., Battaglia, C., and Cavalieri, D. (2009).

Using pathway signatures as means of identifying similarities among microarray experiments. PLoS One

4, e4128.

80

Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., and Marron, J.S. (2004). Adjustment of

systematic microarray data biases. Bioinformatics 20, 105–114.

Bennett, S. (2004). Solexa Ltd. Pharmacogenomics 5, 433-438.

Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers,

D.J., Barnes, C.L., Bignell, H.R., et al. (2008). Accurate whole human genome sequencing using reversible

terminator chemistry. Nature 456, 53-59.

Berry, D.B., and Gasch, A.P. (2008). Stress-activated genomic expression changes serve a preparative

role for impending stress in yeast. Mol Biol Cell 19, 4580-4587.

Boedigheimer, M.J., Wolfinger, R.D., Bass, M.B., Bushel, P.R., Chou, J.W., Cooper, M., Corton, J.C., Fostel,

J., Hester, S., Lee, J.S., et al. (2008). Sources of variation in baseline gene expression levels from

toxicogenomics study control animals across multiple laboratories. BMC Genomics 9, 285.

Boelens, M.C., te Meerman, G.J., Gibcus, J.H., Blokzijl, T., Boezen, H.M., Timens, W., Postma, D.S., Groen,

H.J., and van den Berg, A. (2007). Microarray amplification bias: loss of 30% differentially expressed

genes due to long probe - poly(A)-tail distances. BMC Genomics 8, 277.

Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods

for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-193.

Bozinov, D., and Rahnenfuhrer, J. (2002). Unsupervised technique for robust target separation and

analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics 18, 747-756.

Branham, W.S., Melvin, C.D., Han, T., Desai, V.G., Moland, C.L., Scully, A.T., and Fuscoe, J.C. (2007).

Elimination of laboratory ozone leads to a dramatic improvement in the reproducibility of microarray

gene expression measurements. BMC Biotechnol 7, 8.

Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P. (2004). Rank products: a simple, yet powerful,

new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett

573, 83-92.

Brown, P.O., and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. Nat

Genet 21, 33-37.

Buchdunger, E., Zimmermann, J., Mett, H., Meyer, T., Muller, M., Druker, B.J., and Lydon, N.B. (1996).

Inhibition of the Abl protein-tyrosine kinase in vitro and in vivo by a 2-phenylaminopyrimidine

derivative. Cancer Res 56, 100-104.

Cavalieri, D., Castagnini, C., Toti, S., Maciag, K., Kelder, T., Gambineri, L., Angioli, S., and Dolara, P.

(2007). Eu.Gene Analyzer a tool for integrating gene expression data with pathway databases.

Bioinformatics 23, 2631-2632.

Cheung-Ong, K., Song, K., Ma, Z., Shabtai, D., Heisler, L.M., Bierbach, U., Giaever, G., and Nislow, C. (In

review). Insights into the mechanism of action of nonclassical platinum–acridine anticancer agents from

comprehensive chemogenomic fitness screens.

81

Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B., Faulkner, G.J., Brown, M.K., Taylor, D.F., Steptoe, A.L.,

Wani, S., Bethel, G., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing.

Nat Methods 5, 613-619.

Dabney, A.R., and Storey, J.D. (2007). Normalization of two-channel microarrays accounting for

experimental design and intensity-dependent relationships. Genome Biol 8, R44.

Daskalakis, A., Kostopoulos, S., Spyridonos, P., Glotsos, D., Ravazoula, P., Kardari, M., Kalatzis, I.,

Cavouras, D., and Nikiforidis, G. (2008). Design of a multi-classifier system for discriminating benign from

malignant thyroid nodules using routinely H&E-stained cytological images. Comput Biol Med 38, 196-

203.

Deutschbauer, A.M., Jaramillo, D.F., Proctor, M., Kumm, J., Hillenmeyer, M.E., Davis, R.W., Nislow, C.,

and Giaever, G. (2005). Mechanisms of Haploinsufficiency Revealed by Genome-Wide Profiling in Yeast.

Genetics 169.

Dobbin, K.K., Kawasaki, E.S., Petersen, D.W., and Simon, R.M. (2005). Characterizing dye bias in

microarray experiments. Bioinformatics 21, 2430-2437.

Druker, B.J., Tamura, S., Buchdunger, E., Ohno, S., Segal, G.M., Fanning, S., Zimmermann, J., and Lydon,

N.B. (1996). Effects of a selective inhibitor of the Abl tyrosine kinase on the growth of Bcr-Abl positive

cells. Nat Med 2, 561-566.

Dudoit, S., Yang, Y.H., Callow, M.J., and Speed, T.P. (2002). Statistical methods for identifying

differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111-139.

Ein-Dor, L., Zuk, O., and Domany, E. (2006). Thousands of samples are needed to generate a robust gene

list for predicting outcome in cancer. Proc Natl Acad Sci U S A 103, 5923-5928.

Eklund, A.C., and Szallasi, Z. (2008). Correction of technical bias in clinical microarray data improves

concordance with known biological information. Genome Biol 9, R26.

Fare, T.L., Coffey, E.M., Dai, H., He, Y.D., Kessler, D.A., Kilian, K.A., Koch, J.E., LeProust, E., Marton, M.J.,

Meyer, M.R., et al. (2003). Effects of Atmospheric Ozone on Microarray Data Quality. Analytical

Chemistry 75 4672-4675.

Fodor, S.P., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., and Solas, D. (1991). Light-directed, spatially

addressable parallel chemical synthesis. Science 251, 767-773.

Frantz, S. (2005). An array of problems. Nat Rev Drug Discov 4, 362-363.

Furness, P.N., Taub, N., Assmann, K.J., Banfi, G., Cosyns, J.P., Dorman, A.M., Hill, C.M., Kapper, S.K.,

Waldherr, R., Laurinavicius, A., et al. (2003). International variation in histologic grading is large, and

persistent feedback does not improve reproducibility. Am J Surg Pathol 27, 805-810.

Gasch, A.P., and Eisen, M.B. (2002). Exploring the conditional coregulation of yeast gene expression

through fuzzy k-means clustering. Genome Biol 3, RESEARCH0059.

Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., and Brown,

P.O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol

Biol Cell 11, 4241-4257.

82

Giaever, G., Chu, A.M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson,

K., Andre, B., et al. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418,

387-391.

Giaever, G., Flaherty, P., Kumm, J., Proctor, M., Nislow, C., Jaramillo, D.F., Chu, A.M., Jordan, M.I., Arkin,

A.P., and Davis, R.W. (2004). Chemogenomic profiling: Identifying the functional interactions of small

molecules in yeast. PNAS 101, 793-798.

Giaever, G., Shoemaker, D.D., Jones, T.W., Liang, H., Winzeler, E.A., Astromoff, A., and Davis, R.W.

(1999). Genomic profiling of drug sensitivities via induced haploinsufficiency. Nature Genetics 21, 278-

283.

Han, E.S., Wu, Y., McCarter, R., Nelson, J.F., Richardson, A., and Hilsenbeck, S.G. (2004). Reproducibility,

sources of variability, pooling, and sample size: important considerations for the design of high-density

oligonucleotide array experiments. J Gerontol A Biol Sci Med Sci 59, 306-315.

Hillenmeyer, M.E., Ericson, E., Davis, R.W., Nislow, C., Koller, D., and Giaever, G. (2010). Systematic

analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. Genome

Biology 11.

Hillenmeyer, M.E., Fung, E., Wildenhain, J., Pierce, S.E., Hoon, S., Lee, W., Proctor, M., St Onge, R.P.,

Tyers, M., Koller, D., et al. (2008). The chemical genomic portrait of yeast: uncovering a phenotype for

all genes. Science 320, 362-365.

Hillier, L.W., Marth, G.T., Quinlan, A.R., Dooling, D., Fewell, G., Barnett, D., Fox, P., Glasscock, J.I.,

Hickenbotham, M., Huang, W., et al. (2008). Whole-genome sequencing and variant discovery in C.

elegans. Nat Methods 5, 183-188.

Ho, C.H., Magtanong, L., Barker, S.L., Gresham, D., Nishimura, S., Natarajan, P., Koh, J.L., Porter, J., Gray,

C.A., Andersen, R.J., et al. (2009). A molecular barcoded yeast ORF library enables mode-of-action

analysis of bioactive compounds. Nat Biotechnol 27, 369-377.

Hoon, S., Smith, A.M., Wallace, I.M., Suresh, S., Miranda, M., Fung, E., Proctor, M., Shokat, K.M., Zhang,

C., Davis, R.W., et al. (2008). An integrated platform of genomic assays reveals small-molecule

bioactivities. Nat Chem Biol 4, 498-506.

Huang, J., Qi, R., Quackenbush, J., Dauway, E., Lazaridis, E., and Yeatman, T. (2001). Effects of ischemia

on gene expression. J Surg Res 99, 222-227.

Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman,

M., Schelter, J.M., Meyer, M.R., et al. (2001). Expression profiling using microarrays fabricated by an ink-

jet oligonucleotide synthesizer. Nat Biotechnol 19, 342-347.

Illumina (2011).

Ioannidis, J.P. (2005). Microarrays and molecular research: noise discovery? Lancet 365, 454-455.

Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.,

Geoghegan, J., Germino, G., et al. (2005). Multiple-laboratory comparison of microarray platforms. Nat

Methods 2, 345-350.

83

Johnson, W.E., and Li, C. (2007). Adjusting batch effects in microarray expression data using empirical

Bayes methods. Biostatistics 8, 118–127.

Ju, J., Kim, D.H., Bi, L., Meng, Q., Bai, X., Li, Z., Li, X., Marma, M.S., Shi, S., Wu, J., et al. (2006). Four-color

DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc Natl

Acad Sci U S A 103, 19635-19640.

Kendall, M.G. (1938). A new measure of rank correlation. Biometrika 30, 81-93.

Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P.,

Subramanian, A., Ross, K.N., et al. (2006). The Connectivity Map: using gene-expression signatures to

connect small molecules, genes, and disease. Science 313, 1929-1935.

Lander, E.S. (1999). Array of hope. Nature Genetics 21, 3-4.

Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R., and Quackenbush, J. (2005). Independence and

reproducibility across microarray platforms. Nat Methods 2, 337-344.

Lausted, C., Dahl, T., Warren, C., King, K., Smith, K., Johnson, M., Saleem, R., Aitchison, J., Hood, L., and

Lasky, S.R. (2004). POSaM: a fast, flexible, open-source, inkjet oligonucleotide synthesizer and

microarrayer. Genome Biol 5, R58.

Lee, K.M., Kim, J.H., and Kang, D. (2005). Design issues in toxicogenomics using DNA microarray

experiment. Toxicol Appl Pharmacol 207, 200-208.

Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K.,

and Irizarry, R.A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput

data. Nat Rev Genet 11, 733-739.

Leek, J.T., and Storey, J.D. (2007). Capturing heterogeneity in gene expression studies by surrogate

variable analysis. PLoS Genet 3, 1724-1735.

Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D., Chen, K., Dooling, D., Dunford-Shore, B.H.,

McGrath, S., Hickenbotham, M., et al. (2008). DNA sequencing of a cytogenetically normal acute myeloid

leukaemia genome. Nature 456, 66-72.

Li, Z., Vizeacoumar, F.J., Bahr, S., Li, J., Warringer, J., Vizeacoumar, F.S., Min, R., Vandersluis, B., Bellay, J.,

Devit, M., et al. (2011). Systematic exploration of essential yeast gene function with temperature-

sensitive mutants. Nat Biotechnol 29, 361-367.

Lieb, J.D., Liu, X., Botstein, D., and Brown, P.O. (2001). Promoter-specific binding of Rap1 revealed by

genome-wide maps of protein-DNA association. Nat Genet 28, 327-334.

Lin, D.W., Coleman, I.M., Hawley, S., Huang, C.Y., Dumpit, R., Gifford, D., Kezele, P., Hung, H., Knudsen,

B.S., Kristal, A.R., et al. (2006). Influence of surgical manipulation on prostate gene expression:

implications for molecular correlates of treatment effects and disease prognosis. J Clin Oncol 24, 3763-

3770.

Lister, R., O'Malley, R.C., Tonti-Filippini, J., Gregory, B.D., Berry, C.C., Millar, A.H., and Ecker, J.R. (2008).

Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523-536.

84

Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C.,

Kobayashi, M., Horton, H., et al. (1996). Expression monitoring by hybridization to high-density

oligonucleotide arrays. Nat Biotechnol 14, 1675-1680.

Lourdes Peña-Castillo, T.R.H. (2007). Why are there still over 1,000 uncharacterized yeast genes?

Genetics.

Lusa, L., McShane, L.M., Reid, J.F., De Cecco, L., Ambrogi, F., Biganzoli, E., Gariboldi, M., and Pierotti,

M.A. (2007). Challenges in projecting clustering results across gene expression-profiling datasets. J Natl

Cancer Inst 99, 1715-1723.

Lynch, J.L., deSilva, C.J., Peeva, V.K., and Swanson, N.R. (2006). Comparison of commercial probe

labeling kits for microarray: towards quality assurance and consistency of reactions. Anal Biochem 355,

224-231.

Ma, C., Lyons-Weiler, M., Liang, W., LaFramboise, W., Gilbertson, J.R., Becich, M.J., and Monzon, F.A.

(2006). In vitro transcription amplification and labeling methods contribute to the variability of gene

expression profiling with DNA microarrays. J Mol Diagn 8, 183-192.

Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387-

402.

Marshall, E. (2004). Getting the noise out of gene arrays. Science 306, 630-631.

Marton, M.J., DeRisi, J.L., Bennett, H.A., Iyer, V.R., Meyer, M.R., Roberts, C.J., Stoughton, R., Burchard, J.,

Slade, D., Dai, H., et al. (1998). Drug target validation and identification of secondary drug target effects

using DNA microarrays. Nat Med 4, 1293-1301.

Mecham, B.H., Nelson, P.S., and Storey, J.D. (2010). Supervised normalization of microarrays.

Bioinformatics 26, 1308-1315.

Metzker, M.L. (2010). Sequencing technologies - the next generation. Nat Rev Genet 11, 31-46.

Microsoft (2011). Microsoft Developer Network.

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying

mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628.

Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008). The

transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344-1349.

Novak, J.P., Sladek, R., and Hudson, T.J. (2002). Characterization of variability in large-scale gene

expression data: implications for study design. Genomics 79, 104-113.

Nugent, R., and Meila, M. (2010). An overview of clustering applied to molecular biology. Methods Mol

Biol 620, 369-404.

Ovaska, K., Laakso, M., and Hautaniemi, S. (2008). Fast gene ontology based clustering for microarray

experiments. BioData Min 1, 11.

Pearson, K. (1909). Determination of the Coefficient of Correlation. Science 30, 23-25.

85

Pena-Castillo, L., and Hughes, T.R. (2007). Why are there still over 1000 uncharacterized yeast genes?

Genetics 176, 7-14.

Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone,

C., Fishman, D.A., Kohn, E.C., et al. (2002). Use of proteomic patterns in serum to identify ovarian

cancer. Lancet 359, 572-577.

Qualls, M., Pallin, D.J., and Schuur, J.D. (2010). Parametric versus nonparametric statistical tests: the

length of stay example. Acad Emerg Med 17, 1113-1121.

Ransohoff, D.F. (2005a). Bias as a threat to the validity of cancer molecular-marker research. Nat Rev

Cancer 5, 142-149.

Ransohoff, D.F. (2005b). Lessons from controversy: ovarian cancer screening and serum proteomics. J

Natl Cancer Inst 97, 315-319.

Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson,

A.R., Chen, W., et al. (2006). Global variation in copy number in the human genome. Nature 444, 444-

454.

Reina-Pinto, J.J., Voisin, D., Teodor, R., and Yephremov, A. (2010). Probing differentially expressed genes

against a microarray database for in silico suppressor/enhancer and inhibitor/activator screens. Plant J

61, 166-175.

Rothman, K.J., Greenland, S., and Walker, A.M. (1980). Concepts of interaction. Am J Epidemiol 112,

467-470.

Sasaki, E., Takahashi, C., Asami, T., and Shimada, Y. (2011). AtCAST, a tool for exploring gene expression

similarities among DNA microarray experiments using networks. Plant Cell Physiol 52, 169-180.

Satterfield, M., Lippa, K., and Lu, Z. (2008). Microarray scanner performance over a five-week period as

measured with Cy5 and Cy3 serial dilution slides. Journal of Research of the National Institute of

Standards and Technology 113, 154-174.

Scharpf, R.B., Ruczinski, I., Carvalho, B., Doan, B., Chakravarti, A., and Irizarry, R.A. (2011). A multilevel

model to address batch effects in copy number estimation using SNP arrays. Biostatistics 12, 33-50.

Schaupp, C.J., Jiang, G., Myers, T.G., and Wilson, M.A. (2005). Active mixing during hybridization

improves the accuracy and reproducibility of microarray results. Biotechniques 38, 117-119.

Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative monitoring of gene expression

patterns with a complementary DNA microarray. Science 270, 467-470.

Scherer, A. (2009). Batch effects and noise in microarray experiments : sources and solutions

(Chichester, U.K., J. Wiley).

Shi, L., Reid, L.H., Jones, W.D., Shippy, R., Warrington, J.A., Baker, S.C., Collins, P.J., de Longueville, F.,

Kawasaki, E.S., Lee, K.Y., et al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and

intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24, 1151-1161.

86

Shi, L., Tong, W., Su, Z., Han, T., Han, J., Puri, R.K., Fang, H., Frueh, F.W., Goodsaid, F.M., Guo, L., et al.

(2005). Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics 6

Suppl 2, S11.

Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D., Garrett-Engele, P., McDonagh, P.D., Loerch, P.M.,

Leonardson, A., Lum, P.Y., Cavet, G., et al. (2001). Experimental annotation of the human genome using

microarray technology. Nature 409, 922-927.

Singh-Gasson, S., Green, R.D., Yue, Y., Nelson, C., Blattner, F., Sussman, M.R., and Cerrina, F. (1999).

Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat

Biotechnol 17, 974-978.

Smith, A.M., Heisler, L.E., St.Onge, R.P., Farias-Hesson, E., Wallace, I.M., Bodeau, J., Harris, A.N., Perry,

K.M., Giaever, G., Pourmand, N., et al. (2010). Highly-multiplexed barcode sequencing: an efficient

method for parallel analysis of pooled samples. Nucleic Acids Research.

Smith, A.M., Mellor, L.E.H.J., Kaper, F., Thompson, M.J., Chee, M., Roth, F.P., Giaever, G., and Nislow, C.

(2009). Quantitative phenotyping via deep barcode sequencing. Genome Research.

Spearman, C. (1904). The proof and measurement of association between two things. Am J Psychol 15,

72-101.

Spielman, R.S., Bastone, L.A., Burdick, J.T., Morley, M., Ewens, W.J., and Cheung, V.G. (2007). Common

genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39, 226-

231.

Strauss, E. (2006). Arrays of hope. Cell 127, 657-659.

Team, R.D.C. (2011). R: A Language and Environment for Statistical Computing (Vienna, Austria, R

Foundation for Statistical Computing).

Thompson, K.L., Pine, P.S., Rosenzweig, B.A., Turpaz, Y., and Retief, J. (2007). Characterization of the

effect of sample quality on high density oligonucleotide microarray data using progressively degraded

rat liver RNA. BMC Biotechnol 7, 57.

Tseng, G.C., Oh, M.K., Rohlin, L., Liao, J.C., and Wong, W.H. (2001). Issues in cDNA microarray analysis:

quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic

Acids Res 29, 2549-2557.

Turcatti, G., Romieu, A., Fedurco, M., and Tairi, A.P. (2008). A new class of cleavable fluorescent

nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis.

Nucleic Acids Res 36, e25.

Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester,

E., Spencer, J., et al. (1998). Large-scale identification, mapping, and genotyping of single-nucleotide

polymorphisms in the human genome. Science 280, 1077-1082.

Whitney, A.R., Diehn, M., Popper, S.J., Alizadeh, A.A., Boldrick, J.C., Relman, D.A., and Brown, P.O.

(2003). Individuality and variation in gene expression patterns in human blood. Proc Natl Acad Sci U S A

100, 1896-1901.

87

Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bull 1, 80-83.

Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito,

R., Boeke, J.D., Bussey, H., et al. (1999). Functional characterization of the S. cerevisiae genome by gene

deletion and parallel analysis. Science 285, 901-906.

Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules,

R.S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. J

Comput Biol 8, 625-637.

Wu, Z., Irizarry, R.A., Gentleman, R., Murillo, F.M., and Spence, F. (2004). A Model Based Background

Adjustment for Oligonucleotide Expression Arrays. Johns Hopkins University, Dept of Biostatistics

Working Papers 99, 909–917.

Wu, Z., and Irrizary, R. (2007). A statistical framework for the analysis of microarray probe-level data.

Johns Hopkins University, Dept of Biostatistics Working Papers 1.

Wuster, A., and Babu, M.M. (2008). Chemogenomics and biotechnology. Trends in Biotechnology 26,

252-258.

Yan, Z., Costanzo, M., Heisler, L.E., Paw, J., Kaper, F., Andrews, B.J., Boone, C., Giaever, G., and Nislow, C.

(2008). Yeast Barcoders: a chemogenomic application of a universal donor-strain collection carrying bar-

code identifiers. Nat Methods 5, 719-725.

Ying, L., and Sarwal, M. (2009). In praise of arrays. Pediatr Nephrol 24, 1643-1659; quiz 1655, 1659.

Youden, W.J. (1972). Enduring values. Technometrics 14.

Zakharkin, S.O., Kim, K., Mehta, T., Chen, L., Barnes, S., Scheirer, K.E., Parrish, R.S., Allison, D.B., and

Page, G.P. (2005). Sources of variation in Affymetrix microarray experiments. BMC Bioinformatics 6, 214.

an algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass...

Documents