an algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass...
TRANSCRIPT
![Page 1: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/1.jpg)
An algorithm for chemical genomic profiling that minimizes batch effects: Bucket Evaluations
by
Daniel Shabtai
A thesis submitted in conformity with the requirements for the degree of Master of Science
Department of Cell and Systems Biology University of Toronto
© Copyright by Daniel Shabtai 2011
![Page 2: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/2.jpg)
ii
An algorithm for chemical genomic profiling that minimizes batch effects: Bucket Evaluations
Daniel Shabtai
Master of Science
Department of Cell and Systems Biology University of Toronto
2011
Abstract
Chemical genomics is an interdisciplinary field that combines small molecule perturbation with
genomics to understand gene function and to study the mode(s) of drug action. Existing methods
for correlating chemical genomic profiles are not ideal as they often require one to define the
disrupting effects, commonly known as batch effects. These effects are not always known, and
they can mask true biological differences.
I present a method, Bucket Evaluations (BE), which surmounts these problems. This method is a
non-parametric correlation approach, which is suitable for locating correlations in somewhat
perturbed datasets such as chemical genomic profiles. BE can be used on other datasets such as
those obtained via gene expression profiling and performs well on both array-based and
sequence based readouts. Using BE, along with various correlation methods, on a collection of
datasets, showed it to be highly accurate for locating similarity between experiments.
![Page 3: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/3.jpg)
iii
Acknowledgments
I would like to thank Dr. Corey Nislow, who gave me an opportunity upon my arrival to
Toronto, supported and guided my ideas, and always had an open door for sharing thoughts. I
also want to thank my co-supervisor, Dr. Tim Westwood, and my supervisory committee
members Drs. Guri Giaever and Nick Provart for their guidance.
In addition, I would like to thank the Giaever/Nislow lab members, and the CCBR 6th floor
computational researchers with who I had the pleasure to work with, enjoyed my time during
work and beyond.
Finally, I would like to thank my parents Janet and Yaakov, who trust and support me, with great
love, in any step I take, no matter what. My siblings Arei and Runn, and family for their long
distance (>9000KM) video chats that fill my energy stores. Most of all, Michal Sibony for her
love, knowledge and support throughout my research.
![Page 4: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/4.jpg)
iv
Table of Contents
Abstract ........................................................................................................................................................ ii
Acknowledgments ..................................................................................................................................... iii
Table of Contents ...................................................................................................................................... iv
List of Figures ............................................................................................................................................ vi
List of Tables ............................................................................................................................................. vii
List of Abbreviations................................................................................................................................ viii
1. Introduction .......................................................................................................................................... 1
1.1. Microarrays .................................................................................................................................. 2
1.2. High Throughput Sequencing ................................................................................................... 4
1.3. Chemogenomic Profiles ............................................................................................................ 5
1.4. Batch Effects ............................................................................................................................... 8
1.4.1. History .................................................................................................................................. 8
1.4.2. Definition ............................................................................................................................ 12
1.4.3. Sources of Batch Effects ................................................................................................. 14
1.5. Analysis Approaches ............................................................................................................... 17
1.5.1. Overview ............................................................................................................................ 17
1.5.2. Supervised vs. Unsupervised Methods......................................................................... 18
1.5.3. Parametric vs. Non-Parametric Methods ...................................................................... 19
1.6. Software Design ....................................................................................................................... 19
1.6.1. Threads .............................................................................................................................. 20
![Page 5: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/5.jpg)
v
2. Rationale ............................................................................................................................................ 22
3. Results and Discussion ................................................................................................................... 26
3.1. TAG4 Barcode Microarray Dataset ....................................................................................... 31
3.2. TAG3 Microarray 2004 PNAS Dataset ................................................................................. 39
3.3. Gene Expression (Transcript Abundance) Dataset ............................................................ 45
3.4. High Throughput Sequencing Dataset .................................................................................. 54
4. Conclusions ....................................................................................................................................... 59
5. Methods ............................................................................................................................................. 63
5.1. Levelled scoring matrix ............................................................................................................ 63
5.2. Software imaging and implementation .................................................................................. 64
6. Bucket Evaluations Software .......................................................................................................... 65
6.1. User Experience ....................................................................................................................... 68
6.2. Main GUI Window (MGUIW) .................................................................................................. 70
6.2.1. User Input .......................................................................................................................... 70
6.2.2. Status Notifications .......................................................................................................... 74
6.2.3. Cancel Run ........................................................................................................................ 76
6.3. Information Form ...................................................................................................................... 76
6.4. BE Thread Manager (BETM) .................................................................................................. 76
7. References ........................................................................................................................................ 79
![Page 6: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/6.jpg)
vi
List of Figures
Figure 1 Experimental procedure for creating chemical genomic profiles
Figure 2 The score distribution of chemogenomic profiles sorted by date
Figure 3 Two chemogenomic experiments performed using the same conditions
Figure 4 A simplified example of a basic implementation of BE for scoring experiments
Figure 5 Expected results of the ideal outcome and a random outcome
Figure 6 Four correlation methods heat maps applied to the same dataset
Figure 7 Four correlation methods score distribution applied to the same dataset
Figure 8 A comparison of barcode TAG3 microarray similarity results
Figure 9 BE method results on the Gasch et al. dataset
Figure 10 Gasch et al. dataset differentiation between the induced and repressed genes
Figure 11 The distribution of scores of the Gasch et al. study dataset
Figure 12 Results for running the BE method on high throughput sequencing data
Figure 13 A comparison of several methods run on high throughput sequencing data
Figure 14 The score distribution of several methods on high throughput sequencing data
Figure 15 Bucket Evaluations Software Graphical User Interface
Figure 16 Bucket Evaluations Software Architecture
Figure 17 Bucket Evaluations Software - Load file location window
Figure 18 Bucket Evaluations Software - Save file location window
Figure 19 Example of different bucket sizes on clustering of data
Figure 20 Bucket Evaluations Software GUI once executed
![Page 7: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/7.jpg)
vii
List of Tables
Table 1 Non-biological effects on datasets
Table 2 A scoring matrix formula in accordance to the guidelines needed for BE scoring
Table 3 Implementation example of the scoring matrix
Table 4 Top three drug similarity scores located by several correlation methods
![Page 8: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/8.jpg)
viii
List of Abbreviations
AtCAST Arabidopsis thaliana: DNA Microarray Correlation Analysis Tool
bp base pairs
BE Bucket Evaluations
BETM BE Thread Manager
CPU central processing unit
DMSO dimethyl sulfoxide
DNA deoxyribonucleic acid
FDA Food and Drug Administration
GUI graphical user interface
MASTA microarray overlap search tool and analysis
MGUIW Main GUI Window
mRNA messenger ribonucleic acid
NGS next generation sequencing
PAM Partitioning Around Medoids
RNA ribonucleic acid
SD standard deviation
SOLiD Supported Oligo Ligation Detection
![Page 9: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/9.jpg)
1. Introduction
High throughput analysis of genes is a developing technology that allows the user to analyse
thousands of genes simultaneously. High throughput analysis first emerged at the beginning of
the 1990’s with microarray technology (Brown and Botstein, 1999; Lockhart et al., 1996; Schena
et al., 1995), and continued with new technologies such as next-generation sequencing (Alkan et
al., 2009; Bentley et al., 2008; Hillier et al., 2008; Ley et al., 2008; Smith et al., 2009).
One issue that has hindered high throughput experiments since their introduction is the limited
ability to compare results from experiment to experiment, lab to lab and between different dates.
Similarity between experiments helps to understand the activity of specific genes, which are
under new experimental conditions. In my research, I address similarity evaluation problems of
high throughput analysis of genes using an analysis algorithm I have developed, and by
designing and implementing software for using this algorithm.
1
![Page 10: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/10.jpg)
2
1.1. Microarrays
Microarrays are a collection of single or double stranded DNA segments that are attached to a
surface. A sample of cDNA made from RNA or genomic DNA is hybridized to these segments
allowing one to measure the abundance of gene transcript levels in the sample for expression
microarrays, or by barcode/genomic DNA abundance for barcode/genomic microarrays (see
section 1.3).. Such measurements enable researchers to monitor the expression of all known
genes of an organism simultaneously as part of genome-wide studies. For example, the user can
measure the abundance of gene transcripts (or gene copy number) either over time, under stress
conditions, or in the presence of chemical compounds (Ammar et al., 2009; Lieb et al., 2001;
Lockhart et al., 1996; Redon et al., 2006; Shoemaker et al., 2001; Wang et al., 1998). In yeast,
some of the landmark studies using microarrays to measure either transcript levels or gene copy
number include identifying drug targets using yeast deletion strains (Giaever et al., 2004;
Giaever et al., 1999; Marton et al., 1998), locating the group of yeast genes that are affected by
cell stress conditions (Gasch et al., 2000), showing yeast can acquire stress resistance when
exposed to mild stress (Berry and Gasch, 2008), discovering phenotypic activity for almost all
the genes in yeast (Hillenmeyer et al., 2008), and the creation of a drug-gene connectivity map
for understanding drug mechanism in mammalian species (Lamb et al., 2006).
There are several manufacturing methods that exist for the creation of single stranded DNA-
based microarrays. The Affymetrix platform makes use of technology known as
photolithography (Fodor et al., 1991). This technology blocks or exposes light to regions of a
substrate, called a wafer. By selectively exposing portions of the wafer to ultraviolet light, it is
possible to synthesize nucleic acids through photochemical directed reactions in a spatially
![Page 11: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/11.jpg)
3
dependent manner. Similar to Affymetrix, Nimblegen use a photolithographic technique to create
the microarrays (Singh-Gasson et al., 1999). However, Nimblegen use an array of micro-mirrors
to selectively direct light onto regions of the microarray substrate, unlike Affymetrix that uses
physical masks. This makes it easier to change oligonucleotide sequences as the micro-mirrors
are controlled by software. Agilent Technologies create microarrays using inkjet technology
(Hughes et al., 2001). This technology uses modified inkjets to spatially deliver and separate
chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible
since a computer file is used, rather than physical masks, to define the pattern and sequences of
oligonucleotides on the microarray. Illumina use a method called Bead Chips. This technology
uses the random placement of bulk synthesized oligonucleotides on polymer beads. On each
bead there is a unique nucleotide sequence defined by the user plus an identifier sequence. The
beads are attached to a substrate in random location, and thus each microarray creates a different
distribution of beads on the array (Illumina, 2011). A set of control probes that hybridize to the
identifier sequences distinguishes which bead is which on the array. Finally, spotted microarrays
are created using a robot that uses pins that repetitively pick up DNA probes (either single
stranded oligos or double stranded DNA fragments) from microtiter plates and deposits it on a
coated glass slide in a rectilinear pattern (Schena et al., 1995).
![Page 12: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/12.jpg)
4
1.2. High Throughput Sequencing
High throughput sequencing is another method that is used for assessing the abundance of DNA
segments. This method is able to produce an enormous amount of data cheaply, and in this
introduction I’ll mention several technologies for massive parallel DNA sequencing (Metzker,
2010). An example of high throughput sequencing includes Illumina technology, which works by
generating a DNA library for sequencing (Bennett, 2004). Adaptor sequences are ligated onto
the ends of each molecule. Next, the DNA is placed on a glass slide coated with DNA
complementary to the adaptor sequence. Each molecule is then amplified, generating clusters of
the cloned DNA molecules. A single base, which is fluorescently labeled, is incorporated to each
chain, and the terminator blocks further hybridization (Ju et al., 2006; Turcatti et al., 2008). The
cluster is then imaged, and the base is assigned according to the fluorescent colour. Next, the
blocked terminate is reversed, allowing to incorporate the next base. The process of
incorporating a base, reading an image, and unblocking the terminator is repeated 20-150 times,
which allows sequencing a 20-150 base long sequence (Smith et al., 2009). Another example of
sequencing technology is Sequencing by Oligonucleotide Ligation and Detection (SOLiD),
created by Applied Biosystems (Mardis, 2008). Similar to Illumina, this technology starts by
generating a DNA library for sequencing. Adaptor sequences are ligated onto the ends of each
DNA molecule. Then, the DNA is added to an emulsion PCR, where a single DNA molecule is
amplified onto a bead. The beads are deposited onto a slide and sequenced. Unlike Illumina,
SOLiD uses a DNA ligase to add DNA bases (Mardis, 2008).
High throughput sequencing has several advantages over microarrays, as paralogous sequences
can be distinguished, quantitation is ‘digital’ rather than ‘analog’, and prior sequence knowledge
is not required, to name a few.
![Page 13: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/13.jpg)
5
1.3. Chemogenomic Profiles
There are many advantages for using Saccharomyces cerevisiae as a model organism in drug
discovery research. Some of these advantages are its well characterized genome and proteome
(Lourdes Peña-Castillo, 2007; Pena-Castillo and Hughes, 2007), the availability of a complete
molecular barcode deletion strains collection (Giaever et al., 2002; Winzeler et al., 1999), its low
cost maintaining in the lab, and its facile genetics.
The chemogenomic profiles I compared were created by using the yeast Saccharomyces
cerevisiae deletion strains collection (Giaever et al., 2002; Giaever et al., 2004; Giaever et al.,
1999; Winzeler et al., 1999). Heterozygous and homozygous diploid gene deletion collections
were used to determine those gene products of pathways most affected by treatment
(Deutschbauer et al., 2005). In this method each deletion strain is tagged with a barcode, which is
a unique 20bp sequence used for identification of the strain. Once a collection of strains is grown
in the presence of a compound, the sensitivity of a certain strain with a deleted gene is measured
as a decrease in its abundance by PCR amplification of the strain specific barcodes followed by
barcode microarray hybridization or barcode sequencing (Bar-Seq) (Giaever et al., 1999; Smith
et al., 2010). This method allows identifying potential drug targets and/or genes and pathways
required for growth in the presence of a compound (Deutschbauer et al., 2005; Giaever et al.,
2004) (Figure 1).
The results of each experiment are microarray signal intensities or barcode sequence counts,
which reflect barcode abundance and, by extrapolation, strain abundance. These values are
normalized by evaluating the log2 ratio between the signal intensities of drug-treated pools and
control pools, which are treated only with DMSO. This value is represented as the strains fitness
![Page 14: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/14.jpg)
6
defect. In a typical experiment, a few strains show a high fitness defect while the majority show
little or no defect relative to the control treatment. Generally, lower values may be true sensitive
strains, yet are not necessarily located using a set threshold, as they are concealed within
midrange values that are considered background noise.
The fitness defect values vary in repeated experiments under identical conditions due to many
factors such as systematic effects (Baryshnikova et al., 2010) and batch effects (Fare et al., 2003;
Lander, 1999). Examples of possible effects are different dates of the experiment, different
plates, and the machinery used in an experiment. Due to the relatively high variability of
experiment results from experiment to experiment, the ability to compare experiments is limited
to those with higher fitness defects according to a selected threshold. To achieve a meaningful
comparison of different experiments, it is desirable to look at large collections of repeated
experiments, rather than individual, singleton experiments. This will allow evaluating midrange
values, which are not always seen as significant when looking at a single experiment.
Normalization solutions for minimizing non-biological effects exist, however, these solutions
each have limitations because they are designed for defined datasets and known batch effect
conditions (Alter et al., 2000; Benito et al., 2004; Fare et al., 2003; Johnson and Li, 2007;
Lander, 1999; Mecham et al., 2010), which are not always obvious. This lead me to hypothesize
that the similarity between different chemical genomic profiles can be evaluated using the BE
method based on a weighted rank scoring.
![Page 15: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/15.jpg)
7
Figure 1
Figure 1| Experimental procedure for creating chemical genomic profiles. The traditional barcode microarray assay is depicted on the left and the Barcode sequencing modification is presented on the right. In both assays, yeast is grown (1) in the presence of a chemical compound. In this example, the green strain grows poorly in this specific condition. The genomic DNA is isolated (2) and barcodes are amplified (3 or 5). Samples are either hybridized to a barcode microarray (4) or sequenced (6). Only one of the two yeast barcodes is shown, while the red, blue and green boxes represent the barcode which uniquely identifies that particular strain. This figure is used with permission from Andrew M. Smith, adapted from Pierce et al. 2006.
![Page 16: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/16.jpg)
8
1.4. Batch Effects
1.4.1. History
Microarrays are a powerful tool; however, they are not devoid of disadvantages. As any other
measurement process, high throughput analysis methods are susceptible to variability in results
due to technical, biological, and other non-biological sources (Table 1). Due to the variability in
the results of microarray experiments, researchers questioned the validity (Frantz, 2005;
Ioannidis, 2005; Strauss, 2006; Ying and Sarwal, 2009) and challenged the reproducibility of
microarray results (Dobbin et al., 2005; Ein-Dor et al., 2006; Irizarry et al., 2005; Larkin et al.,
2005; Marshall, 2004). Possible technical factors were investigated with an aim to minimize their
effect (Bakay et al., 2002; Boedigheimer et al., 2008; Eklund and Szallasi, 2008; Fare et al.,
2003; Han et al., 2004; Lusa et al., 2007; Novak et al., 2002; Zakharkin et al., 2005). Solutions
for minimizing the effects of technical sources were introduced during microarray studies. Such
solutions located problems in technical steps of microarray procedures, such as improper
experimental design (Lee et al., 2005; Rothman et al., 1980), RNA extraction (Bakay et al.,
2002; Boedigheimer et al., 2008; Huang et al., 2001; Lin et al., 2006; Thompson et al., 2007;
Whitney et al., 2003), RNA processing (Boelens et al., 2007; Lynch et al., 2006; Ma et al.,
2006), hybridization (Schaupp et al., 2005), washing (Branham et al., 2007; Fare et al., 2003),
scanning (Satterfield et al., 2008; Shi et al., 2005), clinical diagnosis (Daskalakis et al., 2008;
Furness et al., 2003) and data interpretation (Ambroise and McLachlan, 2002). These solutions
eliminated technical effects, though did not solve other non-biological effects.
![Page 17: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/17.jpg)
9
Non-technical sources affecting results, such as the person performing the experiment, date of
the experiment, etc., were not resolved by accounting for the technical issues. Therefore,
statistical tools, rather than technical procedures, were developed for accounting for these
effects.
![Page 18: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/18.jpg)
10
Table 1 Non-biological effects on microarray datasets
Source Result
Date of experiments Grouping of experiment similarity according to date (Baggerly et al., 2008)
Location of experiment Grouping of experiment similarity according to location (Shi et al., 2006)
Experimental design Masking of physiological state being studied (Lee et al., 2005; Ransohoff, 2005a; Rothman et al., 1980)
Tissue heterogeneity (sample and RNA extraction)
Masking of tissue or cell population being studied (Bakay et al., 2002; Scherer, 2009)
Temporal and biological variation in expression (sample and RNA extraction)
Masking of the biological state being studied (Boedigheimer et al., 2008; Scherer, 2009; Whitney et al., 2003)
Expression changes after tissue extraction (sample and RNA extraction)
Measured RNA abundances are different than true physiological state being studied (Huang et al., 2001; Lin et al., 2006; Scherer, 2009)
Degraded RNA (sample and RNA extraction)
Measured RNA abundances different than true physiological state being studied (Scherer, 2009; Thompson et al., 2007)
Amplification biases (RNA Processing)
RNA abundances change with different protocols and handling (Boelens et al., 2007; Ma et al., 2006; Scherer, 2009)
Labeling biases (RNA Processing) Measured signals differ from actual abundances and are dependent on actual procedure used (Lynch et al., 2006; Scherer, 2009)
Non-uniform hybridization (hybridization)
Spatial signal biases and non-uniform high backgrounds (Schaupp et al., 2005)
![Page 19: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/19.jpg)
11
Source Result
Cy5 degradation (washing) Cy5 molecule degrades under ozone exposure (Branham et al., 2007; Fare et al., 2003; Scherer, 2009)
System stability (scanning) Variation in signal outputs (Scherer, 2009)
System settings (scanning) Scan-to-scan variability (Scherer, 2009)
Subjective analysis of specimen (clinical diagnostics)
Systematic bias in assessment due to single or multiple pathologists making diagnosis (Daskalakis et al., 2008; Furness et al., 2003; Scherer, 2009)
Selection bias (data interpretation) Bias in selecting data sets for training and validation (Ambroise and McLachlan, 2002; Scherer, 2009)
Subtle differences in growth conditions, such as incubation time, from one array plate to the next (systematic effects)
Plate effect (Baryshnikova et al., 2010)
Growth of different subsets of colonies on the same plate (systematic effects)
Local nutrient availability (Baryshnikova et al., 2010)
Angle at which agar medium was allowed to solidify (systematic effects)
Gradients in growth medium (Baryshnikova et al., 2010)
Increased colony size next to less fit mutants (systematic effects)
Neighboring mutant strain fitness change due to local competition for nutrients (Baryshnikova et al., 2010)
Table 1 | Non-biological effects on microarray datasets. This table mentions several possible effects on microarray experiments that may change the data.
![Page 20: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/20.jpg)
12
1.4.2. Definition
Batch effects are non-biological experimental variations that affect the outcome of experiments
(Johnson and Li, 2007). The term originates from statistical process control, where it refers to
systematic differences of quality parameters between different production batches (Scherer,
2009). Such differences in parameters become significant if the average difference between
batches is larger than the within-batch random variation (Scherer, 2009). These effects create
differences in the gene expression intensities of samples processed in different batches, and
distort real biological effects. As a result, the distribution of intensities is largely due to the batch
effect, rather than the true biological variation (Figure 2). Despite my focus on avoiding batch
effects in relation to high throughput chemogenomic research, batch effects are also found in
other fields, such as physics (Youden, 1972). Examples of batch effects have been documented
in many studies, showing high correlation between variables that are not study related (Petricoin
et al., 2002; Spielman et al., 2007). Such effects created concerns regarding the credibility of
biological conclusions even after the publication of results (Akey et al., 2007; Baggerly et al.,
2004), which led to blocking the use of such an assay by the FDA until further validation
(Ransohoff, 2005b). When not detected, batch effects can result in lack or reproducibility and
subsequent misallocation of resources (Baggerly et al., 2008).
![Page 21: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/21.jpg)
13
Figure 2
Figure 2| The score distribution of chemogenomic profiles sorted by date. The red rectangle shows a group of experiments that display a similar score distribution of the gene fitness defect. These experiments were performed on the same date, but not under the same conditions, which is an example of a batch effect due to the date of the experiment.
![Page 22: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/22.jpg)
14
1.4.3. Sources of Batch Effects
Batch effect sources can be ambiguous; therefore dealing with specific known effects may not be
enough to avoid batch effects completely. Batch effects vary with respect to their impact on the
data. For instance, in almost every gene expression study there are variations that are associated
with processing date of the microarray (Scherer, 2009) (Figure 3). Another example is seen in
the comparison of microarray experiments between laboratories which show strong lab-specific
effects (Irizarry et al., 2005). Another example is seen in the large variations associated with
DNA preparation groups, such as for different batches or reagents (Scharpf et al., 2011). These
‘strong’ effects are commonly used to account for batch effects, though they may be surrogates
for other sources of variation such as ozone levels, lab temperatures, reagent quality, etc. (Fare et
al., 2003; Leek and Storey, 2007; Scherer, 2009).
Most of the batch effects are masked by the ‘strong’ effects, therefore are not recorded as a
potential effect, which makes it impossible to account for them. Taking into account the ‘strong’
factors, which may affect the data, and ignoring the factors that were not recorded, may not be
enough to clear the data from non-biological effects. The reason for the remaining batch effects
is that neither date nor biological factors are completely associated with all affecting
components, and it also suggests that other unknown sources are present. In other words, we
cannot explicitly account for undetected or unmeasured effects. Therefore accounting only for
the known batch effects is not sufficient for removing non-biological effects (Leek et al., 2010).
Samples, in which batch effects are confounding the outcome of interest, may result in wrong
biological or clinical conclusions. For example, an experiment where all control samples are
processed on one day and case samples on another may not provide useful data.
![Page 23: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/23.jpg)
15
Figure 3
a.
b.
![Page 24: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/24.jpg)
16
Figure 3| Two chemogenomic experiments performed using the same conditions (cantharidin, a protein phosphatase inhibitor) on different dates (a). These images show the extent of the differences between experiments that were performed under the same conditions. There is a difference in the scale of results (left experiment’s top value is ~22 representing a 10
� fold difference in abundance while right experiment’s top value is ~31 representing a 10
�fold difference in abundance). The lower results are the least affected genes, and include the majority of strains. These results vary in range between experiments, and are assessed as noise as they are due to unmanageable differences between experiments, i.e. temperature perturbations. Despite the fact that the experiments were performed under the same conditions, the most sensitive deletion strains are not necessarily in the same ratio to each other nor are necessarily ranked in the same order (i.e. a strain can obtain the second highest fitness defect value in one experiment, yet the third highest in another). Another representation of the differences between experiments is shown in image b. The scatterplot shows an example of scores of two experiments performed using the same conditions. Top fitness defect scores are similar, though these strains are not in ranked the same for both experiments and have a different range of scores.
![Page 25: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/25.jpg)
17
1.5. Analysis Approaches
1.5.1. Overview
Several approaches exist for evaluating similarities between experiments. All methods attempt to
overcome batch effects within the data, while some methods require more information than
others about the experiments. Obviously, the ideal dataset of experiments will hold all possible
data about the conditions of the experiments, allowing utilization of the variables for analyzing
the data; though, in reality, having all the information about the experiment’s conditions is not
possible. Normalizing the data is a standard step in data analysis of gene expression experiments
(Allison et al., 2006), yet it does not completely remove batch effects, which can affect (in
chemogenomics) different genes in different ways, as different biological pathways are affected
by conditions unrelated to the experiment (Bolstad et al., 2003; Dudoit et al., 2002; Tseng et al.,
2001; Wu et al., 2004).
There are several tools, for comparing microarray data, which are available online. For example,
Arabidopsis thaliana: DNA Microarray Correlation Analysis Tool (AtCAST ) (Sasaki et al.,
2011) is a tool for comparing microarray results, specifically for Arabidopsis thaliana. This
method uses a module based correlation analysis, incorporating accumulated microarray data and
known shared biological activity of genes, to identify biological relationships. Another example,
microarray overlap search tool and analysis (MASTA) (Reina-Pinto et al., 2010). MASTA is
used for comparing differentially expressed genes against a publicly available Arabidopsis
microarray datasets. Another method uses Eu.Gene (Cavalieri et al., 2007) to assess similarity of
samples within microarray databases. Eu.Gene is used to generate pathway signatures,
![Page 26: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/26.jpg)
18
recapitulating the biologically meaningful pathways related to some clinical/biological variable
of interest. It then uses them to compare different microarray experiments (Beltrame et al.,
2009). There are many other methods which require prior knowledge such as information
regarding gene regulation (Breitling et al., 2004; Gasch and Eisen, 2002), biological pathways
(Ovaska et al., 2008), or defining the groups of batch effects (see section 1.5.2). Such methods
may be powerful analysis tools, though they rely on prior knowledge and/or accumulated data
for performing the analysis. Here I present a method that examines independent datasets and
does not rely on prior knowledge for the analysis.
1.5.2. Supervised vs. Unsupervised Methods
Methods for assessing similarities between experiments can be divided into two main categories.
First, supervised methods, which are methods that take into account study design, and attempt to
use all measured variables as the basis to correct for the batch effects (Baird et al., 2004; Dabney
and Storey, 2007; Johnson and Li, 2007; Mecham et al., 2010; Wolfinger et al., 2001; Wu et al.,
2004; Wu and Irrizary, 2007). However, these methods require highly structured experimental
design. They assume that the experimenter has identified all sources of variation, and that all
these sources of variations are recorded in the data (Mecham et al., 2010). In contrast,
unsupervised methods are methods that do not require the utilization of batch variables and can
also be used when these data are missing.
I have developed a non-supervised method which does not assume prior knowledge of the data.
Most of the datasets I used did not publish additional information beyond the results themselves,
and the use of such additional data was not needed because the method does not require prior
![Page 27: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/27.jpg)
19
knowledge of possible batch effects. I compared this method to other commonly used
unsupervised methods to assess its abilities, such as Pearson, Spearman and Kendall correlations
(Kendall, 1938; Pearson, 1909; Spearman, 1904).
1.5.3. Parametric vs. Non-Parametric Methods
Unsupervised methods are widely used in biostatistics (Armstrong et al., 2011; Nugent and
Meila, 2010), and are divided into parametric or non-parametric methods. Some statistical tests
depend on certain assumptions about the data behaviour for accurate evaluations. Tests that
require such prior assumptions are defined as parametric methods (Nugent and Meila, 2010).
Parametric statistics depend on a particular distribution of the data, and will base their conclusion
on presumptions regarding data parameters (e.g. standard deviation, variance etc.). Unlike
parametric tests, the non-parametric methods do not make assumptions regarding the probability
distribution of the data (Qualls et al., 2010). Here I present a non-parametric method I developed,
and compare its performance to other commonly used methods, including non-parametric
methods such as Spearman (Spearman, 1904) and Kendall (Kendall, 1938) correlations, and to a
parametric method, Pearson correlation (Pearson, 1909).
1.6. Software Design
Part of my research project included design and implementation of software that will accompany
the algorithm I developed.
![Page 28: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/28.jpg)
20
1.6.1. Threads
One of the major concepts I use in my software implementation is multi-threading. Threads are
useful for parallelizing applications and are similar to processes. Computer processes consume
the computer’s central processing unit (CPU) time, and run concurrently with other processes.
The operating system allocates CPU time to each application. A single core computer processor
consists of a single CPU; therefore, the processes do not run concurrently. Despite the
asynchronous running of processes, when users run several programs at the same time, they
experience all the applications as if they are running concurrently. The illusion of concurrent
activity is attainable due to fast context switching, which is the transfer of the information
needed for each process, between processes. When the processor consists of multiple cores, the
processes are actually running concurrently on several CPUs, while there is still context
switching within the cores, as there are often more processes than cores (Microsoft, 2011).
A thread is a processing event which is allocated with CPU time; therefore, multiple threads
allow a single program to run multiple events at the same time. Multiple threads are useful when
using a graphical user interface (GUI), because it should remain active while performing events
in the background. In practice, at least one thread is dedicated to the GUI, while additional
threads perform events in the background. Another advantage of threads is the use of multiple
threads for concurrent analysis of a single resource (e.g. file or data in the computer memory).
Therefore, concurrent analysis by multiple threads provides a faster outcome compared to a
single thread. I have created a multithreaded program which allows the GUI to remain active
while multiple threads analyse a shared dataset. Such a multithreaded analysis results in a faster
![Page 29: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/29.jpg)
21
outcome compared to single threaded analysis. However, excessive usage of threads can result in
slowing the performance of a program due to lack of CPUs and overwhelming the system with
time-consuming context switching.
![Page 30: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/30.jpg)
22
2. Rationale
Chemogenomics, the study of how the genome is affected by chemical compounds, is a valuable
approach to elucidate the mechanism of action of small molecules by identifying their cellular
targets and target pathways (Wuster and Babu, 2008). Recent applications of chemical genomics
in yeast include haploinsufficiency profiling and homozygote profiling of barcoded deletion
collections in yeast (Giaever et al., 2002; Giaever et al., 2004; Giaever et al., 1999; Winzeler et
al., 1999), exploration of essential genes using temperature-sensitive mutants (Li et al., 2011),
molecular barcoded open reading frame libraries (Ho et al., 2009), decreased abundance by
mRNA perturbation (Yan et al., 2008), multi-copy suppression profiling (Hoon et al., 2008) and
gene function and drug action analysis using the relationships between gene fitness profiles and
drug inhibition profiles (Hillenmeyer et al., 2010), to name a few. To apply chemical genomics
on a larger scale (i.e. thousands-100 thousands of tests) a robust, extensible means to correct for
variation is needed. This variation can come from many sources; including operator, laboratory,
sample preparation and date (Irizarry et al., 2005; Scherer, 2009). Taken together, many profiles
will cluster based on these non-biological parameters, into "batches", which adversely affect the
validity of the conclusions of a study (Akey et al., 2007; Spielman et al., 2007). Furthermore, as
throughput increases, batch effects are likely to increase.
I used chemogenomic profiles obtained from experiments that utilized the yeast Saccharomyces
cerevisiae gene deletion collections (Deutschbauer et al., 2005), which include heterozygous and
homozygous diploid deletions and haploid deletions. These screens primarily measure growth of
individual strains in a mixed population of deletion strains in the presence of diverse small
![Page 31: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/31.jpg)
23
molecules. In these screens, a strain’s fitness defect can reflect that the deleted gene is the target
of the chemical compound present (in heterozygous diploid deletion strains) or that a particular
pathway is the target of the small molecule (homozygous diploid deletion strains).
In a genome-wide chemical-genetic profile, the fitness of each strain can be determined by
measuring the abundance of each deletion strain at the conclusion of the experiment, relative to a
mock treatment control profile. As each chemical compound produces a unique profile of gene
sensitivities, comparing the profiles helps understand the similarity between the modes of action
of compounds (Baetz et al., 2004; Hillenmeyer et al., 2008). This “guilt-by-association”
approach may help uncover therapeutic applications for known compounds as well as the
mode(s) of action of novel compounds (Buchdunger et al., 1996; Druker et al., 1996). Because
most chemical profiles display a range of fitness defects, identifying similarities between
chemical profiles requires locating shared gene targets of each profile and emphasizing genes
with highest fitness defect values, i.e. the strains most sensitive to treatment.
Batch effects, defined as non-biological variation in results (Scherer, 2009), interfere with the
ability to compare profiles because they mask the actual biological differences. Because each
experiment is subject to non-biological effects, and some of these co-occur every time the assay
is performed or new, technical variables are introduced in an unscheduled manner (e.g. – a new
lot of lab consumables), such variations are often termed batch effects (Leek et al., 2010). Batch
effects can be caused by many factors, such as the date on which an experiment was done, the
experimenter, the machinery used, etc. While most of these factors are recorded for each
experiment, one cannot account for all variation, and even when these factors are logged, some
are very difficult to normalize. One example of an effect that is not always recorded is the level
of training, which varies in time, of the person performing the experiment. Another example is
![Page 32: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/32.jpg)
24
the atmospheric ozone levels on the day of the experiment, which affects certain types of
microarrays (Fare et al., 2003) (Table 1), and temperature which affects all next generation
sequencing experiments.
Due to batch effects, correlation between experiments displays unwanted similarity according to
these effects rather than the similarity of the underlying chemical biology (Johnson and Li, 2007;
Leek et al., 2010). Comparison algorithms, which do not consider batch effects, provide
inaccurate correlation mapping of profiles. Some algorithms require that one defines the variable
that affects the results for an accurate comparison (Baryshnikova et al., 2010; Benito et al., 2004;
Johnson and Li, 2007; Leek et al., 2010; Mecham et al., 2010), yet these variables, and their
relative impact are not always known.
To find correlation between experiments in a way that accommodates such uncertainty, I devised
a method which finds correlation between experiments without the need to define the batch
effects variables. This method is based on scaled ranks, which are scored according to a levelled
scoring matrix. The levelled scoring matrix provides a score for each gene comparison. I
evaluated the method using chemogenomic profiles (see section 1.3), and compared the method
to other existing correlation methods, including Pearson (Pearson, 1909), Spearman (Spearman,
1904), and Kendall (Kendall, 1938) correlations, which also do not require prior knowledge of
the variables that affect the results. Finally, I explored the extensibility of the Bucket Evaluations
(BE) algorithm on other microarray data and barcode sequencing data (see results). Because
many different clustering methods exist (e.g. hierarchical, PAM, etc. (Bozinov and
Rahnenfuhrer, 2002)) and each method relies on different agglomeration/division methods, each
approach can yield different results. It is therefore essential, when comparing performance, to
statistically evaluate the different profile similarity metrics irrespective of their clustering
![Page 33: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/33.jpg)
25
method. I will demonstrate the performance of the BE algorithm compared to other correlation
methods, and will illustrate its applications on a variety of data types.
![Page 34: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/34.jpg)
26
3. Results and Discussion
The BE algorithm is based on ranking and comparing a large number of columns within a
dataset, and was initially applied to chemogenomic profiles. For better understanding the
applicability of the algorithm, consider an example from the world of spiders. There are over
40000 species of spiders around the world, living in a variety of areas ranging from the freezing
arctic to the hot deserts. Similar spider habitats are expected to have similar groups of spider
species, as these species have adapted to the same type of environments. To evaluate similarity
between spider habitats, one would compare the groups of successful species, those that are most
prosperous in numbers, of each habitat, rather than comparing the single most successful species
alone. The reason for such a comparison is that for very similar habitats A and B, the most
successful species in habitat A is not necessarily the most successful species in habitat B. One
can determine that habitats A and B are similar if the most successful species in habitat A is in
the top fifty most successful species in habitat B, as such a rank is still very high considering
there are 40000 species.
Similar to the world of spiders, comparing the effect of chemical compounds requires examining
the groups of genes affected by the chemical compounds rather than the top gene alone. There
are many differences between profiles, such as scale of results, standard deviation, and a
changing rank of gene values, even when the experiment was performed with the same
compound at the same dosage (Figure 3). These differences require analysing the ranking, not by
comparing specific ranks, but by comparing groups of ranks. A pure rank comparison, meaning
the highest value in one profile against the highest value in another profile and so on, gives poor
![Page 35: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/35.jpg)
27
results because it does not take into account the variability of ranks between genome-wide
profiles.
Back to the spider world, the widow spider species can be found in dry warm areas.
Environments that have the widow spider as one of the successful species will be considered as
similar habitats, while the specific rank of how successful the widow spider, can vary between
these habitats. I confronted this problem using section comparisons, dividing each profile’s gene
scores into sections, defined as buckets. The algorithm creates a weighted scoring system by
ranking sections separately, and holding a higher score for highly ranked gene scores compared
to lower ranked gene scores. Each section, or “bucket”, is defined as a subgroup of ranked scores
and itself is scored according to significance. The genes with the highest fitness defect scores are
considered the most significant for comparing profiles, as these deletion strains are the most
influenced by the chemical compound. Therefore, I define the bucket sizes in each experiment
according to significance, i.e. smaller buckets contain the most significant genes (genes with the
higher fitness defects scores and lower fitness), whereas larger buckets contain the least
significant genes (those with lower fitness defect scores and higher fitness). After the genes of
each profile are parsed into buckets, I used a levelled scoring matrix (see section 5.1) with
weighted scores for scoring similarity between profiles, and evaluate a summed similarity score
(Figure 4).
The levelled scoring matrix guidelines consisted of awarding a higher score to genes located in
lower buckets (e.g. when comparing two experiments, a gene located in bucket 2 for both
experiments is awarded a higher score compared to a gene located in bucket 3 for both
experiments), and to genes located in closer buckets (e.g. when comparing two experiments, a
gene that is located in buckets 2 and 3 will get a higher score than a gene located in buckets 2
![Page 36: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/36.jpg)
28
and 4). To implement the levelled scoring matrix guidelines, I devised a scoring matrix formula
(Table 2) which meets the requirements of the levelled scoring matrix (Table 3). These
guidelines allowed me to find resemblance between profiles in addition to identifying profiles of
repeated conditions.
![Page 37: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/37.jpg)
29
Figure 4
Figure 4| A simplified example of a basic implementation of BE for scoring experiments: (1) Define bucket sizes and scoring table values. (2) For each experiment, insert the strains in the relevant bucket according to rank. Each strain is mentioned with its bucket definition, while the values in brackets represent the fitness defect score. The fitness defect diagrams represent the buckets according to a coloured rectangle (red for bucket1, green for bucket2, and blue for bucket3). (3) Compare each experiment to the other experiments, and score similarity according to the scoring table. In this example, there is a higher similarity between Exp1-Exp3 rather than Exp2-Exp3. This example demonstrates that the BE algorithm gives greater emphasis to strains with a high value rather than strains with a lower value.
![Page 38: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/38.jpg)
30
Table 2
Buckets 1 2 3 ... 1 )1(2 −n cS cc /),1( −
2 cS cc /)1,1( −−
3 cS cc /)1,1( −−
... ... Table 2 | A scoring matrix formula in accordance to the guidelines needed for BE scoring. The top score (bucket 1 vs. bucket 1) depends on the total number of buckets (n) in order to achieve a wide spread of scores throughout the table. For example, the range of scores for n=5 buckets is from 4
5,1 102.1 −⋅=S to
162 )15(1,1 == −S , while the range of scores for 11 buckets is from 16
11,1 109.9 −⋅=S to
10242 )111(1,1 == −S (as seen in Table 3).
n= Total number of buckets c= Current bucket column
),( jiS = Score for when comparing bucket i to bucket j.
Table 3
Bucket 1 2 3 4 5 6 7 8 9 10 11
1 1024 256 18.96296 0.666667 0.013653 0.000183 1.727E-06 1.211E-08 6.555E-11 2.822E-13 9.89E-16
2 256 512 56.88889 2.666667 0.068267 0.001097 1.209E-05 9.688E-08 5.9E-10 2.822E-12 1.088E-14
3 18.96296 56.88889 170.6667 10.66667 0.341333 0.006584 8.462E-05 7.75E-07 5.31E-09 2.822E-11 1.197E-13
4 0.666667 2.666667 10.66667 42.66667 1.706667 0.039506 0.0005923 6.2E-06 4.779E-08 2.822E-10 1.316E-12
5 0.013653 0.068267 0.341333 1.706667 8.533333 0.237037 0.0041464 4.96E-05 4.301E-07 2.822E-09 1.448E-11
6 0.000183 0.001097 0.006584 0.039506 0.237037 1.422222 0.0290249 0.0003968 3.871E-06 2.822E-08 1.593E-10
7 1.73E-06 1.21E-05 8.46E-05 0.000592 0.004146 0.029025 0.2031746 0.0031746 3.484E-05 2.822E-07 1.752E-09
8 1.21E-08 9.69E-08 7.75E-07 6.2E-06 4.96E-05 0.000397 0.0031746 0.0253968 0.0003135 2.822E-06 1.927E-08
9 6.56E-11 5.9E-10 5.31E-09 4.78E-08 4.3E-07 3.87E-06 3.484E-05 0.0003135 0.0028219 2.822E-05 2.12E-07
10 2.82E-13 2.82E-12 2.82E-11 2.82E-10 2.82E-09 2.82E-08 2.822E-07 2.822E-06 2.822E-05 0.0002822 2.332E-06
11 9.89E-16 1.09E-14 1.2E-13 1.32E-12 1.45E-11 1.59E-10 1.752E-09 1.927E-08 2.12E-07 2.332E-06 2.565E-05
Table 3 | Implementation example of the scoring matrix (Table 2) where the number of buckets (n) equals 11 (therefore 10242 )1(
1,1 == −nS ). The cell colour, ranging from red to green, indicates the
significance of a similarity score when comparing gene ranks between experiments. The most significant buckets hold few genes (buckets are smaller in size), yet have the potential of receiving the highest scores (shown in green) giving more significance to the most sensitive genes, providing that the most sensitive genes appear in close buckets for both experiments being compared (such as the scores in the blue rectangle). If a gene is in distant buckets, the score is lower, i.e. a strain in bucket 6 in both experiments is scored 1.42, while a strain in bucket 6 in one experiment, and in bucket 5 in another is scored 0.237. For hits in the same bucket, the score will be more significant for a lower bucket, i.e. a strain in bucket 2 in both experiments will get a score of 512, while a strain in bucket 4 in both experiments will get a score of 42.67.
![Page 39: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/39.jpg)
31
3.1. TAG4 Barcode Microarray Dataset
I ran the BE method on a dataset of TAG4 barcode microarray results (see section 5.1), which
included platinum based novel chemical compounds, in addition to well characterised
compounds, such as cisplatin. The dataset was created by screening novel platinum-acridine
conjugates in addition to known DNA-damaging chemical compounds against the complete pool
of ~6,000 barcoded deletion strains of Saccharomyces cerevisiae, 1200 essential genes as
heterozygous diploids and 4800 non-essential genes as homozygous diploids, producing unique
genome-wide profiles (Cheung-Ong et al., In review; Giaever et al., 2002; Giaever et al., 2004;
Giaever et al., 1999; Winzeler et al., 1999). I used several correlation methods, including Pearson
(Pearson, 1909), Spearman (Spearman, 1904) and Kendall (Kendall, 1938), for finding
similarities between the compounds. I then assessed their performance according to batching of
dates, an unwanted cluster outcome, versus batching by chemical compounds, a wanted cluster
outcome (Figure 5, Figure 6). The findings showed the BE method performed better than other
methods, providing an understanding of the mechanism of action of new chemical compounds by
comparing them to better known chemical compounds.
I statistically assessed the distribution of similarity scores generated by each of the algorithms by
using the Wilcoxon test (Figure 7) (Wilcoxon, 1945). Typically, when clustering experiments to
evaluate similarity one would like to see experiments cluster according to experimental factors,
i.e. chemical compound or mechanism of action, and not according to the date of the experiment,
for example. To assess whether the date of the experiment had an effect in batching the scores, I
used a two-sided Wilcoxon test on two vectors. The first vector contained the similarity scores of
pairs of experiments performed on the same date, and the second vector contained scores of pairs
![Page 40: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/40.jpg)
32
of experiments performed on different dates. The graphs represent the distribution of similarity
scores of both vectors (Figure 7a, 7c, 7e, 7g). These differences demonstrate a statistically
significant shift in the distribution of scores between the two vectors when Pearson, Spearman or
Kendall algorithms are used (p-values 10���-10���, Figure 7a, 7c, 7e), indicating a strong
unwanted effect of the experiment’s date on the outcome. In contrast, the BE algorithm was not
significantly affected by date (p>0.05, Figure 7g). Indeed, the statistical evaluation confirmed
that, compared to these other methods, the BE algorithm was least influenced by the date of the
experiment, visualized as a highly similar distribution of scores for same dates and different
dates. This is because BE compares groups of genes, rather than single gene ranks (Figure 7g). I
next evaluated whether the chemical compound used in an experiment had an effect in batching
the scores, using the Wilcoxon test. I used two vectors: the first contained similarity scores for
pairs of experiments performed with the same chemical compound, and the second contained
scores of experiment pairs performed using different compounds (Figure 7b, 7d, 7f, 7h).
Repeated experiments, using the same chemical compound, received higher similarity scores
compared to experiments using different chemical compounds. The graphs represent the
distribution of similarity scores of both vectors, and demonstrate a statistically significant shift in
distribution for all algorithms used, indicating all methods used are affected by the chemical
compound present. This was substantial for the BE algorithm, which attained the lowest p-value
(p=8.28e-23, W=40060) compared to the other methods (1.89e-10<p<0.0041,
26396<W<33347), confirming that the chemical compound has the strongest effect on the
batching of scores rated by the BE method, and seen where the distribution of scores for different
compounds is much lower than the distribution of scores for identical compounds (Figure 7h).
To summarize this application of the BE algorithm, BE showed a clear difference in the
![Page 41: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/41.jpg)
33
distributions of scores between date and chemical compound, showing date has less effect on the
BE method (Figure 7g), while chemical compounds have a strong effect on the BE method
(Figure 7h). On the other hand, the differences in score distribution for each one of the
correlation methods other than BE, look similar for both date and chemical compound, which
means that experiments performed on the same date receive a score distribution nearly as high as
experiments where the same chemical compound was used (Figure 7a-b, 7c-d, 7e-f).
![Page 42: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/42.jpg)
34
Figure 5 Cluster by date Cluster by chemical compound
Ideal:
a.
b.
Random:
c.
d.
Figure 5 | Expected results of an ideal outcome and a random outcome. The left column displays the cluster of experiments where the labels are the dates on which the experiment was performed (a, c). Adjacent identical dates are displayed in a red rectangle to indicate when clustering occurs by date. The right column displays the cluster of experiments where the labels are the chemical compound that was used for each experiment (b, d). Adjacent identical chemical compounds are displayed in a green rectangle as shown in the legend, to indicate when the same chemical compounds are clustering together. The ideal result shows that experiments, performed using the same chemical compound, cluster together according to chemical compounds, where each cluster can be seen in a green rectangle (b). The ideal result also shows that the experiments cluster by date only when they were performed using the same chemical compound (a). The random score did not cluster any of the experiments according to chemical compound (d), and clustered experiments by date only by chance.
![Page 43: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/43.jpg)
35
Figure 6 Cluster by date Cluster by chemical compound
Pearson:
a. b. Spearman:
c. d. Kendall:
e. f. BE:
g. h.
![Page 44: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/44.jpg)
36
Figure 6 | Four correlation methods applied to the same dataset were clustered to show the performance of BE compared to other methods. The left column displays the cluster of experiments where the labels are the dates on which the experiment was performed (a, c, e, g). Adjacent identical dates are displayed in a red rectangle to indicate when clustering occurs by date. The right column displays the cluster of experiments where the labels are the chemical compound that was used for each experiment (b, d, f, h). Adjacent identical chemical compounds are displayed in a green rectangle to indicate when the same chemical compounds are clustering together. The desired result of a cluster is that similar conditions will cluster together. Examining the Pearson correlation cluster, the experiments cluster by date (a), due to a date batch effect. The BE method minimized the batch effect where identical dates did not cluster together (g), while identical conditions (chemical compounds) did cluster together (h).
![Page 45: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/45.jpg)
37
Figure 7 By Date By Chemical Compound
Pearson:
a.
b.
Spearman:
c.
d.
![Page 46: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/46.jpg)
38
By Date By Chemical Compound Kendall:
e.
f.
BE:
g.
h.
Figure 7 | The BE algorithm is least affected by the experiment date and most affected by experiment’s chemical compound used. The graphs show the distribution of scores. The graphs on the left column represent results affected by date (a, c, e, g). The solid blue line represents the score distribution of experiment pairs performed on identical dates, and the fragmented red line represents the score distribution of experiment pairs performed on different dates (a, c, e, g). The distributions according to date are significantly diverse for Pearson, Spearman and Kendall correlations (a, c, e), whereas the distributions by date are similar for BE correlation (g), meaning the scores were highly comparable for experiments done on the same date compared to experiments done on different dates. The graphs on the right column represent the score distributions affected by chemical compound (b, d, f, h). The solid blue line represents the score distribution of experiment pairs using identical chemical compounds, and the fragmented red line represents the score distribution of experiment pairs using different chemical compounds. All methods show that the distribution of the same chemical compound scores is significantly different than the distribution of different chemical compound scores, signifying, as expected, that all methods are affected by the chemical compound. The BE method shows the most significant difference in distribution compared to the other methods (h), being most affected by the chemical compound.
![Page 47: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/47.jpg)
39
3.2. TAG3 Microarray 2004 PNAS Dataset
In order to evaluate the BE method on other types of datasets, I tested the method on a dataset
which included 80 published microarray results for 10 different FDA approved drugs including
anticancer and antifungal agents, statins, alverine citrate, and dyclonine (Giaever et al., 2004).
The assay used Haploinsufficiency Profiling, which comprises 6200 diploid heterozygous yeast
strains that can be sensitized to compounds that inhibit the product of the heterozygous locus.
This was performed by lowering gene dosage from two copies to one copy in the yeast
heterozygous deletion strain, and was identified by a unique barcode sequence using TAG3
microarrays (see section 1.3) (Giaever et al., 1999). This dataset consisted of 4 to 16 replicate
experiments for each drug. The BE algorithm successfully located similarity between drugs
(Table 4), recapitulating the previously reported similarity between three drugs: alverine-citrate,
dyclonine, and fenpropimorph (Giaever et al., 2004), demonstrating the accuracy of the
algorithm (Figure 8d). In the original study, the similarity between drugs was found using a
parametric method that set a threshold to ignore genes with low fitness defects (<3SD) (Giaever
et al., 2004), while the BE method is non-parametric and did not ignore any genes for scoring
similarity between experiments. I assessed the similarity results using other methods, including
Pearson, Spearman and Kendall correlations, which all found similarity between these drugs.
However, BE was the only method which found the three drugs as most similar to one another
(Table 4, Figure 8). All methods found the replicate experiments as most similar to one another,
scoring the drug itself within the top two most similar drugs.
All methods found alverine-citrate, dyclonine, and fenpropimorph drugs as highly similar, with
the closest result to BE being the Pearson correlation method. The Pearson results showed high
![Page 48: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/48.jpg)
40
similarity between the three drugs, with a higher similarity score occurring between dyclonine
and miconazole (Figure 8a). BE found miconazole as the next similar drug to the three
mentioned drugs (Figure 8d), which suggests there is also some similarity in structure and mode
of action between the three mentioned drugs and miconazole. These lower level similarities are
found by BE when less significant buckets hold the same genes.
![Page 49: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/49.jpg)
41
Table 4
alverine-citrate dyclonine fenpropimorph Total Identification
Pearson 3/3 (100%) 2/3 (67%) 3/3 (100%) 89% Spearman 2/3 (67%) 2/3 (67%) 3/3 (100%) 78% Kendall 2/3 (67%) 2/3 (67%) 3/3 (100%) 78% BE 3/3 (100%) 3/3 (100%) 3/3 (100%) 100% Table 4 | Top three drug similarity scores of the group of drugs that were reported as similar. Each drug column mentions the amount of drugs that were in the top three highest scores. For example, Pearson correlation showed alverine-citrate experiments as most similar to all three reported drugs: alverine-citrate, dyclonine and fenpropimorph. BE is the only method which identified the similarity for all drugs (100%) recapitulating the previously reported similarity of alverine-citrate, dyclonine and fenpropimorph.
![Page 50: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/50.jpg)
42
Figure 8 a.
b.
![Page 51: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/51.jpg)
43
c.
d.
![Page 52: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/52.jpg)
44
e.
Figure 8 | A comparison of barcode TAG3 microarray similarity results between a variety of correlation methods including Pearson (a), Spearman (b), Kendall (c) and BE (d). Each colour represents a drug, and each column represents similarity scores of one drug to other drugs using coloured bars according to the compared drug. An example of a column is seen in figure a showing similarity levels to alverine citrate as calculated using Pearson correlation. Each bar represents a different drug, and the size of each bar represents the level of similarity to alverine citrate as a percentage of the top score of the method used (e). To recapitulate the previously reported similarity between three drugs: alverine-citrate, dyclonine, and fenpropimorph, I used different methods, and ascertained all methods found similarity between these drugs as seen in the orange (alverine-citrate), green (dyclonine) and blue (fenpropimorph) bars. The top three most similar drugs are mentioned within the drug’s similarity column of each method for these drugs. For the BE method, the top three values for these compounds are the three compounds themselves, where the chemical structure of these drugs is similar explained by a similar mode of action (d). BE was the only method where all three drugs shared the same top three similar drugs.
![Page 53: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/53.jpg)
45
3.3. Gene Expression (Transcript Abundance) Dataset
Having shown BE works on barcode data from different studies, I next evaluated the BE method
on an entirely different data type, genome-wide expression profiles from yeast. In this instance,
gene expression is the measurement of transcript abundance, which is used as a proxy to measure
the relative transcriptional activity of genes. Using microarrays, this process allows analyzing
thousands of genes at once, creating a global picture of transcript abundance (see section 1.1).
For this analysis I selected the widely used study of Gasch et al. which contains microarray
results for 173 environmental stress experiments for all ~6000 genes (Gasch et al., 2000). This
data was composed of genomic expression of Saccharomyces cerevisiae to diverse
environmental conditions such as heat shock, oxidative and reductive stress, osmotic shock,
nutrient starvation, DNA damage and extreme pH. In this dataset, high correlation
scores between genes, represented by the transcript abundance measured, are indicative of a
shared response to stress. These data were initially analyzed using fuzzy k-means (Gasch and
Eisen, 2002), a method that differs from the standard k-means, as it provides a membership
value for each gene to a centroid. Such membership permits each gene (scored according to
transcript abundance) to belong to more than one centroid as it may be co-regulated with several
groups. Gasch and co-workers used prior knowledge about the data to select the k value
according to the expected number of clusters, and chose the initial centroid locations according
to known regulatory elements, and I therefore used this as a benchmark. The BE method
positions the most affected genes, those with the highest score represented by transcript
abundance, in the top significant buckets, providing a high score for comparing buckets among
experiments with shared top genes, which resulted in a high correlation score specifically
between groups of highly affected genes, confirming the previously reported group of ~900
![Page 54: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/54.jpg)
46
specific genes which were found to be strongly affected throughout all stress treatments (Figure
9). This group of environmental stress response genes represent a common gene expression to
stress, help understand cell response to stress, and help reflect the bias in experimental gene
study due to these genes activity in unfavourable conditions (Giaever et al., 2002). Furthermore,
the BE score cluster allowed dividing the specific 868 genes mentioned by Gasch et al. into two
groups of 586 and 282 genes, where each group was affected counter to the other group (Figure
10) (Berry and Gasch, 2008). The affected genes received statistically significant greater scores
than the less affected genes where p<2e-16 (Figure 9c, Figure 9f). These findings suggest that
one can use the BE algorithm to locate unique groups of genes that display a similar pattern of
behaviour within certain experimental conditions, i.e. stress conditions or in the presence of
chemical compounds. The BE method was found to perform as well as other correlation
methods, which also scored a significantly higher score for the reported genes (Figure 11),
including Pearson, Spearman and Kendall, for locating groups of similarly affected genes,
presenting an additional application of the method.
![Page 55: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/55.jpg)
47
Figure 9 a.
b.
c.
![Page 56: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/56.jpg)
48
d.
e. f.
![Page 57: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/57.jpg)
49
Figure 9 | In order to locate genes of interest, the BE method was executed on a dataset of yeast response to environmental changes. Because both negative values and positive values are meaningful, I created two datasets where one included all positive values (negative values were set to 0) and the second dataset included all negative values, set to their absolute value (positive values were set to 0). Results show how the BE method successfully located the most affected genes , according to measured transcript abundance, confirming the 586 positively affected genes (2a), and the 282 negatively affected genes (2d), marked in yellow in the ranked scores as seen as the exceedingly affected genes. The higher scores, that the 868 genes received compared to other genes, can be seen in light green for both positive (2b) and negative (2e) scores. The 868 genes received statistically significant greater scores than other genes both for positive (2c P<2e-16) and negative (2f P<2e-16) affected genes where the full green line represents the positively, induced genes (2c), and negatively, repressed genes (2f), and the fragmented red line represents the rest of the genes. The distribution of scores for the less affected genes displays two peaks due to lower scores for the negative genes compared to the other genes and seen as two dark stripes (2b), marked in blue at the low end scores (2a).
![Page 58: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/58.jpg)
50
Figure 10
a.
b.
Figure 10 | Gasch et al. dataset differentiation between the induced and repressed genes within the group of 900 genes, represented by transcript abundance measured. To differentiate between groups within the group of ~900 genes, running the BE method can separate the induced and repressed genes by clustering them into 2 separate branches in the dendrogram (3a). These genes are anti-correlated, wherein ~300 genes are either repressed or induced in an anti-correlated manner to ~600 genes, depending on the stress experiments performed (3b).
![Page 59: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/59.jpg)
51
Figure 11 a.
e.
b.
f.
![Page 60: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/60.jpg)
52
c.
g.
d.
h.
![Page 61: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/61.jpg)
53
Figure 11 | The distribution of scores of the Gasch et al. study dataset. The green line represents the score distribution of the previously reported group of genes found to be significantly affected by the stress treatments. For the negative score dataset (a, b, c, d), the green line represents the group of ~300 repressed genes, and for the positive score dataset (e, f, g, h), the green line represents the group of ~600 induced genes. The fragmented red line represents the score distribution of the genes other than the reported group of genes. The methods used for comparing the score distribution included BE, Pearson, Spearman and Kendall correlations. All methods showed there are statistically significant higher scores for the reported genes (similar W statistic value) successfully locating the affected genes. The BE method performed as well as other methods identifying the affected group of genes, moreover, it differentiated the lower results and identified anti-correlation between the two groups of ~300 and ~600 affected genes by showing two peaks for the lower scores.
![Page 62: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/62.jpg)
54
3.4. High Throughput Sequencing Dataset
An additional type of dataset which I evaluated the BE method was high throughput sequencing
data of chemogenomic profiles performed in a manner similar to that described in my initial test
(see section 1.3). The fitness of the yeast strains was assessed using SOLiD sequencing in a
multiplex format, allowing sequencing of many experiments concurrently (Smith et al., 2010).
For this method, each strain carries a strain specific barcode. In addition, each individual
experiment carried a second, unique barcode, so together one can simultaneously identify both
the strain and the multiplexing tag of the sequence, where the multiplexing tag allowed
distinguishing between experiments. The sequencing results consisted of counts of barcode
sequences representing the abundance of strains for each experiment. The fitness defects are
expressed as a log2 ratio of the strain specific barcode counts versus the mock condition, for
calculating the differences between the treatment and control, creating a sequencing result matrix
of strain fitness, that provided a dataset for using the BE. I ran the algorithm on 12 experiments
which included 4 repeated experiments for each of the 3 different drugs. The BE method
successfully identified the experiments where repeated conditions clustered together according to
the drug (Figure 12a). Same drug experiments had a statistically significant higher scores than
different drug experiments where P=1.27e-20 (Figure 12b). Such findings are significant as they
confirm that one can use the BE method to compare different chemical compounds using data
originated from high throughput sequencing experiments. The BE method performed better than
the Pearson correlation method (seen in cluster of repeated experiments in Figure 13a compared
to Figure 13d), and as well as non-parametric methods including Spearman and Kendall
correlations (Figure 12, Figure 13, Figure 14). This is an important result as many experiments
have recently been usurped by sequencing alternatives, such as assessing abundance of yeast
![Page 63: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/63.jpg)
55
deletion strains using barcodes (Smith et al., 2009), mapping of the yeast genome (Nagalakshmi
et al., 2008), stem cell transcriptome profiling (Cloonan et al., 2008), mammalian cell
transcriptome mapping (Mortazavi et al., 2008) and epigenetics studies of plants (Lister et al.,
2008).
I implemented the BE method so that it is available in a graphical user interface environment
program. The application loads an input dataset, provided by the user, and produces a similarity
matrix according to the BE variable definitions.
![Page 64: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/64.jpg)
56
Figure 12
a.
b. Figure 12 | Running the BE method on high throughput sequencing data successfully cluster experiments using the same drug (a). I used the Wilcoxon test to evaluate the distribution of the scores (b) of same drug experiment scores (green line) and different drug experiment scores (red line). These results showed that same drug scores received a statistically significant higher score than different drug scores (P=1.27e-20).
![Page 65: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/65.jpg)
57
Figure 13 a.
b.
c.
d.
Figure 13 | A comparison of several methods, including Pearson (a), Spearman (b), Kendall (c) and BE (d), for finding correlations between barcode sequencing experiments. A heat-map and dendrogram displays the clustering of experiments for each method. For BE, Spearman and Kendall methods, all experiments that were performed using the same drug clustered together, showing BE (d) performed as well as other non-parametric methods, including Spearman (b) and Kendall (c). BE performed better than the Pearson correlation (a), where not all same-drug experiments clustered together.
![Page 66: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/66.jpg)
58
Figure 14 a.
b.
c.
d.
Figure 14 | The score distribution of several methods, including Pearson (a), Spearman (b), Kendall (c) and BE (d) of correlations scores of barcode sequencing experiments. The full green line represents the similarity score distribution of experiments performed using the same drug, while the fragmented red line represents the score distribution of experiments performed using different drugs. All methods present statistically significant greater scores to experiments performed using the same drug.
![Page 67: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/67.jpg)
59
4. Conclusions
Rigorous evaluations on several datasets, which included TAG4 microarrays, TAG3
microarrays, gene expression microarrays, and high throughput sequencing data, show that the
BE algorithm overcomes the batch effects (Figure 6). I confirmed that the BE algorithm
outperforms other well-established methods, by statistically validating the differences of score
distributions, and comparing these differences between the BE method and other methods
(Figure 7). Clustering of results showed the BE algorithm successfully identified similar
conditions for microarray and sequencing data (Figure 6, Figure 8d and Figure 12). The BE
method performed as well as other methods by successfully locating the group of key genes as
most sensitive to environmental changes, by attaining the highest similarity scores, confirming
the findings of Gasch et al. (Gasch et al., 2000) (Figure 9). The BE algorithm can thus provide
another analytical tool to aid in the understanding of the mechanism of action of characterized
and uncharacterised compounds according to similarity between compounds, and by learning the
gene targets of specific experimental conditions (chemical compound in use or environmental
changes). Similarity of an unknown chemical compound to other known drugs suggests a similar
mode of action and provides information about possible applications of the unknown chemical
compound. Similarity between a drug to other known drugs can suggest additional applications
and better understanding of the mode of action of action of that drug.
Having tested the BE method on data arising from different technological platforms, I conclude
that the method is applicable to other datasets where correlation between values is needed.
Specifically, by changing the BE variables to fine tune it for different datasets, e.g. for high
![Page 68: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/68.jpg)
60
throughput sequencing data I modified the first bucket size to be 0.05% of the total number of
genes, and set the maximum amount of buckets to 20. In general, achieving accurate correlation
of results may involve changing these variables (as explained in section 6.2.1). The general
concept of bucket weighted scores can therefore be applicable to both groups of highly similar
profiles, and diverse matrices, according to the definition of the variables. This method may also
be applicable to data collected from emerging technologies, such as next generation sequencing,
as finding correlation between results will continue to be beneficial (Smith et al., 2010).
I note that despite being applicable to many dataset models, like any algorithm it may not satisfy
all datasets. When considering whether to use the BE method or other methods instead, one
should take into account several factors. First, whether the data is significant for both positive
and negative values. As the BE method evaluates scores according to rank, datasets that are
significant for both positive and negative values are not analyzed properly. This occurs due to
negative values appraised as insignificant relative to positive values. For example, a genomic
expression dataset can hold positive scores for induced genes and negative scores for repressed
genes, represented by transcript abundance. Therefore both positive and negative values are
significant, as they both show a change in cell response to the conditions measured in the
experiment. A possible way to surmount such a problem, which I used in my study, is to create
two datasets from the original dataset. The first dataset will hold the positive values, and the
second dataset will hold the absolute values of the original negative values, while removing the
original positive values. Running separate analysis for positive and negative values can be
sufficient for locating the effected genes, represented by transcript abundance. Though, such a
solution is not ideal, as it adds several steps to the analysis, and may not be accurate in cases
![Page 69: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/69.jpg)
61
where none of the genes, represented by transcript abundance, were repressed in the
experimental conditions.
The second factor is whether there is prior data regarding the dataset which the user wishes to
take into account when assessing similarity between experiments. An example is the work done
by Gasch and co-workers (see section 3.3), in which they wished to filter out highly regulated
genes. To do so, Gasch and co-workers used the fuzzy k-means method, which uses prior
knowledge about the expected number of clusters, and regulatory elements (see section 3.3). This
resulted in filtering out many genes that are highly co-regulated, based on prior knowledge of the
regulation factors. If the user wishes to ignore subsections of the dataset, the BE method is not
suitable, as it is specifically designed to avoid the need of prior knowledge about the dataset, and
utilize an entire-dataset analysis approach in order to maximize the amount of scientifically
significant results that can be discovered in the dataset. If the user insists on using the BE
method, the dataset would have to be updated to exclude the data the user wished to ignore;
however, this approach is not as straightforward as using a method that relies on prior
knowledge.
The researcher should opt to use the BE method when he/she is not interested in including prior
knowledge in the analysis or when prior knowledge is unknown, when the scores are deemed
more significant as their value increases, and when he/she wants to include all the gene scores in
the dataset without restricting them using a threshold value. The researcher can safely choose to
use the BE method as it was shown by comparison to various methods that BE consistently
performs better than or as well as other parametric and non-parametric methods. Compared to
the results of the TAG3 microarray dataset (see section 3.2), the BE method clearly performed
better than other non-parametric methods. Pearson correlation, a parametric method, performed
![Page 70: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/70.jpg)
62
almost as well as the BE method in this analysis. Furthermore, in the high throughput sequencing
dataset results (see section 3.4), BE method’s performance was comparable to the non-
parametric methods Spearman and Kendall in terms of statistical results (BE: � = 4484,� ≈
10���, Spearman and Kendall: � = 4608,� ≈ 10���). On the other hand, Pearson correlation
performance in this dataset was worse than non-parametric methods, with a demonstrated
smaller statistic (� = 3640,� ≈ 10��).
![Page 71: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/71.jpg)
63
5. Methods
5.1. Levelled scoring matrix
The levelled scoring matrix is constructed of decreasing scores, from high scores for a gene in
closely ranked groups (buckets) to low scores for a gene in distant groups (buckets). When
comparing profiles, the score matrix yields the score of jiS , to a gene located in bucket i and
bucket j in each of the profiles compared. For a score ofjiS , the scoring matrix follows these
guidelines: (1) For each experiment, the strains are divided into buckets. The buckets are ordered
in ascending importance so that a lower bucket holds the strains with the highest fitness defect.
(2) Assign higher scores for hits in different experiments which fall within the same bucket,
while taking into consideration that first buckets are more significant than last buckets, where
jiS , for experiments 1Exp and 2Exp , is the score of a fitness defect strain which is located in
bucket i in 1Exp , and in bucket j in 2Exp . (3) jjii SSjiji ,,|, >⇒<∀ For example: 2,21,1 SS > .
(4) Assign a higher score for hits in closer buckets: kiji SSkjikji ,,|,, >⇒<<∀ . For example:
4,23,2 SS > .
![Page 72: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/72.jpg)
64
5.2. Software imaging and implementation
Images and analysis were created using R (Team, 2011). Figure 3b was created using SPSS. The
BE software was developed using C# .NET 3.0 Framework. The software is available for
download at: http://chemogenomics.med.utoronto.ca/supplemental/BE/.
![Page 73: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/73.jpg)
65
6. Bucket Evaluations Software
In order to create a program for running the BE algorithm, I took into consideration several
design approaches, such as: (1) creating a program that allows the user to decide on resource
allocation for better performance according to the hardware abilities, and (2) an easy-to-use
graphical user interface (GUI) (Figure 15).
The program includes a multithreaded architecture that allows the GUI to remain active, while
multiple threads are executing the analysis on the dataset provided (Figure 16). The design of the
program consisted of 3 independent threads, including the main GUI window (MGUIW) thread,
information form thread, and the BE Thread Manager (BETM). In addition to these threads are
the dataset analysis threads. The amount of Dataset Threads (DT) is a variable set by the user
(Figure 16).
![Page 74: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/74.jpg)
66
Figure 15
Figure 15 | Bucket Evaluations Software Graphical User Interface. This image is the main window, which displayed to the user. It provides the user the needed steps to load a dataset (step 1), run the BE algorithm and produce a similarity matrix as a file (step 3). This window gives the user an option to set the algorithm’s variables (step 2), and provides help for each of the variables using a tool-tip hover button.
![Page 75: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/75.jpg)
67
Figure 16
Figure 16 | Bucket Evaluations Software Architecture. Each rectangle represents a class in the program. The green blue and orange rectangles represent separate threads. The Thread Barrier (purple rectangle) is a separate object used by the BE Thread Manager.
Thread
Barrier
Main GUI Window
Info Form BE Thread Manager
Dataset
Thread
Dataset
Thread
Dataset
Thread
Dataset
Thread
Dataset
Thread…
![Page 76: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/76.jpg)
68
6.1. User Experience
Running the analysis requires three steps, as described in the MGUIW (Figure 15):
1. Choose a data source file for which you wish to create a similarity matrix. Clicking the
button opens a file browser for choosing the data source file (Figure 17). The
data source file should be a standard tab delimited file, which includes the column names,
row names and numeric data results.
2. Set algorithm variables according to the data type you are using. As previously
mentioned, results may be more accurate by manipulating these values. It is therefore
recommended to run the source dataset multiple times with different values. This will
allow to fine tune the variables to best the user’s data source.
3. Clicking the button opens a window requesting the user for a file save
location (Figure 18). The program writes the similarity matrix information to the save
location, which is the output target file. The program commences the analysis of the data
once the user confirms the output location. If the dataset is in an incorrect format, or if
there is any other problem with the run, a message will be displayed to the user using
exception handlers, which are code sections that deal with faulty input.
Once the program commences the analysis, the status of the run is displayed to the user. The
status is displayed by using a progress bar, percentage status, and status text components, which
are constantly updated throughout the run.
![Page 77: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/77.jpg)
69
Figure 17
Figure 17 | Bucket Evaluations Software - Load file location window
Figure 18
Figure 18 | Bucket Evaluations Software - Save file location window
![Page 78: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/78.jpg)
70
6.2. Main GUI Window (MGUIW)
The MGUIW thread is the initial execution thread, and is responsible for user input and status
notifications to the user. The user input includes input file selection window, initial input file
validations, algorithm parameter input etc. (Figure 15). The MGUIW thread also includes tool-
tip hover labels , which provide the user with a text explanation of the variable fields to fill.
6.2.1. User Input
Each input parameter consists of a GUI object that is relevant to the type of data needed (e.g. file
location requires text and initial bucket size requires a number). The parameters, which the user
can modify include:
• “Choose data source” – A variable input that consists of a textbox and a button
components. Allows users to select the data file which they wish to load. The data must
be a tab delimited file including column and row names. For example (file format:
“\tC1\tC2\tC3\nR1\t1\t2\t3\nR2\t4\t5\t6\nR3\t7\t8\t9”):
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
• “Use Pre-set Values” – A variable input that consists of a combo box component. Allows
the user to choose pre-set values for the BE variables, including: “Stringent” - Score is
![Page 79: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/79.jpg)
71
dependent on more accuracy between the ranks of values, and “Broad” - The group of top
ranked values is larger. A large group of top ranked values produces a high score for
more distant ranks in each bucket. These values are set while assuming there are ~6000
values to compare. The sizes vary for different sizes of datasets.
• “Number of Additional Threads” – A variable input that consists of a numeric up-down
component. Increasing the number of threads allows the algorithm to run analysis
concurrently. Concurrent running of the algorithm may result in a faster outcome. Thread
performance is dependent on the computer's hardware; therefore increasing the number of
threads may result in a delayed outcome (see section 1.6.1).
• “Initial Bucket Size (%)” – A variable input that consists of a numeric up-down
component. Allows the user to select the size of the first bucket. This value is the
percentage of the dataset size. For example, if there are 10000 values to compare, and the
value of the initial bucket size is set to 0.05, then the first bucket, which holds the most
significant values, will hold 5 top values (which are 0.05% of 10000). Following buckets
will be larger in size (Table 2). A small value will result in fewer scores considered as top
hits, resulting in a stringent result when comparing columns/rows. The value set for the
initial bucket size affects the results of the algorithm, therefore it is recommended to run
the algorithm several times, while using different sets of variables, for finding the ideal
variable values (Figure 19). These values can also be changed by using the different
options in the pre-set values combo-box.
• “Maximum Number of Buckets” – A variable input that consists of a numeric up-down
component. Allows the user to select the algorithm’s maximum amount of levels to
divide each experiment. For example, when comparing fitness defects of genes, this
![Page 80: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/80.jpg)
72
variable will represent the maximum number of groups the genes will be divided. A small
value in this field will result in a reduced effect on the similarity score of the lower values
in the dataset (Table 2).
• “Comparing Columns/Rows” – A variable input that consists of radio button
components. Allows the users to select what analysis they wish to run on the dataset. For
example, if the columns represent the experiments and the rows the represent the gene’s
fitness defect, then selecting the “Column” radio button will create an experiment
similarity scoring matrix. For example, for an input file such as (file format:
“\tC1\tC2\tC3\nR1\t1\t2\t3\nR2\t4\t5\t6\nR3\t7\t8\t9”):
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9
selecting 'Column' will result in a similarity scoring matrix of C1-C3 versus C1-C3, while
selecting 'Rows' will result in a scoring matrix of R1-R3 versus R1-R3.
• “Set score of lowest bucket to 0” – A variable input that consists of a checkbox
component. If checked, the score for the lowest buckets is set to 0. This results in giving a
score of 0 similarity to the lowest bucket, which contains the lowest ranks, hence
avoiding the low end of ranks. If left unchecked, the lowest ranks can cause a higher
score of similarity, as it is included in the final similarity score.
![Page 81: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/81.jpg)
73
Figure 19
Figure 19 | Example of different result outputs for setting different bucket sizes when running the bucket evaluations algorithm. The dataset originated from high throughput sequencing (see section 3.4) with initial bucket sizes set to a broad value of 5% (a), and a stringent value of 0.05% (b). These dendrograms show the importance of running several parameter definitions for finding the best fit for the dataset. Both results show same drug treatments clustered together, though this dataset required a stringent set of variables as seen in dendrogram b, as all same chemical compounds clustered together, while for broad values of the variables, not all chemical compounds clustered together (cisplatin).
![Page 82: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/82.jpg)
74
6.2.2. Status Notifications
In addition to the user input, the MGUIW is responsible for status notifications to the user. Status
notifications are important as the user can get information regarding what stage of the program is
running at each moment. In order to allow the BETM and DTs to change the components of the
MGUIW, delegates execute MGUIW methods from external threads. These delegates provide an
up-to-date status to the user by manipulating several components, such as:
• Progress Bar – Provides a graphical display of progress percentage (Figure 20B).
• Percentage Label – Provides a numerical display of progress percentage (Figure 20C).
• Status Text Label – Provides a brief text explanation of the current analysis action that is
performed on the dataset (Figure 20D). Once the analysis is completed, the Status Text
Label displays the output file location.
![Page 83: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/83.jpg)
75
Figure 20
Figure 20 | Program GUI once executed. The user can cancel the run by clicking the “Cancel” button (A). This button is located at the same place the “Run” button was located prior execution. The status is presented to the user using a status bar (B), the current action percentage of the run (C), and text that provides a brief explanation of the current analysis action being performed (D).
![Page 84: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/84.jpg)
76
6.2.3. Cancel Run
Once a run has been executed by clicking the button, the MGUIW allows the user to
cancel the run by clicking the button (Figure 20A). The “Cancel” button is located at
the same location that the “Run” button was located. Once “Cancel” is clicked, a series of events
is initiated for terminating all running threads. The button-click event raises a flag that is
periodically checked by the BETM. The raised flag leads to the following actions: (1) it prevents
the creation of additional DTs by the BETM, (2) prevents existing DTs, that are in queue prior
running, from starting analysis on the dataset, and (3) leaves the DTs, that are already running, to
terminate upon completion.
6.3. Information Form
The Information Form is a separate thread, accessible through the button on the MGUIW. It
provides information about software version, and usage citation. This form is executed as a
separate thread, therefore can also be displayed when the analysis is underway.
6.4. BE Thread Manager (BETM)
The BETM is a thread which controls the flow of the analysis according to the user’s parameters,
and orchestrates over multiple DTs. In order to do so, it uses thread control tools such as
semaphores and thread barriers.
Semaphores are software objects that limit the amount of running threads. If the defined number
of running threads is at its maximum, the semaphore puts any additional threads into sleep mode.
![Page 85: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/85.jpg)
77
Once a running thread is terminated, the semaphore activates one of the queued threads. The
semaphores were used to limit the amount of DTs running at the same time. The amount of
allowed threads is set by the user prior the run. If the user sets the number of additional threads
to 0 (Figure 15), the analysis will be performed from the BETM thread and not from additional
DTs.
The thread barrier is an object responsible for preventing a selected thread from running as long
as a certain group of signal threads have not completed their run. The barrier prevents the
selected thread from running, by putting it in sleep mode. Once the group of signal threads
complete their run, the sleeping thread is awakened. The thread barrier was not part of .Net 3.0
framework, therefore, I implemented the barrier object as a separate class. The barrier was used
for controlling the stages of the analysis, adding DTs to the queue only for relevant sections of
the dataset. I also used the thread barrier for preventing the queue of DTs from becoming too
large. Preventing an oversized DT queue is important for the event that the “Cancel” button is
clicked, as there is a limited amount of threads in the queue that need to be cancelled.
Each part of the analysis is divided into sub-tasks that are performed by DTs. The BETM creates
a limited amount of DTs so that if the user cancels the run, there is a limited amount of DTs
awaiting execution. Because the dataset is analysed by multiple threads, the BETM makes sure
that the sections being analysed are not overrun by other threads, and therefore provides the DT
with a mutually exclusive section for it to work on. These subtasks included tasks such as
ranking scores of specific columns, entering the values of a specific column into buckets, and
comparing experiments. For example, for a dataset of size 100X100 there will be ~10200
subtasks (100 ranks + 100 bucket definitions + ~10000 comparisons) assigned to threads. Once
![Page 86: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/86.jpg)
78
the analysis is completed, the BETM saves the output to a standard tab delimited file which is
located in the path selected by the user.
![Page 87: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/87.jpg)
79
7. References
Akey, J.M., Biswas, S., Leek, J.T., and Storey, J.D. (2007). On the design and analysis of gene expression
studies in human populations. Nat Genet 39, 807-808; author reply 808-809.
Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker,
C., Malig, M., Mutlu, O., et al. (2009). Personalized copy number and segmental duplication maps using
next-generation sequencing. Nat Genet 41, 1061-1067.
Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to
consolidation and consensus. Nat Rev Genet 7, 55-65.
Alter, O., Brown, P.O., and Botstein, D. (2000). Singular value decomposition for genome-wide
expression data processing and modeling. PNAS 97, 10101–10106.
Ambroise, C., and McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray
gene-expression data. Proc Natl Acad Sci U S A 99, 6562-6566.
Ammar, R., Smith, A.M., Heisler, L.E., Giaever, G., and Nislow, C. (2009). A comparative analysis of DNA
barcode microarray feature size. BMC Genomics 10.
Armstrong, R.A., Davies, L.N., Dunne, M.C., and Gilmartin, B. (2011). Statistical guidelines for clinical
studies of human vision. Ophthalmic Physiol Opt 31, 123-136.
Baetz, K., McHardy, L., Gable, K., Tarling, T., Reberioux, D., Bryan, J., Andersen, R.J., Dunn, T., Hieter, P.,
and Roberge, M. (2004). Yeast genome-wide drug-induced haploinsufficiency screen to determine drug
mode of action. Proc Natl Acad Sci U S A 101, 4525-4530.
Baggerly, K.A., Coombes, K.R., and Neeley, E.S. (2008). Run batch effects potentially compromise the
usefulness of genomic signatures for ovarian cancer. J Clin Oncol 26, 1186-1187; author reply 1187-
1188.
Baggerly, K.A., Edmonson, S.R., Morris, J.S., and Coombes, K.R. (2004). High-resolution serum proteomic
patterns for ovarian cancer detection. Endocr Relat Cancer 11, 583-584; author reply 585-587.
Baird, D., Johnstone, P., and Wilson, T. (2004). Normalization of microarray data using a spatial mixed
model analysis which includes splines. Bioinformatics 20, 3196-3205.
Bakay, M., Chen, Y.W., Borup, R., Zhao, P., Nagaraju, K., and Hoffman, E.P. (2002). Sources of variability
and effect of experimental approach on expression profiling data interpretation. BMC Bioinformatics 3,
4.
Baryshnikova, A., Costanzo, M., Kim, Y., Youn, J.-Y., Ding, H., Koh, J., Toufighi, K., Luis, B.-J.S.,
Bandyopadhyay, S., Hibbs, M., et al. (2010). Quantitative analysis of fitness and genetic interactions in
yeast on a genome scale. Nature Methods 7, 1017-1024.
Beltrame, L., Rizzetto, L., Paola, R., Rocca-Serra, P., Gambineri, L., Battaglia, C., and Cavalieri, D. (2009).
Using pathway signatures as means of identifying similarities among microarray experiments. PLoS One
4, e4128.
![Page 88: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/88.jpg)
80
Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., and Marron, J.S. (2004). Adjustment of
systematic microarray data biases. Bioinformatics 20, 105–114.
Bennett, S. (2004). Solexa Ltd. Pharmacogenomics 5, 433-438.
Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers,
D.J., Barnes, C.L., Bignell, H.R., et al. (2008). Accurate whole human genome sequencing using reversible
terminator chemistry. Nature 456, 53-59.
Berry, D.B., and Gasch, A.P. (2008). Stress-activated genomic expression changes serve a preparative
role for impending stress in yeast. Mol Biol Cell 19, 4580-4587.
Boedigheimer, M.J., Wolfinger, R.D., Bass, M.B., Bushel, P.R., Chou, J.W., Cooper, M., Corton, J.C., Fostel,
J., Hester, S., Lee, J.S., et al. (2008). Sources of variation in baseline gene expression levels from
toxicogenomics study control animals across multiple laboratories. BMC Genomics 9, 285.
Boelens, M.C., te Meerman, G.J., Gibcus, J.H., Blokzijl, T., Boezen, H.M., Timens, W., Postma, D.S., Groen,
H.J., and van den Berg, A. (2007). Microarray amplification bias: loss of 30% differentially expressed
genes due to long probe - poly(A)-tail distances. BMC Genomics 8, 277.
Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods
for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-193.
Bozinov, D., and Rahnenfuhrer, J. (2002). Unsupervised technique for robust target separation and
analysis of DNA microarray spots through adaptive pixel clustering. Bioinformatics 18, 747-756.
Branham, W.S., Melvin, C.D., Han, T., Desai, V.G., Moland, C.L., Scully, A.T., and Fuscoe, J.C. (2007).
Elimination of laboratory ozone leads to a dramatic improvement in the reproducibility of microarray
gene expression measurements. BMC Biotechnol 7, 8.
Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P. (2004). Rank products: a simple, yet powerful,
new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett
573, 83-92.
Brown, P.O., and Botstein, D. (1999). Exploring the new world of the genome with DNA microarrays. Nat
Genet 21, 33-37.
Buchdunger, E., Zimmermann, J., Mett, H., Meyer, T., Muller, M., Druker, B.J., and Lydon, N.B. (1996).
Inhibition of the Abl protein-tyrosine kinase in vitro and in vivo by a 2-phenylaminopyrimidine
derivative. Cancer Res 56, 100-104.
Cavalieri, D., Castagnini, C., Toti, S., Maciag, K., Kelder, T., Gambineri, L., Angioli, S., and Dolara, P.
(2007). Eu.Gene Analyzer a tool for integrating gene expression data with pathway databases.
Bioinformatics 23, 2631-2632.
Cheung-Ong, K., Song, K., Ma, Z., Shabtai, D., Heisler, L.M., Bierbach, U., Giaever, G., and Nislow, C. (In
review). Insights into the mechanism of action of nonclassical platinum–acridine anticancer agents from
comprehensive chemogenomic fitness screens.
![Page 89: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/89.jpg)
81
Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B., Faulkner, G.J., Brown, M.K., Taylor, D.F., Steptoe, A.L.,
Wani, S., Bethel, G., et al. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing.
Nat Methods 5, 613-619.
Dabney, A.R., and Storey, J.D. (2007). Normalization of two-channel microarrays accounting for
experimental design and intensity-dependent relationships. Genome Biol 8, R44.
Daskalakis, A., Kostopoulos, S., Spyridonos, P., Glotsos, D., Ravazoula, P., Kardari, M., Kalatzis, I.,
Cavouras, D., and Nikiforidis, G. (2008). Design of a multi-classifier system for discriminating benign from
malignant thyroid nodules using routinely H&E-stained cytological images. Comput Biol Med 38, 196-
203.
Deutschbauer, A.M., Jaramillo, D.F., Proctor, M., Kumm, J., Hillenmeyer, M.E., Davis, R.W., Nislow, C.,
and Giaever, G. (2005). Mechanisms of Haploinsufficiency Revealed by Genome-Wide Profiling in Yeast.
Genetics 169.
Dobbin, K.K., Kawasaki, E.S., Petersen, D.W., and Simon, R.M. (2005). Characterizing dye bias in
microarray experiments. Bioinformatics 21, 2430-2437.
Druker, B.J., Tamura, S., Buchdunger, E., Ohno, S., Segal, G.M., Fanning, S., Zimmermann, J., and Lydon,
N.B. (1996). Effects of a selective inhibitor of the Abl tyrosine kinase on the growth of Bcr-Abl positive
cells. Nat Med 2, 561-566.
Dudoit, S., Yang, Y.H., Callow, M.J., and Speed, T.P. (2002). Statistical methods for identifying
differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111-139.
Ein-Dor, L., Zuk, O., and Domany, E. (2006). Thousands of samples are needed to generate a robust gene
list for predicting outcome in cancer. Proc Natl Acad Sci U S A 103, 5923-5928.
Eklund, A.C., and Szallasi, Z. (2008). Correction of technical bias in clinical microarray data improves
concordance with known biological information. Genome Biol 9, R26.
Fare, T.L., Coffey, E.M., Dai, H., He, Y.D., Kessler, D.A., Kilian, K.A., Koch, J.E., LeProust, E., Marton, M.J.,
Meyer, M.R., et al. (2003). Effects of Atmospheric Ozone on Microarray Data Quality. Analytical
Chemistry 75 4672-4675.
Fodor, S.P., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., and Solas, D. (1991). Light-directed, spatially
addressable parallel chemical synthesis. Science 251, 767-773.
Frantz, S. (2005). An array of problems. Nat Rev Drug Discov 4, 362-363.
Furness, P.N., Taub, N., Assmann, K.J., Banfi, G., Cosyns, J.P., Dorman, A.M., Hill, C.M., Kapper, S.K.,
Waldherr, R., Laurinavicius, A., et al. (2003). International variation in histologic grading is large, and
persistent feedback does not improve reproducibility. Am J Surg Pathol 27, 805-810.
Gasch, A.P., and Eisen, M.B. (2002). Exploring the conditional coregulation of yeast gene expression
through fuzzy k-means clustering. Genome Biol 3, RESEARCH0059.
Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., and Brown,
P.O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol
Biol Cell 11, 4241-4257.
![Page 90: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/90.jpg)
82
Giaever, G., Chu, A.M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson,
K., Andre, B., et al. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418,
387-391.
Giaever, G., Flaherty, P., Kumm, J., Proctor, M., Nislow, C., Jaramillo, D.F., Chu, A.M., Jordan, M.I., Arkin,
A.P., and Davis, R.W. (2004). Chemogenomic profiling: Identifying the functional interactions of small
molecules in yeast. PNAS 101, 793-798.
Giaever, G., Shoemaker, D.D., Jones, T.W., Liang, H., Winzeler, E.A., Astromoff, A., and Davis, R.W.
(1999). Genomic profiling of drug sensitivities via induced haploinsufficiency. Nature Genetics 21, 278-
283.
Han, E.S., Wu, Y., McCarter, R., Nelson, J.F., Richardson, A., and Hilsenbeck, S.G. (2004). Reproducibility,
sources of variability, pooling, and sample size: important considerations for the design of high-density
oligonucleotide array experiments. J Gerontol A Biol Sci Med Sci 59, 306-315.
Hillenmeyer, M.E., Ericson, E., Davis, R.W., Nislow, C., Koller, D., and Giaever, G. (2010). Systematic
analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. Genome
Biology 11.
Hillenmeyer, M.E., Fung, E., Wildenhain, J., Pierce, S.E., Hoon, S., Lee, W., Proctor, M., St Onge, R.P.,
Tyers, M., Koller, D., et al. (2008). The chemical genomic portrait of yeast: uncovering a phenotype for
all genes. Science 320, 362-365.
Hillier, L.W., Marth, G.T., Quinlan, A.R., Dooling, D., Fewell, G., Barnett, D., Fox, P., Glasscock, J.I.,
Hickenbotham, M., Huang, W., et al. (2008). Whole-genome sequencing and variant discovery in C.
elegans. Nat Methods 5, 183-188.
Ho, C.H., Magtanong, L., Barker, S.L., Gresham, D., Nishimura, S., Natarajan, P., Koh, J.L., Porter, J., Gray,
C.A., Andersen, R.J., et al. (2009). A molecular barcoded yeast ORF library enables mode-of-action
analysis of bioactive compounds. Nat Biotechnol 27, 369-377.
Hoon, S., Smith, A.M., Wallace, I.M., Suresh, S., Miranda, M., Fung, E., Proctor, M., Shokat, K.M., Zhang,
C., Davis, R.W., et al. (2008). An integrated platform of genomic assays reveals small-molecule
bioactivities. Nat Chem Biol 4, 498-506.
Huang, J., Qi, R., Quackenbush, J., Dauway, E., Lazaridis, E., and Yeatman, T. (2001). Effects of ischemia
on gene expression. J Surg Res 99, 222-227.
Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman,
M., Schelter, J.M., Meyer, M.R., et al. (2001). Expression profiling using microarrays fabricated by an ink-
jet oligonucleotide synthesizer. Nat Biotechnol 19, 342-347.
Illumina (2011).
Ioannidis, J.P. (2005). Microarrays and molecular research: noise discovery? Lancet 365, 454-455.
Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.,
Geoghegan, J., Germino, G., et al. (2005). Multiple-laboratory comparison of microarray platforms. Nat
Methods 2, 345-350.
![Page 91: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/91.jpg)
83
Johnson, W.E., and Li, C. (2007). Adjusting batch effects in microarray expression data using empirical
Bayes methods. Biostatistics 8, 118–127.
Ju, J., Kim, D.H., Bi, L., Meng, Q., Bai, X., Li, Z., Li, X., Marma, M.S., Shi, S., Wu, J., et al. (2006). Four-color
DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc Natl
Acad Sci U S A 103, 19635-19640.
Kendall, M.G. (1938). A new measure of rank correlation. Biometrika 30, 81-93.
Lamb, J., Crawford, E.D., Peck, D., Modell, J.W., Blat, I.C., Wrobel, M.J., Lerner, J., Brunet, J.P.,
Subramanian, A., Ross, K.N., et al. (2006). The Connectivity Map: using gene-expression signatures to
connect small molecules, genes, and disease. Science 313, 1929-1935.
Lander, E.S. (1999). Array of hope. Nature Genetics 21, 3-4.
Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R., and Quackenbush, J. (2005). Independence and
reproducibility across microarray platforms. Nat Methods 2, 337-344.
Lausted, C., Dahl, T., Warren, C., King, K., Smith, K., Johnson, M., Saleem, R., Aitchison, J., Hood, L., and
Lasky, S.R. (2004). POSaM: a fast, flexible, open-source, inkjet oligonucleotide synthesizer and
microarrayer. Genome Biol 5, R58.
Lee, K.M., Kim, J.H., and Kang, D. (2005). Design issues in toxicogenomics using DNA microarray
experiment. Toxicol Appl Pharmacol 207, 200-208.
Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K.,
and Irizarry, R.A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput
data. Nat Rev Genet 11, 733-739.
Leek, J.T., and Storey, J.D. (2007). Capturing heterogeneity in gene expression studies by surrogate
variable analysis. PLoS Genet 3, 1724-1735.
Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D., Chen, K., Dooling, D., Dunford-Shore, B.H.,
McGrath, S., Hickenbotham, M., et al. (2008). DNA sequencing of a cytogenetically normal acute myeloid
leukaemia genome. Nature 456, 66-72.
Li, Z., Vizeacoumar, F.J., Bahr, S., Li, J., Warringer, J., Vizeacoumar, F.S., Min, R., Vandersluis, B., Bellay, J.,
Devit, M., et al. (2011). Systematic exploration of essential yeast gene function with temperature-
sensitive mutants. Nat Biotechnol 29, 361-367.
Lieb, J.D., Liu, X., Botstein, D., and Brown, P.O. (2001). Promoter-specific binding of Rap1 revealed by
genome-wide maps of protein-DNA association. Nat Genet 28, 327-334.
Lin, D.W., Coleman, I.M., Hawley, S., Huang, C.Y., Dumpit, R., Gifford, D., Kezele, P., Hung, H., Knudsen,
B.S., Kristal, A.R., et al. (2006). Influence of surgical manipulation on prostate gene expression:
implications for molecular correlates of treatment effects and disease prognosis. J Clin Oncol 24, 3763-
3770.
Lister, R., O'Malley, R.C., Tonti-Filippini, J., Gregory, B.D., Berry, C.C., Millar, A.H., and Ecker, J.R. (2008).
Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523-536.
![Page 92: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/92.jpg)
84
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C.,
Kobayashi, M., Horton, H., et al. (1996). Expression monitoring by hybridization to high-density
oligonucleotide arrays. Nat Biotechnol 14, 1675-1680.
Lourdes Peña-Castillo, T.R.H. (2007). Why are there still over 1,000 uncharacterized yeast genes?
Genetics.
Lusa, L., McShane, L.M., Reid, J.F., De Cecco, L., Ambrogi, F., Biganzoli, E., Gariboldi, M., and Pierotti,
M.A. (2007). Challenges in projecting clustering results across gene expression-profiling datasets. J Natl
Cancer Inst 99, 1715-1723.
Lynch, J.L., deSilva, C.J., Peeva, V.K., and Swanson, N.R. (2006). Comparison of commercial probe
labeling kits for microarray: towards quality assurance and consistency of reactions. Anal Biochem 355,
224-231.
Ma, C., Lyons-Weiler, M., Liang, W., LaFramboise, W., Gilbertson, J.R., Becich, M.J., and Monzon, F.A.
(2006). In vitro transcription amplification and labeling methods contribute to the variability of gene
expression profiling with DNA microarrays. J Mol Diagn 8, 183-192.
Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387-
402.
Marshall, E. (2004). Getting the noise out of gene arrays. Science 306, 630-631.
Marton, M.J., DeRisi, J.L., Bennett, H.A., Iyer, V.R., Meyer, M.R., Roberts, C.J., Stoughton, R., Burchard, J.,
Slade, D., Dai, H., et al. (1998). Drug target validation and identification of secondary drug target effects
using DNA microarrays. Nat Med 4, 1293-1301.
Mecham, B.H., Nelson, P.S., and Storey, J.D. (2010). Supervised normalization of microarrays.
Bioinformatics 26, 1308-1315.
Metzker, M.L. (2010). Sequencing technologies - the next generation. Nat Rev Genet 11, 31-46.
Microsoft (2011). Microsoft Developer Network.
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying
mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628.
Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., and Snyder, M. (2008). The
transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344-1349.
Novak, J.P., Sladek, R., and Hudson, T.J. (2002). Characterization of variability in large-scale gene
expression data: implications for study design. Genomics 79, 104-113.
Nugent, R., and Meila, M. (2010). An overview of clustering applied to molecular biology. Methods Mol
Biol 620, 369-404.
Ovaska, K., Laakso, M., and Hautaniemi, S. (2008). Fast gene ontology based clustering for microarray
experiments. BioData Min 1, 11.
Pearson, K. (1909). Determination of the Coefficient of Correlation. Science 30, 23-25.
![Page 93: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/93.jpg)
85
Pena-Castillo, L., and Hughes, T.R. (2007). Why are there still over 1000 uncharacterized yeast genes?
Genetics 176, 7-14.
Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone,
C., Fishman, D.A., Kohn, E.C., et al. (2002). Use of proteomic patterns in serum to identify ovarian
cancer. Lancet 359, 572-577.
Qualls, M., Pallin, D.J., and Schuur, J.D. (2010). Parametric versus nonparametric statistical tests: the
length of stay example. Acad Emerg Med 17, 1113-1121.
Ransohoff, D.F. (2005a). Bias as a threat to the validity of cancer molecular-marker research. Nat Rev
Cancer 5, 142-149.
Ransohoff, D.F. (2005b). Lessons from controversy: ovarian cancer screening and serum proteomics. J
Natl Cancer Inst 97, 315-319.
Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson,
A.R., Chen, W., et al. (2006). Global variation in copy number in the human genome. Nature 444, 444-
454.
Reina-Pinto, J.J., Voisin, D., Teodor, R., and Yephremov, A. (2010). Probing differentially expressed genes
against a microarray database for in silico suppressor/enhancer and inhibitor/activator screens. Plant J
61, 166-175.
Rothman, K.J., Greenland, S., and Walker, A.M. (1980). Concepts of interaction. Am J Epidemiol 112,
467-470.
Sasaki, E., Takahashi, C., Asami, T., and Shimada, Y. (2011). AtCAST, a tool for exploring gene expression
similarities among DNA microarray experiments using networks. Plant Cell Physiol 52, 169-180.
Satterfield, M., Lippa, K., and Lu, Z. (2008). Microarray scanner performance over a five-week period as
measured with Cy5 and Cy3 serial dilution slides. Journal of Research of the National Institute of
Standards and Technology 113, 154-174.
Scharpf, R.B., Ruczinski, I., Carvalho, B., Doan, B., Chakravarti, A., and Irizarry, R.A. (2011). A multilevel
model to address batch effects in copy number estimation using SNP arrays. Biostatistics 12, 33-50.
Schaupp, C.J., Jiang, G., Myers, T.G., and Wilson, M.A. (2005). Active mixing during hybridization
improves the accuracy and reproducibility of microarray results. Biotechniques 38, 117-119.
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitative monitoring of gene expression
patterns with a complementary DNA microarray. Science 270, 467-470.
Scherer, A. (2009). Batch effects and noise in microarray experiments : sources and solutions
(Chichester, U.K., J. Wiley).
Shi, L., Reid, L.H., Jones, W.D., Shippy, R., Warrington, J.A., Baker, S.C., Collins, P.J., de Longueville, F.,
Kawasaki, E.S., Lee, K.Y., et al. (2006). The MicroArray Quality Control (MAQC) project shows inter- and
intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24, 1151-1161.
![Page 94: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/94.jpg)
86
Shi, L., Tong, W., Su, Z., Han, T., Han, J., Puri, R.K., Fang, H., Frueh, F.W., Goodsaid, F.M., Guo, L., et al.
(2005). Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics 6
Suppl 2, S11.
Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D., Garrett-Engele, P., McDonagh, P.D., Loerch, P.M.,
Leonardson, A., Lum, P.Y., Cavet, G., et al. (2001). Experimental annotation of the human genome using
microarray technology. Nature 409, 922-927.
Singh-Gasson, S., Green, R.D., Yue, Y., Nelson, C., Blattner, F., Sussman, M.R., and Cerrina, F. (1999).
Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nat
Biotechnol 17, 974-978.
Smith, A.M., Heisler, L.E., St.Onge, R.P., Farias-Hesson, E., Wallace, I.M., Bodeau, J., Harris, A.N., Perry,
K.M., Giaever, G., Pourmand, N., et al. (2010). Highly-multiplexed barcode sequencing: an efficient
method for parallel analysis of pooled samples. Nucleic Acids Research.
Smith, A.M., Mellor, L.E.H.J., Kaper, F., Thompson, M.J., Chee, M., Roth, F.P., Giaever, G., and Nislow, C.
(2009). Quantitative phenotyping via deep barcode sequencing. Genome Research.
Spearman, C. (1904). The proof and measurement of association between two things. Am J Psychol 15,
72-101.
Spielman, R.S., Bastone, L.A., Burdick, J.T., Morley, M., Ewens, W.J., and Cheung, V.G. (2007). Common
genetic variants account for differences in gene expression among ethnic groups. Nat Genet 39, 226-
231.
Strauss, E. (2006). Arrays of hope. Cell 127, 657-659.
Team, R.D.C. (2011). R: A Language and Environment for Statistical Computing (Vienna, Austria, R
Foundation for Statistical Computing).
Thompson, K.L., Pine, P.S., Rosenzweig, B.A., Turpaz, Y., and Retief, J. (2007). Characterization of the
effect of sample quality on high density oligonucleotide microarray data using progressively degraded
rat liver RNA. BMC Biotechnol 7, 57.
Tseng, G.C., Oh, M.K., Rohlin, L., Liao, J.C., and Wong, W.H. (2001). Issues in cDNA microarray analysis:
quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic
Acids Res 29, 2549-2557.
Turcatti, G., Romieu, A., Fedurco, M., and Tairi, A.P. (2008). A new class of cleavable fluorescent
nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis.
Nucleic Acids Res 36, e25.
Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester,
E., Spencer, J., et al. (1998). Large-scale identification, mapping, and genotyping of single-nucleotide
polymorphisms in the human genome. Science 280, 1077-1082.
Whitney, A.R., Diehn, M., Popper, S.J., Alizadeh, A.A., Boldrick, J.C., Relman, D.A., and Brown, P.O.
(2003). Individuality and variation in gene expression patterns in human blood. Proc Natl Acad Sci U S A
100, 1896-1901.
![Page 95: An algorithm for chemical genomic profiling that minimizes ... · chemical reagents on a glass substrate (Lausted et al., 2004). This technology is very flexible since a computer](https://reader034.vdocuments.net/reader034/viewer/2022042416/5f31d0e60006d575514592de/html5/thumbnails/95.jpg)
87
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bull 1, 80-83.
Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito,
R., Boeke, J.D., Bussey, H., et al. (1999). Functional characterization of the S. cerevisiae genome by gene
deletion and parallel analysis. Science 285, 901-906.
Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules,
R.S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. J
Comput Biol 8, 625-637.
Wu, Z., Irizarry, R.A., Gentleman, R., Murillo, F.M., and Spence, F. (2004). A Model Based Background
Adjustment for Oligonucleotide Expression Arrays. Johns Hopkins University, Dept of Biostatistics
Working Papers 99, 909–917.
Wu, Z., and Irrizary, R. (2007). A statistical framework for the analysis of microarray probe-level data.
Johns Hopkins University, Dept of Biostatistics Working Papers 1.
Wuster, A., and Babu, M.M. (2008). Chemogenomics and biotechnology. Trends in Biotechnology 26,
252-258.
Yan, Z., Costanzo, M., Heisler, L.E., Paw, J., Kaper, F., Andrews, B.J., Boone, C., Giaever, G., and Nislow, C.
(2008). Yeast Barcoders: a chemogenomic application of a universal donor-strain collection carrying bar-
code identifiers. Nat Methods 5, 719-725.
Ying, L., and Sarwal, M. (2009). In praise of arrays. Pediatr Nephrol 24, 1643-1659; quiz 1655, 1659.
Youden, W.J. (1972). Enduring values. Technometrics 14.
Zakharkin, S.O., Kim, K., Mehta, T., Chen, L., Barnes, S., Scheirer, K.E., Parrish, R.S., Allison, D.B., and
Page, G.P. (2005). Sources of variation in Affymetrix microarray experiments. BMC Bioinformatics 6, 214.