combining drug and gene similarity measures for drug … · combining drug and gene similarity...

60
Tel-Aviv University The Raymond and Beverly Sackler Faculty of Exact Sciences Blavatnik School of Computer Science Combining drug and gene similarity measures for drug-target elucidation This thesis is submitted in partial fulfillment of the requirements towards the M.Sc. degree Tel-Aviv University School of Computer Science by Liat Perlman The research work in this thesis has been carried out under the supervision of Prof. Roded Sharan October, 2010

Upload: doduong

Post on 09-Apr-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Tel-Aviv University

The Raymond and Beverly Sackler Faculty of Exact Sciences

Blavatnik School of Computer Science

Combining drug and gene similarity

measures for drug-target elucidation

This thesis is submitted in partial fulfillment

of the requirements towards the M.Sc. degree

Tel-Aviv University

School of Computer Science

by

Liat Perlman

The research work in this thesis has been carried out

under the supervision of Prof. Roded Sharan

October, 2010

2

Acknowledgments

I would like to express my gratitude to my thesis supervisor, Prof. Roded Sharan, for the

helpful guidance throughout the course of this work and for his fruitful and creative ideas

which led to its establishment. I would also like to thank Assaf Gottlieb for

accompanying me in every step of this research, for the constant help and the

constructive advices. I am grateful to Nir Atias for the incorporation of the side effect

algorithm in our method, for the innovative ideas and for the assistance in the creation of

this work. Finally, I would like to thank my wonderful husband for his endless support

and his constant belief in me. He has been my source of strength in every step of the way.

This study was supported by a fellowship from the Edmond J. Safra Bioinformatics

Program at Tel Aviv University.

3

Abstract Understanding drugs and their modes of action is a fundamental challenge in systems

medicine. Key to addressing this challenge is the elucidation of drug targets, an important

step in the search for new drugs or novel targets for existing drugs. Incorporating

multiple biological information sources is of essence for improving the accuracy of drug

target prediction. In this thesis, we introduce a novel framework -- Similarity-based

Inference of drug-TARgets (SITAR) -- for incorporating multiple drug-drug and gene-

gene similarity measures for drug target prediction. The framework consists of a new

scoring scheme for drug-gene associations based on a given pair of drug-drug and gene-

gene similarity measures, combined with a logistic regression component that integrates

the scores of multiple measures to yield the final association score. We apply our

framework to predict targets for hundreds of drugs using commonly used and novel drug-

drug and gene-gene similarity measures and compare our results to existing state of the

art methods, markedly outperforming them. We then employ our framework to make

novel target predictions for hundreds of drugs; we validate these predictions via curated

databases that were not used in the learning stage. Our framework provides an extensible

platform for incorporating additional emerging similarity measures among drugs and

genes.

Part of the results in this thesis have been accepted for publication in the Journal of

Computational Biology [54].

4

Contents

1 Introduction ................................................................................................ 5

2 Background ................................................................................................ 8

2.1 Introduction to drugs and targets ............................................................................... 8

2.2 Motivation for prediction of drug target associations ............................................. 12

2.3 Drug related information sources ............................................................................ 12

2.3.1 Database of cellular response data .................................................................... 13

2.3.2 Databases of drugs and their targets ................................................................. 15

3 Previous work and Our approach .......................................................... 17

3.1 Previous Methods .................................................................................................... 17

3.1.1 Ligand-based similarity .................................................................................... 17

3.1.2 Side effect-based similarity .............................................................................. 19

3.1.3 Integrating chemical and sequence similarities for drug target inference ........ 22

3.2 Summary and Our approach .................................................................................... 27

4 SITAR: An algorithm for predicting drug targets ............................... 29

4.1 Similarity measure construction .............................................................................. 31

4.2 Combining measures ............................................................................................... 34

4.3 Measure construction considerations ...................................................................... 36

4.4 Prediction assessment and parameter optimization ................................................. 37

4.5 Computing novel predictions .................................................................................. 39

5 Experimental Results ............................................................................... 40

5.1 Feature selection and performance evaluation ........................................................ 40

5.2 Comparison to other drug-target prediction methods ............................................. 43

5.3 Novel predictions .................................................................................................... 44

6 Conclusions ............................................................................................... 50

5

Chapter 1

Introduction

Deciphering drug targets is a primary task in the development of new drugs, in finding

new ways to utilize existing drugs and in pinpointing their side effects. Experimental

identification of drug-target associations remains a laborious and costly task [26], calling

for faster computational prediction methods. Such methods can be used to augment the

limited available information on drug targets, which is in sharp contrast to the vast

number of compounds existing in chemical databases.

Early attempts in computational prediction included docking simulations [14] and

text mining [79]. The former, however, can be applied only to targets with known 3D

structure. The latter searches for co-occurrences of drugs and genes in texts and is limited

to current knowledge and prone to detection problems due to multiple gene and

compound names. Additional attempts were based on reverse engineering of gene

regulatory networks, inferring possible targets from cellular responses to administered

drugs [6, 23, 47]. These methods suffer from the complex and noisy nature of molecular

networks. Recently, several algorithms have been proposed to predict drug-target

associations by combining drug-drug and gene-gene similarity measures [39, 7, 76, 11].

The key assumption underlying these algorithms is that similar drugs tend to share

similar targets [50]. This has been observed with respect to chemical similarity [61, 48],

side effect similarity [11] and more.

6

Several authors had previously predicted drug-target interactions by combining

chemical drug-drug similarity and sequence-based gene-gene similarity [7, 76]. Keiser et

al. [39] compared the chemical structure of drugs to a compendium of ligands, known to

modulate the function of protein receptors, providing indirect connections between drugs

and targets via these ligands. Several approaches, concentrating mainly on indirect drug-

gene associations, employed additional similarity measures to gain insights on drugs.

Specifically, protein-protein interaction (PPI) network similarity was used in [28] to

predict drug-gene genetic interactions (termed Pharmacogenes), and gene expression data

was combined with drug-response data in [42] to infer co-modules of genes and drugs.

Last, a recent approach used information on compound-induced fitness defects of yeast

deletion strains to predict drug-targets in S. cerevisiae [29].

Different similarity measures utilize different attributes of drugs or genes, hence the

use of just one or few of them may miss information that is relevant to predicting new

associations. In particular, the 2D chemical space does not always comply with 3D

structural similarity [40]. Moreover, sequence similar targets tend to bind the same

ligands in the case of G-protein coupled receptor (GPCR) targets but less so for protein

kinases [40]. The use of pharmacological information is limited to marketed drugs whose

indication/side-effect information is available [11].

To overcome these limitations we have designed a new prediction scheme --

Similarity-based Inference of drug-TARgets (SITAR) -- that integrates multiple measures

to facilitate the prediction task. Our contribution is two-fold: (i) We introduce novel drug-

drug similarity measures and combine them into the prediction process; and (ii) we

propose a way of integrating the drug-drug and gene-gene similarities to create

7

classification features. The result is a new drug-target prediction algorithm, which

markedly outperforms previous methods and can cope with new drugs with no known

targets.

This thesis is organized as follows: Chapter 2 is divided into two parts: The first

part provides biological background on drugs, targets and their required interplay for the

establishment of drugs as therapeutic substances. We then highlight the pivotal role of

computational methods for drug-target prediction in facilitating the formidable procedure

of drug discovery. In the second part we describe several contemporary drug related

information sources.

In Chapter 3 we present the state of the art methods for drug-target prediction and

discuss their caveats. Then, we briefly review our framework and emphasize its novelty

compared to other methods.

In Chapter 4 we introduce our method in detail; starting from the construction of

both known and novel similarity measures, continuing to the scheme by which we

integrated all the similarity measures into a single association score and ending with a

short description of how we exploited our framework to predict novel associations

between drugs and targets.

In Chapter 5 we evaluate the performance of our method and compare it to the

performance of previous state of the art methods. Then, we discuss novel associations

predicted by our method and their validation.

Finally, in Chapter 6 we summarize this work and discuss future research.

8

Chapter 2

Background

In this chapter we review some of the basic concepts characterizing drugs and targets.

Then, we discuss the importance of computational schemes for drug-target prediction in

advancing drug discovery and development. Finally, we introduce several up to date drug

related information sources.

2.1 Introduction to drugs and targets

There is no dispute as to the pivotal role of drugs and their development in human

healthcare and medicine. The relation between drugs and diseases can be pictured as

follows: If we consider a disease state as a situation in which the biological system is

shifted from its equilibrium, namely its healthy state, the role of the drug is to ideally

restore the system to its equilibrium [34]. The effectiveness of a drug relies on a wide

spectrum of parameters, among which the type of the disease and its progression, the

general health of the patient, his past medical profile and more.

Formally, a drug is any substance that, when absorbed into the body of a living

organism, alters normal bodily function. An example of a well-known drug is penicillin

which was fortuitously discovered in 1929 by Alexander Fleming. Penicillin is a

metabolite produced from penicillium mold that can lyse staphylococci. Due to its

9

efficacy and lack of toxicity penicillin became one of the most influential antibiotics back

in the 19’th century [17].

A drug mostly confers its effect by binding to its target. A drug target is a biological

molecule or molecules that are inhibited, activated or modulated by a drug molecule.

They can be any form or combination of proteins, DNA or RNA. Generally, the linkage

of a drug to its target brings about a cascade of reactions that ultimately results in the

drug response.

The druggability of a given target is defined either by how well the drug or any other

therapeutic substance can access the target, or by the efficacy the drug can actually

achieve once it interacts with the target. It is influenced by many parameters such as the

cellular location of the target, its related side effects and toxicity [18]. Conceivably, the

majority of the target proteins of FDA approved drugs are localized in the membrane

[77], as illustrated in Figure 2.1. Membrane proteins are much easier to target since the

drug need not pass through the cellular membrane, which is a highly complex task, to

confer its effect.

Inspection of the functional distribution of target proteins indicates that there are

only a few prominent druggable protein families (see Figure 2.2): Enzymes that

Figure 2.1: Cellular localization of target proteins of FDA approved drugs. 62% of

the target proteins are localized in the membrane. Figure taken from [77].

10

constitute slightly under half of current targets, GPCR (G-protein coupled receptors), ion

channels and nuclear receptors. The rest of the target proteins belong to a variety of

protein families that altogether comprise a very small portion of current targets.

Recent publications conducted a large-scale assessment of the druggability of the

human genome; their estimate is that only 10% of the genes in the human genome are

druggable, and only 5% are both druggable and relevant to disease [14].

It is noteworthy that the distribution of the number of targets per drug is far from

uniform. The majority of the drugs target only a few proteins but some have many targets

as depicted in Figure 2.3. For example, the drugs propiomazine and promazine are

associated with the highest number of targets, having 14 targets each [77].

Historically, the preferable drug candidates were drugs that specifically bind very

few targets. Following the success of anticancer drugs and nonselective drugs for mood

disorders the pharmaceutical industry has shifted toward multi-target drug development,

or polypharmacology [77, 39, 17]. These drugs bind promiscuously to many targets, and

consequently have a more disperse impact than the one caused by a drug targeting a

Figure 2.2: Functional distribution of target proteins of marketed drugs. Figure taken from [30].

11

single target. Hence, polypharmacology has a huge potential in treating complex diseases,

having many casual genes.

Not all the effects of a drug are desirable. In practice, most approved drugs are

known to have a series of unwanted effects, termed the side effects of a drug. Side effects

are complex phenomenological observations that can be caused by several molecular

scenarios. The most common scenario is the binding of a drug with its direct targets or

additional targets named the drug’s off targets [11]. The drug binds its off targets with

some degree of affinity; however unlike its direct targets the drug was not originally

designated to interact with them. Additional side effect causing scenarios are downstream

pathway perturbations, kinetics and dosage effects, drug-drug interference, effects of

active metabolites, and aggregation or irreversible target binding of the drug [11]. Side

effects may vary for each individual depending on the person's disease state, age, weight,

gender, ethnicity and general health.

Figure 2.3: Distribution of FDA approved drugs with respect to the number of their

targets. Figure taken from [77].

12

2.2 Motivation for prediction of drug target associations

Drug discovery and development is a laborious multi-step process that requires enormous

money investment, great endeavors and extensive analysis of massive data.

A critical point in the development of a novel drug is the elucidation of its molecular

targets, as this is crucial for understanding the cellular processes taking place in the

response to the drug. Failure in correct identification of targets is a significant factor in

the low drug approval rate stemming from safety issues [45]. Specifically, the association

of a drug with additional targets, beyond its direct ones may give rise to hazardous side

effects, precluding further drug development and usage. Additionally, drug target

identification can broaden the therapeutic areas of extant drugs.

Since the experimental determination of drug-protein interactions remains a

formidable task [26] there is a strong incentive to develop computational high throughput

methods for drug-target prediction. These methods identify putative drug-target

associations by exploiting extant information related to drugs and targets. The large

potential of such methods lies in the disproportion between the sizes of the compound

space and target space and the number of presently known interactions between

compounds and target proteins. This observation suggests that there is a large array of

associations that are yet uncovered [76].

2.3 Drug related information sources

In recent years there has been a tremendous increase in our knowledge on drugs’

properties and mechanisms of action. It is attributed to the emergence of a plethora of

13

information sources, for example a compilation of high-throughput gene-expression

profiling of drug perturbation and comprehensive public resources that cover a

substantial portion of drugs’ properties. In the following we summarize several of these

information sources that are pertinent to this thesis.

2.3.1 Database of cellular response data

The connectivity map project [43] is an innovative framework that aims to relate drugs,

genes and diseases via comparison of their gene-expression profiles or signatures, to a

large database of signatures and using pattern-matching tools to reveal significant

similarities among these signatures. The signature database was produced from a series of

microarray experiments in which distinct cell-lines were treated with various drug

dosages across several time points. In each experiment the genes were ranked according

to their differential expression relative to a control cell line which was not treated with

the drug, giving rise to a ranked list of ~22,000 genes. Over 6000 such microarray

experiments were performed on 5 distinct cell lines treated with 1309 compounds.

In order to identify drugs with similar gene-expression profiles a nonparametric,

ranked-based approach termed GSEA (Gene Set Enrichment Analysis) was adopted. By

this approach, a query signature is compared against each ranked gene list in the

connectivity map data set to determine whether the query genes exhibit a similar or an

opposite trend to the assayed ranked list. Each comparison of the query signature to a

reference ranked gene list yields a connectivity score ranging from +1 to -1. Scores

approaching +1 reflect positive connectivity, meaning that the up-regulated genes of the

query tend to appear near the top of the ranked genes list and the down-regulated genes of

14

the query near the bottom. Scores approaching -1 express negative connectivity and

exhibit the opposite trend, i.e., the query signature genes are anti-correlated with the

ranked gene list. All instances in the database were then ordered according to their

connectivity scores, from the highest connectivity scores to the smallest ones. Figure 2.4

illustrates the above technique.

The authors showed that the proposed method can elucidate new therapeutic roles for

established drugs, generate hypotheses on the mechanism of action of novel drugs and

offer new therapeutic substances for treatment of complex diseases.

Figure 2.4: The Connectivity Map Concept. Gene-expression profiles

derived from treatment of cells with a large number of drugs constitute a

reference database. Gene-expression signatures represent any induced or

organic cell state of interest (left). Pattern-matching algorithms score each

reference profile for the direction and strength of enrichment with the query

signature (center). Drugs are ranked by their “connectivity score”; those at

the top ("positive") and bottom ("negative") are functionally connected with

the query state (right) through the common gene-expression changes.

Figure taken from [43].

15

2.3.2 Databases of drugs and their targets

As mentioned in previous sections, identification of drug target associations is paramount

for drug development. In order to promote the process, multiple drug-target databases

have evolved during recent years.

DrugBank [74, 73] is a comprehensive drug-centered resource that contains detailed

information on drugs and their action. Each drug entry is composed of over 100 data

fields with half of them devoted to drug/chemical data and the other half devoted to

pharmacological, pharmacogenomic and molecular biology data. In addition to the drug

related data, DrugBank maintains information on drug targets (i.e. sequence, structure,

and pathway) and their associations with drugs. DrugBank spans ~4900 drugs comprised

of ~1350 FDA-approved drugs, ~3100 experimental drugs, 188 illicit drugs and small

amounts of biotec, nutraceutical and withdrawn drugs. The database also retains ~3000

drug targets, more than half of them targeted by approved drugs.

KEGG DRUG [35, 37, 36] is an extensive drug information resource that contains

chemical structures and/or chemical components of all drugs in Japan, including crude

drugs and TCM (Traditional Chinese Medicine) formulas, and drugs in the USA and

Europe. It also maintains information on drug target interactions. It accommodates data

of about 9000 drugs (as of September 2009).

DCDB (Drug Combination database) [46] holds information on approved and

investigational drug combinations, also known as multicomponent drugs. The

combination of several drugs to produce an extra powerful and more effective therapeutic

substance has proven useful to treat complicated diseases, e.g., cancer [55] and AIDS

[27]. Apart from drug combinations data it also contains information on individual drugs

16

and their known molecular targets. The drug target information was drawn from the

literature and from several pertinent databases such as DrugBank [73], Uniprot [33],

PubChem [72] and Drugs.com. Overall it encompasses 499 drug combinations,

involving 485 individual drugs from more than 6000 references.

Table 2.1 summarizes the data content of the different databases.

DrugBank KEGG DRUG DCDB

# total drugs

4897 ~9000 485

# total targets

3037 ~2500 497

# drug target associations ~6000

2619 967

# data fields per drug

107 ~10 18

Contains target related

information*

+ - +

Main content Chemical,

pharmacological

and molecular

biology data

Chemical

structures and

chemical

components of

drugs

Drug combinations

*e.g. sequence, structure and GO annotation

Table 2.1: Comparison of the data content among the databases DrugBank, KEGG DRUG

and DCDB.

17

Chapter 3

Previous work and Our approach

In this chapter we review in detail the state of the art methods for drug-target prediction

and discuss their drawbacks. Eventually, we briefly present our approach to tackle the

task and highlight its contribution with respect to the other methods.

3.1 Previous Methods

As mentioned earlier, a large array of computational methods for drug target associations

prediction have evolved during the last decade. In the following we elaborate on four

notable ones: The ligand based similarity method of Keiser et al. [39], the side effect

based similarity method of Campillos et al. [11] and two methods that incorporate

chemical and sequence similarities [76, 7] for the prediction task.

3.1.1 Ligand-based similarity

The algorithm of [39] quantitatively associates drugs with known drug targets based on

the chemical 2D similarity of each drug to a compendium of ligands annotated into sets

for hundreds of drug targets. The motivating hypothesis of this approach is that two

structurally similar molecules are likely to bind the same targets [50]. Analogously, a

drug which is structurally similar to the binding molecules of a certain target, namely its

ligands, is likely to bind the target by itself.

18

Formally, for each drug target pair the 2D structural similarity of to 's

ligand set { } was quantified as an E-value using a statistics-based

chemoinformatics approach called SEA [38], which is described in the sequel. The E-

value serves as a reliability score to the association between ; the smaller the E-

value is the more likely the association takes place.

The structural similarity between a drug and 's ligand set { } is

computed as follows: Tanimoto score [70] is measured across all pairwise drug ligand

combinations . Define { } as a subset of indexes for

which ( ) . The Tanimoto score threshold

was set to 0.57 in this study. The similarity raw score of the pair is evaluated

by the formula:

A statistical model for the random chemical similarity of the raw scores which

originated from BLAST paradigm was developed empirically (for further details on the

development of the model and the selection of the Tanimoto score threshold see the

Methods section of [38]). Z-score for the similarity raw score is computed based on

the model parameters. Finally, the probability to obtain the similarity raw score by

random chance alone, given the Z-score is transformed to an expectation value [38]. The

combination of the structural comparisons with the statistical model is referred to as SEA

(Similarity Ensemble Approach).

In [39] 3,665 drugs drawn from MDL Comprehensive Medicinal Chemistry database

(CMC) were compared against 65,241 ligands organized into 246 targets extracted from

19

MDL Drug Data Report (MDDR) [2] database. Each target was represented solely by its

set of known ligands. The molecules were represented by two topological descriptors:

2048-bit Daylight [3] and the 1024-bit folded ECFP_4 [1] fingerprints. SEA was applied

with each descriptor individually and drug target pairs having small E-values were

selected for further exploration.

The main drawback of this approach is that it is constrained to proteins whose related

ligands are both known and readily-available. Hence, it cannot provide any predictions

involving proteins with no a priori information regarding the ligands that modulate their

activities. Furthermore, this method does not take into account any sequence information,

and relies heavily on the ligands as the sole representatives of the targets. This may result

in an increased susceptibility of the method to noise in the ligand information.

3.1.2 Side effect-based similarity

Campillos et al. [11] used side effect similarity to relate drugs to novel targets. The

underlying assumption of this approach is that drugs having similar side effects tend to

share similar targets. This hypothesis was confirmed empirically by the authors,

demonstrating a remarkable correlation between side-effect similarity and the likelihood

that two drugs share a protein target. In general, the authors formulated a novel measure

for side effect similarity between each pair of drugs. This measure was then combined

with the widely-used Tanimoto 2D chemical similarity to form a statistical model that

estimates the probability of a pair of drugs to share a target. In the following we review

this scheme in detail.

20

Side effects were extracted from publicly available package inserts via a text-mining

procedure. The side effects were then classified according to the Unified Medical

Language System (UMLS) ontology for medical symptoms [8]. For the computation of

the side-effect similarity a weighing scheme which consists of two parts, a rareness

weight and a correlation weight, was defined. The rareness weight reflects the

relationship between the side effect frequency and the probability two drugs share a

target. Due to an inverse correlation between the two parameters, the rareness weight of a

side effect ri is defined as the negative logarithm of the side-effect frequency fi.

Some side-effects are correlated, for instance nausea and vomiting tend to commonly

co-occur. The correlation weight gi is defined to penalize for this redundancy, it does so

by down-weighing correlated side effects in a manner which resembles the down-

weighing of similar protein sequences within multiple alignments.

A raw score for the side-effect similarity between a drug pair (X,Y) was

calculated by summing the product of the weights across all their shared side effects:

The raw scores were then converted to P-values by randomizing the side effects. These P

values represent the pairwise side-effect similarity between drugs.

The Chemical Development Kit tool [65] was utilized to calculate the Tanimoto 2D

chemical similarity between each pair of drugs on chemical hashed fingerprints of a 1024

bits length ,thus yielding the chemical similarity measure.

Given a pair of drugs we wish to evaluate their probability to share a target based on

a function of side effect similarity and chemical similarity. To this end, both similarity

21

measures of the raw data were distributed in a two-dimensional histogram. In its first

dimension, drug pairs were distributed using logarithmic percentiles of their side effect

similarity P values, and in the second dimension, they were distributed using percentiles

of their chemical similarity scores (see Figure 3.1). The two-dimensional histogram of the

raw data was approximated by a sigmoid function. Drug pairs with over 25% probability

were predicted to share a target.

The main caveat of this approach is that it can be applied only to drugs with known

side-effects, namely marketed drugs, a fact that substantially limits its applicability.

Moreover, this method does not take into consideration any target related information

sources, such as sequence and molecular function. Another minor weakness is that it

indicates which drug pairs are likely to share a target; however, it does not state explicitly

which of their targets are anticipated to be the joint ones.

Figure 3.1: Demonstration of the approach adopted by Campillos et al. [11]. The

probability of drugs to share a target was assigned on the basis of a combination of side-

effect similarity and chemical similarity. The line delimits the area used to relate drugs

to each other. All drug pairs in this area have a >25% probability to share a target and

side-effect similarity P value < 0.1.

22

3.1.3 Integrating chemical and sequence similarities for drug

target inference

In [76] and [7] algorithms were developed that integrate two types of information –

chemical similarity between compounds and sequence similarity between proteins.

Yamanishi et al. [76] devised a kernel regression-based method (KRM) to infer

unknown drug-target interactions by integrating both of these similarities into a unified

Euclidean space termed the ‘pharmacological space’. The drug-target association

inference was conducted by a multi-step procedure as follows:

(i) Interacting compounds and proteins are embedded into a unified space called the

‘pharmacological space’.

(ii) A regression model is learned between the chemical/genomic space and the

unified space and any compounds/proteins are mapped to the unified space.

(iii) Compounds and proteins that are closer than a threshold in the unified space are

predicted to interact with each other.

Let { } be a set of known drugs and { }

a set of known target proteins,

respectively, where is the number of known drugs and the number of known

target proteins. Initially, a gold standard set of known drug-target interactions is

represented by a bipartite graph G(V1+V2, E), where V1 is the set of drugs, V2 is the

set of target proteins and E represents the interactions between them. The authors

proposed to transform the bipartite graph structure to an Euclidian space such that

both compounds and proteins are represented by sets of q-dimensional feature vectors

{ }

and {

}

, respectively. To this end, a graph-based similarity matrix

23

(

) is constructed, where the elements are computed by

employing Gaussian functions as follows:

for i,j = 1,…,nc

for i,j = 1,…,ng

for i = 1,…,nc, j = 1,…,ng

where is the shortest distance between the node pair (x,y) in the bipartite graph and h

is a width parameter, set to 2 in this study.

Subsequently, the matrix K is transformed into a positive definite matrix by

repeatedly adding a small multiple of the identity matrix to its diagonal until all its

eigenvalues become positive. Then, K is decomposed into eigenvalues as follows:

where the diagonal elements of the matrix are eigenvalues and the columns of the

matrix are eigenvectors and

⁄ . The row vectors of the matrix

(

)

are used to represent all drugs and target proteins. The

features ( )

and (

)

constitute the pharmacological

feature space, i.e., the unified space of compounds and proteins.

Let denote the matrix of chemical similarity between compounds and the

matrix of sequence similarity between proteins; these similarity matrices represent the

chemical and genomic space, respectively. The aim of the method is to construct

regression models that best describe the relationship between the chemical/genomic space

24

and the pharmacological feature space. To this end, a non-parametric technique known as

kernel regression was adopted. This technique learns a non-linear relation between the

chemical/genomic space and the pharmacological feature space such that the inferred

relation maximally fits the data. The kernel regression model for the problem at hand is

given by:

where x is an element in a space X, is the feature vector of x in the pharmacological

space, n is the size of the space X, is a similarity function, is a weight vector and

is a noise vector. We wish to optimize the equation so that its right hand side would

match its left hand side as much as possible for all the elements in the space X. This is

achieved by finding a weight matrix that minimalizes the following

loss function:

‖ ‖

where S is an similarity matrix and ‖ ‖ is Frobenius norm. The above function can

be formulated in a trace form as:

{ }

From setting

, the solution is analytically obtained by:

Two regression models were generated by this scheme: One for the correlation

between the chemical space and the pharmacological feature space with the weight

matrix Wc = (

)

and the other for the correlation between the

25

genomic space and the pharmacological feature space with

(

)

, respectively. Formally, given a new compound it is mapped onto

the pharmacological feature space by the formula:

where is a weight vector and is the chemical similarity score of a compound

pair. An analogous procedure is employed to map a new protein gnew onto the

pharmacological feature space:

where is a weight vector and is the sequence similarity score between a protein

pair. Finally, a feature-based similarity score is computed for each compound-

protein pair by calculating the inner product of their features in the

pharmacological space:

( )

The feature-based similarity score serves as a measure of the closeness between

compounds and proteins in the pharmacological feature space. Compound-protein pairs

whose score exceeds a certain threshold are predicted to interact with each other.

The shortcoming of this method stems from the utility of only two types of similarity

measures – the chemical structure similarity and sequence similarity. The information

required for their construction is usually at hand, so its acquisition does not restrict the

method's applicability. Nonetheless, these similarity measures comprise a minor portion

26

in the huge space of drug and target properties, and therefore their usage might not

suffice to exhaust this space.

Bleakley et al. [7] proposed another integrative technique to cope with the task. The

authors use a scheme called bipartite local models (BLM) to assign an association score

to each drug-target pair, in a two-phase procedure. They defined the problem as follows:

Given a drug target pair (di,tj) we wish to predict whether the drug di interacts with the

target tj. Preliminarily, both the chemical similarity and sequence similarity were

transformed to positive semi-definite matrices; this permitted their utility as kernel

functions on which a support vector machine (SVM) model was applied throughout the

course of the algorithm.

In the first phase the target tj is discarded and two separate lists are assembled, one

containing all the other known targets of the drug di and another containing targets not

known to be bound by di. The targets in the first list are assigned a label +1 and the

targets in the other list are assigned a label of -1. Next, an SVM model is trained on the

sequence similarity among the concerned targets to discriminate between the two data

types. This model is then employed on the sequence similarity between the target tj to the

targets in the training set to predict the label of tj as a target or non-target of the drug di.

An analogous procedure is conducted on the drugs; namely, the target tj is fixed and

the drug di is discarded. Again, an SVM model is trained on the chemical similarity

among the relevant drugs to discriminate between drugs targeting tj from those not

targeting tj. Eventually, this classification rule is applied to the chemical similarity

between the drug di to the drugs in the training set, predicting whether di binds to tj or not.

27

The final stage of the algorithm is the integration of the two scores, the one assigned

to the target tj and the one assigned to the drug di into a unified association score. To this

end, the authors use a max function; therefore, the association score is simply the highest

score among the two.

The primary pitfall of this algorithm is that it cannot be employed on drug target

pairs (di,tj) having no prior information with regard to the targets of the drug di and the

drugs targeting the target tj. Hence, this method is not applicable in scenarios in which

the query association involves both a novel drug and a novel target. Furthermore, it

suffers from the same weakness of the previously mentioned method [76], as both

methods exploit the same similarity measures.

3.2 Summary and Our approach

Detection of associations between drugs and targets is a crucial task in genomic drug

discovery. Its great importance has led to the rapid development of a large array of

computational methods from various disciplines, each trying to address the problem from

a different perspective. In the previous sections we reviewed several existing methods for

this task and pointed out their drawbacks. A shared weakness of all these methods is that

they take advantage of only a thin slice in the huge amount of data attributed to drugs and

targets. As alluded in the introduction, the utilization of just one or a few similarity

measures may miss information that is valuable for uncovering novel drug-target

associations.

The next chapter will be dedicated to the description of our approach; here, we

provide an overview of the method and highlight its contribution compared to other state

28

of the art methods. Our method relies on the construction of various drug-drug and gene-

gene similarity measures derived from diverse readily available databases pertaining to

drugs and targets. Each pair of drug-drug and gene-gene similarity measures is combined

to form a single classification feature. Every feature is calculated by combining the drug-

drug similarities between a query drug and other drugs and the gene-gene similarities

between the query gene and other target genes across all true drug-target associations.

The features are automatically combined using a logistic regression classifier that is

coupled with a wrapper feature selection procedure and yield the final classification

scores.

The contribution of our method can be summarized as follows: (i) We introduce

novel drug-drug similarity measures and combine them into the prediction process. (ii)

We propose a way of integrating the drug-drug and gene-gene similarities to create

classification features. (iii) Our novel drug-target prediction algorithm markedly

outperforms previous methods. (iv) Finally, our method can be easily extended by newly

constructed similarity measures, and hence is scalable and remarkably adequate to current

times where new data sources are emerging in an exceedingly rapid manner.

29

Chapter 4

SITAR: An algorithm for predicting

drug targets

We designed a drug-target prediction algorithm with three main components (see Figure

4.1): (i) drug-drug and gene-gene similarity computations; (ii) combining the drug and

gene similarity measures into classification features; and (iii) feature selection and

prediction using logistic regression. In the following we describe these components in

detail.

Similarity measures

In order to overcome the limitations engulfed in using similarity measures of a single

type, we set out to incorporate a multitude of similarity measures, including both novel

and already published ones. Overall, we considered five drug-drug similarities and three

gene-gene similarities from different biological and chemical sources. The drug-drug

similarity measures were computed using chemical, registered and predicted side effects

[41] of the drug, drug response gene expression profiles and the Anatomical, Therapeutic

and chemical (ATC) classification system. The gene-gene similarity measures used are

based on sequence, closeness in a protein-protein interaction network and semantic GO

similarities. Overall, 315 drugs and 250 protein targets were represented in all measures,

spanning 782 known associations.

30

Figure 4.1: Algorithmic pipeline: Formation of drug-drug and gene-gene similarity

measures (A), integration of the similarities to classification features (B) and

classification with wrapper feature selection (C).

A

B

C

31

4.1 Similarity measure construction

We defined and computed five drug-drug similarity measures and three gene-gene

similarity measures. All similarity measures were normalized to be in the range [0,1].

We build the following drug-drug similarity measures:

1. Chemical-based: Canonical simplified molecular input line entry specification

(SMILES) of the drug molecules were downloaded from DrugBank [73]. Hashed

fingerprints were computed using the Chemical Development Kit (CDK) with default

parameters [66]. The similarity score between two drugs is computed on their

fingerprints according to the two-dimensional Tanimoto score [70], which is equivalent

to the Jaccard score [32] of their fingerprints, i.e., the size of the intersection over the

union when viewing each fingerprint as specifying a set of elements.

2. Ligand-based: The Similarity Ensemble Approach (SEA) [38] relates protein

receptors based on the chemical 2D similarity of the ligand-sets modulating their

function. Given a drug’s canonical SMILES, the SEA search tool compares it against a

compendium of ligand-sets and computes E-values for those ligand sets. To compute a

drug-drug similarity we queried drugs using their canonical SMILES on the SEA tool. To

obtain robust results, we queried the drug against the two ligand databases provided in

the tool (MDL Drug data report [2] and WOMBAT [52]) and used two different methods

to compute the drug fingerprint (Scitegic ECFP4 [1] and Daylight [3]), resulting in four

lists of similar ligand sets. We utilized both fingerprints since according to the SEA tool

Scitegic ECFP4 provides more precise results, while Daylight can find more novel

connections. Unifying the four lists and filtering drug-ligand set pairs with E-values>10-5

,

we obtained a list of relevant protein-receptor families for each drug. Finally, the

32

similarity between a pair of drugs was computed as the Jaccard score between the

corresponding sets of receptor families. We note that due to a partial mapping of the

receptor families to proteins, we could not use the drug-receptor associations directly as

classification features.

3. Expression-based: Gene expression responses to drugs were retrieved from the

Connectivity Map project [44]. We experimented with three different methods to

calculate drug similarity from Connectivity Map ranked gene expression profiles: (i)

Spearman rank correlation coefficient; (ii) calculating a Jaccard score between the 500

most differentially expressed genes (250 most up-regulated and 250 most down-regulated

genes); and (iii) using the method proposed by [31], employing Gene Set Enrichment

Analysis (GSEA) [69] as a similarity measure. We dealt with multiple experiments per

drug as follows: In the Spearman calculation, we averaged over the d1xd2 different

correlation coefficients obtained between the d1 experiments of one drug against the d2

repeated experiments of the second drug. In the Jaccard case, we used differentially

expressed genes that appeared in at least 50% of the gene expression responses to the

same drug. The method proposed by [31] handles repeated experiments of the same drug

through iterative merging.

4. Side-effect based: Drug side effects were obtained from SIDER [41], an online

database containing drug side effects associations extracted from package inserts using

text mining methods . Recently, we developed an algorithmic framework to predict side

effects for drugs by combining side effect information on known drugs with their

chemical properties [5]. Following this work, we defined the similarity between drugs

according to the Jaccard score between their top ten predicted side effects.

33

5. Annotation-based: We used the World Health Organization (WHO) Anatomical ,

Therapeutic and Chemical (ATC) classification system [63]. This hierarchical

classification system categorizes drugs at five different levels according to the organ or

system on which they act, their therapeutic and their chemical characteristics.

ATC codes were obtained from DrugBank. To define a similarity between ATC terms we

used the semantic similarity algorithm of [58]. This algorithm associates probabilities

p(x) with all the nodes (i.e., terms) x in the hierarchy and calculates the similarity of two

drugs as the maximum over all their common ancestors c of –log (p(c)).

The gene-gene similarity measures we used include:

1. Sequence similarity: based on a Smith-Waterman sequence alignment score [64].

Following the normalization suggested in [7], we divide the Smith-Waterman score

between two protein sequences by the geometric mean of the scores obtained from

aligning each sequence against itself.

2. Closeness in a protein-protein interaction (PPI) network: Human protein-protein

interactions were compiled from [9, 20, 60, 67, 75]. The distances between each pair of

genes were calculated based on their corresponding proteins using an all-pairs shortest

paths algorithm. Distances were transformed to similarity values using:

)',()',( ppbDAeppS (1)

where S(p,p’) is the computed similarity value between two proteins, D(p,p’) is the

shortest path between these proteins in the PPI network and A, b are free parameters (see

Measure construction considerations section for their optimization).

3. Gene Ontology (GO) semantic similarity: GO annotations [4] were downloaded

from UniProt [33]. The scores of the semantic similarity between the GO annotation

34

terms of the targets were calculated according to [58], using the csbl.go R package [53].

All three ontologies were used in the computation as similar drugs are expected to

interact with proteins that act in similar biological processes, or have similar molecular

functions or reside in similar compartments.

4.2 Combining measures

We view the drug target prediction problem as a classification problem, where the goal is

to learn a classifier that can distinguish true and false drug-target associations. True drug-

target interactions were retrieved from KEGG DRUG [36], DrugBank [73] and DCDB

[46]. An independent set of drug-target interactions was downloaded from Matador [24]

for validation purposes only. The classification features that we use are constructed from

scores calculated on pairs combined of one of M drug-drug measures (5 in our case) and

one of N gene-gene similarity measures (3 in our case), resulting in an MxN set of drug-

gene measure features for each drug-target association (15 in our case). For a given

measure pair (i.e., feature), the score of a given association (d,t) is calculated by

considering the similarity, according to the given measure pair, of all true drug-target

interactions to this association. The computation is done as follows: First, for each true

interaction (d’,t’) we compute the drug-drug similarity S(d,d’) and the target-target

similarity S(t,t’). Next, we combine the two similarities to a single score. Finally, we

integrate the scores of all true interactions to form the classification feature.

We experimented with several ways of combining the two similarities and

integrating the scores over the various true associations, aiming to compute a score that

would differ as much as possible between true and false drug-target interaction. This

35

difference was tested using the Wilcoxon rank sum test for equal medians. To combine a

drug-drug and a gene-gene similarity, we considered two types of averaging functions:

arithmetic mean and geometric mean. The two options scored similarly (p<1e-10

) with a

slight advantage to geometric mean (AUPR difference < 0.01), which we used in the

sequel. To integrate the scores of the known associations, we studied the effect of

averaging the top-k scores for various selections of k and discovered that taking only the

top scoring drug-target interaction yields the lowest p-value, as displayed in Figure 4.2.

This effect was observed also under the arithmetic mean score.

To conclude, we used the maximum value over the following weighted version of

geometric mean to score drug target associations:

1r0 ,)',()',(max),( )1(,'',

rr

tdtd ttSddStdScore (2)

The optimization of the scoring parameter r was done in a cross validation setting (see

the Prediction assessment and parameter optimization section for further details).

Figure 4.2: -log(p-value) of the Wilcoxon rank sum test for the distribution of scores of

true versus false drug-target associations. The score was computed by averaging the top k

scores, k ranging from 1 to 50.

36

4.3 Measure construction considerations

The transformation of a data source into a similarity measure can be done in various

ways. In order to choose a transformation that will perform well in drug target prediction,

we applied the Wilcoxon rank sum test to choose the transformation with greatest

separation power. Precisely, a suggested drug-drug similarity measure was combined

with all the gene-gene similarity measures to form classification features. A Wilcoxon

rank-sum test was then applied to each of these features to test whether the medians of

the distributions of feature values on true and false drug-targets interactions differ. The

logarithms of the resulting p-values were averaged to yield the separation score for that

suggested measure. An analogous procedure was applied to score gene-gene similarity

measures.

We used these separation scores to specify two measures: the co-expression based

drug similarity and the PPI-based gene similarity. For the co-expression based drug

similarity, Jaccard-based similarities scored better (average rank sum p< 7e-89

) than

GSEA-based similarities (average rank sum p< 2e-35

) and Spearman-based similarities

(p<3e-11

). The decreased performance of the Spearman-based similarity measure is

understandable in light of the fact that gene expression profiles tend to be informative

mainly with regard to the top (upregulated) and bottom (downregulated) genes. For the

PPI-based gene similarity, we used the separation score in order to choose the optimal

parameters A and b (equation 1). Hence, we picked b=1 and chose A such that direct

network neighbors would get a value close to 1 (A=0.9). However, the results were

robust over a wide range of parameters – similar AUPRs (AUPR difference<0.003) were

obtained with b ranging from 0.5 to 2 and A ranging from 0.5 to 1.

37

4.4 Prediction assessment and parameter optimization

We used a 10-fold cross validation scheme to evaluate the accuracy of our prediction

algorithm. The training set used for the cross validation included 782 true drug-target

interactions and a randomly generated set of drug-gene pairs (not part of the positive set),

twice larger than the positive set. We note that taking a negative set of equal size or three

times larger did not alter the results substantially (AUPR difference of 0.035 between a

negative set of equal size to a twice larger set and AUPR difference of 0.027 between a

negative set twice larger to a three times larger set).

In order to improve the classification accuracy, we optimized the weight factor r in

Equation 2 and performed feature selection on the 15 features associated with the various

measure pairs. Specifically, we trained a logistic regression classifier where the data was

divided into 10 parts and we iteratively selected eight parts to be the training set, a 9th

part was considered a test set on which we optimized the parameter r and the 10th part

was considered the validation set, on which the performance of the classification was

assessed. We found that the AUPR was robust to the selection of the weight r, with

slightly better classification performance using r=0.4 (AUPR difference<0.02).

To select features, we used a wrapper feature selection procedure, which is based on

a classification accuracy measure (see [25]). A common method to avoid exhaustively

searching for the optimal set of features in terms of classification accuracy is through an

iterative greedy algorithm. One variant, termed forward selection, starts with an empty

set and adds in each step the most contributing feature, while another, termed backward

elimination, starts with the full set and removes the least contributing feature in each step.

38

These two algorithms yield nested subsets of features from which a set attaining the

highest accuracy is finally chosen.

We considered two different scores as indicators for the classification accuracy – The

Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) curve

[21] and the Area Under the Precision Recall (AUPR) curve [16]. Standard deviations for

both measures were assessed by using 100 random partitions of the training data into

cross validation sets. Importantly, when dealing with uneven number of true and false

examples, Precision-Recall (PR) curves give a more informative picture of an algorithm's

performance [16]. Furthermore, in [16] it is shown that a curve dominates in ROC space

if and only if it dominates in PR space. Hence we focus on the AUPR criterion as an

indication for classification performance. We evaluated the classification quality also in

regard to different target types, including Enzymes, G-protein coupled receptors

(GPCRs), ion-channels and nuclear receptors. The target type was inferred from

DrugBank [73] and Uniprot [33] annotations.

The classification was done using MATLAB implemented logistic regression.

Logistic regression is used for the prediction of the probability of an event by fitting the

data to a logistic curve. In our study, the logistic regression predicts the probability of an

association between a drug-gene pair to occur by concordant weighting of a list of

features score, estimated with respect to a query association. A cutoff was selected

according to the best F1-measure, defined as the harmonic mean between the precision

and recall. We also tested a Support Vector Machines (SVM) classifier with different

kinds of kernel functions (linear, polynomial, radial basis function and sigmoid) using the

LIBSVM package for MATLAB [12] with default parameters.

39

4.5 Computing novel predictions

To predict additional targets for drugs in our data set, we used a training set that included

all the true associations and a two-fold larger randomly generated set of drug-gene pairs

that are not known to interact. We applied the trained classifier to all remaining drug-

gene pairs to form our prediction set. To assign prediction scores also to the random

negative set that we used, we repeated the analysis with another randomly-picked

negative set, distinct from the first one. Overall, we obtained classification scores for all

77968 unknown drug-gene pairs. We used a classification threshold that yielded the best

F1-measure. Out of the resulting prediction set, we picked for each drug its maximal

scoring target as the most promising target. Overall, we could predict targets for 307

drugs.

40

Chapter 5

Experimental Results

In this chapter, we demonstrate the results of the feature selection procedure and inspect

the individual contribution of each feature to the predictive power of our method.

Subsequently, we compare our method to two extant approaches. We conclude the

chapter with a thorough verification of our novel predictions.

5.1 Feature selection and performance evaluation

We performed feature selection using both forward selection and backward elimination,

converging to a selected set of ten features, constructed from pairs of drug-drug and

gene-gene similarity measures. The area under the precision-recall curve (AUPR) scores

before and after the feature selection phase, as well as the AUPR achieved when using

the final ten selected features are listed in Table 5.1. Similar results were obtained when

using an SVM classifier as shown in Table 5.2. Examining the individual contribution of

each of the features, we find that best AUPR scores are achieved using the ligand-based

drug-drug similarity of [39] together with sequence similarity for genes (AUPR=0.85).

The least contributing combination is the drug co-expression together with PPI network

closeness (AUPR=0.54). Accordingly, we find that the best drug-drug similarity averaged

over all gene-gene similarities is the ligand similarity (AUPR=0.83±0.02) and the worst

is the co-expression similarity (AUPR=0.67±0.11). For genes, the best gene-gene

41

measure across all drug-drug similarities is the GO semantic similarity and the worst is

the PPI closeness measure. We note that the feature selection process did not affect the

results significantly, but we expect it to have more impact when additional similarity

measures are incorporated as features.

Drug similarity Gene similarity AUPR* AUC†

All features 0.905 0.935

Selected features 0.908 0.935

Ligand Sequence similarity 0.851 0.867

Ligand GO semantic

similarity

0.845 0.867

Predicted Side

Effect

GO semantic

similarity

0.832 0.863

ATC similarity GO semantic

similarity

0.81 0.858

Ligand PPI closeness 0.809 0.844

Chemical GO semantic

similarity

0.805 0.84

ATC similarity PPI closeness 0.762 0.809

Chemical Sequence similarity 0.749 0.763

Predicted Side

Effect

PPI closeness 0.729 0.759

Co-expression Sequence similarity 0.724 0.748

* All standard deviations are below AUPR of 0.01 † All standard deviations are below AUC of 0.005

Table 5.1: AUPR and AUC scores calculated using different features.

42

Kernel type AUPR AUC

Linear 0.902 0.928

Polynomial (gamma=0.1,

degree=3)

0.867 0.88

Radial basis function

(gamma=0.1)

0.903 0.931

Sigmoid (gamma=0.1) 0.879 0.903

Table 5.2: AUPR and AUC scores calculated using different SVM kernels on the

selected features of Table 5.1.

We further assessed the classification quality of different target types (GPCRs, ion

channels, enzymes and nuclear receptors), as in [7]. The results are listed in Table 5.3,

with GPCRs attaining the best scores.

Target type Number AUPR AUC

GPCR 49 0.939 0.946

Ion channels 37 0.889 0.927

Enzymes 94 0.877 0.922

Others 56 0.87 0.935

Nuclear receptors 14 0.851 0.863

Table 5.3: AUPR and AUC scores calculated for different target types.

43

5.2 Comparison to other drug-target prediction

methods

We compared our method to two state of the art methods: (i) The kernel regression-based

method (KRM) of [76]. This method embeds drugs and targets into a unified Euclidean

space termed the ‘pharmacological space’, using a regression model. Predicted

interacting drug-gene pairs are those which are closer to each other below a certain

threshold in the pharmacological space. (ii) The bipartite local models (BLM) method of

[7] constructs local models to learn drug-target associations based on additional targets of

the query drug and additional drugs targeting the query target . We note that we originally

aimed to compare our approach to the method proposed by [39], however since the SEA

tool of [38] provides receptors code names which cannot be mapped to our list of targets,

no direct comparison was feasible. Figure 5.1 displays the Precision-Recall curves of the

three methods and Table 5.4 summarizes the AUPR and AUC scores between the

different methods, overall demonstrating the marked improvement obtained by our new

method (AUPR of 0.908, exceeding the KRM and BLM methods by 0.07 and 0.15 AUPR

difference, respectively).

Figure 5.1: Precision-Recall curves of our logistic regression classifier (SITAR),

KRM and BLM methods.

44

Measure type AUPR AUC

SITAR 0.908 0.935

KRM [76] 0.838 0.884

BLM [7] 0.754 0.814

Table 5.4: Comparison of AUPR and AUC scores calculated using different methods.

5.3 Novel predictions

After demonstrating the utility of our method in predicting drug target associations, we

set out to predict novel targets for drugs in our data set (Section 4.5). We focused on the

best scoring gene for each drug, obtaining putative novel targets for 307 of the 315 drugs

used in this study. To validate these predictions we compared the predicted associations

to those reported in two other sources of drug-target interactions that were not used in the

learning process: The Therapeutic Target Database (TTD) [78] and the Matador database

[24]. TTD included only six interactions that were not part of our original data. Two of

them were predicted by us (p<2.4e-4

). Matador, on the other hand, reported 219 additional

interactions, out of which we predicted thirteen (p<8.4e-12

).

For an additional validation, we utilized information on drug-related pathways of

KEGG [36] and REACTOME [49] databases, as cataloged in the Comparative

Toxicogenomics Database (CTD) [15]. Out of the 315 drugs used in this study, 217 drugs

are associated with known pathways in CTD. For each drug, we constructed a merged list

45

of genes that appear in all the pathways associated with that drug. We then searched for

enrichment of the predicted targets of each drug in this merged list. We found 20 drugs

whose pathways are enriched for our predicted targets (at a false discovery rate (FDR) of

0.05). For comparison, we randomly permuted the association between drug target

predictions and drug pathways and recomputed the number of enriched drugs. Across 100

such permutation tests, 9.2±3.4 drugs were enriched on average, with all tests attaining

less than 20 enrichments (p<0.01).

The distribution of target types in our novel predictions is different from the general

distribution of known targets. Specifically, nuclear receptors and GPCRs are

overrepresented in our novel predictions relative to their general abundance in all the

targets as depicted in Figure 5.2 (relative abundance in novel predictions is 2.4 and 1.9

times the relative abundance in drug targets for nuclear receptors and GPCRs,

respectively). This overrepresentation is attributed to higher scores achieved by drugs

associated with nuclear receptors and GPCRs due to higher sequence similarity of nuclear

receptors and GPCRs relative to other targets (twice and 1.7 more similar than ion

channels, respectively). In addition, nuclear receptors are much closer to each other on

the protein-protein interaction (PPI) network (average PPI distance between nuclear

receptors is smaller by a factor of 0.6 from all other types). We note that dopamine

receptor D3 (DRD3), having only two known interactions in our data, had exceptionally

high occurrence in our novel prediction set (28 times). Dopamine receptor D3 is

primarily predicted to be targeted by drugs indicated for Parkinson’s disease and

Schizophrenia (eight out of nine existing in DrugBank). It is noteworthy that half of the

predicted drugs targeting DRD3 are also indicated for Parkinson’s disease or

46

Schizophrenia. These predictions make DRD3 a promising candidate for further

investigation.

Next, we applied our method to provide novel predictions for drugs that have no

known interacting targets in DrugBank. There are 20 such drugs that are included in all

our similarity measures. Applying a cross-validation scheme and choosing a

classification threshold maximizing the F1-measure, we could predict targets for 14

drugs. The top 5 scoring predictions are listed in Table 5.5. Evaluating our set of top

predictions for drugs with unknown targets, we find indirect evidence for the first four of

the top five associations. Specifically, Theobromine is predicted to target Adenosine A1

receptor (Adora1). Theobromine, derived from the cacao bean, belongs to the

methylxanthine class of chemical compounds. Other methylxanthine derivatives like

Caffeine and Theophylline are reported to target Adenosine A1 receptors [22]. Cefotetan

is predicted to target Paraoxonase 1 (PON1). Cefotetan belongs to the Cephamycin class

of antibiotics, which is often classified with second generation Cephalosporins, the latter

inhibits PON1 [62]. Rolitetracycline is predicted to target Cytochrome C (CYCS) and

Caspase 3 (Casp3) with equal score. Rolitetracycline is a member of the tetracycline

Figure 5.2: Relative abundance of (A) enzymes, (B) ion channels, (C) GPCRs, (D)

nuclear receptors and (E) others in the novel predicted set.

47

family of antibiotics. Other members from this family like Minocycline and Doxycycline

are reported to inhibit Cytochrome C (Minocycline [80]) and Caspase 3 (Minocycline and

Doxycycline [13, 51]). Last, Sulfamerazine is predicted to target Dihydrofolate reductase

(Dhfr). Sulfamerazine belongs to antibacterial sulfonamides whose inhibition of

Dihydrofolate reductase has been extensively studied [59].

Rank Drug name Target name

1 Theobromine Adenosine A1 receptor (Adora1)

2 Cefotetan Paraoxonase 1 (PON1)

3 Rolitetracycline Cytochrome C, somatic (CYCS)

3 Rolitetracycline Caspase 3, apoptosis-related cysteine peptidase

(Casp3)

4 Sulfamerazine Dihydrofolate reductase (Dhfr)

5 Guaifenesin Solute carrier family 6 (neurotransmitter

transporter, serotonin), member 4 (SLC6A4)

Last, we applied our method to detect genes that to date are not recognized as drug

targets. In order to reduce the amount of tested genes, we constructed a list of the most

promising targets in the following way: for each of the 250 known targets, we took the

most similar gene in at least one of the three used gene-gene similarity measures

(sequence similarity, closeness on the protein-protein interaction network and Gene

Ontology semantic similarity). We obtained 135 potential new targets on which we

applied a similar procedure as in the case of drugs with no known targets: we performed a

Table 5.5: Top 5 predicted drug-target associations for drugs with no known targets.

The two top predicted targets of Rolitetracycline achieved the same score.

48

10-fold cross validation to obtain a classification threshold that was used to classify the

135 potential targets. We then considered the top ranked predicted drug for each of the

135 potential targets. The top 5 drug predictions are listed in Table 5.6, all of which are

supported by current knowledge as discussed below. Specifically, Fluorometholone is

predicted to target calreticulin (CALR). Fluorometholone is a glucocorticoid, which is a

class of steroid hormones that binds to the glucocorticoid receptor (GR). Interestingly, it

has been shown that CALR binds to the GR, thus inhibiting the effect of glucocorticoids

[10]. Menadione (Vitamin K3) is predicted to target coagulation factor III (tissue factor,

F3). According to DrugBank, Menadione is involved as a cofactor of vitamin K-

dependent coagulation factors II, VII, IX and X, two of which are known to be part of the

tissue factor pathway [71]. In addition, tissue factor is known to be targeted by

Coagulation factor VIIa, which is itself a vitamin K-dependent glycoprotein. Methyldopa,

primarily used for hypertension is predicted to target nitric oxide synthase 1 (neuronal)

(NOS1). It has been shown that Methyldopa is associated with changes in nitric-oxide

synthesis [56]. Clofibrate is predicted to target apolipoprotein E (APOE). According to

DrugBank, Clofibrate inhibits the synthesis and increases the clearance of apolipoprotein

B, which is part of the low-density lipoproteins (LDL). APOE has a role in the

conversion of very low density lipoproteins to LDL [19]. Finally, Bacitracin, an

antibiotic, is predicted to target thrombospondin 1 (Thbs1/TSP1). While we could not

find a direct connection between the two, it is noteworthy that Bacitracin acts by

interfering with bacterial cell wall maintenance of Gram-positive organisms [68].

Interestingly, it has been shown that TSP1 promotes cellular adherence of Gram-positive

49

pathogens via the recognition of peptidoglycan, the main component of the cell wall [57].

Rank Drug name Target name

1 Fluorometholone Calreticulin (CALR)

2 Menadione Coagulation factor III

(thromboplastin, tissue

factor) (F3)

3 Methyldopa Nitric oxide synthase 1

(neuronal) (NOS1)

4 Clofibrate Hypothetical

LOC100129500;

apolipoprotein E (APOE)

5 Bacitracin Thrombospondin 1

(Thbs1/TSP1)

Table 5.6: Top 5 predicted drug-target associations for genes with no known drugs

targeting them.

50

Chapter 6

Conclusions

We introduced a novel method, SITAR, for predicting drug-target interactions. Our

method incorporates an extensive set of drug-drug and gene-gene similarity measures.

Newly incorporated drug-drug similarities are based on predicted side effects, gene

expression drug response profiles and the ATC classification system. The classification

features are constructed based on a new score integrating the drug-drug and gene-gene

similarity spaces. These features are integrated via a logistic regression classifier,

combined with a feature selection process. Our method is flexible and allows the

incorporation of new emerging measures without altering already computed scores on

other measures. Using our method, we show marked improvement of classification

performance over previous drug-target prediction approaches. We provide novel

predictions of drug-target interactions and validate them against public databases. Last,

we predict targets for drugs which to-date have no known targets.

Having shown that our method is robust with respect to different score choices,

selected features and different classification methods, it seems that the primary reason for

the increased performance compared to previous methods stems from the use of multiple

similarity measures. Each of the resulting features alone does not have enough predictive

power, but the combination of multiple features allows the classification procedure to

perform well. Accordingly, we noticed that using a low number of features (less than 5)

deteriorates the results. Nevertheless, our method can be enhanced in several ways. First,

51

one could improve and expand the measures used. Of special interest is improving the

gene co-expression similarity based on the Connectivity Map data, which currently

exhibits the worst performance. Another extension would be to increase the number of

represented drugs and genes shared between the different measures. This could be

achieved either by predicting missing similarities from existing ones (as in [5]) or by

incorporating imputation methods to overcome missing information in some of the

measures.

52

Bibliography

[1] Accelrys, Inc. http://accelrys.com/products/scitegic/. 2009.

[2] MDL Drug Data Report 2006.1 (MDL Information Systems Inc., San Leandro, CA,

2006).

[3] Thor and Merlin; Version 4.62; Daylight Chemical Information Systems Inc.:

Irvine, CA. Theory at www.daylight.com.

[4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.

Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-

Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.

Rubin and G. Sherlock, Gene ontology: tool for the unification of biology. The

Gene Ontology Consortium, Nat Genet, 25 (2000), pp. 25-9.

[5] N. Atias and R. Sharan, An Algorithmic Framework for Predicting Side-Effects of

Drugs, RECOMB 2010, to appear. (2010).

[6] D. d. Bernardo, M. J. Thompson, T. S. Gardner, S. E. Chobot, E. L. Eastwood, A.

P. Wojtovich, S. J. Elliott, S. E. Schaus and J. J. Collins, Chemogenomic profiling

on a genome-wide scale using reverse-engineered gene networks, Nature

Biotechnology, 23 (2005), pp. 377 - 383.

[7] K. Bleakley and Y. Yamanishi, Supervised prediction of drug-target interactions

using bipartite local models, Bioinformatics, 25 (2009), pp. 2397-403.

[8] O. Bodenreider, The Unified Medical Language System (UMLS): integrating

biomedical terminology, Nucleic Acids Research, 32 (2004), pp. D267-D270.

[9] B. J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R.

Oughtred, D. H. Lackner, J. Bahler, V. Wood, K. Dolinski and M. Tyers, The

BioGRID Interaction Database: 2008 update, Nucleic Acids Res, 36 (2008), pp.

D637-40.

[10] K. Burns, M. Opas and M. Michalak, Calreticulin inhibits glucocorticoid- but not

cAMP-sensitive expression of tyrosine aminotransferase gene in cultured McA-

RH7777 hepatocytes, Mol Cell Biochem, 171 (1997), pp. 37-43.

[11] M. Campillos, M. Kuhn, A. C. Gavin, L. J. Jensen and P. Bork, Drug target

identification using side-effect similarity, Science, 321 (2008), pp. 263-6.

[12] C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines,

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001).

53

[13] M. Chen, V. O. Ona, M. Li, R. J. Ferrante, K. B. Fink, S. Zhu, J. Bian, L. Guo, L.

A. Farrell, S. M. Hersch, W. Hobbs, J. P. Vonsattel, J. H. Cha and R. M.

Friedlander, Minocycline inhibits caspase-1 and caspase-3 expression and delays

mortality in a transgenic mouse model of Huntington disease, Nat Med, 6 (2000),

pp. 797-801.

[14] A. C. Cheng, R. G. Coleman, K. T. Smyth, Q. Cao, P. Soulard, D. R. Caffrey, A. C.

Salzberg and E. S. Huang, Structure-based maximal affinity model predicts small-

molecule druggability, Nat Biotechnol, 25 (2007), pp. 71-5.

[15] A. P. Davis, C. G. Murphy, C. A. Saraceni-Richards, M. C. Rosenstein, T. C.

Wiegers and C. J. Mattingly, Comparative Toxicogenomics Database: a

knowledgebase and discovery tool for chemical-gene-disease networks, Nucleic

Acids Res, 37 (2009), pp. D786-92.

[16] J. Davis and M. Goadrich, The relationship between Precision-Recall and ROC

curves, ICML '06: Proceedings of the 23rd international conference on Machine

learning (2006), pp. 233-240.

[17] J. Drews, Drug discovery: a historical perspectiv, Science, 287 (2000), pp. 1960-

1964.

[18] J. Eckstein, Drug disovery tutorial.

[19] C. Ehnholm, R. W. Mahley, D. A. Chappell, K. H. Weisgraber, E. Ludwig and J. L.

Witztum, Role of apolipoprotein E in the lipolytic conversion of beta-very low

density lipoproteins to low density lipoproteins in type III hyperlipoproteinemia,

Proc Natl Acad Sci U S A, 81 (1984), pp. 5566-70.

[20] R. M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-

Cerajewski, M. D. Robinson, L. O'Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho, A.

Heilbut, L. Moore, S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng, J.

Vasilescu, M. Abu-Farha, J. P. Lambert, H. S. Duewel, Stewart, II, B. Kuehl, K.

Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S. L. Adams, M. F. Moran,

G. B. Morin, T. Topaloglou and D. Figeys, Large-scale mapping of human protein-

protein interactions by mass spectrometry, Mol Syst Biol, 3 (2007), pp. 89.

[21] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 27

(2005), pp. 861-874.

[22] B. B. Fredholm and C. G. Persson, Xanthine derivatives as adenosine receptor

antagonists, Eur J Pharmacol, 81 (1982), pp. 673-6.

[23] T. S. Gardner, D. d. Bernardo, D. Lorenz and J. J. Collins, Inferring Genetic

Networks and Identifying Compound Mode of Action via Expression Profiling,

Science, 301 (2003), pp. 102 - 105.

54

[24] S. Gunther, M. Kuhn, M. Dunkel, M. Campillos, C. Senger, E. Petsalaki, J. Ahmed,

E. G. Urdiales, A. Gewiess, L. J. Jensen, R. Schneider, R. Skoblo, R. B. Russell, P.

E. Bourne, P. Bork and R. Preissner, SuperTarget and Matador: resources for

exploring drug-target relationships, Nucleic Acids Res, 36 (2008), pp. D919-22.

[25] I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection,

Journal of Machine Learning Research, 3 (2003), pp. 1157-1182.

[26] S. J. Haggarty, K. M. Koeller, J. C. Wong, R. A. Butcher and S. L. Schreiber,

Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived

deacetylase inhibitors using cell-based assays, Chem Biol, 10 (2003), pp. 383-96.

[27] S. M. Hammer, D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley,

R. H. Haubrich, W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, M. S. Hirsch

and T. C. Merigan, A Trial Comparing Nucleoside Monotherapy with Combination

Therapy in HIV-Infected Adults with CD4 Cell Counts from 200 to 500 per Cubic

Millimeter, N Engl J Med, 335 (1996), pp. 1081-1090.

[28] N. T. Hansen, S. Brunak and R. B. Altman, Generating genome-scale candidate

gene lists for pharmacogenomics, Clin Pharmacol Ther, 86 (2009), pp. 183-9.

[29] M. E. Hillenmeyer, E. Ericson, R. W. Davis, C. Nislow, D. Koller and G. Giaever,

Systematic analysis of genome-wide fitness data in yeast reveals novel gene

function and drug action, Genome Biol, 11 (2010), pp. R30.

[30] A. L. Hopkins and C. R. Groom, The druggable genome, Nature Reviews Drug

Discovery, 1 (2002), pp. 727-730.

[31] F. Iorio, R. Tagliaferri and D. d. Bernardo, Identifying network of drug mode of

action by gene expression profiling, J Comput Biol 16 (2009), pp. 241-251.

[32] P. Jaccard, Nouvelles recherches sur la distribution florale., Bul. Soc. Vaudoise

Sci. Nat., 44 (1908), pp. 223–270.

[33] E. Jain, A. Bairoch, S. Duvaud, I. Phan, N. Redaschi, B. E. Suzek, M. J. Martin, P.

McGarvey and E. Gasteiger, Infrastructure for the life sciences: design and

implementation of the UniProt website, BMC Bioinformatics, 10 (2009), pp. 136.

[34] C. S. Janga and A. Tzakos, Structure and organiztion of drug-target networks:

insights from genomic approaches for drug discovery Molecular BioSystems, 5

(2009), pp. 1536-1548.

[35] M. Kanehisa and S. Goto, KEGG: kyoto encyclopedia of genes and genomes,

Nucleic Acids Res, 28 (2000), pp. 27-30.

[36] M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe and M. Hirakawa, KEGG for

representation and analysis of molecular networks involving diseases and drugs,

Nucleic Acids Res, 38 (2010), pp. D355-60.

55

[37] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T.

Katayama, M. Araki and M. Hirakawa, From genomics to chemical genomics: new

developments in KEGG, Nucleic Acids Res, 34 (2006), pp. D354-7.

[38] M. J. Keiser, B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin and B. K.

Shoichet, Relating protein pharmacology by ligand chemistry, Nat Biotechnol, 25

(2007), pp. 197-206.

[39] M. J. Keiser, V. Setola, J. J. Irwin, C. Laggner, A. I. Abbas, S. J. Hufeisen, N. H.

Jensen, M. B. Kuijer, R. C. Matos, T. B. Tran, R. Whaley, R. A. Glennon, J. Hert,

K. L. Thomas, D. D. Edwards, B. K. Shoichet and B. L. Roth, Predicting new

molecular targets for known drugs, Nature, 462 (2009), pp. 175-81.

[40] M. Kuhn, M. Campillos, P. Gonzalez, L. J. Jensen and P. Bork, Large-scale

prediction of drug-target relationships, FEBS Lett, 582 (2008), pp. 1283-90.

[41] M. Kuhn, M. Campillos, I. Letunic, L. J. Jensen and P. Bork, A side effect resource

to capture phenotypic effects of drugs, Mol Syst Biol, 6 (2010), pp. 343.

[42] Z. Kutalik, J. S. Beckmann and S. Bergmann, A modular approach for integrative

analysis of large-scale gene-expression and drug-response data, Nat Biotechnol,

26 (2008), pp. 531-9.

[43] J. Lamb, The Connectivity Map: a new tool for biomedical research, Nat Rev

Cancer, 7 (2007), pp. 54-60.

[44] J. Lamb, E. D. Crawford, D. Peck, J. W. Modell, I. C. Blat, M. J. Wrobel, J. Lerner,

J. P. Brunet, A. Subramanian, K. N. Ross, M. Reich, H. Hieronymus, G. Wei, S. A.

Armstrong, S. J. Haggarty, P. A. Clemons, R. Wei, S. A. Carr, E. S. Lander and T.

R. Golub, The Connectivity Map: using gene-expression signatures to connect

small molecules, genes, and disease, Science, 313 (2006), pp. 1929-35.

[45] M. A. Lindsay, Target discovery, Nature Reviews Drug Discovery, 2 (2003), pp.

831-838.

[46] Y. Liu, B. Hu, C. Fu and X. Chen, DCDB: drug combination database,

Bioinformatics, 26 (2010), pp. 587-8.

[47] K. M. Mani, C. Lefebvre, K. Wang, W. K. Lim, K. Basso, R. Dalla-Favera and A.

Califano, A systems biology approach to prediction of oncogenes and molecular

perturbation targets in B-cell lymphomas, Mol Syst Biol, 4 (2008), pp. 169.

[48] Y. C. Martin, J. L. Kofron and L. M. Traphagen, Do structurally similar molecules

have similar biological activity?, J Med Chem, 45 (2002), pp. 4350-8.

[49] L. Matthews, G. Gopinath, M. Gillespie, M. Caudy, D. Croft, B. de Bono, P.

Garapati, J. Hemish, H. Hermjakob, B. Jassal, A. Kanapin, S. Lewis, S. Mahajan,

B. May, E. Schmidt, I. Vastrik, G. Wu, E. Birney, L. Stein and P. D'Eustachio,

56

Reactome knowledgebase of human biological pathways and processes, Nucleic

Acids Res, 37 (2009), pp. D619-22.

[50] J. B. Mitchell, The relationship between the sequence identities of alpha helical

proteins in the PDB and the molecular similarities of their ligands, J Chem Inf

Comput Sci, 41 (2001), pp. 1617-22.

[51] P. X. Mouratidis, K. W. Colston and A. G. Dalgleish, Doxycycline induces caspase-

dependent apoptosis in human pancreatic cancer cells, Int J Cancer, 120 (2007),

pp. 743-52.

[52] M. Olah, M. Mracec, L. Ostopovici, R. Rad, A. Bora, N. Hadaruga, I. Olah, M.

Banda, Z. Simon, M. Mracec and T. I. Oprea, WOMBAT: World of Molecular

Bioactivity, in I. O. Prof. Dr. Tudor, ed., Chemoinformatics in Drug Discovery,

2005, pp. 221-239.

[53] K. Ovaska, M. Laakso and S. Hautaniemi, Fast Gene Ontology based clustering for

microarray experiments, BioData Min, 1 (2008), pp. 11.

[54] L. Perlman, A. Gottlieb, N. Atias, E. Ruppin and R. Sharan, Combining drug and

gene similarity measures for drug-target elucidation, Journal of Computational

Biology, Special issue of RECOMB Systems Biology 2010, in press.

[55] G. J. Peters, C. L. van der Wilt, C. J. A. van Moorsel, J. R. Kroep, A. M. Bergmann

and S. P. Ackland, Basis for effective combination cancer chemotherapy with

antimetabolites, Pharmacology & Therapeutics, 87 (2000), pp. 227–253.

[56] E. Podjarny, S. Benchetrit, B. Katz, J. Green and J. Bernheim, Effect of methyldopa

on renal function in rats with L-NAME-induced hypertension in pregnancy,

Nephron, 88 (2001), pp. 354-9.

[57] C. Rennemeier, S. Hammerschmidt, S. Niemann, S. Inamura, U. Zahringer and B.

E. Kehrel, Thrombospondin-1 promotes cellular adherence of gram-positive

pathogens via recognition of peptidoglycan, FASEB J, 21 (2007), pp. 3118-32.

[58] P. Resnik, Semantic Similarity in a Taxonomy: An Information-Based Measure and

its Application to Problems of Ambiguity in Natural Language, J of Artificial

Intelligence Research, 11 (1999), pp. 95-130.

[59] I. M. Rollo, Dihydrofolate reductase inhibitors as antimicrobial agents and their

potentiation by sulfonamides, CRC Crit Rev Clin Lab Sci, 1 (1970), pp. 565-83.

[60] J. F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F.

Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon,

M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G.

Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex,

P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak,

R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth and M. Vidal,

57

Towards a proteome-scale map of the human protein-protein interaction network,

Nature, 437 (2005), pp. 1173-8.

[61] A. Schuffenhauer, P. Floersheim, P. Acklin and E. Jacoby, Similarity metrics for

ligands reflecting the similarity of the target proteins, J Chem Inf Comput Sci, 43

(2003), pp. 391-405.

[62] S. Sinan, F. Kockar and O. Arslan, Novel purification strategy for human PON1

and inhibition of the activity by cephalosporin and aminoglikozide derived

antibiotics, Biochimie, 88 (2006), pp. 565-74.

[63] A. Skrbo, B. Begovic and S. Skrbo, [Classification of drugs using the ATC system

(Anatomic, Therapeutic, Chemical Classification) and the latest changes], Med

Arh, 58 (2004), pp. 138-41.

[64] T. F. Smith, M. S. Waterman and C. Burks, The statistical distribution of nucleic

acid similarities, Nucleic Acids Res, 13 (1985), pp. 645-56.

[65] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann and E. Willighagen, The

Chemistry Development Kit (CDK): an open-source Java library for Chemo- and

Bioinformatics, J Chem Inf Comput Sci, 43 (2003), pp. 493-500.

[66] C. Steinbeck, C. Hoppe, S. Kuhn, M. Floris, R. Guha and E. L. Willighagen, Recent

developments of the chemistry development kit (CDK) - an open-source java library

for chemo- and bioinformatics, Curr Pharm Des, 12 (2006), pp. 2111-20.

[67] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler, M.

Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff, C.

Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Krobitsch,

B. Korn, W. Birchmeier, H. Lehrach and E. E. Wanker, A human protein-protein

interaction network: a resource for annotating the proteome, Cell, 122 (2005), pp.

957-68.

[68] K. J. Stone and J. L. Strominger, Mechanism of action of bacitracin: complexation

with metal ion and C 55 -isoprenyl pyrophosphate, Proc Natl Acad Sci U S A, 68

(1971), pp. 3223-7.

[69] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A.

Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov,

Gene set enrichment analysis: a knowledge-based approach for interpreting

genome-wide expression profiles, Proc Natl Acad Sci U S A, 102 (2005), pp.

15545-50.

[70] T. T. Tanimoto, IBM Internal Report 17th Nov 1957.

[71] A. E. Thomson, E. J. Squires and P. A. Gentry, Assessment of factor V, VII and X

activities, the key coagulant proteins of the tissue factor pathway in poultry plasma,

Br Poult Sci, 43 (2002), pp. 313-21.

58

[72] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang and S. H. Bryant, PubChem: a

public information system for analyzing bioactivities of small molecules, Nucleic

Acids Research, 37 (2009), pp. W623–W633.

[73] D. S. Wishart, C. Knox, A. C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam

and M. Hassanali, DrugBank: a knowledgebase for drugs, drug actions and drug

targets, Nucleic Acids Res, 36 (2008), pp. D901-6.

[74] D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z.

Chang and J. Woolsey, DrugBank: a comprehensive resource for in silico drug

discovery and exploration, Nucleic Acids Res, 34 (2006), pp. D668-72.

[75] I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S. M. Kim and D. Eisenberg, DIP,

the Database of Interacting Proteins: a research tool for studying cellular networks

of protein interactions, Nucleic Acids Res, 30 (2002), pp. 303-5.

[76] Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda and M. Kanehisa, Prediction of

drug-target interaction networks from the integration of chemical and genomic

spaces, Bioinformatics, 24 (2008), pp. i232-40.

[77] M. Yildrim, K. Goh, M. Cusick, A. Barabasi and M. Vidal, Drug-target network,

Nature Biotechnology, 25 (2007), pp. 1119-1126.

[78] F. Zhu, B. Han, P. Kumar, X. Liu, X. Ma, X. Wei, L. Huang, Y. Guo, L. Han, C.

Zheng and Y. Chen, Update of TTD: Therapeutic Target Database, Nucleic Acids

Res, 38 (2010), pp. D787-91.

[79] S. Zhu, Y. Okuno, G. Tsujimoto and H. Mamitsuka, A probabilistic model for

mining implicit 'chemical compound-gene' relations from literature,

Bioinformatics, 21 Suppl 2 (2005), pp. ii245-51.

[80] S. Zhu, I. G. Stavrovskaya, M. Drozda, B. Y. Kim, V. Ona, M. Li, S. Sarang, A. S.

Liu, D. M. Hartley, D. C. Wu, S. Gullans, R. J. Ferrante, S. Przedborski, B. S.

Kristal and R. M. Friedlander, Minocycline inhibits cytochrome c release and

delays progression of amyotrophic lateral sclerosis in mice, Nature, 417 (2002), pp.

74-8.

ירתקצ

דרך יעילה . הרפואי פיתוחקידום הב משמעותי אתגר מהווה פעולתן ואופן תרופות של הבנה

מרכזי צעד הינו . גילוי המטרותתרופותלהתמודד עם אתגר זה היא זיהוי חלבוני המטרה של

על מנת לשפר את הדיוק תרופות קיימות.ל מטרות חדשותקישור בפיתוח תרופות חדשות וב

, זו בעבודה מגוון רחב ככל האפשר של מקורות מידע ביולוגים. אגדבחיזוי המטרות יש צורך ל

תרופה-תרופה של מדדי דמיוןמגוון המשלבת SITAR חדשנית הנקראת מציגים שיטה אנו

חדשה ציון הערכת על מושתתת השיטה .תרופות לש מטרה חלבוני חיזוי לשם גן-גןו

תרופה-תרופה של דמיון מדדינתון של זוג סמך על המחושבת לגנים תרופות בין לאסוציאציות

זוגות המדדים כדי מבחרשל סטית מאגד את הציונים יכאשר אלמנט של רגרסיה לוג גן-וגן

למאות מטרותרחב היקף של לחיזוי בשיטה השתמשנו ציון סופי לאסוציאציה הנדונה.להניב

-של תרופה ידינו, על לראשונה פותחוש, חדשים וכאלו קיימים דמיון מדדי באמצעות תרופות

והראינו את ,קיימות שיטות של לאלו תוצאות השיטה את והשווינ כמו כןו גן-תרופה וגן

, תרופות למאות חדשות מטרות חיזוישם ל שיטהה הפעלנו את מכן, לאחר .יתרונה עליהן

בשלב הלימוד. שימוש בהם נעשה שלא מידע מאגרי באמצעות הללו התחזיותמ חלק ואימתנו

תרופות בקרב נוספיםחדשים דמיון מדדי הכללת י"ע קלות ביתר להרחבה ניתנת שלנו השיטה

כתב העתב לפרסום התקבלו זו בעבודה המוצגות מהתוצאות חלק. וגנים

Journal of Computational Biology [45].

אביב-תל אוניברסיטת

סאקלר ובברלי ריימונד שם על מדויקים למדעים הפקולטה

בלבטניק שם על המחשב למדעי הספר-בית

וגנים תרופות של דמיון מדדי של אינטגרציה

תרופותשל מטרה חלבוני גילוי לצורך

תואר לקבלת מהדרישות כחלק הוגש זה חיבור

אביב-תל באוניברסיטת M.Sc. - "אוניברסיטה מוסמך"

המחשב למדעי ס"ביה

ידי על

פרלמן ליאת

שרן רודד פרופסור של בהדרכתו הוכנה העבודה

ע"אתש תשרי