comp150-final-project-proteins homology

Tufts University

Protein Homology Detection Using Bag of Words Classifier Exploring a Word within Documents Classifier for Protein Sequences within families or folds

Zina Saadi 12/9/2010 Final Project Comp 150 taught by Dr. Lenore Cowen [email protected]

Proteins Homology Detection

2

Abstract: The current increase of protein sequences in biological databases had encouraged researchers to explore new methodologies to predict the functionality and structure of proteins. Through homology, studies show that proteins belonging in a single family and a super family observe similar properties in terms of their functionality or structural relation. The first part of this study uses the benchmark SCOP dataset to train and test the ability of Bag of Words (BOW) classifier using the Naive-Bayesian classifier for protein homologies detection. The second part of this study was for calculating the likelihood for a protein’s most distinct sub-sequence given a (family or fold) in the training files for each class. Results from the second part showed that the most distinct sub-sequence given a (family or fold) can be used to identify the homology of unknown proteins. These results can provide insight to biologist as well as to researchers on the benefit of exploring simpler methods. Data and code live in1 http://www.eecs.tufts.edu/~zsaadi01/proteins_homology/ Keywords: Homology detection, Bag of Words, protein classification, Naive-Bayesian classifier

1 I renamed all perl files to .txt since no matter how I change the permission to them, their content is still

hidden.

http://www.eecs.tufts.edu/~zsaadi01/proteins_homology/


3

Contents 1 Introduction ................................................................................................................. 4

1.1 Predicting Homology among Proteins ................................................................. 4

1.2 Motivation and Overview..................................................................................... 4

2 Major Design Specifications ........................................................................................ 5

2.1 Design Overview ................................................................................................... 5

2.2 Data Processing .................................................................................................... 5

2.2.1 Data Specification .................................................................................................... 6

2.2.2 Methods .................................................................................................................. 6

2.2.3 Results and Measurements ..................................................................................... 7

3 Future Experiments ................................................................................................... 10

4 Appendix A ................................................................................................................. 11

5 Appendix B ................................................................................................................. 24

6 Appendix C ................................................................................................................. 25

7 Appendix D ................................................................................................................ 34

8 Appendix E ................................................................................................................. 39


4

1 Introduction The current advancement of automated sequencing tools had constantly increased

the amount of DNA and protein sequences in public biological databases [4][6]. Such increase often creates a gap between the amount of information available in these databases and its connection with the structure and functionality of proteins per say due to a lack of promising automated tools that can predict information about proteins function and structure[6]. Over the past decade, researchers have been exploring various ways to predict the functionality of proteins by experimenting with statistical and machine learning approaches which are less costly than conducting time consuming lab experiments [5].

1.1 Predicting Homology among Proteins Traditional approaches to identify protein homology had varied from linear

sequence-base comparison to network comparison. Finding sufficient similarities between protein sequences is referred to as Sequence alignment, which is a way of arranging protein sequences to identify regions of similarity between them. As for Global alignments attempt to align every amino acid in two sequences and are generally useful for similar sequences close in length. On the other hand, local alignments attempt to find regions of local similarity between sequences and are generally useful for less similar sequences. Clearly, there could be many ways to align two sequences. However, what is important in this approach is the scoring function which determines how to judge the degree of similarity between two protein sequences which then can determine the best alignment. In addition to sequence-based comparisons, network comparisons across species have also been used to identify proteins with similar functions and detect homology [1].

Most of these approaches rely on comparing amino acids sequences to look for similarities, which is the only logical inference for building evolutionary relationship trees. However, when these comparisons are strictly followed to produce an evolutionary tree, many embarrassing results are obtained such as showing that the turtle is more closely related to the birds that to the snake, or the chicken is grouped with the penguin rather than the duck [2].

This Study proposes the approach for predicting the family of protein sequences across organism by patterns recognition using the “bag of words” conditional Bayesian classifier.

1.2 Motivation and Overview The bag of words (BOW) conditional Bayesian classifier had gained great

importance in Natural Language Processing (NLP) in various domains such as spam-filtering, document classification and words-sense disambiguation. The use of the BOW classifier was first introduced to my knowledge in comp134 with Dr. Carla Brodley, where the professor asked us to build this classifier for predicting U.S presidential speeches. This can be analogous to predicting protein sequences across organisms


5

(speeches by the same president over a period of time) based on training a model to classify protein sequences into their annotated family. The U.S president speech classifier was adjusted to:

Treat each protein sequence within the same family as an instance.

Measure the rate of false positives, and specify instead of precision and accuracy.

2 Major Design Specifications There were four elements critical to the accomplishment of this study concluded in:

1. Finding the appropriate data that is composed of annotated data to be used for training and testing.

2. Transforming the previously used code to applicable code for protein homology classification/detection.

3. Designing the code so that it can produce measurement results easily compared to measurement for other experiments in the same domain.

4. Finding results of other machine learning classifiers to compare BOW results with.

2.1 Design Overview The implementation code was written in Perl v5.10.1, divided into two .pl files

namely < homology.bagOfWordsTest.pl> and < homology.bagOfWordsTrain.pl>. To run the code, one should specify a pointer to root directory of both the training and testing data extracted from <fisher-scop-data.tar.gz2> (last modified on 23-Dec-2002 00:16), in addition to a specification to maximum words length (sequence sub-segment), to be used to split the given protein sequence.

2.2 Data Processing Since the BOW classifier had been majorly applied to NLP applications and problems. It is crucial to note the difference in dealing between the actual human language texts and protein sequences. In processing human language text, as English for instance, words boundary are easily specifies by splitting the text on white spaces and punctuations. However, while processing protein sequence, the lacks of words/segments boundaries provide many challenges in identifying the segments to consider in this BOW classifier. To overcome this challenge, the code was designed to take a command line argument with the maximum desired protein word/segment length to use it to split the sequence into words with the exact length (with the length of the rightmost word remainder be of less or equal length and the rest of the words). Then, the code runs the model over a loop of segmenting the training and testing proteins sequences in the range of words length=1 to words length=MaxSpecifiedLength.

2 Data Source http://compbio.soe.ucsc.edu/discriminative/

http://compbio.soe.ucsc.edu/discriminative/


6

2.2.1 Data Specification The data used in this study was obtained from the SCOP benchmark data-set (http://compbio.soe.ucsc.edu/discriminative/). The reason why this version was chosen, is to compare the results with other machine learning classifiers as illustrated in [5]. Upon extracting the directories and files from the <fisher-scop-data.tar.gz> file, data files looks as follow:

Figure 1 A sample screenshot of the directory layout

There were 33 class labels (Families ID) to predict a given protein sequence’s membership.

There were 12 class labels (Folds ID) to predict a given protein sequence’s membership.

Within each familyID folder, files ending in pos-train.seq and pos-test.seq were taken into consideration as training and testing data respectively.

2.2.2 Methods The code used for this study was inspired by Dr. Carla’s Brodley’s project3’s homework for comp135 (introduction to Machine learning and data mining), especially since I obtained 100% grade for that project so I know that the code’s computations are reliable. For Carla’s project we had to build presidential speeches classifier and then to identify each president’s most distinct word using the likelihood method. Applying this analogy to protein’s homology, for each specified word/subsequence length, I treated each protein sequence as an instance of the class (either fold or family) that it belongs to for training and then

http://compbio.soe.ucsc.edu/discriminative/


7

for testing, I computed the prediction of each testing protein sequence and calculated the false positives, true positives and false negatives, and then the rate of false positives (Experiment 1). Then for each specified word/subsequence length, I made the code predict the most distinct word/subsequence among all protein sequences instances for that particular class (family or fold) (Experiment 2).

2.2.2.1 Experiment 1 The first part of this study consisted of splitting each protein sequence for a given class (family or fold) by a specified word length by the user, and labeled by the class it belongs to. For predicting a given protein sequence’s class, the naïve Bayesian classifier was used:

Train: 1. For each class Cj of (folds or family ids), estimate P(Cj) 2. For each word Wi estimate P(Wi | Cj )

Classify (doc): Assign the protein sequene to the most probable class, assuming words are conditionally independent, for given class .

In general, it has been noted that When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are too risky to use since they can give incorrect result. To eliminate such risk, it has been suggested to take the logs of the probabilities as well as to use Laplace’s estimate which is to add 1 to the numerator and 2 to the denominator (for my code I added the values 1 and 3 respectively).

2.2.2.2 Experiment 2 The second part of this study consisted of calculating the likelihood for a word given a class (family or fold) in the training protein sequences for each class for a given word length specified by the user. More precisely: P(word|a class)/P(word|other classes), however since there will be words that can appear in sequences for certain classes but not others => meaning we might have the denominator be zero. To overcome this problem, I decided to take the log of the likelihood => log (P(word|a class)) – log(P(word|other classes)) and I used Laplace estimate for log(P(word|other presidents)), then I ranked the results by a decreasing sort, and took the word with the most likelihood for a certain class.

2.2.3 Results and Measurements


8

Experiment part 1

Plotting the average rate of false positives for family and folds classes From the graph, it seems that for subsequence length equals to 1 or greater than 3, the used classifier performs better for classifying protein sequences into their correct folds than their correct family. Further research would be interesting to investigate why for subsequences of length equals to 2 and 3, it is the other way around. Experiment part 2 The following two tables illustrate the distinct-fragment obtained from the training data for all sequences under the same fold (i.e. the first Table) or under the same family (i.e the second Table). The first table illustrates also these distinct fragments for the rate of false positive (RFP=0),while the second table illustrates the same thing but for the first 10 minimal RFP. After running some investigation, it turns out that those distinct fragments appear among some protein sequences from the fold or family it was extracted from but they do not appear in any protein sequence from other folds or families rather than the one they belong to (only when RFP is zero).This discovery could enhance using the used method for detecting unknown protein sequence based on locating the distinct fragment for the major fold or family and therefore classifying its family or fold. I took a couple of distinct fragments namely “TLGNSTITTQ”, “KELGTVMRSL”, “MTEYKLVVVG” and “VSSFFTY” and searched online to find anything specific about them and their equivalent classes but I could not find anything interesting. Word Length FoldID Spec Precision RFP

Distinct-Fragment

10 2.8 1 1 0 TLGNSTITTQ

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 10

Family (Ave RFP)

Fold (Ave RFP)


9

10 1.34 1 1 0 KELGTVMRSL

10 3.25 1 1 0 MTEYKLVVVG

10 2.5 1 1 0 HQWYWSYEYS

2 1.34 1 1 0 AZ

4 2.8 1 1 0 QIMY

4 3.33 1 1 0 APWC

4 1.34 1 1 0 QDMI

4 3.25 1 1 0 QNHF

4 3.19 1 1 0 WKLD

5 1.25 1 1 0 SFYFK

5 2.8 1 1 0 CGYSD

5 3.33 1 1 0 CGHCK

5 1.34 1 1 0 DKDGD

5 2.5 1 1 0 HQWYW

6 3.73 1 1 0 TANLAA

6 1.25 1 1 0 WEVVRA

6 3.33 1 1 0 WCGHCK

6 1.34 1 1 0 LFDKDG

6 3.19 1 1 0 GAGILD

6 3.1 1 1 0 EPFVTL

6 2.5 1 1 0 HQWYWS

7 1.25 1 1 0 AWEVVRA

7 2.8 1 1 0 VENYGGE

7 3.33 1 1 0 PWCGHCK

7 1.34 1 1 0 EAFSLFD

7 3.25 1 1 0 IEDSYRK

7 2.5 1 1 0 IGHQWYW

8 1.25 1 1 0 CAWEVVRA

8 3.33 1 1 0 APWCGHCK

8 1.34 1 1 0 LGTVMRSL

8 3.25 1 1 0 DPTIEDSY

8 2.5 1 1 0 WYWSYEYS

9 3.33 1 1 0 EFYAPWCGH

9 1.34 1 1 0 ITTKELGTV

9 2.5 1 1 0 LRLLYLLDE

Word Length Family-ID Spec Precision RFP

Distinct-Fragment

2 1.34.1.5 1 1 0 AZ

7 2.19.1.1 1 1 0 VSSFFTY

1 2.5.1.1 0.972727273 0 0.027272727 X


10

1 3.1.1.5 0.957098284 0.166666667 0.042901716 x

1 1.25.1.1 0.953125 0 0.046875 X

3 2.1.1.4 0.952238806 0 0.047761194 TPZ

2 2.1.1.4 0.921414538 0 0.078585462 KB

7 2.1.1.4 0.918533605 0 0.081466395 SGVAGTH

10 2.1.1.4 0.917610711 0 0.082389289 NSGDAIYDAD

6 2.1.1.4 0.916753382 0 0.083246618 RPGQQP

Note that I was planning to compare my results to the results discussed in reference [5]. However I could not find a table to convert the SCOP family IDs used in version 1.37 to the names used in reference [5]. I spent so much time looking for a conversion table but did not find anything that went back to SCOP 1.37, otherwise that was my initial objective before running the code.

3 Future Experiments There were many ideas to be implemented in this project which would be interesting to explore in the future such as:

1. Building a discriminative model that takes into account both of the positive and negative training and testing data.

2. Improving the code so that the accuracy and sensitivity measurement can be calculated. In the current design, the False Negative values cannot be calculated.

3. Exploring statistical segmentation approaches for segments human languages that lack spaces such as Chinese for more accurate results when it comes to segmenting proteins sequences since it lacks words boundaries.

4. Creating separate classes for those proteins that belonged in more than one class.


11

4 Appendix A A Table demonstrating the Ave RFP for each word length for detecting familyID or FoldID

Word Length

Family (Ave RFP)

Fold (Ave RFP)

1 0.731403473 0.645392247

2 0.594537429 0.59567166

3 0.568716552 0.676720305

4 0.600529232 0.362620053

5 0.617939183 0.469614604

6 0.604334648 0.320435088

7 0.602217334 0.449013135

8 0.629653251 0.529603441

9 0.591933171 0.557651688

10 0.611017176 0.554913204

All results for word length from 1 to 10 for detecting Family IDs Word Length Family-ID Specifity Precision RFP

Distinct-Fragment

1 2.5.1.1 0.972727273 0 0.027272727 X

1 3.1.1.5 0.957098284 0.166666667 0.042901716 x

1 1.25.1.1 0.953125 0 0.046875 X

1 1.34.1.4 0.868020305 0 0.131979695 Z

1 1.34.1.5 0.865168539 0 0.134831461 Z

1 3.1.1.3 0.768149883 0 0.231850117 x

1 2.31.1.1 0.763157895 0 0.236842105 x

1 2.19.1.1 0.741803279 0 0.258196721 B

1 2.41.1.1 0.526315789 0 0.473684211 X

1 1.25.1.3 0.365079365 0 0.634920635 B

1 2.8.1.2 0.36 0 0.64 X

1 3.25.1.3 0.28 0 0.72 x

1 3.73.1.2 0.157894737 0 0.842105263 X

1 1.1.1.2 0.12605042 0 0.87394958 B

1 2.1.1.4 0.101123596 0 0.898876404 B

1 2.31.1.2 0.057971014 0 0.942028986 X

1 2.1.1.5 0 0 1 B

1 2.1.1.3 0 0 1 B

1 3.33.1.5 0 0 1 x

1 2.5.1.3 0 0 1 Z

1 3.19.1.5 0 0 1 x

1 2.1.1.2 0 0 1 B

1 2.1.1.1 0 0 1 Z


12

1 3.19.1.4 0 0 1 x

1 3.1.1.1 0 0 1 x

1 3.33.1.1 0 0 1 x

1 3.19.1.1 0 0 1 x

1 2.34.1.1 0 0 1 x

1 3.19.1.3 0 0 1 X

1 3.25.1.1 0 0 1 x

1 1.25.1.2 0 0 1 B

1 3.50.1.7 0 0 1 x

1 2.8.1.4 0 0 1 X

2 1.34.1.5 1 1 0 AZ

2 2.1.1.4 0.921414538 0 0.078585462 KB

2 2.31.1.2 0.834645669 0.192307692 0.165354331 MX

2 3.73.1.2 0.805668016 0 0.194331984 XW

2 2.41.1.1 0.78313253 0 0.21686747 XQ

2 2.34.1.1 0.753846154 0 0.246153846 Lx

2 1.1.1.2 0.715068493 0 0.284931507 ZH

2 3.19.1.3 0.692307692 0 0.307692308 CX

2 1.25.1.3 0.68 0 0.32 BP

2 3.19.1.4 0.642857143 0 0.357142857 Sx

2 2.1.1.1 0.633986928 0 0.366013072 PB

2 2.31.1.1 0.581395349 0 0.418604651 xI

2 3.25.1.1 0.566929134 0 0.433070866 xT

2 1.25.1.2 0.5625 0 0.4375 BP

2 3.25.1.3 0.552795031 0 0.447204969 xM

2 1.25.1.1 0.543478261 0 0.456521739 XE

2 1.34.1.4 0.434782609 0 0.565217391 AZ

2 3.19.1.1 0.416666667 0 0.583333333 Sx

2 3.19.1.5 0.405940594 0 0.594059406 Sx

2 2.8.1.4 0.292543021 0 0.707456979 XQ

2 3.33.1.5 0.198019802 0 0.801980198 xF

2 2.8.1.2 0.189873418 0 0.810126582 XC

2 2.5.1.3 0.172413793 0 0.827586207 BR

2 2.1.1.5 0 0 1 KB

2 2.1.1.3 0 0 1 KB

2 2.1.1.2 0 0 1 KB

2 3.1.1.1 0 0 1 Tx

2 2.5.1.1 0 0 1 XH

2 3.1.1.3 0 0 1 Ax

2 3.33.1.1 0 0 1 xF

2 3.1.1.5 0 0 1 Ax


13

2 3.50.1.7 0 0 1 xS

2 2.19.1.1 0 0 1 BM

3 2.1.1.4 0.952238806 0 0.047761194 TPZ

3 1.1.1.2 0.896825397 0.875 0.103174603 YH

3 2.41.1.1 0.892086331 0.166666667 0.107913669 XQT

3 2.5.1.1 0.871428571 0 0.128571429 XPM

3 3.33.1.1 0.833333333 0 0.166666667 MGx

3 1.34.1.5 0.74742268 0.416666667 0.25257732 HM

3 2.8.1.2 0.737704918 0 0.262295082 TXH

3 3.1.1.3 0.725 0 0.275 KAx

3 2.31.1.1 0.643564356 0 0.356435644 HCW

3 1.25.1.1 0.64 0 0.36 CCF

3 1.25.1.3 0.611650485 0 0.388349515 WNB

3 3.19.1.3 0.6 0 0.4 XDL

3 2.31.1.2 0.580645161 0 0.419354839 MWC

3 3.25.1.1 0.566929134 0 0.433070866 xTG

3 3.19.1.4 0.545454545 0 0.454545455 XDL

3 2.34.1.1 0.505154639 0 0.494845361 LxS

3 2.1.1.1 0.504424779 0 0.495575221 WCH

3 3.25.1.3 0.433070866 0 0.566929134 LTx

3 1.34.1.4 0.385826772 0 0.614173228 XKF

3 3.19.1.1 0.363636364 0 0.636363636 XDL

3 3.1.1.1 0.336734694 0 0.663265306 xDD

3 3.33.1.5 0.27027027 0 0.72972973 PNX

3 2.8.1.4 0.267326733 0 0.732673267 CCY

3 2.5.1.3 0.223880597 0.133333333 0.776119403 ZKG

3 3.19.1.5 0.097744361 0 0.902255639 XDL

3 2.1.1.5 0 0 1 TPZ

3 2.1.1.3 0 0 1 TPZ

3 2.1.1.2 0 0 1 TPZ

3 3.73.1.2 0 0 1 WWF

3 3.1.1.5 0 0 1 KAx

3 1.25.1.2 0 0 1 WNB

3 3.50.1.7 0 0 1 IxS

3 2.19.1.1 0 0 1 FXP

4 2.1.1.4 0.913606911 0 0.086393089 YQMY

4 2.5.1.1 0.904255319 0 0.095744681 YVWA

4 3.33.1.1 0.872727273 0 0.127272727 HLGR

4 2.41.1.1 0.794871795 0.333333333 0.205128205 MPNF

4 2.8.1.2 0.742971888 0 0.257028112 QIMY

4 3.1.1.3 0.725 0 0.275 YVWI


14

4 2.1.1.1 0.678653405 0 0.321346595 WCGK

4 2.31.1.1 0.657142857 0 0.342857143 ICLP

4 1.25.1.3 0.63963964 0 0.36036036 QLCH

4 3.19.1.3 0.636363636 0 0.363636364 GMGT

4 3.25.1.1 0.620689655 0 0.379310345 QNHF

4 1.25.1.1 0.60625 0 0.39375 LNF

4 1.34.1.5 0.588235294 0 0.411764706 EVD

4 2.31.1.2 0.573770492 0 0.426229508 HCGM

4 1.34.1.4 0.518518519 0 0.481481481 QNRD

4 3.19.1.5 0.444444444 0 0.555555556 GMGT

4 3.25.1.3 0.433070866 0 0.566929134 DMFR

4 1.1.1.2 0.422222222 0 0.577777778 HLDN

4 3.19.1.1 0.416666667 0 0.583333333 GMGT

4 2.8.1.4 0.336917563 0 0.663082437 DGCP

4 3.1.1.1 0.336734694 0 0.663265306 WDDP

4 2.5.1.3 0.172413793 0 0.827586207 KCTP

4 3.33.1.5 0.147368421 0 0.852631579 DCQD

4 2.1.1.5 0 0 1 LYYR

4 2.1.1.3 0 0 1 LYYR

4 2.1.1.2 0 0 1 LYYR

4 3.19.1.4 0 0 1 GMGT

4 2.34.1.1 0 0 1 WILG

4 3.73.1.2 0 0 1 WWFF

4 3.1.1.5 0 0 1 APNH

4 1.25.1.2 0 0 1 CAWE

4 3.50.1.7 0 0 1 HMVP

4 2.19.1.1 0 0 1 HFNP

5 2.1.1.4 0.912758997 0 0.087241003 MLWYR

5 2.5.1.1 0.901639344 0 0.098360656 NLIEA

5 3.33.1.1 0.86407767 0 0.13592233 QSWKE

5 2.8.1.2 0.771019678 0 0.228980322 CGYSD

5 3.1.1.3 0.725 0 0.275 WGQNG

5 1.25.1.3 0.718309859 0 0.281690141 LDTLQ

5 1.34.1.5 0.703180212 0 0.296819788 NEAP

5 2.1.1.1 0.680608365 0 0.319391635 IYVKQ

5 1.25.1.1 0.67357513 0 0.32642487 VLNF

5 2.31.1.1 0.643564356 0 0.356435644 SRPYM

5 2.41.1.1 0.621052632 0 0.378947368 VGFAT

5 3.19.1.5 0.583333333 0 0.416666667 CKNTK

5 3.25.1.1 0.566929134 0 0.433070866 IWDTA

5 3.19.1.3 0.555555556 0 0.444444444 CKNTK


15

5 1.34.1.4 0.518518519 0 0.481481481 GCINY

5 3.25.1.3 0.433070866 0 0.566929134 AGKGT

5 2.31.1.2 0.417040359 0 0.582959641 AGHCT

5 1.1.1.2 0.409090909 0 0.590909091 LHVDP

5 3.1.1.1 0.336734694 0 0.663265306 ALQRS

5 2.8.1.4 0.27734375 0 0.72265625 ATYGG

5 3.33.1.5 0.147368421 0 0.852631579 MVKQI

5 2.5.1.3 0.130434783 0 0.869565217 GMVGK

5 2.1.1.2 0.017800381 0 0.982199619 IFYIW

5 2.1.1.5 0 0 1 IFYIW

5 2.1.1.3 0 0 1 IFYIW

5 3.19.1.4 0 0 1 CKNTK

5 3.19.1.1 0 0 1 CKNTK

5 2.34.1.1 0 0 1 DTGTS

5 3.73.1.2 0 0 1 VREEV

5 3.1.1.5 0 0 1 PAIYY

5 1.25.1.2 0 0 1 CLKDR

5 3.50.1.7 0 0 1 GGPGC

5 2.19.1.1 0 0 1 LHFNP

6 2.1.1.4 0.916753382 0 0.083246618 RPGQQP

6 3.33.1.1 0.875 0 0.125 YFPVRG

6 2.5.1.1 0.869565217 0 0.130434783 IGGHGD

6 2.8.1.2 0.742971888 0 0.257028112 NASKFH

6 3.1.1.3 0.725 0 0.275 HWDDLA

6 1.25.1.3 0.69924812 0 0.30075188 TSAFQR

6 2.1.1.1 0.692645445 0 0.307354555 TDSLDL

6 2.31.1.1 0.643564356 0 0.356435644 LTAAHC

6 1.25.1.1 0.63372093 0 0.36627907 IVSFYF

6 3.19.1.5 0.630769231 0 0.369230769 FMLIPE

6 3.25.1.1 0.566929134 0 0.433070866 GVGKSA

6 3.19.1.3 0.555555556 0 0.444444444 FMLIPE

6 3.25.1.3 0.529411765 0 0.470588235 GIPQIS

6 1.34.1.5 0.527027027 0.166666667 0.472972973 EDLHDM

6 1.34.1.4 0.503184713 0 0.496815287 KVLGNP

6 3.50.1.7 0.419354839 0 0.580645161 NGGPGC

6 2.41.1.1 0.368421053 0 0.631578947 GVGFAT

6 2.31.1.2 0.356435644 0 0.643564356 IAGGEA

6 3.1.1.1 0.336734694 0 0.663265306 ARSVQA

6 2.8.1.4 0.310986965 0 0.689013035 KLIDLG

6 2.19.1.1 0.258823529 0 0.741176471 FHFNPR

6 2.1.1.3 0.21875 0 0.78125 TDSLDL


16

6 2.34.1.1 0.213114754 0 0.786885246 SNLWVP

6 3.1.1.5 0.185185185 0 0.814814815 STDVIY

6 3.33.1.5 0.147368421 0 0.852631579 VDFSAT

6 2.5.1.3 0.130434783 0 0.869565217 NNAGFP

6 2.1.1.5 0 0 1 TDSLDL

6 1.1.1.2 0 0 1 HVDPEN

6 2.1.1.2 0 0 1 KIDKTF

6 3.19.1.4 0 0 1 FMLIPE

6 3.19.1.1 0 0 1 FMLIPE

6 3.73.1.2 0 0.833333333 1 TANLAA

6 1.25.1.2 0 0 1 WEVVRA

7 2.19.1.1 1 1 0 VSSFFTY

7 2.1.1.4 0.918533605 0 0.081466395 SGVAGTH

7 2.5.1.1 0.869565217 0 0.130434783 AVGALTG

7 3.33.1.1 0.852631579 0 0.147368421 FPVRGRC

7 2.8.1.2 0.750972763 0 0.249027237 VENYGGE

7 1.25.1.3 0.72027972 0 0.27972028 NYGLLYC

7 3.1.1.3 0.716332378 0 0.283667622 IYWGQNG

7 2.1.1.1 0.691176471 0 0.308823529 LTIEKVT

7 2.31.1.1 0.643564356 0 0.356435644 VLTAAHC

7 3.25.1.3 0.638190955 0 0.361809045 APGAGKG

7 3.19.1.5 0.615384615 0 0.384615385 TPAEQFD

7 1.34.1.5 0.593023256 0.166666667 0.406976744 LGEKMKE

7 1.34.1.4 0.571428571 0 0.428571429 IDQNRDG

7 3.25.1.1 0.566929134 0 0.433070866 IEDSYRK

7 1.25.1.1 0.565517241 0 0.434482759 VSFYFKL

7 3.19.1.3 0.555555556 0 0.444444444 TPAEQFD

7 2.41.1.1 0.419354839 0 0.580645161 TFKNTEI

7 3.1.1.1 0.360655738 0 0.639344262 LGNSAG

7 2.31.1.2 0.356435644 0 0.643564356 LTAGHCT

7 2.8.1.4 0.257028112 0 0.742971888 DLGQLGI

7 1.25.1.2 0.16 0 0.84 AWEVVRA

7 3.33.1.5 0.147368421 0 0.852631579 VLIEFYA

7 2.5.1.3 0.130434783 0 0.869565217 FKNNAGF

7 2.1.1.2 0.026465028 0 0.973534972 SCDYKFC

7 2.1.1.5 0 0 1 SCDYKFC

7 2.1.1.3 0 0 1 SCDYKFC

7 1.1.1.2 0 0 1 PWTQRFF

7 3.19.1.4 0 0 1 TPAEQFD

7 3.19.1.1 0 0 1 TPAEQFD

7 2.34.1.1 0 0 1 SNLWVPS


17

7 3.73.1.2 0 0 1 TITLVRE

7 3.1.1.5 0 0 1 QALAFTL

7 3.50.1.7 0 0 1 LNGGPGC

8 2.1.1.4 0.912758997 0 0.087241003 VGYDETDK

8 2.5.1.1 0.898876404 0 0.101123596 HLIGGHGD

8 3.33.1.1 0.882352941 0 0.117647059 RLLLEYTD

8 2.8.1.2 0.742971888 0 0.257028112 APHRVLAT

8 3.1.1.3 0.725 0 0.275 HCNPAANT

8 2.1.1.1 0.698275862 0 0.301724138 TESKKPAF

8 1.34.1.4 0.656387665 0 0.343612335 QNGFISAA

8 2.19.1.1 0.647058824 0.333333333 0.352941176 WDEIDIEF

8 2.31.1.1 0.643564356 0 0.356435644 GKDSCQGD

8 1.25.1.1 0.63372093 0 0.36627907 IVSFYFKL

8 3.19.1.5 0.615384615 0 0.384615385 HRESTWSD

8 1.25.1.3 0.611650485 0 0.388349515 AFQRRAGG

8 1.34.1.5 0.588235294 0 0.411764706 CITTKELG

8 3.25.1.1 0.566929134 0 0.433070866 DPTIEDSY

8 3.19.1.3 0.538461538 0 0.461538462 HRESTWSD

8 3.25.1.3 0.433070866 0 0.566929134 PGAGKGTQ

8 2.31.1.2 0.356435644 0 0.643564356 NATARIGG

8 3.1.1.1 0.336734694 0 0.663265306 KAQKGVTA

8 2.8.1.4 0.257028112 0 0.742971888 YCNDSATV

8 3.33.1.5 0.256880734 0 0.743119266 FSGANKEK

8 2.5.1.3 0.130434783 0 0.869565217 AGFPHNVV

8 1.1.1.2 0.071428571 0 0.928571429 SELHCDKL

8 2.1.1.2 0.017800381 0 0.982199619 TESKKPAF

8 2.41.1.1 0 0 1 GVGFATRQ

8 2.1.1.5 0 0 1 TESKKPAF

8 2.1.1.3 0 0 1 TESKKPAF

8 3.19.1.4 0 0 1 HRESTWSD

8 3.19.1.1 0 0 1 HRESTWSD

8 2.34.1.1 0 0 1 GSSNLWVP

8 3.73.1.2 0 0 1 APLTITLV

8 3.1.1.5 0 0 1 ALAFTLTS

8 1.25.1.2 0 0 1 CAWEVVRA

8 3.50.1.7 0 0 1 LNGGPGCS

9 2.1.1.4 0.909502262 0 0.090497738 SASSQVNVA

9 3.33.1.1 0.902777778 0 0.097222222 LNEKFKLGL

9 2.5.1.1 0.869565217 0 0.130434783 NGAVGALTG

9 2.8.1.2 0.742971888 0 0.257028112 PVTTTVENY

9 3.1.1.3 0.716332378 0 0.283667622 RPLGDAVLD


18

9 3.19.1.1 0.695652174 0 0.304347826 GRQTRAARS

9 2.19.1.1 0.685897436 0.222222222 0.314102564 LGKDTTKVQ

9 2.1.1.1 0.680365297 0 0.319634703 KGYNGRLKV

9 2.41.1.1 0.643564356 0 0.356435644 STFKNTEIS

9 2.31.1.1 0.643564356 0 0.356435644 CQGDSGGPL

9 1.34.1.5 0.627659574 0.166666667 0.372340426 MIDQNRDGF

9 1.25.1.3 0.611650485 0 0.388349515 LQALAGISP

9 3.19.1.5 0.594594595 0 0.405405405 GRQTRAARS

9 3.25.1.1 0.566929134 0 0.433070866 LTIQLIQNH

9 1.25.1.1 0.565517241 0 0.434482759 QSQIVSFYF

9 2.31.1.2 0.530685921 0 0.469314079 VGFSVTRGA

9 3.25.1.3 0.503448276 0 0.496551724 QISTGDMLR

9 1.34.1.4 0.472972973 0 0.527027027 VFDKDQNGF

9 3.1.1.1 0.360655738 0 0.639344262 ANPNLGSPQ

9 1.1.1.2 0.341772152 0 0.658227848 LHCDKLHVD

9 2.8.1.4 0.318600368 0 0.681399632 GFQYDMADT

9 2.5.1.3 0.230769231 0 0.769230769 GMVGKVTVN

9 3.33.1.5 0.147368421 0 0.852631579 MIKPFFHSL

9 3.19.1.3 0.076923077 0 0.923076923 GRQTRAARS

9 2.1.1.2 0.026465028 0 0.973534972 IEGIKRSLS

9 2.1.1.5 0 0 1 IEGIKRSLS

9 2.1.1.3 0 0 1 IEGIKRSLS

9 3.19.1.4 0 0 1 GRQTRAARS

9 2.34.1.1 0 0 1 TGSSNLWVP

9 3.73.1.2 0 0 1 YIGVSVVLF

9 3.1.1.5 0 0 1 SRGVPAIYY

9 1.25.1.2 0 0 1 AFVLSLLMA

9 3.50.1.7 0 0 1 LNGGPGCSS

10 2.1.1.4 0.917610711 0 0.082389289 NSGDAIYDAD

10 2.5.1.1 0.869565217 0 0.130434783 YVGEQDFYVP

10 3.33.1.1 0.852631579 0 0.147368421 EYTDSSYEEK

10 2.19.1.1 0.808219178 0.555555556 0.191780822 EFLGKDTTKV

10 2.8.1.2 0.75 0 0.25 TLGNSTITTQ

10 3.1.1.3 0.716332378 0 0.283667622 ADYLWNNFLG

10 2.1.1.1 0.683972912 0 0.316027088 EVLVPPRIE

10 1.34.1.5 0.651452282 0 0.348547718 AFRVFDKDQN

10 2.41.1.1 0.650485437 0 0.349514563 FYIKTSTTVR

10 2.31.1.1 0.643564356 0 0.356435644 DSCQGDSGGP

10 1.25.1.3 0.611650485 0 0.388349515 QRRAGGVLVA

10 3.19.1.5 0.594594595 0 0.405405405 LRTVPLDVSK

10 3.25.1.1 0.566929134 0 0.433070866 MTEYKLVVVG


19

10 1.25.1.1 0.565517241 0 0.434482759 SPA

10 1.34.1.4 0.561797753 0 0.438202247 RVFDKDQNGF

10 3.19.1.3 0.555555556 0 0.444444444 LRTVPLDVSK

10 2.31.1.2 0.490196078 0 0.509803922 LFAGSTALGL

10 3.25.1.3 0.433070866 0 0.566929134 AGKGTQAQFI

10 3.1.1.1 0.360655738 0 0.639344262 TWKFFDGVDI

10 2.8.1.4 0.257028112 0 0.742971888 TLYFPQPTNT

10 3.33.1.5 0.147368421 0 0.852631579 KLVVVDFSAT

10 2.5.1.3 0.130434783 0 0.869565217 NNAGFPHNVV

10 2.1.1.2 0.017800381 0 0.982199619 EVLVPPRIE

10 2.1.1.5 0 0 1 EVLVPPRIE

10 2.1.1.3 0 0 1 EVLVPPRIE

10 1.1.1.2 0 0 1 LHCDKLHVDP

10 3.19.1.4 0 0 1 LRTVPLDVSK

10 3.19.1.1 0 0 1 LRTVPLDVSK

10 2.34.1.1 0 0 1 TGSSNLWVPS

10 3.73.1.2 0 0.333333333 1 LTITLVREEV

10 3.1.1.5 0 0 1 TSRGVPAIYY

10 1.25.1.2 0 0 1 ANAVLRAQHL

10 3.50.1.7 0 0 1 VLWLNGGPGC

All results for word length from 1 to 10 for detecting Fold IDs Word Length FoldID Specifity Precision RFP

Distinct-Fragment

1 1.34 0.887108 0 0.112892 Z

1 3.1 0.872453 0.061111 0.127547 x

1 1.25 0.839246 0 0.160754 B

1 2.19 0.827869 0 0.172131 B

1 2.41 0.576471 0 0.423529 X

1 2.31 0.552106 0 0.447894 x

1 3.73 0.448276 0 0.551724 X

1 1.1 0.174603 0 0.825397 B

1 2.8 0.145798 0 0.854202 X

1 2.34 0.142857 0 0.857143 x

1 3.25 0.105634 0 0.894366 x

1 2.5 0.098039 0 0.901961 Z

1 2.1 0.003264 0 0.996736 Z

1 3.5 0 0 1 x

1 3.33 0 0 1 x

1 3.19 0 0 1 x

2 1.34 1 1 0 AZ


20

2 3.73 0.907193 0.166667 0.092807 XW

2 1.25 0.896104 0.944828 0.103896 QZ

2 3.19 0.88935 0.74359 0.11065 Sx

2 2.41 0.837838 0 0.162162 XQ

2 2.34 0.823529 0 0.176471 Lx

2 1.1 0.800384 0 0.199616 ZH

2 3.25 0.098765 0.425197 0.901235 xM

2 2.1 0.090387 0.769287 0.909613 ZP

2 3.33 0.068966 0.147368 0.931034 xF

2 2.5 0.056738 0.036232 0.943262 BR

2 2.19 0 0 1 BM

2 3.5 0 0 1 xS

2 2.8 0 0.060241 1 XC

2 2.31 0 0.618812 1 xI

2 3.1 0 0 1 Ax

3 1.1 0.960123 0.875 0.039877 YH

3 2.41 0.918478 0.166667 0.081522 XQT

3 1.34 0.884956 0.919753 0.115044 GBG

3 2.34 0.675676 0 0.324324 LxS

3 2.5 0.612903 0.826087 0.387097 XPM

3 3.19 0.381974 0.538462 0.618026 xNE

3 3.33 0.330579 0.147368 0.669421 PNX

3 2.1 0.084507 0.834425 0.915493 XSG

3 2.19 0 0 1 FXP

3 3.73 0 0 1 WWF

3 3.5 0 0 1 IxS

3 2.8 0 0.947791 1 TXH

3 2.31 0 0.866337 1 HCW

3 3.25 0 0.96063 1 DxM

3 3.1 0 0.969444 1 KAx

3 1.25 NA 1 NA WNB

4 2.8 1 1 0 QIMY

4 3.33 1 1 0 APWC

4 1.34 1 1 0 QDMI

4 3.25 1 1 0 QNHF

4 3.19 1 1 0 WKLD

4 2.5 0.929204 0.942029 0.070796 WYWS

4 2.41 0.873016 0.333333 0.126984 MPNF

4 2.31 0.780488 0.955446 0.219512 ICLP

4 1.1 0.643836 0 0.356164 HLDN

4 2.34 0.52 0 0.48 WILG


21

4 2.1 0.176776 0.827147 0.823224 VYYC

4 2.19 0 0 1 HFNP

4 3.73 0 0 1 WWFF

4 3.5 0 0 1 HMVP

4 1.25 NA 1 NA CAWE

4 3.1 NA 1 NA IYIT

5 1.25 1 1 0 SFYFK

5 2.8 1 1 0 CGYSD

5 3.33 1 1 0 CGHCK

5 1.34 1 1 0 DKDGD

5 2.5 1 1 0 HQWYW

5 2.41 0.84 0 0.16 VGFAT

5 3.19 0.448276 0.948718 0.551724 FVKAI

5 1.1 0.409091 0 0.590909 LHVDP

5 2.19 0.112676 0 0.887324 LHFNP

5 2.1 0.084967 0.847162 0.915033 RFSGS

5 2.34 0 0 1 DTGTS

5 3.73 0 0 1 VREEV

5 3.5 0 0 1 GGPGC

5 2.31 NA 1 NA DSGGP

5 3.25 NA 1 NA IWDTA

5 3.1 NA 1 NA HWDLP

6 3.73 1 1 0 TANLAA

6 1.25 1 1 0 WEVVRA

6 3.33 1 1 0 WCGHCK

6 1.34 1 1 0 LFDKDG

6 3.19 1 1 0 GAGILD

6 3.1 1 1 0 EPFVTL

6 2.5 1 1 0 HQWYWS

6 2.41 0.817259 0 0.182741 GVGFAT

6 3.5 0.419355 0 0.580645 NGGPGC

6 2.1 0.384615 0.935953 0.615385 RFSGSG

6 2.34 0.213115 0 0.786885 SNLWVP

6 2.19 0 0 1 FHFNPR

6 1.1 0 0 1 HVDPEN

6 2.8 NA 1 NA NASKFH

6 2.31 NA 1 NA GDSGGP

6 3.25 NA 1 NA GVGKSA

7 1.25 1 1 0 AWEVVRA

7 2.8 1 1 0 VENYGGE

7 3.33 1 1 0 PWCGHCK


22

7 1.34 1 1 0 EAFSLFD

7 3.25 1 1 0 IEDSYRK

7 2.5 1 1 0 IGHQWYW

7 2.41 0.803279 0 0.196721 TFKNTEI

7 2.1 0.359551 0.917031 0.640449 WVRQAPG

7 2.34 0 0 1 SNLWVPS

7 2.19 0 0.555556 1 VSSFFTY

7 1.1 0 0 1 PWTQRFF

7 3.73 0 0 1 TITLVRE

7 3.5 0 0 1 LNGGPGC

7 2.31 NA 1 NA GDSGGPL

7 3.19 NA 1 NA QIERTIA

7 3.1 NA 1 NA TLHHFDT

8 1.25 1 1 0 CAWEVVRA

8 3.33 1 1 0 APWCGHCK

8 1.34 1 1 0 LGTVMRSL

8 3.25 1 1 0 DPTIEDSY

8 2.5 1 1 0 WYWSYEYS

8 2.41 0.643564 0 0.356436 GVGFATRQ

8 2.1 0.284091 0.931223 0.715909 KGRFTISR

8 1.1 0.1875 0 0.8125 SELHCDKL

8 2.34 0 0 1 GSSNLWVP

8 2.19 0 0 1 WDEIDIEF

8 3.73 0 0 1 APLTITLV

8 3.5 0 0 1 LNGGPGCS

8 3.19 0 0.923077 1 IYQVPVYS

8 2.8 NA 1 NA APHRVLAT

8 2.31 NA 1 NA GDSGGPLV

8 3.1 NA 1 NA TLHHFDTP

9 3.33 1 1 0 EFYAPWCGH

9 1.34 1 1 0 ITTKELGTV

9 2.5 1 1 0 LRLLYLLDE

9 2.41 0.798883 0 0.201117 STFKNTEIS

9 1.1 0.341772 0 0.658228 LHCDKLHVD

9 2.19 0.222222 0 0.777778 LGKDTTKVQ

9 2.1 0.060606 0.966157 0.939394 VKGRFTISR

9 2.34 0 0 1 TGSSNLWVP

9 3.73 0 0 1 YIGVSVVLF

9 3.5 0 0 1 LNGGPGCSS

9 1.25 NA 1 NA QSQIVSFYF

9 2.8 NA 1 NA PVTTTVENY


23

9 2.31 NA 1 NA CQGDSGGPL

9 3.25 NA 1 NA LTIQLIQNH

9 3.19 NA 1 NA FVAGLGGIG

9 3.1 NA 1 NA EPFVTLHHF

10 2.8 1 1 0 TLGNSTITTQ

10 1.34 1 1 0 KELGTVMRSL

10 3.25 1 1 0 MTEYKLVVVG

10 2.5 1 1 0 HQWYWSYEYS

10 2.41 0.809524 0 0.190476 FYIKTSTTVR

10 2.1 0.360465 0.939956 0.639535 SCKASGYTFT

10 2.19 0.171053 0 0.828947 EFLGKDTTKV

10 2.34 0 0 1 TGSSNLWVPS

10 1.1 0 0 1 LHCDKLHVDP

10 3.73 0 0.333333 1 LTITLVREEV

10 3.5 0 0 1 VLWLNGGPGC

10 3.19 0 0.948718 1 FWDKRKGGPG

10 1.25 NA 1 NA LQCLEEELKP

10 3.33 NA 1 NA KLVVVDFSAT

10 2.31 NA 1 NA DSCQGDSGGP

10 3.1 NA 1 NA FVTLHHFDTP


24

5 Appendix B References [1] N. Yosef, R. Sharan, and W.S. Noble. Improved network-based identification of protein

orthologs. Bioinformatics, 24(16):i200–i206, 2008.

[2] C. Knaub. Molecular Evolution? http://www.icr.org/article/molecular-evolution/ [3] I. Budowski-Tal, Y. Nov, R. Kolodny. FragBag, an accurate representation of protein structure,

retrieves structural neighbors from the entire PDB quickly and accurately. Proc Natl Acad Sci U S A. 2010 Feb 3;: 20133727 Cit:1

[4] M. Ester and X. Zhang, “A Top-Down Method for Mining Most Specific Frequent

Patterns in Biological Sequence Data,” Proc. SIAM Int'l Conf. on Data Mining (SDM

'04), Apr. 2004. [5] N.M. Zaki, R.M Ilias, and S. Derus. “A Comparative analysis of Protein Homology Detection Methods”. Journal of Theoretics, 5-4, 2003. [6] Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.: Protein Classification with

Multiple Algorithms. In: Proc. of 10th Panhellenic Conference in Informatics, Volos,

Greece, November 21-23. LNCS. Springer, Heidelberg (2005).

http://www.icr.org/article/molecular-evolution/


25

6 Appendix C The code: #!/usr/bin/perl -w

##### Training Code #########

#Usage: perl homology.bagOfWordsTrain.pl fisher-scop-data maxLengthWords (fold or

family)

#Notes: classLabel is the same as FamilyID

# Negative Training data are not used in this code

#author: Zina Saadi ([email protected])

#Purpose: Protein Homology detection based on predicting the familyID

# of the sequence using positve training data and positive and negative test data

#Training Data: http://compbio.soe.ucsc.edu/discriminative/fisher-scop-data.tar.gz

#Runs in Versions of Perl above 5.0

use strict;

use File::Find;

#command line input arguments

my $directory = $ARGV[0]; # command line directory name

my $maxSplit = int($ARGV[1]); # max length for words

my $classOption = $ARGV[2]; # specify type of class (fold or family)

#saving training data composed of indexed words found in sequences

#in the format of wordOfSeq,indexOfWord

open(TRAINKEY, "> trainKey.txt") or die("Couldn't open Keys file\n");

#saving training data composed in indexed of words found in sequences and its

frequencies per family

#in the format of indexOfWord1 freqOfWord1, indexOfWord2 freqOfWord2,....,

familyID

open(TRAINSPARSE, "> trainSparseVectors.txt ") or die("Couldn't open Sparse Data\n");

#root directory to search in ti for training data

my $trainDirectory = "./$directory";

#Initialize global variables

my %overallI=();

my @files=[];

my $classLabel="";

my %tokensFreq=();


26

my %fileI=();

#search for all the training files in all the sub-directories

find(sub { push @files, $File::Find::name if /pos\-train.seq$/ }, $trainDirectory);#find all

positive training data files

for (my $f=1;$f<=$#files;$f++){

my $fileName=$files[$f];

#get class labels

if ($classOption eq "fold") {

($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3})\//i;#to capture fold ID

} elsif ($classOption eq "family") {

($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\//i; #to

capture familyID

}else {

die("last argument must be of the string value \"family\" or \"fold\"\n");

}

#%tokensFreq=();

#%fileI=();

open(INPUT, "< $fileName") or die("Couldn't open $fileName.\n");

my $seq="";

while (<INPUT>) {#loop through each file

chomp;

if (not index($_,">")==0){

$seq .= $_; #concatenate segments of the same seq

} elsif (index($_,">")==0 and not $seq eq ""){

$seq .= ","; #to flag seq boundaries

}

}

close INPUT;

my @tokens=[];

if (not $seq eq ""){

my @seqArray=split (/,/,$seq);

foreach (@seqArray){

%tokensFreq=();


27

%fileI=();

if ($maxSplit==1){

@tokens = split(/(.{1})/, $_);

}else{

@tokens = split(/(.{1,$maxSplit})/, $_);

}

for (my $i=0;$i<=$#tokens;$i++){

chomp($tokens[$i]);

if(not $tokens[$i] eq "") {

if (exists($tokensFreq{$tokens[$i]})){

$tokensFreq{$tokens[$i]}+=1;

}else{

if (not (exists($overallI{$tokens[$i]}))){

my $size =keys %overallI;

$overallI{$tokens[$i]}=$size;

print TRAINKEY "$tokens[$i],$size\n";

}

if (not (exists($fileI{$tokens[$i]}))){

my $size =keys %fileI;

$fileI{$tokens[$i]}=$size;

}

$tokensFreq{$tokens[$i]}=1;

}

}

}

for my $k1 (sort {$fileI{$a} <=> $fileI{$b}} keys %fileI) {

print TRAINSPARSE "$overallI{$k1} $tokensFreq{$k1},";

}

print TRAINSPARSE "$classLabel\n";

}

}

}

close TRAINKEY;

close TRAINSPARSE;

#!/usr/local/bin/perl

##### Testing Code #########


28

#Usage: perl homology.bagOfWordsTest.pl fisher-scop-data maxLengthWords (family or

fold)

#Notes: classLabel is the same as FamilyID

#

#author: Zina Saadi ([email protected])

#Purpose: Protein Homology detection based on predicting the familyID

# of the sequence using positve training and testing data

#Testing Data: http://compbio.soe.ucsc.edu/discriminative/fisher-scop-data.tar.gz

#Runs in Versions of Perl above 5.0

use strict;

use File::Find;

my $directory = $ARGV[0]; # directory (root of train and test data)

my $maxSplit = int($ARGV[1]); # max length for words of a sequence

my $classOption = $ARGV[2]; # specify type of class (fold or family)

#output results data to a file

open(RESULTS, "> results".$maxSplit.".".$classOption.".txt") or die("Couldn't open

results file\n");

#Results Header Table

print RESULTS "SeqSplit\tFamID\tSpec\tPrecision\tRFP\tDistinct-Fragment\n";

my $testDirectory = "./$directory";

my $indexSplit=$maxSplit;

#Initialize global variables

my %count=();

my @trainKeys=();

my %totalNumbers=();

my %wordsCountP=();

my %distinctWords=();

my %instancesArray=();

my @files=[];

my $classLabel="";

system("rm trainKey.txt"); #remove any previous generated file

system("rm trainSparseVectors.txt"); #remove any previous generated file


29

system("perl homology.bagOfWordsTrain.pl $directory $indexSplit $classOption");

#call training code

print "done with >perl homology.bagOfWordsTrain.pl $directory $indexSplit

$classOption\n"; #print to system the status of the loop

#retrieve training data composed of indexed words found in sequences

#in the format of wordOfSeq,indexOfWord

open(TRAINKEY, "./trainKey.txt") or die("Couldn't open Keys file\n");

while (<TRAINKEY>) {

chomp;

my @tempSplit=split(/,/,$_);

push (@trainKeys, $tempSplit[0]);

}

close TRAINKEY;

#retrieve training data composed in indexed of words found in sequences and its

frequencies per fami

#in the format of indexOfWord1 freqOfWord1, indexOfWord2 freqOfWord2,....,

familyID

open(SPARSE, "./trainSparseVectors.txt") or die("Couldn't open Sparse Data file\n");

while (<SPARSE>) {

chomp;

my @tempISplit=split(/,/,$_);

for (my $i=0;$i<$#tempISplit;$i++){

if(not $tempISplit[$i] eq ""){

my @tempJSplit=split(/ /,$tempISplit[$i]);

if

(exists($instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]})){

$instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]}+=$temp

JSplit[1];

}else{

$instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]}=$tempJSplit[1];

}

if (exists($wordsCountP{$tempISplit[$#tempISplit]})){

$wordsCountP{$tempISplit[$#tempISplit]}+=$tempJSplit[1];

}else{


30

$wordsCountP{$tempISplit[$#tempISplit]}=$tempJSplit[1];

}

}

}

if (exists $count{$tempISplit[$#tempISplit]}){

$count{$tempISplit[$#tempISplit]}+=1;

}else{

$count{$tempISplit[$#tempISplit]}=1;

}

}

close SPARSE;

#search for all the testing files in all the sub-directories

find(sub { push @files, $File::Find::name if /pos\-test.seq$/ }, $testDirectory);#find all

positive testing data files

for (my $f=1;$f<=$#files;$f++){#for some reason data starts at index1 #############

my $fileName=$files[$f];

open(INPUT, "< $fileName") or die("Couldn't open $fileName.\n");

#get class labels

if ($classOption eq "fold") {

($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3})\//i;#to capture fold ID

} elsif ($classOption eq "family") {

($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\//i; #to

capture familyID

}else {

die("last argument must be of the string value \"family\" or \"fold\"\n");

}

my %tokensFreq=();

my $totalWordsFreq=0;

#combine all the lines of the same protein sequence

#and separate those instances within the same family with comma

my $seq ="";

while (<INPUT>) {#loop through each file

chomp;

if (not index($_,">")==0){

$seq .= $_; #concatenate segments of the same seq


31

} elsif (index($_,">")==0 and not $seq eq ""){

$seq .= ","; #to flag seq boundaries

}

}

close INPUT;

if (not $seq eq ""){

my @seqArray=split (/,/,$seq);

my @tokens=[];

foreach (@seqArray){ # loop thru each seq instance

if ($maxSplit==1){

@tokens = split(/(.{1})/, $_);

}else{

@tokens = split(/(.{1,$maxSplit})/, $_);

}

for (my $i=0;$i<=$#tokens;$i++){

chomp($tokens[$i]);

if (not $tokens[$i] eq "") {

if (exists($tokensFreq{$tokens[$i]})){

$tokensFreq{$tokens[$i]}+=1;

}else{

$tokensFreq{$tokens[$i]}=1;

}

$totalWordsFreq++;

}

}

#calculate prediction based on taking the argmax

#of the sum of the log of the probabilities

my %prediction=();

my $labels_size= keys %count;

for my $class (keys %count){

my $probP=1/$labels_size;

my $bayesProb=log($probP);

for my $word (keys %tokensFreq){

my $probWord=$tokensFreq{$word}/$totalWordsFreq;

my

$probWordGivenP=($instancesArray{$class}{$word})/($wordsCountP{$class});

$bayesProb+=log(($probWordGivenP+1)/$probWord+3);


32

}

$prediction{$class}=$bayesProb;

}

#take the argmax

my @sortedP = reverse sort {$prediction{$a} <=> $prediction{$b}} keys

%prediction;

my $predictedL =$sortedP[0];

if ($classLabel eq $sortedP[0]){

if (exists($totalNumbers{$classLabel.".TPos"})){

$totalNumbers{$classLabel.".TPos"}+=1;

}else{

$totalNumbers{$classLabel.".TPos"}=1;

}

}else{

if (exists($totalNumbers{$classLabel.".FPos"})){

$totalNumbers{$classLabel.".FPos"}+=1;

}else{

$totalNumbers{$classLabel.".FPos"}=1;

}

if (exists($totalNumbers{$predictedL.".TNeg"})){

$totalNumbers{$predictedL.".TNeg"}+=1;

}else{

$totalNumbers{$predictedL.".TNeg"}=1;

}

}

}#foreach (@seqArray)

} #if (not $seq eq "")

}#loop through files

#calculating all applicable types of measurement

my $specifity=0;

my $precision=0;

my $FP_rate=0;

my $Ave_RFP=0; #average Rate of False Positive

my $countRFP=0; #Rate of False Positive

my @seenBefore=[];#to remove duplicates

$classLabel=""; #reset the default value

for my $class (keys %count){


33

my $TN_FP=($totalNumbers{$class.".TNeg"}+$totalNumbers{$class.".FPos"}); #denom

for the specifity meas

my $TP_FP=($totalNumbers{$class.".TPos"}+$totalNumbers{$class.".FPos"});

#denom for the precision meas

if (not $TN_FP ==0){ $specifity=($totalNumbers{$class.".TNeg"}/$TN_FP);}else

{$specifity="NA";}

if (not $TN_FP ==0){ $FP_rate=($totalNumbers{$class.".FPos"}/$TN_FP);

$Ave_RFP+=$FP_rate; $countRFP++;}else {$FP_rate="NA";}

if (not $TP_FP ==0){ $precision=($totalNumbers{$class.".TPos"}/$TP_FP); }else

{$precision="NA";}

#calculating the likelihood for a word given a class (family or fold) in the training files

for each class.

for my $word (reverse sort {$instancesArray{$class}{$a} <=>

$instancesArray{$class}{$b}} keys %{%instancesArray->{$class}}){

my

$probWordGivenP=($instancesArray{$class}{$word})/($wordsCountP{$class});

my $log_likelihood=log($probWordGivenP);

for my $other_protein (keys %count){

if ($class ne $other_protein){

#applying laplace transformation to eliminate dividing by zero

my

$probWordGivenNotP=($instancesArray{$other_protein}{$word}+1)/($wordsCountP{$o

ther_protein}+3);

$log_likelihood-=log($probWordGivenNotP);

}

}

$distinctWords{$class}{$word}=$log_likelihood;

}

my @sortedW= reverse sort {$distinctWords{$class}{$a} <=>

$distinctWords{$class}{$b}} keys %{%distinctWords->{$class}};

print RESULTS

"$indexSplit\t$class\t$specifity\t$precision\t$FP_rate\t$sortedW[0]\n";

}

print RESULTS "$indexSplit\t",$Ave_RFP/$countRFP, "\n";

closeRESULTS;


34

7 Appendix D Slides


35


36


37


38


39

8 Appendix E Measurement definition and formulas from (http://webdocs.cs.ualberta.ca/~bioinfo/data/publications/theses/2004-Theses-BrettPoulin.pdf )

http://webdocs.cs.ualberta.ca/~bioinfo/data/publications/theses/2004-Theses-BrettPoulin.pdf

http://webdocs.cs.ualberta.ca/~bioinfo/data/publications/theses/2004-Theses-BrettPoulin.pdf

comp150-final-project-proteins homology

Documents

proteins homology detection

homology of unknown

predicting homology

bag of words classifier

protein homologies detection

structure of proteins

documents classifier

protein classification