comp150-final-project-proteins homology
TRANSCRIPT
Tufts University
Protein Homology Detection Using Bag of Words Classifier Exploring a Word within Documents Classifier for Protein Sequences within families or folds
Zina Saadi 12/9/2010 Final Project Comp 150 taught by Dr. Lenore Cowen [email protected]
Proteins Homology Detection
2
Abstract: The current increase of protein sequences in biological databases had encouraged researchers to explore new methodologies to predict the functionality and structure of proteins. Through homology, studies show that proteins belonging in a single family and a super family observe similar properties in terms of their functionality or structural relation. The first part of this study uses the benchmark SCOP dataset to train and test the ability of Bag of Words (BOW) classifier using the Naive-Bayesian classifier for protein homologies detection. The second part of this study was for calculating the likelihood for a protein’s most distinct sub-sequence given a (family or fold) in the training files for each class. Results from the second part showed that the most distinct sub-sequence given a (family or fold) can be used to identify the homology of unknown proteins. These results can provide insight to biologist as well as to researchers on the benefit of exploring simpler methods. Data and code live in1 http://www.eecs.tufts.edu/~zsaadi01/proteins_homology/ Keywords: Homology detection, Bag of Words, protein classification, Naive-Bayesian classifier
1 I renamed all perl files to .txt since no matter how I change the permission to them, their content is still
hidden.
Proteins Homology Detection
3
Contents 1 Introduction ................................................................................................................. 4
1.1 Predicting Homology among Proteins ................................................................. 4
1.2 Motivation and Overview..................................................................................... 4
2 Major Design Specifications ........................................................................................ 5
2.1 Design Overview ................................................................................................... 5
2.2 Data Processing .................................................................................................... 5
2.2.1 Data Specification .................................................................................................... 6
2.2.2 Methods .................................................................................................................. 6
2.2.3 Results and Measurements ..................................................................................... 7
3 Future Experiments ................................................................................................... 10
4 Appendix A ................................................................................................................. 11
5 Appendix B ................................................................................................................. 24
6 Appendix C ................................................................................................................. 25
7 Appendix D ................................................................................................................ 34
8 Appendix E ................................................................................................................. 39
Proteins Homology Detection
4
1 Introduction The current advancement of automated sequencing tools had constantly increased
the amount of DNA and protein sequences in public biological databases [4][6]. Such increase often creates a gap between the amount of information available in these databases and its connection with the structure and functionality of proteins per say due to a lack of promising automated tools that can predict information about proteins function and structure[6]. Over the past decade, researchers have been exploring various ways to predict the functionality of proteins by experimenting with statistical and machine learning approaches which are less costly than conducting time consuming lab experiments [5].
1.1 Predicting Homology among Proteins Traditional approaches to identify protein homology had varied from linear
sequence-base comparison to network comparison. Finding sufficient similarities between protein sequences is referred to as Sequence alignment, which is a way of arranging protein sequences to identify regions of similarity between them. As for Global alignments attempt to align every amino acid in two sequences and are generally useful for similar sequences close in length. On the other hand, local alignments attempt to find regions of local similarity between sequences and are generally useful for less similar sequences. Clearly, there could be many ways to align two sequences. However, what is important in this approach is the scoring function which determines how to judge the degree of similarity between two protein sequences which then can determine the best alignment. In addition to sequence-based comparisons, network comparisons across species have also been used to identify proteins with similar functions and detect homology [1].
Most of these approaches rely on comparing amino acids sequences to look for similarities, which is the only logical inference for building evolutionary relationship trees. However, when these comparisons are strictly followed to produce an evolutionary tree, many embarrassing results are obtained such as showing that the turtle is more closely related to the birds that to the snake, or the chicken is grouped with the penguin rather than the duck [2].
This Study proposes the approach for predicting the family of protein sequences across organism by patterns recognition using the “bag of words” conditional Bayesian classifier.
1.2 Motivation and Overview The bag of words (BOW) conditional Bayesian classifier had gained great
importance in Natural Language Processing (NLP) in various domains such as spam-filtering, document classification and words-sense disambiguation. The use of the BOW classifier was first introduced to my knowledge in comp134 with Dr. Carla Brodley, where the professor asked us to build this classifier for predicting U.S presidential speeches. This can be analogous to predicting protein sequences across organisms
Proteins Homology Detection
5
(speeches by the same president over a period of time) based on training a model to classify protein sequences into their annotated family. The U.S president speech classifier was adjusted to:
Treat each protein sequence within the same family as an instance.
Measure the rate of false positives, and specify instead of precision and accuracy.
2 Major Design Specifications There were four elements critical to the accomplishment of this study concluded in:
1. Finding the appropriate data that is composed of annotated data to be used for training and testing.
2. Transforming the previously used code to applicable code for protein homology classification/detection.
3. Designing the code so that it can produce measurement results easily compared to measurement for other experiments in the same domain.
4. Finding results of other machine learning classifiers to compare BOW results with.
2.1 Design Overview The implementation code was written in Perl v5.10.1, divided into two .pl files
namely < homology.bagOfWordsTest.pl> and < homology.bagOfWordsTrain.pl>. To run the code, one should specify a pointer to root directory of both the training and testing data extracted from <fisher-scop-data.tar.gz2> (last modified on 23-Dec-2002 00:16), in addition to a specification to maximum words length (sequence sub-segment), to be used to split the given protein sequence.
2.2 Data Processing Since the BOW classifier had been majorly applied to NLP applications and problems. It is crucial to note the difference in dealing between the actual human language texts and protein sequences. In processing human language text, as English for instance, words boundary are easily specifies by splitting the text on white spaces and punctuations. However, while processing protein sequence, the lacks of words/segments boundaries provide many challenges in identifying the segments to consider in this BOW classifier. To overcome this challenge, the code was designed to take a command line argument with the maximum desired protein word/segment length to use it to split the sequence into words with the exact length (with the length of the rightmost word remainder be of less or equal length and the rest of the words). Then, the code runs the model over a loop of segmenting the training and testing proteins sequences in the range of words length=1 to words length=MaxSpecifiedLength.
2 Data Source http://compbio.soe.ucsc.edu/discriminative/
Proteins Homology Detection
6
2.2.1 Data Specification The data used in this study was obtained from the SCOP benchmark data-set (http://compbio.soe.ucsc.edu/discriminative/). The reason why this version was chosen, is to compare the results with other machine learning classifiers as illustrated in [5]. Upon extracting the directories and files from the <fisher-scop-data.tar.gz> file, data files looks as follow:
Figure 1 A sample screenshot of the directory layout
There were 33 class labels (Families ID) to predict a given protein sequence’s membership.
There were 12 class labels (Folds ID) to predict a given protein sequence’s membership.
Within each familyID folder, files ending in pos-train.seq and pos-test.seq were taken into consideration as training and testing data respectively.
2.2.2 Methods The code used for this study was inspired by Dr. Carla’s Brodley’s project3’s homework for comp135 (introduction to Machine learning and data mining), especially since I obtained 100% grade for that project so I know that the code’s computations are reliable. For Carla’s project we had to build presidential speeches classifier and then to identify each president’s most distinct word using the likelihood method. Applying this analogy to protein’s homology, for each specified word/subsequence length, I treated each protein sequence as an instance of the class (either fold or family) that it belongs to for training and then
Proteins Homology Detection
7
for testing, I computed the prediction of each testing protein sequence and calculated the false positives, true positives and false negatives, and then the rate of false positives (Experiment 1). Then for each specified word/subsequence length, I made the code predict the most distinct word/subsequence among all protein sequences instances for that particular class (family or fold) (Experiment 2).
2.2.2.1 Experiment 1 The first part of this study consisted of splitting each protein sequence for a given class (family or fold) by a specified word length by the user, and labeled by the class it belongs to. For predicting a given protein sequence’s class, the naïve Bayesian classifier was used:
Train: 1. For each class Cj of (folds or family ids), estimate P(Cj) 2. For each word Wi estimate P(Wi | Cj )
Classify (doc): Assign the protein sequene to the most probable class, assuming words are conditionally independent, for given class .
In general, it has been noted that When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are too risky to use since they can give incorrect result. To eliminate such risk, it has been suggested to take the logs of the probabilities as well as to use Laplace’s estimate which is to add 1 to the numerator and 2 to the denominator (for my code I added the values 1 and 3 respectively).
2.2.2.2 Experiment 2 The second part of this study consisted of calculating the likelihood for a word given a class (family or fold) in the training protein sequences for each class for a given word length specified by the user. More precisely: P(word|a class)/P(word|other classes), however since there will be words that can appear in sequences for certain classes but not others => meaning we might have the denominator be zero. To overcome this problem, I decided to take the log of the likelihood => log (P(word|a class)) – log(P(word|other classes)) and I used Laplace estimate for log(P(word|other presidents)), then I ranked the results by a decreasing sort, and took the word with the most likelihood for a certain class.
2.2.3 Results and Measurements
Proteins Homology Detection
8
Experiment part 1
Plotting the average rate of false positives for family and folds classes From the graph, it seems that for subsequence length equals to 1 or greater than 3, the used classifier performs better for classifying protein sequences into their correct folds than their correct family. Further research would be interesting to investigate why for subsequences of length equals to 2 and 3, it is the other way around. Experiment part 2 The following two tables illustrate the distinct-fragment obtained from the training data for all sequences under the same fold (i.e. the first Table) or under the same family (i.e the second Table). The first table illustrates also these distinct fragments for the rate of false positive (RFP=0),while the second table illustrates the same thing but for the first 10 minimal RFP. After running some investigation, it turns out that those distinct fragments appear among some protein sequences from the fold or family it was extracted from but they do not appear in any protein sequence from other folds or families rather than the one they belong to (only when RFP is zero).This discovery could enhance using the used method for detecting unknown protein sequence based on locating the distinct fragment for the major fold or family and therefore classifying its family or fold. I took a couple of distinct fragments namely “TLGNSTITTQ”, “KELGTVMRSL”, “MTEYKLVVVG” and “VSSFFTY” and searched online to find anything specific about them and their equivalent classes but I could not find anything interesting. Word Length FoldID Spec Precision RFP
Distinct-Fragment
10 2.8 1 1 0 TLGNSTITTQ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9 10
Family (Ave RFP)
Fold (Ave RFP)
Proteins Homology Detection
9
10 1.34 1 1 0 KELGTVMRSL
10 3.25 1 1 0 MTEYKLVVVG
10 2.5 1 1 0 HQWYWSYEYS
2 1.34 1 1 0 AZ
4 2.8 1 1 0 QIMY
4 3.33 1 1 0 APWC
4 1.34 1 1 0 QDMI
4 3.25 1 1 0 QNHF
4 3.19 1 1 0 WKLD
5 1.25 1 1 0 SFYFK
5 2.8 1 1 0 CGYSD
5 3.33 1 1 0 CGHCK
5 1.34 1 1 0 DKDGD
5 2.5 1 1 0 HQWYW
6 3.73 1 1 0 TANLAA
6 1.25 1 1 0 WEVVRA
6 3.33 1 1 0 WCGHCK
6 1.34 1 1 0 LFDKDG
6 3.19 1 1 0 GAGILD
6 3.1 1 1 0 EPFVTL
6 2.5 1 1 0 HQWYWS
7 1.25 1 1 0 AWEVVRA
7 2.8 1 1 0 VENYGGE
7 3.33 1 1 0 PWCGHCK
7 1.34 1 1 0 EAFSLFD
7 3.25 1 1 0 IEDSYRK
7 2.5 1 1 0 IGHQWYW
8 1.25 1 1 0 CAWEVVRA
8 3.33 1 1 0 APWCGHCK
8 1.34 1 1 0 LGTVMRSL
8 3.25 1 1 0 DPTIEDSY
8 2.5 1 1 0 WYWSYEYS
9 3.33 1 1 0 EFYAPWCGH
9 1.34 1 1 0 ITTKELGTV
9 2.5 1 1 0 LRLLYLLDE
Word Length Family-ID Spec Precision RFP
Distinct-Fragment
2 1.34.1.5 1 1 0 AZ
7 2.19.1.1 1 1 0 VSSFFTY
1 2.5.1.1 0.972727273 0 0.027272727 X
Proteins Homology Detection
10
1 3.1.1.5 0.957098284 0.166666667 0.042901716 x
1 1.25.1.1 0.953125 0 0.046875 X
3 2.1.1.4 0.952238806 0 0.047761194 TPZ
2 2.1.1.4 0.921414538 0 0.078585462 KB
7 2.1.1.4 0.918533605 0 0.081466395 SGVAGTH
10 2.1.1.4 0.917610711 0 0.082389289 NSGDAIYDAD
6 2.1.1.4 0.916753382 0 0.083246618 RPGQQP
Note that I was planning to compare my results to the results discussed in reference [5]. However I could not find a table to convert the SCOP family IDs used in version 1.37 to the names used in reference [5]. I spent so much time looking for a conversion table but did not find anything that went back to SCOP 1.37, otherwise that was my initial objective before running the code.
3 Future Experiments There were many ideas to be implemented in this project which would be interesting to explore in the future such as:
1. Building a discriminative model that takes into account both of the positive and negative training and testing data.
2. Improving the code so that the accuracy and sensitivity measurement can be calculated. In the current design, the False Negative values cannot be calculated.
3. Exploring statistical segmentation approaches for segments human languages that lack spaces such as Chinese for more accurate results when it comes to segmenting proteins sequences since it lacks words boundaries.
4. Creating separate classes for those proteins that belonged in more than one class.
Proteins Homology Detection
11
4 Appendix A A Table demonstrating the Ave RFP for each word length for detecting familyID or FoldID
Word Length
Family (Ave RFP)
Fold (Ave RFP)
1 0.731403473 0.645392247
2 0.594537429 0.59567166
3 0.568716552 0.676720305
4 0.600529232 0.362620053
5 0.617939183 0.469614604
6 0.604334648 0.320435088
7 0.602217334 0.449013135
8 0.629653251 0.529603441
9 0.591933171 0.557651688
10 0.611017176 0.554913204
All results for word length from 1 to 10 for detecting Family IDs Word Length Family-ID Specifity Precision RFP
Distinct-Fragment
1 2.5.1.1 0.972727273 0 0.027272727 X
1 3.1.1.5 0.957098284 0.166666667 0.042901716 x
1 1.25.1.1 0.953125 0 0.046875 X
1 1.34.1.4 0.868020305 0 0.131979695 Z
1 1.34.1.5 0.865168539 0 0.134831461 Z
1 3.1.1.3 0.768149883 0 0.231850117 x
1 2.31.1.1 0.763157895 0 0.236842105 x
1 2.19.1.1 0.741803279 0 0.258196721 B
1 2.41.1.1 0.526315789 0 0.473684211 X
1 1.25.1.3 0.365079365 0 0.634920635 B
1 2.8.1.2 0.36 0 0.64 X
1 3.25.1.3 0.28 0 0.72 x
1 3.73.1.2 0.157894737 0 0.842105263 X
1 1.1.1.2 0.12605042 0 0.87394958 B
1 2.1.1.4 0.101123596 0 0.898876404 B
1 2.31.1.2 0.057971014 0 0.942028986 X
1 2.1.1.5 0 0 1 B
1 2.1.1.3 0 0 1 B
1 3.33.1.5 0 0 1 x
1 2.5.1.3 0 0 1 Z
1 3.19.1.5 0 0 1 x
1 2.1.1.2 0 0 1 B
1 2.1.1.1 0 0 1 Z
Proteins Homology Detection
12
1 3.19.1.4 0 0 1 x
1 3.1.1.1 0 0 1 x
1 3.33.1.1 0 0 1 x
1 3.19.1.1 0 0 1 x
1 2.34.1.1 0 0 1 x
1 3.19.1.3 0 0 1 X
1 3.25.1.1 0 0 1 x
1 1.25.1.2 0 0 1 B
1 3.50.1.7 0 0 1 x
1 2.8.1.4 0 0 1 X
2 1.34.1.5 1 1 0 AZ
2 2.1.1.4 0.921414538 0 0.078585462 KB
2 2.31.1.2 0.834645669 0.192307692 0.165354331 MX
2 3.73.1.2 0.805668016 0 0.194331984 XW
2 2.41.1.1 0.78313253 0 0.21686747 XQ
2 2.34.1.1 0.753846154 0 0.246153846 Lx
2 1.1.1.2 0.715068493 0 0.284931507 ZH
2 3.19.1.3 0.692307692 0 0.307692308 CX
2 1.25.1.3 0.68 0 0.32 BP
2 3.19.1.4 0.642857143 0 0.357142857 Sx
2 2.1.1.1 0.633986928 0 0.366013072 PB
2 2.31.1.1 0.581395349 0 0.418604651 xI
2 3.25.1.1 0.566929134 0 0.433070866 xT
2 1.25.1.2 0.5625 0 0.4375 BP
2 3.25.1.3 0.552795031 0 0.447204969 xM
2 1.25.1.1 0.543478261 0 0.456521739 XE
2 1.34.1.4 0.434782609 0 0.565217391 AZ
2 3.19.1.1 0.416666667 0 0.583333333 Sx
2 3.19.1.5 0.405940594 0 0.594059406 Sx
2 2.8.1.4 0.292543021 0 0.707456979 XQ
2 3.33.1.5 0.198019802 0 0.801980198 xF
2 2.8.1.2 0.189873418 0 0.810126582 XC
2 2.5.1.3 0.172413793 0 0.827586207 BR
2 2.1.1.5 0 0 1 KB
2 2.1.1.3 0 0 1 KB
2 2.1.1.2 0 0 1 KB
2 3.1.1.1 0 0 1 Tx
2 2.5.1.1 0 0 1 XH
2 3.1.1.3 0 0 1 Ax
2 3.33.1.1 0 0 1 xF
2 3.1.1.5 0 0 1 Ax
Proteins Homology Detection
13
2 3.50.1.7 0 0 1 xS
2 2.19.1.1 0 0 1 BM
3 2.1.1.4 0.952238806 0 0.047761194 TPZ
3 1.1.1.2 0.896825397 0.875 0.103174603 YH
3 2.41.1.1 0.892086331 0.166666667 0.107913669 XQT
3 2.5.1.1 0.871428571 0 0.128571429 XPM
3 3.33.1.1 0.833333333 0 0.166666667 MGx
3 1.34.1.5 0.74742268 0.416666667 0.25257732 HM
3 2.8.1.2 0.737704918 0 0.262295082 TXH
3 3.1.1.3 0.725 0 0.275 KAx
3 2.31.1.1 0.643564356 0 0.356435644 HCW
3 1.25.1.1 0.64 0 0.36 CCF
3 1.25.1.3 0.611650485 0 0.388349515 WNB
3 3.19.1.3 0.6 0 0.4 XDL
3 2.31.1.2 0.580645161 0 0.419354839 MWC
3 3.25.1.1 0.566929134 0 0.433070866 xTG
3 3.19.1.4 0.545454545 0 0.454545455 XDL
3 2.34.1.1 0.505154639 0 0.494845361 LxS
3 2.1.1.1 0.504424779 0 0.495575221 WCH
3 3.25.1.3 0.433070866 0 0.566929134 LTx
3 1.34.1.4 0.385826772 0 0.614173228 XKF
3 3.19.1.1 0.363636364 0 0.636363636 XDL
3 3.1.1.1 0.336734694 0 0.663265306 xDD
3 3.33.1.5 0.27027027 0 0.72972973 PNX
3 2.8.1.4 0.267326733 0 0.732673267 CCY
3 2.5.1.3 0.223880597 0.133333333 0.776119403 ZKG
3 3.19.1.5 0.097744361 0 0.902255639 XDL
3 2.1.1.5 0 0 1 TPZ
3 2.1.1.3 0 0 1 TPZ
3 2.1.1.2 0 0 1 TPZ
3 3.73.1.2 0 0 1 WWF
3 3.1.1.5 0 0 1 KAx
3 1.25.1.2 0 0 1 WNB
3 3.50.1.7 0 0 1 IxS
3 2.19.1.1 0 0 1 FXP
4 2.1.1.4 0.913606911 0 0.086393089 YQMY
4 2.5.1.1 0.904255319 0 0.095744681 YVWA
4 3.33.1.1 0.872727273 0 0.127272727 HLGR
4 2.41.1.1 0.794871795 0.333333333 0.205128205 MPNF
4 2.8.1.2 0.742971888 0 0.257028112 QIMY
4 3.1.1.3 0.725 0 0.275 YVWI
Proteins Homology Detection
14
4 2.1.1.1 0.678653405 0 0.321346595 WCGK
4 2.31.1.1 0.657142857 0 0.342857143 ICLP
4 1.25.1.3 0.63963964 0 0.36036036 QLCH
4 3.19.1.3 0.636363636 0 0.363636364 GMGT
4 3.25.1.1 0.620689655 0 0.379310345 QNHF
4 1.25.1.1 0.60625 0 0.39375 LNF
4 1.34.1.5 0.588235294 0 0.411764706 EVD
4 2.31.1.2 0.573770492 0 0.426229508 HCGM
4 1.34.1.4 0.518518519 0 0.481481481 QNRD
4 3.19.1.5 0.444444444 0 0.555555556 GMGT
4 3.25.1.3 0.433070866 0 0.566929134 DMFR
4 1.1.1.2 0.422222222 0 0.577777778 HLDN
4 3.19.1.1 0.416666667 0 0.583333333 GMGT
4 2.8.1.4 0.336917563 0 0.663082437 DGCP
4 3.1.1.1 0.336734694 0 0.663265306 WDDP
4 2.5.1.3 0.172413793 0 0.827586207 KCTP
4 3.33.1.5 0.147368421 0 0.852631579 DCQD
4 2.1.1.5 0 0 1 LYYR
4 2.1.1.3 0 0 1 LYYR
4 2.1.1.2 0 0 1 LYYR
4 3.19.1.4 0 0 1 GMGT
4 2.34.1.1 0 0 1 WILG
4 3.73.1.2 0 0 1 WWFF
4 3.1.1.5 0 0 1 APNH
4 1.25.1.2 0 0 1 CAWE
4 3.50.1.7 0 0 1 HMVP
4 2.19.1.1 0 0 1 HFNP
5 2.1.1.4 0.912758997 0 0.087241003 MLWYR
5 2.5.1.1 0.901639344 0 0.098360656 NLIEA
5 3.33.1.1 0.86407767 0 0.13592233 QSWKE
5 2.8.1.2 0.771019678 0 0.228980322 CGYSD
5 3.1.1.3 0.725 0 0.275 WGQNG
5 1.25.1.3 0.718309859 0 0.281690141 LDTLQ
5 1.34.1.5 0.703180212 0 0.296819788 NEAP
5 2.1.1.1 0.680608365 0 0.319391635 IYVKQ
5 1.25.1.1 0.67357513 0 0.32642487 VLNF
5 2.31.1.1 0.643564356 0 0.356435644 SRPYM
5 2.41.1.1 0.621052632 0 0.378947368 VGFAT
5 3.19.1.5 0.583333333 0 0.416666667 CKNTK
5 3.25.1.1 0.566929134 0 0.433070866 IWDTA
5 3.19.1.3 0.555555556 0 0.444444444 CKNTK
Proteins Homology Detection
15
5 1.34.1.4 0.518518519 0 0.481481481 GCINY
5 3.25.1.3 0.433070866 0 0.566929134 AGKGT
5 2.31.1.2 0.417040359 0 0.582959641 AGHCT
5 1.1.1.2 0.409090909 0 0.590909091 LHVDP
5 3.1.1.1 0.336734694 0 0.663265306 ALQRS
5 2.8.1.4 0.27734375 0 0.72265625 ATYGG
5 3.33.1.5 0.147368421 0 0.852631579 MVKQI
5 2.5.1.3 0.130434783 0 0.869565217 GMVGK
5 2.1.1.2 0.017800381 0 0.982199619 IFYIW
5 2.1.1.5 0 0 1 IFYIW
5 2.1.1.3 0 0 1 IFYIW
5 3.19.1.4 0 0 1 CKNTK
5 3.19.1.1 0 0 1 CKNTK
5 2.34.1.1 0 0 1 DTGTS
5 3.73.1.2 0 0 1 VREEV
5 3.1.1.5 0 0 1 PAIYY
5 1.25.1.2 0 0 1 CLKDR
5 3.50.1.7 0 0 1 GGPGC
5 2.19.1.1 0 0 1 LHFNP
6 2.1.1.4 0.916753382 0 0.083246618 RPGQQP
6 3.33.1.1 0.875 0 0.125 YFPVRG
6 2.5.1.1 0.869565217 0 0.130434783 IGGHGD
6 2.8.1.2 0.742971888 0 0.257028112 NASKFH
6 3.1.1.3 0.725 0 0.275 HWDDLA
6 1.25.1.3 0.69924812 0 0.30075188 TSAFQR
6 2.1.1.1 0.692645445 0 0.307354555 TDSLDL
6 2.31.1.1 0.643564356 0 0.356435644 LTAAHC
6 1.25.1.1 0.63372093 0 0.36627907 IVSFYF
6 3.19.1.5 0.630769231 0 0.369230769 FMLIPE
6 3.25.1.1 0.566929134 0 0.433070866 GVGKSA
6 3.19.1.3 0.555555556 0 0.444444444 FMLIPE
6 3.25.1.3 0.529411765 0 0.470588235 GIPQIS
6 1.34.1.5 0.527027027 0.166666667 0.472972973 EDLHDM
6 1.34.1.4 0.503184713 0 0.496815287 KVLGNP
6 3.50.1.7 0.419354839 0 0.580645161 NGGPGC
6 2.41.1.1 0.368421053 0 0.631578947 GVGFAT
6 2.31.1.2 0.356435644 0 0.643564356 IAGGEA
6 3.1.1.1 0.336734694 0 0.663265306 ARSVQA
6 2.8.1.4 0.310986965 0 0.689013035 KLIDLG
6 2.19.1.1 0.258823529 0 0.741176471 FHFNPR
6 2.1.1.3 0.21875 0 0.78125 TDSLDL
Proteins Homology Detection
16
6 2.34.1.1 0.213114754 0 0.786885246 SNLWVP
6 3.1.1.5 0.185185185 0 0.814814815 STDVIY
6 3.33.1.5 0.147368421 0 0.852631579 VDFSAT
6 2.5.1.3 0.130434783 0 0.869565217 NNAGFP
6 2.1.1.5 0 0 1 TDSLDL
6 1.1.1.2 0 0 1 HVDPEN
6 2.1.1.2 0 0 1 KIDKTF
6 3.19.1.4 0 0 1 FMLIPE
6 3.19.1.1 0 0 1 FMLIPE
6 3.73.1.2 0 0.833333333 1 TANLAA
6 1.25.1.2 0 0 1 WEVVRA
7 2.19.1.1 1 1 0 VSSFFTY
7 2.1.1.4 0.918533605 0 0.081466395 SGVAGTH
7 2.5.1.1 0.869565217 0 0.130434783 AVGALTG
7 3.33.1.1 0.852631579 0 0.147368421 FPVRGRC
7 2.8.1.2 0.750972763 0 0.249027237 VENYGGE
7 1.25.1.3 0.72027972 0 0.27972028 NYGLLYC
7 3.1.1.3 0.716332378 0 0.283667622 IYWGQNG
7 2.1.1.1 0.691176471 0 0.308823529 LTIEKVT
7 2.31.1.1 0.643564356 0 0.356435644 VLTAAHC
7 3.25.1.3 0.638190955 0 0.361809045 APGAGKG
7 3.19.1.5 0.615384615 0 0.384615385 TPAEQFD
7 1.34.1.5 0.593023256 0.166666667 0.406976744 LGEKMKE
7 1.34.1.4 0.571428571 0 0.428571429 IDQNRDG
7 3.25.1.1 0.566929134 0 0.433070866 IEDSYRK
7 1.25.1.1 0.565517241 0 0.434482759 VSFYFKL
7 3.19.1.3 0.555555556 0 0.444444444 TPAEQFD
7 2.41.1.1 0.419354839 0 0.580645161 TFKNTEI
7 3.1.1.1 0.360655738 0 0.639344262 LGNSAG
7 2.31.1.2 0.356435644 0 0.643564356 LTAGHCT
7 2.8.1.4 0.257028112 0 0.742971888 DLGQLGI
7 1.25.1.2 0.16 0 0.84 AWEVVRA
7 3.33.1.5 0.147368421 0 0.852631579 VLIEFYA
7 2.5.1.3 0.130434783 0 0.869565217 FKNNAGF
7 2.1.1.2 0.026465028 0 0.973534972 SCDYKFC
7 2.1.1.5 0 0 1 SCDYKFC
7 2.1.1.3 0 0 1 SCDYKFC
7 1.1.1.2 0 0 1 PWTQRFF
7 3.19.1.4 0 0 1 TPAEQFD
7 3.19.1.1 0 0 1 TPAEQFD
7 2.34.1.1 0 0 1 SNLWVPS
Proteins Homology Detection
17
7 3.73.1.2 0 0 1 TITLVRE
7 3.1.1.5 0 0 1 QALAFTL
7 3.50.1.7 0 0 1 LNGGPGC
8 2.1.1.4 0.912758997 0 0.087241003 VGYDETDK
8 2.5.1.1 0.898876404 0 0.101123596 HLIGGHGD
8 3.33.1.1 0.882352941 0 0.117647059 RLLLEYTD
8 2.8.1.2 0.742971888 0 0.257028112 APHRVLAT
8 3.1.1.3 0.725 0 0.275 HCNPAANT
8 2.1.1.1 0.698275862 0 0.301724138 TESKKPAF
8 1.34.1.4 0.656387665 0 0.343612335 QNGFISAA
8 2.19.1.1 0.647058824 0.333333333 0.352941176 WDEIDIEF
8 2.31.1.1 0.643564356 0 0.356435644 GKDSCQGD
8 1.25.1.1 0.63372093 0 0.36627907 IVSFYFKL
8 3.19.1.5 0.615384615 0 0.384615385 HRESTWSD
8 1.25.1.3 0.611650485 0 0.388349515 AFQRRAGG
8 1.34.1.5 0.588235294 0 0.411764706 CITTKELG
8 3.25.1.1 0.566929134 0 0.433070866 DPTIEDSY
8 3.19.1.3 0.538461538 0 0.461538462 HRESTWSD
8 3.25.1.3 0.433070866 0 0.566929134 PGAGKGTQ
8 2.31.1.2 0.356435644 0 0.643564356 NATARIGG
8 3.1.1.1 0.336734694 0 0.663265306 KAQKGVTA
8 2.8.1.4 0.257028112 0 0.742971888 YCNDSATV
8 3.33.1.5 0.256880734 0 0.743119266 FSGANKEK
8 2.5.1.3 0.130434783 0 0.869565217 AGFPHNVV
8 1.1.1.2 0.071428571 0 0.928571429 SELHCDKL
8 2.1.1.2 0.017800381 0 0.982199619 TESKKPAF
8 2.41.1.1 0 0 1 GVGFATRQ
8 2.1.1.5 0 0 1 TESKKPAF
8 2.1.1.3 0 0 1 TESKKPAF
8 3.19.1.4 0 0 1 HRESTWSD
8 3.19.1.1 0 0 1 HRESTWSD
8 2.34.1.1 0 0 1 GSSNLWVP
8 3.73.1.2 0 0 1 APLTITLV
8 3.1.1.5 0 0 1 ALAFTLTS
8 1.25.1.2 0 0 1 CAWEVVRA
8 3.50.1.7 0 0 1 LNGGPGCS
9 2.1.1.4 0.909502262 0 0.090497738 SASSQVNVA
9 3.33.1.1 0.902777778 0 0.097222222 LNEKFKLGL
9 2.5.1.1 0.869565217 0 0.130434783 NGAVGALTG
9 2.8.1.2 0.742971888 0 0.257028112 PVTTTVENY
9 3.1.1.3 0.716332378 0 0.283667622 RPLGDAVLD
Proteins Homology Detection
18
9 3.19.1.1 0.695652174 0 0.304347826 GRQTRAARS
9 2.19.1.1 0.685897436 0.222222222 0.314102564 LGKDTTKVQ
9 2.1.1.1 0.680365297 0 0.319634703 KGYNGRLKV
9 2.41.1.1 0.643564356 0 0.356435644 STFKNTEIS
9 2.31.1.1 0.643564356 0 0.356435644 CQGDSGGPL
9 1.34.1.5 0.627659574 0.166666667 0.372340426 MIDQNRDGF
9 1.25.1.3 0.611650485 0 0.388349515 LQALAGISP
9 3.19.1.5 0.594594595 0 0.405405405 GRQTRAARS
9 3.25.1.1 0.566929134 0 0.433070866 LTIQLIQNH
9 1.25.1.1 0.565517241 0 0.434482759 QSQIVSFYF
9 2.31.1.2 0.530685921 0 0.469314079 VGFSVTRGA
9 3.25.1.3 0.503448276 0 0.496551724 QISTGDMLR
9 1.34.1.4 0.472972973 0 0.527027027 VFDKDQNGF
9 3.1.1.1 0.360655738 0 0.639344262 ANPNLGSPQ
9 1.1.1.2 0.341772152 0 0.658227848 LHCDKLHVD
9 2.8.1.4 0.318600368 0 0.681399632 GFQYDMADT
9 2.5.1.3 0.230769231 0 0.769230769 GMVGKVTVN
9 3.33.1.5 0.147368421 0 0.852631579 MIKPFFHSL
9 3.19.1.3 0.076923077 0 0.923076923 GRQTRAARS
9 2.1.1.2 0.026465028 0 0.973534972 IEGIKRSLS
9 2.1.1.5 0 0 1 IEGIKRSLS
9 2.1.1.3 0 0 1 IEGIKRSLS
9 3.19.1.4 0 0 1 GRQTRAARS
9 2.34.1.1 0 0 1 TGSSNLWVP
9 3.73.1.2 0 0 1 YIGVSVVLF
9 3.1.1.5 0 0 1 SRGVPAIYY
9 1.25.1.2 0 0 1 AFVLSLLMA
9 3.50.1.7 0 0 1 LNGGPGCSS
10 2.1.1.4 0.917610711 0 0.082389289 NSGDAIYDAD
10 2.5.1.1 0.869565217 0 0.130434783 YVGEQDFYVP
10 3.33.1.1 0.852631579 0 0.147368421 EYTDSSYEEK
10 2.19.1.1 0.808219178 0.555555556 0.191780822 EFLGKDTTKV
10 2.8.1.2 0.75 0 0.25 TLGNSTITTQ
10 3.1.1.3 0.716332378 0 0.283667622 ADYLWNNFLG
10 2.1.1.1 0.683972912 0 0.316027088 EVLVPPRIE
10 1.34.1.5 0.651452282 0 0.348547718 AFRVFDKDQN
10 2.41.1.1 0.650485437 0 0.349514563 FYIKTSTTVR
10 2.31.1.1 0.643564356 0 0.356435644 DSCQGDSGGP
10 1.25.1.3 0.611650485 0 0.388349515 QRRAGGVLVA
10 3.19.1.5 0.594594595 0 0.405405405 LRTVPLDVSK
10 3.25.1.1 0.566929134 0 0.433070866 MTEYKLVVVG
Proteins Homology Detection
19
10 1.25.1.1 0.565517241 0 0.434482759 SPA
10 1.34.1.4 0.561797753 0 0.438202247 RVFDKDQNGF
10 3.19.1.3 0.555555556 0 0.444444444 LRTVPLDVSK
10 2.31.1.2 0.490196078 0 0.509803922 LFAGSTALGL
10 3.25.1.3 0.433070866 0 0.566929134 AGKGTQAQFI
10 3.1.1.1 0.360655738 0 0.639344262 TWKFFDGVDI
10 2.8.1.4 0.257028112 0 0.742971888 TLYFPQPTNT
10 3.33.1.5 0.147368421 0 0.852631579 KLVVVDFSAT
10 2.5.1.3 0.130434783 0 0.869565217 NNAGFPHNVV
10 2.1.1.2 0.017800381 0 0.982199619 EVLVPPRIE
10 2.1.1.5 0 0 1 EVLVPPRIE
10 2.1.1.3 0 0 1 EVLVPPRIE
10 1.1.1.2 0 0 1 LHCDKLHVDP
10 3.19.1.4 0 0 1 LRTVPLDVSK
10 3.19.1.1 0 0 1 LRTVPLDVSK
10 2.34.1.1 0 0 1 TGSSNLWVPS
10 3.73.1.2 0 0.333333333 1 LTITLVREEV
10 3.1.1.5 0 0 1 TSRGVPAIYY
10 1.25.1.2 0 0 1 ANAVLRAQHL
10 3.50.1.7 0 0 1 VLWLNGGPGC
All results for word length from 1 to 10 for detecting Fold IDs Word Length FoldID Specifity Precision RFP
Distinct-Fragment
1 1.34 0.887108 0 0.112892 Z
1 3.1 0.872453 0.061111 0.127547 x
1 1.25 0.839246 0 0.160754 B
1 2.19 0.827869 0 0.172131 B
1 2.41 0.576471 0 0.423529 X
1 2.31 0.552106 0 0.447894 x
1 3.73 0.448276 0 0.551724 X
1 1.1 0.174603 0 0.825397 B
1 2.8 0.145798 0 0.854202 X
1 2.34 0.142857 0 0.857143 x
1 3.25 0.105634 0 0.894366 x
1 2.5 0.098039 0 0.901961 Z
1 2.1 0.003264 0 0.996736 Z
1 3.5 0 0 1 x
1 3.33 0 0 1 x
1 3.19 0 0 1 x
2 1.34 1 1 0 AZ
Proteins Homology Detection
20
2 3.73 0.907193 0.166667 0.092807 XW
2 1.25 0.896104 0.944828 0.103896 QZ
2 3.19 0.88935 0.74359 0.11065 Sx
2 2.41 0.837838 0 0.162162 XQ
2 2.34 0.823529 0 0.176471 Lx
2 1.1 0.800384 0 0.199616 ZH
2 3.25 0.098765 0.425197 0.901235 xM
2 2.1 0.090387 0.769287 0.909613 ZP
2 3.33 0.068966 0.147368 0.931034 xF
2 2.5 0.056738 0.036232 0.943262 BR
2 2.19 0 0 1 BM
2 3.5 0 0 1 xS
2 2.8 0 0.060241 1 XC
2 2.31 0 0.618812 1 xI
2 3.1 0 0 1 Ax
3 1.1 0.960123 0.875 0.039877 YH
3 2.41 0.918478 0.166667 0.081522 XQT
3 1.34 0.884956 0.919753 0.115044 GBG
3 2.34 0.675676 0 0.324324 LxS
3 2.5 0.612903 0.826087 0.387097 XPM
3 3.19 0.381974 0.538462 0.618026 xNE
3 3.33 0.330579 0.147368 0.669421 PNX
3 2.1 0.084507 0.834425 0.915493 XSG
3 2.19 0 0 1 FXP
3 3.73 0 0 1 WWF
3 3.5 0 0 1 IxS
3 2.8 0 0.947791 1 TXH
3 2.31 0 0.866337 1 HCW
3 3.25 0 0.96063 1 DxM
3 3.1 0 0.969444 1 KAx
3 1.25 NA 1 NA WNB
4 2.8 1 1 0 QIMY
4 3.33 1 1 0 APWC
4 1.34 1 1 0 QDMI
4 3.25 1 1 0 QNHF
4 3.19 1 1 0 WKLD
4 2.5 0.929204 0.942029 0.070796 WYWS
4 2.41 0.873016 0.333333 0.126984 MPNF
4 2.31 0.780488 0.955446 0.219512 ICLP
4 1.1 0.643836 0 0.356164 HLDN
4 2.34 0.52 0 0.48 WILG
Proteins Homology Detection
21
4 2.1 0.176776 0.827147 0.823224 VYYC
4 2.19 0 0 1 HFNP
4 3.73 0 0 1 WWFF
4 3.5 0 0 1 HMVP
4 1.25 NA 1 NA CAWE
4 3.1 NA 1 NA IYIT
5 1.25 1 1 0 SFYFK
5 2.8 1 1 0 CGYSD
5 3.33 1 1 0 CGHCK
5 1.34 1 1 0 DKDGD
5 2.5 1 1 0 HQWYW
5 2.41 0.84 0 0.16 VGFAT
5 3.19 0.448276 0.948718 0.551724 FVKAI
5 1.1 0.409091 0 0.590909 LHVDP
5 2.19 0.112676 0 0.887324 LHFNP
5 2.1 0.084967 0.847162 0.915033 RFSGS
5 2.34 0 0 1 DTGTS
5 3.73 0 0 1 VREEV
5 3.5 0 0 1 GGPGC
5 2.31 NA 1 NA DSGGP
5 3.25 NA 1 NA IWDTA
5 3.1 NA 1 NA HWDLP
6 3.73 1 1 0 TANLAA
6 1.25 1 1 0 WEVVRA
6 3.33 1 1 0 WCGHCK
6 1.34 1 1 0 LFDKDG
6 3.19 1 1 0 GAGILD
6 3.1 1 1 0 EPFVTL
6 2.5 1 1 0 HQWYWS
6 2.41 0.817259 0 0.182741 GVGFAT
6 3.5 0.419355 0 0.580645 NGGPGC
6 2.1 0.384615 0.935953 0.615385 RFSGSG
6 2.34 0.213115 0 0.786885 SNLWVP
6 2.19 0 0 1 FHFNPR
6 1.1 0 0 1 HVDPEN
6 2.8 NA 1 NA NASKFH
6 2.31 NA 1 NA GDSGGP
6 3.25 NA 1 NA GVGKSA
7 1.25 1 1 0 AWEVVRA
7 2.8 1 1 0 VENYGGE
7 3.33 1 1 0 PWCGHCK
Proteins Homology Detection
22
7 1.34 1 1 0 EAFSLFD
7 3.25 1 1 0 IEDSYRK
7 2.5 1 1 0 IGHQWYW
7 2.41 0.803279 0 0.196721 TFKNTEI
7 2.1 0.359551 0.917031 0.640449 WVRQAPG
7 2.34 0 0 1 SNLWVPS
7 2.19 0 0.555556 1 VSSFFTY
7 1.1 0 0 1 PWTQRFF
7 3.73 0 0 1 TITLVRE
7 3.5 0 0 1 LNGGPGC
7 2.31 NA 1 NA GDSGGPL
7 3.19 NA 1 NA QIERTIA
7 3.1 NA 1 NA TLHHFDT
8 1.25 1 1 0 CAWEVVRA
8 3.33 1 1 0 APWCGHCK
8 1.34 1 1 0 LGTVMRSL
8 3.25 1 1 0 DPTIEDSY
8 2.5 1 1 0 WYWSYEYS
8 2.41 0.643564 0 0.356436 GVGFATRQ
8 2.1 0.284091 0.931223 0.715909 KGRFTISR
8 1.1 0.1875 0 0.8125 SELHCDKL
8 2.34 0 0 1 GSSNLWVP
8 2.19 0 0 1 WDEIDIEF
8 3.73 0 0 1 APLTITLV
8 3.5 0 0 1 LNGGPGCS
8 3.19 0 0.923077 1 IYQVPVYS
8 2.8 NA 1 NA APHRVLAT
8 2.31 NA 1 NA GDSGGPLV
8 3.1 NA 1 NA TLHHFDTP
9 3.33 1 1 0 EFYAPWCGH
9 1.34 1 1 0 ITTKELGTV
9 2.5 1 1 0 LRLLYLLDE
9 2.41 0.798883 0 0.201117 STFKNTEIS
9 1.1 0.341772 0 0.658228 LHCDKLHVD
9 2.19 0.222222 0 0.777778 LGKDTTKVQ
9 2.1 0.060606 0.966157 0.939394 VKGRFTISR
9 2.34 0 0 1 TGSSNLWVP
9 3.73 0 0 1 YIGVSVVLF
9 3.5 0 0 1 LNGGPGCSS
9 1.25 NA 1 NA QSQIVSFYF
9 2.8 NA 1 NA PVTTTVENY
Proteins Homology Detection
23
9 2.31 NA 1 NA CQGDSGGPL
9 3.25 NA 1 NA LTIQLIQNH
9 3.19 NA 1 NA FVAGLGGIG
9 3.1 NA 1 NA EPFVTLHHF
10 2.8 1 1 0 TLGNSTITTQ
10 1.34 1 1 0 KELGTVMRSL
10 3.25 1 1 0 MTEYKLVVVG
10 2.5 1 1 0 HQWYWSYEYS
10 2.41 0.809524 0 0.190476 FYIKTSTTVR
10 2.1 0.360465 0.939956 0.639535 SCKASGYTFT
10 2.19 0.171053 0 0.828947 EFLGKDTTKV
10 2.34 0 0 1 TGSSNLWVPS
10 1.1 0 0 1 LHCDKLHVDP
10 3.73 0 0.333333 1 LTITLVREEV
10 3.5 0 0 1 VLWLNGGPGC
10 3.19 0 0.948718 1 FWDKRKGGPG
10 1.25 NA 1 NA LQCLEEELKP
10 3.33 NA 1 NA KLVVVDFSAT
10 2.31 NA 1 NA DSCQGDSGGP
10 3.1 NA 1 NA FVTLHHFDTP
Proteins Homology Detection
24
5 Appendix B References [1] N. Yosef, R. Sharan, and W.S. Noble. Improved network-based identification of protein
orthologs. Bioinformatics, 24(16):i200–i206, 2008.
[2] C. Knaub. Molecular Evolution? http://www.icr.org/article/molecular-evolution/ [3] I. Budowski-Tal, Y. Nov, R. Kolodny. FragBag, an accurate representation of protein structure,
retrieves structural neighbors from the entire PDB quickly and accurately. Proc Natl Acad Sci U S A. 2010 Feb 3;: 20133727 Cit:1
[4] M. Ester and X. Zhang, “A Top-Down Method for Mining Most Specific Frequent
Patterns in Biological Sequence Data,” Proc. SIAM Int'l Conf. on Data Mining (SDM
'04), Apr. 2004. [5] N.M. Zaki, R.M Ilias, and S. Derus. “A Comparative analysis of Protein Homology Detection Methods”. Journal of Theoretics, 5-4, 2003. [6] Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I.: Protein Classification with
Multiple Algorithms. In: Proc. of 10th Panhellenic Conference in Informatics, Volos,
Greece, November 21-23. LNCS. Springer, Heidelberg (2005).
Proteins Homology Detection
25
6 Appendix C The code: #!/usr/bin/perl -w
##### Training Code #########
#Usage: perl homology.bagOfWordsTrain.pl fisher-scop-data maxLengthWords (fold or
family)
#Notes: classLabel is the same as FamilyID
# Negative Training data are not used in this code
#author: Zina Saadi ([email protected])
#Purpose: Protein Homology detection based on predicting the familyID
# of the sequence using positve training data and positive and negative test data
#Training Data: http://compbio.soe.ucsc.edu/discriminative/fisher-scop-data.tar.gz
#Runs in Versions of Perl above 5.0
use strict;
use File::Find;
#command line input arguments
my $directory = $ARGV[0]; # command line directory name
my $maxSplit = int($ARGV[1]); # max length for words
my $classOption = $ARGV[2]; # specify type of class (fold or family)
#saving training data composed of indexed words found in sequences
#in the format of wordOfSeq,indexOfWord
open(TRAINKEY, "> trainKey.txt") or die("Couldn't open Keys file\n");
#saving training data composed in indexed of words found in sequences and its
frequencies per family
#in the format of indexOfWord1 freqOfWord1, indexOfWord2 freqOfWord2,....,
familyID
open(TRAINSPARSE, "> trainSparseVectors.txt ") or die("Couldn't open Sparse Data\n");
#root directory to search in ti for training data
my $trainDirectory = "./$directory";
#Initialize global variables
my %overallI=();
my @files=[];
my $classLabel="";
my %tokensFreq=();
Proteins Homology Detection
26
my %fileI=();
#search for all the training files in all the sub-directories
find(sub { push @files, $File::Find::name if /pos\-train.seq$/ }, $trainDirectory);#find all
positive training data files
for (my $f=1;$f<=$#files;$f++){
my $fileName=$files[$f];
#get class labels
if ($classOption eq "fold") {
($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3})\//i;#to capture fold ID
} elsif ($classOption eq "family") {
($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\//i; #to
capture familyID
}else {
die("last argument must be of the string value \"family\" or \"fold\"\n");
}
#%tokensFreq=();
#%fileI=();
open(INPUT, "< $fileName") or die("Couldn't open $fileName.\n");
my $seq="";
while (<INPUT>) {#loop through each file
chomp;
if (not index($_,">")==0){
$seq .= $_; #concatenate segments of the same seq
} elsif (index($_,">")==0 and not $seq eq ""){
$seq .= ","; #to flag seq boundaries
}
}
close INPUT;
my @tokens=[];
if (not $seq eq ""){
my @seqArray=split (/,/,$seq);
foreach (@seqArray){
%tokensFreq=();
Proteins Homology Detection
27
%fileI=();
if ($maxSplit==1){
@tokens = split(/(.{1})/, $_);
}else{
@tokens = split(/(.{1,$maxSplit})/, $_);
}
for (my $i=0;$i<=$#tokens;$i++){
chomp($tokens[$i]);
if(not $tokens[$i] eq "") {
if (exists($tokensFreq{$tokens[$i]})){
$tokensFreq{$tokens[$i]}+=1;
}else{
if (not (exists($overallI{$tokens[$i]}))){
my $size =keys %overallI;
$overallI{$tokens[$i]}=$size;
print TRAINKEY "$tokens[$i],$size\n";
}
if (not (exists($fileI{$tokens[$i]}))){
my $size =keys %fileI;
$fileI{$tokens[$i]}=$size;
}
$tokensFreq{$tokens[$i]}=1;
}
}
}
for my $k1 (sort {$fileI{$a} <=> $fileI{$b}} keys %fileI) {
print TRAINSPARSE "$overallI{$k1} $tokensFreq{$k1},";
}
print TRAINSPARSE "$classLabel\n";
}
}
}
close TRAINKEY;
close TRAINSPARSE;
#!/usr/local/bin/perl
##### Testing Code #########
Proteins Homology Detection
28
#Usage: perl homology.bagOfWordsTest.pl fisher-scop-data maxLengthWords (family or
fold)
#Notes: classLabel is the same as FamilyID
#
#author: Zina Saadi ([email protected])
#Purpose: Protein Homology detection based on predicting the familyID
# of the sequence using positve training and testing data
#Testing Data: http://compbio.soe.ucsc.edu/discriminative/fisher-scop-data.tar.gz
#Runs in Versions of Perl above 5.0
use strict;
use File::Find;
my $directory = $ARGV[0]; # directory (root of train and test data)
my $maxSplit = int($ARGV[1]); # max length for words of a sequence
my $classOption = $ARGV[2]; # specify type of class (fold or family)
#output results data to a file
open(RESULTS, "> results".$maxSplit.".".$classOption.".txt") or die("Couldn't open
results file\n");
#Results Header Table
print RESULTS "SeqSplit\tFamID\tSpec\tPrecision\tRFP\tDistinct-Fragment\n";
my $testDirectory = "./$directory";
my $indexSplit=$maxSplit;
#Initialize global variables
my %count=();
my @trainKeys=();
my %totalNumbers=();
my %wordsCountP=();
my %distinctWords=();
my %instancesArray=();
my @files=[];
my $classLabel="";
system("rm trainKey.txt"); #remove any previous generated file
system("rm trainSparseVectors.txt"); #remove any previous generated file
Proteins Homology Detection
29
system("perl homology.bagOfWordsTrain.pl $directory $indexSplit $classOption");
#call training code
print "done with >perl homology.bagOfWordsTrain.pl $directory $indexSplit
$classOption\n"; #print to system the status of the loop
#retrieve training data composed of indexed words found in sequences
#in the format of wordOfSeq,indexOfWord
open(TRAINKEY, "./trainKey.txt") or die("Couldn't open Keys file\n");
while (<TRAINKEY>) {
chomp;
my @tempSplit=split(/,/,$_);
push (@trainKeys, $tempSplit[0]);
}
close TRAINKEY;
#retrieve training data composed in indexed of words found in sequences and its
frequencies per fami
#in the format of indexOfWord1 freqOfWord1, indexOfWord2 freqOfWord2,....,
familyID
open(SPARSE, "./trainSparseVectors.txt") or die("Couldn't open Sparse Data file\n");
while (<SPARSE>) {
chomp;
my @tempISplit=split(/,/,$_);
for (my $i=0;$i<$#tempISplit;$i++){
if(not $tempISplit[$i] eq ""){
my @tempJSplit=split(/ /,$tempISplit[$i]);
if
(exists($instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]})){
$instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]}+=$temp
JSplit[1];
}else{
$instancesArray{$tempISplit[$#tempISplit]}{$trainKeys[$tempJSplit[0]]}=$tempJSplit[1];
}
if (exists($wordsCountP{$tempISplit[$#tempISplit]})){
$wordsCountP{$tempISplit[$#tempISplit]}+=$tempJSplit[1];
}else{
Proteins Homology Detection
30
$wordsCountP{$tempISplit[$#tempISplit]}=$tempJSplit[1];
}
}
}
if (exists $count{$tempISplit[$#tempISplit]}){
$count{$tempISplit[$#tempISplit]}+=1;
}else{
$count{$tempISplit[$#tempISplit]}=1;
}
}
close SPARSE;
#search for all the testing files in all the sub-directories
find(sub { push @files, $File::Find::name if /pos\-test.seq$/ }, $testDirectory);#find all
positive testing data files
for (my $f=1;$f<=$#files;$f++){#for some reason data starts at index1 #############
my $fileName=$files[$f];
open(INPUT, "< $fileName") or die("Couldn't open $fileName.\n");
#get class labels
if ($classOption eq "fold") {
($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3})\//i;#to capture fold ID
} elsif ($classOption eq "family") {
($classLabel)= $fileName =~ /\/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\//i; #to
capture familyID
}else {
die("last argument must be of the string value \"family\" or \"fold\"\n");
}
my %tokensFreq=();
my $totalWordsFreq=0;
#combine all the lines of the same protein sequence
#and separate those instances within the same family with comma
my $seq ="";
while (<INPUT>) {#loop through each file
chomp;
if (not index($_,">")==0){
$seq .= $_; #concatenate segments of the same seq
Proteins Homology Detection
31
} elsif (index($_,">")==0 and not $seq eq ""){
$seq .= ","; #to flag seq boundaries
}
}
close INPUT;
if (not $seq eq ""){
my @seqArray=split (/,/,$seq);
my @tokens=[];
foreach (@seqArray){ # loop thru each seq instance
if ($maxSplit==1){
@tokens = split(/(.{1})/, $_);
}else{
@tokens = split(/(.{1,$maxSplit})/, $_);
}
for (my $i=0;$i<=$#tokens;$i++){
chomp($tokens[$i]);
if (not $tokens[$i] eq "") {
if (exists($tokensFreq{$tokens[$i]})){
$tokensFreq{$tokens[$i]}+=1;
}else{
$tokensFreq{$tokens[$i]}=1;
}
$totalWordsFreq++;
}
}
#calculate prediction based on taking the argmax
#of the sum of the log of the probabilities
my %prediction=();
my $labels_size= keys %count;
for my $class (keys %count){
my $probP=1/$labels_size;
my $bayesProb=log($probP);
for my $word (keys %tokensFreq){
my $probWord=$tokensFreq{$word}/$totalWordsFreq;
my
$probWordGivenP=($instancesArray{$class}{$word})/($wordsCountP{$class});
$bayesProb+=log(($probWordGivenP+1)/$probWord+3);
Proteins Homology Detection
32
}
$prediction{$class}=$bayesProb;
}
#take the argmax
my @sortedP = reverse sort {$prediction{$a} <=> $prediction{$b}} keys
%prediction;
my $predictedL =$sortedP[0];
if ($classLabel eq $sortedP[0]){
if (exists($totalNumbers{$classLabel.".TPos"})){
$totalNumbers{$classLabel.".TPos"}+=1;
}else{
$totalNumbers{$classLabel.".TPos"}=1;
}
}else{
if (exists($totalNumbers{$classLabel.".FPos"})){
$totalNumbers{$classLabel.".FPos"}+=1;
}else{
$totalNumbers{$classLabel.".FPos"}=1;
}
if (exists($totalNumbers{$predictedL.".TNeg"})){
$totalNumbers{$predictedL.".TNeg"}+=1;
}else{
$totalNumbers{$predictedL.".TNeg"}=1;
}
}
}#foreach (@seqArray)
} #if (not $seq eq "")
}#loop through files
#calculating all applicable types of measurement
my $specifity=0;
my $precision=0;
my $FP_rate=0;
my $Ave_RFP=0; #average Rate of False Positive
my $countRFP=0; #Rate of False Positive
my @seenBefore=[];#to remove duplicates
$classLabel=""; #reset the default value
for my $class (keys %count){
Proteins Homology Detection
33
my $TN_FP=($totalNumbers{$class.".TNeg"}+$totalNumbers{$class.".FPos"}); #denom
for the specifity meas
my $TP_FP=($totalNumbers{$class.".TPos"}+$totalNumbers{$class.".FPos"});
#denom for the precision meas
if (not $TN_FP ==0){ $specifity=($totalNumbers{$class.".TNeg"}/$TN_FP);}else
{$specifity="NA";}
if (not $TN_FP ==0){ $FP_rate=($totalNumbers{$class.".FPos"}/$TN_FP);
$Ave_RFP+=$FP_rate; $countRFP++;}else {$FP_rate="NA";}
if (not $TP_FP ==0){ $precision=($totalNumbers{$class.".TPos"}/$TP_FP); }else
{$precision="NA";}
#calculating the likelihood for a word given a class (family or fold) in the training files
for each class.
for my $word (reverse sort {$instancesArray{$class}{$a} <=>
$instancesArray{$class}{$b}} keys %{%instancesArray->{$class}}){
my
$probWordGivenP=($instancesArray{$class}{$word})/($wordsCountP{$class});
my $log_likelihood=log($probWordGivenP);
for my $other_protein (keys %count){
if ($class ne $other_protein){
#applying laplace transformation to eliminate dividing by zero
my
$probWordGivenNotP=($instancesArray{$other_protein}{$word}+1)/($wordsCountP{$o
ther_protein}+3);
$log_likelihood-=log($probWordGivenNotP);
}
}
$distinctWords{$class}{$word}=$log_likelihood;
}
my @sortedW= reverse sort {$distinctWords{$class}{$a} <=>
$distinctWords{$class}{$b}} keys %{%distinctWords->{$class}};
print RESULTS
"$indexSplit\t$class\t$specifity\t$precision\t$FP_rate\t$sortedW[0]\n";
}
print RESULTS "$indexSplit\t",$Ave_RFP/$countRFP, "\n";
closeRESULTS;
Proteins Homology Detection
34
7 Appendix D Slides
Proteins Homology Detection
35
Proteins Homology Detection
36
Proteins Homology Detection
37
Proteins Homology Detection
38
Proteins Homology Detection
39
8 Appendix E Measurement definition and formulas from (http://webdocs.cs.ualberta.ca/~bioinfo/data/publications/theses/2004-Theses-BrettPoulin.pdf )