model quality assessment of predicted protein …avilamala/resources/finaldissertation_albe… · a...
TRANSCRIPT
1
MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN STRUCTURES USING A SUPPORT VECTOR
MACHINE
A Dissertation
Submitted In Partial Fulfilment Of The Requirements For The Degree Of
MASTER OF SCIENCE
In
NETWORK AND E-BUSINESS CENTERED COMPUTING,
in the
THE BIOCENTRE / SCHOOL OF BIOLOGICAL SCIENCES
UNIVERSITY OF READING
by
Albert Vilamala Muñoz
27/06/2007
Supervisor Dr. Liam James McGuffin
2
ACKNOWLEDGEMENTS
I would like to thank specially to my supervisor, Dr. Liam James McGuffin, who helped me a lot in
the realisation of this project. Moreover, many thanks to Ana Carmen Cajaraville Leiro for sharing
her knowledge about Biochemistry. Also thanks to Bérengère Bressenot, who spent some time
revising the grammar of previous reports. Thanks to Simon Clarke as well for revising the grammar
of this document and to Jean-Loup Rakhodai for the printed version of this dissertation. Elena
Carrara and Anaïs Chevigny also aided with the references section.
To all of them: MANY THANKS!
3
ABSTRACT
The functionality that a protein develops is strictly related to its physical (3D) structure.
Determining a protein 3D structure from its amino acid sequence using experimental techniques is
very costly and time consuming. For this reason, finding computational techniques which deal with
this purpose is an ongoing issue of research.
This project aims at developing software that permits to increase the cumulative mean model
quality (CMMQ) of proteins supplied by different Model Quality Assessment Programs (MQAP).
To achieve this target, the output scores of some MQAPs are taken and combined using a Support
Vector Machine (SVM) in order to build a system which offers better accuracy than the existing
ones.
Although the CMMQ has not been increased, determining the quality of high quality models for
classification at varying thresholds of model quality has been achieved by using a decision tree of
binary SVMs.
4
CONTENTS
Acknowledgements............................................................................................................................. 2
Abstract ............................................................................................................................................... 3
Contents .............................................................................................................................................. 4
Introduction ........................................................................................................................................ 6
DNA and Genetic Information ................................................................................................... 6
Proteins ....................................................................................................................................... 7
Protein Structure Prediction...................................................................................................... 10
Ab-initio or Template Free Modelling ............................................................................ 11
Comparative of Template Based Protein Modelling ....................................................... 11
Steps in the Homology Modelling Process ...................................................................... 12
Model Quality Assessment Programs (MQAP) ....................................................................... 13
PRO-Q ............................................................................................................................. 13
MODCHECK .................................................................................................................. 13
MODSSEA ....................................................................................................................... 13
ModFOLD ....................................................................................................................... 14
Goals and Milestones................................................................................................................ 14
Preliminary System Design ............................................................................................................. 16
Support Vector Machine (SVM) .............................................................................................. 16
An Easy Example ............................................................................................................. 16
Applying SVM to our problem ......................................................................................... 17
Kernel Function ........................................................................................................................ 17
SVM Multi-Class ...................................................................................................................... 19
SVM Multi-Class Using Binary Classifiers ..................................................................... 19
SVM Multi-Class Using All Data at Once ...................................................................... 20
Technical Requirements ........................................................................................................... 21
Implementation Strategy ................................................................................................................. 22
Planning the Project ....................................................................................................... 22
Background Study ........................................................................................................... 22
SVM Study ....................................................................................................................... 22
First Approach ................................................................................................................ 22
Choosing Parameters ...................................................................................................... 23
First Multi-Class Approach ............................................................................................ 24
Testing the First Multi-Class Approach .......................................................................... 24
10-Class Test ................................................................................................................... 25
20-Class Test ................................................................................................................... 26
5
BST approach .................................................................................................................. 29
4-Class BST ..................................................................................................................... 29
8-Class BST ..................................................................................................................... 30
First Approach for High Quality Models ........................................................................ 30
Focusing on Very High Quality Models (above 0.875) .................................................. 32
Calculating Sensitivity and Specificity ............................................................................ 32
Independent Binary Classification .................................................................................. 33
Detailed Software Design................................................................................................................. 37
Scope ........................................................................................................................................ 37
Identification ................................................................................................................... 37
System Overview ............................................................................................................. 37
Document Overview ........................................................................................................ 37
System-Wide Design ................................................................................................................ 37
Architectural Design ....................................................................................................... 37
Components ..................................................................................................................... 38
Concept of Execution ...................................................................................................... 39
System Interface Design .................................................................................................. 40
Detailed Description of Components ....................................................................................... 42
Class BST ........................................................................................................................ 42
Class Node ...................................................................................................................... 42
Class NodeSVM ............................................................................................................... 44
Class Fileformat .............................................................................................................. 45
Class SVMtree ................................................................................................................. 45
Class ProteinQualityCalculator ...................................................................................... 45
LIBSVM Library .............................................................................................................. 45
Class Reformat ................................................................................................................ 46
Conclusions ....................................................................................................................................... 47
References ......................................................................................................................................... 49
Appendices ........................................................................................................................................ 51
Class BST Java code ....................................................................................................... 51
Class Node Java code ..................................................................................................... 56
Class NodeSVM Java code .............................................................................................. 58
Class Fileformat Java code ............................................................................................. 61
Class SVMtree Java code ................................................................................................ 64
Class ProteinQualityCalculator Java code .................................................................... 66
Class Reformat Java code ............................................................................................... 72
6
INTRODUCTION
In recent years, big steps have been made in the field of molecular biology. One of the most
important was the recent mapping of the entire human genome sequence. Genome sequencing
projects have opened the door to a better understanding of molecular processes, but continuing
research in this field is needed in order to improve our knowledge and to enable us to use it for
different purposes, such as in medicine (i.e. understanding some diseases and developing drugs
against them) and agriculture (i.e. developing a product against plagues).
This project aims to contribute to the molecular biology field, specifically in the Protein Structure
Prediction, where better methods to assess the quality of a prediction are needed, and this is the
main goal of the project.
This dissertation starts with a brief explanation of the basic concepts related with biochemistry
which are essential to understand the reason of this project. First of all, an introduction to DNA and
its processes is given. Secondly, the knowledge provided from the DNA section is used to
understand the protein structure, made up by amino acids, which are presented, too. This concept
will, in turn, aid us to find out some of the basic functions a protein can carry out. Then, an
overview to Protein Structure Prediction is explained; different approaches to the problem are
shown and the most widely used procedure is outlined. Finally, the last step of this process, called
Model Quality Assessment is described, which is the focus of this project. In this part of the
process, the need of computer programs that deal with this function is introduced and the basic
ideas behind several MQAPs (which are used for the development of the project) are presented.
In the second chapter, called Preliminary System Design, the mathematical background of the
chosen method to solve the problem is explained. Moreover all the decisions made during the
project are justified in this section, too.
The third section (Implementation Strategy) outlines all the work realised in this project. In other
words, the section contains all the steps we followed explained in detail.
A detailed software design description is given in the fourth chapter. To write this documentation,
the Software Design Description standard developed by IEEE (IEEE 1016-1998) has been taken as
a guide.
The last section of this dissertation is the Conclusions where the results of the project are evaluated
according to the goals presented in this chapter.
DNA AND GENETIC INFORMATION
The nucleus of most cells contains deoxyribonucleic acid or DNA. This material stores the genetic
information needed by the cell in order to synthesise the proteins, which carry out important
biological functions.
How is this information stored?
The whole DNA string is known as a genome. This huge string can be broken down into several
substrings (known as genes), where each one contains the information needed to synthesise a
concrete protein.
Each string is made up of 4 nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T).
DNA can be seen as a long string of A's, C's, G's and T's. Some of these substrings of characters,
known as exons, contain genetic information (i.e. for encoding a protein) and some others are non-
coding regions known as introns.
The DNA sequences are translated to mRNA and the introns are excised.
7
The mRNA, is read in sets of 3 characters (called codons) which encode one amino acid, apart from
some codons that are used as 'punctuation marks' which inform about the beginning and the end of
the gene.
In summary, each protein is made up from a sequence of amino acids, which are encoded by genes.
(Cold Spring Harbor Laboratory, 2002)
From DNA to protein
This idea is known as the central dogma of molecular biology and is composed of three parts.
In the first step (known as replication), the DNA sequence replicates its information.
In the second step (known as transcription), the information that the DNA contains is copied into
the mRNA.
This mRNA translates the information from the nucleus of the cell to the cytoplasm, in the third
step (translation). Ribosomes allocated in the cell 'read' the information carried by the mRNA and
synthesise the protein sequence. (Access Excellence at the National Health Museum, 1999)
Figure 1. Central Dogma of Molecular Biology.
Source http://library.thinkquest.org/C0122429/pictures/centraldogma2.gif
PROTEINS
What is a protein and what is it made for?
Proteins are macromolecules made of amino acids which are contained in the cells of all living
organisms. They are essential to the existence of an organism since they develop a wide range of
functions needed for all the cellular processes. (Berg, J.M., Tymoczko, J.L. and Stryer, L., 2002)
What is an amino acid?
An amino acid is a molecule that consists of a carbon atom linked to an amino group, a carboxylic
acid group, a hydrogen atom and an R group (Figure 2). This R group is the part of the molecule
that differs from each amino acid, varying in size, shape, charge, etc. Regarding this R group, 20
different amino acids which are needed to construct a protein can be defined. (Berg et al., 2002)
8
Figure 2. Amino acid chemical structure. The central carbon atom is linked to a hydrogen group (top), to a carboxylic
acid group (right), to an amino group (left) and to an R group (bottom).
Source http://plantandsoil.unl.edu/croptechnology2005/UserFiles/Image/siteImages/AminoAcidLG.gif
A sequence of amino acids can be connected between them with a peptide bond forming a linear
polymer (or polypeptide chain). This is known as the Primary Structure of a protein.
These polymers can fold into regular structures in order to stabilize them (Secondary Structure).
Typical secondary structures are the Alpha Helix, the Beta Sheets and turns and loops (Figure 3).
Hence, the Secondary Structure refers to a local spatial arrangement of amino acids.
On the other hand, the Tertiary Structure defines the overall shape of a protein, that is to say, the
spatial arrangement of all the amino acids sequence.
A Quaternary Structure describes protein complexes made up by several interacting folded chains
of amino acids sequences. Thus, this structure defines the highest level a protein can be approached.
(Berg et al., 2002)
9
Figure 3. Protein Structure Levels. The amino acids sequence (a) folds forming regular structures (alpha helix and
beta sheet) (b) which, in turn, are linked between them forming a bigger structure (c). The overall shape of the protein
(d) is constructed by combining several Tertiary Structures.
Source http://ghs.gresham.k12.or.us/science/ps/sci/ibbio/chem/notes/chpt3/proteinlevels.gif
Although physical and chemical properties which permit a protein configuring its three-dimensional
structure from the amino acid sequence are not completely understood, several experiments reveal
that indeed, the amino acid sequence completely determines the spatial structure. In other words,
the Tertiary Structure is defined by the Primary Structure.
Moreover, as the function of a protein is strictly linked to its shape, transitively, the amino acid
sequence determines the function of a protein. (Berg et al., 2002)
10
Protein functions
It has been explained that proteins are essential for the vital processes within organisms, but which
are, actually, the functions that proteins develop?
The main characteristic of a protein is the capacity to bind other molecules. This property is strictly
related with its Tertiary Structure (i.e. formation of a binding site which interacts with the external
molecule) as well as the external amino acids chemical properties (i.e. enabling chemical bonds).
The chief function of a protein is acting as an enzyme. Enzymes are used to catalyse chemical
reactions such as cellular metabolism, catabolism or DNA replication.
Other proteins, known as antibodies, aim to bind foreign substances in the body and destruct them
as a function of the immunologic system.
Haemoglobin is another important protein, allocated in the blood, whose function is transporting
oxygen from the lungs to other organs in the body.
Other proteins are used for structural purposes; for example keratin, which is found in hair and
nails.
These are few roles proteins can carry out and they have been explained as example, but in fact,
there are a huge number of functions developed by proteins. Even though the function of a lot of
proteins is known, a lot of proteins with unknown functionality exist. In order to improve our
understanding of them, research on this field is being carried out.
It is proved that most proteins that share similar structure have similar functionality as well; this
makes sense since the function of a protein is related to its Tertiary Structure. (Lodish, H., Berk, A.,
Matsudaira, P., Kaiser, C.A., Krieger, M., Scott, M.P., Zipurksy, S.L. and Darnell, J., 2004) (Voet,
D. and Voet, J.G., 2004) (Bairoch, A., 2000)
PROTEIN STRUCTURE PREDICTION
The link between the protein structure and the function it develops has been explained in the
previous section. It is obvious that given a protein to study, knowledge regarding its structure can
be very useful in order to understand the processes where it is involved. Determining this structure
using experimental techniques (such as X-ray crystallography or Nuclear Magnetic Resonance) is
very costly and time consuming, making it an unviable process for studying the structures of
proteins which take part in the biological processes on a genomic scale.
Taking into account that recent advances in genomics allow us obtain the amino acid sequence of a
protein from its DNA sequence then computational techniques which predict the Tertiary Structure
of a protein directly from its Primary Structure can be used.
There are two main approaches for protein structure prediction to deal with this objective: Ab-Initio
or Template Free Modelling, which predicts the structure “from first principles” and Comparative
of Template Based Modelling, which predicts the structure comparing it to known template
structures. (Zhang, Y. and Skolnick, J., 2005) (Bowie, J.U., Luthy, R., Eisenberg, D., 1991)
11
PROTEIN STRUCTURE PREDICTION
Ab-Initio or
Template Free Modeling
Comparative of Template
Based Protein Modelling
Homology
Modelling
Protein
Threading Figure 4. Protein Structure Prediction Methods.
Ab-Initio or Template Free Modelling
This method attempts to find the structure of a protein directly from its amino acids sequence,
trying to imitate the natural folding of the protein using physical principles (which are not
completely understood) or stochastic methods (e.g. using an energy function to evaluate different
chain conformations).
Ab-Initio modelling is a very computationally expensive method. Several research projects are
being carried out in order to face this weakness: using supercomputers (such as Blue Gene) or
distributed computing, but the results are still far from the desired ones. Therefore, this technique
can not yet be applied practically on genomic scale but most projects focused on small proteins (~
100 amino acids). (Zhang et al., 2005) (Bowie et al., 1991)
Comparative of Template Based Protein Modelling
This method compares the amino acid sequence of a protein with unknown structure (target) against
sequences with known protein structures (templates).
The theory behind this method is that although there are a vast existing number of proteins, they
seem to fold in a particular relatively small set of structures. Therefore, there is a high probability
studying a protein whose structure has been already found in other proteins.
Protein Threading and Homology Modelling are the most used techniques to exploit this theory.
(Zhang et al., 2005) (Bowie et al., 1991)
Protein Threading
Protein Threading (also known as Fold Recognition) relies on the estimation that up to 70% of new
proteins have a structure with similar folds to another one already studied in the PDB (Protein Data
Bank) (Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov,
I.N. and Bourne, P.E., 2000). The reason for that are the physical and chemical laws regarding
polypeptide chains as well as the relationship between proteins in such a process as evolution.
This strategy compares the target with a library of fold templates using a scoring scheme (i.e.
energy function or statistics).
12
Depending on the way to proceed, this method can be split up into two subgroups:
The first one takes a 1-D profile to compare with the fold library (i.e. marking each amino acid
depending on if it belongs to the internal part of the protein or to the surface).
The second group compares the whole 3-D structure (i.e. comparing the inter-atomic distances for a
set of amino acids).
Most of the methods can give several measures of similarity which have to be interpreted by a
human expert. Hence, this is often a difficult method to automate. (Jones, D.T. and Hadley, C.,
2000)
Homology Modelling
The underlying theory behind this method relies on the premise that similar sequences will have
similar structures, taking advantage of known proteins. In this method, the structure of the template
which optimally aligns to the target sequence is assigned as the structure of the target. The process
to deal with this is explained below.
First, it must be mentioned that the quality of the prediction depends on the quality of the sequence
alignment as well as the quality of the template structure. (Marti-Renom, M.A., Stuart, A.C., Fiser,
A., Sanchez, R., Melo, F. and Sali, A., 2000)
Steps in the homology modelling process
This process consists of four steps: In the first one, the template that fits the best according to the
target is selected. In the second step, the sequence of the template and the sequence of the target are
aligned. Usually these two steps are performed together, since the aligned sequences can be useful
for finding the proper template.
The third step consists in generating the 3D structure model for the target; and in the last step, the
correctness of the model is assessed (Marti-Renom et al., 2000).
Template Selection and Sequence Alignment
The first approach for selecting a template is using pair wise sequence alignment with programs
such as FASTA or BLAST. Another strategy is by using multiple sequence alignment (i.e. PSI-
BLAST software).
Alternatively, the Protein Threading method explained in the above section can be used for this
stage.
In general, a template is selected depending on a similarity function (i.e. E-value function) or
searching for a model which is close in evolution to the target. Other features which aid the search
are finding a template that has similar functionality to the target or considering the fraction of the
query sequence structure that can be predicted from the template. (Marti-Renom et al., 2000)
Model Generation
In this step, the selected template and the alignment are combined to generate a 3D structure of the
target: Cartesian Coordinates allocating each atom within the protein are defined.
There are several strategies for building models; the most widely used method is Satisfaction of
Spatial Restraints which is based on calculations typically used in NMR spectroscopy. (Baker, D.
and Sali, A., 2001) (Sali, A. and Blundell, T.L., 1993)
13
Model Assessment
Once the structure has been generated, the quality of the model is assessed typically using Model
Quality Assessment Programs, which are explained in detail in the next section. This problem has
been tackled using “statistic potentials” or “Physics-based energy” approaches.
The first approach is based on structure frequencies (assigning a probability for each pair wise
interaction between amino acids) in the known proteins of the PDB. Alternatively, this method can
be used partially to identify poor regions in a protein.
The second approach uses a function to calculate the stability of a protein (regarding physical and
chemical interactions).
Recently, other methods to assess the quality of a model using Machine Learning have been carried
out. After training the system, they can reckon the quality of a model directly from its structure or
combining several Model Quality Assessment Programs. (Sippl, M.J., 1993) (Lazaridis, T. and
Karplus, M., 1999) (Eramian, D., Shen, M., Devos, D., Melo, F., Sali, A and Marti-Renom, M.A.,
2006)
MODEL QUALITY ASSESSMENT PROGRAMS (MQAPs)
An MQAP is a computer program that evaluates the quality of a given 3D protein structure (i.e.
coordinates in a PDB format). The output of the program is a real number that represents the quality
of the model. Notice the protein structure is evaluated without comparing with any native structure.
The basics of the MQAPs used in this project are explained in this section as an example of MQAPs
and required understanding for the project outline. (CAFASP)
PRO-Q (ProQ-LG and ProQ-MX)
This method was created in 2002 by Björn Wallner and Arne Elofsson
It uses structural features for evaluation such as atom-atom contacts or residue-residue contacts and
combine them throughout a Neural Network which is trained to learn the observed model quality as
defined by either LG score (ProQ-LG) or MaxSub score (ProQ-MX). (Wallner, B. and Elofsson A.,
2003)
MODCHECK
This method was created in 2005 by Chris S. Pettitt , Liam J. McGuffin and David T. Jones.
It is based on classical threading potentials in order to calculate the protein structure quality; it uses
statistical analysis of highly resolved protein X-ray crystal structures combined with a solvation
potential and the inverse Boltzmann equation (Pettit, C.S., McGuffin, L.J. and Jones, D.T., 2005).
MODSSEA
ModSSEA determines the protein structure quality based on secondary structure element
alignments (SSEA) between the predicted secondary structure of the target protein and the
secondary structure represented in the 3D model.
This MQAP was created by Liam James McGuffin in 2007. (McGuffin, L.J., 2007a)
14
ModFOLD
Another recent example of work in this area is the ModFOLD project, developed by Liam J.
McGuffin (Bioinformatics and Systems Biology Unit, The University of Reading). The ModFOLD
method attempts to combine the output of the different individual MQAP methods presented above
(ModSSEA, MODCHECK, ProQ-MX and ProQ-LG) using a feed forward back propagation
network in order to increase the cumulative observed model quality scores. The premise is that by
combining the output of many different methods to form a consensus model quality prediction we
can achieve an increase in accuracy of the observed model quality. The results of this initial study
were that, indeed, ModFOLD obtained a better accuracy compared with the other individual
methods, although further research can be carried out, since the mean score is still far from the
theoretical maximum MQAP score. (McGuffin, L.J., 2007a)
This project attempts to refine the solution proposed by ModFOLD, in order to increase the
observed model quality score as close to the theoretical maximum achievable score as possible (The
theoretical maximum score that can be reached by consistently selecting the highest quality model).
In order to achieve this objective, a similar protocol to the ModFOLD method is followed, but here
we investigate the use of a Support Vector Machine (SVM) instead of a neural network and explore
several alternative approaches to classifying the data.
GOALS AND MILESTONES
For achieving the main goal of this project (increasing the accuracy of model quality assessment), a
set of milestones are defined in order to be able to evaluate the success of the project. A list of
milestones follows:
1.- Understanding the problem: This is the first step in every project. A basic understanding of the
background is essential to confront the problem.
In this case, knowledge about molecular biology is required: what is DNA and what is its function,
what are proteins, their composition and function?
Protein Structure Prediction is another concept that must be understood at the beginning of the
project: what are the different methods to solve this problem and what are the steps required for
these methods?
Understand the concept of MQAPs and viewing some examples, since the program developed in
this project is an MQAP.
2.- Choosing the method: The main goal to achieve is a very clear one, but choosing the best
approach for it is a difficult task in this early stage of the project. The selection of a concrete
method implies a period of learning the theory behind it and how to use the software that provides
this method.
3.- Training the SVM: The results of this project are related, mainly, to the correctness choosing the
different options SVM provide and correctly formatting the data. This means choosing the proper
Kernel function and proper parameters, training the SVM with training data and checking the
system with testing data.
4.- Developing the application: Once the first steps have been completed, it is time for developing
the application itself, introducing the trained SVM to the system.
15
5.- Developing the web based interface: The last step of this project is to develop a web based
interface in order to supply the resource for public use.
16
PRELIMINARY SYSTEM DESIGN
The next two sections, Preliminary System Design (PSD) and Implementation Strategy (IS), all of
the processes carried out throughout the project are explained. The difference between them relies
on the fact that the PSD is focused on the theory while the IS is mainly focused on the practice. In
other words, the PSD explains all the mathematical theory behind each method used, the
justification of each made decision against the other possible strategies whilst the IS is the outlined
work realised within this project.
It was known that the results reached by the actual MQAPs were still far from the theoretical best
results they could achieve (i.e. the selection of the highest quality models of protein structures).
Hence, finding a way to improve these results was the aim of this project.
ModFOLD (McGuffin, L.J., 2007a) was known as a good consensus MQAP method, developed by
University of Reading staff, which combined scores from many individual MQAP methods using a
neural network in order to improve selection of high quality models. In other words, as each one of
the MQAPs focuses in specific features of the protein, a wide range of features can be taken into
account for assessing the model by mixing several MQAPs. (McGuffin, L.J., 2007a)
For that reason, this ModFOLD was taken as a reference, trying to identify the main limitations of
the method and refine the consensus approach using new techniques.
One of the difficulties found in ModFOLD was the fact that when using Neural Networks it is very
difficult to control the system; that is to say, once the activation function and its parameters are
chosen and the system has been trained, it is very complicated to understand the functionality of it
and consequently refine the results. A new approach to allow more control over the learning and
testing stages was wanted.
Lately, a new method for classification problems, Support Vector Machine – SVM, has been widely
used in the bioinformatics community, often giving better results than other methods used before in
a large variety of problems including those which previously used neural networks.
This conjecture encouraged us to find a way to use SVM to deal with our purpose. Firstly we
introduce the SVM technique, and afterwards, the way to integrate this technique in our project will
be presented.
SUPPORT VECTOR MACHINE (SVM)
A Support Vector Machine is a mathematical method which is used for classification and
regression. The detailed mathematical background of the method is not explained here, but a brief
overview about the goals of the method using an example will be given.
An easy example
Imagine the owner of a bank who wants to know which of the clients are thrifty people and which
ones are wasteful.
Draw a 2-dimensional grid with the average monthly expenses in the abscissa and the average
monthly incomes in the ordinate.
After that, draw a straight line that separates the thrifty people from the wasteful people.
In this example it is obvious that the line that fits the best is the identity function, but it can be
considered an interval which a person is not a thrifty or a wasteful even though they have a small
positive or negative balance. In this case, it is needed to refine the function that separates the data.
Using SVM method, this line is drawn using a mathematical function that is used to separate the
17
data with the maximum separation between them. Some parameters of the function can be fixed in
order to achieve the purpose explained in the paragraph above.
The example shows how an SVM works in a 2-dimensional space, but the same idea can be
generalized to an n-dimension space (checking n attributes in the same study). This can be achieved
finding the n-1 hyperplane that separates the data in different subsets. And of course, the problem is
not restricted for two groups; several SVMs can be used, where each one answers a Boolean
question: all data on one side of the hyperplane belong to one set; the data on the other side do not
belong to this set. This is the most intuitively approach to multi class SVM, but other methods will
be introduced further in this section. (Moore, A.W., 2001)
Figure 5. An easy example using SVM. The plot shows the income vs. expenses of different people. The deep blue
line (hyperplane) separates the thrifty people from the wasteful ones, while the cyan interval is the margin where the
people are neither considered thrifty or wasteful.
Applying SVM to our problem
The problem to solve in this project can be seen as a classification one: classify the quality of the
model in a concrete grade. For example, ten different grades (integers from 0 to 10) can be defined
and the problem to solve is to classify the quality of a given model into one of this grades.
Obviously a multi class SVM will be needed to deal with this purpose. The system can be refined
giving more intervals to the system. The more intervals defined, the more precision assessing the
model. Realistically, there may be a limit in the number of intervals and this occurs when the SVM
is unable to classify the data. One of the other aims of the project is to discover how “fine grained”
we can make the classification grades for separating very similar high quality models from one
another.
KERNEL FUNCTION
The SVM method requires a function that evaluates each sample in order to classify them in one of
the available classes; this is called Kernel Function. The discovery of an optimal function for the
SVM is a continuous object of research. Nowadays, the most used functions are:
18
Linear Function:
K(xi , xj) = xiT xj
Polynomial Function:
K(xi , xj) = (γxiT xj + r)
d , γ > 0.
Radial Basis Function (RBF):
K(xi , xj ) = exp(−γ ||xi − xj||2 ), γ > 0.
Sigmoid:
K(xi , xj ) = tanh(γxiT xj + r).
where xi, xj are training vectors and γ, r and d are kernel parameters. (Hsu, C.W., Chang, C.C. and
Lin, C.J., 2003)
The RBF function is the most popularly used kernel because it can classify the data even if the
samples have a non-linear relationship between them. To achieve that, it maps the data into a higher
dimensional space (known as Kernel Trick). Using the Kernel Trick, some data can be split into the
correct subset whilst this could not be possible in a low dimensional space (Hsu et al., 2003)
Figure 6. Kernel Trick. This graphic shows how the data can be classified by mapping it in a higher dimensional space
using the kernel trick. Source: http://www.dtreg.com/svm.htm
Furthermore, read documentation indicates Linear Function is a special case of RBF and sigmoid
behaves as RBF for certain parameters. (Hsu et al., 2003)
Moreover, in RBF there are a fewer number of hyperparameters to tune (compared with Polynomial
Function, for example) which permits an easier parameters search.
In this project we have chosen to use the RBF kernel as it is expected that there is a non-linear
relationship between the different types of input data and thus it appears to be suitable for the
problem. (Hsu et al., 2003)
It is possible that alternative kernels may behave better than RBF for this concrete problem, but as
the time for this project is limited this could be a future extension of this work.
19
SVM MULTI-CLASS
Another decision we had to make in a more advanced stage of the project was the selection of the
multi-class system that SVM uses to classify the data into several groups, instead of the typical
behaviour of SVM where the data is split into two groups. There are two main approaches for
dealing with this purpose: using Binary Classifiers and using All the Data at Once. (Hsu, C.W. and
Lin, C.J., 2002)
The most used methods for these approaches are explained below:
SVM multi-class using Binary Classifiers
There are several methods for multi-class classification using binary classifiers. A brief overview
for the most widely used ones is shown in this section.
One against all
This method consists of using as many Support Vectors as the classes need. Then, every SV splits
the data between belonging to a certain class or not. Given a sample to classify, this is checked
iteratively for each SV until the appropriate class is found. (Hsu et al., 2002)
Example:
Let's suppose the data has to be split into 4 classes: A, B, C and D. Our system will have 4 SV.
Once the system is trained and given a certain sample, knowing which is the class that this sample
belongs to can be predicted. First of all, the system uses one SV to decide if the sample belongs to
A or not. Then, in case the sample does not belong to A, the system uses another SV to check if the
data belongs to B or not. This process continues until the data is allocated in one of these groups.
One against one
This strategy consists in creating k(k-1)/2 Support Vectors (being k the number of needed classes).
Each SV splits the data into two classes. Hence, it exists one SV for each pair of classes.
Given a sample, this is compared using one SV. This SV determines if the sample belongs to class
A or class B, incrementing a counter by one for the chosen class.
This process is repeated iteratively for all the SV, where every one of them splits the data into other
two classes.
At the end, the class that has more votes is the class that contains the sample. (Hsu et al., 2002)
Example:
Let's suppose the data has to be split into 4 classes: A, B, C and D. Our system will have 6 SV.
Once the system is trained and given a certain sample, knowing to which class this sample belongs
can be predicted. In the first step, the system uses one SV to check whether the sample belongs to
class A or B, incrementing by one the counter of the class the system has determined. In the second
step, the system checks if the data belongs to A or C. The counter of the selected class is
incremented by one. This process goes on until all the SVs are checked.
Here is the table with possible results. The first row shows the discrimination the SV has to check
and the second row shows what the decision of the current SV is.
20
A vs B A vs C A vs D B vs C B vs D C vs D
B A A B B D
Table I. One Against One example. Decision made by the SVM using OAO multi-class approach.
After the process, class A obtain 2 votes, class B 3 votes, class C 0 votes and class D 1 vote. In this
case, the system determines the sample belongs to class B.
Directed Acyclic Graph Support Vector Machine (DAGSVM)
This method uses the same technique as the One Against One method regarding Support Vectors,
but this one uses a rooted binary direct acyclic graph, which has k(k-1)/2 nodes and k leaves, to
guide the search. Nodes represent the SV and leaves are the classes to sort by. Given a sample, it
starts from the root checking if it belongs to class A or B. Depending on the result it follows one
branch or another in the decision tree and the process is repeated in the next node. This goes on
until a leaf is reached, which is the class that the sample belongs to. (Hsu et al., 2002)
Example:
Let's suppose the same example as in the previous case with the same sample. The graph below
shows the path the algorithm will follow throughout the tree to determine the class the sample
belongs to.
A
A
A
A B
B
B
B
C
C C
C
D
D DD
Figure 7. Decision tree for DAGSVM method used to solve the example.
SVM multi-class using all data at once
The mathematical theory behind this method is quite complicated and the results of using it were
very poor, therefore only a brief explanation is provided:
The idea of this method is the same as One Against All (OAA) method explained above, but differs
from it in the number of Support Vectors. While OAA uses one SV for each comparison, All Data
at Once (ADO) formulates all the decision functions in just one function. (Hsu et al. 2002)
A study between these methods can be found in multisvm paper. The study concludes binary
classifiers are more suitable for large data testing than ADO, so a binary classifier has been
21
implemented in the LIBSVM (Chang et al., 2001) library to solve the multi-class problem,
specifically the OAA method because the results are similar to other binary methods and the
training time is shorter. (Hsu et al., 2002)
Another problem regarding multi-class classification is the hyperparameter selection, since it would
be computationally very expensive to search for different parameters for each Support Vector.
Hence, all the SVMs share the same parameters. (Chang, C.C. and Lin, C.J., 2001)
TECHNICAL REQUIREMENTS
Another important step on the preliminary system design was choosing the platform where the
MQAP had to run on, the programming language to develop the program and the library that
supplies SVM methods. An explanation about the made decisions is shown here.
Software was developed using Sun Java Technology (Sun Microsystems, Inc., 1994) (Java Servlets
for the web server) on a Linux machine.
The reason for using Java is easy: This is an established language, very widely used for the
computer science community which offers a vast quantity of libraries, well documented that makes
this language very powerful, comfortable and easy to use. Furthermore, it is distributed for free and
can be run on Linux machines.
Linux machines were used for this project because most of the research software used in it is
developed on this platform and the interaction between applications is easier if they run on the same
platform.
Moreover, as part of the software is a server, it is well known that Linux is the best Operating
System for this purpose, as Apache (The Apache Software Foundation, 1999) is the most widely
used web based server.
LIBSVM has been chosen as a library that implements the logic of SVM method. There is no
special reason for selecting this one instead of another, but some features make it a good candidate:
it has a Java interface, runs over Linux, it is in constant evolution and there is good documentation.
22
IMPLEMENTATION STRATEGY
In this section, all the outline processes of this project are explained, starting with the plan,
following with the integration of the system and finishing with the testing.
Planning the project
The first thing in the early stage of the project was to analyse and plan the structure of it, identifying
all the milestones and breaking them down into a list of tasks. A distribution of time and resources
for each task was done using the Gantt Project Software (GanttProject.org, 2003). This software
allowed us to draw a Gantt chart (Figure 8). Although the plan was studied meticulously, a re-
planning was needed in order to face the arisen problems from the optimising step. Because of that,
the outline that we have followed and explained in this chapter follows the revised version of the
preliminary Gantt.
Figure 8. Preliminary Gantt chart.
Background study
After the planning, a background study was carried out. Specifically, concepts were understood by
reading articles from the internet and speaking with experts in these fields, such as biologists and
geneticists. (See references section for more details).
SVM study
Then we studied SVM methods. This was approached by reading the LIBSVM (Chang et al., 2001)
documentation and papers (Hsu et al., 2003 and Chang et al., 2001) as well as experimenting with
the software. Several data sets from other projects (Chang et al., 2001) were tested in order to
understand the functionality of this library.
First approach
Afterwards, a Java (Sun Microsystems, Inc., 1994) mapper was developed in order to transform
23
data from the ModFOLD software to suit LIBSVM functions input. Moreover, the data needed to
be rescaled in order to achieve better results.
The core of the project was initiated in this point. The first objective to tackle was to develop a
system that permits classification of our data into two basic subsets: good models against bad
models (defined below) in order to see the general SVM behaviour.
We had a file with training set results from the ModFOLD project. For each sample, we had
information concerning the protein model, as shown in the following example:
0.7357142567634583 0.8888083166000735 0.494 0.4807000160217285 0.9072
Each one of the first four columns represents information concerning the predicted quality of a
model using each individual MQAP in a scale ranging between 0 and 1 (0 – bad quality, 1 – good
quality).
The last column contains the observed model quality (OMQ). OMQ is a score to measure the
accuracy of the models supplied by the TM-score (Zhang, Y. and Skolnick, J., 2004), which
determines the structural distance between the predicted model of the protein target and the solved
structure.
Two classes, depending on the OMQ, were specified: If the value was less than or equal to 0.5, the
model was set into the 'bad models' class whilst if the OMQ was greater than 0.5, the model was set
into the 'good models' class.
The SVM was trained using half of our data (30,000 samples) and the remaining half became the
testing set. The SVM was used to classify the rest of the data into these two classes in order to
check the results against the OMQ to check the accuracy of the method.
The decision (explained in the Preliminary System Design section) about choosing the best kernel
for our problem was made at this point.
Choosing parameters
Once a Kernel Function had been selected, it was time to choose the appropriate values for the
kernel parameters. In our case, we needed to find the best values for the γ and C parameters to suit
the RBF.
The first way to find these parameters was to use a subset of the data and realize a Grid Search
using the tool supplied by LIBSVM.
After carry out several tests, we concluded the best parameters were C=8, γ=2; C=32, γ=0.5;
C=2048, γ=0.5.
These values provide an 83% average accuracy (accuracy was defined as the correct classification
according OMQ) for binary classification (i.e. separating good models from bad ones).
This seemed to be a good result if the accuracy did not fall in a multi-class classification.
Unfortunately, this was not the case and we needed to use different techniques for parameter
searching.
After reading documentation about hyperparameter tuning, software that deals with hyperparameter
search was downloaded from the Mathematics Department of King's College, University of London
webpage. The software is called Bayesian Support Vector Machine Hyperparameter Tuning
(BSVMHT) (Gold, C. and Sollich, P., 2005).
The best hyperparameter values this software found were C=10, γ=2.303 but the average accuracy
was the same as using the other values found by Grid Search.
A Java Data Mapper had to be developed in order to convert the data into a valid input for this
software.
24
First multi-class approach
After completing the classification between good and bad models using the RBF kernel and
choosing the best parameter values for this concrete problem, a new step forward in the project was
needed. The data not only had to be classified into two broad classes (good and bad), but as many
sub classes as possible. The more classes the data was split, the more useful the results would be (as
it is important to distinguish the finer differences between the most accurate models rather than just
separate the best from the worst).
The first approach for multi-class classification was trying to sort the data into 4 subsets. The SVM
worked pretty well for binary classification, but the effectiveness for non-binary classification was
unclear.
Documentation was read (Hsu et al., 2002) and two main ways to tackle the problem were found:
The first one consisted of using several binary classifiers while the second one consisted in
considering all the data in one formulation. Finding out which strategy would be better was our next
research goal. The decision made is explained in the Preliminary System Design section.
Once the multi-class method was chosen, it was time to test the data against the system and check
the results.
Testing the first multi-class approach
The first step was to test the data classifying them into 4 classes depending on the OMQ (<=0.25:
Very Bad Models; >0.25 & <=0.5: Bad Models; >0.5 & <=0.75: Good Models; >0.75 & <=1: Very
Good Models). The training and testing procedures were the same as for the binary classification
explained above. It was observed that the parameters that fit best were the same as for binary
classification, but the accuracy fell to around 50%.
Then, it was decided to classify the data between 10 classes in order to see if the accuracy continues
falling. Once more, the best values for the hyperparameters were the same as in the other tests and
indeed, the accuracy fell to 37%.
Observing these results, a deviation from the Preliminary Gantt Plot was needed in order to find out
the problem and the behaviour of the SVM. Knowing that, it should be easier to guide the SVM
through our purpose.
25
10 class test
First of all, the data from a 10 subsets classification was plotted into a 2D-Graph (OMQ values in x-
axes and predicted values in y-axes) and a table with the same information was created:
Predicted
Real 0 1 2 3 4 5 6 7 8 9
0 0 896 7 4 11 40 8 1 0 5
1 0 7516 30 16 70 311 9 9 0 27
2 0 3744 73 28 75 347 32 16 0 78
3 0 2254 50 39 106 462 65 24 0 91
4 0 1675 24 25 133 743 144 38 0 174
5 0 1163 22 16 106 906 299 99 0 262
6 0 612 15 11 68 793 364 283 0 400
7 0 230 13 10 31 358 306 395 0 655
8 0 79 3 5 8 79 96 190 0 655
9 0 430 0 1 22 213 133 207 0 2054
Table II. Predicted class against Real class table for a 10 classes classification. Each cell shows the number of
samples that were been classified as a 'Predicted' when they belonged, in fact, to the class 'Real'. It can be seen SVM
works well for binary classification but not for multi-classification.
26
20 class test
Next step was to classify the data between 20 classes in order to have more fine results and plot
them using the same technique as with 10 classes.
Pred
Real 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 65 155 0 0 0 1 5 0 5 11 5 0 0 0 0 0 0 4
1 0 0 418 278 0 0 0 0 0 0 2 8 5 0 2 0 0 0 0 8
2 0 0 1830 2025 0 0 0 2 6 0 20 30 5 0 0 0 0 0 0 34
3 0 0 518 3289 0 0 0 6 13 0 54 92 7 0 2 1 0 0 0 54
4 0 0 213 2110 3 0 0 8 7 0 41 83 16 0 3 0 0 0 0 78
5 0 0 122 1449 3 0 0 8 7 1 42 65 21 1 4 2 0 0 0 106
6 0 0 114 1213 2 0 0 10 15 0 40 94 42 0 10 4 0 0 0 103
7 0 0 76 999 1 0 0 11 11 0 71 95 48 4 4 6 0 0 0 118
8 0 0 74 842 1 0 0 11 18 0 84 158 85 5 10 5 0 0 0 164
9 0 0 68 821 0 0 0 8 13 0 114 171 74 5 16 0 0 0 0 209
10 0 0 47 644 1 0 0 7 8 0 107 211 110 9 22 10 0 0 0 273
11 0 0 43 573 0 0 0 3 8 0 80 241 116 8 23 14 0 0 0 315
12 0 0 23 396 0 0 0 5 9 0 87 233 108 14 35 15 0 0 0 398
13 0 0 17 260 1 0 0 6 3 0 58 178 108 17 37 22 0 0 0 516
14 0 0 9 147 2 0 0 1 2 0 32 146 79 13 33 23 0 0 0 556
15 0 0 9 98 3 0 0 1 3 0 19 79 61 16 27 15 0 0 0 624
16 0 0 7 42 1 0 0 1 2 0 8 40 25 6 13 12 0 0 0 538
17 0 0 3 21 0 0 0 0 0 0 2 13 16 2 9 7 0 0 0 347
18 0 0 2 9 0 0 0 0 0 0 0 5 2 1 1 0 0 0 0 159
19 0 0 141 272 0 0 0 3 1 0 35 79 38 15 16 9 0 0 0 2272
Table III. Predicted class against Real class table for a 20 classes classification. Each cell shows the number of
samples that were classified as 'Predicted' when they belonged, in fact, to the class 'Real'. It can be seen SVM works
well for binary classification but not for multi-classification.
27
Then, it was decided to plot the same data (20 classes), but using real values for the OMQ instead of
discrete ones because the plots above were not clear to interpret.
Figure 9. Distribution of a 20 subsets classification. The X-axes contains the real value for OMQ. The Y-axes
contains the classification our system predicts after using multi SVM. A roughly linear distribution can be seen: Most of
the lower classes are classified as lower ones, most of the middle classes are classified as middle ones and most of the
top classes are classified as top ones. Apart from these three main groups, the method does not seem to be very accurate
for each of the separate classes.
After observing the results above it was decided to plot a histogram for each OMQ value in order to
see the distribution of the predicted values for a given OMQ. Some of these histograms are shown
below.
Figure 10. Distribution for class 0. X-axes are the class number, while Y-axes are the number of samples. Notice most
of the data is allocated in one of the lower classes.
28
Figure 11. Distribution for class 12. X-axes are the class number, while Y-axes are the number of samples. Notice the
data is split equally in two main groups (class 3 and class 19). There is some data classified in subset 11 and
surroundings.
Figure 12. Distribution for class 19. X-axes are the class number, while Y-axes are the number of samples. Notice
most of the data is allocated in one of the higher classes.
After this analysis, the conclusion was that SVM is very good for binary classification of protein
structure models, but the multisvm does not fit in very well with our problem.
29
BST approach
Therefore, we decided to develop a binary decision tree which helps the SVM to classify the data
more precisely. The functionality of this tree is explained below.
The height of the tree depends on the precision to obtain and the number of leaves is the same as the
number of classes to sort by.
Every node in the tree represents a SV, which is trained following the normal procedure.
Given a sample, this is checked in the root node of the tree. This node determines whether the
sample belongs to those classes where the OMQ is less than or equal to 0.5 or on the contrary, those
where the OMQ is greater than 0.5.
In this example, it is assumed that the first node has determined the sample belongs to the first
group (<=0.5). In the next node the SV determines if the sample belongs to those classes where the
OMQ is less than or equal to 0.25 or those ones where the OMQ is greater than 0.25 and, of course,
less than or equal to 0.5.
This procedure keeps going on until a leaf is reached. This leaf determines the class that the sample
belongs to.
> 0.5<= 0.5
<= 0.25 > 0.25 <= 0.75 > 0.75
0 1 2 3
Figure 13. Three levels Complete Binary Decision Tree. This decision tree shows the procedure to find the class of a
given sample. Beginning from the root, in each node it is decided which is the next node to follow depending on the
OMQ. Each one of the leaf nodes is a class.
Having this idea in mind, we started by developing a program that deals with the purpose of mixing
the SVM with the BST structure. Details about the program can be found in the Detailed Software
Design chapter.
4 class BST
The first approach using this technique was to classify the data between 4 subsets in order to check
if the results were better than the raw multisvm or not (Figure 13). In this early test, all the nodes in
the BST shared the same parameters. This made the training process faster than training each node
independently. The problem we found here was the difficulty of working with our large amount of
data (30,000 samples) since the Java Virtual Machine (Sun Microsystems, Inc., 1994) had not
enough memory to work with these complex data structures. Hence, we tried to increase the JVM
heap memory, but even with that technique, the machine we were using to run the program did not
30
have enough memory to deal with our purpose.
At that point, another decision had to be made: On the one hand, we could save some data in disk in
order to deal working with all the training set. On the other hand, we could use just a subset of that
to train the SVM (specifically 5,000 samples). We decided to take the second option, because the
process is time consuming enough just working in memory. If we had to work with the disk as well,
the time for running the program would be endless.
The results we achieved using this technique were slightly better than using the multi-class SVM.
The accuracy for the new method was 58% compared with the old method whose accuracy was
50%. We were on the right path, but our system still needed more refining.
8 class BST
> 0.5<= 0.5
<= 0.25 > 0.25 <= 0.75 > 0.75
<= 0.125 > 0.125<= 0.375 > 0.375
<= 0.625 > 0.625> 0.875<= 0.875
0 1 2 3 4 5 6 7
Figure 14. Four levels Complete Binary Decision Tree. This decision tree shows the procedure to find the class of a
given sample. Beginning from the root, in each node it is decided which is the next node to follow depending on the
OMQ. Each one of the leaf nodes is a class.
The next step was to try a similar test but splitting the data into 8 classes instead (Figure 14). In this
case, the accuracy fell to 42%. Although we knew from multi-class SVM that using the same
parameters to train all the SVMs achieved the same results as using different parameters for each
SVM; because the method we were using was a bit different than multisvm, we decided to try using
different parameters for each one of the SVMs. This means training each SVM in each node
independently. It was pretty clear that this strategy would not be valid for a high deep tree, but
using just 8-classification, this was an option that had to be tried.
The results were similar as using the same parameters for all the SVMs, but the training time was
much longer and therefore, we rejected this option.
First approach for high quality models
The following step in our research was to focus only in the high quality models. In other words, in
each decision node, we continue searching the quality for the right branch (the greatest one) in order
to be more accurate in the high quality models rather than the low quality models. Since it is
relatively easy to separate the good models from the bad ones using MQAPs, the difficulty relies in
31
identifying the best model when their quality is high.
A 6-class structure like the one shown in the Figure 15 was developed and the test was realised. The
results were not good enough (Table IV), a fact that forced us to continue in our research.
0
1
2
3
4 5
<= 0.5 > 0.5
<= 0.75 > 0.75
<= 0.875 > 0.875
<= 0.9375 > 0.9375
<= 0.96875 > 0.96875
Figure 15. Six Levels Unbalanced Binary Search Tree. Starting from the root, a decision is made in each node until a
leaf is reached. This unbalanced structure permits to focus in the High Quality Models rather than Low Quality ones.
Class Belonging Predicted
0 19401 20500
1 6464 6280
2 1888 952
3 329 0
4 82 0
5 2831 3261
Table IV. Results for Six Levels Unbalanced Binary Search Tree. A testing set containing 30933 samples was used
in this test. Although the test should has been done counting the “well predicted samples” for each class instead of “all
the predicted samples” for a certain class, the predictions for class 3 and 4 show the bad behaviour of the system.
32
Focusing on very high quality models (above 0.875)
Afterwards, we focused on more specific high quality models. We selected the models whose
accuracy was greater than 0.875 for both training and testing steps. Therefore, we developed a 3-
class structure such as in the Figure 16. The results were very bad, since the system was unable to
classify any data (Table V).
0
1
<= 0.9375 > 0.9375
<= 0.96875 > 0.96875
2
Figure 16. Three Levels Unbalanced Binary Search Tree. Starting from the root, a decision is made in each node
until a leaf is reached. This unbalanced structure permits to focus in the High Quality Models rather than Low Quality
ones. This technique differs with Figure 15 in the prediction stage. The models whose quality is better than 0.875 are
chosen regarding the OMQ while in Figure 15 they are selected after several prediction processes.
Class Belonging Well predicted
0 350 2
1 90 0
2 3089 3086
Table V. Results for Three Levels Unbalanced Binary Search Tree. This test was made using a testing set
containing 3529 samples. The results for class 0 and 1 demonstrate the failure of this approach.
Calculating Sensitivity and Specificity
After these demoralising results, we decided to return back to the 6-class structure and calculate
different measures such as sensitivity and specificity in order to discover any clues that would help
us find a way to follow, but the results were, again, not good at all and there was no clue that
allowed us to refine the solution (Table VI).
(In a binary classification, sensitivity and specificity are statistical measures which are taken for
each one of the two possible results of the classification.
In this case, taking class 0 as a reference:
Sensitivity is the proportion of samples predicted as 0 among all the samples that really belong to 0
(both predicted as 0 and predicted as 1).
Specificity is the proportion of samples predicted as NO 0 among all the samples that really do
NOT belong to 0.)
33
Class Belonging Well predicted Sensitivity Specificity
0 19401 17239 88.86% 71.87%
1 6462 3052 76.92% 71.76%
2 1888 316 30.30% 89.08%
3 329 0 0.00% 100.00%
4 82 0 0.00% 100.00%
5 2831 1587 100.00% 0.00%
Table VI. Sensitivity and Specificity calculations for a Six Levels Unbalanced BST. The same testing set used in
Table IV was used in this test, counting the well predicted instead of all the predicted for a given class and calculating
the Sensitivity and Specificity in each node. As it can be appreciated, class 3, 4 and 5 continue giving problems and the
statistic values do not give any clue.
Despite bad tests experiences, we decided to carry on with the project and search for alternative
ways to tackle the problem, since the positioning in the project was correct and the appropriate
solution should be close, but we were not finding the proper one.
Independent Binary Classification
We decided to remove the clear cases such as those models with quality = 0 or quality = 1, because
these models could deviate the successful SVM training and produce unbalanced data, which
produces bad SVM behaviour. Moreover, we decided to do strictly binary studies and start an
individual problem for each one of the nodes.
34
0 1 0 1
0 1 0 1
0 1 0 1
<= 0.5 > 0.5 <= 0.75 > 0.75
<= 0.875 > 0.875 <= 0.9375 > 0.9375
<= 0.96875 > 0.96875 <= 0.984375 > 0.984375
a) b)
c) d)
e) f)
Figure 17. ‘Forest’ of BSTs. Each BST solves an independent binary problem. Starting from a), if the system decides
the model quality belongs to class 1, the model is checked in the next BST (i.e. b)). The process finishes when a class 0
is reached, giving this quality threshold to the model.
The strategy we followed was to start from the root node and go on throughout the BST until the random chance classifying the data was better than the accuracy classifying the data by the system.
(Random chance is the probability of doing a correct prediction for a certain class by chance,
depending on the frequency. For example: if there are 70 samples belonging to class 0 and 30 to
class 1, the random chance for the class 0 is 70% and the random chance for the class 1 is 30%).
Finally, we achieved quite good results with this experiment (Tables VII – XII).
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 21697 20177 92.99% 68.70% 92.99% 62.63%
1 9883 6190 62.63% 31.30% 62.63% 92.99%
Table VII. Results for ‘Forest’ of BSTs (Figure 17.a).
35
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 9254 8541 92.29% 72.69% 92.29% 47.93%
1 3476 1666 47.93% 27.31% 47.93% 92.29%
Table VIII. Results for ‘Forest’ of BSTs (Figure 17.b).
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 2222 2132 95.95% 77.37% 95.95% 22.92%
1 650 149 22.92% 22.63% 22.92% 95.95%
Table IX. Results for ‘Forest’ of BSTs (Figure 17.c).
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 366 267 72.95% 58.10% 72.95% 42.42%
1 264 112 42.42% 41.90% 42.42% 72.95%
Table X. Results for ‘Forest’ of BSTs (Figure 17.d).
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 91 39 42.86% 35.83% 42.86% 66.87%
1 163 109 66.87% 64.17% 66.87% 42.86%
Table XI. Results for ‘Forest’ of BSTs (Figure 17.e).
Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity
0 60 11 18.33% 41.37% 18.33% 76.47%
1 85 65 76.47% 58.62% 76.47% 18.33%
Table XII. Results for ‘Forest’ of BSTs (Figure 17.f). Notice the random chance for class 0 is better than the
prediction made by the system. The algorithm finishes at this point.
Unfortunately, the time was running against us and we could not refine this solution. Then, we
considered the core refining process concluded and the project turned into developing a webpage,
where we could integrate the system for external use.
Obviously, the point we reached did not permit a system that works as an ideal MQAP (receiving
the amino acids sequence in PDB format (Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G.,
Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000) as an input and giving the quality
score of the model as an output), but we were able to develop an intermediate quality assessment
system which receives the score of the 4 MQAPs as input and gives the probability of different
quality score as output. Indeed, this system would be very useful for an initial evaluation of
predicted models and allows us to develop and test an interface for the program (Figure 19).
36
Hence, we started installing the Apache Tomcat Http Server (The Apache Software Foundation,
1999) which allows Java Technology, and a basic Servlet was developed to provide a user interface
for the system.
37
DETAILED SOFTWARE DESIGN
1-SCOPE
1.1-Identification
This system is a tool to assess the quality of a protein structure prediction. Specifically given the
scores supplied by ModSSEA, ModCHECK, ProQ-MX and ProQ-LG, the program combines them
to evaluate the protein structure quality. The results are shown as several cut off points regarding
the quality ranging from 0 to 1 with their average correctness assessing the models.
1.2-System Overview
The system uses the mathematical approach for classification called Support Vector Machine in
order to tackle the problem. A library that deals with this purpose (LIBSVM (Chang et al., 2001)) is
used as a core of the system, supplying the functions for training and testing. (Steps needed in all
Learning Machine based systems).
Several ways to interact with the system are provided: A command line interface for the system
administrator for training and testing the system and a web based interface to predict a single
sample.
Moreover, a formatting tool that formats the data from ModFOLD project into suitable data for this
system is also included.
1.3-Document Overview
This document has been written following the Software Design Description defined by IEEE (IEEE
1016-1998).
Section 1 gives a brief overview about the system and this document.
Section 2 explains the main design decisions for the system development. Specifically, Section 2.1
shows the architectural design: it presents the existing classes in the system and the interrelationship
between them. Section 2.2 presents all the components in the system. The flow of information when
realising a use case is explained in Section 2.3. The design and ways to interact with the user
interfaces built for this system are presented in Section 2.4.
Section 3 explains all the components in detail: the most important methods for each class are
detailed specifying the parameters and the return values involved in each method.
2-SYSTEM-WIDE DESIGN
2.1-Architectural Design
The system deals with multi classification using SVM binary classification and guiding the search
using a „forest‟ of Binary Search Trees.
These trees have been implemented in the BST class, which is the main class where the system
administrator interface interacts with for all the requests to the system.
As the results showed that independent data for each classification problem behaves better than a
single BST, the system uses several BST to deal its purpose.
38
Node class implements the node structures for building each BST. Each node has a NodeSVM
associated, which contains all the information regarding the related SVM to interact with the
LIBSVM package.
2.2-Components
Class BST
Implements the structure and logics for constructing and using a Binary Search Tree.
Class Node
Implements the structure of a node contained in a tree
Class NodeSVM
Implements the structure needed in each node for constructing a SVM using LIBSVM functions.
Class Fileformat
Contains procedures for dealing with the intermediate data files needed for the correct functionality
of the system.
Class SVMtree
Provides the user interface for the system administrator, creates the software objects by calling the
BST class and trains and tests the system.
LIBSVM package
External library that provides all the methods for working with SVMs.
Class ProteinQualityCalculator
Web based user interface that provides the Protein Quality Structure Prediction of a sample.
Class reformat
This class implements a tool for reformatting the data from ModFOLD program into suitable data
for this system.
39
+Node(entrada mqap : double, entrada classe : int, entrada svm : nodeSVM, entrada c : double, entrada g : double)
+getMQAP() : double
+getClasse() : int
+getNodeSVM() : nodeSVM
+getLeftChild() : Node
+getRightChild() : Node
+getC() : double
+getG() : double
+setNodeSVM(entrada nSVM : nodeSVM)
+setLeftChild(entrada x : Node)
+setRightChild(entrada x : Node)
+setC(entrada c : double, entrada g : double)
-mqap : double
-classe : int
-svm : nodeSVM
-left : Node
-right : Node
-c : double
-g : double
Node
+read_problem()
+predict(entrada linea : object)
+getparam() : object
+getproblem() : object
+getmodel() : object
+getinputfilename() : string
+getmodelfilename() : string
+geterrormsg() : string
+getcrossvalidation() : int
+getnrfold() : int
+setparam(entrada param : object)
+setproblem(entrada prob : object)
+setmodel(entrada model : object)
+setinputfilename(entrada file : string)
+setmodelfilename(entrada file : string)
+seterrormsg(entrada error_msn : string)
+setcrossvalidation(entrada cross_val : int)
+setnrfold(entrada nr_fold : int)
-param : object
-prob : object
-model : object
-input_file_name : string
-model_file_name : string
-error_msg : string
-cross_validation : int
-nr_fold : int
nodeSVM
+size() : int
+put(entrada mqap : double, entrada classe : int, entrada svm : nodeSVM, entrada c : double, entrada g : double)
+trainTree(entrada filename : string)
+predictTree(entrada fileinput : string, entrada fileoutput : string, entrada separador : double, entrada y salida reliability : object)
-root : Node
-N : int
BST
+Split(entrada filename : string, entrada separator : double, entrada training : bool)
Fileformat
+main(entrada args : object)
SVMtree
+doGet(entrada request : object, salida response : object)
+doPost(entrada request : object, salida response : object)
ProteinQualityCalculator
+main(entrada args : object)
reformat
1
N
1
1
1
1
1
N
libsvmUses
Only uses data types from
the library, not methods
Uses
Figure 18. Class Diagram.
2.3-Concept of Execution
From system administrator:
The execution starts in the SVMtree class, where the main procedure creates a „forest‟ of BSTs
using the BST constructor for each one of them. Each BST represents a binary decision for different
cut off points. Then, the training and testing methods associated to each BST (trainTree and
predictTree from the BST class respectively) are called iteratively.
Once in the trainTree method, it goes recursively throughout the tree, visiting all the non leaves
nodes where, for each one of them, it reads the training set file, splits the data into two new files for
further usage (using Split method from Fileformat class) and uses the information for setting all the
parameters to the associated nodeSVM structure and training its SVM.
40
In the predictTree method, the testing file is read and split into two new files for further usage
(using Fileformat class). Then, for each sample in the file, it access recursively to the proper nodes
in the tree deciding if advancing in the prediction by using the left child node or using the right
child node instead. When the program reaches a leaf, the associated class is given to the sample. For
dealing with this prediction, for each Node object, it gets the nodeSVM associated (structure that
contains the information regarding the associated SVM) and it calls the predict function, which, in
turn, calls the predict function of the LIBSVM package.
Finally, the average reliability percentage for each prediction obtained from the predictTree
function is written into the reliability file.
From ProteinQualityCalculator:
In this case, the Servlet interacts directly with the LIBSVM. It uses all the models generated in the
training step (loading them with the function svm_load_model from LIBSVM) for predicting the
score of the given sample (by using the svm_predict method from LIBSVM) and the data from the
reliability file.
All this information is displayed in the browser as an HTML file.
2.4-System Interface Design
The design of the three interfaces is explained in this section.
2.4.1- Protein Quality Calculator Interface
This interface is implemented as a Servlet which prints the data in the client's browser as an HTML
file.
On the left side of the page, there is a form that must be filled with the scores provided by
ModSSEA, ModCHECK, ProQ-MX and ProQ-LG systems.
When the submit button is pressed, the request is sent from the browser in the client machine to the
server, which calls the proper functions in the server machine to calculate the prediction.
Afterwards, the information regarding the prediction is printed in the client's browser in the right
side of the screen.
The data introduced in the form must be decimal numbers ranging from 0 to 1, both included.
The output supplied consists in two columns. The first column tells if the prediction is above or
below a certain threshold (ranging from 0 to 1 since it is represented in the same scale as the input)
and the second one is a percentage representing the correctness of this prediction based on the
results in the testing stage.
In order to display the correctness of the prediction, a file called “reliability” which contains the
accuracy achieved in the testing stage for each threshold has to be read.
The file is structured in two columns (separated by a blank) containing a decimal number each one
of them. For each row, the left number represents the accuracy predicting the class 0 and the right
one represents the accuracy for the class 1 for a certain threshold. The data in the file is sorted
ascendantly depending on the threshold value.
It must be mentioned that the form uses a Javascript (Sun Microsystems, Inc., 1994.) function in
order to validate the input data in the client side (faster validation than using Java (Sun
41
Microsystems, Inc., 1994) in the server because of the transmission time). Such a validation has
been implemented using regular expressions.
Figure 19. Protein Quality Calculator User Interface.
2.4.2- System Administrator Interface
Although this is a non interactive interface, the communication between the user and the system can
be achieved through the “trainingset” and “testingset” files inside the “data” directory.
Both files use the same data format. Each one of them consists in a set of rows which contain the
ModSSEA, ModCHECK, ProQ-MX and ProQ-LG scores separated by a blank. At the end, and
after the last blank, there is the observed quality score.
This interface gives as an output a set of different files which will be used for predicting and
evaluating tasks or for the system itself to develop its function. Specifically, there are five kinds of
files:
·X.model -> These files contain information regarding the SVM with X value as a discriminator.
·X-class0 -> These files contain the dataset that belongs to class 0 after splitting the data
depending on X.
·X-class1 -> The same concept as the above file but the data belongs to class 1 instead.
·X-classified -> These files contain the MQAPs scores for dealing with X but the observed
quality score has been converted into 0 or 1 in order to compare the prediction quality.
·outputX -> These files contain the predictions for the SVM with the X discriminator value.
·reliability -> File containing the accuracy of each prediction realised (Explained above)
2.4.3- Reformat Interface
This interface receives the file name of the ModFOLD file for the command line.
A file named “filename.form” is generated as an output containing suitable data for using in the
system.
42
3-DETAILED DESCRIPTION OF COMPONENTS
3.1-class BST
3.1.1-Method put
void put(double mqap, int classe, nodeSVM svm,double c, double g)
*Creates a node and inserts it to the proper position inside the tree
*@param mqap: a double representing the key value and threshold value of the node
*@param classe: The number that identifies the class in a leaf node, -1 otherwise
*@param svm: SVM associated to the node
*@param c: double representing the c parameter for the SVM
*@param g: double representing the g parameter for the SVM
3.1.2-Method trainTree
void trainTree(String filename)
*Trains all the SVMs within the tree
*@param filename: a String with the filename of the file that contains the training set
3.1.3-Method predictTree
void predictTree(String fileinput, String fileoutput, double separador, double[] reliability)
*Predicts all the testing set using the previously trained tree
*@param fileinput: name of the file which contains the testing set
*@param fileoutput: name of the file to which will be written the output
*@param separator: name of the threshold value which will be used to split the data for further
classification
*@param reliability: array of doubles that contain the values of the average accuracy in the
classification
3.2-Class Node
3.2.1-Method Node
Node(double mqap, int classe, nodeSVM svm, double c, double g)
*Class constructor
*@param mqap: a double representing the key value and threshold value of the node
*@param classe: The number that identifies the class in a leaf node, -1 otherwise
*@param svm: SVM associated to the node
*@param c: double representing the c parameter for the SVM
*@param g: double representing the g parameter for the SVM
43
3.2.2-Method getMQAP
double getMQAP()
*Gets key and threshold value of the node
*@return: Double representing the key of the node
3.2.3-Method getClasse
int getClasse()
*Gets the number identifying the class
*@return: Integer representing the id of the class
3.2.4-Method getNodeSVM
nodeSVM getNodeSVM()
*Gets the nodeSVM associated to the node
*@return: the instance of nodeSVM associated to the current node
3.2.5-Method setNodeSVM
void setNodeSVM(nodeSVM nSVM)
*Sets the nodeSVM
*@param nSVM: a reference to nodeSVM to associate to the current node
3.2.6-Method getLeftChild
Node getLeftChild()
*Gets the left child of the current node
*@return: a reference to the left child node
3.2.7-Method setLeftChild
void setLeftChild(Node x)
*Sets the left child of the current node
*@param x: a reference to the left child which will be associated to the current node
3.2.8-Method getRightChild
Node getRightChild()
*Gets the right child of the current node
*@return: a reference to the right child node
3.2.9-Method setRightChild
void setRightChild(Node x)
*Sets the right child of the current node
44
*@param x: a reference to the right child which will be associated to the current node
3.2.10-Method getG
double getG()
*Gets the G parameter of the SVM associated to the current node
*@return: a double representing the G parameter
3.2.11-Method setG
void setG(double g)
*Sets the G parameter of the SVM associated to the current node
*@param g: a double representing the G parameter
3.2.12-Method getC
double getC()
*Gets the C parameter of the SVM associated to the current node
*@return: a double representing the C parameter
3.2.13-Method setC
void setC(double c)
*Sets the C parameter of the SVM associated to the current node
*@param c: a double representing the C parameter
3.3-Class NodeSVM
3.3.1-Method read_problem
void read_problem()
*Reads the input file defined in the input_file_name variable and creates the problem to
*be used for the LIBSVM
3.3.2-Method predict
int predict(svm_node[] linea)
*Calls to the predict function of the LIBSVM library in order to make a prediction
*@param linea: array of svm_node (type defined by the LIBSVM library) which contain
*all the needed information of a simple sample in order to be predicted.
3.3.3-Get/Set methods
Several get and set methods have been created in order to access to the attributes of the
class which are used by the LIBSVM to deal with its functionality. These methods are not
45
explained here because they are used by the library, but not by the system itself.
3.4-Class Fileformat
3.4.1-Method Split
void Split(String filename, double separator, boolean training)
*Splits the data from the file named “filename” depending on the “separator” and save
*the results in 3 files (X-classified, X-class0, X-class1). The meaning of this files is
*explained in the section 2.4.2 of this document.
*@param filename: Name of the fileinput to split
*@param separator: double representing the cut off point
*@param training: Boolean that informs whether the split is made for training or testing
3.5-Class SVMtree
3.5.1-Method Main
void main(String[] args)
*Main procedure for the System Administrator Interface. This procedure creates a „forest‟
*of BSTs, trains and predicts the SVMs.
*This procedure does not receive any parameter from the command line
3.6-class ProteinQualityCalculator
3.6.1-Method doGet
void doGet(HttpServletRequest request, HttpServletResponse response)
*Fills the response object with the HTML code which will be use for displaying a
*webpage in a browser. Apart, deals with the user request, calling the proper functions
*@param request: the HttpServletRequest object that has the information of the request
*@param response: the HttpServletResponse object that has the information regarding
*the response which will be sent to the client
3.6.2-Method doPost
void doPost(HttpServletRequest request, HttpServletResponse response)
*Fills the response object with the HTML code which will be use for displaying a
*webpage in a browser. Apart, deals with the user request, calling the proper functions
*@param request: the HttpServletRequest object that has the information of the request
*@param response: the HttpServletResponse object that has the information regarding
*the response which will be sent to the client
3.7-LIBSVM library
The description of this component can be found in the documentation provided by the LIBSVM.
46
3.8-Class reformat
3.8.1-Method Main
void main(String[] args)
*Main procedure for the Reformat tool. The file received as a parameter is reformatted to
*a suitable file to be used in the system
*The filename of the input file has to be passed as a parameter by the command line
47
CONCLUSIONS
This project arrives at the end with a certain bitter-sweet flavour. On the one hand, the results that
we aspired to at the beginning of the project have not been fulfilled; we could not develop a full
MQAP that increases the accuracy of model quality assessment. On the other hand, a new approach
to tackle this very difficult problem has been attempted and the results achieved show that, indeed,
this is a suitable way to face the problem and the beginning of a path that can allow us to meet our
goals.
In the following bullet points, we check the results obtained and the way how this project has been
developed against the goals and milestones defined in the introduction of this dissertation in order
to identify the points where the project has been carried out as expected and the phases where it has
been weak.
1-Understanding the problem:
This milestone was perfectly covered at the early stage of the project, since it was needed to carry
on with the rest of the stages. This important research phase was required to understand the scope of
the project.
2-Choosing the method:
This goal was achieved easily because a lot of successful studies have been recently carried out in
the bioinformatic field using SVM. Hence, after reading specific documentation about this method,
the decision of using it was made since it should be suitable to meet our objectives.
Afterwards the usage of the library that implements the method was learnt, but once more, the
documentation was very good and the project could proceed without much problem.
3-Training SVM:
The main problems arose when trying to reach this milestone. The proper Kernel was chosen, the
best parameters for the Kernel were found and the way to tackle the multi classification, but the
results achieved were not as expected (the system was unable to classify the data properly). These
decisions (i.e. the choice of using RBF kernel, the selection of parameters, etc.) were made
regarding the available documentation, since if we had to check everything by ourselves we would
need to develop a different project for each decision.
At the end, we found a method that could be useful for determining the quality of high quality
models using a decision tree of binary SVMs for classification at varying thresholds of model
quality. Unfortunately, due time limitations we were unable to refine this strategy.
To sum up, although this difficult goal was not achieved completely, we can say that it was partially
covered. (However, it must be said that optimisation of machine learning methods is a complete
field of research in itself.)
4-Developing the application:
Though a program that helps in calculating the quality of the models has been developed, this
difficult milestone can not be considered as fully reached.
The current program takes the ModSSEA, MODCHECK, ProQ-MX and ProQ-LG scores as a
starting point and it combines them, giving output stating if the current sample is above or below
certain cut off points and the average accuracy of the classifier during the testing stage, while a true
MQAP receives a model of protein structure as an input and gives the quality of the model as an
output. Obviously, the program built can not be described as a true MQAP, however it is a useful
tool and with further modifications will contribute towards the development of a future methods.
Moreover, the system architecture is perhaps not optimal, since the structure of the classes is a
consequence of all the different approaches tried during the project realisation. For example, the
48
system uses a „forest‟ of BSTs because a previous approach was to structure all the information
regarding classification decisions in one single BST rather than being the best structure for it.
The prototype system has not yet been further refined because most of the time for the project was
spent in meeting with the Training SVM goal.
Therefore, a new analysis and design for this program could be carried out in order to develop a
better and more efficient piece of software.
Currently, this system would be useful to a small part of the biochemistry community because the
problem it solves is quite specific. However, the tool which has been developed is a starting point
for further research with wider reaching application.
5-Developing web based interface:
A basic web page for use as an interface for the system has been developed. Therefore, this
milestone has been reached.
The web page is graphically simple and functional, as we decided to focus on achieving the 3rd
milestone rather than spend time building an elaborate user interface in the front end of the
application.
The normal procedures for installing a web page such as setting up the server or programming the
Servlet have been carried out.
In conclusion, even though not all of the ambitious goals have been achieved as they were set out at
the beginning of the project, it has been found that the way for dealing with increasing the accuracy
of model quality assessment using SVM relies on using several SVMs and treating each
classification as a single problem independent of the previous classification. Furthermore, the
models with observed quality equal to 1 or to 0 must be removed from the training and testing data
in order to keep the data balanced.
Following this advice, further research can be done using the same Kernel and parameters we used
in this project and good results for differentiating high quality models may be achieved.
We would discourage future researchers from trying to use a multi class SVM or a BST with
dependent data and the RBF kernel.
We anticipate that an extension of this work will be carried out very soon which will take into
account our findings and we are optimistic that the ultimate difficult goal that we were not able to
achieve in this short project will be fulfilled.
49
REFERENCES
Access Excellence at the National Health Museum (1999): The Central Dogma of Molecular
Biology [online] Available from: http://www.accessexcellence.org/RC/VL/GG/central.html
[accessed January 2007]
Bairoch, A. (2000): „The ENZYME database in 2000‟. Nucleic Acids Research. 28, 304-305.
Baker, D. and Sali, A. (2001): „Protein structure prediction and structural genomics‟. Science
294(5540):93-96.
Berg, J.M., Tymoczko, J.L. and Stryer, L. (2002): Biochemistry, W. H. Freeman and Company,
Houndmills, Basingstoke, England, chap.3, 41-76.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and
Bourne, P.E. (2000): „The Protein Data Bank‟. Nucleic Acids Research. Vol.28, No.1 235-242.
Bowie, J.U., Luthy, R. and Eisenberg, D. (1991): „A method to identify protein sequences that fold
into a known three-dimensional structure‟. Science. 253 (5016), 164-170.
CAFASP Critical Assessment of Fully Automated Structure Prediction. CAFASP4 MQAP [online]
Available from: http://www.cs.bgu.ac.il/~dfischer/CAFASP4/mqap.html [accessed January 2007]
Chang, C.C. and Lin, C.J. (2001): LIBSVM: a Library for Support Vector Machines [online]
Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf [accessed February 2007]
Cold Spring Harbor Laboratory (2002): DNA From The Beginning [online] Available from:
http://www.dnaftb.org [accessed January 2007]
DTREG. SVM – Support Vector Machines [online] Available from: http://www.dtreg.com/svm.htm
[accessed February 2007]
Eramian, D., Shen, M., Devos, D., Melo, F., Sali, A. and Marti-Renom, M.A. (2006): „A composite
score for predicting errors in protein structure models‟. Protein Science. 15, 1653-1666.
GanttProject.org (2003): GanttProject.org [online] Available from: http://ganttproject.org [accessed
January 2007]
Gold, C. and Sollich, P. (2005): Software for parameter tuning for SVM classifiers. [online]
Available from: http://www.mth.kcl.ac.uk/~psollich/BayesSVM/ [accessed March 2007]
Guy, N.C. Protein 3D Structural Prediction & Fold Recognition [online] Available from:
http://darwin.nmsu.edu/~molb470/fall2003/Projects/guy/ [accessed January 2007]
Hsu, C.W., Chang, C.C. and Lin, C.J. (2003): A Practical Guide to Support Vector Classification.
[online] Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf [accessed
January 2007]
Hsu, C.W and Lin, C.J. (2002): A Comparison of Methods for Multi-class Support Vector
Machines. [online] Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/multisvm.pdf
[accessed February 2007]
50
Jones, D.T. and Hadley, C. (2000): „Threading methods for protein structure prediction‟.
Bioinformatics: Sequence, structure and databanks. Higgins, D. & Taylor, W.R. Eds., 1-13,
Springer-Verlag, Heidelberg.
Lazaridis, T. and Karplus, M. (1999): „Discrimination of the native from misfolded protein models
with an energy function including implicit solvation‟. J. Mol. Biol. 288, 477–487
Lodish, H., Berk, A., Matsudaira, P., Kaiser, C.A., Krieger, M., Scott, M.P., Zipurksy, S.L. and
Darnell, J. (2004): Molecular Cell Biology, WH Freeman and Company, New York.
Marti-Renom, M.A., Stuart, A.C., Fiser, A., Sanchez, R., Melo, F. and Sali, A. (2000):
„Comparative protein structure modeling of genes and genomes‟. Annu Rev Biophys Biomol Struct
.29, 291-325.
McGuffin, L. J. (2007a): „Benchmarking consensus model quality assessment for protein fold
recognition‟. BMC Bioinformatics, submitted.
McGuffin, L. J. (2007b): „Aligning sequences to structures‟. Methods in Molecular Biology, Protein
Structure Prediction: Methods and Protocols. Humana Press, In press.
Moore, A.W. (2001): Support Vector Machines [online] Available from:
http://www.autonlab.org/tutorials/svm.html [accessed January 2007]
Nelson, D.L. and Cox, M.M. (2000): Lehninger. Principles of Biochemistry. Worth Publishers, New
York, chap. 5-6, 115-202.
Pettit, C.S., McGuffin, L.J. and Jones, D.T. (2005): „Improving sequence-based fold recognition by
using 3D model quality assessment‟. Bioinformatics. 21, 2509-3515.
Sali, A. and Blundell T.L. (1993): „Comparative protein modelling by satisfaction of spatial
restraints‟. J Mol Biol 234(3), 779-815.
Sippl, M.J. (1993): „Recognition of Errors in Three-Dimensional Structures of Proteins‟. Proteins.
17, 355-62
Sun Microsystems, Inc. (1994): Sun Developer Network (SDN). [online] Available from:
http://java.sun.com/ [accessed January 2007]
The Apache Software Foundation (1999): Apache Tomcat. [online] Available from:
http://tomcat.apache.org/ [accessed June 2007]
Voet, D. and Voet, J.G. (2004): Biochemistry, Wiley, Hoboken, NJ, Vol. 1.
Wallner, B. and Elofsson A. (2003): „Can correct protein models be identified?‟ Protein Science.
12, 1073-1086.
Zhang, Y. and Skolnick, J. (2004): „Scoring function for automated assessment of protein structure
template quality‟. Proteins. 57, 702-710.
Zhang, Y. and Skolnick, J. (2005): „The protein structure prediction problem could be solved using
the current PDB library‟. Proc Natl Acad Sci USA, 102 (4), 1029-1034.
51
APPENDICES
Class BST Java code:
import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.util.StringTokenizer;
import libsvm.svm;
import libsvm.svm_node;
import libsvm.svm_parameter;
public class BST
{
private Node root;
private int N;
public BST()
{
root = null;
N = 0;
}
public int size() { return N; }
public void put(double mqap, int classe, nodeSVM svm,double c, double g)
{
root=insert(root,mqap,classe,svm,c,g);
N++;
}
private Node insert(Node x, double mqap, int classe, nodeSVM svm, double c, double g)
{
Node aux;
if(x==null)
{
x=new Node(mqap,classe,svm,c,g);
}
else
{
if(mqap<=x.getMQAP())
{
aux=x.getLeftChild();
aux=insert(aux,mqap,classe,svm,c,g);
x.setLeftChild(aux);
}
else
{
aux=x.getRightChild();
52
aux=insert(aux,mqap,classe,svm,c,g);
x.setRightChild(aux);
}
}
return x;
}
public void trainTree(String filename) throws Exception
{
trainNode(root,filename);
}
private void trainNode(Node x,String filename) throws Exception
{
if(x.getClasse()!=-1)
{
return;
}
else
{
//Format the input
Fileformat ff=new Fileformat();
ff.Split(filename, x.getMQAP(),true);
nodeSVM nSVM=new nodeSVM();
fillSVMParameters(nSVM, filename,x.getMQAP(),x.getC(),x.getG());
nSVM.read_problem();
nSVM.seterrormsg(svm.svm_check_parameter(nSVM.getproblem(),nSVM.getparam()));
if(nSVM.geterrormsg() != null)
{
System.err.print("Error: "+nSVM.geterrormsg()+"\n");
System.exit(1);
}
else
{
nSVM.setmodel(svm.svm_train(nSVM.getproblem(),nSVM.getparam()));
svm.svm_save_model(nSVM.getmodelfilename(),nSVM.getmodel());
}
x.setNodeSVM(nSVM);
trainNode(x.getLeftChild(),"data/"+x.getMQAP()+"-train-class0");
trainNode(x.getRightChild(),"data/"+x.getMQAP()+"-train-class1");
}
}
private void fillSVMParameters(nodeSVM nSVM, String filename, double separador, double c,
double g)
53
{
nSVM.setinputfilename("data/"+separador+"-train-classified");
nSVM.setmodelfilename("data/"+separador+".model");
nSVM.setcrossvalidation(0);
nSVM.setnrfold(0);
nSVM.setparam(new svm_parameter());
nSVM.getparam().svm_type = svm_parameter.C_SVC;
nSVM.getparam().kernel_type = svm_parameter.RBF;
nSVM.getparam().degree = 3;
nSVM.getparam().gamma = g; // -g option from grid search
nSVM.getparam().coef0 = 0;
nSVM.getparam().nu = 0.5;
nSVM.getparam().cache_size = 100;
nSVM.getparam().C = c; // -c option from grid search
nSVM.getparam().eps = 1e-3;
nSVM.getparam().p = 0.1;
nSVM.getparam().shrinking = 1;
nSVM.getparam().probability = 0;
nSVM.getparam().nr_weight = 0;
nSVM.getparam().weight_label = new int[0];
nSVM.getparam().weight = new double[0];
}
public void predictTree(String fileinput, String fileoutput, double separador, double[] reliability)
throws Exception
{
//Formatting the file
Fileformat ff=new Fileformat();
ff.Split(fileinput, separador, false);
//Calling to predictNode for each line within the file
BufferedReader input = new BufferedReader(new FileReader("data/"+separador+"-test-
classified"));
DataOutputStream output = new DataOutputStream(new FileOutputStream(fileoutput));
double is0 = 0, isNO0 =0;
double predictedas0 = 0,predictedasNO0 = 0;
while(true)
{
String line = input.readLine();
if(line == null) break;
StringTokenizer st = new StringTokenizer(line," \t\n\r\f:");
double target = atof(st.nextToken());
int m = st.countTokens()/2;
svm_node[] x = new svm_node[m];
for(int j=0;j<m;j++)
54
{
x[j] = new svm_node();
x[j].index = atoi(st.nextToken());
x[j].value = atof(st.nextToken());
}
int v;
v = predictNode(root,x);
output.writeBytes(v+"\n");
if(target==0)
{
is0++;
if(v==0)
predictedas0++;
}
else
{
isNO0++;
if(v!=0)
predictedasNO0++;
}
}
reliability[0]=(predictedas0/is0);
reliability[1]=(predictedasNO0/isNO0);
}
private int predictNode(Node x,svm_node[] linea) throws Exception
{
if(x.getClasse()!=-1)
{
return x.getClasse();
}
else
{
nodeSVM auxSVM=x.getNodeSVM();
int classe=auxSVM.predict(linea);
if(classe==0)
{
return predictNode(x.getLeftChild(),linea);
}
else
{
return predictNode(x.getRightChild(),linea);
}
}
}
private static double atof(String s)
{
55
return Double.valueOf(s).doubleValue();
}
private static int atoi(String s)
{
return Integer.parseInt(s);
}
}
56
Class Node Java code:
public class Node
{
private double mqap; // mqap value
private int classe; // associated class
private nodeSVM svm; // SVM associated to the node
private Node left, right; // left and right subtrees
private double c;
private double g;
public Node(double mqap, int classe, nodeSVM svm, double c, double g)
{
this.mqap = mqap;
this.classe = classe;
this.svm = svm;
this.c=c;
this.g=g;
}
public double getMQAP()
{
return mqap;
}
public int getClasse()
{
return classe;
}
public nodeSVM getNodeSVM()
{
return svm;
}
public void setNodeSVM(nodeSVM nSVM)
{
this.svm=nSVM;
}
public Node getLeftChild()
{
return left;
}
public void setLeftChild(Node x)
{
left=x;
}
public Node getRightChild()
{
57
return right;
}
public void setRightChild(Node x)
{
right=x;
}
public double getC()
{
return c;
}
public void setC(double c)
{
this.c=c;
}
public double getG()
{
return g;
}
public void setG(double g)
{
this.g=g;
}
}
58
Class NodeSVM Java code:
import libsvm.*;
import java.io.*;
import java.util.*;
class nodeSVM
{
private svm_parameter param;
private svm_problem prob;
private svm_model model;
private String input_file_name;
private String model_file_name;
private String error_msg;
private int cross_validation;
private int nr_fold;
public svm_parameter getparam()
{
return param;
}
public void setparam(svm_parameter param)
{
this.param=param;
}
public svm_problem getproblem()
{
return prob;
}
public void setproblem(svm_problem prob)
{
this.prob=prob;
}
public svm_model getmodel()
{
return model;
}
public void setmodel(svm_model model)
{
this.model=model;
}
public String getinputfilename()
{
return input_file_name;
}
59
public void setinputfilename(String file)
{
this.input_file_name=file;
}
public String getmodelfilename()
{
return model_file_name;
}
public void setmodelfilename(String file)
{
this.model_file_name=file;
}
public String geterrormsg()
{
return error_msg;
}
public void seterrormsg(String error_msg)
{
this.error_msg=error_msg;
}
public int getcrossvalidation()
{
return cross_validation;
}
public void setcrossvalidation(int cross_val)
{
this.cross_validation=cross_val;
}
public int getnrfold()
{
return nr_fold;
}
public void setnrfold(int nr_fold)
{
this.nr_fold=nr_fold;
}
//SVM functions
public void read_problem() throws IOException
{
BufferedReader fp = new BufferedReader(new FileReader(input_file_name));
Vector vy = new Vector();
Vector vx = new Vector();
60
while(true)
{
String line = fp.readLine();
if(line == null) break;
StringTokenizer st = new StringTokenizer(line," \t\n\r\f:");
vy.addElement(st.nextToken());
int m = st.countTokens()/2;
svm_node[] x = new svm_node[m];
for(int j=0;j<m;j++)
{
x[j] = new svm_node();
x[j].index = atoi(st.nextToken());
x[j].value = atof(st.nextToken());
}
vx.addElement(x);
}
prob = new svm_problem();
prob.l = vx.size();
prob.x = new svm_node[prob.l][];
for(int i=0;i<prob.l;i++)
prob.x[i] = (svm_node[])vx.elementAt(i);
prob.y = new double[prob.l];
for(int i=0;i<prob.l;i++)
prob.y[i] = atof((String)vy.elementAt(i));
fp.close();
}
public int predict(svm_node[] linea) throws IOException
{
return (int)svm.svm_predict(this.model,linea);
}
private static double atof(String s)
{
return Double.valueOf(s).doubleValue();
}
private static int atoi(String s)
{
return Integer.parseInt(s);
}
}
61
Class Fileformat Java code:
import java.io.*;
public class Fileformat
{
/*This function classifies the data into two different classes
depending on the separator
Moreover, it creates 2 new files with the data separated with the
original value
Example for separator 0.5:
Input-> 0:0.3478 1:0.4672 2:0.8724 3: 0.273 0.7458
0:0.5673 1:0.6496 2:0.4976 3: 0.234 0.4598
Output-> 0:0.3478 1:0.4672 2:0.8724 3: 0.273 1
0:0.5673 1:0.6496 2:0.4976 3: 0.234 0
*/
public void Split(String filename, double separator, boolean training) throws Exception
{
String infix="-test";
if(training)
infix="-train";
FileInputStream fs=new FileInputStream(filename);
FileOutputStream out=new FileOutputStream("data/"+separator+infix+"-classified");
FileOutputStream out1=new FileOutputStream("data/"+separator+infix+"-class0");
FileOutputStream out2=new FileOutputStream("data/"+separator+infix+"-class1");
String line;
String lineout;
String lineout1;
String lineout2;
//Number of bytes in the input file to be read
int totalbytes=fs.available();
byte[] inarray=new byte[totalbytes];
byte[] outarray=new byte[totalbytes];
byte[] outarray1=new byte[totalbytes];
byte[] outarray2=new byte[totalbytes];
//Reading the file and storing the data in inarray
fs.read(inarray);
//line number
int i=1;
//i+k: current byte in the input array
int k=-1;
62
//i+m: current byte in the output array
int m=-1;
int m1=0;
int m2=0;
while((i+k)<totalbytes)
{
int j=0;
line="";
//Reading the line i
while(inarray[i+k]!='\n')
{
line+=(char)inarray[i+k];
k++;
}
//Formatting the line
if(line!="")
{
int lasttabindex=line.lastIndexOf(' ');
String sQuality=line.substring(lasttabindex+1);
lineout=line.substring(0,lasttabindex);
lineout1="";
lineout2="";
//Checking the model quality
double quality=Double.parseDouble(sQuality);
if(quality<=separator)
{
lineout1=lineout+" "+sQuality;
lineout="0 "+lineout;
}
else
{
lineout2=lineout+" "+sQuality;
lineout="1 "+lineout;
}
if(quality!=0 && quality!=1)
{
//Writing the line in the output array
for(j=0;j<lineout.length();j++)
{
outarray[i+m]=(byte)lineout.charAt(j);
m++;
}
63
outarray[i+m]='\n';
int mbefore=m1;
for(int l=0;l<lineout1.length();l++)
{
outarray1[m1]=(byte)lineout1.charAt(l);
m1++;
}
if(mbefore!=m1)
{
outarray1[m1]='\n';
m1++;
}
mbefore=m2;
for(int l=0;l<lineout2.length();l++)
{
outarray2[m2]=(byte)lineout2.charAt(l);
m2++;
}
if(mbefore!=m2)
{
outarray2[m2]='\n';
m2++;
}
}
else
{
m--;
}
i++;
}
}
//Writing the output array to the file
out.write(outarray,0,i+m);
out1.write(outarray1,0,m1);
out2.write(outarray2,0,m2);
}
}
64
Class SVMtree Java code:
import java.io.*;
public class SVMtree
{
public static void main(String[] args) throws Exception
{
PrintWriter out = new PrintWriter(new FileWriter("data/reliability"));
//Create the decision trees
BST st05 = new BST();
BST st075 = new BST();
BST st0875 = new BST();
BST st09375 = new BST();
BST st096875 = new BST();
BST st0984375 = new BST();
double[] reliability=new double[2];
st05.put(0.5,-1,null,512,0.5);
st05.put(0.2,0,null,-1,-1);
st05.put(0.7,1,null,-1,-1);
st05.trainTree("data/trainingset");
st05.predictTree("data/testingset","data/output05",0.5,reliability);
out.println(reliability[0]+" "+reliability[1]);
st075.put(0.75,-1,null,32,0.03125);
st075.put(0.6,0,null,-1,-1);
st075.put(0.8,1,null,-1,-1);
st075.trainTree("data/0.5-train-class1");
st075.predictTree("data/0.5-test-class1","data/output075",0.75,reliability);
out.println(reliability[0]+" "+reliability[1]);
st0875.put(0.875,-1,null,32768,2);
st0875.put(0.6,0,null,-1,-1);
st0875.put(0.9,1,null,-1,-1);
st0875.trainTree("data/0.75-train-class1");
st0875.predictTree("data/0.75-test-class1","data/output0875",0.875,reliability);
out.println(reliability[0]+" "+reliability[1]);
st09375.put(0.9375,-1,null,512,8);
st09375.put(0.6,0,null,-1,-1);
st09375.put(0.94,1,null,-1,-1);
st09375.trainTree("data/0.875-train-class1");
st09375.predictTree("data/0.875-test-class1","data/output09375",0.9375,reliability);
out.println(reliability[0]+" "+reliability[1]);
65
st096875.put(0.96875,-1,null,512,8);
st096875.put(0.6,0,null,-1,-1);
st096875.put(0.97,1,null,-1,-1);
st096875.trainTree("data/0.9375-train-class1");
st096875.predictTree("data/0.9375-test-class1","data/output096875",0.96875,reliability);
out.println(reliability[0]+" "+reliability[1]);
st0984375.put(0.984375,-1,null,512,8);
st0984375.put(0.6,0,null,-1,-1);
st0984375.put(0.99,1,null,-1,-1);
st0984375.trainTree("data/0.96875-train-class1");
st0984375.predictTree("data/0.96875-test-class1",
"data/output0984375",0.984375,reliability);
out.println(reliability[0]+" "+reliability[1]);
out.close();
}
}
66
Class ProteinQualityCalculator Java code:
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import libsvm.*;
public class ProteinQualityCalculator extends HttpServlet
{
public void doGet(HttpServletRequest request, HttpServletResponse response)throws
IOException, ServletException
{
response.setContentType("text/html");
PrintWriter out = response.getWriter();
out.println("<html>");
out.println("<body>");
out.println("<head>");
//Javascript for validate the form
out.println("<script type='text/javascript'>");
out.println("function validate(form){\n");
out.println("\tif(form.modssea.value.length==0 || form.modcheck.value.length==0 || " +
"form.proqmx.value.length==0 || form.proqlg.value.length==0){\n");
out.println("\t\talert('You must fill all the fields');\n");
out.println("\t\treturn false;\n");
out.println("\t}\n");
out.println("\tvar decimal = /^((0(\\.[0-9]*)?)|(1))$/;");
out.println("\tif(form.modssea.value.match(decimal)==null ||”+
“form.modcheck.value.match(decimal)==null || " +
"form.proqmx.value.match(decimal)==null || “+
“ form.proqlg.value.match(decimal)==null){\n");
out.println("\t\talert('Valid values are decimal numbers ranging from 0 to 1');\n");
out.println("\t\treturn false;\n");
out.println("\t}\n");
out.println("\treturn true;\n");
out.println("}");
out.println("</script>");
out.println("<title>Protein Quality Calculator</title>");
out.println("</head>");
out.println("<body bgcolor=\"white\">");
out.println("<TABLE>");
out.println("<TR><TD><TABLE><TR><TD>");
out.println("<H1><CENTER><B>PROTEIN QUALITY “+
“CALCULATOR</B></CENTER></H1>");
out.println("</TD></TR></TABLE></TD></TR>");
67
out.println("<TR>");
out.println("<TD width='50%'>");
out.println("<H2><CENTER>INPUT</CENTER></H2>");
out.println("</TD><TD width='50%'>");
out.println("<H2><CENTER>RESULTS</CENTER></H2>");
out.println("</TD>");
out.println("</TR>");
out.println("<TR>");
out.println("<TD><CENTER>");
out.println("<form action='ProteinQualityCalculator' onsubmit='return validate(this);'”+
“method='POST'>");
out.println("<TABLE border='1'>");
out.println("<TR>");
out.println("<TD>ModSSEA score:</TD><TD><input type='text' name='modssea'></TD>");
out.println("</TR><TR>");
out.println("<TD>MODCHECK score:</TD><TD><input type='text' “+
“name='modcheck'></TD>");
out.println("</TR><TR>");
out.println("<TD>ProQ-MX score:</TD><TD><input type='text' name='proqmx'></TD>");
out.println("</TR><TR>");
out.println("<TD>ProQ-LG score:</TD><TD><input type='text' name='proqlg'></TD>");
out.println("</TR>");
out.println("</TABLE>");
out.println("<br><input type='submit' value='Submit'>");
out.println("</form>");
out.println("</CENTER></TD>");
out.println("<TD VALIGN='top' ALIGN='center'>");
if(request.getParameter("modssea")!=null && request.getParameter("modcheck")!=null &&
request.getParameter("proqmx")!=null && request.getParameter("proqlg")!=null)
{
calculate(out,request);
}
out.println("</TD>");
out.println("</TR>");
out.println("</TABLE>");
out.println("</body>");
out.println("</html>");
}
public void doPost(HttpServletRequest request, HttpServletResponse response)throws
IOException, ServletException
{
doGet(request, response);
}
68
private static void calculate(PrintWriter out, HttpServletRequest request) throws IOException
{
svm_node[] x = new svm_node[4];
x[0] = new svm_node();
x[1] = new svm_node();
x[2] = new svm_node();
x[3] = new svm_node();
x[0].index = 1;
x[0].value = Double.parseDouble(request.getParameter("modssea"));
x[1].index = 2;
x[1].value = Double.parseDouble(request.getParameter("modcheck"));
x[2].index = 3;
x[2].value = Double.parseDouble(request.getParameter("proqmx"));
x[3].index = 4;
x[3].value = Double.parseDouble(request.getParameter("proqlg"));
out.println("<TABLE><TR><TD width='50%'><H3>QUALITY</H3></TD><TD”+
“width='50%' align='right'><H3>ACCURACY</H3></TD></TR>");
FileReader fs=new FileReader("/var/lib/tomcat5.5/webapps/servlets-examples/WEB-“+
“INF/classes/PQC/data/reliability");
BufferedReader in=new BufferedReader(fs);
double aux;
String line="";
boolean classified=false;
//Load all the models
svm_model model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-“+
“examples/WEB-INF/classes/PQC/data/0.5.model");
int classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.5</P></TD><TD><P”+
“ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.5</P></TD><TD><P”+
“ ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
}
if(!classified)
{
model=null;
69
model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+
"WEB-INF/classes/PQC/data/0.75.model");
classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.75</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.75</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
}
}
if(!classified)
{
model=null;
model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+
"WEB-INF/classes/PQC/data/0.875.model");
classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.875</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.875</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
}
}
if(!classified)
{
model=null;
model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+
"WEB-INF/classes/PQC/data/0.9375.model");
classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.9375</P></TD><TD><P "+
70
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.9375</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
}
}
if(!classified)
{
model=null;
model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+
"WEB-INF/classes/PQC/data/0.96875.model");
classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.96875</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.96875</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
}
}
if(!classified)
{
model=null;
model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+
"WEB-INF/classes/PQC/data/0.984375.model");
classe=(int)svm.svm_predict(model,x);
line=in.readLine();
if(classe==0)
{
aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));
out.println("<TR><TD><P><= 0.984375</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
classified=true;
}
else
{
aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));
out.println("<TR><TD><P>> 0.984375</P></TD><TD><P "+
"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");
71
}
}
out.println("</TABLE>");
}
}
72
Class Reformat Java code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
public class reformat
{
/**
* This program formats the data input to a suitable format to be used with SVM.
* You must specify the input file name as a parameter.
* It writes an output file called inputfilename.formatted
* @param args[0] contains the input file name to be formatted
*/
public static void main(String[] args) throws Exception
{
if(args.length==0)
throw new Exception("You must specify the input file");
FileInputStream fs=new FileInputStream(args[0]);
FileOutputStream out=new FileOutputStream(args[0]+".form");
String line;
String lineout;
//Number of bytes in the input file to be read
int totalbytes=fs.available();
byte[] inarray=new byte[totalbytes];
byte[] outarray=new byte[totalbytes];
//Reading the file and storing the data in inarray
fs.read(inarray);
//line number
int i=1;
//i+k: current byte in the input array
int k=-1;
//i+m: current byte in the output array
int m=-1;
while((i+k)<totalbytes)
{
int j=0;
line="";
//Reading the line i
while(inarray[i+k]!='\n')
{
line+=(char)inarray[i+k];
k++;
73
}
k+=2;
//Formatting the line
if(line!="")
{
// libsvm
String aux;
lineout="";
//skipping two first columns
line=line.substring(line.indexOf(" ")+1);
line=line.substring(line.indexOf(" ")+1);
aux=line.substring(0,line.indexOf(" "));
lineout+="1:"+aux+" ";
line=line.substring(line.indexOf(" ")+1);
aux=line.substring(0,line.indexOf(" "));
lineout+="2:"+aux+" ";
line=line.substring(line.indexOf(" ")+1);
aux=line.substring(0,line.indexOf(" "));
lineout+="3:"+aux+" ";
line=line.substring(line.indexOf(" ")+1);
aux=line.substring(0,line.indexOf(" "));
lineout+="4:"+aux+" ";
line=line.substring(line.indexOf(" ")+1);
//Checking the model quality
double quality=Double.parseDouble(line);
lineout+=quality;
//Writing the line in the output array
for(j=0;j<lineout.length();j++)
{
outarray[i+m]=(byte)lineout.charAt(j);
m++;
}
if(lineout!="")
{
outarray[i+m]='\n';
i++;
}
}
}
//Writing the output array to the file
out.write(outarray,0,i+m);
}
}