model quality assessment of predicted protein …avilamala/resources/finaldissertation_albe… · a...

73
1 MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN STRUCTURES USING A SUPPORT VECTOR MACHINE A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree Of MASTER OF SCIENCE In NETWORK AND E-BUSINESS CENTERED COMPUTING, in the THE BIOCENTRE / SCHOOL OF BIOLOGICAL SCIENCES UNIVERSITY OF READING by Albert Vilamala Muñoz 27/06/2007 Supervisor Dr. Liam James McGuffin

Upload: others

Post on 15-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

1

MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN STRUCTURES USING A SUPPORT VECTOR

MACHINE

A Dissertation

Submitted In Partial Fulfilment Of The Requirements For The Degree Of

MASTER OF SCIENCE

In

NETWORK AND E-BUSINESS CENTERED COMPUTING,

in the

THE BIOCENTRE / SCHOOL OF BIOLOGICAL SCIENCES

UNIVERSITY OF READING

by

Albert Vilamala Muñoz

27/06/2007

Supervisor Dr. Liam James McGuffin

Page 2: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

2

ACKNOWLEDGEMENTS

I would like to thank specially to my supervisor, Dr. Liam James McGuffin, who helped me a lot in

the realisation of this project. Moreover, many thanks to Ana Carmen Cajaraville Leiro for sharing

her knowledge about Biochemistry. Also thanks to Bérengère Bressenot, who spent some time

revising the grammar of previous reports. Thanks to Simon Clarke as well for revising the grammar

of this document and to Jean-Loup Rakhodai for the printed version of this dissertation. Elena

Carrara and Anaïs Chevigny also aided with the references section.

To all of them: MANY THANKS!

Page 3: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

3

ABSTRACT

The functionality that a protein develops is strictly related to its physical (3D) structure.

Determining a protein 3D structure from its amino acid sequence using experimental techniques is

very costly and time consuming. For this reason, finding computational techniques which deal with

this purpose is an ongoing issue of research.

This project aims at developing software that permits to increase the cumulative mean model

quality (CMMQ) of proteins supplied by different Model Quality Assessment Programs (MQAP).

To achieve this target, the output scores of some MQAPs are taken and combined using a Support

Vector Machine (SVM) in order to build a system which offers better accuracy than the existing

ones.

Although the CMMQ has not been increased, determining the quality of high quality models for

classification at varying thresholds of model quality has been achieved by using a decision tree of

binary SVMs.

Page 4: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

4

CONTENTS

Acknowledgements............................................................................................................................. 2

Abstract ............................................................................................................................................... 3

Contents .............................................................................................................................................. 4

Introduction ........................................................................................................................................ 6

DNA and Genetic Information ................................................................................................... 6

Proteins ....................................................................................................................................... 7

Protein Structure Prediction...................................................................................................... 10

Ab-initio or Template Free Modelling ............................................................................ 11

Comparative of Template Based Protein Modelling ....................................................... 11

Steps in the Homology Modelling Process ...................................................................... 12

Model Quality Assessment Programs (MQAP) ....................................................................... 13

PRO-Q ............................................................................................................................. 13

MODCHECK .................................................................................................................. 13

MODSSEA ....................................................................................................................... 13

ModFOLD ....................................................................................................................... 14

Goals and Milestones................................................................................................................ 14

Preliminary System Design ............................................................................................................. 16

Support Vector Machine (SVM) .............................................................................................. 16

An Easy Example ............................................................................................................. 16

Applying SVM to our problem ......................................................................................... 17

Kernel Function ........................................................................................................................ 17

SVM Multi-Class ...................................................................................................................... 19

SVM Multi-Class Using Binary Classifiers ..................................................................... 19

SVM Multi-Class Using All Data at Once ...................................................................... 20

Technical Requirements ........................................................................................................... 21

Implementation Strategy ................................................................................................................. 22

Planning the Project ....................................................................................................... 22

Background Study ........................................................................................................... 22

SVM Study ....................................................................................................................... 22

First Approach ................................................................................................................ 22

Choosing Parameters ...................................................................................................... 23

First Multi-Class Approach ............................................................................................ 24

Testing the First Multi-Class Approach .......................................................................... 24

10-Class Test ................................................................................................................... 25

20-Class Test ................................................................................................................... 26

Page 5: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

5

BST approach .................................................................................................................. 29

4-Class BST ..................................................................................................................... 29

8-Class BST ..................................................................................................................... 30

First Approach for High Quality Models ........................................................................ 30

Focusing on Very High Quality Models (above 0.875) .................................................. 32

Calculating Sensitivity and Specificity ............................................................................ 32

Independent Binary Classification .................................................................................. 33

Detailed Software Design................................................................................................................. 37

Scope ........................................................................................................................................ 37

Identification ................................................................................................................... 37

System Overview ............................................................................................................. 37

Document Overview ........................................................................................................ 37

System-Wide Design ................................................................................................................ 37

Architectural Design ....................................................................................................... 37

Components ..................................................................................................................... 38

Concept of Execution ...................................................................................................... 39

System Interface Design .................................................................................................. 40

Detailed Description of Components ....................................................................................... 42

Class BST ........................................................................................................................ 42

Class Node ...................................................................................................................... 42

Class NodeSVM ............................................................................................................... 44

Class Fileformat .............................................................................................................. 45

Class SVMtree ................................................................................................................. 45

Class ProteinQualityCalculator ...................................................................................... 45

LIBSVM Library .............................................................................................................. 45

Class Reformat ................................................................................................................ 46

Conclusions ....................................................................................................................................... 47

References ......................................................................................................................................... 49

Appendices ........................................................................................................................................ 51

Class BST Java code ....................................................................................................... 51

Class Node Java code ..................................................................................................... 56

Class NodeSVM Java code .............................................................................................. 58

Class Fileformat Java code ............................................................................................. 61

Class SVMtree Java code ................................................................................................ 64

Class ProteinQualityCalculator Java code .................................................................... 66

Class Reformat Java code ............................................................................................... 72

Page 6: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

6

INTRODUCTION

In recent years, big steps have been made in the field of molecular biology. One of the most

important was the recent mapping of the entire human genome sequence. Genome sequencing

projects have opened the door to a better understanding of molecular processes, but continuing

research in this field is needed in order to improve our knowledge and to enable us to use it for

different purposes, such as in medicine (i.e. understanding some diseases and developing drugs

against them) and agriculture (i.e. developing a product against plagues).

This project aims to contribute to the molecular biology field, specifically in the Protein Structure

Prediction, where better methods to assess the quality of a prediction are needed, and this is the

main goal of the project.

This dissertation starts with a brief explanation of the basic concepts related with biochemistry

which are essential to understand the reason of this project. First of all, an introduction to DNA and

its processes is given. Secondly, the knowledge provided from the DNA section is used to

understand the protein structure, made up by amino acids, which are presented, too. This concept

will, in turn, aid us to find out some of the basic functions a protein can carry out. Then, an

overview to Protein Structure Prediction is explained; different approaches to the problem are

shown and the most widely used procedure is outlined. Finally, the last step of this process, called

Model Quality Assessment is described, which is the focus of this project. In this part of the

process, the need of computer programs that deal with this function is introduced and the basic

ideas behind several MQAPs (which are used for the development of the project) are presented.

In the second chapter, called Preliminary System Design, the mathematical background of the

chosen method to solve the problem is explained. Moreover all the decisions made during the

project are justified in this section, too.

The third section (Implementation Strategy) outlines all the work realised in this project. In other

words, the section contains all the steps we followed explained in detail.

A detailed software design description is given in the fourth chapter. To write this documentation,

the Software Design Description standard developed by IEEE (IEEE 1016-1998) has been taken as

a guide.

The last section of this dissertation is the Conclusions where the results of the project are evaluated

according to the goals presented in this chapter.

DNA AND GENETIC INFORMATION

The nucleus of most cells contains deoxyribonucleic acid or DNA. This material stores the genetic

information needed by the cell in order to synthesise the proteins, which carry out important

biological functions.

How is this information stored?

The whole DNA string is known as a genome. This huge string can be broken down into several

substrings (known as genes), where each one contains the information needed to synthesise a

concrete protein.

Each string is made up of 4 nucleotides: adenine (A), cytosine (C), guanine (G) and thymine (T).

DNA can be seen as a long string of A's, C's, G's and T's. Some of these substrings of characters,

known as exons, contain genetic information (i.e. for encoding a protein) and some others are non-

coding regions known as introns.

The DNA sequences are translated to mRNA and the introns are excised.

Page 7: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

7

The mRNA, is read in sets of 3 characters (called codons) which encode one amino acid, apart from

some codons that are used as 'punctuation marks' which inform about the beginning and the end of

the gene.

In summary, each protein is made up from a sequence of amino acids, which are encoded by genes.

(Cold Spring Harbor Laboratory, 2002)

From DNA to protein

This idea is known as the central dogma of molecular biology and is composed of three parts.

In the first step (known as replication), the DNA sequence replicates its information.

In the second step (known as transcription), the information that the DNA contains is copied into

the mRNA.

This mRNA translates the information from the nucleus of the cell to the cytoplasm, in the third

step (translation). Ribosomes allocated in the cell 'read' the information carried by the mRNA and

synthesise the protein sequence. (Access Excellence at the National Health Museum, 1999)

Figure 1. Central Dogma of Molecular Biology.

Source http://library.thinkquest.org/C0122429/pictures/centraldogma2.gif

PROTEINS

What is a protein and what is it made for?

Proteins are macromolecules made of amino acids which are contained in the cells of all living

organisms. They are essential to the existence of an organism since they develop a wide range of

functions needed for all the cellular processes. (Berg, J.M., Tymoczko, J.L. and Stryer, L., 2002)

What is an amino acid?

An amino acid is a molecule that consists of a carbon atom linked to an amino group, a carboxylic

acid group, a hydrogen atom and an R group (Figure 2). This R group is the part of the molecule

that differs from each amino acid, varying in size, shape, charge, etc. Regarding this R group, 20

different amino acids which are needed to construct a protein can be defined. (Berg et al., 2002)

Page 8: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

8

Figure 2. Amino acid chemical structure. The central carbon atom is linked to a hydrogen group (top), to a carboxylic

acid group (right), to an amino group (left) and to an R group (bottom).

Source http://plantandsoil.unl.edu/croptechnology2005/UserFiles/Image/siteImages/AminoAcidLG.gif

A sequence of amino acids can be connected between them with a peptide bond forming a linear

polymer (or polypeptide chain). This is known as the Primary Structure of a protein.

These polymers can fold into regular structures in order to stabilize them (Secondary Structure).

Typical secondary structures are the Alpha Helix, the Beta Sheets and turns and loops (Figure 3).

Hence, the Secondary Structure refers to a local spatial arrangement of amino acids.

On the other hand, the Tertiary Structure defines the overall shape of a protein, that is to say, the

spatial arrangement of all the amino acids sequence.

A Quaternary Structure describes protein complexes made up by several interacting folded chains

of amino acids sequences. Thus, this structure defines the highest level a protein can be approached.

(Berg et al., 2002)

Page 9: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

9

Figure 3. Protein Structure Levels. The amino acids sequence (a) folds forming regular structures (alpha helix and

beta sheet) (b) which, in turn, are linked between them forming a bigger structure (c). The overall shape of the protein

(d) is constructed by combining several Tertiary Structures.

Source http://ghs.gresham.k12.or.us/science/ps/sci/ibbio/chem/notes/chpt3/proteinlevels.gif

Although physical and chemical properties which permit a protein configuring its three-dimensional

structure from the amino acid sequence are not completely understood, several experiments reveal

that indeed, the amino acid sequence completely determines the spatial structure. In other words,

the Tertiary Structure is defined by the Primary Structure.

Moreover, as the function of a protein is strictly linked to its shape, transitively, the amino acid

sequence determines the function of a protein. (Berg et al., 2002)

Page 10: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

10

Protein functions

It has been explained that proteins are essential for the vital processes within organisms, but which

are, actually, the functions that proteins develop?

The main characteristic of a protein is the capacity to bind other molecules. This property is strictly

related with its Tertiary Structure (i.e. formation of a binding site which interacts with the external

molecule) as well as the external amino acids chemical properties (i.e. enabling chemical bonds).

The chief function of a protein is acting as an enzyme. Enzymes are used to catalyse chemical

reactions such as cellular metabolism, catabolism or DNA replication.

Other proteins, known as antibodies, aim to bind foreign substances in the body and destruct them

as a function of the immunologic system.

Haemoglobin is another important protein, allocated in the blood, whose function is transporting

oxygen from the lungs to other organs in the body.

Other proteins are used for structural purposes; for example keratin, which is found in hair and

nails.

These are few roles proteins can carry out and they have been explained as example, but in fact,

there are a huge number of functions developed by proteins. Even though the function of a lot of

proteins is known, a lot of proteins with unknown functionality exist. In order to improve our

understanding of them, research on this field is being carried out.

It is proved that most proteins that share similar structure have similar functionality as well; this

makes sense since the function of a protein is related to its Tertiary Structure. (Lodish, H., Berk, A.,

Matsudaira, P., Kaiser, C.A., Krieger, M., Scott, M.P., Zipurksy, S.L. and Darnell, J., 2004) (Voet,

D. and Voet, J.G., 2004) (Bairoch, A., 2000)

PROTEIN STRUCTURE PREDICTION

The link between the protein structure and the function it develops has been explained in the

previous section. It is obvious that given a protein to study, knowledge regarding its structure can

be very useful in order to understand the processes where it is involved. Determining this structure

using experimental techniques (such as X-ray crystallography or Nuclear Magnetic Resonance) is

very costly and time consuming, making it an unviable process for studying the structures of

proteins which take part in the biological processes on a genomic scale.

Taking into account that recent advances in genomics allow us obtain the amino acid sequence of a

protein from its DNA sequence then computational techniques which predict the Tertiary Structure

of a protein directly from its Primary Structure can be used.

There are two main approaches for protein structure prediction to deal with this objective: Ab-Initio

or Template Free Modelling, which predicts the structure “from first principles” and Comparative

of Template Based Modelling, which predicts the structure comparing it to known template

structures. (Zhang, Y. and Skolnick, J., 2005) (Bowie, J.U., Luthy, R., Eisenberg, D., 1991)

Page 11: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

11

PROTEIN STRUCTURE PREDICTION

Ab-Initio or

Template Free Modeling

Comparative of Template

Based Protein Modelling

Homology

Modelling

Protein

Threading Figure 4. Protein Structure Prediction Methods.

Ab-Initio or Template Free Modelling

This method attempts to find the structure of a protein directly from its amino acids sequence,

trying to imitate the natural folding of the protein using physical principles (which are not

completely understood) or stochastic methods (e.g. using an energy function to evaluate different

chain conformations).

Ab-Initio modelling is a very computationally expensive method. Several research projects are

being carried out in order to face this weakness: using supercomputers (such as Blue Gene) or

distributed computing, but the results are still far from the desired ones. Therefore, this technique

can not yet be applied practically on genomic scale but most projects focused on small proteins (~

100 amino acids). (Zhang et al., 2005) (Bowie et al., 1991)

Comparative of Template Based Protein Modelling

This method compares the amino acid sequence of a protein with unknown structure (target) against

sequences with known protein structures (templates).

The theory behind this method is that although there are a vast existing number of proteins, they

seem to fold in a particular relatively small set of structures. Therefore, there is a high probability

studying a protein whose structure has been already found in other proteins.

Protein Threading and Homology Modelling are the most used techniques to exploit this theory.

(Zhang et al., 2005) (Bowie et al., 1991)

Protein Threading

Protein Threading (also known as Fold Recognition) relies on the estimation that up to 70% of new

proteins have a structure with similar folds to another one already studied in the PDB (Protein Data

Bank) (Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov,

I.N. and Bourne, P.E., 2000). The reason for that are the physical and chemical laws regarding

polypeptide chains as well as the relationship between proteins in such a process as evolution.

This strategy compares the target with a library of fold templates using a scoring scheme (i.e.

energy function or statistics).

Page 12: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

12

Depending on the way to proceed, this method can be split up into two subgroups:

The first one takes a 1-D profile to compare with the fold library (i.e. marking each amino acid

depending on if it belongs to the internal part of the protein or to the surface).

The second group compares the whole 3-D structure (i.e. comparing the inter-atomic distances for a

set of amino acids).

Most of the methods can give several measures of similarity which have to be interpreted by a

human expert. Hence, this is often a difficult method to automate. (Jones, D.T. and Hadley, C.,

2000)

Homology Modelling

The underlying theory behind this method relies on the premise that similar sequences will have

similar structures, taking advantage of known proteins. In this method, the structure of the template

which optimally aligns to the target sequence is assigned as the structure of the target. The process

to deal with this is explained below.

First, it must be mentioned that the quality of the prediction depends on the quality of the sequence

alignment as well as the quality of the template structure. (Marti-Renom, M.A., Stuart, A.C., Fiser,

A., Sanchez, R., Melo, F. and Sali, A., 2000)

Steps in the homology modelling process

This process consists of four steps: In the first one, the template that fits the best according to the

target is selected. In the second step, the sequence of the template and the sequence of the target are

aligned. Usually these two steps are performed together, since the aligned sequences can be useful

for finding the proper template.

The third step consists in generating the 3D structure model for the target; and in the last step, the

correctness of the model is assessed (Marti-Renom et al., 2000).

Template Selection and Sequence Alignment

The first approach for selecting a template is using pair wise sequence alignment with programs

such as FASTA or BLAST. Another strategy is by using multiple sequence alignment (i.e. PSI-

BLAST software).

Alternatively, the Protein Threading method explained in the above section can be used for this

stage.

In general, a template is selected depending on a similarity function (i.e. E-value function) or

searching for a model which is close in evolution to the target. Other features which aid the search

are finding a template that has similar functionality to the target or considering the fraction of the

query sequence structure that can be predicted from the template. (Marti-Renom et al., 2000)

Model Generation

In this step, the selected template and the alignment are combined to generate a 3D structure of the

target: Cartesian Coordinates allocating each atom within the protein are defined.

There are several strategies for building models; the most widely used method is Satisfaction of

Spatial Restraints which is based on calculations typically used in NMR spectroscopy. (Baker, D.

and Sali, A., 2001) (Sali, A. and Blundell, T.L., 1993)

Page 13: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

13

Model Assessment

Once the structure has been generated, the quality of the model is assessed typically using Model

Quality Assessment Programs, which are explained in detail in the next section. This problem has

been tackled using “statistic potentials” or “Physics-based energy” approaches.

The first approach is based on structure frequencies (assigning a probability for each pair wise

interaction between amino acids) in the known proteins of the PDB. Alternatively, this method can

be used partially to identify poor regions in a protein.

The second approach uses a function to calculate the stability of a protein (regarding physical and

chemical interactions).

Recently, other methods to assess the quality of a model using Machine Learning have been carried

out. After training the system, they can reckon the quality of a model directly from its structure or

combining several Model Quality Assessment Programs. (Sippl, M.J., 1993) (Lazaridis, T. and

Karplus, M., 1999) (Eramian, D., Shen, M., Devos, D., Melo, F., Sali, A and Marti-Renom, M.A.,

2006)

MODEL QUALITY ASSESSMENT PROGRAMS (MQAPs)

An MQAP is a computer program that evaluates the quality of a given 3D protein structure (i.e.

coordinates in a PDB format). The output of the program is a real number that represents the quality

of the model. Notice the protein structure is evaluated without comparing with any native structure.

The basics of the MQAPs used in this project are explained in this section as an example of MQAPs

and required understanding for the project outline. (CAFASP)

PRO-Q (ProQ-LG and ProQ-MX)

This method was created in 2002 by Björn Wallner and Arne Elofsson

It uses structural features for evaluation such as atom-atom contacts or residue-residue contacts and

combine them throughout a Neural Network which is trained to learn the observed model quality as

defined by either LG score (ProQ-LG) or MaxSub score (ProQ-MX). (Wallner, B. and Elofsson A.,

2003)

MODCHECK

This method was created in 2005 by Chris S. Pettitt , Liam J. McGuffin and David T. Jones.

It is based on classical threading potentials in order to calculate the protein structure quality; it uses

statistical analysis of highly resolved protein X-ray crystal structures combined with a solvation

potential and the inverse Boltzmann equation (Pettit, C.S., McGuffin, L.J. and Jones, D.T., 2005).

MODSSEA

ModSSEA determines the protein structure quality based on secondary structure element

alignments (SSEA) between the predicted secondary structure of the target protein and the

secondary structure represented in the 3D model.

This MQAP was created by Liam James McGuffin in 2007. (McGuffin, L.J., 2007a)

Page 14: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

14

ModFOLD

Another recent example of work in this area is the ModFOLD project, developed by Liam J.

McGuffin (Bioinformatics and Systems Biology Unit, The University of Reading). The ModFOLD

method attempts to combine the output of the different individual MQAP methods presented above

(ModSSEA, MODCHECK, ProQ-MX and ProQ-LG) using a feed forward back propagation

network in order to increase the cumulative observed model quality scores. The premise is that by

combining the output of many different methods to form a consensus model quality prediction we

can achieve an increase in accuracy of the observed model quality. The results of this initial study

were that, indeed, ModFOLD obtained a better accuracy compared with the other individual

methods, although further research can be carried out, since the mean score is still far from the

theoretical maximum MQAP score. (McGuffin, L.J., 2007a)

This project attempts to refine the solution proposed by ModFOLD, in order to increase the

observed model quality score as close to the theoretical maximum achievable score as possible (The

theoretical maximum score that can be reached by consistently selecting the highest quality model).

In order to achieve this objective, a similar protocol to the ModFOLD method is followed, but here

we investigate the use of a Support Vector Machine (SVM) instead of a neural network and explore

several alternative approaches to classifying the data.

GOALS AND MILESTONES

For achieving the main goal of this project (increasing the accuracy of model quality assessment), a

set of milestones are defined in order to be able to evaluate the success of the project. A list of

milestones follows:

1.- Understanding the problem: This is the first step in every project. A basic understanding of the

background is essential to confront the problem.

In this case, knowledge about molecular biology is required: what is DNA and what is its function,

what are proteins, their composition and function?

Protein Structure Prediction is another concept that must be understood at the beginning of the

project: what are the different methods to solve this problem and what are the steps required for

these methods?

Understand the concept of MQAPs and viewing some examples, since the program developed in

this project is an MQAP.

2.- Choosing the method: The main goal to achieve is a very clear one, but choosing the best

approach for it is a difficult task in this early stage of the project. The selection of a concrete

method implies a period of learning the theory behind it and how to use the software that provides

this method.

3.- Training the SVM: The results of this project are related, mainly, to the correctness choosing the

different options SVM provide and correctly formatting the data. This means choosing the proper

Kernel function and proper parameters, training the SVM with training data and checking the

system with testing data.

4.- Developing the application: Once the first steps have been completed, it is time for developing

the application itself, introducing the trained SVM to the system.

Page 15: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

15

5.- Developing the web based interface: The last step of this project is to develop a web based

interface in order to supply the resource for public use.

Page 16: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

16

PRELIMINARY SYSTEM DESIGN

The next two sections, Preliminary System Design (PSD) and Implementation Strategy (IS), all of

the processes carried out throughout the project are explained. The difference between them relies

on the fact that the PSD is focused on the theory while the IS is mainly focused on the practice. In

other words, the PSD explains all the mathematical theory behind each method used, the

justification of each made decision against the other possible strategies whilst the IS is the outlined

work realised within this project.

It was known that the results reached by the actual MQAPs were still far from the theoretical best

results they could achieve (i.e. the selection of the highest quality models of protein structures).

Hence, finding a way to improve these results was the aim of this project.

ModFOLD (McGuffin, L.J., 2007a) was known as a good consensus MQAP method, developed by

University of Reading staff, which combined scores from many individual MQAP methods using a

neural network in order to improve selection of high quality models. In other words, as each one of

the MQAPs focuses in specific features of the protein, a wide range of features can be taken into

account for assessing the model by mixing several MQAPs. (McGuffin, L.J., 2007a)

For that reason, this ModFOLD was taken as a reference, trying to identify the main limitations of

the method and refine the consensus approach using new techniques.

One of the difficulties found in ModFOLD was the fact that when using Neural Networks it is very

difficult to control the system; that is to say, once the activation function and its parameters are

chosen and the system has been trained, it is very complicated to understand the functionality of it

and consequently refine the results. A new approach to allow more control over the learning and

testing stages was wanted.

Lately, a new method for classification problems, Support Vector Machine – SVM, has been widely

used in the bioinformatics community, often giving better results than other methods used before in

a large variety of problems including those which previously used neural networks.

This conjecture encouraged us to find a way to use SVM to deal with our purpose. Firstly we

introduce the SVM technique, and afterwards, the way to integrate this technique in our project will

be presented.

SUPPORT VECTOR MACHINE (SVM)

A Support Vector Machine is a mathematical method which is used for classification and

regression. The detailed mathematical background of the method is not explained here, but a brief

overview about the goals of the method using an example will be given.

An easy example

Imagine the owner of a bank who wants to know which of the clients are thrifty people and which

ones are wasteful.

Draw a 2-dimensional grid with the average monthly expenses in the abscissa and the average

monthly incomes in the ordinate.

After that, draw a straight line that separates the thrifty people from the wasteful people.

In this example it is obvious that the line that fits the best is the identity function, but it can be

considered an interval which a person is not a thrifty or a wasteful even though they have a small

positive or negative balance. In this case, it is needed to refine the function that separates the data.

Using SVM method, this line is drawn using a mathematical function that is used to separate the

Page 17: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

17

data with the maximum separation between them. Some parameters of the function can be fixed in

order to achieve the purpose explained in the paragraph above.

The example shows how an SVM works in a 2-dimensional space, but the same idea can be

generalized to an n-dimension space (checking n attributes in the same study). This can be achieved

finding the n-1 hyperplane that separates the data in different subsets. And of course, the problem is

not restricted for two groups; several SVMs can be used, where each one answers a Boolean

question: all data on one side of the hyperplane belong to one set; the data on the other side do not

belong to this set. This is the most intuitively approach to multi class SVM, but other methods will

be introduced further in this section. (Moore, A.W., 2001)

Figure 5. An easy example using SVM. The plot shows the income vs. expenses of different people. The deep blue

line (hyperplane) separates the thrifty people from the wasteful ones, while the cyan interval is the margin where the

people are neither considered thrifty or wasteful.

Applying SVM to our problem

The problem to solve in this project can be seen as a classification one: classify the quality of the

model in a concrete grade. For example, ten different grades (integers from 0 to 10) can be defined

and the problem to solve is to classify the quality of a given model into one of this grades.

Obviously a multi class SVM will be needed to deal with this purpose. The system can be refined

giving more intervals to the system. The more intervals defined, the more precision assessing the

model. Realistically, there may be a limit in the number of intervals and this occurs when the SVM

is unable to classify the data. One of the other aims of the project is to discover how “fine grained”

we can make the classification grades for separating very similar high quality models from one

another.

KERNEL FUNCTION

The SVM method requires a function that evaluates each sample in order to classify them in one of

the available classes; this is called Kernel Function. The discovery of an optimal function for the

SVM is a continuous object of research. Nowadays, the most used functions are:

Page 18: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

18

Linear Function:

K(xi , xj) = xiT xj

Polynomial Function:

K(xi , xj) = (γxiT xj + r)

d , γ > 0.

Radial Basis Function (RBF):

K(xi , xj ) = exp(−γ ||xi − xj||2 ), γ > 0.

Sigmoid:

K(xi , xj ) = tanh(γxiT xj + r).

where xi, xj are training vectors and γ, r and d are kernel parameters. (Hsu, C.W., Chang, C.C. and

Lin, C.J., 2003)

The RBF function is the most popularly used kernel because it can classify the data even if the

samples have a non-linear relationship between them. To achieve that, it maps the data into a higher

dimensional space (known as Kernel Trick). Using the Kernel Trick, some data can be split into the

correct subset whilst this could not be possible in a low dimensional space (Hsu et al., 2003)

Figure 6. Kernel Trick. This graphic shows how the data can be classified by mapping it in a higher dimensional space

using the kernel trick. Source: http://www.dtreg.com/svm.htm

Furthermore, read documentation indicates Linear Function is a special case of RBF and sigmoid

behaves as RBF for certain parameters. (Hsu et al., 2003)

Moreover, in RBF there are a fewer number of hyperparameters to tune (compared with Polynomial

Function, for example) which permits an easier parameters search.

In this project we have chosen to use the RBF kernel as it is expected that there is a non-linear

relationship between the different types of input data and thus it appears to be suitable for the

problem. (Hsu et al., 2003)

It is possible that alternative kernels may behave better than RBF for this concrete problem, but as

the time for this project is limited this could be a future extension of this work.

Page 19: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

19

SVM MULTI-CLASS

Another decision we had to make in a more advanced stage of the project was the selection of the

multi-class system that SVM uses to classify the data into several groups, instead of the typical

behaviour of SVM where the data is split into two groups. There are two main approaches for

dealing with this purpose: using Binary Classifiers and using All the Data at Once. (Hsu, C.W. and

Lin, C.J., 2002)

The most used methods for these approaches are explained below:

SVM multi-class using Binary Classifiers

There are several methods for multi-class classification using binary classifiers. A brief overview

for the most widely used ones is shown in this section.

One against all

This method consists of using as many Support Vectors as the classes need. Then, every SV splits

the data between belonging to a certain class or not. Given a sample to classify, this is checked

iteratively for each SV until the appropriate class is found. (Hsu et al., 2002)

Example:

Let's suppose the data has to be split into 4 classes: A, B, C and D. Our system will have 4 SV.

Once the system is trained and given a certain sample, knowing which is the class that this sample

belongs to can be predicted. First of all, the system uses one SV to decide if the sample belongs to

A or not. Then, in case the sample does not belong to A, the system uses another SV to check if the

data belongs to B or not. This process continues until the data is allocated in one of these groups.

One against one

This strategy consists in creating k(k-1)/2 Support Vectors (being k the number of needed classes).

Each SV splits the data into two classes. Hence, it exists one SV for each pair of classes.

Given a sample, this is compared using one SV. This SV determines if the sample belongs to class

A or class B, incrementing a counter by one for the chosen class.

This process is repeated iteratively for all the SV, where every one of them splits the data into other

two classes.

At the end, the class that has more votes is the class that contains the sample. (Hsu et al., 2002)

Example:

Let's suppose the data has to be split into 4 classes: A, B, C and D. Our system will have 6 SV.

Once the system is trained and given a certain sample, knowing to which class this sample belongs

can be predicted. In the first step, the system uses one SV to check whether the sample belongs to

class A or B, incrementing by one the counter of the class the system has determined. In the second

step, the system checks if the data belongs to A or C. The counter of the selected class is

incremented by one. This process goes on until all the SVs are checked.

Here is the table with possible results. The first row shows the discrimination the SV has to check

and the second row shows what the decision of the current SV is.

Page 20: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

20

A vs B A vs C A vs D B vs C B vs D C vs D

B A A B B D

Table I. One Against One example. Decision made by the SVM using OAO multi-class approach.

After the process, class A obtain 2 votes, class B 3 votes, class C 0 votes and class D 1 vote. In this

case, the system determines the sample belongs to class B.

Directed Acyclic Graph Support Vector Machine (DAGSVM)

This method uses the same technique as the One Against One method regarding Support Vectors,

but this one uses a rooted binary direct acyclic graph, which has k(k-1)/2 nodes and k leaves, to

guide the search. Nodes represent the SV and leaves are the classes to sort by. Given a sample, it

starts from the root checking if it belongs to class A or B. Depending on the result it follows one

branch or another in the decision tree and the process is repeated in the next node. This goes on

until a leaf is reached, which is the class that the sample belongs to. (Hsu et al., 2002)

Example:

Let's suppose the same example as in the previous case with the same sample. The graph below

shows the path the algorithm will follow throughout the tree to determine the class the sample

belongs to.

A

A

A

A B

B

B

B

C

C C

C

D

D DD

Figure 7. Decision tree for DAGSVM method used to solve the example.

SVM multi-class using all data at once

The mathematical theory behind this method is quite complicated and the results of using it were

very poor, therefore only a brief explanation is provided:

The idea of this method is the same as One Against All (OAA) method explained above, but differs

from it in the number of Support Vectors. While OAA uses one SV for each comparison, All Data

at Once (ADO) formulates all the decision functions in just one function. (Hsu et al. 2002)

A study between these methods can be found in multisvm paper. The study concludes binary

classifiers are more suitable for large data testing than ADO, so a binary classifier has been

Page 21: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

21

implemented in the LIBSVM (Chang et al., 2001) library to solve the multi-class problem,

specifically the OAA method because the results are similar to other binary methods and the

training time is shorter. (Hsu et al., 2002)

Another problem regarding multi-class classification is the hyperparameter selection, since it would

be computationally very expensive to search for different parameters for each Support Vector.

Hence, all the SVMs share the same parameters. (Chang, C.C. and Lin, C.J., 2001)

TECHNICAL REQUIREMENTS

Another important step on the preliminary system design was choosing the platform where the

MQAP had to run on, the programming language to develop the program and the library that

supplies SVM methods. An explanation about the made decisions is shown here.

Software was developed using Sun Java Technology (Sun Microsystems, Inc., 1994) (Java Servlets

for the web server) on a Linux machine.

The reason for using Java is easy: This is an established language, very widely used for the

computer science community which offers a vast quantity of libraries, well documented that makes

this language very powerful, comfortable and easy to use. Furthermore, it is distributed for free and

can be run on Linux machines.

Linux machines were used for this project because most of the research software used in it is

developed on this platform and the interaction between applications is easier if they run on the same

platform.

Moreover, as part of the software is a server, it is well known that Linux is the best Operating

System for this purpose, as Apache (The Apache Software Foundation, 1999) is the most widely

used web based server.

LIBSVM has been chosen as a library that implements the logic of SVM method. There is no

special reason for selecting this one instead of another, but some features make it a good candidate:

it has a Java interface, runs over Linux, it is in constant evolution and there is good documentation.

Page 22: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

22

IMPLEMENTATION STRATEGY

In this section, all the outline processes of this project are explained, starting with the plan,

following with the integration of the system and finishing with the testing.

Planning the project

The first thing in the early stage of the project was to analyse and plan the structure of it, identifying

all the milestones and breaking them down into a list of tasks. A distribution of time and resources

for each task was done using the Gantt Project Software (GanttProject.org, 2003). This software

allowed us to draw a Gantt chart (Figure 8). Although the plan was studied meticulously, a re-

planning was needed in order to face the arisen problems from the optimising step. Because of that,

the outline that we have followed and explained in this chapter follows the revised version of the

preliminary Gantt.

Figure 8. Preliminary Gantt chart.

Background study

After the planning, a background study was carried out. Specifically, concepts were understood by

reading articles from the internet and speaking with experts in these fields, such as biologists and

geneticists. (See references section for more details).

SVM study

Then we studied SVM methods. This was approached by reading the LIBSVM (Chang et al., 2001)

documentation and papers (Hsu et al., 2003 and Chang et al., 2001) as well as experimenting with

the software. Several data sets from other projects (Chang et al., 2001) were tested in order to

understand the functionality of this library.

First approach

Afterwards, a Java (Sun Microsystems, Inc., 1994) mapper was developed in order to transform

Page 23: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

23

data from the ModFOLD software to suit LIBSVM functions input. Moreover, the data needed to

be rescaled in order to achieve better results.

The core of the project was initiated in this point. The first objective to tackle was to develop a

system that permits classification of our data into two basic subsets: good models against bad

models (defined below) in order to see the general SVM behaviour.

We had a file with training set results from the ModFOLD project. For each sample, we had

information concerning the protein model, as shown in the following example:

0.7357142567634583 0.8888083166000735 0.494 0.4807000160217285 0.9072

Each one of the first four columns represents information concerning the predicted quality of a

model using each individual MQAP in a scale ranging between 0 and 1 (0 – bad quality, 1 – good

quality).

The last column contains the observed model quality (OMQ). OMQ is a score to measure the

accuracy of the models supplied by the TM-score (Zhang, Y. and Skolnick, J., 2004), which

determines the structural distance between the predicted model of the protein target and the solved

structure.

Two classes, depending on the OMQ, were specified: If the value was less than or equal to 0.5, the

model was set into the 'bad models' class whilst if the OMQ was greater than 0.5, the model was set

into the 'good models' class.

The SVM was trained using half of our data (30,000 samples) and the remaining half became the

testing set. The SVM was used to classify the rest of the data into these two classes in order to

check the results against the OMQ to check the accuracy of the method.

The decision (explained in the Preliminary System Design section) about choosing the best kernel

for our problem was made at this point.

Choosing parameters

Once a Kernel Function had been selected, it was time to choose the appropriate values for the

kernel parameters. In our case, we needed to find the best values for the γ and C parameters to suit

the RBF.

The first way to find these parameters was to use a subset of the data and realize a Grid Search

using the tool supplied by LIBSVM.

After carry out several tests, we concluded the best parameters were C=8, γ=2; C=32, γ=0.5;

C=2048, γ=0.5.

These values provide an 83% average accuracy (accuracy was defined as the correct classification

according OMQ) for binary classification (i.e. separating good models from bad ones).

This seemed to be a good result if the accuracy did not fall in a multi-class classification.

Unfortunately, this was not the case and we needed to use different techniques for parameter

searching.

After reading documentation about hyperparameter tuning, software that deals with hyperparameter

search was downloaded from the Mathematics Department of King's College, University of London

webpage. The software is called Bayesian Support Vector Machine Hyperparameter Tuning

(BSVMHT) (Gold, C. and Sollich, P., 2005).

The best hyperparameter values this software found were C=10, γ=2.303 but the average accuracy

was the same as using the other values found by Grid Search.

A Java Data Mapper had to be developed in order to convert the data into a valid input for this

software.

Page 24: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

24

First multi-class approach

After completing the classification between good and bad models using the RBF kernel and

choosing the best parameter values for this concrete problem, a new step forward in the project was

needed. The data not only had to be classified into two broad classes (good and bad), but as many

sub classes as possible. The more classes the data was split, the more useful the results would be (as

it is important to distinguish the finer differences between the most accurate models rather than just

separate the best from the worst).

The first approach for multi-class classification was trying to sort the data into 4 subsets. The SVM

worked pretty well for binary classification, but the effectiveness for non-binary classification was

unclear.

Documentation was read (Hsu et al., 2002) and two main ways to tackle the problem were found:

The first one consisted of using several binary classifiers while the second one consisted in

considering all the data in one formulation. Finding out which strategy would be better was our next

research goal. The decision made is explained in the Preliminary System Design section.

Once the multi-class method was chosen, it was time to test the data against the system and check

the results.

Testing the first multi-class approach

The first step was to test the data classifying them into 4 classes depending on the OMQ (<=0.25:

Very Bad Models; >0.25 & <=0.5: Bad Models; >0.5 & <=0.75: Good Models; >0.75 & <=1: Very

Good Models). The training and testing procedures were the same as for the binary classification

explained above. It was observed that the parameters that fit best were the same as for binary

classification, but the accuracy fell to around 50%.

Then, it was decided to classify the data between 10 classes in order to see if the accuracy continues

falling. Once more, the best values for the hyperparameters were the same as in the other tests and

indeed, the accuracy fell to 37%.

Observing these results, a deviation from the Preliminary Gantt Plot was needed in order to find out

the problem and the behaviour of the SVM. Knowing that, it should be easier to guide the SVM

through our purpose.

Page 25: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

25

10 class test

First of all, the data from a 10 subsets classification was plotted into a 2D-Graph (OMQ values in x-

axes and predicted values in y-axes) and a table with the same information was created:

Predicted

Real 0 1 2 3 4 5 6 7 8 9

0 0 896 7 4 11 40 8 1 0 5

1 0 7516 30 16 70 311 9 9 0 27

2 0 3744 73 28 75 347 32 16 0 78

3 0 2254 50 39 106 462 65 24 0 91

4 0 1675 24 25 133 743 144 38 0 174

5 0 1163 22 16 106 906 299 99 0 262

6 0 612 15 11 68 793 364 283 0 400

7 0 230 13 10 31 358 306 395 0 655

8 0 79 3 5 8 79 96 190 0 655

9 0 430 0 1 22 213 133 207 0 2054

Table II. Predicted class against Real class table for a 10 classes classification. Each cell shows the number of

samples that were been classified as a 'Predicted' when they belonged, in fact, to the class 'Real'. It can be seen SVM

works well for binary classification but not for multi-classification.

Page 26: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

26

20 class test

Next step was to classify the data between 20 classes in order to have more fine results and plot

them using the same technique as with 10 classes.

Pred

Real 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 0 0 65 155 0 0 0 1 5 0 5 11 5 0 0 0 0 0 0 4

1 0 0 418 278 0 0 0 0 0 0 2 8 5 0 2 0 0 0 0 8

2 0 0 1830 2025 0 0 0 2 6 0 20 30 5 0 0 0 0 0 0 34

3 0 0 518 3289 0 0 0 6 13 0 54 92 7 0 2 1 0 0 0 54

4 0 0 213 2110 3 0 0 8 7 0 41 83 16 0 3 0 0 0 0 78

5 0 0 122 1449 3 0 0 8 7 1 42 65 21 1 4 2 0 0 0 106

6 0 0 114 1213 2 0 0 10 15 0 40 94 42 0 10 4 0 0 0 103

7 0 0 76 999 1 0 0 11 11 0 71 95 48 4 4 6 0 0 0 118

8 0 0 74 842 1 0 0 11 18 0 84 158 85 5 10 5 0 0 0 164

9 0 0 68 821 0 0 0 8 13 0 114 171 74 5 16 0 0 0 0 209

10 0 0 47 644 1 0 0 7 8 0 107 211 110 9 22 10 0 0 0 273

11 0 0 43 573 0 0 0 3 8 0 80 241 116 8 23 14 0 0 0 315

12 0 0 23 396 0 0 0 5 9 0 87 233 108 14 35 15 0 0 0 398

13 0 0 17 260 1 0 0 6 3 0 58 178 108 17 37 22 0 0 0 516

14 0 0 9 147 2 0 0 1 2 0 32 146 79 13 33 23 0 0 0 556

15 0 0 9 98 3 0 0 1 3 0 19 79 61 16 27 15 0 0 0 624

16 0 0 7 42 1 0 0 1 2 0 8 40 25 6 13 12 0 0 0 538

17 0 0 3 21 0 0 0 0 0 0 2 13 16 2 9 7 0 0 0 347

18 0 0 2 9 0 0 0 0 0 0 0 5 2 1 1 0 0 0 0 159

19 0 0 141 272 0 0 0 3 1 0 35 79 38 15 16 9 0 0 0 2272

Table III. Predicted class against Real class table for a 20 classes classification. Each cell shows the number of

samples that were classified as 'Predicted' when they belonged, in fact, to the class 'Real'. It can be seen SVM works

well for binary classification but not for multi-classification.

Page 27: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

27

Then, it was decided to plot the same data (20 classes), but using real values for the OMQ instead of

discrete ones because the plots above were not clear to interpret.

Figure 9. Distribution of a 20 subsets classification. The X-axes contains the real value for OMQ. The Y-axes

contains the classification our system predicts after using multi SVM. A roughly linear distribution can be seen: Most of

the lower classes are classified as lower ones, most of the middle classes are classified as middle ones and most of the

top classes are classified as top ones. Apart from these three main groups, the method does not seem to be very accurate

for each of the separate classes.

After observing the results above it was decided to plot a histogram for each OMQ value in order to

see the distribution of the predicted values for a given OMQ. Some of these histograms are shown

below.

Figure 10. Distribution for class 0. X-axes are the class number, while Y-axes are the number of samples. Notice most

of the data is allocated in one of the lower classes.

Page 28: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

28

Figure 11. Distribution for class 12. X-axes are the class number, while Y-axes are the number of samples. Notice the

data is split equally in two main groups (class 3 and class 19). There is some data classified in subset 11 and

surroundings.

Figure 12. Distribution for class 19. X-axes are the class number, while Y-axes are the number of samples. Notice

most of the data is allocated in one of the higher classes.

After this analysis, the conclusion was that SVM is very good for binary classification of protein

structure models, but the multisvm does not fit in very well with our problem.

Page 29: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

29

BST approach

Therefore, we decided to develop a binary decision tree which helps the SVM to classify the data

more precisely. The functionality of this tree is explained below.

The height of the tree depends on the precision to obtain and the number of leaves is the same as the

number of classes to sort by.

Every node in the tree represents a SV, which is trained following the normal procedure.

Given a sample, this is checked in the root node of the tree. This node determines whether the

sample belongs to those classes where the OMQ is less than or equal to 0.5 or on the contrary, those

where the OMQ is greater than 0.5.

In this example, it is assumed that the first node has determined the sample belongs to the first

group (<=0.5). In the next node the SV determines if the sample belongs to those classes where the

OMQ is less than or equal to 0.25 or those ones where the OMQ is greater than 0.25 and, of course,

less than or equal to 0.5.

This procedure keeps going on until a leaf is reached. This leaf determines the class that the sample

belongs to.

> 0.5<= 0.5

<= 0.25 > 0.25 <= 0.75 > 0.75

0 1 2 3

Figure 13. Three levels Complete Binary Decision Tree. This decision tree shows the procedure to find the class of a

given sample. Beginning from the root, in each node it is decided which is the next node to follow depending on the

OMQ. Each one of the leaf nodes is a class.

Having this idea in mind, we started by developing a program that deals with the purpose of mixing

the SVM with the BST structure. Details about the program can be found in the Detailed Software

Design chapter.

4 class BST

The first approach using this technique was to classify the data between 4 subsets in order to check

if the results were better than the raw multisvm or not (Figure 13). In this early test, all the nodes in

the BST shared the same parameters. This made the training process faster than training each node

independently. The problem we found here was the difficulty of working with our large amount of

data (30,000 samples) since the Java Virtual Machine (Sun Microsystems, Inc., 1994) had not

enough memory to work with these complex data structures. Hence, we tried to increase the JVM

heap memory, but even with that technique, the machine we were using to run the program did not

Page 30: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

30

have enough memory to deal with our purpose.

At that point, another decision had to be made: On the one hand, we could save some data in disk in

order to deal working with all the training set. On the other hand, we could use just a subset of that

to train the SVM (specifically 5,000 samples). We decided to take the second option, because the

process is time consuming enough just working in memory. If we had to work with the disk as well,

the time for running the program would be endless.

The results we achieved using this technique were slightly better than using the multi-class SVM.

The accuracy for the new method was 58% compared with the old method whose accuracy was

50%. We were on the right path, but our system still needed more refining.

8 class BST

> 0.5<= 0.5

<= 0.25 > 0.25 <= 0.75 > 0.75

<= 0.125 > 0.125<= 0.375 > 0.375

<= 0.625 > 0.625> 0.875<= 0.875

0 1 2 3 4 5 6 7

Figure 14. Four levels Complete Binary Decision Tree. This decision tree shows the procedure to find the class of a

given sample. Beginning from the root, in each node it is decided which is the next node to follow depending on the

OMQ. Each one of the leaf nodes is a class.

The next step was to try a similar test but splitting the data into 8 classes instead (Figure 14). In this

case, the accuracy fell to 42%. Although we knew from multi-class SVM that using the same

parameters to train all the SVMs achieved the same results as using different parameters for each

SVM; because the method we were using was a bit different than multisvm, we decided to try using

different parameters for each one of the SVMs. This means training each SVM in each node

independently. It was pretty clear that this strategy would not be valid for a high deep tree, but

using just 8-classification, this was an option that had to be tried.

The results were similar as using the same parameters for all the SVMs, but the training time was

much longer and therefore, we rejected this option.

First approach for high quality models

The following step in our research was to focus only in the high quality models. In other words, in

each decision node, we continue searching the quality for the right branch (the greatest one) in order

to be more accurate in the high quality models rather than the low quality models. Since it is

relatively easy to separate the good models from the bad ones using MQAPs, the difficulty relies in

Page 31: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

31

identifying the best model when their quality is high.

A 6-class structure like the one shown in the Figure 15 was developed and the test was realised. The

results were not good enough (Table IV), a fact that forced us to continue in our research.

0

1

2

3

4 5

<= 0.5 > 0.5

<= 0.75 > 0.75

<= 0.875 > 0.875

<= 0.9375 > 0.9375

<= 0.96875 > 0.96875

Figure 15. Six Levels Unbalanced Binary Search Tree. Starting from the root, a decision is made in each node until a

leaf is reached. This unbalanced structure permits to focus in the High Quality Models rather than Low Quality ones.

Class Belonging Predicted

0 19401 20500

1 6464 6280

2 1888 952

3 329 0

4 82 0

5 2831 3261

Table IV. Results for Six Levels Unbalanced Binary Search Tree. A testing set containing 30933 samples was used

in this test. Although the test should has been done counting the “well predicted samples” for each class instead of “all

the predicted samples” for a certain class, the predictions for class 3 and 4 show the bad behaviour of the system.

Page 32: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

32

Focusing on very high quality models (above 0.875)

Afterwards, we focused on more specific high quality models. We selected the models whose

accuracy was greater than 0.875 for both training and testing steps. Therefore, we developed a 3-

class structure such as in the Figure 16. The results were very bad, since the system was unable to

classify any data (Table V).

0

1

<= 0.9375 > 0.9375

<= 0.96875 > 0.96875

2

Figure 16. Three Levels Unbalanced Binary Search Tree. Starting from the root, a decision is made in each node

until a leaf is reached. This unbalanced structure permits to focus in the High Quality Models rather than Low Quality

ones. This technique differs with Figure 15 in the prediction stage. The models whose quality is better than 0.875 are

chosen regarding the OMQ while in Figure 15 they are selected after several prediction processes.

Class Belonging Well predicted

0 350 2

1 90 0

2 3089 3086

Table V. Results for Three Levels Unbalanced Binary Search Tree. This test was made using a testing set

containing 3529 samples. The results for class 0 and 1 demonstrate the failure of this approach.

Calculating Sensitivity and Specificity

After these demoralising results, we decided to return back to the 6-class structure and calculate

different measures such as sensitivity and specificity in order to discover any clues that would help

us find a way to follow, but the results were, again, not good at all and there was no clue that

allowed us to refine the solution (Table VI).

(In a binary classification, sensitivity and specificity are statistical measures which are taken for

each one of the two possible results of the classification.

In this case, taking class 0 as a reference:

Sensitivity is the proportion of samples predicted as 0 among all the samples that really belong to 0

(both predicted as 0 and predicted as 1).

Specificity is the proportion of samples predicted as NO 0 among all the samples that really do

NOT belong to 0.)

Page 33: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

33

Class Belonging Well predicted Sensitivity Specificity

0 19401 17239 88.86% 71.87%

1 6462 3052 76.92% 71.76%

2 1888 316 30.30% 89.08%

3 329 0 0.00% 100.00%

4 82 0 0.00% 100.00%

5 2831 1587 100.00% 0.00%

Table VI. Sensitivity and Specificity calculations for a Six Levels Unbalanced BST. The same testing set used in

Table IV was used in this test, counting the well predicted instead of all the predicted for a given class and calculating

the Sensitivity and Specificity in each node. As it can be appreciated, class 3, 4 and 5 continue giving problems and the

statistic values do not give any clue.

Despite bad tests experiences, we decided to carry on with the project and search for alternative

ways to tackle the problem, since the positioning in the project was correct and the appropriate

solution should be close, but we were not finding the proper one.

Independent Binary Classification

We decided to remove the clear cases such as those models with quality = 0 or quality = 1, because

these models could deviate the successful SVM training and produce unbalanced data, which

produces bad SVM behaviour. Moreover, we decided to do strictly binary studies and start an

individual problem for each one of the nodes.

Page 34: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

34

0 1 0 1

0 1 0 1

0 1 0 1

<= 0.5 > 0.5 <= 0.75 > 0.75

<= 0.875 > 0.875 <= 0.9375 > 0.9375

<= 0.96875 > 0.96875 <= 0.984375 > 0.984375

a) b)

c) d)

e) f)

Figure 17. ‘Forest’ of BSTs. Each BST solves an independent binary problem. Starting from a), if the system decides

the model quality belongs to class 1, the model is checked in the next BST (i.e. b)). The process finishes when a class 0

is reached, giving this quality threshold to the model.

The strategy we followed was to start from the root node and go on throughout the BST until the random chance classifying the data was better than the accuracy classifying the data by the system.

(Random chance is the probability of doing a correct prediction for a certain class by chance,

depending on the frequency. For example: if there are 70 samples belonging to class 0 and 30 to

class 1, the random chance for the class 0 is 70% and the random chance for the class 1 is 30%).

Finally, we achieved quite good results with this experiment (Tables VII – XII).

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 21697 20177 92.99% 68.70% 92.99% 62.63%

1 9883 6190 62.63% 31.30% 62.63% 92.99%

Table VII. Results for ‘Forest’ of BSTs (Figure 17.a).

Page 35: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

35

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 9254 8541 92.29% 72.69% 92.29% 47.93%

1 3476 1666 47.93% 27.31% 47.93% 92.29%

Table VIII. Results for ‘Forest’ of BSTs (Figure 17.b).

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 2222 2132 95.95% 77.37% 95.95% 22.92%

1 650 149 22.92% 22.63% 22.92% 95.95%

Table IX. Results for ‘Forest’ of BSTs (Figure 17.c).

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 366 267 72.95% 58.10% 72.95% 42.42%

1 264 112 42.42% 41.90% 42.42% 72.95%

Table X. Results for ‘Forest’ of BSTs (Figure 17.d).

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 91 39 42.86% 35.83% 42.86% 66.87%

1 163 109 66.87% 64.17% 66.87% 42.86%

Table XI. Results for ‘Forest’ of BSTs (Figure 17.e).

Class Belonging Well Predicted Well Pred. (%) Random Sensitivity Specificity

0 60 11 18.33% 41.37% 18.33% 76.47%

1 85 65 76.47% 58.62% 76.47% 18.33%

Table XII. Results for ‘Forest’ of BSTs (Figure 17.f). Notice the random chance for class 0 is better than the

prediction made by the system. The algorithm finishes at this point.

Unfortunately, the time was running against us and we could not refine this solution. Then, we

considered the core refining process concluded and the project turned into developing a webpage,

where we could integrate the system for external use.

Obviously, the point we reached did not permit a system that works as an ideal MQAP (receiving

the amino acids sequence in PDB format (Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G.,

Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000) as an input and giving the quality

score of the model as an output), but we were able to develop an intermediate quality assessment

system which receives the score of the 4 MQAPs as input and gives the probability of different

quality score as output. Indeed, this system would be very useful for an initial evaluation of

predicted models and allows us to develop and test an interface for the program (Figure 19).

Page 36: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

36

Hence, we started installing the Apache Tomcat Http Server (The Apache Software Foundation,

1999) which allows Java Technology, and a basic Servlet was developed to provide a user interface

for the system.

Page 37: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

37

DETAILED SOFTWARE DESIGN

1-SCOPE

1.1-Identification

This system is a tool to assess the quality of a protein structure prediction. Specifically given the

scores supplied by ModSSEA, ModCHECK, ProQ-MX and ProQ-LG, the program combines them

to evaluate the protein structure quality. The results are shown as several cut off points regarding

the quality ranging from 0 to 1 with their average correctness assessing the models.

1.2-System Overview

The system uses the mathematical approach for classification called Support Vector Machine in

order to tackle the problem. A library that deals with this purpose (LIBSVM (Chang et al., 2001)) is

used as a core of the system, supplying the functions for training and testing. (Steps needed in all

Learning Machine based systems).

Several ways to interact with the system are provided: A command line interface for the system

administrator for training and testing the system and a web based interface to predict a single

sample.

Moreover, a formatting tool that formats the data from ModFOLD project into suitable data for this

system is also included.

1.3-Document Overview

This document has been written following the Software Design Description defined by IEEE (IEEE

1016-1998).

Section 1 gives a brief overview about the system and this document.

Section 2 explains the main design decisions for the system development. Specifically, Section 2.1

shows the architectural design: it presents the existing classes in the system and the interrelationship

between them. Section 2.2 presents all the components in the system. The flow of information when

realising a use case is explained in Section 2.3. The design and ways to interact with the user

interfaces built for this system are presented in Section 2.4.

Section 3 explains all the components in detail: the most important methods for each class are

detailed specifying the parameters and the return values involved in each method.

2-SYSTEM-WIDE DESIGN

2.1-Architectural Design

The system deals with multi classification using SVM binary classification and guiding the search

using a „forest‟ of Binary Search Trees.

These trees have been implemented in the BST class, which is the main class where the system

administrator interface interacts with for all the requests to the system.

As the results showed that independent data for each classification problem behaves better than a

single BST, the system uses several BST to deal its purpose.

Page 38: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

38

Node class implements the node structures for building each BST. Each node has a NodeSVM

associated, which contains all the information regarding the related SVM to interact with the

LIBSVM package.

2.2-Components

Class BST

Implements the structure and logics for constructing and using a Binary Search Tree.

Class Node

Implements the structure of a node contained in a tree

Class NodeSVM

Implements the structure needed in each node for constructing a SVM using LIBSVM functions.

Class Fileformat

Contains procedures for dealing with the intermediate data files needed for the correct functionality

of the system.

Class SVMtree

Provides the user interface for the system administrator, creates the software objects by calling the

BST class and trains and tests the system.

LIBSVM package

External library that provides all the methods for working with SVMs.

Class ProteinQualityCalculator

Web based user interface that provides the Protein Quality Structure Prediction of a sample.

Class reformat

This class implements a tool for reformatting the data from ModFOLD program into suitable data

for this system.

Page 39: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

39

+Node(entrada mqap : double, entrada classe : int, entrada svm : nodeSVM, entrada c : double, entrada g : double)

+getMQAP() : double

+getClasse() : int

+getNodeSVM() : nodeSVM

+getLeftChild() : Node

+getRightChild() : Node

+getC() : double

+getG() : double

+setNodeSVM(entrada nSVM : nodeSVM)

+setLeftChild(entrada x : Node)

+setRightChild(entrada x : Node)

+setC(entrada c : double, entrada g : double)

-mqap : double

-classe : int

-svm : nodeSVM

-left : Node

-right : Node

-c : double

-g : double

Node

+read_problem()

+predict(entrada linea : object)

+getparam() : object

+getproblem() : object

+getmodel() : object

+getinputfilename() : string

+getmodelfilename() : string

+geterrormsg() : string

+getcrossvalidation() : int

+getnrfold() : int

+setparam(entrada param : object)

+setproblem(entrada prob : object)

+setmodel(entrada model : object)

+setinputfilename(entrada file : string)

+setmodelfilename(entrada file : string)

+seterrormsg(entrada error_msn : string)

+setcrossvalidation(entrada cross_val : int)

+setnrfold(entrada nr_fold : int)

-param : object

-prob : object

-model : object

-input_file_name : string

-model_file_name : string

-error_msg : string

-cross_validation : int

-nr_fold : int

nodeSVM

+size() : int

+put(entrada mqap : double, entrada classe : int, entrada svm : nodeSVM, entrada c : double, entrada g : double)

+trainTree(entrada filename : string)

+predictTree(entrada fileinput : string, entrada fileoutput : string, entrada separador : double, entrada y salida reliability : object)

-root : Node

-N : int

BST

+Split(entrada filename : string, entrada separator : double, entrada training : bool)

Fileformat

+main(entrada args : object)

SVMtree

+doGet(entrada request : object, salida response : object)

+doPost(entrada request : object, salida response : object)

ProteinQualityCalculator

+main(entrada args : object)

reformat

1

N

1

1

1

1

1

N

libsvmUses

Only uses data types from

the library, not methods

Uses

Figure 18. Class Diagram.

2.3-Concept of Execution

From system administrator:

The execution starts in the SVMtree class, where the main procedure creates a „forest‟ of BSTs

using the BST constructor for each one of them. Each BST represents a binary decision for different

cut off points. Then, the training and testing methods associated to each BST (trainTree and

predictTree from the BST class respectively) are called iteratively.

Once in the trainTree method, it goes recursively throughout the tree, visiting all the non leaves

nodes where, for each one of them, it reads the training set file, splits the data into two new files for

further usage (using Split method from Fileformat class) and uses the information for setting all the

parameters to the associated nodeSVM structure and training its SVM.

Page 40: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

40

In the predictTree method, the testing file is read and split into two new files for further usage

(using Fileformat class). Then, for each sample in the file, it access recursively to the proper nodes

in the tree deciding if advancing in the prediction by using the left child node or using the right

child node instead. When the program reaches a leaf, the associated class is given to the sample. For

dealing with this prediction, for each Node object, it gets the nodeSVM associated (structure that

contains the information regarding the associated SVM) and it calls the predict function, which, in

turn, calls the predict function of the LIBSVM package.

Finally, the average reliability percentage for each prediction obtained from the predictTree

function is written into the reliability file.

From ProteinQualityCalculator:

In this case, the Servlet interacts directly with the LIBSVM. It uses all the models generated in the

training step (loading them with the function svm_load_model from LIBSVM) for predicting the

score of the given sample (by using the svm_predict method from LIBSVM) and the data from the

reliability file.

All this information is displayed in the browser as an HTML file.

2.4-System Interface Design

The design of the three interfaces is explained in this section.

2.4.1- Protein Quality Calculator Interface

This interface is implemented as a Servlet which prints the data in the client's browser as an HTML

file.

On the left side of the page, there is a form that must be filled with the scores provided by

ModSSEA, ModCHECK, ProQ-MX and ProQ-LG systems.

When the submit button is pressed, the request is sent from the browser in the client machine to the

server, which calls the proper functions in the server machine to calculate the prediction.

Afterwards, the information regarding the prediction is printed in the client's browser in the right

side of the screen.

The data introduced in the form must be decimal numbers ranging from 0 to 1, both included.

The output supplied consists in two columns. The first column tells if the prediction is above or

below a certain threshold (ranging from 0 to 1 since it is represented in the same scale as the input)

and the second one is a percentage representing the correctness of this prediction based on the

results in the testing stage.

In order to display the correctness of the prediction, a file called “reliability” which contains the

accuracy achieved in the testing stage for each threshold has to be read.

The file is structured in two columns (separated by a blank) containing a decimal number each one

of them. For each row, the left number represents the accuracy predicting the class 0 and the right

one represents the accuracy for the class 1 for a certain threshold. The data in the file is sorted

ascendantly depending on the threshold value.

It must be mentioned that the form uses a Javascript (Sun Microsystems, Inc., 1994.) function in

order to validate the input data in the client side (faster validation than using Java (Sun

Page 41: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

41

Microsystems, Inc., 1994) in the server because of the transmission time). Such a validation has

been implemented using regular expressions.

Figure 19. Protein Quality Calculator User Interface.

2.4.2- System Administrator Interface

Although this is a non interactive interface, the communication between the user and the system can

be achieved through the “trainingset” and “testingset” files inside the “data” directory.

Both files use the same data format. Each one of them consists in a set of rows which contain the

ModSSEA, ModCHECK, ProQ-MX and ProQ-LG scores separated by a blank. At the end, and

after the last blank, there is the observed quality score.

This interface gives as an output a set of different files which will be used for predicting and

evaluating tasks or for the system itself to develop its function. Specifically, there are five kinds of

files:

·X.model -> These files contain information regarding the SVM with X value as a discriminator.

·X-class0 -> These files contain the dataset that belongs to class 0 after splitting the data

depending on X.

·X-class1 -> The same concept as the above file but the data belongs to class 1 instead.

·X-classified -> These files contain the MQAPs scores for dealing with X but the observed

quality score has been converted into 0 or 1 in order to compare the prediction quality.

·outputX -> These files contain the predictions for the SVM with the X discriminator value.

·reliability -> File containing the accuracy of each prediction realised (Explained above)

2.4.3- Reformat Interface

This interface receives the file name of the ModFOLD file for the command line.

A file named “filename.form” is generated as an output containing suitable data for using in the

system.

Page 42: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

42

3-DETAILED DESCRIPTION OF COMPONENTS

3.1-class BST

3.1.1-Method put

void put(double mqap, int classe, nodeSVM svm,double c, double g)

*Creates a node and inserts it to the proper position inside the tree

*@param mqap: a double representing the key value and threshold value of the node

*@param classe: The number that identifies the class in a leaf node, -1 otherwise

*@param svm: SVM associated to the node

*@param c: double representing the c parameter for the SVM

*@param g: double representing the g parameter for the SVM

3.1.2-Method trainTree

void trainTree(String filename)

*Trains all the SVMs within the tree

*@param filename: a String with the filename of the file that contains the training set

3.1.3-Method predictTree

void predictTree(String fileinput, String fileoutput, double separador, double[] reliability)

*Predicts all the testing set using the previously trained tree

*@param fileinput: name of the file which contains the testing set

*@param fileoutput: name of the file to which will be written the output

*@param separator: name of the threshold value which will be used to split the data for further

classification

*@param reliability: array of doubles that contain the values of the average accuracy in the

classification

3.2-Class Node

3.2.1-Method Node

Node(double mqap, int classe, nodeSVM svm, double c, double g)

*Class constructor

*@param mqap: a double representing the key value and threshold value of the node

*@param classe: The number that identifies the class in a leaf node, -1 otherwise

*@param svm: SVM associated to the node

*@param c: double representing the c parameter for the SVM

*@param g: double representing the g parameter for the SVM

Page 43: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

43

3.2.2-Method getMQAP

double getMQAP()

*Gets key and threshold value of the node

*@return: Double representing the key of the node

3.2.3-Method getClasse

int getClasse()

*Gets the number identifying the class

*@return: Integer representing the id of the class

3.2.4-Method getNodeSVM

nodeSVM getNodeSVM()

*Gets the nodeSVM associated to the node

*@return: the instance of nodeSVM associated to the current node

3.2.5-Method setNodeSVM

void setNodeSVM(nodeSVM nSVM)

*Sets the nodeSVM

*@param nSVM: a reference to nodeSVM to associate to the current node

3.2.6-Method getLeftChild

Node getLeftChild()

*Gets the left child of the current node

*@return: a reference to the left child node

3.2.7-Method setLeftChild

void setLeftChild(Node x)

*Sets the left child of the current node

*@param x: a reference to the left child which will be associated to the current node

3.2.8-Method getRightChild

Node getRightChild()

*Gets the right child of the current node

*@return: a reference to the right child node

3.2.9-Method setRightChild

void setRightChild(Node x)

*Sets the right child of the current node

Page 44: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

44

*@param x: a reference to the right child which will be associated to the current node

3.2.10-Method getG

double getG()

*Gets the G parameter of the SVM associated to the current node

*@return: a double representing the G parameter

3.2.11-Method setG

void setG(double g)

*Sets the G parameter of the SVM associated to the current node

*@param g: a double representing the G parameter

3.2.12-Method getC

double getC()

*Gets the C parameter of the SVM associated to the current node

*@return: a double representing the C parameter

3.2.13-Method setC

void setC(double c)

*Sets the C parameter of the SVM associated to the current node

*@param c: a double representing the C parameter

3.3-Class NodeSVM

3.3.1-Method read_problem

void read_problem()

*Reads the input file defined in the input_file_name variable and creates the problem to

*be used for the LIBSVM

3.3.2-Method predict

int predict(svm_node[] linea)

*Calls to the predict function of the LIBSVM library in order to make a prediction

*@param linea: array of svm_node (type defined by the LIBSVM library) which contain

*all the needed information of a simple sample in order to be predicted.

3.3.3-Get/Set methods

Several get and set methods have been created in order to access to the attributes of the

class which are used by the LIBSVM to deal with its functionality. These methods are not

Page 45: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

45

explained here because they are used by the library, but not by the system itself.

3.4-Class Fileformat

3.4.1-Method Split

void Split(String filename, double separator, boolean training)

*Splits the data from the file named “filename” depending on the “separator” and save

*the results in 3 files (X-classified, X-class0, X-class1). The meaning of this files is

*explained in the section 2.4.2 of this document.

*@param filename: Name of the fileinput to split

*@param separator: double representing the cut off point

*@param training: Boolean that informs whether the split is made for training or testing

3.5-Class SVMtree

3.5.1-Method Main

void main(String[] args)

*Main procedure for the System Administrator Interface. This procedure creates a „forest‟

*of BSTs, trains and predicts the SVMs.

*This procedure does not receive any parameter from the command line

3.6-class ProteinQualityCalculator

3.6.1-Method doGet

void doGet(HttpServletRequest request, HttpServletResponse response)

*Fills the response object with the HTML code which will be use for displaying a

*webpage in a browser. Apart, deals with the user request, calling the proper functions

*@param request: the HttpServletRequest object that has the information of the request

*@param response: the HttpServletResponse object that has the information regarding

*the response which will be sent to the client

3.6.2-Method doPost

void doPost(HttpServletRequest request, HttpServletResponse response)

*Fills the response object with the HTML code which will be use for displaying a

*webpage in a browser. Apart, deals with the user request, calling the proper functions

*@param request: the HttpServletRequest object that has the information of the request

*@param response: the HttpServletResponse object that has the information regarding

*the response which will be sent to the client

3.7-LIBSVM library

The description of this component can be found in the documentation provided by the LIBSVM.

Page 46: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

46

3.8-Class reformat

3.8.1-Method Main

void main(String[] args)

*Main procedure for the Reformat tool. The file received as a parameter is reformatted to

*a suitable file to be used in the system

*The filename of the input file has to be passed as a parameter by the command line

Page 47: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

47

CONCLUSIONS

This project arrives at the end with a certain bitter-sweet flavour. On the one hand, the results that

we aspired to at the beginning of the project have not been fulfilled; we could not develop a full

MQAP that increases the accuracy of model quality assessment. On the other hand, a new approach

to tackle this very difficult problem has been attempted and the results achieved show that, indeed,

this is a suitable way to face the problem and the beginning of a path that can allow us to meet our

goals.

In the following bullet points, we check the results obtained and the way how this project has been

developed against the goals and milestones defined in the introduction of this dissertation in order

to identify the points where the project has been carried out as expected and the phases where it has

been weak.

1-Understanding the problem:

This milestone was perfectly covered at the early stage of the project, since it was needed to carry

on with the rest of the stages. This important research phase was required to understand the scope of

the project.

2-Choosing the method:

This goal was achieved easily because a lot of successful studies have been recently carried out in

the bioinformatic field using SVM. Hence, after reading specific documentation about this method,

the decision of using it was made since it should be suitable to meet our objectives.

Afterwards the usage of the library that implements the method was learnt, but once more, the

documentation was very good and the project could proceed without much problem.

3-Training SVM:

The main problems arose when trying to reach this milestone. The proper Kernel was chosen, the

best parameters for the Kernel were found and the way to tackle the multi classification, but the

results achieved were not as expected (the system was unable to classify the data properly). These

decisions (i.e. the choice of using RBF kernel, the selection of parameters, etc.) were made

regarding the available documentation, since if we had to check everything by ourselves we would

need to develop a different project for each decision.

At the end, we found a method that could be useful for determining the quality of high quality

models using a decision tree of binary SVMs for classification at varying thresholds of model

quality. Unfortunately, due time limitations we were unable to refine this strategy.

To sum up, although this difficult goal was not achieved completely, we can say that it was partially

covered. (However, it must be said that optimisation of machine learning methods is a complete

field of research in itself.)

4-Developing the application:

Though a program that helps in calculating the quality of the models has been developed, this

difficult milestone can not be considered as fully reached.

The current program takes the ModSSEA, MODCHECK, ProQ-MX and ProQ-LG scores as a

starting point and it combines them, giving output stating if the current sample is above or below

certain cut off points and the average accuracy of the classifier during the testing stage, while a true

MQAP receives a model of protein structure as an input and gives the quality of the model as an

output. Obviously, the program built can not be described as a true MQAP, however it is a useful

tool and with further modifications will contribute towards the development of a future methods.

Moreover, the system architecture is perhaps not optimal, since the structure of the classes is a

consequence of all the different approaches tried during the project realisation. For example, the

Page 48: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

48

system uses a „forest‟ of BSTs because a previous approach was to structure all the information

regarding classification decisions in one single BST rather than being the best structure for it.

The prototype system has not yet been further refined because most of the time for the project was

spent in meeting with the Training SVM goal.

Therefore, a new analysis and design for this program could be carried out in order to develop a

better and more efficient piece of software.

Currently, this system would be useful to a small part of the biochemistry community because the

problem it solves is quite specific. However, the tool which has been developed is a starting point

for further research with wider reaching application.

5-Developing web based interface:

A basic web page for use as an interface for the system has been developed. Therefore, this

milestone has been reached.

The web page is graphically simple and functional, as we decided to focus on achieving the 3rd

milestone rather than spend time building an elaborate user interface in the front end of the

application.

The normal procedures for installing a web page such as setting up the server or programming the

Servlet have been carried out.

In conclusion, even though not all of the ambitious goals have been achieved as they were set out at

the beginning of the project, it has been found that the way for dealing with increasing the accuracy

of model quality assessment using SVM relies on using several SVMs and treating each

classification as a single problem independent of the previous classification. Furthermore, the

models with observed quality equal to 1 or to 0 must be removed from the training and testing data

in order to keep the data balanced.

Following this advice, further research can be done using the same Kernel and parameters we used

in this project and good results for differentiating high quality models may be achieved.

We would discourage future researchers from trying to use a multi class SVM or a BST with

dependent data and the RBF kernel.

We anticipate that an extension of this work will be carried out very soon which will take into

account our findings and we are optimistic that the ultimate difficult goal that we were not able to

achieve in this short project will be fulfilled.

Page 49: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

49

REFERENCES

Access Excellence at the National Health Museum (1999): The Central Dogma of Molecular

Biology [online] Available from: http://www.accessexcellence.org/RC/VL/GG/central.html

[accessed January 2007]

Bairoch, A. (2000): „The ENZYME database in 2000‟. Nucleic Acids Research. 28, 304-305.

Baker, D. and Sali, A. (2001): „Protein structure prediction and structural genomics‟. Science

294(5540):93-96.

Berg, J.M., Tymoczko, J.L. and Stryer, L. (2002): Biochemistry, W. H. Freeman and Company,

Houndmills, Basingstoke, England, chap.3, 41-76.

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and

Bourne, P.E. (2000): „The Protein Data Bank‟. Nucleic Acids Research. Vol.28, No.1 235-242.

Bowie, J.U., Luthy, R. and Eisenberg, D. (1991): „A method to identify protein sequences that fold

into a known three-dimensional structure‟. Science. 253 (5016), 164-170.

CAFASP Critical Assessment of Fully Automated Structure Prediction. CAFASP4 MQAP [online]

Available from: http://www.cs.bgu.ac.il/~dfischer/CAFASP4/mqap.html [accessed January 2007]

Chang, C.C. and Lin, C.J. (2001): LIBSVM: a Library for Support Vector Machines [online]

Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf [accessed February 2007]

Cold Spring Harbor Laboratory (2002): DNA From The Beginning [online] Available from:

http://www.dnaftb.org [accessed January 2007]

DTREG. SVM – Support Vector Machines [online] Available from: http://www.dtreg.com/svm.htm

[accessed February 2007]

Eramian, D., Shen, M., Devos, D., Melo, F., Sali, A. and Marti-Renom, M.A. (2006): „A composite

score for predicting errors in protein structure models‟. Protein Science. 15, 1653-1666.

GanttProject.org (2003): GanttProject.org [online] Available from: http://ganttproject.org [accessed

January 2007]

Gold, C. and Sollich, P. (2005): Software for parameter tuning for SVM classifiers. [online]

Available from: http://www.mth.kcl.ac.uk/~psollich/BayesSVM/ [accessed March 2007]

Guy, N.C. Protein 3D Structural Prediction & Fold Recognition [online] Available from:

http://darwin.nmsu.edu/~molb470/fall2003/Projects/guy/ [accessed January 2007]

Hsu, C.W., Chang, C.C. and Lin, C.J. (2003): A Practical Guide to Support Vector Classification.

[online] Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf [accessed

January 2007]

Hsu, C.W and Lin, C.J. (2002): A Comparison of Methods for Multi-class Support Vector

Machines. [online] Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/multisvm.pdf

[accessed February 2007]

Page 50: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

50

Jones, D.T. and Hadley, C. (2000): „Threading methods for protein structure prediction‟.

Bioinformatics: Sequence, structure and databanks. Higgins, D. & Taylor, W.R. Eds., 1-13,

Springer-Verlag, Heidelberg.

Lazaridis, T. and Karplus, M. (1999): „Discrimination of the native from misfolded protein models

with an energy function including implicit solvation‟. J. Mol. Biol. 288, 477–487

Lodish, H., Berk, A., Matsudaira, P., Kaiser, C.A., Krieger, M., Scott, M.P., Zipurksy, S.L. and

Darnell, J. (2004): Molecular Cell Biology, WH Freeman and Company, New York.

Marti-Renom, M.A., Stuart, A.C., Fiser, A., Sanchez, R., Melo, F. and Sali, A. (2000):

„Comparative protein structure modeling of genes and genomes‟. Annu Rev Biophys Biomol Struct

.29, 291-325.

McGuffin, L. J. (2007a): „Benchmarking consensus model quality assessment for protein fold

recognition‟. BMC Bioinformatics, submitted.

McGuffin, L. J. (2007b): „Aligning sequences to structures‟. Methods in Molecular Biology, Protein

Structure Prediction: Methods and Protocols. Humana Press, In press.

Moore, A.W. (2001): Support Vector Machines [online] Available from:

http://www.autonlab.org/tutorials/svm.html [accessed January 2007]

Nelson, D.L. and Cox, M.M. (2000): Lehninger. Principles of Biochemistry. Worth Publishers, New

York, chap. 5-6, 115-202.

Pettit, C.S., McGuffin, L.J. and Jones, D.T. (2005): „Improving sequence-based fold recognition by

using 3D model quality assessment‟. Bioinformatics. 21, 2509-3515.

Sali, A. and Blundell T.L. (1993): „Comparative protein modelling by satisfaction of spatial

restraints‟. J Mol Biol 234(3), 779-815.

Sippl, M.J. (1993): „Recognition of Errors in Three-Dimensional Structures of Proteins‟. Proteins.

17, 355-62

Sun Microsystems, Inc. (1994): Sun Developer Network (SDN). [online] Available from:

http://java.sun.com/ [accessed January 2007]

The Apache Software Foundation (1999): Apache Tomcat. [online] Available from:

http://tomcat.apache.org/ [accessed June 2007]

Voet, D. and Voet, J.G. (2004): Biochemistry, Wiley, Hoboken, NJ, Vol. 1.

Wallner, B. and Elofsson A. (2003): „Can correct protein models be identified?‟ Protein Science.

12, 1073-1086.

Zhang, Y. and Skolnick, J. (2004): „Scoring function for automated assessment of protein structure

template quality‟. Proteins. 57, 702-710.

Zhang, Y. and Skolnick, J. (2005): „The protein structure prediction problem could be solved using

the current PDB library‟. Proc Natl Acad Sci USA, 102 (4), 1029-1034.

Page 51: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

51

APPENDICES

Class BST Java code:

import java.io.BufferedReader;

import java.io.DataOutputStream;

import java.io.FileOutputStream;

import java.io.FileReader;

import java.util.StringTokenizer;

import libsvm.svm;

import libsvm.svm_node;

import libsvm.svm_parameter;

public class BST

{

private Node root;

private int N;

public BST()

{

root = null;

N = 0;

}

public int size() { return N; }

public void put(double mqap, int classe, nodeSVM svm,double c, double g)

{

root=insert(root,mqap,classe,svm,c,g);

N++;

}

private Node insert(Node x, double mqap, int classe, nodeSVM svm, double c, double g)

{

Node aux;

if(x==null)

{

x=new Node(mqap,classe,svm,c,g);

}

else

{

if(mqap<=x.getMQAP())

{

aux=x.getLeftChild();

aux=insert(aux,mqap,classe,svm,c,g);

x.setLeftChild(aux);

}

else

{

aux=x.getRightChild();

Page 52: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

52

aux=insert(aux,mqap,classe,svm,c,g);

x.setRightChild(aux);

}

}

return x;

}

public void trainTree(String filename) throws Exception

{

trainNode(root,filename);

}

private void trainNode(Node x,String filename) throws Exception

{

if(x.getClasse()!=-1)

{

return;

}

else

{

//Format the input

Fileformat ff=new Fileformat();

ff.Split(filename, x.getMQAP(),true);

nodeSVM nSVM=new nodeSVM();

fillSVMParameters(nSVM, filename,x.getMQAP(),x.getC(),x.getG());

nSVM.read_problem();

nSVM.seterrormsg(svm.svm_check_parameter(nSVM.getproblem(),nSVM.getparam()));

if(nSVM.geterrormsg() != null)

{

System.err.print("Error: "+nSVM.geterrormsg()+"\n");

System.exit(1);

}

else

{

nSVM.setmodel(svm.svm_train(nSVM.getproblem(),nSVM.getparam()));

svm.svm_save_model(nSVM.getmodelfilename(),nSVM.getmodel());

}

x.setNodeSVM(nSVM);

trainNode(x.getLeftChild(),"data/"+x.getMQAP()+"-train-class0");

trainNode(x.getRightChild(),"data/"+x.getMQAP()+"-train-class1");

}

}

private void fillSVMParameters(nodeSVM nSVM, String filename, double separador, double c,

double g)

Page 53: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

53

{

nSVM.setinputfilename("data/"+separador+"-train-classified");

nSVM.setmodelfilename("data/"+separador+".model");

nSVM.setcrossvalidation(0);

nSVM.setnrfold(0);

nSVM.setparam(new svm_parameter());

nSVM.getparam().svm_type = svm_parameter.C_SVC;

nSVM.getparam().kernel_type = svm_parameter.RBF;

nSVM.getparam().degree = 3;

nSVM.getparam().gamma = g; // -g option from grid search

nSVM.getparam().coef0 = 0;

nSVM.getparam().nu = 0.5;

nSVM.getparam().cache_size = 100;

nSVM.getparam().C = c; // -c option from grid search

nSVM.getparam().eps = 1e-3;

nSVM.getparam().p = 0.1;

nSVM.getparam().shrinking = 1;

nSVM.getparam().probability = 0;

nSVM.getparam().nr_weight = 0;

nSVM.getparam().weight_label = new int[0];

nSVM.getparam().weight = new double[0];

}

public void predictTree(String fileinput, String fileoutput, double separador, double[] reliability)

throws Exception

{

//Formatting the file

Fileformat ff=new Fileformat();

ff.Split(fileinput, separador, false);

//Calling to predictNode for each line within the file

BufferedReader input = new BufferedReader(new FileReader("data/"+separador+"-test-

classified"));

DataOutputStream output = new DataOutputStream(new FileOutputStream(fileoutput));

double is0 = 0, isNO0 =0;

double predictedas0 = 0,predictedasNO0 = 0;

while(true)

{

String line = input.readLine();

if(line == null) break;

StringTokenizer st = new StringTokenizer(line," \t\n\r\f:");

double target = atof(st.nextToken());

int m = st.countTokens()/2;

svm_node[] x = new svm_node[m];

for(int j=0;j<m;j++)

Page 54: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

54

{

x[j] = new svm_node();

x[j].index = atoi(st.nextToken());

x[j].value = atof(st.nextToken());

}

int v;

v = predictNode(root,x);

output.writeBytes(v+"\n");

if(target==0)

{

is0++;

if(v==0)

predictedas0++;

}

else

{

isNO0++;

if(v!=0)

predictedasNO0++;

}

}

reliability[0]=(predictedas0/is0);

reliability[1]=(predictedasNO0/isNO0);

}

private int predictNode(Node x,svm_node[] linea) throws Exception

{

if(x.getClasse()!=-1)

{

return x.getClasse();

}

else

{

nodeSVM auxSVM=x.getNodeSVM();

int classe=auxSVM.predict(linea);

if(classe==0)

{

return predictNode(x.getLeftChild(),linea);

}

else

{

return predictNode(x.getRightChild(),linea);

}

}

}

private static double atof(String s)

{

Page 55: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

55

return Double.valueOf(s).doubleValue();

}

private static int atoi(String s)

{

return Integer.parseInt(s);

}

}

Page 56: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

56

Class Node Java code:

public class Node

{

private double mqap; // mqap value

private int classe; // associated class

private nodeSVM svm; // SVM associated to the node

private Node left, right; // left and right subtrees

private double c;

private double g;

public Node(double mqap, int classe, nodeSVM svm, double c, double g)

{

this.mqap = mqap;

this.classe = classe;

this.svm = svm;

this.c=c;

this.g=g;

}

public double getMQAP()

{

return mqap;

}

public int getClasse()

{

return classe;

}

public nodeSVM getNodeSVM()

{

return svm;

}

public void setNodeSVM(nodeSVM nSVM)

{

this.svm=nSVM;

}

public Node getLeftChild()

{

return left;

}

public void setLeftChild(Node x)

{

left=x;

}

public Node getRightChild()

{

Page 57: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

57

return right;

}

public void setRightChild(Node x)

{

right=x;

}

public double getC()

{

return c;

}

public void setC(double c)

{

this.c=c;

}

public double getG()

{

return g;

}

public void setG(double g)

{

this.g=g;

}

}

Page 58: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

58

Class NodeSVM Java code:

import libsvm.*;

import java.io.*;

import java.util.*;

class nodeSVM

{

private svm_parameter param;

private svm_problem prob;

private svm_model model;

private String input_file_name;

private String model_file_name;

private String error_msg;

private int cross_validation;

private int nr_fold;

public svm_parameter getparam()

{

return param;

}

public void setparam(svm_parameter param)

{

this.param=param;

}

public svm_problem getproblem()

{

return prob;

}

public void setproblem(svm_problem prob)

{

this.prob=prob;

}

public svm_model getmodel()

{

return model;

}

public void setmodel(svm_model model)

{

this.model=model;

}

public String getinputfilename()

{

return input_file_name;

}

Page 59: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

59

public void setinputfilename(String file)

{

this.input_file_name=file;

}

public String getmodelfilename()

{

return model_file_name;

}

public void setmodelfilename(String file)

{

this.model_file_name=file;

}

public String geterrormsg()

{

return error_msg;

}

public void seterrormsg(String error_msg)

{

this.error_msg=error_msg;

}

public int getcrossvalidation()

{

return cross_validation;

}

public void setcrossvalidation(int cross_val)

{

this.cross_validation=cross_val;

}

public int getnrfold()

{

return nr_fold;

}

public void setnrfold(int nr_fold)

{

this.nr_fold=nr_fold;

}

//SVM functions

public void read_problem() throws IOException

{

BufferedReader fp = new BufferedReader(new FileReader(input_file_name));

Vector vy = new Vector();

Vector vx = new Vector();

Page 60: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

60

while(true)

{

String line = fp.readLine();

if(line == null) break;

StringTokenizer st = new StringTokenizer(line," \t\n\r\f:");

vy.addElement(st.nextToken());

int m = st.countTokens()/2;

svm_node[] x = new svm_node[m];

for(int j=0;j<m;j++)

{

x[j] = new svm_node();

x[j].index = atoi(st.nextToken());

x[j].value = atof(st.nextToken());

}

vx.addElement(x);

}

prob = new svm_problem();

prob.l = vx.size();

prob.x = new svm_node[prob.l][];

for(int i=0;i<prob.l;i++)

prob.x[i] = (svm_node[])vx.elementAt(i);

prob.y = new double[prob.l];

for(int i=0;i<prob.l;i++)

prob.y[i] = atof((String)vy.elementAt(i));

fp.close();

}

public int predict(svm_node[] linea) throws IOException

{

return (int)svm.svm_predict(this.model,linea);

}

private static double atof(String s)

{

return Double.valueOf(s).doubleValue();

}

private static int atoi(String s)

{

return Integer.parseInt(s);

}

}

Page 61: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

61

Class Fileformat Java code:

import java.io.*;

public class Fileformat

{

/*This function classifies the data into two different classes

depending on the separator

Moreover, it creates 2 new files with the data separated with the

original value

Example for separator 0.5:

Input-> 0:0.3478 1:0.4672 2:0.8724 3: 0.273 0.7458

0:0.5673 1:0.6496 2:0.4976 3: 0.234 0.4598

Output-> 0:0.3478 1:0.4672 2:0.8724 3: 0.273 1

0:0.5673 1:0.6496 2:0.4976 3: 0.234 0

*/

public void Split(String filename, double separator, boolean training) throws Exception

{

String infix="-test";

if(training)

infix="-train";

FileInputStream fs=new FileInputStream(filename);

FileOutputStream out=new FileOutputStream("data/"+separator+infix+"-classified");

FileOutputStream out1=new FileOutputStream("data/"+separator+infix+"-class0");

FileOutputStream out2=new FileOutputStream("data/"+separator+infix+"-class1");

String line;

String lineout;

String lineout1;

String lineout2;

//Number of bytes in the input file to be read

int totalbytes=fs.available();

byte[] inarray=new byte[totalbytes];

byte[] outarray=new byte[totalbytes];

byte[] outarray1=new byte[totalbytes];

byte[] outarray2=new byte[totalbytes];

//Reading the file and storing the data in inarray

fs.read(inarray);

//line number

int i=1;

//i+k: current byte in the input array

int k=-1;

Page 62: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

62

//i+m: current byte in the output array

int m=-1;

int m1=0;

int m2=0;

while((i+k)<totalbytes)

{

int j=0;

line="";

//Reading the line i

while(inarray[i+k]!='\n')

{

line+=(char)inarray[i+k];

k++;

}

//Formatting the line

if(line!="")

{

int lasttabindex=line.lastIndexOf(' ');

String sQuality=line.substring(lasttabindex+1);

lineout=line.substring(0,lasttabindex);

lineout1="";

lineout2="";

//Checking the model quality

double quality=Double.parseDouble(sQuality);

if(quality<=separator)

{

lineout1=lineout+" "+sQuality;

lineout="0 "+lineout;

}

else

{

lineout2=lineout+" "+sQuality;

lineout="1 "+lineout;

}

if(quality!=0 && quality!=1)

{

//Writing the line in the output array

for(j=0;j<lineout.length();j++)

{

outarray[i+m]=(byte)lineout.charAt(j);

m++;

}

Page 63: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

63

outarray[i+m]='\n';

int mbefore=m1;

for(int l=0;l<lineout1.length();l++)

{

outarray1[m1]=(byte)lineout1.charAt(l);

m1++;

}

if(mbefore!=m1)

{

outarray1[m1]='\n';

m1++;

}

mbefore=m2;

for(int l=0;l<lineout2.length();l++)

{

outarray2[m2]=(byte)lineout2.charAt(l);

m2++;

}

if(mbefore!=m2)

{

outarray2[m2]='\n';

m2++;

}

}

else

{

m--;

}

i++;

}

}

//Writing the output array to the file

out.write(outarray,0,i+m);

out1.write(outarray1,0,m1);

out2.write(outarray2,0,m2);

}

}

Page 64: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

64

Class SVMtree Java code:

import java.io.*;

public class SVMtree

{

public static void main(String[] args) throws Exception

{

PrintWriter out = new PrintWriter(new FileWriter("data/reliability"));

//Create the decision trees

BST st05 = new BST();

BST st075 = new BST();

BST st0875 = new BST();

BST st09375 = new BST();

BST st096875 = new BST();

BST st0984375 = new BST();

double[] reliability=new double[2];

st05.put(0.5,-1,null,512,0.5);

st05.put(0.2,0,null,-1,-1);

st05.put(0.7,1,null,-1,-1);

st05.trainTree("data/trainingset");

st05.predictTree("data/testingset","data/output05",0.5,reliability);

out.println(reliability[0]+" "+reliability[1]);

st075.put(0.75,-1,null,32,0.03125);

st075.put(0.6,0,null,-1,-1);

st075.put(0.8,1,null,-1,-1);

st075.trainTree("data/0.5-train-class1");

st075.predictTree("data/0.5-test-class1","data/output075",0.75,reliability);

out.println(reliability[0]+" "+reliability[1]);

st0875.put(0.875,-1,null,32768,2);

st0875.put(0.6,0,null,-1,-1);

st0875.put(0.9,1,null,-1,-1);

st0875.trainTree("data/0.75-train-class1");

st0875.predictTree("data/0.75-test-class1","data/output0875",0.875,reliability);

out.println(reliability[0]+" "+reliability[1]);

st09375.put(0.9375,-1,null,512,8);

st09375.put(0.6,0,null,-1,-1);

st09375.put(0.94,1,null,-1,-1);

st09375.trainTree("data/0.875-train-class1");

st09375.predictTree("data/0.875-test-class1","data/output09375",0.9375,reliability);

out.println(reliability[0]+" "+reliability[1]);

Page 65: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

65

st096875.put(0.96875,-1,null,512,8);

st096875.put(0.6,0,null,-1,-1);

st096875.put(0.97,1,null,-1,-1);

st096875.trainTree("data/0.9375-train-class1");

st096875.predictTree("data/0.9375-test-class1","data/output096875",0.96875,reliability);

out.println(reliability[0]+" "+reliability[1]);

st0984375.put(0.984375,-1,null,512,8);

st0984375.put(0.6,0,null,-1,-1);

st0984375.put(0.99,1,null,-1,-1);

st0984375.trainTree("data/0.96875-train-class1");

st0984375.predictTree("data/0.96875-test-class1",

"data/output0984375",0.984375,reliability);

out.println(reliability[0]+" "+reliability[1]);

out.close();

}

}

Page 66: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

66

Class ProteinQualityCalculator Java code:

import java.io.*;

import javax.servlet.*;

import javax.servlet.http.*;

import libsvm.*;

public class ProteinQualityCalculator extends HttpServlet

{

public void doGet(HttpServletRequest request, HttpServletResponse response)throws

IOException, ServletException

{

response.setContentType("text/html");

PrintWriter out = response.getWriter();

out.println("<html>");

out.println("<body>");

out.println("<head>");

//Javascript for validate the form

out.println("<script type='text/javascript'>");

out.println("function validate(form){\n");

out.println("\tif(form.modssea.value.length==0 || form.modcheck.value.length==0 || " +

"form.proqmx.value.length==0 || form.proqlg.value.length==0){\n");

out.println("\t\talert('You must fill all the fields');\n");

out.println("\t\treturn false;\n");

out.println("\t}\n");

out.println("\tvar decimal = /^((0(\\.[0-9]*)?)|(1))$/;");

out.println("\tif(form.modssea.value.match(decimal)==null ||”+

“form.modcheck.value.match(decimal)==null || " +

"form.proqmx.value.match(decimal)==null || “+

“ form.proqlg.value.match(decimal)==null){\n");

out.println("\t\talert('Valid values are decimal numbers ranging from 0 to 1');\n");

out.println("\t\treturn false;\n");

out.println("\t}\n");

out.println("\treturn true;\n");

out.println("}");

out.println("</script>");

out.println("<title>Protein Quality Calculator</title>");

out.println("</head>");

out.println("<body bgcolor=\"white\">");

out.println("<TABLE>");

out.println("<TR><TD><TABLE><TR><TD>");

out.println("<H1><CENTER><B>PROTEIN QUALITY “+

“CALCULATOR</B></CENTER></H1>");

out.println("</TD></TR></TABLE></TD></TR>");

Page 67: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

67

out.println("<TR>");

out.println("<TD width='50%'>");

out.println("<H2><CENTER>INPUT</CENTER></H2>");

out.println("</TD><TD width='50%'>");

out.println("<H2><CENTER>RESULTS</CENTER></H2>");

out.println("</TD>");

out.println("</TR>");

out.println("<TR>");

out.println("<TD><CENTER>");

out.println("<form action='ProteinQualityCalculator' onsubmit='return validate(this);'”+

“method='POST'>");

out.println("<TABLE border='1'>");

out.println("<TR>");

out.println("<TD>ModSSEA score:</TD><TD><input type='text' name='modssea'></TD>");

out.println("</TR><TR>");

out.println("<TD>MODCHECK score:</TD><TD><input type='text' “+

“name='modcheck'></TD>");

out.println("</TR><TR>");

out.println("<TD>ProQ-MX score:</TD><TD><input type='text' name='proqmx'></TD>");

out.println("</TR><TR>");

out.println("<TD>ProQ-LG score:</TD><TD><input type='text' name='proqlg'></TD>");

out.println("</TR>");

out.println("</TABLE>");

out.println("<br><input type='submit' value='Submit'>");

out.println("</form>");

out.println("</CENTER></TD>");

out.println("<TD VALIGN='top' ALIGN='center'>");

if(request.getParameter("modssea")!=null && request.getParameter("modcheck")!=null &&

request.getParameter("proqmx")!=null && request.getParameter("proqlg")!=null)

{

calculate(out,request);

}

out.println("</TD>");

out.println("</TR>");

out.println("</TABLE>");

out.println("</body>");

out.println("</html>");

}

public void doPost(HttpServletRequest request, HttpServletResponse response)throws

IOException, ServletException

{

doGet(request, response);

}

Page 68: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

68

private static void calculate(PrintWriter out, HttpServletRequest request) throws IOException

{

svm_node[] x = new svm_node[4];

x[0] = new svm_node();

x[1] = new svm_node();

x[2] = new svm_node();

x[3] = new svm_node();

x[0].index = 1;

x[0].value = Double.parseDouble(request.getParameter("modssea"));

x[1].index = 2;

x[1].value = Double.parseDouble(request.getParameter("modcheck"));

x[2].index = 3;

x[2].value = Double.parseDouble(request.getParameter("proqmx"));

x[3].index = 4;

x[3].value = Double.parseDouble(request.getParameter("proqlg"));

out.println("<TABLE><TR><TD width='50%'><H3>QUALITY</H3></TD><TD”+

“width='50%' align='right'><H3>ACCURACY</H3></TD></TR>");

FileReader fs=new FileReader("/var/lib/tomcat5.5/webapps/servlets-examples/WEB-“+

“INF/classes/PQC/data/reliability");

BufferedReader in=new BufferedReader(fs);

double aux;

String line="";

boolean classified=false;

//Load all the models

svm_model model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-“+

“examples/WEB-INF/classes/PQC/data/0.5.model");

int classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.5</P></TD><TD><P”+

“ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.5</P></TD><TD><P”+

“ ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

}

if(!classified)

{

model=null;

Page 69: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

69

model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+

"WEB-INF/classes/PQC/data/0.75.model");

classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.75</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.75</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

}

}

if(!classified)

{

model=null;

model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+

"WEB-INF/classes/PQC/data/0.875.model");

classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.875</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.875</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

}

}

if(!classified)

{

model=null;

model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+

"WEB-INF/classes/PQC/data/0.9375.model");

classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.9375</P></TD><TD><P "+

Page 70: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

70

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.9375</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

}

}

if(!classified)

{

model=null;

model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+

"WEB-INF/classes/PQC/data/0.96875.model");

classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.96875</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.96875</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

}

}

if(!classified)

{

model=null;

model=svm.svm_load_model("/var/lib/tomcat5.5/webapps/servlets-examples/"+

"WEB-INF/classes/PQC/data/0.984375.model");

classe=(int)svm.svm_predict(model,x);

line=in.readLine();

if(classe==0)

{

aux=Double.parseDouble(line.substring(0,line.indexOf(" ")));

out.println("<TR><TD><P><= 0.984375</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

classified=true;

}

else

{

aux=Double.parseDouble(line.substring(line.indexOf(" ")+1));

out.println("<TR><TD><P>> 0.984375</P></TD><TD><P "+

"ALIGN='center'>"+(int)(aux*100.)+"%</P></TD></TR>");

Page 71: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

71

}

}

out.println("</TABLE>");

}

}

Page 72: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

72

Class Reformat Java code:

import java.io.FileInputStream;

import java.io.FileOutputStream;

public class reformat

{

/**

* This program formats the data input to a suitable format to be used with SVM.

* You must specify the input file name as a parameter.

* It writes an output file called inputfilename.formatted

* @param args[0] contains the input file name to be formatted

*/

public static void main(String[] args) throws Exception

{

if(args.length==0)

throw new Exception("You must specify the input file");

FileInputStream fs=new FileInputStream(args[0]);

FileOutputStream out=new FileOutputStream(args[0]+".form");

String line;

String lineout;

//Number of bytes in the input file to be read

int totalbytes=fs.available();

byte[] inarray=new byte[totalbytes];

byte[] outarray=new byte[totalbytes];

//Reading the file and storing the data in inarray

fs.read(inarray);

//line number

int i=1;

//i+k: current byte in the input array

int k=-1;

//i+m: current byte in the output array

int m=-1;

while((i+k)<totalbytes)

{

int j=0;

line="";

//Reading the line i

while(inarray[i+k]!='\n')

{

line+=(char)inarray[i+k];

k++;

Page 73: MODEL QUALITY ASSESSMENT OF PREDICTED PROTEIN …avilamala/resources/FinalDissertation_Albe… · A Dissertation Submitted In Partial Fulfilment Of The Requirements For The Degree

73

}

k+=2;

//Formatting the line

if(line!="")

{

// libsvm

String aux;

lineout="";

//skipping two first columns

line=line.substring(line.indexOf(" ")+1);

line=line.substring(line.indexOf(" ")+1);

aux=line.substring(0,line.indexOf(" "));

lineout+="1:"+aux+" ";

line=line.substring(line.indexOf(" ")+1);

aux=line.substring(0,line.indexOf(" "));

lineout+="2:"+aux+" ";

line=line.substring(line.indexOf(" ")+1);

aux=line.substring(0,line.indexOf(" "));

lineout+="3:"+aux+" ";

line=line.substring(line.indexOf(" ")+1);

aux=line.substring(0,line.indexOf(" "));

lineout+="4:"+aux+" ";

line=line.substring(line.indexOf(" ")+1);

//Checking the model quality

double quality=Double.parseDouble(line);

lineout+=quality;

//Writing the line in the output array

for(j=0;j<lineout.length();j++)

{

outarray[i+m]=(byte)lineout.charAt(j);

m++;

}

if(lineout!="")

{

outarray[i+m]='\n';

i++;

}

}

}

//Writing the output array to the file

out.write(outarray,0,i+m);

}

}