b. sc. engg. thesis applications of graphs in bioinformatics

B. Sc. Engg. Thesis

Applications of Graphs in Bioinformatics

By

Abdullah Al Mueen

Student No. 0005040

&

Md. Nurul Amin

Student No. 0005091

Submitted to

Department of Computer Science and Engineering

in partial fulfillment of the requirements for the degree of

Bachelor of Science in Computer Science and Engineering

Department of Computer Science and Engineering

Bangladesh University of Engineering and Technology (BUET)

Dhaka 1000, Bangladesh

November, 2006

CERTIFICATE

This is to certify that the work presented in this thesis entitled “Applications of Graphs

in Bioinformatics” is the outcome of the investigation carried out by us under the supervision

of Professor Dr. Md. Saidur Rahman in the Department of Computer Science and Engineering,

Bangladesh University of Engineering and Technology, Dhaka. It is also declared that neither

this thesis nor any part thereof has been submitted or is being currently submitted anywhere

else for the award of any degree or diploma.

(Supervisor)

Dr. Md. Saidur Rahman

Professor

Department of Computer Science

and Engineering, BUET, Dhaka.

(Author)

Abdullah Al Mueen

Student No: 0005040

(Author)

Md. Nurul Amin

Student No: 0005091

Contents

Acknowledgments vi

Abstract vii

1 Introduction 1

1.1 History of Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Computations in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Scope of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Preliminaries 6

2.1 Cell-The Unit of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Genome Organization in Eukaryotic Cell . . . . . . . . . . . . . . . . . . 7

2.2.2 Genome Organization in Prokaryotic Cell . . . . . . . . . . . . . . . . . . 7

2.3 Genome Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Generation of Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.2 Haplotype Reconstruction and Inference . . . . . . . . . . . . . . . . . . 10

2.5 Genomic Data Analysis and Applications . . . . . . . . . . . . . . . . . . . . . . 11

2.5.1 Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.2 Study of Single Nucleotide Polymorphism . . . . . . . . . . . . . . . . . 13

2.5.3 SNPs and Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 13

i

CONTENTS ii

2.5.4 SNPs and Personalized Drug Prediction . . . . . . . . . . . . . . . . . . 14

2.6 Graph Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Haplotyping 18

3.1 Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Individual Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2 Minimum Fragment Removal : MFR . . . . . . . . . . . . . . . . . . . . 22

3.2.3 Minimum SNP Removal : MSR . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.4 Minimum Error Correction : MEC . . . . . . . . . . . . . . . . . . . . . 25

3.2.5 A Heuristic Algorithm for MEC Problem . . . . . . . . . . . . . . . . . . 26

3.2.6 Comparison of Different Individual Haplotyping Problems . . . . . . . . 35

3.3 Population Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2 Pure Parsimony Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.3 Maximum Resolution Problem – MR . . . . . . . . . . . . . . . . . . . . 39

3.3.4 Perfect Phylogeny Haplotyping Problem – PPH . . . . . . . . . . . . . . 40

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Phylogenetic Tree 44

4.1 Majority Rule Consensus Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.3 A Simulated Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Uniform Sampling of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.2 Sampling Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

CONTENTS iii

5 Conclusion 55

References 57

Index 59

List of Figures

2.1 Organization of DNA molecule in a cell . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Directed and undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Minimum fragment removal using fragment conflict graph . . . . . . . . . . . . . 23

3.2 Minimum SNP removal using SNP conflict graph . . . . . . . . . . . . . . . . . 25

3.3 SNP matrix and its partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 An example calculation of Gain measure . . . . . . . . . . . . . . . . . . . . . . 30

3.5 An example iteration of HMEC . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 An example log table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 An example of haplotype perfect phylogeny (hpp) . . . . . . . . . . . . . . . . . 42

3.8 Forbidden matrix for hpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 A sample phylogenetic tree of living organisms . . . . . . . . . . . . . . . . . . . 45

4.2 An example of majority rule consensus tree for l = 1

2. . . . . . . . . . . . . . . 47

4.3 Steps of majority rule tree construction . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 An induced subtree of a phylogenetic tree . . . . . . . . . . . . . . . . . . . . . 52

iv

List of Tables

3.1 A chromosome and two haplotypes assembled from it . . . . . . . . . . . . . . . 19

3.2 An SNP matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Different relations between fragments . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 The pseudocode for the HMEC algorithm . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Comparison among the SNP problems . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 A chromosome and its genotype . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 Example of optimal application of haplotype inference rule . . . . . . . . . . . . 40

v

Acknowledgments

All the appraisal belongs to the Almighty ALLAH.

We would like to express our deep gratitude to our supervisor Professor Dr. Md. Saidur

Rahman. We would like to thank him for introducing us in this beautiful field of bioinformatics

and for teaching us how to carry on research work. We again express our most sincere gratitude

to him for his affectionate supervision, continuous motivation and valuable advice, without

which this thesis would not have come to a completion.

We would like to thank every one associated with our research group. We specially thank

Mr. Abdullah Adnan, Mr. Abul Hasan Samee and Mr. Md. Shariful Islam Bhuyan for their

invaluable comments and encouragement throughout the period of this thesis.

Each presentation session was greatly supported by the lab assistants and officials of the

department. We would like to thank them for each of the services and facilities they provided

us.

Finally, we are grateful to our families for giving us the moral support to overcome the

tedium of repetitive trials to new findings.

vi

Abstract

Bioinformatics principally is the science of biological information processing. It requires wide

range of computational models for effective representation and efficient computation of biolog-

ical data. This thesis describes graphs as computational models in bioinformatics. Here we

focus on two major areas of bioinformatics: Haplotyping and Phylogeny.

We have compiled four optimization problems of individual haplotyping and their graphical

modelings. We propose a new heuristic algorithm to solve minimum error correction problem

which constructs a pair of haplotypes of an individual from aligned and overlapping but inter-

mixed and erroneous fragments. We describe the perfect phylogeny haplotyping problem which

arranges a set of haplotypes for a population in a phylogenetic tree.

We describe an algorithm to build a single consensus tree from a number of different small

phylogenetic trees on the same set of taxa using majority rule. We discuss the uniform sampling

of a phylogenetic tree that generates an induced subtree uniformly at random and carrying a

subset of leaves of the actual tree satisfying some constraints. We annotated three problems of

uniform sampling having constraints on leaf depths, edge lengths and pairwise leaf distances of

the resulting samples.

vii

Chapter 1

Introduction

Analysis of biological experiments and processes is labor intensive and time consuming due to

the increasing complexity of the processes and explosive growth of biological data emerging

from laboratories worldwide. Hence, the recent challenge is to transform this huge data into

knowledge for complete understanding of the biological processes and experiments relating to

both health and diseases. The quest for this knowledge has given rise to a new era of science,

bioinformatics.

Bioinformatics is a highly interdisciplinary field where techniques and concepts from infor-

matics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics merge into a

single branch of science. The most prominent and important of these disciplines is informa-

tion processing. Therefore, “bioinformatics” can be stretched as “Biological information

processing”. More precisely, bioinformatics derives knowledge from computer analysis of bi-

ological data. The development of powerful computers, and the availability of experimental

data, that can readily be treated by computation, launched bioinformatics as an independent

field.

From the perspective of information processing, bioinformatics can be divided into several

areas. The major areas are :

• Develop better tools for data generation, capture, and annotation. An example of data

generation would be “shotgun sequencing” which identifies the sequence of nucleotides of a

1

Chapter 1. Introduction 2

target DNA molecule.

• Develop and improve tools for comprehensive functional and relational analysis. Numerous

tools for phylogenetic analysis of biological data have already been developed to determine the

evolutionary relationship among species.

• Develop and improve tools for representing and comparing sequence similarity and variation.

An example would be BLAST (Basic Local Alignment Search Tool) which aligns two or more

sequence and computes different similarity and dissimilarity measures.

• Improve content and utility of biological databases to store the data in effective ways. An

example would be GenBank. GenBank, developed and housed at NCBI (National Center for

Biotechnology information), is the U.S depository for all DNA and protein sequences containing

more than 61 million sequence records.

• Create mechanisms to support effective approaches for producing robust, exportable software

that can be widely shared.

Each of these areas are evolving day by day and are subjects to extensive research. The next

section of this chapter contains a snap of the history of bioinformatics. The section following it

contains an overview of the computations required in bioinformatics. Finally the scope of this

thesis is detailed.

1.1 History of Bioinformatics

The generation of micro-biological data was started first in 1955 when amino acid sequence

of a protein (bovine insulin) was announced for the first time by F. Sanger. Since then, hun-

dreds of proteins have been sequenced and analyzed and the need for creating a data bank

for these proteins so that the information can be spread very easily to all. In 1973 one such

Protein Data Bank (PDB) named Brookhaven was announced. This PDB was fully described

in 1977 (http://www.pdb.bnl.gov). These PDBs were actually in printed form because of the

unavailability of desktop computers and communication facilities at those days.

The first complete genome sequence for an organism (FX174) was published in 1980. The

1.2. Computations in Bioinformatics 3

gene consists of 5,386 nucleotide base pairs encoding nine proteins. Just after one year, in 1981,

IBM introduced its personal computer in the market and concurrently the smith-waterman

algorithm for sequence alignment was published. The growth of computer performance and

biological data is exponential since then and leads to the creation of some new databases

like SWISS-PR0T in the ’80s. Many organizations were also founded in that decade namely

Genetics Computer Group(GCG), National Center for Biotechnology Information(NCBI), etc.

Whole genome sequences are published for different single cell organisms like Haemophilus

influenza of 1.6 Mbp (million base pair), baker’s yeast of 12.1 Mbp, E. coli of 4.7 Mbp in

’90s. The Human Genome Project has successfully published the complete sequence of human

genome (3000 Mbp) in 2001.

Biochemical methods to sequence, recombine and engineer the DNA sequences have been

developed by the 90s. Besides, new modeling and analysis provides new dimensions in the

datasets. Thus, the growth of the datasets obviates the necessity of various forms of computa-

tion in bioinformatics from the beginning of this decade.

1.2 Computations in Bioinformatics

Computations required in bioinformatics varies largely depending on the objective of the ap-

plication. Almost all the developed branches of computer science have strong applications in

bioinformatics. Some examples of the computational tools that are used in bioinformatics are

given below.

Graphs are used to represent relationships among species on different physical and micro-

biological criteria. For example, the evolutionary relationships among the existing species are

expressed in a tree structure called phylogenetic tree. Graphs are also used in problems to

analyze biological data.

Numerical simulations of biological systems are used to model systems that are very difficult

to be modeled by analytical methods and deterministic operations. For example, the genetic

regulatory networks can be modeled by stochastic process. Similarly, host-parasite system,

Chapter 1. Introduction 4

ecosystem etc are well studied through numerical simulation.

Machine learning has many applications in bioinformatics. Generally biological data are

huge in quantity but with no established theory. For such data, learning theories provide

methods to gain an insight into the underlying theory of the origin of these data. Besides,

statistical analysis can be used in population oriented biology like epidemic controlling, drug

designing, etc.

Data mining and advanced database technology are one of the main part of biological

information analysis and preservation. The huge amount of data requires efficient processing

to maximize their use in research and educational purpose.

1.3 Scope of This Thesis

There are varieties of applications of graph theory in modeling, representing, analyzing and

comparing biological data. Our study covers two major areas concerning the data generation

through biological experiments and analyzing represented data.

The first area we covered is computational problems of constructing haplotypes. Haplotypes

are sequence of nucleotides that are positioned into some fixed sites in the DNA molecule called

the SNP sites. Experimental readout can not completely determine haplotypes due to the lack

of precision and correctness in constructing haplotypes, that’s why methods are required to

build the haplotypes in an optimum way so that the reliability of the haplotypes maximizes. In

modeling such optimization problems graphs are extensively used [CIKT05, BVDL03, BILR05].

We studied several models and in this thesis, we have compiled the problems in a well sorted

fashion. Besides, we provide an heuristic algorithm to one of the optimization problems of

correcting minimum errors in a set of nucleotide sequences.

The second area we covered is the analysis of relationships among species through phy-

logeny. Phylogeny is also a well developed branch of bioinformatics. Construction, assembling,

visualizing and sampling of phylogenetic trees are major applications of graph in this branch

[ACJ03, KMP03, MM81]. In this thesis, we discuss the majority rule method of assembling

1.4. Summary 5

phylogenetic tree from many small trees. We also discuss a way of arranging haplotypes in

phylogenetic order which justifies the relation between the major two areas of our study.

1.4 Summary

In this chapter highlighted bioinformatics from the biological and computational viewpoints.

We also placed a short history of bioinformatics to demonstrate the rapid growth of biological

data throughout the world. The final section describes the scope of our thesis.

Chapter 2

Preliminaries

Basic concepts and terminologies used in this thesis have come from two distinct and different

fields of natural and applied science. Hence, it is worth reviewing and initiating these concepts

in a brief manner. The first section is about the typical structure of a cell down to molecular

level including major biological processes. Generation, assembly, analysis and applications of

biological data are discussed in the later sections. The final section contains the definitions and

terminologies of graph theory that are required in this thesis.

2.1 Cell-The Unit of Life

Cells are the structural and functional unit of all living organisms. Cells range in size from one

millimeter down to one micrometer. Cells are highly organized and complicated assembly of

large polymeric molecules with specific compartments called organelles. The most important

organelle is the nucleus, which houses most of the cellular DNA-the hereditary material of

living organisms. Cells are of two types - prokaryotic cells (e.g. bacteria, amoebae), which lack

a defined nucleus and with a simplified internal organization, and eukaryotic cells (body cells

of human, animals), which have a more complicated organization including a defined nucleus.

In addition to the nucleus, there are several other organelles in typical eukaryotic cells:

the mitochondria, where the cell’s energy metabolism is carried out; the rough and smooth

6

2.2. Genome Organization 7

endoplasmic reticula, a membranous network in which glycoproteins and lipids are synthesized;

Golgi vesicles, which transfers membrane constituents to appropriate places in the cell; and

peroxisome, in which fatty acids and amino acids undergo degradation. Animal cells, but not

plant cells, contain lysosomes, which degrade unnecessary materials taken in by the cell. Plant

cells have chloroplasts, where photosynthesis takes place.

2.2 Genome Organization

2.2.1 Genome Organization in Eukaryotic Cell

DNA (deoxyribonucleic acid) is the master molecule of the cell that stores genetic information

and transfers genetically determined characteristics from one generation to the next. The three-

dimensional structure of DNA, consists of two long helical strands coiled around a common axis

forming a double helix. Each strand of DNA is composed of continually varying sequence of just

four different types of monomers(Adenine, Cytosine,Guanine and Thymine) called nucleotides.

Genes are simply portion of nucleotide sequence residing in the DNA molecule that codes

polypeptides or proteins- the main constituents of cells. DNA also contains instructions in form

of nucleotide sequence to direct when, which proteins are to be made and in what quantities.

The DNA in the nucleus of a eukaryotic cell is organized among 1 to more than 50 long

linear, compact structures. They are called chromosomes. All cells of an organism contain

chromosomes of same size and number but they vary among different types of organisms. Each

chromosome comprises of a single DNA molecule associated various DNA binding proteins.

The total DNA content in the chromosomes of an organism is referred as its genome. The

organization schematic is shown in Fig. 2.1 [www.mtsinai.on.ca/pdmg/images/chromosome.jpg]

2.2.2 Genome Organization in Prokaryotic Cell

In all prokaryotic cells, most of or all the genetic information resides in a single circular DNA

molecule that folds back on itself many times. Usually it is one millimeter in length and stays

Chapter 2. Preliminaries 8

Figure 2.1: Organization of DNA molecule in a cell

2.3. Genome Expression 9

in the central region of the cell. The genomic DNA molecule in prokaryotes is also associated

with proteins and often is referred to as a chromosome. But the organization of DNA within a

bacterial chromosome differs greatly from that within the chromosomes of eukaryotic cells.

2.3 Genome Expression

DNA encodes all of the RNA(ribonucleic acid) and protein molecules of the cells of an organ-

ism. Proteins are the most abundant and functionally versatile of the cellular macromolecules.

Proteins are polymers formed from only 20 different monomers, the amino acids. Many pro-

teins within cells are enzymes, which accelerate (catalyze) biochemical reactions. Proteins also

direct their own synthesis and that of other macromolecules, maintain internal cell rigidity, and

transport small molecules and ions across membranes,.

Protein synthesis from genes does not occur directly. RNA acts as an intermediary molecule.

Firstly, a portion of DNA sequence of the large DNA molecule in a chromosome is copied into

RNA. The process is called transcription. These RNA copies of segments of the DNA works

as templates to direct the synthesis of the protein. This process is called translation as genetic

information stored in the form of nucleotide sequence is decoded in the form of amino acid

sequence in proteins. DNA can undergo replication (synthesis of new DNA) also. Therefore

genetic information in cells flows from DNA to RNA to protein. This fundamental principle

genome expression is termed as the central dogma of molecular biology. Genome expression is

under fine regulation at various levels.

2.4 Generation of Genomic Data

Bioinformatics primarily deals with a huge range of genomic data gathered from different bi-

ological experiment. Computational biology studies the data and derives knowledge from it.

Thus generation of genomic data is of prime importance. This section briefly discuss about few

effective ways for generation of genomic data.


2.4.1 DNA Sequencing

DNA sequencing is the laboratory technique by which chemical code of the genome is deci-

phered. DNA sequencing determines the exact order of chemical building blocks, nucleotides

(abbreviated A, T, C, and G) that make up the DNA. First, chromosomes are broken into much

shorter pieces. After sequencing of the short sequences (in blocks of about 500 bases each, called

the read length) they are assembled into long continuous stretches that are analyzed for errors,

gene-coding regions, and other characteristics.

The DNA sequence acts as a blueprint, determining what species of organism is produced,

and within a species, the DNA sequence of each individual is unique. The DNA sequence that

makes up a genome may include coding DNA (gene) and also non-coding DNA. Non-coding

DNA does not encode a protein but may regulate where and when genes and proteins are active.

Advances in molecular biology, genomics, and robotics have resulted in automatic sequencing

of DNA. Today, the DNA sequence of an entire microbial genome can be determined in just

a few weeks rather than several years as in the past. The resulting DNA sequence maps are

being used by 21st century scientists to explore biology phenomena.

2.4.2 Haplotype Reconstruction and Inference

A haplotype, is simply the genetic constitution of an individual chromosome. It also refers

to a set of single nucleotide polymorphisms (SNPs) found to be statistically associated on a

single chromatid (one-half of a replicated chromosome). Such information is most valuable to

investigate the genetics behind common diseases. Haplotypes may be used to compare different

populations. Haplotype diversity refers to the uniqueness of a particular haplotype in a given

population. Haplogroups are large groups of haplotypes that define genetic populations and

are often geographically oriented.

2.5. Genomic Data Analysis and Applications 11

2.5 Genomic Data Analysis and Applications

Bioinformatics creates the tools to store, manage, analyze, compare genomic data. Vast

amounts of sequence are now stored in organized computer databases. The genome sequence

has been interpreted using computational tools combined with biological knowledge. Computer

software associated with the database is being used for easier data retrieval and data analysis

process. Sequence analysis tools can also translate the DNA sequence into protein sequence and

can provide information on the predicted physical properties of the protein such as molecular

weight. Sequence comparisons also can be used to categorize groups of related gene or sequences

into families. Sequences in the same family suggest that the genes or proteins perform similar

functions. Another use for sequence comparisons is studying the relatedness and evolution of

different genes or organisms.

Here are some research areas where important relationships and predictions are being gen-

erated by genomic data analysis with the help of bioinformatics tools:

• Gene number, exact locations, and functions

• Gene regulation

• DNA sequence organization

• Chromosomal structure and organization

• Non-coding DNA types, amount, distribution, information content, and functions

• Coordination of gene expression, protein synthesis, and post-translational events

• Predicted vs experimentally determined gene function

• Evolutionary conservation among organisms

• Correlation of SNPs (single-base DNA variations among individuals) with health and

disease

• Disease-susceptibility prediction based on gene sequence variation

• Genes involved in complex traits and multi-gene diseases

Some uses of genomic data are discussed in this section which will be elaborated throughout

this thesis.


2.5.1 Phylogeny

Phylogeny is the description of biological relationships based on classification according to

similarity of one or more sets of characters or on a model of evolutionary processes. Phylogenetic

relationship based different characters are consistent and support one another. Phylogeny is

usually expressed as trees called phylogenetic trees.

The goals of phylogeny is to work out the relationships among species, populations, indi-

viduals or genes(generally referred as taxa). Relationship is taken as assignment of a scheme of

descendants of a common ancestor.The results are usually presented as an evolutionary tree.

A phylogenetic tree is a two dimensional graph composed of nodes representing the taxa

and branches representing the relationships among the taxa. A tree is a connected graph in

which there is exactly one path (consecutive set of edges beginning at one point and ending at

the other) between every two points. A particular node may be selected as a root. Abstract

trees may be rooted or unrooted. Unrooted tree displays the topology of relationship. But it

does not show the pattern of the descent. Rooted trees are directed graphs in which each edge

is a one-way street and the ancestor-descendant relationship implies the direction of each edge.

If every node of rooted trees has two descendants, they are called Binary trees. Numbers are

often assigned to the edges of a graph to signify a distance between the nodes connected by

the edges. Thus the sizes of the edges can be drawn proportional to the assigned lengths. The

length of a path through the graph is the sum of the edge lengths.

In phylogenetic trees edge length signify either some measure of the dissimilarity between

two species or the period since their separation. There are two approaches for derivation of

phylogenetic trees

• Phenetic approach proceeds by measuring a set of distances between species to generate

a tree by hierarchical clustering procedure.

• Cladistic approach considers possible pathways of evolution, inferring the characteristics

of the ancestor at each node.

There are two types of data used for building phylogenetic trees:

• Distance-based: A matrix of distances between the species is used as input(e.g., the

2.5. Genomic Data Analysis and Applications 13

alignment score between them or the fraction of residues they agree on).

• Character-based: Each character (e.g., a base in a specific position in the DNA) is exam-

ined separately.

2.5.2 Study of Single Nucleotide Polymorphism

Single nucleotide polymorphisms or SNPs (pronounced ”snips”) are DNA sequence variations

of single nucleotide (A,T,C,or G) in the genome sequence. For example, a change in the DNA

sequence AATTAC to ATTTAC is a SNP. When a variation occurs in at least 1% of the

population, it is considered as an SNP. SNPs are found in both coding (gene) and non-coding

regions of the genome. Most SNPs occur outside of “coding sequences”. SNPs found within a

coding sequence are of particular interest to researchers because they are more likely to alter

the biological function of a protein.

SNPs, make up about 90% of all human genome sequence variation. Two of every three

SNPs are due to the replacement of cytosine (C) with thymine (T). These variations in DNA

sequence are believed to be associated with humans respond to disease; environmental insults

such as bacteria, viruses, chemicals; and therapies. Thus SNPs has become of great value for

biomedical research and for developing pharmaceutical products or medical diagnostics. SNPs

are easier to follow in population studies because they are evolutionarily stable - not changing

much from generation to generation.

2.5.3 SNPs and Disease Diagnosis

Each person has a unique SNP pattern. In most cases, SNPs do not cause disease, they only

serve as biological markers for pinpointing a disease on the human genome map, because they

are usually located near a gene found to be associated with a certain disease. Thus SNPs

help to determine a person’s genetic predisposition to a particular disease based on genes and

hereditary factors. Researchers may also identify relevant genes associated with a disease by

studying stretches of DNA that have been found to harbor a SNP associated with a disease


trait. Thus SNPs will also allow researchers a better evaluation about the impact of non-genetic

factors like behavior, diet, lifestyle, and physical activities diseases.

To create a genetic test to screen a disease in which the disease-causing gene has already

been identified, scientists may compare SNP patterns of a group of individuals affected by the

disease with that of individuals unaffected by the disease. This association study can indicate

which pattern is most likely associated with the disease-causing gene. Then SNP profiles that

are characteristic of a variety of diseases can be established and a physician would easily screen

individuals for susceptibility to a disease just by analyzing their DNA samples for specific SNP

patterns. SNP maps may help them identify the multiple genes associated with such complex

diseases as cancer, diabetes and mental disorder, etc.

2.5.4 SNPs and Personalized Drug Prediction

A treatment proved effective in one patient for a particular disease may be found ineffective

in others. SNPs are also believed to be useful in helping to determine and understand why

individuals differ in their abilities to absorb or clear certain drugs, as well as to determine why

an individual may experience an adverse side effect to a particular drug while other may not

have any. The most appropriate drug for an individual could be determined by analyzing a

patient’s SNP pattern. When a medicine works well for a group of people, researchers tries

to find out the DNA markers (SNPs) that are alike for these people. Scientists could identify

which medicines are best for any one person by using these markers. For example, when a

person is in need of medicine, doctors will compare the person’s SNP pattern with several SNP

patterns of different groups of individuals having same disease and will prescribe individualized

therapies specific to a patient’s needs. The prediction of the appropriate treatment to the right

person is referred to as “personalized drug prediction”. Personalized medicine would allow

pharmaceutical companies to bring many more drugs to market and allow doctors to suggest a

drug that will be most effective for that individual patient.

Therefore, the recent discovery of SNPs are on the way of a revolution in the process of

disease detection and the practice of preventative and curative medicine

2.6. Graph Terminologies 15

2.6 Graph Terminologies

The focus of this thesis is to study the use of graphs in bioinformatics. The necessary graph

terminologies are very briefly described here. Interested readers are referred to detailed texts

of the literature [Wes03,Die05].

A graph is a tuple (V, E), which consists of a finite set of vertices V and a finite set of edges

E. Each edge is an ordered or unordered pair of distinct vertices. The set of vertices are denoted

by V (G) and the set of edges are denoted by E(G). A graph may look like the Fig. 2.2(i) where

each vertex in V(G) = {v1, v2, v3, v4, v5} is drawn by a small black circle and each edge in

E(G) = {e1, e2, e3, e4, e5} is drawn by a line connecting a pair of vertices. Depending upon

the directional property of the edges a graph can be one of two classes: Directed Graph and

Undirected Graph. A graph where the edges are simple undirected line between two vertices

is an undirected graph. A graph where the edges are directed line from a source vertex to a

destination vertex is called directed graph.

vv

v

v

v

1

3

2

4

5

ee

e

ee

1

2

3

45

vv

v

v

v

1

3

2

4

5

ee

e

ee

1

2

3

45

(i) (ii)

Figure 2.2: Directed and undirected graphs

In this thesis an edge e of an undirected graph is represented by (u, v) where u and v are the

vertices which are connected by e and an edge of a directed graph is represented as < u, v >

where u is the source vertex and v is the destination vertex. From here on, we will discuss

every terminology for undirected graph unless otherwise specified.

Let G be a graph. A graph H is called a subgraph of G if every edge in H is an edge in


G and every vertex of H is a vertex of G. It is to be noted that the definition is a one way

implication. A v0 − vl walk, v0; e1; v1; . . . ; vl−1; el; vl, in G is an alternating sequence of vertices

and edges of G, beginning and ending with a vertex, in which from each vertex one can move

to the following vertex through their intermediate edge. If the vertices v0, v1, . . . , vl are distinct

(except possibly v0 and vl), then the walk is called a path and usually denoted either by the

sequence of vertices v0, v1, . . . , vl or by the sequence of edges e1, e2, . . . , el. A path with l edges

has length l. A path or walk is closed if v0 = vl. A closed path containing at least one edge is

called a cycle. A graph G is connected if for every pair u, v of distinct vertices there exists a

path between them on G. Number of incident edges in a vertex v is called the degree of v in

an undirected graph. It is denoted by d(v). For directed graph there are two types of degree

for each vertex. The in-degree denotes the number of incoming edges and the out-degree the

number of outgoing edges from a vertex. A tree is a connected graph without any cycle. Since

there is no cycle there exists exactly one path between any two vertices. The vertices of a tree

are usually called nodes. The vertex v of a tree with d(v) = 1 is called a leaf and with d(v) > 1

is called an internal node. Thus a leaf has exactly one edge connected to it. A rooted tree is

a tree that has a distinguished vertex called root. Since the tree is a connected graph, there

is a path from root to every other vertex in the tree. The vertex p that immediately precedes

vertex u in the path from root to u is called the parent of u and u is called the child of p. An

internal node may have more than one child but exactly one parent. A leaf has no child. If

u1, u2, u3, . . . , ul is a sequence of nodes in a tree such that u1 is the parent of u2 which is the

parent of u3 and so on, then u1 is called the ancestor of ul and ul is called the descendant of

u1. The root is the ancestor of all nodes and it is descendant of none. The height of a tree is

the length of the longest path from root to a leaf.

2.7 Summary

This chapter is written as a suitable starting point for the readers who lack necessary biological

background to read the rest of the thesis. The very basics of bioinformatics is actually deeply

2.7. Summary 17

rooted in the concepts and discoveries of biologists. That’s why, as a student of computer

science, to study of bioinformatics is a difficult job and requires much time and effort to grasp

the ideas inherent in this beautiful research area.

Chapter 3

Haplotyping

One of the biggest achievements in the quest for mystery of life is the complete sequencing of

human genome. The Human Genome Project successfully discovered the fact that humans are

almost 99% identical at the DNA level. Therefore what makes us different is very small region

of the whole genome. The smallest possible variation in DNA sequence is the Single Nucleotide

Polymorphism abbreviated as SNP and pronounced as “snip”. It is also the prominent kind of

variation in humans. Let us describe SNP more elaborately because haplotyping is the process

of locating and determining SNPs.

An SNP is a specific nucleotide, placed inside a DNA molecule which is otherwise identical

for all of us, whose value varies, in a statistically significant way, within a population. For a

chromosome all the sites where SNP occurs have been well identified in the human genome

project. The base at an SNP site is called allele. The possible variations in a particular SNP

site over the entire population are most of the times between two alleles. Such SNP is called

bi-allelic. It is still to be explained the reason why bi-allelic SNPs are prominent than multi-

allelic SNPs which has three or more different bases. Snips are the most extensive research

topic in recent years for the computational biologists.

18

3.1. Haplotype 19

3.1 Haplotype

Several computational problems related to SNPs have been devised in few years. One of the

popular problems is Haplotyping which deals with generating haplotypes from genomic se-

quence data. Diploid organisms are organized in pairs of chromosomes. A diploid organism

has one copy of a specific chromosome inherited from father and another copy of that same

chromosome inherited from mother. For each SNP site one can be homozygous (same allele

on both chromosomes) or heterozygous (different alleles). The values of a set of SNPs on a

particular chromosome copy define a haplotype. An example is shown in Table. 3.1. Two copies

of the same chromosome of an individual are aligned and the SNP sites are shown by capital

letters. The individual is heterozygous at SNPs 1 and 3 and homozygous at SNPs 2 and 4.

The haplotypes are TGGT and AGCT.

paternal copy: atcatcTcaagtGgaattGctcTctaa

maternal copy: atcatcAcaagtGgaattCctcTctaa

Haplotype 1: T G G T

Haplotype 2: A G C T

Table 3.1: A chromosome and two haplotypes assembled from it

Two major category of problems related to haplotyping are

• Individual Haplotyping - The Haplotype Assembly Problem

• Population Haplotyping - The Haplotype Inference Problem

The following sections elaborate the idea of haplotyping problems in details.

3.2 Individual Haplotyping

Haplotyping an individual consists of determining a pair of haplotypes, one for each copy of

a given chromosome. This pair of haplotypes completely define the SNP fingerprints of an

Chapter 3. Haplotyping 20

individual for a specific chromosome. Given a sequence of bases of a specific chromosome, we

just need to check all the SNP sites to generate the haplotypes. But, the situation is a bit

complicated for generating haplotypes from sequencing data. Sequencing data for a genome

does not contain the total sequence of bases for a specific chromosome, rather it provides

sequences of a set of fragments of the whole genome. Hence the actual problem of individual

haplotyping is to find two haplotypes from the set of overlapping fragments of the both copies

of a chromosome where fragments may have error and to which copy of the chromosome a

fragment belongs is not determined. The very general formulation of the problem is given

below [BILR05].

“Given a set of fragments obtained by DNA sequencing from the two copies of a chromosome,

find the smallest amount of data to remove so that there exist two haplotypes compatible with

all the data remaining.”

Before describing some of the optimization problems of individual haplotyping, the mathe-

matical terminology is presented below.

3.2.1 Terminology

Let, S be the set of n bi-allelic SNP sites over which the haplotypes will be constructed. Let, F

be the set of m fragments of the two chromosomes where each fragment contains information

for nonzero number of SNPs in S. Without loss of generality, let the two alleles for each SNP

are 0 and 1 which are two different elements of {A, T, G, C}. Thus each fragment f ∈ F is a

string of symbols {0, 1,−} of length n where ‘−’ denotes an undetermined or unknown SNP

named as hole. All the fragments can be arranged in an m × n matrix M [f, s] named as SNP

matrix . An example of an SNP matrix is given in Table. 3.2.

The Consecutive sequence of ‘−’ that lies between two non-hole symbols is called a gap. A

gapless SNP matrix is the one that has no gap in any of the fragments. In the example, the

first, second and third rows have no gaps while the fourth and sixth rows both have one gap.

Two fragments f and g are said to have conflict, if there exist an SNP position s where two

fragments disagree, i.e. M [f, s] = 0 and M [g, s] = 1 or vice versa. If two fragments do not have

3.2. Individual Haplotyping 21

- - - -1101- - - - - - - - - - - -

- - - - -0001110101- - - - -

11010010011- - - - - - - - -

- - -10100- - -010- - - - - -

- - - - - - - - -10110101011

010111- - - - - - - - -01011

Table 3.2: An SNP matrix.

conflict on any SNP, they are said to agree. Some pairs of fragments are given in Table. 3.3 to

illustrate the idea of conflict and agree.

relation pair of fragments

conflict 0111010101100

- - - -010001- --

agree 0111010101100

- - - -010101- --

agree 011101- - - - - --

- - - - - - -110001

conflict 010001- - - - - --

- - - - -00001- --

Table 3.3: Different relations between fragments

An SNP matrix is error-free if it can be partitioned into two classes of non-conflicting

fragments. For a specific SNP matrix there may be more than one such partitions. From

each non-conflicting partition of fragments, a haplotype can be constructed by just taking the

common allele of the non-conflicting fragments for a particular SNP site. Let, partition of rows

in M are Ml and Mr where rows in each set are pairwise non-conflicting. From these two


partitions, the construction of haplotypes Hl and Hr by combining the rows can be described

as

Hij =

1 if Mij = 1 for at least one row

0 if Mij = 0 for at least one row

− if Mij = − for all fragments

(3.1)

where i ∈ {l, r} and j = 1, 2, . . . , n. Although actual haplotypes can not have any hole,

haplotype assembly problems can introduce holes in the haplotypes if there is no allelic infor-

mation in the fragments of the partition. Thus the general problem can now be redefined as to

finding an error-free matrix from a given SNP matrix through optimal changes. The following

sections deal with different approaches of finding error-free matrix and finally the comparison

of the complexities of these approaches is presented.

3.2.2 Minimum Fragment Removal : MFR

The fragment data generated from DNA sequencing contains not only overlapping fragments

but also multiple fragments from the same region of the DNA. That’s why there are many

redundant fragments and optimal removing of redundant fragments to make the matrix error-

free is the most reviewed SNP problem.

Problem : Minimum Fragment Removal – MFR

Input : An SNP matrix M .

Output : Find the smallest set of fragments (rows) whose removal makes M error-free.

MFR has been modeled using a graph so that its tractability can be found. Let GF (F, EF )

is a fragment conflict graph , where there is an edge between two fragments if they conflict on

any of their common SNPs. Therefore, to find out two classes of non-conflicting fragments is

equivalent to find out a Bipartite graph from G by removing minimum number of vertices, as

illustrated in Fig. 3.1. Here Fig. 3.1(a) is an SNP matrix, Fig. 3.1(b) is its fragment conflict

graph and Fig. 3.1(c) is the bipartite graph resulted after removing the fragment E.


101 0111 0001

101 00

ABCDE

1 00011000 1

(a)

A

C

E

D

B

(b) (c)

A

D

B

C

Figure 3.1: Minimum fragment removal using fragment conflict graph

Depending upon the presence of gaps, solution to MFR varies in its time complexity.

For gapless SNP matrix, MFR is polynomial time solvable [LBIS01]. Dynamic programming

method is a good approach to implement MFR for gapless SNP matrix [BILR05]. MFR problem

with gapped SNP matrix is NP-hard [BILR05].

Variation

MFR deletes a fragment if necessary. Whenever there are more than one fragments, any of

whose deletion can make the matrix error-free, MFR does not chose wisely. Therefore, a

variation of MFR is readily available and described below [CIKT05].

The length of a haplotype Hi, i ∈ {l, r} is the number of non-hole SNP sites. Let, the sum

of the lengths of the two haplotypes reconstructed from an error-free matrix M is L(M). In

this problem, the matrix M is made error-free by removing fragments in such a way that the

haplotypes constructed from the matrix achieve maximum length. Thus maximizing L(M) as

well as removing fragments is the goal of the new version.

Problem : Longest Haplotype Reconstruction – LHR


Output : Find the smallest set of fragments (rows) whose removal makes M error-free with

maximum L(M)

As for MFR, LHR is also polynomial time solvable for gapless cases and NP-Hard for

gapped versions. The complexity status of LHR for “at most k-holes” is still open. Here


also, dynamic programing algorithm could be an approach to implement LHR for gapless SNP

matrix [CIKT05].

3.2.3 Minimum SNP Removal : MSR

If a fixed sequencing technique is applied to all the fragments then it is likely that errors occur

in the same SNPs for all the fragments. That’s why, MSR makes the SNP matrix error-free

by removing columns from the matrix. This problem is similar to MFR from algorithmic

viewpoint.

Problem : Minimum SNP Removal – MSR


Output : Find the smallest set of SNPs (columns) whose removal makes M error-free.

Similar to the conflict between fragments, conflict between two SNPs is defined. Two SNPs

s and t are said to be conflicting if they have both 0 and 1 in their corresponding columns and

there exists two fragments f and g, where M [f, s], M [f, t], M [g, s] and M [g, t] are non-hole

symbols in which exactly one is different from the other three. Let GS(S, ES) is a SNP conflict

graph , where there is an edge between two SNPs if they conflict. For an error free SNP matrix

the GS is an independent set i.e. ES = φ [LBIS01]. Hence, to make an SNP matrix error

free by minimum SNP removal is equivalent to finding the maximum independent set of GS

and removing all vertices (SNPs) s ∈ S that are not in the maximum independent set of GS.

Here Fig. 3.2(a) is an SNP matrix, Fig. 3.2(b) is its SNP conflict graph and Fig. 3.2(c) is the

maximum independent set resulted after removing the SNPs 3, 5 and 6 from S.

Depending upon the presence of gaps, MSR also varies in its time complexity. For a gapless

SNP matrix the GS is a perfect graph with no chordless cycle of length > 4 and for perfect

graphs the maximum independent set problem is polynomial time solvable. Hence, for gapless

SNP matrix, MSR is polynomial time solvable. A class of gapped SNP matrix can also be

solved in polynomial time. This class of matrices is called C1P matrices [LBIS01].


1 2 3 4 5 6

101 110

101001100010000 001

000 01

(a)3

2

1 6

5

4(c)

2 4

1

−

(b)

Figure 3.2: Minimum SNP removal using SNP conflict graph

3.2.4 Minimum Error Correction : MEC

With the advent of sophisticated sequencing methods, errors in SNP matrices are getting

smaller. As a result, correcting these errors rather than removing a whole row or column

is more preferable approach. Erroneous SNP values can be corrected by just flipping it to the

other allele. Obviously it would have not been possible if it were a multi-allele (more than two

possible symbols in an SNP) SNP matrix. A good example of saving information by correcting

the errors would be for the SNP matrix in Fig. 3.1. In that matrix fragment E has conflict with

fragment B just for the last SNP that is common to them. Thus flipping that SNP of fragment

E to 0 would result into an error-free matrix having the partition {A,C} and {B,D,E}. MEC

retains the information of the other four alleles of fragment E, while MFR would delete all the

five alleles from the matrix.

Problem : Minimum Error Correction – MEC


Output : The smallest set of SNP alleles whose flipping makes M error-free.

MEC problem for gapless SNP matrix is NP-hard [CIKT05]. Hence, MEC is more difficult

than the gapless MFR or MSR which have polynomial time algorithm. For 1-gap case MEC is

proved to be APX-Hard. It has been showed that haplotype assembly problem has reasonable

input size for practical exact algorithms [Huf05]. An exact algorithm based on branch and

bound technique is available. It searches all possible pairs of haplotypes to find the solution

[WWLZ05]. Now, we present a heuristic algorithm to find the solution of a minimum error


correction problem.

3.2.5 A Heuristic Algorithm for MEC Problem

Terminologies and Definitions

Before describing the algorithm, let us redefine the terminologies. Let, M = {Mij} is an SNP

matrix of dimension m × k where Mij ∈ {0, 1,−}. There are m fragments; each of which has

either a fixed allele (i.e. {0, 1}) or a gap (i.e. ‘−’) in each of its k SNP sites.

Conceptually, the matrix M = {M1, M2, . . . , Mm} is a set of m fragments where a fragment

Mi = {Mi1, Mi2, . . . , Mik} is an array of k alleles or holes (see Fig. 3.3(a)). A fragment Mi is

called to cover the jth SNP if Mij ∈ {0, 1} and called to skip the jth SNP if Mij =‘−’. Let,

Ms and Mt are two fragments. The distance between two fragments, D(Ms, Mt) is defined as

the number of SNPs that are covered by both of the fragments and have different alleles. For

example in Fig. 3.3(a) the D(M3, M2) = 4.

D(Ms, Mt) =k

∑

j=1

d(Msj, Mtj) (3.2)

where d(x, y) is defined as

d(x, y) =

1 if x 6= − and y 6= − and x 6= y

0 Otherwise(3.3)

Two fragments Ms and Mt are said to be conflicting if D(Ms, Mt) > 0. Now, P (C1, C2)

is a partition of M , where C1 and C2 are two collections (sets) of fragments taken from M so

that C1

⋃

C2 = M and C1

⋂

C2 = φ. In Fig. 3.3(b) a partition is shown. Therefore, M will be

an error-free matrix if and only if there exists a partition P (C1, C2) of M such that for any two

fragments x, y ∈ Ci, i ∈ {1, 2}, x and y are non-conflicting, in other words D(x, y) = 0 for all x

and y belonging to the same collection Ci. Such a partition is called a error-free partition. The

partition in the figure is not error free because D(M1, M2) > 0 and D(M5, M6) > 0 insert error


6

5

4

3

2

1 01000

1000001

11001

010010

1000001

00111000

(a)

H2 = 0100000001

H1 = 0100000000

(c)

3

12

C1 C2

456

P

(b)

M

Figure 3.3: SNP matrix and its partition

in C1 and C2 respectively. Construction of the two haplotypes from an error free partition is

similar to that of LHR. A haplotype Hi, i ∈ {1, 2} is constructed by taking common allele from

the component fragments for both the parts Ci, i ∈ {1, 2}. The mathematical expression that

describes such construction is

Hij =

1 if at least one fragment in Ci has a 1 in jth SNP

0 if at least one fragment in Ci has a 0 in jth SNP

− if all the fragments in Ci skips jth SNP

(3.4)

If the matrix M is not error-free, there will be no error-free partition P as well and there

will be conflicts between fragments in any or both of the collections of all possible partitions.

Thus, for such M , there is no partition which we can use to construct two haplotypes by just

taking the common allele for a particular.

So far we described how an SNP matrix becomes erroneous, now the question is to correct

an erroneous SNP matrix to make it error-free. If we are given a partition P (C1, C2) of an

erroneous matrix M and the two actual haplotypes H1 and H2, the number of errors E(P ) that

must be corrected are just the sum of the distances of all the fragments from their corresponding

haplotypes. More mathematical expression is given below.

E(P ) =2

∑

i=1

∑

f∈Ci

D(f, Hi) (3.5)


Now, the MEC problem reduces to finding a partition P that minimizes the error function

E(P ).

Algorithm - HMEC

Now, to minimize the E(P ), we need to search all possible partitions of a matrix M . This

would certainly require running time exponential to the number of fragments in M . But such

a search is not possible because we don’t have the real haplotypes to calculate E(P ). Hence,

we have to approximate E(P ) by constructing two haplotypes from the given partition P .

For best approximation we should construct haplotypes which are minimum conflicting with

the fragments of their corresponding collections. Therefore, for each SNP site, the haplotype

Hi should take the allele that is present in majority of the fragments in Ci. In case of ties, 0 is

favored because it is the more common than 1. In Fig. 3.3(c) the two haplotypes H1 and H2

are shown for the partition P in Fig. 3.3(b) which are constructed in this method. To define

this construction more mathematically, let N 0

j (Ci) is the number of fragments in a collection

Ci that have 0 in jth SNP. Similarly, N 1

j (Ci) defines the number of 1s. Thus to minimize the

number of errors E(P ) for a specific partition P , the haplotype should be constructed following

the rule

Hij =

1 if N1

j (Ci) > N0

j (Ci)

0 if N0

j (Ci) >= N1

j (Ci) and N0

j (Ci) 6= 0

− if N1

j (Ci) = N0

j (Ci) = 0

(3.6)

where i ∈ {1, 2} and j = 1, 2, . . . , k.

To find the best partition we will use a local search heuristic. The algorithm iteratively

searches a better partition with respect to the current one and chooses it to move to. Because

the algorithm searches from within a small set of partitions, the chosen partition may not

lead to the optimum solution and the algorithm has a chance to fall into the local optimum

solution. The algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm

for bipartitioning a hypergraph minimizing the cut size.


This algorithm starts with an arbitrary partition as for example P (M, φ) and iteratively

searches a better partition. In each iteration the algorithm performs a sequence of transfer of

distinct fragments from their present collection to the other one so that the partition becomes

less erroneous. It should be noted that, a fragment’s transfer of collection can both increase or

decrease the error function E(P ).

Let, the partition before transferring a fragment f is Pp and the partition resulted is Pn. We

define the gain of the transfer as Gain(f) = E(Pp)−E(Pn). Let, F =< fi >, i = {1, 2, . . . , m}

is an ordering of all the fragments in a partition P in such a way that fragment fi will precede

fragment fj if all the fragments preceding fi have been transferred to form an intermediate

partition Pi and Gain(fi) >= Gain(fj) over Pi. Thus the first intermediate partition P1 is the

current partition Pc of the ongoing iteration. We also define the cumulative gain of a fragment

ordering F up to its nth fragment as CGain(F, n) =∑n

j=0Gain(fj). Note that, all the gains

used to compute the cumulative gain are calculated over different intermediate partitions. The

maximum cumulative gain, MCGain(F ) is defined as

MCGain(F ) = max1 ≤ i ≤ m

CGain(F, i)

Now, in each iteration the algorithm finds the ordering Fc of current partition Pc and

transfers only those fragments of Fc that can achieve the MCGain(Fc). The fragment that is

the last to be transferred is referred as fmax. Thus, the algorithm iterates from one partition to

another reducing the error function. Since our algorithm is local search algorithm, it continues

whenever MCGain(Fc) > 0 and stops as soon as MCGain(Fc) ≤ 0.

Implementation and Complexity

There are several issues to discuss about the above described algorithms. We will discuss the

data structure for each such issue.

To find Fc in each iteration the algorithm repeatedly transfers the fragment that is not

transferred previously in this iteration and has maximum gain over all such fragments. To

accomplish this we use a locking mechanism. At the beginning of each iteration all the fragments


are set free. The free fragment with maximum gain is found out and tentatively transferred

to the other collection. After the transfer the fragment is locked at the new collection. This

tentative transfer creates the first intermediate partition P1. The algorithm then finds the next

free fragment with maximum gain in P1 and transfer and lock that fragment to create the P2.

Thus, free fragments are transferred until all the fragments are locked and the order of the

transfer (Fc) is stored in the log table along with the cumulative gains (CGain). MCGain is

the maximum CGain and fmax is the fragment corresponding to MCGain in the log table.

Although after finishing all such tentative transfers Pc has been changed to an undefined

partition, the algorithm checks the log to find the MCGain(Fc) and fmax and rollback the

transfer of all the fragments that were transferred after fmax. When the rollback completes the

Pc becomes ready for the next iteration.

H1 = 0100000000H2 = 0100000001

Pp

C1 C2

123

456

H1 = 01000001−−H2 = 0100101001

nP

C1 C2

1

3

456

2

E(P ) = 8p E(P ) = 6nE(P )p E(P )nGain(2) = = 2

Figure 3.4: An example calculation of Gain measure

While tentatively transferring a free fragment, the algorithm needs to find the fragment

with maximum gain among the free fragments (which are not yet transferred). This requires

calculating gains for each of them. To calculate the Gain(f) = E(Pp)−E(Pn) for a fragment we

need to calculate two error values of two different partitions; the present intermediate partition

and the next partition which will be resulted if f is transferred. Each of these error functions

requires calculation of two new haplotypes from their corresponding collections (see Fig. 3.4).

Although E(Pp) and the haplotypes of Pp can be found from the previous transfer, calculation

of E(Pn) requires construction of haplotypes of Pn. Since, the difference between Pp and Pn

is only one transfer, we can introduce differential calculation of haplotypes Hnj, j ∈ {1, 2} of


next partition from the haplotypes of Hpi, i ∈ {1, 2} of present partition. For this purpose, the

algorithm stores N 1

j (Cpi) and N0

j (Cpi) values of the present partition. After a transfer these

values will either be incremented or decremented by 1 or remain the same. Hence, it is now

possible to construct Hnj, j ∈ {1, 2} in O(k) time. To compute E(Pn) from the haplotypes

requires O(mk) time. Therefore, running time to compute the E(Pn) as well as to compute

Gain(f) is O(mk + k).

For each intermediate partition Pi, i = 1, . . . , n we need to compute Gain measures for

m − i unlocked fragments to find the maximum one. The transfer of this fragments require

updating of N1

j (Ci) and N0

j (Ci), i ∈ {1, 2} and j = 1, 2, . . . , k. So, it also needs O(k) time to

run. Finally, there will be m such transfer in each iteration. Thus each iteration will require

O(m(m(mk + k) + k)) ∼ O(m3k) running time.

Approximation to Improve Performance

For large SNP matrix O(m3k) running time is critical to the performance of the algorithm. We

can use an approximation in the calculation of the Gain(f) by using only the fragment f and

not using the m − 1 other fragments. The approximate gain should be

AppxGain(f) = D(Hpi , f) − D(Hn

j , f) (3.7)

where Hpi is the haplotype of f ’s present collection Ci of partition Pp and Hnj is the

haplotype of f ’s next collection Cj of partition Pn. This function ignores the effect of fragments

other than f on Gain(f) but reduces the run time of calculating gain to O(k). The total run

time of each iteration will be O(m2k)

A Simulated Example

In Fig. 3.5 we present an example iteration of HMEC. We consider that the current partition

Pc = P1 is the partition given in Fig. 3.3(b) for the matrix M . All the intermediate partitions

Pi, i ∈ {1, . . . , 7} are shown sequentially and the gains of each fragment over the intermediate


Algorithm HMEC(M)

1: Pc = P (M, φ)

2: FREE LOCKS()

3: CLEAR LOG()

4: repeat always

5: while there is an unlocked fragment in Pc do

begin

6: find a free fragment f so that Gain(f) is maximum

7: transfer f to the other collection

8: update the haplotypes after the transfer

9: LOCK(f)

10: LOG RECORD(f ,Gain(f))

end

11: FREE LOCKS()

12: check the log and find MCGain(Fc) and fmax

13: if MCGain(Fc) > 0

begin

14: set new Pc by rolling back the transfers

that occurred after the transfer of fmax

15: calculate haplotypes of Pc

16: CLEAR LOG()

17: continue the loop

end

18: else

begin

19: terminate the algorithm and output current haplotypes

end

end repeat

Table 3.4: The pseudocode for the HMEC algorithm


123

456

Gain(1)=0Gain(2)=2Gain(3)=2Gain(4)=1Gain(5)=1Gain(6)=1

1

3

2456

136

245

Gain(1)=−1Gain(3)=−2Gain(4)=−1Gain(5)=0

1

6

3 4

52 Gain(1)=−1

Gain(3)=−3

1

2

3

54

6

5

6

4

2

3

1

1

653

2

4

P1

P2

P3 P5

P6

P7

P4

Gain(1)=−1Gain(3)=−2Gain(4)=−2Gain(5)=1Gain(6)=2

Gain(1)=−2Gain(3)=−3Gain(4)=−1

Gain(3)=−2

locked fragment

Figure 3.5: An example iteration of HMEC


partitions are shown on the right of each partition. The free fragment with maximum gain is

marked in each intermediate partition. For example, the maximum gaining fragment on P2 is

fragment 6 with gain 2. After each transfer the transferred fragment is locked by a circle. Here,

the ordering Fc of the fragments is < 2, 6, 5, 1, 4, 3 > which is also the order of locking of the

fragments. In the log this ordering will be stored along with the CGains (see Fig. 3.6). All the

tentative transfers after fmax have to be rolled back so that the P2 becomes the next Pc.

MCGain

244320

CGain

265413

Fc

Log Table

Figure 3.6: An example log table

Performance Evaluation

We have implemented the algorithm and tested it thoroughly. We first chose two independent

set of haplotypes. One of them was based on real sequences of DCP1 genes that contains lot

of similarity among the haplotypes [RTCN99]. There were 7 pairs of haplotypes of 53 SNPs

in this set. The other one was generated sequences reflecting true randomness. There were 6

pairs of haplotypes of 90 SNPs in this set. Upon each of these sets of haplotypes, we ran a

simulated sequencing operation that generates a set of fragments (i.e. the SNP matrix) from

every pair of haplotypes within the set. The simulation program was controlled in several ways.

We varied the number of fragment in the matrix as well as the length (number of non-hole SNP

sites) of each fragment. We also varied the maximum amount of errors that was introduced

in the matrix. Over each of the SNP matrices we ran our algorithm and it performed very

well to reconstruct the haplotypes. The worst scenario, that was tested, was a matrix from

the generated haplotypes with only 40 fragments each having 60% holes in them and and 40%

errors in the alleles. Our algorithm constructed on the average 85% of the haplotypes. This


percentage is quite a good one considering the true randomness of the SNP matrix and obviously

better performing than a genetic algorithm for MEC [WWLZ05]. Besides the correctness, it

is lot more efficient in execution time. It took 2.46 sec to process all the 6 matrices of 40X90

dimension.

The most successful feature of this search algorithm is its Independence of gap. The position

of holes (i.e. gaps) in the SNP matrix will not create any difference to the HMEC algorithm.

That’s why the number of gaps was not a control parameter of the sequencing simulation.

An algorithmic comparison with the FM algorithm for bipartitioning with minimum cut

size reveals that our FM algorithm runs in O(m) time in updating the gain values in one pass

where our algorithm runs in O(mk) if no approximation is applied.

Variation

A new model of MEC has been proposed to incorporate the genotypic information of the

organism, which is more readily available, to increase the correctness of the solution [WWLZ05].

From algorithmic point of view there is another variation of MEC [CITK05]. The problem

is open at its complexity status.

Problem : Binary-Witness-MEC

Input : An SNP matrix M that does not contain any holes.

Output : For an input matrix M of size n × m, two haplotypes H1,H2 ∈ {0, 1}m minimizing.

D(H1, H2) =∑

rows r∈M

min(d(r, H1), d(r, H2))

3.2.6 Comparison of Different Individual Haplotyping Problems

We described four haplotype assembly problems in this section. Those are MFR, LHR, MSR

and MEC. All these problems are very extensively studied over the last few years and several

algorithms are established for each of the settings. Several papers are published regarding the

complexity status of the problems. Several new models derived from these four problems are

also coming by the last couple of years. The problems are themselves varied in nature because of


the presence of gap in the input SNP matrix. It should be noted here that a hole in a fragment is

not necessarily a gap. Gap is a sequence of one or more holes separated by nonempty sequence

of alleles. The HMEC that we proposed is significantly better than a Genetic algorithm. Other

than these two we did not find any approximation algorithm for the other problems. The most

recent compiled state of these four problems (as per our knowledge) is shown in the Table. 3.5

Problem Gap and hole type Time Reference

MFR UnGapped O(m2n + m3) [BILR05]

At most k holes O(22km2n + 23km3) [BILR05]

At most 1-gap NP-Hard [LBIS01]

MSR UnGapped O(mn2) [BILR05]

At most k holes O(2kmn2) [BILR05]

At most 1-gap NP-Hard [BILR05]

LHR UnGapped O(mn2 + n3) [CITK05]

At most k holes ?

At most 1-gap NP-Hard [CITK05]

MEC UnGapped NP-Hard [CITK05]

At most k holes NP-Hard [CITK05]

At most 1-gap NP-Hard [CITK05]

Table 3.5: Comparison among the SNP problems

3.3 Population Haplotyping

Sequencing an individual DNA and having all of its haplotypic information is not the major

concern of the researchers who deals with disease analysis, drug design, hybridization of crops

etc. They need the trend of SNPs in the haplotypes in many individuals of a particular diploid

organism. But, it is not possible for even the most powerful sequencer built by today to

3.3. Population Haplotyping 37

generate large scale haplotype data in short time and cost. The more economical method for

population haplotype mapping is to generate large scale “genotype” data and compute the

map from it. To be more precise, a genotype is not a fragment of a haplotype, rather it is a

combined information of the fragments of the haplotypes. A genotype can only say whether an

individual is homozygous or heterozygous in some SNP sites. Population haplotyping problems

takes on genotypes of a large number of individuals over a fixed number of SNP sites, and

infers the haplotypes of this population. Haplotype is a sequence of SNP values on a copy of

chromosome of an individual. A genotype vector, or simply genotype, represents two haplotypes

as a sequence of unordered pairs of alleles. Each pair represents the nucleotides in a given site.

The following Table. 3.6 shows the genotype for a chromosome.

paternal copy: atc T cat G gat G ctc T ctaa

maternal copy: atc A cat G gat C ctc T ctaa

Haplotype 1: T G G T

Haplotype 2: A G C T

Genotype: (T,A) (G,G) (C,G) (T,T)

Table 3.6: A chromosome and its genotype

Before describing some of the haplotype inference problems, the mathematical terminology

is described as usually.

3.3.1 Terminology

For bi-allelic SNPs, haplotypes can also be represented as a string of symbols {0, 1}. Thus a

genotype is a string of unordered pairs over the set {0, 1}. Whenever an SNP site is homozygous

the genotype pair will be either (0,0) or (1,1), while for the heterozygous site the pair is (0,1).

For example two haplotypes of length 3 are 0,1,1 and 1,0,1 which are combined into the genotype

(0,1),(0,1),(1,1). Since, only 3 distinct pairs are possible we can use a different alphabet {0, 1, ?}


to represent genotypes more precisely. For the homozygous pairs we will use 0 and 1 respectively

and for the heterozygous pair the ? will be used. Hence, for the previous example the genotype

should be ?,?,1.

Given a genotype g = {gi}, i = 1, 2, . . . , m, the resolution of g is pair h , k of haplotypes,

where h = {hi}, i = 1, 2, . . . , m and k = {ki}, i = 1, 2, . . . , m such that hi = ki = gi if g 6=?

and hi, ki ∈ {0, 1}, hi 6= ki if gi =?. We say that h, k resolves g. Given a genotype g and a

haplotype h, we say that h is compatible with g if hi = gi whenever g 6=?. If h is compatible

with g we can always find a haplotype h′ such that h and h′ resolves g. Here, h′ is called the

realization of g by h and denoted as R(g, h). It should be noted here that, h and k can be

compatible with g separately but they may not resolve g together. Given a g and h, R(g, h) is

a unique haplotype computed as

R(g, h) =

hi if gi 6=?, i = 1, 2, . . . , m

1 − hi if gi =?, i = 1, 2, . . . , m

Now, the general problem of haplotype inference is defined below [BVDL03].

“Given a set of genotypes G = {g1, g2, . . . , gn}, for each g ∈ G find a pair h, k of haplotypes

resolving g.”

Preserving this general settings, there are many optimization problems with different mod-

els. The following subsections will discuss several of such problems along with their algorithms.

3.3.2 Pure Parsimony Problem

The first optimization problem that comes in mind about this general problem is to minimize

the number of haplotypes. The pure parsimony problem states to do so.

Problem : Pure Parsimony Haplotyping Problem

Input : A set G of genotypes.

Output : Find the cardinality-smallest set of haplotypes H such that for each g ∈ G, there is

a pair of haplotypes h, h′ ∈ H that resolves g.

An algorithm introducing the maximum parsimony approach to haplotyping was described

earlier [Cla90]. That was actually an approximation algorithm using the rationale that ho-


mozygous site in one genotype should be common to the other genotype where the site is ?.

Later, Lancia et.al proved that the pure parsimony problem is NP-hard for genotype set that

has no genotype with more than 3 heterozygous sites. It has been proved that pure parsimony

problem can be solved in polynomial time (O(mnlog(n) + n3/2)) for a genotype set G of no

more than 2 heterozygous sites in a single genotype [CITK05]. It is yet to prove the hardness

of pure parsimony problem for a genotype set with more than 3 heterozygous sites.

The haplotype inference problems has another name called “genotype phasing”. Recent soft-

ware applications approach this problem for large population set and real time operations. An

entropy minimization algorithm can be used for practical implementation [PM06]. Algorithm

for a highly reliable and large scale implementation is also available [BZ06].

3.3.3 Maximum Resolution Problem – MR

With the general settings, many optimization problems have been devised by introducing rule

of inference over the genotype data. The inference rule is an operation that is performed

to expand a known set of haplotypes to make it consistent with a known set of genotypes

[BVLD03]. The definition is as follows

Definition 3.3.1 (Inference rule) Let G = {g1, g2, . . . , gm} be a set of genotypes and let H

be a nonempty set of haplotypes. The application of the inference rule to a haplotype h ∈ H

compatible with a genotype g ∈ G consists of adding R(g, h) to H and removing g from G.

The MR problem deals with the application of inference rule over given (G, H). It states

to find the sequence of inference that achieve maximum resolution of G by H.

Problem : Maximum Resolution – MR

Input : A set G of genotypes and a set H of haplotypes.

Output : A maximum cardinality subset G′ of G of genotypes that are removed from G by a

sequence of application of the inference rule starting from G and H.

An example of application of inference rule will give an insight into the problems. Let

G = {g1 = 00??0?, g2 = 00?1?1} and H = {h1 = 001000, h2 = 000001}. The optimal sequence


of application of inference rule in comparison to a non-optimal one is shown in Table. 3.7

Step Optimal Sequence Non Optimal Sequence

0 G = {g1, g2}, H = {h1, h2} G = {g1, g2}, H = {h1, h2}

G = {g1 = 00??0?, g2 = 00?1?1} and

H = {h1 = 001000, h2 = 000001}

G = {g1 = 00??0?, g2 = 00?1?1} and

H = {h1 = 001000, h2 = 000001}

1 h3 = R(g1, h1) = 000101 h5 = R(g1, h2) = 001100

G = {g2}, H = {h1, h2, h3} G = {g2}, H = {h1, h2, h5}

2 h4 = R(g2, h3) = 001111 No haplotype in H is compatible with g2

G = {}, H = {h1, h2, h3, h4} G = {g2}, H = {h1, h2, h5}

Table 3.7: Example of optimal application of haplotype inference rule

A formal framework to analyze and investigate the computational complexity of haplo-

type inference problems is presented by stating an optimization problem, whose corresponding

decision version is NP-hard [Gus01].

Variation

A variation of the MR problem named “single genotype resolution problem” is presented whose

computational complexity is yet to determine [BVLD05]. This open problem deals with a single

genotype to determine its existence in a population.

Problem : Single Genotype Resolution – SGR

Input : A set G of genotypes and a set H of haplotypes and a distinguished genotype g ∈ G.

Output : A sequence of applications of the inference rule that resolves a subset of G including

g.

3.3.4 Perfect Phylogeny Haplotyping Problem – PPH

Difference in SNPs in individuals of a particular population is not a frequent occurrence. A

white man and a black man surely have some differences in their SNPs. But, for a population


of white people, living in the same geographic region, having many similar traits, SNPs of indi-

viduals should not differ very much except some cases of atomic disaster like that of Hiroshima

and Nagasaki which can mutate many SNPs of reproductive cell and contaminate the whole

next generation of that individual.

A mathematical description of the above mentioned fact will clear the idea. Let, there are

n SNPs in the entire human race, so there might be 2n possible SNP configuration if all these

SNPs are bi-allelic. But actually only a few of these 2n configurations can be identified so far

in humans. All those configuration are very much different from each other and every single

configuration is common to some population. This changes in SNPs occur for natural mutations

and represents the evolution of humans through a large epoch. Thus haplotypes of a population

contains the evolutionary information and they could be arranged in some phylogenetic tree.

Thus the very general population haplotyping problem can degenerate into the phylogeny of

haplotypes. Let, us define a phylogeny of the haplotypes as haplotype perfect phylogeny.

Definition 3.3.2 (Haplotype Perfect Phylogeny) Let B be an n×m {0, 1} matrix where

each row in B is a binary haplotype and each column i is the n vector of the SNP sites of the m

haplotypes. A haplotype perfect phylogeny for B, in short hpp, is a rooted tree T with n leaves

such that the following properties hold:

• each leaf of the tree is labeled by a distinct haplotype from B, that is a distinct row of B.

• each internal edge of T can be labeled by at least a SNP site j changing from 0 to 1, while

each site labels at most one edge.

• for each haplotype leaf h, the unique path from the root of T to h specifies exactly all SNP

sites that are 1 in h.

An example of an hpp is given in the Fig. 3.7. It should be noted that the edges without

label are not the internal edges of the tree. Hence, the example completely agrees the definition.

Here, the matrix of haplotypes on the left is arranged in the hpp on the right. Since the

hpp intends the use of haplotypes as a matrix, we should define the resolution of a genotype


ABCD 000

0

0

1 0123456

001100 101

10100010

1010 00E

(a)

A

CB

ED

(b)

3

1 5

2

6

4

Figure 3.7: An example of haplotype perfect phylogeny (hpp)

matrix. A {0,1} matrix H of haplotypes resolves a {0,1,?} matrix G of genotypes if each row

of G is resolved by a pair of rows of H. The perfect phylogeny haplotyping problem is

Problem : Perfect Phylogeny Haplotyping – PPH

Input : an n × m matrix G over alphabet {0,1,?}.

Output : A matrix H which is a realization of matrix G and a haplotype perfect phylogeny

for H, or decide that such tree does not exist.

Now, the issue of existence of a solution is important in this problem. We need to know how

to check that a matrix H has no hpp. In the Fig. 3.8 a matrix that has no hpp is shown. The

trees in the figure are possible trials to represent the matrix in hpp, but both failed to represent

a haplotype. Such matrix, where three rows and two columns and the rows are (1,1),(0,1) and

(1,0) is called the forbidden matrix. It is proved that when a matrix has a sub matrix of any

two column and three rows equal to a forbidden matrix, the matrix has no hpp [BVLD05].

3.4 Summary

In this chapter, we described the haplotyping problem, one of the most important SNP related

problems. We described both types of haplotyping: individual and population. The individual

haplotyping determines the haplotype of an individual from the DNA sequence data of that

3.4. Summary 43

2h1h

3h 0 1

1 0

1 1

i j

3h2h1h

j

i

(a) (b) (c)

1h

i

j

Figure 3.8: Forbidden matrix for hpp

individual. There are several variation of this generic problem. The MFR, MSR and MEC

problem are described along with some graph based computational model and their variations.

The fragment conflict graph, SNP conflict graph are examples of such models. A local search

algorithm is given for the MEC problem. Population haplotyping determines the haplotypes

of a population from the genotypes of a number of individuals. The pure parsimony problem,

maximum resolution problem and perfect phylogeny haplotyping problem are described. For

each problem, illustrative example is used to elaborate the computation required to solve the

problem.

Chapter 4

Phylogenetic Tree

Phylogeny is the description of relations among species, population, individuals or genes. This

relationship can be of several kinds having the property of ancestor-descendant relationship.

For a quick example, we can describe a phylogeny of human languages of different civil societies

where “Hindi” and “Bangla” may have same ancestor named “Sanskrit” while “arabic” and

“Persian” may have same ancestor. Generally relations that can be expressed as descendant-

ancestor have a close tie with time. That’s why phylogeny is extensively used to research the

evolutionary history of biological elements like species, population, genes, etc (generally referred

as taxa). In this setting, the most reliable source of information that can be used to relate taxa

are cellular information like ribosomal RNAs, gene families, repetitive DNA sequences, protein

sequences, etc.

The best representation of a phylogeny should be nothing but tree. A tree organization

of taxas enables easier analysis and makes it more suitable for applying computer algorithms.

A tree representation of a phylogeny is called the Phylogenetic Tree. An example of a

phylogenetic tree constructed from RNA comparison is shown in Fig. 4.1 taken from the book

Fundamental concepts of bioinformatics by D. E. Krane et. al. (page 113).

From the point of view of computer science, phylogenetic tree can be studied in two different

phases.

• Construction of phylogenetic tree using some distance metric to measure the relationship

44

45

Purple Bacteria

Cyanobacteria

Flavobacteria

Greennonsulfurbacteria

Thermotogales

Euryarchaeota Crenarchaeota

FungiPlants

Animals

Ciliates

Slime moldsEntamoebae

FlagellatesTrichomonadsDiplomonads

Microsporidia

Archaea

Eucarya

Bacteria

BacteriaGram−Positive

Figure 4.1: A sample phylogenetic tree of living organisms

Chapter 4. Phylogenetic Tree 46

among taxa so that the topology of the tree represent the relation reliably.

• Analysis of information (ancestral sequence, path length, subtree, etc.) from an already

constructed phylogenetic tree accurately and visualization the trees in easy shapes.

In this chapter we will discuss two different analysis algorithm to enchant the current trend

of research in phylogenetic tree analysis.

4.1 Majority Rule Consensus Tree

Most of the construction problems of phylogenetic trees are NP-hard and run heuristic algo-

rithms to generate phylogenetic tree for smaller set of taxa than required. While looking for

building a tree for a set of million taxa, smaller trees generated by these algorithms should be

combined reliably and efficiently. A consensus tree for a set of input trees is a single tree which

include features on which all or most of the input trees agree. Depending upon the definition

of the feature, on which the trees should agree, consensus tree can be different. For example, it

can preserve the topological structure of the trees or can preserve the pairwise path lengths of

individual taxa. The algorithm that we will discuss in this section builds consensus tree using

majority rule. In this problem, a node in a tree is equal to another node in a different tree if

they have the same set of taxa in the corresponding subtree rooted at them. The majority rule

consensus tree problem states to include all nodes that are present in more than fraction l of

the input trees into the consensus tree so that the topology of the consensus tree agrees with

majority of the input trees [ACJ03].

Problem : Majority Rule Consensus Tree Problem

Input : A set of phylogenetic trees

Output : A consensus tree only having all the nodes that appear in majority of the input

trees.

The algorithm we are presenting is a simplified linear time randomized algorithm.

4.1. Majority Rule Consensus Tree 47

4.1.1 Terminology

Let S = {s0, s1, . . . , sn−1} is a set of taxa and T = {T1, T2, . . . , Tt} is a set of phylogenetic trees,

each having leaves labeled by s such that s ∈ S.

Without loss of generality, let s0 is a distinguished taxon that is present in all the trees Tj

and is connected directly to the root of each of the trees. Such taxon is called the outgroup

to the rest of the tree. If there is no common outgroup present in T , to continue with the

algorithm we will insert s dummy node to each of the trees and make it the outgroup.

Now, let i is a node in a tree Tj. If we delete the edge from i towards the root of Tj, we will

get a bipartition : a subtree rooted at i and the rest of the tree having s0. Such a bipartition B

can be represented uniquely only by the leaves of the subtree rooted at i and denoted as S(B).

For example, in the Fig. 4.2 a bipartition in tree T2 is s2s3|s0s1s4 and S(s2s3|s0s1s4) = {s2, s3}.

The cardinality of the bipartition is 2. Therefore, two nodes i and i′ of two different phylogenetic

trees Tj and Tk are equal if their respective bipartitions B and B ′ are equal, i.e. S(B) = S(B ′)

s0 s1 s4s3s2 s0 s3 s4s2s1 s0 s2s1 s4 s3

s0 s3 s4s2s1

Majority rule consensus tree

Figure 4.2: An example of majority rule consensus tree for l = 1

2

In majority rule tree, or Ml tree, includes nodes for exactly those bipartitions which occur

in more than some fraction l of the input trees. It has been showed that there exists such tree

for any 1/2 < l ≤ 1 [MM81].


4.1.2 Algorithm

The algorithm can be divided into two phase. The counting phase counts the number of

occurrence of all the bipartition present in T and the construction phase constructs an Ml for

some 1

2< l ≤ 1.

Counting Bipartitions

Counting phase counts the the number of occurrences of a bipartition so that we can determine

whether the bipartition is a majority one. To count every possible bipartition we need a

storage for them. A tree with n leaves can have O(n) distinct bipartition. Because |S|= n,

maximum number of distinct bipartitions is O(tn). We can represent each bipartition B by

B = (b1, b2, . . . , bn), a bit-string of n-bits, where bit bj carries a 1 if the respective taxon sj ∈ B.

The bit-string for a bipartition B of node i can be calculated from the bit-strings of the its

children’s bipartition’s B1, B2, . . . , Br by just taking a logical OR of them. This is because the

Bi, i = 1, 2, . . . , r are disjoint set of taxa.

B =r

∨

i=1

Bi. (4.1)

We will use a hash function that hashes a bipartition to a hash table address. Each location

of the hash table contains the record of the running count of the bipartition, the bit-string

representation of the bipartition. The hash function for a bipartition B is h(B).

h(B) =n

∑

i=1

biai mod m (4.2)

where a = (a1, a2, . . . , an) is a list of random integers in (0, . . . , m − 1) and m is a prime

number and m > tn. Such hash function has an advantage of implicit calculation. The hash

index for a bipartition B of node i can also be calculated from the bipartitions of the its

children’s bipartition’s B1, B2, . . . , Br by just adding and taking mod of their hash indices.

4.1. Majority Rule Consensus Tree 49

h(B) =n

∑

i=1

biai mod m =r

∑

j=1

n∑

i=1

bjiai mod m =r

∑

j=1

h(Bj) mod m (4.3)

To count bipartitions the algorithm traverses every tree in postorder and calculates the hash

index and the bit-string representation from those of its children. Using index the algorithm

accesses the hash table which uses separate chaining method to resolve the collision and inserts

the bit-string as a new record if it is not in the table and increments its count.

Constructing Majority Rule Tree

The construction phase makes use of the following facts.

Fact 4.1.1 Let Ba and B are two majority bipartitions. If Ba is an ancestor of B in a tree in

T , then Ba is an ancestor of B in Ml.

Fact 4.1.2 Let Bp and B are two majority bipartitions. If Bp is the parent of B in Ml, then

at least one tree in T has Bp as Bs ancestor.

Fact 4.1.3 Let Ba, Bp and B are all majority bipartitions. If Bp is the parent of B and Ba is

an ancestor of B other than Bp in Ml, then |Ba| > |Bp|.

The Fact 4.1.1 tells that a preorder traversal of the input trees will be congruent with the

top-down order of bipartitions in the Ml. The Fact 4.1.2 tells that if we traverse all the input

trees in preorder, we won’t miss any edge of Ml.

The algorithm starts a preorder traversal of each tree in T . Let, the algorithm is in node i.

Each traversal uses a pointer c which always points to the nearest ancestor of i that is majority

node. We create a dummy node whose bipartition is S and before each traversal c is initialized

to dummy. After the algorithm, the Ml will be the sole subtree of this dummy. Let C be the

bipartition of c.

At node i, the algorithm calculates the Bi and h(Bi) using Equ 4.1 and Equ 4.3 and using

h(Bi) it finds the count of Bi. If Bi is not a majority node, it is just ignored. If Bi is a majority


node then it is searched in the Ml. If there is no node for Bi in the Ml then a node for Bi is

created. Since, c is the ancestor of i in Ml, a node for C obviously exists in Ml and the node

for C is made the parent of Bi. If, on the other hand, a node for Bi does exist in Ml, the

algorithm finds the current parent P of Bi in Ml. If P has a greater cardinality than C, then

C is a strong contender to be the parent of Bi (using Fact 4.1.3). Hence, C is made the parent

of Bi in Ml. Otherwise no change is done in Ml.

4.1.3 A Simulated Example

s3 s4s2s1s0

s2s3s1 s4

s2s3

s3 s4s2s1s0

s2s3s1 s4

s2s3

s2s3s1

s3 s4s2s1s0

s2s3s1 s4

s2s3

s2s3s1

s3 s4s2s1s0

s2s3s1 s4

s2s3

s2s3s1consensus treeMajority rule

s3 s4s2s0 s1

s2s3s1

s4s2s3s1

s2s3

s0 s1 s4s3s2 s0 s3 s4s2s1

T1 2T 3T

s0 s2s1 s3 s4

(a) (b) (c)

(d)(e)

Figure 4.3: Steps of majority rule tree construction

In Fig. 4.3 the steps of applying the above algorithm to the trees of the first row is shown.

The order of traversal of the tree is < T3, T2, T1 >.

After finishing the traversal of T3, the tree would look like (a). Since s2s3s4 (third node

4.2. Uniform Sampling of Phylogenetic Trees 51

from the top) of T3 is not a majority node, both s1s2 and s4 is added under s1s2s3s4.

The Ml would look like (b) after finishing the traversal of s1s2s3 (third node from the top) of

T2. Since, Bi = s1s2s3 was not in the Ml built so far, a node for it is created and C = s1s2s3s4

is made its parent. When the traversal move on to Bi = s1 the c moves down to s1s2s3. Now,

the parent of Bi in Ml is P = s1s2s3s4 and it is greater in cardinality than the C = s1s2s3.

Therefore, Bi’s parent is updated to C = s1s2s3 from P = s1s2s3s4 (see (c)). Similarly When

the traversal move on to Bi = s2s3 of T2, it updates its parent to C = s1s2s3 from P = s1s2s3s4

(see (d)).

After finishing the traversal of T2 and T1 the Ml would look like (e) which is the desired

majority rule consensus tree. Although the example uses more simple binary tree, the algorithm

equally performs in case of n-ary tree.

4.2 Uniform Sampling of Phylogenetic Trees

Most of the analysis algorithms proposed earlier were tested on the simulated topological and

sequence data. The common reasons behind this are unavailability of biologically obtained

phylogenetic trees and the large sizes of the few available ones. With the increase in the

study of phylogenetic analysis, the necessity of more correct and reliable phylogenetic trees is

also increasing. So, recent trend is to analyze an algorithm with a smaller and representative

sample from a real phylogenetic tree. One might think it as a trivial problem by just choosing

an arbitrary subtree from a large phylogenetic tree. But it becomes non-trivial when the

samples are required to be uniformly selected over some biologically motivated constrained set

of induced subtrees. In this section, the general problem of uniform tree sampling is defined

and some specific problems are described.

4.2.1 Terminology

Let, T is a phylogenetic tree over a set of taxa S, where |S| = n. If S ′ is a nonempty set of

taxa such that S ′ ⊂ S, then the subtree induced by S ′ from T is denoted by T ′ and is obtained


by the following sequence of steps

• deleting all leaves from S\S ′

• deleting all edges connecting those leaves

• deleting all degree-2 internal nodes by merging the edges connected to it into a single one.

If T is a weighted tree then the merging of two weighted edges should be defined. In

the Fig. 4.4 the T is shown in (a) where S = {1, 2, . . . , 10} and a induced subtree T ′ for

S ′ = {1, 3, 4, 5, 6, 7} is shown in (b).

1

2

3 4

5

6

78

9

10

1

3 4

5

6

7

(a) (b)

Figure 4.4: An induced subtree of a phylogenetic tree

Now, if we are given a T ,S and S ′, a unique T ′ is defined. But if we are given |S ′| instead of

S ′ then a number of T ′s can be found. This |S ′| is named as samplesize. The uniform sampling

problem states to find a fixed sized sample from T uniformly.

Problem : Uniform phylogenetic tree sampling problem

Input : A phylogenetic tree T over a set of taxa S such that |S| = n and the sample size

2 ≤ m < n.

Output : A uniformly chosen induced subtree T ′ over all such trees where T ′ has m distinct

leaves taken from S and satisfies some constraints.

Although phylogenetic trees can be n-ary tree, the binary representation is the mostly used.

4.2. Uniform Sampling of Phylogenetic Trees 53

In this section we will introduce three sampling problem of binary phylogenetic trees [KMP03].

4.2.2 Sampling Problems

All the problems stated in this section consider trees weighted by positive real numbers. To

merge two edges of a degree-2 internal node, the node is just removed and the solitary edge

gets a weight equal to the sum of the previous two weights.

Problem : Leaf depth problem

Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n

and two non-negative real number dmin and dmax.


leaves taken from S and satisfies dmin < dist(r, v) < dmax for all the leaves v of T ′ and root r

of T ′.

Problem : Edge length problem


and two non-negative real number emin and emax.


leaves taken from S and satisfies emin < |e| < emax for all the edges e of T ′.

Problem : Pair wise leaf distance problem


and two non-negative real number dmin and dmax.


leaves taken from S and satisfies dmin < dist(v1, v2) < dmax for all pair of distinct leaves v1 and

v2 of T ′.

Efficient algorithms for all the above problems are available considering T to be binary

[KMP03]. For arbitrary trees, the complexities of these problems are still to be explored.


Although every arbitrary phylogenetic tree can be represented as binary phylogenetic tree by

adding internal nodes, to our belief determining the complexity of sampling arbitrary tree for

the above problems will be a tough task.

Since most of the phylogenetic software packages use binary representation, it will also be

a good option to generate binary samples from an arbitrary n-ary phylogenetic tree satisfying

the above constraints.

4.3 Summary

Phylogeny is one of the most extensively studied area of bioinformatics. The obvious represen-

tation of tree and rich theory behind its analysis made phylogeny more advanced than other

recent areas. Phylogenetic tree analysis is currently widely used in disease tracing, Biodiversity

analysis, customized vaccine research, understanding microbial ecology and many such fields of

applied biology.

In this chapter we present an algorithm to construct consensus tree using majority rule.

The algorithm is carried out in two phases: counting phase and construction phase. We also

annotated three phylogenetic tree sampling problems. Uniform sampling of small induced trees

from large phylogenetic trees of millions of taxa is relatively new area of study. There could be

many variations of this nontrivial uniform sampling problem. Some directions are also discussed

at the end of the chapter.

Chapter 5

Conclusion

This thesis reveals several uses of graph in generating and representing biological data. We

organize the thesis around two specific problems on these two broad classes : haplotyping and

phylogeny.

Generation of haplotypes for an individual from fragments of SNP sequence can be done

using several algorithms. Minimum fragment removal, minimum SNP removal and minimum

error correction are such algorithms suitable for different experimental set ups. In this thesis, we

described the computational model (fragment conflict graph and SNP conflict graph) for these

problems. We proposed a local search based algorithm for MEC problem. We also incorporate

variations of these problems, some of which are still open in their complexity and algorithmic

status.

Generation of haplotypes for a population from their genotype data is another important

area. Pure parsimony problem, maximum resolution problem, perfect phylogeny haplotyp-

ing are discussed in this thesis. We provide illustrative examples to explain the computation

required to solve these problems.

Phylogenetic tree is the most widely used tool to represent the parent-child relationships

of biological elements, like DNA sequences, external features, physiological processes, etc. We

described an algorithm to construct consensus tree from many small candidate trees using

majority rule. We also described the uniform sampling problems of large phylogenetic tree.

55

Chapter 5. Conclusion 56

Let us justify the organization of this thesis again. In chapter 1, we defined the research

area : “bioinformatics” and provide two different perspectives to look at it : biological and

computational perspective. We explicitly defined the scope of this thesis with respect to the

possible area of researches.

In chapter 2 we define several preliminary ideas to help a reader to grasp the content of this

thesis.

In chapter 3 we discussed the haplotyping problems. We described the LSMEC, an algorithm

for solving minimum error correction problem. The algorithm is an iterative search algorithm

and performs quite efficiently even for true random fragment data. Two graphs are defined for

the MFR and MSR as computational model. The computational status of different versions

with respect to these models are also presented. We also explained the perfect phylogeny

haplotyping problem and maximum resolution problem.

In chapter 4 we discussed the phylogeny. We explained the steps of the majority rule

algorithm to construct consensus tree. The algorithm is illustrated very elegantly for easier

understanding. Uniform sampling of phylogenetic tree meeting some defined constraints is a

non-trivial sampling problem. Two different constraints on pair-wise leaf distances and on edge

lengths of the induced subtrees are defined here. Sampling of binary tree from arbitrary n-ary

tree would be an interesting problem in this aspect.

The use of graph in bioinformatics as a computational tool is the subject matter of this

thesis. Other branches of computer science are also significant in their contribution to bioin-

formatics and this thesis inspires us to study those areas very soon.

References

[ACJ03] N. Amenta, F. Clarke, K. S. Jhon: A Linear-Time majority tree algorithm, Proc. of

3rd International Workshop on Algorithms in Bioinformatics, LNBI 2812, Springer, pp.

216-227, 2003.

[BILR05] V. Bafna, S. Istrail, G. Lancia, R. Rizzi: Polynomial and APX-hard cases of the

individual haplotyping problem, Theoretical Computer Science, 335, pp. 109-125, 2005.

[BVDL03] P. Bonizzoni, G. D. Vedova, R. Dondi, J. Li: The Haplotyping problem: an overview

of computational models and solutions, Journal of Computer Science and Technology, 18(6),

pp. 675-688, 2003.

[BZ06] D. Branza, A. Zelikovsky: 2snp: scalable phasing based on 2-snp haplotypes, Bioinfor-

matics, 22(3), pp. 371-373, 2006.

[CIKT05] R. Cilibrasi, L. V. Iersel, S. Kelk, J. Tromp: On the complexity of several haplotyping

problems, Proc. of 5th International Workshop on Algorithms in Bioinformatics, LNCS

3692, Springer, pp. 128-139, 2005.

[Cla90] A. G. Clark: Inference of haplotypes from PCR-amplified samples of diploid populations,

Molecular Biology and Evolution, 7(2), pp. 111-122, 1990.

[Die05] R. Diestel: Graph theory, Springer, Heidelberg, 2005.

[Gus01] D. M. Gusfield: Inference of haplotypes from samples of diploid populations: Complex-

ity and algorithms, Journal of Computational Biology, 8(3), pp. 305-323, 2001.

57

References 58

[Huf05] Huffner F.: Algorithm engineering for optimal graph bipartization, Proc. of 4th Inter-

national Workshop on Efficient and Experimental Algorithms, LNCS 3503, Springer, pp.

240-252, 2005.

[KMP03] P. Kearney, J. I. Munro, D. Phillips: Efficient generation of uniform samples from

phylogenetic trees, Proc. of 3rd International Workshop on Algorithms in Bioinformatics,

LNBI 2812, Springer, pp. 177-189, 2003.

[LBIS01] G. Lancia, V. Bafna, S. Istrail, R. Schwartz: SNP Problems, complexity and algo-

rithms, Proc. of 9th Annual European Symposium on Algorithms (ESA), 2161, LNCS,

Springer, pp. 182-193, 2001.

[MM81] T. Margush, F. R. McMorris : Consensus n-trees, Bulletin of Mathematical Biology,

43, pp. 239-244, 1981.

[PM06] B. Pasaniuc, I. Mandoiu: Highly scalable genotype phasing by entropy minimization,

Proc. of IEEE International Conference of the Engineering in Medicine and Biology Society,

pp. 3482-3486, 2006.

[RTCN99] M. J. Rieder, S. L. Taylor, A. G. Clark, D. A. Nickrson: Sequence variation in the

human angiotensin aonverting enzyme, Nature Genetics, 22, pp. 59-62, 1999.

[WWLZ05] R.-S. Wang, L.-Y. Wu, Z.-P. Li, X.-S. Zhang: Haplotype reconstruction from SNP

fragments by minimum error correction, Bioinformatics, 21(10), pp. 2456-2462, 2005.

[Wes03] D. B. West: Introduction to graph theory, Prentice Hall of India, 2003.

Index

allele, 18, 26

heterozygous, 19

homozygous, 19

bi-allelic, 20

binary-witness MEC, 35

bioinformatics, 1

areas, 1

bipartition, 47

bit-string, 48

BLAST, 2

cardinality, 51

chromosome, 7

collection, 26

conflict, 20

consensus tree, 46

majority rule tree, 47

diploid, 19

DNA, 7

forbidden matrix, 42

fragment, 26

fragment conflict graph, 22

gain, 29

gap, 20

GenBank, 2

genotype, 37

phasing, 39

resolution, 38

graph, 15

bipartite, 22

connected graph, 16

cycle, 16

degree, 16

directed, 15

path, 16

subgraph, 15

undirected, 15

haplotype, 10, 19

construction, 22

length, 23

haplotype perfect phylogeny, 41

haplotyping, 19

individual, 19

population, 19, 36

HMEC, 26

hole, 20

59

INDEX 60

induced subtree, 52

inference rule, 39

LHR, 23

marker, 14

maximum cumulative gain, 29

MEC, 25

MFR, 22

MR, 39

MSR, 24

NCBI, 2

nucleotide, 7

partition, 26

PDB, 2

perfect graph, 24

personalized drug prediction, 14

phylogenetic Tree, 44

phylogenetic tree, 12

character-based, 13

cladistic, 12

distance-based, 12

phenetic, 12

phylogeny, 12

protein, 9

pure parsimony, 38

replication, 9

RNA, 9

SNP, 18

SNP conflict graph, 24

SNP matrix, 20

taxa, 12, 44

outgroup, 47

transcription, 9

tree, 16

topology, 46

b. sc. engg. thesis applications of graphs in bioinformatics

Documents