b. sc. engg. thesis applications of graphs in bioinformatics
TRANSCRIPT
B. Sc. Engg. Thesis
Applications of Graphs in Bioinformatics
By
Abdullah Al Mueen
Student No. 0005040
&
Md. Nurul Amin
Student No. 0005091
Submitted to
Department of Computer Science and Engineering
in partial fulfillment of the requirements for the degree of
Bachelor of Science in Computer Science and Engineering
Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology (BUET)
Dhaka 1000, Bangladesh
November, 2006
CERTIFICATE
This is to certify that the work presented in this thesis entitled “Applications of Graphs
in Bioinformatics” is the outcome of the investigation carried out by us under the supervision
of Professor Dr. Md. Saidur Rahman in the Department of Computer Science and Engineering,
Bangladesh University of Engineering and Technology, Dhaka. It is also declared that neither
this thesis nor any part thereof has been submitted or is being currently submitted anywhere
else for the award of any degree or diploma.
(Supervisor)
Dr. Md. Saidur Rahman
Professor
Department of Computer Science
and Engineering, BUET, Dhaka.
(Author)
Abdullah Al Mueen
Student No: 0005040
(Author)
Md. Nurul Amin
Student No: 0005091
Contents
Acknowledgments vi
Abstract vii
1 Introduction 1
1.1 History of Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Computations in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scope of This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Preliminaries 6
2.1 Cell-The Unit of Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Genome Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Genome Organization in Eukaryotic Cell . . . . . . . . . . . . . . . . . . 7
2.2.2 Genome Organization in Prokaryotic Cell . . . . . . . . . . . . . . . . . . 7
2.3 Genome Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Generation of Genomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Haplotype Reconstruction and Inference . . . . . . . . . . . . . . . . . . 10
2.5 Genomic Data Analysis and Applications . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Study of Single Nucleotide Polymorphism . . . . . . . . . . . . . . . . . 13
2.5.3 SNPs and Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 13
i
CONTENTS ii
2.5.4 SNPs and Personalized Drug Prediction . . . . . . . . . . . . . . . . . . 14
2.6 Graph Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Haplotyping 18
3.1 Haplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Individual Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Minimum Fragment Removal : MFR . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Minimum SNP Removal : MSR . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.4 Minimum Error Correction : MEC . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 A Heuristic Algorithm for MEC Problem . . . . . . . . . . . . . . . . . . 26
3.2.6 Comparison of Different Individual Haplotyping Problems . . . . . . . . 35
3.3 Population Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Pure Parsimony Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Maximum Resolution Problem – MR . . . . . . . . . . . . . . . . . . . . 39
3.3.4 Perfect Phylogeny Haplotyping Problem – PPH . . . . . . . . . . . . . . 40
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Phylogenetic Tree 44
4.1 Majority Rule Consensus Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.3 A Simulated Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Uniform Sampling of Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Sampling Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
CONTENTS iii
5 Conclusion 55
References 57
Index 59
List of Figures
2.1 Organization of DNA molecule in a cell . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Directed and undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Minimum fragment removal using fragment conflict graph . . . . . . . . . . . . . 23
3.2 Minimum SNP removal using SNP conflict graph . . . . . . . . . . . . . . . . . 25
3.3 SNP matrix and its partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 An example calculation of Gain measure . . . . . . . . . . . . . . . . . . . . . . 30
3.5 An example iteration of HMEC . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 An example log table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 An example of haplotype perfect phylogeny (hpp) . . . . . . . . . . . . . . . . . 42
3.8 Forbidden matrix for hpp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 A sample phylogenetic tree of living organisms . . . . . . . . . . . . . . . . . . . 45
4.2 An example of majority rule consensus tree for l = 1
2. . . . . . . . . . . . . . . 47
4.3 Steps of majority rule tree construction . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 An induced subtree of a phylogenetic tree . . . . . . . . . . . . . . . . . . . . . 52
iv
List of Tables
3.1 A chromosome and two haplotypes assembled from it . . . . . . . . . . . . . . . 19
3.2 An SNP matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Different relations between fragments . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 The pseudocode for the HMEC algorithm . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Comparison among the SNP problems . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 A chromosome and its genotype . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 Example of optimal application of haplotype inference rule . . . . . . . . . . . . 40
v
Acknowledgments
All the appraisal belongs to the Almighty ALLAH.
We would like to express our deep gratitude to our supervisor Professor Dr. Md. Saidur
Rahman. We would like to thank him for introducing us in this beautiful field of bioinformatics
and for teaching us how to carry on research work. We again express our most sincere gratitude
to him for his affectionate supervision, continuous motivation and valuable advice, without
which this thesis would not have come to a completion.
We would like to thank every one associated with our research group. We specially thank
Mr. Abdullah Adnan, Mr. Abul Hasan Samee and Mr. Md. Shariful Islam Bhuyan for their
invaluable comments and encouragement throughout the period of this thesis.
Each presentation session was greatly supported by the lab assistants and officials of the
department. We would like to thank them for each of the services and facilities they provided
us.
Finally, we are grateful to our families for giving us the moral support to overcome the
tedium of repetitive trials to new findings.
vi
Abstract
Bioinformatics principally is the science of biological information processing. It requires wide
range of computational models for effective representation and efficient computation of biolog-
ical data. This thesis describes graphs as computational models in bioinformatics. Here we
focus on two major areas of bioinformatics: Haplotyping and Phylogeny.
We have compiled four optimization problems of individual haplotyping and their graphical
modelings. We propose a new heuristic algorithm to solve minimum error correction problem
which constructs a pair of haplotypes of an individual from aligned and overlapping but inter-
mixed and erroneous fragments. We describe the perfect phylogeny haplotyping problem which
arranges a set of haplotypes for a population in a phylogenetic tree.
We describe an algorithm to build a single consensus tree from a number of different small
phylogenetic trees on the same set of taxa using majority rule. We discuss the uniform sampling
of a phylogenetic tree that generates an induced subtree uniformly at random and carrying a
subset of leaves of the actual tree satisfying some constraints. We annotated three problems of
uniform sampling having constraints on leaf depths, edge lengths and pairwise leaf distances of
the resulting samples.
vii
Chapter 1
Introduction
Analysis of biological experiments and processes is labor intensive and time consuming due to
the increasing complexity of the processes and explosive growth of biological data emerging
from laboratories worldwide. Hence, the recent challenge is to transform this huge data into
knowledge for complete understanding of the biological processes and experiments relating to
both health and diseases. The quest for this knowledge has given rise to a new era of science,
bioinformatics.
Bioinformatics is a highly interdisciplinary field where techniques and concepts from infor-
matics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics merge into a
single branch of science. The most prominent and important of these disciplines is informa-
tion processing. Therefore, “bioinformatics” can be stretched as “Biological information
processing”. More precisely, bioinformatics derives knowledge from computer analysis of bi-
ological data. The development of powerful computers, and the availability of experimental
data, that can readily be treated by computation, launched bioinformatics as an independent
field.
From the perspective of information processing, bioinformatics can be divided into several
areas. The major areas are :
• Develop better tools for data generation, capture, and annotation. An example of data
generation would be “shotgun sequencing” which identifies the sequence of nucleotides of a
1
Chapter 1. Introduction 2
target DNA molecule.
• Develop and improve tools for comprehensive functional and relational analysis. Numerous
tools for phylogenetic analysis of biological data have already been developed to determine the
evolutionary relationship among species.
• Develop and improve tools for representing and comparing sequence similarity and variation.
An example would be BLAST (Basic Local Alignment Search Tool) which aligns two or more
sequence and computes different similarity and dissimilarity measures.
• Improve content and utility of biological databases to store the data in effective ways. An
example would be GenBank. GenBank, developed and housed at NCBI (National Center for
Biotechnology information), is the U.S depository for all DNA and protein sequences containing
more than 61 million sequence records.
• Create mechanisms to support effective approaches for producing robust, exportable software
that can be widely shared.
Each of these areas are evolving day by day and are subjects to extensive research. The next
section of this chapter contains a snap of the history of bioinformatics. The section following it
contains an overview of the computations required in bioinformatics. Finally the scope of this
thesis is detailed.
1.1 History of Bioinformatics
The generation of micro-biological data was started first in 1955 when amino acid sequence
of a protein (bovine insulin) was announced for the first time by F. Sanger. Since then, hun-
dreds of proteins have been sequenced and analyzed and the need for creating a data bank
for these proteins so that the information can be spread very easily to all. In 1973 one such
Protein Data Bank (PDB) named Brookhaven was announced. This PDB was fully described
in 1977 (http://www.pdb.bnl.gov). These PDBs were actually in printed form because of the
unavailability of desktop computers and communication facilities at those days.
The first complete genome sequence for an organism (FX174) was published in 1980. The
1.2. Computations in Bioinformatics 3
gene consists of 5,386 nucleotide base pairs encoding nine proteins. Just after one year, in 1981,
IBM introduced its personal computer in the market and concurrently the smith-waterman
algorithm for sequence alignment was published. The growth of computer performance and
biological data is exponential since then and leads to the creation of some new databases
like SWISS-PR0T in the ’80s. Many organizations were also founded in that decade namely
Genetics Computer Group(GCG), National Center for Biotechnology Information(NCBI), etc.
Whole genome sequences are published for different single cell organisms like Haemophilus
influenza of 1.6 Mbp (million base pair), baker’s yeast of 12.1 Mbp, E. coli of 4.7 Mbp in
’90s. The Human Genome Project has successfully published the complete sequence of human
genome (3000 Mbp) in 2001.
Biochemical methods to sequence, recombine and engineer the DNA sequences have been
developed by the 90s. Besides, new modeling and analysis provides new dimensions in the
datasets. Thus, the growth of the datasets obviates the necessity of various forms of computa-
tion in bioinformatics from the beginning of this decade.
1.2 Computations in Bioinformatics
Computations required in bioinformatics varies largely depending on the objective of the ap-
plication. Almost all the developed branches of computer science have strong applications in
bioinformatics. Some examples of the computational tools that are used in bioinformatics are
given below.
Graphs are used to represent relationships among species on different physical and micro-
biological criteria. For example, the evolutionary relationships among the existing species are
expressed in a tree structure called phylogenetic tree. Graphs are also used in problems to
analyze biological data.
Numerical simulations of biological systems are used to model systems that are very difficult
to be modeled by analytical methods and deterministic operations. For example, the genetic
regulatory networks can be modeled by stochastic process. Similarly, host-parasite system,
Chapter 1. Introduction 4
ecosystem etc are well studied through numerical simulation.
Machine learning has many applications in bioinformatics. Generally biological data are
huge in quantity but with no established theory. For such data, learning theories provide
methods to gain an insight into the underlying theory of the origin of these data. Besides,
statistical analysis can be used in population oriented biology like epidemic controlling, drug
designing, etc.
Data mining and advanced database technology are one of the main part of biological
information analysis and preservation. The huge amount of data requires efficient processing
to maximize their use in research and educational purpose.
1.3 Scope of This Thesis
There are varieties of applications of graph theory in modeling, representing, analyzing and
comparing biological data. Our study covers two major areas concerning the data generation
through biological experiments and analyzing represented data.
The first area we covered is computational problems of constructing haplotypes. Haplotypes
are sequence of nucleotides that are positioned into some fixed sites in the DNA molecule called
the SNP sites. Experimental readout can not completely determine haplotypes due to the lack
of precision and correctness in constructing haplotypes, that’s why methods are required to
build the haplotypes in an optimum way so that the reliability of the haplotypes maximizes. In
modeling such optimization problems graphs are extensively used [CIKT05, BVDL03, BILR05].
We studied several models and in this thesis, we have compiled the problems in a well sorted
fashion. Besides, we provide an heuristic algorithm to one of the optimization problems of
correcting minimum errors in a set of nucleotide sequences.
The second area we covered is the analysis of relationships among species through phy-
logeny. Phylogeny is also a well developed branch of bioinformatics. Construction, assembling,
visualizing and sampling of phylogenetic trees are major applications of graph in this branch
[ACJ03, KMP03, MM81]. In this thesis, we discuss the majority rule method of assembling
1.4. Summary 5
phylogenetic tree from many small trees. We also discuss a way of arranging haplotypes in
phylogenetic order which justifies the relation between the major two areas of our study.
1.4 Summary
In this chapter highlighted bioinformatics from the biological and computational viewpoints.
We also placed a short history of bioinformatics to demonstrate the rapid growth of biological
data throughout the world. The final section describes the scope of our thesis.
Chapter 2
Preliminaries
Basic concepts and terminologies used in this thesis have come from two distinct and different
fields of natural and applied science. Hence, it is worth reviewing and initiating these concepts
in a brief manner. The first section is about the typical structure of a cell down to molecular
level including major biological processes. Generation, assembly, analysis and applications of
biological data are discussed in the later sections. The final section contains the definitions and
terminologies of graph theory that are required in this thesis.
2.1 Cell-The Unit of Life
Cells are the structural and functional unit of all living organisms. Cells range in size from one
millimeter down to one micrometer. Cells are highly organized and complicated assembly of
large polymeric molecules with specific compartments called organelles. The most important
organelle is the nucleus, which houses most of the cellular DNA-the hereditary material of
living organisms. Cells are of two types - prokaryotic cells (e.g. bacteria, amoebae), which lack
a defined nucleus and with a simplified internal organization, and eukaryotic cells (body cells
of human, animals), which have a more complicated organization including a defined nucleus.
In addition to the nucleus, there are several other organelles in typical eukaryotic cells:
the mitochondria, where the cell’s energy metabolism is carried out; the rough and smooth
6
2.2. Genome Organization 7
endoplasmic reticula, a membranous network in which glycoproteins and lipids are synthesized;
Golgi vesicles, which transfers membrane constituents to appropriate places in the cell; and
peroxisome, in which fatty acids and amino acids undergo degradation. Animal cells, but not
plant cells, contain lysosomes, which degrade unnecessary materials taken in by the cell. Plant
cells have chloroplasts, where photosynthesis takes place.
2.2 Genome Organization
2.2.1 Genome Organization in Eukaryotic Cell
DNA (deoxyribonucleic acid) is the master molecule of the cell that stores genetic information
and transfers genetically determined characteristics from one generation to the next. The three-
dimensional structure of DNA, consists of two long helical strands coiled around a common axis
forming a double helix. Each strand of DNA is composed of continually varying sequence of just
four different types of monomers(Adenine, Cytosine,Guanine and Thymine) called nucleotides.
Genes are simply portion of nucleotide sequence residing in the DNA molecule that codes
polypeptides or proteins- the main constituents of cells. DNA also contains instructions in form
of nucleotide sequence to direct when, which proteins are to be made and in what quantities.
The DNA in the nucleus of a eukaryotic cell is organized among 1 to more than 50 long
linear, compact structures. They are called chromosomes. All cells of an organism contain
chromosomes of same size and number but they vary among different types of organisms. Each
chromosome comprises of a single DNA molecule associated various DNA binding proteins.
The total DNA content in the chromosomes of an organism is referred as its genome. The
organization schematic is shown in Fig. 2.1 [www.mtsinai.on.ca/pdmg/images/chromosome.jpg]
2.2.2 Genome Organization in Prokaryotic Cell
In all prokaryotic cells, most of or all the genetic information resides in a single circular DNA
molecule that folds back on itself many times. Usually it is one millimeter in length and stays
Chapter 2. Preliminaries 8
Figure 2.1: Organization of DNA molecule in a cell
2.3. Genome Expression 9
in the central region of the cell. The genomic DNA molecule in prokaryotes is also associated
with proteins and often is referred to as a chromosome. But the organization of DNA within a
bacterial chromosome differs greatly from that within the chromosomes of eukaryotic cells.
2.3 Genome Expression
DNA encodes all of the RNA(ribonucleic acid) and protein molecules of the cells of an organ-
ism. Proteins are the most abundant and functionally versatile of the cellular macromolecules.
Proteins are polymers formed from only 20 different monomers, the amino acids. Many pro-
teins within cells are enzymes, which accelerate (catalyze) biochemical reactions. Proteins also
direct their own synthesis and that of other macromolecules, maintain internal cell rigidity, and
transport small molecules and ions across membranes,.
Protein synthesis from genes does not occur directly. RNA acts as an intermediary molecule.
Firstly, a portion of DNA sequence of the large DNA molecule in a chromosome is copied into
RNA. The process is called transcription. These RNA copies of segments of the DNA works
as templates to direct the synthesis of the protein. This process is called translation as genetic
information stored in the form of nucleotide sequence is decoded in the form of amino acid
sequence in proteins. DNA can undergo replication (synthesis of new DNA) also. Therefore
genetic information in cells flows from DNA to RNA to protein. This fundamental principle
genome expression is termed as the central dogma of molecular biology. Genome expression is
under fine regulation at various levels.
2.4 Generation of Genomic Data
Bioinformatics primarily deals with a huge range of genomic data gathered from different bi-
ological experiment. Computational biology studies the data and derives knowledge from it.
Thus generation of genomic data is of prime importance. This section briefly discuss about few
effective ways for generation of genomic data.
Chapter 2. Preliminaries 10
2.4.1 DNA Sequencing
DNA sequencing is the laboratory technique by which chemical code of the genome is deci-
phered. DNA sequencing determines the exact order of chemical building blocks, nucleotides
(abbreviated A, T, C, and G) that make up the DNA. First, chromosomes are broken into much
shorter pieces. After sequencing of the short sequences (in blocks of about 500 bases each, called
the read length) they are assembled into long continuous stretches that are analyzed for errors,
gene-coding regions, and other characteristics.
The DNA sequence acts as a blueprint, determining what species of organism is produced,
and within a species, the DNA sequence of each individual is unique. The DNA sequence that
makes up a genome may include coding DNA (gene) and also non-coding DNA. Non-coding
DNA does not encode a protein but may regulate where and when genes and proteins are active.
Advances in molecular biology, genomics, and robotics have resulted in automatic sequencing
of DNA. Today, the DNA sequence of an entire microbial genome can be determined in just
a few weeks rather than several years as in the past. The resulting DNA sequence maps are
being used by 21st century scientists to explore biology phenomena.
2.4.2 Haplotype Reconstruction and Inference
A haplotype, is simply the genetic constitution of an individual chromosome. It also refers
to a set of single nucleotide polymorphisms (SNPs) found to be statistically associated on a
single chromatid (one-half of a replicated chromosome). Such information is most valuable to
investigate the genetics behind common diseases. Haplotypes may be used to compare different
populations. Haplotype diversity refers to the uniqueness of a particular haplotype in a given
population. Haplogroups are large groups of haplotypes that define genetic populations and
are often geographically oriented.
2.5. Genomic Data Analysis and Applications 11
2.5 Genomic Data Analysis and Applications
Bioinformatics creates the tools to store, manage, analyze, compare genomic data. Vast
amounts of sequence are now stored in organized computer databases. The genome sequence
has been interpreted using computational tools combined with biological knowledge. Computer
software associated with the database is being used for easier data retrieval and data analysis
process. Sequence analysis tools can also translate the DNA sequence into protein sequence and
can provide information on the predicted physical properties of the protein such as molecular
weight. Sequence comparisons also can be used to categorize groups of related gene or sequences
into families. Sequences in the same family suggest that the genes or proteins perform similar
functions. Another use for sequence comparisons is studying the relatedness and evolution of
different genes or organisms.
Here are some research areas where important relationships and predictions are being gen-
erated by genomic data analysis with the help of bioinformatics tools:
• Gene number, exact locations, and functions
• Gene regulation
• DNA sequence organization
• Chromosomal structure and organization
• Non-coding DNA types, amount, distribution, information content, and functions
• Coordination of gene expression, protein synthesis, and post-translational events
• Predicted vs experimentally determined gene function
• Evolutionary conservation among organisms
• Correlation of SNPs (single-base DNA variations among individuals) with health and
disease
• Disease-susceptibility prediction based on gene sequence variation
• Genes involved in complex traits and multi-gene diseases
Some uses of genomic data are discussed in this section which will be elaborated throughout
this thesis.
Chapter 2. Preliminaries 12
2.5.1 Phylogeny
Phylogeny is the description of biological relationships based on classification according to
similarity of one or more sets of characters or on a model of evolutionary processes. Phylogenetic
relationship based different characters are consistent and support one another. Phylogeny is
usually expressed as trees called phylogenetic trees.
The goals of phylogeny is to work out the relationships among species, populations, indi-
viduals or genes(generally referred as taxa). Relationship is taken as assignment of a scheme of
descendants of a common ancestor.The results are usually presented as an evolutionary tree.
A phylogenetic tree is a two dimensional graph composed of nodes representing the taxa
and branches representing the relationships among the taxa. A tree is a connected graph in
which there is exactly one path (consecutive set of edges beginning at one point and ending at
the other) between every two points. A particular node may be selected as a root. Abstract
trees may be rooted or unrooted. Unrooted tree displays the topology of relationship. But it
does not show the pattern of the descent. Rooted trees are directed graphs in which each edge
is a one-way street and the ancestor-descendant relationship implies the direction of each edge.
If every node of rooted trees has two descendants, they are called Binary trees. Numbers are
often assigned to the edges of a graph to signify a distance between the nodes connected by
the edges. Thus the sizes of the edges can be drawn proportional to the assigned lengths. The
length of a path through the graph is the sum of the edge lengths.
In phylogenetic trees edge length signify either some measure of the dissimilarity between
two species or the period since their separation. There are two approaches for derivation of
phylogenetic trees
• Phenetic approach proceeds by measuring a set of distances between species to generate
a tree by hierarchical clustering procedure.
• Cladistic approach considers possible pathways of evolution, inferring the characteristics
of the ancestor at each node.
There are two types of data used for building phylogenetic trees:
• Distance-based: A matrix of distances between the species is used as input(e.g., the
2.5. Genomic Data Analysis and Applications 13
alignment score between them or the fraction of residues they agree on).
• Character-based: Each character (e.g., a base in a specific position in the DNA) is exam-
ined separately.
2.5.2 Study of Single Nucleotide Polymorphism
Single nucleotide polymorphisms or SNPs (pronounced ”snips”) are DNA sequence variations
of single nucleotide (A,T,C,or G) in the genome sequence. For example, a change in the DNA
sequence AATTAC to ATTTAC is a SNP. When a variation occurs in at least 1% of the
population, it is considered as an SNP. SNPs are found in both coding (gene) and non-coding
regions of the genome. Most SNPs occur outside of “coding sequences”. SNPs found within a
coding sequence are of particular interest to researchers because they are more likely to alter
the biological function of a protein.
SNPs, make up about 90% of all human genome sequence variation. Two of every three
SNPs are due to the replacement of cytosine (C) with thymine (T). These variations in DNA
sequence are believed to be associated with humans respond to disease; environmental insults
such as bacteria, viruses, chemicals; and therapies. Thus SNPs has become of great value for
biomedical research and for developing pharmaceutical products or medical diagnostics. SNPs
are easier to follow in population studies because they are evolutionarily stable - not changing
much from generation to generation.
2.5.3 SNPs and Disease Diagnosis
Each person has a unique SNP pattern. In most cases, SNPs do not cause disease, they only
serve as biological markers for pinpointing a disease on the human genome map, because they
are usually located near a gene found to be associated with a certain disease. Thus SNPs
help to determine a person’s genetic predisposition to a particular disease based on genes and
hereditary factors. Researchers may also identify relevant genes associated with a disease by
studying stretches of DNA that have been found to harbor a SNP associated with a disease
Chapter 2. Preliminaries 14
trait. Thus SNPs will also allow researchers a better evaluation about the impact of non-genetic
factors like behavior, diet, lifestyle, and physical activities diseases.
To create a genetic test to screen a disease in which the disease-causing gene has already
been identified, scientists may compare SNP patterns of a group of individuals affected by the
disease with that of individuals unaffected by the disease. This association study can indicate
which pattern is most likely associated with the disease-causing gene. Then SNP profiles that
are characteristic of a variety of diseases can be established and a physician would easily screen
individuals for susceptibility to a disease just by analyzing their DNA samples for specific SNP
patterns. SNP maps may help them identify the multiple genes associated with such complex
diseases as cancer, diabetes and mental disorder, etc.
2.5.4 SNPs and Personalized Drug Prediction
A treatment proved effective in one patient for a particular disease may be found ineffective
in others. SNPs are also believed to be useful in helping to determine and understand why
individuals differ in their abilities to absorb or clear certain drugs, as well as to determine why
an individual may experience an adverse side effect to a particular drug while other may not
have any. The most appropriate drug for an individual could be determined by analyzing a
patient’s SNP pattern. When a medicine works well for a group of people, researchers tries
to find out the DNA markers (SNPs) that are alike for these people. Scientists could identify
which medicines are best for any one person by using these markers. For example, when a
person is in need of medicine, doctors will compare the person’s SNP pattern with several SNP
patterns of different groups of individuals having same disease and will prescribe individualized
therapies specific to a patient’s needs. The prediction of the appropriate treatment to the right
person is referred to as “personalized drug prediction”. Personalized medicine would allow
pharmaceutical companies to bring many more drugs to market and allow doctors to suggest a
drug that will be most effective for that individual patient.
Therefore, the recent discovery of SNPs are on the way of a revolution in the process of
disease detection and the practice of preventative and curative medicine
2.6. Graph Terminologies 15
2.6 Graph Terminologies
The focus of this thesis is to study the use of graphs in bioinformatics. The necessary graph
terminologies are very briefly described here. Interested readers are referred to detailed texts
of the literature [Wes03,Die05].
A graph is a tuple (V, E), which consists of a finite set of vertices V and a finite set of edges
E. Each edge is an ordered or unordered pair of distinct vertices. The set of vertices are denoted
by V (G) and the set of edges are denoted by E(G). A graph may look like the Fig. 2.2(i) where
each vertex in V(G) = {v1, v2, v3, v4, v5} is drawn by a small black circle and each edge in
E(G) = {e1, e2, e3, e4, e5} is drawn by a line connecting a pair of vertices. Depending upon
the directional property of the edges a graph can be one of two classes: Directed Graph and
Undirected Graph. A graph where the edges are simple undirected line between two vertices
is an undirected graph. A graph where the edges are directed line from a source vertex to a
destination vertex is called directed graph.
vv
v
v
v
1
3
2
4
5
ee
e
ee
1
2
3
45
vv
v
v
v
1
3
2
4
5
ee
e
ee
1
2
3
45
(i) (ii)
Figure 2.2: Directed and undirected graphs
In this thesis an edge e of an undirected graph is represented by (u, v) where u and v are the
vertices which are connected by e and an edge of a directed graph is represented as < u, v >
where u is the source vertex and v is the destination vertex. From here on, we will discuss
every terminology for undirected graph unless otherwise specified.
Let G be a graph. A graph H is called a subgraph of G if every edge in H is an edge in
Chapter 2. Preliminaries 16
G and every vertex of H is a vertex of G. It is to be noted that the definition is a one way
implication. A v0 − vl walk, v0; e1; v1; . . . ; vl−1; el; vl, in G is an alternating sequence of vertices
and edges of G, beginning and ending with a vertex, in which from each vertex one can move
to the following vertex through their intermediate edge. If the vertices v0, v1, . . . , vl are distinct
(except possibly v0 and vl), then the walk is called a path and usually denoted either by the
sequence of vertices v0, v1, . . . , vl or by the sequence of edges e1, e2, . . . , el. A path with l edges
has length l. A path or walk is closed if v0 = vl. A closed path containing at least one edge is
called a cycle. A graph G is connected if for every pair u, v of distinct vertices there exists a
path between them on G. Number of incident edges in a vertex v is called the degree of v in
an undirected graph. It is denoted by d(v). For directed graph there are two types of degree
for each vertex. The in-degree denotes the number of incoming edges and the out-degree the
number of outgoing edges from a vertex. A tree is a connected graph without any cycle. Since
there is no cycle there exists exactly one path between any two vertices. The vertices of a tree
are usually called nodes. The vertex v of a tree with d(v) = 1 is called a leaf and with d(v) > 1
is called an internal node. Thus a leaf has exactly one edge connected to it. A rooted tree is
a tree that has a distinguished vertex called root. Since the tree is a connected graph, there
is a path from root to every other vertex in the tree. The vertex p that immediately precedes
vertex u in the path from root to u is called the parent of u and u is called the child of p. An
internal node may have more than one child but exactly one parent. A leaf has no child. If
u1, u2, u3, . . . , ul is a sequence of nodes in a tree such that u1 is the parent of u2 which is the
parent of u3 and so on, then u1 is called the ancestor of ul and ul is called the descendant of
u1. The root is the ancestor of all nodes and it is descendant of none. The height of a tree is
the length of the longest path from root to a leaf.
2.7 Summary
This chapter is written as a suitable starting point for the readers who lack necessary biological
background to read the rest of the thesis. The very basics of bioinformatics is actually deeply
2.7. Summary 17
rooted in the concepts and discoveries of biologists. That’s why, as a student of computer
science, to study of bioinformatics is a difficult job and requires much time and effort to grasp
the ideas inherent in this beautiful research area.
Chapter 3
Haplotyping
One of the biggest achievements in the quest for mystery of life is the complete sequencing of
human genome. The Human Genome Project successfully discovered the fact that humans are
almost 99% identical at the DNA level. Therefore what makes us different is very small region
of the whole genome. The smallest possible variation in DNA sequence is the Single Nucleotide
Polymorphism abbreviated as SNP and pronounced as “snip”. It is also the prominent kind of
variation in humans. Let us describe SNP more elaborately because haplotyping is the process
of locating and determining SNPs.
An SNP is a specific nucleotide, placed inside a DNA molecule which is otherwise identical
for all of us, whose value varies, in a statistically significant way, within a population. For a
chromosome all the sites where SNP occurs have been well identified in the human genome
project. The base at an SNP site is called allele. The possible variations in a particular SNP
site over the entire population are most of the times between two alleles. Such SNP is called
bi-allelic. It is still to be explained the reason why bi-allelic SNPs are prominent than multi-
allelic SNPs which has three or more different bases. Snips are the most extensive research
topic in recent years for the computational biologists.
18
3.1. Haplotype 19
3.1 Haplotype
Several computational problems related to SNPs have been devised in few years. One of the
popular problems is Haplotyping which deals with generating haplotypes from genomic se-
quence data. Diploid organisms are organized in pairs of chromosomes. A diploid organism
has one copy of a specific chromosome inherited from father and another copy of that same
chromosome inherited from mother. For each SNP site one can be homozygous (same allele
on both chromosomes) or heterozygous (different alleles). The values of a set of SNPs on a
particular chromosome copy define a haplotype. An example is shown in Table. 3.1. Two copies
of the same chromosome of an individual are aligned and the SNP sites are shown by capital
letters. The individual is heterozygous at SNPs 1 and 3 and homozygous at SNPs 2 and 4.
The haplotypes are TGGT and AGCT.
paternal copy: atcatcTcaagtGgaattGctcTctaa
maternal copy: atcatcAcaagtGgaattCctcTctaa
Haplotype 1: T G G T
Haplotype 2: A G C T
Table 3.1: A chromosome and two haplotypes assembled from it
Two major category of problems related to haplotyping are
• Individual Haplotyping - The Haplotype Assembly Problem
• Population Haplotyping - The Haplotype Inference Problem
The following sections elaborate the idea of haplotyping problems in details.
3.2 Individual Haplotyping
Haplotyping an individual consists of determining a pair of haplotypes, one for each copy of
a given chromosome. This pair of haplotypes completely define the SNP fingerprints of an
Chapter 3. Haplotyping 20
individual for a specific chromosome. Given a sequence of bases of a specific chromosome, we
just need to check all the SNP sites to generate the haplotypes. But, the situation is a bit
complicated for generating haplotypes from sequencing data. Sequencing data for a genome
does not contain the total sequence of bases for a specific chromosome, rather it provides
sequences of a set of fragments of the whole genome. Hence the actual problem of individual
haplotyping is to find two haplotypes from the set of overlapping fragments of the both copies
of a chromosome where fragments may have error and to which copy of the chromosome a
fragment belongs is not determined. The very general formulation of the problem is given
below [BILR05].
“Given a set of fragments obtained by DNA sequencing from the two copies of a chromosome,
find the smallest amount of data to remove so that there exist two haplotypes compatible with
all the data remaining.”
Before describing some of the optimization problems of individual haplotyping, the mathe-
matical terminology is presented below.
3.2.1 Terminology
Let, S be the set of n bi-allelic SNP sites over which the haplotypes will be constructed. Let, F
be the set of m fragments of the two chromosomes where each fragment contains information
for nonzero number of SNPs in S. Without loss of generality, let the two alleles for each SNP
are 0 and 1 which are two different elements of {A, T, G, C}. Thus each fragment f ∈ F is a
string of symbols {0, 1,−} of length n where ‘−’ denotes an undetermined or unknown SNP
named as hole. All the fragments can be arranged in an m × n matrix M [f, s] named as SNP
matrix . An example of an SNP matrix is given in Table. 3.2.
The Consecutive sequence of ‘−’ that lies between two non-hole symbols is called a gap. A
gapless SNP matrix is the one that has no gap in any of the fragments. In the example, the
first, second and third rows have no gaps while the fourth and sixth rows both have one gap.
Two fragments f and g are said to have conflict, if there exist an SNP position s where two
fragments disagree, i.e. M [f, s] = 0 and M [g, s] = 1 or vice versa. If two fragments do not have
3.2. Individual Haplotyping 21
- - - -1101- - - - - - - - - - - -
- - - - -0001110101- - - - -
11010010011- - - - - - - - -
- - -10100- - -010- - - - - -
- - - - - - - - -10110101011
010111- - - - - - - - -01011
Table 3.2: An SNP matrix.
conflict on any SNP, they are said to agree. Some pairs of fragments are given in Table. 3.3 to
illustrate the idea of conflict and agree.
relation pair of fragments
conflict 0111010101100
- - - -010001- --
agree 0111010101100
- - - -010101- --
agree 011101- - - - - --
- - - - - - -110001
conflict 010001- - - - - --
- - - - -00001- --
Table 3.3: Different relations between fragments
An SNP matrix is error-free if it can be partitioned into two classes of non-conflicting
fragments. For a specific SNP matrix there may be more than one such partitions. From
each non-conflicting partition of fragments, a haplotype can be constructed by just taking the
common allele of the non-conflicting fragments for a particular SNP site. Let, partition of rows
in M are Ml and Mr where rows in each set are pairwise non-conflicting. From these two
Chapter 3. Haplotyping 22
partitions, the construction of haplotypes Hl and Hr by combining the rows can be described
as
Hij =
1 if Mij = 1 for at least one row
0 if Mij = 0 for at least one row
− if Mij = − for all fragments
(3.1)
where i ∈ {l, r} and j = 1, 2, . . . , n. Although actual haplotypes can not have any hole,
haplotype assembly problems can introduce holes in the haplotypes if there is no allelic infor-
mation in the fragments of the partition. Thus the general problem can now be redefined as to
finding an error-free matrix from a given SNP matrix through optimal changes. The following
sections deal with different approaches of finding error-free matrix and finally the comparison
of the complexities of these approaches is presented.
3.2.2 Minimum Fragment Removal : MFR
The fragment data generated from DNA sequencing contains not only overlapping fragments
but also multiple fragments from the same region of the DNA. That’s why there are many
redundant fragments and optimal removing of redundant fragments to make the matrix error-
free is the most reviewed SNP problem.
Problem : Minimum Fragment Removal – MFR
Input : An SNP matrix M .
Output : Find the smallest set of fragments (rows) whose removal makes M error-free.
MFR has been modeled using a graph so that its tractability can be found. Let GF (F, EF )
is a fragment conflict graph , where there is an edge between two fragments if they conflict on
any of their common SNPs. Therefore, to find out two classes of non-conflicting fragments is
equivalent to find out a Bipartite graph from G by removing minimum number of vertices, as
illustrated in Fig. 3.1. Here Fig. 3.1(a) is an SNP matrix, Fig. 3.1(b) is its fragment conflict
graph and Fig. 3.1(c) is the bipartite graph resulted after removing the fragment E.
3.2. Individual Haplotyping 23
101 0111 0001
101 00
ABCDE
1 00011000 1
(a)
A
C
E
D
B
(b) (c)
A
D
B
C
Figure 3.1: Minimum fragment removal using fragment conflict graph
Depending upon the presence of gaps, solution to MFR varies in its time complexity.
For gapless SNP matrix, MFR is polynomial time solvable [LBIS01]. Dynamic programming
method is a good approach to implement MFR for gapless SNP matrix [BILR05]. MFR problem
with gapped SNP matrix is NP-hard [BILR05].
Variation
MFR deletes a fragment if necessary. Whenever there are more than one fragments, any of
whose deletion can make the matrix error-free, MFR does not chose wisely. Therefore, a
variation of MFR is readily available and described below [CIKT05].
The length of a haplotype Hi, i ∈ {l, r} is the number of non-hole SNP sites. Let, the sum
of the lengths of the two haplotypes reconstructed from an error-free matrix M is L(M). In
this problem, the matrix M is made error-free by removing fragments in such a way that the
haplotypes constructed from the matrix achieve maximum length. Thus maximizing L(M) as
well as removing fragments is the goal of the new version.
Problem : Longest Haplotype Reconstruction – LHR
Input : An SNP matrix M .
Output : Find the smallest set of fragments (rows) whose removal makes M error-free with
maximum L(M)
As for MFR, LHR is also polynomial time solvable for gapless cases and NP-Hard for
gapped versions. The complexity status of LHR for “at most k-holes” is still open. Here
Chapter 3. Haplotyping 24
also, dynamic programing algorithm could be an approach to implement LHR for gapless SNP
matrix [CIKT05].
3.2.3 Minimum SNP Removal : MSR
If a fixed sequencing technique is applied to all the fragments then it is likely that errors occur
in the same SNPs for all the fragments. That’s why, MSR makes the SNP matrix error-free
by removing columns from the matrix. This problem is similar to MFR from algorithmic
viewpoint.
Problem : Minimum SNP Removal – MSR
Input : An SNP matrix M .
Output : Find the smallest set of SNPs (columns) whose removal makes M error-free.
Similar to the conflict between fragments, conflict between two SNPs is defined. Two SNPs
s and t are said to be conflicting if they have both 0 and 1 in their corresponding columns and
there exists two fragments f and g, where M [f, s], M [f, t], M [g, s] and M [g, t] are non-hole
symbols in which exactly one is different from the other three. Let GS(S, ES) is a SNP conflict
graph , where there is an edge between two SNPs if they conflict. For an error free SNP matrix
the GS is an independent set i.e. ES = φ [LBIS01]. Hence, to make an SNP matrix error
free by minimum SNP removal is equivalent to finding the maximum independent set of GS
and removing all vertices (SNPs) s ∈ S that are not in the maximum independent set of GS.
Here Fig. 3.2(a) is an SNP matrix, Fig. 3.2(b) is its SNP conflict graph and Fig. 3.2(c) is the
maximum independent set resulted after removing the SNPs 3, 5 and 6 from S.
Depending upon the presence of gaps, MSR also varies in its time complexity. For a gapless
SNP matrix the GS is a perfect graph with no chordless cycle of length > 4 and for perfect
graphs the maximum independent set problem is polynomial time solvable. Hence, for gapless
SNP matrix, MSR is polynomial time solvable. A class of gapped SNP matrix can also be
solved in polynomial time. This class of matrices is called C1P matrices [LBIS01].
3.2. Individual Haplotyping 25
1 2 3 4 5 6
101 110
101001100010000 001
000 01
(a)3
2
1 6
5
4(c)
2 4
1
−
(b)
Figure 3.2: Minimum SNP removal using SNP conflict graph
3.2.4 Minimum Error Correction : MEC
With the advent of sophisticated sequencing methods, errors in SNP matrices are getting
smaller. As a result, correcting these errors rather than removing a whole row or column
is more preferable approach. Erroneous SNP values can be corrected by just flipping it to the
other allele. Obviously it would have not been possible if it were a multi-allele (more than two
possible symbols in an SNP) SNP matrix. A good example of saving information by correcting
the errors would be for the SNP matrix in Fig. 3.1. In that matrix fragment E has conflict with
fragment B just for the last SNP that is common to them. Thus flipping that SNP of fragment
E to 0 would result into an error-free matrix having the partition {A,C} and {B,D,E}. MEC
retains the information of the other four alleles of fragment E, while MFR would delete all the
five alleles from the matrix.
Problem : Minimum Error Correction – MEC
Input : An SNP matrix M .
Output : The smallest set of SNP alleles whose flipping makes M error-free.
MEC problem for gapless SNP matrix is NP-hard [CIKT05]. Hence, MEC is more difficult
than the gapless MFR or MSR which have polynomial time algorithm. For 1-gap case MEC is
proved to be APX-Hard. It has been showed that haplotype assembly problem has reasonable
input size for practical exact algorithms [Huf05]. An exact algorithm based on branch and
bound technique is available. It searches all possible pairs of haplotypes to find the solution
[WWLZ05]. Now, we present a heuristic algorithm to find the solution of a minimum error
Chapter 3. Haplotyping 26
correction problem.
3.2.5 A Heuristic Algorithm for MEC Problem
Terminologies and Definitions
Before describing the algorithm, let us redefine the terminologies. Let, M = {Mij} is an SNP
matrix of dimension m × k where Mij ∈ {0, 1,−}. There are m fragments; each of which has
either a fixed allele (i.e. {0, 1}) or a gap (i.e. ‘−’) in each of its k SNP sites.
Conceptually, the matrix M = {M1, M2, . . . , Mm} is a set of m fragments where a fragment
Mi = {Mi1, Mi2, . . . , Mik} is an array of k alleles or holes (see Fig. 3.3(a)). A fragment Mi is
called to cover the jth SNP if Mij ∈ {0, 1} and called to skip the jth SNP if Mij =‘−’. Let,
Ms and Mt are two fragments. The distance between two fragments, D(Ms, Mt) is defined as
the number of SNPs that are covered by both of the fragments and have different alleles. For
example in Fig. 3.3(a) the D(M3, M2) = 4.
D(Ms, Mt) =k
∑
j=1
d(Msj, Mtj) (3.2)
where d(x, y) is defined as
d(x, y) =
1 if x 6= − and y 6= − and x 6= y
0 Otherwise(3.3)
Two fragments Ms and Mt are said to be conflicting if D(Ms, Mt) > 0. Now, P (C1, C2)
is a partition of M , where C1 and C2 are two collections (sets) of fragments taken from M so
that C1
⋃
C2 = M and C1
⋂
C2 = φ. In Fig. 3.3(b) a partition is shown. Therefore, M will be
an error-free matrix if and only if there exists a partition P (C1, C2) of M such that for any two
fragments x, y ∈ Ci, i ∈ {1, 2}, x and y are non-conflicting, in other words D(x, y) = 0 for all x
and y belonging to the same collection Ci. Such a partition is called a error-free partition. The
partition in the figure is not error free because D(M1, M2) > 0 and D(M5, M6) > 0 insert error
3.2. Individual Haplotyping 27
6
5
4
3
2
1 01000
1000001
11001
010010
1000001
00111000
(a)
H2 = 0100000001
H1 = 0100000000
(c)
3
12
C1 C2
456
P
(b)
M
Figure 3.3: SNP matrix and its partition
in C1 and C2 respectively. Construction of the two haplotypes from an error free partition is
similar to that of LHR. A haplotype Hi, i ∈ {1, 2} is constructed by taking common allele from
the component fragments for both the parts Ci, i ∈ {1, 2}. The mathematical expression that
describes such construction is
Hij =
1 if at least one fragment in Ci has a 1 in jth SNP
0 if at least one fragment in Ci has a 0 in jth SNP
− if all the fragments in Ci skips jth SNP
(3.4)
If the matrix M is not error-free, there will be no error-free partition P as well and there
will be conflicts between fragments in any or both of the collections of all possible partitions.
Thus, for such M , there is no partition which we can use to construct two haplotypes by just
taking the common allele for a particular.
So far we described how an SNP matrix becomes erroneous, now the question is to correct
an erroneous SNP matrix to make it error-free. If we are given a partition P (C1, C2) of an
erroneous matrix M and the two actual haplotypes H1 and H2, the number of errors E(P ) that
must be corrected are just the sum of the distances of all the fragments from their corresponding
haplotypes. More mathematical expression is given below.
E(P ) =2
∑
i=1
∑
f∈Ci
D(f, Hi) (3.5)
Chapter 3. Haplotyping 28
Now, the MEC problem reduces to finding a partition P that minimizes the error function
E(P ).
Algorithm - HMEC
Now, to minimize the E(P ), we need to search all possible partitions of a matrix M . This
would certainly require running time exponential to the number of fragments in M . But such
a search is not possible because we don’t have the real haplotypes to calculate E(P ). Hence,
we have to approximate E(P ) by constructing two haplotypes from the given partition P .
For best approximation we should construct haplotypes which are minimum conflicting with
the fragments of their corresponding collections. Therefore, for each SNP site, the haplotype
Hi should take the allele that is present in majority of the fragments in Ci. In case of ties, 0 is
favored because it is the more common than 1. In Fig. 3.3(c) the two haplotypes H1 and H2
are shown for the partition P in Fig. 3.3(b) which are constructed in this method. To define
this construction more mathematically, let N 0
j (Ci) is the number of fragments in a collection
Ci that have 0 in jth SNP. Similarly, N 1
j (Ci) defines the number of 1s. Thus to minimize the
number of errors E(P ) for a specific partition P , the haplotype should be constructed following
the rule
Hij =
1 if N1
j (Ci) > N0
j (Ci)
0 if N0
j (Ci) >= N1
j (Ci) and N0
j (Ci) 6= 0
− if N1
j (Ci) = N0
j (Ci) = 0
(3.6)
where i ∈ {1, 2} and j = 1, 2, . . . , k.
To find the best partition we will use a local search heuristic. The algorithm iteratively
searches a better partition with respect to the current one and chooses it to move to. Because
the algorithm searches from within a small set of partitions, the chosen partition may not
lead to the optimum solution and the algorithm has a chance to fall into the local optimum
solution. The algorithm is inspired from the famous Fiduccia and Mattheyses (FM) algorithm
for bipartitioning a hypergraph minimizing the cut size.
3.2. Individual Haplotyping 29
This algorithm starts with an arbitrary partition as for example P (M, φ) and iteratively
searches a better partition. In each iteration the algorithm performs a sequence of transfer of
distinct fragments from their present collection to the other one so that the partition becomes
less erroneous. It should be noted that, a fragment’s transfer of collection can both increase or
decrease the error function E(P ).
Let, the partition before transferring a fragment f is Pp and the partition resulted is Pn. We
define the gain of the transfer as Gain(f) = E(Pp)−E(Pn). Let, F =< fi >, i = {1, 2, . . . , m}
is an ordering of all the fragments in a partition P in such a way that fragment fi will precede
fragment fj if all the fragments preceding fi have been transferred to form an intermediate
partition Pi and Gain(fi) >= Gain(fj) over Pi. Thus the first intermediate partition P1 is the
current partition Pc of the ongoing iteration. We also define the cumulative gain of a fragment
ordering F up to its nth fragment as CGain(F, n) =∑n
j=0Gain(fj). Note that, all the gains
used to compute the cumulative gain are calculated over different intermediate partitions. The
maximum cumulative gain, MCGain(F ) is defined as
MCGain(F ) = max1 ≤ i ≤ m
CGain(F, i)
Now, in each iteration the algorithm finds the ordering Fc of current partition Pc and
transfers only those fragments of Fc that can achieve the MCGain(Fc). The fragment that is
the last to be transferred is referred as fmax. Thus, the algorithm iterates from one partition to
another reducing the error function. Since our algorithm is local search algorithm, it continues
whenever MCGain(Fc) > 0 and stops as soon as MCGain(Fc) ≤ 0.
Implementation and Complexity
There are several issues to discuss about the above described algorithms. We will discuss the
data structure for each such issue.
To find Fc in each iteration the algorithm repeatedly transfers the fragment that is not
transferred previously in this iteration and has maximum gain over all such fragments. To
accomplish this we use a locking mechanism. At the beginning of each iteration all the fragments
Chapter 3. Haplotyping 30
are set free. The free fragment with maximum gain is found out and tentatively transferred
to the other collection. After the transfer the fragment is locked at the new collection. This
tentative transfer creates the first intermediate partition P1. The algorithm then finds the next
free fragment with maximum gain in P1 and transfer and lock that fragment to create the P2.
Thus, free fragments are transferred until all the fragments are locked and the order of the
transfer (Fc) is stored in the log table along with the cumulative gains (CGain). MCGain is
the maximum CGain and fmax is the fragment corresponding to MCGain in the log table.
Although after finishing all such tentative transfers Pc has been changed to an undefined
partition, the algorithm checks the log to find the MCGain(Fc) and fmax and rollback the
transfer of all the fragments that were transferred after fmax. When the rollback completes the
Pc becomes ready for the next iteration.
H1 = 0100000000H2 = 0100000001
Pp
C1 C2
123
456
H1 = 01000001−−H2 = 0100101001
nP
C1 C2
1
3
456
2
E(P ) = 8p E(P ) = 6nE(P )p E(P )nGain(2) = = 2
Figure 3.4: An example calculation of Gain measure
While tentatively transferring a free fragment, the algorithm needs to find the fragment
with maximum gain among the free fragments (which are not yet transferred). This requires
calculating gains for each of them. To calculate the Gain(f) = E(Pp)−E(Pn) for a fragment we
need to calculate two error values of two different partitions; the present intermediate partition
and the next partition which will be resulted if f is transferred. Each of these error functions
requires calculation of two new haplotypes from their corresponding collections (see Fig. 3.4).
Although E(Pp) and the haplotypes of Pp can be found from the previous transfer, calculation
of E(Pn) requires construction of haplotypes of Pn. Since, the difference between Pp and Pn
is only one transfer, we can introduce differential calculation of haplotypes Hnj, j ∈ {1, 2} of
3.2. Individual Haplotyping 31
next partition from the haplotypes of Hpi, i ∈ {1, 2} of present partition. For this purpose, the
algorithm stores N 1
j (Cpi) and N0
j (Cpi) values of the present partition. After a transfer these
values will either be incremented or decremented by 1 or remain the same. Hence, it is now
possible to construct Hnj, j ∈ {1, 2} in O(k) time. To compute E(Pn) from the haplotypes
requires O(mk) time. Therefore, running time to compute the E(Pn) as well as to compute
Gain(f) is O(mk + k).
For each intermediate partition Pi, i = 1, . . . , n we need to compute Gain measures for
m − i unlocked fragments to find the maximum one. The transfer of this fragments require
updating of N1
j (Ci) and N0
j (Ci), i ∈ {1, 2} and j = 1, 2, . . . , k. So, it also needs O(k) time to
run. Finally, there will be m such transfer in each iteration. Thus each iteration will require
O(m(m(mk + k) + k)) ∼ O(m3k) running time.
Approximation to Improve Performance
For large SNP matrix O(m3k) running time is critical to the performance of the algorithm. We
can use an approximation in the calculation of the Gain(f) by using only the fragment f and
not using the m − 1 other fragments. The approximate gain should be
AppxGain(f) = D(Hpi , f) − D(Hn
j , f) (3.7)
where Hpi is the haplotype of f ’s present collection Ci of partition Pp and Hnj is the
haplotype of f ’s next collection Cj of partition Pn. This function ignores the effect of fragments
other than f on Gain(f) but reduces the run time of calculating gain to O(k). The total run
time of each iteration will be O(m2k)
A Simulated Example
In Fig. 3.5 we present an example iteration of HMEC. We consider that the current partition
Pc = P1 is the partition given in Fig. 3.3(b) for the matrix M . All the intermediate partitions
Pi, i ∈ {1, . . . , 7} are shown sequentially and the gains of each fragment over the intermediate
Chapter 3. Haplotyping 32
Algorithm HMEC(M)
1: Pc = P (M, φ)
2: FREE LOCKS()
3: CLEAR LOG()
4: repeat always
5: while there is an unlocked fragment in Pc do
begin
6: find a free fragment f so that Gain(f) is maximum
7: transfer f to the other collection
8: update the haplotypes after the transfer
9: LOCK(f)
10: LOG RECORD(f ,Gain(f))
end
11: FREE LOCKS()
12: check the log and find MCGain(Fc) and fmax
13: if MCGain(Fc) > 0
begin
14: set new Pc by rolling back the transfers
that occurred after the transfer of fmax
15: calculate haplotypes of Pc
16: CLEAR LOG()
17: continue the loop
end
18: else
begin
19: terminate the algorithm and output current haplotypes
end
end repeat
Table 3.4: The pseudocode for the HMEC algorithm
3.2. Individual Haplotyping 33
123
456
Gain(1)=0Gain(2)=2Gain(3)=2Gain(4)=1Gain(5)=1Gain(6)=1
1
3
2456
136
245
Gain(1)=−1Gain(3)=−2Gain(4)=−1Gain(5)=0
1
6
3 4
52 Gain(1)=−1
Gain(3)=−3
1
2
3
54
6
5
6
4
2
3
1
1
653
2
4
P1
P2
P3 P5
P6
P7
P4
Gain(1)=−1Gain(3)=−2Gain(4)=−2Gain(5)=1Gain(6)=2
Gain(1)=−2Gain(3)=−3Gain(4)=−1
Gain(3)=−2
locked fragment
Figure 3.5: An example iteration of HMEC
Chapter 3. Haplotyping 34
partitions are shown on the right of each partition. The free fragment with maximum gain is
marked in each intermediate partition. For example, the maximum gaining fragment on P2 is
fragment 6 with gain 2. After each transfer the transferred fragment is locked by a circle. Here,
the ordering Fc of the fragments is < 2, 6, 5, 1, 4, 3 > which is also the order of locking of the
fragments. In the log this ordering will be stored along with the CGains (see Fig. 3.6). All the
tentative transfers after fmax have to be rolled back so that the P2 becomes the next Pc.
MCGain
244320
CGain
265413
Fc
Log Table
Figure 3.6: An example log table
Performance Evaluation
We have implemented the algorithm and tested it thoroughly. We first chose two independent
set of haplotypes. One of them was based on real sequences of DCP1 genes that contains lot
of similarity among the haplotypes [RTCN99]. There were 7 pairs of haplotypes of 53 SNPs
in this set. The other one was generated sequences reflecting true randomness. There were 6
pairs of haplotypes of 90 SNPs in this set. Upon each of these sets of haplotypes, we ran a
simulated sequencing operation that generates a set of fragments (i.e. the SNP matrix) from
every pair of haplotypes within the set. The simulation program was controlled in several ways.
We varied the number of fragment in the matrix as well as the length (number of non-hole SNP
sites) of each fragment. We also varied the maximum amount of errors that was introduced
in the matrix. Over each of the SNP matrices we ran our algorithm and it performed very
well to reconstruct the haplotypes. The worst scenario, that was tested, was a matrix from
the generated haplotypes with only 40 fragments each having 60% holes in them and and 40%
errors in the alleles. Our algorithm constructed on the average 85% of the haplotypes. This
3.2. Individual Haplotyping 35
percentage is quite a good one considering the true randomness of the SNP matrix and obviously
better performing than a genetic algorithm for MEC [WWLZ05]. Besides the correctness, it
is lot more efficient in execution time. It took 2.46 sec to process all the 6 matrices of 40X90
dimension.
The most successful feature of this search algorithm is its Independence of gap. The position
of holes (i.e. gaps) in the SNP matrix will not create any difference to the HMEC algorithm.
That’s why the number of gaps was not a control parameter of the sequencing simulation.
An algorithmic comparison with the FM algorithm for bipartitioning with minimum cut
size reveals that our FM algorithm runs in O(m) time in updating the gain values in one pass
where our algorithm runs in O(mk) if no approximation is applied.
Variation
A new model of MEC has been proposed to incorporate the genotypic information of the
organism, which is more readily available, to increase the correctness of the solution [WWLZ05].
From algorithmic point of view there is another variation of MEC [CITK05]. The problem
is open at its complexity status.
Problem : Binary-Witness-MEC
Input : An SNP matrix M that does not contain any holes.
Output : For an input matrix M of size n × m, two haplotypes H1,H2 ∈ {0, 1}m minimizing.
D(H1, H2) =∑
rows r∈M
min(d(r, H1), d(r, H2))
3.2.6 Comparison of Different Individual Haplotyping Problems
We described four haplotype assembly problems in this section. Those are MFR, LHR, MSR
and MEC. All these problems are very extensively studied over the last few years and several
algorithms are established for each of the settings. Several papers are published regarding the
complexity status of the problems. Several new models derived from these four problems are
also coming by the last couple of years. The problems are themselves varied in nature because of
Chapter 3. Haplotyping 36
the presence of gap in the input SNP matrix. It should be noted here that a hole in a fragment is
not necessarily a gap. Gap is a sequence of one or more holes separated by nonempty sequence
of alleles. The HMEC that we proposed is significantly better than a Genetic algorithm. Other
than these two we did not find any approximation algorithm for the other problems. The most
recent compiled state of these four problems (as per our knowledge) is shown in the Table. 3.5
Problem Gap and hole type Time Reference
MFR UnGapped O(m2n + m3) [BILR05]
At most k holes O(22km2n + 23km3) [BILR05]
At most 1-gap NP-Hard [LBIS01]
MSR UnGapped O(mn2) [BILR05]
At most k holes O(2kmn2) [BILR05]
At most 1-gap NP-Hard [BILR05]
LHR UnGapped O(mn2 + n3) [CITK05]
At most k holes ?
At most 1-gap NP-Hard [CITK05]
MEC UnGapped NP-Hard [CITK05]
At most k holes NP-Hard [CITK05]
At most 1-gap NP-Hard [CITK05]
Table 3.5: Comparison among the SNP problems
3.3 Population Haplotyping
Sequencing an individual DNA and having all of its haplotypic information is not the major
concern of the researchers who deals with disease analysis, drug design, hybridization of crops
etc. They need the trend of SNPs in the haplotypes in many individuals of a particular diploid
organism. But, it is not possible for even the most powerful sequencer built by today to
3.3. Population Haplotyping 37
generate large scale haplotype data in short time and cost. The more economical method for
population haplotype mapping is to generate large scale “genotype” data and compute the
map from it. To be more precise, a genotype is not a fragment of a haplotype, rather it is a
combined information of the fragments of the haplotypes. A genotype can only say whether an
individual is homozygous or heterozygous in some SNP sites. Population haplotyping problems
takes on genotypes of a large number of individuals over a fixed number of SNP sites, and
infers the haplotypes of this population. Haplotype is a sequence of SNP values on a copy of
chromosome of an individual. A genotype vector, or simply genotype, represents two haplotypes
as a sequence of unordered pairs of alleles. Each pair represents the nucleotides in a given site.
The following Table. 3.6 shows the genotype for a chromosome.
paternal copy: atc T cat G gat G ctc T ctaa
maternal copy: atc A cat G gat C ctc T ctaa
Haplotype 1: T G G T
Haplotype 2: A G C T
Genotype: (T,A) (G,G) (C,G) (T,T)
Table 3.6: A chromosome and its genotype
Before describing some of the haplotype inference problems, the mathematical terminology
is described as usually.
3.3.1 Terminology
For bi-allelic SNPs, haplotypes can also be represented as a string of symbols {0, 1}. Thus a
genotype is a string of unordered pairs over the set {0, 1}. Whenever an SNP site is homozygous
the genotype pair will be either (0,0) or (1,1), while for the heterozygous site the pair is (0,1).
For example two haplotypes of length 3 are 0,1,1 and 1,0,1 which are combined into the genotype
(0,1),(0,1),(1,1). Since, only 3 distinct pairs are possible we can use a different alphabet {0, 1, ?}
Chapter 3. Haplotyping 38
to represent genotypes more precisely. For the homozygous pairs we will use 0 and 1 respectively
and for the heterozygous pair the ? will be used. Hence, for the previous example the genotype
should be ?,?,1.
Given a genotype g = {gi}, i = 1, 2, . . . , m, the resolution of g is pair h , k of haplotypes,
where h = {hi}, i = 1, 2, . . . , m and k = {ki}, i = 1, 2, . . . , m such that hi = ki = gi if g 6=?
and hi, ki ∈ {0, 1}, hi 6= ki if gi =?. We say that h, k resolves g. Given a genotype g and a
haplotype h, we say that h is compatible with g if hi = gi whenever g 6=?. If h is compatible
with g we can always find a haplotype h′ such that h and h′ resolves g. Here, h′ is called the
realization of g by h and denoted as R(g, h). It should be noted here that, h and k can be
compatible with g separately but they may not resolve g together. Given a g and h, R(g, h) is
a unique haplotype computed as
R(g, h) =
hi if gi 6=?, i = 1, 2, . . . , m
1 − hi if gi =?, i = 1, 2, . . . , m
Now, the general problem of haplotype inference is defined below [BVDL03].
“Given a set of genotypes G = {g1, g2, . . . , gn}, for each g ∈ G find a pair h, k of haplotypes
resolving g.”
Preserving this general settings, there are many optimization problems with different mod-
els. The following subsections will discuss several of such problems along with their algorithms.
3.3.2 Pure Parsimony Problem
The first optimization problem that comes in mind about this general problem is to minimize
the number of haplotypes. The pure parsimony problem states to do so.
Problem : Pure Parsimony Haplotyping Problem
Input : A set G of genotypes.
Output : Find the cardinality-smallest set of haplotypes H such that for each g ∈ G, there is
a pair of haplotypes h, h′ ∈ H that resolves g.
An algorithm introducing the maximum parsimony approach to haplotyping was described
earlier [Cla90]. That was actually an approximation algorithm using the rationale that ho-
3.3. Population Haplotyping 39
mozygous site in one genotype should be common to the other genotype where the site is ?.
Later, Lancia et.al proved that the pure parsimony problem is NP-hard for genotype set that
has no genotype with more than 3 heterozygous sites. It has been proved that pure parsimony
problem can be solved in polynomial time (O(mnlog(n) + n3/2)) for a genotype set G of no
more than 2 heterozygous sites in a single genotype [CITK05]. It is yet to prove the hardness
of pure parsimony problem for a genotype set with more than 3 heterozygous sites.
The haplotype inference problems has another name called “genotype phasing”. Recent soft-
ware applications approach this problem for large population set and real time operations. An
entropy minimization algorithm can be used for practical implementation [PM06]. Algorithm
for a highly reliable and large scale implementation is also available [BZ06].
3.3.3 Maximum Resolution Problem – MR
With the general settings, many optimization problems have been devised by introducing rule
of inference over the genotype data. The inference rule is an operation that is performed
to expand a known set of haplotypes to make it consistent with a known set of genotypes
[BVLD03]. The definition is as follows
Definition 3.3.1 (Inference rule) Let G = {g1, g2, . . . , gm} be a set of genotypes and let H
be a nonempty set of haplotypes. The application of the inference rule to a haplotype h ∈ H
compatible with a genotype g ∈ G consists of adding R(g, h) to H and removing g from G.
The MR problem deals with the application of inference rule over given (G, H). It states
to find the sequence of inference that achieve maximum resolution of G by H.
Problem : Maximum Resolution – MR
Input : A set G of genotypes and a set H of haplotypes.
Output : A maximum cardinality subset G′ of G of genotypes that are removed from G by a
sequence of application of the inference rule starting from G and H.
An example of application of inference rule will give an insight into the problems. Let
G = {g1 = 00??0?, g2 = 00?1?1} and H = {h1 = 001000, h2 = 000001}. The optimal sequence
Chapter 3. Haplotyping 40
of application of inference rule in comparison to a non-optimal one is shown in Table. 3.7
Step Optimal Sequence Non Optimal Sequence
0 G = {g1, g2}, H = {h1, h2} G = {g1, g2}, H = {h1, h2}
G = {g1 = 00??0?, g2 = 00?1?1} and
H = {h1 = 001000, h2 = 000001}
G = {g1 = 00??0?, g2 = 00?1?1} and
H = {h1 = 001000, h2 = 000001}
1 h3 = R(g1, h1) = 000101 h5 = R(g1, h2) = 001100
G = {g2}, H = {h1, h2, h3} G = {g2}, H = {h1, h2, h5}
2 h4 = R(g2, h3) = 001111 No haplotype in H is compatible with g2
G = {}, H = {h1, h2, h3, h4} G = {g2}, H = {h1, h2, h5}
Table 3.7: Example of optimal application of haplotype inference rule
A formal framework to analyze and investigate the computational complexity of haplo-
type inference problems is presented by stating an optimization problem, whose corresponding
decision version is NP-hard [Gus01].
Variation
A variation of the MR problem named “single genotype resolution problem” is presented whose
computational complexity is yet to determine [BVLD05]. This open problem deals with a single
genotype to determine its existence in a population.
Problem : Single Genotype Resolution – SGR
Input : A set G of genotypes and a set H of haplotypes and a distinguished genotype g ∈ G.
Output : A sequence of applications of the inference rule that resolves a subset of G including
g.
3.3.4 Perfect Phylogeny Haplotyping Problem – PPH
Difference in SNPs in individuals of a particular population is not a frequent occurrence. A
white man and a black man surely have some differences in their SNPs. But, for a population
3.3. Population Haplotyping 41
of white people, living in the same geographic region, having many similar traits, SNPs of indi-
viduals should not differ very much except some cases of atomic disaster like that of Hiroshima
and Nagasaki which can mutate many SNPs of reproductive cell and contaminate the whole
next generation of that individual.
A mathematical description of the above mentioned fact will clear the idea. Let, there are
n SNPs in the entire human race, so there might be 2n possible SNP configuration if all these
SNPs are bi-allelic. But actually only a few of these 2n configurations can be identified so far
in humans. All those configuration are very much different from each other and every single
configuration is common to some population. This changes in SNPs occur for natural mutations
and represents the evolution of humans through a large epoch. Thus haplotypes of a population
contains the evolutionary information and they could be arranged in some phylogenetic tree.
Thus the very general population haplotyping problem can degenerate into the phylogeny of
haplotypes. Let, us define a phylogeny of the haplotypes as haplotype perfect phylogeny.
Definition 3.3.2 (Haplotype Perfect Phylogeny) Let B be an n×m {0, 1} matrix where
each row in B is a binary haplotype and each column i is the n vector of the SNP sites of the m
haplotypes. A haplotype perfect phylogeny for B, in short hpp, is a rooted tree T with n leaves
such that the following properties hold:
• each leaf of the tree is labeled by a distinct haplotype from B, that is a distinct row of B.
• each internal edge of T can be labeled by at least a SNP site j changing from 0 to 1, while
each site labels at most one edge.
• for each haplotype leaf h, the unique path from the root of T to h specifies exactly all SNP
sites that are 1 in h.
An example of an hpp is given in the Fig. 3.7. It should be noted that the edges without
label are not the internal edges of the tree. Hence, the example completely agrees the definition.
Here, the matrix of haplotypes on the left is arranged in the hpp on the right. Since the
hpp intends the use of haplotypes as a matrix, we should define the resolution of a genotype
Chapter 3. Haplotyping 42
ABCD 000
0
0
1 0123456
001100 101
10100010
1010 00E
(a)
A
CB
ED
(b)
3
1 5
2
6
4
Figure 3.7: An example of haplotype perfect phylogeny (hpp)
matrix. A {0,1} matrix H of haplotypes resolves a {0,1,?} matrix G of genotypes if each row
of G is resolved by a pair of rows of H. The perfect phylogeny haplotyping problem is
Problem : Perfect Phylogeny Haplotyping – PPH
Input : an n × m matrix G over alphabet {0,1,?}.
Output : A matrix H which is a realization of matrix G and a haplotype perfect phylogeny
for H, or decide that such tree does not exist.
Now, the issue of existence of a solution is important in this problem. We need to know how
to check that a matrix H has no hpp. In the Fig. 3.8 a matrix that has no hpp is shown. The
trees in the figure are possible trials to represent the matrix in hpp, but both failed to represent
a haplotype. Such matrix, where three rows and two columns and the rows are (1,1),(0,1) and
(1,0) is called the forbidden matrix. It is proved that when a matrix has a sub matrix of any
two column and three rows equal to a forbidden matrix, the matrix has no hpp [BVLD05].
3.4 Summary
In this chapter, we described the haplotyping problem, one of the most important SNP related
problems. We described both types of haplotyping: individual and population. The individual
haplotyping determines the haplotype of an individual from the DNA sequence data of that
3.4. Summary 43
2h1h
3h 0 1
1 0
1 1
i j
3h2h1h
j
i
(a) (b) (c)
1h
i
j
Figure 3.8: Forbidden matrix for hpp
individual. There are several variation of this generic problem. The MFR, MSR and MEC
problem are described along with some graph based computational model and their variations.
The fragment conflict graph, SNP conflict graph are examples of such models. A local search
algorithm is given for the MEC problem. Population haplotyping determines the haplotypes
of a population from the genotypes of a number of individuals. The pure parsimony problem,
maximum resolution problem and perfect phylogeny haplotyping problem are described. For
each problem, illustrative example is used to elaborate the computation required to solve the
problem.
Chapter 4
Phylogenetic Tree
Phylogeny is the description of relations among species, population, individuals or genes. This
relationship can be of several kinds having the property of ancestor-descendant relationship.
For a quick example, we can describe a phylogeny of human languages of different civil societies
where “Hindi” and “Bangla” may have same ancestor named “Sanskrit” while “arabic” and
“Persian” may have same ancestor. Generally relations that can be expressed as descendant-
ancestor have a close tie with time. That’s why phylogeny is extensively used to research the
evolutionary history of biological elements like species, population, genes, etc (generally referred
as taxa). In this setting, the most reliable source of information that can be used to relate taxa
are cellular information like ribosomal RNAs, gene families, repetitive DNA sequences, protein
sequences, etc.
The best representation of a phylogeny should be nothing but tree. A tree organization
of taxas enables easier analysis and makes it more suitable for applying computer algorithms.
A tree representation of a phylogeny is called the Phylogenetic Tree. An example of a
phylogenetic tree constructed from RNA comparison is shown in Fig. 4.1 taken from the book
Fundamental concepts of bioinformatics by D. E. Krane et. al. (page 113).
From the point of view of computer science, phylogenetic tree can be studied in two different
phases.
• Construction of phylogenetic tree using some distance metric to measure the relationship
44
45
Purple Bacteria
Cyanobacteria
Flavobacteria
Greennonsulfurbacteria
Thermotogales
Euryarchaeota Crenarchaeota
FungiPlants
Animals
Ciliates
Slime moldsEntamoebae
FlagellatesTrichomonadsDiplomonads
Microsporidia
Archaea
Eucarya
Bacteria
BacteriaGram−Positive
Figure 4.1: A sample phylogenetic tree of living organisms
Chapter 4. Phylogenetic Tree 46
among taxa so that the topology of the tree represent the relation reliably.
• Analysis of information (ancestral sequence, path length, subtree, etc.) from an already
constructed phylogenetic tree accurately and visualization the trees in easy shapes.
In this chapter we will discuss two different analysis algorithm to enchant the current trend
of research in phylogenetic tree analysis.
4.1 Majority Rule Consensus Tree
Most of the construction problems of phylogenetic trees are NP-hard and run heuristic algo-
rithms to generate phylogenetic tree for smaller set of taxa than required. While looking for
building a tree for a set of million taxa, smaller trees generated by these algorithms should be
combined reliably and efficiently. A consensus tree for a set of input trees is a single tree which
include features on which all or most of the input trees agree. Depending upon the definition
of the feature, on which the trees should agree, consensus tree can be different. For example, it
can preserve the topological structure of the trees or can preserve the pairwise path lengths of
individual taxa. The algorithm that we will discuss in this section builds consensus tree using
majority rule. In this problem, a node in a tree is equal to another node in a different tree if
they have the same set of taxa in the corresponding subtree rooted at them. The majority rule
consensus tree problem states to include all nodes that are present in more than fraction l of
the input trees into the consensus tree so that the topology of the consensus tree agrees with
majority of the input trees [ACJ03].
Problem : Majority Rule Consensus Tree Problem
Input : A set of phylogenetic trees
Output : A consensus tree only having all the nodes that appear in majority of the input
trees.
The algorithm we are presenting is a simplified linear time randomized algorithm.
4.1. Majority Rule Consensus Tree 47
4.1.1 Terminology
Let S = {s0, s1, . . . , sn−1} is a set of taxa and T = {T1, T2, . . . , Tt} is a set of phylogenetic trees,
each having leaves labeled by s such that s ∈ S.
Without loss of generality, let s0 is a distinguished taxon that is present in all the trees Tj
and is connected directly to the root of each of the trees. Such taxon is called the outgroup
to the rest of the tree. If there is no common outgroup present in T , to continue with the
algorithm we will insert s dummy node to each of the trees and make it the outgroup.
Now, let i is a node in a tree Tj. If we delete the edge from i towards the root of Tj, we will
get a bipartition : a subtree rooted at i and the rest of the tree having s0. Such a bipartition B
can be represented uniquely only by the leaves of the subtree rooted at i and denoted as S(B).
For example, in the Fig. 4.2 a bipartition in tree T2 is s2s3|s0s1s4 and S(s2s3|s0s1s4) = {s2, s3}.
The cardinality of the bipartition is 2. Therefore, two nodes i and i′ of two different phylogenetic
trees Tj and Tk are equal if their respective bipartitions B and B ′ are equal, i.e. S(B) = S(B ′)
s0 s1 s4s3s2 s0 s3 s4s2s1 s0 s2s1 s4 s3
s0 s3 s4s2s1
Majority rule consensus tree
Figure 4.2: An example of majority rule consensus tree for l = 1
2
In majority rule tree, or Ml tree, includes nodes for exactly those bipartitions which occur
in more than some fraction l of the input trees. It has been showed that there exists such tree
for any 1/2 < l ≤ 1 [MM81].
Chapter 4. Phylogenetic Tree 48
4.1.2 Algorithm
The algorithm can be divided into two phase. The counting phase counts the number of
occurrence of all the bipartition present in T and the construction phase constructs an Ml for
some 1
2< l ≤ 1.
Counting Bipartitions
Counting phase counts the the number of occurrences of a bipartition so that we can determine
whether the bipartition is a majority one. To count every possible bipartition we need a
storage for them. A tree with n leaves can have O(n) distinct bipartition. Because |S|= n,
maximum number of distinct bipartitions is O(tn). We can represent each bipartition B by
B = (b1, b2, . . . , bn), a bit-string of n-bits, where bit bj carries a 1 if the respective taxon sj ∈ B.
The bit-string for a bipartition B of node i can be calculated from the bit-strings of the its
children’s bipartition’s B1, B2, . . . , Br by just taking a logical OR of them. This is because the
Bi, i = 1, 2, . . . , r are disjoint set of taxa.
B =r
∨
i=1
Bi. (4.1)
We will use a hash function that hashes a bipartition to a hash table address. Each location
of the hash table contains the record of the running count of the bipartition, the bit-string
representation of the bipartition. The hash function for a bipartition B is h(B).
h(B) =n
∑
i=1
biai mod m (4.2)
where a = (a1, a2, . . . , an) is a list of random integers in (0, . . . , m − 1) and m is a prime
number and m > tn. Such hash function has an advantage of implicit calculation. The hash
index for a bipartition B of node i can also be calculated from the bipartitions of the its
children’s bipartition’s B1, B2, . . . , Br by just adding and taking mod of their hash indices.
4.1. Majority Rule Consensus Tree 49
h(B) =n
∑
i=1
biai mod m =r
∑
j=1
n∑
i=1
bjiai mod m =r
∑
j=1
h(Bj) mod m (4.3)
To count bipartitions the algorithm traverses every tree in postorder and calculates the hash
index and the bit-string representation from those of its children. Using index the algorithm
accesses the hash table which uses separate chaining method to resolve the collision and inserts
the bit-string as a new record if it is not in the table and increments its count.
Constructing Majority Rule Tree
The construction phase makes use of the following facts.
Fact 4.1.1 Let Ba and B are two majority bipartitions. If Ba is an ancestor of B in a tree in
T , then Ba is an ancestor of B in Ml.
Fact 4.1.2 Let Bp and B are two majority bipartitions. If Bp is the parent of B in Ml, then
at least one tree in T has Bp as Bs ancestor.
Fact 4.1.3 Let Ba, Bp and B are all majority bipartitions. If Bp is the parent of B and Ba is
an ancestor of B other than Bp in Ml, then |Ba| > |Bp|.
The Fact 4.1.1 tells that a preorder traversal of the input trees will be congruent with the
top-down order of bipartitions in the Ml. The Fact 4.1.2 tells that if we traverse all the input
trees in preorder, we won’t miss any edge of Ml.
The algorithm starts a preorder traversal of each tree in T . Let, the algorithm is in node i.
Each traversal uses a pointer c which always points to the nearest ancestor of i that is majority
node. We create a dummy node whose bipartition is S and before each traversal c is initialized
to dummy. After the algorithm, the Ml will be the sole subtree of this dummy. Let C be the
bipartition of c.
At node i, the algorithm calculates the Bi and h(Bi) using Equ 4.1 and Equ 4.3 and using
h(Bi) it finds the count of Bi. If Bi is not a majority node, it is just ignored. If Bi is a majority
Chapter 4. Phylogenetic Tree 50
node then it is searched in the Ml. If there is no node for Bi in the Ml then a node for Bi is
created. Since, c is the ancestor of i in Ml, a node for C obviously exists in Ml and the node
for C is made the parent of Bi. If, on the other hand, a node for Bi does exist in Ml, the
algorithm finds the current parent P of Bi in Ml. If P has a greater cardinality than C, then
C is a strong contender to be the parent of Bi (using Fact 4.1.3). Hence, C is made the parent
of Bi in Ml. Otherwise no change is done in Ml.
4.1.3 A Simulated Example
s3 s4s2s1s0
s2s3s1 s4
s2s3
s3 s4s2s1s0
s2s3s1 s4
s2s3
s2s3s1
s3 s4s2s1s0
s2s3s1 s4
s2s3
s2s3s1
s3 s4s2s1s0
s2s3s1 s4
s2s3
s2s3s1consensus treeMajority rule
s3 s4s2s0 s1
s2s3s1
s4s2s3s1
s2s3
s0 s1 s4s3s2 s0 s3 s4s2s1
T1 2T 3T
s0 s2s1 s3 s4
(a) (b) (c)
(d)(e)
Figure 4.3: Steps of majority rule tree construction
In Fig. 4.3 the steps of applying the above algorithm to the trees of the first row is shown.
The order of traversal of the tree is < T3, T2, T1 >.
After finishing the traversal of T3, the tree would look like (a). Since s2s3s4 (third node
4.2. Uniform Sampling of Phylogenetic Trees 51
from the top) of T3 is not a majority node, both s1s2 and s4 is added under s1s2s3s4.
The Ml would look like (b) after finishing the traversal of s1s2s3 (third node from the top) of
T2. Since, Bi = s1s2s3 was not in the Ml built so far, a node for it is created and C = s1s2s3s4
is made its parent. When the traversal move on to Bi = s1 the c moves down to s1s2s3. Now,
the parent of Bi in Ml is P = s1s2s3s4 and it is greater in cardinality than the C = s1s2s3.
Therefore, Bi’s parent is updated to C = s1s2s3 from P = s1s2s3s4 (see (c)). Similarly When
the traversal move on to Bi = s2s3 of T2, it updates its parent to C = s1s2s3 from P = s1s2s3s4
(see (d)).
After finishing the traversal of T2 and T1 the Ml would look like (e) which is the desired
majority rule consensus tree. Although the example uses more simple binary tree, the algorithm
equally performs in case of n-ary tree.
4.2 Uniform Sampling of Phylogenetic Trees
Most of the analysis algorithms proposed earlier were tested on the simulated topological and
sequence data. The common reasons behind this are unavailability of biologically obtained
phylogenetic trees and the large sizes of the few available ones. With the increase in the
study of phylogenetic analysis, the necessity of more correct and reliable phylogenetic trees is
also increasing. So, recent trend is to analyze an algorithm with a smaller and representative
sample from a real phylogenetic tree. One might think it as a trivial problem by just choosing
an arbitrary subtree from a large phylogenetic tree. But it becomes non-trivial when the
samples are required to be uniformly selected over some biologically motivated constrained set
of induced subtrees. In this section, the general problem of uniform tree sampling is defined
and some specific problems are described.
4.2.1 Terminology
Let, T is a phylogenetic tree over a set of taxa S, where |S| = n. If S ′ is a nonempty set of
taxa such that S ′ ⊂ S, then the subtree induced by S ′ from T is denoted by T ′ and is obtained
Chapter 4. Phylogenetic Tree 52
by the following sequence of steps
• deleting all leaves from S\S ′
• deleting all edges connecting those leaves
• deleting all degree-2 internal nodes by merging the edges connected to it into a single one.
If T is a weighted tree then the merging of two weighted edges should be defined. In
the Fig. 4.4 the T is shown in (a) where S = {1, 2, . . . , 10} and a induced subtree T ′ for
S ′ = {1, 3, 4, 5, 6, 7} is shown in (b).
1
2
3 4
5
6
78
9
10
1
3 4
5
6
7
(a) (b)
Figure 4.4: An induced subtree of a phylogenetic tree
Now, if we are given a T ,S and S ′, a unique T ′ is defined. But if we are given |S ′| instead of
S ′ then a number of T ′s can be found. This |S ′| is named as samplesize. The uniform sampling
problem states to find a fixed sized sample from T uniformly.
Problem : Uniform phylogenetic tree sampling problem
Input : A phylogenetic tree T over a set of taxa S such that |S| = n and the sample size
2 ≤ m < n.
Output : A uniformly chosen induced subtree T ′ over all such trees where T ′ has m distinct
leaves taken from S and satisfies some constraints.
Although phylogenetic trees can be n-ary tree, the binary representation is the mostly used.
4.2. Uniform Sampling of Phylogenetic Trees 53
In this section we will introduce three sampling problem of binary phylogenetic trees [KMP03].
4.2.2 Sampling Problems
All the problems stated in this section consider trees weighted by positive real numbers. To
merge two edges of a degree-2 internal node, the node is just removed and the solitary edge
gets a weight equal to the sum of the previous two weights.
Problem : Leaf depth problem
Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n
and two non-negative real number dmin and dmax.
Output : A uniformly chosen induced subtree T ′ over all such trees where T ′ has m distinct
leaves taken from S and satisfies dmin < dist(r, v) < dmax for all the leaves v of T ′ and root r
of T ′.
Problem : Edge length problem
Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n
and two non-negative real number emin and emax.
Output : A uniformly chosen induced subtree T ′ over all such trees where T ′ has m distinct
leaves taken from S and satisfies emin < |e| < emax for all the edges e of T ′.
Problem : Pair wise leaf distance problem
Input : A phylogenetic tree T over a set of taxa S such that |S| = n , the sample size 2 ≤ m < n
and two non-negative real number dmin and dmax.
Output : A uniformly chosen induced subtree T ′ over all such trees where T ′ has m distinct
leaves taken from S and satisfies dmin < dist(v1, v2) < dmax for all pair of distinct leaves v1 and
v2 of T ′.
Efficient algorithms for all the above problems are available considering T to be binary
[KMP03]. For arbitrary trees, the complexities of these problems are still to be explored.
Chapter 4. Phylogenetic Tree 54
Although every arbitrary phylogenetic tree can be represented as binary phylogenetic tree by
adding internal nodes, to our belief determining the complexity of sampling arbitrary tree for
the above problems will be a tough task.
Since most of the phylogenetic software packages use binary representation, it will also be
a good option to generate binary samples from an arbitrary n-ary phylogenetic tree satisfying
the above constraints.
4.3 Summary
Phylogeny is one of the most extensively studied area of bioinformatics. The obvious represen-
tation of tree and rich theory behind its analysis made phylogeny more advanced than other
recent areas. Phylogenetic tree analysis is currently widely used in disease tracing, Biodiversity
analysis, customized vaccine research, understanding microbial ecology and many such fields of
applied biology.
In this chapter we present an algorithm to construct consensus tree using majority rule.
The algorithm is carried out in two phases: counting phase and construction phase. We also
annotated three phylogenetic tree sampling problems. Uniform sampling of small induced trees
from large phylogenetic trees of millions of taxa is relatively new area of study. There could be
many variations of this nontrivial uniform sampling problem. Some directions are also discussed
at the end of the chapter.
Chapter 5
Conclusion
This thesis reveals several uses of graph in generating and representing biological data. We
organize the thesis around two specific problems on these two broad classes : haplotyping and
phylogeny.
Generation of haplotypes for an individual from fragments of SNP sequence can be done
using several algorithms. Minimum fragment removal, minimum SNP removal and minimum
error correction are such algorithms suitable for different experimental set ups. In this thesis, we
described the computational model (fragment conflict graph and SNP conflict graph) for these
problems. We proposed a local search based algorithm for MEC problem. We also incorporate
variations of these problems, some of which are still open in their complexity and algorithmic
status.
Generation of haplotypes for a population from their genotype data is another important
area. Pure parsimony problem, maximum resolution problem, perfect phylogeny haplotyp-
ing are discussed in this thesis. We provide illustrative examples to explain the computation
required to solve these problems.
Phylogenetic tree is the most widely used tool to represent the parent-child relationships
of biological elements, like DNA sequences, external features, physiological processes, etc. We
described an algorithm to construct consensus tree from many small candidate trees using
majority rule. We also described the uniform sampling problems of large phylogenetic tree.
55
Chapter 5. Conclusion 56
Let us justify the organization of this thesis again. In chapter 1, we defined the research
area : “bioinformatics” and provide two different perspectives to look at it : biological and
computational perspective. We explicitly defined the scope of this thesis with respect to the
possible area of researches.
In chapter 2 we define several preliminary ideas to help a reader to grasp the content of this
thesis.
In chapter 3 we discussed the haplotyping problems. We described the LSMEC, an algorithm
for solving minimum error correction problem. The algorithm is an iterative search algorithm
and performs quite efficiently even for true random fragment data. Two graphs are defined for
the MFR and MSR as computational model. The computational status of different versions
with respect to these models are also presented. We also explained the perfect phylogeny
haplotyping problem and maximum resolution problem.
In chapter 4 we discussed the phylogeny. We explained the steps of the majority rule
algorithm to construct consensus tree. The algorithm is illustrated very elegantly for easier
understanding. Uniform sampling of phylogenetic tree meeting some defined constraints is a
non-trivial sampling problem. Two different constraints on pair-wise leaf distances and on edge
lengths of the induced subtrees are defined here. Sampling of binary tree from arbitrary n-ary
tree would be an interesting problem in this aspect.
The use of graph in bioinformatics as a computational tool is the subject matter of this
thesis. Other branches of computer science are also significant in their contribution to bioin-
formatics and this thesis inspires us to study those areas very soon.
References
[ACJ03] N. Amenta, F. Clarke, K. S. Jhon: A Linear-Time majority tree algorithm, Proc. of
3rd International Workshop on Algorithms in Bioinformatics, LNBI 2812, Springer, pp.
216-227, 2003.
[BILR05] V. Bafna, S. Istrail, G. Lancia, R. Rizzi: Polynomial and APX-hard cases of the
individual haplotyping problem, Theoretical Computer Science, 335, pp. 109-125, 2005.
[BVDL03] P. Bonizzoni, G. D. Vedova, R. Dondi, J. Li: The Haplotyping problem: an overview
of computational models and solutions, Journal of Computer Science and Technology, 18(6),
pp. 675-688, 2003.
[BZ06] D. Branza, A. Zelikovsky: 2snp: scalable phasing based on 2-snp haplotypes, Bioinfor-
matics, 22(3), pp. 371-373, 2006.
[CIKT05] R. Cilibrasi, L. V. Iersel, S. Kelk, J. Tromp: On the complexity of several haplotyping
problems, Proc. of 5th International Workshop on Algorithms in Bioinformatics, LNCS
3692, Springer, pp. 128-139, 2005.
[Cla90] A. G. Clark: Inference of haplotypes from PCR-amplified samples of diploid populations,
Molecular Biology and Evolution, 7(2), pp. 111-122, 1990.
[Die05] R. Diestel: Graph theory, Springer, Heidelberg, 2005.
[Gus01] D. M. Gusfield: Inference of haplotypes from samples of diploid populations: Complex-
ity and algorithms, Journal of Computational Biology, 8(3), pp. 305-323, 2001.
57
References 58
[Huf05] Huffner F.: Algorithm engineering for optimal graph bipartization, Proc. of 4th Inter-
national Workshop on Efficient and Experimental Algorithms, LNCS 3503, Springer, pp.
240-252, 2005.
[KMP03] P. Kearney, J. I. Munro, D. Phillips: Efficient generation of uniform samples from
phylogenetic trees, Proc. of 3rd International Workshop on Algorithms in Bioinformatics,
LNBI 2812, Springer, pp. 177-189, 2003.
[LBIS01] G. Lancia, V. Bafna, S. Istrail, R. Schwartz: SNP Problems, complexity and algo-
rithms, Proc. of 9th Annual European Symposium on Algorithms (ESA), 2161, LNCS,
Springer, pp. 182-193, 2001.
[MM81] T. Margush, F. R. McMorris : Consensus n-trees, Bulletin of Mathematical Biology,
43, pp. 239-244, 1981.
[PM06] B. Pasaniuc, I. Mandoiu: Highly scalable genotype phasing by entropy minimization,
Proc. of IEEE International Conference of the Engineering in Medicine and Biology Society,
pp. 3482-3486, 2006.
[RTCN99] M. J. Rieder, S. L. Taylor, A. G. Clark, D. A. Nickrson: Sequence variation in the
human angiotensin aonverting enzyme, Nature Genetics, 22, pp. 59-62, 1999.
[WWLZ05] R.-S. Wang, L.-Y. Wu, Z.-P. Li, X.-S. Zhang: Haplotype reconstruction from SNP
fragments by minimum error correction, Bioinformatics, 21(10), pp. 2456-2462, 2005.
[Wes03] D. B. West: Introduction to graph theory, Prentice Hall of India, 2003.
Index
allele, 18, 26
heterozygous, 19
homozygous, 19
bi-allelic, 20
binary-witness MEC, 35
bioinformatics, 1
areas, 1
bipartition, 47
bit-string, 48
BLAST, 2
cardinality, 51
chromosome, 7
collection, 26
conflict, 20
consensus tree, 46
majority rule tree, 47
diploid, 19
DNA, 7
forbidden matrix, 42
fragment, 26
fragment conflict graph, 22
gain, 29
gap, 20
GenBank, 2
genotype, 37
phasing, 39
resolution, 38
graph, 15
bipartite, 22
connected graph, 16
cycle, 16
degree, 16
directed, 15
path, 16
subgraph, 15
undirected, 15
haplotype, 10, 19
construction, 22
length, 23
haplotype perfect phylogeny, 41
haplotyping, 19
individual, 19
population, 19, 36
HMEC, 26
hole, 20
59
INDEX 60
induced subtree, 52
inference rule, 39
LHR, 23
marker, 14
maximum cumulative gain, 29
MEC, 25
MFR, 22
MR, 39
MSR, 24
NCBI, 2
nucleotide, 7
partition, 26
PDB, 2
perfect graph, 24
personalized drug prediction, 14
phylogenetic Tree, 44
phylogenetic tree, 12
character-based, 13
cladistic, 12
distance-based, 12
phenetic, 12
phylogeny, 12
protein, 9
pure parsimony, 38
replication, 9
RNA, 9
SNP, 18
SNP conflict graph, 24
SNP matrix, 20
taxa, 12, 44
outgroup, 47
transcription, 9
tree, 16
topology, 46