v
ABSTRACT
The use of compression techniques in various fields of data management is
very encouraging lately. DNA data size becomes large, and this causes a problem of
storage and data transfer. Common approach used is to put this datum into the server
which adds to the cost of data management. Furthermore, the transfer of online data
is not the best solution anymore. For research center that has a low speed of Internet
connection, the transfer is almost impossible to implement. This study proposed an
enhancement of LZ77 algorithm, which is the common non-greedy, data dictionary
type, using sliding windows concept for alphabethical data compression. By
introducing sectioning sliding windows with hash table approach, the proposed
compression algorithm can solve the storage problem of large DNA sequences. This
implementation can speed up time and improve data compression rates. Two formats
of DNA data (binary and FASTA) are tested and analysed. Simulation proved that,
data compression rate shows promising results, with the addition of proportional size
of the DNA, where it can compress at the rate of 56% per bit. Comparing to the
LZ77 based DNA compression algorithm, BioCompress which has 44% of
compress rate; the proposed algorithm has outperformed by 12%. Implications from
this study will allow cost reduction in handling large scale DNA data.
vi
ABSTRAK
Penggunaan teknik pemampatan di dalam pelbagai bidang pengurusan data
amat menggalakkan sejak akhir-akhir ini. Namun dengan wujudnya pelbagai teknik
terkini, saiz data DNA menjadi semakin besar, dan ini menyebabkan masalah
penyimpanan dan pemindahan data berlaku. Pendekatan yang biasa digunakan
adalah dengan meletakkan data ini ke dalam pelayan namun menambahkan lagi kos
pengurusan data. Bagi pusat kajian dengan capaian internet yang rendah,
pemindahan ini hampir mustahil untuk dilaksanakan. Kajian membincangkan
mengenai penambahbaikan algoritma LZ77, di mana ianya menggunakan konsep
tanpa rakus, berjenis kamus data, dan mengaplikasikan pendekatan tingkap
gelangsar, untuk pemampatan data berjenis jujukan abjad. Algoritma tersebut
dimajukan lagi dengan irisan tingkap gelangsar, beserta pendekatan jadual
cincangan. Metod ini mengurangkan masa pemampatan dan meningkatkan kadarnya.
Dua format data DNA (binari dan FASTA) telah diuji dan dianalisis. Hasil simulasi
berkadaran dengan penambahan saiz DNA, di mana ia mampu memampat pada
kadar 56% per bit. Bagi tujuan perbandingan, kadar tersebut mengatasi teknik
pemampatan berasaskan LZ77 terkini iaitu BioCompress, di mana ia mampu
memampat hanya pada kadar 44%; lebih tinggi sebanyak 12%. Kajian ini mampu
mengurangkan kos di dalam penyelenggaraan data DNA yang bersaiz besar.
vii
TABLE OF CONTENT
CHAPTER TITLE PAGE
DECLARATION ii
DEDICATION iii
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRAK vi
TABLE OF CONTENT vii
LIST OF TABLES xi
LIST OF FIGURES xii
LIST OF TERMINOLOGIES xiv
LIST OF ABBREVIATIONS xv
LIST OF APPENDICES xvi
1 INTRODUCTION 1
1.1 Overview 1
1.2 Background of DNA Sequencing 3
1.2.1 DNA Sequence Identification 5
1.2.2 Large-scale DNA Sequencing 9
viii
1.2.3 Benefits of Genome Research 10
1.3 Motivation of the Research 11
1.4 Statement of the Problem 13
1.5 Objective of the Study 14
1.6 Scope of the Study 14
1.7 Thesis Outline 16
2 LITERATURE REVIEW 17
2.1 Overview 17
2.2 Basic Compression Method 18
2.3 Dictionary Compression Method 19
2.3.1 The Huffman Coding 20
2.3.2 Limpel-Ziv Coding 21
2.4 Dictionary Based Compression 23
2.4.1 LZ77 24
2.4.2 LZW 27
2.5 Existing DNA Compression Algorithm 30
2.5.1 BioCompress 31
2.5.2 GenCompress 33
2.5.3 DNACompress 34
2.6 Hash Table 35
2.6.1 Hash Function 39
2.7 Discussion 42
3 RESEARCH METHODOLOGY 44
3.1 Overview 44
3.2 Preliminary Study 45
3.3 Research Framework 46
3.4 Research Approach 50
ix
3.4.1 Function Based Algorithm 51
3.4.2 Reconstructing LZ77 Algorithm and
Hash Table Approach Code
51
3.4.3 Combination between LZ77 and
Hash Table
52
3.4.4 Testing and Simulate with Data 52
3.4.5 Database 53
3.4.6 Hardware and Software 53
3.5 Hash Table Allocation 54
3.5.1 Write Method 55
3.5.2 Read Method 56
3.6 Compression Algorithm / Code 57
3.7 Decompression Algorithm / Code 60
3.8 Summary 62
4 IMPLEMENTATION AND RESULT 63
4.1 Overview 63
4.2 DNA Sequence for Sampling 64
4.3 DNA Sequence Data Set 65
5.3.1 FASTA Format 65
5.3.2 Binary Format 66
4.4 The Compression Metrics 67
5.4.1 Bit Rate 67
5.4.2 Percentage of Compression 68
5.4.3 Time Consumption 68
4.5 Experiments Result 69
4.6 Summary 71
x
5 ANALYSIS AND DISCUSSION 72
5.1 Overview 72
5.2 Bit Rate Discussion among Algorithm 73
5.3 Compression Percentage between FASTA
and Binary Format
75
5.4 Speed to Compress /
Decompress FASTA
77
5.5 Prediction Graph for Large Scale DNA
Sequence
78
6 CONCLUSION AND FUTURE WORKS 81
6.1 Discussion 81
6.2 Future Works 83
6.2.1 A Solution for Binary Data Type 83
6.2.2 Multiple Architecture Capabilities 83
6.2.3 Greedy Method 84
6.2.4 Data Transfer Through Network 84
6.2.5 Pattern of DNA Sequence 85
6.3 Conclusion 85
REFERENCES 87
APPENDIX A 92
APPENDIX B 94
CHAPTER 1
INTRODUCTION
1.1 Overview
Data compression is important in making maximal use of limited information
storage and transmission capabilities. One might think that as such capabilities
increase, data compression would become less relevant. But so far this has not been
the case, since the volume of data always seems to increase more rapidly than
capabilities for storing and transmitting it. Wolfram (2002) said, in the future,
compression is always likely to remain relevant when there are physical constraints,
such as transmission by electromagnetic radiation that is not spatially localized.
They are many types of specified compression such as text, image and sound.
In this research, the DNA sequences will be the subject of experiments. They consist
of a specified kind of text only. The deoxyribonucleic acid (DNA) constitutes the
physical medium in which all properties of living organisms are encoded. Biological
2
database such as EMBL, GenBank, and DDBJ, were developed around the world to
store nucleotide sequence (DNA, RNA) and amino-acid sequences of proteins, and
the improvement and addition of those entity sizes, increase nowadays exponentially
fast (Grumbach and Tahi ,1994). Not as big as some other scientific databases, their
size is in hundred of gigabyte.
The first ever compression was invented in 1838, the Morse code for use in
telegraphy. It applies data compression based on shorter codeword for letters such as
"e" and "t" that are more common in English. In 1949 Claude Shannon and Robert
Fano develop a systematic way to assign codeword based on probability of blocks,
(Wolfram, 2002) In the mid-1970s, the idea emerged of dynamically updating
codeword for Huffman encoding, based on the actual data encountered, (Huffman,
1952). And in the late 1970s, with online storage of text files becoming common,
software compression programs began to be developed, almost all based on adaptive
Huffman coding. In 1977 Abraham Lempel and Jacob Ziv (1977, 1978) suggested
the basic idea of pointer-based encoding. In the mid-1980s, following work by Terry
Welch (1984), the so-called LZW algorithm rapidly became the method of choice
for most general-purpose compression systems. It was used in programs such as
PKZIP, as well as in hardware devices such as modems, (Nevill, Witten and Olson,
1996).
This research focuses on enhancement of recently used character
compression to solve large scale DNA Sequence. The selected large scale genes will
be tested using this research scheme. Next section will discuss the background of
DNA sequencing that will lead to understanding why the compression for large-
scale DNA sequence must be done, in motivation of research. The goal and
objectives of the research are presented in section 1.4 and the scope of research is
presented in section 1.5. The thesis outlines will be introduced in section 1.6.
3
1.2 Background on DNA Sequencing
Finding a single gene amid the vast stretches of DNA that makes up the
human genome - three billion base-pairs' worth - requires a set of powerful tools.
The Human Genome Project (HGP) was devoted to develop new and better tools to
make gene hunts faster, cheaper and practicable for almost any scientist to
accomplish said (Watson, 1990; Francis et.al., 1998).
These tools include genetic maps, physical maps and DNA sequence - which
is a detailed description of the order of the chemical building blocks, or bases, in a
given stretch of DNA. Indeed, the monumental achievement of the HGP was its
successful sequencing of the entire length of human DNA, also called the human
genome, (Adams et.al., 1991).
Scientists need to know the sequence of bases because it tells them the kind
of genetic information that is carried in a particular segment of DNA. For example,
they can use sequence information to determine which stretches of DNA contain
genes, as well as to analyze those genes for changes in sequence, called mutations,
that may cause disease.
DNA sequencing involves a process of Polymerase Chain Reaction or PCR.
The purpose of sequencing is to determine the order of the nucleotides of a gene.
This order is the key to the understanding of the human genome. Frederick Sanger,
was first accredited with the invention of DNA sequencing techniques as said by
Roberts(1987).
4
Sanger's approach involved copying DNA strands which would show the
location of the nucleotides in the strands though the use of X-Ray machines. This
technique is very slow and tedious, usually taking many years to sequence only a
few million letters in a string of DNA that often contains hundreds of millions or
even billions of letters. Modern techniques make used of fluorescent tags instead of
X-rays. This significantly reduced the time required to process a given batch of
DNA.
In 1991, working with Nobel laureate Hamilton Smith, Venter's genomic
research project (TIGR) created a bold new sequencing process coined
'shotgunning’.(Weber and Myers, 1997).
"Using an ordinary kitchen blender, they would shatter the organism's DNA
into millions of small fragments, run them through the sequencers (which can read
500 letters at a time), then reassemble them into full genomes using a high speed
computer and novel software written by in-house computer "(Weber and Myers,
1997).
This new method not only uses super fast automated machines, but also the
fluorescent detection process and the PCR DNA copying procedure. This method is
very fast and accurate compared to older techniques.
5
1.2.1 DNA Sequence Identification
DNA sequencing is a complex nucleotide-sequencing technique including
three identifiable steps:
1. Polymerase Chain Reaction (PCR)
2. Sequencing Reaction
3. Gel Electrophoresis & Computer Processing
Chromosomes (Roberts, 1987), which range from 50 million to 250 million
bases, must first be broken into much shorter pieces (PCR step). Each short piece is
used as a template to generate a set of fragments that differ in length from each other
by a single base that will be identified in a later step (template preparation and
sequencing reaction steps).
The fragments in a set are separated by gel electrophoresis (separation step).
New fluorescent dyes allow separation of all four fragments in a single lane on the
gel.
6
Figure 1.1: The Separation of the Molecules with Electrophoresis
The final base at the end of each fragment is identified (base-calling step).
This process recreates the original sequence of As, Ts, Cs, and Gs for each short
piece generated in the first step. Current electrophoresis limits are about 500 to 700
bases sequenced per read. Automated sequencers analyze the resulting
electropherograms and the output is a four-color chromatogram showing peaks that
represent each of the 4 DNA bases as shown in Figure 1.1
The fluorescently labeled fragments that migrate through the gel are passed
through a laser beam at the bottom of the gel. The laser exits the fluorescent
molecule, which sends out light of a distinct color. That light is collected and
focused by lenses into a spectrograph. Based on the wavelength, the spectrograph
separates the light across a CCD camera (charge coupled device). Each base has its
own color, so the sequencer can detect the order of the bases in the sequenced gene
as shown in Figure 1.2.
7
Figure 1.2: The Scanning and |Detection System on the ABI Prism 377
Sequencer
After the bases are "read," computers are used to assemble the short
sequences (in blocks of about 500 bases each, called the read length) into long
continuous stretches that are analyzed for errors, gene-coding regions, and other
characteristics. It will use the ABI Prism 377 sequencer as shown in Figure 1.2.
8
Figure 1.3 A Snapshot of the Detection of the Molecules on the
Sequencer
After the sequencer successes his job, the window similar like Figure 1.3 will
be shown. Each dot and color represents for each A, C, T, and G code. This image,
will be studied and produce a DNA Sequence.
At the end, the DNA data will be provided to public, to solve human needs.
Figure 1.4 is a summary of DNA sequencing steps.
9
Figure 1.4: DNA Sequencing Work Flow Summary
1.2.2 Large-scale DNA Sequencing
The evolution of Human Genome Project (HGP), promises that all organism
cells can be mapped. The human genome is about three billion (3,000,000,000) base
pair longs (Collins et. al., 2003) if the average fragment length is 500 bases, it would
take a minimum of 6 million (3 billion/500) to sequence the human genome (not
allowing for overlap = 1-fold coverage). Keeping track of such a high number of
10
sequences presents significant challenges, only held down by developing and
coordinating several procedural and computational algorithms, such as efficient
database development and management.
Advancement of this knowledge will motivate another research towards
completing another genome project. Therefore, a huge database with a good
algorithm will make this large scale DNA Sequencing reliable and can be done,
without limitations.
1.2.3 Benefits of Genome Research
Rapid progress in genome science and a glimpse into its potential
applications have spurred observers to predict that biology will be the foremost
science of the 21st century. Technology and resources generated by the Human
Genome Project and other genomics research are already having a major impact on
research across the life sciences. The potential for commercial development of
genomics research presents U.S. industry with a wealth of opportunities, and sales of
DNA-based products and technologies in the biotechnology industry are projected to
exceed $45 billion by 2009 (Consulting Resources Corporation Newsletter, Spring
1999).
Technology and resources promoted by the HGP are starting to have
profound impacts on biomedical research and promise to revolutionize the wider
spectrum of biological research and clinical medicine. Increasingly detailed genome
maps have aided researchers seeking genes associated with dozens of genetic
11
conditions, including myotonic dystrophy, fragile X syndrome, neurofibromatosis
types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast
cancer.
On the horizon is a new era of molecular medicine characterized less by
treating symptoms and more by looking to the most fundamental causes of disease.
Rapid and more specific diagnostic tests will make possible earlier treatment of
countless maladies. Medical researchers also will be able to devise novel therapeutic
regimens based on new classes of drugs, immunotherapy techniques, avoidance of
environmental conditions that may trigger disease, and possible augmentation or
even replacement of defective genes through gene therapy.
Another benefits including
� Decoding of microbes
� Finding out about our potential weaknesses and problems
� Finding out evolution and our links with life
� Helping to solve crimes
� Agricultural benefits
1.3 Motivation of the Research
The rapid advancement of next-generation DNA sequencers has been
possible due to vast improvements in computer technology, specifically in speed and
size. These new systems produce enormous amounts of data - one run could
generate close to one terabytes of data - and bioinformatics and data management
12
tools have to play catch-up to trigger the analysis and storage of this datum.
Data management and storage will always be an issue for the life science and
medical research industries, and is something that vendor will constantly have to
improve to appease the research world. Luckily, there is hope for software vendors.
Researchers will only begin to warm to the idea that next-generation technologies
produce better data, and will provide time- and cost-savings, if there are adequate
software applications to analyze the data.
However, how much researcher spends on storage device, the transmission
problem will occurred. Even transferring data among computers can consists several
hours for 30 gigabyte file, how about some terabytes data?
Therefore, a specific compression technique for DNA compression has been
invented lately. Most of them using, LZ77 idea, because of the dictionary function
that helps sequential data easily to compress. Many compression algorithms are
focusing on the scheme to shorten the process, enhance the compression ratio, and
fasten the process. From Biocompress to newest algorithm, Graph Compression,
these researches care about compression of sequence data. Logically, if one
sequence of nucleotide (GTACCTATG…) is compressed using any technique, it
will reduce its size. For example by using Biocompress algorithm for CHNTXX
sequence, compression rate is 16.26% (Susan, 1998). For more details about existing
DNA Sequence, please refer to Chapter 2.
Base on previous issue (previous section), The Human Genome Project, a
terabytes of DNA Sequence data are not suitable for non-specific DNA compression.
Mostly, a small sequence has been tested, and it has been proven that they can solve
13
the problem. Base on their experiment, the compression ratio become worst when
the data is bigger .
1.4 Statement of the Problem
i. The size of DNA Sequence database and the chain itself, arise
drastically inline with advancement of sequencing technology. The
storage problem will occur soon.
ii. Advancement of LZ77 (LZSS) has always been focusing on general
data, and for DNA sequence many researchers keep testing and
experimenting on popular and small sequences, instead of using it to
iii. A huge data is not compatible for mobile device usage and data
transfer (Kwong and Ho, 2001). A good compression for large scale
data must be implemented to support the mobile technology.
iv. A transfer rate among two research center (Eg. National Center for
Biotechnology Information in United States and Institute of Medical
Research in Malaysia) must be enhanced, to cater the knowledge
transfer. Mostly, large scale data, took a lot of time to transfer.
14
1.5 Objectives of the Study
i. To find the best solution for large scale DNA sequence compression.
This research will be the first ever research focusing on large scale
DNA sequence.
ii. To enhance LZ77 (LZSS) from universal data compression scheme
purposes to suit large scale DNA sequence problem. It is reliable
based on the characteristic of LZ77 which is similar to DNA
sequence.
iii. To study a hash table approach, which has solved many type of data
(e.g. sequence data, picture, and jpeg) and implement it into the LZ77
environment. This approach will make DNA sequence stay in
computer memory while compression / decompress is decompress.
iv. To optimize hash table to suit large scale DNA sequence data with
suitable method. Hash table cannot achieve optimum performance if
the data atmosphere is not suitable for hashing scheme,
1.6 Scope of the Study
Compression of DNA sequence is a huge area in bioinformatics area. A lot
of weighted factor has been identified to compress the sequences. Some of them use
the uniqueness of DNA sequence, the palindrome among sequence and in this
research will focus only on similarities of the characters. However, the latest
compression scheme using dynamic programming did not use any of these factors.
15
They are two types of sequence which has been stored in NCBI databases;
the FASTA and binary format. Sometimes bioinformatics application needs to use
single or both formats. Sadly, all DNA specific compression just focuses on the
FASTA. It will ease the researcher to identify which DNA belongs to whom, instead
of binary. On the other hand, binary is really good on data transmission. They tend
to minimize the size and it will ease the data transmission. This research will use
and analyze DNA sequence data in FASTA and Binary format (text sequence).
In this modern world, a lot of DNA databases (servers) lie in research center.
Some of them only focus on certain world of organism. An example, Bioinformatics
Database in Japan focusing on bacteria, while in Malaysia, they tend to store crops
DNA such as Jathropa and Palm oil. Only one universal server that supports all
databases around the world. The National Center of Biological Information (NCBI)
in United States. Bioinformaticians give a trust to this server, based on capabilities
of storing multiple type of organism including human. This trend will be succeeding
by this research as a primary source.
The special characteristics being highlight for this research is the similarities
among sequences. In computer science, several compression algorithm has been
introduced, and the perfect algorithm suits the needs of this research of LZ77. It uses
sliding window, and create a dictionary to compare with the futures characters.
Using the sliding windows technique, the compression work can be done without
any mistake. The original data still there with the simplify appearance.
16
1.7 Thesis Outline
This section gives a general description of the contents of subsequent
chapters in this thesis. Chapter 2 gives a review of the various techniques to solve
DNA Sequence compression problem. Chapter 3 describes the methodology adopted
to achieve the objectives of this research. Chapter 4 will discuss about the algorithm
construction focus on enhancement of hash table, to suit large DNA sequence.
Chapter 5 will present various experiment using several type of data and
environment. Chapter 6 will summarize the findings of research and future works.