dna sequencing & data compression
DESCRIPTION
PPT on DNA Sequencing & Data CompressionTRANSCRIPT
-
DNA Sequencing DNA Sequencing DNA Sequencing DNA Sequencing
&
Data CompressionData CompressionData CompressionData Compression
Presented Presented Presented Presented by:by:by:by:----
Amit Jain
Sauradip Ghosh
Sumit Agarwal
-
Part 5
Methods for DNA Sequencing
Methods for Data Compression
Summary Summary Summary Summary LayoutLayoutLayoutLayout
Introduction
Experimental Results
Future Scope, Limitation and Conclusion
Part 1
Part 2
Part 3
Part 4
-
The first genome was read by Frederick Sanger.
For low cost & high throughputNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGS) ) ) ) is used.
NGS is divided into two parts:-
i. Reference Sequencing
ii. De-Novo Sequencing
IntroductionIntroductionIntroductionIntroduction
-
Data compression is a technique to reduce or compress data.
Minimizes the cost of data storage & transmission
There are basically two types data compression:-
i. Lossless Compression
ii. Lossy Compression
IntroductionIntroductionIntroductionIntroduction
-
Methods for DNA SequencingMethods for DNA SequencingMethods for DNA SequencingMethods for DNA Sequencing1. Sequencing
2. Overlapping Concept
3. Sequence by Hybridization (SBH)
-
SequencingSequencingSequencingSequencing
-
Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept
Overlap (Si, Sj) is defined as the length of the longest prefix of Sj that
matches a suffix of Si.
Example:-
Si= ATGGCTA
Sj= GCTAATGG
-
Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept
ATGGCTAGCTAGCTAGCTA
GCTAGCTAGCTAGCTAATGG
ATGGCTAGCTAGCTAGCTAATGG
Overlapping Score = 4
-
Two types of approaches used in SBH:-
1. Hamiltonian Path Approach
2. Eulerian Path Approach
Spectrum
S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
SSSSequence equence equence equence bbbby y y y HHHHybridizationybridizationybridizationybridization(SBH(SBH(SBH(SBH))))
-
Example 1:-
Spectrum
S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG}
Overlap in order to path is
A T GT GT GT G
T T T T G G G G C
G G G G CCCC A
C C C C AAAA G
A A A A GGGG G
G G G G GGGG T
G G G G T T T T C
T T T T CCCC C
A T G C A G A T G C A G A T G C A G A T G C A G GGGG T C T C T C T C CCCC
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
So the Hamiltonian Path for the spectrum
S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG} is:-
ATG ATG ATG ATG TGC TGC TGC TGC GCA GCA GCA GCA CAG CAG CAG CAG AGG AGG AGG AGG GGT GGT GGT GGT GTC GTC GTC GTC TCC TCC TCC TCC
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
For some sequences more than one Hamiltonian Path is Possible.
Example 2:-
Spectrum S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
Path 1(H1) :-
Spectrum
S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
A T GT GT GT G
T T T T GGGG C
G G G G CCCC G
C C C C GGGG T
G G G G TTTT G
T T T T GGGG G
G G G G GGGG C
G G G G CCCC A
Final Sequence- A A A A T G C G T G T G C G T G T G C G T G T G C G T G GGGG C AC AC AC A
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
The Hamiltonian Path of H1 is:
ATG ATG ATG ATG TGC TGC TGC TGC GCG GCG GCG GCG CGT CGT CGT CGT GTG GTG GTG GTG TGG TGG TGG TGG GGC GGC GGC GGC GCAGCAGCAGCA
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
Path 2(H2) :-
Spectrum
S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
A T GT GT GT G
T T T T GGGG G
G G G G GGGG C
G G G G CCCC G
C C C C GGGG T
G G G G TTTT G
T T T T GGGG C
G G G G CCCC A
Final Sequence- A A A A T G T G T G T G GGGG C G T G C AC G T G C AC G T G C AC G T G C A
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
The Hamiltonian Path of H2 is:
ATG TGG GGC GCG CGT GTG TGC GCA
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
So for the spectrum
S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
two sequences ATGCGTGGCAATGCGTGGCAATGCGTGGCAATGCGTGGCA and ATGGCGTGCAATGGCGTGCAATGGCGTGCAATGGCGTGCA are formed .
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
AlgorithmAlgorithmAlgorithmAlgorithm
Step 1. Input: A set S, representing all i-mers from an (unknown)
String S.
Step 2. Draw a directed graph H, where every vertex (p) represents an
i-mer of spectrum.
Step 3. Two vertices p1 and p2 joined by directed edge, if overlap (p1,
p2), i=1.
Step 4. Find out a paths starting from the node whose in
degree=0.Covers every vertex only once and finish on that node
whose out degree=0.
Step 5. Overlap the i-mers representing vertices, in order to the
Hamiltonian path.
Step 6. Output: String s such that Spectrum (s) = S
Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach
-
Input spectrum
S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}
(i-1) mers corresponding S is
{ AT, TG, GC, GG , GT , CA, CG }
In this case too, more than one paths are possible
EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach
-
For the Eulerian Path 1(E1):-
AT
TG
GG
GC
CG
GT
TG
GC
CA
Final Sequence:- ATGGCGTGCA
EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach
-
For the Eulerian Path 2(E2):-
AT
TG
GC
CG
GT
TG
GG
GC
CA
Final Sequence: ATGCGTGGCA
EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach
-
Algorithm Algorithm Algorithm Algorithm
Step 1. Input: A set S, representing all i-mers from an (unknown) string S.
Step 2. Forming (i-1) mers corresponding to every i-mer from S,(like "AT" and TG" from "ATG" i-mer)
Step 3. Eliminate the duplicate (i-1)mers.
Step 4. Draw a directed graph G, where every vertex(p) represents an (i-1)-mer of spectrum.
Step 5. Two vertices p1 and p2 joined by directed edge p1 to p2 if there is a i-merwhose first (i-1) characters coincide with p1 and last (i-1) characters coincide with p2.
Step 6. Find out a paths (Eulerian path) starting from the node whose indegree=0 covers every edge exactly once and finish on that node whose outdegree=0.
Step 7. Overlap the (i-1)i-mers representing vertices, in order to the Eulerianpaths.
Step 8. Output: String s such that Spectrum ( s ,i )= S
EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach
-
Data CompressionData CompressionData CompressionData Compression
LossyLossyLossyLossy CompressionCompressionCompressionCompression
A lossy data compression method is one
where compressing data and then
decompressing it retrieves data that may
well be different from the original, but is
"close enough" to be useful in some way.
Lossless CompressionLossless CompressionLossless CompressionLossless Compression
Lossless data compression make use of
data compression algorithms that allows
the exact original data to be
reconstructed from the compressed data.
-
Here we use 2 types of Lossless Data Compression:-
1. Huffman coding
2. Lempel-ziv-Welch(LZW) coding
Lossless Data Lossless Data Lossless Data Lossless Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
"happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"
p a i o yh
4 3 2 1 1 1 1
Frequency of p = 4Frequency of h = 3Frequency of = 2Frequency of a = 1Frequency of i = 1Frequency of o = 1Frequency of y = 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
p a i
o y
h
4 3 2 2 1 1
1 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
p
o y
h
a i
4 3 2 2 2
1 1 1 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
p
h
a i
o y
4 4 3 2
2 2 1 1
1 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
"happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"
p
h
o y
a i
5 4 4
3 2 2 2
1 1 1 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 1111::::----
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
p
o y
h
a i
8 5
4 4 3 3
2 21 1
1 1
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
Huffman Coding ExampleExampleExampleExample::::----
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
h
a i
p
o y
13
2
5
4 43 2
2
8
1 1
1 1
-
Algorithm:Algorithm:Algorithm:Algorithm:
Step 1. Create a collection of singleton trees, one for each character, with weight
equal to the character frequency.
Step 2. From the collection, pick out the two trees with the smallest weights and
remove them. Combine them into a new tree whose root has a weight equal to the
sum of the weights of the two trees and with the two trees as its left and right
subtrees.
Step 3) Add the new combined tree back into the collection.
Step 4) Repeat steps 2 and 3 until there is only one tree left.
Step 5) The remaining node is the root of the optimal encoding tree.
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 2222 :::: viaviaviavia ASCIIASCIIASCIIASCII EncodingEncodingEncodingEncoding
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
charcharcharchar ASCIIASCIIASCIIASCII bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)
hhhh 104 01101000
aaaa 97 01100001
pppp 112 01110000
yyyy 121 01111001
iiii 105 01101001
oooo 111 01101111
spacespacespacespace 32 00100000
Encoded String in ASCII - 104 97 112 112 121 32 104 105 112 32 104 111 112.
01101000 01100001 01110000 01110000 01111001 00100000 01101000
011010001 01110000 00100000 01101000 01101111 01110000
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 3333 ::::---- FixedFixedFixedFixed LengthLengthLengthLength EncodingEncodingEncodingEncoding
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
charcharcharchar NumberNumberNumberNumber bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)
HHHH 0 000
AAAA 1 001
PPPP 2 010
YYYY 3 011
IIII 4 100
OOOO 5 101
spacespacespacespace 6 110
String will be encoded as :- 0 1 2 2 3 6 0 4 2 6 0 5 2
000 001 010 010 011 110 000 100 010 110
000 101 010
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 4444 ::::---- VariableVariableVariableVariable LengthLengthLengthLength EncodingEncodingEncodingEncoding
""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"
CharCharCharChar bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)
HHHH 01
AAAA 000
PPPP 10
YYYY 1111
IIII 001
OOOO 1110
SpaceSpaceSpaceSpace 110
01 000 10 10 1111 110 01 001 10 110
01 1110 10
HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 5555 ::::----
""""ATOZOFCATOZOFCATOZOFC#" " " " Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file
AAAA Store
TTTT AT Store Store
OOOO TO Store Store
ZZZZ OZ Store Store
OOOO ZO Store Store
FFFF OF Store Store
CCCC FC Store Store
AAAA CA Store -
TTTT AT Retrive Store Relevant Code
OOOO ATO Store -
ZZZZ OZ Retrive -
OOOO OZO Store Store Relevant Code
FFFF OF Retrive -
CCCC OFC Store -
AAAA CA Retrive Store Relevant Code
LZWData Data Data Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 5555 ::::----
""""ATOZOFCATOZOFCATOZOFC#" " " "
Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file
TTTT CAT Store -
OOOO TO Retrieve Store Relevant Code
Z Z Z Z TOZ Store -
OOOO ZO Retrieve Store Relevant Code
F F F F ZOF Store -
CCCC FC Retrieve Store Relevant Code
LZWData Data Data Data CompressionCompressionCompressionCompression
-
ExampleExampleExampleExample 5555 ::::----
ComprssedComprssedComprssedComprssed Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit
format to 8 bit formatformat to 8 bit formatformat to 8 bit formatformat to 8 bit format
04040404
A , T10101010
84848484
04040404
O , ZF0F0F0F0
5A5A5A5A
04040404
O , FF0F0F0F0
46464646
04040404
C , AT31313131
00000000
10101010
OZ , OF21212121
04040404
10101010
CA , TO61616161
01010101
10101010
ZO , FC31313131
05050505
LZWData Data Data Data CompressionCompressionCompressionCompression
-
Encoding AlgorithmEncoding AlgorithmEncoding AlgorithmEncoding Algorithm
Step Step Step Step 1111. Initialize the dictionary to a known value.
Step 2.Step 2.Step 2.Step 2. Read an uncoded string that is the length of the maximum allowable
match.
Step 3Step 3Step 3Step 3. Search for the longest matching string in the dictionary.
Step 4Step 4Step 4Step 4. If a match is found greater than or equal to the minimum allowable
match length:
A)Write the encoded flag, then the offset and length to the
encoded output.
B)Otherwise, write the uncoded flag and the first uncoded
symbol to the encoded output.
Step 5.Step 5.Step 5.Step 5. Shift a copy of the symbols written to the encoded output from the
unencoded string to the dictionary.
Step 6Step 6Step 6Step 6. Read a number of symbols from the uncoded input equal to the number
of symbols written in Step 4.
StepStepStepStep 7. Repeat from Step 3, until all the entire input has been encoded.
LZWData Data Data Data CompressionCompressionCompressionCompression
-
Decoding AlgorithmDecoding AlgorithmDecoding AlgorithmDecoding Algorithm
Step 1.Step 1.Step 1.Step 1. Initialize the dictionary to a known value.
Step 2.Step 2.Step 2.Step 2. Read the encoded/not encoded flag.
Step 3.Step 3.Step 3.Step 3. If the flag indicates an encoded string:
A.A.A.A. Read the encoded length and offset, then copy the specified number
of symbols from the dictionary to the decoded output.
B.B.B.B. Otherwise, read the next character and write it to the
decoded output.
Step 4.Step 4.Step 4.Step 4. Shift a copy of the symbols written to the decoded output into the
dictionary.
Step 5.Step 5.Step 5.Step 5. Repeat from Step 2, until all the entire input has been decoded
LZWData Data Data Data CompressionCompressionCompressionCompression
-
Hamiltonial Approaches of SHB
Experimental Results
-
Experimental Results
-
Experimental Results
-
EulerianEulerianEulerianEulerian Approaches of SHBApproaches of SHBApproaches of SHBApproaches of SHB
Experimental Results
-
Experimental Results
-
Huffman Coding
Experimental Results
-
LZW Coding
Experimental Results
Output
Input
-
Future Work & LimitationFuture Work & LimitationFuture Work & LimitationFuture Work & Limitation
I. Selecting the correct sequence when more than one sequence is produced in
SBH, which requires vast biological concept.
II. Showing the graph in the output & giving better graphics.
III. Refactoring and optimizing all the codes.
IV. Representing the code table in output with the output stream.
V. Integrating both the concepts together so that just one input can decode
find the DNA Sequence, compress it & store it.
-
ConclusionConclusionConclusionConclusion
I. Finding the DNA Sequence using Hamiltonian & Eulers approach,
which are low cost & high throughput methods and can be very helpful
for bio-informatics.
II. Compressing data using different Hamiltonian & LZW method,
thereby reducing the storage & transmission cost.
III. Implementation of Hamiltonian & Eulers approach in a new way.
-
BibliographyBibliographyBibliographyBibliography1. Simons, Robert W. UCLA(2002)
http://www.mimg.ucla.edu/bobs/C
159/Presentations/Benzer.pdf
2. Batzoglou,S Stanford University
http://www.stanford.edu/class
/cs262/handsout.html
3. T C Bell, J.G. Cleary, Text
Compression