dna sequencing & data compression

DNA Sequencing DNA Sequencing DNA Sequencing DNA Sequencing

&

Data CompressionData CompressionData CompressionData Compression

Presented Presented Presented Presented by:by:by:by:----

Amit Jain

Sauradip Ghosh

Sumit Agarwal

Part 5

Methods for DNA Sequencing

Methods for Data Compression

Summary Summary Summary Summary LayoutLayoutLayoutLayout

Introduction

Experimental Results

Future Scope, Limitation and Conclusion

Part 1

Part 2

Part 3

Part 4

The first genome was read by Frederick Sanger.

For low cost & high throughputNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGS) ) ) ) is used.

NGS is divided into two parts:-

i. Reference Sequencing

ii. De-Novo Sequencing

IntroductionIntroductionIntroductionIntroduction

Data compression is a technique to reduce or compress data.

Minimizes the cost of data storage & transmission

There are basically two types data compression:-

i. Lossless Compression

ii. Lossy Compression

IntroductionIntroductionIntroductionIntroduction

Methods for DNA SequencingMethods for DNA SequencingMethods for DNA SequencingMethods for DNA Sequencing1. Sequencing

2. Overlapping Concept

3. Sequence by Hybridization (SBH)

SequencingSequencingSequencingSequencing

Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept

Overlap (Si, Sj) is defined as the length of the longest prefix of Sj that

matches a suffix of Si.

Example:-

Si= ATGGCTA

Sj= GCTAATGG

Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept

ATGGCTAGCTAGCTAGCTA

GCTAGCTAGCTAGCTAATGG

ATGGCTAGCTAGCTAGCTAATGG

Overlapping Score = 4

Two types of approaches used in SBH:-

1. Hamiltonian Path Approach

2. Eulerian Path Approach

Spectrum

S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

SSSSequence equence equence equence bbbby y y y HHHHybridizationybridizationybridizationybridization(SBH(SBH(SBH(SBH))))

Example 1:-

Spectrum

S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG}

Overlap in order to path is

A T GT GT GT G

T T T T G G G G C

G G G G CCCC A

C C C C AAAA G

A A A A GGGG G

G G G G GGGG T

G G G G T T T T C

T T T T CCCC C

A T G C A G A T G C A G A T G C A G A T G C A G GGGG T C T C T C T C CCCC

Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

So the Hamiltonian Path for the spectrum

S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG} is:-

ATG ATG ATG ATG TGC TGC TGC TGC GCA GCA GCA GCA CAG CAG CAG CAG AGG AGG AGG AGG GGT GGT GGT GGT GTC GTC GTC GTC TCC TCC TCC TCC


For some sequences more than one Hamiltonian Path is Possible.

Example 2:-

Spectrum S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}


Path 1(H1) :-

Spectrum


A T GT GT GT G

T T T T GGGG C

G G G G CCCC G

C C C C GGGG T

G G G G TTTT G

T T T T GGGG G

G G G G GGGG C

G G G G CCCC A

Final Sequence- A A A A T G C G T G T G C G T G T G C G T G T G C G T G GGGG C AC AC AC A


The Hamiltonian Path of H1 is:

ATG ATG ATG ATG TGC TGC TGC TGC GCG GCG GCG GCG CGT CGT CGT CGT GTG GTG GTG GTG TGG TGG TGG TGG GGC GGC GGC GGC GCAGCAGCAGCA


Path 2(H2) :-

Spectrum


A T GT GT GT G

T T T T GGGG G

G G G G GGGG C

G G G G CCCC G

C C C C GGGG T

G G G G TTTT G

T T T T GGGG C

G G G G CCCC A

Final Sequence- A A A A T G T G T G T G GGGG C G T G C AC G T G C AC G T G C AC G T G C A


The Hamiltonian Path of H2 is:

ATG TGG GGC GCG CGT GTG TGC GCA


So for the spectrum


two sequences ATGCGTGGCAATGCGTGGCAATGCGTGGCAATGCGTGGCA and ATGGCGTGCAATGGCGTGCAATGGCGTGCAATGGCGTGCA are formed .


AlgorithmAlgorithmAlgorithmAlgorithm

Step 1. Input: A set S, representing all i-mers from an (unknown)

String S.

Step 2. Draw a directed graph H, where every vertex (p) represents an

i-mer of spectrum.

Step 3. Two vertices p1 and p2 joined by directed edge, if overlap (p1,

p2), i=1.

Step 4. Find out a paths starting from the node whose in

degree=0.Covers every vertex only once and finish on that node

whose out degree=0.

Step 5. Overlap the i-mers representing vertices, in order to the

Hamiltonian path.

Step 6. Output: String s such that Spectrum (s) = S


Input spectrum


(i-1) mers corresponding S is

{ AT, TG, GC, GG , GT , CA, CG }

In this case too, more than one paths are possible

EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach

For the Eulerian Path 1(E1):-

AT

TG

GG

GC

CG

GT

TG

GC

CA

Final Sequence:- ATGGCGTGCA


For the Eulerian Path 2(E2):-

AT

TG

GC

CG

GT

TG

GG

GC

CA

Final Sequence: ATGCGTGGCA


Algorithm Algorithm Algorithm Algorithm

Step 1. Input: A set S, representing all i-mers from an (unknown) string S.

Step 2. Forming (i-1) mers corresponding to every i-mer from S,(like "AT" and TG" from "ATG" i-mer)

Step 3. Eliminate the duplicate (i-1)mers.

Step 4. Draw a directed graph G, where every vertex(p) represents an (i-1)-mer of spectrum.

Step 5. Two vertices p1 and p2 joined by directed edge p1 to p2 if there is a i-merwhose first (i-1) characters coincide with p1 and last (i-1) characters coincide with p2.

Step 6. Find out a paths (Eulerian path) starting from the node whose indegree=0 covers every edge exactly once and finish on that node whose outdegree=0.

Step 7. Overlap the (i-1)i-mers representing vertices, in order to the Eulerianpaths.

Step 8. Output: String s such that Spectrum ( s ,i )= S


Data CompressionData CompressionData CompressionData Compression

LossyLossyLossyLossy CompressionCompressionCompressionCompression

A lossy data compression method is one

where compressing data and then

decompressing it retrieves data that may

well be different from the original, but is

"close enough" to be useful in some way.

Lossless CompressionLossless CompressionLossless CompressionLossless Compression

Lossless data compression make use of

data compression algorithms that allows

the exact original data to be

reconstructed from the compressed data.

Here we use 2 types of Lossless Data Compression:-

1. Huffman coding

2. Lempel-ziv-Welch(LZW) coding

Lossless Data Lossless Data Lossless Data Lossless Data CompressionCompressionCompressionCompression

ExampleExampleExampleExample 1111::::----

"happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"

p a i o yh

4 3 2 1 1 1 1

Frequency of p = 4Frequency of h = 3Frequency of = 2Frequency of a = 1Frequency of i = 1Frequency of o = 1Frequency of y = 1

HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression


""""happy hip hop" happy hip hop" happy hip hop" happy hip hop"

p a i

o y

h

4 3 2 2 1 1

1 1




p

o y

h

a i

4 3 2 2 2

1 1 1 1




p

h

a i

o y

4 4 3 2

2 2 1 1

1 1



"happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"

p

h

o y

a i

5 4 4

3 2 2 2

1 1 1 1




p

o y

h

a i

8 5

4 4 3 3

2 21 1

1 1


Huffman Coding ExampleExampleExampleExample::::----


h

a i

p

o y

13

2

5

4 43 2

2

8

1 1

1 1

Algorithm:Algorithm:Algorithm:Algorithm:

Step 1. Create a collection of singleton trees, one for each character, with weight

equal to the character frequency.

Step 2. From the collection, pick out the two trees with the smallest weights and

remove them. Combine them into a new tree whose root has a weight equal to the

sum of the weights of the two trees and with the two trees as its left and right

subtrees.

Step 3) Add the new combined tree back into the collection.

Step 4) Repeat steps 2 and 3 until there is only one tree left.

Step 5) The remaining node is the root of the optimal encoding tree.


ExampleExampleExampleExample 2222 :::: viaviaviavia ASCIIASCIIASCIIASCII EncodingEncodingEncodingEncoding


charcharcharchar ASCIIASCIIASCIIASCII bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

hhhh 104 01101000

aaaa 97 01100001

pppp 112 01110000

yyyy 121 01111001

iiii 105 01101001

oooo 111 01101111

spacespacespacespace 32 00100000

Encoded String in ASCII - 104 97 112 112 121 32 104 105 112 32 104 111 112.

01101000 01100001 01110000 01110000 01111001 00100000 01101000

011010001 01110000 00100000 01101000 01101111 01110000


ExampleExampleExampleExample 3333 ::::---- FixedFixedFixedFixed LengthLengthLengthLength EncodingEncodingEncodingEncoding


charcharcharchar NumberNumberNumberNumber bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

HHHH 0 000

AAAA 1 001

PPPP 2 010

YYYY 3 011

IIII 4 100

OOOO 5 101

spacespacespacespace 6 110

String will be encoded as :- 0 1 2 2 3 6 0 4 2 6 0 5 2

000 001 010 010 011 110 000 100 010 110

000 101 010


ExampleExampleExampleExample 4444 ::::---- VariableVariableVariableVariable LengthLengthLengthLength EncodingEncodingEncodingEncoding


CharCharCharChar bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

HHHH 01

AAAA 000

PPPP 10

YYYY 1111

IIII 001

OOOO 1110

SpaceSpaceSpaceSpace 110

01 000 10 10 1111 110 01 001 10 110

01 1110 10


ExampleExampleExampleExample 5555 ::::----

""""ATOZOFCATOZOFCATOZOFC#" " " " Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file

AAAA Store

TTTT AT Store Store

OOOO TO Store Store

ZZZZ OZ Store Store

OOOO ZO Store Store

FFFF OF Store Store

CCCC FC Store Store

AAAA CA Store -

TTTT AT Retrive Store Relevant Code

OOOO ATO Store -

ZZZZ OZ Retrive -

OOOO OZO Store Store Relevant Code

FFFF OF Retrive -

CCCC OFC Store -

AAAA CA Retrive Store Relevant Code

LZWData Data Data Data CompressionCompressionCompressionCompression


""""ATOZOFCATOZOFCATOZOFC#" " " "

Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file

TTTT CAT Store -

OOOO TO Retrieve Store Relevant Code

Z Z Z Z TOZ Store -

OOOO ZO Retrieve Store Relevant Code

F F F F ZOF Store -

CCCC FC Retrieve Store Relevant Code



ComprssedComprssedComprssedComprssed Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit

format to 8 bit formatformat to 8 bit formatformat to 8 bit formatformat to 8 bit format

04040404

A , T10101010

84848484

04040404

O , ZF0F0F0F0

5A5A5A5A

04040404

O , FF0F0F0F0

46464646

04040404

C , AT31313131

00000000

10101010

OZ , OF21212121

04040404

10101010

CA , TO61616161

01010101

10101010

ZO , FC31313131

05050505


Encoding AlgorithmEncoding AlgorithmEncoding AlgorithmEncoding Algorithm

Step Step Step Step 1111. Initialize the dictionary to a known value.

Step 2.Step 2.Step 2.Step 2. Read an uncoded string that is the length of the maximum allowable

match.

Step 3Step 3Step 3Step 3. Search for the longest matching string in the dictionary.

Step 4Step 4Step 4Step 4. If a match is found greater than or equal to the minimum allowable

match length:

A)Write the encoded flag, then the offset and length to the

encoded output.

B)Otherwise, write the uncoded flag and the first uncoded

symbol to the encoded output.

Step 5.Step 5.Step 5.Step 5. Shift a copy of the symbols written to the encoded output from the

unencoded string to the dictionary.

Step 6Step 6Step 6Step 6. Read a number of symbols from the uncoded input equal to the number

of symbols written in Step 4.

StepStepStepStep 7. Repeat from Step 3, until all the entire input has been encoded.


Decoding AlgorithmDecoding AlgorithmDecoding AlgorithmDecoding Algorithm

Step 1.Step 1.Step 1.Step 1. Initialize the dictionary to a known value.

Step 2.Step 2.Step 2.Step 2. Read the encoded/not encoded flag.

Step 3.Step 3.Step 3.Step 3. If the flag indicates an encoded string:

A.A.A.A. Read the encoded length and offset, then copy the specified number

of symbols from the dictionary to the decoded output.

B.B.B.B. Otherwise, read the next character and write it to the

decoded output.

Step 4.Step 4.Step 4.Step 4. Shift a copy of the symbols written to the decoded output into the

dictionary.

Step 5.Step 5.Step 5.Step 5. Repeat from Step 2, until all the entire input has been decoded


Hamiltonial Approaches of SHB


EulerianEulerianEulerianEulerian Approaches of SHBApproaches of SHBApproaches of SHBApproaches of SHB


Huffman Coding


LZW Coding


Output

Input

Future Work & LimitationFuture Work & LimitationFuture Work & LimitationFuture Work & Limitation

I. Selecting the correct sequence when more than one sequence is produced in

SBH, which requires vast biological concept.

II. Showing the graph in the output & giving better graphics.

III. Refactoring and optimizing all the codes.

IV. Representing the code table in output with the output stream.

V. Integrating both the concepts together so that just one input can decode

find the DNA Sequence, compress it & store it.

ConclusionConclusionConclusionConclusion

I. Finding the DNA Sequence using Hamiltonian & Eulers approach,

which are low cost & high throughput methods and can be very helpful

for bio-informatics.

II. Compressing data using different Hamiltonian & LZW method,

thereby reducing the storage & transmission cost.

III. Implementation of Hamiltonian & Eulers approach in a new way.

BibliographyBibliographyBibliographyBibliography1. Simons, Robert W. UCLA(2002)

http://www.mimg.ucla.edu/bobs/C

159/Presentations/Benzer.pdf

2. Batzoglou,S Stanford University

http://www.stanford.edu/class

/cs262/handsout.html

3. T C Bell, J.G. Cleary, Text

Compression

dna sequencing & data compression

Documents

hamiltonian path approach2

dna sequencingmethods

dna sequencing1

types data compression

suffix of si

types of approaches

longest prefix of sj

lossless compressionii