dna sequencing & data compression

51
DNA Sequencing DNA Sequencing DNA Sequencing DNA Sequencing & Data Compression Data Compression Data Compression Data Compression Presented Presented Presented Presented by: by: by: by:- - - Amit Jain Sauradip Ghosh Sumit Agarwal

Upload: khawaja-ali

Post on 23-Nov-2015

32 views

Category:

Documents


0 download

DESCRIPTION

PPT on DNA Sequencing & Data Compression

TRANSCRIPT

  • DNA Sequencing DNA Sequencing DNA Sequencing DNA Sequencing

    &

    Data CompressionData CompressionData CompressionData Compression

    Presented Presented Presented Presented by:by:by:by:----

    Amit Jain

    Sauradip Ghosh

    Sumit Agarwal

  • Part 5

    Methods for DNA Sequencing

    Methods for Data Compression

    Summary Summary Summary Summary LayoutLayoutLayoutLayout

    Introduction

    Experimental Results

    Future Scope, Limitation and Conclusion

    Part 1

    Part 2

    Part 3

    Part 4

  • The first genome was read by Frederick Sanger.

    For low cost & high throughputNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGSNext Generation Sequencing (NGS) ) ) ) is used.

    NGS is divided into two parts:-

    i. Reference Sequencing

    ii. De-Novo Sequencing

    IntroductionIntroductionIntroductionIntroduction

  • Data compression is a technique to reduce or compress data.

    Minimizes the cost of data storage & transmission

    There are basically two types data compression:-

    i. Lossless Compression

    ii. Lossy Compression

    IntroductionIntroductionIntroductionIntroduction

  • Methods for DNA SequencingMethods for DNA SequencingMethods for DNA SequencingMethods for DNA Sequencing1. Sequencing

    2. Overlapping Concept

    3. Sequence by Hybridization (SBH)

  • SequencingSequencingSequencingSequencing

  • Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept

    Overlap (Si, Sj) is defined as the length of the longest prefix of Sj that

    matches a suffix of Si.

    Example:-

    Si= ATGGCTA

    Sj= GCTAATGG

  • Overlapping ConceptOverlapping ConceptOverlapping ConceptOverlapping Concept

    ATGGCTAGCTAGCTAGCTA

    GCTAGCTAGCTAGCTAATGG

    ATGGCTAGCTAGCTAGCTAATGG

    Overlapping Score = 4

  • Two types of approaches used in SBH:-

    1. Hamiltonian Path Approach

    2. Eulerian Path Approach

    Spectrum

    S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    SSSSequence equence equence equence bbbby y y y HHHHybridizationybridizationybridizationybridization(SBH(SBH(SBH(SBH))))

  • Example 1:-

    Spectrum

    S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG}

    Overlap in order to path is

    A T GT GT GT G

    T T T T G G G G C

    G G G G CCCC A

    C C C C AAAA G

    A A A A GGGG G

    G G G G GGGG T

    G G G G T T T T C

    T T T T CCCC C

    A T G C A G A T G C A G A T G C A G A T G C A G GGGG T C T C T C T C CCCC

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • So the Hamiltonian Path for the spectrum

    S(n,i)={ ATG,AGG,TGC,TCC,GTC,GGT,GCA,CAG} is:-

    ATG ATG ATG ATG TGC TGC TGC TGC GCA GCA GCA GCA CAG CAG CAG CAG AGG AGG AGG AGG GGT GGT GGT GGT GTC GTC GTC GTC TCC TCC TCC TCC

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • For some sequences more than one Hamiltonian Path is Possible.

    Example 2:-

    Spectrum S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • Path 1(H1) :-

    Spectrum

    S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    A T GT GT GT G

    T T T T GGGG C

    G G G G CCCC G

    C C C C GGGG T

    G G G G TTTT G

    T T T T GGGG G

    G G G G GGGG C

    G G G G CCCC A

    Final Sequence- A A A A T G C G T G T G C G T G T G C G T G T G C G T G GGGG C AC AC AC A

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • The Hamiltonian Path of H1 is:

    ATG ATG ATG ATG TGC TGC TGC TGC GCG GCG GCG GCG CGT CGT CGT CGT GTG GTG GTG GTG TGG TGG TGG TGG GGC GGC GGC GGC GCAGCAGCAGCA

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • Path 2(H2) :-

    Spectrum

    S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    A T GT GT GT G

    T T T T GGGG G

    G G G G GGGG C

    G G G G CCCC G

    C C C C GGGG T

    G G G G TTTT G

    T T T T GGGG C

    G G G G CCCC A

    Final Sequence- A A A A T G T G T G T G GGGG C G T G C AC G T G C AC G T G C AC G T G C A

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • The Hamiltonian Path of H2 is:

    ATG TGG GGC GCG CGT GTG TGC GCA

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • So for the spectrum

    S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    two sequences ATGCGTGGCAATGCGTGGCAATGCGTGGCAATGCGTGGCA and ATGGCGTGCAATGGCGTGCAATGGCGTGCAATGGCGTGCA are formed .

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • AlgorithmAlgorithmAlgorithmAlgorithm

    Step 1. Input: A set S, representing all i-mers from an (unknown)

    String S.

    Step 2. Draw a directed graph H, where every vertex (p) represents an

    i-mer of spectrum.

    Step 3. Two vertices p1 and p2 joined by directed edge, if overlap (p1,

    p2), i=1.

    Step 4. Find out a paths starting from the node whose in

    degree=0.Covers every vertex only once and finish on that node

    whose out degree=0.

    Step 5. Overlap the i-mers representing vertices, in order to the

    Hamiltonian path.

    Step 6. Output: String s such that Spectrum (s) = S

    Hamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path ApproachHamiltonian Path Approach

  • Input spectrum

    S(n,i)={ ATG,TGG,TGC,GTG,GGC,GCA,GCG,CGT}

    (i-1) mers corresponding S is

    { AT, TG, GC, GG , GT , CA, CG }

    In this case too, more than one paths are possible

    EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach

  • For the Eulerian Path 1(E1):-

    AT

    TG

    GG

    GC

    CG

    GT

    TG

    GC

    CA

    Final Sequence:- ATGGCGTGCA

    EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach

  • For the Eulerian Path 2(E2):-

    AT

    TG

    GC

    CG

    GT

    TG

    GG

    GC

    CA

    Final Sequence: ATGCGTGGCA

    EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach

  • Algorithm Algorithm Algorithm Algorithm

    Step 1. Input: A set S, representing all i-mers from an (unknown) string S.

    Step 2. Forming (i-1) mers corresponding to every i-mer from S,(like "AT" and TG" from "ATG" i-mer)

    Step 3. Eliminate the duplicate (i-1)mers.

    Step 4. Draw a directed graph G, where every vertex(p) represents an (i-1)-mer of spectrum.

    Step 5. Two vertices p1 and p2 joined by directed edge p1 to p2 if there is a i-merwhose first (i-1) characters coincide with p1 and last (i-1) characters coincide with p2.

    Step 6. Find out a paths (Eulerian path) starting from the node whose indegree=0 covers every edge exactly once and finish on that node whose outdegree=0.

    Step 7. Overlap the (i-1)i-mers representing vertices, in order to the Eulerianpaths.

    Step 8. Output: String s such that Spectrum ( s ,i )= S

    EulerianEulerianEulerianEulerian Path ApproachPath ApproachPath ApproachPath Approach

  • Data CompressionData CompressionData CompressionData Compression

    LossyLossyLossyLossy CompressionCompressionCompressionCompression

    A lossy data compression method is one

    where compressing data and then

    decompressing it retrieves data that may

    well be different from the original, but is

    "close enough" to be useful in some way.

    Lossless CompressionLossless CompressionLossless CompressionLossless Compression

    Lossless data compression make use of

    data compression algorithms that allows

    the exact original data to be

    reconstructed from the compressed data.

  • Here we use 2 types of Lossless Data Compression:-

    1. Huffman coding

    2. Lempel-ziv-Welch(LZW) coding

    Lossless Data Lossless Data Lossless Data Lossless Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    "happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"

    p a i o yh

    4 3 2 1 1 1 1

    Frequency of p = 4Frequency of h = 3Frequency of = 2Frequency of a = 1Frequency of i = 1Frequency of o = 1Frequency of y = 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    p a i

    o y

    h

    4 3 2 2 1 1

    1 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    p

    o y

    h

    a i

    4 3 2 2 2

    1 1 1 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    p

    h

    a i

    o y

    4 4 3 2

    2 2 1 1

    1 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    "happy hip hop" "happy hip hop" "happy hip hop" "happy hip hop"

    p

    h

    o y

    a i

    5 4 4

    3 2 2 2

    1 1 1 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 1111::::----

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    p

    o y

    h

    a i

    8 5

    4 4 3 3

    2 21 1

    1 1

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • Huffman Coding ExampleExampleExampleExample::::----

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    h

    a i

    p

    o y

    13

    2

    5

    4 43 2

    2

    8

    1 1

    1 1

  • Algorithm:Algorithm:Algorithm:Algorithm:

    Step 1. Create a collection of singleton trees, one for each character, with weight

    equal to the character frequency.

    Step 2. From the collection, pick out the two trees with the smallest weights and

    remove them. Combine them into a new tree whose root has a weight equal to the

    sum of the weights of the two trees and with the two trees as its left and right

    subtrees.

    Step 3) Add the new combined tree back into the collection.

    Step 4) Repeat steps 2 and 3 until there is only one tree left.

    Step 5) The remaining node is the root of the optimal encoding tree.

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 2222 :::: viaviaviavia ASCIIASCIIASCIIASCII EncodingEncodingEncodingEncoding

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    charcharcharchar ASCIIASCIIASCIIASCII bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

    hhhh 104 01101000

    aaaa 97 01100001

    pppp 112 01110000

    yyyy 121 01111001

    iiii 105 01101001

    oooo 111 01101111

    spacespacespacespace 32 00100000

    Encoded String in ASCII - 104 97 112 112 121 32 104 105 112 32 104 111 112.

    01101000 01100001 01110000 01110000 01111001 00100000 01101000

    011010001 01110000 00100000 01101000 01101111 01110000

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 3333 ::::---- FixedFixedFixedFixed LengthLengthLengthLength EncodingEncodingEncodingEncoding

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    charcharcharchar NumberNumberNumberNumber bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

    HHHH 0 000

    AAAA 1 001

    PPPP 2 010

    YYYY 3 011

    IIII 4 100

    OOOO 5 101

    spacespacespacespace 6 110

    String will be encoded as :- 0 1 2 2 3 6 0 4 2 6 0 5 2

    000 001 010 010 011 110 000 100 010 110

    000 101 010

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 4444 ::::---- VariableVariableVariableVariable LengthLengthLengthLength EncodingEncodingEncodingEncoding

    """"happy hip hop" happy hip hop" happy hip hop" happy hip hop"

    CharCharCharChar bit pattern (binary)bit pattern (binary)bit pattern (binary)bit pattern (binary)

    HHHH 01

    AAAA 000

    PPPP 10

    YYYY 1111

    IIII 001

    OOOO 1110

    SpaceSpaceSpaceSpace 110

    01 000 10 10 1111 110 01 001 10 110

    01 1110 10

    HHHHuffman Data uffman Data uffman Data uffman Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 5555 ::::----

    """"ATOZOFCATOZOFCATOZOFC#" " " " Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file

    AAAA Store

    TTTT AT Store Store

    OOOO TO Store Store

    ZZZZ OZ Store Store

    OOOO ZO Store Store

    FFFF OF Store Store

    CCCC FC Store Store

    AAAA CA Store -

    TTTT AT Retrive Store Relevant Code

    OOOO ATO Store -

    ZZZZ OZ Retrive -

    OOOO OZO Store Store Relevant Code

    FFFF OF Retrive -

    CCCC OFC Store -

    AAAA CA Retrive Store Relevant Code

    LZWData Data Data Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 5555 ::::----

    """"ATOZOFCATOZOFCATOZOFC#" " " "

    Character ReadCharacter ReadCharacter ReadCharacter Read String Stored / RetrievedString Stored / RetrievedString Stored / RetrievedString Stored / Retrieved Process in TableProcess in TableProcess in TableProcess in Table In fileIn fileIn fileIn file

    TTTT CAT Store -

    OOOO TO Retrieve Store Relevant Code

    Z Z Z Z TOZ Store -

    OOOO ZO Retrieve Store Relevant Code

    F F F F ZOF Store -

    CCCC FC Retrieve Store Relevant Code

    LZWData Data Data Data CompressionCompressionCompressionCompression

  • ExampleExampleExampleExample 5555 ::::----

    ComprssedComprssedComprssedComprssed Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Bytes ( in hex)Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit Strings given after converting from 12 bit

    format to 8 bit formatformat to 8 bit formatformat to 8 bit formatformat to 8 bit format

    04040404

    A , T10101010

    84848484

    04040404

    O , ZF0F0F0F0

    5A5A5A5A

    04040404

    O , FF0F0F0F0

    46464646

    04040404

    C , AT31313131

    00000000

    10101010

    OZ , OF21212121

    04040404

    10101010

    CA , TO61616161

    01010101

    10101010

    ZO , FC31313131

    05050505

    LZWData Data Data Data CompressionCompressionCompressionCompression

  • Encoding AlgorithmEncoding AlgorithmEncoding AlgorithmEncoding Algorithm

    Step Step Step Step 1111. Initialize the dictionary to a known value.

    Step 2.Step 2.Step 2.Step 2. Read an uncoded string that is the length of the maximum allowable

    match.

    Step 3Step 3Step 3Step 3. Search for the longest matching string in the dictionary.

    Step 4Step 4Step 4Step 4. If a match is found greater than or equal to the minimum allowable

    match length:

    A)Write the encoded flag, then the offset and length to the

    encoded output.

    B)Otherwise, write the uncoded flag and the first uncoded

    symbol to the encoded output.

    Step 5.Step 5.Step 5.Step 5. Shift a copy of the symbols written to the encoded output from the

    unencoded string to the dictionary.

    Step 6Step 6Step 6Step 6. Read a number of symbols from the uncoded input equal to the number

    of symbols written in Step 4.

    StepStepStepStep 7. Repeat from Step 3, until all the entire input has been encoded.

    LZWData Data Data Data CompressionCompressionCompressionCompression

  • Decoding AlgorithmDecoding AlgorithmDecoding AlgorithmDecoding Algorithm

    Step 1.Step 1.Step 1.Step 1. Initialize the dictionary to a known value.

    Step 2.Step 2.Step 2.Step 2. Read the encoded/not encoded flag.

    Step 3.Step 3.Step 3.Step 3. If the flag indicates an encoded string:

    A.A.A.A. Read the encoded length and offset, then copy the specified number

    of symbols from the dictionary to the decoded output.

    B.B.B.B. Otherwise, read the next character and write it to the

    decoded output.

    Step 4.Step 4.Step 4.Step 4. Shift a copy of the symbols written to the decoded output into the

    dictionary.

    Step 5.Step 5.Step 5.Step 5. Repeat from Step 2, until all the entire input has been decoded

    LZWData Data Data Data CompressionCompressionCompressionCompression

  • Hamiltonial Approaches of SHB

    Experimental Results

  • Experimental Results

  • Experimental Results

  • EulerianEulerianEulerianEulerian Approaches of SHBApproaches of SHBApproaches of SHBApproaches of SHB

    Experimental Results

  • Experimental Results

  • Huffman Coding

    Experimental Results

  • LZW Coding

    Experimental Results

    Output

    Input

  • Future Work & LimitationFuture Work & LimitationFuture Work & LimitationFuture Work & Limitation

    I. Selecting the correct sequence when more than one sequence is produced in

    SBH, which requires vast biological concept.

    II. Showing the graph in the output & giving better graphics.

    III. Refactoring and optimizing all the codes.

    IV. Representing the code table in output with the output stream.

    V. Integrating both the concepts together so that just one input can decode

    find the DNA Sequence, compress it & store it.

  • ConclusionConclusionConclusionConclusion

    I. Finding the DNA Sequence using Hamiltonian & Eulers approach,

    which are low cost & high throughput methods and can be very helpful

    for bio-informatics.

    II. Compressing data using different Hamiltonian & LZW method,

    thereby reducing the storage & transmission cost.

    III. Implementation of Hamiltonian & Eulers approach in a new way.

  • BibliographyBibliographyBibliographyBibliography1. Simons, Robert W. UCLA(2002)

    http://www.mimg.ucla.edu/bobs/C

    159/Presentations/Benzer.pdf

    2. Batzoglou,S Stanford University

    http://www.stanford.edu/class

    /cs262/handsout.html

    3. T C Bell, J.G. Cleary, Text

    Compression