approaches for analyzing biological sequences

Upload: dr-singh

Post on 06-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    1/33

    New Approaches forNew Approaches for

    Analyzing Biological SequencesAnalyzing Biological Sequences

    Prof. Rushen Chahal Prof. Rushen Chahal

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    2/33

    ContributionsContributions

    Developed methods for:

    Identifying new genes

    Constructing evolutionary trees

    Comparing phylogenetic solution space

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    3/33

    B iological SequencesB iological Sequences

    cgttaacaaagc... MAEKPKLH...

    RNA pro teinD NA ( gene)

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    4/33

    M ain TasksM ain Tasks

    F inding new genes:F inding new genes:Polymerase Chain Reaction (PCR)Polymerase Chain Reaction (PCR)CloningCloningGenomic sequencingGenomic sequencing

    Determining gene functions:Determining gene functions:L ab work L ab work Related genes (homology)Related genes (homology)Genome comparisonGenome comparison

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    5/33

    Evolu tionary Tr ees

    Evolu tionary Tr ees

    time

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    6/33

    P utting it togetherP utting it together

    ?Ev olu tionEv olu tion

    n ew seque nc es

    database

    related seque nc es

    relatio n ships

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    7/33

    P rimer Selection forP rimer Selection forP olymerase Chain Reactions ( P CR)P olymerase Chain Reactions ( P CR)

    G oal:G oal:

    D iscover previously unknown genesD iscover previously unknown genesS trategy:S trategy:

    D esign PCR primers for large set of knownD esign PCR primers for large set of known

    gene family members gene family members

    U nkn own gen es will (hopefully) be amplified U nkn own gen es will (hopefully) be amplified

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    8/33

    G ene FamilyG ene FamilyherpesEC

    crnvHH2cmvHH3

    humIL8bovLOR1RBS11 humSSR1

    musdeltohumC5a

    ratBK2 humTHRratRTA humMRG

    humMAS

    humfM LFratANGratG10d

    dogRDC1

    humRSC

    chkGPCRgpPAF

    musP2uchkP2y

    ratODOR

    ratLH bovOPhumEDG1

    ratCGPCR

    ratPOThumACTHhumMSH

    musEP3humTXA2

    musEP2ratCCKA

    dogCCKB

    dogAd1humD2

    humA2ahamA1a

    hamB2hum5HT1a

    ratD1bovH1

    humM1

    ratNPYY1ratNK1flyNKflyNPY

    musGIRratNTR

    musTRHmusGnRHratV1a

    bovETAmusGRP

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    9/33

    P olymerase Chain ReactionP olymerase Chain Reaction

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    10/33

    P rimersP rimers

    C ommo n regio n => c ommo n primer

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    11/33

    P rimer G roupP rimer G roupherpesEC

    crnvHH2cmvHH3

    humIL8bovLOR1RBS11 humSSR1

    musdeltohumC5a

    ratBK2 humTHR

    ratRTA humMRGhumMAS

    humfML FratANGratG10d

    dogRDC1

    humRSC

    chkGPCRgpPAF

    musP2u

    chkP2y

    ratODOR

    ratLH bovOP

    humEDG1ratCGPCR

    ratPOThumACTHhumMSH

    musEP3humTXA2

    musEP2

    ratCCKA

    dogCCKB

    dogAd1humD2

    humA2ahamA1a

    hamB2hum5HT1a

    ratD1bovH1

    humM1

    ratNPYY1ratNK1

    flyNKflyNPY

    musGIRratNTR

    musTRHmusGnRHratV1a

    bovETAmusGRP

    ?

    ?

    ?

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    12/33

    P rimer Selection P roblemP rimer Selection P roblem

    O ptimal Primer S election Problem:O ptimal Primer S election Problem:

    input:input: set of DNA sequencesset of DNA sequencesoutput:output: optimal set of primersoptimal set of primers

    Theorem: NPTheorem: NP- -completecompleteProof: reduction from set coverProof: reduction from set cover

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    13/33

    A pproachesA pproaches

    Ex act algorithms:Ex act algorithms:ex haustive bruteex haustive brute- -forceforcebranchbranch- -andand--boundbound

    ProvablyProvably- -good heuristics:good heuristics:solution quality:solution quality: log(# sequences)log(# sequences) OPT OPT

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    14/33

    Ex tension: Ine x act P rimersEx tension: Ine x act P rimers

    Goal: Optimize mismat c hes & #primers

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    15/33

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    16/33

    The B ig P ictureThe B ig P icture

    ?Ev olu tionEv olu tion

    Identifying new genes using PCR

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    17/33

    Ev olutionary Tree ReconstructionEv olutionary Tree Reconstruction

    N P-complete [Foulds & Graham 1982, D ay 1987]

    Op timality Criteria:L east-Sq uar esMinim um-Ev olu tionMaxim um-Pa r sim onyMaxim um-L ike lihood

    tree c ost

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    18/33

    P re v ious A pproachesP re v ious A pproaches

    F itchF itch--Margoliash [1967]Margoliash [1967]NeighborNeighbor- -Joining [1987]Joining [1987]Quartet Quartet- -Puzzling [1997]Puzzling [1997]S plit S plit--Decomposition [1995]Decomposition [1995]PAUP [1998], PHYLIP [1993]PAUP [1998], PHYLIP [1993]

    All use greedy & target best solutio nProf. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    19/33

    H owe v er .H owe v er .

    Topologically distant solutions may e x ist Topologically distant solutions may e x ist

    Dete c t diverse low c ost solutio n s

    1

    45 2

    3

    6

    0.25620.2562

    1

    23 4

    5

    6

    0.25600.2560

    34

    1

    0.25610.2561

    33

    33

    33

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    20/33

    [M addiso n 1991, Pe nn y 1995, Swofford 1997]

    Random Starting Trees + H euristics?Random Starting Trees + H euristics?

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    21/33

    N eighborN eighbor- -Joining M ethodJoining M ethod

    1

    2

    3

    4

    5

    3

    4

    51

    2

    1

    2

    1

    2

    1

    2

    1

    2

    1

    2

    3

    4

    51

    2 4

    5

    4

    5

    4

    5

    4

    5

    4

    5

    3

    1

    2

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    22/33

    G eneralized N eighborG eneralized N eighbor- -JoiningJoining1

    2 3

    4

    5

    3

    1

    2

    4

    51

    43

    2

    53

    42

    1

    5

    1

    2 4

    5

    3

    3

    4 1

    5

    2

    3

    4 2

    5

    1Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    23/33

    G eneralized N eighborG eneralized N eighbor- -JoiningJoining

    Controlling solution space sampling:Controlling solution space sampling:

    K : max # partial solutions maintained K : max # partial solutions maintained Q (quality): # candidates selected for low cost Q (quality): # candidates selected for low cost D (diversity): # candidates selected for variety D (diversity): # candidates selected for variety

    Tradeoff quality & topological diversity:Tradeoff quality & topological diversity:K = Q + DK = Q + D

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    24/33

    G N J P erformance (8 lea v es)G N J P erformance (8 lea v es)

    least-squares cost least-squares cost

    exhaustive

    50, 045, 525, 255, 45

    0, 50

    Q D

    1

    1 0

    1 00

    0

    1

    2

    3

    4

    5

    6

    n u m

    b e r o

    f s o

    l u t i o n s

    t o p o

    l o g

    i c a l

    d i s t a n c e

    0.00 1 0.01 0.1 1 0.001 0.01 0.1 1

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    25/33

    Solution Cost (16 lea v es)Solution Cost (16 lea v es)

    meth ods

    s o l u t i o n c o s t

    LS

    ME

    1 0 -2

    1 0 -3

    10

    -4

    1 0 -5

    1 0 -6

    1 0 -7

    1 0 -8

    K=1 K=20 K=100

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    26/33

    Solution Di v ersity (16 lea v es)Solution Di v ersity (16 lea v es)

    LS -maxME -maxLS -aveME -ave

    0

    2

    4

    6

    8

    1 0

    1 2

    t o p o

    l o g

    i c a l

    d i s t a n c e

    meth ods

    K=1 K=20 K=100

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    27/33

    G N J Time Comple x ityG N J Time Comple x ity

    G enerate candidates:G enerate candidates:O( K N O( K N 2 2 ) )

    S elect candidates:S elect candidates:O( K N O( K N 2 2 (lg K + lg N )) (lg K + lg N ))

    O(K N 3 (lg K + lg N)) G NJ ru n times (sec onds)

    K N =8 N =16 N =32

    20 0 .08 0 .8 9 .8

    50 0 .2 2 .1 25 .1

    1 00 0 .5 4 .4 52 .1

    200 1 .1 8 .8 1 0 3 .7

    500 3 .1 24 .2 262 .7

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    28/33

    SummarySummary

    Ev olu tionEv olu tion

    Identifying new genes using PCR

    Detecting low-cost diverse trees

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    29/33

    Refereed P ublicationsRefereed P ublications

    Pearson, W. R., Robins, G ., and Zhang, T., G eneralized NeighborPearson, W. R., Robins, G ., and Zhang, T., G eneralized Neighbor- -Joining: More Reliable Phylogenetic Tree Reconstruction, toJoining: More Reliable Phylogenetic Tree Reconstruction, toappear in Journal of Molecular Biology and E volution.appear in Journal of Molecular Biology and E volution.

    Pearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., O n thePearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., O n thePrimer S election Problem for Polymerase Chain ReactionPrimer S election Problem for Polymerase Chain ReactionEx periments, Discrete and Applied Mathematics, Vol. 71, 1996, pp.Ex periments, Discrete and Applied Mathematics, Vol. 71, 1996, pp.231231--246.246.

    Pearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., A NewPearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., A NewApproach to Primer S election in Polymerase Chain ReactionApproach to Primer S election in Polymerase Chain ReactionEx periments, Proc. International Conference on Intelligent Ex periments, Proc. International Conference on Intelligent S ystems for Molecular Biology, Cambridge, E ngland, July, 1995, pp.S ystems for Molecular Biology, Cambridge, E ngland, July, 1995, pp.

    285285--291.291.Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    30/33

    Refereed P ublications (cont.)Refereed P ublications (cont.)

    G riffith, J., Robins, G ., S alowe, J. S ., and Zhang, T., Closing the G ap:G riffith, J., Robins, G ., S alowe, J. S ., and Zhang, T., Closing the G ap:NearNear--O ptimal S teiner Trees in Polynomial Time, I EEE TransactionsO ptimal S teiner Trees in Polynomial Time, I EEE Transactionson Computeron Computer- -Aided Design of Integrated Circuits and S ystems, Vol.Aided Design of Integrated Circuits and S ystems, Vol.13, No. 11, November 1994, pp. 135113, No. 11, November 1994, pp. 1351- -1365.1365.

    Barrera, T., G riffith, J., McKee, S . A., Robins, G ., and Zhang, T., TowardBarrera, T., G riffith, J., McKee, S . A., Robins, G ., and Zhang, T., Towarda S teiner E ngine: E nhanced S erial and Parallel Implementations of a S teiner E ngine: E nhanced S erial and Parallel Implementations of the Iterated 1the Iterated 1- -S teiner MR S T Algorithm, Proc. G reat Lakes S ymposiumS teiner MR S T Algorithm, Proc. G reat Lakes S ymposium

    on VLS I, Kalamazoo, MI, March 1993, pp. 90on VLS I, Kalamazoo, MI, March 1993, pp. 90- -94.94.

    Barrera, T., G riffith, J., Robins, G ., and Zhang, T., Narrowing the G ap:Barrera, T., G riffith, J., Robins, G ., and Zhang, T., Narrowing the G ap:NearNear--O ptimal S teiner Trees in Polynomial Time, Proc. I EEE O ptimal S teiner Trees in Polynomial Time, Proc. I EEE International A S IC Conference, Rochester, S eptember 1993, pp. 87International A S IC Conference, Rochester, S eptember 1993, pp. 87- -90.90. Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    31/33

    G eneralizationG eneralization

    G ene r ate Par tial Solu tions

    Ev aluate Par tial Solu tions

    Select Par tial Solu tions

    n- i & i

    Parsimo n y, Least -Squares

    Prefer dista n t trees

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    32/33

    Future WorkFuture Work

    G eneralize G NJG eneralize G NJother optimality criteriaother optimality criteria

    other solution space samplingother solution space samplingalternative topological distance metricsalternative topological distance metrics

    Examine solution space using

    GNJ

    Examine solution space using

    GNJ

    Identify pathological data setsIdentify pathological data sets

    Prof. Rushen Chahal

  • 8/3/2019 Approaches for Analyzing Biological Sequences

    33/33

    G eneralized N eighborG eneralized N eighbor- -JoiningJoining

    1 . T = { t}, wher e t is the star -tr ee ov er S2 . R e peatT* n All next-ste p tr ees der iv ed fro m TT n Select up to K tr ees from T*

    Unti l (all tr ees in T ar e fully r esolv ed)3 . Ou t pu t T

    In pu t: a set of l eav es S,the distance mat r ix ov er SOu t pu t: a set of po ssib le phylo genetic tr ees for S

    Prof. Rushen Chahal