approaches for analyzing biological sequences
TRANSCRIPT
-
8/3/2019 Approaches for Analyzing Biological Sequences
1/33
New Approaches forNew Approaches for
Analyzing Biological SequencesAnalyzing Biological Sequences
Prof. Rushen Chahal Prof. Rushen Chahal
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
2/33
ContributionsContributions
Developed methods for:
Identifying new genes
Constructing evolutionary trees
Comparing phylogenetic solution space
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
3/33
B iological SequencesB iological Sequences
cgttaacaaagc... MAEKPKLH...
RNA pro teinD NA ( gene)
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
4/33
M ain TasksM ain Tasks
F inding new genes:F inding new genes:Polymerase Chain Reaction (PCR)Polymerase Chain Reaction (PCR)CloningCloningGenomic sequencingGenomic sequencing
Determining gene functions:Determining gene functions:L ab work L ab work Related genes (homology)Related genes (homology)Genome comparisonGenome comparison
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
5/33
Evolu tionary Tr ees
Evolu tionary Tr ees
time
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
6/33
P utting it togetherP utting it together
?Ev olu tionEv olu tion
n ew seque nc es
database
related seque nc es
relatio n ships
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
7/33
P rimer Selection forP rimer Selection forP olymerase Chain Reactions ( P CR)P olymerase Chain Reactions ( P CR)
G oal:G oal:
D iscover previously unknown genesD iscover previously unknown genesS trategy:S trategy:
D esign PCR primers for large set of knownD esign PCR primers for large set of known
gene family members gene family members
U nkn own gen es will (hopefully) be amplified U nkn own gen es will (hopefully) be amplified
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
8/33
G ene FamilyG ene FamilyherpesEC
crnvHH2cmvHH3
humIL8bovLOR1RBS11 humSSR1
musdeltohumC5a
ratBK2 humTHRratRTA humMRG
humMAS
humfM LFratANGratG10d
dogRDC1
humRSC
chkGPCRgpPAF
musP2uchkP2y
ratODOR
ratLH bovOPhumEDG1
ratCGPCR
ratPOThumACTHhumMSH
musEP3humTXA2
musEP2ratCCKA
dogCCKB
dogAd1humD2
humA2ahamA1a
hamB2hum5HT1a
ratD1bovH1
humM1
ratNPYY1ratNK1flyNKflyNPY
musGIRratNTR
musTRHmusGnRHratV1a
bovETAmusGRP
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
9/33
P olymerase Chain ReactionP olymerase Chain Reaction
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
10/33
P rimersP rimers
C ommo n regio n => c ommo n primer
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
11/33
P rimer G roupP rimer G roupherpesEC
crnvHH2cmvHH3
humIL8bovLOR1RBS11 humSSR1
musdeltohumC5a
ratBK2 humTHR
ratRTA humMRGhumMAS
humfML FratANGratG10d
dogRDC1
humRSC
chkGPCRgpPAF
musP2u
chkP2y
ratODOR
ratLH bovOP
humEDG1ratCGPCR
ratPOThumACTHhumMSH
musEP3humTXA2
musEP2
ratCCKA
dogCCKB
dogAd1humD2
humA2ahamA1a
hamB2hum5HT1a
ratD1bovH1
humM1
ratNPYY1ratNK1
flyNKflyNPY
musGIRratNTR
musTRHmusGnRHratV1a
bovETAmusGRP
?
?
?
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
12/33
P rimer Selection P roblemP rimer Selection P roblem
O ptimal Primer S election Problem:O ptimal Primer S election Problem:
input:input: set of DNA sequencesset of DNA sequencesoutput:output: optimal set of primersoptimal set of primers
Theorem: NPTheorem: NP- -completecompleteProof: reduction from set coverProof: reduction from set cover
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
13/33
A pproachesA pproaches
Ex act algorithms:Ex act algorithms:ex haustive bruteex haustive brute- -forceforcebranchbranch- -andand--boundbound
ProvablyProvably- -good heuristics:good heuristics:solution quality:solution quality: log(# sequences)log(# sequences) OPT OPT
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
14/33
Ex tension: Ine x act P rimersEx tension: Ine x act P rimers
Goal: Optimize mismat c hes & #primers
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
15/33
-
8/3/2019 Approaches for Analyzing Biological Sequences
16/33
The B ig P ictureThe B ig P icture
?Ev olu tionEv olu tion
Identifying new genes using PCR
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
17/33
Ev olutionary Tree ReconstructionEv olutionary Tree Reconstruction
N P-complete [Foulds & Graham 1982, D ay 1987]
Op timality Criteria:L east-Sq uar esMinim um-Ev olu tionMaxim um-Pa r sim onyMaxim um-L ike lihood
tree c ost
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
18/33
P re v ious A pproachesP re v ious A pproaches
F itchF itch--Margoliash [1967]Margoliash [1967]NeighborNeighbor- -Joining [1987]Joining [1987]Quartet Quartet- -Puzzling [1997]Puzzling [1997]S plit S plit--Decomposition [1995]Decomposition [1995]PAUP [1998], PHYLIP [1993]PAUP [1998], PHYLIP [1993]
All use greedy & target best solutio nProf. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
19/33
H owe v er .H owe v er .
Topologically distant solutions may e x ist Topologically distant solutions may e x ist
Dete c t diverse low c ost solutio n s
1
45 2
3
6
0.25620.2562
1
23 4
5
6
0.25600.2560
34
1
0.25610.2561
33
33
33
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
20/33
[M addiso n 1991, Pe nn y 1995, Swofford 1997]
Random Starting Trees + H euristics?Random Starting Trees + H euristics?
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
21/33
N eighborN eighbor- -Joining M ethodJoining M ethod
1
2
3
4
5
3
4
51
2
1
2
1
2
1
2
1
2
1
2
3
4
51
2 4
5
4
5
4
5
4
5
4
5
3
1
2
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
22/33
G eneralized N eighborG eneralized N eighbor- -JoiningJoining1
2 3
4
5
3
1
2
4
51
43
2
53
42
1
5
1
2 4
5
3
3
4 1
5
2
3
4 2
5
1Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
23/33
G eneralized N eighborG eneralized N eighbor- -JoiningJoining
Controlling solution space sampling:Controlling solution space sampling:
K : max # partial solutions maintained K : max # partial solutions maintained Q (quality): # candidates selected for low cost Q (quality): # candidates selected for low cost D (diversity): # candidates selected for variety D (diversity): # candidates selected for variety
Tradeoff quality & topological diversity:Tradeoff quality & topological diversity:K = Q + DK = Q + D
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
24/33
G N J P erformance (8 lea v es)G N J P erformance (8 lea v es)
least-squares cost least-squares cost
exhaustive
50, 045, 525, 255, 45
0, 50
Q D
1
1 0
1 00
0
1
2
3
4
5
6
n u m
b e r o
f s o
l u t i o n s
t o p o
l o g
i c a l
d i s t a n c e
0.00 1 0.01 0.1 1 0.001 0.01 0.1 1
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
25/33
Solution Cost (16 lea v es)Solution Cost (16 lea v es)
meth ods
s o l u t i o n c o s t
LS
ME
1 0 -2
1 0 -3
10
-4
1 0 -5
1 0 -6
1 0 -7
1 0 -8
K=1 K=20 K=100
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
26/33
Solution Di v ersity (16 lea v es)Solution Di v ersity (16 lea v es)
LS -maxME -maxLS -aveME -ave
0
2
4
6
8
1 0
1 2
t o p o
l o g
i c a l
d i s t a n c e
meth ods
K=1 K=20 K=100
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
27/33
G N J Time Comple x ityG N J Time Comple x ity
G enerate candidates:G enerate candidates:O( K N O( K N 2 2 ) )
S elect candidates:S elect candidates:O( K N O( K N 2 2 (lg K + lg N )) (lg K + lg N ))
O(K N 3 (lg K + lg N)) G NJ ru n times (sec onds)
K N =8 N =16 N =32
20 0 .08 0 .8 9 .8
50 0 .2 2 .1 25 .1
1 00 0 .5 4 .4 52 .1
200 1 .1 8 .8 1 0 3 .7
500 3 .1 24 .2 262 .7
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
28/33
SummarySummary
Ev olu tionEv olu tion
Identifying new genes using PCR
Detecting low-cost diverse trees
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
29/33
Refereed P ublicationsRefereed P ublications
Pearson, W. R., Robins, G ., and Zhang, T., G eneralized NeighborPearson, W. R., Robins, G ., and Zhang, T., G eneralized Neighbor- -Joining: More Reliable Phylogenetic Tree Reconstruction, toJoining: More Reliable Phylogenetic Tree Reconstruction, toappear in Journal of Molecular Biology and E volution.appear in Journal of Molecular Biology and E volution.
Pearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., O n thePearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., O n thePrimer S election Problem for Polymerase Chain ReactionPrimer S election Problem for Polymerase Chain ReactionEx periments, Discrete and Applied Mathematics, Vol. 71, 1996, pp.Ex periments, Discrete and Applied Mathematics, Vol. 71, 1996, pp.231231--246.246.
Pearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., A NewPearson, W. R., Robins, G ., Wrege, D. E ., and Zhang, T., A NewApproach to Primer S election in Polymerase Chain ReactionApproach to Primer S election in Polymerase Chain ReactionEx periments, Proc. International Conference on Intelligent Ex periments, Proc. International Conference on Intelligent S ystems for Molecular Biology, Cambridge, E ngland, July, 1995, pp.S ystems for Molecular Biology, Cambridge, E ngland, July, 1995, pp.
285285--291.291.Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
30/33
Refereed P ublications (cont.)Refereed P ublications (cont.)
G riffith, J., Robins, G ., S alowe, J. S ., and Zhang, T., Closing the G ap:G riffith, J., Robins, G ., S alowe, J. S ., and Zhang, T., Closing the G ap:NearNear--O ptimal S teiner Trees in Polynomial Time, I EEE TransactionsO ptimal S teiner Trees in Polynomial Time, I EEE Transactionson Computeron Computer- -Aided Design of Integrated Circuits and S ystems, Vol.Aided Design of Integrated Circuits and S ystems, Vol.13, No. 11, November 1994, pp. 135113, No. 11, November 1994, pp. 1351- -1365.1365.
Barrera, T., G riffith, J., McKee, S . A., Robins, G ., and Zhang, T., TowardBarrera, T., G riffith, J., McKee, S . A., Robins, G ., and Zhang, T., Towarda S teiner E ngine: E nhanced S erial and Parallel Implementations of a S teiner E ngine: E nhanced S erial and Parallel Implementations of the Iterated 1the Iterated 1- -S teiner MR S T Algorithm, Proc. G reat Lakes S ymposiumS teiner MR S T Algorithm, Proc. G reat Lakes S ymposium
on VLS I, Kalamazoo, MI, March 1993, pp. 90on VLS I, Kalamazoo, MI, March 1993, pp. 90- -94.94.
Barrera, T., G riffith, J., Robins, G ., and Zhang, T., Narrowing the G ap:Barrera, T., G riffith, J., Robins, G ., and Zhang, T., Narrowing the G ap:NearNear--O ptimal S teiner Trees in Polynomial Time, Proc. I EEE O ptimal S teiner Trees in Polynomial Time, Proc. I EEE International A S IC Conference, Rochester, S eptember 1993, pp. 87International A S IC Conference, Rochester, S eptember 1993, pp. 87- -90.90. Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
31/33
G eneralizationG eneralization
G ene r ate Par tial Solu tions
Ev aluate Par tial Solu tions
Select Par tial Solu tions
n- i & i
Parsimo n y, Least -Squares
Prefer dista n t trees
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
32/33
Future WorkFuture Work
G eneralize G NJG eneralize G NJother optimality criteriaother optimality criteria
other solution space samplingother solution space samplingalternative topological distance metricsalternative topological distance metrics
Examine solution space using
GNJ
Examine solution space using
GNJ
Identify pathological data setsIdentify pathological data sets
Prof. Rushen Chahal
-
8/3/2019 Approaches for Analyzing Biological Sequences
33/33
G eneralized N eighborG eneralized N eighbor- -JoiningJoining
1 . T = { t}, wher e t is the star -tr ee ov er S2 . R e peatT* n All next-ste p tr ees der iv ed fro m TT n Select up to K tr ees from T*
Unti l (all tr ees in T ar e fully r esolv ed)3 . Ou t pu t T
In pu t: a set of l eav es S,the distance mat r ix ov er SOu t pu t: a set of po ssib le phylo genetic tr ees for S
Prof. Rushen Chahal