how do we compare biological sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf ·...

45
How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach ©2015 by Compeau and Pevzner. All rights reserved.

Upload: others

Post on 08-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

How Do We Compare Biological Sequences?�

Dynamic Programming

Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach

©2015 by Compeau and Pevzner. All rights reserved.

Page 2: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Page 3: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Page 4: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Recursive Manhattan Tourist

SouthOrEast(i,j)ifi=0andj=0return0x←-infinity,y←-infinityifi>0x←SouthOrEast(i-1,j)+weightofvert.edgeinto(i,j)ifj>0y←SouthOrEast(i,j-1)+weightofhoriz.edgeinto(i,j)returnmax(x,y)

Exercise Break: How many times is SouthOrEast(3, 2) called in the computation of SouthOrEast(9, 7)?

Page 5: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Dynamic Programming Manhattan

Page 6: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Dynamic Programming Manhattan

STOP and Think: Which element of the table should we fill in next and what should its value be?

Page 7: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Dynamic Programming Manhattan

Page 8: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Dynamic Programming Manhattan

Page 9: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Dynamic Programming Manhattan

STOP and Think: Do you see a longest path in this grid? What algorithm did you use?

Page 10: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Reconstructing an Optimal Path

STOP and Think: In general, how do we reconstruct this path?

Page 11: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Reconstructing an Optimal Path

Answer: start at ending node and follow edges backwards to the beginning node.

Page 12: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Finding an LCS

A T C G T C CA

T

G

T

T

A

T

A

Exercise Break: Find a longest common subsequence of ATGTTATA and ATCGTCC.

Page 13: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Page 14: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Strengthening Alignment Scoring

0 1 2 2 3 4 5 6 7 8A T - G T T A T AA T C G T - C - C

0 1 2 3 4 5 5 6 6 7

Alignment score: Divided into three components: •  match reward (+1) •  mismatch penalty (-μ) •  insertion/deletion penalty (-σ)

Page 15: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Strengthening Alignment Scoring

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

Page 16: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Strengthening Alignment Scoring

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

STOP and Think: How can we solve this problem?

Page 17: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Strengthening Alignment Scoring

T C G T

T

G

T

T

A

+1 +1

+1

+1 +1

+1+1

- -

- - -

- -

--

- - - - -

-

-

-

-

-

-

-

-

--

-

-

-

--

-

-

-

-

-

-

-

-

-

----

- - - -

- - - -

----

- - - -

- - - -

Answer: Slight modification to alignment graph ...

Page 18: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Strengthening Alignment Scoring

T C G T

T

G

T

T

A

+1 +1

+1

+1 +1

+1+1

- -

- - -

- -

--

- - - - -

-

-

-

-

-

-

-

-

--

-

-

-

--

-

-

-

-

-

-

-

-

-

----

- - - -

- - - -

----

- - - -

- - - -

Exercise Break: Find a best alignment (σ = 2, μ = 3).

Page 19: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Finding “Local” Similarities

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Page 20: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Homeodomain: area within homeobox genes of shared “local” similarity.

Page 21: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Homeobox genes: regulate embryonic development and are present in a large variety of species.

Homeodomain: area within homeobox genes of shared “local” similarity.

STOP and Think: Will our dynamic programming algorithm find regions of local similarity?

Page 22: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Finding “Local” Similarities

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

H O W D O W E C O M PA R E D N A S E Q U E N C E S ?

Limitations of global alignment

Analysis of homeobox genes offers an example of a problem for which global alignmentmay fail to reveal biologically relevant similarities. These genes regulate embryonicdevelopment and are present in a large variety of species, from flies to humans. Home-obox genes are long, and they differ greatly between species, but an approximately 60amino acid-long region in each gene, called the homeodomain, is highly conserved.For instance, consider the mouse and human homeodomains below.

Mouse...ARRSRTHFTKFQTDILIEAFEKNRFPGIVTREKLAQQTGIPESRIHIWFQNRRARHPDPG......ARQKQTFITWTQKNRLVQAFERNPFPDTATRKKLAEQTGLQESRIQMWFQKQRSLYLKKS...

Human

The immediate question is how to find this conserved segment within the muchlonger genes and ignore the flanking areas, which exhibit little similarity. Global align-ment seeks similarities between two strings across their entire length; however, whensearching for homeodomains, we are looking for smaller, local regions of similarityand do not need to align the entire strings. For example, the global alignment belowhas 22 matches, 18 indels, and 2 mismatches, resulting in the score 22 � 18 � 2 = 2 (ifs = µ = 1):

GCC-C-AGTC-TATGT-CAGGGGGCACG--A-GCATGCACA-GCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-T-CAGAT

However, these sequences can be aligned differently (with 17 matches and 32 indels)based on a highly conserved interval represented by the substrings CAGTCTATGTCAGand CAGTTATGTTCAG:

---G----C-----C--CAGTCTATG-TCAGGGGGCACGAGCATGCACAGCCGCCGTCGTTTTCAGCAGT-TATGTTCAG-----A------T-----

This alignment has fewer matches and a lower score of 17 � 32 = �15, even though theconserved region of the alignment contributes a score of 12 � 2 = 10, which is hardlyan accident.

Figure 5.19 shows the two alignment paths corresponding to these two differentalignments. The upper path, corresponding to the second alignment above, losesout because it contains many heavily penalized indels on either side of the diagonalcorresponding to the conserved interval. As a result, global alignment outputs thebiologically irrelevant lower path.

257

Exercise Break: Score these alignments (σ = μ = 1). Does our scoring function make sense?

Page 23: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Visualizing Local Alignments C H A P T E R 5

FIGURE 5.19 Global and local alignments of two DNA strings that share a highlyconserved interval. The relevant alignment that captures this interval (upper path) losesto an irrelevant alignment (lower path), since the former incurs heavy indel penalties.

When biologically significant similarities are present in some parts of sequencesv and w and absent from others, biologists attempt to ignore global alignment andinstead align substrings of v and w, which yields a local alignment of the two strings.The problem of finding substrings that maximize the global alignment score over allsubstrings of v and w is called the Local Alignment Problem.

258

Page 24: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Revisiting Global Alignment

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

Page 25: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Revisiting Global Alignment

Global Alignment Problem: Find a highest-scoring alignment of two strings. •  Input: Two strings. •  Output: An alignment of the strings with

maximum alignment score.

STOP and Think: How can we reformulate the problem to find areas of “local” similarity?

Page 26: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Revisiting Global Alignment

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

Page 27: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Revisiting Global Alignment

Local Alignment Problem: Find a highest-scoring “local” alignment of two strings. •  Input: Two strings v and w. •  Output: Substrings of v and w whose best global

alignment score is maximized over all substrings.

STOP and Think: What algorithm would you propose to solve this problem?

Page 28: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

“Free Taxi Rides” for Local Alignment

GCCCAGTCTATGTCAGGGGGCACGAGCATGCACA

G C C G C C G T C G T T T T C A G C A G T T A T G T T C A G A T

0

0

Page 29: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Page 30: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Comparing Same-Score Alignments

STOP and Think: Which of these two alignments (which have the same score) is “better”? Why?

GATCCAGGA-C-AG

GATCCAGGA--CAG

Page 31: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Comparing Same-Score Alignments

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

Page 32: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Comparing Same-Score Alignments

GATCCAGGA-C-AG

GATCCAGGA--CAG

Affine penalty: a way of scoring contiguous gaps higher than discontiguous gaps. •  gap opening penalty (σ): assessed to first symbol. •  gap extension penalty (ε): assessed to subsequent

symbols.

If σ = 5 and ε = 1, then the alignment on the left is penalized by 2σ = 10, whereas the alignment on the right is only penalized by σ + ε.

Page 33: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Adding Affine Gap Penalties

Alignment with Affine Gap Penalties Problem: Construct a highest-scoring global alignment of two strings (with affine gap penalties). •  Input: Two strings along with numbers σ and ε. •  Output: A highest scoring global alignment

between these strings, as defined by the gap opening and extension penalties σ and ε.

STOP and Think: How can we modify the alignment graph to solve this problem?

Page 34: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Page 35: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Adding “Long” Edges to Graph

One solution: Add a (huge) number of new edges to alignment graph to facilitate longer gaps.

Page 36: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Outline

•  Introduction to Sequence Alignment

•  The Manhattan Tourist Problem

•  Sequence Alignment is the Manhattan Tourist Problem in Disguise

•  An Introduction to Dynamic Programming: The Change Problem

•  The Manhattan Tourist Problem Revisited

•  From Global to Local Alignment

•  Penalizing Insertions and Deletions in Sequence Alignment

•  Multiple Sequence Alignment

Page 37: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

Page 38: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Moving to Multiple Sequences

Multiple Alignment Problem: Find the highest-scoring alignment between multiple strings. •  Input: A collection of t strings. •  Output: A multiple alignment of these strings

having maximal score.

STOP and Think: What algorithm would you propose to solve this problem?

Page 39: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

Page 40: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What is the issue with the dynamic programming approach in multiple dimensions?

Page 41: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Answer: The number of edges in a single block grows like 2t – 1...

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

Page 42: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Moving to Multiple Dimensions

(i – 1, j – 1, k – 1)

(i, j – 1, k – 1)

(i – 1, j, k – 1)

(i – 1, j – 1, k) (i – 1, j, k)

(i, j, k) (i, j – 1, k)

(i, j, k – 1)

STOP and Think: What heuristic might you propose to align multiple sequences?

Page 43: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

Page 44: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

STOP and Think: Try this approach on the strings CCCCTTTT, TTTTGGGG, and GGGGCCCC.

Page 45: How Do We Compare Biological Sequences?compeau.cbd.cmu.edu/wp-content/uploads/2016/08/lec_38.pdf · How Do We Compare Biological Sequences? Dynamic Programming Phillip Compeau and

There is no way to combine these optimal pairwise alignment into a meaningful multiple alignment!

Greedy Heuristic for Multiple Alignment

1.  Find an optimal pairwise alignment of each pair of strings.

2.  Combine the set of optimal pairwise alignments into a multiple alignment.

CCCCTTTT---- ----CCCCTTTT TTTTGGGG--------TTTTGGGG GGGGCCCC---- ----GGGGCCCC