sequence analysis determining how similar 2 (or more) gene/protein sequences are (too each other) is...

15
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information is utilized for: 1) Gene/Protein Identification 2) Infer Gene/Protein Function 3) Measure Genetic Distance This ENTIRE exercise relies on the comparison between 2 (or more) sequences, and is independent of any functional content within the sequence(s).

Upload: peter-brown

Post on 11-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

Sequence Analysis

Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics.

This information is utilized for:

1) Gene/Protein Identification

2) Infer Gene/Protein Function

3) Measure Genetic Distance

This ENTIRE exercise relies on the comparison between 2 (or more) sequences, and is independent of any functional content within the sequence(s).

Page 2: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

In “Pair Wise” analysis and “Multiple Sequence Alignments”, two (or more) sequences are compared to each other and a similarity measurement is derived. This process is completely computational and there is no need for a database query.

From this process we can:

1) Identify common regions of sequence identity (infer function).

2) Rank order multiple sequences to identify the sequences that aremost similar (measure genetic distance).

Page 3: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

In “Sequence Identification”, we compare our sequence(s) of interest to an entire database of (known) sequences, and identify those sequences that are most similar to our sequence of interest.

Theoretical Basis of Pairwise Sequence Analysis

Needleman-Wunsch Algorithm : Global Alignment(entire sequence contributes to alignment)

Fundamental Principle: calculate the alignment score across two sequences. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array.

Represents Dynamic Programming: Solving a series of subsets of a computational problem to solve the entire problem. “Divide and Conquer”.

Page 4: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

'Dynamic programming' is an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics as it is the basis of sequence alignment algorithms for comparing protein and DNA sequences.

In the bioinformatics application Dynamic Programming gives a spectacular efficiency gain over a purely recursive algorithm.

Don't expect much enlightenment from the etymology of the term 'dynamic programming,' though. Dynamic programming was formalized in the early 1950s by mathematician Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning—thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' because "it's impossible to use the word dynamic in a derogatory sense"; he figured dynamic programming was "something not even a Congressman could object to.”

Page 5: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

OFFICEUNIVERSITY | | | |||||COFFEEICEVARSITY

OFFICEUNIVERSITY

COFFEEICEVARSITY

Alignment of 2 “Sequences” (words for demo purposes)

“Ungapped Alignment”

-OFFICEUNIVERSITY ||| COFFEEICEVARSITY

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 6: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

-OFF--ICEUNIVERSITY ||| ||| | |||||COFFEEICE---VARSITY

OFFICEUNIVERSITY

COFFEEICEVARSITY

Alignment of 2 “Sequences” (words for demo purposes)

“Gapped Alignment”

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

If gaps at any position (and any length) are allowed, the process becomes computationally expensive, and in many cases the alignment does not provide meaningful information. Hence gaps must be limited to a useful and manageable number.

Page 7: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

  O F F I C E U N I V E R S I T Y

C                                

O                                

F                                

F                                

E                                

E                                

I                                

C                                

E                                

V                                

A                                

R                                

S                                

I                                

T                                

Y                                

Dynamic Programming (Initialization Step)

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 8: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

  O F F I C E U N I V E R S I T Y

C                              

O                              

F                              

F                              

E                            

E                            

I                          

C                              

E                            

V                              

A                                

R                              

S                              

I                          

T                              

Y                              

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 9: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

  O F F I C E U N I V E R S I T Y

C                              

O                              

F                              

F                              

E     -0.3                      

E     -3                      

I                          

C                              

E           -0.3 -0.3 -3            

V                              

A                     -3          

R                              

S                              

I                          

T                              

Y                              

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Gap Penalties:

1) Reduce number of gaps in the alignment

2) Ensure a more meaningful alignment

3) Opening a gap is costly

4) Extending a gap is cheap

Gap opening penalty: should be 2 – 3 times larger than the most negative value in the substitution matrix that is being used.

Gap extension penalty: should be 0.1 to 0.3 times the value of the gap opening penalty.

Page 10: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

  O F F I C E U N I V E R S I T Y

C                              

O                              

F                              

F                              

E     -0.9                      

E     -0.6                      

I                          

C                              

E           -0.6 -0.3 0            

V                              

A                     2          

R                              

S                              

I                          

T                              

Y                              

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 11: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

  O F F I C E U N I V E R S I T Y

C                              

O                              

F                              

F                              

E     0                      

E     -0.3                      

I                          

C                              

E           -0.3 -0.6 -0.9            

V                              

A                     -2.9          

R                              

S                              

I                          

T                              

Y                              

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 12: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

-OFF--ICEUNIVERSITY ||| ||| |||||||COFFEEICE---VARSITY

-OFF--ICE ||| |||COFFEEICE

VERSITY| |||||VARSITY

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

Page 13: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

Theoretical Basis of Pairwise Sequence Analysis

Smith-Waterman Algorithm : Local Alignment

Fundamental Principle: based on Needleman-Wunsch, but compares segments of all possible lengths and chooses whichever optimize the similarity measure. Allows user to search for conserved/functional domains within sequences.

Functionally, global alignments start aligning at the far end of the alignment matrix and trace back, where local alignments only show the regions of alignment.

Page 14: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

Pair Wise Alignment Multiple Alignments Sequence Searching

Compares 2 sequences Compares 3 or more sequences Compares 1 sequence against thousandsProcess:

Objective:

Application:

Find common sequence motifs

Find common sequence motifs, rank based on alignment scores.

Sequence Identification,Comparative genomics

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ebi.ac.uk/clustalw/

Page 15: Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information

BLAST (Basic Local Alignment Search Tool)

Why is BLAST so fast?

By preindexing all the possible 11-letter words into the database records.

EXAMPLE “AGTGTCGATCG”

Steps: 1) Find all the 11-letter words in your query sequence, plus a few variations. 2) Look these up in the 11-letter-word index. 3) Retrieve all sequences containing those words. 4) Use a rigorous algorithm (e.g. Smith-Waterman) to extend the match in both directions