sequence analysis determining how similar 2 (or more) gene/protein sequences are (too each other) is...

Sequence Analysis

Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics.

This information is utilized for:

1) Gene/Protein Identification

2) Infer Gene/Protein Function

3) Measure Genetic Distance

This ENTIRE exercise relies on the comparison between 2 (or more) sequences, and is independent of any functional content within the sequence(s).

In “Pair Wise” analysis and “Multiple Sequence Alignments”, two (or more) sequences are compared to each other and a similarity measurement is derived. This process is completely computational and there is no need for a database query.

From this process we can:

1) Identify common regions of sequence identity (infer function).

2) Rank order multiple sequences to identify the sequences that aremost similar (measure genetic distance).

In “Sequence Identification”, we compare our sequence(s) of interest to an entire database of (known) sequences, and identify those sequences that are most similar to our sequence of interest.

Theoretical Basis of Pairwise Sequence Analysis

Needleman-Wunsch Algorithm : Global Alignment(entire sequence contributes to alignment)

Fundamental Principle: calculate the alignment score across two sequences. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array.

Represents Dynamic Programming: Solving a series of subsets of a computational problem to solve the entire problem. “Divide and Conquer”.

DYNAMIC PROGRAMMING and SEQUENCE ALIGNMENTS

'Dynamic programming' is an efficient programming technique for solving certain combinatorial problems. It is particularly important in bioinformatics as it is the basis of sequence alignment algorithms for comparing protein and DNA sequences.

In the bioinformatics application Dynamic Programming gives a spectacular efficiency gain over a purely recursive algorithm.

Don't expect much enlightenment from the etymology of the term 'dynamic programming,' though. Dynamic programming was formalized in the early 1950s by mathematician Richard Bellman, who was working at RAND Corporation on optimal decision processes. He wanted to concoct an impressive name that would shield his work from US Secretary of Defense Charles Wilson, a man known to be hostile to mathematics research. His work involved time series and planning—thus 'dynamic' and 'programming' (note, nothing particularly to do with computer programming). Bellman especially liked 'dynamic' because "it's impossible to use the word dynamic in a derogatory sense"; he figured dynamic programming was "something not even a Congressman could object to.”

OFFICEUNIVERSITY | | | |||||COFFEEICEVARSITY

OFFICEUNIVERSITY

COFFEEICEVARSITY

Alignment of 2 “Sequences” (words for demo purposes)

“Ungapped Alignment”

-OFFICEUNIVERSITY ||| COFFEEICEVARSITY


-OFF--ICEUNIVERSITY ||| ||| | |||||COFFEEICE---VARSITY

OFFICEUNIVERSITY

COFFEEICEVARSITY

Alignment of 2 “Sequences” (words for demo purposes)

“Gapped Alignment”


If gaps at any position (and any length) are allowed, the process becomes computationally expensive, and in many cases the alignment does not provide meaningful information. Hence gaps must be limited to a useful and manageable number.

O F F I C E U N I V E R S I T Y

C

O

F

F

E

E

I

C

E

V

A

R

S

I

T

Y

Dynamic Programming (Initialization Step)



C

O

F

F

E

E

I

C

E

V

A

R

S

I

T

Y



C

O

F

F

E -0.3

E -3

I

C

E -0.3 -0.3 -3

V

A -3

R

S

I

T

Y


Gap Penalties:

1) Reduce number of gaps in the alignment

2) Ensure a more meaningful alignment

3) Opening a gap is costly

4) Extending a gap is cheap

Gap opening penalty: should be 2 – 3 times larger than the most negative value in the substitution matrix that is being used.

Gap extension penalty: should be 0.1 to 0.3 times the value of the gap opening penalty.


C

O

F

F

E -0.9

E -0.6

I

C

E -0.6 -0.3 0

V

A 2

R

S

I

T

Y



C

O

F

F

E 0

E -0.3

I

C

E -0.3 -0.6 -0.9

V

A -2.9

R

S

I

T

Y


-OFF--ICEUNIVERSITY ||| ||| |||||||COFFEEICE---VARSITY

-OFF--ICE ||| |||COFFEEICE

VERSITY| |||||VARSITY


Theoretical Basis of Pairwise Sequence Analysis

Smith-Waterman Algorithm : Local Alignment

Fundamental Principle: based on Needleman-Wunsch, but compares segments of all possible lengths and chooses whichever optimize the similarity measure. Allows user to search for conserved/functional domains within sequences.

Functionally, global alignments start aligning at the far end of the alignment matrix and trace back, where local alignments only show the regions of alignment.

Pair Wise Alignment Multiple Alignments Sequence Searching

Compares 2 sequences Compares 3 or more sequences Compares 1 sequence against thousandsProcess:

Objective:

Application:

Find common sequence motifs

Find common sequence motifs, rank based on alignment scores.

Sequence Identification,Comparative genomics

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

http://www.ncbi.nlm.nih.gov/BLAST/http://www.ebi.ac.uk/clustalw/

BLAST (Basic Local Alignment Search Tool)

Why is BLAST so fast?

By preindexing all the possible 11-letter words into the database records.

EXAMPLE “AGTGTCGATCG”

Steps: 1) Find all the 11-letter words in your query sequence, plus a few variations. 2) Look these up in the 11-letter-word index. 3) Retrieve all sequences containing those words. 4) Use a rigorous algorithm (e.g. Smith-Waterman) to extend the match in both directions

sequence analysis determining how similar 2 (or more) gene/protein sequences are (too each other) is...

Documents

geneprotein sequences

sequences words

term dynamic programming

dna sequences

programming note

computer programming

sequence identification

sequence analysisdetermining