pairwise sequence alignment part 2. outline summary local and global alignments fasta and blast...

24
Pairwise Sequence Alignment Part 2

Upload: loraine-cross

Post on 18-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Pairwise Sequence Alignment Part 2

Page 2: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Outline

• Summary Local and Global alignments

• FASTA and BLAST algorithms

• Evaluating significance of alignments

• Alignment of protein sequences

Page 3: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• Best score for aligning part of sequences

• Dynamic programming • Algorithm:

Smith-Waterman• Table cells never score

below zero

• Best score for aligning the full length sequences

• Dynamic programming• Algorithm:

Needelman- Wunch• Table cells are allowed

any score

Global Local

Pairwise Alignment Summary

Page 4: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Gap Scores

• Example showed -1 score per indel– So gap cost is proportional to its length

• Biologically, indels occur in groups– We want our gap score to reflect this

• Standard solution: affine gap model– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 5: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Assessing Alignment SignificanceCompare alignment score to all random

alignment scores

•Compute the mean and the standard deviation (SD) for random scores•Compute the deviation (in sd) of the actual score from the mean of random scoresZ=(x-mean)/sd

Evaluate the significance of the alignment

•Generate random alignments and calculate their scores

Page 6: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Complexity

• Complexity is determined by size of table– Aligning a sequence of length m against one of

length n requires calculating (m n) cells

• Estimate: we calculate 108 cells per second– Aligning two mRNA sequences of 8,000 bp

requires 64,000,000 cells 0.64 seconds– Aligning an mRNA and a 107 bp chromosome

requires ~1011 cells 1,000 secs = 15 minutes

Page 7: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Complexity for GenBank

• GenBank contains 3 1010 base pairs– Searching an mRNA against GenBank requires

~2.5 1014 cells 2.5 106 secs = 1 month!– So each computer could support just one

GenBank search per month

• We need to cut down on alignment– Use a heuristic method to narrow down the part

of GenBank that could be of interest

Page 8: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• Using the pairwise comparison, each database search normally yields 2 groups of scores: genuinely related and unrelated sequences, with some overlap between them.

• A good search method should completely separate between the 2 score groups.

Database Searches

Page 9: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

IdealNo Good

Random

Related

Page 10: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Heuristic Methods: FASTA and BLAST

• FASTA (Lipman & Pearson 1985)– First fast sequence searching algorithm for comparing

a query sequence against a database

• BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) – improvement of FASTA: Search speed, ease of use,

statistical rigor– Gapped BLAST (Altschul et al 1997)

Page 11: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

FASTA and BLAST

• Common idea - a good alignment contains subsequences of absolute identity:– First, identify very short (almost) exact matches.– Next, the best short hits from the 1st step are extended

to longer regions of similarity.– Finally, the best hits are optimized using the Smith-

Waterman algorithm.

Page 12: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches.For DNA sequences the word length used is 6.

seq1

seq2

FASTA

Page 13: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix.

seq1

seq2

Page 14: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined.

seq1

seq2

Page 15: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

The score for the joined regions is the sum of the scores of the initial regions minus a joining penaltyfor each gap.

seq1

seq2

Page 16: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

FastA uses dynamic programming (Smith-Waterman algorithm ) over a narrow band of high scoring diagonals between the query sequence and the search set sequence, to produce an alignment with a new score.

seq1

seq2

Page 17: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• Using the distribution of the z-score, the program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E value

Page 18: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• Search for regions with exact word matches• keep 10 highest scoring regions and re-score them using a scoring matrix • Join diagonals by introducing gaps • Apply Smith-Waterman algorithm to achieve best alignment • Calculate Z-score• Evaluate significance of Z_scores: E values

FASTA :Summary

Page 19: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

BLAST

• Basic Local Alignment Search Technique

• A set of tools developed at NCBI (BlastN, BlastP,..)

• BLAST benefits– Search speed– Ease of use– Statistical rigor

Page 20: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

Query sequenceWords of length W

(1)

(2) Compare the word list to the database and identify exact matches

BLAST Algorithm

W default = 11

Page 21: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

(3) For each word match, extend alignment in both directions

(4) Compute E-value

Page 22: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• Bit score (S)– Similar to alignment score– Normalized– Higher means more significant

• E value:Number of hits of score ≥ S expected by chance – Based on random database of similar size– Lower means more significant– Used to assess the statistical significance of the

alignment

Page 23: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment.

Gapped BLAST

Page 24: Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment

• AAAAAAAAAAA

• ATATATATATATA

• Alu sequences

Low Complexity Sequences