protein sequence alignment multiple sequence alignment

28
Sequence Alignment Part 3 Protein Sequence Alignment Multiple Sequence Alignment

Upload: bruce-riley

Post on 18-Jan-2018

346 views

Category:

Documents


3 download

DESCRIPTION

    Table 3.1. Web sites for alignment of sequence pairs Name of site Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998) Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site PipMaker (percent identity plot), a graphical tool for assessing long alignments http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000) BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ SIM—Local similarity program for finding alternative alignments http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992) Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994) FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996) Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990) AceViewf shows alignment of mRNAs and ESTs to the genome sequence http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002) GeneSeqerf predicts genes and aligns mRNA and genome sequences http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000) SIM4f http://globin.cse.psu.edu Floria et al. (1998)

TRANSCRIPT

Page 1: Protein Sequence Alignment Multiple Sequence Alignment

Sequence Alignment Part 3

Protein Sequence AlignmentMultiple Sequence Alignment

Page 2: Protein Sequence Alignment Multiple Sequence Alignment

  

Table 3.1. Web sites for alignment of sequence pairs

Name of site

Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998)

Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site

PipMaker (percent identity plot), a graphical tool for assessing long alignments

http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000)

BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ see Web site

SIM—Local similarity program for finding alternative alignments

http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992)

Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994)

FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996)

Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990)

AceViewf shows alignment of mRNAs and ESTs to the genome sequence

http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly see Web site

BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002)

GeneSeqerf predicts genes and aligns mRNA and genome sequences

http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000)

SIM4f http://globin.cse.psu.edu Floria et al. (1998)

Page 3: Protein Sequence Alignment Multiple Sequence Alignment

Protein Sequence Alignment

Page 4: Protein Sequence Alignment Multiple Sequence Alignment

Protein Pairwise Sequence Alignment• The alignment tools are similar to the DNA alignment

tools• BLASTP, FASTA

• Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:• Score s(i,j) > 0 if amino acids i and j have similar

properties • Score s(i,j) is 0 otherwise

• How should we score s(i,j)?

Page 5: Protein Sequence Alignment Multiple Sequence Alignment

The 20 Amino Acids

Page 6: Protein Sequence Alignment Multiple Sequence Alignment

Chemical Similarities Between Amino Acids

Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

Page 7: Protein Sequence Alignment Multiple Sequence Alignment

Amino Acid Substitutions Matrices

• For aligning amino acids, we need a scoring matrix of 20 rows 20 columns

• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids

• Could base matrix on amino acid properties– In practice: based on empirical data

Page 8: Protein Sequence Alignment Multiple Sequence Alignment

identity similarity

Page 9: Protein Sequence Alignment Multiple Sequence Alignment

Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each other

D SFHRRRAGCDE-DEDEES

AGHKKKR

In this column E & D are found 8/10

Page 10: Protein Sequence Alignment Multiple Sequence Alignment

Amino Acid MatricesSymmetric matrix of 20x20 entries: entry (i,j)=entry(j,i)

Entry (i,j): the score of aligning amino acid i against amino acid j.

Entry (i,i) is greater than any entry (i,j), ji.

Page 11: Protein Sequence Alignment Multiple Sequence Alignment

PAM - Point Accepted Mutations• Developed by Margaret Dayhoff, 1978.• Analyzed very similar protein sequences

• Proteins are evolutionary close. • Alignment is easy.• Point mutations - mainly substitutions• Accepted mutations - by natural selection.

• Used global alignment.• Counted the number of substitutions (i,j) per amino acid pair: Many

i<->j substitutions => high score s(i,j)• Found that common substitutions occurred involving

chemically similar amino acids.

Page 12: Protein Sequence Alignment Multiple Sequence Alignment

PAM 250• Similar amino acids are close to each other.• Regions define conserved substitutions.

Page 13: Protein Sequence Alignment Multiple Sequence Alignment

Selecting a PAM Matrix

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM120 recommended for general use (40% identity)

– PAM60 for close relations (60% identity)

– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

Page 14: Protein Sequence Alignment Multiple Sequence Alignment

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992)• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment– Counts amino acids observed in same column– Symmetrical model of substitution AABCDA… BBCDA

DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

Page 15: Protein Sequence Alignment Multiple Sequence Alignment

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on sequences that are at most n percent identical.

Page 16: Protein Sequence Alignment Multiple Sequence Alignment

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

Page 17: Protein Sequence Alignment Multiple Sequence Alignment

Multiple Sequence Alignment

Page 18: Protein Sequence Alignment Multiple Sequence Alignment

Multiple Alignment

• Like pairwise alignment– n input sequences instead of 2– Add indels to make same length– Local and global alignments

• Score columns in alignment independently

• Seek an alignment to maximize score

Page 19: Protein Sequence Alignment Multiple Sequence Alignment

Alignment Example

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC

1*12*0.7511*0.5

Score=8

4*111*0.752*0.5

Score=13.25

Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Page 20: Protein Sequence Alignment Multiple Sequence Alignment

Dynamic Programming

• Pairwise A–B alignment table– Cell (i,j) = score of best alignment between first i elements of A and first j elements of B

– Complexity: length of A length of B• 3-way A–B–C alignment table

– Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C

– Complexity: length A length B length C

Page 21: Protein Sequence Alignment Multiple Sequence Alignment

• n-way S1–S2–…–Sn-1–Sn alignment table– Cell (x1,…,xn) = best alignment score between

first x1 elements of S1, …, xn elements of Sn

– Complexity: length S1 … length Sn

• Example: protein family alignment– 100 proteins, 1000 amino acids each– Complexity: 10300 table cells– Calculation time: beyond the big bang!

MSA Complexity

Page 22: Protein Sequence Alignment Multiple Sequence Alignment

Feasible Approach

• Based on pairwise alignment scores– Build n by n table of pairwise scores

• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences

Page 23: Protein Sequence Alignment Multiple Sequence Alignment

• Sum of pairwise alignment scores– For n sequences, there are n(n-1)/2 pairs

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

Page 24: Protein Sequence Alignment Multiple Sequence Alignment

1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC

1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C

Page 25: Protein Sequence Alignment Multiple Sequence Alignment

ClustalW Algorithm

• Compute pairwise alignment for all the pairs of sequences.

• Use the alignment scores to build a phylogenetic tree such that • similar sequences are neighbors in the tree• distant sequences are distant from each other in

the tree.• The sequences are progressively aligned

according to the branching order in the guide tree.• http://www.ebi.ac.uk/clustalw/

Progressive Sequences Alignment (Higgins and Sharp 1988)

Page 26: Protein Sequence Alignment Multiple Sequence Alignment

N Y L S N K Y L S N F S N F L S

N K/- Y L S N F L/- S

N K/- Y/F L/- S

Progressive Sequence Alignment (Protein sequences example)

Page 27: Protein Sequence Alignment Multiple Sequence Alignment

Treating Gaps in ClustalW

• Penalty for opening gaps and additional penalty for extending the gap

• Gaps found in initial alignment remain fixed

• New gaps are introduced as more sequences are added (decreased penalty if gap exists)

• Decreased within stretches of hydrophilic residues

Page 28: Protein Sequence Alignment Multiple Sequence Alignment

MSA Approaches• Progressive approach

CLUSTALW (CLUSTALX) PILEUP

T-COFFEE

• Iterative approach: Repeatedly realign subsets of sequences.

MultAlin, DiAlign.

• Statistical Methods:Hidden Markov ModelsSAM2K

• Genetic algorithmSAGA