protein sequence alignment multiple sequence alignment

Sequence Alignment Part 3

Protein Sequence AlignmentMultiple Sequence Alignment

Table 3.1. Web sites for alignment of sequence pairs

Name of site

Bayes block alignera http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998)

Likelihood-weighted sequence alignmentb http://stateslab.bioinformatics.med.umich.edu/service see Web site

PipMaker (percent identity plot), a graphical tool for assessing long alignments

http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000)

BCM Search Launcherc http://searchlauncher.bcm.tmc.edu/ see Web site

SIM—Local similarity program for finding alternative alignments

http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992)

Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align.html Huang (1994)

FASTA program suited http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996)

Pairwise BLASTe http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990)

AceViewf shows alignment of mRNAs and ESTs to the genome sequence

http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly see Web site

BLATf Fast alignment for finding genes in genome http://genome.ucsc.edu Kent (2002)

GeneSeqerf predicts genes and aligns mRNA and genome sequences

http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000)

SIM4f http://globin.cse.psu.edu Floria et al. (1998)

http://genome.ucsc.edu/

http://genome.ucsc.edu/

http://www.bioinformatics.iastate.edu/bioinformatics2go/

http://globin.cse.psu.edu/

http://globin.cse.psu.edu/

Protein Sequence Alignment

Protein Pairwise Sequence Alignment• The alignment tools are similar to the DNA alignment

tools• BLASTP, FASTA

• Main difference: instead of scoring match (+2) and mismatch (-1) we have similarity scores:• Score s(i,j) > 0 if amino acids i and j have similar

properties • Score s(i,j) is 0 otherwise

• How should we score s(i,j)?

The 20 Amino Acids

Chemical Similarities Between Amino Acids

Acids & Amides DENQ (Asp, Glu, Asn, Gln)

Basic HKR (His, Lys, Arg)

Aromatic FYW (Phe, Tyr, Trp)

Hydrophilic ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)

Hydrophobic ILMV (Ile, Leu, Met, Val)

Amino Acid Substitutions Matrices

• For aligning amino acids, we need a scoring matrix of 20 rows 20 columns

• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids

• Could base matrix on amino acid properties– In practice: based on empirical data

identity similarity

Given an alignment of closely related sequences we can score the relation between amino acidsbased on how frequently they substitute each other

D SFHRRRAGCDE-DEDEES

AGHKKKR

In this column E & D are found 8/10

Amino Acid MatricesSymmetric matrix of 20x20 entries: entry (i,j)=entry(j,i)

Entry (i,j): the score of aligning amino acid i against amino acid j.

Entry (i,i) is greater than any entry (i,j), ji.

PAM - Point Accepted Mutations• Developed by Margaret Dayhoff, 1978.• Analyzed very similar protein sequences

• Proteins are evolutionary close. • Alignment is easy.• Point mutations - mainly substitutions• Accepted mutations - by natural selection.

• Used global alignment.• Counted the number of substitutions (i,j) per amino acid pair: Many

i<->j substitutions => high score s(i,j)• Found that common substitutions occurred involving

chemically similar amino acids.

PAM 250• Similar amino acids are close to each other.• Regions define conserved substitutions.

Selecting a PAM Matrix

• Low PAM numbers: short sequences, strong local similarities.

• High PAM numbers: long sequences, weak similarities.– PAM120 recommended for general use (40% identity)

– PAM60 for close relations (60% identity)

– PAM250 for distant relations (20% identity)

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

BLOSUM• Blocks Substitution Matrix

– Steven and Jorga G. Henikoff (1992)• Based on BLOCKS database (www.blocks.fhcrc.org)

– Families of proteins with identical function– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Each motif is a block of local alignment– Counts amino acids observed in same column– Symmetrical model of substitution AABCDA… BBCDA

DABCDA. A.BBCBB BBBCDABA.BCCAA AAACDAC.DCBCDB CCBADAB.DBBDCC AAACAA… BBCCC

BLOSUM Matrices

• Different BLOSUMn matrices are calculated independently from BLOCKS

• BLOSUMn is based on sequences that are at most n percent identical.

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable for sequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

Multiple Sequence Alignment

Multiple Alignment

• Like pairwise alignment– n input sequences instead of 2– Add indels to make same length– Local and global alignments

• Score columns in alignment independently

• Seek an alignment to maximize score

Alignment Example

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

GTCGTAGTCGGCTCGACGTCTAGCGAGCGTGATGCGAAGAGGCGAGCGCCGTCGCGTCGTAAC

1*12*0.7511*0.5

Score=8

4*111*0.752*0.5

Score=13.25

Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0

Dynamic Programming

• Pairwise A–B alignment table– Cell (i,j) = score of best alignment between first i elements of A and first j elements of B

– Complexity: length of A length of B• 3-way A–B–C alignment table

– Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C

– Complexity: length A length B length C

• n-way S1–S2–…–Sn-1–Sn alignment table– Cell (x1,…,xn) = best alignment score between

first x1 elements of S1, …, xn elements of Sn

– Complexity: length S1 … length Sn

• Example: protein family alignment– 100 proteins, 1000 amino acids each– Complexity: 10300 table cells– Calculation time: beyond the big bang!

MSA Complexity

Feasible Approach

• Based on pairwise alignment scores– Build n by n table of pairwise scores

• Align similar sequences first– After alignment, consider as single sequence– Continue aligning with further sequences

• Sum of pairwise alignment scores– For n sequences, there are n(n-1)/2 pairs

GTCGTAGTCG-GC-TCGACGTC-TAG-CGAGCGT-GATGC-GAAG-AG-GCG-AG-CGCCGTCG-CG-TCGTA-AC

1 GTCGTAGTCG-GC-TCGAC2 GTC-TAG-CGAGCGT-GAT3 GC-GAAGAGGCG-AGC4 GCCGTCGCGTCGTAAC

1 GTCGTA-GTCG-GC-TCGAC2 GTC-TA-G-CGAGCGT-GAT3 G-C-GAAGA-G-GCG-AG-C4 G-CCGTCGC-G-TCGTAA-C

ClustalW Algorithm

• Compute pairwise alignment for all the pairs of sequences.

• Use the alignment scores to build a phylogenetic tree such that • similar sequences are neighbors in the tree• distant sequences are distant from each other in

the tree.• The sequences are progressively aligned

according to the branching order in the guide tree.• http://www.ebi.ac.uk/clustalw/

Progressive Sequences Alignment (Higgins and Sharp 1988)

N Y L S N K Y L S N F S N F L S

N K/- Y L S N F L/- S

N K/- Y/F L/- S

Progressive Sequence Alignment (Protein sequences example)

Treating Gaps in ClustalW

• Penalty for opening gaps and additional penalty for extending the gap

• Gaps found in initial alignment remain fixed

• New gaps are introduced as more sequences are added (decreased penalty if gap exists)

• Decreased within stretches of hydrophilic residues

MSA Approaches• Progressive approach

CLUSTALW (CLUSTALX) PILEUP

T-COFFEE

• Iterative approach: Repeatedly realign subsets of sequences.

MultAlin, DiAlign.

• Statistical Methods:Hidden Markov ModelsSAM2K

• Genetic algorithmSAGA

protein sequence alignment multiple sequence alignment

Documents