pairwise alignment how do we tell whether two sequences are similar? bio520 bioinformaticsjim lund...
TRANSCRIPT
Pairwise Alignment
How do we tell whether two sequences are
similar?
BIO520 Bioinformatics Jim Lund
Assigned reading:Ch 4.1-4.7, Ch 5.1, get what you can out of 5.2, 5.4
Pairwise alignment
• DNA:DNA
• polypeptide:polypeptide
The BASIC Sequence Analysis Operation
Alignments
• Pairwise sequence alignments
–One-to-One
–One-to-Database• Multiple sequence alignments
–Many-to-Many
Origins of Sequence Similarity
• Homology– common evolutionary descent
• Chance– Short similar segments are very
common.
• Similarity in function– Convergence (very rare)
Visual sequence comparison: Dotplot
Visual sequence comparison: Filtered dotplot
4 bp window, 75% identity cutoff
Visual sequence comparison: Dotplot
4 bp windw, 75% identity cutoff
Dotplots of sequence rearrangements
Assessing similarity
GAACAAT||||||| 7/7 OR 100%GAACAAT
GAACAAT | 1/7 or 14%GAACAAT
Which is BETTER?How do we SCORE?
Similarity
GAACAAT||||||| 7/7 OR 100%GAACAAT
GAACAAT||| ||| 6/7 OR 84%GAATAAT
MISMATCH
Mismatches
GAACAAT||| ||| 6/7 OR 84%GAATAAT
GAACAAT||| ||| 6/7 OR 84%GAAGAAT
Terminal Mismatch
GAACAATttttt ||| |||aaaccGAATAAT 6/7 OR 84%
INDELS
GAAgCAAT||| |||| 7/7 OR 100%GAA*CAAT
Indels, cont’d
GAAgCAAT||| ||||GAA*CAAT
GAAggggCAAT||| ||||GAA****CAAT
Similarity Scoring
Common Method: • Terminal mismatches (0)• Match score (1)• Mismatch penalty (-3)• Gap penalty (-1)• Gap extension penalty (-1)
DNA Defaults
DNA Scoring
GGGGGGAGAA
|||||*|*|| 8(1)+2(-3)=22GGGGGAAAAAGGGGG
GGGGGGAGAA--GGG
|||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)=33GGGGGAAAAAGGGGG
Absurdity of Low Gap Penalty
GATCGCTACGCTCAGC A.C.C..C..T
Perfect similarity,Every time!
Sequence alignment algorithms
• Local alignment– Smith-Waterman
• Global alignment– Needleman-Wunsch
Alignment Programs
• Local alignment (Smith-Waterman)– BLAST (simplified Smith-Waterman)
– FASTA (simplified Smith-Waterman)
– BESTFIT (GCG program)
• Global alignment (Needleman-Wunsch)– GAP
Local vs. global alignment
10 gaggc 15 ||||| 3 gaggc 7
1 gggggaaaaagtggccccc 19 || |||| ||1 gggggttttttttgtggtttcc 22
Global alignment: alignment of the full length of the sequences
Local alignment: alignment of regions of substantial similarity
Local vs. global alignment
BLAST Algorithm
Look for local alignment, a High Scoring Pair (HSP)• Finding word (W) in query and subject. Score > T.• Extend local alignment until score reaches
maximum-X.• Keep High Scoring Segment Pairs (HSPs) with
scores > S.• Find multiple HSPs per query if present• Expectation value (E value) using Karlin-Altschul
stats
BLAST statistical significance: assessing the likelihood a match
occurs by chance
Karlin-Altschul statistic:E = k m N exp(-Lambda S)
m = Size of query seqeunceN = Size of databasek = Search space scaling parameterLambda = scoring scaling parameterS = BLAST HSP score
Low E -> good match
BLAST statistical significance:
Rule of thumb for a good match:
•Nucleotide match•E < 1e-6•Identity > 70%
•Protein match•E < 1e-3•Identity > 25%
Protein Similarity Scoring
• Identity - Easy• WEAK Alignments• Chemical Similarity
– L vs I, K vs R…
• Evolutionary Similarity–How do proteins evolve?–How do we infer similarities?
BLOSUM62
C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6
Single-base evolution changes the encoded
AACAU=HCAU=H
CAC=H CGU=R UAU=Y
CAA=Q CCU=P GAU=D
CAG=Q CUU=L AAU=N
Substitution Matrices
Two main classes:
• PAM-Dayhoff
• BLOSUM-Henikoff
PAM-Dayhoff
• Built from closed related proteins, substitutions constrained by evolution and function
• “accepted” by evolution (Point Accepted Mutation=PAM)
• 1 PAM::1% divergence• PAM120=closely related proteins
• PAM250=divergent proteins
BLOSUM-Henikoff&Henikoff
• Built from ungapped alignments in proteins: “BLOCKS”
• Merge blocks at given % similar to one sequence
• Calculate “target” frequencies
• BLOSUM62=62% similar blocks– good general purpose
• BLOSUM30– Detects weak similarities, used for distantly related proteins
BLOSUM62
C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6
Gapped alignments
• No general theory for significance of matches!!
• G+L(n) – indel mutations rare
– variation in gap length “easy”, G > L
Real Alignments
Phylogeny
1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 ||||||||||||| |||||||||||||||||||| ||||||||||||||| 1 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL 50 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 |.| ||||||||||||||||||||||||||||||||. ||:||| |||| 51 KSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKI 100 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |||||||||:||| || .||| ||||||| |||||||||||||||.|| | 101 PVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFRNDMAAKYKEL 150
151 GFHG 154 || | 151 GFQG 154
Cow-to-Pig Protein
Cow-to-Pig cDNA 1 CAGCTGTCGGAGACAGACACCCAGTCAGTCCCGCCCTTGTTCTTTTTCTC 50 | ||| ||| || | ||||| |||| ||| |||||| 1 .......CAGAGCCAGGACACCCAGTACGCCCGCACTTGCTCTGTTTCTC 43 . . . . . 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 |||| ||||||| |||||||||||||||||||||||||||||| |||||| 44 TTCTGCAGACTGTGCCATGGGGCTCAGCGACGGGGAATGGCAGCTGGTGC 93 . . . . . 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143 . . . . . 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 ||||||||||||||||| | ||||| ||||||||||||||||||||||| 144 GTCCTCATCAGGCTCTTTAAGGGTCACCCCGAGACCCTGGAGAAATTTGA 193 . . . . . 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 |||||| |||||||||||| |||||| ||||||||||||||| ||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243
80% Identity (88% at aa!)
DNA similarity reflects polypeptide similarity
101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143
501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . .
Coding vs Non-coding Regions
451 CAGGCTGCCATGAGCAAGGCCCTGGAACTGTTCCGGAATGACATGGCTGC 500 |||| ||||||||||||||||||||||| |||||||| |||||||| || 444 CAGGGAGCCATGAGCAAGGCCCTGGAACTCTTCCGGAACGACATGGCGGC 493 . . . . . 501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . . 551 CAC.CCCACCCACCTGGG...........CAGGGTGGGCGGGGACTGAAT 588 | | |||| |||| |||| | || ||| ||| ||||| 544 CCCACCCATCCACTTGGGCCAGGGCCCCCCGCGGAGGGTGGGCGCTGAAG 593 . . . . . 589 CCCAAGTAGTTATAGGGTTTGCTTCTGAGTGTGTGCTTTGTTTAGGAGAG 638 | | |||| | |||||||||||||||||||| ||||||||| | ||||| 594 CTCCTGTAGCTGTAGGGTTTGCTTCTGAGTGT.TGCTTTGTTCATGAGAG 642 . . . . . 639 GTGGGTGGAAGAGGTGGATGGGTTAGGGGTGGAGG............... 673 |||||||| ||||||||| ||| | | ||||| || 643 GTGGGTGGGAGAGGTGGAGGGGCTGGTGGTGGTGGTGGGGGGGTGTTCAG 692
90% in coding (70% in non-coding)
Third Base of Codon is Hypervariable
201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 ||||||*||||||||||||*||||||*|||||||||||||||*||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243 . . . . . 251 TGAAGAAGCATGGCAACACGGTGCTCACGGCCCTGGGGGGTATCCTGAAG 300 ||||||||||*||||||||||||||*||*|||||||||||*|||||*||| 244 TGAAGAAGCACGGCAACACGGTGCTGACTGCCCTGGGGGGCATCCTTAAG 293
Cow-to-Fish Protein
1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 :. :|| || .||| | || || |||| |||||. | || : 1 ....MADFDMVLKCWGPMEADHATHGSLVLTRLFTEHPETLKLFPKFAGI 46 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 :: . || ||| || :|| :| | | .| |. ||| |||| 47 .AHGDLAGDAGVSAHGATVLNKLGDLLKARGAHAALLKPLSSSHATKHKI 95 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |: . |.: | |: | | | | |: : : || | || | 96 PIINFKLIAEVIGKVMEEKAGLD..AAGQTALRNVMAIIITDMEADYKEL 143
151 GFHG 154 || 144 GFTE 147
42% identity, 51% similarity
Cow-to-Fish DNA
32 .ACAGGACATTTTACTACTCTGCAGATAATGGCTGACTTTGACATGGTAC 80 | | | | | | || | | || | | |||| | 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 . . . . . 81 TGAAGTGCTGGGGTCCAATGGAGGCGGACCACGCAACCCACGGGAGTCTG 130 |||| |||||| ||||||| || |||| ||| ||| | 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 . . . . . 131 GTGCTGACCCGTTTATTCACAGAGCACCCAGAAACCCTAAAGTTATTCCC 180 || || | | | | ||||||| || || || ||||| || ||| 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 . . . . . 181 CAAGTTTGCTGGC...ATCGCCCATGGGGACCTGGCCGGGGATGCAGGTG 227 |||||| | | | | | || || | | | 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250
48% similarity
Protein vs. DNAAlignments
• Polypeptide similarity > DNA• Coding DNA > Non-coding
• 3rd base of codon hypervariable• Moderate Distance poor DNA similarity
Rules of Thumb
• DNA-DNA similarities– 50% significant if “long”
– E < 1e-6, 70% identity
• Protein-protein similarities– 80% end-end: same structure, same function
– 30% over domain, similar function, structure overall similar
– 15-30% “twilight zone”
– Short, strong match…could be a “motif”
Basic BLAST Family
• BLASTN– DNA to DNA database
• BLASTP– protein to protein database
• TBLASTN– DNA (translated) to protein database
• BLASTX– protein to DNA database (translated)
• TBLASTX– DNA (translated) to DNA database (translated)
DNA Databases
• nr (non-redundantish merge of Genbank, EMBL, etc…)– EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS
• est (expressed sequence tags)• htgs (high throughput genome seq.)• gss (genome survey sequence)• vector, yeast, ecoli, mito• chromosome (complete genomes)• And more
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
Protein Databases
• nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS)
• swissprot
• ecoli, yeast, fly
• month
• And more
BLAST Input
• Program
• Database
• Options - see more
• Sequence– FASTA
– gi or accession#
BLAST Options
• Algorithm and output options– # descriptions, # alignments returned– Probability cutoff– Strand
• Alignment parameters– Scoring Matrix
• PAM30, PAM70, BLOSUM45, BLOSUM62BLOSUM62, BLOSUM80, BLOSUM80
– Filter (low complexity) PPPPP->XXXXX
Extended BLAST Family
• Gapped Blast (default)Gapped Blast (default)• PSI-Blast (Position-specific iterated
blast)– “self” generated scoring matrix
• PHI BLAST (motif plus BLAST)• BLAST2 client (align two seqs)
• megablast (genomic sequence)• rpsblast (search for domains)