multiple sequence alignment school of b&i tcd may 2010
TRANSCRIPT
![Page 1: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/1.jpg)
Multiple Sequence Alignment
School of B&I TCD May 2010
![Page 2: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/2.jpg)
MSA
• A central technique in bioinformatics– homology searching– multiple sequence alignment– phylogenetic trees
![Page 3: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/3.jpg)
An example
“all you have to do” is re-write your sequences so that similar features finish up in the same columns
![Page 4: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/4.jpg)
Evolutionary relationship
• “similar features” ideally means homologous – with a shared ancestor
• clustalW and T-coffee mimic the process of evolution– by weighting similar residues by how
conserved they are in evolution• Important AAs don’t mutate• Less important AAs change easily, even randomly
– by inserting judicious gaps
![Page 5: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/5.jpg)
Applications• Discover conserved patterns/motifs
– A step to describing a protein domain– MSA can add a distant relative to your protein
family
• To define DNA regulatory elements.
• Prediction of 2nd Structure and helps 3-D
• A step to phylogenetic trees:
• PCR analysis/primer design – find most and least degenerate regions of your
sequence
![Page 6: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/6.jpg)
So why difficult?
Trivial 2 seq alignment: 3 possibilities. As length and # of seqs increase, number of possible permutations goes astronomical
FGDERTHHSFGD--DHRS
FGDERTHHSFGDD--HRS
FGDERTHHSFGD-D-HRS
Where put the gap?
![Page 7: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/7.jpg)
Some data
• Cat ATGAAACGTCGGATCTAA• Dog ATGAATCGACCCATCTAA• Mus ATGGCGTGGCTTGGCATGTGA• Rat ATGGCATGTCGTGGCATGTAGProtocol step 1• Align each pair of seqs C-D, C-M, C-R etc• Get a score for each alignment• And make a …
![Page 8: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/8.jpg)
Similarity matrix
Cat Dog Mus Rat
Cat ID 14 10 10
Dog ID 10 10
Mus ID 16
Rat ID• Number of identical residues
– Which pair of sequences is most similar?
![Page 9: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/9.jpg)
Progressive alignment
• Align the two most similar sequences, inserting any gaps.
• Mus/Rat: lock these sequences together (call it “RODent)
• Return to similarity matrix to find next most similar seqs or sequence cluster
• Dog/Cat: align and lock (call it CARnivore)– if next step requires a gap, then gap inserted in both
carnivore sequences
• Align next most …(now its iterative)
![Page 10: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/10.jpg)
An alignment
Cat ATGAAACGTCGG---ATCTAADog ATGAATCGACCC---ATCTAAMus ATGGCGTGGCTTGGCATGTGARat ATGGCATGTCGTGGCATGTAG *** * * ** *• Good: Always a two “sequence” problem
– So computationally possible
• Bad: Can’t rewrite or decouple (part of) the dog/cat alignment in the light of later info. Locked in a (suboptimal?) trough.
![Page 11: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/11.jpg)
Choosing the right seqs
• Use MSA to inform you!• Always use AA/protein if possible
– can copygaps back to DNA later
• Start with 6-15 sequences• Eliminate very different (<30% id) seqs• Eliminate identical sequences• Watch out for partial sequences• …or sequences that need ++ gaps to align• Check for repeats with dotlet, Lalign
![Page 12: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/12.jpg)
Less is more
• Large alignments – take ++ CPU and time– are hard to do well– are difficult to display– are difficult to use: in trees for example– may include marginal seqs that wreck whole
alignment
• So start small and add/eliminate seqs until you have a clear informative picture
![Page 13: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/13.jpg)
Level of variation is important
• Choose sequence family with best rate of evolution for your taxonomic group– Histones evolve very slow (compare kingdoms)– Transferrins are fast (compare classes,orders)
• Closely related sequences may have identical protein (but variable DNA)
• Distantly related sequences no DNA signal (“saturated”)
![Page 14: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/14.jpg)
Comparing related sequences
• Case 1, human vs chimpSeq1 A C G T A A A A G C | | | | | | | | |Seq2 A A G T A A A A G C• How many changes? D=0.1 d=?• Case 2 aardvark vs human Seq1 A C G T A A A A G C | | |Seq2 A C A C G G A T A G• How many changes? D=0.7 d=?• Need to compensate for multiple hits.
G 100mya G
G 90mya G
G 70mya C
A 50mya C
C 30mya C
C 10mya G
A now G
![Page 15: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/15.jpg)
Multiple substitution
Ancestor G
G C
G
A C
G
A A
GC 1 seen
A A 0 seen
A C 1 seen
Greater distance – more likely multiple substitution
What really happened:
What diffs we can see:
![Page 16: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/16.jpg)
EBI: loads of options
![Page 17: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/17.jpg)
T-coffee
Minimal input parameters and STILL a better job than ClustalW
![Page 18: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/18.jpg)
Output EBI clustalW
Pairwise distance etcAlignmentGuidetreeWhat you submitted
Jalview alignmenteditor
![Page 19: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/19.jpg)
An alignment fragmentACT_CANAL -MDGEEVAALIIDNGSGMCKAACT_CANDU -MDGEEVAALVIDNGSGMCKAACT_PICAN -MDGEDVAALVIDNGSGMCKAACT_PICPA -MDGEDVAALVIDNGSGMCKAACT_KLULA -MDS-EVAALVIDNGSGMCKAACT_YEAST -MDS-EVAALVIDNGSGMCKAACT_YARLI -MED-ETVALVIDNGSGMCKAACT2_ABSGL MSMEEDIAALVIDNASGMCKAACT2_SCHCO --MDDEIQAVVIDNGSGMCKA : *:::**.******
* All AA in column identical: AA similar size & hydrophobicity. AA similar size or hydrophobicity
ClustalW format
![Page 20: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/20.jpg)
The alignment, so what next?
• Look at it very closely
• Hand edit if necessary (probably)
• Eliminate problem sequences and redo?
• Use display option best for next step– Phylip format for trees
![Page 21: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/21.jpg)
Parameter changes
• Substit matrix PAM, Gonnet, Blosum – Clustalw chooses which matrix within family
• PAM30 for closely related pairs; PAM120; PAM250 for more distant
– Difficult alignment: matrix change may help• Gap penalty (open and extend) have optimal
values for each family: find which by trial and error.– Clustalw puts gaps (which are often external loops)
near previous gaps (longer loop)• MSA does the grunt work. YOU do the fine
tuning.
![Page 22: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/22.jpg)
Alignment display: weblogo
Always remember: sequence represents a 3-D structure
![Page 23: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/23.jpg)
Patterns to recognise(more reliable in MSA than in single seq)
• Alternate hydrophobic residues– Surface -sheet (zig-zag-zig-zag)
• Runs of hydrophobic residues– Interior/buried -sheet
• Residues with 3.5AA spacing (amphipathic)– -helix WNNWFNNFNNWNNNF
• Gaps/indels– Probably surface not core
MSA improves 2ndary structure (-helix -sheet) prediction by >6%)
![Page 24: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/24.jpg)
Conserved residues• W,F,Y large hydrophobic, internal/core
– conserved WFY best signal for domains
• G,P turns, can mark end of -helix -sheet• C conserved with reliable spacing speaks C-C
disulphide bridges - defensins• H,S often catalytic sites in proteases (and other
enzymes)• KRDE charged: ligand binding or salt-bridge• L very common AA but not conserved
– except in Leucine zipper L234567L234567L234567L
![Page 25: Multiple Sequence Alignment School of B&I TCD May 2010](https://reader035.vdocuments.net/reader035/viewer/2022081514/56649e4f5503460f94b46fea/html5/thumbnails/25.jpg)
Finish with an alignment:defensins
3 pairs of C residues: 3 disulphide bridges