an introduction to multiple alignments · institut suisse de bioinformatique cn+lf-2003.10 what do...

29
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF-2003.10 An introduction to multiple alignments © Cédric Notredame Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF-2003.10 Overview Multiple alignments How-to, Goal, problems, use Patterns PROSITE database, syntax, use PSI-BLAST BLAST, matrices, use [ Profiles/HMMs ] …

Upload: others

Post on 20-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    An introduction to multiple alignments

    © Cédric Notredame

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Overview

    Multiple alignmentsHow-to, Goal, problems, use

    PatternsPROSITE database, syntax, use

    PSI-BLASTBLAST, matrices, use

    [ Profiles/HMMs ] …

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Overview

    What are multiple alignments?How can I use my alignments?How does the computer align the sequences?

    The progressive alignment algorithmWhat are the difficulties?Pre-requisite?

    How can we compare sequences?How can we align sequences?

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Sometimes two sequences are not enough

    The man with TWO watches NEVER knows the exact time

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    What is a multiple sequence alignment?

    What can it do for me?How can I produce one of these?How can I use it?

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    What is a multiple sequence alignment?

    Structural/biochemical criteriaResidues playing a similar role end up in the same column.

    Evolution criteriaResidues having the same ancestor end up in the same column.

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPunknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------unknown AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Extrapolation

    SwissProt

    Unkown Sequence

    Homology?

    Less Than 30 % idBUT

    Conserved where it MATTERS

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Extrapolation

    Prosite Patterns

    P-K-R-[PA]-x(1)-[ST]…

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?

    Extrapolation

    Prosite Patterns

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    L?K>R

    Prosite Profiles -More Sensitive-More Specific

    AFDEFGHQIVLW

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    PROSITE profile (see also HMMs)

    A Substitution Cost For Every Amino Acid, At Every Position

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?

    Phylogeny

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    chite

    wheat

    trybr

    mouse

    -Evolution-Paralogy/Orthology

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?

    Phylogeny

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Struc. Prediction

    Column Constraint

    Evolution Constraint

    Structure Constraint

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?

    Phylogeny

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Struc. Prediction

    PsiPred or PhDFor secondary Structure Prediction: 75% Accurate.Threading: is improving but is not yet as good.

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How can I use a multiple alignment?

    Phylogeny

    Struc. Prediction

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Caution!

    Automatic MultipleSequence Alignment methodsare not always perfect…

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    The problem

    why is it difficult to compute a multiple sequence alignment?

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    Computation

    What is the good alignment?

    Biology

    What is a good alignment?

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    The problem

    why is it difficult to compute a multiple sequence alignment?

    CIRCULAR PROBLEM....

    GoodSequences

    GoodAlignment

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    The problem

    Same as pairwise alignment problemWe do NOT know how sequences evolve.We do NOT understand the relation between structures and sequences.

    We would NOT recognize the “correct” alignment if we had it IN FRONT of our eyes…

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    The Charlie Chaplin paradox

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    What do I need to know to make a good multiple alignment?

    How do sequences evolve?How does the computer align the sequences?How can I choose my sequences?What is the best program?How can I use my alignment?

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    An alignment is a story

    ADKPKRPLSAYMLWLN

    ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

    ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN

    Mutations+

    Selection

    ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

    InsertionDeletion

    Mutation

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Homology

    Same sequences -> same origin? -> same function? -> same 3D fold?

    Length

    %Sequence Identity

    30%

    100

    Same 3D Fold

    Twilight Zone

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Convergent evolution

    AFGP with (ThrAlaAla)nSimilar To Trypsinogen

    AFGP with (ThrAlaAla)nNOT

    Similar to Trypsinogen

    N

    S

    Chen et al, 97, PNAS, 94, 3811-16

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Residues and mutations

    All residues are equal, but some more than others…

    PG

    SC

    LI

    T

    V A

    WYF QH

    K

    R

    ED N

    Aliphatic

    Aromatic

    Hydrophobic

    Polar

    SmallM

    Accurate matrices are data driven rather than knowledge driven

    GC

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Substitution matrices

    Different Flavors:

    • Pam: 250, 350• Blosum: 45, 62• …

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    What is the best substition matrix?

    Mutation rates depend on families

    Choosing the right matrix may be trickyGonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning

    Family S N Histone3 6.4 0Insulin 4.0 0.1Interleukin I 4.6 1.4α−Globin 5.1 0.6Apolipoprot. AI 4.5 1.6Interferon G 8.6 2.8

    Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Insertions and deletions?

    Indel Cost

    L

    Cost

    L

    Cost

    L

    Affine Gap PenaltyCost=GOP+GEP*L

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    2 Globins =>1 sec

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    3 Globins =>2 mn

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    4 Globins =>5 hours

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    -> heuristic wished

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    5 Globins =>3 weeks

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    -> heuristic really wished!

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    6 Globins =>9 years

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    -> heuristic required!

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    -> heuristic definitely required!

    7 Globins =>1000 years

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    8 Globins =>150 000 years

    How to align many sequences?

    Exact algorithms are computing time consumingNeedlemann & Wunsch Smith & Waterman

    -> heuristic please!…

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Existing methods1-Carillo and Lipman:

    -MSA, DCA.

    -Few Small Closely Related Sequence.

    2-Segment Based:

    -DIALIGN, MACAW.

    -May Align Too Few Residues

    -Do Well When They Can Run.

    3-Iterative:-HMMs, HMMER, SAM.

    -Slow, Sometimes Inacurate

    -Good Profile Generators

    4-Progressive:

    -ClustalW, Pileup, Multalign…

    -Fast and Sensitive

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Progressive alignmentFeng and Dolittle, 1980; Taylor 1981

    Dynamic Programming Using A Substitution Matrix

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Progressive alignmentFeng and Dolittle, 1980; Taylor 1981

    -Depends on the ORDER of the sequences (Tree).

    -Depends on the CHOICE of the sequences.

    -Depends on the PARAMETERS:

    •Substitution Matrix.

    •Penalties (Gop, Gep).

    •Sequence Weight.

    •Tree making Algorithm.

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Progressive alignment

    Works well when phylogeny is denseNo outlayer sequenceExample: river crossing

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Selecting sequences from a BLAST output

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    A common mistake

    Sequences too closely related

    Identical sequences brings no informationMultiple sequence alignments thrive on diversity

    PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEEPRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEEPRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEEPRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEEPRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE

    :**::*.*******:***:* :****************..::******:***********

    PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESPRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESPRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESPRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESPRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES

    :*** ******.******.**** *:************.:******:**

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Respect information!PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKAPRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKAPRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKAPRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKAPRVA_RABIT ------------------------------------------AMTELLN----AEDIKKATPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

    : :*. .*::::

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFIPRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSIPRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSIPRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

    :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

    *: . .. :: .: : *: ***:.**:*. :** ::

    -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences.

    -A better spread of the sequences is needed

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Selecting diverse sequences

    PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE

    : *: .: . .* .:*. * ** *: * : * :* * **:**PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA

    :** .*:.* .* *: ** :: .* **** **::** **

    -A REASONABLE model now exists.

    -Going further:remote homologues.

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Aligning remote homologuesPRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKAPRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAAPRVB_BOACO ------------------------------------------AFAGILSD----ADIAAGPRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAAPRVB_RANES ------------------------------------------SITDIVSE----KDIDAATPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM

    : ::

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFVPRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLFPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELFPRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM

    : . .: .. . *: * : * :* : .*:*: :** .

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQTPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQTPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE

    :: .. :: : :: .* :.** *. :** ::

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Going further…

    PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKFPRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFTPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEITPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMMTPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI

    . : .. . :: . : * :* : .* *. : * .

    PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA

    : . :: : :: * :..* :. :** ::

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    What makes a good alignment…

    The more divergent the sequences, the betterThe fewer indels, the betterNice ungapped blocks separated with indelsDifferent classes of residues within a block:

    Completely conserved (*)Size and hydropathy conserved (:)Size or hydropathy conserved (.)

    The ultimate evaluation is a matter of personal judgment and knowledge

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Avoiding pitfalls

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Naming your sequences the right way

    Never use white spaces in your sequence namesNever use special symbols. Stick to plain letters, numbers and the underscore sign (_) to replace spaces. Avoid ALL other signs, especially the most tempting ones like @, #, |, *, >,

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Beware of RepeatsThere is a problem when two sequences do not contain the same number of repeats

    It is then better to manually extarct the repeats and to align them separately. Individual repeats can be recognized using Dotlet or Dotter.

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Keep a biological perspectivechite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLStrybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS

    * *** .:: ::... : * . . . : * . *: *

    chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESAtrybr RKVYEEMAEKDKERY----K--RE-M-------mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE-----

    : : * : .* :

    DIFFERENTPARAMETERS

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Do not overtune!!!

    DO NOT PLAY WITHPARAMETERS!

    IF YOU KNOW THE ALIGNMENT YOU

    WANT: MAKE IT YOURSELF!

    chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. ::: .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKDwheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSEtrybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGPmouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP

    ***. * .: .. . : . . * . *: *

    chite AATAKQNYIRALQEYERNGG-wheat ANKLKGEYNKAIAAYNKGESAtrybr AEKDKERYKREM---------mouse AKDDRIRYDNEMKSWEEQMAE

    * : .* . :

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    BaliBase classification and benchmarkDescriptionPROBLEM

    Even Phylogenic Spread.

    One Outlayer Sequence

    Two Distantly related Groups

    Long Internal Indel

    Long Terminal Indel

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Choosing the right method

    Source: BaliBase

    Thompson et al, NAR, 1999

    PROBLEM Strategy Strategy

    ClustalW, T-coffee,MSA, DCA

    PrrP,T-Coffee

    Dialign II

    T-Coffee

    T-Coffee

    Dialign II

    T-Coffee

    Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Some interesting links

  • Swiss Institute of BioinformaticsInstitut Suisse de Bioinformatique

    CN+LF-2003.10

    Conclusion

    The best alignment method:Your brainThe right data

    The best evaluation method:Your eyesExperimental information (SwissProt)

    Choosing the sequences well is importantBeware of repeated elements

    What can I conclude?Homology -> information extrapolation

    How can I go further?PatternsProfilesHMMs…