multiple s equence alignment and their reliability

43
sequence sequence alignment and alignment and their their reliability reliability The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/ January 2013 1 TAU Bioinformatics Workshop

Upload: marjorie-lazos

Post on 30-Dec-2015

43 views

Category:

Documents


0 download

DESCRIPTION

Multiple s equence alignment and their reliability. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/. What are alignments good for?. To compare sequences Find homology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple  s equence alignment and their reliability

Multiple Multiple sequence sequence

alignment and alignment and their reliabilitytheir reliability

The Bioinformatics UnitG.S. Wise Faculty of Life Science

Tel Aviv University, IsraelJanuary 2013

By Haim Ashkenazy

http://guidance.tau.ac.il/workshop_2013/

January 2013 1TAU Bioinformatics Workshop

Page 2: Multiple  s equence alignment and their reliability

What are alignments good What are alignments good for?for?

• To compare sequenceso Find homologyo Similar sequence similar function

• To learn about sequence evolutiono Mismatch = point mutationo Gap = indel (insertion or deletion)o Reconstruct phylogenetic treeo Infer selection forces, e.g., detecting positive selection, co-

evolving sites

• For structure predictiono Similar regions potentially have similar structure

2

Page 3: Multiple  s equence alignment and their reliability

Making an alignment Making an alignment (pairwise)(pairwise)

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ

• For 2 sequences – Pairwise alignmento Local alignment – finds regions of high

similarity in parts of the sequences.

o Global alignment – finds the best alignment across the entire two sequences

• Use exact solutiono Needleman-Wunsch (for global) or Smith-Waterman (for local) -

http://www.ebi.ac.uk/Tools/psa/

3

Page 4: Multiple  s equence alignment and their reliability

Sequences evolutionSequences evolutionATGAAATAA

ATGTTTTAA ATGCCCAAATAA

ATGTTTTCA ATGTTTTAA ATGCCCAAA

A T G - - - T T T T A A

A T G - - - T T T T C A

A T G C C C A A A - - -

30 MYA

5 MYA

Today

Human

Chimp

Mouse4

A T G - - - T T T T A A

A T G - - - T T T T C A

A T G C C C - - - A A A

Page 5: Multiple  s equence alignment and their reliability

Alignment and phylogeny Alignment and phylogeny are mutually dependentare mutually dependent

Inaccurate tree

building

MSA

Sequence alignment

0.4

Phylogeny reconstructi

on

Unaligned sequences

5

Page 6: Multiple  s equence alignment and their reliability

Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging

~25% of residues are wrongly alignedBased on BAliBASE: a large representative set of proteins

6

Page 7: Multiple  s equence alignment and their reliability

Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging

5% of tree branches are wrong

Based on simulations of 100 protein sequences

Page 8: Multiple  s equence alignment and their reliability

Making an alignment (MSA)Making an alignment (MSA)• For more sequences - Multiple sequence

alignment (MSA)o Exact methods are not feasible (too slow)o We use heuristic methodso Several advanced MSA programs are available

Basically two recommended methods:• MAFFT – fastest and one of the most

accurate• PRANK – distinct from all other MSA

programs because of its correct treatment of insertions/deletions

8

Page 9: Multiple  s equence alignment and their reliability

ABCDE

Compute the pairwise Compute the pairwise alignments for all alignments for all

against all (10against all (10 pairwise pairwise alignments).alignments).

The similarities are The similarities are converted to distances converted to distances and stored in a tableand stored in a table

First step: compute pairwise distances

Progressive alignmentProgressive alignment

A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32 9

Page 10: Multiple  s equence alignment and their reliability

A

D

C

B

E

Cluster the sequences to create Cluster the sequences to create a tree (a tree (guide treeguide tree):):

• represents the order in which represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned• similar sequences are neighbors similar sequences are neighbors in the tree in the tree • distant sequences are distant distant sequences are distant from each other in the treefrom each other in the tree

Second step:build a guide tree

A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

10

Page 11: Multiple  s equence alignment and their reliability

Third step: align sequences in a bottom up order

A

D

C

B

E

1. Align the most similar (neighboring) pairs

2. Align pairs of pairs

3. Align sequences clustered to pairs of pairs

deeper in the tree

Sequence ASequence B

Sequence CSequence D

Sequence E

11

Page 12: Multiple  s equence alignment and their reliability

Multiple sequence alignment (MSA)progressivprogressiv

ee

alignmentalignment

ABCDE

Guide tree

A

DCB

E

MSA

Pairwise distance table

Iterative

12

Page 13: Multiple  s equence alignment and their reliability

Sources of alignment Sources of alignment errorserrors

Progressive alignment algorithms are greedy heuristics

Co-optimal solutions Heads-or-Tails (HoT) scores (Landan & Graur 2007)

GEELTNWPSPVCHNRLASGIDDSTAFRFPRPQKWIISYSLHCVI...GEELTLWPSPVCHNRLASGIDASIAFRFPRAQKRFYRYSLHCVI...TEELTHWPFPVCRNRLARGIGSAIAFRCPRSQEHI-RNSLPCVI...TEELRYWPFPVCQN--ARGNGSVIEARNPGSQ-----KVLPYVI...

...IVCHLSYSIIWKQPRPFRFATSDDIGSALRNHCVPSPWNTLEEG

...IVCHLSYRYFRKQARPFRFAISADIGSALRNHCVPSPWLTLEEG

...IVCPLSNRI-HEQSRPCRFAIASGIGRALRNRCVPFPWHTLEET

...IVYPLVK-----QSGPNRAEIVSGNGRA--NQCVPFPWYRLEET

13

Page 14: Multiple  s equence alignment and their reliability

…MSA 1 MSA 2 MSA 99 MSA 100

Progressive alignment

…Tree 1 Tree 2 Tree 99 Tree 100

Bootstrap sampling of NJ trees

Base alignment

GUIDANCE Scores

0

1

Penn, Privman et al. MBE. 2010

GUIDANCE: Guide-tree based GUIDANCE: Guide-tree based alignment confidence scoresalignment confidence scores

14

Page 15: Multiple  s equence alignment and their reliability

Comparing alignmentsComparing alignmentsCommon measures to quantify distance between two MSAs:1.CS: Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0.2.SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0.3.Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.

Page 16: Multiple  s equence alignment and their reliability

Accuracy of GUIDANCE Accuracy of GUIDANCE scoresscores

16

Page 17: Multiple  s equence alignment and their reliability

http://guidance.tau.ac.il

As a rule of thumb, use HoT for less than 8 sequences

17

Page 18: Multiple  s equence alignment and their reliability

http://guidance.tau.ac.il

Un-aligned sequences

(FASTA format)

Choose sequence

type

Choose alignment

method

18

Page 19: Multiple  s equence alignment and their reliability

GUIDANCE resultsGUIDANCE results

04/19/23Footer Text 19

MSA colored by

confidence score

Page 20: Multiple  s equence alignment and their reliability

Confident

Uncertain

Sequence score

Column score

GUIDANCE resultsGUIDANCE results

Page 21: Multiple  s equence alignment and their reliability

GUIDANCE outputsGUIDANCE outputs

21

Download MSA for down-stream

analysis

Text files with all scores

Mask residue by score

Remove unreliable sequences

Page 22: Multiple  s equence alignment and their reliability

Confident

Uncertain

Sequence score

Column score

GUIDANCE resultsGUIDANCE results

22

Page 23: Multiple  s equence alignment and their reliability

GUIDANCE outputsGUIDANCE outputs

23

Remove unreliable sequences

Re-align sequences after filtration

Sequences left after filtration

Page 24: Multiple  s equence alignment and their reliability

Filtering sequences Filtering sequences with low scores and with low scores and

re-alignre-align

24

But always remember not to

remove too much data and

consider the biology…

Page 25: Multiple  s equence alignment and their reliability

GUIDANCE outputsGUIDANCE outputs

25

Remove unreliable columns

MSA after filtration

Page 26: Multiple  s equence alignment and their reliability

Filtering columns with Filtering columns with low scoreslow scores

26

Page 27: Multiple  s equence alignment and their reliability

GUIDANCE outputsGUIDANCE outputs

27

Masking unreliably aligned residues

Page 28: Multiple  s equence alignment and their reliability

Filtering residues with Filtering residues with low scoreslow scores

28

Page 29: Multiple  s equence alignment and their reliability

Filtering unreliable regions Filtering unreliable regions

can improve down-stream can improve down-stream

analysisanalysis

29

(Mol Biol Evol 2012;29:1-5)

Page 30: Multiple  s equence alignment and their reliability

AcknowledgmentsAcknowledgments• Prof. Tal Pupko• Dr. Eyal Privman• Dr. Osnat Penn• Pupko’s lab members

1. Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010).GUIDANCE: a web server for assessing alignment confidence scores.Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443 [ABS] [PDF] 

2. Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010).An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS] [PDF] 

3. Landan, G., and D. Graur. (2008).Local reliability measures from sets of co-optimal multiple sequence alignments.Pac Symp Biocomput 13:15-24 [ABS] [PDF]

30

Page 31: Multiple  s equence alignment and their reliability

Thanks for your Thanks for your attention!attention!

31

Page 32: Multiple  s equence alignment and their reliability

1. Download and save the sequences file.

(http://guidance.tau.ac.il/workshop_2013/) "Seq_For_GUIDANCE.fs" (File

“Save as”). This file contains 20 protein sequences in FASTA

format.

2. Run GUIDANCE web-server to create a protein alignment:

a. Use GUIDANCE algorithm

b. Select “amino acids” as the sequences type;

c. Select MAFFT as the alignment method

d. Run (press the “Submit“ button) .

e. (In case it does not run for you, you can see the results at:

http://guidance.tau.ac.il/results/13589321556364/output.php)

3. What is the alignment score? What does it mean about the alignment achieved?

4. Which sequences can be removed to improve the alignment? What is

the biological justification for that? Try it!

Page 33: Multiple  s equence alignment and their reliability

Appendix – MSA serversAppendix – MSA servers

33

Page 34: Multiple  s equence alignment and their reliability

MAFFTMAFFT• Web server & download:

http://mafft.cbrc.jp/alignment/server/

34

Page 35: Multiple  s equence alignment and their reliability

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

• Efficiency-tuned variants quick & dirty or slow but accurate

Page 36: Multiple  s equence alignment and their reliability

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

Page 37: Multiple  s equence alignment and their reliability

Choosing a MAFFT Choosing a MAFFT strategy strategy

quick & dirty slow

but accurate

Page 38: Multiple  s equence alignment and their reliability

Choosing a MAFFT strategy Choosing a MAFFT strategy

L-INS-i

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------

--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------

------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------

--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo

--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

G-INS-i

XXXXXXXXXXX-XXXXXXXXXXXXXXX

XX-XXXXXXXXXXXXXXX-XXXXXXXX

XXXXX----XXXXXXXX---XXXXXXX

XXXXX-XXXXXXXXXX----XXXXXXX

XXXXXXXXXXXXXXXX----XXXXXXX

E-INS-i

oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo

---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-------------

-----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo

---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX-------------

---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------

quick & dirty slow

but accurate

Page 39: Multiple  s equence alignment and their reliability

MAFFT outputMAFFT outputA colored view of the

alignmentChoose a format: Clustal, Fasta and save as text file

Run GUIDANCE also from here!!

Page 40: Multiple  s equence alignment and their reliability

PRANK

Page 41: Multiple  s equence alignment and their reliability

Classical alignment errors for HIV env

Page 42: Multiple  s equence alignment and their reliability

PRANKPRANK• Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/

Page 43: Multiple  s equence alignment and their reliability

PRANK outputPRANK output

If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/