medical natural sciences year 2: introduction to bioinformatics lecture 9: multiple sequence...

Medical Natural Sciences Year 2:Introduction to Bioinformatics

Lecture 9:

Multiple sequence alignment (III)

Centre for Integrative Bioinformatics VU

Intermezzo: Symmetry-derived secondary structure prediction using

multiple sequence alignments (SymSSP)

Victor Simossis Jaap Heringa

Centre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit

Amsterdam, The Netherlands

Symmetry-derived secondary structure prediction using multiple

sequence alignments (SymSSP)• Modern state-of-the-art methods use multiple sequence alignments

•Methods like PhD, Profs, SSPro, etc., predict for the top sequence in the alignment by cutting out positions with gaps in the top sequence

• What if two helices ‘out of phase’ are pasted together? Or a strand and a helix?

• Approach: correct by permuting alignments and consensus prediction

Secondary structure periodicity patterns

Burried -strand

Edge -strand

-helix

hydrophobic hydrophilic

Symmetry-derived Secondary structure prediction using MA (SymSSP)

1234

2134

3124

4123

EEEEE HHHHHH EEEEE HH

EEEE? ?HHHHH EEE H

EEEEE HHHHH? ??EE HH

EEEEEE ?HHHHH EEEE HH

EEEEE HHHHHH EEE HH

EEEE? ?HHHHH EEE H


EEEEE ?HHHHH EEEE HH

EEEEE HHHH EEE HH

EEEE? ?HHH EEE H

EEEEE HHH? ??EE HH

EEEEE HHH? EEEE HH

EEEEE HHHHHH EEE HHHH

EEEE? ?HHHHH EEE ?HHH

EEEEE HHHHH? ??EE HHHH

EEEEE ?HHHHH EEEE HHHH

EEEEE HHHHH EEE H

EEEE HHHH EE HHH EEEE HHHHH EEE H EEEE HHH EEE HH

1111

Optimal segmentation of predicted secondary structures

H score 0 0 0 0 0….E score 3 4 4 4 3….

C score 1 0 0 0 0…..

1234

EEEEE HHHHHH EEEEE HH

EEEE? ?HHHHH EEE H


EEEEEE ?HHHHH EEEE HH

1->1

1->21->31->4

? Score 0 0 0 0 1….Region 0 1 1 1 0….

CEH

Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment.

The predictions are recorded by secondary structure type and region position in a single matrix

Optimal segmentation of predicted secondary structures by Dynamic Programming

sequence position

window size

Max scoreOffsetLabel

H scoreE scoreC score

The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score.

Restrictions:H only if ws>=4E only if ws>=2

5H

2 6

Segmentation score (Total score of each path)

? scoreRegion

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy ---------------GYVV-----KPFTAATLEEKLNKIFEKLGM------

3chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ????????

3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ????????

3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ????????

3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ????????

3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ????????

3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ?????????

3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ?????????

3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ??????

3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ??????

3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh

3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ????

3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ??????

3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ???????????

3chy <- 3chy --------------- ----- hhhhhhhhhhhhhh ------

Consensus ---------------EEEE----- HHHHHHHHHHHHH ------

Consensus-DSSP ...............****.....****xx***************......

PHD --------------- ----- HHHHHHHHHHHHHH ------

PHD-DSSP ...............xxxx.....******************x**......

DSSP ...............EEEE.....SS HHHHHHHHHHHHHHHT ......

LumpDSSP ...............EEEE..... HHHHHHHHHHHHHHH ......

Symmetry-derived secondary structure prediction (SymSSP)

• Tried over 120 different consensus weighting schemes (global, regional, positional)

• Over ~2700 Homstrad alignments and compared to PHD, on average 0.5% better

• 60% of the alignments are improved, 20% not affected and 20% is made worse

• Tried to correlate schemes with “cheap” a priori data (pairwise identities, sequence lengths, number of sequences, etc.)

Integrating secondary structure prediction and multiple sequence

alignment

• Low key example shown of fairly homogeneous data (strings of letters in both cases)

• But already difficult to do and methods are not easily tunable

• How to scale up to knowledge-integrating and inference engines?

• Profile pre-processing

• Secondary structure-induced alignment

• Globalised local alignment• Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Globalised local alignment

• Aim: fill each DP search matrix with the highest possible local alignment going through that cell

• Problem: Forward calculation + traceback for each local alignment is too slow

• Solution: Double dynamic programming1. Local DP in forward and reverse direction (no

traceback) + matrix summation

2. Global DP over matrix from step 1 + traceback

Globalised local alignment

+ =

1. Local (SW) alignment (M + Po,e)

2. Global (NW) alignment (no M or Po,e)

Double dynamic programming

M = BLOSUM62, Po= 0, Pe= 0

• Profile pre-processing

• Secondary structure-induced alignment

• Globalised local alignment

• Matrix extension

Objective: try to avoid (early) errors

Strategies for multiple sequence alignment

Integrating alignment methods and alignment information with

T-Coffee• Integrating different pair-wise alignment

techniques (NW, SW, ..)

• Combining different multiple alignment methods (consensus multiple alignment)

• Combining sequence alignment methods with structural alignment techniques

• Plug in user knowledge

Matrix extension

T-CoffeeTree-based Consistency Objective Function

For alignmEnt Evaluation

Cedric Notredame

Des Higgins

Jaap Heringa J. Mol. Biol., J. Mol. Biol., 302, 205-217302, 205-217;2000;2000

Using different sources of alignment information

Clustal

Dialign

Clustal

Lalign

Structure alignments

Manual

T-Coffee

Progressive multiple alignment121345

Guide tree Multiple alignment

Score 1-2

Score 1-3

Score 4-5

Scores

Similaritymatrix

5×5

Default T-COFFEE

• Uses information from all sequences for each pair-wise alignment

• Reconciles global and local alignment information

T-Coffee matrix extension

12

13

14

23

24

34

Search matrix extension

T-Coffee• Combine different alignment techniques by adding scores:

W(A(x), B(y)) = S(A(x), B(y))

– A(x) is residue x in sequence A

– summation is over the scores S of the global and local alignments containing the residue pair (A(x), B(y))

– S is sequence identity percentage of the associated alignment

• Combine direct alignment seqA- seqB with each seqA-seqI-seqB:

W’(A(x), B(y)) = W(A(x), B(y)) +

IA,BMin(W(A(x), I(z)), W(I(z), B(y)))

– Summation over all third sequences I other than A or B

T-Coffee

Direct alignment

Other sequences

T-Coffee library system

Seq1 AA1 Seq2 AA2 Weight

3 V31 5 L33 103 V31 6 L34 14

5 L33 6 R35 215 l33 6 I36 35

T-Coffee progressive alignment

MDAGSTVILCFVGMDAASTILCGS

Amino Acid Exchange

Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Kinase nucleotide binding sites

Comparing T-coffee with other methods

but.....T-COFFEE (V1.23) multiple sequence alignment Flavodoxin-cheY1fx1 ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----

FLAV_DESVH ---MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-----

FLAV_DESGI ---MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-----

FLAV_DESSA ---MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-----

FLAV_DESDE ---MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----

4fxn ------MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----

FLAV_MEGEL -----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK-----

FLAV_CLOAB ----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN---------ISWEMKKW-IDESSEFNLEGKL-----

2fcr -----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-----

FLAV_ENTAG ---MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-----

FLAV_ANASP ---SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL--------QSDWEGL-YSELDDVDFNGKL-----

FLAV_AZOVI ----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT-----

FLAV_ECOLI ----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA--------QCDWDDF-FPTLEEIDFNGKL-----

3chy ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE--------------LLKTIRADGAMSALPVLMV

:. . . : . ::

1fx1 ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------

FLAV_DESVH ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI--------

FLAV_DESGI ---------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS---------------------LKIDGEPDSA----EVLDWAREVLARV--------

FLAV_DESSA ---------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS---------------------LKIDGDPE----RDEIVSWGSGIADKI--------

FLAV_DESDE ---------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG---------------------LKMEGDASND--PEAVASFAEDVLKQL--------

4fxn ---------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP---------------------LIVQNEPD--EAEQDCIEFGKKIANI---------

FLAV_MEGEL ---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA---------------------IV--NEMP--DNAPECKELGEAAAKA---------

FLAV_CLOAB ---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----------

2fcr ---------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------

FLAV_ENTAG ---------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL-------

FLAV_ANASP ---------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL------

FLAV_AZOVI ---------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL----

FLAV_ECOLI ---------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA

3chy TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM----------------------------------------------------------

.

Evaluating multiple alignmentsEvaluating multiple alignments• Conflicting standards of truth

– evolution

– structure

– function

• With orphan sequences no additional information

• Benchmarks depending on reference alignments

• Quality issue of available reference alignment databases

• Different ways to quantify agreement with reference alignment (sum-of-pairs, column score)

• “Charlie Chaplin” problem

Evaluating multiple alignmentsEvaluating multiple alignments

• As a standard of truth, often a reference alignment based on structural superpositioning is taken

Evaluation measuresQuery Reference

Column score

Sum-of-Pairs score

Scoring a multiple alignment

Query

Sum-of-Pairs score:

•For each alignment position: take the sum of all pairs (add a.a. exchange values)

•As an option, subtract gap penalties

Evaluating multiple alignmentsEvaluating multiple alignments

SP

BAliBASE alignment nseq * len

Summary

• Weighting schemes simulating simultaneous multiple alignment– Profile pre-processing (global/local)– Matrix extension (well balanced scheme)

• Smoothing alignment signals– globalised local alignment

• Using additional information– secondary structure driven alignment

• Schemes strike balance between speed and sensitivity

References

• Heringa, J. (1999) Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comp. Chem. 23, 341-364.

• Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205-217.

• Heringa, J. (2002) Local weighting schemes for protein multiple sequence alignment. Comput. Chem., 26(5), 459-477.

Where to find this….http://www.ibivu.cs.vu.nl/teaching

medical natural sciences year 2: introduction to bioinformatics lecture 9: multiple sequence...

Documents