the bioinformatics toolkit at the mpi for developmental biology workshop systems biology berlin,...

40
The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution (Andrei Lupas) Max-Planck-Institute for Developmental Biology

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

The Bioinformatics Toolkit at the MPI for Developmental Biology

Workshop Systems Biology

Berlin, March 3, 2006

Johannes Söding

Department for Protein Evolution (Andrei Lupas)

Max-Planck-Institute for Developmental Biology

Page 2: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Our toolkit assists the department’s research in protein evolution …

… and makes methods developed in our group accessible to a larger public

Sequence similarity searches

Multiple sequence alignment

Sequence analysis (repeats, periodicities, subtyping)

Secondary structure and transmembrane prediction

Tertiary structure prediction and structure analysis

Phylogeny and classification

Utilities (reformatting, sequence retrieval, filtering)

?

Page 3: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Overview page for Sequence Search

toolkit

Page 4: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

PSI-BLAST has enhanced functionality over NCBI

• Select subsets out of >300 genomes

• Upload personal databases

• Change databases between search rounds

• Show colored multiple alignment (JalView)

• Submit results to other tools

57636

Page 5: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Quick2D integrates results of various 2’ndary structure prediction programs

Contributed by Christian Mayer, MPI-DevBio 68748

Page 6: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

REPPER detects periodic regions in proteins

Gruber M, Söding J, and Lupas AN. (2005) NAR 33, W239-243. 92259

Page 7: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Several tools rely on a sensitive new method for remote homology detection

HHrep De-novo repeat detection

HHpred Structure and function prediction by detecting remote homologs in databases such as the PDB, SCOP, Pfam, Smart, InterPro, CDD at NCBI,…

HHsenser Sequence search method that employs exhaustive intermediate profile search`

Underlying method: Pairwise comparison of profile hidden Markov models (HMMs)

What is a sequence profile?

What is a profile HMM?

Page 8: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Sequence profiles are a condensed representation of multiple alignments

HBA_human ... W G K V G A - - H A G E ...HBB_human ... W G K V - - - - N V D E ...MYG_phyca ... W G K V E A - - D V A G ...LGB2_luplu ... W K D F N A - - N I P K ...GLB1_glydi ... W E E I A G A D N G A G ...

0 0.25 0.75 0 0.2 0.4 0...A ...

0 0 0 0.2 0 0.2 0...D ...0 0.25 0 0 0 0 0.4...E ...

0.2 0 0 0 0 0 0...F ...0 0.25 0.25 0 0.2 0.2 0.4...G ...0 0 0 0.2 0 0 0...H ...

0.2 0 0 0 0.2 0 0...I ...0 0 0 0 0 0 0.2...K ...0 0 0 0 0 0 0...L ...

0 0.25 0 0.6 0 0 0...N ...0 0 0 0 0 0.2 0...P ...

0.6 0 0 0 0.4 0 0...V ...0 0 0 0 0 0 0...W ...

0 0 0 0 0 0 0...C ...

0 0 0 0 0 0 0...M ...

0 0 0 0 0 0 0...T ...

0 0 0 0 0 0 0...Q ...0 0 0 0 0 0 0...R ...0 0 0 0 0 0 0...S ...

0 0 0 0 0 0 0...Y ...

Each column of the profile pj(a)

contains the amino acid

frequencies in the multiple

sequence alignment

0

00.2

00.6

00

0.20

00

00

0

0

0

000

0

0

0.20.2

0000

0.60

00

00

0

0

0

000

0

0

00000000

00

01.0

0

0

0

000

0

master sequence

Page 9: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HMMs include position-specific gap penalties

HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G A G V ...

M/D M/D M/D I I M/D M/D M/D M/D M/D Deletions

Insertions0 0.25 0.2 0.4 0 0...A ...

0 0 0 0.2 0 0...D ...0 0.25 0 0 0.4 0...E ...

0.2 0 0 0 0 0...F ...0 0.25 0.2 0.2 0.4 0...G ...0 0 0 0 0 0.4...H ...

0.2 0 0.2 0 0 0...I ...0 0 0 0 0.2 0...K ...0 0 0 0 0 0...L ...

0 0.25 0 0 0 0...N ...0 0 0 0.2 0 0...P ...

0.2 0 0 0 0 0...MD ...

0 0 0 0 0 0...C ...

0 0 0 0 0 0...M ...

0 0 0 0 0 0...W ...0 0 0 0 0 0.2...Y ...

0 1.0 0 0 0 0...DD ...

0 0 0 0 0 0...I I ...0 0 0 0 0 0...M I ...

0.75

000

0.250000

00

0

0

0

00

0

0.50.25

0

0.2000

0.2000

0.60

0

0

0

00

0

00

Match or Delete

Probabilities for Insert Open Insert Extend Delete Open Delete Extend

Page 10: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Profile HMMs can be represented as states connected by transitions

HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...

M/D M/D M/D I I M/D M/D M/D M/D M/D

D

I

D

I

D

I

D

I

D

I

D

I

D

I

D

I

… …

0 0.25 0.2 0.4 0 0A

0.2 0 0 0 0 0MD

0 0 0 0 0 0C

0 0 0 0 0 0W0 0 0 0 0 0.2Y

0 1.0 0 0 0 0DD

0 0 0 0 0 0I I0 0 0 0 0 0M I

0.75

0

0

00

0

0.5 0.25

0

0

0

00

0

00

HMM p

pi(a)

pi(XY)

Matrix:

M M M M MMMM

Page 11: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Profile HMMs can be represented as states connected by transitions

HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...

M/D M/D M/D I I M/D M/D M/D M/D M/D

D

I

D

I

D

I

D

I

D

I

D

I

D

I

D

I

… …

0 0.25 0.2 0.4 0 0A

0.2 0 0 0 0 0MD

0 0 0 0 0 0C

0 0 0 0 0 0W0 0 0 0 0 0.2Y

0 1.0 0 0 0 0DD

0 0 0 0 0 0I I0 0 0 0 0 0M I

0.75

0

0

00

0

0.50.25

0

0

0

00

0

00

HMM p

pi(a)

pi(XY)

Matrix:

M M M M MMMM

Page 12: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Profile HMMs can be represented as states connected by transitions

HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...

M/D M/D M/D I I M/D M/D M/D M/D M/D

D

I

D

I

D

I

D

I

D

II

DD

I

D

I

… …

0 0.25 0.2 0.4 0 0A

0.2 0 0 0 0 0MD

0 0 0 0 0 0C

0 0 0 0 0 0W0 0 0 0 0 0.2Y

0 1.0 0 0 0 0DD

0 0 0 0 0 0I I0 0 0 0 0 0M I

0.75

0

0

00

0

0.50.25

0

0

0

00

0

00

HMM p

pi(a)

pi(XY)

Matrix:

M M M M MMMM

Page 13: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Profile HMMs can be represented as states connected by transitions

HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...

M/D M/D M/D I I M/D M/D M/D M/D M/D

D

I

D

I

D

I

D

I

D

I

D

I

D

I

D

I

… …

0 0.25 0.2 0.4 0 0A

0.2 0 0 0 0 0MD

0 0 0 0 0 0C

0 0 0 0 0 0W0 0 0 0 0 0.2Y

0 1.0 0 0 0 0DD

0 0 0 0 0 0I I0 0 0 0 0 0M I

0.75

0

0

00

0

0.50.25

0

0

0

00

0

00

HMM p

pi(a)

pi(XY)

Matrix:

M M M M MMMM

Page 14: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Find path through two HMMs that maximizes co-emission probability

State q

State p

M

D

I

M

D

I

M

D

I

M

D

I

M

D

I

M

D

I

M

D

I

HMM q

M

M

M

M

M

I

M

M

M

M

D

M

M

M

D

I

M

D

I

M

D

I

M

D

I

M

D

I

HMM p

x1 x2 x3 x4 x5 x6

Söding, J. (2005) Bioinformatics 21, 951-960.

Include Null model maximize “log-sum-of-odds score”

Co-emitted sequence

Page 15: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHrep detects repeats by HMM-HMM comparison of the sequence with itself

The dotplot with suboptimal alignments reveals internal symmetries

repeat 1 repeat 2 repeat 3 repeat 4

repe

at 4

repe

at 3

repe

at 2

repe

at 1

Page 16: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Outer membrane barrels might have evolved by duplication of a single hairpin

OmpA

… but is there an internal symmetry in the sequences?

Page 17: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHrep indeed finds a fourfold sequence symmetry in OMPs

50

100

150

50 100 150

ompa_2

OmpA

blue: significantalignments

Page 18: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

TIM barrels possess approximate structural symmetry …

… but up to now it has not been possible to detect this repeat pattern on the sequence level

Page 20: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Did TIM barrels evolve by duplication of a quarter barrel peptide?

HisF

KDPG aldolase

Fourfold symmetry Eightfold symmetry

profile-profile dot plot

after consistencytransformation

same, but lower score threshold

Page 21: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

seq-seq

profile-seqHMM-seq

profile-profile

profile-profile

HMM-HMM

HMM-HMM+SS

HMM-HMM+corr

HMM-HMM+predSS

10% ra

te of false

positive

s profile-profile

HMM-HMM comparison improves upon profile-profile comparison

All-against-all benchmark on SCOP (20% seq. id.)

Page 22: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

8

The HHpred input page

1. Paste ScbA sequence

2. Select database

3. Submit jobAll input

parameters are linked to explanations on help

pages ScbA from Steptomyces is involved in regulating the onset of antibiotics production, but its function is unknown

Page 23: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Search results: alignment view

Query sequence (ScbA)

Template sequence: (from database)

Predicted 2nd’ary structure (query)

Predicted 2nd’ary structure (template)

Actual 2nd’ary structure (template)

Graphical representation of

best database hits along query sequence

View alignments as histograms

View template alignment

View template structure

Match quality

Statistical significance

Summary hit list for best

database matches

...

Alignments with database

sequences (templates)Interesting region

of high similarity

Six best hits belong to a superfamily of enzymes from the

fatty acid synthesis pathway!

Create 3D model

48830

Page 24: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Histogram view

Highly conserved residues E and Q are catalytic residues in FabZ / FabA!

FabZ

FabAFabZ

Highly conserved arginine: catalytic ?

Page 25: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Homology between histones and C-terminal subdomain in AAA+ ATPases

RuvB (AAA+)

kink

TAFII62

TAFII42

Work in progress, V. Alva Kullanja and M. Ammelburg et al.

Page 26: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

The prediction of transmembrane barrel proteins is a challenging problem

• TM β-barrel proteins occur in outer membranes of bacteria, mitochondria and plastids

• TM β-barrel proteins are normally amphiphilic → more difficult to identify than α-helical TMPs

• Only a handful of known structures exist

• No structure of OmpW has yet been released→ use OmpW as test case

OmpA MspA porin

Page 27: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Most dedicated TM β-barrel predictors fail to predict Erwinia carotovora OmpW correctly

Server

TBBpred(Chandigarh, India)

TMBETA-NET (AIST, Tokyo)

PROFtmb(Columbia University)

Pred-TMBB(University of Athens)

Result

“Protein is likely to be globular”

Confidence? Nine strands predicted with unrealistic positions

Low confidence (Z-score 5.8 ≈ 35% accuracy)Six strands predicted with realistic positions

Score below threshold;Nine strands predicted, 4 probably misplaced, 5 correct

Page 28: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHpred model of Erwinia carotovora OmpW (default parameters, no refinements)

Correct topology predicted, with 8 strands at realistic positions;

High confidence for OMP prediction(Probability = 100%)

Only needs refine-ment for precise placement of loop inserts

Ompw_3

Ompw_1

Page 29: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHsenser is a novel method to search for remote homologs in sequence databases

• Recursive search strategy employing PSI-BLAST to build new aligynments that may be homologous to query

• HMM-HMM comparison for validation of homology between newly built alignment and alignment of validated sequences

• Very sensitive!

..

.

.

. .

... x

..

.. .. .... x

x

x

..

..

.

.

..

.

. .

.

..

.

.

.

...

x

..

..

.

.

..

. .

.

..

.

.

..

.. ..

..

.

query

E<10

E<10-3

x

.

.

.

.

.

shaded:accepted

sequences

Page 30: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHsenser defines a diverse superfamily of transcription factors around AbrB/SpoVTSpoVT

NC

C N

N’

NC’

C

N’NCC’

NC C N

N

C

C’

N’

N

C

N’C’

N’C’ C

NC’ N’

CN

MazE (1mvf)

MraZ (1n0g)

AbrB (new, 1yfb)

SpoVTSpoVT

AbrB

MraZ-CMraZ-N

cyano TF

YjiW

Archaeal PhoUPemI / MazE

Vir

VagC

PrlF

1n0g1n0g

1yfb

1mvf

AbrB (1ekt)

Sequences obtained with HHsenser, clustered with CLANS:

M. Coles et al. (2005) Structure 13, 919-928. abrb_1

Page 31: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Retroactive from Drosophila was identified in a screen in for chitin-associated defects

● The retroactive fly larvae are bloated and show a characterisitic disarrangement of chitin fibres in the cuticle

● Except for the orthologous genes from D pseudoobscura and Anopheles, no homologs are found in the databases

● Understanding chitin-related developmental and metabolic pathways is important for pest control

wildtype

rtvmutant

Page 32: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Based on remote homology with CD59 and snake toxins, HHpred could generate

a 3D model for Rtv

B. Moussian, J. Söding, H. Schwarz, and C. Nüsslein-Volhard, Dev Dyn 2005

• Rtv is membrane-bound and adopts a three-finger neurotoxin fold

• The long fingers carry two exposed aromatic residues each

• These exposed residues are likely to binding chitin at the surface of epidermal cells

63951

rtv_1

Page 33: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution
Page 34: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHsenser finds homology between P5 protein of phage phi-6 and lytic transglycosilases

(default parameters)

p5_2

Page 35: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

HHpred confidently predicts Gas1 (target 5 from AFP-SIG) to be a GDNF receptor

(default parameters, database: CDD)

ma

In collaboration with Mart Saarma, Helsinki Gas1_2

Gas1_1

Page 36: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Outlook

Toolkit as open-source package Continuous integration of the best available tools Several new tools planned or in development

• Cluster known folds by sequence similarity(Galaxy of folds)

• Functional subtyping

• PDB remote homology alert barrel membrane protein prediction

• Repeat detection (database-assisted) Expert system

Page 37: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

The Toolkit Team

Andreas Biegert

Michael Remmert

Christian Mayer

Andrei Lupas

Johannes Söding

Many thanks to

• Tancred Frickey, Markus Gruber, Alex Diemand, and Pavel Szczesny for contributing tools

• Alexander Diemand for systems admin and support

• Members of our group for critical feedback

http://toolkit.tuebingen.mpg.dehttp://toolkit.tuebingen.mpg.de

Page 38: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Stucture is more conserved in evolution than function

Sequence identity

60%50%40%30%20%

Main-chain RMSD in conserved core

0.85 Å1.0 Å1.2 Å1.5 Å 1.8 Å

Conservation of structure

Fraction of aas in conserved core

90%80%70%60%50%

Structure prediction based on homology to template with known structure can yield useful 3D models even at sequence

identities below 20% (twilight zone)

Page 39: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Sequence identity is a good indicator of functional similarity …

Sequence identity

Conservation of substrate specificity

(all four EC digits)

Conservation of reaction mechanism (first three EC digits)

Conservation of enzyme function (EC code) in proteins

50% - 60%40% - 50%30% - 40%20% - 30%

75%60%35%15%

85%70%50%25%

… but function evolves quickly:below 50% direct functional inference gets problematic

Analysis of conserved functional residues,comparative sequence analysis, structure prediction, …

Page 40: The Bioinformatics Toolkit at the MPI for Developmental Biology Workshop Systems Biology Berlin, March 3, 2006 Johannes Söding Department for Protein Evolution

Global versus local alignment

global alignment

BLAST and PSI-BLAST use a local alignment method

HHpred can construct both local and global alignments

• Probabilities / E-values more reliable for local alignment

• Global alignment mode useful for making 3D models and for determination of structural domain boundaries

query

db match

query

db match

local alignment