power and weakness of data power: data + software + bioinformatician = answer. weakness: data...

Post on 15-Jan-2016

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Power and weaknessPower and weakness of data of data

Power: data + software + bioinformatician = answer.

Weakness: Data errors. Data poorly understood. Poor software. Never enough data. Few bioinformaticians available.

Laerte about structures:Laerte about structures:

“Use the Force, Luke” sequence , Gert

Signals in SequencesSignals in Sequences

The number of sequencesThe number of sequencesavailable for analysis rapidlyavailable for analysis rapidlyapproaches infinite.approaches infinite.

We need new ways to look We need new ways to look at all this information.at all this information.

The First Law:The First Law:

First law of sequence First law of sequence analysis:analysis:

A conserved residue A conserved residue is important.is important.

With thousands of aligned With thousands of aligned sequences:sequences:

Second law of sequence Second law of sequence analysis:analysis:

A very conserved residue A very conserved residue is very important.is very important.

Signals in sequences:Signals in sequences:Conserved, CMA, variableConserved, CMA, variable

QWERTYASDFGRGHQWERTYASDTHRPMQWERTNMKDFGRKCQWERTNMKDTHRVWBlack = conservedWhite = variableGreen = correlated mutations(CMA)

Sequence SignalsSequence Signals

Three types of information from multiple sequence alignments:

1) Conservation2) Correlation3) Variability

ArtefactsArtefacts

Wrong sequence signalscan result from:

Not enough sequencesToo conserved sequencesToo variable sequencesOver-alignmentOver-interpretation

Recalcitrant residues Recalcitrant residues

Sequence EntropySequence Entropy

20

Ei = pi ln(pi) i=1

Sequence VariabilitySequence Variability

Sequence variability is the number of residue types that is present in more than 0.5% of the sequences.

Entropy - VariabilityEntropy - Variability

Evolution = try everything(and keep what works well)

Variability = Chaos (try everything)

Entropy = Information(keep what works well)

Entropy - VariabilityEntropy - Variability

Variability is result of DNA trying everything.

Entropy is the protein’s break on evolutionary speed.

Ras Entropy - VariabilityRas Entropy - Variability

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Ras LocationRas Location

11 Red12 Orange22 Yellow23 Green33 Blue

Protease Protease Entropy - VariabilityEntropy - Variability

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Protease LocationProtease Location

11 Red12 Orange22 Yellow23 Green33 Blue

Globin Globin Entropy - VariabilityEntropy - Variability

GPCR

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Globin LocationGlobin Location

11 Red12 Orange22 Yellow23 Green33 Blue

And now for drug design: GPCR And now for drug design: GPCR

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

GPCRs: (Membrane facing GPCRs: (Membrane facing amino acids left out)amino acids left out)

11 Red12 Orange22 Yellow23 Green33 Blue

SummarySummary

Given many sequences:

Every residue’s role known.Signaling paths detectable.Two step evolutionary model: First main site, soon after modulator site.

Beyond the summaryBeyond the summary

Sequence -> structure -> functionis wrong. It should be:Structure -> sequence -> function.

And, because active sites are at the surface, conserved residues are at or near the surface.

Beyond the summaryBeyond the summary

Why do all TIM-barrel enzymes have the functional residues at the C-terminal side of the strands?

Beyond the summaryBeyond the summary

22 Yellow: Core

11 Red: main site

23 Green: Modulator

12 Orange: Around main site

Up to 18 residue types

Up to 14 residue types

Up to 8 residue types

Up to 4 residue types11

12 22

23 33

The weakness of dataThe weakness of data

Data errors.Poor software. Data poorly understood. Never enough data. Few bioinformaticians around.

The weakness of dataThe weakness of data

Rob Hooft

WHAT_CHECK

www.cmbi.kun.nl/gv/servers/www.cmbi.kun.nl/gv/pdbreport/

Structure validationStructure validation

Everything that can goEverything that can gowrong, will go wrong,wrong, will go wrong,especially with things asespecially with things ascomplicated as proteincomplicated as proteinstructures.structures.

Why ?Why ?

Why does a sane (?) human being spend fourteen years to search for twelve million errors in the PDB?

Because:Because:

All we know about proteins is derived from PDB files.

If a template is wrong the model will be wrong.

Errors become smaller when you know about them.

What do we check?What do we check?

Administrative errors.Crystal-specific errors.NMR-specific errors.Really wrong things.Improbable things.Things worth looking at.Ad hoc things.

Error detectionError detection

Detecting errors is one thingfixing them another…

We try not to say about the structure that it is wrong, but we try to say what is wrong about the structure.Give hints how to fix things.

How difficult can it be?How difficult can it be?

How difficult can it be?How difficult can it be?

Your best check:Your best check:

PlanarityPlanarity

Little things hurt bigLittle things hurt big

Improbable thingsImprobable things

How wrong is wrong?How wrong is wrong?

Our errorsOur errors

Four sigma: 12.000 false positives.Administrative errors misunderstood.Improbable is not wrong.Poor data makes errors unavoidable.Bugs.

Contact ProbabilityContact Probability

Contact ProbabilityContact Probability

DACADACA

DACADACA

DACADACA

DACADACA

DACADACA

Contact probability boxContact probability box

Using contact probabilityUsing contact probability

His, Asn, Gln ‘flips’His, Asn, Gln ‘flips’

Where are the protons?Where are the protons?

Hydrogen bond networkHydrogen bond network

Hydrogen bond force fieldHydrogen bond force field

Hydrogen bond force fieldHydrogen bond force field

15% should be flipped15% should be flipped

SummarySummary

Everything that could go wrong has gone wrong.Errors are on a ‘sliding scale’.Error detection can detect a lot, but surely not everything (yet).

Beyond the summary,Beyond the summary,For Drug Design:For Drug Design:

Forget: High throughput.Forget: Docking.Forget: Structure in absence of many, many sequences.

First gather and digest all experimental data.

Beyond the summary,Beyond the summary,For Drug Design:For Drug Design:

First know your enemy,

then defeat it.

Thanks to:Thanks to:

Laerte Oliveira Sao PauloFlorence Horn San FranciscoRob Hooft DelftWilma Kuipers Weesp Bob Bywater CopenhagenNora vd Wenden The HagueMike Singer BostonAd IJzerman LeidenMargot Beukers LeidenAmos Bairoch GenevaFabien Campagne San Diego

top related