power and weakness of data power: data + software + bioinformatician = answer. weakness: data...

59
Power and weakness Power and weakness of data of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough data. Few bioinformaticians available.

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Power and weaknessPower and weakness of data of data

Power: data + software + bioinformatician = answer.

Weakness: Data errors. Data poorly understood. Poor software. Never enough data. Few bioinformaticians available.

Page 2: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Laerte about structures:Laerte about structures:

“Use the Force, Luke” sequence , Gert

Page 3: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Signals in SequencesSignals in Sequences

The number of sequencesThe number of sequencesavailable for analysis rapidlyavailable for analysis rapidlyapproaches infinite.approaches infinite.

We need new ways to look We need new ways to look at all this information.at all this information.

Page 4: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

The First Law:The First Law:

First law of sequence First law of sequence analysis:analysis:

A conserved residue A conserved residue is important.is important.

Page 5: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

With thousands of aligned With thousands of aligned sequences:sequences:

Second law of sequence Second law of sequence analysis:analysis:

A very conserved residue A very conserved residue is very important.is very important.

Page 6: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Signals in sequences:Signals in sequences:Conserved, CMA, variableConserved, CMA, variable

QWERTYASDFGRGHQWERTYASDTHRPMQWERTNMKDFGRKCQWERTNMKDTHRVWBlack = conservedWhite = variableGreen = correlated mutations(CMA)

Page 7: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Sequence SignalsSequence Signals

Three types of information from multiple sequence alignments:

1) Conservation2) Correlation3) Variability

Page 8: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

ArtefactsArtefacts

Wrong sequence signalscan result from:

Not enough sequencesToo conserved sequencesToo variable sequencesOver-alignmentOver-interpretation

Page 9: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Recalcitrant residues Recalcitrant residues

Page 10: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Sequence EntropySequence Entropy

20

Ei = pi ln(pi) i=1

Page 11: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Sequence VariabilitySequence Variability

Sequence variability is the number of residue types that is present in more than 0.5% of the sequences.

Page 12: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Entropy - VariabilityEntropy - Variability

Evolution = try everything(and keep what works well)

Variability = Chaos (try everything)

Entropy = Information(keep what works well)

Page 13: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Entropy - VariabilityEntropy - Variability

Variability is result of DNA trying everything.

Entropy is the protein’s break on evolutionary speed.

Page 14: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Ras Entropy - VariabilityRas Entropy - Variability

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Page 15: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Ras LocationRas Location

11 Red12 Orange22 Yellow23 Green33 Blue

Page 16: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Protease Protease Entropy - VariabilityEntropy - Variability

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Page 17: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Protease LocationProtease Location

11 Red12 Orange22 Yellow23 Green33 Blue

Page 18: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Globin Globin Entropy - VariabilityEntropy - Variability

GPCR

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Page 19: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Globin LocationGlobin Location

11 Red12 Orange22 Yellow23 Green33 Blue

Page 20: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

And now for drug design: GPCR And now for drug design: GPCR

11 Red

12 Orange

22 Yellow

23 Green

33 Blue

Page 21: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

GPCRs: (Membrane facing GPCRs: (Membrane facing amino acids left out)amino acids left out)

11 Red12 Orange22 Yellow23 Green33 Blue

Page 22: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

SummarySummary

Given many sequences:

Every residue’s role known.Signaling paths detectable.Two step evolutionary model: First main site, soon after modulator site.

Page 23: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Beyond the summaryBeyond the summary

Sequence -> structure -> functionis wrong. It should be:Structure -> sequence -> function.

And, because active sites are at the surface, conserved residues are at or near the surface.

Page 24: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Beyond the summaryBeyond the summary

Why do all TIM-barrel enzymes have the functional residues at the C-terminal side of the strands?

Page 25: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Beyond the summaryBeyond the summary

22 Yellow: Core

11 Red: main site

23 Green: Modulator

12 Orange: Around main site

Up to 18 residue types

Up to 14 residue types

Up to 8 residue types

Up to 4 residue types11

12 22

23 33

Page 26: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

The weakness of dataThe weakness of data

Data errors.Poor software. Data poorly understood. Never enough data. Few bioinformaticians around.

Page 27: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

The weakness of dataThe weakness of data

Rob Hooft

WHAT_CHECK

www.cmbi.kun.nl/gv/servers/www.cmbi.kun.nl/gv/pdbreport/

Page 28: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Structure validationStructure validation

Everything that can goEverything that can gowrong, will go wrong,wrong, will go wrong,especially with things asespecially with things ascomplicated as proteincomplicated as proteinstructures.structures.

Page 29: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Why ?Why ?

Why does a sane (?) human being spend fourteen years to search for twelve million errors in the PDB?

Page 30: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Because:Because:

All we know about proteins is derived from PDB files.

If a template is wrong the model will be wrong.

Errors become smaller when you know about them.

Page 31: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

What do we check?What do we check?

Administrative errors.Crystal-specific errors.NMR-specific errors.Really wrong things.Improbable things.Things worth looking at.Ad hoc things.

Page 32: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Error detectionError detection

Detecting errors is one thingfixing them another…

We try not to say about the structure that it is wrong, but we try to say what is wrong about the structure.Give hints how to fix things.

Page 33: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

How difficult can it be?How difficult can it be?

Page 34: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

How difficult can it be?How difficult can it be?

Page 35: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Your best check:Your best check:

Page 36: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

PlanarityPlanarity

Page 37: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Little things hurt bigLittle things hurt big

Page 38: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Improbable thingsImprobable things

Page 39: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

How wrong is wrong?How wrong is wrong?

Page 40: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Our errorsOur errors

Four sigma: 12.000 false positives.Administrative errors misunderstood.Improbable is not wrong.Poor data makes errors unavoidable.Bugs.

Page 41: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Contact ProbabilityContact Probability

Page 42: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Contact ProbabilityContact Probability

Page 43: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

DACADACA

Page 44: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

DACADACA

Page 45: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

DACADACA

Page 46: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

DACADACA

Page 47: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

DACADACA

Page 48: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Contact probability boxContact probability box

Page 49: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Using contact probabilityUsing contact probability

Page 50: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

His, Asn, Gln ‘flips’His, Asn, Gln ‘flips’

Page 51: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Where are the protons?Where are the protons?

Page 52: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Hydrogen bond networkHydrogen bond network

Page 53: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Hydrogen bond force fieldHydrogen bond force field

Page 54: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Hydrogen bond force fieldHydrogen bond force field

Page 55: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

15% should be flipped15% should be flipped

Page 56: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

SummarySummary

Everything that could go wrong has gone wrong.Errors are on a ‘sliding scale’.Error detection can detect a lot, but surely not everything (yet).

Page 57: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Beyond the summary,Beyond the summary,For Drug Design:For Drug Design:

Forget: High throughput.Forget: Docking.Forget: Structure in absence of many, many sequences.

First gather and digest all experimental data.

Page 58: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Beyond the summary,Beyond the summary,For Drug Design:For Drug Design:

First know your enemy,

then defeat it.

Page 59: Power and weakness of data Power: data + software + bioinformatician = answer. Weakness: Data errors. Data poorly understood. Poor software. Never enough

Thanks to:Thanks to:

Laerte Oliveira Sao PauloFlorence Horn San FranciscoRob Hooft DelftWilma Kuipers Weesp Bob Bywater CopenhagenNora vd Wenden The HagueMike Singer BostonAd IJzerman LeidenMargot Beukers LeidenAmos Bairoch GenevaFabien Campagne San Diego