embl-ebi the european macromolecular structure database (emsd)
TRANSCRIPT
EMBL-EBI
the European Macromolecular Structure Database (EMSD).
http://www.ebi.ac.uk/msd/education/Tutorial.html
http://www.ebi.ac.uk/msd/roadshow.htmlMSD Roadshow Co-ordinator . Janet Copeland2nd November 2005Oxford University
EMBL-EBI
Introduction to MSD and to Quaternary Structures/Assemblies as Basis of MSD database
SSM Fold recognition
PISA Surface and assembly toolkit
MSDchem Chemistry reference data
MSDlite/MSDpro generalised search systems
MSDsite Active sites
MSDmotif small structural motifs
EMBL-EBI
Visualisation and Patterns
Intergration Projects with Sequence and Domain data
Validation/Deposition
Clustering methods used at MSD
MSDmine – generalised data access to the MSD
PIMS – Protein Information System
Targets – Workflow for Target selection tools
NMR – NMR tools and data at MSD
Data Mining and an example MSDtemplates
DataBases at MSD including data warehouse technologies
DataBase Replication
EMBL-EBI
Genomes
Hypotheses andin silico models
Bioinformatics
Expression-profiling
Comparativegenomics
Mutant/RNAidata
Metabolic data
Literature
Proteome data
Biochemistry
Bioinformatics
EMBL-EBI
Role of Bioinformatics
To Support Experimental BiologyTo Collect and Archive DataTo provide Framework and IntegrationTo give Easy Access to Data
To make New Discoveries through Data Analysis
EMBL-EBI
Databanks and Databases
The PDB Archive is a “databank” A series of flat files that have a format originally
designed for Fortran card readers
The MSD provides “databases” Collections of data (1000’s attributes)
organized into relational tables and held with a RDMS.
PQS biological assemblies
MSDchem ligand data
Electron Density VisualisationAstexViewer MSDPro, MSDlite
SSM fold matching Surface MatchingMSDsite Active sites
Linking to Domain data, eFamily
Sequence Mapping, SIFTS
EMBL-EBI
Data & information
ATOM 2567 N PHE B 175 7.821 -25.530 -22.848 1.00 8.71 ATOM 2568 CA PHE B 175 8.845 -25.172 -21.877 1.00 9.41ATOM 2569 C PHE B 175 9.449 -23.798 -22.169 1.00 10.02 ATOM 2570 O PHE B 175 10.664 -23.613 -22.103 1.00 10.37 ATOM 2571 CB PHE B 175 9.928 -26.251 -21.848 1.00 9.53 ATOM 2572 CG PHE B 175 10.969 -26.137 -22.982 1.00 10.03 ATOM 2573 CD1 PHE B 175 12.356 -25.819 -22.988 1.00 10.51 ATOM 2574 CD2 PHE B 175 11.725 -27.211 -23.402 1.00 10.25 ATOM 2575 CE1 PHE B 175 11.821 -27.095 -22.869 1.00 11.17 ATOM 2576 CE2 PHE B 175 12.282 -26.086 -24.008 1.00 10.95 ATOM 2577 CZ PHE B 175 10.953 -26.335 -23.622 1.00 11.38
EMBL-EBI
MSD service provider
We provide a service to the scientific community 24/7 (almost) :
parallel DB with fail-over, etc.
Service “ping” baseline check several times/day Data is incremented with new data weekly Systems are extensible
EMBL-EBI
Query capabilities
Browsing (click and read) Simple search
select records with some constraints More elaborate search
select specific fields of some records with constraints on some fields
Complex queryingability to return an answer that results from a
"live" computation, and was not part of any record of the database
EMBL-EBI
What is the function of this structure?
What is the function of this sequence?
What is the function of this motif? the fold provides a scaffold, which
can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
EMBL-EBI
1H8E (ADP.ALF4)2(ADP.SO4) BOVINE F1-ATPASE (ALL THREE CATALYTIC SITES OCCUPIED)MENZ, R.I., WALKER, J.E., LESLIE, A.G.W.
ATPase
EMBL-EBI
Ground rules for bioinformatics
Don't always believe what programs tell youthey're often misleading & sometimes wrong!
Don't always believe what databases tell youthey're often misleading & sometimes wrong!
Don't always believe what lecturers tell youthey're often misleading & sometimes wrong!
In short, don't be a naive user when computers are applied to biology, it is vital to
understand the difference between mathematical & biological significance
computers don’t do biology - they do sums quickly!
EMBL-EBI
General Evaluation Criteria Be sceptical and cynical!
When you are searching for information you need to judge its quality and suitability.
Think critically about each piece of information you find and how you found it.
Relevance: Does the information you have found adequately support your research? Does it answer the question, or support one of your arguments? How general or specific is the information about the topic?
EMBL-EBI
Appreciate how difficult it is to draw a complex 3-D object and appreciate the complexity of the requirements for storing sequence and structural information of molecules in a database.
There are a lot of interrelated pieces of information about a biomolecule, such as
sequence similaritiesgenome locationprotein structureExpressionchemistry
EMBL-EBI
Data formats are not standard. The nomenclature is not standard. There is more than one database offering the same information (data redundancy). Links between databases may not be easy to follow. The number of databases available makes it confusing to choose from
Some of the obstacles of searching databases are:
EMBL-EBI
Quality Control Issues
The quality of archived data is no better than the data determined in the contributing laboratories.
Curation of the data can help to identify errors. Disagreement between duplicate determinations is a
clear warning of an error in one or the other. Similarly, results that disagree with established
principles may contain errors. It is useful, for instance, to flag deviations from
expected stereochemistry in protein structures, but such ``outliers'' are not necessarily wrong.
EMBL-EBI
Data quality
Data Consistency Data Models Reliability
Evidences ? Level of confidence ?Assignation of function by similarity
recursive process propagation of errors
EMBL-EBI
Data quality
It’s hard to judge whether something “makes sense”.
The lack of labeling on many web pages makes it hard to know the source.
Calculations based on databases are even harder to deal with
Logical deductions may be worse.
“tacR gene regulates the human nervous system”
“tacQ gene is similar to tacR but is found in E. coli”
“so tacQ gene regulates the E. coli nervous system”
EMBL-EBI Significance
Appreciating that mathematical & biological significance are different is crucial
Important in understanding the limitations of database search algorithms multiple sequence alignment algorithms pattern recognition techniques functional site & structure prediction tools
Contrary to popular opinion, there is currently still no biologically-reliable automatic multiple alignment
algorithm no infallible pattern-recognition technique no reliable gene, function or structure prediction algorithm
EMBL-EBI
As a result, we will have to give up the ``safe'' idea of a stable databank composed of entries that are correct when they are first distributed in mature form and stay fixed thereafter.
Databanks are dynamic in information content and growing in size, and maturing in quality.
Maintaining local copies – largely “top up” this is not sufficient.
Proliferation of various copies in various states with out-of-date linkages
New Problems