ewan birney biocuration 2013
DESCRIPTION
Ewan Birney's slides from Biocuration 2013TRANSCRIPT
Ewan Birney (tweetable)
Curation
Who am I?• Associate Director at
European Bioinformatics Institute (EBI)
• Involved in genomics since I was 19 (> 20 years!)
• Trained as a biochemist – most people think I am CS
• Analysed – sometimes lead – human/mouse/rat/platypus etc genomes, ENCODE, Others.
EBI is in Hinxton, SouthCambridgeshire
EBI is part of EMBL, ~likeCERN for molecular biology
Molecular Biology
• The study of how life works – at a molecular level
• Key molecules:
• DNA – Information store (Disk)
• RNA – Key information transformer, also does stuff (RAM)
• Proteins – The business end of life (Chip, robotic arms)
• Metabolites – Fuel and signalling molecules (electricity)
• Theories of how these interact – no theories of to predict what they are
• Instead we determine attributes of molecules and store them in globally accessible, open, databases
Theory Observation
MolecularBiology
Geology,Astronomy
Climatemodelling
High EnergyPhysics
Can accurately predict from models
Must directly observe
This ratio is not well correlated with data size
Ratio of model predictability
Data Size
~60PB
~5PB
MolecularBiology
Astronomy
Climate Models
High Energy Physics
“Knowing stuff” is critical to biology…
• The bases of the human genome
• … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….
• The functions of proteins
• Enzymes, Transcription Factors, Signalling….
• The types of cells, their lineages and organ composition
• …and all the molecular components in each cell
• Small molecules
• … and their conversions, binding partners
• Structures of molecules, complexes and cells
• … at atomic and higher resolution
Two fundamental types of information
• Experimental data
• The result of a specific experiment
• Often an experiment specific, data heavy part plus a “meta-data” part
• Might be contradictory
• “Primary paper”
• Consensus Knowledge
• Integration of different strands of information on a topic
• Realised as a computationally accessible scheme
• “Review article”
Five types of curation
Experimental Data Entry
• Intact – Protein:Protein interactions
• GWAS Catalog – extraction of summary statistics
Experimental Meta data capture
• Sample, CDS lines in ENA
• Sample in Metabolights, PRIDE etc
• Machine and analysis specification in PDB, PRIDE, ENA
Consensus integration of information
• GenCode gene models in human
• Summaries and GO assignment in UniProt
• Pathway information in Reactome
• GO assignment and summaries in MODs (eg, PomBase, WormBase, PhytoPathDB etc)
Knowledge frameworks
• The EC classification
• Cell type ontologies
• Cell lineages – Worms!
• SnowMed, HPO etc
• GO ontologies
Knowledge management
• Creation of rules representing ENA standards compliance
• Cross-ontology coordination (eg, EFO) or tieing (GO ChEBI)
• RuleBase / UniRule curation processes
Data Entry vs Programming
DirectData Entry
ProgrammaticData Entry
ImprovedData entrytools RuleBase,
Computational AccessibleStandards
“Messy” Scripting
Thank You!
Curation Dilema
• If you do your job well…
• Everyone assumes it’s easy
• People forget about the complexity
• You are ignored
• If you do your job badly…
• Everyone assumes it’s easy
• People forget about the complexity
• People complain
Why we need an infrastructure…
Infrastructures are critical…
But we only notice them when they go wrong
Biology already needs an information infrastructure
• For the human genome
• (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl
• For the function of genes and proteins
• For all genes, in text and computational – UniProt and GO
• For all 3D structures
• To understand how proteins work – PDBe
• For where things are expressed
• The differences and functionality of cells - Atlas
..But this keeps on going…
• We have to scale across all of (interesting) life
• There are a lot of species out there!
• We have to handle new areas, in particular medicine
• A set of European haplotypes for good imputation
• A set of actionable variants in germline and cancers
• We have to improve our chemical understanding
• Of biological chemicals
• Of chemicals which interfere with Biology
22
medicine
environment
bioindustries
society
To build a sustainable European infrastructure for biological information, supporting life science research and its translation to:
ELIXIR’s mission
How?
Fully Centralised
Pros: Stability, reuse,Learning ease
Cons: Hard to concentrateExpertise across of life scienceGeographic, language placementBottlenecks and lack of diversity
Pros: Responsive, GeographicLanguage responsive
Cons: Internal communication overheadHarder for end users to learnHarder to provide multi-decade stability
Fully Distributed
24
InternationalEBI / ElixirEnglishLow legalities
NationalHealthcareNational LanguageComplex legalities
Research Healthcare
Other infrastructures needed for biology• EuroBioImaging
• Cellular and whole organism Imaging
• BioBanks (BBMRI)
• We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease
• Mouse models and phenotypes (Infrafrontier)
• A baseline set of knockouts and phenotypes in our most tractable mammalian model
• (it’s hard to prove something in human)
• Robust molecular assays in a clinical setting (EATRIS)
• The ability to reliably use state of the art molecular techniques in a clinical research setting
Questions?
(you can follow me on twitter @ewanbirney)I blog and update this on Google Plus publically