pag 2004 data management and curation at tair margarita garcia-hernandez
TRANSCRIPT
PAG 2004
DATA MANAGEMENT AND CURATION AT TAIR
Margarita Garcia-Hernandez
PAG 2004
The ‘systems biology’ paradigm
• FACT: huge amounts of data
• NEED: systematic harvesting & easy accessibility (store, sort, interlink)
• PROBLEM: complexity & heterogeneity of data
• CHALLENGE: to describe complete biological systems in an integrated way (organizing, defining relationships, defining metadata standards, interpreting, quality control assessment – DATA MANAGEMENT)
PAG 2004
Data Management Flow ChartData generation
Database population
Collection
Selection
Organization of similar data types•Remove redundancy•Correct errors
Association of different data types•Establish unambiguous identifiers•Define and validate relationships
Resolve data heterogeneity - standardization
Data modeling
Data curation
Data dissemination
Annotation – add descriptions•Define standard vocabulary
PAG 2004
Quality ControlIssues
1. Accuracy of information
2. Consistency in format and content
3. Up-to-datedness
4. Conflicting data
TAIR’s approaches• Personnel training (Ph.D. level biologists)
• User input
• Source attribution
• Checking curation consistency (computationally and manually)
• Adopt Standard Operation Procedures
• Define and use controlled vocabularies
PAG 2004
Many Data Types with Many Sources
Data TypesSources
Literature
Public Databases
Communitysubmissions
Computationalanalysis
Functional Genomicprojects
Genes/Gene Products
Mutant Phenotypes
Expression
Stocks
Metabolism
PAG 2004
Two Examples of Data Curation
Data Types
Genes/Gene Products
Mutant Phenotypes
MicroarrayExpression
Stocks
Metabolism
SourcesLiterature
Public Databases(SMD)
Communitysubmissions
Computationalanalysis
Functional Genomic projects
(AFGC)
PAG 2004
Literature CurationPubSearch
• A literature curation management system designed to store and manage the available literature for an organism of interest
PubSearch software is freely available at http://pubsearch.org
Generic Model Organism Database (GMOD) http://www.gmod.orga joint effort by several model organism databases to develop reusable components for creating new biological databases
PAG 2004
Literature CurationStep 1: Collection of References
•Meetings abstracts•Dissertations•Textbooks
Remove redundancy
Journal names standardization
PubSearch DB(21,527)
(‘Arabidopsis’ in title or abstract)
Full text papers(scanning, online)
Arabidopsis References
TAIR DB
Biosis PubMed Agricola
(curation tool)
(public db)
PAG 2004
knowngene
names
Literature CurationStep 2: Assigning References to Genes
Arabidopsis References(PubSearch)
candidategene
names
Scanning references for terms in list(programmatically)
TermList
(17,470)
Gene XRef 1Ref 2
..Ref n
Reference hit Validation(by curators using PubSearch)
Validated list of references for each gene
PAG 2004
Literature CurationStep 3: Extracting information
• Gene-centric curation approach• Each curator is assigned 2 genes per day• Papers are read and information extracted (following SOPs and
using PubSearch curation tool):– Name validation & add aliases– Add sequence info– Assign locus (mapping to the genome by BLAST)– Merge/split genes– Write summary sentence– Correct errors– Annotation using controlled vocabularies (GO, POC)
PAG 2004
Controlled Vocabularies
• A collection of defined terms (organized in a hierarchy) intended to serve as a standard nomenclature
• Provide a common set of terms that users of a single system (or across multiple systems) can share
• Allows retrieval of ALL relevant information
Example:– Find all the genes that have transporter activity (regardless of how
they are named, or what type they are)
PAG 2004
Controlled Vocabularies used at TAIR• Gene Ontology (GO) http://www.geneontology.org/ Goal: to produce a controlled vocabulary for describing genes and proteins that can be applied to all organisms
• Molecular Function• Cellular component• Biological process
• Plant Organism Consortium (POC) http://www.plantontology.org/Gramene, TAIR, Univ Missouri St Louis, MaizeDB, IRIS, MIPS, Oryzabase & Monsanto & Pioneer as
collaboratorsGoal: to develop structured controlled vocabularies for plant-specific knowledge domains:
• Plant Anatomy (morphology, organs, tissue and cell types)
• Temporal stages (plant growth and developmental stages)
• Phenotype Ontology (in the works)
PAG 2004
Qualifying Annotations with supporting evidence
References
Evidence code usage
A set of controlled vocabulary, which provides evidence to support the association between gene products and annotations
– IDA: Inferred from Direct Assay– IMP: Inferred from Mutant Phenotype– ISS: Inferred from Sequence Similarity– IEA: Inferred from Electronic Annotation– IEP: Inferred from Expression Pattern – ……
Evidence code descriptionE.g., IPI : Inferred from Physical Interaction
» Co-immunoprecipitation» Co-purification» Co-sedimentation» ….
PAG 2004
Gene Annotation Display in TAIR
PAG 2004
QC of Literature Curation
• Weekly annotation meeting• Quality control manager• Use of standardized vocabularies• Random checks of annotations• Annotations are tagged by date and curator• Automatic checks in software• Use SOPs – curation guidelines
PAG 2004
Curation of Microarray Data
Data TypesSources
Literature
Public Databases(SMD)
Communitysubmissions
Computationalanalysis
Functional Genomic projects
(AFGC)
Genes/Gene Products
Mutant Phenotypes
MicroarrayData
Stocks
Metabolism
PAG 2004
Curation of AFGC Microarray DataData Collection and Selection
Selected Arrays(516)
Numeric Results Data(raw and normalized)
StanfordMicroarrayDatabase
ArabidopsisFunctionalGenomicsConsortium
- All Arabidopsis public arrays- exclude QC arrays (45)
Metadata
ExperimentsSamplesArray
ElementsProtocols
•sample info•proposal abstracts•protocols
• results • array design • minimal descriptions of individual arrays
PAG 2004
Curation of Metadata:Array elements
1. classify, organize, add missing sequences, correct errors2. mapping to the Arabidopsis genome & association to genes (pipeline)
Samples & Experiments 1. Data extraction from flat files (abstracts, RNA forms), and database (SMD)
e.g., tissue type, treatments, experimental design
2. Organization of data & parsing into tables3. Develop controlled vocabularies for experiment categorization & treatments4. Standardization using those vocabularies5. Data association
grouping arrays replicate sets experimentsmerging replicate samples to minimize redundancylinking to other related data (germplasm, clones, publications, people)
6. Annotation Experiments: GO process, category, experimental variables
Samples: tissues (POC anatomy & temporal) & treatment
Data Submission http://arabidopsis.org/info/microarray.submission.jsp
PAG 2004
Curation of Microarray Results Data
-Quality controlRemove poor quality arrays (2)Exclude spots flagged as bad Re-normalize using lowess method (minimize spatial bias) Remove arrays with strong spatial/plate bias (72)(ANOVA)Exclude array elements with intensity < 350 in both channelsExclude array elements with null values in 80% of arrays
-AnalysisCalculate log2 ratio [ch2N/ch1N]Calculate fold change [ch2N/ch1N]Calculate averages for each array element (array & replicates)
Numeric Results Data
Elementfold change/log2 ratio
std errorper array
Elementfold change/log2 ratio
std errorper replicate arrays
PAG 2004
Conclusions
• Requires trained biologists familiar with data• Can be facilitated computationally (repetitive
tasks), but is mainly a knowledge-based task that can only be done by humans
• Essential for assuring data quality• Adds value to data• Slow process• Can be inconsistent