pag 2004 data management and curation at tair margarita garcia-hernandez

PAG 2004

DATA MANAGEMENT AND CURATION AT TAIR

Margarita Garcia-Hernandez

PAG 2004

The ‘systems biology’ paradigm

• FACT: huge amounts of data

• NEED: systematic harvesting & easy accessibility (store, sort, interlink)

• PROBLEM: complexity & heterogeneity of data

• CHALLENGE: to describe complete biological systems in an integrated way (organizing, defining relationships, defining metadata standards, interpreting, quality control assessment – DATA MANAGEMENT)

PAG 2004

Data Management Flow ChartData generation

Database population

Collection

Selection

Organization of similar data types•Remove redundancy•Correct errors

Association of different data types•Establish unambiguous identifiers•Define and validate relationships

Resolve data heterogeneity - standardization

Data modeling

Data curation

Data dissemination

Annotation – add descriptions•Define standard vocabulary

PAG 2004

Quality ControlIssues

1. Accuracy of information

2. Consistency in format and content

3. Up-to-datedness

4. Conflicting data

TAIR’s approaches• Personnel training (Ph.D. level biologists)

• User input

• Source attribution

• Checking curation consistency (computationally and manually)

• Adopt Standard Operation Procedures

• Define and use controlled vocabularies

PAG 2004

Many Data Types with Many Sources

Data TypesSources

Literature

Public Databases

Communitysubmissions

Computationalanalysis

Functional Genomicprojects

Genes/Gene Products

Mutant Phenotypes

Expression

Stocks

Metabolism

PAG 2004

Two Examples of Data Curation

Data Types

Genes/Gene Products

Mutant Phenotypes

MicroarrayExpression

Stocks

Metabolism

SourcesLiterature

Public Databases(SMD)



Functional Genomic projects

(AFGC)

PAG 2004

Literature CurationPubSearch

• A literature curation management system designed to store and manage the available literature for an organism of interest

PubSearch software is freely available at http://pubsearch.org

Generic Model Organism Database (GMOD) http://www.gmod.orga joint effort by several model organism databases to develop reusable components for creating new biological databases

PAG 2004

Literature CurationStep 1: Collection of References

•Meetings abstracts•Dissertations•Textbooks

Remove redundancy

Journal names standardization

PubSearch DB(21,527)

(‘Arabidopsis’ in title or abstract)

Full text papers(scanning, online)

Arabidopsis References

TAIR DB

Biosis PubMed Agricola

(curation tool)

(public db)

PAG 2004

knowngene

names

Literature CurationStep 2: Assigning References to Genes

Arabidopsis References(PubSearch)

candidategene

names

Scanning references for terms in list(programmatically)

TermList

(17,470)

Gene XRef 1Ref 2

..Ref n

Reference hit Validation(by curators using PubSearch)

Validated list of references for each gene

PAG 2004

Literature CurationStep 3: Extracting information

• Gene-centric curation approach• Each curator is assigned 2 genes per day• Papers are read and information extracted (following SOPs and

using PubSearch curation tool):– Name validation & add aliases– Add sequence info– Assign locus (mapping to the genome by BLAST)– Merge/split genes– Write summary sentence– Correct errors– Annotation using controlled vocabularies (GO, POC)

PAG 2004

Controlled Vocabularies

• A collection of defined terms (organized in a hierarchy) intended to serve as a standard nomenclature

• Provide a common set of terms that users of a single system (or across multiple systems) can share

• Allows retrieval of ALL relevant information

Example:– Find all the genes that have transporter activity (regardless of how

they are named, or what type they are)

PAG 2004

Controlled Vocabularies used at TAIR• Gene Ontology (GO) http://www.geneontology.org/ Goal: to produce a controlled vocabulary for describing genes and proteins that can be applied to all organisms

• Molecular Function• Cellular component• Biological process

• Plant Organism Consortium (POC) http://www.plantontology.org/Gramene, TAIR, Univ Missouri St Louis, MaizeDB, IRIS, MIPS, Oryzabase & Monsanto & Pioneer as

collaboratorsGoal: to develop structured controlled vocabularies for plant-specific knowledge domains:

• Plant Anatomy (morphology, organs, tissue and cell types)

• Temporal stages (plant growth and developmental stages)

• Phenotype Ontology (in the works)

PAG 2004

Qualifying Annotations with supporting evidence

References

Evidence code usage

A set of controlled vocabulary, which provides evidence to support the association between gene products and annotations

– IDA: Inferred from Direct Assay– IMP: Inferred from Mutant Phenotype– ISS: Inferred from Sequence Similarity– IEA: Inferred from Electronic Annotation– IEP: Inferred from Expression Pattern – ……

Evidence code descriptionE.g., IPI : Inferred from Physical Interaction

» Co-immunoprecipitation» Co-purification» Co-sedimentation» ….

PAG 2004

Gene Annotation Display in TAIR

PAG 2004

QC of Literature Curation

• Weekly annotation meeting• Quality control manager• Use of standardized vocabularies• Random checks of annotations• Annotations are tagged by date and curator• Automatic checks in software• Use SOPs – curation guidelines

PAG 2004

Curation of Microarray Data

Data TypesSources

Literature

Public Databases(SMD)



Functional Genomic projects

(AFGC)

Genes/Gene Products

Mutant Phenotypes

MicroarrayData

Stocks

Metabolism

PAG 2004

Curation of AFGC Microarray DataData Collection and Selection

Selected Arrays(516)

Numeric Results Data(raw and normalized)

StanfordMicroarrayDatabase

ArabidopsisFunctionalGenomicsConsortium

- All Arabidopsis public arrays- exclude QC arrays (45)

Metadata

ExperimentsSamplesArray

ElementsProtocols

•sample info•proposal abstracts•protocols

• results • array design • minimal descriptions of individual arrays

PAG 2004

Curation of Metadata:Array elements

1. classify, organize, add missing sequences, correct errors2. mapping to the Arabidopsis genome & association to genes (pipeline)

Samples & Experiments 1. Data extraction from flat files (abstracts, RNA forms), and database (SMD)

e.g., tissue type, treatments, experimental design

2. Organization of data & parsing into tables3. Develop controlled vocabularies for experiment categorization & treatments4. Standardization using those vocabularies5. Data association

grouping arrays replicate sets experimentsmerging replicate samples to minimize redundancylinking to other related data (germplasm, clones, publications, people)

6. Annotation Experiments: GO process, category, experimental variables

Samples: tissues (POC anatomy & temporal) & treatment

Data Submission http://arabidopsis.org/info/microarray.submission.jsp

PAG 2004

Curation of Microarray Results Data

-Quality controlRemove poor quality arrays (2)Exclude spots flagged as bad Re-normalize using lowess method (minimize spatial bias) Remove arrays with strong spatial/plate bias (72)(ANOVA)Exclude array elements with intensity < 350 in both channelsExclude array elements with null values in 80% of arrays

-AnalysisCalculate log2 ratio [ch2N/ch1N]Calculate fold change [ch2N/ch1N]Calculate averages for each array element (array & replicates)

Numeric Results Data

Elementfold change/log2 ratio

std errorper array

Elementfold change/log2 ratio

std errorper replicate arrays

PAG 2004

Conclusions

• Requires trained biologists familiar with data• Can be facilitated computationally (repetitive

tasks), but is mainly a knowledge-based task that can only be done by humans

• Essential for assuring data quality• Adds value to data• Slow process• Can be inconsistent

pag 2004 data management and curation at tair margarita garcia-hernandez

Documents