improving curation efficiency: user contributions and textpresso-based semi-automation sab 2008...
TRANSCRIPT
Improving Curation Efficiency: User Contributions and Textpresso-Based
Semi-Automation
SAB 2008
WormBase Literature Curators Textpresso
SAB 2008
User submission (email, web forms)
First-pass curation
Institution: Sanger InstituteSUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse/elegans/
COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7......
How does data get into WormBase?
SAB 2008
Publication
Flagging/Triage
Curation
Current first-pass curation pipeline
SAB 2008
Growing desire amongst biocurators for user submissions
First people to know what data is in a paper is the authors
TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter
Submitter
Paper identifier
Locus name
Term/descriptor,method
User submissions: first-pass flagging/triage
SAB 2008
User-submitted first-pass flags - WormBase
SAB 2008
User data-submission forms: Expression Pattern
SAB 2008
Full-text searching
Keywords and/or categories
Data extraction: Textpresso
Müller, Kenny, and Sternberg. PLoS Biology, November, 2004.
SAB 2008
Paper – entity association: pattern matching
Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12
Fact extraction: specialized categories
Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099) background, but not noticeably in the weaker tra-1(e1076) background.
GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows.
Textpresso: What data types?
SAB 2008
Textpresso-mediated CC curation: from sentences to annotations
SAB 2008
Transgenes: 1,100 new paper-transgene connections 250 new transgenes
checked manually – 95% accuracy ultimately, connections will go directly into database
Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers
GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week)
Textpresso: How much data?
Textpresso: Other data types
How else can we use Textpresso?
Other data types: Molecular Function Assays, Gene Product Interactions
Pilot: GO molecular function annotations for protein kinase activitykeyword: phosphorylatecategory: C. elegans proteins
13 new GO annotations/hour
Extension of this: protein modifications – not yet captured in WB
Pilot: Gene product interactions for WB and BINDkeywords: physically interact
category: C. elegans proteins310 matches in 237 documents22 physical interactions – top 15 papers
Textpresso for triage: Classifying text based on content
Multiple strategies (using existing first-pass papers as training set):
Organismal triage – C. elegans, Drosophila
Identify, prioritize information-rich papers
Flag for specific data types
Multiple levels:
Machine learning – SVM (Support Vector Machine)Word frequency analysis
Hand-crafted categories
Combine SVM and categories
Supplement with word weighting, contextual analyses
SAB 2008
Keeping better track of curation statistics.....
SAB 2008
.....and making curation statistics more transparent to users.
Users could search for curation status of any paper
Users could search for curation status of a given data type
Each database release would report newly curated papers
Each database release would document increases in data-type curation
WormBase Literature Curation
Gene Symbols, Alleles,Sequence Features,
Mapping Data:Mary Ann Tuli, Sanger
Gene Function: Concise Descriptions,Gene Ontology:
Ranjana Kishore, CaltechErich Schwarz, Caltech
Kimberly Van Auken, Caltech
Mutant Phenotypes (RNAi and Alleles):Igor Antoshechkin, CaltechJolene Fernandez, Caltech
Raymond Lee, CaltechGary Shindelman, Caltech
Karen Yook, Caltech
First Pass, Genetic Interactions:
Andrei Petcherski, Caltech
Gene Regulation, PWMs:Xiaodong Wang, CaltechErich Schwarz, Caltech
Expression Patterns, Antibodies, Transgenes:
Wen Chen, Caltech
Anatomy Ontology, Cell Function:
Raymond Lee, CaltechMicroarrays, SAGE:
Igor Antoshechkin, Caltech
Sequence, Gene Structures:Sanger, Wash UAuthors, Papers: Cecilia Nakamura, Daniel Wang
Curation Tools, Database:Juancarlos Chan, Caltech