gene wiki at phenotype rcn annual meeting
TRANSCRIPT
The Gene Wiki: Synthesizing knowledge about human
genes with Wikipedia
Benjamin Good
Feb. 26, 2013
http://www.slideshare.net/goodb
“Knowledge about human genes”2
“Knowledge about human genes”3
2) It is scattered
1) There is a lot
Biological knowledge is growing, rapidly4
• More than 22 million articles indexed in PubMed
• Growing at about million/year and rising
Scattered genomic knowledge is a problem
• Scientists faced with new and unfamiliar genes on a daily basis
5
• Public faced with unfamiliar genes on a daily basis
HitsIFITM3TFE3BEX1ST8SIA1TFEBBEX2SKP1A....
GNF Robotics
Knowledge synthesis
“the pulling together of ideas or information to develop a common framework for understanding”
6
Knowledge synthesis in biology, aka biocuration
• The production of structured data
7
Unstructured Structured
Gene Ontology
“Tool for the unification of biology”[1]
8
[1] Nature Genetics. 2000 May;25(1):25-9.
A shared, controlled vocabulary for describing gene function
Molecular Function, Biological Process, Cellular Component
> 10,550 Citations in Google Scholar
Gene Ontology Annotation Database (‘GOA’)
• Records gene function using gene ontology terms
• Expert synthesis of the knowledge from thousands of articles
9
33k articles become 31 gene annotations10
31 function annotations for human gene
Gene Ontology Curators
11
Great!
12
BUT
13
GO annotation is not complete
Many genes are not thoroughly annotated14
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
Data: NCBI, February 2013
+ Electronic annotation (IEA)
Biological Process only
15
1 million articles per year....
16
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content17
ShortHead
Long Tail
Content produced
Contributors (sorted)
News reporting:Video:
Product reviews:Food reviews:
Gene annotation:
NewspapersTV/Hollywood
Consumer reportsFood criticsbio-curators
BlogsYouTube
Amazon reviewsYelp
????????????
Wikipedia successfully harnesses the long tail
• Within top 10 most visited websites
• 14 million+ registered users
18
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Words/ article
Wikipedia Britannica Online
Wikipedia is reasonably accurate19
20
“We can harness the Long Tail of scientists to directly participate in
the gene annotation process.”
-Andrew Su
The Gene Wiki Hypothesis
Goal of the Gene Wiki project
• Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.
21
Filtering, extracting, and summarizing PubMed
Success depends on a positive feedback loop23
Value of service
Number ofusers
Number ofcontributors
1001
2002
24
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Gene “stubs” seed community contributions
A review article for every gene is powerful25
Hyperlinks to related concepts
References to the literature
68 editors, 543 edits (as of July 2010)
The Gene Wiki project – 2010 stats26
10,300 articles1.2 million words67MB text
(about 1,000 PloS Biology research articles)
Value of service
Number ofusers
Number ofcontributors
55 million page views
3,500 editors17,000 edits
Monthly growth of words in Gene Wiki articles, page views per month and edits per month between 1 September 2009 and 1 September 2011.
Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261
© The Author(s) 2011. Published by Oxford University Press.
Why is it working?
28
Google loves Wikipedia29
• ...
• 1.86 million results from Google
• courses
• products
• databases
The Gene Wiki hitches a ride on Wikipedia30
CC photo by ff137 on flickr
Take home messages31
• Where possible, try to hitch a ride
Value
userscontributors
• Success depends on a positive feedback loop
But still, many genes lack structured annotation…32
Genes, sorted by decreasing counts
GO
An
no
tati
on
C
ou
nts
Data: NCBI, February 2013
+ Electronic annotation (IEA)
Biological Process only
Can we generate structured annotations from the text of the gene wiki?
33
Great for people to read
?
Great for building software for people to use
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Document- and concept-centric text mining35
Subject Object
Predicate
Simple text mining for gene annotations36
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
Good, BMC Genomics, 2011.
Finding concepts
• NCBO Annotator Web Service – Gene Ontology– Human Disease Ontology
• Annotator service selected for:– Speed, easy API, precision
Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. 56-60 http://bioportal.bioontology.org/annotator
Mining workflowGene Wiki Articles
(10,271)
Filtering, cleanup
Extract concepts(NCBO)
11,022 matched gene ontology
terms
2,983 matched disease ontology
terms
ResultsCompared to current dbs Manual evaluation
on random sample
DO
GO
GO problems
False match (e.g., “Olfactory receptors .. are responsible for the transduction of odorant signals. The system incorrectly identifies ‘transduction’ (GO:0009293) defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector No support in sentence (e.g., "The protein is composed ... including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus.” Such sentences may lead to incorrect annotations of 'Golgi apparatus' and 'Posttranslational modification’.)
Applications
• Enrichment analysis • even with false positives, text-mined annotations can
improve statistical analyses that are tolerant to noise.
• GeneWiki+
Gene Wiki+ for integrative queries
42http://genewikiplus.org
mwsync
Good, J Biomed Semantics, 2012.
Dynamic queries across genes, diseases, SNPs
43Good, J Biomed Semantics, 2012.
Gene Wiki+ for integrative queries
44http://genewikiplus.org
mwsync
{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}
…
OMIMPharmGKB
Good, J Biomed Semantics, 2012.
OMIMPharmGKB
Gene Wiki+ for integrative queries
45http://genewikiplus.org
mwsync
Good, J Biomed Semantics, 2012.
Text mining take home
46
• Approach depends on corpus• concept-centric text has advantages
• Depends a lot on the ontology• (same text, same algorithm,
completely different results)
• Approach depends on purpose• high false positive rates are common
but may be acceptable – e.g. enrichment analysis
Can we skip text mining?
http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/
Wikidata
48
Provide a database of the world’s knowledge that
anyone can edit
- Denny Vrandečić
Wikidata
49
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata
50
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Wikidata
51http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
Wikidata
52http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
53
“We can harness the Long Tail of scientists to directly participate in
the gene annotation process.”
-Andrew Su
Gene Wiki acknowledgements..54
“The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012
“Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011
“Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012
“Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012
“A gene wiki for community annotation of gene function” PloS Biology 2008
“The Gene Wiki: community intelligence applied to human gene annotation” Nucleic Acids Research 2009
http://wordle.comMany Wikipedia editors WP:MCB Project
55
Funding and Support
NIH / NIGMS (Gene Wiki: GM089820)
[email protected]@bgoodi9606.blogspot.comslideshare/goodb
My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching..
Help her out!
Gene Wiki content improves enrichment analysis56
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Good, BMC Genomics, 2011.