gene wiki at phenotype rcn annual meeting

56
The Gene Wiki: Synthesizing knowledge about human genes with Wikipedia Benjamin Good Feb. 26, 2013 http://www.slideshare.net/goodb

Upload: goodb

Post on 10-May-2015

1.545 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Gene Wiki at Phenotype RCN annual meeting

The Gene Wiki: Synthesizing knowledge about human

genes with Wikipedia

Benjamin Good

Feb. 26, 2013

http://www.slideshare.net/goodb

Page 2: Gene Wiki at Phenotype RCN annual meeting

“Knowledge about human genes”2

Page 3: Gene Wiki at Phenotype RCN annual meeting

“Knowledge about human genes”3

2) It is scattered

1) There is a lot

Page 4: Gene Wiki at Phenotype RCN annual meeting

Biological knowledge is growing, rapidly4

• More than 22 million articles indexed in PubMed

• Growing at about million/year and rising

Page 5: Gene Wiki at Phenotype RCN annual meeting

Scattered genomic knowledge is a problem

• Scientists faced with new and unfamiliar genes on a daily basis

5

• Public faced with unfamiliar genes on a daily basis

HitsIFITM3TFE3BEX1ST8SIA1TFEBBEX2SKP1A....

GNF Robotics

Page 6: Gene Wiki at Phenotype RCN annual meeting

Knowledge synthesis

“the pulling together of ideas or information to develop a common framework for understanding”

6

Page 7: Gene Wiki at Phenotype RCN annual meeting

Knowledge synthesis in biology, aka biocuration

• The production of structured data

7

Unstructured Structured

Page 8: Gene Wiki at Phenotype RCN annual meeting

Gene Ontology

“Tool for the unification of biology”[1]

8

[1] Nature Genetics. 2000 May;25(1):25-9.

A shared, controlled vocabulary for describing gene function

Molecular Function, Biological Process, Cellular Component

> 10,550 Citations in Google Scholar

Page 9: Gene Wiki at Phenotype RCN annual meeting

Gene Ontology Annotation Database (‘GOA’)

• Records gene function using gene ontology terms

• Expert synthesis of the knowledge from thousands of articles

9

Page 10: Gene Wiki at Phenotype RCN annual meeting

33k articles become 31 gene annotations10

31 function annotations for human gene

Gene Ontology Curators

Page 11: Gene Wiki at Phenotype RCN annual meeting

11

Great!

Page 12: Gene Wiki at Phenotype RCN annual meeting

12

BUT

Page 13: Gene Wiki at Phenotype RCN annual meeting

13

GO annotation is not complete

Page 14: Gene Wiki at Phenotype RCN annual meeting

Many genes are not thoroughly annotated14

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

Page 15: Gene Wiki at Phenotype RCN annual meeting

15

1 million articles per year....

Page 16: Gene Wiki at Phenotype RCN annual meeting

16

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 17: Gene Wiki at Phenotype RCN annual meeting

The Long Tail is a prolific source of content17

ShortHead

Long Tail

Content produced

Contributors (sorted)

News reporting:Video:

Product reviews:Food reviews:

Gene annotation:

NewspapersTV/Hollywood

Consumer reportsFood criticsbio-curators

BlogsYouTube

Amazon reviewsYelp

????????????

Page 18: Gene Wiki at Phenotype RCN annual meeting

Wikipedia successfully harnesses the long tail

• Within top 10 most visited websites

• 14 million+ registered users

18

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Words/ article

Wikipedia Britannica Online

Page 19: Gene Wiki at Phenotype RCN annual meeting

Wikipedia is reasonably accurate19

Page 20: Gene Wiki at Phenotype RCN annual meeting

20

“We can harness the Long Tail of scientists to directly participate in

the gene annotation process.”

-Andrew Su

The Gene Wiki Hypothesis

Page 21: Gene Wiki at Phenotype RCN annual meeting

Goal of the Gene Wiki project

• Enable the creation of a collaboratively written, continuously updated, high quality review article for every human gene.

21

Page 22: Gene Wiki at Phenotype RCN annual meeting

Filtering, extracting, and summarizing PubMed

Page 23: Gene Wiki at Phenotype RCN annual meeting

Success depends on a positive feedback loop23

Value of service

Number ofusers

Number ofcontributors

1001

2002

Page 24: Gene Wiki at Phenotype RCN annual meeting

24

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Gene “stubs” seed community contributions

Page 25: Gene Wiki at Phenotype RCN annual meeting

A review article for every gene is powerful25

Hyperlinks to related concepts

References to the literature

68 editors, 543 edits (as of July 2010)

Page 26: Gene Wiki at Phenotype RCN annual meeting

The Gene Wiki project – 2010 stats26

10,300 articles1.2 million words67MB text

(about 1,000 PloS Biology research articles)

Value of service

Number ofusers

Number ofcontributors

55 million page views

3,500 editors17,000 edits

Page 27: Gene Wiki at Phenotype RCN annual meeting

Monthly growth of words in Gene Wiki articles, page views per month and edits per month between 1 September 2009 and 1 September 2011.

Good B M et al. Nucl. Acids Res. 2012;40:D1255-D1261

© The Author(s) 2011. Published by Oxford University Press.

Page 28: Gene Wiki at Phenotype RCN annual meeting

Why is it working?

28

Page 29: Gene Wiki at Phenotype RCN annual meeting

Google loves Wikipedia29

• ...

• 1.86 million results from Google

• courses

• products

• databases

Page 30: Gene Wiki at Phenotype RCN annual meeting

The Gene Wiki hitches a ride on Wikipedia30

CC photo by ff137 on flickr

Page 31: Gene Wiki at Phenotype RCN annual meeting

Take home messages31

• Where possible, try to hitch a ride

Value

userscontributors

• Success depends on a positive feedback loop

Page 32: Gene Wiki at Phenotype RCN annual meeting

But still, many genes lack structured annotation…32

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Data: NCBI, February 2013

+ Electronic annotation (IEA)

Biological Process only

Page 33: Gene Wiki at Phenotype RCN annual meeting

Can we generate structured annotations from the text of the gene wiki?

33

Great for people to read

?

Great for building software for people to use

Page 34: Gene Wiki at Phenotype RCN annual meeting

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 35: Gene Wiki at Phenotype RCN annual meeting

Document- and concept-centric text mining35

Subject Object

Predicate

Page 36: Gene Wiki at Phenotype RCN annual meeting

Simple text mining for gene annotations36

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

Good, BMC Genomics, 2011.

Page 37: Gene Wiki at Phenotype RCN annual meeting

Finding concepts

• NCBO Annotator Web Service – Gene Ontology– Human Disease Ontology

• Annotator service selected for:– Speed, easy API, precision

Clement Jonquet, Nigam H Shah, Mark A Musen, (2009) The Open Biomedical Annotator. AMIA Summit on Translational Bioinformatics. 56-60 http://bioportal.bioontology.org/annotator

Page 38: Gene Wiki at Phenotype RCN annual meeting

Mining workflowGene Wiki Articles

(10,271)

Filtering, cleanup

Extract concepts(NCBO)

11,022 matched gene ontology

terms

2,983 matched disease ontology

terms

Page 39: Gene Wiki at Phenotype RCN annual meeting

ResultsCompared to current dbs Manual evaluation

on random sample

DO

GO

Page 40: Gene Wiki at Phenotype RCN annual meeting

GO problems

False match (e.g., “Olfactory receptors .. are responsible for the transduction of odorant signals.  The system incorrectly identifies ‘transduction’ (GO:0009293) defined as the transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector No support in sentence (e.g., "The protein is composed ... including 10 sialic acid residues, which are attached to the protein during posttranslational modification in the Golgi apparatus.”  Such sentences may lead to incorrect annotations of 'Golgi apparatus' and 'Posttranslational modification’.)

Page 41: Gene Wiki at Phenotype RCN annual meeting

Applications

• Enrichment analysis • even with false positives, text-mined annotations can

improve statistical analyses that are tolerant to noise.

• GeneWiki+

Page 42: Gene Wiki at Phenotype RCN annual meeting

Gene Wiki+ for integrative queries

42http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Page 43: Gene Wiki at Phenotype RCN annual meeting

Dynamic queries across genes, diseases, SNPs

43Good, J Biomed Semantics, 2012.

Page 44: Gene Wiki at Phenotype RCN annual meeting

Gene Wiki+ for integrative queries

44http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

Good, J Biomed Semantics, 2012.

Page 45: Gene Wiki at Phenotype RCN annual meeting

OMIMPharmGKB

Gene Wiki+ for integrative queries

45http://genewikiplus.org

mwsync

Good, J Biomed Semantics, 2012.

Page 46: Gene Wiki at Phenotype RCN annual meeting

Text mining take home

46

• Approach depends on corpus• concept-centric text has advantages

• Depends a lot on the ontology• (same text, same algorithm,

completely different results)

• Approach depends on purpose• high false positive rates are common

but may be acceptable – e.g. enrichment analysis

Page 47: Gene Wiki at Phenotype RCN annual meeting

Can we skip text mining?

http://fiehnlab.ucdavis.edu/projects/Rice_metabolome/

Page 48: Gene Wiki at Phenotype RCN annual meeting

Wikidata

48

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 49: Gene Wiki at Phenotype RCN annual meeting

Wikidata

49

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 50: Gene Wiki at Phenotype RCN annual meeting

Wikidata

50

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 53: Gene Wiki at Phenotype RCN annual meeting

53

“We can harness the Long Tail of scientists to directly participate in

the gene annotation process.”

-Andrew Su

Page 54: Gene Wiki at Phenotype RCN annual meeting

Gene Wiki acknowledgements..54

“The Gene Wiki in 2011: community intelligence applied to human gene annotation” Nucleic Acids Research 2012

“Mining the Gene Wiki for Functional Genomic Knowledge” BMC Genomics 2011

“Linking genes to diseases with a SNPedia-Gene Wiki mashup” Journal of Biomedical Semantics 2012

“Building a biomedical semantic network in Wikipedia with Semantic Wiki Links” Database: The Journal of Biological Databases and Curation 2012

“A gene wiki for community annotation of gene function” PloS Biology 2008

“The Gene Wiki: community intelligence applied to human gene annotation” Nucleic Acids Research 2009

http://wordle.comMany Wikipedia editors WP:MCB Project

Page 55: Gene Wiki at Phenotype RCN annual meeting

55

Funding and Support

NIH / NIGMS (Gene Wiki: GM089820)

[email protected]@bgoodi9606.blogspot.comslideshare/goodb

My sister Erin has a PhD in linguististics, lives in Raleigh and is looking for work in research or teaching..

Help her out!

Page 56: Gene Wiki at Phenotype RCN annual meeting

Gene Wiki content improves enrichment analysis56

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good, BMC Genomics, 2011.