crowdsourcing biology: the gene wiki, biogps, and citizen science

70
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org May 14, 2014 CBIIT Slides: slideshare.net/andrewsu Citizen Science!

Upload: andrew-su

Post on 23-Aug-2014

1.375 views

Category:

Science


1 download

DESCRIPTION

Screencast video now at: https://www.youtube.com/watch?v=oe7pjHJU-z4 Talk info at http://1.usa.gov/1kPcRxC

TRANSCRIPT

Page 1: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Andrew Su, Ph.D.@andrewsu

[email protected]://sulab.org

May 14, 2014

CBIIT

Slides: slideshare.net/andrewsu

Citizen Science!

Page 2: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Few genes are well annotated…2

Data: NCBI, February 2013

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

Ann

otat

ion

Cou

nts

Page 3: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

… because the literature is sparsely curated?3

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

Page 4: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

… because the literature is sparsely curated?4

0

10

20

30

40

Average capacity of human scientist

Page 5: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

5

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 6: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

6

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 7: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Long Tail is a prolific source of content7

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 8: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Wikipedia is reasonably accurate8

Page 9: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica Online

Page 10: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

10

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 11: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

From crowdsourcing to structured data11

The Gene Wiki

Citizen Science

Page 12: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Page 13: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 14: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Wiki success depends on a positive feedback14

Gene wiki page utility

Number ofusers

Number ofcontributors

10012002

Page 15: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

10,000 gene “stubs” within Wikipedia15

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 16: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Gene Wiki has a critical mass of readers16

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Page 17: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Gene Wiki has a critical mass of editors17

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor c

ount Editors

Edits Edi

t cou

nt

Page 18: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

A review article for every gene is powerful18

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002

Page 19: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Making the Gene Wiki more computable19

Structured annotationsFree text

Page 20: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Filling the gaps in gene annotation20

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel GO annotations2147 novel DO annotations

Page 21: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Gene Wiki content improves enrichment analysis23

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good BM et al., BMC Genomics, 2011

Page 22: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Making the Gene Wiki more computable24

Structured annotationsFree text

Analyses

Page 23: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Expansion through outreach and incentives26

SP-A1SP-A2

KIF11

LIG3 MIR155

EPHX2

Page 24: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Cardiovascular Gene Wiki Portal27

• CAMK2D -- CaM kinase II subunit delta• CSRP3 -- Cysteine and glycine-rich protein 3• GJA1 -- Gap junction alpha-1 protein / Connexin-43• MAPK14 -- Mitogen-activated protein kinase 14 / p38-α• MYL7 -- Myosin regulatory light chain 2, atrial isoform• MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac

isoform • PECAM1 -- Platelet endothelial cell adhesion molecule/CD31• RYR2 -- Ryanodine receptor 2• ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium

ATPase 2 / SERCA2• TNNI3 -- Troponin I, cardiac muscle• TNNT2 -- Troponin T, cardiac muscle

Peipei PingUCLA

Page 25: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Long Tail of scientists is a valuable source of

information on gene function

28

Page 26: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

From crowdsourcing to structured data29

The Gene Wiki

Citizen Science

Page 27: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Gene databases are numerous and overlapping30

… and hundreds more …

Page 28: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Why is there so much redundancy?31

Users

Requests

Resources

Time

Communitydevelopment

BioGPS emphasizes community extensibility

Page 29: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Why do developers define the gene report view?32

BioGPS emphasizes user customizability

Page 30: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

http://biogps.org

Community extensibility and user customizability33

Page 31: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility

UsersContributors

Utility: A simple and universal plugin interface34

Page 32: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility

UsersContributors

Utility: A simple and universal plugin interface35

Page 33: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility

UsersContributors

Utility: A simple and universal plugin interface36

Page 34: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility

UsersContributors

Utility: A simple and universal plugin interface37

Page 35: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility

UsersContributors

Utility: A simple and universal plugin interface38

Page 36: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Utility: A simple and universal plugin interface39

Utility

UsersContributors

Total of > 540 gene-centric online databases registered as BioGPS plugins

Page 37: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Users: BioGPS has critical mass40

• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviewsUtility

UsersContributors

Page 38: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Contributors: Explicit and implicit knowledge41

540 plugins registered (>300 publicly shared)

by over 120 users

spanning 280+ domains

Utility

UsersContributors

Page 39: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Gene Annotation Query as a Service42

http://mygene.info

• High performance• 3M hits/month

• Highly scalable• 13k species• 16M genes

• Weekly data updates• JSON output• REST interface• Python/R/JS libraries

Page 40: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Long Tail of

bioinformaticianscan collaboratively build a gene portal.

43

Page 41: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

From crowdsourcing to structured data44

The Gene Wiki

Citizen Science

Page 42: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The biomedical literature is growing fast45

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

Page 43: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Information Extraction46

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Page 44: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Disease mentions in PubMed abstracts47

NCBI Disease corpus• 793 PubMed abstracts

• (100 development, 593 training, 100 test)

• 12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.

Page 45: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Four types of disease mentions48

Specific Disease: • “Diastrophic dysplasia”

Disease Class:• “Cancers”

Composite Mention: • “prostatic , skin , and lung cancer”

Modifier:• ..the “familial breast cancer” gene , BRCA2..

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.

Page 46: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?

49

Page 47: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Turk50

http://en.wikipedia.org/wiki/The_Turk

Page 48: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Turk51

http://en.wikipedia.org/wiki/The_Turk

Page 49: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Amazon Mechanical Turk (AMT)52

Requester

AmazonFor each task, specify:• a qualification test• how many workers per task• how much we will pay per task

Manages: • parallel execution of jobs• worker access to tasks

via qualification tests• payments• task advertising

Workers

1. Create tasks

2. Execute

3. Aggregate

Page 50: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Instructions to workers53

• Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients

received...”• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…” • Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but undergoes…”

• Highlight symptoms - physical results of having a disease– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.

Page 51: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Qualification test54

Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions

Page 52: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Qualification test results55

Threshold for passing

33/194 passed17%

Workers

qualified workers

Page 53: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Simple annotation interface56

Click to see instructions

Highlight disease mentions

Page 54: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Experimental design

• Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus– $0.06 per Human Intelligence Task (HIT)– HIT = annotate one abstract from PubMed– 5 workers annotate each abstract

57

Page 55: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Aggregation function based on simple voting58

58

1 or more votes (K=1)This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Page 56: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Comparison to gold standard59

F = 0.81, k = 2, N = 5• 593 documents• 7 days• 17 workers• $192.90

Page 57: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard60

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

Page 58: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard61

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

Page 59: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard62

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

Page 60: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard63

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

Page 61: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Comparisons to text-mining algorithms64

Page 62: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Comparisons to human annotators65

Average level of agreement

between expert annotators (stage 1)

F = 0.76

Page 63: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Comparisons to human annotators66

F = 0.76F = 0.87

Average level of agreement

between expert annotators (stage 2)

Page 64: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

67

In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease

concept recognition.

Page 65: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Information Extraction68

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Page 66: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Annotating the relationships69

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

therapeutic target

subject predicate

objectGENE

DISEASE

Page 67: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Citizen Science at Mark2Cure.org70

Page 68: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

The Long Tail of citizen

scientistscan collaboratively

annotate biomedical text.

71

Page 69: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

72

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternLynn Schriml, U MarylandPaul Pavlidis, U British ColumbiaPeipei Ping, UCLAMany Wikipedia editors

WP:MCB Project

Collaborators

Katie FischKarthik GangavarapuLouis GioiaBen GoodSalvatore Loguercio

Adam MarkMax NanisGinger TseungChunlei Wu

Group members

Contact

http://[email protected]

@andrewsu+Andrew Su

Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni

Citizen Science logo based on http://thenounproject.com/term/teamwork/39543/

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)

Page 70: Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science

Related AMT work73

• [1] Zhai et al 2013, used similar protocol to tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard

• [2] Burger et al, using microtask workers to identify relationships between genes and mutations.

• [3] Aroyo & Welty, used workers to identify relations between concepts in medical text.

[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res

[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report

[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.