crowdsourcing biology: the gene wiki, biogps, and citizen science

Post on 23-Aug-2014

1.375 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Screencast video now at: https://www.youtube.com/watch?v=oe7pjHJU-z4 Talk info at http://1.usa.gov/1kPcRxC

TRANSCRIPT

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org

Andrew Su, Ph.D.@andrewsu

asu@scripps.eduhttp://sulab.org

May 14, 2014

CBIIT

Slides: slideshare.net/andrewsu

Citizen Science!

Few genes are well annotated…2

Data: NCBI, February 2013

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

Ann

otat

ion

Cou

nts

… because the literature is sparsely curated?3

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

… because the literature is sparsely curated?4

0

10

20

30

40

Average capacity of human scientist

5

311,696 articles (1.5% of PubMed)have been cited by GO annotations

6

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

The Long Tail is a prolific source of content7

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Wikipedia is reasonably accurate8

Wikipedia has breadth and depth9

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica Online

10

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

From crowdsourcing to structured data11

The Gene Wiki

Citizen Science

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Wiki success depends on a positive feedback14

Gene wiki page utility

Number ofusers

Number ofcontributors

10012002

10,000 gene “stubs” within Wikipedia15

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Gene Wiki has a critical mass of readers16

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Gene Wiki has a critical mass of editors17

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor c

ount Editors

Edits Edi

t cou

nt

A review article for every gene is powerful18

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002Heparin: 358 editors, 654 edits since June 2003AMPK: 109 editors, 203 edits since March 2004RNAi: 394 editors, 994 edits since October 2002

Making the Gene Wiki more computable19

Structured annotationsFree text

Filling the gaps in gene annotation20

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel GO annotations2147 novel DO annotations

Gene Wiki content improves enrichment analysis23

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Good BM et al., BMC Genomics, 2011

Making the Gene Wiki more computable24

Structured annotationsFree text

Analyses

Expansion through outreach and incentives26

SP-A1SP-A2

KIF11

LIG3 MIR155

EPHX2

Cardiovascular Gene Wiki Portal27

• CAMK2D -- CaM kinase II subunit delta• CSRP3 -- Cysteine and glycine-rich protein 3• GJA1 -- Gap junction alpha-1 protein / Connexin-43• MAPK14 -- Mitogen-activated protein kinase 14 / p38-α• MYL7 -- Myosin regulatory light chain 2, atrial isoform• MYL2 -- Myosin regulatory light chain 2, ventricular/cardiac

isoform • PECAM1 -- Platelet endothelial cell adhesion molecule/CD31• RYR2 -- Ryanodine receptor 2• ATP2A2 -- Sarcoplasmic/endoplasmic reticulum calcium

ATPase 2 / SERCA2• TNNI3 -- Troponin I, cardiac muscle• TNNT2 -- Troponin T, cardiac muscle

Peipei PingUCLA

The Long Tail of scientists is a valuable source of

information on gene function

28

From crowdsourcing to structured data29

The Gene Wiki

Citizen Science

Gene databases are numerous and overlapping30

… and hundreds more …

Why is there so much redundancy?31

Users

Requests

Resources

Time

Communitydevelopment

BioGPS emphasizes community extensibility

Why do developers define the gene report view?32

BioGPS emphasizes user customizability

http://biogps.org

Community extensibility and user customizability33

Utility

UsersContributors

Utility: A simple and universal plugin interface34

Utility

UsersContributors

Utility: A simple and universal plugin interface35

Utility

UsersContributors

Utility: A simple and universal plugin interface36

Utility

UsersContributors

Utility: A simple and universal plugin interface37

Utility

UsersContributors

Utility: A simple and universal plugin interface38

Utility: A simple and universal plugin interface39

Utility

UsersContributors

Total of > 540 gene-centric online databases registered as BioGPS plugins

Users: BioGPS has critical mass40

• > 6400 registered users• 14,000 unique visitors per month• 155,000 page views per month

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviewsUtility

UsersContributors

Contributors: Explicit and implicit knowledge41

540 plugins registered (>300 publicly shared)

by over 120 users

spanning 280+ domains

Utility

UsersContributors

Gene Annotation Query as a Service42

http://mygene.info

• High performance• 3M hits/month

• Highly scalable• 13k species• 16M genes

• Weekly data updates• JSON output• REST interface• Python/R/JS libraries

The Long Tail of

bioinformaticianscan collaboratively build a gene portal.

43

From crowdsourcing to structured data44

The Gene Wiki

Citizen Science

The biomedical literature is growing fast45

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

Number of new PubMed-indexed articles

Information Extraction46

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Disease mentions in PubMed abstracts47

NCBI Disease corpus• 793 PubMed abstracts

• (100 development, 593 training, 100 test)

• 12 expert annotators (2 annotate each abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.

Four types of disease mentions48

Specific Disease: • “Diastrophic dysplasia”

Disease Class:• “Cancers”

Composite Mention: • “prostatic , skin , and lung cancer”

Modifier:• ..the “familial breast cancer” gene , BRCA2..

Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.

Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?

49

The Turk50

http://en.wikipedia.org/wiki/The_Turk

The Turk51

http://en.wikipedia.org/wiki/The_Turk

Amazon Mechanical Turk (AMT)52

Requester

AmazonFor each task, specify:• a qualification test• how many workers per task• how much we will pay per task

Manages: • parallel execution of jobs• worker access to tasks

via qualification tests• payments• task advertising

Workers

1. Create tasks

2. Execute

3. Aggregate

Instructions to workers53

• Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients

received...”• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…” • Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but undergoes…”

• Highlight symptoms - physical results of having a disease– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.

Qualification test54

Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”

Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”

Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”

26 yes / no questions

Qualification test results55

Threshold for passing

33/194 passed17%

Workers

qualified workers

Simple annotation interface56

Click to see instructions

Highlight disease mentions

Experimental design

• Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus– $0.06 per Human Intelligence Task (HIT)– HIT = annotate one abstract from PubMed– 5 workers annotate each abstract

57

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Aggregation function based on simple voting58

58

1 or more votes (K=1)This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

Comparison to gold standard59

F = 0.81, k = 2, N = 5• 593 documents• 7 days• 17 workers• $192.90

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard60

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard61

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard62

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

0 3 6 9 12 15 180

0.10.20.30.40.50.60.70.80.9

Comparison to gold standard63

Max F = 0.69 0.79 0.82

k=1

2

3

23 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 80.85 0.85

Comparisons to text-mining algorithms64

Comparisons to human annotators65

Average level of agreement

between expert annotators (stage 1)

F = 0.76

Comparisons to human annotators66

F = 0.76F = 0.87

Average level of agreement

between expert annotators (stage 2)

67

In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease

concept recognition.

Information Extraction68

1. Find mentions of high level concepts in text

2. Map mentions to specific terms in ontologies

3. Identify relationships between concepts

Annotating the relationships69

This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.

therapeutic target

subject predicate

objectGENE

DISEASE

Citizen Science at Mark2Cure.org70

The Long Tail of citizen

scientistscan collaboratively

annotate biomedical text.

71

72

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternLynn Schriml, U MarylandPaul Pavlidis, U British ColumbiaPeipei Ping, UCLAMany Wikipedia editors

WP:MCB Project

Collaborators

Katie FischKarthik GangavarapuLouis GioiaBen GoodSalvatore Loguercio

Adam MarkMax NanisGinger TseungChunlei Wu

Group members

Contact

http://sulab.orgasu@scripps.edu

@andrewsu+Andrew Su

Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni

Citizen Science logo based on http://thenounproject.com/term/teamwork/39543/

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)

Related AMT work73

• [1] Zhai et al 2013, used similar protocol to tag medication names in clinical trials descriptions. F = 0.88 compared to gold standard

• [2] Burger et al, using microtask workers to identify relationships between genes and mutations.

• [3] Aroyo & Welty, used workers to identify relations between concepts in medical text.

[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard Development in Clinical Natural Language Processing” J Med Internet Res

[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing.” Mitre technical report

[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM Research, 2013.

top related