crowdsourcing to structure biological knowledge (usc/isi)

Post on 10-May-2015

2.264 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk given at USC's Information Sciences Institute (http://www.isi.edu). The AV recording is pretty horrible, but for anyone interested: http://webcasterms1.isi.edu/mediasite/SilverlightPlayer/Default.aspx?peid=89751f8537c44f2fa241db99c793cd231d

TRANSCRIPT

Crowdsourcing to structure biological knowledge

Andrew Su, Ph.D.Department of Molecular and Experimental Medicine

The Scripps Research Institute

ISI, USC

August 16, 2012

Human genetics underlies human health2

~3 billion bases

~23,000 genes

Molecular diagnostics & therapeutics

Molecular understanding of:• Biological function• Genetic variation• Mutation• Deletion• Amplification• …

“Gene annotation”

Structured gene annotations enable computation3

Structured annotations

Few genes are well annotated4

38%

59%

TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE

Data: NCBI gene2pubmed, August 2010

23,278 protein-coding genes

Genes, sorted by decreasing counts

Co

un

ts

Gene ontology (GO)

PubMed

Biocuration is a key annotation bottleneck5

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

6

311,696 articles (1.5% of PubMed)have been cited by GO annotations

7

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

The Long Tail is a prolific source of content8

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

9

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

From crowdsourcing to structured data10

The Gene Wiki

Biological Games

10,000 gene “stubs” within Wikipedia11

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Gene Wiki has a critical mass of readers12

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Gene Wiki has a critical mass of editors13

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Edi

tor

coun

t Editors

Edits Edi

t co

unt

A review article for every gene is powerful14

Hyperlinks to related concepts

References to the literature

Reelin: 68 editors, 543 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Document- and concept-centric text mining16

Subject Object

Predicate

Simple text mining for gene annotations17

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel Gene Ontology annotations2147 novel Disease Ontology annotations

Gene Wiki+ for integrative queries18

http://genewikiplus.org

mwsync

Dynamic queries across genes, diseases, SNPs19

20

21

TOP 100 GENES

Gene Wiki+ for integrative queries22

http://genewikiplus.org

mwsync

{{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] [[HasSNP:: <q>[[is_associated_with:: <q>[[Category:Breast_cancer]]</q>]] </q>]]}}

OMIMPharmGKB

OMIMPharmGKB

Gene Wiki+ for integrative queries23

http://genewikiplus.org

mwsync

From crowdsourcing to structured data24

The Gene Wiki

Biological Games

Not just the biomedical literature…25

BioGPS aggregates gene-centric information26

http://biogps.org

The plugin interface is simple and universal27

KEGGhttp://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}

STRINGhttp://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}

Pubmedhttp://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}

URL template

Gene entityRendered URL

The plugin interface is simple and universal28

The plugin interface is simple and universal29

The plugin interface is simple and universal30

The plugin interface is simple and universal31

The plugin interface is simple and universal32

Total of 389 gene-centric online databases registered as BioGPS plugins

BioGPS has a critical mass of users33

• > 4100 registered users• 4000 unique visitors per week• 40,000 page views per week

1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge

7. U Penn8. Stanford9. Wash U10. UNC

Top 10 organizations

Daily pageviews

All resources should provide RDF…34

Mining structured content from HTML35

Defining a data extraction template36

TP53 TNF APOE IL6 VEGF …EGFR TGFB1

The BioGPS Semantic Annotator37

http://50.112.124.237

All resources should provide flat files…38

From crowdsourcing to structured data39

The Gene Wiki

Biological Games

40

http://www.flickr.com/photos/archana3k1/4124330493/

Seven million human hours

41

Twenty million human hours

http://www.flickr.com/photos/ableman/2171326385/

-42

150 billion human hours

http://www.flickr.com/photos/rvp-cw/6243289302/

per year

Using games to fold proteins43

Fold.it players have successfully:• Outperformed state of the art protein

folding algorithms (Cooper, Nature, 2010)

• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)

• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)

• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)

Using games to fold RNAs44

http://eterna.cmu.edu/

Using games to align sequences 45

http://phylo.cs.mcgill.ca

Using games to annotate gene-disease links46

http://genegames.org

If its ‘right’, you get points

then on to the next question

Click the related disease

hurry!

Dizeez players seem pretty smart…47

In total:• 207 unique gamers• 1045 games played• 8525 guesses

# Occurrences Gene Disease

7 GAST gastrinoma

7 RBP3 retinoblastoma

7 SSX1 synovial sarcoma

6 TG Graves' disease

6 CRYGC Cataract

6 SOX8 mental retardation

6 WRN Werner syndrome

6 ABL1 leukemia

6 MLL3 leukemia

6 SNAI2 breast carcinoma

Pubmed OMIM PharmGKB Gene Wiki

Dizeez players seem pretty smart…48

# Occurrences Gene Disease

5 MECOM sarcoma

4 ATF7 cancer

3 ABCB5 acute myeloid leukemia

3 SART1 glioblastoma

3 NCK1 leukemia

3 NEK1 cancer

Pubmed OMIM PharmGKB Gene Wiki

In total:• 207 unique gamers• 1045 games played• 8525 guesses

GenESP: Two-player annotation games49

COMBO: Genomic predictors for disease50

cancer normal

find patterns

make predictions on new samples

cancer

normal

COMBO: Genomic predictors for disease51

COMBO: Genomic predictors for disease52

COMBO: Genomic predictors for disease53

COMBO: Genomic predictors for disease54

COMBO: Genomic predictors for disease55

COMBO: Genomic predictors for disease56

57

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

58

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors

WP:MCB Project

Collaborators

Erik ClarkeBen GoodSalvatore Loguercio

Ian MacleodChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

asu@scripps.edu@andrewsu+Andrew Su

Summer internships for students!

Recruiting graduate students in quantitative biology! See http://education.scripps.edu/

top related