a centralized model organism database (cmod) for the long tail of sequenced genomes andrew su, ph.d....

53
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org January 16, 2014 GMOD 2014 OK OK

Upload: sabrina-fletcher

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced

Genomes

Andrew Su, Ph.D.@andrewsu

[email protected]://sulab.org

January 16, 2014

GMOD 2014

OK

OK

Page 2: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Why am I giving this keynote?

2

Page 3: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

3

http://www.flickr.com/photos/portland_mike/6140660504/

Harnessing the crowd…

Page 4: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

4

… to organize information

http://www.flickr.com/photos/45697441@N00/6629580443

Page 5: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

My simplified history of MODs5

Page 6: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

My simplified history of MODs6

Page 7: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

GMOD is widely used7

199 (!) organizations listed as GMOD users

Page 8: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Does the current model scale?8

Page 9: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Does the current model scale?9

Page 10: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

2025

1

10

100

1000

10000

100000

1000000

Bacteria

Eukaryotes

Archaea

Does the current model scale?10

# sequenced genomes

Year

Page 11: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Does the current model scale?11

Page 12: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

The Long Tail of genomic data is being lost12

Identified 517 operons and 103 small regulatory RNAs...

Page 13: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

The Long Tail of genomic data is being lost13

Identified 517 operons and 103 small regulatory RNAs...

Page 14: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

At least you can download structured data…14

Page 15: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Centralized Model Organism Database concept15

CMOD

Page 16: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

16

http://www.flickr.com/photos/aigle_dore/5626312363/

GMOD as a Service (GaaS)

Page 17: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

17

http://www.flickr.com/photos/shannonmary/187131727/

Page 18: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Few genes are well annotated…18

Data: NCBI, February 2013

41%

65%

CTNNB1VEGFASIRT1FGFR2TGFB1TP53MEF2CBMP4LEF1WNT5ATNF

20,473 protein-coding genes

Genes, sorted by decreasing counts

GO

An

no

tati

on

C

ou

nts

Page 19: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

0

200,000

400,000

600,000

800,000

1,000,000

Number of PubMed-indexed articles

… because the literature is sparsely curated?19

Page 20: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

… because the literature is sparsely curated?20

0

1 0

2 0

Average capacity of human scientistNumber of articles read by typical scientist

Page 21: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

21

311,696 articles (1.5% of PubMed)have been cited by GO annotations

Page 22: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

22

0

Sooner or later, the research community will

need to be involved in the annotation effort to scale

up to the rate of data generation.

Page 23: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

The Long Tail is a prolific source of content23

ShortHead

Long Tail

Content produced

Contributors (sorted)

News :Video:

Product reviews:Food reviews:Talent judging:

NewspapersTV/Hollywood

Consumer reportsFood criticsOlympics

BlogsYouTube

Amazon reviewsYelp

American Idol

Page 24: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikipedia is reasonably accurate24

Page 25: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikipedia has breadth and depth25

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles

Words(millions)

Wikipedia Britannica Online

Page 26: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

26

We can harness the Long Tail of scientists to directly participate in

the gene annotation process.

Page 27: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Filtering, extracting, and summarizing PubMed

Documents

Concepts Review article

Page 28: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Filtering, extracting, and summarizing PubMed

Documents

Concepts

Page 29: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wiki success depends on a positive feedback29

Gene wiki page utility

Number ofusers

Number ofcontributors

1001

2002

Page 30: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

10,000 gene “stubs” within Wikipedia30

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Utility

Users

Contributors

Page 31: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Gene Wiki has a critical mass of readers31

Total: 4.0 million views / month

Huss, PLoS Biol, 2008; Good, NAR, 2011

Utility

Users

Contributors

Page 32: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Gene Wiki has a critical mass of editors32

Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words

Approximately equal to 230 full-length articles

Good, NAR, 2011

Utility

Users

Contributors

Edi

tor

coun

t Editors

Edits Edi

t co

unt

Page 33: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

A review article for every gene is powerful33

References to the literature

Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002

Heparin: 358 editors, 654 edits since June 2003

AMPK: 109 editors, 203 edits since March 2004

RNAi: 394 editors, 994 edits since October 2002

Page 34: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Making the Gene Wiki more computable34

Structured annotationsFree text

Page 35: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Filling the gaps in gene annotation35

Wikilink

GO exact match

Gene Wiki mapping

NCBI Entrez Gene: 334

Candidate assertion

GO:0006897

6319 novel GO annotations2147 novel DO annotations

Page 36: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Gene Wiki content improves enrichment analysis36

GO term

Gene listConcept

recognitionPubMed abstracts

Enrichment analysis

GO:0007411

axon guidance

(GO:0007411)

264 genes

Linked genes through PubMed

P = 1.55 E-20

811 articles

Yes No

Yes 13 2

No 251 12033

Page 37: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Gene Wiki content improves enrichment analysis37

GO term

Gene listConcept

recognitionPubMed abstracts

Gene Wiki

+

Enrichment analysis

GO:0006936 GO:0006936

muscle contraction

(GO:0006936)

87 genes

Linked genes through PubMed

Linked genes through

PubMed + Gene Wiki

P = 1.0 P = 1.22 E-09

251 articles

87 articles

Page 38: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Gene Wiki content improves enrichment analysis38

p-value (PubMed only)

p-value (PubMed + GW)

Muscle contraction

More significant

PubMed + GW

More significant

PubMed only

Page 39: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

The Long Tail of scientists is a valuable source of

information on gene function

39

Page 40: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

http://fiehnlab.ucdavis.edu/projects/rice_metabolome/

Can we skip text mining?

Page 41: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata41

Provide a database of the world’s knowledge that

anyone can edit

- Denny Vrandečić

Page 42: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata understands scale42

Page 43: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata understands scale43

14 million Wikidata items…

…13 million total genes in Entrez Gene

Page 44: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata understands scale44

27 million Wikidata statements…

…150k total GO annotations

Page 45: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata for biology45

is a

regulates

Interacts with

Protein

Glycoprotein

Neural development

VLDL receptor

Amyloid precursor protein

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

Reelin

http://www.wikidata.org/wiki/Q414043

Page 46: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata for biology46

Property:P31

Property:P128

Property:P129

Q8054

Q187126

Q1345738

Q1979313

Q423510

Q414043

http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en

Page 48: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Loading genomic data into Wikidata48

Entrez Gene

Ensembl

UniProt

UCSC

PDB

RefSeq

Page 49: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata gene model49

Added ~1000 human genes so far….

Page 50: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata as CMOD?50

CMOD

Page 51: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

Wikidata as CMOD?51

CMODPowered by:

CMOD

Page 52: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

The Long Tail of

bioinformaticianscan collaboratively build a Centralized Model Organism

Database (CMOD).

52

Page 53: A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu  January

53

Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,

Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon LimMany Wikipedia editors

WP:MCB Project

Gene Wiki Collaborators

Katie FischBen GoodSalvatore Loguercio

Tobias MeissnerMax NanisChunlei Wu

Group members

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820)

Contacthttp://sulab.org

[email protected]@andrewsu+Andrew Su

Adriel CarolinoErik ClarkeJon HussMarc LegliseMaximilian LudvigssonIan MacLeodCamilo Orozco

Key group alumni

Recruiting for student,

postdoc, outreach, and/or

staff positions!