s carbon - amigo2: document-oriented approach to ontology software and escaping heartache of sql

41

Upload: jan-aerts

Post on 18-Dec-2014

216 views

Category:

Technology


4 download

DESCRIPTION

Presentation "AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL" by Seth Carbon at BOSC2012

TRANSCRIPT

Page 1: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

AmiGO 2: a document-oriented approach toontology software and escaping the heartache of

SQL.

Seth Carbon (with Chris Mungall and Heiko Dietze)

Berkeley BOP (http://berkeleybop.org),Lawrence Berkeley National Lab

13 July 2012

Page 2: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Outline

1 Introduction

2 Data as a document

3 What it gets you

4 Development and maintainability

5 Acknowledgments

Page 3: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Introduction

Introduction

Page 4: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Where we were at: The Software

AmiGO (http://amigo.geneontology.org) is an open-source webapplication that allows users to query, browse, and visualizeontologies and related gene product annotation data.

The basic things that we have to do:

Get information about gene products and terms.

Search by text in various �elds.

Find direct annotations for term, �ltered by. . .

Find all inferred annotations and/or genes to a term, �lteredby . . .

Page 5: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Where we were at: The Software

AmiGO (http://amigo.geneontology.org) is an open-source webapplication that allows users to query, browse, and visualizeontologies and related gene product annotation data.

The basic things that we have to do:

Get information about gene products and terms.

Search by text in various �elds.

Find direct annotations for term, �ltered by. . .

Find all inferred annotations and/or genes to a term, �lteredby . . .

Page 6: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

AmiGO 1.8 (term details page)

Page 7: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Where we were at: The Problem

More, more, more.

After years of clinging to a core SQL backend, with an increasingnumber of tricks, extensions, and caches to keep the performanceat an acceptable level, things had to change. . .

Complicated queries

: enrichment, subsets, search, reports, etc.

Data

: ~1,500,000 -> ~13,000,000 -> ~80,000,000 -> ???

Provided services

Page 8: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Where we were at: The Problem

More, more, more.

After years of clinging to a core SQL backend, with an increasingnumber of tricks, extensions, and caches to keep the performanceat an acceptable level, things had to change. . .

Complicated queries: enrichment, subsets, search, reports, etc.

Data: ~1,500,000 -> ~13,000,000 -> ~80,000,000 -> ???

Provided services

Page 9: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

How the graph was in SQL

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

[Chado users should be familiar with our ontology model.]

Page 10: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Our Solution: Solr

Solr�a specialized HTTP server over the Lucene document store.

AmiGO 2 has greatly increased in �exibility, speed, reliability, anddevelopment turnaround time over its SQL predecessor.

For example: a deep text search from �30s down to �0.3s.

It has also made things that were previously not feasible easy to do.

Page 11: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Our Solution: Solr

Solr�a specialized HTTP server over the Lucene document store.

AmiGO 2 has greatly increased in �exibility, speed, reliability, anddevelopment turnaround time over its SQL predecessor.

For example: a deep text search from �30s down to �0.3s.

It has also made things that were previously not feasible easy to do.

Page 12: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Data as a document

Data as a document

Page 13: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

One minute overview of Solr

It is a document store.

Each document can have any number of named �elds.

These �eld names do not need to be unique�having multiplebehaves like a list.

The values of these �elds can be any number of atomic types(if they are a string).

Page 14: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Example to this point

document_category: ontology_classis_obsolete: falselabel: neurogenesislabel_searchable: neurogenesisid: GO:0022008source: biological_processdescription: Generation of cells within the nervous system.description_searchable: Generation of cells within the nervous system.comment: This term was added by GO_REF:0000021.comment_searchable: This term was added by GO_REF:0000021.synonym: nervous system cell generationsynonym: neural cell di�erentiationsynonym_searchable: nervous system cell generationsynonym_searchable: neural cell di�erentiation

Page 15: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Conversion: how the graph was in SQL

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

Page 16: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Conversion: the graph with Solr

A graph aspect in Solr:

[GO:0048770 GO:0031988 GO:0005623 GO:0031410GO:0005575 GO:0031982 GO:0043231 GO:0005622GO:0044444 GO:0005737 GO:0043226 GO:0043227GO:0044464 GO:0016023 GO:0043229 GO:0044424]

More on this later. . .

Page 17: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

In for a penny, in for a pound

In fact, why not cram everything in we can? Lucene is designed fordata rather larger than what we have.

JSON maps of ids to labels and labels to ids.

Rich graph segments as non-indexed JSON blobs.

Anything that might have been cached.

Want just direct annotations? Add another �eld.

Page 18: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

A �nal schema (annotation)

document_category annotationid MGI:MGI:107940_:_GO:0014013bioentity MGI:MGI:107940bioentity_label Ezh2bioentity_label_searchable Ezh2source MGIdate 20110523taxon NCBITaxon:10090taxon_label Mus musculustaxon_label_searchable Mus musculusreference MGI:MGI:4833736|PMID:20798045evidence_type IMPevidence_with MGI:MGI:2661097annotation_class GO:0014013annotation_class_label regulation of gliogenesisannotation_class_label_searchable regulation of gliogenesisisa_partof_closure_map {"GO:0051239":"regulation of multicellular organismal process","GO:0009987":"cellular process","GO:2000026":"regulation of multicellular organismal development","GO:0048699":"generation of neurons","GO:0065007":"biological regulation","GO:0048869":"cellular developmental process","GO:0007275":"multicellular organismal development","GO:0030154":"cell di�erentiation","GO:0007399":"nervous system development","GO:0051960":"regulation of nervous system development","GO:0042063":"gliogenesis","GO:0032502":"developmental process","GO:0008150":"biological_process","GO:0032501":"multicellular organismal process","GO:0050767":"regulation of neurogenesis","GO:0050794":"regulation of cellular process","GO:0060284":"regulation of cell development","GO:0050789":"regulation of biological process","GO:0050793":"regulation of developmental process","GO:0014013":"regulation of gliogenesis","GO:0045595":"regulation of cell di�erentiation","GO:0048468":"cell development","GO:0022008":"neurogenesis"}isa_partof_closure GO:0051239isa_partof_closure GO:0009987... ...isa_partof_closure_label regulation of multicellular organismal processisa_partof_closure_label cellular processisa_partof_closure_label regulation of multicellular organismal development... ...isa_partof_closure_label_searchable regulation of multicellular organismal processisa_partof_closure_label_searchable cellular processisa_partof_closure_label_searchable regulation of multicellular organismal development... ...

Page 19: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

What it gets you

What it gets you

Page 20: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Graph example in SQL

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

SELECT term.name AS superterm_name, term.acc ASsuperterm_acc, term.term_type AS superterm_type,association.*, gene_product.symbol AS gp_symbol,gene_product.symbol AS gp_full_name, dbxref.xref_dbname ASgp_dbname, dbxref.xref_key AS gp_acc, species.genus,species.species, species.ncbi_taxa_id, species.common_nameFROM term INNER JOIN graph_path ON(term.id=graph_path.term1_id) INNER JOIN association ON(graph_path.term2_id=association.term_id) INNER JOINgene_product ON(association.gene_product_id=gene_product.id) INNER JOINspecies ON (gene_product.species_id=species.id) INNER JOINdbxref ON (gene_product.dbxref_id=dbxref.id) WHEREterm.name = `neurogenesis' AND species.genus = `Drosophila';

Page 21: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Graph example in SQL

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

SELECT term.name AS superterm_name, term.acc ASsuperterm_acc, term.term_type AS superterm_type,association.*, gene_product.symbol AS gp_symbol,gene_product.symbol AS gp_full_name, dbxref.xref_dbname ASgp_dbname, dbxref.xref_key AS gp_acc, species.genus,species.species, species.ncbi_taxa_id, species.common_nameFROM term INNER JOIN graph_path ON(term.id=graph_path.term1_id) INNER JOIN association ON(graph_path.term2_id=association.term_id) INNER JOINgene_product ON(association.gene_product_id=gene_product.id) INNER JOINspecies ON (gene_product.species_id=species.id) INNER JOINdbxref ON (gene_product.dbxref_id=dbxref.id) WHEREterm.name = `neurogenesis' AND species.genus = `Drosophila';

Page 22: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Graph example in Solr

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

Add to URL:

what we want query arg

any doc q=*:*in genes fq=document_category:"bioentity"in closure fq=isa_partof_closure_label:"neurogenesis"with �y fq=taxon_label_searchable:"Drosophilia"

Page 23: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Graph example in Solr

�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�

Add to URL:

what we want query arg

any doc q=*:*in genes fq=document_category:"bioentity"in closure fq=isa_partof_closure_label:"neurogenesis"with �y fq=taxon_label_searchable:"Drosophilia"

Page 24: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Text search example in SQL

�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�

. . . I'll get back to you.

Page 25: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Text search example in SQL

�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�

. . . I'll get back to you.

Page 26: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Text search example in Solr

�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�

Add to URL:

what we want query arg

only term fq=document_category:"ontology_class"has "pigment" defType=edismax & q=pigmentweights qf=[...] label_searchable^2 id^2 [...]in closure fq=isa_partof_closure_label:"organelle"highlighting hl.simple.pre=<em class="hilite"> & hl=true

Page 27: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Text search example in Solr

�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�

Add to URL:

what we want query arg

only term fq=document_category:"ontology_class"has "pigment" defType=edismax & q=pigmentweights qf=[...] label_searchable^2 id^2 [...]in closure fq=isa_partof_closure_label:"organelle"highlighting hl.simple.pre=<em class="hilite"> & hl=true

Page 28: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Ease of Exploration I

Facets can make life a lot easier.From the �neurogenesis� example, a couple of facets we get are:

IMP: 1370ISO: 495IGI: 356IDA: 290IBA: 104ISS: 36TAS: 7NAS: 6ISA: 4IEP: 3IEA: 2IRD: 2

... ...neuron projection morphogenesis: 636regulation of neuron di�erentiation: 533locomotion: 447central nervous system development: 408response to stimulus: 355... ...

Page 29: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

A user interface:

Page 30: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Ease of Exploration II

Also, with a small amount of work, calculations like informationcontent.

Page 31: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Caching, speed, and data as a resource

Liberal �eld creation alleviates the need for a lot of caches ofqueries.

UI seeding data (facets).

All over HTTP�very easy to add a reverse proxy server in front.

Easy and direct data access for third parties (HTTP clients).

Reusable components such as term completion and spellcheck.

Page 32: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

All folded into AmiGO 2

Page 33: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Development and maintainability

Development and maintainability

Page 34: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Cons

Some contortions to get at the data that we want (e.g. nojoins).

We can get by without with a little thought. Also, moreSolr features coming.

Loading is enabled through a parallel software stack.

But can leverage a lot of the stu� already out there(e.g. OWL API, SolrJ, etc.).

There is more overhead in the creation and maintenance of thevarious �elds necessary to make this all work out.

We use a con�guration and loading manager inOWLTools and AmiGO 2.

Page 35: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Cons

Some contortions to get at the data that we want (e.g. nojoins).We can get by without with a little thought. Also, moreSolr features coming.

Loading is enabled through a parallel software stack.But can leverage a lot of the stu� already out there(e.g. OWL API, SolrJ, etc.).

There is more overhead in the creation and maintenance of thevarious �elds necessary to make this all work out.We use a con�guration and loading manager inOWLTools and AmiGO 2.

Page 36: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

OWLTools FlexLoader (YAML-based)

id: bbop_ontdescription: Test mapping of ontology class for GO.display_name: Ontologydocument_category: ontology_classweight: 40boost_weights: id^2.0 label^2.0 description^1.0 comment^0.5 synonym^1.0 alternate_id^1.0result_weights: label^10.0 id^8.0 description^6.0 source^4.0 synonym^3.0 alternate_id^2.0 comment^1.0�elds:- id: label

description: Common term name.display_name: Termtype: stringproperty: [getAnnotationPropertyValues, label]searchable: true

- id: ......

Page 37: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Pros

It's very very fast.

No more long running queries.

No ORMs (or SQL).

Development and debugging easier for clients�everything is ina uniform return schema/type.

Decreased number of layers necessary to complete a clientprogram�you just need an HTTP client.

New classes of features like JavaScript APIs, autocomplete,and spellcheck. . .

Trivial to o�er web APIs.

Scales nicely (for example, can chop up store between di�erentservers).

Page 38: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

The software involved

Solr over Jetty (the store)License: Apache 2https://lucene.apache.org/solr/

AmiGO 2 (the clients)Seth Carbon, Chris Mungall, Shahid Manzoor, Heiko Dietze,Gene Ontology ConsortiumLicense: Modi�ed BSDhttp://wiki.geneontology.org/index.php/AmiGO_2http://amigo2.berkeleybop.org

OWLTools (the loader)Chris Mungall, Heiko Dietze, Gene Ontology ConsortiumLicense: BSD 2-Clausehttps://code.google.com/p/owltools/

Page 39: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Acknowledgments

Acknowledgments

Page 40: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

Acknowledgments

Berkeley Bioinformatics Open-source Projects

The Gene Ontology Consortium

Saccharomyces Genome Database

All the users of AmiGO

All the future users of AmiGO 2

Page 41: S Carbon - AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL

Introduction Data as a document What it gets you Development and maintainability Acknowledgments

AmiGO 2