arabidopsis information portal: a community-extensible platform for open data

Arabidopsis Information Portal: A Community-Extensible Platform for

Open Data Matt Vaughn

Director, Life Sciences ComputingTexas Advanced Computing Center

University of Texas at [email protected] | @mattdotvaughn | www.slideshare.net/mattdotvaughn

The Rationale for Araport• Loss of TAIR as a publicly funded shared resource for

data mining and basic bioinformatics (plus technical obsolescence)

• Centralization as a key contributing factor– Loading of new data into database– Development of new user experience– Curation and annotation– Community support mission

• Araport is designed to be de-centralized and thus sustainable

3

Modules Proposed: IAIC

‘The design of the AIP will provide core functionality while remaining flexible to encourage multiple contributors and constant innovation.’IAIC Whitepaper (2012) Plant Cell: ”Taking the next step”.

4

Modules Realized

5

Modules Realized

Core web applications for integration and indexing.

6

Modules Realized

Core web applications for integration and indexing.

In-house Science Apps

Community Science Apps

8Global or faceted search. Will soon extend to community-provided modules.

9

The GMOD JBrowse app lets you select which data tracks to display (left).An Araport extension, SeqLighter, lets you zoom to sequence (inset).

10

JBrowse users can select additional epigenomics tracks (obtained live from EPIC CoGe).User may filter by many attributes (shown: Lab name = Jacobsen Lab).

11

Araport 11: Evidence-based re-annotation of A. thaliana Col-0, incorporating 113 public RNAseq data sets binned by tissue type. Available in pre-release v3 at araport.org

14

PantherHomologs

PhytozomeHomologs

PhysicalInteractions

via BAR

Coming soon: genetic interactions va IntAct.

Expression Patternsvia BAR

Co-expression via ATTED

ThaleMine gene report pages include…

15

ThaleMine gene report pages include…

NCBIGeneRIFs

NCBIPublications

Gene Ontology

associations

16

Saved query + user parameters

Display a dynamic table

Modify the query

ThaleMine template queries

17

Alter query filters

Save results

Analyze columns

Alter display columns

ThaleMine query results

18

ThaleMine results list manipulation

20

Predicted phasiRNA sites in Arabidopsis.Blake Meyers Lab, University of Delaware.Deepti Vemaraju, Mayumi Nakano.

Arabidopsis citations by year and category.Nick Provart Lab, University of Toronto.

Asher Pasha, Jamie Waese

Science Apps backed by Web Services

21Asher Pasha & Nicholas Provart from BAR at University of Toronto.

22

How It Workshttps://www.araport.org

JavaScript in the Browser…

if ( gene && gene.length > 0 )

… calls Araport web services by URL…

$.get('https://api.araport.org…

… which in turn call BAR web services by URL…

http://bar.utoronto.ca/webservices/get_expressologs.php

23

How It Works

https://www.araport.org

The graph is interactive.• Users can rearrange nodes by dragging.• Users can get details by clicking.

This is Cytoscape.• The graph is drawn by Cytoscape.js• This is a free library for JavaScript.

There are many libraries to choose from!• jsPhyloSVG: phylogenetic trees• HighCharts: statistical charts• jQuery DataTables: interactive tables• d3.js: all sorts of cool stuff

24

Code Re-use

http://bar.utoronto.ca

The Araport science app (left) reuses code from the pre-existing BAR app (right).The apps look different by choice but they could be made identical.

https://www.araport.org

25

Key Points• BAR Interactions Science App– Example of visualization module– Uses Open-source Cytoscape Javascript library– Displays data from BAR web services via Araport- mediated

web service API– Developed at BAR by developers who attended Araport

Workshop in 2014– Similar codes deployed at Araport and BAR

• We invite you to develop a visualization module– Araport engineers available to provide technical support and

advice

26

Pure Data Web Services

Contributed by SUBA group Nov 2015

Cornelia Hooper & Ian Castleden from University of Western Australia.

27


28


at2g46830

29

AutomaticDocumentation and User Interface

suba3

These are the service endpoints• The endpoint is the verb in the URL.• Verb is followed by one or more parameters• Example: araport/suba/search?locus=AT2G46830Standard service endpoints at Araport• /list = which IDs work with this service?• /search = what are the details for a given ID• /prov = who provided this data and how?• /stats = number of accesses, number of unique users• /health = results of status check on underlying service

Automatic documentation is generated based on simple metadata provided when the service is created. We translate it to the OpenAPI (née Swagger) spec for API interfaces.

That, in turn, is used to build UI (and language libraries)

30

How does it work?suba3

This URL

Returns this data

Javascript-friendly JSON format*

This transcription factor localizes to

nucleus

*Not mandatory for Araport APIs but preferred!

31

Web Service Module• SUBA provides a web service module to Araport

– URL query takes an Arabidopsis locus as parameter– URL responds with a web page full of data

• The data is not formatted for display to humans (e.g. HTML)• The data is formatted for Javascript parsers (in JSON format)

– The service is REST-like in that the data exchange is achieved with just the standard web protocol, HTTP. This is a modern standard for data exchange.

• We can all use this module!– Build a Science App that colors genes and pathway– Build a Science App that scores predicted interactions– ThaleMine could add subcellular localization to gene lists with ease

• We invite you to develop a web service module– Araport will provide tech support plus documentation & indexing– Araport will promote auto-discovery, interoperability– There’s even provisional support for hosting data that doesn’t have an existing web presence

yet– The docs and tooling are getting better all the time (but…)

suba3

32

SUBA module was developed without Araport Staff Intervention

suba3

1. SUBA created a web service at their university.• Added local URLs that return JSON instead of HTML.• Re-used their existing database and web server.

2. SUBA wrote an Araport adapter to transform their service into a REST-like API• Wrote a small program in Python (2 and 3 supported)• Program calls their URL & prints results in JSON format.• Added metadata in YAML format (for auto documentation)• Saved code to a source code repository on bitbucket.

3. SUBA deployed the adapter to the Araport platform• Used ‘curl’ to send Araport the URL of the source code repository.• Araport checks out the code, compiles it, containerizes it, deploys it.• Araport generates interactive documentation using Swagger.

http://suba.plantenergy.uwa.edu.au/suba-app…

https://bitbucket.org/athaliana/suba-araport

$ curl –kL -X POST –H ”$BEARER_TOKEN” –F "git_repository=https://bitbucket.org/athaliana/suba-araport” https://api.araport.org/community/v0.3

33

Araport Developer Support: 2016

34

Summary• The Araport project

– Provides evidence-based annotation for Col-0– Performs extensive baseline data integration– Supports the Arabidopsis research community

• The Araport platform – Enables and hosts modules from community contributors

• Members gain visibility, accessibility, discoverability• Members benefit from documentation, tech support• Community Modules can come in multiple forms

– Visualization Science Apps using JavaScript libraries– Pure data interchange as REST-like Web Services– Computation Science Apps (analysis code + support for running it)– JBrowse tracks as RESTful web services

• Technical improvements are ongoing all the time– Improved developer support and tooling– Federated search, ontology-based interoperation– User workspaces, drag & drop combinations

Araport Developer Workshops

35

Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.

The Atted Science App Tutorial is freely available on GitHub. Other training material

at araport.org/devzone

Sign up to get updates on 2016 workshophttps://www.araport.org/contact

36

Acknowledgements

Araport Data Sources and Module Providers

37

AcknowledgementsJ Craig Venter Institute• Chris Town (PI)• Jason Miller• Agnes Chan• Erik Ferlanti• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarAlumni: Konstantinos Krampis, Svetlana Karamycheva, Maria Kim, Ben Rosen, Christopher Nelson, Seth Schobel

University of Cambridge• Gos Micklem• Sergio Contrino

Funding Agencies

Texas Advanced Computing Center• Matt Vaughn• Josue Balandrano Coronel• Matt Hanlon• Rion Dooley• Joe Stubbs• Alex Rocha• John GentleAlumni: Walter Moreira, Steve Mock

Araport11 Genome Annotation

Araport11Protein Coding Genes

UniProtUpdate

NCBI Novel

Models

Maker Novel

Models

NCBI SRARNA-seq

PASA, Trinity, BLAST,…

https://www.araport.org/data/araport11

TAIR10Annotation

113 public RNAseq samples

Araport11 Pre-release 3 (Dec 2015)

• Available via ThaleMine, JBrowse, FTP, APIsCategories TAIR10 Araport11

Gene LociProtein coding loci 27,416 27,667Novel loci in Araport11 719Gene loci with splice isoform 5,665 10,698TranscriptsTranscript isoforms 35,385 48,389Transcripts altered in Araport11CDS altered 1,191UTR altered 24,185

41

To install a Science App:

📝 Fill out this form.

🕒Wait a few minutes.

😃 Test the app.

📬 Notify the App Store.

Araport provisions a virtual machine.Araport obtains the source code.Araport installs the program.

42

Web ServicesAlso known as APIs:• Application Programmer Interfaces

Computer programs that…• Run on a web server.• Use HTTP for communication.

The query is a URL:• http://my.url/gene?AT2G46830

The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.

43

Query by URL

Response in JSON

Araport online documetnation

Web Service: SUBA

44

Query by URL

Response in JSON

Araport online documetnation

Web Service: KEGG

45

Science Apps

Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.

Obtains data by…• web services!

Useful for…• Interactive science.• Cool visualizations.

46

Web Service Science App

Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.

Obtains data by…• web services!

Useful for…• Interactive science.• Cool visualizations.

Computer programs that…• Run on a web server.• Use HTTP for communication.

The query is a URL:• http://my.url/gene?AT2G46830

The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.

47JCVI Expression Profile web service (left) and science app (right).Erik Ferlanti, JCVI senior software engineer.


48KEGG Pathways web service (left) and science app (right).Brian Liu, intern at JCVI.


49PhosPhAt Phosphorylation web service (left) and science app (right).Ismail Liban, intern at JCVI.


Araport Developer Workshops

50

Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.

The Atted Science App Tutorial is available as open source on GitHub.

Next workshop: Winter 2015

51

Acknowledgements

J Craig Venter Institute• Chris Town• Jason Miller• Agnes Chan• Maria Kim• Erik Ferlanti• Seth Schobel• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarFormer members• Ben Rosen• Christopher Nelson• Konstantinos Krampis• Svetlana Karamycheva

University of Cambridge• Gos Micklem• Sergio Contrino

Texas Advanced Computing Center• Matt Vaughn• Steve Mock• Rion Dooley• Matt Hanlon• Joe Stubbs• Walter Moreira• Chris Jordan

Funding Agencies

Data Sources

52

Araport User Workspaces

• Status– Prototype available in 2015• Grid layout (user adds rows or columns)• User adds Science Apps to grid (app isolation is goal)

– Coming soon• Drag and drop• Communications bus (blast app sends results to viz app)

– Coming later• Automatic discovery (blast app finds my aligment app)• Shared workspaces

53

Infrastructure Challenges 1

• Federated search– Prototype: Single search returns results from

• The Araport content management system (Drupal)• The Araport data warehouse (ThaleMine, Lucene)• The Araport genome browser (JBrowse metadata)

– Goal during development: • Extend search to 3rd party indexes (NCBI, EBI, etc)• Develop web services APIs for distributed indexes• Implement rapid response distributed search

• Automatic discovery– Araport components to discover each other at run time– User sees available options based on current results

54


• Interoperable web services– Prototype: support controlled vocabularies

• Sequence Ontology (SO) for data organization & display• Gene Ontology (GO) associations for display and search• Science apps integrate diverse web services

– Goals for development:• Ontologies for phenotype, reactome, metabolome• Community-driven adoption of controlled vocabularies• Web service integration with snap-in easy

• Redundant web services– Establish web services equivalence classes– Automatic fail over should primary provider fail

55


• Provenance– Prototype: • Submitters provide web services metadata for auto display• Submitters may provide an “About” page on Science Apps• Web services logs show number of users, number of hits

– Goals for development:• Automatic and provenance display on every submission• Monthly reporting to contributors (e.g. Google Analytics)• Standards compliance e.g. W3C PROV spec

• Community adoption

56

Server

Browser

JSCSSDBDB

HTML<form>

CGI

HTML<table>

URL HTML

WebServices

JavaScript<table>

URL

HTML3

CSS3

HTML5

Server

Browser

TraditionalActive server, static client.Submit one form, display one result.Server provides data and its format.

ModernActive client, dynamic pages.Continual client/server interaction.Server provides data, client formats it.

Web Design for Dynamic Pages

HTTP HTTP

57

External programsPortal programs (www.araport.org)

API (api.araport.org)

Agave Corekeep metadata

enroll usersADAMA

format data

enroll services

a b c d e f

CGI

Computing

Storage

Databases

ThaleMine JBrowse

Authentication, metering, logging, versioning, security.

a b c d e f

Apps

Jobs

Systems

CGI

InterMines

Others

Tripal

SOAP

CGI

REST

Science Apps

Requisite Architectural Diagram

58

Abstract• The Araport platform for scalable information exchange in genomics.• The Arabidopsis Information Portal (Araport) is a web resource for genome science. Araport is a new

and free service centered on Arabidopsis thaliana, the plant whose genome sequence serves as a model for all of plant biology. Araport integrates data from major sources including NCBI, UniProt, PubMed, TAIR, BAR, EPIC CoGe, IntAct, Atted II, KEGG, and the 1001 Genomes Project. Araport also exposes its own “Araport11” update to the organism’s structural and functional gene annotation. Araport was conceived as a new kind of model organism database, one that could keep pace with ever-growing data sets while not burdening funding agencies with an ever-growing data warehouse. Araport is a platform for data sharing, data integration, and data federation. Araport provides means for scientists in the community to develop and deploy web services that expose data residing elsewhere on the internet. Araport provides means for scientists to develop and deploy “Science Apps” that can perform computational analysis and visualization of distributed data. Araport already hosts over 20 Science Apps and almost 100 web services linked to a dozen data sources. Currently shifting from prototype to development mode, Araport provides a model for sustainable growth of model organism community resources. Primarily an information science project, Araport takes on scalability challenges related to real-time integration of distributed services, interoperability between diverse services, indexing for federated search, reliability & responsiveness, security & logging, open-source development for software portability, and usability through automated documentation.