arabidopsis information portal: a community-extensible platform for open data

58
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data Matt Vaughn Director, Life Sciences Computing Texas Advanced Computing Center University of Texas at Austin [email protected] | @mattdotvaughn | www.slideshare.net/mattdotvaughn

Upload: matthew-vaughn

Post on 14-Apr-2017

155 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

Arabidopsis Information Portal: A Community-Extensible Platform for

Open Data Matt Vaughn

Director, Life Sciences ComputingTexas Advanced Computing Center

University of Texas at [email protected] | @mattdotvaughn | www.slideshare.net/mattdotvaughn

Page 2: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

The Rationale for Araport• Loss of TAIR as a publicly funded shared resource for

data mining and basic bioinformatics (plus technical obsolescence)

• Centralization as a key contributing factor– Loading of new data into database– Development of new user experience– Curation and annotation– Community support mission

• Araport is designed to be de-centralized and thus sustainable

Page 3: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

3

Modules Proposed: IAIC

‘The design of the AIP will provide core functionality while remaining flexible to encourage multiple contributors and constant innovation.’IAIC Whitepaper (2012) Plant Cell: ”Taking the next step”.

Page 4: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

4

Modules Realized

Page 5: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

5

Modules Realized

Core web applications for integration and indexing.

Page 6: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

6

Modules Realized

Core web applications for integration and indexing.

In-house Science Apps

Community Science Apps

Page 7: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

7

Page 8: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

8Global or faceted search. Will soon extend to community-provided modules.

Page 9: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

9

The GMOD JBrowse app lets you select which data tracks to display (left).An Araport extension, SeqLighter, lets you zoom to sequence (inset).

Page 10: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

10

JBrowse users can select additional epigenomics tracks (obtained live from EPIC CoGe).User may filter by many attributes (shown: Lab name = Jacobsen Lab).

Page 11: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

11

Araport 11: Evidence-based re-annotation of A. thaliana Col-0, incorporating 113 public RNAseq data sets binned by tissue type. Available in pre-release v3 at araport.org

Page 12: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

12

Page 13: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

13

Page 14: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

14

PantherHomologs

PhytozomeHomologs

PhysicalInteractions

via BAR

Coming soon: genetic interactions va IntAct.

Expression Patternsvia BAR

Co-expression via ATTED

ThaleMine gene report pages include…

Page 15: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

15

ThaleMine gene report pages include…

NCBIGeneRIFs

NCBIPublications

Gene Ontology

associations

Page 16: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

16

Saved query + user parameters

Display a dynamic table

Modify the query

ThaleMine template queries

Page 17: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

17

Alter query filters

Save results

Analyze columns

Alter display columns

ThaleMine query results

Page 18: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

18

ThaleMine results list manipulation

Page 19: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

19

Page 20: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

20

Predicted phasiRNA sites in Arabidopsis.Blake Meyers Lab, University of Delaware.Deepti Vemaraju, Mayumi Nakano.

Arabidopsis citations by year and category.Nick Provart Lab, University of Toronto.

Asher Pasha, Jamie Waese

Science Apps backed by Web Services

Page 21: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

21Asher Pasha & Nicholas Provart from BAR at University of Toronto.

Page 22: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

22

How It Workshttps://www.araport.org

JavaScript in the Browser…

if ( gene && gene.length > 0 )

… calls Araport web services by URL…

$.get('https://api.araport.org…

… which in turn call BAR web services by URL…

http://bar.utoronto.ca/webservices/get_expressologs.php

Page 23: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

23

How It Works

https://www.araport.org

The graph is interactive.• Users can rearrange nodes by dragging.• Users can get details by clicking.

This is Cytoscape.• The graph is drawn by Cytoscape.js• This is a free library for JavaScript.

There are many libraries to choose from!• jsPhyloSVG: phylogenetic trees• HighCharts: statistical charts• jQuery DataTables: interactive tables• d3.js: all sorts of cool stuff

Page 24: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

24

Code Re-use

http://bar.utoronto.ca

The Araport science app (left) reuses code from the pre-existing BAR app (right).The apps look different by choice but they could be made identical.

https://www.araport.org

Page 25: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

25

Key Points• BAR Interactions Science App– Example of visualization module– Uses Open-source Cytoscape Javascript library– Displays data from BAR web services via Araport- mediated

web service API– Developed at BAR by developers who attended Araport

Workshop in 2014– Similar codes deployed at Araport and BAR

• We invite you to develop a visualization module– Araport engineers available to provide technical support and

advice

Page 26: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

26

Pure Data Web Services

Contributed by SUBA group Nov 2015

Cornelia Hooper & Ian Castleden from University of Western Australia.

Page 27: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

27

Pure Data Web Services

Page 28: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

28

Pure Data Web Services

at2g46830

Page 29: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

29

AutomaticDocumentation and User Interface

suba3

These are the service endpoints• The endpoint is the verb in the URL.• Verb is followed by one or more parameters• Example: araport/suba/search?locus=AT2G46830Standard service endpoints at Araport• /list = which IDs work with this service?• /search = what are the details for a given ID• /prov = who provided this data and how?• /stats = number of accesses, number of unique users• /health = results of status check on underlying service

Automatic documentation is generated based on simple metadata provided when the service is created. We translate it to the OpenAPI (née Swagger) spec for API interfaces.

That, in turn, is used to build UI (and language libraries)

Page 30: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

30

How does it work?suba3

This URL

Returns this data

Javascript-friendly JSON format*

This transcription factor localizes to

nucleus

*Not mandatory for Araport APIs but preferred!

Page 31: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

31

Web Service Module• SUBA provides a web service module to Araport

– URL query takes an Arabidopsis locus as parameter– URL responds with a web page full of data

• The data is not formatted for display to humans (e.g. HTML)• The data is formatted for Javascript parsers (in JSON format)

– The service is REST-like in that the data exchange is achieved with just the standard web protocol, HTTP. This is a modern standard for data exchange.

• We can all use this module!– Build a Science App that colors genes and pathway– Build a Science App that scores predicted interactions– ThaleMine could add subcellular localization to gene lists with ease

• We invite you to develop a web service module– Araport will provide tech support plus documentation & indexing– Araport will promote auto-discovery, interoperability– There’s even provisional support for hosting data that doesn’t have an existing web presence

yet– The docs and tooling are getting better all the time (but…)

suba3

Page 32: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

32

SUBA module was developed without Araport Staff Intervention

suba3

1. SUBA created a web service at their university.• Added local URLs that return JSON instead of HTML.• Re-used their existing database and web server.

2. SUBA wrote an Araport adapter to transform their service into a REST-like API• Wrote a small program in Python (2 and 3 supported)• Program calls their URL & prints results in JSON format.• Added metadata in YAML format (for auto documentation)• Saved code to a source code repository on bitbucket.

3. SUBA deployed the adapter to the Araport platform• Used ‘curl’ to send Araport the URL of the source code repository.• Araport checks out the code, compiles it, containerizes it, deploys it.• Araport generates interactive documentation using Swagger.

http://suba.plantenergy.uwa.edu.au/suba-app…

https://bitbucket.org/athaliana/suba-araport

$ curl –kL -X POST –H ”$BEARER_TOKEN” –F "git_repository=https://bitbucket.org/athaliana/suba-araport” https://api.araport.org/community/v0.3

Page 33: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

33

Araport Developer Support: 2016

Page 34: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

34

Summary• The Araport project

– Provides evidence-based annotation for Col-0– Performs extensive baseline data integration– Supports the Arabidopsis research community

• The Araport platform – Enables and hosts modules from community contributors

• Members gain visibility, accessibility, discoverability• Members benefit from documentation, tech support• Community Modules can come in multiple forms

– Visualization Science Apps using JavaScript libraries– Pure data interchange as REST-like Web Services– Computation Science Apps (analysis code + support for running it)– JBrowse tracks as RESTful web services

• Technical improvements are ongoing all the time– Improved developer support and tooling– Federated search, ontology-based interoperation– User workspaces, drag & drop combinations

Page 35: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

Araport Developer Workshops

35

Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.

The Atted Science App Tutorial is freely available on GitHub. Other training material

at araport.org/devzone

Sign up to get updates on 2016 workshophttps://www.araport.org/contact

Page 36: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

36

Acknowledgements

Araport Data Sources and Module Providers

Page 37: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

37

AcknowledgementsJ Craig Venter Institute• Chris Town (PI)• Jason Miller• Agnes Chan• Erik Ferlanti• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarAlumni: Konstantinos Krampis, Svetlana Karamycheva, Maria Kim, Ben Rosen, Christopher Nelson, Seth Schobel

University of Cambridge• Gos Micklem• Sergio Contrino

Funding Agencies

Texas Advanced Computing Center• Matt Vaughn• Josue Balandrano Coronel• Matt Hanlon• Rion Dooley• Joe Stubbs• Alex Rocha• John GentleAlumni: Walter Moreira, Steve Mock

Page 38: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

38

Page 39: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

Araport11 Genome Annotation

Araport11Protein Coding Genes

UniProtUpdate

NCBI Novel

Models

Maker Novel

Models

NCBI SRARNA-seq

PASA, Trinity, BLAST,…

https://www.araport.org/data/araport11

TAIR10Annotation

113 public RNAseq samples

Page 40: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

Araport11 Pre-release 3 (Dec 2015)

• Available via ThaleMine, JBrowse, FTP, APIsCategories TAIR10 Araport11

Gene LociProtein coding loci 27,416 27,667Novel loci in Araport11 719Gene loci with splice isoform 5,665 10,698TranscriptsTranscript isoforms 35,385 48,389Transcripts altered in Araport11CDS altered 1,191UTR altered 24,185

Page 41: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

41

To install a Science App:

📝 Fill out this form.

🕒Wait a few minutes.

😃 Test the app.

📬 Notify the App Store.

Araport provisions a virtual machine.Araport obtains the source code.Araport installs the program.

Page 42: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

42

Web ServicesAlso known as APIs:• Application Programmer Interfaces

Computer programs that…• Run on a web server.• Use HTTP for communication.

The query is a URL:• http://my.url/gene?AT2G46830

The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.

Page 43: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

43

Query by URL

Response in JSON

Araport online documetnation

Web Service: SUBA

Page 44: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

44

Query by URL

Response in JSON

Araport online documetnation

Web Service: KEGG

Page 45: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

45

Science Apps

Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.

Obtains data by…• web services!

Useful for…• Interactive science.• Cool visualizations.

Page 46: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

46

Web Service Science App

Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.

Obtains data by…• web services!

Useful for…• Interactive science.• Cool visualizations.

Computer programs that…• Run on a web server.• Use HTTP for communication.

The query is a URL:• http://my.url/gene?AT2G46830

The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.

Page 47: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

47JCVI Expression Profile web service (left) and science app (right).Erik Ferlanti, JCVI senior software engineer.

Web Service Science App

Page 48: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

48KEGG Pathways web service (left) and science app (right).Brian Liu, intern at JCVI.

Web Service Science App

Page 49: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

49PhosPhAt Phosphorylation web service (left) and science app (right).Ismail Liban, intern at JCVI.

Web Service Science App

Page 50: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

Araport Developer Workshops

50

Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.

The Atted Science App Tutorial is available as open source on GitHub.

Next workshop: Winter 2015

Page 51: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

51

Acknowledgements

J Craig Venter Institute• Chris Town• Jason Miller• Agnes Chan• Maria Kim• Erik Ferlanti• Seth Schobel• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarFormer members• Ben Rosen• Christopher Nelson• Konstantinos Krampis• Svetlana Karamycheva

University of Cambridge• Gos Micklem• Sergio Contrino

Texas Advanced Computing Center• Matt Vaughn• Steve Mock• Rion Dooley• Matt Hanlon• Joe Stubbs• Walter Moreira• Chris Jordan

Funding Agencies

Data Sources

Page 52: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

52

Araport User Workspaces

• Status– Prototype available in 2015• Grid layout (user adds rows or columns)• User adds Science Apps to grid (app isolation is goal)

– Coming soon• Drag and drop• Communications bus (blast app sends results to viz app)

– Coming later• Automatic discovery (blast app finds my aligment app)• Shared workspaces

Page 53: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

53

Infrastructure Challenges 1

• Federated search– Prototype: Single search returns results from

• The Araport content management system (Drupal)• The Araport data warehouse (ThaleMine, Lucene)• The Araport genome browser (JBrowse metadata)

– Goal during development: • Extend search to 3rd party indexes (NCBI, EBI, etc)• Develop web services APIs for distributed indexes• Implement rapid response distributed search

• Automatic discovery– Araport components to discover each other at run time– User sees available options based on current results

Page 54: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

54

Infrastructure Challenges 2

• Interoperable web services– Prototype: support controlled vocabularies

• Sequence Ontology (SO) for data organization & display• Gene Ontology (GO) associations for display and search• Science apps integrate diverse web services

– Goals for development:• Ontologies for phenotype, reactome, metabolome• Community-driven adoption of controlled vocabularies• Web service integration with snap-in easy

• Redundant web services– Establish web services equivalence classes– Automatic fail over should primary provider fail

Page 55: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

55

Infrastructure Challenges 3

• Provenance– Prototype: • Submitters provide web services metadata for auto display• Submitters may provide an “About” page on Science Apps• Web services logs show number of users, number of hits

– Goals for development:• Automatic and provenance display on every submission• Monthly reporting to contributors (e.g. Google Analytics)• Standards compliance e.g. W3C PROV spec

• Community adoption

Page 56: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

56

Server

Browser

JSCSSDBDB

HTML<form>

CGI

HTML<table>

URL HTML

WebServices

JavaScript<table>

URL

HTML3

CSS3

HTML5

Server

Browser

TraditionalActive server, static client.Submit one form, display one result.Server provides data and its format.

ModernActive client, dynamic pages.Continual client/server interaction.Server provides data, client formats it.

Web Design for Dynamic Pages

HTTP HTTP

Page 57: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

57

External programsPortal programs (www.araport.org)

API (api.araport.org)

Agave Corekeep metadata

enroll usersADAMA

format data

enroll services

a b c d e f

CGI

Computing

Storage

Databases

ThaleMine JBrowse

Authentication, metering, logging, versioning, security.

a b c d e f

Apps

Jobs

Systems

CGI

InterMines

Others

Tripal

SOAP

CGI

REST

Science Apps

Requisite Architectural Diagram

Page 58: Arabidopsis Information Portal: A Community-Extensible Platform for Open Data

58

Abstract• The Araport platform for scalable information exchange in genomics.• The Arabidopsis Information Portal (Araport) is a web resource for genome science. Araport is a new

and free service centered on Arabidopsis thaliana, the plant whose genome sequence serves as a model for all of plant biology. Araport integrates data from major sources including NCBI, UniProt, PubMed, TAIR, BAR, EPIC CoGe, IntAct, Atted II, KEGG, and the 1001 Genomes Project. Araport also exposes its own “Araport11” update to the organism’s structural and functional gene annotation. Araport was conceived as a new kind of model organism database, one that could keep pace with ever-growing data sets while not burdening funding agencies with an ever-growing data warehouse. Araport is a platform for data sharing, data integration, and data federation. Araport provides means for scientists in the community to develop and deploy web services that expose data residing elsewhere on the internet. Araport provides means for scientists to develop and deploy “Science Apps” that can perform computational analysis and visualization of distributed data. Araport already hosts over 20 Science Apps and almost 100 web services linked to a dozen data sources. Currently shifting from prototype to development mode, Araport provides a model for sustainable growth of model organism community resources. Primarily an information science project, Araport takes on scalability challenges related to real-time integration of distributed services, interoperability between diverse services, indexing for federated search, reliability & responsiveness, security & logging, open-source development for software portability, and usability through automated documentation.