arabidopsis information portal: a community-extensible platform for open data
TRANSCRIPT
Arabidopsis Information Portal: A Community-Extensible Platform for
Open Data Matt Vaughn
Director, Life Sciences ComputingTexas Advanced Computing Center
University of Texas at [email protected] | @mattdotvaughn | www.slideshare.net/mattdotvaughn
The Rationale for Araport• Loss of TAIR as a publicly funded shared resource for
data mining and basic bioinformatics (plus technical obsolescence)
• Centralization as a key contributing factor– Loading of new data into database– Development of new user experience– Curation and annotation– Community support mission
• Araport is designed to be de-centralized and thus sustainable
3
Modules Proposed: IAIC
‘The design of the AIP will provide core functionality while remaining flexible to encourage multiple contributors and constant innovation.’IAIC Whitepaper (2012) Plant Cell: ”Taking the next step”.
4
Modules Realized
5
Modules Realized
Core web applications for integration and indexing.
6
Modules Realized
Core web applications for integration and indexing.
In-house Science Apps
Community Science Apps
7
8Global or faceted search. Will soon extend to community-provided modules.
9
The GMOD JBrowse app lets you select which data tracks to display (left).An Araport extension, SeqLighter, lets you zoom to sequence (inset).
10
JBrowse users can select additional epigenomics tracks (obtained live from EPIC CoGe).User may filter by many attributes (shown: Lab name = Jacobsen Lab).
11
Araport 11: Evidence-based re-annotation of A. thaliana Col-0, incorporating 113 public RNAseq data sets binned by tissue type. Available in pre-release v3 at araport.org
12
13
14
PantherHomologs
PhytozomeHomologs
PhysicalInteractions
via BAR
Coming soon: genetic interactions va IntAct.
Expression Patternsvia BAR
Co-expression via ATTED
ThaleMine gene report pages include…
15
ThaleMine gene report pages include…
NCBIGeneRIFs
NCBIPublications
Gene Ontology
associations
16
Saved query + user parameters
Display a dynamic table
Modify the query
ThaleMine template queries
17
Alter query filters
Save results
Analyze columns
Alter display columns
ThaleMine query results
18
ThaleMine results list manipulation
19
20
Predicted phasiRNA sites in Arabidopsis.Blake Meyers Lab, University of Delaware.Deepti Vemaraju, Mayumi Nakano.
Arabidopsis citations by year and category.Nick Provart Lab, University of Toronto.
Asher Pasha, Jamie Waese
Science Apps backed by Web Services
21Asher Pasha & Nicholas Provart from BAR at University of Toronto.
22
How It Workshttps://www.araport.org
JavaScript in the Browser…
if ( gene && gene.length > 0 )
… calls Araport web services by URL…
$.get('https://api.araport.org…
… which in turn call BAR web services by URL…
http://bar.utoronto.ca/webservices/get_expressologs.php
23
How It Works
https://www.araport.org
The graph is interactive.• Users can rearrange nodes by dragging.• Users can get details by clicking.
This is Cytoscape.• The graph is drawn by Cytoscape.js• This is a free library for JavaScript.
There are many libraries to choose from!• jsPhyloSVG: phylogenetic trees• HighCharts: statistical charts• jQuery DataTables: interactive tables• d3.js: all sorts of cool stuff
24
Code Re-use
http://bar.utoronto.ca
The Araport science app (left) reuses code from the pre-existing BAR app (right).The apps look different by choice but they could be made identical.
https://www.araport.org
25
Key Points• BAR Interactions Science App– Example of visualization module– Uses Open-source Cytoscape Javascript library– Displays data from BAR web services via Araport- mediated
web service API– Developed at BAR by developers who attended Araport
Workshop in 2014– Similar codes deployed at Araport and BAR
• We invite you to develop a visualization module– Araport engineers available to provide technical support and
advice
26
Pure Data Web Services
Contributed by SUBA group Nov 2015
Cornelia Hooper & Ian Castleden from University of Western Australia.
27
Pure Data Web Services
28
Pure Data Web Services
at2g46830
29
AutomaticDocumentation and User Interface
suba3
These are the service endpoints• The endpoint is the verb in the URL.• Verb is followed by one or more parameters• Example: araport/suba/search?locus=AT2G46830Standard service endpoints at Araport• /list = which IDs work with this service?• /search = what are the details for a given ID• /prov = who provided this data and how?• /stats = number of accesses, number of unique users• /health = results of status check on underlying service
Automatic documentation is generated based on simple metadata provided when the service is created. We translate it to the OpenAPI (née Swagger) spec for API interfaces.
That, in turn, is used to build UI (and language libraries)
30
How does it work?suba3
This URL
Returns this data
Javascript-friendly JSON format*
This transcription factor localizes to
nucleus
*Not mandatory for Araport APIs but preferred!
31
Web Service Module• SUBA provides a web service module to Araport
– URL query takes an Arabidopsis locus as parameter– URL responds with a web page full of data
• The data is not formatted for display to humans (e.g. HTML)• The data is formatted for Javascript parsers (in JSON format)
– The service is REST-like in that the data exchange is achieved with just the standard web protocol, HTTP. This is a modern standard for data exchange.
• We can all use this module!– Build a Science App that colors genes and pathway– Build a Science App that scores predicted interactions– ThaleMine could add subcellular localization to gene lists with ease
• We invite you to develop a web service module– Araport will provide tech support plus documentation & indexing– Araport will promote auto-discovery, interoperability– There’s even provisional support for hosting data that doesn’t have an existing web presence
yet– The docs and tooling are getting better all the time (but…)
suba3
32
SUBA module was developed without Araport Staff Intervention
suba3
1. SUBA created a web service at their university.• Added local URLs that return JSON instead of HTML.• Re-used their existing database and web server.
2. SUBA wrote an Araport adapter to transform their service into a REST-like API• Wrote a small program in Python (2 and 3 supported)• Program calls their URL & prints results in JSON format.• Added metadata in YAML format (for auto documentation)• Saved code to a source code repository on bitbucket.
3. SUBA deployed the adapter to the Araport platform• Used ‘curl’ to send Araport the URL of the source code repository.• Araport checks out the code, compiles it, containerizes it, deploys it.• Araport generates interactive documentation using Swagger.
http://suba.plantenergy.uwa.edu.au/suba-app…
https://bitbucket.org/athaliana/suba-araport
$ curl –kL -X POST –H ”$BEARER_TOKEN” –F "git_repository=https://bitbucket.org/athaliana/suba-araport” https://api.araport.org/community/v0.3
33
Araport Developer Support: 2016
34
Summary• The Araport project
– Provides evidence-based annotation for Col-0– Performs extensive baseline data integration– Supports the Arabidopsis research community
• The Araport platform – Enables and hosts modules from community contributors
• Members gain visibility, accessibility, discoverability• Members benefit from documentation, tech support• Community Modules can come in multiple forms
– Visualization Science Apps using JavaScript libraries– Pure data interchange as REST-like Web Services– Computation Science Apps (analysis code + support for running it)– JBrowse tracks as RESTful web services
• Technical improvements are ongoing all the time– Improved developer support and tooling– Federated search, ontology-based interoperation– User workspaces, drag & drop combinations
Araport Developer Workshops
35
Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.
The Atted Science App Tutorial is freely available on GitHub. Other training material
at araport.org/devzone
Sign up to get updates on 2016 workshophttps://www.araport.org/contact
36
Acknowledgements
Araport Data Sources and Module Providers
37
AcknowledgementsJ Craig Venter Institute• Chris Town (PI)• Jason Miller• Agnes Chan• Erik Ferlanti• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarAlumni: Konstantinos Krampis, Svetlana Karamycheva, Maria Kim, Ben Rosen, Christopher Nelson, Seth Schobel
University of Cambridge• Gos Micklem• Sergio Contrino
Funding Agencies
Texas Advanced Computing Center• Matt Vaughn• Josue Balandrano Coronel• Matt Hanlon• Rion Dooley• Joe Stubbs• Alex Rocha• John GentleAlumni: Walter Moreira, Steve Mock
38
Araport11 Genome Annotation
Araport11Protein Coding Genes
UniProtUpdate
NCBI Novel
Models
Maker Novel
Models
NCBI SRARNA-seq
PASA, Trinity, BLAST,…
https://www.araport.org/data/araport11
TAIR10Annotation
113 public RNAseq samples
Araport11 Pre-release 3 (Dec 2015)
• Available via ThaleMine, JBrowse, FTP, APIsCategories TAIR10 Araport11
Gene LociProtein coding loci 27,416 27,667Novel loci in Araport11 719Gene loci with splice isoform 5,665 10,698TranscriptsTranscript isoforms 35,385 48,389Transcripts altered in Araport11CDS altered 1,191UTR altered 24,185
41
To install a Science App:
📝 Fill out this form.
🕒Wait a few minutes.
😃 Test the app.
📬 Notify the App Store.
Araport provisions a virtual machine.Araport obtains the source code.Araport installs the program.
42
Web ServicesAlso known as APIs:• Application Programmer Interfaces
Computer programs that…• Run on a web server.• Use HTTP for communication.
The query is a URL:• http://my.url/gene?AT2G46830
The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.
43
Query by URL
Response in JSON
Araport online documetnation
Web Service: SUBA
44
Query by URL
Response in JSON
Araport online documetnation
Web Service: KEGG
45
Science Apps
Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.
Obtains data by…• web services!
Useful for…• Interactive science.• Cool visualizations.
46
Web Service Science App
Computer programs that…• Hosted on a web server.• Run in the browser.• Written in JavaScript.
Obtains data by…• web services!
Useful for…• Interactive science.• Cool visualizations.
Computer programs that…• Run on a web server.• Use HTTP for communication.
The query is a URL:• http://my.url/gene?AT2G46830
The response is a “web page”:• Format is JSON not HTML.• Simple to read, simple to parse.
47JCVI Expression Profile web service (left) and science app (right).Erik Ferlanti, JCVI senior software engineer.
Web Service Science App
48KEGG Pathways web service (left) and science app (right).Brian Liu, intern at JCVI.
Web Service Science App
49PhosPhAt Phosphorylation web service (left) and science app (right).Ismail Liban, intern at JCVI.
Web Service Science App
Araport Developer Workshops
50
Deploying the Atted Science App Tutorial atAIP Developer Workshop, TACC, Nov 2014.
The Atted Science App Tutorial is available as open source on GitHub.
Next workshop: Winter 2015
51
Acknowledgements
J Craig Venter Institute• Chris Town• Jason Miller• Agnes Chan• Maria Kim• Erik Ferlanti• Seth Schobel• Irina Belyaeva• Chia-Yi Cheng• Vivek KrishnakumarFormer members• Ben Rosen• Christopher Nelson• Konstantinos Krampis• Svetlana Karamycheva
University of Cambridge• Gos Micklem• Sergio Contrino
Texas Advanced Computing Center• Matt Vaughn• Steve Mock• Rion Dooley• Matt Hanlon• Joe Stubbs• Walter Moreira• Chris Jordan
Funding Agencies
Data Sources
52
Araport User Workspaces
• Status– Prototype available in 2015• Grid layout (user adds rows or columns)• User adds Science Apps to grid (app isolation is goal)
– Coming soon• Drag and drop• Communications bus (blast app sends results to viz app)
– Coming later• Automatic discovery (blast app finds my aligment app)• Shared workspaces
53
Infrastructure Challenges 1
• Federated search– Prototype: Single search returns results from
• The Araport content management system (Drupal)• The Araport data warehouse (ThaleMine, Lucene)• The Araport genome browser (JBrowse metadata)
– Goal during development: • Extend search to 3rd party indexes (NCBI, EBI, etc)• Develop web services APIs for distributed indexes• Implement rapid response distributed search
• Automatic discovery– Araport components to discover each other at run time– User sees available options based on current results
54
Infrastructure Challenges 2
• Interoperable web services– Prototype: support controlled vocabularies
• Sequence Ontology (SO) for data organization & display• Gene Ontology (GO) associations for display and search• Science apps integrate diverse web services
– Goals for development:• Ontologies for phenotype, reactome, metabolome• Community-driven adoption of controlled vocabularies• Web service integration with snap-in easy
• Redundant web services– Establish web services equivalence classes– Automatic fail over should primary provider fail
55
Infrastructure Challenges 3
• Provenance– Prototype: • Submitters provide web services metadata for auto display• Submitters may provide an “About” page on Science Apps• Web services logs show number of users, number of hits
– Goals for development:• Automatic and provenance display on every submission• Monthly reporting to contributors (e.g. Google Analytics)• Standards compliance e.g. W3C PROV spec
• Community adoption
56
Server
Browser
JSCSSDBDB
HTML<form>
CGI
HTML<table>
URL HTML
WebServices
JavaScript<table>
URL
HTML3
CSS3
HTML5
Server
Browser
TraditionalActive server, static client.Submit one form, display one result.Server provides data and its format.
ModernActive client, dynamic pages.Continual client/server interaction.Server provides data, client formats it.
Web Design for Dynamic Pages
HTTP HTTP
57
External programsPortal programs (www.araport.org)
API (api.araport.org)
Agave Corekeep metadata
enroll usersADAMA
format data
enroll services
a b c d e f
CGI
Computing
Storage
Databases
ThaleMine JBrowse
Authentication, metering, logging, versioning, security.
a b c d e f
Apps
Jobs
Systems
CGI
InterMines
Others
Tripal
SOAP
CGI
REST
Science Apps
Requisite Architectural Diagram
58
Abstract• The Araport platform for scalable information exchange in genomics.• The Arabidopsis Information Portal (Araport) is a web resource for genome science. Araport is a new
and free service centered on Arabidopsis thaliana, the plant whose genome sequence serves as a model for all of plant biology. Araport integrates data from major sources including NCBI, UniProt, PubMed, TAIR, BAR, EPIC CoGe, IntAct, Atted II, KEGG, and the 1001 Genomes Project. Araport also exposes its own “Araport11” update to the organism’s structural and functional gene annotation. Araport was conceived as a new kind of model organism database, one that could keep pace with ever-growing data sets while not burdening funding agencies with an ever-growing data warehouse. Araport is a platform for data sharing, data integration, and data federation. Araport provides means for scientists in the community to develop and deploy web services that expose data residing elsewhere on the internet. Araport provides means for scientists to develop and deploy “Science Apps” that can perform computational analysis and visualization of distributed data. Araport already hosts over 20 Science Apps and almost 100 web services linked to a dozen data sources. Currently shifting from prototype to development mode, Araport provides a model for sustainable growth of model organism community resources. Primarily an information science project, Araport takes on scalability challenges related to real-time integration of distributed services, interoperability between diverse services, indexing for federated search, reliability & responsiveness, security & logging, open-source development for software portability, and usability through automated documentation.