smartapis: eudat semantic working group presentation @ rda 9th plenary

64
1 @micheldumontier & @markmoby The smartAPI project Mark D. Wilkinson Center for Plant Biotechnology and Genomics UPM- INIA, Madrid On behalf of Michel Dumontier Maastricht University Discovering interconnected web APIs with semantic metadata

Upload: mark-wilkinson

Post on 21-Apr-2017

324 views

Category:

Internet


3 download

TRANSCRIPT

Slide 1

1@micheldumontier & @markmobyThe smartAPI project

Mark D. WilkinsonCenter for Plant Biotechnology and Genomics UPM-INIA, Madrid

On behalf of

Michel DumontierMaastricht University

Discovering interconnected web APIs with semantic metadata

1

2@micheldumontier & @markmobyBiomedical data analysis is increasingly being done using cloud-based, web-friendly application programming interfaces (APIs).

BUT its pretty much impossible to automatically discover which API to use and how to connect these together to create an effective workflow.

Background

3@micheldumontier & @markmobyAPI Catalogs

17,202 APIs1,187 APIs6206 APIs

15,128 APIsSHARE Registry

4@micheldumontier & @markmobyVariable Metadata

5@micheldumontier & @markmobyVariable Metadata

6@micheldumontier & @markmobyVariable Metadata

7@micheldumontier & @markmobyVariable Metadata

8@micheldumontier & @markmoby

9@micheldumontier & @markmoby

The parameter called sequence can have values that are FASTA formatted sequences

10@micheldumontier & @markmoby

The average bioinformatician can traverse these links, read these API documents, and make reasonably good guesses about how to access the service

But this is limited to the speed and patience of a human

11@micheldumontier & @markmobyMeanwhile, in another registry

12@micheldumontier & @markmobyVariable Metadata

13@micheldumontier & @markmobyVariable Metadata

Different metadata fieldsdescribing ~the same operation (BLAST)

14@micheldumontier & @markmobyVariable Metadata

15@micheldumontier & @markmobyVariable Metadata

In this case, the parameter is called QUERY, and it can consume an Accession (???...), a GI, or a FASTA formatted sequence

16@micheldumontier & @markmobyIf you really work and dig-around

A human can use Service Registries to findmost of the information they need

(though they still need experience and/or guesswork!)

17@micheldumontier & @markmobyWeak or absent input/output descriptors

makes pipelining of services difficult based solely on registry metadata

18@micheldumontier & @markmobyWeak or absent input/output descriptors

And even with ~well-described servicespipelining remains troublesome

19@micheldumontier & @markmoby

20@micheldumontier & @markmoby

myGene.info: Input parameters(described using the openAPI descriptor standard)

21@micheldumontier & @markmobymyGene.info: Input parameters(described using the openAPI descriptor standard)

From the openAPI description, A bioinformatician can learn thatthe geneid parameter can be an Entrez or EnsEMBL gene id

22@micheldumontier & @markmoby

myGene.info: Input parameters(described using the openAPI descriptor standard)

GenemyGene.info

23@micheldumontier & @markmoby

myGene.info: Input parameters(described using the openAPI descriptor standard)

GenemyGene.info?

24@micheldumontier & @markmoby

myGene.info: Input parameters(described using the openAPI descriptor standard)

GenemyGene.infoJSON

25@micheldumontier & @markmoby

GenBank identifierAffymetrix identifierTaxonomy identifier 1340 lines HGNC symbol?NCBI Gene TerminologyA big block of JSON!

What do these symbols refer to?How do we find out more?

26@micheldumontier & @markmobyTwo distinct problems:

Discovery of a tool that does what you need

Understanding how to use the tool you discovered

Its inputs and outputs (what kind of information, and in what format/syntax, with which parameter names, required/optional?)How it can be chained with other tools into more complex analytical workflows.

27@micheldumontier & @markmobyMore contemporary registries get us closer

28@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fields

29@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fields

GUID

30@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fields

EDAM:operation_0346

31@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fields

EDAM:data_2044

32@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fields

EDAM:data_0857

33@micheldumontier & @markmoby

Crowdsourced API registry (some curation)Features ontology-constrained fieldsNo description of I/O parameters (for non-browser-based interaction)

Description of data formats are sometimes available (and also grounded in EDAM ontology) but inconsistent

Only possible to use this API registry for discovery, not for invocation

(i.e. solves problem #1, but not #2)

Also invented a novel Service Descriptor format requires de novo tool-building

34@micheldumontier & @markmoby

Semantic Health and Research Environment - SHARE - Registry(synopsis interface)

35@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) RegistryUses the myGrid Service descriptor (same as )

36@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) RegistryUses ontology terms for both data types and service operation types, much as with(but allows/encourages any ontology)

37@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) Registry

SADI standardizes service interfaces such that the interface itself is also defined by these ontology terms (i.e. data must be owl:Individuals of the ontological type)

38@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) Registry

and therefore.

39@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) Registry

Automated synthesis of, and invocation of, complex Service pipelines from independent providers

40@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) Registry

Automated gap filling for unavailable data

Automated detection of useful data combinations

41@micheldumontier & @markmoby

Semantic Health and Research Environment (SHARE) Registry

SADI assumes a world of 100% OWL/RDF data

(Good) OWL can be quite hard to write!

42@micheldumontier & @markmobyBarely describedNo automationHard to find and useNot FAIRRichly describedFully automatableFully FAIR

43@micheldumontier & @markmobyBarely describedNo automationHard to find and useNot FAIRRichly describedFully automatableFully FAIRAn incremental path to increasingly rich semantically-controlled metadata

that

Does not invent new standards

and

Is easy for our end-users to create

44@micheldumontier & @markmoby

45@micheldumontier & @markmobyThe goal is to reduce the barrier for the discovery and reuse of web APIs through richer semantic metadata.

a coordinated facility for the intelligent and facile annotation of smart APIsa web application to discover smart APIs and how they connect to each other.

1 year supplement in collaboration with HeartBD2K center - Peipei Ping (PI), Andrew Su and Chunlei Wu.smartAPI

46@micheldumontier & @markmobyBuild on API metadata specification standards

SWAGGER

47@micheldumontier & @markmobyTools for Intelligent API Metadata Authoring

Build on CEDAR technology Generate the Service metadata capture Web Form from a smartAPI template (CEDAR)

Discover context-appropriate annotation recommendations to enhance harmonization

Validate and give improvement suggestions

48@micheldumontier & @markmobyMetadata authoring will connect to numerous existing resources

Identifier syntax and link outs475 ontologies and terminologies

49@micheldumontier & @markmoby

50@micheldumontier & @markmoby

Smart Profiling

51@micheldumontier & @markmoby

Smart Profiling(not the same as Extreme Vetting ;-) )

52@micheldumontier & @markmobyUsing information from identifiers.org, MIRIAM, and prefix-commons, make some intelligent guesses about what a given data field might be

Enhanced suggestions for the end-user annotator

53@micheldumontier & @markmobyUse this to automatically map API data to Linked Open Data

53

54@micheldumontier & @markmobySteps along the stairway

55@micheldumontier & @markmobyMetadata Survey

We performed a survey of 3 repositories (Biocatalogue, Programmable Web, Elixir Tools & Services Registry) and 4 specifications (MIAS, OPEN API, SADI, schema.org, and a preliminary smartAPI metadata specification).

56@micheldumontier & @markmoby

Metadata Elements 20 basic, 6 provider, 10 operation, 12 parameters, 6 response

57@micheldumontier & @markmobyMUSTNameAccess PointSHOULDDescriptionDocumentationResponse MIME-TypeTerms of ServiceAuthentication ModeVersionSSL SupportMAYWebsiteCategoryPublicationsAPI Access RestrictionsAccess Point MirrorsAPI Metadata FormatAPI Access Mode

API LocationAPI Implementation LanguageAPI MaturitySocial Media Links

58@micheldumontier & @markmobyMetadata authoring made easier. We augmented the Swagger Editor to autocomplete using the smartAPI Repository and enabled validation against the smartAPI specification.

59@micheldumontier & @markmoby

60@micheldumontier & @markmoby

Faceted Search Inteface. We implemented a lightweight web-based tool to perform faceted search and filtering over the elasticSearch repository of smartAPIs descriptions.

API Interoperability WG People

Michel DumontierAmrapali ZaveriShima DastgheibChunlei Wu

Ruben Verborgh

Caty ChungRaymond TerrynPaul AvillachGregg KelloggNolan Nicholshttp://mygene.info/

http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/http://smart-api.info/website/http://www.lincsproject.org/http://bd2k-picsure.hms.harvard.edu

https://spec-ops.iohttp://nidm.nidash.org/

Kevin OsbornDavid Steinberg

https://cgl.genomics.ucsc.edu/

Mark WilkinsonMary ShimoyamaJeff De Pons

Denise Lunahttp://sadiframework.orghttps://bd2kccc.org/

http://rgd.mcw.edu/

Kathleen Jagodnik61@micheldumontier & @markmoby

62@micheldumontier & @markmobyFacilitate the discoverability, interoperability, and reuse of web-based APIs Eliminate API data silos by providing FAIR (Findable, Accessible, Interoperable, Reuseable) Linked Data.

The tools, technologies, and design patterns developed in the pilot and WG should generalize to API development across the BD2K consortium (and beyond).

Take-home Message

63@micheldumontier & @markmobyMichel DumontierChunlei WuCyrus Afrasiabi (backend, repository API)Trish Whetzel (API profiling)Yash Vyas (recommendation engine)Amrapali Zaveri (metadata survey, template, web application, evaluation)Andrew Su (evaluation)Mark Wilkinson (evaluation)

TEAM

[email protected]: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier @micheldumontier & @markmoby

[email protected]

64

64