Anita de WaardVP Research Data CollaborationsElsevier, Jericho, VTSome Thoughts on Collectively Creating Networks of Ideas, Data and Software

How do we unify the needs of the collective and the individual? Let us endeavor to build systems that allow a kid in Mali who wants to learn about proteomics to not be overwhelmed by the irrelevant and the untrue.

- John Perry Barlow, iAnnotate 2014

Collectively create nimble and robust systems of knowledge management that interconnect ideas, data and software.

Automated caption/body text splitting & linkingPrecisionRecallF-score56.376.064.7

Statement typeConnecting Ideas: Big Mechanism

Connecting Ideas: Towards an Elsevier Knowledge Graph

14M articles from Science Direct3.3M triples475M triples49M triplesp x r matrixp x k, k x r latent factor matrices~102 triples920K concepts from EMMeTOngoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier LabsUnsupervised, scalable and built with off-the-shelf technologiesBased on recent work at University College London and University of Massachusetts AmherstRiedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013).

Connecting Research Data:

Linking Papers to Data, Phase 1

Supplementary data at PANGAEABidirectional links between PANGAEA & ScienceDirectData visualized next to the article

Linking Papers to Data, Phase 2ICSU/WDS/RDA Publishing Data Service Working groupCurrently creating linked-data model for exposing DOI to DOI links outside publishers firewallMerged with National Data Service pilot with the same goalCollaboration between CrossRef, DataCite, Europe PubMed Central, ANDS, Thompson Reuters, ElsevierAbout to deliver:

Objective: move froma plethora of (mostly) bilateral arrangements between the different players.. a one-for-all cross-referencing service for articles and data.. to ..

ResearchersFunding AgencyInstitutionData RepositoryDatasetJournalPaperCurrent Systems for Linking DataResearcher creates datasetsResearcher writes paper & publishes in journal(Sometimes,) dataset gets posted to repositoryResearcher reports (post-hoc) to Institution and Funder221344

ResearchersFunding AgencyInstitutionData RepositoryDatasetJournalPaperIssues with the Current Situation:221344

iii. No link between data and paper

iv. Funders/Institutions informed as an afterthought

i. Too much work for researchers

ii. Data posting not mandatory

ResearchersFunding AgencyInstitutionData RepositoryDatasetJournalPaperA Proposal To Address These Issues:Researcher creates datasets and posts to repository(under embargo)Funder is automatically notified of dataset publicationResearcher writes paper & publishes in journal; embargo is lifted and data linked- NB this also allows release of non-used data for negative result and reproducibilityFunder and institution get report on publication and embargo lifting211 33 344i. Less Work!iv. Better Tracking!iii. Better Linking!ii. More Data Stored!

One piece of the puzzle: Mendeley Data:

Linked to published papers or not

Linked to Github or not

Versioning and provenance

Another Piece of the Puzzle: DataSearch:

Federated Poor APIRich APIFTP & Index

Federated Poor APIRich APIFTP & Index

Federated Poor APIRich APIFTP & IndexDataEnrichment ManualAutomated(User) IntentRanking Filtering (how to mix federated & indexed rich & poor)Search

RenderingSearch all dataFaceted query/Results refinementStore & Use results

How Do We Evaluate Discoverability?Birds of a Feather on Data Search:

How do we pay for all this?RDA Cost Recovery WGCochair with Ingrid Dillo (DANS), Simon Hodson (CODATA)Goal: write a report regarding new potential funding models for data repositories, allow them to start sharing this knowledgeInterviewed 24 repositories on their funding (current and future)Now summarising stories and trends will present at RDAP7

Terms of funding for main income stream (in %)

Software As A First-Class Knowledge Object:

Working with Networks of PartnersForce11: Multi-stakeholder, member-driven organisationUnites scholars, tool developers, librarians, publishers, funding agencies etc. etc.E.g. Software citation group, akin to Data Citation GroupWill present at Force16 in Portland, OR April 17-19, 2016

National Data Service:Multi-stakeholder group, based around supercomputing centresAims to be a connective tissue between data creation, curation, storage etc projects. Inviting Pilots: two or more partners who have not worked together, interested in collaborating on a data-centric project to solve a real-world needs: can include software sharingE.g. Datasearch, Data Linking systems

RDA: CoLead Data publishing, linking groupColead Cost Recovery groupActive in Chemistry, Earth Science groupsStarting BoF Data Search



Anita de WaardVP Research Data Collaborations,

In summary:Lets collectively enable an account of the present undertakings, studies and labours of the ingenious in many considerable parts of the world,

by connecting ideas, data, and software through interconnected partnerships!