openminted: making sense of large volumes of data

twitter.com/openminted_eu

beyond Open Access MAKING SENSE OF LARGE VOLUMEs OF SCIENTIFIC CONTENT

Stelios Piperidis Athena Research & Innovation Centre

[email protected]

1

http://twitter.com/openminted_eu


The global research community generates over 1.5 million new

scholarly articles per annum. The STM report (2009)

2

Lokman I. Meho, The rise and rise of citation analysis, 2007

e STM report (2009)

… some 90% of papers … are never cited.

… 50% of papers are never read by anyone other than their

authors, referees and journal editors

…about scientific literature?

… one paper published every 30 seconds

… 70,000 papers published on a single protein, the tumor

suppressor p53 Spangler et al, Automated Hypothesis Generation based on Mining Scientific

Literature, 2014 e STM report (2009)

Emerging solution(s)

Machine reading process textual sources, organise and classify in various dimensions, extract main (indexical) information items,

… and understanding identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data

… and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict

3

Structuring and mining textual data many examples from medical research

An example from social sciences:

study social confrontation in the Greek

society with a focus on the years of the crisis

based on newspaper corpora

what have been the claims of the social agents (parties, unions, different professional associations, etc) against which government/state bodies, instruments used, how they were reported in different newspapers

4

Study social confrontation example Κατάληψη στα Υποθηκοφυλακεία Πειραιώς και Σαλαμίνας αποφάσισε ο Δικηγορικός Σύλλογος Πειραιώς (ΔΣΠ), στις 26 και 27 Απριλίου 2011, διαμαρτυρόμενος για τα σοβαρότατα προβλήματα λειτουργίας που παρουσιάζουν. The Piraeus Bar Association ( SAB) decided to go for the occupation of land registries in Piraeus and Salamis on 26 and April 27, 2011 , protesting about the serious operational problems they present.

5

Study social confrontation example

6

Form Actor/

Addressee Issue

Time/

Location Claims

Named Entity

Recognition Chunking

Dependency Parsing

Co-reference Resolution

Aggregation

Analytics

Stack

ILSP-NLP

IE workflow

Summarize/

Export

Summarize/

Export

Visualise statistics

Main objective Establish an open and sustainable Text and

Data Mining (TDM) platform and infrastructure

where researchers can collaboratively create,

discover, share and re-use knowledge from a

wide range of text based scientific and

scholarly related sources.

9

infrastructure - focus on interoperability

build on existing TDM tools - no new algorithms

service oriented - discovery, re-use of content & tools

community driven - user centric requirements

open science - openness at all levels

Key aspects

10

The landscape

Text Mining

Researchers

Text Mining

Researchers

Content Providers Content Providers

End Users End Users Computing Infrastructures Computing Infrastructures

11

the project • Started: June 2015

• Duration: 3 years

• Total budget: 6,068,074 Euros

16 Partners • 6 mining research groups

• 3 content providers

• 1 data center

• 1 library association

• 2 legal experts

• 6 community related partners

• 2 SMEs

12

Partners

Athena RIC

Univ. of Manchester (NacTem)

Univ. of Darmstadt

INRA

EMBL-EBI

Agro-Know

LIBER

Univ. of Amsterdam

Open University UK

EPFL

CNIO

Univ. of Sheffield (GATE)

GESIS

GRNET

Frontiers

Univ. of Stirling

the challenges

Content Barriers and obstacles due to non-availability, technical restrictions, copyright law or licensing issues.

No uniform way to search for, retrieve and access content for TDM.

Services How to identify the most fitting one? Do I have permission to use it?

How to combine with other services I have access to or I need? How to use them on my content?

Processing Where to deploy? Are my machines powerful enough? How can I get access to powerful machines? Where to store intermediate and final results? How to ensure persistence of storage?

13

Bring all stakeholders together!

Main routes

14

accessible content

Metadata and transfer protocols • Document literature content, language resources, data categories

taxonomies, provenance information

• Generic and domain-specific metadata descriptions

• Identify standards for metadata harvesting and federated search in distributed repositories

IPR and licensing • Study IPR restrictions for reuse of sources

• Exceptions?

• What about non-commercial research?

• Translate the legal & policy aspects into authentication and authorization specifications (GEANT’s EduGain, …)

• User-to-service and service-to-service interactions

15

Starting with repositories and OA

publishers

via OpenAIRE and CORE

Starting with repositories and OA

publishers

via OpenAIRE and CORE

In close collaboration with the

FUTURETDM project

http://project.futuretdm.eu/

In close collaboration with the

FUTURETDM project






Scholarly Comm.

life sciences agriculture social

sciences

Community driven

17

From the very beginning… Requirements, content, barriers, expected outcomes.

… to the very end Create applications, validate and evaluate the results.

Use cases (1)

Scholarly communication analytics OpenAIRE, CORE, Frontiers

• Semantic search and discovery of open scientific outcomes

• Map of academia – scholarly communication network

• Research monitoring and analytics

Life sciences EBI, Human brain project

• Assisted curation of the EMBL-EBI chemical databases for metabolomics

• Curation of the neurosciences resources KnowledgeBase and Neurolex

18

Use cases (2)

Agriculture and biodiversity INRA, AGRO-KNOW, EFSA

• Enrich agricultural databases to assist food- and water-borne disease outbreak alerts and product recalls

• Image, figure and dataset discovery in the AGRIS FAO online service

social sciences GESIS

• Develop and evaluate methods for the automatic detection and linking of named entities, citation traces and intentions in social science scientific publications

19

Expectations from today’s WS

•Establish contact and dialogue with content providers, especially OA content providers

•Understand current practices, problems and limitations

•Look into the emerging requirements

•Explore the challenges content providers face at technical, legal, policy and organisational challenges face in making their data open for text and data mining

•Develop a common vision and strategy

20

twitter.com/openminted_eu

facebook.com/openminted

bit.do/openmintedlinkedin

vimeo.com/openminted

bit.do/openmintedplus

THANK YOU!

21





openminted: making sense of large volumes of data

Data & Analytics