openminted: making sense of large volumes of data
TRANSCRIPT
twitter.com/openminted_eu
beyond Open Access MAKING SENSE OF LARGE VOLUMEs OF SCIENTIFIC CONTENT
Stelios Piperidis Athena Research & Innovation Centre
1
The global research community generates over 1.5 million new
scholarly articles per annum. The STM report (2009)
2
Lokman I. Meho, The rise and rise of citation analysis, 2007
e STM report (2009)
… some 90% of papers … are never cited.
… 50% of papers are never read by anyone other than their
authors, referees and journal editors
…about scientific literature?
… one paper published every 30 seconds
… 70,000 papers published on a single protein, the tumor
suppressor p53 Spangler et al, Automated Hypothesis Generation based on Mining Scientific
Literature, 2014 e STM report (2009)
Emerging solution(s)
Machine reading process textual sources, organise and classify in various dimensions, extract main (indexical) information items,
… and understanding identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data
… and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict
3
Structuring and mining textual data many examples from medical research
An example from social sciences:
study social confrontation in the Greek
society with a focus on the years of the crisis
based on newspaper corpora
what have been the claims of the social agents (parties, unions, different professional associations, etc) against which government/state bodies, instruments used, how they were reported in different newspapers
4
Study social confrontation example Κατάληψη στα Υποθηκοφυλακεία Πειραιώς και Σαλαμίνας αποφάσισε ο Δικηγορικός Σύλλογος Πειραιώς (ΔΣΠ), στις 26 και 27 Απριλίου 2011, διαμαρτυρόμενος για τα σοβαρότατα προβλήματα λειτουργίας που παρουσιάζουν. The Piraeus Bar Association ( SAB) decided to go for the occupation of land registries in Piraeus and Salamis on 26 and April 27, 2011 , protesting about the serious operational problems they present.
5
Study social confrontation example
6
Form Actor/
Addressee Issue
Time/
Location Claims
Named Entity
Recognition Chunking
Dependency Parsing
Co-reference Resolution
Aggregation
Analytics
Stack
ILSP-NLP
IE workflow
Summarize/
Export
Summarize/
Export
Main objective Establish an open and sustainable Text and
Data Mining (TDM) platform and infrastructure
where researchers can collaboratively create,
discover, share and re-use knowledge from a
wide range of text based scientific and
scholarly related sources.
9
infrastructure - focus on interoperability
build on existing TDM tools - no new algorithms
service oriented - discovery, re-use of content & tools
community driven - user centric requirements
open science - openness at all levels
Key aspects
10
The landscape
Text Mining
Researchers
Text Mining
Researchers
Content Providers Content Providers
End Users End Users Computing Infrastructures Computing Infrastructures
11
the project • Started: June 2015
• Duration: 3 years
• Total budget: 6,068,074 Euros
16 Partners • 6 mining research groups
• 3 content providers
• 1 data center
• 1 library association
• 2 legal experts
• 6 community related partners
• 2 SMEs
12
Partners
Athena RIC
Univ. of Manchester (NacTem)
Univ. of Darmstadt
INRA
EMBL-EBI
Agro-Know
LIBER
Univ. of Amsterdam
Open University UK
EPFL
CNIO
Univ. of Sheffield (GATE)
GESIS
GRNET
Frontiers
Univ. of Stirling
the challenges
Content Barriers and obstacles due to non-availability, technical restrictions, copyright law or licensing issues.
No uniform way to search for, retrieve and access content for TDM.
Services How to identify the most fitting one? Do I have permission to use it?
How to combine with other services I have access to or I need? How to use them on my content?
Processing Where to deploy? Are my machines powerful enough? How can I get access to powerful machines? Where to store intermediate and final results? How to ensure persistence of storage?
13
Bring all stakeholders together!
accessible content
Metadata and transfer protocols • Document literature content, language resources, data categories
taxonomies, provenance information
• Generic and domain-specific metadata descriptions
• Identify standards for metadata harvesting and federated search in distributed repositories
IPR and licensing • Study IPR restrictions for reuse of sources
• Exceptions?
• What about non-commercial research?
• Translate the legal & policy aspects into authentication and authorization specifications (GEANT’s EduGain, …)
• User-to-service and service-to-service interactions
15
Starting with repositories and OA
publishers
via OpenAIRE and CORE
Starting with repositories and OA
publishers
via OpenAIRE and CORE
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
In close collaboration with the
FUTURETDM project
http://project.futuretdm.eu/
Scholarly Comm.
life sciences agriculture social
sciences
Community driven
17
From the very beginning… Requirements, content, barriers, expected outcomes.
… to the very end Create applications, validate and evaluate the results.
Use cases (1)
Scholarly communication analytics OpenAIRE, CORE, Frontiers
• Semantic search and discovery of open scientific outcomes
• Map of academia – scholarly communication network
• Research monitoring and analytics
Life sciences EBI, Human brain project
• Assisted curation of the EMBL-EBI chemical databases for metabolomics
• Curation of the neurosciences resources KnowledgeBase and Neurolex
18
Use cases (2)
Agriculture and biodiversity INRA, AGRO-KNOW, EFSA
• Enrich agricultural databases to assist food- and water-borne disease outbreak alerts and product recalls
• Image, figure and dataset discovery in the AGRIS FAO online service
social sciences GESIS
• Develop and evaluate methods for the automatic detection and linking of named entities, citation traces and intentions in social science scientific publications
19
Expectations from today’s WS
•Establish contact and dialogue with content providers, especially OA content providers
•Understand current practices, problems and limitations
•Look into the emerging requirements
•Explore the challenges content providers face at technical, legal, policy and organisational challenges face in making their data open for text and data mining
•Develop a common vision and strategy
20
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus
THANK YOU!
21