extraction of relevant information in scientific documents ... poster... · labex 1,2 numev...

LABEX NUMEV Solutions Numériques Matérielles et Modélisation pour l’Environnement et le Vivant logo du laboratoire qui fait le poster RESULTS: Positionnement NUMEV : Axe Données Keywords: Text mining, information retrieval, Ontological and Terminological Resource, N-ary Relations, Unit of measure extraction ABSTRACT: Automatically extracting relevant information from scientific documents, particularly experimental quantitative data, is a painstaking process. Experimental data results can be represented in N-ary Relations used to model a domain of knowledge in an Ontological and Terminological Resource (OTR). Information extraction allows the OTR to be enriched with N-ary Relation instances which link a studied object (e.g. a packaging) with its features (e.g. thickness, O 2 permeability,…). Several scientific challenges are encountered during the task of information extraction: (1) locating relevant information drowned in text, (2) automatically extracting symbolic concept instances (e.g. a packaging) and quantitative concept instances (e.g. a thickness), (3) automatically or semi-automatically feeding the OTR with those new instances. This work aims at proposing a method in order to locate quantitative data using text mining methods, and, extract specific patterns for symbolic and quantitative concepts recognition using syntactic and semantic analysis. METHOD Soumia Lilia Berrahou 1,2 , Patrice Buche 1,2 , Juliette Dibie-Barthélemy 3 , Mathieu Roche 1 1 – LIRMM, 2 – INRA-UMR IATE, 3 – INRA-Mét@risk-AgroParisTech [email protected] [email protected] , [email protected] , [email protected] Extraction of relevant information in scientific documents guided by an Ontological and Terminological Resource Contribu)on 1 : Reducing relevant informa3on search space to experimental data involved in N ary rela3ons. Approach : Using text mining and supervised learning methods, guided by the OTR Key points : Several textual contexts evaluated Several learning algorithms tested Several word weigh3ng measures computed Expected contribu)on : Iden3fying new paDerns for symbolic and quan3ta3ve concepts involved in Nary rela3ons. Approach : Using syntac3c and seman3c analysis aFer reducing the search space of relevant informa3on. Designing specific paDerns for quan3ta3ve data extrac3on and facilita3ng the annota3on process of scien3fic documents. DESCRIPTION OF EXPERIMENTS Data: a corpus of 115 scien3fic documents (i.e. 35 000 sentences) Textual contexts: XP1 (only the sentence where at least one unit appears), XP2 (2 sentences aFer), XP3 (2 sentences before) Supervised learning algorithms: Decision Trees, Naive Bayes for text (DMNB),… Word weigh3ng measures: TF, TF.IDF, Okapi Result : The relevant context is the sentence where at least one known unit from the OTR appears, i.e. 5,000 sentences, almost 86% of reducing search space Table 1. Results of « Unit » instances according to each textual context. (P) Precision, (R) Recall, (F) Fmeasure Table 2. Results of « Unit » instances according to weight based measures in XP1 textual context. 1 Soumia Lilia Berrahou, Patrice Buche, JulieDe DibieBarthélemy, Mathieu Roche: How to Extract Unit of Measure in Scien3fic Documents?. KDIR/KMIS 2013 Example Eight apple wedges were packaged into polypropylene trays of 500 cm^3 and wrapsealed using a 64 μm thickness polypropylene film. Syntac)c and seman)c analysis num(film, 64) nn(film, μm) nn(film, thickness) nn(film, polypropylene) num, nn, amod define syntac3c dependencies between those words. Result : The expected result is to find correlated concepts involved in Nary rela3ons. The example shows dependencies between concepts known in the OTR.

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

LABEX

NUMEV

Solutions Numériques

Matérielles et Modélisation

pour l’Environnement

et le Vivant

logo du laboratoire qui fait

le poster

RESULTS:

Positionnement NUMEV : Axe Données

Keywords: Text mining, information retrieval, Ontological and Terminological Resource, N-ary Relations, Unit of measure extraction ABSTRACT: Automatically extracting relevant information from scientific documents, particularly experimental quantitative data, is a painstaking process. Experimental data results can be represented in N-ary Relations used to model a domain of knowledge in an Ontological and Terminological Resource (OTR). Information extraction allows the OTR to be enriched with N-ary Relation instances which link a studied object (e.g. a packaging) with its features (e.g. thickness, O2 permeability,…). Several scientific challenges are encountered during the task of information extraction: (1) locating relevant information drowned in text, (2) automatically extracting symbolic concept instances (e.g. a packaging) and quantitative concept instances (e.g. a thickness), (3) automatically or semi-automatically feeding the OTR with those new instances. This work aims at proposing a method in order to locate quantitative data using text mining methods, and, extract specific patterns for symbolic and quantitative concepts recognition using syntactic and semantic analysis.

METHOD

Soumia Lilia Berrahou1,2, Patrice Buche1,2, Juliette Dibie-Barthélemy3, Mathieu Roche1

1 – LIRMM, 2 – INRA-UMR IATE, 3 – INRA-Mét@risk-AgroParisTech [email protected]

[email protected], [email protected], [email protected]

Extraction of relevant information in scientific documents guided by an Ontological and Terminological Resource

Contribu)on1: Reducing relevant informa3on search space to experimental data involved in N-‐ary rela3ons. Approach: Using text mining and supervised learning methods, guided by the OTR Key points: §  Several textual contexts evaluated §  Several learning algorithms tested §  Several word weigh3ng measures computed

Expected contribu)on: Iden3fying new paDerns for symbolic and quan3ta3ve concepts involved in N-‐ary rela3ons. Approach: Using syntac3c and seman3c analysis aFer reducing the search space of relevant informa3on. Designing specific paDerns for quan3ta3ve data extrac3on and facilita3ng the annota3on process of scien3fic documents.

DESCRIPTION OF EXPERIMENTS §  Data: a corpus of 115 scien3fic documents (i.e. 35 000 sentences) §  Textual contexts: XP1 (only the sentence where at least one unit appears), XP2 (2 sentences aFer), XP3 (2 sentences before) §  Supervised learning algorithms: Decision Trees, Naive Bayes for text (DMNB),… §  Word weigh3ng measures: TF, TF.IDF, Okapi

Result: The relevant context is the sentence where at least one known unit from the OTR appears, i.e. 5,000 sentences, almost 86% of reducing search space

Table 1. Results of « Unit » instances according to each textual context. (P) Precision, (R) Recall, (F) F-‐measure

Table 2. Results of « Unit » instances according to weight based measures in XP1 textual context.

1 Soumia Lilia Berrahou, Patrice Buche, JulieDe Dibie-‐Barthélemy, Mathieu Roche: How to Extract Unit of Measure in Scien3fic Documents?. KDIR/KMIS 2013

Example Eight apple wedges were packaged into polypropylene trays of 500 cm^3 and wrap-‐sealed using a 64

μm thickness polypropylene film.

Syntac)c and seman)c

analysis

num(film, 64) nn(film, μm)

nn(film, thickness) nn(film, polypropylene)

num, nn, amod define syntac3c dependencies between those words.

Result: The expected result is to find correlated concepts involved in N-‐ary rela3ons. The example shows dependencies between concepts known in the OTR.

Labex Korea

Quelles architectures matérielles pour Hadoop ?

Embrapa Labex Europe Biofuels and agricultural sustainability · PDF fileEmbrapa Labex Europe Biofuels and agricultural sustainability Sharing Knowledge Foundation Conference Chania,

Atténuer les menaces matérielles pour permettre un

Telluric Monitoring of geothermal reservoir Géothermie…labex-geothermie.unistra.fr/sites/labex-geothermie.unistra.fr/IMG/... · LABEX – G‐EAU‐THERMIE PROFONDE Call for Project

Rural Development Administration - Embrapa Labex Korea

Radio logicielle: analyse d'architectures matérielles et

APPEL LABEX/ Archimède CALL FORiml.univ-mrs.fr/reunions/labex/Archimede-fiche-b-labex.pdf · APPEL A PROJETS LABEX/ CALL FOR PROPOSALS 2010 Archimède DOCUMENT SCIENTIFIQUE B / SCIENTIFIC

LABEX (FR)

LabEx ReFi Annual Activities Overview 2017

Paton_Projet Post-doc Labex SMS.pdf

architecture logicielles et matérielles