extraction of relevant information in scientific documents ... poster... · labex 1,2 numev...

1
LABEX NUMEV Solutions Numériques Matérielles et Modélisation pour l’Environnement et le Vivant logo du laboratoire qui fait le poster RESULTS: Positionnement NUMEV : Axe Données Keywords: Text mining, information retrieval, Ontological and Terminological Resource, N-ary Relations, Unit of measure extraction ABSTRACT: Automatically extracting relevant information from scientific documents, particularly experimental quantitative data, is a painstaking process. Experimental data results can be represented in N-ary Relations used to model a domain of knowledge in an Ontological and Terminological Resource (OTR). Information extraction allows the OTR to be enriched with N-ary Relation instances which link a studied object (e.g. a packaging) with its features (e.g. thickness, O 2 permeability,). Several scientific challenges are encountered during the task of information extraction: (1) locating relevant information drowned in text, (2) automatically extracting symbolic concept instances (e.g. a packaging) and quantitative concept instances (e.g. a thickness), (3) automatically or semi-automatically feeding the OTR with those new instances. This work aims at proposing a method in order to locate quantitative data using text mining methods, and, extract specific patterns for symbolic and quantitative concepts recognition using syntactic and semantic analysis. METHOD Soumia Lilia Berrahou 1,2 , Patrice Buche 1,2 , Juliette Dibie-Barthélemy 3 , Mathieu Roche 1 1 – LIRMM, 2 – INRA-UMR IATE, 3 – INRA-Mét@risk-AgroParisTech [email protected] [email protected] , [email protected] , [email protected] Extraction of relevant information in scientific documents guided by an Ontological and Terminological Resource Contribu)on 1 : Reducing relevant informa3on search space to experimental data involved in N ary rela3ons. Approach : Using text mining and supervised learning methods, guided by the OTR Key points : Several textual contexts evaluated Several learning algorithms tested Several word weigh3ng measures computed Expected contribu)on : Iden3fying new paDerns for symbolic and quan3ta3ve concepts involved in Nary rela3ons. Approach : Using syntac3c and seman3c analysis aFer reducing the search space of relevant informa3on. Designing specific paDerns for quan3ta3ve data extrac3on and facilita3ng the annota3on process of scien3fic documents. DESCRIPTION OF EXPERIMENTS Data: a corpus of 115 scien3fic documents (i.e. 35 000 sentences) Textual contexts: XP1 (only the sentence where at least one unit appears), XP2 (2 sentences aFer), XP3 (2 sentences before) Supervised learning algorithms: Decision Trees, Naive Bayes for text (DMNB),… Word weigh3ng measures: TF, TF.IDF, Okapi Result : The relevant context is the sentence where at least one known unit from the OTR appears, i.e. 5,000 sentences, almost 86% of reducing search space Table 1. Results of « Unit » instances according to each textual context. (P) Precision, (R) Recall, (F) Fmeasure Table 2. Results of « Unit » instances according to weight based measures in XP1 textual context. 1 Soumia Lilia Berrahou, Patrice Buche, JulieDe DibieBarthélemy, Mathieu Roche: How to Extract Unit of Measure in Scien3fic Documents?. KDIR/KMIS 2013 Example Eight apple wedges were packaged into polypropylene trays of 500 cm^3 and wrapsealed using a 64 μm thickness polypropylene film. Syntac)c and seman)c analysis num(film, 64) nn(film, μm) nn(film, thickness) nn(film, polypropylene) num, nn, amod define syntac3c dependencies between those words. Result : The expected result is to find correlated concepts involved in Nary rela3ons. The example shows dependencies between concepts known in the OTR.

Upload: others

Post on 13-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extraction of relevant information in scientific documents ... poster... · LABEX 1,2 NUMEV Solutions Numériques Matérielles et Modélisation pour l’Environnement et le Vivant

LABEX

NUMEV

Solutions Numériques

Matérielles et Modélisation

pour l’Environnement

et le Vivant

logo du laboratoire qui fait

le poster

RESULTS:

Positionnement NUMEV : Axe Données

Keywords: Text mining, information retrieval, Ontological and Terminological Resource, N-ary Relations, Unit of measure extraction ABSTRACT: Automatically extracting relevant information from scientific documents, particularly experimental quantitative data, is a painstaking process. Experimental data results can be represented in N-ary Relations used to model a domain of knowledge in an Ontological and Terminological Resource (OTR). Information extraction allows the OTR to be enriched with N-ary Relation instances which link a studied object (e.g. a packaging) with its features (e.g. thickness, O2 permeability,…). Several scientific challenges are encountered during the task of information extraction: (1) locating relevant information drowned in text, (2) automatically extracting symbolic concept instances (e.g. a packaging) and quantitative concept instances (e.g. a thickness), (3) automatically or semi-automatically feeding the OTR with those new instances. This work aims at proposing a method in order to locate quantitative data using text mining methods, and, extract specific patterns for symbolic and quantitative concepts recognition using syntactic and semantic analysis.

METHOD

Soumia Lilia Berrahou1,2, Patrice Buche1,2, Juliette Dibie-Barthélemy3, Mathieu Roche1

1 – LIRMM, 2 – INRA-UMR IATE, 3 – INRA-Mét@risk-AgroParisTech [email protected]

[email protected], [email protected], [email protected]

Extraction of relevant information in scientific documents guided by an Ontological and Terminological Resource

Contribu)on1:  Reducing  relevant  informa3on  search  space  to  experimental  data  involved  in  N-­‐ary  rela3ons.      Approach:  Using  text  mining  and  supervised  learning  methods,  guided  by  the  OTR    Key  points:  §  Several  textual  contexts  evaluated  §  Several  learning  algorithms  tested  §  Several  word  weigh3ng  measures  computed  

Expected  contribu)on:  Iden3fying  new  paDerns  for  symbolic  and  quan3ta3ve  concepts  involved  in  N-­‐ary  rela3ons.    Approach:  Using  syntac3c  and  seman3c  analysis  aFer  reducing  the  search  space  of  relevant  informa3on.  Designing  specific  paDerns  for  quan3ta3ve  data  extrac3on  and  facilita3ng  the  annota3on  process  of  scien3fic  documents.  

DESCRIPTION  OF  EXPERIMENTS  §  Data:  a  corpus  of  115  scien3fic  documents  (i.e.  35  000  sentences)  §  Textual  contexts:  XP1  (only  the  sentence  where  at  least  one  unit  appears),  XP2  (2  sentences  aFer),  XP3  (2  sentences  before)  §  Supervised  learning  algorithms:  Decision  Trees,  Naive  Bayes  for  text  (DMNB),…  §  Word  weigh3ng  measures:  TF,  TF.IDF,  Okapi  

Result:  The  relevant  context  is  the  sentence  where  at  least  one  known  unit  from  the  OTR  appears,  i.e.  5,000  sentences,  almost  86%  of  reducing  search  space  

Table  1.  Results  of  «  Unit  »  instances  according  to  each  textual  context.  (P)  Precision,  (R)  Recall,  (F)  F-­‐measure  

Table  2.  Results  of  «  Unit  »  instances  according  to  weight  based  measures  in  XP1  textual  context.  

1  Soumia  Lilia  Berrahou,  Patrice  Buche,  JulieDe  Dibie-­‐Barthélemy,  Mathieu  Roche:  How  to  Extract  Unit    of  Measure  in  Scien3fic  Documents?.  KDIR/KMIS  2013  

     

Example  Eight  apple  wedges  were  packaged  into  polypropylene  trays  of    500  cm^3  and  wrap-­‐sealed  using  a  64  

μm  thickness  polypropylene  film.      

 Syntac)c  and  seman)c  

analysis  

num(film,  64)  nn(film,  μm)    

nn(film,  thickness)  nn(film,  polypropylene)    

num,  nn,  amod  define  syntac3c  dependencies  between  those  words.  

Result:  The  expected  result  is  to  find  correlated  concepts  involved  in  N-­‐ary  rela3ons.  The  example  shows  dependencies  between  concepts  known  in  the  OTR.