extraction of relevant information in scientific documents ... poster... · labex 1,2 numev...
Embed Size (px)
TRANSCRIPT
-
LABEX
NUMEV
Solutions Numériques
Matérielles et Modélisation
pour l’Environnement
et le Vivant
logo du laboratoire qui fait
le poster
RESULTS:
Positionnement NUMEV : Axe Données
Keywords: Text mining, information retrieval, Ontological and Terminological Resource, N-ary Relations, Unit of measure extraction ABSTRACT: Automatically extracting relevant information from scientific documents, particularly experimental quantitative data, is a painstaking process. Experimental data results can be represented in N-ary Relations used to model a domain of knowledge in an Ontological and Terminological Resource (OTR). Information extraction allows the OTR to be enriched with N-ary Relation instances which link a studied object (e.g. a packaging) with its features (e.g. thickness, O2 permeability,…). Several scientific challenges are encountered during the task of information extraction: (1) locating relevant information drowned in text, (2) automatically extracting symbolic concept instances (e.g. a packaging) and quantitative concept instances (e.g. a thickness), (3) automatically or semi-automatically feeding the OTR with those new instances. This work aims at proposing a method in order to locate quantitative data using text mining methods, and, extract specific patterns for symbolic and quantitative concepts recognition using syntactic and semantic analysis.
METHOD
Soumia Lilia Berrahou1,2, Patrice Buche1,2, Juliette Dibie-Barthélemy3, Mathieu Roche1 1 – LIRMM, 2 – INRA-UMR IATE, 3 – INRA-Mé[email protected]
[email protected] [email protected], [email protected], [email protected]r
Extraction of relevant information in scientific documents guided by an Ontological and Terminological Resource
Contribu)on1: Reducing relevant informa3on search space to experimental data involved in N-‐ary rela3ons. Approach: Using text mining and supervised learning methods, guided by the OTR Key points: § Several textual contexts evaluated § Several learning algorithms tested § Several word weigh3ng measures computed
Expected contribu)on: Iden3fying new paDerns for symbolic and quan3ta3ve concepts involved in N-‐ary rela3ons. Approach: Using syntac3c and seman3c analysis aFer reducing the search space of relevant informa3on. Designing specific paDerns for quan3ta3ve data extrac3on and facilita3ng the annota3on process of scien3fic documents.
DESCRIPTION OF EXPERIMENTS § Data: a corpus of 115 scien3fic documents (i.e. 35 000 sentences) § Textual contexts: XP1 (only the sentence where at least one unit appears), XP2 (2 sentences aFer), XP3 (2 sentences before) § Supervised learning algorithms: Decision Trees, Naive Bayes for text (DMNB),… § Word weigh3ng measures: TF, TF.IDF, Okapi
Result: The relevant context is the sentence where at least one known unit from the OTR appears, i.e. 5,000 sentences, almost 86% of reducing search space
Table 1. Results of « Unit » instances according to each textual context. (P) Precision, (R) Recall, (F) F-‐measure
Table 2. Results of « Unit » instances according to weight based measures in XP1 textual context.
1 Soumia Lilia Berrahou, Patrice Buche, JulieDe Dibie-‐Barthélemy, Mathieu Roche: How to Extract Unit of Measure in Scien3fic Documents?. KDIR/KMIS 2013
Example Eight apple wedges were packaged into polypropylene trays of 500 cm^3 and wrap-‐sealed using a 64
μm thickness polypropylene film.
Syntac)c and seman)c
analysis
num(film, 64) nn(film, μm)
nn(film, thickness) nn(film, polypropylene)
num, nn, amod define syntac3c dependencies between those words.
Result: The expected result is to find correlated concepts involved in N-‐ary rela3ons. The example shows dependencies between concepts known in the OTR.