>lingway█ >lingway fact extractor (lfe)█ >introduction >goals crossmarc / lingway...

12
>lingway█ >Lingway Fact Extractor (LFE)█ > Introduction > Goals Crossmarc / Lingway > Lingway adaptation of the NHLRT approach > Rule induction > (ongoing work)

Upload: aubrey-dean

Post on 18-Jan-2016

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Lingway Fact Extractor (LFE)█

> Introduction

> Goals Crossmarc / Lingway

> Lingway adaptation of the NHLRT approach

> Rule induction

> (ongoing work)

Page 2: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Introduction█

> LR, HLRT and NHLRT approaches> LR wrapper (Left-right)

> set {<l1,r1>,...,<lk,rk>} of 2K delimiters> Rigid (the left and right delimiters and the order between them are

unique)

> HLRT (Head-left-right-tail)> Two additional elements (f. ex. <ul> and </ul>)

> NHLRT (Nested HLRT)> Less rigid approach (conditional rules)

Kushmerick, N. Finite-state approaches to Web information extraction. In Proc. 3rd Summer Convention on Information Extraction, Rome, Italy 2002,

Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68, special issue on Intelligent Internet Systems, 2000).

Page 3: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Crossmarc architecture Constraints and Lingway goals█

> division of the process into NERC and FE

> multilingualism of the FE

> semi-automatic approach

> reuse of XTIRP which formalism accepts disjunction, missing and repeated elements and free order

> => result is a much more flexible formalism that the one of the original NHLRT

Page 4: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Lingway adaptation of the NHLRT approach█

> Named entities (NE) are already recognised

> Aspects of the problem are:> to detect where starts a fact (the Head),> to detect where it ends (the Tail),> to select the relevant NE (= to drop non-relevant NE),> to produce the fact (the tuple) proper.

Page 5: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>LFE - WHISK█

> Relatively close to WHISK, which can be seen as an extension of Kushmerick systems using regular expressions including disjunctions

> But LFE different because it does not use semantic or linguistic categories, and its concrete algorithm and general philosophy are different.

Page 6: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Head and Tail█

> Recognition of Head and Tails (technique similar to NERC)

> Ex. "Poste:", "Intitulé:"

> (Important role of JOB_TITLE)

Page 7: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Fact Extraction█

Poste: Ingénieur.

Vous avez une formation BAC + 4 et avez plus de 2 ans d'expérience. […]

Head Left-cont Right-cont

NE (named entity)

Right-int-elmLeft-int-

elm

Page 8: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Selecting / dropping elements█

> Relations between elements> F.ex. association between SCHEDULE and DURATION, > In this case, a DURATION without SCHEDULE could be

marked as NONFACT ("dropped"), etc.

> Testing of the context (previous and next NE)

Page 9: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Extension of XTIRP formalism█

> NEXT, PREV> These operators allow to test conditions with respect to

previous and next elements (NE, NUMEX and TIMEX), including types and attributes

> COUNTER> Just a trivial counter (from 0 to ...)

> CURRENT_MARK> This operator allows to test if the current element is

embedded in a given mark (notably for testing stressed fonts)

> (generalisation) DYNAMIC VARIABLES

Page 10: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Extension implementation█

> NEXT, PREV> Ongoing development

> COUNTER> Implemented

> CURRENT_MARK> Ongoing

> (generalisation) DYNAMIC VARIABLES> Implemented

Page 11: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Production of facts proper█

> Once NE are marked as belonging to a fact (fact#1, etc.) or as being "nonfacts" :

> a simple XSLT program extracts the facts in the corresponding XML output format

Page 12: >lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)

>lingway█

>Calendar█

> First complete version 21st of July

> Evaluation end of July