>lingway█ >lingway fact extractor (lfe)█ >introduction >goals crossmarc / lingway...
TRANSCRIPT
>lingway█
>Lingway Fact Extractor (LFE)█
> Introduction
> Goals Crossmarc / Lingway
> Lingway adaptation of the NHLRT approach
> Rule induction
> (ongoing work)
>lingway█
>Introduction█
> LR, HLRT and NHLRT approaches> LR wrapper (Left-right)
> set {<l1,r1>,...,<lk,rk>} of 2K delimiters> Rigid (the left and right delimiters and the order between them are
unique)
> HLRT (Head-left-right-tail)> Two additional elements (f. ex. <ul> and </ul>)
> NHLRT (Nested HLRT)> Less rigid approach (conditional rules)
Kushmerick, N. Finite-state approaches to Web information extraction. In Proc. 3rd Summer Convention on Information Extraction, Rome, Italy 2002,
Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68, special issue on Intelligent Internet Systems, 2000).
>lingway█
>Crossmarc architecture Constraints and Lingway goals█
> division of the process into NERC and FE
> multilingualism of the FE
> semi-automatic approach
> reuse of XTIRP which formalism accepts disjunction, missing and repeated elements and free order
> => result is a much more flexible formalism that the one of the original NHLRT
>lingway█
>Lingway adaptation of the NHLRT approach█
> Named entities (NE) are already recognised
> Aspects of the problem are:> to detect where starts a fact (the Head),> to detect where it ends (the Tail),> to select the relevant NE (= to drop non-relevant NE),> to produce the fact (the tuple) proper.
>lingway█
>LFE - WHISK█
> Relatively close to WHISK, which can be seen as an extension of Kushmerick systems using regular expressions including disjunctions
> But LFE different because it does not use semantic or linguistic categories, and its concrete algorithm and general philosophy are different.
>lingway█
>Head and Tail█
> Recognition of Head and Tails (technique similar to NERC)
> Ex. "Poste:", "Intitulé:"
> (Important role of JOB_TITLE)
>lingway█
>Fact Extraction█
Poste: Ingénieur.
Vous avez une formation BAC + 4 et avez plus de 2 ans d'expérience. […]
Head Left-cont Right-cont
NE (named entity)
Right-int-elmLeft-int-
elm
>lingway█
>Selecting / dropping elements█
> Relations between elements> F.ex. association between SCHEDULE and DURATION, > In this case, a DURATION without SCHEDULE could be
marked as NONFACT ("dropped"), etc.
> Testing of the context (previous and next NE)
>lingway█
>Extension of XTIRP formalism█
> NEXT, PREV> These operators allow to test conditions with respect to
previous and next elements (NE, NUMEX and TIMEX), including types and attributes
> COUNTER> Just a trivial counter (from 0 to ...)
> CURRENT_MARK> This operator allows to test if the current element is
embedded in a given mark (notably for testing stressed fonts)
> (generalisation) DYNAMIC VARIABLES
>lingway█
>Extension implementation█
> NEXT, PREV> Ongoing development
> COUNTER> Implemented
> CURRENT_MARK> Ongoing
> (generalisation) DYNAMIC VARIABLES> Implemented
>lingway█
>Production of facts proper█
> Once NE are marked as belonging to a fact (fact#1, etc.) or as being "nonfacts" :
> a simple XSLT program extracts the facts in the corresponding XML output format
>lingway█
>Calendar█
> First complete version 21st of July
> Evaluation end of July