j. turmo, 2006 adaptive information extraction information extraction jordi turmo talp research...
TRANSCRIPT
J. Turmo, 2006 Adaptive Information Extraction
Information ExtractionInformation Extraction
Jordi Turmo
TALP Research CentreDep. Llenguatges i Sistemes Informàtics
Universitat Politècnica de [email protected]
http://www.lsi.upc.edu/~turmo
Jordi Turmo
TALP Research CentreDep. Llenguatges i Sistemes Informàtics
Universitat Politècnica de [email protected]
http://www.lsi.upc.edu/~turmo
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
DefinitionDefinition
• Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents
• Input requirements: scenario of extraction and document collection• Output requirements: output format
Introduction
J. Turmo, 2006 Adaptive Information Extraction
TypologyTypologyIntroduction
• Different points of view:− conceptual coverage: restricted-domain IE vs. open-domain IE− language coverage: monoligual IE vs. multilingual IE− media coverage: written text IE, speech IE, image IE, multimedia IE− document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)
J. Turmo, 2006 Adaptive Information Extraction
TypologyTypologyIntroduction
• Different points of view:− conceptual converage: restricted-domain IE vs. open-domain IE− language coverage: monoligual IE vs. multilingual IE− media coverage: written text IE, speech IE, image IE, multimedia IE− document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)
J. Turmo, 2006 Adaptive Information Extraction
Example 1: Structured documentsExample 1: Structured documentsIntroduction
• Web pages• A list of members of an organization per document• English • Scenario of Extraction
Name, degree, school and affiliation of the member
J. Turmo, 2006 Adaptive Information Extraction
Example 1: Structured documentsExample 1: Structured documentsIntroduction
Name Degree School Affiliation
WL Hsu PhD Cornell IIS, SinicaCS Ho PhD NTU EE,NTITC.Chen PhD SUNY EE,NTITC.Wu PhD Utexas Cedu,NNUMark Liao PhD NWU IIS, SinicaCJ Liau PhD NTU IIS, SinicaWK Cheng PhD TKU TunghaiWC Wang MS Syracus FIT...
J. Turmo, 2006 Adaptive Information Extraction
Example 2: Semi-structured documents
Example 2: Semi-structured documents
Introduction
• 485 seminar announcements• A description of one seminar per document• English • Scenario of Extraction
Speaker, location, start time and end time of the
seminar
J. Turmo, 2006 Adaptive Information Extraction
Example 2: Semi-structured documents
Example 2: Semi-structured documents
Introduction
J. Turmo, 2006 Adaptive Information Extraction
Example 3: Free textExample 3: Free textIntroduction
• 318 Wall Street Journal articles • A description of an incident per document• English• Scenario of Extraction
Type of incident, perpetrator, target, date, location,
effects and instrument
J. Turmo, 2006 Adaptive Information Extraction
Example 3: Free textExample 3: Free textIntroduction
A bomb went off this morning near a power tower in San Salvador leavinga large part of the city without energy, but no casualties have been reported.According to unofficial sources, the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part ofSan Salvador at 0650.
Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb
J. Turmo, 2006 Adaptive Information Extraction
Example 4: Free textExample 4: Free textIntroduction
• 78 documents • A description of mushroom per document• Spanish • Scenario of Extraction
colors of parts of mushrooms and the circumstances
in which they occur
J. Turmo, 2006 Adaptive Information Extraction
Example 4: Free textExample 4: Free textIntroduction
J. Turmo, 2006 Adaptive Information Extraction
Example 4: Free textExample 4: Free textIntroduction
El color blanco de su sombrero pasa a amarillo crema al corte.El sombrero ennegrece si se corta.
Sombrero_1color:
Sombrero_2color:
virar_1inicio:final:causa: corte
virar_2inicio: indeffinal:causa: corte
color_1base: blancotono: indefluz: indef
color_3base: indeftono: negroluz: indef
color_2base: amarillotono: cremaluz: indef
J. Turmo, 2006 Adaptive Information Extraction
Example 5: CombinationExample 5: CombinationIntroduction
• 78 documents • A description of mushroom per document• Spanish • Scenario of Extraction
Names of the mushroom in different languages, ethimology
colors of parts of mushrooms and the circumstances
in which they occur
J. Turmo, 2006 Adaptive Information Extraction
Example 5: CombinationExample 5: CombinationIntroduction
J. Turmo, 2006 Adaptive Information Extraction
ApplicationsApplicationsIntroduction
• IE from the Web• Building of news DBs• Information Integration• Support for QA and Summarization …
Limitation when P<80%
J. Turmo, 2006 Adaptive Information Extraction
ReferencesReferencesIntroduction
• D.E. Appelt, D.J. Israel, 1999
• E. Hovy, 1999• R.J. Mooney, C. Cardie,
1999• Muslea, 1999• J. Cowie, Y. Wilks, 2000• M.T. Pazienza, 2003• Turmo, 2003• Turmo et al. 2005
J. Turmo, 2006 Adaptive Information Extraction
Recent eventsRecent eventsIntroduction
• IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001)
• ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003)
• AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004)
• EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)
• COLING-ACL 06 Workshop on Information Extraction Beyond the Document
• ECAI 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
Origin of IEOrigin of IEHistorical framework
• Acquisition of the relevant information involved in knowledge-based systems
• Traditionally Traditionally (High human cost)(High human cost)
Experts Experts
on the on the
DomainDomain
ManualManual
ProcessProcess
RelevantRelevant
InformationInformation
J. Turmo, 2006 Adaptive Information Extraction
Origin of IEOrigin of IEHistorical framework
• Acquisition of the relevant information involved in knowledge-based systems
Text-based Text-based Intelligent Intelligent SystemsSystems
RelevantRelevant
InformationInformation
• 80’s 80’s (text sources)(text sources)
J. Turmo, 2006 Adaptive Information Extraction
Origin of IEOrigin of IEHistorical framework
• Text-Based Intelligent Systems (TBIS)− Information Retrieval− Information Integration − Information Filtering− Information Routing− Information Extraction− Document Classification− Question Answering− Automatic Summarization− Topic Detection & Tracking...
J. Turmo, 2006 Adaptive Information Extraction
Relevant Historical ProgramsRelevant Historical ProgramsHistorical framework
• Precedents: LSP (Sager, 81), FRUMP (DeJong, 82),
JASPER (Hayes, 86)
• in USA− (1987-1991): MUC [US Navy]
− TIPSTER (1991-1998): MUC [DARPA]
− TIDES (1999-): ACE [NIST]
• in Europe− LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE
− PASCAL excellence network (2003-)
J. Turmo, 2006 Adaptive Information Extraction
MUC EvolutionMUC EvolutionHistorical framework
• MUC-1 (1987)– naval operations– auto-definition of scenarios– auto-evaluation
• MUC-2 (1989)– naval operations– output structure with 10 attributes (type of event, agent, place, ...)
– auto-evaluation
J. Turmo, 2006 Adaptive Information Extraction
MUC EvolutionMUC EvolutionHistorical framework
• MUC-3 (1991), – Latin-American terrorism– output structure with 18 attributes (type of incident, date, place, ...)– recall and precision measures
extracted
relevant
ab
c
de
f
parcially
extracted
extracted = a + b + e + frelevant = a + f + drecall = a + 0.5 f/ (a + f + d)precision = a + 0.5 f/ (a + f + b + e)
J. Turmo, 2006 Adaptive Information Extraction
MUC EvolutionMUC EvolutionHistorical framework
• MUC-4 (1992), – Latin-American terrorism– 24 attributes– F-score (harmonic average)
r pβrp1)(β
F 2
2
• MUC-5 (1993), – Financial news, microelectronics– English, Japanese
J. Turmo, 2006 Adaptive Information Extraction
MUC EvolutionMUC EvolutionHistorical framework
• MUC-6 (1995), – finantial news– subtasks: NE, coreference tasks: TE (template element), ST
(scenario template)
• MUC-7 (1998),– air crashes– new task: TR (template relation)
J. Turmo, 2006 Adaptive Information Extraction
MUC EvolutionMUC EvolutionHistorical framework
• MUC-6, MUC-7 – Partial extractions are discarded
extracted
relevant
a
b
c
d
extracted = a + brelevant = a + drecall = a / (a + d)precision = a / (a + b)
r pβ
rp1)(β F
2
2
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
General ArchitectureGeneral ArchitectureArchitecture
• Hobbs,93:
– Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules
J. Turmo, 2006 Adaptive Information Extraction
Traditional ArchitectureTraditional ArchitectureArchitecture
Conceptual HierarchyConceptual Hierarchy
Pattern MatchingPattern Matching
Pattern Base
Document PreprocessingDocument Preprocessing
PostprocessPostprocess
J. Turmo, 2006 Adaptive Information Extraction
Traditional ArchitectureTraditional ArchitectureArchitecture
Lexical AnalysisLexical Analysis
Pattern MatchingPattern Matching
Conceptual Hierarchy
Pattern BasePattern Base
Text ControlText Control
Syntactic AnalysisSyntactic Analysis
PostprocessPostprocess
J. Turmo, 2006 Adaptive Information Extraction
Traditional ArchitectureTraditional ArchitectureArchitecture
Conceptual HierarchyConceptual Hierarchy
Pattern MatchingPattern Matching
Pattern BaseDiscourse AnalysisDiscourse Analysis
Output Template GenerationOutput Template Generation
Output FormatOutput Format
Lexical AnalysisLexical Analysis
Text ControlText Control
Syntactic AnalysisSyntactic Analysis
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Text controlText controlArchitecture
• Filtering relevant documents• Guessing the language of the documents• Splitting documents into textual zones• Filtering relevant zones• Splitting text into appropriate units (eg.
sentences)• Filtering relevant units• Tokenizing units
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Text controlText controlArchitecture
• Example
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Text controlText controlArchitecture
• Example
<Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>
…
<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Lexical analysisLexical analysisArchitecture
• Identifying morpho-syntactic categories and semantic categories of words General lexicon
• Recognizing terminology words Specific dictionaries
• Recognizing time expressions, quantities, abbreviations, …• Extending abbreviations
Lists of abbrev. + expansion
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Lexical analysisLexical analysisArchitecture
• Recognizing and classifying proper nouns (Named Entities –NERC-) Gazetteers Patterns
• Dealing with unknown words• Dealing with lexical ambiguities
POS taggers WSD (???)
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Lexical analysisLexical analysisArchitecture
• Example1
<Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>
…
<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>
time expressions
mushroom names
abbreviatures
numbers
morphologic parts
Depends on
the scenario
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Lexical analysisLexical analysisArchitecture
• Example2
time expressions
locations
organizations
persons
…
<A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .><According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .>
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Syntactic analysisSyntactic analysisArchitecture
• Full parsing (Lolita, LaSIE, LaSIE-II)
– inefficient, sizes of the grammars– missing robustness (off vocabulary)– treebank grammars– cascaded grammars
• Solves some problems related to the tuning and incompleteness
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Syntactic analysisSyntactic analysisArchitecture
• Partial parsing
−the most commonly used−chunks or phrasal trees (noun phrases,
verbal phrases, prep phrases, adj phrases, adv phrases)
−absence of global dependences
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Semantic interpretationSemantic interpretationArchitecture
• Compositive semantics
− full parsing + λ-expressions −LaSIE, LaSIE-II−Entries with λ-expressions in the Lexicons
−partial parsing + gramatical relations [Vilain,99]
−output = logical forms
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Semantic interpretationSemantic interpretationArchitecture
A bomb went off this morning near a power tower in San Salvador …
• Compositive semantics (example1)
np np pp
np
pp
vp
s
go_off → λ(t) λ(s) λ(r) λ(z) λ(y) λ(x) (bombing(x,y,z,r,s,t))power_tower → λ(x) (power_tower(x))
λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador)))
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Semantic interpretationSemantic interpretationArchitecture
A bomb went off this morning near a power tower in San Salvador …
location_ofsubj time
place
event(bombing , E)subj(bomb , E)time(today_morning , E)place(power_tower, E)location_of(power_tower, San_Salvador)
• Compositive semantics (example2)
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Semantic interpretationSemantic interpretationArchitecture
• Pattern matching
−after partial parsing + svo dependences−the most extended−patterns can be implemented in different
ways −scenario driven approach (TE, TR, ST, …)
−Output = partial templates
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Semantic interpretationSemantic interpretationArchitecture
• Pattern matching (example)
A bomb went off this morning near a power tower in San Salvador …
np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→
INSTRUMENT := C-instrumentDATE := C-timePHIS_TARGET := C-placeLOCATION := C-location
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Discourse analysisDiscourse analysisArchitecture
• Inter-sentence analysis−Co-reference resolution−Ellipsis resolution−Alias resolution−Traditional semantic interpretation
procedures−Template merging procedures
• Inference procedures−Open-domain and domain-specific
knowledge for inferences
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Discourse analysisDiscourse analysisArchitecture
• Example
A bomb went off this morning near a power tower in San Salvador …, but no casualties have been reported
According to unofficial sources , the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650
λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))
λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650,power_tower(the_northwestern_part_of_San_Salvador)))
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Discourse analysisDiscourse analysisArchitecture
• Example
λ(y) (bombing(urban_guerrilla_comandos,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))
λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650, power_tower( the_northwestern_part_of_San_Salvador)))
λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))
Unification & inference
bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,today_morning,power_tower(San_Salvador))
Inference (blew_up → destroyed)
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Output template generationOutput template generationArchitecture
• Mapping of the extracted pieces onto the desired output format
• Specific inferences:− Normalization to predefined values of slots− Mandatory slots− Extracted information that implies different
slot values
J. Turmo, 2006 Adaptive Information Extraction
Architecture
Output template generationOutput template generationArchitecture
• Examplebombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,today_morning,power_tower(San_Salvador))
Today_morning → March_19No_casualties = no_injuries_or_death
Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Characteristics of IE systemsCharacteristics of IE systems
• Strong dependence of the domain−Scenario of extraction−Semantics vs. syntax−Discourse analysis
• Strong dependence of the text structure−Sublanguages−Meta-information
• Strong dependence of the output format−BDs−annotations
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Characteristics of IE systemsCharacteristics of IE systems
• Importance of the portability and tuning• Importance of the Knowledge
Engineering−Modularity
−Basic tasks and specific tasks−Use of weak and local knowledge
• Importance of the NL resources−MDRs, ontologies, general lexicons, specific
dictionaries, …
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Knowledge resourcesKnowledge resources
• Knowledge more or less stable− general lexicon− general grammar− basic NL processors: segmenters, taggers,
parsers, …
• Domain dependent knowledge − Domain specific vocabularies, terminology− gazetteers and patterns for NERC− IE patterns Knowledge specifically used for IEIE
patterns
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Viewpoint 1: type of representation
− rules
np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→Event:INSTRUMENT := C-instrument Event:DATE := C-timeEvent:PHIS_TARGET := C-place Event:LOCATION := C-location
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Viewpoint 1: type of representation
− statistical models (BNs, HMMs, ME, Hyperplanes, …)
whospeaker5409appointment
withabouthow…
seminarremindertheater…
thatbyspeaker…
dr.professorrobertmichaelmr
wcavalierstevenschristel
will(receivedHas…
1.0
1.0
0.99
0.76
0.24
0.99 0.56
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Viewpoint 2: type of values extracted− slot filler extraction patterns
(the HMM presented before)
whospeaker5409appointment
withabouthow…
seminarremindertheater…
thatbyspeaker…
dr.professorrobertmichaelmr
wcavalierstevenschristel
will(receivedHas…
1.0
1.0
0.99
0.76
0.24
0.99 0.56
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Viewpoint 2: type of values extracted− slot filler extraction patterns
(the HMM presented before)
− event extraction patterns (the rule presented
before)np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→Event:INSTRUMENT := C-instrument Event:DATE := C-timeEvent:PHIS_TARGET := C-place Event:LOCATION := C-location
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Point of view: type of values extracted− slot filler extraction patterns
(the HMM presented before)
np(C-person) … vp(is) pron(C-his) “wife” →Married_with:HUSBAND := C-hisMarried_with:WIFE := C-person
− relation extraction patterns
− event extraction patterns (the rule presented
before)
J. Turmo, 2006 Adaptive Information Extraction
Knowledge specific for IE
Types of IE patternsTypes of IE patterns
• Viewpoint 3: number of slot fillers extracted− single-slot IE patterns
(the HMM presented before)
− multi-slot IE patterns (both rules presented
before)
J. Turmo, 2006 Adaptive Information Extraction
SummarySummary
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
• Information Extraction Systems• Introduction
• Historical framework
• Architecture
• Knowledge specific for IE
• Examples
• Evaluation
• Multilinguality
• Adaptability
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
Methodologies [Turmo,2002]Methodologies [Turmo,2002]
LaSIELaSIE-IILOLITACIRCUSFASTUSBADGERHASTENPROTEUSALEMBICPIETURBIOPLUMIE2LOUELLASIFT
System Reference Parsing Semantics Discourse
Gaizauskas et al, 1995Humphreys et al, 1998Garigliano et al, 1998Lehnert et al, 1991Hobbs et al, 1993Fisher et al, 1995Krupka, 1995Grishman, 1995Aberdeen et al, 1993Lin, 1995Turmo,2002Weischedel et al, 1995Aone et al, 1998Childs et al, 1995Miller et al, 1998
indepth understanding
template merging
Chunking Pattern matching -
semantic Gramm relations interp interpretation procedures
Partial Parsing pattern matching
Pattern matching template merging -
sintactico-semantic parsing
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
Knowledge [Turmo,2002]Knowledge [Turmo,2002]
LaSIELaSIE-IILOLITACIRCUSFASTUSBADGERHASTENPROTEUSALEMBICTURBIOPIEPLUMIE2
LOUELLASIFT
System Parsing Semantics Discourse
Treebank grammar -expressionshand-crafted stratified general grammar General grammar semantic network
concept nodes (AutoSlog) hand-crafted IE rules concept nodes (CRYSTAL) decision trees
Phrasal grammar E-graphs IE rules (ExDISCO)
hand-crafted gram relations IE rules (EVIUS)
General grammar hand-crafted IE rules
hand-crafted rules
hand-crafted IE rules decision trees
Statistical models for syntactic-semantic parsing & coreference resolution learned from PTBand on-domain annotated texts
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
Templatewriter
Lexicon Conceptual hierarchygazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Stratified grammar
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
Templatewriter
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Preprocessing• NERC preprocess via gazetters and keyword lists• Root form and inflexional suffix for verbs, nouns and adjs found in sentences
According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prep the-det northwestern-adj part-n of-prep San Salvador-loc at-prep 0650
Conceptual hierarchyStratified grammar
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
TemplateWriter
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time, timex)
According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prep the-det northwestern part of San Salvador-loc at-prep 0650-time
Conceptual hierarchyStratified grammar
NE1 NE2
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
TemplateWriter
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time) • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)
S(According_to-adv NP(unofficial-adj source[s]-n) , NP(the-det bomb-n) – allegedly-adv VP(detonate[ed]-v) PP(by-prep NP(urban-adj guerrilla-n commando[s]-n)) - VP(blow_up-v) NP(a-det power_tower-n) PP(in-prep NP(the-det NE1-loc)) PP(at-prep NP(NE2-time)))
Conceptual hierarchyStratified grammar
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
TemplateWriter
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time) • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)• QLFs (Note: the real implementation of QLFs is not specified)
Conceptual hierarchyStratified grammar
Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y), Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
Templatewriter
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Discourse analysis• Name matcher: Matches variants of NEs across the text• Discourse interpreter:
• adds QLF representation to a semantic net (links)• adds presuppositions• coreference resolution
Conceptual hierarchyStratified grammar
location of eventdestroy
bombing event
Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y), Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)
isa
implies
implies
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
LaSIE-II systemLaSIE-II system
Sentencesplitter
Buchart parser
Namematcher
Discourseinterpreter
Template writer
Lexicongazetteers
Gazetteerlookup
TE TR ST
Brilltagger
Taggedmorph
Output template generation• procedure that write the templates in the desired format
Conceptual hierarchyStratified grammar
Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
TE TR ST
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Preprocessing
According_to-adv unofficial-adj sources-n , the-det bomb-n – allegedly-adv detonated-v by-prep urban-adj guerrilla-n commandos-n - blew_up-v a-det power_tower-n in-prep the-det northwestern part of San Salvador-loc at-prep 0650-time
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
Rules
NE1 NE2
Conceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• semantics refer to types of slot fillers (Conceptual hierarchy)
According_to-adv NP(unofficial-adj sources-n-s1) , NP(the-det bomb-n-artifact) – allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) – VP(blew_up-v-s4) NP(a-det power_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• IE-rules for relations (appositions, PP-attachments, limited conjunctions)
• NP(A-person) , B-integer years old , → instance(X,person), name_of(X,A), age_of(X,B)• NP(A-position) of NP(B-company) → instance(X,person), position_of(X,A), company_of(X,B)
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
Bage
Aname
personClass
ValueSlot
Real implementation as objects
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• IE-rules for relations (appositions, PP-attachments, limited conjunctions)• IE-rules for events (PET interface or ExDISCO)
• NP(A-artifact) v-s4 NP(B-building) → instance(E1,s4), instrument_of(E1,A), phisical_target_of(E1,B)
According_to-adv NP(unofficial-adj sources-n-s1) , NP(the-det bomb-n-artifact) – allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) – VP(blew_up-v-s4) NP(a-det power_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Discourse analysis• antecedents found seeking in sequential order.• constraints:
• instance of a hyperclass• same number• share arguments
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Discourse analysis• QLFs + inference rules = more complex QLFs
• conversion of date expressions.• inference of slot values from the QLFs already achieved• inference of events from others explicitly described
Fred, the president of Cuban Cigar Corp., was appointed vice president of MicrosoftimpliesFred left the Cuban Cigar Corp.
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
PROTEUS systemPROTEUS system
Output template generation• use of rules to build the templates with the desired format
NERC Partial parsing
ScenarioPatterns
Coreferenceresolution
DiscourseAnalysis
Output generator
Lexicon NERC Rules
Lexical Analizer
TE TR ST
Chunk grammar IE-Rules Format
RulesConceptual hierarchy
Inference Rules
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
Preprocessing• only NERC • SGML-tagged• general NE types and subtypes• restricted-domain NE types and subtypes
<person id=1>Jeff Bantle</person>, <entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
Syntactico-semantic interpretation• SGML-tagging of phrases that are values of slots• NPs denoting persons (PNP), organizations (ENP), artifacts (ANP), …• local links (location-of, employee-of, owner-of, …)
<person id=1>Jeff Bantle</person>, <PNP affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
Syntactico-semantic interpretation• SGML-tagging of phrases that are values of slots in templates• NPs• local semantic relations (employee-of, location-of, product-of, …)• event IE-rules (note: the real implementation is not specified)
• $Vehicle + LaunchN → launch_event::vehicle_info := $Vehicle
<launch_event id=2 vehicle_info=1><ANP> The <vehicle id=1>Arian 5</vehicle> launch </ANP> was successfully achieved at 6am
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
Discourse analysis• Three coreference resolution methods
• Rule based• Machine learning based• Hybrid
• Name alias resolution in addition to that performed by NetOwl • Definite NPs• Singular personal pronouns
<person id=1>Jeff Bantle</person>, <PNP ref=1 affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
IE2 systemIE2 system
NetOwlExtractor 3.0
CustomNameTag
PhraseTag EventTagDiscourseModule
TempGen
TE TR STHand-craftedrules
Decisiontree
Output template generation• Translates SGML output into templates in the desired format• Solves and normalizes time expressions• Performs event merging
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
SIFT systemSIFT system
Sentence level Cross-sentece levelOutput
generator
Statistical models
IdentifinderTM TE TR
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
SIFT systemSIFT system
Sentence level Cross-sentece levelOutput
generator
Statistical models
IdentifinderTM TE TR
Preprocessing• NERC using a HMM [Bikel et al. 97] + Viterbi maximizing Pr(W,F,C)• each word is tagged with one NE class
person organization location not-a-name
start-sentence
end-sentence
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
SIFT systemSIFT system
Sentence level Cross-sentece levelOutput
generator
Statistical models
IdentifinderTM TE TR
Syntactico-semantic interpretation• properties of NEs (TE) and relations (TR)• generative statistical model [Miller et al. 98, 00]• search the most likely augmented parse tree (bottom-up chart based)• prunning of low probability constituents
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
SIFT systemSIFT system
Sentence level Cross-sentece levelOutput
generator
Statistical models
IdentifinderTM TE TR
Syntactico-semantic interpretation
Nance , a paid consultant to ABC News , …
per/nnp , det vbn per-desc/nn to org’/nnp org/nnp ,
per-r/np per-desc/np org-r/np
org-ptr/pp
emp-of/pp-lnk
per-desc-r/npper/np
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
SIFT systemSIFT system
Sentence level Cross-sentece levelOutput
generator
Statistical models
IdentifinderTM TE TR
Syntactico-semantic interpretation• relations between NEs across sentences• statistical model [Miller et al. 98]• classifier of pairs of entities
• entities in different sentences• entities do not take part into local relations• their types are compatible with any relation
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
TURBIO systemTURBIO system
NERC Partial parsing controller
Output generator
Lexicon IE-rule set scheduling
NERC Rules
Lexical Analizer
TE TR
Partial-tree grammar
IE-Rule set processor
IE-Rule sets
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
TURBIO systemTURBIO system
NERC Partial parsing controller
Output generator
Lexicon IE-rule set scheduling
NERC Rules
Lexical Analizer
TE TR
Partial-tree grammar
IE-Rule set processor
IE-Rule sets
Preprocessing• WordNet synsets, lemmas, POS tags• NERC• parsed trees of noun, verbal, and adjectival phrases
J. Turmo, 2006 Adaptive Information Extraction
Examples of IE systems
TURBIO systemTURBIO system
NERC Partial parsing controller
Output generator
Lexicon IE-rule set scheduling
NERC Rules
Lexical Analizer
TE TR
Partial-tree grammar
IE-Rule set processor
IE-Rule sets
Syntactico-semantic interpretation• Hypotesis: dependence among relations of NEs• Iterative execution of IE-rule sets depending on the scheduling• Example:
• Scenario = Mushroom parts, their possible colors and the circumstances by which they are produced• There are colors in the documents that are not related to any mushroom part, but all colors related with a circumstance are colors related to mushroom parts.