information extraction extract meaningful information from text without fully understanding...
TRANSCRIPT
Information Extraction
• Extract meaningful information from text
• Without fully understanding everything!
• Basic idea:– Define domain-specific templates– Simple and reliable linguistic processing– Recognize known types of entities and relations– Fill templates with recognized information
Example4 Apr. Dallas - Early last evening, a tornado swept
through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported.
Event: tornadoDate: 4/3/97Time: 19:15Location:“northwest Dallas” : Texas : USADamage: “mobile homes” (2)
“Texaco station” (1)Injuries: none
4 Apr. Dallas – Early last evening,
a tornado swept through northwest....
Event: tornadoDate: 4/3/97Time: 19:15Location: “northwest Dallas”
: Texas : USA...
Tokenization &
Tagging
Early/ADV last/ADJ evening/NN:time ,/,
a/DT tornado/NN:weather swept/VBD ...
Sentence Analysis
Early last evening: adv-phrase:timea tornado: noun-group:subject
swept: verb-group...
PatternExtraction
tornado swept: Event: tornado
through northwest Dallas: Loc: “northwest Dallas”
causing extensive damage:Damage
Merging
Early last evening, a tornado swept through northwest Dallas.
The twister occurred without warning at about ....
TemplateGeneration
MUC: Message Understanding Conference
• “Competitive” conference with predefined tasks for research groups to address
• Tasks (MUC-7):– Named Entities: Extract typed entities from text– Equivalence Classes: Solving coreference– Attributes: Fill in attributes of entities– Facts: Extract logical relations between entities– Events: Extract descriptions of events from text
Tokenization & Tagging• Tokenization & POS tagging
• Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc.
Sentence Analysis• Shallow parsing for phrase types
• Use tagging & semantics to tag phrases
• Note phrase heads
Pattern Extraction
• Find domain-specific relations between text units
• Typically use lexical triggers and relation-specific patterns to recognize relations
Concept: Damaged-ObjectTrigger: destroyedPosition: direct-objectConstraints: physical-thing
... and [ destroyed ] [ two mobile homes ] Damaged-Object = “two mobile homes”
Learning Extraction Patterns
• Very difficult to predefine extraction patterns
• Must be redone for each new domain
• Hence, corpus-based approaches are indicated
• Some methods:– AutoSlog (1992) – “syntactic” learning– PALKA (1995) – “conceptual” learning– CRYSTAL (1995) – covering algorithm
AutoSlog (Lehnert 1992)• Patterns based on recognizing “concepts”
– Concept: what concept to recognize– Trigger: a word indicating an occurrence– Position: what syntactic role the concept will
take in the sentence– Constraints: what type of entity to allow– Enabling conditions: constraints on the
linguistic context
• Concept: Event-Time
• Trigger: “at”
• Position: prep-phrase-object
• Constraints: time
• Enabling conditions: post-verb
The twister occurred without warning at about 7:15 pm and destroyed two mobile homes.
Event-Time = 19:15
Learning Patterns
• Supervised: Training is text with patterns to be extracted from it
• Knowledge: 13 general syntactic patterns
• Algorithm:– Find sentence with target noun phrase
“two mobile homes”
– Partial parsing of sentence: find syntactic relations– Try all linguistic patterns to find match– Generate concept pattern from match
Linguistic Patterns• Identify domain-specific thematic roles
based on syntactic structure
active-voice-verb followed by target=direct object
Concept = target conceptTrigger = verb of active-voice-verbPosition = direct-objectConstraints = semantic-class of targetEnabling conditions = active-voice
More Examples
– victim was murdered
– perpetrator bombed
– perpetrator attempted to kill
– was aimed at target
• Some bad extraction patterns occur (e.g, “is” as a trigger)
• Human review process
CRYSTAL• Complex syntactic patterns• Use “covering” algorithm:
– Generate most specific possible patterns for all occurrences of targets in corpus
– Loop:• Find most specific unifier of the most similar
patterns C & C’, generating new pattern P• If P has less than ε error on corpus, replace C
and C’ with P• Continue until no new patterns can be added
MergingMotor Vehicles International Corp. announced a
major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...
Coreference Resolution
• Many different kinds of linguistic phenomena:– Proper names,
– Aliases (MVI),
– Definite NPs (the Big 10 auto maker),
– Pronouns (it, they),
– Appositives (, the first company to ...)
• Errors of previous phases may be amplified
Learning to Merge• Treat coreference as a classification task
– Should this pair of entities be linked?
• Methodology:– Training corpus: manually link all coreferential
expressions– Each possible pair is a training example, if they
are linked it is positive if not, it is negative– Create a feature vector for each example– Use your favorite learning algorithm
MLR (1995)• 66 features were used, in 4 categories:
– Lexical features of each phrasee.g, do they overlap?
– Grammatical role of each phrasee.g, subject, direct-object
– Semantic classes of each phrasee.g, physical-thing, company
– Relative positions of the phrasese.g, X one sentence after Y
• Decision-tree learning (C4.5)
C4.5• Incrementally build decision-tree from
labeled training examples
• At each stage choose “best” attribute to split dataset– E.g, use info-gain to compare features
• After building complete tree, prune the leaves to prevent overfitting– Use statistical tests to determine if enough
examples are in leaf bins, if not – prune!
C4.5
40 training
f1
f2 f3
15 training25 training
7 training18 training 2 training 13 training
C1 C2 C2 C1
RESOLVE (1995)• C4.5 with 8 complex features:
– NAME-{1,2}: does reference include a name?– JV-CHILD-{1,2}: does reference refer to part of a
joint venture?– ALIAS: does one reference contain an alias for the
other?– BOTH-JV-CHILD: do both refer to part of a joint
venture?– COMMON-NP: do both contain a common NP?– SAME-SENTENCE: are both in the same
sentence?
(ye s )CO RE FERENCE
(ye s )CO RE FERENCE
(ye s )NO T -CO RE F
(ye s )NO T -CO RE F
(n o )CO RE FERENCE
(u n know n)S AM E -S ENTENCE
(ye s )JV -CH ILD -2
(n o )NO T -CO RE F
(u n know n)NAM E -2
(ye s )CO RE FERENCE
(n o )NO T -CO RE F
(n o )A L IAS
(n o )BO TH -JV -CH ILD
COMMON -NP ?
Decision Tree
RESOLVE Results
• 50 texts, leave-1-out cross-validation:
System Recall Precision
Unpruned 85.4% 87.6%
Pruned 80.1% 92.4%
Manual 67.7% 94.4%
Full System: FASTUS (1996)Input Text
OutputTemplate
PartialTemplates
TemplateMerger
CoreferenceResolution
Pattern Recognition
Pattern Recognition• Multiple passes of finite-state methods
John Smith, 47, was named president of ABC Corp.
Pers-Name
V-Group
Num Aux V N P Org-Name
Poss-N-Group
Domain-Event
Partially-Instantiated Templates
Person: _______Pos: PresidentOrg: ABC Corp.
Person: John SmithPos: PresidentOrg: ABC Corp.
Start:
End:
Domain-Dependent!!
The Next Sentence...
Person: Mike JonesPos: ________Org: ________
Person: John SmithPos: ________Org: ________
Start:
End:
He replaces Mike Jones.
Coreference analysis: He = John Smith
UnificationUnify new template with preceding template(s),if possible...
Person: Mike JonesPos: PresidentOrg: ABC Corp.
Person: John SmithPos: PresidentOrg: ABC Corp.
Start:
End:
Principle of Least Commitment
• Idea: Maintain options as long as possible
• E.g: parsing – maintain a lattice structure:
The committee heads announced that...
DT NN1NN2
VBZ
VBD CSubN-GRPEvent
Event: AnnounceActor: Committee
heads
Principle of Least Commitment
• Idea: Maintain options as long as possible
• E.g: parsing – maintain a lattice structure:
N-GRPEvent
Head: CommitteeEffort:ABC’s
recruitment
The committee heads ABC’s recruitment effort.
DT NN1NN2
VBZ
NNpos NNN-GRP
More Least Commitment
• Maintain multiple coreference hypotheses:– Disambiguate when creating domain-events– More information available
• Too many possibilities?– Use beam search algorithm: maintain k ‘best’
hypotheses at every stage