2007. 11. 14. introduction information extraction (ie) a limited form of “complete text...

45
2007. 11. 14

Upload: elisabeth-morton

Post on 31-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

2007. 11. 14

Page 2: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Introduction

Information Extraction (IE) A limited form of “complete text

comprehension” Document 로부터 entity, relationship 을

추출 Relationship => fact, event

Fact: static Event: dynamic

Document => Entity-relationship or frame………

Structured object

Page 3: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Schematic view of IE

Page 4: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Information Extraction

Simple IE system Term extraction

Complex IE system Frame generation

Page 5: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Data Elements of IE

Entities Basic building blocks Ex) people, locations, genes, and drugs

Attributes Features of extracted entities Ex) an employment relationship between a person

and a company or phosphorylation between two proteins

Event An activity of occurrence of interest in which entities

participate such as a terrorist at, a merger between two companies, a birthday and so on

Page 6: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Data Elements of IE

Page 7: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

MUC IE Tasks

MUC Message Understanding Conference Sponsored by DARPA (Defense Advanced

Research Project Agency)

MUC tasks Named Entity Recognition Template Element Task Template Relationship (TR) Task Scenario Temple (ST) Coreference Task(CO)

Page 8: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Named Entity Recognition

NER Identity all mentions of proper names and quantities

in the text People names, geographic locations, and organizations Dates and times Monetary amounts and percentages

Test with MUC corpora Proper names: 70%

Organization: 45~50% Location: 12~32% People: 23~39%

Dates and times: 25% Monetary amounts and percentages: 5%

Page 9: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Template Element Task

TE A generic object

and its attributes Person Organization Location (airport, city,

country, province, region, water, and etc)

Artifact

Page 10: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Template Relationship (TR) Task TR

Find the relationship that exist between the template elements extracted from text Ex) persons and companies can be related by

employee of relation

Employee_of (Fletcher Maddox, UCSD Business School)

Employee_of (Fletcher Maddox, La Jolla Genomatics)

Product_of (Geninfo, La Jolla Genomatics)

Location_of (La Jolla, La Jolla Genomatics)

Location_of (CA, La Jolla Genomactics)

Page 11: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Scenario Template

ST:

express “domain” and task-specific entities and relations

Page 12: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Coreference Task (CO)

CO:

captures information on coreferring expression (eg. Pronouns or any other mentions of a given entity

Ex David came home from

school, and saw his mother, Rachel. She told him that his father will be late.

Identified pronominal coreference (David, his, him, his) (mother, Rachel, she)

Page 13: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

IE Examples

Page 14: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Architecture of IE Systems

Page 15: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Architecture of IE Systems Tokenization module

Splits an input document into its basic building blocks Words, sentences, and paragraphs

Morphological and lexical analysis Assign POS tags to the document various words, creating

basic phrases (like noun phrases and verb phrases), and disambiguating the sense of ambiguous words and phrases

Syntactic analysis Establish the connection between the difference parts of

each sentence by doing full parsing or shallow parsing

Domain analysis Combine all the information collected from the previous

components and creates complete frames that describe relationship between entities

Can include ‘anaphora resolution’

Page 16: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Information Flow in IE System Processing initial lexical content:

Tokenization and Lexical Analysis Proper name identification Shallow parsing Building relations Inferencing

Page 17: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Information Flow in IE System Building relations

Using domain-specific pattern Ex)

Company [Temporal] @ Announce Connector Person PersonDetail @Appoint Position

Inferencing Infer missing values to complete the identification values Ex)

John Edgar was reported to live with Nancy Leroy. His Address is 101 Forest Rd., Bethlethem, PA.

Person(John Edgar) Person(Nancy Leroy) Livetogether(John Edgar, Nancy Leroy) Address(John Edgar, 101 Forest Rd., Bethlethem, PA) Address(P2,A) :- person(P1), person(P2), livetogether(P1, P2), address(P1,A)

Page 18: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Anaphora Resolution

Anaphora (Coreference) resolution Process of matching pairs of NLP

expressions that refer to the same entity in the real world

Two main approaches Knowledge-based approach

Linguistic analysis of sentences Machine learning-based approach

Need Annotated corpus

Page 19: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Anaphora Resolution

Pronominal anaphora Reflexive/personal/possessive pronouns

Proper name coreference Apposition Predicative nominative Identical sets Function-value coreference Ordinal anaphora One-anaphora Part-whole coreference

Page 20: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Approaches to Anaphora Resolution

Focus on pronominal resolution

Hobbs Algorithm Also called ‘Naïve Algorithm’ Constraints

For two candidate antecedents a and b, if a is encountered before b in the search space, then a is preferred over b.

No two antecedents will have the same salience.

Page 21: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Approaches to Anaphora Resolution CogNIAC

Ordered Six rules

Kennedy and Boguraev Salience algorithm

Mitkov Scoring algorithm

Definiteness Giveness Indicating verbs Lexical reiteration Section Heading preference “non-prepositional” noun phrases Collocation pattern preference Immediate reference Referential distance Domain terminology preference

Page 22: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Approaches to Anaphora Resolution

Machine Learning Approaches Markables

NLP elements such as nouns, nouns phrases, or pronouns

Features for Markables Sentence distance Pronouns Exact match Definite noun phrase Number agreement Semantic agreement Gender agreement Proper name alias

Page 23: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches Generating Training Examples

Positive examples {M1, M2, M3, M4} : same real-world entity

Positive examples: {M1, M2}, {M2, M3}, {M3, M4}

Negative examples Assume that markables a, b, c appear

between M1 and M2 Negative examples: {a, M2}, {b, M2}, {c,

M3}

Page 24: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches

Page 25: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches WHISK

Supervised learning algorithm that uses hand-tagged examples for learning information extraction rules using regular expression

Ex) Input:: * (Digit) ‘BR’ * ‘$’ (number) Output:: Rental {Bedrooms $1} {Price $2}

Page 26: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches: BWI (Boosted Wrapper Induction)

Page 27: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches: BWI (Boosted Wrapper Induction) “Boundary Detectors” are pairs of token

sequences <p,s> Detector matches a boundary iff p matches

text before boundary and s matches text after boundary

Detectors can contain wildcards, e.g. “capitalized word”, “number”, etc.

Example: <Date:,[CapitalizedWord]> matches

beginning of

Date: Thursday, October 25

Page 28: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Machine Learning Approaches: (LP)2 Algorithm Inducing two set of

rules Tagging rules

Ex) stime (start time of a seminar)

Correction rules Ex) “at <stime> 4

</stime> pm => “at <stime> 4 pm

</stime>

Page 29: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Evaluation of IE systems

slot BWI HMM (LP)2 WHISK

Speaker 67.7% 76.6% 77.6% 18.3%

Location 76.7% 78.6% 75.0% 66.4%

Start Time 99.6% 98.5% 99.0% 92.6%

End Time 93.9% 62.1% 95.5% 86%

Page 30: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Structural IE

Introduction Considering structural or visual

characteristics of the text E.g) font type, size, location

A complement of conventional IE (text mining)

Called ‘Visual Information Extraction (VIE)’

Page 31: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Structural IE

VIE procedure Group the primitive elements into

meaningful objects (e.g., lines, paragraph, etc)

Establish the hierarchical structure among these objects

Compare the structure of the query document with the structure of the training document to find the objects corresponding to the target fields

Page 32: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Object Tree

Page 33: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Object Tree Generation

X

Y

Fit (Y, X) : A measure of how fit Y is as an additional member to X

paragraph

line

Page 34: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Computing Similarity in O-tree

Page 35: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Finding the target fields

Page 36: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Templates

Page 37: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출
Page 38: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Browsing

Page 39: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Topic distribution Browsing

USA, UK => acq 42/19.09%

Page 40: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Browsing and filtering associations

Page 41: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Browsing associations

Page 42: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Taxonomy (Topic Hierarchy) Management

Page 43: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Taxonomy Editor

Page 44: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Clustering Display using Concept Hierarchy

Page 45: 2007. 11. 14. Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출

Query Contruction