unsupervised strategies for information extraction by text segmentation eli cortez, altigran da...

31
Unsupervised Strategies Unsupervised Strategies for Information for Information Extraction by Text Extraction by Text Segmentation Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Upload: emma-mosley

Post on 17-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Unsupervised Strategies for Unsupervised Strategies for Information Extraction by Information Extraction by

Text SegmentationText Segmentation

Eli Cortez, Altigran da SilvaFederal University of Amazonas - BRAZIL

Page 2: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

OutlineOutline

Information Extraction by Text

Segmentation (IETS)

◦ Scenario and Problem

◦ Challenges and Motivation

◦ Related Work

ONDUX

◦ Preliminary Experiments

Next Steps

Page 3: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentationText documents containing

implicit semi-structured data records

Addresses Bibliographic References Classified Ads Product Descriptions

Page 4: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms;

2 Bathrooms. 412-638-7273

Classified Ad

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214

Address

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based

similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,

January 2006

Bibliographic Reference

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Neighborhood, Price, Number, Street,..., Phone

Page 5: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Why extracting information? Database Storage, Query… Data Mining Record Linkage.

Regent Square

$228,900 1028 Mifflin

Ave.; 6 Bedrooms; 2

Bathrooms. 412-638-

7273

Classified Ad

<Neighboorhood> :

Regent Square

<Price> :

$228,900

<No.> : 1028

<Street> :

Mifflin Ave,

<Bed.> : 6 Bedrooms

<Bath..> : 2

Bathrooms

<Phone> : 412-

638-7273

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Page 6: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:

1.Segmenting

2.Assigning to each segment a label corresponding to an attribute a

I

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Page 7: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text

Segmentation (IETS)

◦ Borkar@SIGMOD'01, McCallum@ICML'01,

Agichtein@SIGKDD'04, Mansuri@ICDE'06,

Zhao@SICDM'08, Cortez@JASIST'09

Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.

Different applications share similar domains Ex.: Address and Ads

Records from both domains contain address information

Page 8: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles

Attribute Ordering; Capitalization; Abbreviations.

HomePage

DBLP

ACM

Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Page 9: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Existing approaches deal with this problem use Machine Learning techniques

Hidden Markov Models (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM) (SSVM)

• Supervised approaches require a hand-labeled

training set created by an expert.

• Each generated model is particular to a given

application

• High computational cost

IETS – Challenges(III)IETS – Challenges(III)

Page 10: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

[Borkar et. al @ SIGMOD 2001]◦ Supervised extraction method based on Hidden

Markov Models (HMM)

[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields

(CRF), an supervised model – (S-CRF)

[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models

All of these approaches require an expert to create a hand-labeled training set for each application.

Page 11: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

Hand-labeled examples

<Neighboorhood> Regent Square </Neighboorhood>

<Price> $228,900 </Price> <No> 1028 </No> <Street>

Mifflin Ave, </Street> <Bed> 6 Bedrooms </Bed>

<Bath> 2 Bathrooms </Bath> <Phone>412-638-7273

</Phone>

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

CRF and HMM learn from the given examples, lexical, style, positioning and

sequecing featuresExamples are source-dependentScalability problem, Reusing pre-

existing models?

Page 12: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Semi-structured

Records

Wikipedia Infobox

DBpedia

FreeBase

Knowledge Bases

Structured Records

Page 13: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Supervised X UNsupervised Hand-labeled examples

Source Dependent

Scalability Problem

Reusability

Pre-existing information

Domain Representation

Easily adaptable

Page 14: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

[Agichtein et. al @ SIGKDD 2004]◦ Usage of Reference Tables to create an unsupervised

model using Hidden Markov Models (HMM)

[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised

CRF models - (U-CRF)

[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic

information Domain-specific heuristics, not general application.

Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Page 15: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Basic Concepts(I1)Basic Concepts(I1)Knowledge Base

◦Set of pairs KB =◦Building process trivial

◦Web Databases (Freebase, Googlebase)

)},(),...,,{( 11 nn OmOm

KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}

O = { “Regent Square”, “Milenight Park”}

O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}

O = { “323 462-6252”, “(171) 289-7527”}

Neigh. Street

Neigh.

Street

Phone

Phone

KB: Domain Representation

Hand-labeled examples: Source representation

Page 16: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Proposed MethodProposed MethodONDUX [Cortez et. al. @ SIGMOD 2010]

◦Blocking

◦Matching

◦Reinforcement

Page 17: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (II)ONDUX (II)Overview

3

12

Page 18: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (III)ONDUX (III)Blocking

◦ Split the input text in substrings called blocks;

◦ Consider the co-occurrence of consecutive terms based in the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Page 19: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (IV)ONDUX (IV)Matching

◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base

◦We use distinct matching functions:

Textual Values: FF Function (Field Frequency)

Numeric Values : NM Function (Numeric Matching)

Page 20: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (V)ONDUX (V)Matching

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 21: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VI)ONDUX (VI)How can we deal with blocks that

were incorrectly labeled or were not associated to any attribute?

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 22: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VII)ONDUX (VII)Reinforcement

◦ Review the labeling task performed in the Matching step

Unmatched blocks must receive a label of a given attribute

Mismatching blocks must be correctly labeled

◦How to handle this cases? Using positioning and sequencing

information that are obtained On-Demand.

Page 23: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VIII)ONDUX (VIII)Reinforcement

◦ Given the extraction output of the matching step ONDUX automatically build a

graphical structure, the PSM.

PSM: Positioning and Sequencing Model.

Page 24: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (IX)ONDUX (IX)Reinforcement

◦Extraction Result

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Street

???

Neighborhood

Street

Street

Page 25: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (1)Experiments (1)Setup

◦We tested our proposed approach in: Bibilographic Data (CORA, PersonalBib)

Collections are available in the Web

Dataset

#Attributes

#records

Source #Attributes #records

CORA 1..13 150 Cora 1..13 350

CORA 1..13 150 PersonalBib

7 395

Test Set

KB, Reference Table, …

Page 26: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (II)Experiments (II)Evaluation

◦Metrics Precision, Recall and F-Measure

T-Test for the statistical validation of the results

◦Baseline Conditional Random Fields (CRF)

U-CRF (Unsupervised method) S-CRF (Classical supervised method)

Page 27: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (III)Experiments (III)Extraction Quality

S-CRF achieves higher results than U-CRF due to the hand-labeled training

CORA includes a variety of styles and information (jconference, books)

In general, Matching and Reinforcement Step of ONDUX outperforms CRF models

Page 28: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (IV)Experiments (IV)Extraction Quality

As discussed earlier, U-CRF is able to deal with different attribute orderings

Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models

Page 29: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Conclusions andConclusions andFuture Work (I)Future Work (I)Partial results of our research on

unsupervised strategies for information extraction

ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human

effort to create a training set◦ On-Demand: Ordering and Positioning

Information are learned trough the Matching Phase

Page 30: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Proposed strategy achieve good results of precision and recall◦Comparison with the state-of-art

As a Future Work◦Investigate different matching

functions;◦Multi-Record Extraction;◦Active Learning and Feedback;◦Error Detection;◦Nested structures?

Conclusions and Conclusions and Future Work (II)Future Work (II)

Page 31: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Questions?