Transcript
Page 1: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Unsupervised Strategies for Unsupervised Strategies for Information Extraction by Information Extraction by

Text SegmentationText Segmentation

Eli Cortez, Altigran da SilvaFederal University of Amazonas - BRAZIL

Page 2: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

OutlineOutline

Information Extraction by Text

Segmentation (IETS)

◦ Scenario and Problem

◦ Challenges and Motivation

◦ Related Work

ONDUX

◦ Preliminary Experiments

Next Steps

Page 3: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentationText documents containing

implicit semi-structured data records

Addresses Bibliographic References Classified Ads Product Descriptions

Page 4: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms;

2 Bathrooms. 412-638-7273

Classified Ad

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214

Address

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based

similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,

January 2006

Bibliographic Reference

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Neighborhood, Price, Number, Street,..., Phone

Page 5: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Why extracting information? Database Storage, Query… Data Mining Record Linkage.

Regent Square

$228,900 1028 Mifflin

Ave.; 6 Bedrooms; 2

Bathrooms. 412-638-

7273

Classified Ad

<Neighboorhood> :

Regent Square

<Price> :

$228,900

<No.> : 1028

<Street> :

Mifflin Ave,

<Bed.> : 6 Bedrooms

<Bath..> : 2

Bathrooms

<Phone> : 412-

638-7273

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Page 6: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:

1.Segmenting

2.Assigning to each segment a label corresponding to an attribute a

I

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentation

Page 7: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text

Segmentation (IETS)

◦ Borkar@SIGMOD'01, McCallum@ICML'01,

Agichtein@SIGKDD'04, Mansuri@ICDE'06,

Zhao@SICDM'08, Cortez@JASIST'09

Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.

Different applications share similar domains Ex.: Address and Ads

Records from both domains contain address information

Page 8: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles

Attribute Ordering; Capitalization; Abbreviations.

HomePage

DBLP

ACM

Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Page 9: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Existing approaches deal with this problem use Machine Learning techniques

Hidden Markov Models (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM) (SSVM)

• Supervised approaches require a hand-labeled

training set created by an expert.

• Each generated model is particular to a given

application

• High computational cost

IETS – Challenges(III)IETS – Challenges(III)

Page 10: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

[Borkar et. al @ SIGMOD 2001]◦ Supervised extraction method based on Hidden

Markov Models (HMM)

[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields

(CRF), an supervised model – (S-CRF)

[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models

All of these approaches require an expert to create a hand-labeled training set for each application.

Page 11: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

Hand-labeled examples

<Neighboorhood> Regent Square </Neighboorhood>

<Price> $228,900 </Price> <No> 1028 </No> <Street>

Mifflin Ave, </Street> <Bed> 6 Bedrooms </Bed>

<Bath> 2 Bathrooms </Bath> <Phone>412-638-7273

</Phone>

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

CRF and HMM learn from the given examples, lexical, style, positioning and

sequecing featuresExamples are source-dependentScalability problem, Reusing pre-

existing models?

Page 12: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Semi-structured

Records

Wikipedia Infobox

DBpedia

FreeBase

Knowledge Bases

Structured Records

Page 13: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Supervised X UNsupervised Hand-labeled examples

Source Dependent

Scalability Problem

Reusability

Pre-existing information

Domain Representation

Easily adaptable

Page 14: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

[Agichtein et. al @ SIGKDD 2004]◦ Usage of Reference Tables to create an unsupervised

model using Hidden Markov Models (HMM)

[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised

CRF models - (U-CRF)

[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic

information Domain-specific heuristics, not general application.

Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Page 15: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Basic Concepts(I1)Basic Concepts(I1)Knowledge Base

◦Set of pairs KB =◦Building process trivial

◦Web Databases (Freebase, Googlebase)

)},(),...,,{( 11 nn OmOm

KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}

O = { “Regent Square”, “Milenight Park”}

O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}

O = { “323 462-6252”, “(171) 289-7527”}

Neigh. Street

Neigh.

Street

Phone

Phone

KB: Domain Representation

Hand-labeled examples: Source representation

Page 16: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Proposed MethodProposed MethodONDUX [Cortez et. al. @ SIGMOD 2010]

◦Blocking

◦Matching

◦Reinforcement

Page 17: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (II)ONDUX (II)Overview

3

12

Page 18: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (III)ONDUX (III)Blocking

◦ Split the input text in substrings called blocks;

◦ Consider the co-occurrence of consecutive terms based in the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Page 19: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (IV)ONDUX (IV)Matching

◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base

◦We use distinct matching functions:

Textual Values: FF Function (Field Frequency)

Numeric Values : NM Function (Numeric Matching)

Page 20: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (V)ONDUX (V)Matching

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 21: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VI)ONDUX (VI)How can we deal with blocks that

were incorrectly labeled or were not associated to any attribute?

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 22: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VII)ONDUX (VII)Reinforcement

◦ Review the labeling task performed in the Matching step

Unmatched blocks must receive a label of a given attribute

Mismatching blocks must be correctly labeled

◦How to handle this cases? Using positioning and sequencing

information that are obtained On-Demand.

Page 23: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (VIII)ONDUX (VIII)Reinforcement

◦ Given the extraction output of the matching step ONDUX automatically build a

graphical structure, the PSM.

PSM: Positioning and Sequencing Model.

Page 24: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

ONDUX (IX)ONDUX (IX)Reinforcement

◦Extraction Result

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Street

???

Neighborhood

Street

Street

Page 25: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (1)Experiments (1)Setup

◦We tested our proposed approach in: Bibilographic Data (CORA, PersonalBib)

Collections are available in the Web

Dataset

#Attributes

#records

Source #Attributes #records

CORA 1..13 150 Cora 1..13 350

CORA 1..13 150 PersonalBib

7 395

Test Set

KB, Reference Table, …

Page 26: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (II)Experiments (II)Evaluation

◦Metrics Precision, Recall and F-Measure

T-Test for the statistical validation of the results

◦Baseline Conditional Random Fields (CRF)

U-CRF (Unsupervised method) S-CRF (Classical supervised method)

Page 27: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (III)Experiments (III)Extraction Quality

S-CRF achieves higher results than U-CRF due to the hand-labeled training

CORA includes a variety of styles and information (jconference, books)

In general, Matching and Reinforcement Step of ONDUX outperforms CRF models

Page 28: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Experiments (IV)Experiments (IV)Extraction Quality

As discussed earlier, U-CRF is able to deal with different attribute orderings

Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models

Page 29: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Conclusions andConclusions andFuture Work (I)Future Work (I)Partial results of our research on

unsupervised strategies for information extraction

ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human

effort to create a training set◦ On-Demand: Ordering and Positioning

Information are learned trough the Matching Phase

Page 30: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Proposed strategy achieve good results of precision and recall◦Comparison with the state-of-art

As a Future Work◦Investigate different matching

functions;◦Multi-Record Extraction;◦Active Learning and Feedback;◦Error Detection;◦Nested structures?

Conclusions and Conclusions and Future Work (II)Future Work (II)

Page 31: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL

Questions?


Top Related