unsupervised strategies for information extraction by text segmentation eli cortez, altigran da...

Unsupervised Strategies for Unsupervised Strategies for Information Extraction by Information Extraction by

Text SegmentationText Segmentation

Eli Cortez, Altigran da SilvaFederal University of Amazonas - BRAZIL

OutlineOutline

Information Extraction by Text

Segmentation (IETS)

◦ Scenario and Problem

◦ Challenges and Motivation

◦ Related Work

ONDUX

◦ Preliminary Experiments

Next Steps

IInformation nformation EExtraction by xtraction by TText ext

SSegmentationegmentationText documents containing

implicit semi-structured data records

Addresses Bibliographic References Classified Ads Product Descriptions

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms;

2 Bathrooms. 412-638-7273

Classified Ad

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214

Address

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based

similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,

January 2006

Bibliographic Reference


SSegmentationegmentation

Neighborhood, Price, Number, Street,..., Phone

Why extracting information? Database Storage, Query… Data Mining Record Linkage.

Regent Square

$228,900 1028 Mifflin

Ave.; 6 Bedrooms; 2

Bathrooms. 412-638-

7273

Classified Ad

<Neighboorhood> :

Regent Square

<Price> :

$228,900

<No.> : 1028

<Street> :

Mifflin Ave,

<Bed.> : 6 Bedrooms

<Bath..> : 2

Bathrooms

<Phone> : 412-

638-7273



Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:

1.Segmenting

2.Assigning to each segment a label corresponding to an attribute a

I



IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text

Segmentation (IETS)

◦ Borkar@SIGMOD'01, McCallum@ICML'01,

Agichtein@SIGKDD'04, Mansuri@ICDE'06,

Zhao@SICDM'08, Cortez@JASIST'09

Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.

Different applications share similar domains Ex.: Address and Ads

Records from both domains contain address information

IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles

Attribute Ordering; Capitalization; Abbreviations.

HomePage

DBLP

ACM

Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Existing approaches deal with this problem use Machine Learning techniques

Hidden Markov Models (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM) (SSVM)

• Supervised approaches require a hand-labeled

training set created by an expert.

• Each generated model is particular to a given

application

• High computational cost

IETS – Challenges(III)IETS – Challenges(III)

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

[Borkar et. al @ SIGMOD 2001]◦ Supervised extraction method based on Hidden

Markov Models (HMM)

[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields

(CRF), an supervised model – (S-CRF)

[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models

All of these approaches require an expert to create a hand-labeled training set for each application.

Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches

Hand-labeled examples

<Neighboorhood> Regent Square </Neighboorhood>

<Price> $228,900 </Price> <No> 1028 </No> <Street>

Mifflin Ave, </Street> <Bed> 6 Bedrooms </Bed>

<Bath> 2 Bathrooms </Bath> <Phone>412-638-7273

</Phone>

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

CRF and HMM learn from the given examples, lexical, style, positioning and

sequecing featuresExamples are source-dependentScalability problem, Reusing pre-

existing models?

Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches

Semi-structured

Records

Wikipedia Infobox

DBpedia

FreeBase

Knowledge Bases

Structured Records


Supervised X UNsupervised Hand-labeled examples

Source Dependent

Scalability Problem

Reusability

Pre-existing information

Domain Representation

Easily adaptable

[Agichtein et. al @ SIGKDD 2004]◦ Usage of Reference Tables to create an unsupervised

model using Hidden Markov Models (HMM)

[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised

CRF models - (U-CRF)

[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic

information Domain-specific heuristics, not general application.

Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)


Basic Concepts(I1)Basic Concepts(I1)Knowledge Base

◦Set of pairs KB =◦Building process trivial

◦Web Databases (Freebase, Googlebase)

)},(),...,,{( 11 nn OmOm

KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}

O = { “Regent Square”, “Milenight Park”}

O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}

O = { “323 462-6252”, “(171) 289-7527”}

Neigh. Street

Neigh.

Street

Phone

Phone

KB: Domain Representation

Hand-labeled examples: Source representation

Proposed MethodProposed MethodONDUX [Cortez et. al. @ SIGMOD 2010]

◦Blocking

◦Matching

◦Reinforcement

ONDUX (II)ONDUX (II)Overview

3

12

ONDUX (III)ONDUX (III)Blocking

◦ Split the input text in substrings called blocks;

◦ Consider the co-occurrence of consecutive terms based in the KB



ONDUX (IV)ONDUX (IV)Matching

◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base

◦We use distinct matching functions:

Textual Values: FF Function (Field Frequency)

Numeric Values : NM Function (Numeric Matching)

ONDUX (V)ONDUX (V)Matching



Street Price No. ??? Street

Bed. Bath. Phone

ONDUX (VI)ONDUX (VI)How can we deal with blocks that

were incorrectly labeled or were not associated to any attribute?



Street Price No. ??? Street

Bed. Bath. Phone

ONDUX (VII)ONDUX (VII)Reinforcement

◦ Review the labeling task performed in the Matching step

Unmatched blocks must receive a label of a given attribute

Mismatching blocks must be correctly labeled

◦How to handle this cases? Using positioning and sequencing

information that are obtained On-Demand.

ONDUX (VIII)ONDUX (VIII)Reinforcement

◦ Given the extraction output of the matching step ONDUX automatically build a

graphical structure, the PSM.

PSM: Positioning and Sequencing Model.

ONDUX (IX)ONDUX (IX)Reinforcement

◦Extraction Result



Price No.

Bed. Bath. Phone

Street

???

Neighborhood

Street

Street

Experiments (1)Experiments (1)Setup

◦We tested our proposed approach in: Bibilographic Data (CORA, PersonalBib)

Collections are available in the Web

Dataset

#Attributes

#records

Source #Attributes #records

CORA 1..13 150 Cora 1..13 350

CORA 1..13 150 PersonalBib

7 395

Test Set

KB, Reference Table, …

Experiments (II)Experiments (II)Evaluation

◦Metrics Precision, Recall and F-Measure

T-Test for the statistical validation of the results

◦Baseline Conditional Random Fields (CRF)

U-CRF (Unsupervised method) S-CRF (Classical supervised method)

Experiments (III)Experiments (III)Extraction Quality

S-CRF achieves higher results than U-CRF due to the hand-labeled training

CORA includes a variety of styles and information (jconference, books)

In general, Matching and Reinforcement Step of ONDUX outperforms CRF models

Experiments (IV)Experiments (IV)Extraction Quality

As discussed earlier, U-CRF is able to deal with different attribute orderings

Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models

Conclusions andConclusions andFuture Work (I)Future Work (I)Partial results of our research on

unsupervised strategies for information extraction

ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human

effort to create a training set◦ On-Demand: Ordering and Positioning

Information are learned trough the Matching Phase

Proposed strategy achieve good results of precision and recall◦Comparison with the state-of-art

As a Future Work◦Investigate different matching

functions;◦Multi-Record Extraction;◦Active Learning and Feedback;◦Error Detection;◦Nested structures?

Conclusions and Conclusions and Future Work (II)Future Work (II)

Questions?

unsupervised strategies for information extraction by text segmentation eli cortez, altigran da...

Documents

supervised extraction

classication of web

text segmentationneighborhood

text segmentationgiven

text segmentation ietsborkar

text segmentation ietsscenario

text segmentationeli

marco cristo