unsupervised strategies for information extraction by text segmentation eli cortez, altigran da...
TRANSCRIPT
Unsupervised Strategies for Unsupervised Strategies for Information Extraction by Information Extraction by
Text SegmentationText Segmentation
Eli Cortez, Altigran da SilvaFederal University of Amazonas - BRAZIL
OutlineOutline
Information Extraction by Text
Segmentation (IETS)
◦ Scenario and Problem
◦ Challenges and Motivation
◦ Related Work
ONDUX
◦ Preliminary Experiments
Next Steps
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentationText documents containing
implicit semi-structured data records
Addresses Bibliographic References Classified Ads Product Descriptions
Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms;
2 Bathrooms. 412-638-7273
Classified Ad
Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214
Address
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,
January 2006
Bibliographic Reference
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
Neighborhood, Price, Number, Street,..., Phone
Why extracting information? Database Storage, Query… Data Mining Record Linkage.
Regent Square
$228,900 1028 Mifflin
Ave.; 6 Bedrooms; 2
Bathrooms. 412-638-
7273
Classified Ad
<Neighboorhood> :
Regent Square
<Price> :
$228,900
<No.> : 1028
<Street> :
Mifflin Ave,
<Bed.> : 6 Bedrooms
<Bath..> : 2
Bathrooms
<Phone> : 412-
638-7273
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:
1.Segmenting
2.Assigning to each segment a label corresponding to an attribute a
I
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text
Segmentation (IETS)
◦ Borkar@SIGMOD'01, McCallum@ICML'01,
Agichtein@SIGKDD'04, Mansuri@ICDE'06,
Zhao@SICDM'08, Cortez@JASIST'09
Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.
Different applications share similar domains Ex.: Address and Ads
Records from both domains contain address information
IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles
Attribute Ordering; Capitalization; Abbreviations.
HomePage
DBLP
ACM
Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006
Existing approaches deal with this problem use Machine Learning techniques
Hidden Markov Models (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM) (SSVM)
• Supervised approaches require a hand-labeled
training set created by an expert.
• Each generated model is particular to a given
application
• High computational cost
IETS – Challenges(III)IETS – Challenges(III)
Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches
[Borkar et. al @ SIGMOD 2001]◦ Supervised extraction method based on Hidden
Markov Models (HMM)
[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields
(CRF), an supervised model – (S-CRF)
[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models
All of these approaches require an expert to create a hand-labeled training set for each application.
Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches
Hand-labeled examples
<Neighboorhood> Regent Square </Neighboorhood>
<Price> $228,900 </Price> <No> 1028 </No> <Street>
Mifflin Ave, </Street> <Bed> 6 Bedrooms </Bed>
<Bath> 2 Bathrooms </Bath> <Phone>412-638-7273
</Phone>
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
CRF and HMM learn from the given examples, lexical, style, positioning and
sequecing featuresExamples are source-dependentScalability problem, Reusing pre-
existing models?
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
Semi-structured
Records
Wikipedia Infobox
DBpedia
FreeBase
Knowledge Bases
Structured Records
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
Supervised X UNsupervised Hand-labeled examples
Source Dependent
Scalability Problem
Reusability
Pre-existing information
Domain Representation
Easily adaptable
[Agichtein et. al @ SIGKDD 2004]◦ Usage of Reference Tables to create an unsupervised
model using Hidden Markov Models (HMM)
[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised
CRF models - (U-CRF)
[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic
information Domain-specific heuristics, not general application.
Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
Basic Concepts(I1)Basic Concepts(I1)Knowledge Base
◦Set of pairs KB =◦Building process trivial
◦Web Databases (Freebase, Googlebase)
)},(),...,,{( 11 nn OmOm
KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}
O = { “Regent Square”, “Milenight Park”}
O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}
O = { “323 462-6252”, “(171) 289-7527”}
Neigh. Street
Neigh.
Street
Phone
Phone
KB: Domain Representation
Hand-labeled examples: Source representation
Proposed MethodProposed MethodONDUX [Cortez et. al. @ SIGMOD 2010]
◦Blocking
◦Matching
◦Reinforcement
ONDUX (II)ONDUX (II)Overview
3
12
ONDUX (III)ONDUX (III)Blocking
◦ Split the input text in substrings called blocks;
◦ Consider the co-occurrence of consecutive terms based in the KB
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
ONDUX (IV)ONDUX (IV)Matching
◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base
◦We use distinct matching functions:
Textual Values: FF Function (Field Frequency)
Numeric Values : NM Function (Numeric Matching)
ONDUX (V)ONDUX (V)Matching
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
ONDUX (VI)ONDUX (VI)How can we deal with blocks that
were incorrectly labeled or were not associated to any attribute?
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
ONDUX (VII)ONDUX (VII)Reinforcement
◦ Review the labeling task performed in the Matching step
Unmatched blocks must receive a label of a given attribute
Mismatching blocks must be correctly labeled
◦How to handle this cases? Using positioning and sequencing
information that are obtained On-Demand.
ONDUX (VIII)ONDUX (VIII)Reinforcement
◦ Given the extraction output of the matching step ONDUX automatically build a
graphical structure, the PSM.
PSM: Positioning and Sequencing Model.
ONDUX (IX)ONDUX (IX)Reinforcement
◦Extraction Result
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Price No.
Bed. Bath. Phone
Street
???
Neighborhood
Street
Street
Experiments (1)Experiments (1)Setup
◦We tested our proposed approach in: Bibilographic Data (CORA, PersonalBib)
Collections are available in the Web
Dataset
#Attributes
#records
Source #Attributes #records
CORA 1..13 150 Cora 1..13 350
CORA 1..13 150 PersonalBib
7 395
Test Set
KB, Reference Table, …
Experiments (II)Experiments (II)Evaluation
◦Metrics Precision, Recall and F-Measure
T-Test for the statistical validation of the results
◦Baseline Conditional Random Fields (CRF)
U-CRF (Unsupervised method) S-CRF (Classical supervised method)
Experiments (III)Experiments (III)Extraction Quality
S-CRF achieves higher results than U-CRF due to the hand-labeled training
CORA includes a variety of styles and information (jconference, books)
In general, Matching and Reinforcement Step of ONDUX outperforms CRF models
Experiments (IV)Experiments (IV)Extraction Quality
As discussed earlier, U-CRF is able to deal with different attribute orderings
Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models
Conclusions andConclusions andFuture Work (I)Future Work (I)Partial results of our research on
unsupervised strategies for information extraction
ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human
effort to create a training set◦ On-Demand: Ordering and Positioning
Information are learned trough the Matching Phase
Proposed strategy achieve good results of precision and recall◦Comparison with the state-of-art
As a Future Work◦Investigate different matching
functions;◦Multi-Record Extraction;◦Active Learning and Feedback;◦Error Detection;◦Nested structures?
Conclusions and Conclusions and Future Work (II)Future Work (II)
Questions?