joint unsupervised structure discovery and information extraction eli cortez, daniel oliveira,...

Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. do Amazonas (UFAM) Brazil Univ. Fed. de Minas Gerais (UFMG) Brazil ACM SIGMOD Conference Athens, Greece - June 2011 Presented by Eli Cortez The IETS Problem Information Extraction by Text Segmentation Goal: To extract attribute values occurring in implicit semi- structured data records Current IETS methods are able to accurately predict a sequence of labels to be assigned to a sequence of text segments corresponding to attribute values HMM Borkar et al. (SIGMOD01) CRF Laferty et al. (ICML01) ONDUX Cortez et. al (SIGMOD10) Examples Delimited Records Apple iPad 2 Wi-Fi + 3G 64 GB - Apple iOS 4 1 GHz - Black $589 LG - 32LE " LED-backlit LCD TV p (FullHD) - $400 Samsung - UN55D " Class ( 54.6" viewable ) LED-backlit LCD... $2,048 Mixter Max Accessory Plasma TV Rack Tilt Bracket 248-A05 $65 HP Deskjet 3050 All-in-One Color Ink-jet - Printer / copier / scanner $50 Product Descriptions L. Barbosa and J. Freire. Using Latent-structure to Detect In Proc. of the 13th WeDB, pages 16, A. Doan et. al. Information Extraction Challenges in Managing.. SIGMOD Record, 37(4):1420, J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, Bibliographic Citations $1106 / 2br - Luxury 2 BR, 1 BA apartment loaded with amenities - (Bothell) $1945 / 2br - Beautiful HighPoint Community "Built Green" 2 BR 2.5 Bth Town Home! - (West Seattle) $735 / 1br - Top floor 1 bedroom apt available just minutes from downtown!! - (Seattle,Burien,Highline) $820 / 1br - Lovely 1 bedroom 1k sq ft! Nearly a 2 bdrm! - (Federal Way,Edgewood,Milton, Tacoma) $895 / 2br - ****Lovely 2-Bedroom/2-Bathroom Condo with a View! FREE RENT!!!**** - (Monroe) Classified Ads Example Non-delimited Records 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 2 cups all-purpose flour 1/4 cup cocoa powder 2 teaspoons baking soda 1/8 teaspoon salt 1 cup raisins 1/4 cup dark rum Chocolate Cake Recipe QuantityUnitIngredient 1/2cupbutter 2eggs 4cupswhite sugar ground cinnamon 2tablespoonsdark rum 6chopped pecans Current IETS Methods Assume input records are already separated e.g., manually by a user or using HTML-based heuristics Unfeasible in fully automatic settings 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce Quant.UnitIngredient 1/2cupbutter 2eggs 4cupswhite sugar ground cinnamon 2tablespoonsdark rum 6chopped pecans JUDIE Structure Discovery + Information Extraction Jointly carried out in an unsupervised way Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce Quant.UnitIngredient 1/2cupbutter 2eggs 4cupswhite sugar ground cinnamon 2tablespoonsdark rum 6chopped pecans JUDIE Joint Unsupervised Structure Discovery and Information Extraction Introduces a new Structure Discovery Algorithm Detects the structure of each individual record being extracted without any user intervention Looks for frequent patterns of label repetitions or cycles Integrates this algorithm in the IE process Accomplished by successive refinement steps that alternate information extraction and structure discovery Related Work IETS Approaches/Methods Probabilistic Supervised Hidden Markov Models (HMM) Borkar et et Conditional Random Fields (CRF) Lafferty et et Require training instances labeled on each input text Regent Square $228, Mifflin Ave, 6 Bedrooms 2 Bathrooms Related Work - IETS Approaches / Methods Probabilistic Unsupervised Rely on previously built datasets Unsup. HMM (Agichtein et 04) Rely on records in references tables Batches of fixed-order records as input Unsup. CRF (Zhao et ICDM08) Also reference tables Batches of fixed-order records as input ONDUX (Cortez et Knowledge-base: sets of typical values per attribute no records All of them require one input record at time No structure discovery JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1 st IE Step: Structure-free Labeling JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1 st SD Step: Structure Sketching JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 2 nd IE Step: Structure-aware Labeling JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 2 nd SD Step: Structure Refinement JUDIE Structure-free Labeling What is the best label for each segment? No structural information is available Initially labels potential values with attribute names No information on the structure of the data records Resort only to content-related features Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Features Content Related Features Considered: White sugar Value Format Value Range Attribute Vocabulary Bayes. Noisy OR KB Ingredient JUDIE Structure-free Labeling Initially labels potential values with attribute names No information on the structure of the data records Resort only to content-related features Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I Limitations: Label Fault : Tbsp Misassignment : a little JUDIE Structure Sketching Organizes the labeled candidate values into records Induces a structure on the unstructured text input Outputs labeled values grouped into records Uses a novel algorithm called Structure Discovery (SD) 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I The SD Algorithm Uncover the structure of implicit records from the input text. Used in the Structure Sketching and Structure Refinement Takes as input a sequence of labels and generates the structure of each record Assumption: It is possible to identify patterns of sequences by looking for cycles into a graph (Adjacency Graph) that models the ordering of labels The SD Algorithm Title Conference Year Author Author Title Conference Year Author Title Conference Year Author Title Journal Issue Year Author Title Journal Issue Year Author Author Journal Issue Year Title Year Author Title Conference Year Author Author Author Title Journal Issue Year Author Title Journal Issue Conference Year The SD Algorithm Exploits the occurrence of cycles in the adjacency graph [Author, Title, Conference, Year] [Author, Title, Journal, Issue, Year] [Title,Conference, Year] Author Title Journal Issue Conference Year The SD Algorithm Title Conference Year Author Author Title Conference Year Author Title Conference Year Author Title Journal Issue Year Author Title Journal Issue Year Author Author Journal Issue Year Title Year Author Title Conference Year Author Author Author Title Journal Issue Year Author Title Journal Issue Conference Year Coincident Cycles Viable Cycle The SD Algorithm Dominant Cycles Given the set of Coincident cycles that are also viable, the Dominant Cycle are most frequent in the input Finally, the algorithm works by first identifying all dominant cycles in the adjacency graph and then processing each of these cycles In our given examples, the dominant cycles are: [Author, Title, Journal, Issue, Year] [Author, Title, Conference, Year] [Author, Journal, Issue, Year] [Title,Conference, Year] [Title, Year] JUDIE Structure Sketching Organizes the labeled candidate values into records Induces a structure on the unstructured text input Outputs labeled values grouped into records Uses a novel algorithm called Structure Discovery (SD) 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I JUDIE Structure-aware Labeling Now, what is the best label for each segment? We already know some structural information Re-labels segments considering content-related features and structure-based features Structure-based features learned using a graphical model (PSM) 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I Positioning and Sequencing Model (PSM) Built from the Structure Sketching output States: attribute labels Likelihood of: absolute position of labels within text segments relative position considering other labels START QUANTITYINGREDIENT UNIT END 5% 95% 80% 20% 90% 10% 100% Bayes. Noisy OR Content-related features JUDIE Structure-aware Labeling Quantity A little JUDIE Structure-aware Labeling Labels textual values considering: Uses a graphic model representing the likelihood of attribute transitions within the input text Content-related features and structure-based features 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I JUDIE Structure Refinement Applies again the SD algorithm Considers the output of the structure-aware labeling Fixes structural problems Structure-aware labeling produces more precise results 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I JUDIE Overview Structure- free Labeling Structure Sketching Structure Refinement Structure-aware Labeling Phase 1 Phase 2 Experiments Datasets previously used in other papers Only 3 of the domains are discussed in this presentation. More results on the paper. DomainDatasetText InputsAttributesSourceAttributesRecords Cooking RecipesRecipes5003FreeBase.com3100 Product OffersProducts100003Nhemu.com35000 Postal AddressesBigBook20005BigBook52000 BibliographyCORA5003 to 7PersonalBib7395 ClassifiedWebAds5005 to 18Folha On-line18125 Metrics F-Measure Harmonic mean between precision and recall Attribute-Level Results considering values of a single attribute in all output records Record-Level Results considering all attributes in a single record Average of all records results. T-Test for the statistical validation of the results Evaluation Attribute Level - Recipes High-quality results for all attributes even in Phase 1 Structural information in Phase 2 led to gains above 5% on average AttributePhase 1Phase 2Gain (%) Quantity Unit Ingredient Average Evaluation Attribute Level - CORA Title and Journal have a large term overlap Phase 2 was able to correct the mismatches from Phase 1 AttributePhase 1Phase 2Gain (%) Author Title Booktitle Journal Volume Pages Date Average Evaluation Attribute Level Web Ads Input strings from several websites Still, F = 0.84 on average Value range feature was useful for Phone, etc. AttributePhase 1Phase 2Gain (%) Bedroom Living Phone Price Kitchen Bathroom Others Average Evaluation Record Level Phase 1: acceptable (F 0.7) Phase 2: positive impact (Gains>9%) In CORA, gains higher than 19% Structural information led to significant improvements DatasetPhase 1Phase 2Gain (%) Recipes CORA Web Ads Structure Diversity Impact How our method deals with a heterogeneous dataset in terms of structure CORA has 33 distinct styles were identified L. Barbosa and J. Freire. Using Latent-structure to Detect In Proc. of the 13th WeDB, pages 16, A. Doan et. al. Information Extraction Challenges in Managing.. SIGMOD Record, 37(4):1420, J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, 1988. Structure Diversity Impact Perfect Labeling: all segments are corrected labeled Comparison with baselines Attribute Level Results very close to ONDUX and even better than U-CRF Recall: JUDIE faces a harder task AttributeJUDIEONDUXU-CRF Author Title Booktitle Journal Volume Pages Date Average CORA AttributeJUDIEONDUXU-CRF Bedroom Living Phone Price Kitchen Bathroom Others Average Web Ads Knowledge Base Impact # of common terms between the KB the input JUDIE is more dependent of the KB: Input does not contain structural information Achieves results comparable with baselines for a task considerably harder Conclusions Novel method for extracting semi-structured data records in the form of continuous text Detects the structure of records being extracted Integrates information extraction and structure discovery Achieved good results in comparison with state-of-art methods while demanding less user effort Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc. Conclusions Content-related / Domain-dependent features Learned from a previous existing KB on the domain Used for executing a structure-free labeling step Structure-related / Source-dependent features Learned from the structure-free labeling over the input text Content-related features are used to induce structured- based features through successive refinement steps Thus, no manual training for each input is required Future Work Develop methods for automatically generating knowledge bases Extend the SD algorithm to deal with nested structures Acknowledgments UFMG Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. do Amazonas (UFAM) Brazil Univ. Fed. de Minas Gerais (UFMG) Brazil ACM SIGMOD Conference Athens, Greece - June 2011 Presented by Eli Cortez Thank you! Summary: JUDIE x Previous IETS RequiresHMMCRFU-HMMU-CRFONDUXJUDIE Labeled Examples Yes No Fixed OrderNo Yes No Previous DataNo Yes Separate Input Records Yes No Attribute Vocabulary Value Range Value Format Value Format (Style) First a Markov model is generated for each attribute. Computes the probability of the input mask sequence represents a path in each Markov model of each attribute. Start End [A-Z][a-z]+ [A-Z]. [a-z][a-z] White sugar [A-Z][a-z]+ [a-z][a-z]+ Positioning and Sequencing Model Combining Features

joint unsupervised structure discovery and information extraction eli cortez, daniel oliveira,...

Documents