towards domain-independent information extraction from web tables
DESCRIPTION
Towards Domain-Independent Information Extraction from Web Tables. Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model. Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog, Bernhard Krupl , and Bernhard Pollak. Wolfgang Gatterbauer and Paul Bohunsky. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/1.jpg)
Towards Domain-Independent Information Extraction from Web Tables
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog,Bernhard Krupl, and Bernhard Pollak
Presented by Aaron StewartBYU CS 652
Table Extraction Using Spatial Reasoning in the CSS2 Visual Box Model
Database and Artificial Inteligence GroupVienna University of Technology, Austria
Wolfgang Gatterbauer and Paul Bohunsky
![Page 2: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/2.jpg)
Contributions
1. Classify visually structured data2. Non-tree IE formalism3. Argue to defer semantic interpretation of
output4. Ground truthing method5. Web table test set6. Visual results
![Page 3: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/3.jpg)
Introduction
Source: Gatterbauer et al. 2007
![Page 4: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/4.jpg)
Visually Structured Data on the Web
• Tables• Lists• Aligned Graphs
![Page 5: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/5.jpg)
Visually Structured Data on the Web
Source: Gatterbauer et al. 2007
![Page 6: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/6.jpg)
Formal Setup
• DOM Tree Representation• Visual Box Representation– Visualized Element Nodes (VENs)• DOM nodes with bounding boxes
– Visualized Words• Text words with bounding boxes
![Page 7: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/7.jpg)
Formal Setup
Source: Gatterbauer et al. 2007
![Page 8: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/8.jpg)
Information Extraction
• Visualized Element Nodes Table extraction (VENTex)
• Steps:– Table location– Table recognition– Table interpretation
![Page 9: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/9.jpg)
Information Extraction
Source: Gatterbauer et al. 2007
![Page 10: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/10.jpg)
Table Extraction
Source: Gatterbauer et al. 2007
![Page 11: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/11.jpg)
Table Extraction
1. Gather 8 HTML node attributes2. For text, add link3. Only accept TH, TD, DIV html nodes4. Tables must form frames5. Remove duplicate bounding boxes
![Page 12: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/12.jpg)
Table Extraction
6. Adjacency: 3 pixels7. LOCATEFRAMES algorithm8. No overlapping cells9. Minimum 3 rows, 2 columns10. Remove empty rows/columns (spacers)
![Page 13: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/13.jpg)
LOCATE FRAMES Algorithm (earlier paper)
• Visual table model• Expansion algorithm
![Page 14: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/14.jpg)
Visual Table Model
Source: Gatterbauer et al. 2007
![Page 15: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/15.jpg)
Double Topographical Grid???
• Two origins– Upper left corner– Lower right corner
• Sorted lists of pixel positions– The numbers are indices– But pixels remain in regular coordinates
![Page 16: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/16.jpg)
Neighbor Relations
Source: Gatterbauer et al. 2007
![Page 17: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/17.jpg)
Neighbor Relations
• Expand to include neighbors 1,2,3,4– within or equal – Not bigger– Not outside– Not stepped
![Page 18: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/18.jpg)
Expansion Algorithm
Source: Gatterbauer et al. 2007
![Page 19: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/19.jpg)
Basic Algorithm
• http://www.dbai.tuwien.ac.at/staff/gatter/work/AAAI_2006_Presentation_Table_Extraction_Spatial_Reasoning.pdf
![Page 20: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/20.jpg)
Table Interpretation
• Argument– Few details about the method actually used– Take data as it comes– Pass it on to a later semantic processing stage
![Page 21: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/21.jpg)
Table Interpretation
Source: Gatterbauer et al. 2007
![Page 22: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/22.jpg)
Performance
• Load + render: O(n)• Double topographical grid: O(n sqrt(n))• About 5 seconds per page
![Page 23: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/23.jpg)
Web Table Ground Truthing
• Tool to copy web pages– (not easy!)– http://
www.dbai.tuwien.ac.at/user/pollak/webpagedump
• Students selected and submitted pages– 493 web tables– 269 web pages– 63 students– http://www.dbai.tuwien.ac.at/staff/gatter/ventex/
![Page 24: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/24.jpg)
Experimental Results
Source: Gatterbauer et al. 2007
![Page 25: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/25.jpg)
Future Work• Table extraction• Table interpretation• Nested substructures• Other visually structured data• Information integration
Source: Gatterbauer et al. 2007
![Page 26: Towards Domain-Independent Information Extraction from Web Tables](https://reader036.vdocuments.net/reader036/viewer/2022062812/56816456550346895dd62353/html5/thumbnails/26.jpg)
My Conclusions
• Useful table-building algorithm– For electronic data only– Requires strict alignment
• Could be expanded– Other electronic formats (PDF, even ASCII text)– Probabilistic model for jitter