bootstrapping information extraction from semi-structured web pages

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGESAndrew Carson and Charles Schafer

Abstract• No human supervision required system

• Previous work:1. Required significant human effort

• Their solution:• Requiring 2-5 annotated pages fro 4-6 web sites for training model• No human supervision for the garget web site

• Result:• 83.8% and 91.1% for different sites.

Introduction• Extracting structured records from detail pages of semi-

structured web pages

Introduction• Why semi-structured web

• Great sources of information• Attribute/value structure: downstream learning or querying systems

Related Work• Problem of Previous Work

• No labeling example pages, but manual labeling of the output• Irrelevant fields(20 data fields and 7 schema columns)

• Dela system(automatically label extracted data)• Problem of labeling detected data fields

• A data field does not have a label• Multiple fields of the same data type

Methods• Terms:

• Domain schema: a set of attributes• Schema column: a single attribute• Detailed page: a page that corresponds to a single data record• Data field: a location within a template for that site• Data values: an instance of that data field

Methods• Detecting Data Fields

• Partial Tree Alignment Algorithm

Methods• Classifying Data Fields

• Assign a score to each schema column• c: Data values => data for training schema column• f: data fields => contexts from the training data

• Compute the score:• Use a classifier to map data fields to schema column• Use a model

• K different feature types

Methods• Feature Types

• Precontext character 3-grams• Lowercase value tokens• Lowercase value character 3-grams• Value token types

Methods• Comparing Distributions of Feature Values

• Advantage • Similar data values • Avoid over-fitting

• when high-dimensional feature spaces• Small number of training example

Methods• KL-Divergence

• Smoothed version

• Skew Similarity Score

Methods• Combining Skew Similarity Scores

• Combine skew similarity scores for the dfferent feature types using linear regression model

• Stacked classifier model

• Labeling the Target Site• Higher for each schema column c

Evaluation• Accuracy of automatically labeling new sites• How well it make recommendations to human annotators

• Input: a collection of annotated sites for a domain• Method: cross-validation

Results by Site

Results by Schema Column

Identifying Missing Schema Columns• Vacation rentals: 80.0%• Job sites: 49.3%

Conclusion

bootstrapping information extraction from semi-structured web pages

Documents

efficient knowledge extraction from structured data¬cient...

structured data extraction from the...

bootstrapping information extraction with unlabeled data

structured information extraction from natural disaster...

bootstrapping-based extraction of dictionary terms from...

cermine: automatic extraction of structured metadata from...

information network analysis and extraction extraction and...

automatic extraction of clickable structured web … ·...

sae: structured aspect extraction

kde itinerary - fosdem · 2020-01-31 · data extraction...

revised bootstrapping of a gulfwide implementation...

bootstrapping an ontology-based information...

ieee p2p 2013 - bootstrapping skynet: calibration and...

bootstrapping a structured self-improving & safe ...

a benchmark for structured procedural knowledge extraction

crowdgather: entity extraction over structured...

bootstrapping information extraction from semi...

keseda: knowledge extraction from heterogeneous...

informatics and telematics institute - certh 1 boemie:...

semi-automatic knowledge extraction from semi-structured