conceptual-model-based web data extraction by example

23
Conceptual-Model-Based Web Data Extraction by Example Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF

Upload: candy

Post on 22-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Conceptual-Model-Based Web Data Extraction by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Conceptual-Model-Based Web Data Extraction by Example

Conceptual-Model-Based Web Data Extraction by Example

Yuanqiu (Joe) ZhouData Extraction Group

Brigham Young UniversitySponsored by NSF

Page 2: Conceptual-Model-Based Web Data Extraction by Example

Motivation

Data-rich Websites in abundance

Conceptual-Model-Based Methodology is resilient

“By Example” approach is user-friendly

Page 3: Conceptual-Model-Based Web Data Extraction by Example

“By Example” Approach

Web users specify desired information by creating a form

Users collect sample pages on the Web

An ontology generator learns the task by analyzing the form and the sample pages

Interactions may be needed to improve or complete the ontology

Page 4: Conceptual-Model-Based Web Data Extraction by Example

Architecture

Data Frame Libraries

User Created Form GUI

Sample Pages

Ontology Generator

Extraction Engine Target PagesPopulated Database

Extraction Ontology

Page 5: Conceptual-Model-Based Web Data Extraction by Example

Digital Camera

Brand

Model

CCD Resolution

Image Resolution

Optical Zoom

Digital Zoom

PowerShot G2

4.0

2272 x 1074

3

2

Sample Web Page User Created Form

Canon

Page 6: Conceptual-Model-Based Web Data Extraction by Example

Extraction Ontology

Relationship Set and Constraints

Extraction Patterns

Keywords

Context Expressions

Page 7: Conceptual-Model-Based Web Data Extraction by Example

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Page 8: Conceptual-Model-Based Web Data Extraction by Example

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Page 9: Conceptual-Model-Based Web Data Extraction by Example

Relationship Set and Constraints

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Page 10: Conceptual-Model-Based Web Data Extraction by Example

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Page 11: Conceptual-Model-Based Web Data Extraction by Example

Extraction Patterns

Data Frame Libraries Lexicons Synonym Dictionary Regular Expressions

Extraction Pattern: Lexicons for Brand and Model Regular Expressions for numbers and Image

resolution

From Data Frame Libraries

Page 12: Conceptual-Model-Based Web Data Extraction by Example

CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Extraction Patterns Data Frame Libraries

Page 13: Conceptual-Model-Based Web Data Extraction by Example

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Page 14: Conceptual-Model-Based Web Data Extraction by Example

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Page 15: Conceptual-Model-Based Web Data Extraction by Example

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";

Page 16: Conceptual-Model-Based Web Data Extraction by Example

Context Expressions

3.5x optical zoom (2.5x digital)

a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom

optical 3X /digital 6X zoom

OpticalZoom matches [10]constant{ extract "\b\d(\.\d)?";

context "\b\d(\.\d)?(x)\b"; };keyword "\boptical\b";

Page 17: Conceptual-Model-Based Web Data Extraction by Example

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d"; context "\b\d(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

Page 18: Conceptual-Model-Based Web Data Extraction by Example

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d"; context "\b\d(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

Page 19: Conceptual-Model-Based Web Data Extraction by Example

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

Page 20: Conceptual-Model-Based Web Data Extraction by Example

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

Page 21: Conceptual-Model-Based Web Data Extraction by Example

Results (Same Site)

Page 22: Conceptual-Model-Based Web Data Extraction by Example

Results (Different Site)

Page 23: Conceptual-Model-Based Web Data Extraction by Example

Summary and Future Work

The example indicates that the approach is feasible

Some open questions need to be explored