semiautomatic generation of resilient data-extraction ontologies

32
Semiautomatic Generation of Resilient Data- Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF

Upload: viviana-hurley

Post on 01-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Semiautomatic Generation of Resilient Data-Extraction Ontologies. Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF. Wrapper-Driven Data Extraction. Web data extraction Obtain user-specified information from Web documents Wrapper - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semiautomatic Generation of Resilient Data-Extraction Ontologies

Semiautomatic Generation of Resilient Data-Extraction Ontologies

Yihong Ding

Data Extraction GroupBrigham Young University

Sponsored by NSF

Page 2: Semiautomatic Generation of Resilient Data-Extraction Ontologies

2

Wrapper-Driven Data Extraction

Web data extraction– Obtain user-specified information from Web documents

Wrapper– Convert implicit HTML data into explicit formatted data– Data-source-specified, high performance

Examples:– SoftMealy, STALKER, WIEN, Omini, ROADRUNNER, …

Page 3: Semiautomatic Generation of Resilient Data-Extraction Ontologies

3

Common Problem of Wrappers

<LI> <A HREF="…"> Mani Chandy </A>,

<I>Professor of Computer Science</I>

and <I>Executive Officer for Computer

Science</I>

b

U_U

N_N

? / ε etc.

? / ε

? / ε

? / next_token

? / next_token

s<U,U> / ε

s<b,U> /“U=” + next_token

s<N,N> / εs<b,N> /“N=” + next_token

s<U,N> /“N=” + next_token

SoftMealy

Resiliency fixed domainchangeable layout

Scalabilityunchanged existing wrapperextendable domain and functions

Page 4: Semiautomatic Generation of Resilient Data-Extraction Ontologies

4

Data-Extraction Ontology

Structure– Object sets– Relationship sets– Participation constraints– Data frames

Pros: resilient and scalableCons: hard to create– Knowledge requirements– Tedious and error-prone work

Car [-> object];

Car [0:1] has Make [1:*];Make matches [10] constant { extract "\baudi\b"; };end;

Car [0:1] has Model [1:*];Model matches [25] constant { extract "80"; context "\baudi\S*\s*80\b"; };end;

Car [0:1] has Mileage [1:*];Mileage matches [8] constant {extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000";};end;

Car [0:1] has Price [1:*];Price matches [8] constant { extract "[1-9]\d{3,6}"; context "\$[1-9]\d{3,6}";};end;

Page 5: Semiautomatic Generation of Resilient Data-Extraction Ontologies

5

Motif of Ontology Generation

Human Brain

Concepts of Interest

Concepts with Relations

Data-Extraction Ontology

Knowledge Base

Sample Documents

Page 6: Semiautomatic Generation of Resilient Data-Extraction Ontologies

6

Thesis Statement

Given: knowledge baseInput: sample Web pages of interest Output: a data-extraction ontology for the domain of interest

Between input and output: this is the work of this thesis

Page 7: Semiautomatic Generation of Resilient Data-Extraction Ontologies

7

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 8: Semiautomatic Generation of Resilient Data-Extraction Ontologies

8

Primary Knowledge Source

Requirements– Available – General in coverage– Rich in meaningful relationship– Encoded in or easily converted to XML

Mikrokosmos (K) Ontology– Developed by NMSU jointly with U.S. DoD– Contains over 5000 concepts– Connects to an average 14 links per concept– Represented in XML format

Page 9: Semiautomatic Generation of Resilient Data-Extraction Ontologies

9

Integrated Knowledge Base

Data-Frame Library

KOntolog

y

Synonym Dictionary

(WordNet)

Lexicons

KNOWLEDGE BASE

Page 10: Semiautomatic Generation of Resilient Data-Extraction Ontologies

10

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 11: Semiautomatic Generation of Resilient Data-Extraction Ontologies

11

Domain Specification

Training documents– Data-rich – Narrow in topic breadth

Preprocessing

Page 12: Semiautomatic Generation of Resilient Data-Extraction Ontologies

12

Example – Car AdvertisementRecord 1:

00 GrandAM SE, Sunfire Red, CD, AC, PW, PLGreat Condition, $10,800, Call 798-3446

Record 2:

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250

Record 3:

02 Buick Century, lo mi, mint cond, $11,999. 373-4445 dlr# 2755

Record 4:

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

Page 13: Semiautomatic Generation of Resilient Data-Extraction Ontologies

13

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 14: Semiautomatic Generation of Resilient Data-Extraction Ontologies

14

Concept Selection

Selection strategies– Compare a string with the

name of a concept– Compare a string with the

values belonging to a concept

– Apply data-frame recognizers to recognize a string

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

KB

<PHONE-NR>

Page 15: Semiautomatic Generation of Resilient Data-Extraction Ontologies

15

Concept Selection

Reasons of conflict– Synonymy– Polysemy

Conflict resolution– Same-string only one

meaning– Favor longer over shorter– Context decides meaning

02 Buick Century Custom, Pwr Seat, Nada Retail 13,695 221-1250.

KB<PRICE>

<MILEAGE>

price

by keyword identification

Page 16: Semiautomatic Generation of Resilient Data-Extraction Ontologies

16

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 17: Semiautomatic Generation of Resilient Data-Extraction Ontologies

17

Relationship Retrieval

<AUTOMOBILE>

<PRICE>

<PHONE-NR>

<YEAR>

<CENTURY>

KB

<MILEAGE>

<AUDIO-MEDIA-ARTIFACT>

Page 18: Semiautomatic Generation of Resilient Data-Extraction Ontologies

18

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 19: Semiautomatic Generation of Resilient Data-Extraction Ontologies

19

<AUTOMOBILE>

<PRICE>

Constraint Discovery

<AUTOMOBILE>

<PRICE>

02 Buick Century, lo mi, mint cond, green, pwr seat, $11,999. 373-4445 dlr# 2755

00 Buick Century Stk# HU7159 Green $9,319, 714-2200To Apply By Phone, 1-877-228-9486, OREM Utah

AUTOMOBILE [0:1] IsA.ARTIFACT.CostofProduction PRICE [1:1]

Page 20: Semiautomatic Generation of Resilient Data-Extraction Ontologies

20

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 21: Semiautomatic Generation of Resilient Data-Extraction Ontologies

21

Ontology Generation

concept nodes object setspaths relationship setsdiscovered constraints participation constraintsconcept recognizers data frames

Page 22: Semiautomatic Generation of Resilient Data-Extraction Ontologies

22

Automatically Generated Ontology -- Car Advertisement

(01) {Automobile [-> object];}

(02) {Automobile [0:1] has Mileage [1:1];}

(03) {Automobile [0:1] IsA.ARTIFACT.CostOfProduction Price [1:1];}

(12) {Price [1:1] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year [0:*];}

(20) {Automobile [0:1] relatesTo PhoneNr [1:*] relatesTo ArtifactPart [1:*] relatesTo Mileage [1:*] relatesTo Truck [1:*] relatesTo AudioMediaArtifact [1:*] relatesTo CommunicationDevice [1:*] relatesTo ControlEvent [1:*] relatesTo TravelEvent [1:*];}

Page 23: Semiautomatic Generation of Resilient Data-Extraction Ontologies

23

Ontology-Generation Procedure

Concept Selection

RelationRetrieval

ConstraintDiscovery

Data Extraction Ontology

interact if necessary

Integrated Knowledge Base

Knowledge Sources

pre-processing

Results Storage

ExtractionProcessing

ResultEvaluation

training documents

pre-processing clean records

testdocuments

Page 24: Semiautomatic Generation of Resilient Data-Extraction Ontologies

24

Updating Strategies

Remove all bad relationship sets

Modify remaining incorrect relationship sets– Substitute incorrect object sets– Reduce long n-ary relationship sets – Fix participation constraints

Adjust names or re-arrange sequences

Add new relationship sets

Page 25: Semiautomatic Generation of Resilient Data-Extraction Ontologies

25

Final Ontology

Car [-> object]Car [0:1] has Year [1:*]Car [0:1] has Mileage [1:*]Car [0:1] has Price [1:*]PhoneNr [1:*] is for Car [0:1]PhoneNr [0:1] has Extension [1:*]Car [0:*] has Feature [1:*]Car [0:1] has Make [1:*]Car [0:1] has Model [1:*]

Page 26: Semiautomatic Generation of Resilient Data-Extraction Ontologies

26

Evaluation Criteria

Basic measures– POG (Precision of Ontology Generation)– ROG (Recall of Ontology Generation)

Human constraints– PROG (Pseudo-ROG)– Comparing with an expert-created ontology

Knowledge base constraints– EPROG (Effective-PROG)

Correctness dependency– DEPROG (Dependent-EPROG)– For example: relationship sets depends on object sets

Page 27: Semiautomatic Generation of Resilient Data-Extraction Ontologies

27

Evaluation Results

Page 28: Semiautomatic Generation of Resilient Data-Extraction Ontologies

28

Discussion of Results

Bottleneck: cannot generate what not in the knowledge base

Object sets– Concept-selection procedure works well– Desired concept not shown in training records

• Rarely occurring concept not severe even if we don’t fix the error• Example: extension

– Aggregation and union• USAddressCity, USAddressState, USAddressZipCode Location• CropPlant, AnimalProduct, FruitFoodStuff AgriculturalProduct

– Close-meaning concepts: FurniturePart Furnished

Page 29: Semiautomatic Generation of Resilient Data-Extraction Ontologies

29

Discussion of Results

Relationship sets– Binary relationship sets over 95% – Most errors due to incorrectly generated object sets– Semantically incorrect relationship sets

• Price IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT.Subclasses Year

– n-ary relationship sets (usually huge)

Participation constraints– Error due to lack of training examples – How much is enough?

Page 30: Semiautomatic Generation of Resilient Data-Extraction Ontologies

30

Knowledge Base Extensibility

Add SALT -- a new knowledge sourceSuccessfully integrated into existing KBSample new relationship set (DOE abstract domain)– CrudeOil IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation

Page 31: Semiautomatic Generation of Resilient Data-Extraction Ontologies

31

Conclusion

Experimented with knowledge-base construction and extension

Standardized application domain specification

Generated data-extraction ontologies from a specified domain and an integrated knowledge base

Showed DEPROG results of more than 70% on average and over 90% for well-defined domains

Page 32: Semiautomatic Generation of Resilient Data-Extraction Ontologies

32

Future Work

Build a general-purpose knowledge source for data-extraction usage

Study more about data frames– Can a system correctly identify concepts with data frames?– Can a system update a data frame to fit a special situation?– Can a system generate a data frame from a collection of

information of interest?