wrapping semistructured web pages with finite-state transducers chun-nan hsu and ming-tzung dung...

Post on 02-Apr-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Wrapping Semistructured Web Pages with Finite-State Transducers

Chun-Nan Hsu and Ming-Tzung DungDepartment of Computer Science & Engineering

Arizona State University

Tempe, AZ, USA

2

Information Integration Systems need wrappers

Unprocessed,Unintegrated

Details

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

Bases

SQL ORBWrapper Wrapper

Mediator Mediator

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Translation and Wrapping

Semantic Integration

Mediation

AbstractedInformation

Mediator

User Services:• Query• Monitor• Update

Agent/Module Coordination

3

Web wrappers

Web wrappers wrap...

� ``Query-able’’ or ``Search-able’’ Web sites

� Web pages with large itemized lists The primary issues are:

� how to translate (or extract) the contents of a Web page into machine-understandable data?

� how to build the extractor quickly, can it be learned?

4

Free Text Extraction v.s. Semistructured Text Extraction

Example: to extract attributes --- job title, employer and phone number --- from a job item

Free text extraction can depend on NL knowledge

� “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”

Semistructured text extraction? --- depend on appearance and regularity

� “Faculty position, department of computer science, Cranberry Lemon University.

Call (555)333-5555”

5

Wrapper representations in previous work

Shopbot (Doorenbos, Etzioni, Weld, AA-97), Ariadne (Ashish, Knoblock, Coopis-97), WIEN (Kushmerick, Weld, IJCAI-97)…

Delimiter-based, linear finite-state transducers

For i = 1 to k

skip through input string until locate the delimiter at the beginning of attribute Ai

extract Ai until locate the delimiter at the end of attribute Ai

A1 A2 A4

extract extract extract extract

skip skipskipskipA3

6

Situations where previous work fails

Missing attributes

� e.g., a faculty may not have an administrative title Multiple attribute values

� e.g., a faculty may have two administrative titles Variant attribute permutations

� e.g., (U,N,A,M), (U,N,M,A)… Exceptions and typos

7

Why previous work fails?

One-attribute-permutation assumption The use of delimiters

� prevents the wrapper to recognize different attribute permutations in many cases

� How to extract state and zip code from “CA90210”?--- cases where there is no delimiters at all.

8

Example

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

9

U (URL)

U (URL)

N (Name)

N (Name)

A (Academic title)

M (Admin title)

M (Admin title)

10

SoftMealy wrapper representation

Key features: Uses finite-state transducer where each distinct

attribute permutations can be encoded as a successful path

Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

NEW

NEW

11

Advantages of SoftMealy wrapper representation

Expressive enough to tolerate Web pages with the four troubles:

� missing attributes

� multiple attribute values

� variant attribute permutations

� exceptions and typos Polynomially learnable Retaining extraction efficiency

12

Basic building blocks of SoftMealy

Token: segment of input string

� e.g., html tags, punctuation symbols, words Separator: invisible border line between two

tokens Dummy attribute: sub-string we want to skip; if

following attribute k, denoted as -k Contextual rules: characterize the context of a

class of separators that separate two adjacent attributes (including dummy attributes)

� Consists of a left context and a right context

� Can be disjunctive

13

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

Example of tokens and separators

useless separator

usefulseparator

14

<LI><A HREF=“mani.html”>

Mani Chandy</A>, <I>Professor of Computer Science</I> and

<I>Executive Officer for Computer Science</I>

<LI><A HREF=“david.html”>

David E. Breen</A>, <I>Assistant Director of Computer Graphics

Laboratory</I>

Example of a contextual rulecontextual rule -N, A

left: “</A>, <I>” or “,<I>”right: any initial capital

word token

15

Finite-state transducer

Input: separator instances Output: strings States: initial state b, final state e, one for each

attribute and each dummy attribute Edges: (i,r,o,j) state transition from i to j when

input separator instance satisfies contextual rule r and output string o

� o = empty when we want to skip

� o = the next token when we want to extract

� i and j cannot be both dummy attributes

16

Example FST

b

M -A A

-N

N-UU

e

extract

extractextract

extractskip

skipskip

skip

skip

17

Expressiveness of SoftMealy

SoftMealy can deal with

� missing attributes

� multiple attribute values

� variant attribute permutations SoftMealy can deal with exceptions and typos SoftMealy subsumes wrapper classes in

(Kushmerick Ph.D. thesis U of WA 1997) SoftMealy can wrap nested sources

18

Example of nested sources

Chapter 1 Introduction

Chapter 2 Related Work

2.1 Shopbot

2.2 Ariadne

2.2 WEIN

Chapter 3 SoftMealy Wrapper Representation

3.1 Representation

3.1.1 Tokens and Separators

3.1.2 Contextual Rules

3.2 Expressiveness Analysis

Chapter 4 Learning SoftMealy Wrappers

19

FST for nested sources

b

subsectionsectionchapter

e

20

Learnability of SoftMealy

How difficult (many example items need to see) is it to learn a correct graph structure of a SoftMealy FST to cover all attribute permutations?

PAC model: given k attributes, SoftMealy

Represent each attribute permutation as a linear FST: (multiple attribute values not allowed)

)

1ln(2ln)22(

1 2

km k

)

1ln(2ln

1

kk

m

21

Learnability of SoftMealy (continued)

Multinomial model: how many training items we need so that we have at least one instance of each attribute permutation with more than 0.95 probability?

� Let ub be the upper bound of the items needed

� Let be the number of attribute permutations

� For each permutation j, let pj be the probability that the attribute permutation of a randomly selected item is j

}95.0!

!|min{ all 1

M

m

mmubj

j

j j

j

mp

22

Learning SoftMealy Wrappers: a simple algorithm

Input: Attributes to be extracted, example Web pages where some items are labeled

Output: a SoftMealy Wrapper Algorithm:

1. Create states according to the given attributes

2. Create edges according to the attribute permutation of the example items

3. For each edge, collect the corresponding separator instances (as positive examples)

4. Generalize separator instances into contextual rules

23

Experimental results on expressiveness

Wrap 30 hand-coded CS faculty Web pages, randomly selected from cra.org list

� SoftMealy successfully wraps all of them

� # of distinct attribute permutations in sample pages up to 13, 2.63 on average

� # of training items used about linear with regard to # of edges (separator classes)

� # of disjuncts learned also about linear with regard to # of edges

24

Generalizing over unseen pages

ASU directory (www.asu.edu/asuweb/directory): 28 known distinct attribute permutations

Randomly select 11 output pages, the largest one serves as the test page and 10 used for training

� test page contains 69 items, 17 permutations

� training pages: total 85 items, 18 permutations

� Only 7 permutations are the same Train the system using the training pages in the

ascending order of their sizes

� labeled a total of 15 items

� achieves 87% coverage in the test page

25

Future work

Learning algorithm that uses negative examples Determinization, disambiguation and

minimization of learned FSTs Robustness of wrappers

Initial Results on Wrapping Semistructured Web pages with Finite-State Transducers

and Contextual rules

Chun-Nan Hsu

Institute of Information Science

Academia Sinica

Taipei, TaiwanCopyright © Chun-Nan Hsu, all right reserved

Prepared for presentation in AAAI-98 Workshop on AI and Information Integration, Madison, Wisconsin, USA,July 26, 1998

top related