Wrapping Semistructured Web Pages with Finite-State Transducers
Chun-Nan Hsu and Ming-Tzung DungDepartment of Computer Science & Engineering
Arizona State University
Tempe, AZ, USA
2
Information Integration Systems need wrappers
Unprocessed,Unintegrated
Details
Text,Images/Video,Spreadsheets
Hierarchical& NetworkDatabases
RelationalDatabases
Object &Knowledge
Bases
SQL ORBWrapper Wrapper
Mediator Mediator
Human & Computer Users
Heterogeneous Data Sources
InformationIntegrationService
Translation and Wrapping
Semantic Integration
Mediation
AbstractedInformation
Mediator
User Services:• Query• Monitor• Update
Agent/Module Coordination
3
Web wrappers
Web wrappers wrap...
� ``Query-able’’ or ``Search-able’’ Web sites
� Web pages with large itemized lists The primary issues are:
� how to translate (or extract) the contents of a Web page into machine-understandable data?
� how to build the extractor quickly, can it be learned?
4
Free Text Extraction v.s. Semistructured Text Extraction
Example: to extract attributes --- job title, employer and phone number --- from a job item
Free text extraction can depend on NL knowledge
� “The department of computer science at Cranberry Lemon University has a faculty position opening. Please call (555)333-5555 for more details.”
Semistructured text extraction? --- depend on appearance and regularity
� “Faculty position, department of computer science, Cranberry Lemon University.
Call (555)333-5555”
5
Wrapper representations in previous work
Shopbot (Doorenbos, Etzioni, Weld, AA-97), Ariadne (Ashish, Knoblock, Coopis-97), WIEN (Kushmerick, Weld, IJCAI-97)…
Delimiter-based, linear finite-state transducers
For i = 1 to k
skip through input string until locate the delimiter at the beginning of attribute Ai
extract Ai until locate the delimiter at the end of attribute Ai
A1 A2 A4
extract extract extract extract
skip skipskipskipA3
6
Situations where previous work fails
Missing attributes
� e.g., a faculty may not have an administrative title Multiple attribute values
� e.g., a faculty may have two administrative titles Variant attribute permutations
� e.g., (U,N,A,M), (U,N,M,A)… Exceptions and typos
7
Why previous work fails?
One-attribute-permutation assumption The use of delimiters
� prevents the wrapper to recognize different attribute permutations in many cases
� How to extract state and zip code from “CA90210”?--- cases where there is no delimiters at all.
8
Example
<LI><A HREF=“mani.html”>
Mani Chandy</A>, <I>Professor of Computer Science</I> and
<I>Executive Officer for Computer Science</I>
<LI><A HREF=“david.html”>
David E. Breen</A>, <I>Assistant Director of Computer Graphics
Laboratory</I>
9
U (URL)
U (URL)
N (Name)
N (Name)
A (Academic title)
M (Admin title)
M (Admin title)
10
SoftMealy wrapper representation
Key features: Uses finite-state transducer where each distinct
attribute permutations can be encoded as a successful path
Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes
NEW
NEW
11
Advantages of SoftMealy wrapper representation
Expressive enough to tolerate Web pages with the four troubles:
� missing attributes
� multiple attribute values
� variant attribute permutations
� exceptions and typos Polynomially learnable Retaining extraction efficiency
12
Basic building blocks of SoftMealy
Token: segment of input string
� e.g., html tags, punctuation symbols, words Separator: invisible border line between two
tokens Dummy attribute: sub-string we want to skip; if
following attribute k, denoted as -k Contextual rules: characterize the context of a
class of separators that separate two adjacent attributes (including dummy attributes)
� Consists of a left context and a right context
� Can be disjunctive
13
<LI><A HREF=“mani.html”>
Mani Chandy</A>, <I>Professor of Computer Science</I> and
<I>Executive Officer for Computer Science</I>
<LI><A HREF=“david.html”>
David E. Breen</A>, <I>Assistant Director of Computer Graphics
Laboratory</I>
Example of tokens and separators
useless separator
usefulseparator
14
<LI><A HREF=“mani.html”>
Mani Chandy</A>, <I>Professor of Computer Science</I> and
<I>Executive Officer for Computer Science</I>
<LI><A HREF=“david.html”>
David E. Breen</A>, <I>Assistant Director of Computer Graphics
Laboratory</I>
Example of a contextual rulecontextual rule -N, A
left: “</A>, <I>” or “,<I>”right: any initial capital
word token
15
Finite-state transducer
Input: separator instances Output: strings States: initial state b, final state e, one for each
attribute and each dummy attribute Edges: (i,r,o,j) state transition from i to j when
input separator instance satisfies contextual rule r and output string o
� o = empty when we want to skip
� o = the next token when we want to extract
� i and j cannot be both dummy attributes
16
Example FST
b
M -A A
-N
N-UU
e
extract
extractextract
extractskip
skipskip
skip
skip
17
Expressiveness of SoftMealy
SoftMealy can deal with
� missing attributes
� multiple attribute values
� variant attribute permutations SoftMealy can deal with exceptions and typos SoftMealy subsumes wrapper classes in
(Kushmerick Ph.D. thesis U of WA 1997) SoftMealy can wrap nested sources
18
Example of nested sources
Chapter 1 Introduction
Chapter 2 Related Work
2.1 Shopbot
2.2 Ariadne
2.2 WEIN
Chapter 3 SoftMealy Wrapper Representation
3.1 Representation
3.1.1 Tokens and Separators
3.1.2 Contextual Rules
3.2 Expressiveness Analysis
Chapter 4 Learning SoftMealy Wrappers
19
FST for nested sources
b
subsectionsectionchapter
e
20
Learnability of SoftMealy
How difficult (many example items need to see) is it to learn a correct graph structure of a SoftMealy FST to cover all attribute permutations?
PAC model: given k attributes, SoftMealy
Represent each attribute permutation as a linear FST: (multiple attribute values not allowed)
)
1ln(2ln)22(
1 2
km k
)
1ln(2ln
1
kk
m
21
Learnability of SoftMealy (continued)
Multinomial model: how many training items we need so that we have at least one instance of each attribute permutation with more than 0.95 probability?
� Let ub be the upper bound of the items needed
� Let be the number of attribute permutations
� For each permutation j, let pj be the probability that the attribute permutation of a randomly selected item is j
}95.0!
!|min{ all 1
M
m
mmubj
j
j j
j
mp
22
Learning SoftMealy Wrappers: a simple algorithm
Input: Attributes to be extracted, example Web pages where some items are labeled
Output: a SoftMealy Wrapper Algorithm:
1. Create states according to the given attributes
2. Create edges according to the attribute permutation of the example items
3. For each edge, collect the corresponding separator instances (as positive examples)
4. Generalize separator instances into contextual rules
23
Experimental results on expressiveness
Wrap 30 hand-coded CS faculty Web pages, randomly selected from cra.org list
� SoftMealy successfully wraps all of them
� # of distinct attribute permutations in sample pages up to 13, 2.63 on average
� # of training items used about linear with regard to # of edges (separator classes)
� # of disjuncts learned also about linear with regard to # of edges
24
Generalizing over unseen pages
ASU directory (www.asu.edu/asuweb/directory): 28 known distinct attribute permutations
Randomly select 11 output pages, the largest one serves as the test page and 10 used for training
� test page contains 69 items, 17 permutations
� training pages: total 85 items, 18 permutations
� Only 7 permutations are the same Train the system using the training pages in the
ascending order of their sizes
� labeled a total of 15 items
� achieves 87% coverage in the test page
25
Future work
Learning algorithm that uses negative examples Determinization, disambiguation and
minimization of learned FSTs Robustness of wrappers
Initial Results on Wrapping Semistructured Web pages with Finite-State Transducers
and Contextual rules
Chun-Nan Hsu
Institute of Information Science
Academia Sinica
Taipei, TaiwanCopyright © Chun-Nan Hsu, all right reserved
Prepared for presentation in AAAI-98 Workshop on AI and Information Integration, Madison, Wisconsin, USA,July 26, 1998