automatic wrappers for large scale web extraction
DESCRIPTION
Automatic Wrappers for Large Scale Web Extraction. Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC). Task : Learn rules to extract information (e.g. Directors) from structurally similar pages. html. body. head. class= ‘ head ’. div. class= ‘ content ’. div. title. - PowerPoint PPT PresentationTRANSCRIPT
Automatic Wrappers for Large Scale Web Extraction
Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)
Task: Learn rules to extract information (e.g. Directors) from structurally similar pages.
VLDB 2011, Seattle, USA
html
bodyhead
title
div div
table
td
table
td td td td td
class=‘content’
width=80%Godfather
Title : Godfather Director : Coppola Runtime 118min
We can use the following Xpath rule to extract directors
W1 = /html/body/div[2]/table/td[2]/text()
class=‘head’
VLDB 2011, Seattle, USA3
WrappersCan be learned with a little amount of
supervision.
Very effective for site-level extraction.
Have been extensively studied in literature.
VLDB 2011, Seattle, USA4
In This Work:
Objective: learn wrappers without site-level supervision.
VLDB 2011, Seattle, USA5
VLDB 2011, Seattle, USA6
IdeaObtain training data cheaply using
dictionaries or automatic labelers.
Make wrapper induction tolerant to noise.
VLDB 2011, Seattle, USA7
VLDB 2011, Seattle, USA8
Summary of Approach
VLDB 2011, Seattle, USA9
Summary of ApproachTwo main problems:
Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently?
Wrapper Ranking: How to rank the enumerated wrappers based on quality?
VLDB 2011, Seattle, USA10
Example : TABLE wrapper system
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
Works on a table.
Generates wrappers from the following space: a single cell, a row, a column or the entire table.
VLDB 2011, Seattle, USA11
Example : TABLE wrapper system
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
L = { n1, n2, n4, a4, z5}
32 possible subsets
8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T}
VLDB 2011, Seattle, USA12
Wrapper Enumeration Problem Input : A wrapper inductor, Φ and a set of labels L
Wrapper space of L is defined as
W(L) = {Φ(S)| S ⊆ L}
Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L.
VLDB 2011, Seattle, USA13
Wrapper InductorsTABLE : The wrapper inductor as defined before
XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples
LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair.
VLDB 2011, Seattle, USA14
Well-behaved InductorA wrapper inductor Φ is well-behaved if it has following
properties: [Fidelity] L ⊆ Φ(L) [Closure] l ∈Φ(L) ⇒ Φ(L) = Φ(L ∪ l) [Monotonicity] L1 ⊆L2 ⇒ Φ(L1) ⊆ΦL2)
Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors.
VLDB 2011, Seattle, USA15
Bottom-up AlgorithmStart with singleton labels in L as candidate label sets
Learn wrappers by feeding candidate label sets to Φ
Incrementally apply one-label extensions to each candidate
Extend candidates with the closure of wrappers learned by Φ
Theorem : Bottom-up algorithm is sound and complete
Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space.
VLDB 2011, Seattle, USA16
Can we do better?A wrapper inductor is a feature-based inductor if:
Every label is associated with a set of features ((attribute, value) pairs)
Φ(L) = intersection of all the features of L Output of a wrapper w = text nodes satisfying all the features of w
E.g. TABLE can be expressed as a feature-based inductor with two features, row and col.
Both LR and XPW can be expressed as a feature-based inductor.
VLDB 2011, Seattle, USA17
Top-down Algorithm
We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space.
VLDB 2011, Seattle, USA18
Wrapper Ranking ProblemGiven a set of wrappers, we want to output one that
gives the “best” list.
Let X be a list extracted by a wrapper w
Choose wrapper that maximizes P[X | L], or equivalently,
P[L | X] P[X]
VLDB 2011, Seattle, USA19
Example: Extracting names from business listings
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
Let us rank the following three lists as candidates for the set of names: X1 = first column
X2 = entire table
X3 = first two columnsVLDB 2011, Seattle, USA20
Example: Extracting names from business listings
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
X1 = first column
P[L | X1] : 2 wrong labels, 3 correct labels
P[X1] : nice repeating structure, schema size = 4
VLDB 2011, Seattle, USA21
Example: Extracting names from business listings
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
X2 = entire table
P[L | X2] : 0 wrong labels, 5 correct labels
P[X2] : nice repeating structure, schema size =1
VLDB 2011, Seattle, USA22
Example: Extracting names from business listings
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
X3 = first two columns
P(L | X3) : 1 wrong label, 4 correct labels
P(X3) : poor repeating structure, schema size = 1 or 3
VLDB 2011, Seattle, USA23
Ranking ModelP[L | X]
Assume a simple annotator with precision p and recall r that independently labels each node.
Each node in X is added to L with probability r Each node not in X is added to L with probability 1- p
VLDB 2011, Seattle, USA24
Ranking ModelP[X]
Define features of the grammar that describes X, e.g. schema size and repeating structure
Learn distributions on the values of features, or take it as input as part of domain knowledge.
VLDB 2011, Seattle, USA25
ExperimentsDatasets:
DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages
DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums.
Task : Automatically learn wrappers to extract business names/track titles for each of the website.
VLDB 2011, Seattle, USA26
VLDB 2011, Seattle, USA27
VLDB 2011, Seattle, USA28
SummaryA new framework for noise-tolerant wrapper induction
Two efficient wrapper enumeration algorithms
Probabilistic wrapper ranking model
Web-scale information extraction No site-level supervision No manual labeling Tolerating noise in automatic labeling
VLDB 2011, Seattle, USA29
VLDB 2011, Seattle, USA30
Bottom-up AlgorithmINPUT : Φ, L
Z = all singleton subsets of L
W = Z
while (Z not empty) Remove the smallest set S from Z
For each possible single-label expansion S’ of SAdd Φ(S’) to W Add (Φ(S’) ∩ L) back to Z
VLDB 2011, Seattle, USA31
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={n1, n2, n4, a4, z5}
VLDB 2011, Seattle, USA32
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={n2, n4, a4, z5, {n1, n2, n4}}
n2 n4
C1
VLDB 2011, Seattle, USA33
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T
VLDB 2011, Seattle, USA34
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T
VLDB 2011, Seattle, USA35
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA36
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA37
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA38
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA39
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={{n1, n2, n4, a4, z5}}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA40
Bottom-up Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1 n2 n4 a4 z5
Z={}
n2 n4
C1a4z5
T R4a4
VLDB 2011, Seattle, USA41
Top-down Algorithm
n1 a1 z1 p1
n2 a2 z2 p2
n3 a3 z3 p3
n4 a4 z4 p4
n5 a5 z5 p5
n1, n2, n4, a4, z5
n4, a4
column
n1, n2, n4 a4 z5
row
n1 n2 n4
rowVLDB 2011, Seattle, USA42
Wrapper RankingargmaxX P(L|X) P(X) ?
Possible values of X are the possible wrappers computed byΦ
P (L |X ): probability of observing L given that X is the right wrapper
The annotator has precision p, and recall r (estimated from tested labelings)
Independent annotation process:Decide on labeling nodes
independently Each node in X is added to L with
probability rEach node not in X is added to L
with probability 1-p
H
X A2
X2X1
L
labeled nodes
labeled nodes in X
Non-labeled nodes in X
non-labeled nodes outside X
All nodes
A1
labeled nodes outside X
VLDB 2011, Seattle, USA43