automatic wrappers for large scale web extraction

Automatic Wrappers for Large Scale Web Extraction

Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

Task: Learn rules to extract information (e.g. Directors) from structurally similar pages.

VLDB 2011, Seattle, USA

html

bodyhead

title

div div

table

td

table

td td td td td

class=‘content’

width=80%Godfather

Title : Godfather Director : Coppola Runtime 118min

We can use the following Xpath rule to extract directors

W1 = /html/body/div[2]/table/td[2]/text()

class=‘head’

VLDB 2011, Seattle, USA3

WrappersCan be learned with a little amount of

supervision.

Very effective for site-level extraction.

Have been extensively studied in literature.


In This Work:

Objective: learn wrappers without site-level supervision.


IdeaObtain training data cheaply using

dictionaries or automatic labelers.

Make wrapper induction tolerant to noise.


Summary of Approach


Summary of ApproachTwo main problems:

Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently?

Wrapper Ranking: How to rank the enumerated wrappers based on quality?


Example : TABLE wrapper system

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

Works on a table.

Generates wrappers from the following space: a single cell, a row, a column or the entire table.


Example : TABLE wrapper system

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

L = { n1, n2, n4, a4, z5}

32 possible subsets

8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T}


Wrapper Enumeration Problem Input : A wrapper inductor, Φ and a set of labels L

Wrapper space of L is defined as

W(L) = {Φ(S)| S ⊆ L}

Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L.


Wrapper InductorsTABLE : The wrapper inductor as defined before

XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples

LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair.


Well-behaved InductorA wrapper inductor Φ is well-behaved if it has following

properties: [Fidelity] L ⊆ Φ(L) [Closure] l ∈Φ(L) ⇒ Φ(L) = Φ(L ∪ l) [Monotonicity] L1 ⊆L2 ⇒ Φ(L1) ⊆ΦL2)

Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors.


Bottom-up AlgorithmStart with singleton labels in L as candidate label sets

Learn wrappers by feeding candidate label sets to Φ

Incrementally apply one-label extensions to each candidate

Extend candidates with the closure of wrappers learned by Φ

Theorem : Bottom-up algorithm is sound and complete

Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space.


Can we do better?A wrapper inductor is a feature-based inductor if:

Every label is associated with a set of features ((attribute, value) pairs)

Φ(L) = intersection of all the features of L Output of a wrapper w = text nodes satisfying all the features of w

E.g. TABLE can be expressed as a feature-based inductor with two features, row and col.

Both LR and XPW can be expressed as a feature-based inductor.


Top-down Algorithm

We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space.


Wrapper Ranking ProblemGiven a set of wrappers, we want to output one that

gives the “best” list.

Let X be a list extracted by a wrapper w

Choose wrapper that maximizes P[X | L], or equivalently,

P[L | X] P[X]


Example: Extracting names from business listings

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

Let us rank the following three lists as candidates for the set of names: X1 = first column

X2 = entire table

X3 = first two columnsVLDB 2011, Seattle, USA20


n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X1 = first column

P[L | X1] : 2 wrong labels, 3 correct labels

P[X1] : nice repeating structure, schema size = 4



n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X2 = entire table

P[L | X2] : 0 wrong labels, 5 correct labels

P[X2] : nice repeating structure, schema size =1



n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

X3 = first two columns

P(L | X3) : 1 wrong label, 4 correct labels

P(X3) : poor repeating structure, schema size = 1 or 3


Ranking ModelP[L | X]

Assume a simple annotator with precision p and recall r that independently labels each node.

Each node in X is added to L with probability r Each node not in X is added to L with probability 1- p


Ranking ModelP[X]

Define features of the grammar that describes X, e.g. schema size and repeating structure

Learn distributions on the values of features, or take it as input as part of domain knowledge.


ExperimentsDatasets:

DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages

DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums.

Task : Automatically learn wrappers to extract business names/track titles for each of the website.


SummaryA new framework for noise-tolerant wrapper induction

Two efficient wrapper enumeration algorithms

Probabilistic wrapper ranking model

Web-scale information extraction No site-level supervision No manual labeling Tolerating noise in automatic labeling


Bottom-up AlgorithmINPUT : Φ, L

Z = all singleton subsets of L

W = Z

while (Z not empty) Remove the smallest set S from Z

For each possible single-label expansion S’ of SAdd Φ(S’) to W Add (Φ(S’) ∩ L) back to Z


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n1, n2, n4, a4, z5}


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n2, n4, a4, z5, {n1, n2, n4}}

n2 n4

C1


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={{n1, n2, n4, a4, z5}}

n2 n4

C1a4z5

T R4a4


Bottom-up Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1 n2 n4 a4 z5

Z={}

n2 n4

C1a4z5

T R4a4


Top-down Algorithm

n1 a1 z1 p1

n2 a2 z2 p2

n3 a3 z3 p3

n4 a4 z4 p4

n5 a5 z5 p5

n1, n2, n4, a4, z5

n4, a4

column

n1, n2, n4 a4 z5

row

n1 n2 n4

rowVLDB 2011, Seattle, USA42

Wrapper RankingargmaxX P(L|X) P(X) ?

Possible values of X are the possible wrappers computed byΦ

P (L |X ): probability of observing L given that X is the right wrapper

The annotator has precision p, and recall r (estimated from tested labelings)

Independent annotation process:Decide on labeling nodes

independently Each node in X is added to L with

probability rEach node not in X is added to L

with probability 1-p

H

X A2

X2X1

L

labeled nodes

labeled nodes in X

Non-labeled nodes in X

non-labeled nodes outside X

All nodes

A1

labeled nodes outside X


automatic wrappers for large scale web extraction

Documents

wrapper space

wrapper ranking

table wrapper systeml

table wrapper systemworks

wellbehaved wrapper

wrapper induction tolerant

automatic wrappers

possible wrappers