extracting structured data from web pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9....

30
Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina Stanford University arvinda,hector @cs.stanford.edu Abstract Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases. 1 Introduction The World Wide Webis a vast and rapidly growing source of information. Most of this information is in unstructured HTML pages that are targeted at a human audience. The unstructured nature of these pages makes it hard to do sophisticated querying over the information present in them. There are, however, many web sites that contain a large collection of pages that have more “structure.” These web pages encode data from an underlying structured source, like a relational database, and are typically generated dynamically. An example of such a collection is the set of book pages in Amazon [1]. Figure 1(a) shows two example book pages from Amazon. There are two important characteristics of such a collection of pages. First, all the pages in the collection encode the same kind of data. For instance, each page in Figure 1(a) contains the title, authors, and the price of a book. In other words, the “schema” of the data in each page is the same. Second, the data in each page is encoded in a similar fashion. In both pages of Figure 1(a), the title of the book appears in the beginning, followed by the word “ ”, followed by the author(s). In other words, these pages can be generated from a common template by “plugging-in” values for the title, the list of authors and so on. 1

Upload: others

Post on 02-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

ExtractingStructuredDatafrom WebPages

Arvind Arasu HectorGarcia-Molina

StanfordUniversity�arvinda,hector� @cs.stanford.edu

Abstract

Many web sitescontainlargesetsof pagesgeneratedusinga commontemplateor layout. For example,Amazonlays out the author, title, comments,etc. in the sameway in all its book pages. The valuesusedto generatethe pages(e.g., the author, title,...) typically comefrom a database.In this paper, we studytheproblemof automaticallyextractingthedatabasevaluesfrom thewebpageswithout any learningexamplesorothersimilar humaninput. We formally definethe notion of a template,andproposea modelthatdescribeshow valuesareencodedinto pagesusingatemplate.Wepresentanextractionalgorithmthatusessetsof wordsthathave similar occurrencepatternin theinput pages,to constructthetemplate.Theconstructedtemplateisthenusedto extractvaluesfrom thepages.We show experimentallythat theextractedvaluesmake semanticsensein mostcases.

1 Intr oduction

TheWorld Wide Web is a vastandrapidly growing sourceof information. Most of this informationis in

unstructuredHTML pagesthat aretargetedat a humanaudience.The unstructurednatureof thesepages

makesit hardto do sophisticatedqueryingover theinformationpresentin them.Thereare,however, many

websitesthatcontaina largecollectionof pagesthathave more“structure.” Thesewebpagesencodedata

from an underlyingstructuredsource,like a relationaldatabase,andaretypically generateddynamically.

An exampleof sucha collectionis thesetof book pagesin Amazon[1]. Figure1(a)shows two example

bookpagesfrom Amazon.Therearetwo importantcharacteristicsof sucha collectionof pages.First, all

thepagesin thecollectionencodethesamekind of data.For instance,eachpagein Figure1(a)containsthe

title, authors,andtheprice of a book. In otherwords,the “schema”of thedatain eachpageis thesame.

Second,thedatain eachpageis encodedin a similar fashion.In bothpagesof Figure1(a), the title of the

bookappearsin thebeginning,followedby theword “ ��� ”, followedby theauthor(s).In otherwords,these

pagescanbegeneratedfrom acommontemplateby “plugging-in” valuesfor thetitle, thelist of authorsand

soon.

1

Page 2: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

(a)Two bookpagesfrom Amazon[1]

Page A B C �����1 ��� ������������� ������������� (NULL) �����2 ������ ������� ! ��"#��%$�&����� '%()� (�( �����...

......

......

(b) ExtractedData

This paperstudiesthe problemof automaticallyextractingstructureddatafrom a collectionof pages

describedabove, without any humaninput like manuallygeneratedrulesor training sets. By structured

data,we mean“relations” — setsof tuplesof thesamekind thatcanbestoredandprocessedin adatabase.

For instance,from a collection of pageslike thoseshown in Figure 1(a) we would like to extract book

tuples,whereeachtupleconsistsof thetitle, thesetof authors,the(optional)list-price,andotherattributes

(Figure1(b)). Notethat,asFigure1(b) indicates,it is notourgoalto find semanticallymeaningfulattribute

namesfor theextracteddata.

Extractingstructureddatafrom thewebpagesis clearlyvery useful,sinceit enablesusto posecomplex

queriesover thedata.Extractingstructureddatais alsousefulin informationintegrationsystems[9, 17, 15,

11], which integratethedatapresentin differentweb-sites.

It is notsurprising,therefore,thatextractingstructureddatafrom web-pagesis awell-studiedproblemin

thedatabaseandAI communities.However, mostof theexistingwork on thisproblem[14, 13, 16,6, 7, 12]

assumessignificanthumaninput, for example,in form of trainingexamplesof thedatato beextracted.This

is time consuming,especiallyif thetemplateusedto generatethepageschangesoften. In contrast,we aim

for completeautomationin theextractionprocess.To thebestof ourknowledge,ROADRUNNER project[8]

is the only work that tries to solve the sameproblemaswe do. However, ROADRUNNER makesseveral

simplifying assumptionsthat limits its applicability. We defera moretechnicalcomparisonbetweenour

2

Page 3: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

w* ork andtheirs,until Section7.

Thebasicideaof our approachis asfollows. First,we deducethetemplateof theinput setof pages.As

we mentionedearlier, templateis the text of thepagesthat is “independent”of theactualdataencodedin

thepages,andis moreor less“common” to all the input pages.For example,in Figure1(a), the text �+� ,,.-�/102/431576

:, 8 319;:102/4325<6 :, =2>+? 3<@ ?.� 3A@�3.: � areall part of the template.Oncethe templateis deduced,the

remainingtext in eachpagethatis notpartof thetemplateis extractedasthedata.

Although our basicapproachlooks intuitively simple thereareseveral challengesthat have to be ad-

dressedin orderto make it feasible.

1. Complex Schema:The“schema”of theinformationencodedin thewebpagescouldbevery complex

with arbitrarylevelsnesting.For instance,eachbookpagecancontainasetof authors,with eachauthor

having asetof addressesandsoon. Eventhenotionof a templateis not very obviousin thepresenceof

suchcomplex schema.

2. Templatevs Data: Syntactically, thereis nothingthatdistinguishesthetext that is partof the template

andthetext thatis partof thedata.Thismakesthetaskof identifying thetemplatechallenging.

Therestof thepaperis organizedasfollows. Section2 providesthepreliminarydefinitions,proposesa

modelfor pagecreationandformally statestheEXTRACT problem,thatwearetrying to solve in thispaper.

Section3 providesa brief overview of our algorithm,EXALG, for solvingtheEXTRACT problem.EXALG

describedin greaterdetail in sections4 and5. Section6 describesour experiments.Section7 describes

relatedwork, andSection8 ourconcludingremarks.

2 Model and ProblemFormulation

In this sectionwe formally definestructureddata,the kind of datathat we arehopingto extract from the

webpages.Wealsoproposeamodelfor pagecreationthatdescribeshow datais encodedusinga template.

Finally, we formulatetheEXTRACT problemthatwe aretrying to solve in thispaper.

2.1 Structured Data

StructuredData is any set of datavaluesconformingto a commonschemaor type. A type is defined

recursively asfollows [5]:

3

Page 4: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

1. TheBasicType, denotedby B , representsastringof tokens. A tokenis somebasicunit of text. Wedefine

a tokento beawordor a HTML tag.Howevera tokencouldhave beendefinedasabit or acharacteras

well.

2. If CEDFGGG F�CIH aretypes,thentheirorderedlist JKCLD�FGGG F�CIH�M isalsoatype.Wesaythatthetype JKCLD�FGGG F�CIH�Mis constructedfrom thetypesCED�FGGG�F�CIH usinga tupleconstructorof order N .

3. If C is a type, then OCQP is alsoa type. We say that the type OCQP is constructedfrom C usinga set

constructor.

We usethe term typeconstructorto refer to eithera tuple or setconstructor. An instanceof a schemais

definedrecursively asfollows.

1. An instanceof thebasictype, B , is any stringof tokens.

2. An instanceof type JKCLD�F�CIR.FGGG)F�CSH+M is a tuple of the form JUTVD�F�TWR.FGGG F�TXH+M where T�D�F�TWR.FGGG�F�TWH are in-

stancesof typesCED�F�CYR.FGGG�F�CIH , respectively. InstancesT�D�F�TWR.FGGG F�TWH arecalledattributesof thetuple.

3. An instanceof type OCQP is asetof elementsO)Z7D F�Z�R.FGGG F�Z[\P , suchthat Z�]V^�_Q`aTb`dcfe is aninstanceof

type C .

We alsouseterm value to denotean instance. Also, string denotesa string of tokens. Sometimestype

constructorsymbols,O7P and JXM , aresubscriptedto helpusreferto thecorrespondingtypeconstructors.

Example2.1 Considera setof pages,eachcontaininginformationabouta book. Eachpagecontainsthe

title, thesetof authors,andthecostof a book. Further, eachauthorhasa first nameanda lastname.Then

theschemaof thedataencodedin thepagesis gEDihjJkBlF�O2JkBmFVBnMpoWq)P opr FVBiM o�s . SchemagED hastwo tuplecon-

structors,t;D and tu , andonesetconstructor, tR . An instanceof gED is thevalue vwD�hxJUy%F�O2J{z2D�FV|{D%M F J{z.R.FV|KR M�P<FV}�Mwhere,for example,y denotesthetitle of thebook, z2D denotesthefirst nameof anauthorand } thecost. ~

Schemasandvaluescanbe equivalently viewed astrees. Figure1(d) shows the treerepresentationof

schemagED andvalue vwD . A sub-treeof a schematreeis alsoa schema,andis calleda sub-schemaof the

original schema.A sub-valueof avalueis similarly defined.

4

Page 5: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

2.2�

Model of PageCreation

We now describea modelfor pagecreation.Accordingto our model(Figure1(c)),a value v (takenfrom a

databaseshown on the left) is encodedinto a pageusinga templateC . We denotethepageresultingfrom

encodingof v using C by �E^KC�F�vIe .

λ (T,x )

( T )

x

Template

Database

Output Page

(c) Model for PageCreation

< >

{ }

< >

� �

� �

< >

{ }

< > < >

τ

τ

τ1

2

3

t c

f1 l1 f2 l2

(d) ExampleSchemaandInstance

Figure1:

Definition 2.1 (Template) A templateC for a schema� , is definedas a function that mapseachtype

constructor, t of � into anorderedsetof stringsC�^kt�e , suchthat,

1. If t is a tupleconstructorof order N , C�^kt�e is anorderedsetof N���_ strings JX� o D�FGGG F � o;� H7�EDp� M .2. If t is asetconstructor, C�^kt�e is astring g o (trivially anorderedsetof unit size). ~

Optionally, we representtemplateC as C � to denotethat C is definedfor schema� . For case1 (resp.

case2) of Definition 2.1, we saystring � o ]�^�_�`�T�`�N���_ e (resp. string g o ) is associatedwith type

constructort . If astringis associatedwith a typeconstructorin atemplate,any tokenthatoccurswithin the

stringis alsosaidto beassociatedwith thetypeconstructor.

Example2.2 A template, CLD � s , for SchemagLD of Example2.1, is given by the mapping, CED)^kt;D e�hJk�\FV��F ��FV��M , CED)^ktR�e�h�� , CLD)^ktue&hxJk��FV�iF ��M . Eachletter ��� � is astring.

TemplateCED � s tells ushow to encodea pagefrom a value. For example,theencoding�E^KCLD�F�vwD%e is the

string �ly����\z2DV�Q|XD ���f�¡z.R�Q|¢R ���£}�� .

For concreteness,let strings ^k�¤�a��e be asshown in Figure2 ^k¥+e ( ¦ representsan emptystring and

representswhite space). Thewebpagecorrespondingto thebook tuple J C ProgrammingLanguage,O�JBrian,KernighanM , J Dennis,RitchieM§P , $30.00M is shown in Figure2 ^{¨�e . ~

5

Page 6: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

© ª¬« �V�.�­ ª¬® ���� ­ ª¬® ­K¯ ����°l± ª¢²�® ­³ ® ´ ª¬® ­kµ�����i± ª¢²V® ­¶ ª¢²V® � �% ­ ª¢²V« �V�;�­· �{¸ ¹º»and

ª¬« �V�.�­ª¬® � �� ­ª¬® ­K¯ ���V°�± ª¢²V® ­kµ½¼����#���$����.¾V�#§¿ $��# " $ #�® E¯��¾%$��ÁÀ�� �%�)¾�# « $��&$���E!�� ���)¾%�wÂ)¾�� à « ¾ �ª¬® ­kµ����i± ª¢²V® ­ $Ä�( � (%(ª¢²V® � �% ­ª¢²�« �V�.�­ÅÇÆ)È ÅÇÉ�È

Figure2: TemplateandPageof Example2.2

Formally, given a template,C � , the encoding�E^KC�F�vIe of an instancev of � is definedrecursively in

termsof encodingof sub-valuesof v . Sinceit causesno ambiguity, we usethe �½^KClF�vIe notationfor values

v thatareinstancesof sub-schemaof � .

1. If v is of basictype, B , �½^KClF�vIe is definedto be v itself.

2. If v is a tuple of form JUv D FGGG F�v H MpoWÊ , �½^KC�F�vYe is the string � D �E^KC�F�v D e¡� R �E^KC�F�v R e�GGG&�E^KC�F�v H e�bH7�ED . Here, v is an instanceof sub-schemathat is rootedat type constructort�Ë in � , and C¡^kt�ËWe�hJX��D�FGGG�F �bH7�ED%M .

3. If v is a set of the form O)Z.D�FGGG)F�Z[QP opÌ , �E^KC�F�vIe is given by the string �½^KClF�Z.D%e�gÍ�½^KClF�Z�R e�gÎGGGEg�E^KC�F�Z[£e . Here v is aninstanceof sub-schemathatis rootedat typeconstructort�Ï in � , and C�^kt�Ï e§hÐg .

Werepresenta templateusinganinfix notation.For example,thetemplateof Example2.2is represented

as Jk�ÒÑd��O2Jk�ÐÑn�ÍÑn��M�P)Óm�ÍÑb�ÔM . The“ Ñ ” symbolis similar to UNIX wild-card,andindicatespositions

wherevaluesof basictypeappearin anencodingusingthetemplate.Notethatstring � , associatedwith tRof Example2.2 is placedassubscriptof O7P .

Ourmodelcapturestherequirementthatthewebpagesbegeneratedin aconsistentmanner. In particular,

it ensuresthatvaluesfor thesameattribute in a tupleoccurin thesamerelative positionwith respectto the

valuesof otherattributes,in all thepages.In Example2.2above, thebooknamealwaysoccursbeforethe

list of authorsandtheprice.Theencodingof thesetcapturestheintuition thatelementsof a setareusually

listedcontiguously, andthattheelementsof thesetareformattedin asimilar manner.

2.3 Optionals and Disjunctions

As we saw in Section2.1, a schemais built from two kinds of type constructors,tuple andset,and the

basictype B . Therearetwo otherkinds of type constructorsthat occurcommonlyin the schemaof web

pages,namely, optionalsanddisjunctions.For examplethe list-priceof a book in Amazonbook pagesis

6

Page 7: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

optionalÕ sinceonly pagesfor bookssoldat a discountpricehave list-price information. As anexampleof

a disjunction,theaddressinformationin a webpagecouldbein oneof two formats,basedon whetherthe

addressis a US addressor not, in which casetheschemaof theaddressis a disjunctionof theschemafor

US addressesandtheschemafor non-USaddresses.

We view optionalsanddisjunctionsasspecialtypeconstructorsbuilt from setandtupleconstructors.If

C is a type,then ^KC�e�Ö representstheoptionaltype C , andis equivalentto OC�P o with theconstraintthat in

any instantiationt hasa cardinalityof × or _ . Similarly, if CED and CYR aretypes, ^KCED\Ø7CIR e representsa type

which is disjunctionof CLD and CIR , andis equivalentto JpOCLDP o s F�OCYR;P o r M o , wherefor every instantiationof texactlyoneof t;D�FVtR hascardinalityoneandtheother, cardinalityzero.

Theabove view of optionalsanddisjunctionsenablesus to useour modelof pagecreationfor schema

involving optionalsanddisjunctionswithout any modification.

2.4 Problem Statement

Extract Problem: Givenasetof N pages,Ù�]½hÍ�E^KC�F�v�]WeÁ^�_Q`dT&`aNEe , createdfrom someunknown template

C andvaluesO�vwD�FGGG)F�v�H�P , deducethetemplateC andvaluesO�vwD�FGGG�F�v4H�P from thesetof pagesalone.

In its generalform, EXTRACT problemis ill-defined sincethereareseveral templatesandvaluesthat

couldhave createdagivensetof pages,asthefollowing exampleillustrates.

Example2.3 Considerthreeinput pagesÙ D h¤�l¥ D �Ú¨ D �Q} D � , Ù R h¤�l¥ R ��¨ R �£} R � , Ù u hÛ�m¥ u �Ú¨ u �£} u � .

Thesepagescanbe createdfrom the template Jk�ÜÑm�jÑm�ÛÑl�ÔM anda correspondingsetof values1. For

instance,the value usedto createÙ½D is Jk¥�D�F�¨ D�FV} D%M . Thesepagescan also be createdfrom the template

Jk�ÝÑ&�dÑÁ�ÔM andacorrespondingsetof values.For this template,thevalueusedto createÙwD is Jk¥ÞD��Ú¨ DFV} D%M .~

However, givenasetof realwebpagesfrom asitelikeAmazon,ahumanrarelyhasany ambiguityin picking

theright templateandvaluesencodedin thepages.Ourgoalis to solve theEXTRACT problemfor realweb

pages,i.e., producethetemplateandvaluesthatwouldbeconsideredcorrectby ahuman.

Example2.4 We usethe instanceof EXTRACT problemwith thesetof ß pagesàQáQhâO�Ù4á D FkÙ4á R FkÙ4á u FkÙ4ápã.Pshown in Figure3 asa runningexample. Eachpagein à á containsthe title and the setof reviews of a

1In many, but notall, cases,a templateanda pagecreatedfrom thetemplateuniquelyidentifiesthevalueencodedin thepage

7

Page 8: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

book.ä

Eachreview containsthenameof the reviewer, the ratinggivenby the reviewer andthe text of her

comments.The entiretext of commentsis not shown dueto spacelimitations. Arguably, the pageswere

createdfrom templateC á , and values O�v á DF�v á R.F�v á u;F�v ápã P shown in Figure4. The schemaof the values

is � á håJkBlF�O2JkB�FVBlFVBiM oWæps P oWæ{r M oWækq . The correctsolutionof the EXTRACT problemfor the input à á is the

templateC á andvaluesO�v á D�F�v á R;F�v á u.F�v ápã P . ~

ª¬« �V�.�­ s ª¬® � �%�­ rª¬® ­ q ¯����V°)ç�è�$��;��é ª¢²�® ­kê4! $���$ ® $�����ª¬® ­{ëÞÂ�%ì ¾ � í �%î ª¢²V® ­kïª ����­ s¢ðª ��¾%­ s{sª¬® ­ sKr Â��%ì ¾ ��í��%� s¢q è $��)� s ç ª¢²V® ­ s éwñ%� « �ª¬® ­ s ê  $ � ¾V��# s ë ª¢²V® ­ s îSòª¬® ­ s ï�ó�%ô�� rk𠪢²�® ­ rps �����ª¢² �¾%­ r{rª¢² ���%­ rkqª¢²�® � ��)­ r ç ª¢²V« �V�;�­ r é(a:õ æps )ª¬« �V�.�­ s ª¬® � �%�­ rª¬® ­ q ¯����V° ç è�$��;� é ª¢²�® ­ êYö "�%���÷�ø��;�ª¬® ­ ë Â�%ì ¾ � í � î ª¢²V® ­ ïª ����­ s¢ðª ��¾%­ s{sª¬® ­ sKr Â��%ì ¾ ��í��%� s¢q è $��)� s ç ª¢²V® ­ s é ñ%� « �ª¬® ­ s êÞ $ � ¾V��# s ë ª¢²V® ­ s îSùª¬® ­ s ï�ó�%ô�� rk𠪢²�® ­ rps . . .ª¢² �¾%­ r{rª¢² ���%­ rkqª¢²�® � ��)­ r ç ª¢²V« �V�;�­ r é(c:õ ækq )

ª¬« �V�.��­ s ª¬® � �� ­ rª¬® ­ q ¯ ���V°)çÞè $��;��é ª¢²V® ­kê Data �)¾V�)¾V��#ª¬® ­{ë+Â��%ì ¾�� í)�%î ª¢²�® ­kïª ���%­ s¢ðª �¾%­ s{sª¬® ­ sKr Â��%ì�¾ � í��%� s¢q è�$��;� s ç ª¢²�® ­ s éYñ �%ú%úª¬® ­ s ê  $ ��¾V�# s ë ª¢²V® ­ s îY'ª¬® ­ s ï4ó�%ô%� rk𠪢²V® ­ rps . . .ª¢² �¾ ­ r{rª �¾%­ s{sª¬® ­ sKr Â��%ì�¾ � í��%� s¢q è�$��;� s ç ª¢²�® ­ s éYñ%$���ª¬® ­ s ê  $ ��¾V�# s ë ª¢²V® ­ s îIûª¬® ­ s ï ó�%ô%� rk𠪢²V® ­ rps . . .ª¢² �¾ ­ r{rª¢² ����­ rkqª¢²V® � �% ­ r ç ª¢²V« �V�.�­ r é(b:õ æ{r )ª¬« �V�.�­ s ª¬® � �� ­ rª¬® ­ q ¯ ����°)ç4è $��;��é ª¢²V® ­kê�ó���$�� �%$Ã�� ¾ ���)�ª¬® ­{ëÞÂ��%ì�¾ � í)��î ª¢²�® ­kïª �%��­ s¢ðª¢² ���­ rkqª¢²V® ���� ­ r ç ª¢²V« �V�.�­ r é(d:õ æ ç )

Figure3: Inputpagesof EXTRACT problem

ü æps ± ! $ ��$ ® $� ��� ý ª ñ%� « � , ò , . . . ­1þü æ{r ± Data �)¾V�)¾��# ý ª ñ � ú�ú , ' , . . . ­ , ª ñ%$��� , û , . . . ­+þü ækq ± ö "��%��Á÷�ø�;� ý ª ñ%� « � , ù , . . . ­1þü æ çÁ± ó%��$��)�%$�Ã�� ¾%��� � ÿ

ªª¬« ���.�­ s ª¬® � �% ­ rª¬® ­ q ¯����V°)ç�è�$��;��é ª¢²�® ­kê *ª¬® ­{ëÞÂ�%ì ¾ � í �î ª¢²V® ­kïª ����­ s¢ðý ª ª ��¾%­ s{sª¬® ­ sKr Â�%ì ¾ � í�%� s¢q è $��)� s ç ª¢²V® ­ s é *ª¬® ­ s ê Â�$ � ¾V�# s ë ª¢²V® ­ s î *ª¬® ­ s ï ó��%ô�� rk𠪢²�® ­ rps *ª¢² �¾�­ r{r­Xþª¢² ���­ rkqª¢²�® � �� ­ r ç ª¢²�« �V�.�­ r é­Figure4: Thecorrectsolutionto theEXTRACT problem

8

Page 9: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

2.5�

MiscellaneousTerminology, Definitions

An occurrenceof a token in a template(resp. value,page)is calleda template-token (resp. value-token,

page-token). Notethedistinctionbetweena tokenandits occurrence.Accordingto our model,eachpage-

token is createdfrom eithera template-token or a page-token. Eachtemplate-token of CYá in Figure4 is

subscriptedto help us refer to it subsequently. The page-tokensof à á in Figure3 thatarecreatedfrom a

template-token have thesamesubscriptasthetemplate-token. Two page-tokensaresaidto have thesame

role if they have beengeneratedby the sametemplate-token. Therefore,two page-tokensin à á have the

samerole iff they have thesamesubscriptin Figure4.

3 Overview of our Approach

In thispaper, wepresentanalgorithm,EXALG to solvetheEXTRACT problem.Figure5 showsthedifferent

sub-modulesof EXALG. Broadly, EXALG worksin two stages.In thefirst stage(ECGM), it discoverssets

of tokensassociatedwith thesametypeconstructorin the(unknown) templateusedto createtheinputpages.

In thesecondstage(Analysis),it usestheabove setsto deducethetemplate.Thededucedtemplateis then

usedto extractthevaluesencodedin thepages.ThissectionoutlinesEXALG for our runningexample.

(Handle Invalid Equivalence Classes)

HandInv

(Differentiate Roles Using Eq Class)DiffEq

(Construct Template)ConstTemp

(Extract Value)ExVal

Equivalence Class Generation Module

Input Pages

Analysis Module

Template

Values

Schema

(ECGM)

(Differentiate Roles Using Format)

DiffForm

(Find Equivalence Classes)

FindEq

Figure5: Modulesof EXALG

In thefirst stage,EXALG (within Sub-moduleFINDEQ) computes“equivalenceclasses”— setsof to-

kenshaving the samefrequency of occurrencein every pagein à á . An exampleof an equivalenceclass

(call � á D ) is the set of � tokens O2J�� :��S@ M F J¢����A��M F����� 4FGGG)F J���� :��I@ M�P , whereeachtoken occursexactly

oncein every input page. Thereare � other equivalenceclasses.EXALG retainsonly the equivalence

classesthatarelarge andwhosetokensoccurin a large numberof input pages.We call suchequivalence

9

Page 10: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

classes� LFEQs (for Large andFrequentlyoccurringEQuivalenceclasses).For the runningexamplethere

aretwo LFEQs. The first is �½á D shown above. The second,which we call �½á u , consistsof the � tokens:

O2J @+3 M F�� 6 > 3<6��Þ6</ F��Þ? :43���� F�� 6��2: F J�� @+3 M�P . Eachtoken of � á u occursoncein Ù á D , twice in Ù á R andso on.

Thebasicintuition behindLFEQs is that it is veryunlikely for LFEQs to beformedby “c hance”. Almost

always,LFEQs are formedby tokensassociatedwith thesametypeconstructorin the(unknown)template

usedto createtheinput pages. This intuition is easilyverifiedfor therunningexamplewhereall tokensof

� á D (resp. � á u ) areassociatedwith t á D (resp. t á u ) of � á in C á 2.

For this simpleexample,Sub-moduleHANDINV doesnot play any role, but for real pagesHANDINV

detectsandremoves“invalid” LFEQs — thosethatarenot formedby tokensassociatedwith a typecon-

structor.

However, not all thetokensassociatedwith t á D arein � á D . For example,thetoken �Þ? �S6 doesnot occur

in � á D althoughit is associatedwith t á D in C á . This happensbecause�Þ? �46 hasmultiple “roles” — it is

associatedwith two typeconstructors,namely, t á D and t á u . EXALG triesto addmoretokensto LFEQs by

“dif ferentiating”rolesof tokensusingthecontext in which they occur. For example,EXALG, infers(within

Sub-moduleDIFFFORM)3 that the“role” of �Þ? �46 whenit occursin ���� ��Þ? �S6 is differentfrom the“role”

whenit occursin � 6 > 3<6��+6A/ �Þ? �S6 , using the fact that thesetwo occurrencesalwayshave differentpaths

from the root in the html parsetreesof the pages.EXALG alsoinfers (within Sub-moduleDIFFEQ) that

the role of J¢�4M whenit occursin J¢��M����� �+? �46 is different from the role, whenit occursin J¢��M!� 6 > 3<6�� ,

usingthefactthatthesetwo occurin different“positions”with respectto theLFEQ � á D . Theformeralways

occursbetweentokens J¢�"���2�ÞM and���� of � á D , andthelatterbetweentokens �#�� and� 6 > 3<6��S9 . Returning

to token �Þ? �S6 , let us refer to �+? �S6 as �+? �S6�$ whenit occursin ��#�� %�Þ? �46 and �Þ? �46& whenit occursin

� 6 > 376��Þ6</ �Þ? �S6 . We call �Þ? �S6 $ and �Þ? �S6 & dtokens(for differentiatedtokens). Now, EXALG computes

theoccurrencefrequenciesof thedtokens(againwithin FINDEQ) andchecksif they belongto any of the

existing LFEQs or form new ones.In this case,�+? �S6 $ occursexactly oncein every pageandis, therefore,

addedto � á D . Similarly, �Þ? �46 & is addedto � á u . Similarly, thedtokensformedfrom J¢�4M and J����4M areadded

to oneof �½á D and �Lá u . Thereadercanverify that theabove stepof differentiatingtokensandaddingthem

to existing LFEQs increasesthesizeof � á D (seeFigure6) from � to _�' andthesizeof � á u from � to _�( .EXALG entersthesecondstagewhenit cannotgrow LFEQs,or find new ones.In thisstage,it buildsan

2The subscript)+* (resp. )-, ) of . æWs (resp. . ækq ) hasbeenchosento correspondto the subscriptof / æps (resp. / ækq ). This also

explainswhy thereis no . æ{r — thereareno tokensassociatedwith / æ{r in 0 æ3For exposition,thesequenceof executionof EXALG describedhereis slightly differentfrom theactualsequencedescribedin

Section4. Actually, DIFFFORM executesbeforeFINDEQ assuggestedby Figure5.

10

Page 11: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

J�� :��I@ MbJ¢����A�+M&J¢�4M1 �#�� 2�Þ? �46 J����4M3 4+5 67 s{sJ¢�4M8� 6 > 3<6��I9 J����4M�J9� @ M3 4+5 67 sKr

J��:� @ MbJ����"���2��MbJ���� :��I@ M3 4+5 67 s¢qFigure6: � á D atendof ECGM module

outputtemplateC<; ��= usingtheLFEQs constructedin thepreviousstage.In orderto constructg>; , EXALG

first considersthe root LFEQ — the LFEQ whosetokensoccurexactly oncein every input page. In our

runningexample� á D is therootLFEQ. EXALG determinesthepositionsbetweenconsecutive tokensof � á Dthat arenon-empty4. A positionbetweentwo consecutive tokensis emptyif the two tokensalwaysoccur

contiguously, andnon-empty, otherwise.Therearetwo non-emptypositionsin � á D : thepositionbetween

tokens ? ( J�����M ) and @ ( J¢��M ), andbetweentokens _7_ ( J9� @ M ) and _�( ( J��:� @ M ). The positionbetweenthefirst

( J�� ::�I@ M ) and the second( J¢����A�ÞM ) token of �½á D is empty since J¢����A�ÞM always occursimmediatelyafter

J�� ::�I@ M . EXALG generatesa tupleconstructort ;á D of order ( (oneattribute for eachnon-emptypositionof

� á D ) correspondingto � á D . The first non-emptypositiondoesnot have any equivalenceclassesoccurring

within it. EXALG usesthis information to deducethat the type of the first attribute of t ;á D is B . The

secondnon-emptyposition (betweenJ9� @ M and J��:� @ M ) alwayshaszeroor moreoccurrencesof � á u . For

this case,EXALG recursively constructsthe type C á u correspondingto � á u , anddeducesthe type of the

secondattributeof t ;á D to be OC á u;P o =æ{r . It canbeverifiedthat C á u constructedby EXALG is JkBlFVB�FVBiM o =ækq . The

outputschema,g ; , producedby EXALG is the typecorrespondingto root equivalenceclass,�½á D , which is

JkBlF�O2JkBlFVB�FVBnM o =æUq P o =ækr M o =æWs .EXALG constructstheoutputtemplateC<; by generatinga mappingfrom eachtypeconstructorin g>; to

orderedsetof strings. By definition, since t ;á D is a tuple constructorof order ( , C<;{^kt ;á D e is an orderedset

of ' strings, JX��D�D�F ��DWR.F ��DWu M . EXALG constructstheabove ' stringsfrom tokensof � á D . Thestring ��D�D is

theorderedsetof tokensof � á D , thatoccurbeforethefirst non-emptyposition: J�� :��I@ M�J¢�"���2��M4GGG.J�����M . The

string ��DWR is theorderedsetof tokensbetweenthefirst non-emptypositionandthesecond.Thestrings ��DWuis similarly constructed(seeFigure6). EXALG infers that themappingC ; ^kt ;á R e is theemptystring,since

thereis no“separator”betweenconsecutive occurrencesof �½á u . ThemappingC ; ^kt ;á u e is constructedsimilar

to themappingC ; ^kt ;á D e describedearlier.

Thereadercanverify that C<;IhÎC á and g>;whÎ� á . We have not describedhow EXALG extractsthedata

values.But for thiscase,thevaluesareuniquelydefinedgiven C<; and à á , andcanbeverifiedto beequalto

O�v á D�F�v á R;F�v á u.F�v ápã P . Therefore,EXALG producesthecorrectoutputon our runningexample.

Also, we have not describedSub-moduleHANDINV in this section. HANDINV detectsand removes4Thediscussionof this stageof EXALG usesthefactthat . æps and . ækq areordered. Wewill discussthis in Section4

11

Page 12: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

“invalid” LFEQs formedin FINDEQ. It doesnot play any role for our simplerunningexamplesincethere

wereno invalid LFEQs formed.

4 EquivalenceClasses

Thissectiondefinesanequivalenceclass,anddescribeshow equivalenceclassesareusedin EXALG. Except

whenwereferto our runningexample,thediscussionof sections4, 5, and6 is in thecontext of anarbitrary

setof pagesàÛhÎO�Ù D FGGG�FkÙ H P , whereÙ ] hÐ�E^KC � F�v ] e�^�_Q` T&`aNEe . Schema� consistsof typeconstructors

O t;D�FGGG�FVtBAAP . ThepagesO�ÙwD�FGGG)FkÙ�HÞP form theinput to EXALG. Note,however, thatEXALG doesnothave

knowledgeof C , � and O�vwDFGGG F�v4HÞP .

Definition 4.1 (OccurrenceVector) Theoccurrence-vectorof atoken y , isdefinedasthevector J{z2DFGGG F�z;H+M ,where z;] is thenumberof occurrencesof y in Ù�] . ~

Definition 4.2 (EquivalenceClass) An equivalenceclass is a maximal set of tokens having the same

occurrence-vector. ~

The set of equivalenceclassesdefinea partition over the set of tokensthat occur in à . As we saw in

Section3, thereare C equivalenceclasses(including � á D and � á u ) for pagesà á of our runningexample.The

occurrencevectorof tokensin � á D is J�_.F�_.F�_.F�_ M andtheoccurrencevectorof tokensin � á u is J�_.F-(1F�_.FV×AM .We areinterestedin equivalenceclassesbecause,in practice,tokensassociatedwith thesametypecon-

structorin C , tend to occur in the sameequivalenceclass. In our running example, � of the _�' tokens

associatedwith tá D in CYá , occur in �½á D . Observe that all occurrencesof these� tokensaregeneratedby

uniquetemplate-tokens.For example,all occurrencesof token J�� ::�I@ M aregeneratedby by template-token

J�� ::�I@ M�D . Onetheotherhand,a tokenlike �Þ? �46 thatdoesnotoccurin � á D , in spiteof beingassociatedwith

t á D in C á , is generatedby morethanonetemplate-token,namely, �Þ? �S6�D and �Þ? �S6 D ã . A tokenis saidto have

uniquerole, if all theoccurrencesof thetokenin thepages,is generatedby a singletemplate-token.

Observation 4.1 Tokensassociatedwith thesametypeconstructortFE in C that haveunique-rolesoccurin

thesameequivalenceclass. ~

If, asin Observation4.1,all the tokensof anequivalenceclass,� , have uniquerolesandareassociated

with thesametypeconstructortFE of g , we saythat � is derivedfrom tFE . Wecall anequivalenceclassvalid

12

Page 13: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

ifG

it is derived from sometFE , andinvalid, otherwise.For instance,in our runningexample,theequivalence

classOBHÞ? : ? , I 3J�I3���� , K 6�L#L , M , K<? �Þ6 , N+P (with occurrencevector Jk×�F�_.FV×�FV×AM ) is invalid. Notethatthetokens

in thisequivalenceclassoccurvery infrequently— in justasinglepage.Thisobservationis valid in general

for realwebpages.Definesupportof a token, to be thenumberof pagesin which the token occurs.The

supportof anequivalenceclassis thecommonsupportof thetokensin it. Thesizeof anequivalenceclass

is thenumberof tokensin theequivalenceclass.

Observation 4.2 For real pages,an equivalenceclassof large sizeandsupportis usuallyvalid. ~

Wecall suchequivalenceclassesLFEQs(for LargeandFrequentEQuivalenceclass).Observation4.2is

truebecauseLFEQsarerarelyformedby “chance”.Two tokensrarelyhave thesameoccurrencefrequency

in a largenumberof pagesunlessthey occurin thepagesdueto thesame“reason”.Typically, thenumber

of timestwo typeconstructorsareinstantiatedis not thesamein every input page5. Therefore,tokensasso-

ciatedwith differenttypeconstructorsusuallydonotoccurin thesameequivalenceclass.Tokensgenerated

by value-tokens(e.g., HÞ? : ?;�4? 976�9 in our runningexample)usuallyoccurinfrequentlyandthereforedo not

occurin anLFEQ.

Observation4.2formsthecruxof ourextractiontechnique,whichcanbelooselysummarizedasfollows:

sincetypically LFEQs consistonly of tokensassociatedwith the sametypeconstructorin the (unknown)

input template, useLFEQs to deducethetemplateandschema.

Therearetwo main obstaclesthat we mustovercomein order to make the above ideafeasible. First,

notethatObservation 4.2 is heuristic. Thereis no guaranteethatall the LFEQs for a setof pagessatisfy

Observation 4.2. In practice,we have observed that therearealwayssomeinvalid LFEQs. Second,an

LFEQ, even if it is valid, only containstokensthathave uniqueroles,andtherefore,only containspartial

informationaboutthetemplateusedto generatethepages.We addressboththeseobstaclesin this section.

But, in orderto do so,we observe a few propertiesthatvalid equivalenceclassessatisfy.

4.1 Propertiesof Equivalenceclasses

Definition 4.3 (OrderedEquivalenceClasses) An equivalenceclassis ordered, if its tokenscanbeordered

JUy D�FGGG�F�yp[£M , suchthat,for every pageÙ�]E^�_�`aT&`aNEe , andevery pairof tokens yOE<F�yPAQ^�_Q`RQ%SUT `acfe ,5This statementis not valid if SchemaV is not in “canonical” form (e.g.,

ª{ªXW ­OY�Z � ªXW ­OYO[�­ ). However, for any schemathere

alwaysexistsa “structurallyequivalent” schemain canonicalform (e.g.,ªXW � W ­ for theexampleschemaabove).

13

Page 14: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Ù]\^`_Ba�b�c�d+eBfhghiBc-j�jlkmfnfno�b+j�o5 6B3 4y�D GGG3J4+5J6p c�g � Dp�

y�R GGG3J4B5J6p c�g � RV�y�uQGGG

^-_Ba�b�c�dBgno�j�c�b+q"c`j�j�k+fnfho�b+j�o5 6+3 4y DjGGG3J4B5J6p c�g � Dp�

y�R GGG3J4+5J6p c�g � RV�y�u

Figure7: Occurrenceandspanof anoccurrenceof equivalenceclass

1. If y9E occursat least | timesin Ù�] , the | Ësr occurrenceof y9E in Ù�] occursbeforethe | Ësr occurrenceof yPA in

Ù�] , and

2. If y E occursat least ^k|4�Ü_ e timesin Ù ] , the ^k|4�Ü_ e ÏXË occurrenceof y E in Ù ] is after the | Ësr occurrenceof

y�A in Ù4] .

Wedenotetheabove orderedequivalenceclassby JUy�D�FGGG)F�yp[mM . ~

Let �Îh JUy D�FGGG�F�y�[�M be an orderedequivalenceclass,and let the tokensof � occur z timesin pageÙ#E .Then,we saythat � occurs z timesin Ù E . The T Ësr occurrenceof � referscollectively to the T Ësr occurrence

of tokens y D�FGGG�F�yp[ in Ù#E . Thespanof the T Ësr occurrenceof � in Ù#E is the text startingat (andincluding)

T Ësr occurrenceof y�D andendingat (andincluding) T Ësr occurrenceof y�[ in Ù#E . Thespanof eachoccurrence

of � is sub-divided into ^Uc � _ e positions, namely, t>u�v)^�_ e FGGG)FFtwu�v.^Uc�� _ e . t>u�v ^9TÞei^�_�`xTyS�cfe of T Ësroccurrenceof � in Ù E denotesthetext startingat (but not including) T Ësr occurrenceof y�A andendingat (but

not including) T Ësr occurrenceof y�A �ED . Figure7 illustratesthe spanandand twu�v;^UT�e£^�_�` T�`z(<e for two

occurrencesof anequivalenceclass JUy DF�y�R.F�y�u M in apage.

Definition 4.4 (Nestingof Equivalenceclasses) A pair of equivalenceclasses,�w] and �{E is nestedif,

1. Thespanof any occurrenceof �½] doesnotoverlapwith thespanof any occurrenceof �{E , or

2. Thespanof all occurrencesof �{E is within twu�v;^ ÙIe of someoccurrenceof �½] for somefixed Ù ; or vice-

versa.

A setof equivalenceclassesO���DFGGG)F��wHÞP is nestedif every pair of equivalenceclassesof thesetis nested.

~

Observation 4.3 A valid equivalenceclassis orderedanda pair of twovalid equivalenceclassesis nested.

~

It canbeverifiedthat �½á D and �½á u areordered.Theset O��Eá D F��½á u P is nestedsincethespanof eachoccurrence

of � á u is alwayswithin t>u�v)^9�<e of anoccurrenceof � á D .

14

Page 15: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

4.2|

Handling Invalid Equivalenceclasses

As we mentionedearlier, therearealwayssomeinvalid LFEQs thatareformed,for mostinput setsof web

pages.However, typically invalid LFEQsareeithernotorderedor notnestedwith respectto otherLFEQs.

ModuleHANDINV takesasinput asetof LFEQs (determinedby FINDEQ), detectstheexistenceof invalid

LFEQs usingviolationsof orderedandnestingproperties,and“processes”theinvalid LFEQs found— it

discardssomeof theLFEQs completely, andbreaksothersinto smallerLFEQs. Theoutputof HANDINV

is an orderedsetof nested(with high probability valid, seeSection6) LFEQs. A detaileddescriptionof

HANDINV is in AppendixA.

4.3 Differ entiating rolesof tokens

Recall that the fundamentalideaof EXALG is to useLFEQs to discover the templatetokens. However,

typically an LFEQ only containstokensthat have uniqueroles. Therefore,not all template-tokenscan

be discoveredusing LFEQs. This sectionpresentsa powerful technique,called differentiating roles of

tokens, that is usedin EXALG to discover a greaternumber(in practice,almostall) of template-tokens.

Briefly, whenwedifferentiaterolesof tokens,we identify “contexts” suchthattheoccurrencesof a tokenin

differentcontexts above necessarilyhave differentroles. Thenotionof a context shouldbeclearwhenwe

presentthetwo techniquesfor differentiatingrolesusedin EXALG.

The first techniquefor differentiatingroles usesthe html formatting informationof input pages. An

html pagecanbeequivalentlyviewedasa parsetree.An occurrence-pathof a page-token is thepathfrom

the root to the page-token in the parsetree. For instance,the occurrence-pathof the first J����4M in Ù4á D is

J�� ::�I@ M�J¢�"���2�ÞM�J�����M 6.

Observation 4.4 In practice, twopage-tokenswith differentoccurrence-pathshavedifferentroles. ~

Equivalently, Observation 4.4 assertsthat all page-tokens generatedby a template-token have the same

occurrence-path.It canbeverifiedthatObservation4.4 is valid for our runningexample.In thefull version

of thepaper, weusewell-formedpropertiesof html pagesto arguethatObservation4.4is truefor real-world

pagesandtemplates.

Thesecondtechniquefor differentiatingrolesusesvalid equivalenceclasses,andis basedon thefollow-6Thereis a bit of abuseof notationhere. The

ª¬« �V�.�­ in theoccurrence-pathabove doesnot refer to start-tag,but to thehtml

“element”in theparsetree.

15

Page 16: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

ingG

observation.

Observation 4.5 Let � bea valid equivalenceclassderivedfrom t�] . Theroleof anoccurrenceof a token y ,which is outsidethespanof anyoccurrenceof � , is different fromtherole of an occurrencewhich is within

thespanof someoccurrenceof � . Further, the role of an occurrenceof y , which is within twu�v)^k|{e of some

occurrenceof � , is different fromtherole of an occurrenceof y , which is within t>u�v)^Ucfei^Uc~}h�|Xe of some

occurrenceof � . ~

Equivalently, Observation4.5assertsthatall page-tokensgeneratedby atemplate-tokenoccurwithin afixed

t>u�v)^ ÙIe of � , or outsidethespanof any occurrenceof � . Observation4.5canbeprovedin astraight-forward

way basedon thedefinitionof our model. In our runningexample,all page-tokensgeneratedby template-

token J¢��M�u occurin twu�v)^9(<e of someoccurrenceof � á D , andoutsidethespanof any occurrenceof � á u .Wedifferentiaterolesof atokenby identifyingasetof contextsfor thetokenusingObservation4.4or 4.5,

suchthat, eachoccurrenceof the token is within someuniquecontext of the set;and,occurrencesof the

tokenin differentcontextshasdifferentroles.Thesetof contexts is thesetof occurrence-pathsof thetoken,

if we useObservation4.4,andthesetof positionsof � , if we useObservation4.5with a valid equivalence

class� . Weusethetermdtoken(for differentiatedtoken)to jointly referto a tokenandacontext, identified

by differentiation.For example,if wedifferentiatetoken J¢�4M in our runningexampleusingObservation4.4,

( dtokensareformed:onecorrespondingto theoccurrence-path(context) J�� ::�I@ M�J¢�"��A�+M�J¢��M , andtheother

to J�� :��S@ M�J¢����2��M�J9� @ M�J @+3 M�J¢��M . Instead,if we differentiateusingObservation 4.5 with � á D , ' dtokensare

formed: the first correspondsto context definedby t>u�v;^9(<e of occurrencesof �½á D , the secondto context

definedby t>u�v)^9'<e , andthethird to context definedby t>u�v)^9�<e .A dtoken is almostlike a token (a token is a dtoken with no context). We extendthe notationdefined

for tokensto dtokens. The following is a collectionof statementsandnotationrelatedto dtokens: Each

occurrenceof adtokenis generatedby atemplate-tokenor avalue-token;by definition,eachtemplate-token

generatesauniquedtoken;adtokenis saidto haveauniquerole if all occurrencesof thedtokenis generated

by asingletemplatetoken;apagecanbeviewedasastringof dtokens.

4.3.1 EquivalenceClassesand dtokens

For exposition,we have definedequivalenceclassesassetsof tokens. In fact,EXALG workswith equiva-

lenceclassesdefinedusingdtokens.Mostof thediscussionin thissectionadmitsastraightforwardgeneral-

izationfrom tokensto dtokens.Were-statethemainideasin termsof dtokens.

16

Page 17: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

An occurrencevectorof adtokenis thevectorof occurrencefrequenciesof thedtokenin theinputpages.

An equivalenceclassis amaximalsetof dtokenshaving thesameoccurrencevector. Thedtokensgenerated

by tokensassociatedwith the sametype constructortFE in C andhaving uniquerolesoccur in the same

equivalenceclass(generalizationof Observation 4.1). Observation 4.2 andObservation 4.3 arealsovalid

for equivalenceclassesdefinedusingdtokens.Section4.2canalsobegeneralizedto dtokens.Finally, the

rolesof dtokensitself couldbefurtherdifferentiatedusingoneof thetwo techniquesdescribedearlier. As an

illustrationof thelaststatement,considerthethreedtokensformedby differentiatingrolesof token J¢�4M using

Observation 4.5 with � á D . The third dtoken (onewith context t>u�v ^9�<e of � á D ) canbefurtherdifferentiated

into ' new dtokensusingObservation 4.5 with a differentequivalenceclass�½á u . For instance,thefirst of

the ' new dtokenscorrespondsto context definedby t>u�v;^9�<e of �½á D , and twu�v)^�_ e of �½á u .

4.4 EquivalenceClassGeneration Module

The input to ECGM is thesetof input pagesà . Theoutputof ECGM is a setof LFEQs of dtokensand

pagesà representedasstringsof dtokens.

First, Sub-moduleDIFFFORM differentiatesrolesof tokensin à usingObservation4.4,andrepresents

theinput pagesà asstringsof dtokensformedasa resultof thedifferentiation.Thesub-modulesFINDEQ,

HANDINV and DIFFEQ iteratein a loop. In eachiteration, the input pagesarerepresentedasstringsof

dtokens. This representationchangesfrom oneiterationto otherbecausenew dtokensareformedin each

iteration. FINDEQ computesoccurrencevectorsof thedtokensin the input pagesanddeterminesLFEQs.

FINDEQ needstwo parameters,SIZETHRES and SUPTHRES, to determineif an equivalenceclassis an

LFEQ. Equivalenceclasseswith sizeandsupportgreaterthanSIZETHRES andSUPTHRES, respectively,

areconsideredLFEQs. HANDINV processesLFEQs determinedby FINDEQ, asdescribedin Section4.2

andproducesanestedsetof orderedLFEQs. DIFFEQ optimisticallyassumesthateachLFEQ producedby

HANDINV is valid, andusesObservation 4.5 to differentiatedtokens. If any new dtokensareformedasa

result,it modifiestheinputpagesto reflecttheoccurrenceof thenew dtokens,andthecontrolpassesbackto

FINDEQ for anotheriteration.Otherwise,ECGM terminateswith thesetof LFEQs outputby HANDINV,

andthecurrentrepresentationof input pagesastheoutput.

Onour runningexample,with SIZETHRES andSUPTHRES bothsetto ' , ECGM runsfor two iterations,

andproducestwo equivalenceclasses,� �á D and � �á u , of sizes _�' and _�( , respectively. The orderedsetof

tokenscorrespondingto dtokensin � �á D is J�J�� :��S@ M F J¢����A�ÞM F J¢�4M F� �#�� YFGGG�F J����"��A��M F J���� ::�I@ M�M , andthatof

� �á u is J�J @�3 M F J¢�4M F�� 6 > 376��+6A/ FGGG)F J�� @�3 M�M .

17

Page 18: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Weconcludethissectionwith aremarkonrepresentationof dtokens.It mightseemextremelycomplex to

storecontext informationof a dtoken. In fact,it is not necessaryto explicitly storeany context information

of a dtoken. Context informationof a dtoken is implicitly storedin its occurrencesin the pages.In our

prototypeimplementationwe usedintegers to representdtokens, and maintaineda mappingfrom each

dtokenintegerto thetoken(a characterstring)correspondingto thedtoken.

5 Building Templateand Extracting Values

This sectiondescribesANALYSIS moduleof EXALG. The input of ANALYSIS moduleis a setof LFEQs

andasetof pagesrepresentedasstringsof dtokens,andtheoutputatemplateandasetof values.ANALYSIS

moduleconsistsof two sub-modules:CONSTTEMP andEXVAL (Figure5). We do not describeEXVAL in

thispapersinceit is reasonablystraightforward to derive it.

5.1 Notation

We needthe following algebraof templatesto describethe recursive constructionof templatesin CONST-

TEMP: ^k¥+e If C D�� s F�C R�� r FGGGVC [���� are templates, and � D F � R FGGG F � [n�ED are strings, C � hJKCLD�F�CIR.FGGG�F�CI[QM-� 7 s�� 7 r`������� � 7 �8� s�� denotesa template,where gâh JXgED�F gIR;FGGG�gIH�M o . C is definedby map-

pings C¡^kt�e�h JX��D�F ��R;FGGG F �b[n�ED%M , and C¡^ktBA.e�h CS]�^ktBA7e , for all tBA in gS]�^�_ `�T�`�cfe ; ^{¨e If C �J�] is a

template,and � a string, C � h�OCS]X�J��P)Ó denotesa template,where gah�O;gS]pP o . C is definedby mappings

C¡^kt�e�hjJk� M and C¡^ktBA7e&hÍCS]V^ktBA;e , for all tBA in gS] ; ^k}e½C ���] (for schemaXgI]{e�Ö ) and ^KC ���] Ø7C �`�E e (for schema

^XgS]&Ø7g�E e ) aresimilarly defined; ^O��eYC B denotesthetrivial templatefor thebasictype B .

t>u�v)^ ÙIe of an orderedequivalenceclass �Íh JO�+D�FF�2R.FGGG FF�<H+M is definedto be emptyif dtokens ��� and

���)�ED alwaysoccurcontiguously. An equivalenceclassis definedto beemptyif all its positionsareempty. In

our runningexample,both � �á D and � �á u arenon-empty:t>u�v)^9?<e and t>u�v)^�_×Ae of � �á D arenon-empty;t>u�v)^9�<e ,t>u�v)^9�<e and t>u�v ^�_7_ e of � �á u arenon-empty.

For an occurrenceof equivalenceclass � anda non-emptyt>u�v)^ ÙYe of � , t>u�v`�#���-�s���4^n��FkÙIe is the string

formedby concatenatingtokensandequivalenceclasses7 thatoccurin t>u�v)^ ÙIe of thatoccurrenceof � , but

do not occurwithin the spanof someotherequivalenceclass � ; whosespanis alsowithin t>u�v;^ ÙIe of the

aboveoccurrenceof � . As anexample,t>u�v`�#���-������^n� �á D F�_×Ae of theonly occurrenceof � �á D in à\R is thestring7More formally, someuniquesymbolcorrespondingto eachequivalenceclass.We usethenameof theequivalenceclass(e.g.,

. �æps , . �ækq ) asits symbol.

18

Page 19: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

“ � �á u � �á u ”. Althoughadtokenformedfrom token ��? :43J��� occursin t>u�v)^�_×Ae of theabove occurrence� �á D , it

is notpresentin t>u�v`�#���`�s���4^n� �á D F�_×Ae sinceit is within thespanof anoccurrenceof � �á u .

5.2 CONSTTEM P

Let O��&DF��ER;FGGG F��w[�P bethe input setof LFEQs of ANALYSIS module.For every non-emptyequivalence

class�w] , CONSTTEMP recursively constructsa template,� � � , correspondingto �½] , anda template,C � � � � ,correspondingto eachnon-emptyposition Ù of �E] . The output templateof CONSTTEMP is the template

correspondingto therootequivalenceclass— theequivalenceclasswith occurrencevector J�_.F�_.FB�B�B� M 8.

The templateC � � is definedin termsof C � � � � . Let �w]Úh JO�+D�FF�AR<FGGG�FF���kM , and let yp]�^�_a` T ` |Xe be

the token correspondingto dtoken � ] . Let � ] ^�_ ` T `z�<e denotethe non-emptypositionsof � ] . Define

�Q�x_ strings � ] s F � ] r FGGG�F � ]�� � s asfollows: � ] s h�y D GGGVy Ï s , � ] � h y ÏF��E��YDp�¢�ED GGG y Ï � ^�_yS�Qa`z�<e , and

�����EDmh¤y�Ï � GGG�y�� . Thestrings �b] � ^�_�`�Q `��l�Ð_ e just partition thetokens y�DFGGG F�y�� usingthenon-empty

positionsof �w] . ThetemplateC � � is definedas: C � � hxJKC � � � Ï s FGGG�F�C � � � Ï � M-� 7 sF� 7 r-������� � 7 � � sl� .To constructtemplateC � � � � , CONSTTEMP checksif thesetof strings, t>u�v`�#���`�s����^n�½]�FkÙIe , corresponding

to every occurrenceof �½] , hassomerecognizablepattern. Table1 lists somepatternsthat our prototype

implementationof CONSTTEMP used,and the definition of C � � � � for eachpattern,if the set of strings,

t>u�v`�#���-�����4^n�w]�FkÙIe , hasthat pattern. In our runningexample, t>u�v`�#���-�s���4^n� �á D F-?<e is a string of dtokens,for

Pattern 0 .��O  ¡* . � . � ����� ý�0 .�¢ þF£¤ . �m¥ . �m¥ ������. � ý�0 . ¢ þ`¦, . � or .#§ 0 . ¢!¨ 0 .©ª ¹ or . � Å 0 . ¢ È9«¬

stringof dtokensandemptyequivalenceclasses 0 W­Unknown 0 W

Table1: Patternsusedin definitionof C � � � �

every occurrenceof � �á D , which matchesPattern � of Table 1. Therefore, C � �æps � ® is definedto be C B .

t>u�v`�#���-�����4^n� �á D F�_×Ae is alwaysa string of × or moreoccurrencesof “ � �á u ”, which matchesPattern _ , and

henceC � �æWs � Dl¯ is definedto be OC � �ækq P�° . Thereadercanrecursively constructC � �ækq andverify thattheoutput

template,C � �æWs producedby CONSTTEMP is thesameasthecorrecttemplateC á .8We canalwaysensurethatsuchanequivalenceclassexistsbeprependingandappendinggreaterthanSIZETHRES numberof

dummytokensto beginningandendof eachpagerespectively

19

Page 20: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Experiments

EXALG makesseveralassumptionsregardingthetemplateandvaluesusedto generateits input pages.We

summarizetheimportantassumptions:

A1: LFEQs arealwaysformeddueto “deterministic”causes.By this we meanthatLFEQs areeithervalid,

or if they arenot, they areformeddueto oneof thecausesanticipatedby HANDINV, sothat theoutput

LFEQs of HANDINV arevalid.

A2: Thereis a sufficient numberof dtokenswith uniquerolesafterdifferentiationusingObservation 4.4 to

bootstrapthe processof forming LFEQs and differentiatingusing the LFEQs to discover additional

dtokens. Also, asa resultof the iterative differentiation,eventually, all dtokensgeneratedby template-

tokenshave uniquerolesandbecomepartof thevalid equivalenceclasses.

A3: For eachtupleconstructortFE in � , a largenumberof template-tokensis associatedwith tFE in C , and tFE is

instantiatedanon-zeronumberof timesin a largenumberof pages.Therefore,if AssumptionA2 holds,

avalid LFEQ derivedfrom tFE is formed.

A4: Stringsassociatedwith tupleconstructorsarenon-empty.

Westudyexperimentallyk¥�e towhatextenttheassumptionsaresatisfied,and ^{¨e theimpactontheoutput

of EXALG whensomeof theassumptionsabove arenot satisfied.For thepurposeof experimentationwe

have built adataextractionsystembasedon EXALG.

We describethe resultsof our experimentsfor C representative collectionsof input pages.Of these,?collections( _��R? of Table2) wereobtainedfrom theROADRUNNER site[8]. Theremaining' collections

werecrawled from well-known datarich siteslike E-bay[2] andNetflix [4]. Thecrawled web-pageswere

usuallythefirst few searchresultsfor somesearchquery. Theschemaof thesecollectionsis morecomplex

(larger numberof type constructors),and “less-structured”,i.e., hasa large numberof disjunctionsand

optionals.In Section7 we arguethat thetechniquesusedin [8] do not work well for collectionswith such

complex schema.

Recall that EXALG usestwo parameters— SIZETHRES and SUPTHRES. In our experimentswe set

SIZETHRES to ' . We wantedthe SIZETHRES to be assmall aspossible,sinceany type constructorwith

lessthanSIZETHRES template-tokensassociatedwith it fails to bediscoveredby EXALG(AssumptionA3).

Wedid not usethevalue ( sincethis leadsto theformationof a lot of invalid equivalenceclassesinvolving

20

Page 21: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

start-tagsand their matchingend-tags.For SUPTHRES we useda value ×�G²(�� times the numberof input

pages.Thisvaluewasempiricallydeterminedto begood.

For eachcollection � above,we manuallygeneratedtheschemagS[ of thevaluesencodedin eachpage

of thecollection,usingthesemanticsof theapplication.We ignorednon-text valueslike imagesandand

valuesoccurringwithin tagattributes(e.g., urls) whengeneratinggI[ . Let g á denotetheoutputschemaof

ourautomaticsystem.Weconsideredeachleafattribute �l[ in gS[ , andclassifiedit into oneof thefollowing

' categoriesto reflecthow successfulour systemwasin extractingvaluesof �l[ .

³ Correct: ��[ wasclassifiedascorrectif thereexisteda leafattribute � á in g á suchthatfor eachpagein

� , thesetof valuesof � [ in thepageis equalto thesetof valuesof �má in thepage.

³ Partially Correct: ��[ wasclassifiedaspartially correctif it wasnot correctand thereexisteda leaf

attribute � á in theextractedschemasuchthat for eachpagein � , eachvalueof �l[ occurredaspartof

avalueof � á in thatpage.

³ Incorrect: ��[ wasclassifiedasincorrectif it wasneithercorrectnorpartially correct.

As an illustration of how an attribute in gI[ would be classifiedas incorrect,considera hypothetical

collectionof book pages,containingan attribute, book-title. Also assumethat thereis a token (word) ´thatoccursexactly oncein thetitle of every pagein thecollection. In this case,EXALG will push ´ to the

template,andwill extract two attributescorrespondingto the book title, (correspondingto the text of the

booktitle beforeandaftertheword ´ ), makingtheattributebook-titlein ourmanualschemaincorrect.

Weuseaninstancefrom ourexperimentsto illustratepartiallycorrectattributes.Eachmovie pagein our

Netflix collectionhadanattribute, movie-title. Therewasalsoanoptionalattribute, local-name,for some

foreign languagemovies. However, sincethe local-nameappearedin a very smallnumberof input pages

(AssumptionA3 fails), EXALG did not recognizetheoptionalattribute. Instead,it combinedtheoptional

attribute with the local attribute whenever the formeroccurredin a page.For this case,we classifiedboth

theattributes,movie-title andlocal-name,aspartially correct.

The above classificationwas donewith a view on assumptionsA1-A4. It can be shown that if As-

sumptionA1 is satisfiedthennoneof the attributeswould be incorrect,irrespective of whetherthe other

assumptionsaresatisfiedor not. Partially correctattributesresultif assumptionsA2-A4 arenot satisfied.

Wehaveplacedthedetailedresultsof ourexperimentsat theURL [3]. Theabove link contains,for each

collection,the setof input pagesin the collection,the templatediscoveredby our system,andthe values

21

Page 22: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

eµ xtractedfor eachinput page. It alsohasa log of the executionof our systemfor eachinput collection.

Thelog containsthe informationlike thesetof LFEQs formed,thesetof strings twu�vm�#���-�s���S^n�&FkÙIe andthe

patternthatmatchestheset(Section5), for every non-emptypositionÙ of anLFEQ � , andsoon. Finally,

the above link containsdetailsof our evaluation— the manualschemagS[ that we constructedfor each

input collection,andfor eachattribute �l[ in gI[ , thecategory thatwe assigned��[ to, andthereasonfor

doingso.

Index Collection No. pages No. of leaf attributes Correct Partially Correct Incorrect1 AmazonCars 21 13 13 0 02 AmazonPopArtist Lists 19 5 5 0 03 Baseballplayers 10 7 7 0 04 rpmpackages 20 6 6 0 05 UEFA nationalteamsinfo 20 9 9 0 06 UEFA teamplayers 20 2 2 0 07 E-bay 50 22 18 4 08 Netflix 50 29 23 6 09 ATP TennisPlayerProfiles 32 35 33 2 0

Table2: ExperimentalResults

Table2 summarizestheexperimentalresults.Specifically, it shows for eachcollection � , thesizeof the

collection,thetotal numberof leaf attributesin gS[ , andthedistribution of theseattributesinto oneif the 'categoriesdescribedearlier.

The resultsin Table 2 clearly demonstratethat EXALG is very effective in correctly extracting data

from web pagecollections. EXALG correctly extractedthe valuesof most of the attributesencodedin

the input setof pages.For the morecomplex collectionstherewerea few partially correctattributes. As

we mentionedearlier, partially correctattributesresultwhenEXALG extractsdataat a “granularity” less

than the bestpossible. This happens,for example,when EXALG combinestogetheradjacentattributes,

or includessomewordsthatarepartof the templatewithin the extracteddata. Although partially correct

attributesarelessdesirablethancorrectattributes,they arebetterthanincorrectattributes.This is because

onecanpotentiallyconvert partiallycorrectattributesto correctattributesby developingmoresophisticated

techniquesfor (post)processingeachleaf attributeof theschemaextractedby EXALG. We planto develop

suchtechniquesaspartof our futurework. Finally, theabsenceof incorrectattributesindicatesthatall the

input collectionssatisfiedAssumptionA1.

Ourexperimentalresultsalsoillustrateanotherdesirablepropertyof EXALG — theimpactof thefailed

assumptionsis localized. For example,if AssumptionA3 is not valid for sometype constructort E , then

EXALG fails to extracttheattributesof tFE , andsometimesattributesof “surrounding”typeconstructors(see

our exampleof partially correctattributesinvolving Netflix movie-title, local-title above). In thefull paper,

22

Page 23: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

we* provide moredetaileddescriptionwhy theimpactof failedassumptionsis localized.

Therunningtimeof EXALG dependson thenumberof iterationsof theloop involving thesub-modules

FINDEQ, HANDINV andDIFFEQ. Therunningtime of eachsub-moduleis linear in theinput size.For all

the input collectionsEXALG thenumberof iterationswaslessthanten. Hence,for all practicalpurposes,

EXALG is linearin theinputsize.Sofar, EXALG thatwehavedescribed,worksontheentireinputcollection

of pages.Whentheinput collectionis large,we canmodify EXALG to work in a “wrapper-mode.” In this

mode,insteadof usingtheentirecollectionto generatethetemplate,EXALG usesa smallsampleof pages

to generatethetemplate,andthenusesthegeneratedtemplateto extractthedatafrom thefull collection.

7 RelatedWork

Most of the relatedwork usea “wrapper-based”systemfor extractingdata. In a wrapperbasedsystem,

extractionis a two stepprocess.In thefirst step,a wrapperfor thegivensetof pagesis generated.In the

secondstep,thewrapperis usedto extract thedatafrom thewebpages.A wrapperis just a programthat

extractsthedatafrom thesetof pages.Notethatawrapperis specificto thesetof pages— thewrapperfor

thepagesof onewebsitewill bedifferentfrom thewrapperfor thepagesof asecondwebsite. In Hammer

etal. [12] ahumanexpressesthelocationof thedatato beextractedasdeclarative rules.A programconverts

theserulesinto awrapper. In [14, 13, 16,6] trainingexamplesconsistingof thepagesandthedatathatoccur

in thosepagesareusedto “learn” the wrapperusingvariousmachinelearningtechniques.All the above

wrapper-basedsystemclearlyrequiremorehumaninput eitherin theform of rulesor learningexamples.If

thetemplatechanges,ashappensfrequentlyin practice,new rulesof examplesmaybeneeded.

DIPRE[7] is aninstanceof anonwrapper-basedsystemthatuseslearningexamples.But theinteresting

aspectof DIPREis that it tries to extract relationsfrom theentirewebandnot just a particularwebsiteas

in our case.But their techniquesapply only to simple“relations” while we aim to extractdatawith more

complex schema.

Our work is mostcloselyrelatedto theROADRUNNER project[10, 8]. ROADRUNNER usesa modelof

pagecreationusingatemplatethatis verysimilar to ours.ROADRUNNER startsoff with theentirefirst input

pageasits initial template.Then,for eachsubsequentpageit checksif thepagecanbe generatedby the

currenttemplate.If it cannotbe,it modifiesits currenttemplatesothatthemodifiedtemplatecangenerate

all thepagesseensofar. Thereareseveral limitationsto theROADRUNNER approach:

23

Page 24: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

1 ROADRUNNER assumesthatevery HTML tag in theinput pagesis generatedby thetemplate.This as-

sumptionis crucialin ROADRUNNER to checkif aninputpagecanbegeneratedby thecurrenttemplate.

This assumptionis clearly invalid for pagesin many web-sitessinceHTML tagscanalsooccurwithin

datavalues. For example,a book review in Amazon[1] could containtags— the review could be in

severalparagraphs,in whichcaseit containsJ�¶4M tags,or somewordsin thereview couldbehighlighted

using J 3 M tags.WhentheinputpagescontainsuchdatavaluesROADRUNNER will eitherfail to discover

any template,or produceawrongtemplate.

2 ROADRUNNER assumesthat the “grammar” of the templateusedto generatethe pagesis union-free.

This is equivalentto theassumptionthat thereareno disjunctionsin the input schema.Theauthorsof

ROADRUNNER themselveshave pointedin [8] that this assumptiondoesnot hold for many collections

of pages.Moreover, astheexperimentalresultsin [8] suggest,ROADRUNNER might fail to produceany

outputif therearedisjunctionsin theinput schema.

3 WhenROADRUNNER discoversthat thecurrenttemplatedoesnot generateaninput page,it performsa

complicatedheuristicsearchinvolving “backtracking”for a new template.This searchis exponentialin

thesizeof theschemaof thepages.It is, therefore,not clearhow ROADRUNNER would scaleto web

pagecollectionswith a largeandcomplex schema.

8 Conclusion

This paperpresentedanalgorithm,EXALG, for extractingstructureddatafrom a collectionof webpages

generatedfrom acommontemplate.EXALG first discoverstheunknown templatethatgeneratedthepages

andusesthe discoveredtemplateto extract the datafrom the input pages. EXALG usestwo novel con-

cepts,equivalenceclassesanddifferentiatingroles,to discover the template.Our experimentson several

collectionsof webpages,drawn from many well-known datarich sites,indicatethatEXALG is extremely

good in extractingthe datafrom the web pages.Anotherdesirablefeatureof EXALG is that it doesnot

completelyfail to extractany dataevenwhensomeof theassumptionsmadeby EXALG arenot metby the

input collection.In otherwordstheimpactof thefailedassumptionsis limited to a few attributes.

Thereareseveral interestingdirectionsfor futurework. Thefirst directionis to develop techniquesfor

crawling, indexing andproviding queryingsupportfor the “structured”pagesin the web. Clearly, a lot

of information in thesepagesis lost whennaive key word indexing, andsearchingis used. We indicate

two specificproblemsin this direction. First, how do we automaticallylocatecollectionsof pagesthat

24

Page 25: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

are· structured?Second,is it feasibleto generatesomelarge “database”from thesepages?Any technique

for solving the latter problemhasto be much lesssophisticatedthanthe onediscussedhere,possiblyby

sacrificingaccuracy for efficiency. Also whenwe work at the scaleof the entireweb we might be able

to leveragethe redundancy of the dataon the web as in Brin [7]. The seconddirection of work is to

developtechniquesfor automaticallyannotatingtheextracteddata,possiblyusingthewordsthatappearin

thetemplate.

References

[1] Amazon.com.http://www.amazon.com.

[2] ebay.com.http://www.ebay.com.

[3] Experimentalresults.http://www- db.stanford.edu/˜arvind/ex tract/ index .html .

[4] Netflix. http://www.netflix.com.

[5] SergeAbiteboul,RichardHull, andVictor Vianu. Foundationsof Databases. AddisonWesley, Reading,Mas-sachussetts,1995.

[6] G. Barish,Y. S.Chen,D. DiPasquo,andC. A. Knoblock. Theaterloc:Usinginformationintegrationtechnologyto rapidlybuild virtual applications.In Proc.of the2000Intl. Conf. onDataEngineering, pages681–682,2000.

[7] Sergey Brin. Extractingpatternsandrelationsfrom theworld wideweb. In WebDBWorkshopat 6thInternationalConferenceon ExtendingDatabaseTechnology, EDBT’98, 1998.

[8] Valter Crescenzi,GiansalvatoreMecca,andPaolo Merialdo. Roadrunner:Towardsautomaticdataextractionfrom largewebsites.In Proc.of the2001Intl. Conf. on Very LargeData Bases, 2001.

[9] HectorGarcia-Molina,YannisPapakonstantinou,Dallan Quass,AnandRajaraman,YehoshuaSagiv, Jeff Ull-man,andJenniferWidom. The tsimmisproject: Integrationof heterogenousinformationsources.Journal ofIntelligentInformationSystems, 8(2):117–132,1997.

[10] St’ephaneGrumbachandGiansalvatoreMecca.In searchof thelostschema.In Proceedingsof theIntl. Confer-enceof DatabaseTheory(ICDT), 1999.

[11] LauraM. Haas,DonaldKossmann,EdwardL. Wimmers,andJunYang.Optimizingqueriesacrossdiversedatasources.In Proc.of the1997Intl. Conf. on VeryLarge DataBases, pages276–285,1997.

[12] JoachimHammer, HectorGarcia-Molina,JunghooCho, Arturo Crespo,andRohanAranha. Extractingsemistructureinformationfrom theweb. In Proceedingsof the Workshopon Managementof SemistructuredData,1997.

[13] C. N. HsuandM. T. Dung. Generatingfinite-statetransducersfor semi-structureddataextractionfrom theweb.InformationSystemsSpecialIssueon SemistructuredData, 23(8),1998.

[14] N. Kushmerick,D. Weld,andR. Doorenbos.Wrapperinductionfor informationextraction.In Proc.of the1997Intl. Joint Conf. on Artificial Intelligence, 1997.

[15] Alon Levy, AnandRajaraman,andJoannJ.Ordille. Queryingheterogeneousinformationsourcesusingsourcedescriptions.In Proc.of the1996Intl. Conf. onVery LargeData Bases, pages251–262,1996.

[16] I. Muslea,S. Minton, andC. A. Knoblock. A hierarchicalapproachto wrapperinduction. In ProceedingsofThird InternationalConferenceon AutonomousAgents, Seattle,WA, 1999.

[17] Jeffrey D. Ullman. Information integration using logical views. In Proc. of the Internation ConferenceonDatabaseTheory(ICDT), pages19–40,1997.

25

Page 26: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Handling Invalid Equivalenceclasses

We now presenta procedureHANDINV, which is a sub-moduleof EXALG, whosegoal is to eliminate

invalid equivalenceclasses.HANDINV usesthe orderedandnestingpropertiesof equivalenceclassesto

detecttheexistenceof invalid equivalenceclasses.It subsequently“processes”theinvalid LFEQs foundby

sometimesdiscardingthem,andsometimesbreakingtheminto smallerLFEQs.

It follows from Observation 4.3 that any equivalenceclassthat is not orderedis not valid, andfor any

pair of equivalenceclassesthatarenot nested,at leastoneof themis not valid. HANDINV worksunderthe

hypothesisthatinvalid equivalenceclassesexposetheirpresenceby not satisfyingtheorderednessproperty

or thenestingpropertyor both(which is theconverseof Observation4.3).

The input to HANDINV is a setof equivalenceclasses,possiblycontainingequivalenceclassesthatare

notordered,or pairsof equivalenceclassesthatarenotnested.Theoutputis annestedsetof orderedequiv-

alenceclasses.BeforepresentingHANDINV we examinesomecommoncausesfor formationof invalid

equivalenceclasses.Theseprovide theinsightfor theprocedurethatfollows.

A.0.1 Causesfor formation of invalid equivalenceclasses

In practicetherearethreeprimarycausesfor theoccurrenceof invalid equivalenceclasses.

1 Correlationof instantationof twoor moretypeconstructors. Thetokenshaving uniquerolesandassoci-

atedwith typeconstructors(two or more)having thesamefrequency of instantiationin every inputpage

form aninvalid equivalenceclasses.In practice,this happensbecauserelatedinformationis sometimes

physicallydistributed in non-contiguouspartsof an outputpage.For example,an Amazonbook page

hasanoptionalattribute for theratingof thebook,andanoptionalsetof customerreviews of thebook.

Any bookthathasoneor morecustomerreviews hasa rating,andtheratingattribute is absentif theset

of customerreviews doesnotexist.

2 FrequentlyOccuringtuples.In somecases,thesame(sub-value)tuplerepeatedlyoccursin many differ-

entpages.Thetokensoccuringin thetupletendto form anequivalenceclasssinceall thetokensoccur

whenever thetupleoccursandnoneotherwise.For example,aNetflix movie pagehasinformationabout

relatedmovies. Eachrelatedmovie is a tuple consistingof the movie name,rating, andothersimilar

information.A movie couldberelatedto severalothermovies,andthereforethetuplecorrespondingto

themovie appearsin thepagescorrespondingto all themoviesthatit is relatedto.

A-1

Page 27: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

3 Overlapin thetokensassociatedwith different typeconstructors. Considertwo typeconstructors¹»º and

¹B¼ of ½ . Let ¾ ºB¿ ¾P¼ ¿BÀBÀBÀ be thesetof tokensthatareassociatedwith both the type constructors.Theset

of tokens ¾`º ¿ ¾ ¼ ¿BÀBÀBÀ form anequivalenceclassin a setof pages,if they arenot associatedwith any other

typeconstructorin theschema,or generatedby any value-token. Considerour runningexample,assume

thatweuseasizethresholdof Á to determineif anequivalenceclassis anLFEQ. In thiscase,in addition

to theequivalenceclassesÂÄÃ�º and ÂÄÃ�Å , thereis anadditionalequivalenceclassÂ1Æ : ÇÈ�É{Ê ¿ È�Ë�É"Ê-Ì whichhas

anoccurrencevector È9Í ¿-Î#¿ Í ¿ Á�Ê . ÂÄÆ is anexampleof anequivalenceclassformeddueto theoverlapof

thetokens, È�É{Ê and È�Ë�É{Ê , thatareassociatedwith both ¹BÃ�º and ¹BÃ�Å of ½8à .

Wecall invalid equivalenceclassesformedby thefirst two causesinvalid equivalenceclassesof type Ï , and

invalid equivalenceclassesformedby causesÐ andothersinvalid equivalenceclassesof type Ñ .

The invalid equivalenceclassesof type Ï arejust collectionsof valid equivalenceclasses9. They have

easily verifiable characteristicoccurrenceproperties— they are “almost” orderedand “almost” nested.

ProcedureHANDINV tries to partition thesekind of invalid equivalenceclassesinto their constituentvalid

equivalenceclasses.The Type Ñ invalid equivalenceclassescannotbe partitionedto valid equivalence

classes.ProcedureHANDINV discardsType Ñ equivalenceclasseswhenever it detectsone.

A.0.2 HANDI NV

The classificationof invalid equivalenceclassesinto Type Ï andType Ñ equivalenceclassesrequiresa

knowledgeof theschema½ andtemplateÒ , which is notavailableto HANDINV. ProcedureHANDINV just

usestheorderedandnestingpropertiesof theequivalenceclassesto predictif an invalid equivalenceclass

is of Type Ï or Type Ñ .

Definition A.1 (Almost-Ordered) An equivalenceclass isalmost-orderedif thetokensin theequivalence

classcanbepartitionedinto orderedequivalenceclassesthatdonotoverlap. Ó

ExampleA.1 ConsideranequivalenceclassÂÕÔÖÇ�Ï ¿ Ñ ¿`×Ø¿FÙ Ì definedfor aninput of threepages.If the

relativeorderof occurrenceof thetokensin thethreepagesis ÏØÑÚÏÛÑ ×2ÙÜ×2Ù , ÏØÑ ×2Ù , ÏØÑÚÏÛÑÚÏØÑ ×ÝÙÞ×ÝÙÞ×ÝÙ ,

respectively, then  is almost-ordered. canbesplit into equivalenceclassesÇ�Ï ¿ ÑßÌ and Ç ×Ø¿FÙ Ì whichare

orderedanddo notoverlapwith eachother. Ó9This is notentirelytruefor equivalenceclassesformedby causeà . However, it shouldbecomeclearto thereaderafterreading

Section5 thatnoharmis doneby ignoringthis fact

A-2

Page 28: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Definitioná

A.2 (Nearly-Nested) Twoequivalenceclasses,ÂÄâ andÂäã aresaidto benearly-nestedif for every

token ¾�åæ â (andvice versa)thereexistsa position ç , suchthateachoccurrenceof ¾ is eitheroutsidethe

spanof any occurrenceof Â{ã , or within è>é�ê»ëìç8í of someoccurrenceof Â{ã . Ó

ExampleA.2 Considertwo equivalenceclassesÂîºÜÔïÇ�Ï ¿ Ñ ¿`×Ø¿FÙ Ì and  ¼ ÔðÇ�ñ ¿Fòó¿`ôõ¿Fö Ì for an in-

put of threepagesç1º ¿ ç ¼ ¿ ç{Å . If the relative order of occurrencesof the tokens Ïø÷ ö in çĺ ¿ ç ¼ ¿ ç{Å is

ÏØÑ ×2Ù ÏÚëOñ ò íPÑ × ë ôùö í Ù , ÏÚëOñ ò íPÑ × ë ôùö í Ù andñ òúôùö , respectively, theclassesÂûº and ¼ arealmost-

nestedwith respectto eachother. As an example,token ñ of Âw¼ always eitheroccursin è>é�êJë�ü�í of an

occurrenceof  º or doesnotoccurwithin spanof any occurrenceof  º . Ó

Thefollowing lemmais anologueof lemma4.3 for Type Ï equivalenceclasses.

Lemma A.1 An unorderedType Ï equivalenceclassis almost-ordered.Two orderedequivalenceclasses,

eachof which is eithera valid equivalenceclassor an invalid Type Ï equivalenceclassarealmost-nested

with respectto eachother. Ó

Wearenow readyto describeHANDINV.

First, HANDINV checksif eachequivalenceclassis ordered,or almost-ordered.Any equivalenceclass

that is not orderedor almost-orderedis not a valid equivalenceclassor a Type Ï equivalenceclass.Such

equivalenceclassesareremoved from consideration.Any almost-orderedequivalenceclassis partitioned

into orderednon-overlappingequivalenceclasses(Definition A.1). This resultsin a setof orderedequiva-

lenceclasses.

Next, HANDINV checksthe nestingpropertyof every pair of remainingequivalenceclasses.If two

equivalenceclassesÂÄâ and Â{ã areneithernestednoralmost-nestedwith respectto eachotherat leastoneof

themis anType Ñ invalid equivalenceclass.HANDINV greedilypicks theonethat is badlynestedwith a

lot of otherequivalenceclassesanddeletesit from thesetof equivalenceclasses.The intuition is that the

non-Type Ï equivalenceclassin thepair is likely to benot nestedproperlywith a lot of otherequivalence

classes,while thevalid or Type Ï equivalenceclassin thepair is not likely to be. In otherwordsthevalid

andType Ï equivalenceclassesmutually “vote” eachotheras“good”, andthushelp in locatingthe“bad”

equivalenceclasses.Thefollowing exampleillustratesthis idea.

ExampleA.3 Considerourrunningexample.Let thesizethresholdfor determiningwhetheranequivalence

classis an LFEQ or not be Á . For this supportthreshold,therearethreeequivalenceclassesÂÄÃ�º , ÂwÃlÅ and

A-3

Page 29: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Â1Æ<ÔýÇÈ�É{Ê ¿ È�Ë�É"Ê-Ì . With the knowledgeof the schema½þà andtemplateÒþà we know that ÂÄÆ is an invalid

Type Ñ equivalenceclass.

Consideringthe nestingpropertiesof the threepairs formed from the threeequivalenceclasses.The

equivalenceclassesÂÄÃ�º and ÂwÃ�Å arenestedwith respectto eachother. Theotherpairs ÂÄÃ�º ¿ ñ Æ and ÂÄÃ�Å ¿ ñ Æarenotnested.Sinceñ Æ participatesin a largernumberof non-nestingrelationshipsñ Æ , HANDINV predicts

that ñ Æ is aType Ñ equivalenceclassandis discarded. Ó

After the previous step,equivalenceclassesof every pair of remainingequivalenceclassesareeither

nestedor almost-nestedwith respectto eachother. In thefinal step,HANDINV considerseachpairof equiv-

alenceclassesthat is almost-nested,but not nested.Let Â1â and Â{ã be a pair of almost-nestedequivalence

classes.At leastoneof Â1â and Â{ã is an invalid Type Ï equivalenceclass. It is alsopossiblethatbothare

invalid Type Ï equivalenceclasses.HANDINV triesto predicttheinvalid equivalenceclassesandpartition

theminto smallerequivalenceclasses,suchthattheresultingsetof equivalenceclassesarenested.

It canbe easily verified that if Â1â is a valid equivalenceclassand Â{ã an invalid Type Ï equivalence

class,theneachoccurrenceof Â{ã overlapswith an occurrenceof Â1â but not vice versa(otherwise,ÂÄâ and

 ã wouldhave thesameoccurrencevectorandwouldhave belongedto thesameequivalenceclass).In this

case,HANDINV partitions 㠗 eachcontiguoussetof tokensof  ã thatoccurin thesamepositionç of an

occurrenceof ÂÄâ belongto thesamepartition. Eachpartitionof token is madeinto a separateequivalence

class.

If thereexistssomeoccurrenceof Â1â thatdoesnotoverlapwith ÂÄâ andthereexistssomeoccurrenceof Âäãthatdoesnot overlapwith Â1â , thenneitherÂÄâ and Â{ã arevalid, andbotharepartitioned— eachcontiguous

setof tokensof Âäã (resp. ÂÄâ ) thatoccurin thesameposition ç of anoccurrenceof Â1â (resp. Â{ã ) belongto

thesamepartition.

ExampleA.4 Considerthetwo equivalenceclassesÂûº and ¼ from ExampleA.2. HANDINV wouldpredict

thatboth equivalenceclassesareinvalid sincethereexist an occurrence(first occurrencein ç1º ) of Âûº that

doesnotoverlapwith anoccurrenceof  ¼ , andanoccurrenceof  ¼ (pageç{Å ) thatdoesnotoverlapwith an

occurrenceof Âûº . HANDINV would partition Âûº into threeequivalenceclassesÇ�Ï2Ì ¿ Ç�Ñ ¿`× Ì ¿ Ç Ù Ì and  ¼into two equivalenceclassesÇ�ñ ¿Fò Ì and Ç ôõ¿Fö Ì .

On theotherhand,if therelative occurrenceof tokens Ïÿ÷ ö in ç Å wereto be ÏÛÑ ×2Ù , thenHANDINV

wouldpredictthat  º is valid and Âļ is invalid. In thiscase,it wouldpartition Â>¼ alonein to two equivalence

classesÇ�ñ ¿Fò Ì and Ç ôÚ¿Fö Ì . Ó

A-4

Page 30: Extracting Structured Data from Web Pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9. 17. · Extracting Structured Data from Web Pages Arvind Arasu Hector Garcia-Molina

Thesetof equivalenceclassesafter the last stepis theoutputof HANDINV. It canbeverified that the

outputis a nestedsetof orderedequivalenceclasses.

A-5