intelligent crawling of web applications for web...

22
Outline Traditional Crawling Approach Application-Aware Helper Methodology Results Future Work Intelligent Crawling of Web Applications for Web Archiving {muhammad.faheem, pierre.senellart}@telecom-paristech.fr Télécom ParisTech (http://dbweb.enst.fr) WWW PhD symposium. April 20, 2012 1 / 22

Upload: others

Post on 14-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Intelligent Crawling of Web Applicationsfor Web Archiving

{muhammad.faheem, pierre.senellart}@telecom-paristech.fr

Télécom ParisTech (http://dbweb.enst.fr)

WWW PhD symposium. April 20, 2012

1 / 22

Page 2: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Web Archiving

2 / 22

Page 3: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Archiving the Social Web

3 / 22

Page 4: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Archiving the Social Web

I Traditional crawling approach crawls the web sitesindependently of the nature of the site and its contentmanagement system.

I Goal: Smart archiving of the Social Web;

1. Intelligent Crawling2. Index Web objects

4 / 22

Page 5: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Agenda

Traditional Crawling Approach

Application-Aware Helper

Methodology

Results

Future Work

5 / 22

Page 6: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Traditional Crawling Approach

I A traditional Web crawler (such as Heritrix) crawls the Web ina conceptually very simple way.

QueueMan-

agement

PageFetching

Link Ex-traction

URLSelection

I This approach does not take into account the nature of Webapplication.

6 / 22

Page 7: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Introduction to Application-Aware Helper

I Different crawling techniques for different social Web sites.I Detect the type of Web application, kind of Web pages inside

this Web application and decide crawling actions accordingly.I Our approach does not have the same purpose as focused

crawling.Focused Crawling: crawling based on a Topic.Application-Aware Helper: crawling optimized for a particular

Web Application.

7 / 22

Page 8: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Introduction to Application-Aware Helper

I Extended architecture

QueueMan-

agement

ResourceFetching

ApplicationAwareHelper

ResourceSelection

I To be implemented in 2 Web crawlers: Internet MemoryFoundation crawler, and into Heritrix.

8 / 22

Page 9: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Knowledge base of Web applications

I Knowledge base of Web applications: describes how to crawl aWeb site in an intelligent manner.

I Hierarchy: from general categorizations to specific instances(Web sites) of this Web application.1. categorizes the web applications.2. specifies the detection rules.3. describes the specific crawling actions.

9 / 22

Page 10: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Knowledge base of Web applications

I Different crawling actions for different kinds of Web pagesunder a specific Web application.

I Declarative, XML-based format.

10 / 22

Page 11: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Example of the knowledge base

<knowledgebase><cms name="vBulletin" type="webforum"><detection-rules>

<xpath-expression>//script/@src[contains(.,’vbulletin_global.js’)]

</xpath-expression></detection-rules><page-level-cat>

<list-of-forum><detection-rules>

<xpath-expression type="1">//a[@class="forum"]/@href

</xpath-expression><xpath-expression type="2">

//h2[@class="forumtitle"]/a/@href</xpath-expression>

</detection-rules>11 / 22

Page 12: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Example of the knowledge base

<crawling-action><action id="1">

//a.forum/@href</action><action id="2">

//td.forumtitle/div/a/@href</action>

</crawling-action></list-of-forum><list-of-thread>

.</list-of-thread><thread>

.</thread>

</knowledgebase>

12 / 22

Page 13: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Web application detection Module

I One main challenge in intelligent crawling and contentextraction is to identify the Web application and then performthe best crawling strategy accordingly.

I Detecting Web application using:1. URL patterns,2. HTTP metadata,3. textual content,4. XPath patterns, etc.

I For instance the vBulletin Web forum content managementsystem, that can be identified by searching for a reference to avbulletin_global.js JavaScript script by using a simple//script/@src XPath expression.

13 / 22

Page 14: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Crawling and extraction

I Next stage: determining the corresponding crawling actions.I Crawling action: not just a list of URLs; can be any action

that uses, REST API, complicated interaction withAJAX-based application, and extracts semantic Web objects.

14 / 22

Page 15: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Crawling and extraction

I More specifically, crawling actions are of two kinds:Navigation actions: to navigate to another Web page or Web

resources.Extraction actions: to extract individual semantic objects from

Web pages (e.g., timestamp, the blog post, thecomments).

15 / 22

Page 16: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Crawling and extraction

I We similarly want a declarative language for describing allcrawling actions.

I We therefore need a navigation and extraction language toaccess data from the deep Web as well as regular URLs.

I We will use OXPath that is an extension of XPath, with addedfacilities for interacting with Web applications and extractingrelevant data.

I It allows the simulation of user actions to interact withscripted multipage interfaces of the Web application.

16 / 22

Page 17: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Crawling and extraction

I As an example, a simple OXPath action that can be performedon the vBulletin example is

//a[@class#=’posttitle’]/@href/{click/}/descendant::a[@class#=’postbody’]

:<post=string(.)>

17 / 22

Page 18: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Initial Results

I Prototype of the application-aware helper implemented, withrecognition for a couple of Web applications.

I System evaluated against the vBulletin application, on avariety of Web sites using this content management system.

I For now, the system is able to detect the type of the Webapplication, the level in the Web application, and executes thecorresponding crawling actions.

18 / 22

Page 19: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Initial Results

I Recently, we have integrated the YFilter system (a NFA basedfiltering system) for efficient indexing of detection patterns, inorder to quickly find the relevant Web applications.

19 / 22

Page 20: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Future Work

I Using XPath 1.0 expressions for detection patterns faces someexpressiveness limitations, therefore we have the option ofswitching to XPath 2.0 expressions or to add extensionfunctions for this purpose.

I Automatic, unsupervised, learning of new Web applications (byinference of common patterns), and adaptation to slightchanges in the templates that render the wrappers unusable.

20 / 22

Page 21: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Future Work

I Crawler and external program (OXPath evaluator, API crawler,etc.) integration.

21 / 22

Page 22: Intelligent Crawling of Web Applications for Web Archivingssrg.eecs.uottawa.ca/faheem/publications/ · Outline Traditional Crawling Approach Application-Aware Helper Methodology Results

OutlineTraditional Crawling Approach

Application-Aware HelperMethodology

ResultsFuture Work

Merci

22 / 22