seminar on navigation inma hernández. roadmap introduction conceptual model experiment conclusions...

31
Seminar on Navigation Inma Hernández

Upload: prudence-kelly

Post on 26-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Seminaron Navigation

Inma Hernández

Roadmap

Introduction

Conceptual Model

Experiment

Conclusions

Future Work

Integraweb

¨ Verifier ¨ Ontologiser

¨ Knowledge Base

¨ Extractor¨ Information retrieval

¨ Ontology

¨ Dataset

¨S

¨Q

¨L

¨S

¨Q

¨L

Integraweb

¨ Verifier ¨ Ontologiser

¨ Knowledge Base

¨ Extractor¨ Information retrieval

¨ Ontology

¨ Dataset

¨S

¨Q

¨L

¨S

¨Q

¨L

Web Page Classification

Web Page Classification

Feature Type

Content

Hotho02 Pierre01

Selamat04

Structural

Arasu03 Bar-Yossef02

Blanco07 Crescenzi01 Grumbach99

Reis04 Vieira06 Vidal07

Hybrid

Caverlee05 Markov08

Feature Location

Onpage

(Most)

Neighbours

Cohen02 Fürnkraz02

Inma Hernández – ZOCO– September 2009

Navigation

Navigation

Blind

Crawlers

Ravaghan01

Recorders

Anupam00Baumgartner05

Pan02

Focused Crawlers

Aggarwal01Assis07

Barbosa05Batsakis09

Chakrabarti98Chakrabarti99Mukherjea04

Pant05 Pant06

Partalas08

Intelligent

Automated

Liddle02Blanco05

Palmieri04 Vidal07

User-Defined

Bertoli08Blythe08

Davulcu99Kapow04

Montoto07Vinod05Wang08

Inma Hernández – ZOCO– September 2009

Roadmap

Introduction

Conceptual Model

Experiment

Conclusions

Future Work

Conceptual Model

Inma Hernández – ZOCO– September 2009

keywords

Web Page Classification

Hub Navigation(Link Classification)

Detail Info Extractor

Error

No results

Others

Response Pages

Form 1 Form N…

Structured data

Our focus

(Online)

Conceptual Model

Inma Hernández – ZOCO– September 2009

Web Page Classification

Hub

Detail

Error

No results

Others

Response Pages

Form 1

Classes

Actions

Info Extractor

Navigation(Link Classification)

Store

Discard

Filter

Fill in form

Preliminary results: Promising

25%

50%

15%10%

< 10% relevant links

Between 10% and 27,49 % relevant links

Between 27,49 % and 47,52% relevant links

> 47,52 % relevant links

Preliminary results: Promising

0102030405060708090 Category: Arts

% Relevant Links

01020304050

Art

sBu

sine

ssCo

mpu

ters

Gam

esH

ealth

Hom

eKi

ds a

nd …

New

sRe

crea

tion

Regi

onal

Scie

nce

Shop

ping

Soci

ety

Spor

tsW

orld

% Average Relevant Links

% Relevant Links

Roadmap

Introduction

Conceptual Model

ExperimentDefinition

Statistical Analysis

Conclusions

Future Work

Experiment Definition (Part 1)

Keywords

Xpath LocatorsSample Hub Set(15 categories, 150 sites)

Experiment Definition (Part 1)

Sample Hub Set(15 categories, 150 sites)

Keywords

Xpath Locators

Java application

Experiment Definition (Part 2)

¨ Java Application

% Relevant links

Linked Pages

Relevant Time

Relevant Size

Irrelevant Time

Irrelevant Size

RC -DownloaderN Iterations

Time / Size Results (10 it.)

Average Time Average Size

4115,5892 145277331,8 Relevant

9915,6803 424132935,3 Irrelevant

29,33 % 25,51 %

Roadmap

Introduction

Conceptual Model

ExperimentDefinition

Statistical Analysis

Conclusions

Future Work

Statistical Analysis

0. Statistics

1. Outlier detection

2. Determine Distribution

3. Hypothesis Testing (Chi-square test)

4. Stimate confidence interval for every category

5. Analyse deviations between categories

0. Statistics

Descriptive Statistics

N Minimum Maximum Average Standard Deviation150 2 96 22,35 19,854

Significant?

1. Outlier detection

nbaeveryday_healthmedhelpezineyahoo_financesonemangagamefaqswimbledon

yahoo_healthmedilinejeuxvideo

2. Determine Distribution

Skewness = 1.727706Kurtosis = 2.601109Average = 22.35St. Dev = 19.854N = 150

?

2. Determine Distribution

Chi-square F Snedecor

Beta Gamma

3. Hypothesis Testing

n

ii

ii

E

EO1 2

22 )(

Hypothesis: Observed and Expected distributions are equal

2 follows a chi-square distribution with n-p degrees of freedom (n number of bins, p number of parameters +1)

Reject hypothesis if p-value is less than significance level; else, find specifical hypothesis test.

4. Confidence interval

Check maximum values for % relevant links

4. Confidence interval (categories)

Arts

Compu

ters

Health

Kids

and

Tee

ns

Recre

atio

n

Scie

nce

Societ

y

Wor

ld0

10

20

30

40

50

60

70

Max. % Links

Max. % Links

Check deviations between categories ->

Categories clustering

Roadmap

Introduction

Conceptual Model

Experiment

Conclusions

Future Work

Conclusions

¨ Traditional exhaustive crawlers follow every link in each page. Not all links are relevant.

¨ Virtual integration systems response time should be fast. Visiting irrelevant links results in an increment in cost and time that should be avoided.

¨ The solution is to decide which links are relevant and must be visited. Relevancy should be automatically decided by the navigator, with the support of some classifier.

¨ This solution is semi supervised, which relieves the user from low level web page implementation details and results in a more robust and error prone system.

Roadmap

Introduction

Conceptual Model

Experiment

Conclusions

Future Work

Future Work

¨ Experiment¨ Increase dataset size by adding new sites¨ Automatise hub selection and link classification¨ Determine statistical distribution of data¨ Determine if it is necessary to split data into categories¨ Repeat experiment with new data

¨ Navigator¨ Web Page Classifier(s) Choice¨ Link Classification¨ Little Supervision

Thanks!

Drop by our web site at http://www.tdg-seville.info

[email protected]