seminar on navigation inma hernández. roadmap introduction conceptual model experiment conclusions...
TRANSCRIPT
Integraweb
¨ Verifier ¨ Ontologiser
¨ Knowledge Base
¨ Extractor¨ Information retrieval
¨ Ontology
¨ Dataset
¨S
¨Q
¨L
¨S
¨Q
¨L
Integraweb
¨ Verifier ¨ Ontologiser
¨ Knowledge Base
¨ Extractor¨ Information retrieval
¨ Ontology
¨ Dataset
¨S
¨Q
¨L
¨S
¨Q
¨L
Web Page Classification
Web Page Classification
Feature Type
Content
Hotho02 Pierre01
Selamat04
Structural
Arasu03 Bar-Yossef02
Blanco07 Crescenzi01 Grumbach99
Reis04 Vieira06 Vidal07
Hybrid
Caverlee05 Markov08
Feature Location
Onpage
(Most)
Neighbours
Cohen02 Fürnkraz02
Inma Hernández – ZOCO– September 2009
Navigation
Navigation
Blind
Crawlers
Ravaghan01
Recorders
Anupam00Baumgartner05
Pan02
Focused Crawlers
Aggarwal01Assis07
Barbosa05Batsakis09
Chakrabarti98Chakrabarti99Mukherjea04
Pant05 Pant06
Partalas08
Intelligent
Automated
Liddle02Blanco05
Palmieri04 Vidal07
User-Defined
Bertoli08Blythe08
Davulcu99Kapow04
Montoto07Vinod05Wang08
Inma Hernández – ZOCO– September 2009
Conceptual Model
Inma Hernández – ZOCO– September 2009
keywords
Web Page Classification
Hub Navigation(Link Classification)
Detail Info Extractor
Error
No results
Others
Response Pages
Form 1 Form N…
Structured data
Our focus
(Online)
Conceptual Model
Inma Hernández – ZOCO– September 2009
Web Page Classification
Hub
Detail
Error
No results
Others
Response Pages
Form 1
Classes
Actions
Info Extractor
Navigation(Link Classification)
Store
Discard
…
Filter
Fill in form
Preliminary results: Promising
25%
50%
15%10%
< 10% relevant links
Between 10% and 27,49 % relevant links
Between 27,49 % and 47,52% relevant links
> 47,52 % relevant links
Preliminary results: Promising
0102030405060708090 Category: Arts
% Relevant Links
01020304050
Art
sBu
sine
ssCo
mpu
ters
Gam
esH
ealth
Hom
eKi
ds a
nd …
New
sRe
crea
tion
Regi
onal
Scie
nce
Shop
ping
Soci
ety
Spor
tsW
orld
% Average Relevant Links
% Relevant Links
Roadmap
Introduction
Conceptual Model
ExperimentDefinition
Statistical Analysis
Conclusions
Future Work
Experiment Definition (Part 2)
¨ Java Application
% Relevant links
Linked Pages
Relevant Time
Relevant Size
Irrelevant Time
Irrelevant Size
RC -DownloaderN Iterations
Time / Size Results (10 it.)
Average Time Average Size
4115,5892 145277331,8 Relevant
9915,6803 424132935,3 Irrelevant
29,33 % 25,51 %
Roadmap
Introduction
Conceptual Model
ExperimentDefinition
Statistical Analysis
Conclusions
Future Work
Statistical Analysis
0. Statistics
1. Outlier detection
2. Determine Distribution
3. Hypothesis Testing (Chi-square test)
4. Stimate confidence interval for every category
5. Analyse deviations between categories
0. Statistics
Descriptive Statistics
N Minimum Maximum Average Standard Deviation150 2 96 22,35 19,854
Significant?
1. Outlier detection
nbaeveryday_healthmedhelpezineyahoo_financesonemangagamefaqswimbledon
yahoo_healthmedilinejeuxvideo
2. Determine Distribution
Skewness = 1.727706Kurtosis = 2.601109Average = 22.35St. Dev = 19.854N = 150
?
3. Hypothesis Testing
n
ii
ii
E
EO1 2
22 )(
Hypothesis: Observed and Expected distributions are equal
2 follows a chi-square distribution with n-p degrees of freedom (n number of bins, p number of parameters +1)
Reject hypothesis if p-value is less than significance level; else, find specifical hypothesis test.
4. Confidence interval (categories)
Arts
Compu
ters
Health
Kids
and
Tee
ns
Recre
atio
n
Scie
nce
Societ
y
Wor
ld0
10
20
30
40
50
60
70
Max. % Links
Max. % Links
Check deviations between categories ->
Categories clustering
Conclusions
¨ Traditional exhaustive crawlers follow every link in each page. Not all links are relevant.
¨ Virtual integration systems response time should be fast. Visiting irrelevant links results in an increment in cost and time that should be avoided.
¨ The solution is to decide which links are relevant and must be visited. Relevancy should be automatically decided by the navigator, with the support of some classifier.
¨ This solution is semi supervised, which relieves the user from low level web page implementation details and results in a more robust and error prone system.
Future Work
¨ Experiment¨ Increase dataset size by adding new sites¨ Automatise hub selection and link classification¨ Determine statistical distribution of data¨ Determine if it is necessary to split data into categories¨ Repeat experiment with new data
¨ Navigator¨ Web Page Classifier(s) Choice¨ Link Classification¨ Little Supervision