detecting sequences and cycles of web pages

27
Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata

Upload: candy

Post on 25-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Detecting Sequences and Cycles of Web Pages. Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata. Contents. Introduction Objective Significance Procedure Experiments Future directions. The Web: A Directed Graph. (V, A) Vertices  Web pages - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Detecting Sequences and Cycles of Web Pages

Detecting Sequences and Cycles of Web Pages

Narayan L. Bhamidipati and

Sankar K. Pal

Indian Statistical InstituteKolkata

Page 2: Detecting Sequences and Cycles of Web Pages

Contents

• Introduction• Objective• Significance• Procedure• Experiments• Future directions

Page 3: Detecting Sequences and Cycles of Web Pages

The Web: A Directed Graph

• (V, A)• Vertices Web pages

• V = {v1, v2, …, vN}

• Arcs Hyperlinks• A = {eij : vj vi}

• Path: p1.p2. … .pn with arcs from pi to pi+1

• Cycle: A Path with pn = p1

Page 4: Detecting Sequences and Cycles of Web Pages

Sequences of Web Pages

• Paths consisting of adjacent web pages• Order sensitive• A surfer may follow one such sequence

when browsing pages

Page 5: Detecting Sequences and Cycles of Web Pages

Cycles of Web Pages• http://www.stanford.edu/• http://www.stanford.edu/home/atoz/letterw.html• http://www.stanford.edu/group/wellspring/• http://www.stanford.edu/group/wellspring/yahoo_spotlight.html• http://www.yahoo.com/• http://dir.yahoo.com/Education/• http://dir.yahoo.com/Education/Higher_Education/• http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/• http://dir.yahoo.com/Education/Higher_Education/

Colleges_and_Universities/United_States/• http://www.stanford.edu

Page 6: Detecting Sequences and Cycles of Web Pages

What are we looking for ?

• A particular kind of sequences and cycles• Regular• Consisting of similar units• Units having similar relationship• Reasonably sized

Page 7: Detecting Sequences and Cycles of Web Pages

Why are these Sequences and Cycles Interesting ?

• Individual units form a single object• These were intended to be together• They collectively include the complete

information• Despite being part of a collection,

individuality is maintained

Page 8: Detecting Sequences and Cycles of Web Pages

Significance of Detecting Such Sequences and Cycles

• Compression• Merge groups of pages• Fewer pages fewer links

• Pre-fetching• Know where the surfer wants to be next• Fetch the page(s) before being requested• Saves time• Errors: pre-fetching wrong pages

Page 9: Detecting Sequences and Cycles of Web Pages

Significance of Detecting Such Sequences and Cycles (Contd.)

• Fair comparison• Comparison independent of how content is

presented• Content split into multiple pages should be

treated equivalent to the same in a single page• Better retrieval

• Retrieval independent of the presentation• Output a set of pages instead of a single one as

a match

Page 10: Detecting Sequences and Cycles of Web Pages

Fair Comparison

Page 11: Detecting Sequences and Cycles of Web Pages

Fair Comparison

Page 12: Detecting Sequences and Cycles of Web Pages

Fair Comparison

Page 13: Detecting Sequences and Cycles of Web Pages

Improved Retrieval

• Retrieve only portions of interest• Instead of, whole (huge) documents• Avoid rewarding more content

Page 14: Detecting Sequences and Cycles of Web Pages

How to Detect Sequences and Cycles of Web Pages ?

• Find navigational links• Find consecutive pages

• Define what the elements of the sequence would satisfy

• Identify subsequences (or units)• Concatenate

• Check for cycles

Page 15: Detecting Sequences and Cycles of Web Pages

Finding Navigational Links: Background

• The purpose of a link may be• Navigation• Reference• Advertisement

• Links between pages on the same server are treated as navigational

• Have also been treated as noise

Page 16: Detecting Sequences and Cycles of Web Pages

Finding Navigational Links: Our Method

• Avoid treating links on the same server as navigational links

• Appear mostly either at the top or at the bottom

• Navigational links are generally huddled together

• Fewer text and images around such links

Page 17: Detecting Sequences and Cycles of Web Pages

Advantages and Limitations

• Simple and fast• Navigational links across servers are also

identified

• Heuristics need not always work – fall back on sophisticated methods

Page 18: Detecting Sequences and Cycles of Web Pages

Units of the Sequences

• ABC is a unit if C is “related” to B in the same way as B is “related” to A

• “related” is defined in terms of how they are linked

• Relation is stored as “position” of the link• Several ways of defining “position”

Page 19: Detecting Sequences and Cycles of Web Pages

Combining the units into sequences

• DEF• BCD• ABC• CDE

• ABCDEF

Page 20: Detecting Sequences and Cycles of Web Pages

Cycle detection

• Existing cycle detection algorithms• Cycle detection in number theory• Special case of cycle detection in graph

theory• Stack based algorithm

Page 21: Detecting Sequences and Cycles of Web Pages

Improvements and Speedups

• Believe the “rel” information provided by the (author of the) pages

• Use keywords like “next” and “previous” to perceive the relationships

• Utilize the information of the naming convention

Page 22: Detecting Sequences and Cycles of Web Pages

Experimental Results

• Data• Toy data: python tutorial in HTML• Tutorial split into several chapters and sections• Several cycles

• Mutilated data• Certain pages deleted (missing links)

• 100% detection in all cases

Page 23: Detecting Sequences and Cycles of Web Pages

Other experiments planned

• Real test: unorganized web pages• Difficulties:

• Finding navigational links• Noise (advertisements, etc)• Dynamically generated

• Will the relationships hold ?

Page 24: Detecting Sequences and Cycles of Web Pages

Leads us to …

• Concatenate detected sequences for analysis• Modify retrieval mechanism• Return sets of pages as results• Improve mirror/duplicate detection

Page 25: Detecting Sequences and Cycles of Web Pages

Future Work

• Consider other relations• Unifying framework ?• Improve identification of navigational links

Page 26: Detecting Sequences and Cycles of Web Pages
Page 27: Detecting Sequences and Cycles of Web Pages