com1721: freshman honors seminar a random walk through computing lecture 2: structure of the web...

26
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002

Upload: john-shaw

Post on 01-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

COM1721: Freshman Honors SeminarA Random Walk Through Computing

Lecture 2: Structure of the WebOctober 1, 2002

Structure of the Web

Courtesy: infotoday.com

Why is Web Structure Interesting?

Design of search engines: Improved crawl strategies Make use of link information to give better

ranking, e.g., Google Generate good representative structures for

simulations Relationship to other Internet structures

Traffic patterns User access patterns

Why is Web Structure Interesting?

Understanding the sociology of content creation on the web: Six degrees of separation and the

small-world phenomenon [Milgram 67]

Is every web page just six clicks away from every other web page?

Simply because it is out there!

Background for the Study Conducted by researchers at AltaVista,

Compaq, and IBM Analyzed the connectivity of more than

200M web pages and 1.5B links AltaVista web crawl, May 1999

Start from a large number of sources Follow links in a breadth-first search manner

and add pages to the database Structure determined by set of all web pages

crawled together with their in-links and out-links

A More Detailed Look

Broder et al, WWW Conference, 1999

Bowtie Components

SCC (Core) Largest strongly connected component Every page in core can reach every other

page in core 56 million

IN (Origination) All pages outside the core that can reach

the core 44 million

Bowtie Components

OUT (Termination) All pages that are reachable from SCC 44 million

Other pages: Neither reachable from SCC nor can reach the SCC Reachable from IN or can reach OUT

(Tendrils) Completely disconnected from the rest

(Disconnected) Total of 60 million

Example Pages: SCC

CCS! http://www.ccs.neu.edu Links to many communities and other

authoritative sites outside CCS Authoritative sites such as

http://www.ccs.neu.edu/home/rraj/Courses/172x/F02/ http://www.northeastern.edu http://www.boston.com http://www.yahoo.com

Example Pages: IN

Individual home pages on web hosting services: Do not have links from authoritative

sources and core pages Have connections to core pages

through series of links New or obscure web pages that

have not attracted attention

Example Pages: OUT

Commercial sites Pages point to pages within the site Rarely point to pages outside the site http://www.ibm.com

Can be reached from a core site, but does not have links back to core http://www.ccs.neu.edu/home/rraj/papers.ht

ml

Example Pages: Tendrils

Pages not in OUT or CORE with paths to OUT

Pages not in IN or CORE with paths from IN

A private web page in IN points to a page with links to corporate sites

Example Pages: Disconnected Pages

Temporary set of pages for working on a project

http://www.ccs.neu.edu/home/chenj/rsch/discussions.htm

Pages that were linked to the core, OUT, or IN earlier, with the links now removed

How was the Study Done?

Crawlers searched from over many initial locations: Covered over 200 M webpages With 1.5 billion links among these

pages 9.6 GB storage after compression

Webpage characterized by URL and links to other URLs only Page content not relevant to studyA view that extracts essential information

relevant to the purpose and ignores inessential details

Abstraction!

Finding the Structure

Got a list of 200 M web pages and 1.5 billion links

How do we find out: The distance between two pages? Which pages can be reached from a

given page? Which is the most popular webpage?

Represent the web as a graph!

CCS Web as a Graph

http://www.ccs.neu.edu

ChaptersDirectory

US

CCS

Contact Us

IS

People

Help

ResearchNU

Orgns.

Alumni

NU ACM

Directed Graphs

A directed graph is a pair G = (V,E) V: Set of vertices (nodes) E: Set of directed edges (links), each

going from one vertex to another

NU ACMDirectory

US

ChaptersV = {NUACM, Chapters, Directory, US}E = {(NUACM,Chapters), (Chapters, Directory), (Directory, US), (US, NUACM)}

Graph Terms

In-degree: Number of edges into a node

Out-degree: Number of edges out of a node

Suppose a directed graph has n nodes and m edges: Average in-degree? Average out-degree?

More Graph Terms

Strongly connected graph: There is a path between every two nodes

Distance from node u to v: Number of links on the shortest path

from u to v Diameter:

Maximum distance between any two nodes

Finite for strongly connected graphs only

Undirected Graphs

Edges are undirected (u,v) equivalent to (v,u)

Degree of a node: Number of edges adjacent to it

Connected: If there is a path between any two

nodes

4

1

2

3

Graphs: Useful Representation Tools

Social networks Transportation networks Control flow of a program Flowchart of a manufacturing process Computer networks Bibliography citations …

Structure of the Web

Broder et al, WWW Conference, 1999

Structural Properties of the Web

Diameter of the SCC is at least 28 Pick a random source page u and a

random destination page v: How many links is v away from u? 75% of the time, there is no path! The other 25% of the time, average distance is

16 Interesting distribution of degrees and

sizes of connected components: power laws

Representations of a Graph

Adjacency matrix

1 1 0 1

0 1 0 1

1 0 1 0

0 0 0 1

1

4

2

3

1 2 3 4

1

4

3

2

Representations of a Graph Adjacency list

1

2

3

4

41

4

2

3

2

1

4

References

Structure of the Web: Broder et al, WWW Conference 1999

Graphs: Books on elementary discrete math Graph Theory, by F. Harary

Graph algorithms: Algorithms and data structures books

and courses