basic www technologies 2.1 web documents. 2.2 resource identifiers: uri, url, and urn. 2.3...

81
Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

Basic WWW Technologies

2.1 Web Documents.2.2 Resource Identifiers: URI, URL, and URN.

2.3 Protocols.2.4 Log Files.

2.5 Search Engines.

Page 2: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

2

What Is the World Wide Web?

The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily available to the widest possible audience:

1. A uniform naming scheme for locating resources on the web (e.g., URIs).

2. Protocols, for access to named resources over the web (e.g., HTTP).

3. Hypertext, for easy navigation among resources (e.g., HTML).

Page 3: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

3

Internet vs. Web

Internet:• Internet is a more general term • Includes physical aspect of underlying networks

and mechanisms such as email, FTP, HTTP…Web:• Associated with information stored on the

Internet• Refers to a broader class of networks, i.e. Web

of English LiteratureBoth Internet and web are networks

Page 4: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

4

Essential Components of WWW

Resources:• Conceptual mappings to concrete or abstract entities, which do not

change in the short term• ex: ICS website (web pages and other kinds of files)

Resource identifiers (hyperlinks):• Strings of characters represent generalized addresses that may

contain instructions for accessing the identified resource• http://www.ics.uci.edu is used to identify the ICS homepage

Transfer protocols:• Conventions that regulate the communication between a browser

(web user agent) and a server

Page 5: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

5

Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed by IBM in the 1960s

• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document

• Gave birth to the extensible markup language (XML), W3C recommendation in 1998

Page 6: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

6

SGML Components

SGML documents have three parts:• Declaration: specifies which characters and delimiters

may appear in the application• DTD/ style sheet: defines the syntax of markup

constructs• Document instance: actual text (with the tag) of the

documents

More info could be found: http://www.W3.Org/markup/SGML

Page 7: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

7

DTD Example One

<!ELEMENT UL - - (LI)+>• ELEMENT is a keyword that introduces a new

element type unordered list (UL)• The two hyphens indicate that both the start tag

<UL> and the end tag </UL> for this element type are required

• Any text between the two tags is treated as a list item (LI)

Page 8: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

8

DTD Example Two

<!ELEMENT IMG - O EMPTY>

• The element type being declared is IMG

• The hyphen and the following "O" indicate that the end tag can be omitted

• Together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. (no closing tag)

Page 9: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

9

HTML Background

• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.

• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.

• HTML standards are organized by W3C : http://www.w3.org/MarkUp/

Page 10: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

10

HTML Functionalities

HTML gives authors the means to:• Publish online documents with headings, text, tables,

lists, photos, etc– Include spread-sheets, video clips, sound clips, and other

applications directly in their documents

• Link information via hypertext links, at the click of a button

• Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

Page 11: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

11

HTML Versions

• HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997.

– HTML 4.01 Specification:

http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt• HTML 4.0 was first released as a W3C Recommendation on 18

December 1997• HTML 3.2 was W3C's first Recommendation for HTML which

represented the consensus on HTML features for 1996• HTML 2.0 (RFC 1866) was developed by the IETF's HTML

Working Group, which set the standard for core HTML features based upon current practice in 1994.

Page 12: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

12

Sample Webpage

Page 13: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

13

Sample Webpage HTML Structure

<HTML>

<HEAD>

<TITLE>The title of the webpage</TITLE> </HEAD>

<BODY> <P>Body of the webpage

</BODY>

</HTML>

Page 14: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

14

HTML Structure

• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)

• The title of the document appears in the head (along with other information about the document)

• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>

Page 15: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

15

HTML Hyperlink

<a href="relations/alumni">alumni</a>• A link is a connection from one Web resource

to another

• It has two ends, called anchors, and a direction

• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

Page 16: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

16

Resource Identifiers

URI: Uniform Resource Identifiers

• URL: Uniform Resource Locators

• URN: Uniform Resource Names

Page 17: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

17

Introduction to URIs

Every resource available on the Web has an address that may be encoded by a URI

URIs typically consist of three pieces:• The naming scheme of the mechanism used

to access the resource. (HTTP, FTP)• The name of the machine hosting the

resource• The name of the resource itself, given as a

path

Page 18: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

18

URI Example

http://www.w3.org/TR

• There is a document available via the HTTP protocol

• Residing on the machines hosting www.w3.org

• Accessible via the path "/TR"

Page 19: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

19

Protocols

Describe how messages are encoded and exchanged

Different Layering Architectures

• ISO OSI 7-Layer Architecture

• TCP/IP 4-Layer Architecture

Page 20: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

20

ISO OSI Layering Architecture

Page 21: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

21

ISO’s Design Principles

• A layer should be created where a different level of abstraction is needed

• Each layer should perform a well-defined function

• The layer boundaries should be chosen to minimize information flow across the interfaces

• The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that the architecture does not become unwieldy

Page 22: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

22

TCP/IP Layering Architecture

Page 23: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

23

TCP/IP Layering Architecture

• A simplified model, provides the end-to-end reliable connection

• The network layer – Hosts drop packages into this layer, layer

routes towards destination – Only promise “Try my best”

• The transport layer– Reliable byte-oriented stream

Page 24: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

24

Hypertext Transfer Protocol (HTTP)

• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server

• One of the transport layer protocol supported by Internet

• HTTP communication is established via a TCP connection and server port 80

Page 25: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

25

GET Method in HTTP

Page 26: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

26

Domain Name System

DNS (domain name service): mapping from domain names to IP address

IPv4: • IPv4 was initially deployed January 1st. 1983 and

is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers

separated by dot, range from 0.0.0.0 to 255.255.255.255.

IPv6: • Revision of IPv4 with 128 bit address

Page 27: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

27

Top Level Domains (TLD)

Top level domain names, .com, .edu, .gov and ISO 3166 country codes

There are three types of top-level domains:• Generic domains were created for use by the Internet

public • Country code domains were created to be used by

individual country • The .arpa domain Address and Routing Parameter Area

domain is designated to be used exclusively for Internet-infrastructure purposes

Page 28: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

28

Registrars

• Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different companies (known as "registrars") that compete with one another

• InterNIC at http://internic.net

• Registrars Directory: http://www.internic.net/regist.html

Page 29: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

29

Server Log Files

Server Transfer Log: transactions between a browser and server are logged

• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction

Referrer Log: where the request originated

Agent Log: browser software making the request (spider)

Error Log: request resulted in errors (404)

Page 30: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

30

Server Log Analysis

• Most and least visited web pages

• Entry and exit pages

• Referrals from other sites or search engines

• What are the searched keywords

• How many clicks/page views a page received

• Error reports, like broken links

Page 31: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

31

Server Log Analysis

Page 32: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

32

Search Engines

According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online

• About 33 million U.S. Internet users query on search engines on a typical day.

• More than 80% have used search engines

Search Engines are measured by coverage and recency

Page 33: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

33

Coverage

Overlap analysis used for estimating the size of the indexable web

• W: set of webpages• Wa, Wb: pages crawled by two independent

engines a and b• P(Wa), P(Wb): probabilities that a page was

crawled by a or b• P(Wa)=|Wa| / |W| • P(Wb)=|Wb| / |W|

Page 34: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

34

Overlap Analysis

• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb)

= |Wa Wb| / |Wb|• If a and b are independent:

P(Wa Wb) = P(Wa)*P(Wb)• P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)

= |Wa| * |Wb| / |Wb|

= |Wa| / |W|

=P(Wa)

Page 35: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

35

Overlap Analysis

Using |W| = |Wa|/ P(Wa), the researchers found:

• Web had at least 320 million pages in 1997

• 60% of web was covered by six major engines

• Maximum coverage of a single engine was 1/3 of the web

Page 36: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

36

How to Improve the Coverage?

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Any suggestions?

Page 37: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

37

Web Crawler

• A crawler is a program that picks up a page and follows all the links on that page

• Crawler = Spider

• Types of crawler:– Breadth First– Depth First

Page 38: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

38

Breadth First Crawlers

Use breadth-first search (BFS) algorithm

• Get all links from the starting page, and add them to a queue

• Pick the 1st link from the queue, get all links on the page and add to the queue

• Repeat above step till queue is empty

Page 39: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

39

Breadth First Crawlers

Page 40: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

40

Depth First Crawlers

Use depth first search (DFS) algorithm

• Get the 1st link not visited from the start page

• Visit link and get 1st non-visited link

• Repeat above step till no no-visited links

• Go to next non-visited link in the previous level and repeat 2nd step

Page 41: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

41

Depth First Crawlers

Page 42: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

WEB GRAPHS

Page 43: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

43

Internet/Web as GraphsGraph of the physical layer with routers ,

computers etc as nodes and physical connections as edgesIt is limited

Does not capture the graphical connections associated with the information on the Internet

Web Graph where nodes represent web pages and edges are associated with hyperlinks

Page 44: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

44

Web Graph

http://www.touchgraph.com/TGGoogleBrowser.html

Page 45: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

45

Web Graph Considerations

Edges can be directed or undirectedGraph is highly dynamic

Nodes and edges are added/deleted oftenContent of existing nodes is also subject to

changePages and hyperlinks created on the fly

Apart from primary connected component there are also smaller disconnected components

Page 46: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

46

Why the Web Graph?Example of a large,dynamic and distributed

graphPossibly similar to other complex graphs in

social, biological and other systemsReflects how humans organize information

(relevance, ranking) and their societiesEfficient navigation algorithms Study behavior of users as they traverse

the web graph (e-commerce)

Page 47: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

47

Statistics of Interest

Size and connectivity of the graph

Number of connected components

Distribution of pages per site

Distribution of incoming and outgoing connections per site

Average and maximal length of the shortest path between any two vertices (diameter)

Page 48: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

48

Properties of Web Graphs

Connectivity follows a power law distribution

The graph is sparse|E| = O(n) or atleast o(n2)

Average number of hyperlinks per page roughly a constant

A small world graph

Page 49: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

49

Power Law SizeSimple estimates suggest over a billion

nodes

Distribution of site sizes measured by the number of pages follow a power law distribution

Observed over several orders of magnitude with an exponent in the 1.6-1.9 range

Page 50: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

50

Power Law ConnectivityDistribution of number of connections per

node follows a power law distributionStudy at Notre Dame University reported

= 2.45 for outdegree distribution = 2.1 for indegree distribution

Random graphs have Poisson distribution if p is large.Decays exponentially fast to 0 as k increases

towards its maximum value n-1

Page 51: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

51

Power Law Distribution -Examples

http://www.pnas.org/cgi/reprint/99/8/5207.pdf

Page 52: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

52

Examples of networks with Power Law Distribution

Internet at the router and interdomain levelCitation networkCollaboration network of actorsNetworks associated with metabolic

pathwaysNetworks formed by interacting genes and

proteinsNetwork of nervous system connection in C.

elegans

Page 53: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

53

Small World Networks

It is a ‘small world’Millions of people. Yet, separated by “six

degrees” of acquaintance relationshipsPopularized by Milgram’s famous experiment

MathematicallyDiameter of graph is small (log N) as compared

to overall size 3. Property seems interesting given ‘sparse’

nature of graph but … This property is ‘natural’ in ‘pure’ random

graphs

Page 54: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

54

The small world of WWW

Empirical study of Web-graph reveals small-world propertyAverage distance (d) in simulated web: d = 0.35 + 2.06 log (n)

e.g. n = 109, d ~= 19

Graph generated using power-law model

Diameter properties inferred from samplingCalculation of max. diameter

computationally demanding for large values of n

Page 55: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

55

Implications for Web

Logarithmic scaling of diameter makes future growth of web manageable10-fold increase of web pages results in only 2

more additional ‘clicks’, but …

Users may not take shortest path, may use bookmarks or just get distracted on the way

Therefore search engines play a crucial role

Page 56: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

56

Some theoretical considerations

Classes of small-world networksScale-free: Power-law distribution of connectivity

over entire range

Broad-scale: Power-law over “broad range” + abrupt cut-off

Single-scale: Connectivity distribution decays exponentially

Page 57: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

57

Power Law of PageRank

Assess importance of a page relative to a query and rank pages accordinglyImportance measured by indegree

Not reliable since it is entirely local

PageRank – proportion of time a random surfer would spend on that page at steady state

A random first order Markov surfer at each time step travels from one page to another

Page 58: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

58

PageRank contd

Page rank r(v) of page v is the steady state distribution obtained by solving the system of linear equations given by

Where pa[v] = set of parent nodes

Ch[u] = out degree

Page 59: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

59

Examples

Log Plot of PageRank Distribution of Brown Domain (*.brown.edu)

G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure” ,COCOON 2002

Page 60: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

60

Bow-tie Structure of Web

A large scale study (Altavista crawls) reveals interesting properties of webStudy of 200 million nodes & 1.5 billion links

Small-world property not applicable to entire web

Some parts unreachable

Others have long paths

Power-law connectivity holds thoughPage indegree ( = 2.1), outdegree ( =

2.72)

Page 61: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

61

Bow-tie Components

Strongly Connected Component (SCC)Core with small-world property

Upstream (IN)Core can’t reach IN

Downstream (OUT)OUT can’t reach core

Disconnected (Tendrils)

Page 62: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

62

Component Properties

Each component is roughly same size~50 million nodes

Tendrils not connected to SCC But reachable from IN and can reach OUT

Tubes: directed paths IN->Tendrils->OUT

Disconnected componentsMaximal and average diameter is infinite

Page 63: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

63

Empirical Numbers for Bow-tie

Maximal minimal (?) diameter 28 for SCC, 500 for entire graph

Probability of a path between any 2 nodes~1 quarter (0.24)

Average length 16 (directed path exists), 7 (undirected)

Shortest directed path between 2 nodes in SCC: 16-20 links on average

Page 64: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

64

Models for the Web Graph

Stochastic models that can explain or atleast partially reproduce properties of the web graph The model should follow the power law

distribution properties

Represent the connectivity of the web

Maintain the small world property

Page 65: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

65

Web Page Growth

Empirical studies observe a power law distribution of site sizes Size includes size of the Web, number of IP

addresses, number of servers, average size of a page etc

A Generative model is being proposed to account for this distribution

Page 66: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

66

Component One of the Generative Model

The first component of this model is that “ sites have short-term size fluctuations up or down that are proportional to the size of the site “

A site with 100,000 pages may gain or lose a few hundred pages in a day whereas the effect is rare for a site with only 100 pages

Page 67: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

67

Component Two of the Generative Model

There is an overall growth rate so that the size S(t) satisfies

S(t+1) = (1+t)S(t)

where

- t is the realization of a +-1 Bernoulli random variable at time t with probability 0.5

- b is the absolute rate of the daily fluctuations

Page 68: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

68

Component Two of the Generative Model contd

After T steps

so that

Page 69: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

69

Theoretical Considerations

Assuming t independent, by central limit theorem it is clear that for large values of T, log S(T) is normally distributedThe central limit theorem states that given a distribution

with a mean μ and variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ2/N as N, the sample size, increases.

http://davidmlane.com/hyperstat/A14043.html

Page 70: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

70

Theoretical Considerations contd

Log S(T) can also be associated with a binomial distribution counting the number of time t = +1

Hence S(T) has a log-normal distribution

The probability density and cumulative distribution functions for the log normal distribution

Page 71: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

71

Modified Model

Can be modified to obey power law distribution

Model is modified to include the following inorder to obey power law distributionA wide distribution of growth rates across

different sites and/or

The fact that sites have different ages

Page 72: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

72

Capturing Power Law Property

Inorder to capture Power Law property it is sufficient to consider that Web sites are being continuously created

Web sites grow at a constant rate during a growth period after which their size remains approximately constant

The periods of growth follow an exponential distribution

This will give a relation = 0.8between the rate of exponential distribution and the growth rage when power law exponent = 1.08

Page 73: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

73

Lattice Perturbation (LP) Models

Some Terms“Organized Networks” (a.k.a Mafia)

Each node has same degree k and neighborhoods are entirely local

Probability of Edge (a,b) =1 if dist (a,b) = 1

0 otherwise

Note: We are talking about graphs that can be mapped to a Cartesian plane

Page 74: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

74

Terms (Cont’d)

Organized NetworksAre ‘cliquish’ (Subgraph that is fully connected)

in local neighborhood

Probability of edges across neighborhoods is almost non existent (p=0 for fully organized)

“Disorganized” Networks‘Long-range’ edges exist

Completely Disorganized <=> Fully Random (Erdos Model) : p=1

Page 75: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

75

Semi-organized (SO) Networks

Probability for long-range edge is between zero and one

Clustered at local level (cliquish)But have long-range links as well

Leads to networks thatAre locally cliquishAnd have short path

lengths

Page 76: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

76

Creating SO Networks

Step 1:Take a regular network (e.g. lattice)

Step 2:Shake it up (perturbation)

Step 2 in detail:For each vertex, pick a local edge‘Rewire’ the edge into a long-range edge with a

probability (p)p=0: organized, p=1: disorganized

Page 77: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

77

Statistics of SO Networks

Average Diameter (d): Average distance between two nodes

Average Clique Fraction (c)Given a vertex v, k(v): neighbors of vMax edges among k(v) = k(k-1)/2

Clique Fraction (cv): (Edges present) / (Max)Average clique fraction: average over all nodesMeasures: Degree to which “my friends are

friends of each other”

Page 78: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

78

Statistics (Cont’d)

Statistics of common networks:

n k d c

Actors 225,226 61 3.65 0.79

Power-grid 4,941 2.67 18.7 0.08

C.elegans 282 14 2.65 0.28

Large k = large c?

Small c = large d?

Page 79: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

79

Other Properties

For graph to be sparse but connected:n >> k >> log(n) >>1

As p --> 0 (organized)d ~= n/2k >>1 , c ~= 3/4Highly clustered & d grows linearly with n

As p --> 1 (disorganized)d ~= log(n)/log(k) , c ~= k/n << 1Poorly clustered & d grows logarithmically with

n

Page 80: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

80

Effect of ‘Shaking it up’

Small shake (p close to zero)High cliquishness AND short path lengths

Larger shake (p increased further from 0)d drops rapidly (increased small world

phenomena_c remains constant (transition to small world

almost undetectable at local level)

Effect of long-range link:Addition: non-linear decrease of dRemoval: small linear decrease of c

Page 81: Basic WWW Technologies 2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines

81

LP and The Web

LP has severe limitationsNo concept of short or long links in Web

A page in USA and another in Europe can be joined by one hyperlink

Edge rewiring doesn’t produce power-law connectivity!

Degree distribution bounded & strongly concentrated around mean value

Therefore, we need other models …