basic www technologies 2.1 web documents. 2.2 resource identifiers: uri, url, and urn. 2.3...

Basic WWW Technologies

2.1 Web Documents.2.2 Resource Identifiers: URI, URL, and URN.

2.3 Protocols.2.4 Log Files.

2.5 Search Engines.

What Is the World Wide Web?

The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily available to the widest possible audience:

1. A uniform naming scheme for locating resources on the web (e.g., URIs).

2. Protocols, for access to named resources over the web (e.g., HTTP).

3. Hypertext, for easy navigation among resources (e.g., HTML).

Internet vs. Web

Internet:• Internet is a more general term • Includes physical aspect of underlying networks

and mechanisms such as email, FTP, HTTP…Web:• Associated with information stored on the

Internet• Refers to a broader class of networks, i.e. Web

of English LiteratureBoth Internet and web are networks

Essential Components of WWW

Resources:• Conceptual mappings to concrete or abstract entities, which do not

change in the short term• ex: ICS website (web pages and other kinds of files)

Resource identifiers (hyperlinks):• Strings of characters represent generalized addresses that may

contain instructions for accessing the identified resource• http://www.ics.uci.edu is used to identify the ICS homepage

Transfer protocols:• Conventions that regulate the communication between a browser

(web user agent) and a server

Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed by IBM in the 1960s

• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document

• Gave birth to the extensible markup language (XML), W3C recommendation in 1998

SGML Components

SGML documents have three parts:• Declaration: specifies which characters and delimiters

may appear in the application• DTD/ style sheet: defines the syntax of markup

constructs• Document instance: actual text (with the tag) of the

documents

More info could be found: http://www.W3.Org/markup/SGML

DTD Example One

<!ELEMENT UL - - (LI)+>• ELEMENT is a keyword that introduces a new

element type unordered list (UL)• The two hyphens indicate that both the start tag

<UL> and the end tag </UL> for this element type are required

• Any text between the two tags is treated as a list item (LI)

DTD Example Two

<!ELEMENT IMG - O EMPTY>

• The element type being declared is IMG

• The hyphen and the following "O" indicate that the end tag can be omitted

• Together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. (no closing tag)

HTML Background

• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.

• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.

• HTML standards are organized by W3C : http://www.w3.org/MarkUp/

HTML Functionalities

HTML gives authors the means to:• Publish online documents with headings, text, tables,

lists, photos, etc– Include spread-sheets, video clips, sound clips, and other

applications directly in their documents

• Link information via hypertext links, at the click of a button

• Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

HTML Versions

• HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997.

– HTML 4.01 Specification:

http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt• HTML 4.0 was first released as a W3C Recommendation on 18

December 1997• HTML 3.2 was W3C's first Recommendation for HTML which

represented the consensus on HTML features for 1996• HTML 2.0 (RFC 1866) was developed by the IETF's HTML

Working Group, which set the standard for core HTML features based upon current practice in 1994.

Sample Webpage

Sample Webpage HTML Structure

<HTML>

<HEAD>

<TITLE>The title of the webpage</TITLE> </HEAD>

<BODY> <P>Body of the webpage

</BODY>

</HTML>

HTML Structure

• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)

• The title of the document appears in the head (along with other information about the document)

• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>

HTML Hyperlink

<a href="relations/alumni">alumni</a>• A link is a connection from one Web resource

to another

• It has two ends, called anchors, and a direction

• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

Resource Identifiers

URI: Uniform Resource Identifiers

• URL: Uniform Resource Locators

• URN: Uniform Resource Names

Introduction to URIs

Every resource available on the Web has an address that may be encoded by a URI

URIs typically consist of three pieces:• The naming scheme of the mechanism used

to access the resource. (HTTP, FTP)• The name of the machine hosting the

resource• The name of the resource itself, given as a

URI Example

http://www.w3.org/TR

• There is a document available via the HTTP protocol

• Residing on the machines hosting www.w3.org

• Accessible via the path "/TR"

Protocols

Describe how messages are encoded and exchanged

Different Layering Architectures

• ISO OSI 7-Layer Architecture

• TCP/IP 4-Layer Architecture

ISO OSI Layering Architecture

ISO’s Design Principles

• A layer should be created where a different level of abstraction is needed

• Each layer should perform a well-defined function

• The layer boundaries should be chosen to minimize information flow across the interfaces

• The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that the architecture does not become unwieldy

TCP/IP Layering Architecture

• A simplified model, provides the end-to-end reliable connection

• The network layer – Hosts drop packages into this layer, layer

routes towards destination – Only promise “Try my best”

• The transport layer– Reliable byte-oriented stream

Hypertext Transfer Protocol (HTTP)

• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server

• One of the transport layer protocol supported by Internet

• HTTP communication is established via a TCP connection and server port 80

GET Method in HTTP

Domain Name System

DNS (domain name service): mapping from domain names to IP address

IPv4: • IPv4 was initially deployed January 1st. 1983 and

is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers

separated by dot, range from 0.0.0.0 to 255.255.255.255.

IPv6: • Revision of IPv4 with 128 bit address

Top Level Domains (TLD)

Top level domain names, .com, .edu, .gov and ISO 3166 country codes

There are three types of top-level domains:• Generic domains were created for use by the Internet

public • Country code domains were created to be used by

individual country • The .arpa domain Address and Routing Parameter Area

domain is designated to be used exclusively for Internet-infrastructure purposes

Registrars

• Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different companies (known as "registrars") that compete with one another

• InterNIC at http://internic.net

• Registrars Directory: http://www.internic.net/regist.html

Server Log Files

Server Transfer Log: transactions between a browser and server are logged

• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction

Referrer Log: where the request originated

Agent Log: browser software making the request (spider)

Error Log: request resulted in errors (404)

Server Log Analysis

• Most and least visited web pages

• Entry and exit pages

• Referrals from other sites or search engines

• What are the searched keywords

• How many clicks/page views a page received

• Error reports, like broken links

Server Log Analysis

Search Engines

According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online

• About 33 million U.S. Internet users query on search engines on a typical day.

• More than 80% have used search engines

Search Engines are measured by coverage and recency

Coverage

Overlap analysis used for estimating the size of the indexable web

• W: set of webpages• Wa, Wb: pages crawled by two independent

engines a and b• P(Wa), P(Wb): probabilities that a page was

crawled by a or b• P(Wa)=|Wa| / |W| • P(Wb)=|Wb| / |W|

Overlap Analysis

• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb)

= |Wa Wb| / |Wb|• If a and b are independent:

P(Wa Wb) = P(Wa)*P(Wb)• P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)

= |Wa| * |Wb| / |Wb|

= |Wa| / |W|

=P(Wa)

Overlap Analysis

Using |W| = |Wa|/ P(Wa), the researchers found:

• Web had at least 320 million pages in 1997

• 60% of web was covered by six major engines

• Maximum coverage of a single engine was 1/3 of the web

How to Improve the Coverage?

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Any suggestions?

Web Crawler

• A crawler is a program that picks up a page and follows all the links on that page

• Crawler = Spider

• Types of crawler:– Breadth First– Depth First

Breadth First Crawlers

Use breadth-first search (BFS) algorithm

• Get all links from the starting page, and add them to a queue

• Pick the 1st link from the queue, get all links on the page and add to the queue

• Repeat above step till queue is empty

Breadth First Crawlers

Depth First Crawlers

Use depth first search (DFS) algorithm

• Get the 1st link not visited from the start page

• Visit link and get 1st non-visited link

• Repeat above step till no no-visited links

• Go to next non-visited link in the previous level and repeat 2nd step

Depth First Crawlers

WEB GRAPHS

Internet/Web as GraphsGraph of the physical layer with routers ,

computers etc as nodes and physical connections as edgesIt is limited

Does not capture the graphical connections associated with the information on the Internet

Web Graph where nodes represent web pages and edges are associated with hyperlinks

Web Graph

http://www.touchgraph.com/TGGoogleBrowser.html

Web Graph Considerations

Edges can be directed or undirectedGraph is highly dynamic

Nodes and edges are added/deleted oftenContent of existing nodes is also subject to

changePages and hyperlinks created on the fly

Apart from primary connected component there are also smaller disconnected components

Why the Web Graph?Example of a large,dynamic and distributed

graphPossibly similar to other complex graphs in

social, biological and other systemsReflects how humans organize information

(relevance, ranking) and their societiesEfficient navigation algorithms Study behavior of users as they traverse

the web graph (e-commerce)

Statistics of Interest

Size and connectivity of the graph

Number of connected components

Distribution of pages per site

Distribution of incoming and outgoing connections per site

Average and maximal length of the shortest path between any two vertices (diameter)

Properties of Web Graphs

Connectivity follows a power law distribution

The graph is sparse|E| = O(n) or atleast o(n2)

Average number of hyperlinks per page roughly a constant

A small world graph

Power Law SizeSimple estimates suggest over a billion

Distribution of site sizes measured by the number of pages follow a power law distribution

Observed over several orders of magnitude with an exponent in the 1.6-1.9 range

Power Law ConnectivityDistribution of number of connections per

node follows a power law distributionStudy at Notre Dame University reported

= 2.45 for outdegree distribution = 2.1 for indegree distribution

Random graphs have Poisson distribution if p is large.Decays exponentially fast to 0 as k increases

towards its maximum value n-1

Power Law Distribution -Examples

http://www.pnas.org/cgi/reprint/99/8/5207.pdf

Examples of networks with Power Law Distribution

Internet at the router and interdomain levelCitation networkCollaboration network of actorsNetworks associated with metabolic

pathwaysNetworks formed by interacting genes and

proteinsNetwork of nervous system connection in C.

elegans

Small World Networks

It is a ‘small world’Millions of people. Yet, separated by “six

degrees” of acquaintance relationshipsPopularized by Milgram’s famous experiment

MathematicallyDiameter of graph is small (log N) as compared

to overall size 3. Property seems interesting given ‘sparse’

nature of graph but … This property is ‘natural’ in ‘pure’ random

graphs

The small world of WWW

Empirical study of Web-graph reveals small-world propertyAverage distance (d) in simulated web: d = 0.35 + 2.06 log (n)

e.g. n = 109, d ~= 19

Graph generated using power-law model

Diameter properties inferred from samplingCalculation of max. diameter

computationally demanding for large values of n

Implications for Web

Logarithmic scaling of diameter makes future growth of web manageable10-fold increase of web pages results in only 2

more additional ‘clicks’, but …

Users may not take shortest path, may use bookmarks or just get distracted on the way

Therefore search engines play a crucial role

Some theoretical considerations

Classes of small-world networksScale-free: Power-law distribution of connectivity

over entire range

Broad-scale: Power-law over “broad range” + abrupt cut-off

Single-scale: Connectivity distribution decays exponentially

Power Law of PageRank

Assess importance of a page relative to a query and rank pages accordinglyImportance measured by indegree

Not reliable since it is entirely local

PageRank – proportion of time a random surfer would spend on that page at steady state

A random first order Markov surfer at each time step travels from one page to another

PageRank contd

Page rank r(v) of page v is the steady state distribution obtained by solving the system of linear equations given by

Where pa[v] = set of parent nodes

Ch[u] = out degree

Examples

Log Plot of PageRank Distribution of Brown Domain (*.brown.edu)

G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure” ,COCOON 2002

Bow-tie Structure of Web

A large scale study (Altavista crawls) reveals interesting properties of webStudy of 200 million nodes & 1.5 billion links

Small-world property not applicable to entire web

Some parts unreachable

Others have long paths

Power-law connectivity holds thoughPage indegree ( = 2.1), outdegree ( =

Bow-tie Components

Strongly Connected Component (SCC)Core with small-world property

Upstream (IN)Core can’t reach IN

Downstream (OUT)OUT can’t reach core

Disconnected (Tendrils)

Component Properties

Each component is roughly same size~50 million nodes

Tendrils not connected to SCC But reachable from IN and can reach OUT

Tubes: directed paths IN->Tendrils->OUT

Disconnected componentsMaximal and average diameter is infinite

Empirical Numbers for Bow-tie

Maximal minimal (?) diameter 28 for SCC, 500 for entire graph

Probability of a path between any 2 nodes~1 quarter (0.24)

Average length 16 (directed path exists), 7 (undirected)

Shortest directed path between 2 nodes in SCC: 16-20 links on average

Models for the Web Graph

Stochastic models that can explain or atleast partially reproduce properties of the web graph The model should follow the power law

distribution properties

Represent the connectivity of the web

Maintain the small world property

Web Page Growth

Empirical studies observe a power law distribution of site sizes Size includes size of the Web, number of IP

addresses, number of servers, average size of a page etc

A Generative model is being proposed to account for this distribution

Component One of the Generative Model

The first component of this model is that “ sites have short-term size fluctuations up or down that are proportional to the size of the site “

A site with 100,000 pages may gain or lose a few hundred pages in a day whereas the effect is rare for a site with only 100 pages

Component Two of the Generative Model

There is an overall growth rate so that the size S(t) satisfies

S(t+1) = (1+t)S(t)

- t is the realization of a +-1 Bernoulli random variable at time t with probability 0.5

- b is the absolute rate of the daily fluctuations

Component Two of the Generative Model contd

After T steps

so that

Theoretical Considerations

Assuming t independent, by central limit theorem it is clear that for large values of T, log S(T) is normally distributedThe central limit theorem states that given a distribution

with a mean μ and variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ2/N as N, the sample size, increases.

http://davidmlane.com/hyperstat/A14043.html

Theoretical Considerations contd

Log S(T) can also be associated with a binomial distribution counting the number of time t = +1

Hence S(T) has a log-normal distribution

The probability density and cumulative distribution functions for the log normal distribution

Modified Model

Can be modified to obey power law distribution

Model is modified to include the following inorder to obey power law distributionA wide distribution of growth rates across

different sites and/or

The fact that sites have different ages

Capturing Power Law Property

Inorder to capture Power Law property it is sufficient to consider that Web sites are being continuously created

Web sites grow at a constant rate during a growth period after which their size remains approximately constant

The periods of growth follow an exponential distribution

This will give a relation = 0.8between the rate of exponential distribution and the growth rage when power law exponent = 1.08

Lattice Perturbation (LP) Models

Some Terms“Organized Networks” (a.k.a Mafia)

Each node has same degree k and neighborhoods are entirely local

Probability of Edge (a,b) =1 if dist (a,b) = 1

0 otherwise

Note: We are talking about graphs that can be mapped to a Cartesian plane

Terms (Cont’d)

Organized NetworksAre ‘cliquish’ (Subgraph that is fully connected)

in local neighborhood

Probability of edges across neighborhoods is almost non existent (p=0 for fully organized)

“Disorganized” Networks‘Long-range’ edges exist

Completely Disorganized <=> Fully Random (Erdos Model) : p=1

Semi-organized (SO) Networks

Probability for long-range edge is between zero and one

Clustered at local level (cliquish)But have long-range links as well

Leads to networks thatAre locally cliquishAnd have short path

lengths

Creating SO Networks

Step 1:Take a regular network (e.g. lattice)

Step 2:Shake it up (perturbation)

Step 2 in detail:For each vertex, pick a local edge‘Rewire’ the edge into a long-range edge with a

probability (p)p=0: organized, p=1: disorganized

Statistics of SO Networks

Average Diameter (d): Average distance between two nodes

Average Clique Fraction (c)Given a vertex v, k(v): neighbors of vMax edges among k(v) = k(k-1)/2

Clique Fraction (cv): (Edges present) / (Max)Average clique fraction: average over all nodesMeasures: Degree to which “my friends are

friends of each other”

Statistics (Cont’d)

Statistics of common networks:

n k d c

Actors 225,226 61 3.65 0.79

Power-grid 4,941 2.67 18.7 0.08

C.elegans 282 14 2.65 0.28

Large k = large c?

Small c = large d?

Other Properties

For graph to be sparse but connected:n >> k >> log(n) >>1

As p --> 0 (organized)d ~= n/2k >>1 , c ~= 3/4Highly clustered & d grows linearly with n

As p --> 1 (disorganized)d ~= log(n)/log(k) , c ~= k/n << 1Poorly clustered & d grows logarithmically with

Effect of ‘Shaking it up’

Small shake (p close to zero)High cliquishness AND short path lengths

Larger shake (p increased further from 0)d drops rapidly (increased small world

phenomena_c remains constant (transition to small world

almost undetectable at local level)

Effect of long-range link:Addition: non-linear decrease of dRemoval: small linear decrease of c

LP and The Web

LP has severe limitationsNo concept of short or long links in Web

A page in USA and another in Europe can be joined by one hyperlink

Edge rewiring doesn’t produce power-law connectivity!

Degree distribution bounded & strongly concentrated around mean value

Therefore, we need other models …

basic www technologies 2.1 web documents. 2.2 resource identifiers: uri, url, and urn. 2.3...

http web

web documents

web internet

world wide web web

web page authors

server slide

closing tag slide

web of english literature

Documents

urn 100680

urn 100056

urn 012881

pilot implementation of identifiers using uris and own uri

17.06.2004 epicur kathrin schroeder erpanet-workshop...

doi: urn: urn

uniform resource identifier (uri) uniform resource locator...

uniform resource identifiers (uri): generic syntax 2396 uri...

urn 100121

programming guidelines. sylistic guidelines meaningful...

urn 100470

urn 100489

rome, aug. 30, 2010. current status of vocabularies ...

globally unique identifiers and life science identifiers

ddi urn enabling identification and reuse of ddi metadata...

urn 100697

ipv4/ipv6, hip, e.164, uri/urn/url, doi overviews

persistent identifiers

urn 100549

urn 007935