basic www technologies 2.1 web documents. 2.2 resource identifiers: uri, url, and urn. 2.3...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Basic WWW Technologies
2.1 Web Documents.2.2 Resource Identifiers: URI, URL, and URN.
2.3 Protocols.2.4 Log Files.
2.5 Search Engines.
2
What Is the World Wide Web?
The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily available to the widest possible audience:
1. A uniform naming scheme for locating resources on the web (e.g., URIs).
2. Protocols, for access to named resources over the web (e.g., HTTP).
3. Hypertext, for easy navigation among resources (e.g., HTML).
3
Internet vs. Web
Internet:• Internet is a more general term • Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP…Web:• Associated with information stored on the
Internet• Refers to a broader class of networks, i.e. Web
of English LiteratureBoth Internet and web are networks
4
Essential Components of WWW
Resources:• Conceptual mappings to concrete or abstract entities, which do not
change in the short term• ex: ICS website (web pages and other kinds of files)
Resource identifiers (hyperlinks):• Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource• http://www.ics.uci.edu is used to identify the ICS homepage
Transfer protocols:• Conventions that regulate the communication between a browser
(web user agent) and a server
5
Standard Generalized Markup Language (SGML)
• Based on GML (generalized markup language), developed by IBM in the 1960s
• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document
• Gave birth to the extensible markup language (XML), W3C recommendation in 1998
6
SGML Components
SGML documents have three parts:• Declaration: specifies which characters and delimiters
may appear in the application• DTD/ style sheet: defines the syntax of markup
constructs• Document instance: actual text (with the tag) of the
documents
More info could be found: http://www.W3.Org/markup/SGML
7
DTD Example One
<!ELEMENT UL - - (LI)+>• ELEMENT is a keyword that introduces a new
element type unordered list (UL)• The two hyphens indicate that both the start tag
<UL> and the end tag </UL> for this element type are required
• Any text between the two tags is treated as a list item (LI)
8
DTD Example Two
<!ELEMENT IMG - O EMPTY>
• The element type being declared is IMG
• The hyphen and the following "O" indicate that the end tag can be omitted
• Together with the content model "EMPTY", this is strengthened to the rule that the end tag must be omitted. (no closing tag)
9
HTML Background
• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.
• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.
• HTML standards are organized by W3C : http://www.w3.org/MarkUp/
10
HTML Functionalities
HTML gives authors the means to:• Publish online documents with headings, text, tables,
lists, photos, etc– Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
• Link information via hypertext links, at the click of a button
• Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc
11
HTML Versions
• HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997.
– HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt• HTML 4.0 was first released as a W3C Recommendation on 18
December 1997• HTML 3.2 was W3C's first Recommendation for HTML which
represented the consensus on HTML features for 1996• HTML 2.0 (RFC 1866) was developed by the IETF's HTML
Working Group, which set the standard for core HTML features based upon current practice in 1994.
12
Sample Webpage
13
Sample Webpage HTML Structure
<HTML>
<HEAD>
<TITLE>The title of the webpage</TITLE> </HEAD>
<BODY> <P>Body of the webpage
</BODY>
</HTML>
14
HTML Structure
• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)
• The title of the document appears in the head (along with other information about the document)
• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>
15
HTML Hyperlink
<a href="relations/alumni">alumni</a>• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)
16
Resource Identifiers
URI: Uniform Resource Identifiers
• URL: Uniform Resource Locators
• URN: Uniform Resource Names
17
Introduction to URIs
Every resource available on the Web has an address that may be encoded by a URI
URIs typically consist of three pieces:• The naming scheme of the mechanism used
to access the resource. (HTTP, FTP)• The name of the machine hosting the
resource• The name of the resource itself, given as a
path
18
URI Example
http://www.w3.org/TR
• There is a document available via the HTTP protocol
• Residing on the machines hosting www.w3.org
• Accessible via the path "/TR"
19
Protocols
Describe how messages are encoded and exchanged
Different Layering Architectures
• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture
20
ISO OSI Layering Architecture
21
ISO’s Design Principles
• A layer should be created where a different level of abstraction is needed
• Each layer should perform a well-defined function
• The layer boundaries should be chosen to minimize information flow across the interfaces
• The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that the architecture does not become unwieldy
22
TCP/IP Layering Architecture
23
TCP/IP Layering Architecture
• A simplified model, provides the end-to-end reliable connection
• The network layer – Hosts drop packages into this layer, layer
routes towards destination – Only promise “Try my best”
• The transport layer– Reliable byte-oriented stream
24
Hypertext Transfer Protocol (HTTP)
• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server
• One of the transport layer protocol supported by Internet
• HTTP communication is established via a TCP connection and server port 80
25
GET Method in HTTP
26
Domain Name System
DNS (domain name service): mapping from domain names to IP address
IPv4: • IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to 255.255.255.255.
IPv6: • Revision of IPv4 with 128 bit address
27
Top Level Domains (TLD)
Top level domain names, .com, .edu, .gov and ISO 3166 country codes
There are three types of top-level domains:• Generic domains were created for use by the Internet
public • Country code domains were created to be used by
individual country • The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internet-infrastructure purposes
28
Registrars
• Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different companies (known as "registrars") that compete with one another
• InterNIC at http://internic.net
• Registrars Directory: http://www.internic.net/regist.html
29
Server Log Files
Server Transfer Log: transactions between a browser and server are logged
• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction
Referrer Log: where the request originated
Agent Log: browser software making the request (spider)
Error Log: request resulted in errors (404)
30
Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search engines
• What are the searched keywords
• How many clicks/page views a page received
• Error reports, like broken links
31
Server Log Analysis
32
Search Engines
According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online
• About 33 million U.S. Internet users query on search engines on a typical day.
• More than 80% have used search engines
Search Engines are measured by coverage and recency
33
Coverage
Overlap analysis used for estimating the size of the indexable web
• W: set of webpages• Wa, Wb: pages crawled by two independent
engines a and b• P(Wa), P(Wb): probabilities that a page was
crawled by a or b• P(Wa)=|Wa| / |W| • P(Wb)=|Wb| / |W|
34
Overlap Analysis
• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb)
= |Wa Wb| / |Wb|• If a and b are independent:
P(Wa Wb) = P(Wa)*P(Wb)• P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| * |Wb| / |Wb|
= |Wa| / |W|
=P(Wa)
35
Overlap Analysis
Using |W| = |Wa|/ P(Wa), the researchers found:
• Web had at least 320 million pages in 1997
• 60% of web was covered by six major engines
• Maximum coverage of a single engine was 1/3 of the web
36
How to Improve the Coverage?
• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.
• Any suggestions?
37
Web Crawler
• A crawler is a program that picks up a page and follows all the links on that page
• Crawler = Spider
• Types of crawler:– Breadth First– Depth First
38
Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and add them to a queue
• Pick the 1st link from the queue, get all links on the page and add to the queue
• Repeat above step till queue is empty
39
Breadth First Crawlers
40
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous level and repeat 2nd step
41
Depth First Crawlers
WEB GRAPHS
43
Internet/Web as GraphsGraph of the physical layer with routers ,
computers etc as nodes and physical connections as edgesIt is limited
Does not capture the graphical connections associated with the information on the Internet
Web Graph where nodes represent web pages and edges are associated with hyperlinks
44
Web Graph
http://www.touchgraph.com/TGGoogleBrowser.html
45
Web Graph Considerations
Edges can be directed or undirectedGraph is highly dynamic
Nodes and edges are added/deleted oftenContent of existing nodes is also subject to
changePages and hyperlinks created on the fly
Apart from primary connected component there are also smaller disconnected components
46
Why the Web Graph?Example of a large,dynamic and distributed
graphPossibly similar to other complex graphs in
social, biological and other systemsReflects how humans organize information
(relevance, ranking) and their societiesEfficient navigation algorithms Study behavior of users as they traverse
the web graph (e-commerce)
47
Statistics of Interest
Size and connectivity of the graph
Number of connected components
Distribution of pages per site
Distribution of incoming and outgoing connections per site
Average and maximal length of the shortest path between any two vertices (diameter)
48
Properties of Web Graphs
Connectivity follows a power law distribution
The graph is sparse|E| = O(n) or atleast o(n2)
Average number of hyperlinks per page roughly a constant
A small world graph
49
Power Law SizeSimple estimates suggest over a billion
nodes
Distribution of site sizes measured by the number of pages follow a power law distribution
Observed over several orders of magnitude with an exponent in the 1.6-1.9 range
50
Power Law ConnectivityDistribution of number of connections per
node follows a power law distributionStudy at Notre Dame University reported
= 2.45 for outdegree distribution = 2.1 for indegree distribution
Random graphs have Poisson distribution if p is large.Decays exponentially fast to 0 as k increases
towards its maximum value n-1
51
Power Law Distribution -Examples
http://www.pnas.org/cgi/reprint/99/8/5207.pdf
52
Examples of networks with Power Law Distribution
Internet at the router and interdomain levelCitation networkCollaboration network of actorsNetworks associated with metabolic
pathwaysNetworks formed by interacting genes and
proteinsNetwork of nervous system connection in C.
elegans
53
Small World Networks
It is a ‘small world’Millions of people. Yet, separated by “six
degrees” of acquaintance relationshipsPopularized by Milgram’s famous experiment
MathematicallyDiameter of graph is small (log N) as compared
to overall size 3. Property seems interesting given ‘sparse’
nature of graph but … This property is ‘natural’ in ‘pure’ random
graphs
54
The small world of WWW
Empirical study of Web-graph reveals small-world propertyAverage distance (d) in simulated web: d = 0.35 + 2.06 log (n)
e.g. n = 109, d ~= 19
Graph generated using power-law model
Diameter properties inferred from samplingCalculation of max. diameter
computationally demanding for large values of n
55
Implications for Web
Logarithmic scaling of diameter makes future growth of web manageable10-fold increase of web pages results in only 2
more additional ‘clicks’, but …
Users may not take shortest path, may use bookmarks or just get distracted on the way
Therefore search engines play a crucial role
56
Some theoretical considerations
Classes of small-world networksScale-free: Power-law distribution of connectivity
over entire range
Broad-scale: Power-law over “broad range” + abrupt cut-off
Single-scale: Connectivity distribution decays exponentially
57
Power Law of PageRank
Assess importance of a page relative to a query and rank pages accordinglyImportance measured by indegree
Not reliable since it is entirely local
PageRank – proportion of time a random surfer would spend on that page at steady state
A random first order Markov surfer at each time step travels from one page to another
58
PageRank contd
Page rank r(v) of page v is the steady state distribution obtained by solving the system of linear equations given by
Where pa[v] = set of parent nodes
Ch[u] = out degree
59
Examples
Log Plot of PageRank Distribution of Brown Domain (*.brown.edu)
G.Pandurangan, P.Raghavan,E.Upfal,”Using PageRank to characterize Webstructure” ,COCOON 2002
60
Bow-tie Structure of Web
A large scale study (Altavista crawls) reveals interesting properties of webStudy of 200 million nodes & 1.5 billion links
Small-world property not applicable to entire web
Some parts unreachable
Others have long paths
Power-law connectivity holds thoughPage indegree ( = 2.1), outdegree ( =
2.72)
61
Bow-tie Components
Strongly Connected Component (SCC)Core with small-world property
Upstream (IN)Core can’t reach IN
Downstream (OUT)OUT can’t reach core
Disconnected (Tendrils)
62
Component Properties
Each component is roughly same size~50 million nodes
Tendrils not connected to SCC But reachable from IN and can reach OUT
Tubes: directed paths IN->Tendrils->OUT
Disconnected componentsMaximal and average diameter is infinite
63
Empirical Numbers for Bow-tie
Maximal minimal (?) diameter 28 for SCC, 500 for entire graph
Probability of a path between any 2 nodes~1 quarter (0.24)
Average length 16 (directed path exists), 7 (undirected)
Shortest directed path between 2 nodes in SCC: 16-20 links on average
64
Models for the Web Graph
Stochastic models that can explain or atleast partially reproduce properties of the web graph The model should follow the power law
distribution properties
Represent the connectivity of the web
Maintain the small world property
65
Web Page Growth
Empirical studies observe a power law distribution of site sizes Size includes size of the Web, number of IP
addresses, number of servers, average size of a page etc
A Generative model is being proposed to account for this distribution
66
Component One of the Generative Model
The first component of this model is that “ sites have short-term size fluctuations up or down that are proportional to the size of the site “
A site with 100,000 pages may gain or lose a few hundred pages in a day whereas the effect is rare for a site with only 100 pages
67
Component Two of the Generative Model
There is an overall growth rate so that the size S(t) satisfies
S(t+1) = (1+t)S(t)
where
- t is the realization of a +-1 Bernoulli random variable at time t with probability 0.5
- b is the absolute rate of the daily fluctuations
68
Component Two of the Generative Model contd
After T steps
so that
69
Theoretical Considerations
Assuming t independent, by central limit theorem it is clear that for large values of T, log S(T) is normally distributedThe central limit theorem states that given a distribution
with a mean μ and variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ2/N as N, the sample size, increases.
http://davidmlane.com/hyperstat/A14043.html
70
Theoretical Considerations contd
Log S(T) can also be associated with a binomial distribution counting the number of time t = +1
Hence S(T) has a log-normal distribution
The probability density and cumulative distribution functions for the log normal distribution
71
Modified Model
Can be modified to obey power law distribution
Model is modified to include the following inorder to obey power law distributionA wide distribution of growth rates across
different sites and/or
The fact that sites have different ages
72
Capturing Power Law Property
Inorder to capture Power Law property it is sufficient to consider that Web sites are being continuously created
Web sites grow at a constant rate during a growth period after which their size remains approximately constant
The periods of growth follow an exponential distribution
This will give a relation = 0.8between the rate of exponential distribution and the growth rage when power law exponent = 1.08
73
Lattice Perturbation (LP) Models
Some Terms“Organized Networks” (a.k.a Mafia)
Each node has same degree k and neighborhoods are entirely local
Probability of Edge (a,b) =1 if dist (a,b) = 1
0 otherwise
Note: We are talking about graphs that can be mapped to a Cartesian plane
74
Terms (Cont’d)
Organized NetworksAre ‘cliquish’ (Subgraph that is fully connected)
in local neighborhood
Probability of edges across neighborhoods is almost non existent (p=0 for fully organized)
“Disorganized” Networks‘Long-range’ edges exist
Completely Disorganized <=> Fully Random (Erdos Model) : p=1
75
Semi-organized (SO) Networks
Probability for long-range edge is between zero and one
Clustered at local level (cliquish)But have long-range links as well
Leads to networks thatAre locally cliquishAnd have short path
lengths
76
Creating SO Networks
Step 1:Take a regular network (e.g. lattice)
Step 2:Shake it up (perturbation)
Step 2 in detail:For each vertex, pick a local edge‘Rewire’ the edge into a long-range edge with a
probability (p)p=0: organized, p=1: disorganized
77
Statistics of SO Networks
Average Diameter (d): Average distance between two nodes
Average Clique Fraction (c)Given a vertex v, k(v): neighbors of vMax edges among k(v) = k(k-1)/2
Clique Fraction (cv): (Edges present) / (Max)Average clique fraction: average over all nodesMeasures: Degree to which “my friends are
friends of each other”
78
Statistics (Cont’d)
Statistics of common networks:
n k d c
Actors 225,226 61 3.65 0.79
Power-grid 4,941 2.67 18.7 0.08
C.elegans 282 14 2.65 0.28
Large k = large c?
Small c = large d?
79
Other Properties
For graph to be sparse but connected:n >> k >> log(n) >>1
As p --> 0 (organized)d ~= n/2k >>1 , c ~= 3/4Highly clustered & d grows linearly with n
As p --> 1 (disorganized)d ~= log(n)/log(k) , c ~= k/n << 1Poorly clustered & d grows logarithmically with
n
80
Effect of ‘Shaking it up’
Small shake (p close to zero)High cliquishness AND short path lengths
Larger shake (p increased further from 0)d drops rapidly (increased small world
phenomena_c remains constant (transition to small world
almost undetectable at local level)
Effect of long-range link:Addition: non-linear decrease of dRemoval: small linear decrease of c
81
LP and The Web
LP has severe limitationsNo concept of short or long links in Web
A page in USA and another in Europe can be joined by one hyperlink
Edge rewiring doesn’t produce power-law connectivity!
Degree distribution bounded & strongly concentrated around mean value
Therefore, we need other models …