basic www technologies & mathematic background (chap 2 & 1, baldi) wen-hsiang lu ( 盧文祥...

50
Basic WWW Technologies & Mathematic Background (Chap 2 & 1, Baldi) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2006/10/5

Post on 21-Dec-2015

224 views

Category:

Documents


3 download

TRANSCRIPT

Basic WWW Technologies & Mathematic Background

(Chap 2 & 1, Baldi)

Wen-Hsiang Lu (盧文祥 )Department of Computer Science and Information Engineering,

National Cheng Kung University2006/10/5

2

World Wide Web

• The World Wide Web (Web) is a network of information resources.

• The Web relies on three mechanisms to make these resources available:1. A uniform naming scheme for locating

resources on the web (e.g., URIs).2. Protocols, for access to named resources

over the web (e.g., HTTP).3. Hypertext, for easy navigation among

resources (e.g., HTML).

3

Internet vs. Web

• Internet:– Internet is a more general term – Includes physical aspect of underlying networks and

mechanisms such as email, FTP, HTTP…

• Web:– Associated with information stored on the Internet– Refers to a broader class of networks, i.e. Web of

English Literature

• Both Internet and web are networks

4

Essential Components of WWW

• Resources (HTML, HyperText Markup Language)– Conceptual mappings to concrete or abstract entities, which do

not change in the short term– Taggin support for structuring and laying out documents

• Resource identifiers (hyperlinks):– Strings of characters represent generalized addresses that may

contain instructions for accessing the identified resource– http://www.google.com/ is used to identify the Google homepage

• Transfer protocols (HTTP, HyperText Transmission Protocol)– Conventions that regulate the communication between a

browser (web user agent) and a server

5

Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed by IBM in the 1960s

• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document – Markup: extra information characterizing structure of a

document

• Gave birth to the extensible markup language (XML), W3C recommendation in 1998

6

SGML Components

• SGML documents have three parts:– Declaration: specifies which characters and delimiters

may appear in the application– DTD (Document Type Definition)/ style sheet: defines

the syntax of markup constructs– Document instance: actual text (with the tag) of the

documents

• More info could be found: http://www.W3.Org/markup/SGML

7

HTML Background

• HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA.

• The Web depends on Web page authors and vendors sharing the same conventions for HTML. This has motivated joint work on specifications for HTML.

• HTML standards are organized by W3C : http://www.w3.org/MarkUp/

8

HTML Functionalities

• HTML gives authors the means to:– Publish online documents with headings, text, tables,

lists, photos, etc• Include spread-sheets, video clips, sound clips, and other

applications directly in their documents

– Link information via hypertext links, at the click of a button

– Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc

9

Sample Webpage

10

Sample Webpage: HTML Structure

• <HTML>

• <HEAD>

• <TITLE>The title of the webpage</TITLE>

• </HEAD>

• <BODY> <P>Body of the webpage

• </BODY>

• </HTML>

11

HTML Structure

• An HTML document is divided into a head section (here, between <HEAD> and </HEAD>) and a body (here, between <BODY> and </BODY>)

• The title of the document appears in the head (along with other information about the document)

• The content of the document appears in the body. The body in this example contains just one paragraph, marked up with <P>

12

HTML Hyperlink

• <a href="relations/alumni">alumni</a>• A link is a connection from one Web resource

to another

• It has two ends, called anchors, and a direction

• Starts at the "source" anchor and points to the "destination" anchor, which may be any Web resource (e.g., an image, a video clip, a sound bite, a program, an HTML document)

13

Resource Identifiers

• Uniform Resource Identifiers (URI): include two overlapping subsets of identifiers– URL: Uniform Resource Locators

– URN: Uniform Resource Names

14

Introduction to URIs

• Every resource available on the Web has an address that may be encoded by a URI

• URIs typically consist of three pieces:– The naming scheme of the mechanism used to

access the resource. (HTTP, FTP)– The name of the machine hosting the resource– The name of the resource itself, given as a path

15

URI Example

• http://www.w3.org/TR

• There is a document available via the HTTP protocol

• Residing on the machines hosting www.w3.org

• Accessible via the path "/TR"

16

Protocols

• Describe how messages are encoded and exchanged

• Different Layering Architectures

• ISO OSI 7-Layer Architecture

• TCP/IP 4-Layer Architecture

17

ISO OSI Layering Architecture

18

TCP/IP Layering Architecture

19

TCP/IP Layering Architecture

• A simplified model, provides the end-to-end reliable connection

• The network layer – Hosts drop packages into this layer, layer

routes towards destination – Only promise “Try my best”

• The transport layer– Reliable byte-oriented stream

20

Hypertext Transfer Protocol (HTTP)

• A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server

• One of the transport layer protocol supported by Internet

• HTTP communication is established via a TCP connection and server port 80

21

GET Method in HTTP

22

Form

23

Form

• <HTML><Form action= http://140.116.246.174/cgi-bin/meshdb.cgi method=post>[1] Median Eminence ( 可複選 ):1.<input type=checkbox name=‘Median Eminence’ value= 分泌 > 分泌2.<input type=checkbox name=‘Median Eminence’ value= 一般 > 一般 3.<input type=checkbox name=‘Median Eminence’ value= 王錫崗 > 王錫崗 .<input type=checkbox name=‘Median Eminence’ value= 垂體 > 垂體其他 :<input type=“text” name =‘Median Eminence’ ><input type=submit value= 確認 ></Form></HTML>

24

CGI processing

25

CGI (Common Gateway Interface)

Web Browser Web Server

Database

CGI

Service Request

Service ProcessingOutput

Service Response

26

HTTP Request Processing

27

GNU Wget

28

CGI: Get query search-results from Google using Wget

29

Homework (1)

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

30

Domain Name System

• DNS (domain name service): mapping from domain names to IP address

• IPv4: • IPv4 was initially deployed January 1st. 1983 and

is still the most commonly used version.• 32 bit address, a string of 4 decimal numbers

separated by dot, range from 0.0.0.0 to 255.255.255.255.

• IPv6: • Revision of IPv4 with 128 bit address

31

Top Level Domains (TLD)

• Top level domain names, .com, .edu, .gov and ISO 3166 country codes .de, .fr, .it

• There are three types of top-level domains:• Generic domains were created for use by the Internet

public • Country code domains were created to be used by

individual country • The .arpa domain Address and Routing Parameter Area

domain is designated to be used exclusively for Internet-infrastructure purposes

32

Server Log Files

• Server Transfer Log: transactions between a browser and server are logged

• IP address, the time of the request• Method of the request (GET, HEAD, POST…)• Status code, a response from the server• Size in byte of the transaction

• Referrer Log: where the request originated

• Agent Log: browser software making the request (spider)

• Error Log: request resulted in errors (404)

33

Server Log Analysis

• Most and least visited web pages

• Entry and exit pages

• Referrals from other sites or search engines

• What are the searched keywords

• How many clicks/page views a page received

• Error reports, like broken links

34

Server Log Analysis

35

Search Engines

• According to Pew Internet & American Life Project Report (2002), search engines are the most popular way to locate information online

• About 33 million U.S. Internet users query on search engines on a typical day.

• More than 80% have used search engines

• Search Engines are measured by coverage and recency

36

Web Crawler

• A crawler is a program that picks up a page and follows all the links on that page

• Crawler = Spider

• Types of crawler:– Breadth First– Depth First

37

Breadth First Crawlers

• Use breadth-first search (BFS) algorithm

• Get all links from the starting page, and add them to a queue

• Pick the 1st link from the queue, get all links on the page and add to the queue

• Repeat above step till queue is empty

38

Breadth First Crawlers

39

Depth First Crawlers

• Use depth first search (DFS) algorithm

• Get the 1st link not visited from the start page

• Visit link and get 1st non-visited link

• Repeat above step till no non-visited links

• Go to next non-visited link in the previous level and repeat 2nd step

40

Depth First Crawlers

41

Coverage

• Overlap analysis used for estimating the size of the indexable web

• W: set of webpages• Wa, Wb: pages crawled by two

independent engines a and b• P(Wa), P(Wb): probabilities that a page

was crawled by a or b– P(Wa)=|Wa| / |W| – P(Wb)=|Wb| / |W|

42

Overlap Analysis

• P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb) = |Wa Wb| / |Wb|

• If a and b are independent:– P(Wa Wb) = P(Wa)*P(Wb)– P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)

= |Wa| / |W| * (|Wb| / |W|) / (|Wb| / |W|) = |Wa| / |W| = P(Wa)

43

Overlap Analysis

• Using |W| = |Wa|/ P(Wa), the researchers found:– Web had at least 320 million pages in 1997– 60% of web was covered by six major engines– Maximum coverage of a single engine was

1/3 of the web

44

How to Improve the Coverage?

• Meta-search engine: dispatch the user query to several engines at same time, collect and merge the results into one list to the user.

• Any suggestions?

• Homework: Develop a meta-search engine which responds user query with combined search results from a few search engines.

45

Probability

• Model uncertainty: make inferences about events given observed data

• An event e: proposition or statement about the world at large– “the number of Web pages in existence on 1 January

2003 was greater than five billion”

• A probability P(e): can be viewed as a number that reflects our uncertainty about whether e is true or false in the real world, given whatever information we have available.

46

Learning from a Bayesian Perspective

• A conditional probability P(e | D): represent the degree of belief (Bayesian interpretation of probability), where D is the background information (data) on which our belief is based.

• Bayesian approach: probability as being a dynamic entity updated when more data arrive

– Prior probability: P(e) is your belief in the event e before you see any data

– Posterior probability: P(e | D) reflects your updated belief in event e given the observed data D

– Likelihood: P(D | e) is the probability of the data under the assumption that e is true

• How to model P(D | e)?

)(

)()|()|(

DP

ePeDPDeP

47

Standard Probabilistic Distribution

• Discrete distributions • Continuous distributions

!)|(

)1()(

...!!...

!),...,(

)1(),|(

1

11

111

kekXP

ppkXP

ppkk

nkXkXP

ppk

nnpkXP

k

k

km

k

mmm

knk

m

x

x

x

exx

exf

exN

1

)(2

1

)(),|(

)|(

2

1),|(

22

Geometric

Poisson

Exponential

Gamma

48

Learning from a Bayesian Perspective (cont.)

• Take logarithms for easier operations

• Obtain more data D2 (second data set)

)(

)()|()|(

DP

ePeDPDeP

)(log)(log)|(log)|(log DPePeDPDeP

)|(

)|(),|(),|(

2

22 DDP

DePDeDPDDeP

49

Parameter Estimation from Data

• Maximum a posteriori (MAP)– The objective of parameter estimation is to find or approximate

the best set of parameters for a model, i.e., to find the set of parameters maximizing the posterior P(|D), or log P(|D). This is called maximum a posteriori (MAP) estimation.

– To deal with positive quantities, we can minimize - log P(|D)

– P(D) plays the role of a normalizing constant and is thus irrelevant for the optimization, i.e.,the minimization of

– If the prior P() is uniform over sample space, then the problem reduces to finding the maximum of P(D|), or log P(D|). This is known as maximum likelihood (ML) estimation.

– Simpler ML estimation procedure, i.e., the minimization of

)(log)|(log)( PDP

)(log)(log)|(log)|(log)( DPPDPDP

)|(log)( DP

WMMKS LabWMMKS Lab

Basic FormulaBasic Formula

h

hxPxP ),()(

h

yhxPyxP )|,()|(

h

hyxPyhPyxP ),|()|()|(

),|()|()|,( hyxPyhPyhxP

h

hxPyhPyxP )|()|()|(