the world wide web - university of california,...

21
1 1 The World Wide Web EE 122: Intro to Communication Networks Fall 2006 (MW 4-5:30 in Donner 155) Vern Paxson TAs: Dilip Antony Joseph and Sukun Kim http://inst.eecs.berkeley.edu/~ee122/ Materials with thanks to Jennifer Rexford, Ion Stoica, and colleagues at Princeton and UC Berkeley 2 Announcements • Project #2 out Checkpoint due Weds Oct 18 Full project due Thurs Oct 26

Upload: duongtram

Post on 23-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

1

1

The World Wide WebEE 122: Intro to Communication Networks

Fall 2006 (MW 4-5:30 in Donner 155)

Vern PaxsonTAs: Dilip Antony Joseph and Sukun Kim

http://inst.eecs.berkeley.edu/~ee122/

Materials with thanks to Jennifer Rexford, Ion Stoica,and colleagues at Princeton and UC Berkeley

2

Announcements

• Project #2 out–Checkpoint due Weds Oct 18– Full project due Thurs Oct 26

Page 2: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

2

3

Goals of Today’s Lecture• (Finish Email - retrieving mail from the server)

• Main ingredients of the Web–URIs, HTML, HTTP

• Key properties of HTTP–Request-response, stateless, and resource meta-data

• Performance of HTTP–Parallel connections, persistent connections, pipelining

• Web components–Clients, proxies, and servers–Caching vs. replication

4

Retrieving E-Mail From the Server• Server stores incoming e-mail by mailbox–Based on the “From” field in the message

• Users need to retrieve e-mail

• Variety of ways to do this:–Directly by same-machine access to the mailbox–Via Interactive Mail Access Protocol (IMAP)

Supports concurrent access by multiple clients, server-sidesearchers, partial MIME fetches, multiple mailboxes

–Via HTTP (Web) E.g., GMail

–Via Post Office Protocol (POP)

Page 3: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

3

5

POP3 Protocol

Authorization phase• Client commands:

– user: declare username– pass: password

• Server responses– +OK– -ERR

Transaction phase, client:• list: list message numbers

• retr: retrieve message bynumber

• dele: delete• quit

S: +OK POP3 server ready C: user bob S: +OK C: pass hungry S: +OK user successfully logged on

C: list S: 1 498 S: 2 912 S: . C: retr 1 S: <message 1 contents> S: . C: dele 1 C: retr 2 S: <message 1 contents> S: . C: dele 2 C: quit S: +OK POP3 server signing off

6

The World Wide Web

Page 4: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

4

7

Main Components: URIs• Uniform Resource Identifier (URI)–Denotes a resource–Could be its name–Could be its location–Which is better?

• Uniform Resource Name (URN)–Names a resource independent of how to get it–E.g., urn:ietf:rfc:2396 is a standard URN for RFC 2396

• Uniform Resource Locator (URL)–Specifies how to access a resource–E.g., ftp://ftp.rfc-editor.org/in-notes/rfc2396.txt

8

URL Syntax

protocol://hostname[:port]/directorypath/resource• Protocol might be http, ftp, https, smtp, rtsp, …• In practice, hostname can instead be an IP address– What does your browser (maybe) show for http://2850372702/ ?

• Port defaults to the standard port associated w/ protocol– E.g., 80/tcp for http, 443/tcp for https

• Directory path is hierarchical, often reflecting file system

• Can extend resource to program executions as well…– http://us.f413.mail.yahoo.com/ym/ShowLetter?box=%40B%40Bulk&

MsgId=2604_1744106_29699_1123_1261_0_28917_3552_1289957100&Search=&Nhead=f&YY=31454&order=down&sort=date&pos=0&view=a&head=b

Page 5: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

5

9

Main Components: HTML• HyperText Markup Language (HTML)–Representation of hypertext documents in ASCII format– Format text, reference images, embed hyperlinks (HREF)– Interpreted by Web browsers when rendering a page

• Straight-forward to learn–Can basically start with a plain text file

Easy to add formatting, references, bullets, etc.–Automatically generated by authoring programs

Tools to aid users in creating HTML files–Your browser likely can show a page’s raw HTML

• Web page–Base HTML file referenced objects (e.g., images)–Each object has its own URL

10

Main Components: HTTP• HyperText Transfer Protocol (HTTP)–Client-server protocol for transferring resources

• Important properties of HTTP–Request-response protocol–Reliance on a global URI namespace–Resource metadata–Stateless–ASCII format % telnet www.icir.org 80

GET /vern/ HTTP/1.0<blank line, i.e., CRLF>

Page 6: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

6

11

HTTP Request Message• Request message sent by a client–Request line: method, resource, and protocol version–Request headers: provide information or modify request–Body: optional data (e.g., to “POST” data to the server)

GET /somedir/page.html HTTP/1.1Host: www.someschool.edu User-agent: Mozilla/4.0Connection: close Accept-language:fr blank line

request line(GET, POST,

HEAD commands)

header lines

Carriage return, line feed

indicates end of message

Not optional

12

HTTP Response Message• Response message sent by a server–Status line: protocol version, status code, status phrase–Response headers: provide information–Body: optional data

HTTP/1.1 200 OK Connection closeDate: Thu, 06 Aug 2006 12:00:15 GMT Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 2006 …... Content-Length: 6821 Content-Type: text/htmlblank line data data data data data ...

status line(protocol

status codestatus phrase)

header lines

data, e.g., requestedHTML file

Page 7: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

7

13

Request Methods and Response Codes

• Request methods include–GET: return current value of resource, run program, …–HEAD: return the meta-data associated with a resource–POST: update a resource, provide input to a program, …

• Response code classes– 1xx: informational (e.g., “100 Continue”)– 2xx: success (e.g., “200 OK”)– 3xx: redirection (e.g., “304 Not Modified”)– 4xx: client error (e.g., “404 Not Found”)– 5xx: server error (e.g., “503 Service Unavailable”)– (Similar to other ASCII app. protocols like SMTP, FTP)

14

HTTP Resource Meta-Data• Meta-data– Information relating to a resource–… but not part of the resource itself

• Examples of meta-data–Size of a resource– Type of the content– Last modification time

• Typing of content borrowed from email–Multipurpose Internet Mail Extensions (MIME)–Data format classification (e.g., Content-Type: text/html)–Enables browsers to automatically launch a viewer

Page 8: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

8

15

Example: Conditional GET Request• Fetch resource only if it has changed at the server

• Server avoids wasting resources to send again–Server inspects the “last modified” time of the resource–… and compares to the “if-modified-since” time–Returns “304 Not Modified” if resource has not changed–…. or a “200 OK” with the latest version otherwise

GET /~ee122/fa06/ HTTP/1.1Host: inst.eecs.berkeley.eduUser-Agent: Mozilla/4.03If-Modified-Since: Sun, 27 Aug 2006 22:25:50 GMT<CRLF>

16

Stateless Operation• Stateless protocol–Each request-response exchange treated independently–Clients and servers not required to retain state

• Statelessness improves scalability–Avoid need for server to retain info across requests–Enable server to handle a higher rate of requests

• However, some applications need persistent state– To uniquely identify the user or store temporary info–E.g., personalize a Web page, compute profiles or

access statistics by user, track a shopping cart ….–Done using “cookies”

Page 9: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

9

17

Cookies• Cookie–Small(?) state stored by client on behalf of server–Included in future requests to the server

Request

ResponseSet-Cookie: XYZ

RequestCookie: XYZ

18

Web Components• Clients–Send requests and receive responses–Browsers, spiders, and agents

• Servers–Receive requests and send responses–Store or generate the responses

• Proxies–Act as a server for the client, and a client to the server–Perform extra functions such as caching, anonymization,

logging, transcoding, filtering access.–Can be explicit or transparent (“interception”)

Page 10: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

10

19

Web Browser• Generating HTTP requests:–User types URL, clicks a hyperlink, or selects bookmark–User clicks “reload”, or “submit” on a Web page–Automatic downloading of embedded images

• Fetches content–Via one or multiple HTTP connections

• Layout of response–Parsing HTML and rendering the Web page– Invoking helper applications (e.g., Acrobat, RealPlayer)

• Maintains cache–Storing recently-viewed objects–Checking that cached objects are fresh

20

Web Server• Web site vs. Web server–Web site: collections of Web pages associated with a

particular host name–Web server: program that satisfies client requests for

Web resources

• Handling a client request:–Accept a TCP connection–Read and parse the HTTP request message– Translate the URL to a filename–Determine whether the request is authorized–Generate and transmit the response

Page 11: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

11

21

Web Server: Generating a Response

• Returning a file–URL corresponds to a file (e.g., /www/index.html)–… the server returns the file as the response–… along with the HTTP response header

• Returning meta-data with no body–Example: client requests object “If-modified-since”–Server checks if the object has been modified–… and simply returns “HTTP/1.1 304 Not Modified”

• Dynamically-generated responses–URL corresponds to a program the server needs to run–Server runs the program and sends the output to client

22

5 Minute Break

Questions Before We Proceed?

Page 12: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

12

23

HTTP Performance

• Most Web pages have multiple objects (“items”)–E.g., HTML file and a bunch of embedded images

• Obvious way to fetch these: one item at a time

• What is that directly analogous to?

24

Fetch HTTP Items: Stop & Wait

Client Server

Request item 1

Transfer item 1

Request item 2

Transfer item 2

Request item 3

Transfer item 3Finish; displaypage

Start fetchingpage Tim

e

Page 13: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

13

25

HTTP Performance

• Most Web pages have multiple objects (“items”)–E.g., HTML file and a bunch of embedded images

• Obvious way to fetch these: one item at a time

• What is that directly analogous to?

• What are the performance implications?

• Where does the analogy possibly break down?

26

Pipelined Requests/Responses

• Batch requests and responsesto reduce the number of packets

• Multiple requests can becontained in one TCP segment

• Small items (common) can alsoshare segments

• Note: maintains order ofresponses– Item 1 always arrives before item 2

• HTTP 1.1 feature (not in 1.0)

Client Server

Request 1Request 2Request 3

Transfer 1

Transfer 2

Transfer 3

Page 14: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

14

27

Concurrent Requests/Responses

• Use multiple connections toissue requests and responses inparallel

• Does not necessarily maintainorder of responses

• Raises question of fairness– Set of N parallel connections

“grabs” bandwidth N times moreaggressively than just one

–What’s a reasonable/fair limit astraffic competes with that of otherusers?

Client Server

Request 1Request 2Request 3

Transfer 1

Transfer 2

Transfer 3

28

Persistent Connections

• Handle multiple transfers per connection– Including transfers subsequent to imaging current page–Maintain TCP connection across multiple requests–Either client or server can tear down the connection

• Performance advantages–Avoid overhead of connection set-up and tear-down–Allow TCP to learn more accurate RTT estimate–Allow the TCP congestion window to increase

I.e, leverage previously discovered bandwidth

Page 15: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

15

29

Performance - Caching• Exploit locality of reference

• Modifier to GET requests:– If-modified-since – returns “not modified” if resource not

modified since specified time

• Response header:– Expires – specifies for how long it’s safe to cache the

resource–Or: No-cache – ignore all caches; always get resource

directly from server

• Features best for use by HTTP proxies– Locality of reference increases if many clients share a

proxy

30

HTTP Without Caching

• Many clients transfer same information–Generates redundant server and network load–Clients experience unnecessary latency

Server

Clients

Backbone ISP

ISP-1 ISP-2

Page 16: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

16

31

Reverse Proxies• Cache documents close to server decrease server load• Typically done by content providers• Only works for

static content

Clients

Backbone ISP

ISP-1 ISP-2

Server

Reverse proxies

32

Forward Proxies• Cache documents close to clients reduce network traffic

and decrease latency

• Typically done by ISPs or corporate LANs

Clients

Backbone ISP

ISP-1 ISP-2

Server

Reverse proxies

Forward proxies

Page 17: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

17

33

Content Distribution Networks (CDNs)• Integrate forward and reverse caching functionality

into one overlay network (usually) administrated byone entity–Example: Akamai

• Documents are cached both …–… as a result of clients’ requests (pull)–… due to being pushed in expectation of high access

rate

• Besides caching, do processing, e.g.,–Handle dynamic web pages– Transcoding

34

CDNs (cont’d)

Clients

ISP-1

Server

Forward proxies

Backbone ISP

ISP-2

CDN

Page 18: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

18

35

Example: Akamai

• Akamai creates new domain names for each clientcontent provider.– e.g., i.a.cnn.net = CNAME custom.i.cnn.net.edgesuite.net

custom.i.cnn.net.edgesuite.net = CNAME a1921.aol.akamai.net

• Customer content provider modifies content sothat embedded URLs reference new domains:–CNN page’s HREF’s refer to i.a.cnn.net

• Customer pushes content out to Akamai as itchanges–Or: Akamai pulls it on demand w/ usual caching mech.–Both CNN & Akamai have control over load distribution

36

Example: Akamai CDN

GET http://cnn.com

a

DNS server forcnn.com

b

c

localDNS server

cnn.com“Akamaizes” its content.

“Akamaized” response object has inlineURLs for secondary content at (afterresolving CNAMEs) a1921.aol.akamai.netand other Akamai-managed DNS names.

akamai.netDNS servers

lookupa1921.aol.akamai.net

Akamai servers store/cachesecondary content for“Akamaized” services.

Page 19: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

19

37

Hosting: Multiple Sites Per Machine• Multiple Web sites on a single machine–Hosting company runs the Web server on behalf of

multiple sites (e.g., www.foo.com and www.bar.com)

• Problem: “GET index.html”– Is it for www.foo.com/index.html or www.bar.com/index.html?

• Solution #1: multiple servers on the same machine–Run multiple Web servers on the machine–Have a separate IP address (or port) for each server

• Solution #2: include site name in the HTTP request–Run a single Web server with a single IP address–… and include “Host” header (e.g., “Host: www.foo.com”)–Required header with HTTP 1.1

38

Hosting: Multiple Machines Per Site• Replicating a popular Web site–Running on multiple machines to handle the load–… and to place content closer to the clients–… particularly if content isn’t cacheable by proxies/CDNs

• Problem: directing client to a particular replica– To balance load across the server replicas– To pair clients with nearby servers

• Solution #1: manual selection by clients–Each replica has its own site name–A Web page lists the replicas (e.g., by name, location)–… and asks clients to click on a hyperlink to pick

Page 20: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

20

39

Hosting: Multiple Machines Per Site• Solution #2: single IP address, multiple machines–Run multiple machines behind a single IP address

–Ensure all packets from a singleTCP connection go to the same replica

Load Balancer 64.236.16.20

40

Hosting: Multiple Machines Per Site• Solution #3: multiple addresses, multiple machines–Same name but different addresses for all of the replicas–Configure DNS server to return different addresses

Internet 64.236.16.20

173.72.54.131

12.1.1.1

Page 21: The World Wide Web - University of California, Berkeleyinst.eecs.berkeley.edu/~ee122/fa06/notes/09-Web-2up.pdfThe World Wide Web. 4 7 Main Components: URIs •Uniform Resource Identifier

21

41

Caching vs. Replication• Motivations for moving content close to users–Reduce latency for the user–Reduce load on the network and the server

• Caching–Replicating the content “on demand” after a request–Storing the response message locally for future use–May need to verify if the response has changed–… and some responses are not cacheable

• Replication–Planned replication of the content in multiple locations–Updating of resources handled outside of HTTP–Can replicate scripts that create dynamic responses

42

Conclusions• Key ideas underlying the Web–Uniform Resource Identifier (URI), Locator (URL)–HyperText Markup Language (HTML)–HyperText Transfer Protocol (HTTP)–Browser helper applications based on content type

• Performance implications–Concurrent connections, pipelining, persistent conns.

• Main Web components–Clients, servers, proxies, CDNs

• Next lecture: drilling down to the link layer–P&D 2.1-2.4, 2.6, 2.8