the world wide web - university of california,...
TRANSCRIPT
1
1
The World Wide WebEE 122: Intro to Communication Networks
Fall 2006 (MW 4-5:30 in Donner 155)
Vern PaxsonTAs: Dilip Antony Joseph and Sukun Kim
http://inst.eecs.berkeley.edu/~ee122/
Materials with thanks to Jennifer Rexford, Ion Stoica,and colleagues at Princeton and UC Berkeley
2
Announcements
• Project #2 out–Checkpoint due Weds Oct 18– Full project due Thurs Oct 26
2
3
Goals of Today’s Lecture• (Finish Email - retrieving mail from the server)
• Main ingredients of the Web–URIs, HTML, HTTP
• Key properties of HTTP–Request-response, stateless, and resource meta-data
• Performance of HTTP–Parallel connections, persistent connections, pipelining
• Web components–Clients, proxies, and servers–Caching vs. replication
4
Retrieving E-Mail From the Server• Server stores incoming e-mail by mailbox–Based on the “From” field in the message
• Users need to retrieve e-mail
• Variety of ways to do this:–Directly by same-machine access to the mailbox–Via Interactive Mail Access Protocol (IMAP)
Supports concurrent access by multiple clients, server-sidesearchers, partial MIME fetches, multiple mailboxes
–Via HTTP (Web) E.g., GMail
–Via Post Office Protocol (POP)
3
5
POP3 Protocol
Authorization phase• Client commands:
– user: declare username– pass: password
• Server responses– +OK– -ERR
Transaction phase, client:• list: list message numbers
• retr: retrieve message bynumber
• dele: delete• quit
S: +OK POP3 server ready C: user bob S: +OK C: pass hungry S: +OK user successfully logged on
C: list S: 1 498 S: 2 912 S: . C: retr 1 S: <message 1 contents> S: . C: dele 1 C: retr 2 S: <message 1 contents> S: . C: dele 2 C: quit S: +OK POP3 server signing off
6
The World Wide Web
4
7
Main Components: URIs• Uniform Resource Identifier (URI)–Denotes a resource–Could be its name–Could be its location–Which is better?
• Uniform Resource Name (URN)–Names a resource independent of how to get it–E.g., urn:ietf:rfc:2396 is a standard URN for RFC 2396
• Uniform Resource Locator (URL)–Specifies how to access a resource–E.g., ftp://ftp.rfc-editor.org/in-notes/rfc2396.txt
8
URL Syntax
protocol://hostname[:port]/directorypath/resource• Protocol might be http, ftp, https, smtp, rtsp, …• In practice, hostname can instead be an IP address– What does your browser (maybe) show for http://2850372702/ ?
• Port defaults to the standard port associated w/ protocol– E.g., 80/tcp for http, 443/tcp for https
• Directory path is hierarchical, often reflecting file system
• Can extend resource to program executions as well…– http://us.f413.mail.yahoo.com/ym/ShowLetter?box=%40B%40Bulk&
MsgId=2604_1744106_29699_1123_1261_0_28917_3552_1289957100&Search=&Nhead=f&YY=31454&order=down&sort=date&pos=0&view=a&head=b
5
9
Main Components: HTML• HyperText Markup Language (HTML)–Representation of hypertext documents in ASCII format– Format text, reference images, embed hyperlinks (HREF)– Interpreted by Web browsers when rendering a page
• Straight-forward to learn–Can basically start with a plain text file
Easy to add formatting, references, bullets, etc.–Automatically generated by authoring programs
Tools to aid users in creating HTML files–Your browser likely can show a page’s raw HTML
• Web page–Base HTML file referenced objects (e.g., images)–Each object has its own URL
10
Main Components: HTTP• HyperText Transfer Protocol (HTTP)–Client-server protocol for transferring resources
• Important properties of HTTP–Request-response protocol–Reliance on a global URI namespace–Resource metadata–Stateless–ASCII format % telnet www.icir.org 80
GET /vern/ HTTP/1.0<blank line, i.e., CRLF>
6
11
HTTP Request Message• Request message sent by a client–Request line: method, resource, and protocol version–Request headers: provide information or modify request–Body: optional data (e.g., to “POST” data to the server)
GET /somedir/page.html HTTP/1.1Host: www.someschool.edu User-agent: Mozilla/4.0Connection: close Accept-language:fr blank line
request line(GET, POST,
HEAD commands)
header lines
Carriage return, line feed
indicates end of message
Not optional
12
HTTP Response Message• Response message sent by a server–Status line: protocol version, status code, status phrase–Response headers: provide information–Body: optional data
HTTP/1.1 200 OK Connection closeDate: Thu, 06 Aug 2006 12:00:15 GMT Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 Jun 2006 …... Content-Length: 6821 Content-Type: text/htmlblank line data data data data data ...
status line(protocol
status codestatus phrase)
header lines
data, e.g., requestedHTML file
7
13
Request Methods and Response Codes
• Request methods include–GET: return current value of resource, run program, …–HEAD: return the meta-data associated with a resource–POST: update a resource, provide input to a program, …
• Response code classes– 1xx: informational (e.g., “100 Continue”)– 2xx: success (e.g., “200 OK”)– 3xx: redirection (e.g., “304 Not Modified”)– 4xx: client error (e.g., “404 Not Found”)– 5xx: server error (e.g., “503 Service Unavailable”)– (Similar to other ASCII app. protocols like SMTP, FTP)
14
HTTP Resource Meta-Data• Meta-data– Information relating to a resource–… but not part of the resource itself
• Examples of meta-data–Size of a resource– Type of the content– Last modification time
• Typing of content borrowed from email–Multipurpose Internet Mail Extensions (MIME)–Data format classification (e.g., Content-Type: text/html)–Enables browsers to automatically launch a viewer
8
15
Example: Conditional GET Request• Fetch resource only if it has changed at the server
• Server avoids wasting resources to send again–Server inspects the “last modified” time of the resource–… and compares to the “if-modified-since” time–Returns “304 Not Modified” if resource has not changed–…. or a “200 OK” with the latest version otherwise
GET /~ee122/fa06/ HTTP/1.1Host: inst.eecs.berkeley.eduUser-Agent: Mozilla/4.03If-Modified-Since: Sun, 27 Aug 2006 22:25:50 GMT<CRLF>
16
Stateless Operation• Stateless protocol–Each request-response exchange treated independently–Clients and servers not required to retain state
• Statelessness improves scalability–Avoid need for server to retain info across requests–Enable server to handle a higher rate of requests
• However, some applications need persistent state– To uniquely identify the user or store temporary info–E.g., personalize a Web page, compute profiles or
access statistics by user, track a shopping cart ….–Done using “cookies”
9
17
Cookies• Cookie–Small(?) state stored by client on behalf of server–Included in future requests to the server
Request
ResponseSet-Cookie: XYZ
RequestCookie: XYZ
18
Web Components• Clients–Send requests and receive responses–Browsers, spiders, and agents
• Servers–Receive requests and send responses–Store or generate the responses
• Proxies–Act as a server for the client, and a client to the server–Perform extra functions such as caching, anonymization,
logging, transcoding, filtering access.–Can be explicit or transparent (“interception”)
10
19
Web Browser• Generating HTTP requests:–User types URL, clicks a hyperlink, or selects bookmark–User clicks “reload”, or “submit” on a Web page–Automatic downloading of embedded images
• Fetches content–Via one or multiple HTTP connections
• Layout of response–Parsing HTML and rendering the Web page– Invoking helper applications (e.g., Acrobat, RealPlayer)
• Maintains cache–Storing recently-viewed objects–Checking that cached objects are fresh
20
Web Server• Web site vs. Web server–Web site: collections of Web pages associated with a
particular host name–Web server: program that satisfies client requests for
Web resources
• Handling a client request:–Accept a TCP connection–Read and parse the HTTP request message– Translate the URL to a filename–Determine whether the request is authorized–Generate and transmit the response
11
21
Web Server: Generating a Response
• Returning a file–URL corresponds to a file (e.g., /www/index.html)–… the server returns the file as the response–… along with the HTTP response header
• Returning meta-data with no body–Example: client requests object “If-modified-since”–Server checks if the object has been modified–… and simply returns “HTTP/1.1 304 Not Modified”
• Dynamically-generated responses–URL corresponds to a program the server needs to run–Server runs the program and sends the output to client
22
5 Minute Break
Questions Before We Proceed?
12
23
HTTP Performance
• Most Web pages have multiple objects (“items”)–E.g., HTML file and a bunch of embedded images
• Obvious way to fetch these: one item at a time
• What is that directly analogous to?
24
Fetch HTTP Items: Stop & Wait
Client Server
Request item 1
Transfer item 1
Request item 2
Transfer item 2
Request item 3
Transfer item 3Finish; displaypage
Start fetchingpage Tim
e
13
25
HTTP Performance
• Most Web pages have multiple objects (“items”)–E.g., HTML file and a bunch of embedded images
• Obvious way to fetch these: one item at a time
• What is that directly analogous to?
• What are the performance implications?
• Where does the analogy possibly break down?
26
Pipelined Requests/Responses
• Batch requests and responsesto reduce the number of packets
• Multiple requests can becontained in one TCP segment
• Small items (common) can alsoshare segments
• Note: maintains order ofresponses– Item 1 always arrives before item 2
• HTTP 1.1 feature (not in 1.0)
Client Server
Request 1Request 2Request 3
Transfer 1
Transfer 2
Transfer 3
14
27
Concurrent Requests/Responses
• Use multiple connections toissue requests and responses inparallel
• Does not necessarily maintainorder of responses
• Raises question of fairness– Set of N parallel connections
“grabs” bandwidth N times moreaggressively than just one
–What’s a reasonable/fair limit astraffic competes with that of otherusers?
Client Server
Request 1Request 2Request 3
Transfer 1
Transfer 2
Transfer 3
28
Persistent Connections
• Handle multiple transfers per connection– Including transfers subsequent to imaging current page–Maintain TCP connection across multiple requests–Either client or server can tear down the connection
• Performance advantages–Avoid overhead of connection set-up and tear-down–Allow TCP to learn more accurate RTT estimate–Allow the TCP congestion window to increase
I.e, leverage previously discovered bandwidth
15
29
Performance - Caching• Exploit locality of reference
• Modifier to GET requests:– If-modified-since – returns “not modified” if resource not
modified since specified time
• Response header:– Expires – specifies for how long it’s safe to cache the
resource–Or: No-cache – ignore all caches; always get resource
directly from server
• Features best for use by HTTP proxies– Locality of reference increases if many clients share a
proxy
30
HTTP Without Caching
• Many clients transfer same information–Generates redundant server and network load–Clients experience unnecessary latency
Server
Clients
Backbone ISP
ISP-1 ISP-2
16
31
Reverse Proxies• Cache documents close to server decrease server load• Typically done by content providers• Only works for
static content
Clients
Backbone ISP
ISP-1 ISP-2
Server
Reverse proxies
32
Forward Proxies• Cache documents close to clients reduce network traffic
and decrease latency
• Typically done by ISPs or corporate LANs
Clients
Backbone ISP
ISP-1 ISP-2
Server
Reverse proxies
Forward proxies
17
33
Content Distribution Networks (CDNs)• Integrate forward and reverse caching functionality
into one overlay network (usually) administrated byone entity–Example: Akamai
• Documents are cached both …–… as a result of clients’ requests (pull)–… due to being pushed in expectation of high access
rate
• Besides caching, do processing, e.g.,–Handle dynamic web pages– Transcoding
34
CDNs (cont’d)
Clients
ISP-1
Server
Forward proxies
Backbone ISP
ISP-2
CDN
18
35
Example: Akamai
• Akamai creates new domain names for each clientcontent provider.– e.g., i.a.cnn.net = CNAME custom.i.cnn.net.edgesuite.net
custom.i.cnn.net.edgesuite.net = CNAME a1921.aol.akamai.net
• Customer content provider modifies content sothat embedded URLs reference new domains:–CNN page’s HREF’s refer to i.a.cnn.net
• Customer pushes content out to Akamai as itchanges–Or: Akamai pulls it on demand w/ usual caching mech.–Both CNN & Akamai have control over load distribution
36
Example: Akamai CDN
GET http://cnn.com
a
DNS server forcnn.com
b
c
localDNS server
cnn.com“Akamaizes” its content.
“Akamaized” response object has inlineURLs for secondary content at (afterresolving CNAMEs) a1921.aol.akamai.netand other Akamai-managed DNS names.
akamai.netDNS servers
lookupa1921.aol.akamai.net
Akamai servers store/cachesecondary content for“Akamaized” services.
19
37
Hosting: Multiple Sites Per Machine• Multiple Web sites on a single machine–Hosting company runs the Web server on behalf of
multiple sites (e.g., www.foo.com and www.bar.com)
• Problem: “GET index.html”– Is it for www.foo.com/index.html or www.bar.com/index.html?
• Solution #1: multiple servers on the same machine–Run multiple Web servers on the machine–Have a separate IP address (or port) for each server
• Solution #2: include site name in the HTTP request–Run a single Web server with a single IP address–… and include “Host” header (e.g., “Host: www.foo.com”)–Required header with HTTP 1.1
38
Hosting: Multiple Machines Per Site• Replicating a popular Web site–Running on multiple machines to handle the load–… and to place content closer to the clients–… particularly if content isn’t cacheable by proxies/CDNs
• Problem: directing client to a particular replica– To balance load across the server replicas– To pair clients with nearby servers
• Solution #1: manual selection by clients–Each replica has its own site name–A Web page lists the replicas (e.g., by name, location)–… and asks clients to click on a hyperlink to pick
20
39
Hosting: Multiple Machines Per Site• Solution #2: single IP address, multiple machines–Run multiple machines behind a single IP address
–Ensure all packets from a singleTCP connection go to the same replica
Load Balancer 64.236.16.20
40
Hosting: Multiple Machines Per Site• Solution #3: multiple addresses, multiple machines–Same name but different addresses for all of the replicas–Configure DNS server to return different addresses
Internet 64.236.16.20
173.72.54.131
12.1.1.1
21
41
Caching vs. Replication• Motivations for moving content close to users–Reduce latency for the user–Reduce load on the network and the server
• Caching–Replicating the content “on demand” after a request–Storing the response message locally for future use–May need to verify if the response has changed–… and some responses are not cacheable
• Replication–Planned replication of the content in multiple locations–Updating of resources handled outside of HTTP–Can replicate scripts that create dynamic responses
42
Conclusions• Key ideas underlying the Web–Uniform Resource Identifier (URI), Locator (URL)–HyperText Markup Language (HTML)–HyperText Transfer Protocol (HTTP)–Browser helper applications based on content type
• Performance implications–Concurrent connections, pipelining, persistent conns.
• Main Web components–Clients, servers, proxies, CDNs
• Next lecture: drilling down to the link layer–P&D 2.1-2.4, 2.6, 2.8