web server technologies i: http
DESCRIPTION
TRANSCRIPT
Web Server TechnologiesPart I: HTTP & Getting Started
Joe LimaDirector of Product Development
Port80 Software, [email protected]
Tutorial Content
Introduction to HTTP
• TCP/IP and application layer protocols• URLs, resources and MIME Types• HTTP request/response cycle and proxies
Setup and deployment
• Planning Web server & site deployments• Site structure and basic server configuration• Managing users and hosts
Web Server Technologies | Part I: HTTP & Getting Started
Preliminaries - Recommended Texts
Administrating Web Servers, Security & Maintenance Larson and Stephens, Prentice Hall
HTTP The Definitive GuideGourley and Totty, et al., O’Reilly
Online resources are plentiful and will be cited along the way.
Web Server Technologies | Part I: HTTP & Getting Started
The Role of a Web Server
• Web servers serve various resources - As file (document) servers - As application front ends
• Other servers also provide services on the Internet, each speaking its own protocol:
- SMTP, POP, IMAP, NNTP, FTP, etc.
• Web server = HTTP server
• HTTP servers serve HTTP clients (browsers and other user agents) with the help of HTTP intermediaries (proxies)
Web Server Technologies | Part I: HTTP & Getting Started
a box or a service?
An Introduction to HTTP
• Hyper Text Transfer Protocol
• One of the application layer protocols that make up the Internet - HTTP over TCP/IP - Like SMTP, POP, IMAP, NNTP, FTP, etc.
• The underlying language of the Web
• Three versions have been used, two are in common use and have been specified:
- RFC 1945 HTTP 1.0 (1996) - RFC 2616 HTTP 1.1 (1999)
Web Server Technologies | Part I: HTTP & Getting Started
A Brief Digression on TCP/IP
Web Server Technologies | Part I: HTTP & Getting Started
HTTP sits atop the TCP/IP Protocol Stack
Network Interfaces
HTTP
TCP
IP
Application Layer
Transport Layer
Network Layer
Data Link Layer
A Brief Digression on TCP/IP, cont.
• IP provides packets that are routed based on source and destination IP addresses
• TCP provides segments that ride inside the IP packets and add connection information based on source and destination ports
Web Server Technologies | Part I: HTTP & Getting Started
The ports let TCP carry multiple protocols that connect services running on default ports:
• HTTP on port 80• HTTP with SSL (HTTPS) on port 443• FTP on port 21• SMTP on port 25• POP on port 110• SSH on port 22
A Brief Digression on TCP/IP, cont.
Web Server Technologies | Part I: HTTP & Getting Started
• TCP also provides mechanisms to make the connection a reliable bit pipe
• 3-way handshake, sequence numbers, checksums, control flags
• A data stream is chopped up into chunks that are reassembled, complete and in correct order on the other endpoint of the connection
• TCP segments, riding inside IP packets, carry the chunks of data
• When HTTP is the Application Layer protocol on top of the stack, these chunks of data are the contents of the HTTP Message
A Brief Digression on TCP/IP, cont.
Web Server Technologies | Part I: HTTP & Getting Started
How an HTTP Message is delivered over TCP/IP connection:
GET /index.html HTTP/1.1<CRLF>Host: www.hostname.com Con…
HTTP Message’s data stream is chopped up into chunks small enough to fit in a TCP segment
The segments are shipped to the right destination inside IP datagrams
The chunks ride inside TCP segments used to reassemble them correctly on the other
end of the connection
A Brief Digression on TCP/IP, cont.
Web Server Technologies | Part I: HTTP & Getting Started
HTTPS (HTTP + SSL/TLS)Although a different protocol, service and port, HTTPS is usually integrated with the Web server
FTPOften run on the same box as the HTTP server to provide file transfer capabilities
SMTPSometimes run with Web server (email gateways)
SSHWidely used instead of telnet for remote admin
Other application layer protocols use TCP/IP to provide Internet services often found in company with HTTP
Introduction to HTTP - continued
Web Server Technologies | Part I: HTTP & Getting Started
• HTTP and URLs
• URLs used early on by all Internet protocols, including various document retrieval protocols
• More specifications (both from 1994): - Uniform Resource Locators - RFC 1738 - Universal Resource Identifiers - RFC 1630
• Hypertext came to predominate as the most efficient way of providing access to resources
- Fast, flexible, generic, extensible - Facilitated searching, collaboration, annotation
• HTTP now the central mechanism for requesting and serving URL based resources
Introduction to HTTP - continued
Web Server Technologies | Part I: HTTP & Getting Started
A Digression on MIME Types– URLs point to resources (“content”)– Resources are represented using different Media Types (MIME Types)
• Multipurpose Internet Mail Extensions RFC2045,6• Should be registered with IANA (www.iana.org)
– MIME Type tells how content should be handled• File extensions are mapped to certain MIME Types
– .html usually means a MIME Type of text/html– .jpg usually means a MIME Type of image/jpeg
• But mapping by file extension is dependent on local software’s conventions and might not be shared across applications or machines
Introduction to HTTP - continued
Web Server Technologies | Part I: HTTP & Getting Started
HTTP allows MIME Type info to be passed between client and server so both agree about the media type of the resource
• primary-type/sub-type
The most common MIME Types used on the Web come from the text, image and application top-level groups
• text/html, text/css• image/gif, image/jpeg, image/png• application/pdf, application/octet-stream• application/x-javascript, application/x-shockwave-flash
Introduction to HTTP - continued
Web Server Technologies | Part I: HTTP & Getting Started
HTTP servers turn URLs into resources through a request-response cycle
•User agent (client) issues an HTTP request to a host (server) for a given resource using its URL
•Server “resolves” the URL, acts on the resource- Retrieves, but also launches, modifies etc.
•Server sends an HTTP response back to the client- Usually (not always) a representation of the requested resource- Can also be info about the resource, its state, etc.
•Each request is discontinuous with all previous requests – HTTP is stateless
Basic HTTP Request/Response Cycle
Web Server Technologies | Part I: HTTP & Getting Started
HTTP Server
HTTP Client
Resource
HTTP Request
HTTP Response
www.foo.com
/bar
Asks for resource by its URL:
http://www.foo.com/bar.html
An HTTP Request/Response Chain
Web Server Technologies | Part I: HTTP & Getting Started
HTTP Client
DMZ
Local DNS External DNS
Servers
Internet
TransparentProxies
HTTP Server
ReverseProxy
LAN
EgressProxy
Network atHostingProvider
Root DNS Servers
Types and Uses of Proxy Servers
Web Server Technologies | Part I: HTTP & Getting Started
•Proxies are HTTP Intermediaries
•All act as both clients and servers
•Major types of proxies can be distinguished by
where they live and how they get traffic- Explicit (e.g., Egress)- Transparent/Intercepting- Reverse/Surrogate
•Three primary uses for proxies- Security- Performance- Content Filtering
Looking into HTTP
Web Server Technologies | Part I: HTTP & Getting Started
To really understand Web servers (and clients), study the grammar, syntax and semantics of HTTP requests and responses:
• Look at the parts of the transaction you don’t normally see in a browser
• Issue requests manually to understand how a user agent gets resources from a server
• Use protocol analyzers to “spy” on the HTTP conversation
• Learn to troubleshoot problems by “reading” and “writing” HTTP
Looking into HTTP - continued
Web Server Technologies | Part I: HTTP & Getting Started
HTTP requests and responses are both types of Internet Messages (RFC 822), and share a general format:
– A Start Line, followed by a CRLF• Request Line for requests• Status Line for responses
– Zero or more Message Headers• field-name “:” [field-value] CRLF
– An empty line• Two CRLFs mark the end of the Headers
– An optional Message Body if there is a payload• All or part of the “Entity Body” or “Entity”
Making a simple HTTP request
Web Server Technologies | Part I: HTTP & Getting Started
• Open a TCP connection to a host– Can borrow telnet protocol to do this, by pointing it
at the default HTTP port (80)– C:\>telnet www.google.com 80
• Ask for a resource using a minimal request syntax:– GET / HTTP/1.1 <CRLF>– Host: www.google.com <CRLF><CRLF>
• A Host header is required for HTTP 1.1 connections, though not for HTTP 1.0
A Closer Look at the Request Line
Web Server Technologies | Part I: HTTP & Getting Started
Consists of three major parts– The Request Method followed by a SP
• GET, POST, HEAD, TRACE, OPTIONS, PUT, DELETE and CONNECT• Extension methods such as those specified by WebDav (RFC 2518)
– The Request URI followed by a SP• The URL associated with the resource• By far the most complex part of any Start Line• Defined by intension rather than extension
– The HTTP Version followed by the CRLF• 0.9, 1.0, 1.1
A Closer Look at the Request Methods
Web Server Technologies | Part I: HTTP & Getting Started
• GET– By far most common method– Retrieves a resource from the server– Supports passing of query string arguments
• HEAD– Retrieves only the Headers associated with a resource but not the entity itself– Highly useful for protocol analysis, diagnostics
• POST– Allows passing of data in entity rather than URL– Can transmit of far larger arguments that GET– Arguments not displayed on the URL
More Request Methods
Web Server Technologies | Part I: HTTP & Getting Started
• OPTIONS– Shows methods available for use on the resource (if given a path) or the host
(if given a “*”)
• TRACE– Diagnostic method for assessing the impact of proxies along the request-
response chain
• PUT, DELETE– Used in HTTP publishing (e.g., WebDav)
• CONNECT– A common extension method for Tunneling other protocols through HTTP
A Closer Look at the Request URI
Web Server Technologies | Part I: HTTP & Getting Started
• Absolute URI vs. Absolute Path– Explicit Proxies Require Absolute URIs
• Client is connected directly to the proxy• Protocol and host name needed to resolve request
– Grammar of the Absolute Path• Like Absolute URI minus the “http://hostname”• Initial “/” equivalent of the host’s document root• In HTTP 1.1 with name-based virtual hosting Host header directs request
to appropriate document root• Subsequent slashes left-to-right imply less “significant” distinctions
• The “*” form used to query entire host
A Closer Look at the Status Line
Web Server Technologies | Part I: HTTP & Getting Started
Consists of three major parts– The HTTP Version followed by a SP
• Just like third part of Request Line– Status Code followed by a SP
• 5 groups of 3 digit integers indicating the result of the attempt to satisfy the request
• 1xx are informational• 2xx are success codes• 3xx are for alternate resource locations (redirects)• 4xx indicate client side errors• 5xx indicate server side errors
– The Reason Phrase followed by the CRLF• Short textual description of the status code
A Closer Look at HTTP Headers
Web Server Technologies | Part I: HTTP & Getting Started
Headers come in four major types, some for requests, some for responses, some for both:
– General Headers• Provide info about messages of both kinds
– Request Headers• Provide request-specific info
– Response Headers• Provide response-specific info
– Entity Headers• Provide info about request and response
entities– Extension headers are also possible
A Closer Look at General Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Connection – lets clients and servers manage connection state– Connection: Keep-Alive (HTTP 1.0)– Connection: close (HTTP 1.1)
• Date – when the message was created– Date: Sat, 31-May-03 15:00:00 GMT
• Via – shows proxies that handled message– Via: 1.1 www.myproxy.com (Squid/1.4)
• Cache-Control – Among the most complex of headers, enables caching directives
– Cache-Control: no-cache
A Closer Look at Request Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Host – The hostname (and optionally port) of server to which request is being sent– Required for name-based virtual hosting– Host: www.port80software.com
• Referer – The URL of the resource from which the current request URI came– Misspelled in the specification, so [Sic]– Referer: http://www.host.com/login.asp
• User-Agent – Name of the requesting application, used in browser sensing– User-Agent: Mozilla/4.0 (Compatible; MSIE 6.0)
Some More Request Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Accept and its variants – Inform servers of client’s capabilities and preferences– Enables content negotiation – Accept: image/gif, image/jpeg;q=0.5– Accept- variants for Language, Encoding, Charset
• If-Modified-Since and other conditionals– Frequently used by browsers to manage caches– If-Modified-Since: Sat, 31-May-03 15:00:00 GMT
• Cookie – How clients pass cookies back to the servers that set them– Cookie: id=23432;level=3
A Closer Look at Response Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Server – The server’s name and version– Server: Microsoft-IIS/5.0– Can be problematic for security reasons
• Vary – Tells client & proxy caches which headers were used for content negotiation
– Vary: User-Agent, Accept• Set-Cookie – This is how a server sets a cookie on a client
– Set-Cookie: id=234; path=/shop; expires=Sat, 31-May-03 15:00:00 GMT; secure
A Closer Look at Entity Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Allow – Lists the request methods that can be used on the entity– Allow: GET, HEAD, POST
• Location – Gives the alternate or new location of the entity– Used with 3xx response codes (redirects)– Location: http://www.ibm.com/us/
• Content-Encoding – specifies encoding performed on the body of the response– Used with HTTP compression– Corresponds to Accept-Encoding request header– Content-Encoding: gzip
More Entity Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Content-Length – The size of the entity body in bytes– Value shrinks when compression is applied– Content-Length: 24000
• Content-Location – The actual URL of the resource if different than its request URL
– Often used to show the index or default page– Content-Location: http://www.foo.com/home.html
• Content-Type – specifies Media (MIME) type of the entity body– Corresponds to Accept header– Content-Type: image/png
More Entity Headers
Web Server Technologies | Part I: HTTP & Getting Started
• Etag – Uniquely identifies a particular instance of a given resource – Used with conditional request headers to validate cached instances of the
resource• If-Match, If-None-Match
– Etag: adkskdashjgk07563AF• Expires – Gives expiration for the instance of the resource for use in caching
– Expires: Sat, 31-May-03 19:00:00 GMT• Last-Modified – Date/time the entity was last changed (or created)
– Last-Modified: Fri 30-May-03 09:00:00 GMT
Planning Web Server Deployments
Web Server Technologies | Part I: HTTP & Getting Started
• Major issues to consider when planning a Web server or Web site deployment
– What is the appropriate form of Web hosting?– What type of server software will be used?– What are the sizing requirements?– How will DNS be handled?
• There are no fixed answers to any of these questions• Planning should be guided by the goals of the
deployment and should harmonize with the related business processes
Choosing Among the Hosting Options
Web Server Technologies | Part I: HTTP & Getting Started
• Host your own– Pro: Complete control over the physical box– Con: Expensive and difficult to maintain well
• Hosting provider schemes– Dedicated Server
• Pro: Control without the hardware purchase• Con: Must manage the box – remotely
– Co-located Server• Pro: Admin control of entire box• Con: Must purchase box and manage remotely
– Virtual Hosting• Pro: Cheapest and easiest to maintain solution• Con: Server is shared, admin access limited
Choosing Server Software
Web Server Technologies | Part I: HTTP & Getting Started
Beware of sectarian quarrels, especially over performance and security– Apache has the best reputation historically
• OS started out more stable, secure and scalable• Features rapidly extended & refined via modular and open
development model• Strong administrator ethos = well managed boxes
– IIS formerly favored mainly for ease of use in less demanding environments, but 5.0 on Win2K closed most of the remaining quality gap
– Any modern HTTP server is very solid software that is terribly vulnerable when deployed & used naively
Choosing Server Software, cont.
Web Server Technologies | Part I: HTTP & Getting Started
In real world, usually a conditioned choice if not a forgone conclusion
– Biggest single factors are type of deployment and prior commitment to an underlying OS
– Apache on UNIX and Linux predominates in universities, research institutes and for virtual hosting setups – has majority of hosted domains
– Netscape/iPlanet used to have large enterprise market almost to itself– IIS started with smaller companies, often as part of LAN server, but has now
taken over Netscape’s leading role in the enterprise
Sizing a Web Server
Web Server Technologies | Part I: HTTP & Getting Started
• Sizing is process of determining the physical resources required to meet anticipated demand
• Processing power and memory are not typically a problem for the Web server– Basic HTTP server job of fetching files is not processor intensive– Resource constraints on the box probably an effect of other server-side
mechanisms• Automated session management by app servers• Manipulation of large database queries• Lots of non-optimized code in Web applications
Sizing a Web Server, cont
Web Server Technologies | Part I: HTTP & Getting Started
Network bottlenecks– Available bandwidth should accommodate max HTTP operations (“hits”) under
peak load– Assuming an average file size of 14,000 bytes
• 56K Modem could handle about 0.5 hits/sec• T1 line (1.5Mb) could handle about 13 hits/sec• T3 (45Mb) could handle about 400 hits/sec• OC3 (155Mbps) could handle about 1380 hits/sec
– Bandwidth sizing should be adjusted based on your actual request frequency and size
• Assume peaks at triple the average loads– Also watch out for collisions and overloading of routers, switches, hubs and
NICs on the network
Dealing with DNS
Web Server Technologies | Part I: HTTP & Getting Started
Making a site available by domain name requires its registration and use of DNS – A domain name can be registered with many different registrars– During registration, a DNS server is designated to maintain the domain’s DNS
records– These records propagate to other DNS servers– DNS servers use them to resolve a domain such as www.port80software.com
to a four-octet IP address such as 66.45.42.237– ISP’s offer DNS services; you can also maintain your own or use a 3rd party
service that lets you manage the records without running a DNS box
A Simplistic Model of the DNS System
Web Server Technologies | Part I: HTTP & Getting Started
Root DNS Server
ISP DNS Server
ISP DNS Server
1
2
3
4
56
1. Client asks its ISP’s DNS to resolve foo.com
2. That DNS asks root DNS whom to ask about foo.com
3. Root DNS points to 2nd ISP’s DNS
4. 1st ISP’s DNS asks 2nd ISP’s DNS
5. 2nd ISP’s DNS responds with IP
6. 1st ISP’s DNS replies and caches
Dealing with DNS, cont.
Web Server Technologies | Part I: HTTP & Getting Started
• You should learn to use nslookup to verify your DNS lookups are working and troubleshoot DNS problems
• Command line utility also built into network analyzers like free ieHTTPHeaders
– C:\>nslookup google.com• You can also point nslookup at specific DNS servers to test their ability
to resolve– C:\>nslookup– >Server 206.13.30.12– >google.com
Virtual and Physical Site Structure
Web Server Technologies | Part I: HTTP & Getting Started
Think of a site as having not one structure but two – virtual and physical– Virtual structure is described by the URLs used to request resources
from the site• This is the public view of the site – the site as visitors will see it
when they browse to it– Physical structure is the organization of the files and directories in the
file system on the host machine’s hard disk• This is the private view of the site seen only by you and those
users you choose to give access– It will become obvious why this distinction is necessary to keep
things straight
Configuring Virtual-Physical Mappings
Web Server Technologies | Part I: HTTP & Getting Started
The Document Root– A directory in the file system of the host machine where the Web server
looks for the files that constitute the Web site• Also called the root directory
– Often given an index or default document that serves as the homepage of the site.
– Corresponds to the “/” at the end of hostname portion of the URL:• http://www.foo.com/index.html (virtual)• /var/www/index.html (physical)• C:\inetpub\wwwroot\index.html (physical)
Configuring Virtual-Physical Mappings
Web Server Technologies | Part I: HTTP & Getting Started
Notice how the hostname portion of the URL maps to the same place pointed to by the physical path that lies to the left of the the “/” representing the document root
– The URL is virtual to the left of the document root, but it seems to be physical to the right of the document root
– In fact, a URL is purely virtual – there is no guarantee that the path to the right of the document root looks this way on disk
– In this simple case, virtual and physical paths happen to coincide from the document root down, but such is not always the case
Configuring Virtual-Physical Mappings
Web Server Technologies | Part I: HTTP & Getting Started
• A virtual directory or alias in the URL path preempts the lookup in the document root
• This extends the virtual structure to the right of (or “below”) the root “/” in the URL path
– http://www.foo.com/virtual/index2.html– /htdocs/physical/index2.html
• Here a virtual directory virtual points to a physical directory that is outside of the document root altogether
• Nested virtual directories are also possible
Configuring Virtual-Physical Mappings
Web Server Technologies | Part I: HTTP & Getting Started
• You can (and should) take advantage of this virtual/physical distinction to:– Preserve the site’s URL scheme even if the physical structure has to
change• Avoids broken links due to site expansion/revision
– Manage directory and file locations in ways that minimize security risks and facilitate backup procedures
– Reduce redundant physical directories for supporting files– Allow developers to keep relative URLs in source code simple
Virtual Hosting
Web Server Technologies | Part I: HTTP & Getting Started
• We know the hostname part of the URL is a virtual locator for files that live (physically) in a site’s document root
• The idea of virtual hosting takes this a step further by allowing a single server to host many domains, each with its own document root
• Two methods of virtual hosting– Old way: multiple IP addresses per server– New way: name-based using host headers
Managing Users and Hosts
Web Server Technologies | Part I: HTTP & Getting Started
• Users (developers) will need remote access allowing them to transfer files to and from the site’s physical structure
• FTP (and other file transfer mechanisms) allow the administrator to restrict this access
– to sub-sections of the site– by user account or client IP
• These restrictions should be backed up by access control lists on the directories that enforce the “principle of least access”
Managing Users and Hosts
Web Server Technologies | Part I: HTTP & Getting Started
• Similar rules apply to managing access to the Web site itself by visitors– ACLs in the Web site’s physical file structure should be set to the minimum
required by the Web server to serve the resources on the site• This gets tricky with server side programming
– If the Web site (or part of it) does not need to be available for anonymous access from everywhere then users, groups, hosts and IPs should be restricted
– HTTP Authentication can also be employed to require make all or part of a site private and require login
Managing Users and Hosts
Web Server Technologies | Part I: HTTP & Getting Started
• Although HTTP authentication now offers safeguards like checksums and password encryption, it is not very secure
– Lack of end-to-end encryption of the entire message transmission makes hijacking, scanning and spoofing easy
• If all or part of the site requires authentication and serious security for user’s login credentials, form based authentication over SSL is the only choice
Basic SSL Configuration
Web Server Technologies | Part I: HTTP & Getting Started
• Initiate an application for a certificate from a recognized Certificate Authority (CA)– The site (domain) owner will have to prove they are who they say they are
• Create a Certificate Signing Request (CSR)– Contains the site’s Public Key and matches up with a Private Key that is
created simultaneously and stored on the server• Submit the request to the CA and pay up• Retrieve the certificate and install it• Test the certificate with an HTTPS request
About Port80 Software
Web Server Technologies | Part I: HTTP & Getting Started
Solutions for Microsoft IIS Web Servers
Port80 software exposes control to server-side functionality for developers, and streamlines tasks for administrators:
• Increase security by locking down what info you broadcast and blocking intruders with ServerMask and ServerDefender
• Protect your intellectual property by preventing hotlinking with LinkDeny
• Improve performance: compress pages and manage cache controls for faster load time and bandwidth savings with CacheRight, httpZip, and ZipEnable
• Upgrade Web development tools: Negotiate content based on device, language, or other parameters with PageXchanger, and tighten code with w3compiler.
Visit us online @ www.port80software.com