miw chapter 2

Upload: aniket-shetye

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 MIW Chapter 2

    1/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    Basic WWW Technologies

    2.1 Web Documents.

    2.2 Resource Identifiers: URI, URL, and URN.

    2.3 Protocols.

    2.4 Log Files.

    2.5 Search Engines.

  • 7/28/2019 MIW Chapter 2

    2/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    2

    What Is the World Wide Web?

    The world wide web (web) is a network ofinformation resources. The web relies on threemechanisms to make these resources readilyavailable to the widest possible audience:

    1. A uniform naming scheme for locating resourceson the web (e.g., URIs).

    2. Protocols, for access to named resources over

    the web (e.g., HTTP).3. Hypertext, for easy navigation among resources

    (e.g., HTML).

  • 7/28/2019 MIW Chapter 2

    3/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    3

    Internet vs. Web

    Internet:

    Internet is a more general term

    Includes physical aspect of underlying networks

    and mechanisms such as email, FTP, HTTPWeb:

    Associated with information stored on theInternet

    Refers to a broader class of networks, i.e. Webof English Literature

    Both Internet and web are networks

  • 7/28/2019 MIW Chapter 2

    4/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    4

    Essential Components of WWW

    Resources:

    Conceptual mappings to concrete or abstract entities, which do not

    change in the short term

    ex: ICS website (web pages and other kinds of files)

    Resource identifiers (hyperlinks): Strings of characters represent generalized addresses that may

    contain instructions for accessing the identified resource

    http://www.ics.uci.edu is used to identify the ICS homepage

    Transfer protocols: Conventions that regulate the communication between a browser

    (web user agent) and a server

    http://www.ics.uci.edu/http://www.ics.uci.edu/
  • 7/28/2019 MIW Chapter 2

    5/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    5

    Standard Generalized Markup

    Language (SGML)

    Based on GML (generalized markup language),

    developed by IBM in the 1960s

    An international standard (ISO 8879:1986)

    defines how descriptive markup should beembedded in a document

    Gave birth to the extensible markup language

    (XML), W3C recommendation in 1998

  • 7/28/2019 MIW Chapter 2

    6/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    6

    SGML Components

    SGML documents have three parts: Declaration: specifies which characters and delimiters

    may appear in the application

    DTD/ style sheet: defines the syntax of markupconstructs

    Document instance: actual text (with the tag) of the

    documents

    More info could be found:http://www.W3.Org/markup/SGML

    http://www.w3.org/markup/SGMLhttp://www.w3.org/markup/SGML
  • 7/28/2019 MIW Chapter 2

    7/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    7

    DTD Example One

    ELEMENT is a keyword that introduces a new

    element type unordered list (UL)

    The two hyphens indicate that both the start tag

    and the end tag for this element

    type are required

    Any text between the two tags is treated as a listitem (LI)

  • 7/28/2019 MIW Chapter 2

    8/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    8

    DTD Example Two

    The element type being declared is IMG

    The hyphen and the following "O" indicatethat the end tag can be omitted

    Together with the content model

    "EMPTY", this is strengthened to the rulethat the end tag must be omitted. (no

    closing tag)

  • 7/28/2019 MIW Chapter 2

    9/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    9

    HTML Background

    HTML was originally developed by Tim Berners-

    Lee while at CERN, and popularized by the

    Mosaic browser developed at NCSA.

    The Web depends on Web page authors andvendors sharing the same conventions for

    HTML. This has motivated joint work on

    specifications for HTML.

    HTML standards are organized by W3C :

    http://www.w3.org/MarkUp/

    http://www.w3.org/MarkUp/http://www.w3.org/MarkUp/
  • 7/28/2019 MIW Chapter 2

    10/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    10

    HTML Functionalities

    HTML gives authors the means to:

    Publish online documents with headings, text, tables,

    lists, photos, etc

    Include spread-sheets, video clips, sound clips, and otherapplications directly in their documents

    Link information via hypertext links, at the click of a

    button

    Design forms for conducting transactions with remote

    services, for use in searching for information, making

    reservations, ordering products, etc

  • 7/28/2019 MIW Chapter 2

    11/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    11

    HTML Versions

    HTML 4.01 is a revision of the HTML 4.0 Recommendation first

    released on 18th December 1997.

    HTML 4.01 Specification:

    http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt

    HTML 4.0 was first released as a W3C Recommendation on 18December 1997

    HTML 3.2 was W3C's first Recommendation for HTML which

    represented the consensus on HTML features for 1996

    HTML 2.0 (RFC 1866) was developed by the IETF's HTML

    Working Group, which set the standard for core HTMLfeatures based upon current practice in 1994.

    http://www.w3.org/TR/1999/REC-html401-19991224/html40.txthttp://www.rfc-editor.org/rfc/rfc1866.txthttp://www.rfc-editor.org/rfc/rfc1866.txthttp://www.w3.org/TR/1999/REC-html401-19991224/html40.txthttp://www.w3.org/TR/1999/REC-html401-19991224/html40.txthttp://www.w3.org/TR/1999/REC-html401-19991224/html40.txthttp://www.w3.org/TR/1999/REC-html401-19991224/html40.txthttp://www.w3.org/TR/1999/REC-html401-19991224/html40.txt
  • 7/28/2019 MIW Chapter 2

    12/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    12

    Sample Webpage

  • 7/28/2019 MIW Chapter 2

    13/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    13

    Sample Webpage HTML

    Structure

    The title of the webpage

    Body of the webpage

  • 7/28/2019 MIW Chapter 2

    14/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    14

    HTML Structure

    An HTML document is divided into a head section

    (here, between and ) and a body

    (here, between and )

    The title of the document appears in the head (alongwith other information about the document)

    The content of the document appears in the body. The

    body in this example contains just one paragraph,

    marked up with

  • 7/28/2019 MIW Chapter 2

    15/32

  • 7/28/2019 MIW Chapter 2

    16/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    16

    Resource Identifiers

    URI: Uniform Resource Identifiers

    URL: Uniform Resource Locators

    URN: Uniform Resource Names

  • 7/28/2019 MIW Chapter 2

    17/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    17

    Introduction to URIs

    Every resource available on the Web has anaddress that may be encoded by a URI

    URIs typically consist of three pieces:

    The naming scheme of the mechanism usedto access the resource. (HTTP, FTP)

    The name of the machine hosting the

    resource The name of the resource itself, given as a

    path

  • 7/28/2019 MIW Chapter 2

    18/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    18

    URI Example

    http://www.w3.org/TR

    There is a document available via the HTTP

    protocol

    Residing on the machines hosting

    www.w3.org

    Accessible via the path "/TR"

    http://www.w3.org/TRhttp://www.w3.org/http://www.w3.org/http://www.w3.org/TR
  • 7/28/2019 MIW Chapter 2

    19/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    19

    Protocols

    Describe how messages are encoded and

    exchanged

    Different Layering Architectures

    ISO OSI 7-Layer Architecture

    TCP/IP 4-Layer Architecture

  • 7/28/2019 MIW Chapter 2

    20/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    20

    ISO OSI Layering Architecture

  • 7/28/2019 MIW Chapter 2

    21/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    21

    ISOs Design Principles

    A layer should be created where a different levelof abstraction is needed

    Each layer should perform a well-defined

    function The layer boundaries should be chosen tominimize information flow across the interfaces

    The number of layers should be large enough

    that distinct functions need not be throwntogether in the same layer, and small enoughthat the architecture does not become unwieldy

  • 7/28/2019 MIW Chapter 2

    22/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    22

    TCP/IP Layering Architecture

  • 7/28/2019 MIW Chapter 2

    23/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    23

    TCP/IP Layering Architecture

    A simplified model, provides the end-to-

    end reliable connection

    The network layer

    Hosts drop packages into this layer, layer

    routes towards destination

    Only promise Try my best

    The transport layer

    Reliable byte-oriented stream

  • 7/28/2019 MIW Chapter 2

    24/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    24

    Hypertext Transfer Protocol (HTTP)

    A connection-oriented protocol (TCP) used

    to carry WWW traffic between a browser

    and a server

    One of the transport layer protocol

    supported by Internet

    HTTP communication is established via a

    TCP connection and server port 80

  • 7/28/2019 MIW Chapter 2

    25/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    25

    GET Method in HTTP

  • 7/28/2019 MIW Chapter 2

    26/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    26

    Domain Name System

    DNS (domain name service): mapping fromdomain names to IP address

    IPv4:

    IPv4 was initially deployed January 1st

    . 1983 andis still the most commonly used version.

    32 bit address, a string of 4 decimal numbersseparated by dot, range from 0.0.0.0 to

    255.255.255.255.IPv6:

    Revision of IPv4 with 128 bit address

  • 7/28/2019 MIW Chapter 2

    27/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    27

    Top Level Domains (TLD)

    Top level domain names, .com, .edu, .gov and ISO

    3166 country codes

    There are three types of top-level domains:

    Generic domains were created for use by the Internetpublic

    Country code domains were created to be used by

    individual country

    The .arpa domain Address and Routing ParameterAreadomain is designated to be used exclusively for Internet-

    infrastructure purposes

    http://www.iana.org/gtld/gtld.htmhttp://www.iana.org/cctldhttp://www.iana.org/arpa-dom/http://www.iana.org/arpa-dom/http://www.iana.org/cctldhttp://www.iana.org/gtld/gtld.htm
  • 7/28/2019 MIW Chapter 2

    28/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    28

    Registrars

    Domain names ending with .aero, .biz,

    .com, .coop, .info, .museum, .name, .net,

    .org, or .pro can be registered through

    many different companies (known as"registrars") that compete with one another

    InterNIC at http://internic.net

    Registrars Directory:

    http://www.internic.net/regist.html

    http://internic.net/http://www.internic.net/regist.htmlhttp://www.internic.net/regist.htmlhttp://internic.net/
  • 7/28/2019 MIW Chapter 2

    29/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    29

    Server Log Files

    Server Transfer Log: transactions between a

    browser and server are logged

    IP address, the time of the request

    Method of the request (GET, HEAD, POST) Status code, a response from the server

    Size in byte of the transaction

    Referrer Log: where the request originated

    Agent Log: browser software making the request (spider)

    Error Log: request resulted in errors (404)

  • 7/28/2019 MIW Chapter 2

    30/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    30

    Server Log Analysis

    Most and least visited web pages

    Entry and exit pages

    Referrals from other sites or searchengines

    What are the searched keywords

    How many clicks/page views a pagereceived

    Error reports, like broken links

  • 7/28/2019 MIW Chapter 2

    31/32

    Modeling the Internet and the WebSchool of Information and Computer ScienceUniversity of California, Irvine

    31

    Server Log Analysis

  • 7/28/2019 MIW Chapter 2

    32/32

    Modeling the Internet and the WebSchool of Information and Computer Science 32

    Search Engines

    According to Pew Internet Project Report

    (2002), search engines are the most popular

    way to locate information online

    About 33 million U.S. Internet users query on

    search engines on a typical day.

    More than 80% have used search engines

    Search Engines are measured by coverage and

    recency

    http://www.pewinternet.org/reports/pdfs/PIP_Search_Engine_Data.pdfhttp://www.pewinternet.org/reports/pdfs/PIP_Search_Engine_Data.pdf