contenttypes2011-urs

Upload: shruthi-manjunath

Post on 07-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 ContentTypes2011-Urs

    1/73

    Shalini Urs Open Elective

    2011

    Content and Content Types

    Shalini R. Urs

    International School of Information Management

    University of Mysore

    Mysore

  • 8/3/2019 ContentTypes2011-Urs

    2/73

    Shalini Urs Open Elective

    2011

    What is Document Genre ?

    Genre - the fusion of content,purpose and form of communicativeactions

    Greek philosophers and oratorsrecognized that the content of themessage is not always its mostimportant aspect; rather, the

    delivery, the context, and therhetorical structure all playcomplementary roles in the subtlebut profound act of one human being

    transferring information to anotherand thereby creating meaning from

  • 8/3/2019 ContentTypes2011-Urs

    3/73

    Shalini Urs Open Elective2011

    Document Genre

    a distinctive type of communicativeaction, characterized by a sociallyrecognized communicative purposeand common aspects of form

  • 8/3/2019 ContentTypes2011-Urs

    4/73

    Shalini Urs Open Elective2011

    Content Types

    It all began with MIME(Multipurpose Internet MailExtensions (MIME) is an Internetstandard that extends the format ofemail to support

    Text in Character sets other thanASCII Non-text attachments

    Message bodies with multiple parts Header information in non-ASCII

    character sets

  • 8/3/2019 ContentTypes2011-Urs

    5/73

    Shalini Urs Open Elective2011

    Media Types

    MIME's use, however, has grownbeyond describing the content of e-mail to describe content type in

    general, including for the web. A media type is composed of at least

    two parts: a type, a subtype, and one

    or more optional parameters. Forexample, subtypes of text type havean optional charset parameter thatcan be included to indicate thecharacter encoding, and subtypes of

  • 8/3/2019 ContentTypes2011-Urs

    6/73

    Shalini Urs Open Elective2011

    Character Encoding

    A character encoding systemconsists of a code that pairs eachcharacter from a given repertoirewith something else, such as asequence of natural numbers, octetsor electrical pulses, in order to

    facilitate the transmission of data(generally numbers and/or text)through telecommunication networks

    or storage of text in computers.

  • 8/3/2019 ContentTypes2011-Urs

    7/73

    Shalini Urs Open Elective2011

    What is a code?

    In communications, a code is a rule forconverting a piece of information (forexample, a letter, word, or phrase) intoanother form or representation, not

    necessarily of the same sort. Incommunications and informationprocessing, encoding is the process bywhich a source (object) performs thisconversion of information into data,which is then sent to a receiver(observer), such as a data processingsystem. Decoding is the reverseprocess of converting data, which hasbeen sent by a source, into information

  • 8/3/2019 ContentTypes2011-Urs

    8/73

    Shalini Urs Open Elective2011

    One reason for coding is to enablecommunication in places where ordinaryspoken or written language is difficult orimpossible. For example, a cable codereplaces words (eg, ship or invoice) intoshorter words, allowing the same

    information to be sent with fewercharacters, more quickly, and mostimportant, less expensively. Anotherexample is the use of semaphore flags,

    where the configuration of flags held bya signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and numbers.

    Another person standing a great

  • 8/3/2019 ContentTypes2011-Urs

    9/73

    Shalini Urs Open Elective2011

    Character Encoding

    A character encoding is a code thatpairs a set of natural language characters(such as an alphabet or syllabary) with aset of something else, such as numbers orelectrical pulses. Common examplesinclude Morse code, which encodes lettersof the Latin alphabet as series of long andshort depressions of a telegraph key; and

    ASCII, which encodes letters, numerals,and other symbols as both integers and 7-bit binary versions of those integers

    t

  • 8/3/2019 ContentTypes2011-Urs

    10/73

    Shalini Urs Open Elective2011

    at are an10646?

    The international standard ISO10646 defines the UniversalCharacter Set (UCS). UCS is asuperset of all other character setstandards. It guarantees round-tripcompatibility to other character sets.

    No information will be lost if youconvert any text string to UCS andthen back to the original encoding.

  • 8/3/2019 ContentTypes2011-Urs

    11/73

    Shalini Urs Open Elective2011

    UCS contains the characters required torepresent practically all known languages.This includes not only the Latin, Greek,Cyrillic, Hebrew, Arabic, Armenian, andGeorgian scripts, but also Chinese,Japanese and Korean Han ideographs aswell as scripts such as Hiragana,Katakana, Hangul, Devanagari, Bengali,

    Gurmukhi, Gujarati, Oriya, Tamil, Telugu,Kannada, Malayalam, Thai, Lao, Khmer,Bopomofo, Tibetian, Runic, Ethiopic,Canadian Syllabics, Cherokee, Mongolian,

    Ogham, Myanmar, Sinhala, Thaana, Yi, and

  • 8/3/2019 ContentTypes2011-Urs

    12/73

    Shalini Urs Open Elective2011

    For scripts not yet covered, researchon how to best encode them forcomputer usage is still going on and

    they will be added eventually. Thisincludes not only Cuneiform,Hieroglyphs and various Indo-

    European languages, but even someselected artistic scripts such as

    Tolkien's Tengwar and Cirth.

  • 8/3/2019 ContentTypes2011-Urs

    13/73

    Shalini Urs Open Elective2011

    UCS also covers a large number ofgraphical, typographical,mathematical and scientific symbols,

    including those provided by TeX,PostScript, APL, the InternationalPhonetic Alphabet (IPA), MS-DOS, MS-

    Windows, Macintosh, OCR fonts, aswell as many word processing andpublishing systems, and more are

    being added.

  • 8/3/2019 ContentTypes2011-Urs

    14/73

    Shalini Urs Open Elective2011

    ISO 10646 defines formally a 31-bitcharacter set. The most commonly usedcharacters, including all those found inolder encoding standards, have beenplaced in one of the first 65534 positions(0x0000 to 0xFFFD). This 16-bit subset ofUCS is called the Basic MultilingualPlane (BMP) or Plane 0. The characters

    that were later added outside the 16-bitBMP are mostly for specialist applicationssuch as historic scripts and scientificnotation.

  • 8/3/2019 ContentTypes2011-Urs

    15/73

    Shalini Urs Open Elective2011

    Current plans are that there will never becharacters assigned outside the 21-bitcode space from 0x000000 to 0x10FFFF,which covers a bit over one millionpotential future characters. The ISO10646-1 standard was first published in1993 and defines the architecture of thecharacter set and the content of the BMP.

    A second part ISO 10646-2 was added in2001 and defines characters encodedoutside the BMP. New characters are stillbeing added on a continuous basis, but

    the existing characters will not be changed

  • 8/3/2019 ContentTypes2011-Urs

    16/73

    Shalini Urs Open Elective2011

    UCS assigns to each character not onlya code number but also an officialname. A hexadecimal number thatrepresents a UCS or Unicode value is

    commonly preceded by "U+" as inU+0041 for the character "Latin capitalletter A". The UCS characters U+0000to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the rangeU+0000 to U+00FF is identical to ISO8859-1 (Latin-1). The range U+E000 toU+F8FF and also larger ranges outsidethe BMP are reserved for private use.

    UCS also defines several methods for

  • 8/3/2019 ContentTypes2011-Urs

    17/73

    Shalini Urs Open Elective2011

    What are combining

    characters? Some code points in UCS have beenassigned to combining characters.These are similar to the non-spacingaccent keys on a typewriter. A

    combining character is not a fullcharacter by itself. It is an accent orother diacritical mark that is added tothe previous character. This way, it is

    possible to place any accent on anycharacter. The most important accentedcharacters, like those used in theorthographies of common languages,

    have codes of their own in UCS toensure backwards com atibilit with

  • 8/3/2019 ContentTypes2011-Urs

    18/73

    Shalini Urs Open Elective2011

    They are known as precomposed

    characters. Precomposed charactersare available in UCS for backwardscompatibility with older encodings thathave no combining characters, such as

    ISO 8859. The combining-charactermechanism allows one to add accentsand other diacritical marks to anycharacter. This is especially importantfor scientific notations such asmathematical formulae and theInternational Phonetic Alphabet, where

    any possible combination of a baseh r t r n n r v r l i riti l

  • 8/3/2019 ContentTypes2011-Urs

    19/73

    Shalini Urs Open Elective2011

    Combining characters follow the

    character which they modify. Forexample, the German umlaut character ("Latin capital letter A with diaeresis")can either be represented by the

    precomposed UCS code U+00C4, oralternatively by the combination of anormal "Latin capital letter A" followedby a "combining diaeresis": U+0041

    U+0308. Several combining characterscan be applied when it is necessary tostack multiple accents or add combiningmarks both above and below the basecharacter. The Thai script, for example,needs up to two combining characters

  • 8/3/2019 ContentTypes2011-Urs

    20/73

    Shalini Urs Open Elective2011

    What are UCSimplementation levels?

    Not all systems can be expected tosupport all the advanced mechanismsof UCS, such as combining characters.Therefore, ISO 10646 specifies the

    following three implementation levels: Level 1

    Combining characters and Hangul Jamocharacters are not supported.[Hangul Jamo are an alternativerepresentation of precomposed modernHangul syllables as a sequence ofconsonants and vowels. They are required to

    fully support the Korean script including

  • 8/3/2019 ContentTypes2011-Urs

    21/73

    Shalini Urs Open Elective2011

    Level 2 Like level 1, however in some scripts, a fixed list

    of combining characters is now allowed (e.g., forHebrew, Arabic, Devanagari, Bengali, Gurmukhi,

    Gujarati, Oriya, Tamil, Telugo, Kannada,Malayalam, Thai and Lao). These scripts cannotbe represented adequately in UCS withoutsupport for at least certain combiningcharacters.

    Level 3 All UCS characters are supported, such that, for

    example, mathematicians can place a tilde oran arrow (or both) on any character.

  • 8/3/2019 ContentTypes2011-Urs

    22/73

    Shalini Urs Open Elective2011

    What is Unicode ?

    In the late 1980s, there have beentwo independent attempts to createa single unified character set. One

    was the ISO 10646 project of theInternational Organization forStandardization (ISO), the other was

    the Unicode Project organized by aconsortium of (initially mostly US)manufacturers of multi-lingual

    software.

  • 8/3/2019 ContentTypes2011-Urs

    23/73

    Shalini Urs Open Elective2011

    Fortunately, the participants of bothprojects realized in around 1991 that twodifferent unified character sets is notexactly what the world needs. They joinedtheir efforts and worked together oncreating a single code table. Both projectsstill exist and publish their respectivestandards independently, however the

    Unicode Consortium and ISO/IEC JTC1/SC2have agreed to keep the code tables of theUnicode and ISO 10646 standardscompatible and they closely coordinate

    any further extensions.

  • 8/3/2019 ContentTypes2011-Urs

    24/73

    Shalini Urs Open Elective2011

    Unicode 1.1 corresponded to ISO10646-1:1993, Unicode 3.0corresponded to ISO 10646-1:2000,

    Unicode 3.2 added ISO 10646-2:2001, and Unicode 4.0 correspondsto the forthcoming third version ofISO 10646. All Unicode versions since

    2.0 are compatible, only newcharacters will be added, no existingcharacters will be removed or

    renamed in the future.

  • 8/3/2019 ContentTypes2011-Urs

    25/73

    Shalini Urs Open Elective2011

    In computing, Unicode is the internationalstandard whose goal is to specify a codematching every character needed by

    every written human language to a singleunique integer number, called a codepoint.

    Despite technical problems and limitations

    and criticism on them, Unicode hasemerged as the dominant encodingscheme in internationalization of softwareand multilingual environments.

  • 8/3/2019 ContentTypes2011-Urs

    26/73

    Shalini Urs Open Elective2011

    Microsoft Windows NT and itsdescendants Windows 2000 andWindows XP make extensive use of

    Unicode, more specifically UTF-16, asan internal representation of text.UNIX-like operating systems such as

    Linux, BSD and Mac OS X haveadopted Unicode, more specificallyUTF-8, as the basis of representation

    of multilingual text.

    Unicode and ISO 10646:

  • 8/3/2019 ContentTypes2011-Urs

    27/73

    Shalini Urs Open Elective2011

    Unicode and ISO 10646:Differences

    The Unicode Standard published by theUnicode Consortium corresponds to ISO10646 at implementation level 3. Allcharacters are at the same positionsand have the same names in bothstandards.

    The Unicode Standard defines inaddition much more semanticsassociated with some of the charactersand is in general a better reference for

    implementors of high-qualitytypographic publishing systems.Unicode specifies algorithms forrendering presentation forms of some

    scripts (say Arabic), handling of bi-

  • 8/3/2019 ContentTypes2011-Urs

    28/73

    Shalini Urs Open Elective2011

    The ISO 10646 standard on the otherhand is not much more than a simple

    character set table, comparable to theold ISO 8859 standards. It specifiessome terminology related to thestandard, defines some encoding

    alternatives, and it containsspecifications of how to use UCS inconnection with other established ISOstandards such as ISO 6429 and ISO

    2022. There are other closely relatedISO standards, for instance ISO 14651on sorting UCS strings. A nice feature ofthe ISO 10646-1 standard is that it

    provides CJK example glyphs in fivedifferent st le variants, while the

  • 8/3/2019 ContentTypes2011-Urs

    29/73

    Shalini Urs Open Elective2011

    What is UTF-8?

    UCS and Unicode are first of all justcode tables that assign integernumbers to characters. There existseveral alternatives for how a

    sequence of such characters ortheir respective integer values canbe represented as a sequence ofbytes. The two most obvious

    encodings store Unicode text assequences of either 2 or 4 bytessequences. The official terms forthese encodings are UCS-2 andUCS-4, respectively.

  • 8/3/2019 ContentTypes2011-Urs

    30/73

    Shalini Urs Open Elective2011

    Unless otherwise specified, the mostsignificant byte comes first in these(Bigendian convention). An ASCII or

    Latin-1 file can be transformed into aUCS-2 file by simply inserting a 0x00byte in front of every ASCII byte. If

    we want to have a UCS-4 file, wehave to insert three 0x00 bytesinstead before every ASCII byte.

  • 8/3/2019 ContentTypes2011-Urs

    31/73

    Shalini Urs Open Elective2011

    Encoding Forms

    Character encoding standards definenot only the identity of each characterand its numeric value, or code point, butalso how this value is represented in

    bits. The Unicode Standard defines three

    encoding forms that allow the same

    data to be transmitted in a byte, word ordouble word oriented format (i.e. in 8,16 or 32-bits per code unit).

  • 8/3/2019 ContentTypes2011-Urs

    32/73

    Shalini Urs Open Elective2011

    All three encoding forms encode the same

    common character repertoire and can be

    efficiently transformed into one another

    without loss of data. The UnicodeConsortium fully endorses the use of any

    of these encoding forms as a conformant

    way of implementing the UnicodeStandard.

  • 8/3/2019 ContentTypes2011-Urs

    33/73

    Shalini Urs Open Elective2011

    UTF-8 is popular for HTML and similar protocols.UTF-8 is a way of transforming all Unicodecharacters into a variable length encoding ofbytes. It has the advantages that the Unicode

    characters corresponding to the familiar ASCIIset have the same byte values as ASCII, andthat Unicode characters transformed into UTF-8can be used with much existing software without

    extensive software rewrites.

  • 8/3/2019 ContentTypes2011-Urs

    34/73

    Shalini Urs Open Elective2011

    UTF-16 is popular in many environments that

    need to balance efficient access to characterswith economical use of storage. It isreasonably compact and all the heavily usedcharacters fit into a single 16-bit code unit,while all other characters are accessible viapairs of 16-bit code units.

    UTF-32 is popular where memory space is noconcern, but fixed width, single code unit

    access to characters is desired. EachUnicode character is encoded in a single 32-bit code unit when using UTF-32.

    All three encoding forms need at most 4 bytes

    (or 32-bits) of data for each character.

  • 8/3/2019 ContentTypes2011-Urs

    35/73

    Shalini Urs Open Elective2011

    Defining Elements of Text

    Written languages are represented bytextual elements that are used to create

    words and sentences. These elements

    may be letters such as "w" or "M";characters such as those used in

    Japanese Hiragana to represent

    syllables; or ideographs such as those

    used in Chinese to represent full words

    or concepts.

  • 8/3/2019 ContentTypes2011-Urs

    36/73

    Shalini Urs Open Elective2011

    The definition oftext elements often

    changes depending on the process

    handling the text. For example, in historic

    Spanish language sorting, "ll"; counts as asingle text element. However, when

    Spanish words are typed, "ll" is two

    separate text elements: "l" and "l".

  • 8/3/2019 ContentTypes2011-Urs

    37/73

    Shalini Urs Open Elective2011

    To avoid deciding what is and is not a text

    element in different processes, the UnicodeStandard defines code elements (commonlycalled "characters"). A code element isfundamental and useful for computer textprocessing. For the most part, code elementscorrespond to the most commonly used textelements. In the case of the Spanish "ll", theUnicode Standard defines each "l" as aseparate code element. The task of

    combining two "l" together for alphabeticsorting is left to the software processing thetext.

  • 8/3/2019 ContentTypes2011-Urs

    38/73

    Shalini Urs Open Elective2011

    Text Processing

    Computer text handling involvesprocessing and encoding. Consider, for

    example, a word processor user typing

    text at a keyboard. The computer's

    system software receives a message

    that the user pressed a key combination

    for "T", which it encodes as U+0054.

  • 8/3/2019 ContentTypes2011-Urs

    39/73

    Shalini Urs Open Elective2011

    The word processor stores the number in

    memory, and also passes it on to the display

    software responsible for putting the character on

    the screen. The display software, which may bea window manager or part of the word processor

    itself, uses the number as an index to find an

    image of a "T", which it draws on the monitor

    screen. The process continues as the user typesin more characters.

  • 8/3/2019 ContentTypes2011-Urs

    40/73

    Shalini Urs Open Elective2011

    The Unicode Standard directly addresses

    only the encoding and semantics of text. Itaddresses no other action performed on thetext. For example, the word processor maycheck the typist's input as it is being entered,

    and display misspellings with a wavyunderline. Or it may insert line breaks when itcounts a certain number of charactersentered since the last line break. An importantprinciple of the Unicode Standard is that itdoes not specify how to carry out theseprocesses as long as the character encodingand decoding is performed properly.

  • 8/3/2019 ContentTypes2011-Urs

    41/73

    Shalini Urs Open Elective2011

    Interpreting Characters and

    Rendering Glyph The difference between identifying a code

    point and rendering it on screen or paper is

    crucial to understanding the Unicode

    Standard's role in text processing. Thecharacter identified by a Unicode code point

    is an abstract entity, such as "LATIN

    CHARACTER CAPITAL A" or "BENGALI

    DIGIT 5." The mark made on screen or paper-- called a glyph -- is a visual representation

    of the character.

  • 8/3/2019 ContentTypes2011-Urs

    42/73

    Shalini Urs Open Elective2011

    The Unicode Standard does not define glyph

    images. The standard defines how characters

    are interpreted, not how glyphs are rendered.

    The software or hardware-rendering engine of acomputer is responsible for the appearance of

    the characters on the screen. The Unicode

    Standard does not specify the size, shape, nor

    style of on-screen characters.

  • 8/3/2019 ContentTypes2011-Urs

    43/73

    Shalini Urs Open Elective2011

    Character Sequences

    Text elements are encoded assequences of one or more characters.Certain of these sequences are calledcombining character sequences, made

    up of a base letter and one or morecombining marks, which are renderedaround the base letter (above it, belowit, etc.). For example, a sequence of "a"followed by a combining circumflex "^"would be rendered as ""

  • 8/3/2019 ContentTypes2011-Urs

    44/73

    Shalini Urs Open Elective2011

    The Unicode Standard specifies the order ofcharacters in a combining charactersequence. The base character comes first,followed by one or more non-spacing marks.If there is more than one non-spacing mark,

    the order in which the non-spacing marks arestored isn't important if the marks don'tinteract typographically. If they do interact,then their order is important. The Unicode

    Standard specifies how successive non-spacing characters are applied to a basecharacter, and when the order is significant.

  • 8/3/2019 ContentTypes2011-Urs

    45/73

    Shalini Urs Open Elective2011

    Certain sequences of characters can also be

    represented as a single character, called aprecomposedcharacter (orcomposite ordecomposible character). For example, thecharacter "" can be encoded as the single

    code point U+00FC "" or as the basecharacter U+0075 "u" followed by the non-spacing character U+0308 "". The UnicodeStandard encodes precomposed characters

    for compatibility with established standardssuch as Latin 1, which includes manyprecomposed characters such as "" and "".

  • 8/3/2019 ContentTypes2011-Urs

    46/73

    Shalini Urs Open Elective2011

    Precomposed characters may be

    decomposed for consistency or analysis. Forexample, in alphabetizing (collating) a list ofnames, the character "" may bedecomposed into a "u" followed by the non-

    spacing character "". Once the character hasbeen decomposed, it may be easier for the towork with the character because it can beprocessed as a "u" with modifications. This

    allows easier alphabetical sorting forlanguages where character modifiers do notaffect alphabetical order. The UnicodeStandard defines the decompositions for allprecomposed characters.

  • 8/3/2019 ContentTypes2011-Urs

    47/73

    Shalini Urs Open Elective2011

    The Unicode Standard was created by a

    team of computer professionals, linguists,

    and scholars to become a worldwide

    character standard, one easily used fortext encoding everywhere. To that end, the

    Unicode Standard follows a set of

    fundamental principles:

  • 8/3/2019 ContentTypes2011-Urs

    48/73

    Shalini Urs Open Elective2011

    Universal repertoire

    Logical order

    Efficiency

    Unification Characters, not glyphs

    Dynamic composition

    Semantics

    Equivalent Sequence

    Plain Text

    Convertibility

  • 8/3/2019 ContentTypes2011-Urs

    49/73

    Shalini Urs Open Elective2011

    The character sets of many existing

    international, national and corporate

    standards are incorporated within the

    Unicode Standard. For example, its first256 characters are taken from the widely

    used Latin-1 character set.

  • 8/3/2019 ContentTypes2011-Urs

    50/73

    Shalini Urs Open Elective2011

    Duplicate encoding of characters is avoided

    by unifying characters within scripts acrosslanguages; characters that are equivalent inform are given a single code.Chinese/Japanese/Korean (CJK)consolidation is achieved by assigning asingle code for each ideograph that iscommon to more than one of theselanguages. This is instead of providing aseparate code for the ideograph each time it

    appears in a different language. (These threelanguages share many thousands of identicalcharacters because their ideograph setsevolved from the same source.)

  • 8/3/2019 ContentTypes2011-Urs

    51/73

    Shalini Urs Open Elective2011

    The Unicode Standard specifies an algorithm forthe presentation of text with bidirectionalbehavior, for example, Arabic and English.Characters are stored in logical order. The

    Unicode Standard includes characters to specifychanges in direction when scripts of differentdirectionality are mixed. For all scripts Unicodetext is in logical order within the memory

    representation, corresponding to the order inwhich text is typed on the keyboard.

  • 8/3/2019 ContentTypes2011-Urs

    52/73

    Shalini Urs Open Elective2011

    Assigning Character Codes

    A single number is assigned to eachcode element defined by the UnicodeStandard. Each of these numbers iscalled a code pointand, when referred

    to in text, is listed in hexadecimal formfollowing the prefix "U". For example,the code point U+0041 is thehexadecimal number 0041 (equal to thedecimal number 65). It represents thecharacter "A" in the Unicode Standard.

  • 8/3/2019 ContentTypes2011-Urs

    53/73

    Shalini Urs Open Elective2011

    Each character is also assigned a uniquename that specifies it and no other. Forexample, U+0041 is assigned the

    character name "LATIN CAPITAL LETTERA." U+0A1B is assigned the charactername "GURMUKHI LETTER CHA." TheseUnicode names are identical to the

    ISO/IEC 10646 names for the samecharacters

  • 8/3/2019 ContentTypes2011-Urs

    54/73

    Shalini Urs Open Elective2011

    The Unicode Standard groups characterstogether by scripts in code blocks. A scriptisany system of related characters. Thestandard retains the order of characters in asource set where possible. When thecharacters of a script are traditionally

    arranged in a certain order -- alphabetic order,for example -- the Unicode Standard arrangesthem in its code space using the same orderwhenever possible. Code blocks vary greatly

    in size. For example, the Cyrillic code blockdoes not exceed 256 code points, while theCJK code blocks contain many thousands ofcode points.

  • 8/3/2019 ContentTypes2011-Urs

    55/73

    Shalini Urs Open Elective2011

    Codespace

    Code elements are grouped logically throughoutthe range of code points, called the codespace.The coding starts at U+0000 with the standard

    ASCII characters, and continues with Greek,

    Cyrillic, Hebrew, Arabic, Indic and other scripts;then followed by symbols and punctuation. Thecode space continues with Hiragana, Katakana,and Bopomofo. The unified Han ideographs are

    followed by the complete set of modern Hangul.

  • 8/3/2019 ContentTypes2011-Urs

    56/73

    Shalini Urs Open Elective2011

    A range of code points on the BMP and two verylarge ranges in the supplementary planes arereserved asprivate use areas. These codepoints have no universal meaning, and may be

    used for characters specific to a program or by agroup of users for their own purposes. Forexample, a group of choreographers may designa set of characters for dance notation and

    encode the characters using code points in userspace.

  • 8/3/2019 ContentTypes2011-Urs

    57/73

    Shalini Urs Open Elective2011

    A set of page-layout programs may use

    the same code points as control codes to

    position text on the page. The main point

    of user space is that the Unicode Standardassigns no meaning to these code points,

    and reserves them as user space,

    promising never to assign them meaningin the future.

  • 8/3/2019 ContentTypes2011-Urs

    58/73

    Shalini Urs Open Elective2011

    Conformance to the Unicode

    StandardThe Unicode Standard specifies unambiguous

    requirements for conformance in terms of theprinciples and encoding architecture it embodies. Aconforming implementation has the followingcharacteristics, as a minimum requirement:

    characters are from the common repertoire;

    characters are encoded according to one ofthe encoding forms;

    characters are interpreted with Unicodesemantics;

    unassigned codes are not used; and,

    unknown characters are not corrupted.

  • 8/3/2019 ContentTypes2011-Urs

    59/73

    Shalini Urs Open Elective2011

    Stability

    The Unicode Standard has a lot of roomto grow, and there are a considerable

    number of scripts that will be encoded

    in upcoming versions. This process is

    strictly additive, in other words, while

    characters may be added or new

    character properties may be defined, no

    characters will be removed -- orreinterpreted in incompatible ways.

  • 8/3/2019 ContentTypes2011-Urs

    60/73

    Shalini Urs Open Elective2011

    These stability guarantees make it

    possible to encode data in Unicode and

    expect that future implementations that

    conform to a later version of the UnicodeStandard will be able to interpret them in

    the same way, as implementations

    conforming to The Unicode Standard,Version 3.2.

  • 8/3/2019 ContentTypes2011-Urs

    61/73

    Shalini Urs Open Elective2011

    The range ofsurrogate code points is reserved

    for use with UTF-16. Towards the end of the

    BMP is a range of code points reserved for

    private use, followed by a range of compatibilitycharacters. The compatibility characters are

    character variants that are encoded only to

    enable transcoding to earlier standards and old

    implementations, which made use of them.

  • 8/3/2019 ContentTypes2011-Urs

    62/73

    Shalini Urs Open Elective2011

    Unicode is an industry standarddesigned to allow text and symbolsfrom all languages to be consistently

    represented and manipulated bycomputers.

    Unicode characters can be encoded

    using any of several schemes termedUnicode Transformation Formats(UTF).

  • 8/3/2019 ContentTypes2011-Urs

    63/73

    Shalini Urs Open Elective2011

    The Unicode Consortium has as itsambitious goal the eventual replacementof existing character encoding schemeswith Unicode, as many of the existing

    schemes are limited in size and scope, andare incompatible with multilingualenvironments. Its success at unifyingcharacter sets has led to its widespread

    and predominant use in theinternationalization and localization ofcomputer software. The standard has beenimplemented in many recent technologies,

    including XML, the Java programming

  • 8/3/2019 ContentTypes2011-Urs

    64/73

    Shalini Urs Open Elective2011

    Other terms like characterencoding, character set(charset),and sometimes character map or

    code page are used almostinterchangeably, but these termsnow have related but distinct

    meanings. Common examples of character

    encoding systems include Morse

    code the Baudot code the American

  • 8/3/2019 ContentTypes2011-Urs

    65/73

    Shalini Urs Open Elective2011

    Other codes

    Morse code was introduced in the1840s and is used to encode eachletter of the Latin alphabet and each

    Hindu-Arabic numeral as a series oflong and short presses of a telegraphkey. Representations of characters

    encoded using Morse code varied inlength. The Baudot code was created by

    mile Baudot in 1870 patented in

  • 8/3/2019 ContentTypes2011-Urs

    66/73

    Shalini Urs Open Elective2011

    ASCII and other codes

    ASCII was introduced in 1963 and is a7-bit encoding scheme used toencode letters, numerals, symbols,

    and device control codes as fixed-length codes using integers.

    IBM's Extended Binary Coded

    Decimal Interchange Code (usuallyabbreviated EBCDIC) is an 8-bitencoding scheme developed in 1963.

  • 8/3/2019 ContentTypes2011-Urs

    67/73

    Shalini Urs Open Elective2011

    Why UNICODE ?

    The limitations of such sets soonbecame apparent, and a number ofad-hoc methods were developed to

    extend them. The need to supportmore writing systems for differentlanguages, including the CJK family

    of East Asian scripts, requiredsupport for a far larger number ofcharacters and demanded asystematic approach to character

    Encoding

  • 8/3/2019 ContentTypes2011-Urs

    68/73

    Shalini Urs Open Elective2011

    Encoding Fundamentally, computers just deal with numbers.

    They store letters and other characters byassigning a number for each one.

    Before Unicode was invented, there werehundreds of different encoding systems for

    assigning these numbers. No single encoding could contain enough

    characters: for example, the European Unionalone requires several different encodings to cover

    all its languages. Even for a single language likeEnglish no single encoding was adequate for allthe letters, punctuation, and technical symbols incommon use.

    Encoding

  • 8/3/2019 ContentTypes2011-Urs

    69/73

    Shalini Urs Open Elective2011

    Encoding These encoding systems also conflict with

    one another. That is, two encodings can

    use the same number for two different

    characters, or use different numbers for

    the same character. Any given computer (especially servers)

    needs to support many different

    encodings; yet whenever data is passedbetween different encodings or platforms,

    that data always runs the risk of

    corruption.

  • 8/3/2019 ContentTypes2011-Urs

    70/73

    Shalini Urs Open Elective2011

    Encoding

    In communications, a code is a rulefor converting a piece of information(for example, a letter, word, or

    phrase) into another form orrepresentation, not necessarily of thesame sort.

  • 8/3/2019 ContentTypes2011-Urs

    71/73

    Shalini Urs Open Elective2011

    Encoding

    In communications and informationprocessing, encoding is the processby which a source (object) performs

    this conversion of information intodata, which is then sent to a receiver(observer), such as a data processing

    system. Decoding is the reverseprocess of converting data, whichhas been sent by a source, intoinformation understandable by a

  • 8/3/2019 ContentTypes2011-Urs

    72/73

    Shalini Urs Open Elective2011

    Encoding

    One reason for coding is to enablecommunication in places whereordinary spoken or written language

    is difficult or impossible. Forexample, a cable code replaceswords (eg, ship or invoice) into

    shorter words, allowing the sameinformation to be sent with fewercharacters, more quickly, and mostimportant, less expensively.

  • 8/3/2019 ContentTypes2011-Urs

    73/73

    Encoding

    Another example is the use ofsemaphore flags, where theconfiguration of flags held by a

    signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and

    numbers. Another person standing agreat distance away can interpret theflags and reproduce the words sent.