contenttypes2011-urs
TRANSCRIPT
-
8/3/2019 ContentTypes2011-Urs
1/73
Shalini Urs Open Elective
2011
Content and Content Types
Shalini R. Urs
International School of Information Management
University of Mysore
Mysore
-
8/3/2019 ContentTypes2011-Urs
2/73
Shalini Urs Open Elective
2011
What is Document Genre ?
Genre - the fusion of content,purpose and form of communicativeactions
Greek philosophers and oratorsrecognized that the content of themessage is not always its mostimportant aspect; rather, the
delivery, the context, and therhetorical structure all playcomplementary roles in the subtlebut profound act of one human being
transferring information to anotherand thereby creating meaning from
-
8/3/2019 ContentTypes2011-Urs
3/73
Shalini Urs Open Elective2011
Document Genre
a distinctive type of communicativeaction, characterized by a sociallyrecognized communicative purposeand common aspects of form
-
8/3/2019 ContentTypes2011-Urs
4/73
Shalini Urs Open Elective2011
Content Types
It all began with MIME(Multipurpose Internet MailExtensions (MIME) is an Internetstandard that extends the format ofemail to support
Text in Character sets other thanASCII Non-text attachments
Message bodies with multiple parts Header information in non-ASCII
character sets
-
8/3/2019 ContentTypes2011-Urs
5/73
Shalini Urs Open Elective2011
Media Types
MIME's use, however, has grownbeyond describing the content of e-mail to describe content type in
general, including for the web. A media type is composed of at least
two parts: a type, a subtype, and one
or more optional parameters. Forexample, subtypes of text type havean optional charset parameter thatcan be included to indicate thecharacter encoding, and subtypes of
-
8/3/2019 ContentTypes2011-Urs
6/73
Shalini Urs Open Elective2011
Character Encoding
A character encoding systemconsists of a code that pairs eachcharacter from a given repertoirewith something else, such as asequence of natural numbers, octetsor electrical pulses, in order to
facilitate the transmission of data(generally numbers and/or text)through telecommunication networks
or storage of text in computers.
-
8/3/2019 ContentTypes2011-Urs
7/73
Shalini Urs Open Elective2011
What is a code?
In communications, a code is a rule forconverting a piece of information (forexample, a letter, word, or phrase) intoanother form or representation, not
necessarily of the same sort. Incommunications and informationprocessing, encoding is the process bywhich a source (object) performs thisconversion of information into data,which is then sent to a receiver(observer), such as a data processingsystem. Decoding is the reverseprocess of converting data, which hasbeen sent by a source, into information
-
8/3/2019 ContentTypes2011-Urs
8/73
Shalini Urs Open Elective2011
One reason for coding is to enablecommunication in places where ordinaryspoken or written language is difficult orimpossible. For example, a cable codereplaces words (eg, ship or invoice) intoshorter words, allowing the same
information to be sent with fewercharacters, more quickly, and mostimportant, less expensively. Anotherexample is the use of semaphore flags,
where the configuration of flags held bya signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and numbers.
Another person standing a great
-
8/3/2019 ContentTypes2011-Urs
9/73
Shalini Urs Open Elective2011
Character Encoding
A character encoding is a code thatpairs a set of natural language characters(such as an alphabet or syllabary) with aset of something else, such as numbers orelectrical pulses. Common examplesinclude Morse code, which encodes lettersof the Latin alphabet as series of long andshort depressions of a telegraph key; and
ASCII, which encodes letters, numerals,and other symbols as both integers and 7-bit binary versions of those integers
t
-
8/3/2019 ContentTypes2011-Urs
10/73
Shalini Urs Open Elective2011
at are an10646?
The international standard ISO10646 defines the UniversalCharacter Set (UCS). UCS is asuperset of all other character setstandards. It guarantees round-tripcompatibility to other character sets.
No information will be lost if youconvert any text string to UCS andthen back to the original encoding.
-
8/3/2019 ContentTypes2011-Urs
11/73
Shalini Urs Open Elective2011
UCS contains the characters required torepresent practically all known languages.This includes not only the Latin, Greek,Cyrillic, Hebrew, Arabic, Armenian, andGeorgian scripts, but also Chinese,Japanese and Korean Han ideographs aswell as scripts such as Hiragana,Katakana, Hangul, Devanagari, Bengali,
Gurmukhi, Gujarati, Oriya, Tamil, Telugu,Kannada, Malayalam, Thai, Lao, Khmer,Bopomofo, Tibetian, Runic, Ethiopic,Canadian Syllabics, Cherokee, Mongolian,
Ogham, Myanmar, Sinhala, Thaana, Yi, and
-
8/3/2019 ContentTypes2011-Urs
12/73
Shalini Urs Open Elective2011
For scripts not yet covered, researchon how to best encode them forcomputer usage is still going on and
they will be added eventually. Thisincludes not only Cuneiform,Hieroglyphs and various Indo-
European languages, but even someselected artistic scripts such as
Tolkien's Tengwar and Cirth.
-
8/3/2019 ContentTypes2011-Urs
13/73
Shalini Urs Open Elective2011
UCS also covers a large number ofgraphical, typographical,mathematical and scientific symbols,
including those provided by TeX,PostScript, APL, the InternationalPhonetic Alphabet (IPA), MS-DOS, MS-
Windows, Macintosh, OCR fonts, aswell as many word processing andpublishing systems, and more are
being added.
-
8/3/2019 ContentTypes2011-Urs
14/73
Shalini Urs Open Elective2011
ISO 10646 defines formally a 31-bitcharacter set. The most commonly usedcharacters, including all those found inolder encoding standards, have beenplaced in one of the first 65534 positions(0x0000 to 0xFFFD). This 16-bit subset ofUCS is called the Basic MultilingualPlane (BMP) or Plane 0. The characters
that were later added outside the 16-bitBMP are mostly for specialist applicationssuch as historic scripts and scientificnotation.
-
8/3/2019 ContentTypes2011-Urs
15/73
Shalini Urs Open Elective2011
Current plans are that there will never becharacters assigned outside the 21-bitcode space from 0x000000 to 0x10FFFF,which covers a bit over one millionpotential future characters. The ISO10646-1 standard was first published in1993 and defines the architecture of thecharacter set and the content of the BMP.
A second part ISO 10646-2 was added in2001 and defines characters encodedoutside the BMP. New characters are stillbeing added on a continuous basis, but
the existing characters will not be changed
-
8/3/2019 ContentTypes2011-Urs
16/73
Shalini Urs Open Elective2011
UCS assigns to each character not onlya code number but also an officialname. A hexadecimal number thatrepresents a UCS or Unicode value is
commonly preceded by "U+" as inU+0041 for the character "Latin capitalletter A". The UCS characters U+0000to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the rangeU+0000 to U+00FF is identical to ISO8859-1 (Latin-1). The range U+E000 toU+F8FF and also larger ranges outsidethe BMP are reserved for private use.
UCS also defines several methods for
-
8/3/2019 ContentTypes2011-Urs
17/73
Shalini Urs Open Elective2011
What are combining
characters? Some code points in UCS have beenassigned to combining characters.These are similar to the non-spacingaccent keys on a typewriter. A
combining character is not a fullcharacter by itself. It is an accent orother diacritical mark that is added tothe previous character. This way, it is
possible to place any accent on anycharacter. The most important accentedcharacters, like those used in theorthographies of common languages,
have codes of their own in UCS toensure backwards com atibilit with
-
8/3/2019 ContentTypes2011-Urs
18/73
Shalini Urs Open Elective2011
They are known as precomposed
characters. Precomposed charactersare available in UCS for backwardscompatibility with older encodings thathave no combining characters, such as
ISO 8859. The combining-charactermechanism allows one to add accentsand other diacritical marks to anycharacter. This is especially importantfor scientific notations such asmathematical formulae and theInternational Phonetic Alphabet, where
any possible combination of a baseh r t r n n r v r l i riti l
-
8/3/2019 ContentTypes2011-Urs
19/73
Shalini Urs Open Elective2011
Combining characters follow the
character which they modify. Forexample, the German umlaut character ("Latin capital letter A with diaeresis")can either be represented by the
precomposed UCS code U+00C4, oralternatively by the combination of anormal "Latin capital letter A" followedby a "combining diaeresis": U+0041
U+0308. Several combining characterscan be applied when it is necessary tostack multiple accents or add combiningmarks both above and below the basecharacter. The Thai script, for example,needs up to two combining characters
-
8/3/2019 ContentTypes2011-Urs
20/73
Shalini Urs Open Elective2011
What are UCSimplementation levels?
Not all systems can be expected tosupport all the advanced mechanismsof UCS, such as combining characters.Therefore, ISO 10646 specifies the
following three implementation levels: Level 1
Combining characters and Hangul Jamocharacters are not supported.[Hangul Jamo are an alternativerepresentation of precomposed modernHangul syllables as a sequence ofconsonants and vowels. They are required to
fully support the Korean script including
-
8/3/2019 ContentTypes2011-Urs
21/73
Shalini Urs Open Elective2011
Level 2 Like level 1, however in some scripts, a fixed list
of combining characters is now allowed (e.g., forHebrew, Arabic, Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugo, Kannada,Malayalam, Thai and Lao). These scripts cannotbe represented adequately in UCS withoutsupport for at least certain combiningcharacters.
Level 3 All UCS characters are supported, such that, for
example, mathematicians can place a tilde oran arrow (or both) on any character.
-
8/3/2019 ContentTypes2011-Urs
22/73
Shalini Urs Open Elective2011
What is Unicode ?
In the late 1980s, there have beentwo independent attempts to createa single unified character set. One
was the ISO 10646 project of theInternational Organization forStandardization (ISO), the other was
the Unicode Project organized by aconsortium of (initially mostly US)manufacturers of multi-lingual
software.
-
8/3/2019 ContentTypes2011-Urs
23/73
Shalini Urs Open Elective2011
Fortunately, the participants of bothprojects realized in around 1991 that twodifferent unified character sets is notexactly what the world needs. They joinedtheir efforts and worked together oncreating a single code table. Both projectsstill exist and publish their respectivestandards independently, however the
Unicode Consortium and ISO/IEC JTC1/SC2have agreed to keep the code tables of theUnicode and ISO 10646 standardscompatible and they closely coordinate
any further extensions.
-
8/3/2019 ContentTypes2011-Urs
24/73
Shalini Urs Open Elective2011
Unicode 1.1 corresponded to ISO10646-1:1993, Unicode 3.0corresponded to ISO 10646-1:2000,
Unicode 3.2 added ISO 10646-2:2001, and Unicode 4.0 correspondsto the forthcoming third version ofISO 10646. All Unicode versions since
2.0 are compatible, only newcharacters will be added, no existingcharacters will be removed or
renamed in the future.
-
8/3/2019 ContentTypes2011-Urs
25/73
Shalini Urs Open Elective2011
In computing, Unicode is the internationalstandard whose goal is to specify a codematching every character needed by
every written human language to a singleunique integer number, called a codepoint.
Despite technical problems and limitations
and criticism on them, Unicode hasemerged as the dominant encodingscheme in internationalization of softwareand multilingual environments.
-
8/3/2019 ContentTypes2011-Urs
26/73
Shalini Urs Open Elective2011
Microsoft Windows NT and itsdescendants Windows 2000 andWindows XP make extensive use of
Unicode, more specifically UTF-16, asan internal representation of text.UNIX-like operating systems such as
Linux, BSD and Mac OS X haveadopted Unicode, more specificallyUTF-8, as the basis of representation
of multilingual text.
Unicode and ISO 10646:
-
8/3/2019 ContentTypes2011-Urs
27/73
Shalini Urs Open Elective2011
Unicode and ISO 10646:Differences
The Unicode Standard published by theUnicode Consortium corresponds to ISO10646 at implementation level 3. Allcharacters are at the same positionsand have the same names in bothstandards.
The Unicode Standard defines inaddition much more semanticsassociated with some of the charactersand is in general a better reference for
implementors of high-qualitytypographic publishing systems.Unicode specifies algorithms forrendering presentation forms of some
scripts (say Arabic), handling of bi-
-
8/3/2019 ContentTypes2011-Urs
28/73
Shalini Urs Open Elective2011
The ISO 10646 standard on the otherhand is not much more than a simple
character set table, comparable to theold ISO 8859 standards. It specifiessome terminology related to thestandard, defines some encoding
alternatives, and it containsspecifications of how to use UCS inconnection with other established ISOstandards such as ISO 6429 and ISO
2022. There are other closely relatedISO standards, for instance ISO 14651on sorting UCS strings. A nice feature ofthe ISO 10646-1 standard is that it
provides CJK example glyphs in fivedifferent st le variants, while the
-
8/3/2019 ContentTypes2011-Urs
29/73
Shalini Urs Open Elective2011
What is UTF-8?
UCS and Unicode are first of all justcode tables that assign integernumbers to characters. There existseveral alternatives for how a
sequence of such characters ortheir respective integer values canbe represented as a sequence ofbytes. The two most obvious
encodings store Unicode text assequences of either 2 or 4 bytessequences. The official terms forthese encodings are UCS-2 andUCS-4, respectively.
-
8/3/2019 ContentTypes2011-Urs
30/73
Shalini Urs Open Elective2011
Unless otherwise specified, the mostsignificant byte comes first in these(Bigendian convention). An ASCII or
Latin-1 file can be transformed into aUCS-2 file by simply inserting a 0x00byte in front of every ASCII byte. If
we want to have a UCS-4 file, wehave to insert three 0x00 bytesinstead before every ASCII byte.
-
8/3/2019 ContentTypes2011-Urs
31/73
Shalini Urs Open Elective2011
Encoding Forms
Character encoding standards definenot only the identity of each characterand its numeric value, or code point, butalso how this value is represented in
bits. The Unicode Standard defines three
encoding forms that allow the same
data to be transmitted in a byte, word ordouble word oriented format (i.e. in 8,16 or 32-bits per code unit).
-
8/3/2019 ContentTypes2011-Urs
32/73
Shalini Urs Open Elective2011
All three encoding forms encode the same
common character repertoire and can be
efficiently transformed into one another
without loss of data. The UnicodeConsortium fully endorses the use of any
of these encoding forms as a conformant
way of implementing the UnicodeStandard.
-
8/3/2019 ContentTypes2011-Urs
33/73
Shalini Urs Open Elective2011
UTF-8 is popular for HTML and similar protocols.UTF-8 is a way of transforming all Unicodecharacters into a variable length encoding ofbytes. It has the advantages that the Unicode
characters corresponding to the familiar ASCIIset have the same byte values as ASCII, andthat Unicode characters transformed into UTF-8can be used with much existing software without
extensive software rewrites.
-
8/3/2019 ContentTypes2011-Urs
34/73
Shalini Urs Open Elective2011
UTF-16 is popular in many environments that
need to balance efficient access to characterswith economical use of storage. It isreasonably compact and all the heavily usedcharacters fit into a single 16-bit code unit,while all other characters are accessible viapairs of 16-bit code units.
UTF-32 is popular where memory space is noconcern, but fixed width, single code unit
access to characters is desired. EachUnicode character is encoded in a single 32-bit code unit when using UTF-32.
All three encoding forms need at most 4 bytes
(or 32-bits) of data for each character.
-
8/3/2019 ContentTypes2011-Urs
35/73
Shalini Urs Open Elective2011
Defining Elements of Text
Written languages are represented bytextual elements that are used to create
words and sentences. These elements
may be letters such as "w" or "M";characters such as those used in
Japanese Hiragana to represent
syllables; or ideographs such as those
used in Chinese to represent full words
or concepts.
-
8/3/2019 ContentTypes2011-Urs
36/73
Shalini Urs Open Elective2011
The definition oftext elements often
changes depending on the process
handling the text. For example, in historic
Spanish language sorting, "ll"; counts as asingle text element. However, when
Spanish words are typed, "ll" is two
separate text elements: "l" and "l".
-
8/3/2019 ContentTypes2011-Urs
37/73
Shalini Urs Open Elective2011
To avoid deciding what is and is not a text
element in different processes, the UnicodeStandard defines code elements (commonlycalled "characters"). A code element isfundamental and useful for computer textprocessing. For the most part, code elementscorrespond to the most commonly used textelements. In the case of the Spanish "ll", theUnicode Standard defines each "l" as aseparate code element. The task of
combining two "l" together for alphabeticsorting is left to the software processing thetext.
-
8/3/2019 ContentTypes2011-Urs
38/73
Shalini Urs Open Elective2011
Text Processing
Computer text handling involvesprocessing and encoding. Consider, for
example, a word processor user typing
text at a keyboard. The computer's
system software receives a message
that the user pressed a key combination
for "T", which it encodes as U+0054.
-
8/3/2019 ContentTypes2011-Urs
39/73
Shalini Urs Open Elective2011
The word processor stores the number in
memory, and also passes it on to the display
software responsible for putting the character on
the screen. The display software, which may bea window manager or part of the word processor
itself, uses the number as an index to find an
image of a "T", which it draws on the monitor
screen. The process continues as the user typesin more characters.
-
8/3/2019 ContentTypes2011-Urs
40/73
Shalini Urs Open Elective2011
The Unicode Standard directly addresses
only the encoding and semantics of text. Itaddresses no other action performed on thetext. For example, the word processor maycheck the typist's input as it is being entered,
and display misspellings with a wavyunderline. Or it may insert line breaks when itcounts a certain number of charactersentered since the last line break. An importantprinciple of the Unicode Standard is that itdoes not specify how to carry out theseprocesses as long as the character encodingand decoding is performed properly.
-
8/3/2019 ContentTypes2011-Urs
41/73
Shalini Urs Open Elective2011
Interpreting Characters and
Rendering Glyph The difference between identifying a code
point and rendering it on screen or paper is
crucial to understanding the Unicode
Standard's role in text processing. Thecharacter identified by a Unicode code point
is an abstract entity, such as "LATIN
CHARACTER CAPITAL A" or "BENGALI
DIGIT 5." The mark made on screen or paper-- called a glyph -- is a visual representation
of the character.
-
8/3/2019 ContentTypes2011-Urs
42/73
Shalini Urs Open Elective2011
The Unicode Standard does not define glyph
images. The standard defines how characters
are interpreted, not how glyphs are rendered.
The software or hardware-rendering engine of acomputer is responsible for the appearance of
the characters on the screen. The Unicode
Standard does not specify the size, shape, nor
style of on-screen characters.
-
8/3/2019 ContentTypes2011-Urs
43/73
Shalini Urs Open Elective2011
Character Sequences
Text elements are encoded assequences of one or more characters.Certain of these sequences are calledcombining character sequences, made
up of a base letter and one or morecombining marks, which are renderedaround the base letter (above it, belowit, etc.). For example, a sequence of "a"followed by a combining circumflex "^"would be rendered as ""
-
8/3/2019 ContentTypes2011-Urs
44/73
Shalini Urs Open Elective2011
The Unicode Standard specifies the order ofcharacters in a combining charactersequence. The base character comes first,followed by one or more non-spacing marks.If there is more than one non-spacing mark,
the order in which the non-spacing marks arestored isn't important if the marks don'tinteract typographically. If they do interact,then their order is important. The Unicode
Standard specifies how successive non-spacing characters are applied to a basecharacter, and when the order is significant.
-
8/3/2019 ContentTypes2011-Urs
45/73
Shalini Urs Open Elective2011
Certain sequences of characters can also be
represented as a single character, called aprecomposedcharacter (orcomposite ordecomposible character). For example, thecharacter "" can be encoded as the single
code point U+00FC "" or as the basecharacter U+0075 "u" followed by the non-spacing character U+0308 "". The UnicodeStandard encodes precomposed characters
for compatibility with established standardssuch as Latin 1, which includes manyprecomposed characters such as "" and "".
-
8/3/2019 ContentTypes2011-Urs
46/73
Shalini Urs Open Elective2011
Precomposed characters may be
decomposed for consistency or analysis. Forexample, in alphabetizing (collating) a list ofnames, the character "" may bedecomposed into a "u" followed by the non-
spacing character "". Once the character hasbeen decomposed, it may be easier for the towork with the character because it can beprocessed as a "u" with modifications. This
allows easier alphabetical sorting forlanguages where character modifiers do notaffect alphabetical order. The UnicodeStandard defines the decompositions for allprecomposed characters.
-
8/3/2019 ContentTypes2011-Urs
47/73
Shalini Urs Open Elective2011
The Unicode Standard was created by a
team of computer professionals, linguists,
and scholars to become a worldwide
character standard, one easily used fortext encoding everywhere. To that end, the
Unicode Standard follows a set of
fundamental principles:
-
8/3/2019 ContentTypes2011-Urs
48/73
Shalini Urs Open Elective2011
Universal repertoire
Logical order
Efficiency
Unification Characters, not glyphs
Dynamic composition
Semantics
Equivalent Sequence
Plain Text
Convertibility
-
8/3/2019 ContentTypes2011-Urs
49/73
Shalini Urs Open Elective2011
The character sets of many existing
international, national and corporate
standards are incorporated within the
Unicode Standard. For example, its first256 characters are taken from the widely
used Latin-1 character set.
-
8/3/2019 ContentTypes2011-Urs
50/73
Shalini Urs Open Elective2011
Duplicate encoding of characters is avoided
by unifying characters within scripts acrosslanguages; characters that are equivalent inform are given a single code.Chinese/Japanese/Korean (CJK)consolidation is achieved by assigning asingle code for each ideograph that iscommon to more than one of theselanguages. This is instead of providing aseparate code for the ideograph each time it
appears in a different language. (These threelanguages share many thousands of identicalcharacters because their ideograph setsevolved from the same source.)
-
8/3/2019 ContentTypes2011-Urs
51/73
Shalini Urs Open Elective2011
The Unicode Standard specifies an algorithm forthe presentation of text with bidirectionalbehavior, for example, Arabic and English.Characters are stored in logical order. The
Unicode Standard includes characters to specifychanges in direction when scripts of differentdirectionality are mixed. For all scripts Unicodetext is in logical order within the memory
representation, corresponding to the order inwhich text is typed on the keyboard.
-
8/3/2019 ContentTypes2011-Urs
52/73
Shalini Urs Open Elective2011
Assigning Character Codes
A single number is assigned to eachcode element defined by the UnicodeStandard. Each of these numbers iscalled a code pointand, when referred
to in text, is listed in hexadecimal formfollowing the prefix "U". For example,the code point U+0041 is thehexadecimal number 0041 (equal to thedecimal number 65). It represents thecharacter "A" in the Unicode Standard.
-
8/3/2019 ContentTypes2011-Urs
53/73
Shalini Urs Open Elective2011
Each character is also assigned a uniquename that specifies it and no other. Forexample, U+0041 is assigned the
character name "LATIN CAPITAL LETTERA." U+0A1B is assigned the charactername "GURMUKHI LETTER CHA." TheseUnicode names are identical to the
ISO/IEC 10646 names for the samecharacters
-
8/3/2019 ContentTypes2011-Urs
54/73
Shalini Urs Open Elective2011
The Unicode Standard groups characterstogether by scripts in code blocks. A scriptisany system of related characters. Thestandard retains the order of characters in asource set where possible. When thecharacters of a script are traditionally
arranged in a certain order -- alphabetic order,for example -- the Unicode Standard arrangesthem in its code space using the same orderwhenever possible. Code blocks vary greatly
in size. For example, the Cyrillic code blockdoes not exceed 256 code points, while theCJK code blocks contain many thousands ofcode points.
-
8/3/2019 ContentTypes2011-Urs
55/73
Shalini Urs Open Elective2011
Codespace
Code elements are grouped logically throughoutthe range of code points, called the codespace.The coding starts at U+0000 with the standard
ASCII characters, and continues with Greek,
Cyrillic, Hebrew, Arabic, Indic and other scripts;then followed by symbols and punctuation. Thecode space continues with Hiragana, Katakana,and Bopomofo. The unified Han ideographs are
followed by the complete set of modern Hangul.
-
8/3/2019 ContentTypes2011-Urs
56/73
Shalini Urs Open Elective2011
A range of code points on the BMP and two verylarge ranges in the supplementary planes arereserved asprivate use areas. These codepoints have no universal meaning, and may be
used for characters specific to a program or by agroup of users for their own purposes. Forexample, a group of choreographers may designa set of characters for dance notation and
encode the characters using code points in userspace.
-
8/3/2019 ContentTypes2011-Urs
57/73
Shalini Urs Open Elective2011
A set of page-layout programs may use
the same code points as control codes to
position text on the page. The main point
of user space is that the Unicode Standardassigns no meaning to these code points,
and reserves them as user space,
promising never to assign them meaningin the future.
-
8/3/2019 ContentTypes2011-Urs
58/73
Shalini Urs Open Elective2011
Conformance to the Unicode
StandardThe Unicode Standard specifies unambiguous
requirements for conformance in terms of theprinciples and encoding architecture it embodies. Aconforming implementation has the followingcharacteristics, as a minimum requirement:
characters are from the common repertoire;
characters are encoded according to one ofthe encoding forms;
characters are interpreted with Unicodesemantics;
unassigned codes are not used; and,
unknown characters are not corrupted.
-
8/3/2019 ContentTypes2011-Urs
59/73
Shalini Urs Open Elective2011
Stability
The Unicode Standard has a lot of roomto grow, and there are a considerable
number of scripts that will be encoded
in upcoming versions. This process is
strictly additive, in other words, while
characters may be added or new
character properties may be defined, no
characters will be removed -- orreinterpreted in incompatible ways.
-
8/3/2019 ContentTypes2011-Urs
60/73
Shalini Urs Open Elective2011
These stability guarantees make it
possible to encode data in Unicode and
expect that future implementations that
conform to a later version of the UnicodeStandard will be able to interpret them in
the same way, as implementations
conforming to The Unicode Standard,Version 3.2.
-
8/3/2019 ContentTypes2011-Urs
61/73
Shalini Urs Open Elective2011
The range ofsurrogate code points is reserved
for use with UTF-16. Towards the end of the
BMP is a range of code points reserved for
private use, followed by a range of compatibilitycharacters. The compatibility characters are
character variants that are encoded only to
enable transcoding to earlier standards and old
implementations, which made use of them.
-
8/3/2019 ContentTypes2011-Urs
62/73
Shalini Urs Open Elective2011
Unicode is an industry standarddesigned to allow text and symbolsfrom all languages to be consistently
represented and manipulated bycomputers.
Unicode characters can be encoded
using any of several schemes termedUnicode Transformation Formats(UTF).
-
8/3/2019 ContentTypes2011-Urs
63/73
Shalini Urs Open Elective2011
The Unicode Consortium has as itsambitious goal the eventual replacementof existing character encoding schemeswith Unicode, as many of the existing
schemes are limited in size and scope, andare incompatible with multilingualenvironments. Its success at unifyingcharacter sets has led to its widespread
and predominant use in theinternationalization and localization ofcomputer software. The standard has beenimplemented in many recent technologies,
including XML, the Java programming
-
8/3/2019 ContentTypes2011-Urs
64/73
Shalini Urs Open Elective2011
Other terms like characterencoding, character set(charset),and sometimes character map or
code page are used almostinterchangeably, but these termsnow have related but distinct
meanings. Common examples of character
encoding systems include Morse
code the Baudot code the American
-
8/3/2019 ContentTypes2011-Urs
65/73
Shalini Urs Open Elective2011
Other codes
Morse code was introduced in the1840s and is used to encode eachletter of the Latin alphabet and each
Hindu-Arabic numeral as a series oflong and short presses of a telegraphkey. Representations of characters
encoded using Morse code varied inlength. The Baudot code was created by
mile Baudot in 1870 patented in
-
8/3/2019 ContentTypes2011-Urs
66/73
Shalini Urs Open Elective2011
ASCII and other codes
ASCII was introduced in 1963 and is a7-bit encoding scheme used toencode letters, numerals, symbols,
and device control codes as fixed-length codes using integers.
IBM's Extended Binary Coded
Decimal Interchange Code (usuallyabbreviated EBCDIC) is an 8-bitencoding scheme developed in 1963.
-
8/3/2019 ContentTypes2011-Urs
67/73
Shalini Urs Open Elective2011
Why UNICODE ?
The limitations of such sets soonbecame apparent, and a number ofad-hoc methods were developed to
extend them. The need to supportmore writing systems for differentlanguages, including the CJK family
of East Asian scripts, requiredsupport for a far larger number ofcharacters and demanded asystematic approach to character
Encoding
-
8/3/2019 ContentTypes2011-Urs
68/73
Shalini Urs Open Elective2011
Encoding Fundamentally, computers just deal with numbers.
They store letters and other characters byassigning a number for each one.
Before Unicode was invented, there werehundreds of different encoding systems for
assigning these numbers. No single encoding could contain enough
characters: for example, the European Unionalone requires several different encodings to cover
all its languages. Even for a single language likeEnglish no single encoding was adequate for allthe letters, punctuation, and technical symbols incommon use.
Encoding
-
8/3/2019 ContentTypes2011-Urs
69/73
Shalini Urs Open Elective2011
Encoding These encoding systems also conflict with
one another. That is, two encodings can
use the same number for two different
characters, or use different numbers for
the same character. Any given computer (especially servers)
needs to support many different
encodings; yet whenever data is passedbetween different encodings or platforms,
that data always runs the risk of
corruption.
-
8/3/2019 ContentTypes2011-Urs
70/73
Shalini Urs Open Elective2011
Encoding
In communications, a code is a rulefor converting a piece of information(for example, a letter, word, or
phrase) into another form orrepresentation, not necessarily of thesame sort.
-
8/3/2019 ContentTypes2011-Urs
71/73
Shalini Urs Open Elective2011
Encoding
In communications and informationprocessing, encoding is the processby which a source (object) performs
this conversion of information intodata, which is then sent to a receiver(observer), such as a data processing
system. Decoding is the reverseprocess of converting data, whichhas been sent by a source, intoinformation understandable by a
-
8/3/2019 ContentTypes2011-Urs
72/73
Shalini Urs Open Elective2011
Encoding
One reason for coding is to enablecommunication in places whereordinary spoken or written language
is difficult or impossible. Forexample, a cable code replaceswords (eg, ship or invoice) into
shorter words, allowing the sameinformation to be sent with fewercharacters, more quickly, and mostimportant, less expensively.
-
8/3/2019 ContentTypes2011-Urs
73/73
Encoding
Another example is the use ofsemaphore flags, where theconfiguration of flags held by a
signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and
numbers. Another person standing agreat distance away can interpret theflags and reproduce the words sent.