contenttypes2011-urs

8/3/2019 ContentTypes2011-Urs

1/73

Shalini Urs Open Elective

2011

Content and Content Types

Shalini R. Urs

International School of Information Management

University of Mysore

Mysore


2/73

Shalini Urs Open Elective

2011

What is Document Genre ?

Genre - the fusion of content,purpose and form of communicativeactions

Greek philosophers and oratorsrecognized that the content of themessage is not always its mostimportant aspect; rather, the

delivery, the context, and therhetorical structure all playcomplementary roles in the subtlebut profound act of one human being

transferring information to anotherand thereby creating meaning from


3/73

Shalini Urs Open Elective2011

Document Genre

a distinctive type of communicativeaction, characterized by a sociallyrecognized communicative purposeand common aspects of form


4/73


Content Types

It all began with MIME(Multipurpose Internet MailExtensions (MIME) is an Internetstandard that extends the format ofemail to support

Text in Character sets other thanASCII Non-text attachments

Message bodies with multiple parts Header information in non-ASCII

character sets


5/73


Media Types

MIME's use, however, has grownbeyond describing the content of e-mail to describe content type in

general, including for the web. A media type is composed of at least

two parts: a type, a subtype, and one

or more optional parameters. Forexample, subtypes of text type havean optional charset parameter thatcan be included to indicate thecharacter encoding, and subtypes of


6/73


Character Encoding

A character encoding systemconsists of a code that pairs eachcharacter from a given repertoirewith something else, such as asequence of natural numbers, octetsor electrical pulses, in order to

facilitate the transmission of data(generally numbers and/or text)through telecommunication networks

or storage of text in computers.


7/73


What is a code?

In communications, a code is a rule forconverting a piece of information (forexample, a letter, word, or phrase) intoanother form or representation, not

necessarily of the same sort. Incommunications and informationprocessing, encoding is the process bywhich a source (object) performs thisconversion of information into data,which is then sent to a receiver(observer), such as a data processingsystem. Decoding is the reverseprocess of converting data, which hasbeen sent by a source, into information


8/73


One reason for coding is to enablecommunication in places where ordinaryspoken or written language is difficult orimpossible. For example, a cable codereplaces words (eg, ship or invoice) intoshorter words, allowing the same

information to be sent with fewercharacters, more quickly, and mostimportant, less expensively. Anotherexample is the use of semaphore flags,

where the configuration of flags held bya signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and numbers.

Another person standing a great


9/73


Character Encoding

A character encoding is a code thatpairs a set of natural language characters(such as an alphabet or syllabary) with aset of something else, such as numbers orelectrical pulses. Common examplesinclude Morse code, which encodes lettersof the Latin alphabet as series of long andshort depressions of a telegraph key; and

ASCII, which encodes letters, numerals,and other symbols as both integers and 7-bit binary versions of those integers

t


10/73


at are an10646?

The international standard ISO10646 defines the UniversalCharacter Set (UCS). UCS is asuperset of all other character setstandards. It guarantees round-tripcompatibility to other character sets.

No information will be lost if youconvert any text string to UCS andthen back to the original encoding.


11/73


UCS contains the characters required torepresent practically all known languages.This includes not only the Latin, Greek,Cyrillic, Hebrew, Arabic, Armenian, andGeorgian scripts, but also Chinese,Japanese and Korean Han ideographs aswell as scripts such as Hiragana,Katakana, Hangul, Devanagari, Bengali,

Gurmukhi, Gujarati, Oriya, Tamil, Telugu,Kannada, Malayalam, Thai, Lao, Khmer,Bopomofo, Tibetian, Runic, Ethiopic,Canadian Syllabics, Cherokee, Mongolian,

Ogham, Myanmar, Sinhala, Thaana, Yi, and


12/73


For scripts not yet covered, researchon how to best encode them forcomputer usage is still going on and

they will be added eventually. Thisincludes not only Cuneiform,Hieroglyphs and various Indo-

European languages, but even someselected artistic scripts such as

Tolkien's Tengwar and Cirth.


13/73


UCS also covers a large number ofgraphical, typographical,mathematical and scientific symbols,

including those provided by TeX,PostScript, APL, the InternationalPhonetic Alphabet (IPA), MS-DOS, MS-

Windows, Macintosh, OCR fonts, aswell as many word processing andpublishing systems, and more are

being added.


14/73


ISO 10646 defines formally a 31-bitcharacter set. The most commonly usedcharacters, including all those found inolder encoding standards, have beenplaced in one of the first 65534 positions(0x0000 to 0xFFFD). This 16-bit subset ofUCS is called the Basic MultilingualPlane (BMP) or Plane 0. The characters

that were later added outside the 16-bitBMP are mostly for specialist applicationssuch as historic scripts and scientificnotation.


15/73


Current plans are that there will never becharacters assigned outside the 21-bitcode space from 0x000000 to 0x10FFFF,which covers a bit over one millionpotential future characters. The ISO10646-1 standard was first published in1993 and defines the architecture of thecharacter set and the content of the BMP.

A second part ISO 10646-2 was added in2001 and defines characters encodedoutside the BMP. New characters are stillbeing added on a continuous basis, but

the existing characters will not be changed


16/73


UCS assigns to each character not onlya code number but also an officialname. A hexadecimal number thatrepresents a UCS or Unicode value is

commonly preceded by "U+" as inU+0041 for the character "Latin capitalletter A". The UCS characters U+0000to U+007F are identical to those in US-ASCII (ISO 646 IRV) and the rangeU+0000 to U+00FF is identical to ISO8859-1 (Latin-1). The range U+E000 toU+F8FF and also larger ranges outsidethe BMP are reserved for private use.

UCS also defines several methods for


17/73


What are combining

characters? Some code points in UCS have beenassigned to combining characters.These are similar to the non-spacingaccent keys on a typewriter. A

combining character is not a fullcharacter by itself. It is an accent orother diacritical mark that is added tothe previous character. This way, it is

possible to place any accent on anycharacter. The most important accentedcharacters, like those used in theorthographies of common languages,

have codes of their own in UCS toensure backwards com atibilit with


18/73


They are known as precomposed

characters. Precomposed charactersare available in UCS for backwardscompatibility with older encodings thathave no combining characters, such as

ISO 8859. The combining-charactermechanism allows one to add accentsand other diacritical marks to anycharacter. This is especially importantfor scientific notations such asmathematical formulae and theInternational Phonetic Alphabet, where

any possible combination of a baseh r t r n n r v r l i riti l


19/73


Combining characters follow the

character which they modify. Forexample, the German umlaut character ("Latin capital letter A with diaeresis")can either be represented by the

precomposed UCS code U+00C4, oralternatively by the combination of anormal "Latin capital letter A" followedby a "combining diaeresis": U+0041

U+0308. Several combining characterscan be applied when it is necessary tostack multiple accents or add combiningmarks both above and below the basecharacter. The Thai script, for example,needs up to two combining characters


20/73


What are UCSimplementation levels?

Not all systems can be expected tosupport all the advanced mechanismsof UCS, such as combining characters.Therefore, ISO 10646 specifies the

following three implementation levels: Level 1

Combining characters and Hangul Jamocharacters are not supported.[Hangul Jamo are an alternativerepresentation of precomposed modernHangul syllables as a sequence ofconsonants and vowels. They are required to

fully support the Korean script including


21/73


Level 2 Like level 1, however in some scripts, a fixed list

of combining characters is now allowed (e.g., forHebrew, Arabic, Devanagari, Bengali, Gurmukhi,

Gujarati, Oriya, Tamil, Telugo, Kannada,Malayalam, Thai and Lao). These scripts cannotbe represented adequately in UCS withoutsupport for at least certain combiningcharacters.

Level 3 All UCS characters are supported, such that, for

example, mathematicians can place a tilde oran arrow (or both) on any character.


22/73


What is Unicode ?

In the late 1980s, there have beentwo independent attempts to createa single unified character set. One

was the ISO 10646 project of theInternational Organization forStandardization (ISO), the other was

the Unicode Project organized by aconsortium of (initially mostly US)manufacturers of multi-lingual

software.


23/73


Fortunately, the participants of bothprojects realized in around 1991 that twodifferent unified character sets is notexactly what the world needs. They joinedtheir efforts and worked together oncreating a single code table. Both projectsstill exist and publish their respectivestandards independently, however the

Unicode Consortium and ISO/IEC JTC1/SC2have agreed to keep the code tables of theUnicode and ISO 10646 standardscompatible and they closely coordinate

any further extensions.


24/73


Unicode 1.1 corresponded to ISO10646-1:1993, Unicode 3.0corresponded to ISO 10646-1:2000,

Unicode 3.2 added ISO 10646-2:2001, and Unicode 4.0 correspondsto the forthcoming third version ofISO 10646. All Unicode versions since

2.0 are compatible, only newcharacters will be added, no existingcharacters will be removed or

renamed in the future.


25/73


In computing, Unicode is the internationalstandard whose goal is to specify a codematching every character needed by

every written human language to a singleunique integer number, called a codepoint.

Despite technical problems and limitations

and criticism on them, Unicode hasemerged as the dominant encodingscheme in internationalization of softwareand multilingual environments.


26/73


Microsoft Windows NT and itsdescendants Windows 2000 andWindows XP make extensive use of

Unicode, more specifically UTF-16, asan internal representation of text.UNIX-like operating systems such as

Linux, BSD and Mac OS X haveadopted Unicode, more specificallyUTF-8, as the basis of representation

of multilingual text.

Unicode and ISO 10646:


27/73


Unicode and ISO 10646:Differences

The Unicode Standard published by theUnicode Consortium corresponds to ISO10646 at implementation level 3. Allcharacters are at the same positionsand have the same names in bothstandards.

The Unicode Standard defines inaddition much more semanticsassociated with some of the charactersand is in general a better reference for

implementors of high-qualitytypographic publishing systems.Unicode specifies algorithms forrendering presentation forms of some

scripts (say Arabic), handling of bi-


28/73


The ISO 10646 standard on the otherhand is not much more than a simple

character set table, comparable to theold ISO 8859 standards. It specifiessome terminology related to thestandard, defines some encoding

alternatives, and it containsspecifications of how to use UCS inconnection with other established ISOstandards such as ISO 6429 and ISO

2022. There are other closely relatedISO standards, for instance ISO 14651on sorting UCS strings. A nice feature ofthe ISO 10646-1 standard is that it

provides CJK example glyphs in fivedifferent st le variants, while the


29/73


What is UTF-8?

UCS and Unicode are first of all justcode tables that assign integernumbers to characters. There existseveral alternatives for how a

sequence of such characters ortheir respective integer values canbe represented as a sequence ofbytes. The two most obvious

encodings store Unicode text assequences of either 2 or 4 bytessequences. The official terms forthese encodings are UCS-2 andUCS-4, respectively.


30/73


Unless otherwise specified, the mostsignificant byte comes first in these(Bigendian convention). An ASCII or

Latin-1 file can be transformed into aUCS-2 file by simply inserting a 0x00byte in front of every ASCII byte. If

we want to have a UCS-4 file, wehave to insert three 0x00 bytesinstead before every ASCII byte.


31/73


Encoding Forms

Character encoding standards definenot only the identity of each characterand its numeric value, or code point, butalso how this value is represented in

bits. The Unicode Standard defines three

encoding forms that allow the same

data to be transmitted in a byte, word ordouble word oriented format (i.e. in 8,16 or 32-bits per code unit).


32/73


All three encoding forms encode the same

common character repertoire and can be

efficiently transformed into one another

without loss of data. The UnicodeConsortium fully endorses the use of any

of these encoding forms as a conformant

way of implementing the UnicodeStandard.


33/73


UTF-8 is popular for HTML and similar protocols.UTF-8 is a way of transforming all Unicodecharacters into a variable length encoding ofbytes. It has the advantages that the Unicode

characters corresponding to the familiar ASCIIset have the same byte values as ASCII, andthat Unicode characters transformed into UTF-8can be used with much existing software without

extensive software rewrites.


34/73


UTF-16 is popular in many environments that

need to balance efficient access to characterswith economical use of storage. It isreasonably compact and all the heavily usedcharacters fit into a single 16-bit code unit,while all other characters are accessible viapairs of 16-bit code units.

UTF-32 is popular where memory space is noconcern, but fixed width, single code unit

access to characters is desired. EachUnicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes

(or 32-bits) of data for each character.


35/73


Defining Elements of Text

Written languages are represented bytextual elements that are used to create

words and sentences. These elements

may be letters such as "w" or "M";characters such as those used in

Japanese Hiragana to represent

syllables; or ideographs such as those

used in Chinese to represent full words

or concepts.


36/73


The definition oftext elements often

changes depending on the process

handling the text. For example, in historic

Spanish language sorting, "ll"; counts as asingle text element. However, when

Spanish words are typed, "ll" is two

separate text elements: "l" and "l".


37/73


To avoid deciding what is and is not a text

element in different processes, the UnicodeStandard defines code elements (commonlycalled "characters"). A code element isfundamental and useful for computer textprocessing. For the most part, code elementscorrespond to the most commonly used textelements. In the case of the Spanish "ll", theUnicode Standard defines each "l" as aseparate code element. The task of

combining two "l" together for alphabeticsorting is left to the software processing thetext.


38/73


Text Processing

Computer text handling involvesprocessing and encoding. Consider, for

example, a word processor user typing

text at a keyboard. The computer's

system software receives a message

that the user pressed a key combination

for "T", which it encodes as U+0054.


39/73


The word processor stores the number in

memory, and also passes it on to the display

software responsible for putting the character on

the screen. The display software, which may bea window manager or part of the word processor

itself, uses the number as an index to find an

image of a "T", which it draws on the monitor

screen. The process continues as the user typesin more characters.


40/73


The Unicode Standard directly addresses

only the encoding and semantics of text. Itaddresses no other action performed on thetext. For example, the word processor maycheck the typist's input as it is being entered,

and display misspellings with a wavyunderline. Or it may insert line breaks when itcounts a certain number of charactersentered since the last line break. An importantprinciple of the Unicode Standard is that itdoes not specify how to carry out theseprocesses as long as the character encodingand decoding is performed properly.


41/73


Interpreting Characters and

Rendering Glyph The difference between identifying a code

point and rendering it on screen or paper is

crucial to understanding the Unicode

Standard's role in text processing. Thecharacter identified by a Unicode code point

is an abstract entity, such as "LATIN

CHARACTER CAPITAL A" or "BENGALI

DIGIT 5." The mark made on screen or paper-- called a glyph -- is a visual representation

of the character.


42/73


The Unicode Standard does not define glyph

images. The standard defines how characters

are interpreted, not how glyphs are rendered.

The software or hardware-rendering engine of acomputer is responsible for the appearance of

the characters on the screen. The Unicode

Standard does not specify the size, shape, nor

style of on-screen characters.


43/73


Character Sequences

Text elements are encoded assequences of one or more characters.Certain of these sequences are calledcombining character sequences, made

up of a base letter and one or morecombining marks, which are renderedaround the base letter (above it, belowit, etc.). For example, a sequence of "a"followed by a combining circumflex "^"would be rendered as ""


44/73


The Unicode Standard specifies the order ofcharacters in a combining charactersequence. The base character comes first,followed by one or more non-spacing marks.If there is more than one non-spacing mark,

the order in which the non-spacing marks arestored isn't important if the marks don'tinteract typographically. If they do interact,then their order is important. The Unicode

Standard specifies how successive non-spacing characters are applied to a basecharacter, and when the order is significant.


45/73


Certain sequences of characters can also be

represented as a single character, called aprecomposedcharacter (orcomposite ordecomposible character). For example, thecharacter "" can be encoded as the single

code point U+00FC "" or as the basecharacter U+0075 "u" followed by the non-spacing character U+0308 "". The UnicodeStandard encodes precomposed characters

for compatibility with established standardssuch as Latin 1, which includes manyprecomposed characters such as "" and "".


46/73


Precomposed characters may be

decomposed for consistency or analysis. Forexample, in alphabetizing (collating) a list ofnames, the character "" may bedecomposed into a "u" followed by the non-

spacing character "". Once the character hasbeen decomposed, it may be easier for the towork with the character because it can beprocessed as a "u" with modifications. This

allows easier alphabetical sorting forlanguages where character modifiers do notaffect alphabetical order. The UnicodeStandard defines the decompositions for allprecomposed characters.


47/73


The Unicode Standard was created by a

team of computer professionals, linguists,

and scholars to become a worldwide

character standard, one easily used fortext encoding everywhere. To that end, the

Unicode Standard follows a set of

fundamental principles:


48/73


Universal repertoire

Logical order

Efficiency

Unification Characters, not glyphs

Dynamic composition

Semantics

Equivalent Sequence

Plain Text

Convertibility


49/73


The character sets of many existing

international, national and corporate

standards are incorporated within the

Unicode Standard. For example, its first256 characters are taken from the widely

used Latin-1 character set.


50/73


Duplicate encoding of characters is avoided

by unifying characters within scripts acrosslanguages; characters that are equivalent inform are given a single code.Chinese/Japanese/Korean (CJK)consolidation is achieved by assigning asingle code for each ideograph that iscommon to more than one of theselanguages. This is instead of providing aseparate code for the ideograph each time it

appears in a different language. (These threelanguages share many thousands of identicalcharacters because their ideograph setsevolved from the same source.)


51/73


The Unicode Standard specifies an algorithm forthe presentation of text with bidirectionalbehavior, for example, Arabic and English.Characters are stored in logical order. The

Unicode Standard includes characters to specifychanges in direction when scripts of differentdirectionality are mixed. For all scripts Unicodetext is in logical order within the memory

representation, corresponding to the order inwhich text is typed on the keyboard.


52/73


Assigning Character Codes

A single number is assigned to eachcode element defined by the UnicodeStandard. Each of these numbers iscalled a code pointand, when referred

to in text, is listed in hexadecimal formfollowing the prefix "U". For example,the code point U+0041 is thehexadecimal number 0041 (equal to thedecimal number 65). It represents thecharacter "A" in the Unicode Standard.


53/73


Each character is also assigned a uniquename that specifies it and no other. Forexample, U+0041 is assigned the

character name "LATIN CAPITAL LETTERA." U+0A1B is assigned the charactername "GURMUKHI LETTER CHA." TheseUnicode names are identical to the

ISO/IEC 10646 names for the samecharacters


54/73


The Unicode Standard groups characterstogether by scripts in code blocks. A scriptisany system of related characters. Thestandard retains the order of characters in asource set where possible. When thecharacters of a script are traditionally

arranged in a certain order -- alphabetic order,for example -- the Unicode Standard arrangesthem in its code space using the same orderwhenever possible. Code blocks vary greatly

in size. For example, the Cyrillic code blockdoes not exceed 256 code points, while theCJK code blocks contain many thousands ofcode points.


55/73


Codespace

Code elements are grouped logically throughoutthe range of code points, called the codespace.The coding starts at U+0000 with the standard

ASCII characters, and continues with Greek,

Cyrillic, Hebrew, Arabic, Indic and other scripts;then followed by symbols and punctuation. Thecode space continues with Hiragana, Katakana,and Bopomofo. The unified Han ideographs are

followed by the complete set of modern Hangul.


56/73


A range of code points on the BMP and two verylarge ranges in the supplementary planes arereserved asprivate use areas. These codepoints have no universal meaning, and may be

used for characters specific to a program or by agroup of users for their own purposes. Forexample, a group of choreographers may designa set of characters for dance notation and

encode the characters using code points in userspace.


57/73


A set of page-layout programs may use

the same code points as control codes to

position text on the page. The main point

of user space is that the Unicode Standardassigns no meaning to these code points,

and reserves them as user space,

promising never to assign them meaningin the future.


58/73


Conformance to the Unicode

StandardThe Unicode Standard specifies unambiguous

requirements for conformance in terms of theprinciples and encoding architecture it embodies. Aconforming implementation has the followingcharacteristics, as a minimum requirement:

characters are from the common repertoire;

characters are encoded according to one ofthe encoding forms;

characters are interpreted with Unicodesemantics;

unassigned codes are not used; and,

unknown characters are not corrupted.


59/73


Stability

The Unicode Standard has a lot of roomto grow, and there are a considerable

number of scripts that will be encoded

in upcoming versions. This process is

strictly additive, in other words, while

characters may be added or new

character properties may be defined, no

characters will be removed -- orreinterpreted in incompatible ways.


60/73


These stability guarantees make it

possible to encode data in Unicode and

expect that future implementations that

conform to a later version of the UnicodeStandard will be able to interpret them in

the same way, as implementations

conforming to The Unicode Standard,Version 3.2.


61/73


The range ofsurrogate code points is reserved

for use with UTF-16. Towards the end of the

BMP is a range of code points reserved for

private use, followed by a range of compatibilitycharacters. The compatibility characters are

character variants that are encoded only to

enable transcoding to earlier standards and old

implementations, which made use of them.


62/73


Unicode is an industry standarddesigned to allow text and symbolsfrom all languages to be consistently

represented and manipulated bycomputers.

Unicode characters can be encoded

using any of several schemes termedUnicode Transformation Formats(UTF).


63/73


The Unicode Consortium has as itsambitious goal the eventual replacementof existing character encoding schemeswith Unicode, as many of the existing

schemes are limited in size and scope, andare incompatible with multilingualenvironments. Its success at unifyingcharacter sets has led to its widespread

and predominant use in theinternationalization and localization ofcomputer software. The standard has beenimplemented in many recent technologies,

including XML, the Java programming


64/73


Other terms like characterencoding, character set(charset),and sometimes character map or

code page are used almostinterchangeably, but these termsnow have related but distinct

meanings. Common examples of character

encoding systems include Morse

code the Baudot code the American


65/73


Other codes

Morse code was introduced in the1840s and is used to encode eachletter of the Latin alphabet and each

Hindu-Arabic numeral as a series oflong and short presses of a telegraphkey. Representations of characters

encoded using Morse code varied inlength. The Baudot code was created by

mile Baudot in 1870 patented in


66/73


ASCII and other codes

ASCII was introduced in 1963 and is a7-bit encoding scheme used toencode letters, numerals, symbols,

and device control codes as fixed-length codes using integers.

IBM's Extended Binary Coded

Decimal Interchange Code (usuallyabbreviated EBCDIC) is an 8-bitencoding scheme developed in 1963.


67/73


Why UNICODE ?

The limitations of such sets soonbecame apparent, and a number ofad-hoc methods were developed to

extend them. The need to supportmore writing systems for differentlanguages, including the CJK family

of East Asian scripts, requiredsupport for a far larger number ofcharacters and demanded asystematic approach to character

Encoding


68/73


Encoding Fundamentally, computers just deal with numbers.

They store letters and other characters byassigning a number for each one.

Before Unicode was invented, there werehundreds of different encoding systems for

assigning these numbers. No single encoding could contain enough

characters: for example, the European Unionalone requires several different encodings to cover

all its languages. Even for a single language likeEnglish no single encoding was adequate for allthe letters, punctuation, and technical symbols incommon use.

Encoding


69/73


Encoding These encoding systems also conflict with

one another. That is, two encodings can

use the same number for two different

characters, or use different numbers for

the same character. Any given computer (especially servers)

needs to support many different

encodings; yet whenever data is passedbetween different encodings or platforms,

that data always runs the risk of

corruption.


70/73


Encoding

In communications, a code is a rulefor converting a piece of information(for example, a letter, word, or

phrase) into another form orrepresentation, not necessarily of thesame sort.


71/73


Encoding

In communications and informationprocessing, encoding is the processby which a source (object) performs

this conversion of information intodata, which is then sent to a receiver(observer), such as a data processing

system. Decoding is the reverseprocess of converting data, whichhas been sent by a source, intoinformation understandable by a


72/73


Encoding

One reason for coding is to enablecommunication in places whereordinary spoken or written language

is difficult or impossible. Forexample, a cable code replaceswords (eg, ship or invoice) into

shorter words, allowing the sameinformation to be sent with fewercharacters, more quickly, and mostimportant, less expensively.


73/73

Encoding

Another example is the use ofsemaphore flags, where theconfiguration of flags held by a

signaller or the arms of a semaphoretower encodes parts of the message,typically individual letters and

numbers. Another person standing agreat distance away can interpret theflags and reproduce the words sent.

contenttypes2011-urs

Documents