unicode & w3c jataayu software c. kumar january 2007

Unicode & W3CUnicode & W3CJataayu SoftwareJataayu Software

C. KumarC. KumarJanuary 2007January 2007

AgendaAgenda

About JataayuUnicode & EncodingW3C Specification for multi-lingual authoringMultilingual WEB AddressIndian WEB Sites an OverviewW3C Activity

About JataayuAbout Jataayu

Jataayu formed with a clear focus of delivering solutions for wireless data servicesOver 60% of the data traffic in Indian Mobile Networks for WAP, Mobile WEB and MMS handled by Jataayu ProductsMobile Device Solution Division focusing on wireless data applications like WAP, MMS, SyncML, IMPS, Email, Web Browsing, DownloadActive participants in OMA, W3C and MWIOver 350 people strong with offices in UK, Singapore, Korea, Taiwan and the US; headquartered in India with major development center in Bangalore

Localization - InternationalizationLocalization - Internationalization

Localization (l10n)Adaptation of the content to meet the language, cultural and other requirements of a specific target market

Internationalization (i18n)Design & Development of the content that enables easy localization for target audiences that vary in culture, region or language.Mission of W3C i18n Activity is to ensure the W3C’s formats and protocols are usable worldwide in all languages and in all writing systems.

Need for UnicodeNeed for Unicode

Early character sets based on 7-bit, gave 27 (ie. 128) possible charactersAdding the 8th bit gave a total of 256 possible characters. Still not enough for all the European languages.Code page mechanism helped a little by changing the upper cells (0xA0 to 0xFF), but was very complex.Addressing the needs of the other languages requires thousands of ideographic characters at a time.

Unicode & EncodingUnicode & Encoding

Unicode, universal character set contains all the characters needed for writing the majority of living languages in use on computers.

Allows for simple display and storage of multilingual content

An encoding refers to the way that characters are mapped from the character set to actual Unicode value.

Different encoding yield different byte sequences.


UTF-8 (Unicode Transformation Format)Variable length 8-bit character encoding for UnicodeAble to represent any universal character in the Unicode StandardUses one to four bytes to encode a Unicode symbolOnly one byte is needed to encode the US-ASCII characters


UTF-16 (16-bit Unicode Transformation Format)Variable length 16-bit character encoding for UnicodeUses two or four byte sequence to encode a Unicode symbolTwo byte is required to encode the US-ASCII character

UCS-2 (2-byte Universal Character Set)Fixed length encoding that always encodes characters into a single 16-bit valueIt can encode characters in the range 0x0000 to 0xFFFF


UCS-4 / UTF-32 (32-bit Unicode Transformation Format)

Fixed length 32-bit character encoding for UnicodeEvery character it uses 4 bytes and it is very space inefficient

Little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode Text

http://www.unicode.org/


Devanagari (0x0900 – 0x097F)Bengali (0x0980 – 0x09FF)Tamil (0x0B80 – 0x0BFF)Kannada (0x0C80 – 0x0CFF)

Code Point U+0041 U+05D0 U+597D U+233B4

UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4

UTF-16 00 41 05 D0 59 7D D8 4C DF B4

UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4


Alternate way to represent the character is by using escape value. (א)Not all documents have to be encoded as UnicodeBut documents can only contain characters defined by Unicode StandardAny encoding can be used as long as it is properly declared and it is the subset of UnicodeUnicode encoding also allows many more languages to be mixed on a single page

Other Encoding formats …Other Encoding formats …

Shift_JIS (SJIS), character encoding for the Japanese Language

Single byte character encoding for the lower-ASCII characters (0x00 – 0x7F)Double-byte character encoding for the upper-ASCII bytes

GB2312, character encoding for simplified Chinese characters

W3C Specification - EncodingW3C Specification - Encoding

W3C specification for multi-lingual authoringEncoding of the document needs to be mentioned, so that the application that consumes can interpret it.

Meta Tag<meta http-equiv=“Content-type” content=“text/html;charset=UTF-8” />

XML<?xml version=“1.0” encoding=“UTF-8”?>

Content-type header returned from the WEB server should also contain the character encoding of the document

Content-Type: text/html; Charset=utf-8

W3C Specification - LanguageW3C Specification - Language

Author needs to specify the language of the document (web page content)

Browser can choose the appropriate font selection using the Lang attributeSearch Engine can group or filter results based on the user’s linguistic preferences (using meta)Translation tools use to recognize the section of text in a particular language

W3C Specification - LanguageW3C Specification - Language

HTTP Content Language HeaderContent-Language: hi

Language Attribute on html tag<html lang=“hi”><html xml:lang=“hi”>

Content Language in meta tag<meta http-equiv=“Content-Language” content=“hi” />

Language attribute on embedded content<div lang=“en” xml:lang=“en”> Some English Content </div>

What value to use for lang?What value to use for lang?

IANA (Internet Assigned Numbers Authority)Provides a unique value for each languageIt is available in the Subtag value in the new IANA Language

http://www.iana.org/assignments/language-subtag-registryHindi – hi, Kannada – kn, Tamil – ta

http://www.iana.org/assignments/language-subtag-registry

http://www.iana.org/assignments/language-subtag-registry

Bi-directional textBi-directional text

Additional information is required in addition to the language attribute to provide support for non-Latin scripts (like Arabic, Hebrew, Urdu)In HTML, dir attribute is used to specify the direction of the text

The title says “<span dir=“rtl”> ם ו א נ י ב ה ת ו ל י.W3C</span>” in Hebrew , ע פ

Multilingual WEB AddressMultilingual WEB Address

A Web address is used to point a resource on the WEB

Web address are typically expressed using URIs (Uniform Resource Identifiers)Restricts to a small number of characters (upper & lower case letters of the English alphabet, numerals and few symbols).

User’s expectations and use of the Internet have changed this restrictions.

There is a growing need to use any language characters in WEB Addresses.

Multilingual WEB Address …Multilingual WEB Address …

A Web address in your own language and alphabet is easier to create, memorize, interpret and relate it. (Ex: http://खो�ज.com)Punycode is a way of representing Unicode code points using only ASCII characters. (Ex: http://xn--21bm4l.com)

http://xn--21bm4l.com/

Indian Content an OverviewIndian Content an Overview

Most Indian Websites are not using UnicodeContent are generated within the ASCII range and provide the proprietary fonts which maps the ASCII character set to Indian Languages.Visually it will be fine, but no other entities will be able to interpret itFor each site, the user may need to download the proprietary fonts, which is not user friendlySearch Engine will not be able to interpret the content which is intended by author as it does not follow the standard encoding.

Indian Content an OverviewIndian Content an Overview

Unicode & W3C ImportanceUnicode & W3C Importance

WEB is also moving towards the mobile

W3C Mobile Web Initiative (MWI) defines the best practices for Mobile Browsing

Cannot install the required font’s during run-time as used to do in desktopIf Unicode character are used the required font may be available within the device

FirefoxFirefox

Firefox (http://www.getfirefox.com)Provides extensive support for Unicode and related fontsProvides the Add-ons to type in Indian Languages in web pages in Linux. (Such tools are already available for Windows XP Users through the language packs)

https://addons.mozilla.org/firefox/5484/author/

W3C i18n activityW3C i18n activity

Core Working groupEnable universal access to the World Wide Web by providing adequate support to other W3C Working Groups

GEO (Guidelines, Education & Outreach)Internationalization aspects of W3C technology better understood and more widely and consistently used

ITS (Internationalization Tag Set)Develop a set of elements and attributes that can be used with new DTDs/Schemas to support the internationalization and localization of documents

ThanksThanks

[email protected]@jataayusoft.com

unicode & w3c jataayu software c. kumar january 2007

Documents

unicode encoding unicode

unicode encoding utf

unicode encoding ucs

character encoding

unicode symbol

unicode able

unicode standard

unicode early character