Download - Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |

Internationalization Introduction & Overview

Arvind27th Feb 2016

Oracle Confidential – Internal/Restricted/Highly RestrictedCopyright © 2015, Oracle and/or its affiliates. All rights reserved. |


Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Oracle Confidential – Internal/Restricted/Highly Restricted 4


Agenda

Why should I care about Internationalization

History / Evolution

Various encodings

I18n & Java

References

1

2

3

4


5


Program Agenda


History / Evolution

Various encodings

History linked to Java

References


1

2

3

4

5


😁,😂 , 😃 , 😄

✌, ☂, ♫, ♪

☠

☕, 🏥, 🏦, 🏨 , 🏊

Internationalization



• Mojibake

First, why should I care?

but it may actually display like this:

http://www.w3.org/International/questions/qa-what-is-encoding#why


05/06/07

or

902.300




GB-English 5th June 2007

US-English 6th May 2007

Japanese 7th Jun 2005

(Germany) 902.300

(France) 902 300

(United States) 902,300


Measurement confusion causes $125 Million loss

Source: http://edition.cnn.com/TECH/space/9909/30mars.metric.02/


http://edition.cnn.com/TECH/space/9909



Percentage of English speakers by country.

Source: https://en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population



• Why is it important

– Revenue generation

– Survival

– More adoption / Popularity of the software



Isthisavalidtextyouarelookingat

Character word sentence line


German Swedish

01: Åkersberga 02: Alingsås

02: Alingsås 04: Oskarshamn

03: Äppelbo 07: Utting

04: Oskarshamn 06: Üttfeld

05: Östersund 08: Zwickau

06: Üttfeld 01: Åkersberga

07: Utting 03: Äppelbo

08: Zwickau 05: Östersund

The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort orderUCA Collation Common words (lift, elevator.., start of the week)

Table 1 shows some examples of cases where sort order differs by language, usage, or another customization. Language Swedish: z < ö

German: ö < z




• Internationalization is the process of designing a software application or extending the software, so that it can be easily adapted /supported in various languages and regions without major engineering changes

• Localization is the process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text

• Globalization: Combination of both i18n & l10n

What is Internationalization


• Enabling i18n we can localized a software quickly

• With the addition of localized data, the same executable can run worldwide

• Textual elements, such as status messages and the GUI component labels, are not hardcoded in the program

• Instead they are stored outside the source code and retrieved dynamically

• Support for new languages does not require recompilation

• Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end user's region and language

• Items to be localized :• Strings

• Date & time

• Currencies



Internationalization Introduction/overview


History / Evolution

Various encodings


References


1

2

3

4

5


EvolutionSource: http://www.asciitable.com/


Source:

http://www.lookuptables.com/ebcdic_scancodes.php

Evolution


EvolutionCharacter Set / Repertoire

• A character set or repertoire is an unordered collection of characters that can be represented by numeric values.

• A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages). This is then called a coded character set when each character is assigned a particular number, called a code point. In the coded character set called ISO 8859-1 (also known as Latin1) the decimal code point value for the letter é is 233. However, in ISO 8859-5, the same code pointrepresents the Cyrillic character щ.

Code Point

code points are just non-negative integer numbers in a certain range. They do not have an implicit binary representation or a width of 21 or 32 bits. Binary representation and unit widths are defined for encoding forms

Character encoding scheme (Code Page in Windows)

• A character encoding scheme defines the representation of numeric values from one or more coded character sets containing symbols, letters, digits in bits and bytes

https://en.wikipedia.org/wiki/Windows_code_page

https://www.w3.org/International/questions/qa-what-is-encoding-data/233.png

https://www.w3.org/International/questions/qa-what-is-encoding-data/1097.png


Evolution

CJK / ISO-8859-X

• While these 256 set were ok for the english speaking world, was not true in the East Asian part of the world (DBCS--BIG5, SJIS,ISO-8859-x,..)

Windows Code page

• ANSI ( Apps using GUI native apps using Windows GUI )& OEM ( Console based app

Information

Exchange

• Advent of WWW or data being transferred from one system to other …situation where one code page info or one char encoding is sent to the other what would you see (^&*$##@!@@) or too much of code needed . World needed a common format to address this

• In 1990, there was a parallel initiatives by 2 bodies one was ISO & group of people from OEM (Apple, Xerox early folks, Sun Microsystems, Microsoft)


Source: http://www.unicode.org/charts/

http://www.unicode.org/charts/


Unicode


Unicode

Unicode

• Unicode provides a unique number for every character no matter what the platform,no matter what the program,no matter what the language ( Also referred as Code point, as stated by http://www.unicode.org/ )

• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience

Unicode

• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience

• The Unicode Standard contains much more information for implementers, covering—in depth—topics such as bitwise encoding, biditext , Normalizer ( comparision , searching..)

Unicode

• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth

• Allows you to define around 1114112 code points ( 10FFFF )

• Divided into 17 multilingual planes ( Basic, Supplementary)

http://www.unicode.org/


Unicode

• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth

• The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters (BMP,SMP).

• Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned.

• Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters.


Program Agenda


History / Evolution

Various encodings


References


1

2

3

4

5


Encoding formsEncoding forms that are available:

ASCII, EBCDIC,ISO-8859-1, SJIS

USC-2 (LE,BE), UCS-4(LE, BE)

Encoding Formats in Unicode – UTF-16 (LE, BE) , UTF-32(LE,BE), UTF-8


Problem: When data is transferred from one computer to another and more than 1 byte represents a character , there could be trouble . Anything over 2 bytes causes us to think of how it is transmitted and represented

BOM : Byte Order Mark , defines the endiness of a system and is placed at the beginning of a data file or stream

List of BOM’s–UTF-16BE: FE FF

–UTF-16LE: FF FE

–UTF-32BE: 00 00 FE FF

–UTF-32LE: FF FE 00 00

–UTF-8: EF BB BF***


Encoding forms


• UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before UTF-16 were added to Version 2.0 of the standard

• UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to represent supplementary characters

• Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters

Encoding forms


Encoding Forms

Source: http://www.w3.org/


Unicode Planes: BMP, SMP

Why Surrogate Keys:

In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

Reserves these code pointsHigh surrogates u+D800- U+DBFF

Low surrogates U+DC00 - U+DFFF

Encoding forms


Surrogate KeysSource: https://en.wikipedia.org/wiki/UTF-16

Example of how Surrogate Keys work : U+10437 (𐐷)

1) 0x10437 -0x10000 = 0x00437 ( 0000 0000 0100 0011 0111 )

2) Split high 10-bit value & low 10-bit value: 0000000001 and 0000110111

3) 0xD800 + 0x0001 = 0xD801

4) 0xDC00 + 0x0037 = 0xDC37

Representation of U+10437 (𐐷) - 0xD801 0xDC37 or \ud801\udc37

https://en.wikipedia.org/wiki/UTF-16


UTF-8• UTF-8 story

• Source: https://en.wikipedia.org/wiki/UTF-8

• UTF-8 is the dominant character encoding for the World Wide Web, accounting for 85.1% of all Web pages in September 2015 (with the most popular East Asian encoding, GB 2312, at 1.0%).[4][2][5] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8,[6] and the W3C recommends UTF-8 as the default encoding in XML and HTML.[7]

• UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). Code points with lower numerical values


https://en.wikipedia.org/wiki/World_Wide_Web

https://en.wikipedia.org/wiki/GB_2312

https://en.wikipedia.org/wiki/UTF-8#cite_note-W3Techs-4

https://en.wikipedia.org/wiki/UTF-8#cite_note-MarkDavis2010-2

https://en.wikipedia.org/wiki/UTF-8#cite_note-BuiltWith-5

https://en.wikipedia.org/wiki/Internet_Mail_Consortium

https://en.wikipedia.org/wiki/UTF-8#cite_note-IMC-6

https://en.wikipedia.org/wiki/World_Wide_Web_Consortium

https://en.wikipedia.org/wiki/XML

https://en.wikipedia.org/wiki/HTML

https://en.wikipedia.org/wiki/UTF-8#cite_note-html5charset-7

https://en.wikipedia.org/wiki/Code_point

https://en.wikipedia.org/wiki/Byte

https://en.wikipedia.org/wiki/Octet_(computing)


UTF-8

Bits of code point First code point Last code point

Bytes in

sequence Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

7 U+0000 U+007F 1 0xxxxxxx

11 U+0080 U+07FF 2 110xxxxx 10xxxxxx

16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx

21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx


UTF-8

0x00..0x7F → 1 byte

0x80..0x7FF → 2 bytes

0x800..0xD7FF, 0xE000..0xFFFF → 3 bytes

0x10000 .. 0x10FFFF → 4 bytes


UTF-8 & UTF-16

1. UTF-8 and UTF-16 are both used for encoding characters

2. UTF-8 uses a byte at the minimum in encoding the characters while UTF-16 uses two

3. A UTF-8 encoded file tends to be smaller than a UTF-16 encoded file (When using ASCII only characters, a UTF-16 encoded file would be roughly twice as big as the same file encoded with UTF-8)

4. UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII

5. UTF-8 is byte oriented while UTF-16 is not

6. UTF-8 is better in recovering from errors compared to UTF-16


UTF-8 & UTF-16

• No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too

•Most reasonable characters, like Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. Unless really exotic characters are needed (like for names), this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.


Program Agenda


History / Evolution

Various encodings

I18n & Java

References


1

2

3

4

5


Java history

JDK 1.3

• Sun took over the Taligent i18n classes maintenance responsibility


Java history linked to i18n

• Java’s char data type, whose values are 16-bit unsigned integers representing Unicode code points in the Basic Multilingual Plane, encoded with UTF-16, and whose default value is the null code point ('\u0000')

• ICU was taken from JDK 1.3 and is independent being released since

• JDK also leverages code from ICU4J & we thank the ICU committee (IBM) for this

• Unicode is an evolving standard, and the Java platform has tracked the standard so that it now supports Unicode 8.0 in JDK 9


Java history linked to i18n• Version 2.0 of the Unicode standard was the first to include a design to

enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters ( JSR 204: Supplementary characters support )

•Use the primitive type int to represent code points in low-level APIs

• Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs.

•Provide APIs to easily convert between various char and code point-based representations.

With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll

see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is

relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the

name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.


Unicode & i18n

New Unicode release ≠ New data addition only

– Correction to existing character

– Technical reports may be revised/added, too

● Before Unicode 7

– Releases were irregular. Advance notices were Unreliable

– Difficult to plan to add to JDK

Since Unicode 7

• – New major version is released every year in June.

– Very helpful, easy to plan


I18n & Unicode versions

Java Version Supported Unicode version

Prior JDK 1.1 1.1.5

1.1 2.0

J2SE 1.2 2

J2SE 1.4 3.2

JDK 5.0 4

JDK 6 4

JDK 7 6

JDK 8 6.2

JDK 9(Yet To be Released) 7.0 ,8.0


Unicode & i18n● Before Unicode 7

– Releases were irregular. Advance notices were Unreliable

– Difficult to plan to add to JDK

Since Unicode 7

• – New major version is released every year in June.

– Very helpful, easy to plan


Locale & i18n

• Locale

• ID representing each cultural region

• It does not have any data

– A locale consists of :-• ISO 639-1 2-letter language code. e.g., “en”

• ISO 3166 2-letter country code. e.g., “US”

• variant code (any form): JDK supplied ones are:• “NY”: Norwegian Nynorsk

• “TH”: Thai digit for Thai Gov.

• “EURO”: Designates “Euro” (obsolete)


CLDRThe Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications.

This standardizes the commonly used locale data, among the types of data that CLDR includes are the following:

• Translations for language names, territory and country names.

• Translations for currency names, including singular/plural modifications.

• Translations for weekday, month, era, period of day, in full and abbreviated forms.

• Translations for timezones and example cities (or similar) for timezones.

• Translations for calendar fields.

• Patterns for formatting/parsing dates or times of day.

• Exemplar sets of characters used for writing the language.

• Patterns for formatting/parsing numbers.

• Rules for language-adapted collation.

https://en.wikipedia.org/wiki/Unicode_Consortium

https://en.wikipedia.org/wiki/Locale

https://en.wikipedia.org/wiki/XML

https://en.wikipedia.org/wiki/Computer_application

https://en.wikipedia.org/wiki/Weekday

https://en.wikipedia.org/wiki/Month

https://en.wikipedia.org/wiki/Era

https://en.wikipedia.org/wiki/Collation


Locale Sensitive Services


• java.text.BreakIterator, *.Collator, *.DateFormat,*.DateFormatSymbols •.DecimalFormatSymbols, java.text.NumberFormat,*.bidi

• java.util.Calendar, *.Currency, *.Locale,.TimeZone

Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone

The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages



Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone

The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages



BreakIterator

• Used in text editors and multiple applications with text processing

• Supports four types of text-breaking.

• “If you can dream it, you can do it. Walt Disney”• Character “/I/f/ /y/o/u/ /c/a/n/ /d/r/e/a/m/ /i/t/,/ /…/ /W/a/l/t/ /D/i/s/n/e/y/”

• Word “/If/ /you/ /can/ /dream/ /it/,/ /you/ /can/ /do/ /it/./ /Walt/ /Disney/”

• Sentence “/If you can dream it, you can do it. /Walt Disney/”

• Line “/If /you /can /dream /it, /you /can /do /it. /Walt /Disney/”

• Two implementations exist :

• Rule-based

• Specify rules using General_Category in the UCD

• Dictionary-based

• Helpful for languages which don't use a space between words

• We have only one dictionary file for word- & line-breaking of Thai language



Normalizer

Normalizes text for many purposes

(e.g. comparison, sort, search)

• Supports four Normalization forms.• NFD: Canonical Decomposition

• NFC Canonical Decomposition, followed by Canonical Composition

• NFKD: Compatibility Decomposition

• NFKC: Compatibility Decomposition, followed by Canonical Composition



Collator• – Performs String comparison. → Sorting

• Our only implementation is Rule-based Collator.

• Examples)

• You can choose either sorting:

– “AA”, “aa”, “BA” or “AA”, “Ba”, “aa”

• UTS #10 Unicode Collation Algorithm - Yet to be supported



BiDiProvides information of the bidirectional reordering of text.

Bidirectional Character Types in the UCD are used.

Examples)

'0' (Digit, zero): EN

'٠' (Arabic-Indic digit, zero): AN

'A' (Latin capital letter, A): L

)' ' ا Arabic letter, Alef): AL

)' ' א Hebrew letter, Alef): R

Actually, there are 23 Bidirectional Charater Types!!

UAX #9 Unicode Bidirectional Algorithm


Locale Sensitive Services (Contd)

Supported Loacle’s as of JDK 8

http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html

~ 71 Locale’s for java.text.* , java.util.*

java.util.spi java.text.spi

CurrencyNameProvider BreakIteratorProvider

LocaleServiceProvider CollatorProvider

TimeZoneNameProvider DateFormatProvider

CalendarDataProvider DateFormatSymbolsProvider

DecimalFormatSymbolsProvider

NumberFormatProvider

http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html


FYI , classes related to unicode

Java Version I18 support in Java Class/es

1.0 java.lang.Character & java.lang.String

1.1 java.text.* ; BreakIterator; Collator

1.2 java.lang.Character.UnicodeBlock

1.4 java.awt.font.NumericShaper

java.text.Bidi *

7 java.lang.Character.UnicodeScript

6 java.text.Normalizer *

6 java.net.IDN


Java i18n History (Apart from Unicode versions)JDK 1.0

● char as 16-bit Unicode (code unit)

● Implementation supported ISO 8859-1 only

● Leaked ISO 8859-1 into specs (properties)

● java.util.Date: aligned with C library date-time functions (0-based month numbering, opposite time zone offset)

JDK 1.1● Added I18N classes, java.text.*, java.util.Locale, etc. (came from Taligent OS)

● Originally written in C++ and ported to Java

● Also ported the date-time classes from Taligent OS

○ java.util.Calendar , GregorianCalendar , TimeZone, SimpleTimeZone

JDK 1.2● Input method support (API)

● Unicode 2.0

○ Added Character.UnicodeBlock


Java i18n History JDK 1.3

● Sun took over the Taligent i18n classes maintenance responsibility

○ Date-time API maintenance for Y2K

● Reimplemented platform time zone detection code

● Bi-directionality text rendering support in Swing

JDK 1.4● Added Thai support

○ Text break

○ Collator

○ Thai Buddhist calendar support

○ Input method

● Added java.text.Bidi , java.util.Currency support

JDK 5.0● JSR 204: Supplementary characters support

● Multilingual Font Configuration support

● BigDecimal support in java.text.DecimalFormat


Java i18n History JDK 6

● Locale Sensitive Services (a.k.a. pluggable locales)

○ java.text.spi and java.util.spi

● Added java.text.Normalizer

● Added some locales (derived from CLDR)

● Added Japanese calendar support

JDK 7● Enhanced java.util.Locale

○ Script support (e.g., Hans, Hant)

○ Extensions support


References & Citations used• http://www.oracle.com/us/technologies/java

• http://userguide.icu-project.org/icudata

• http://unicode.org/

• https://www.w3.org/International/

• http://userguide.icu-project.org/unicode

• https://en.wikipedia.org/wiki/List_of_Unicode_characters

• https://en.wikipedia.org/wiki/UTF-16

• https://en.wikipedia.org/wiki/UTF-8

http://www.oracle.com/us/technologies/java

http://userguide.icu-project.org/icudata

http://unicode.org/

https://www.w3.org/International/

http://userguide.icu-project.org/unicode

https://en.wikipedia.org/wiki/List_of_Unicode_characters




Quiz

• ASCII – bits ?

• Variable encoding format -- FEFF, FFFE ?

• Can you register Domain names with unicode chars ?

• JDK i18n had 1 JSR which is that ?


BACKUP


public class NotI18N {

static public void main(String[] args) {

System.out.println("Hello.");

System.out.println("How are you?");

System.out.println("Goodbye.");

}

}


import java.util.*;

public class I18NSample {

static public void main(String[] args) {

String language;

String country;

if (args.length != 2) {

language = new String("en");

country = new String("US");

} else {

language = new String(args[0]);

country = new String(args[1]);

}

Locale currentLocale;

ResourceBundle messages;

currentLocale = new Locale(language, country);

messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);

System.out.println(messages.getString("greetings"));

System.out.println(messages.getString("inquiry"));


Properties:

greetings = Hello.

farewell = Goodbye.

inquiry = How are you?

greetings = Hallo.

farewell = Tschüß.

inquiry = Wie geht's?

greetings = Bonjour.

farewell = Au revoir.

inquiry = Comment allez-vous?

% java I18NSample fr FR

Bonjour.

Comment allez-vous?

Au revoir.

In the next example the language code is en (English) and the country

code is US (United States) so the program displays the messages in

English:

% java I18NSample en US

Hello.

Download - Internationalization Introduction & Overviewfiles.meetup.com/3189882/Java i18n- 25th Feb.pdf · Author: pardesha Subject: Coproate PowerPoint Template Keywords: Java, Java FY15, PowerPoint

Top Related