Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization Introduction & Overview
Arvind27th Feb 2016
Oracle Confidential – Internal/Restricted/Highly RestrictedCopyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
Oracle Confidential – Internal/Restricted/Highly Restricted 4
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Agenda
Why should I care about Internationalization
History / Evolution
Various encodings
I18n & Java
References
1
2
3
4
Oracle Confidential – Internal/Restricted/Highly Restricted 5
5
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Why should I care about Internationalization
History / Evolution
Various encodings
History linked to Java
References
Oracle Confidential – Internal/Restricted/Highly Restricted 7
1
2
3
4
5
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
😁,😂 , 😃 , 😄
✌, ☂, ♫, ♪
☠
☕, 🏥, 🏦, 🏨 , 🏊
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
• Mojibake
First, why should I care?
but it may actually display like this:
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
05/06/07
or
902.300
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
GB-English 5th June 2007
US-English 6th May 2007
Japanese 7th Jun 2005
(Germany) 902.300
(France) 902 300
(United States) 902,300
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Measurement confusion causes $125 Million loss
Source: http://edition.cnn.com/TECH/space/9909/30mars.metric.02/
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
Percentage of English speakers by country.
Source: https://en.wikipedia.org/wiki/List_of_countries_by_English-speaking_population
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
• Why is it important
– Revenue generation
– Survival
– More adoption / Popularity of the software
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
Isthisavalidtextyouarelookingat
Character word sentence line
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
German Swedish
01: Åkersberga 02: Alingsås
02: Alingsås 04: Oskarshamn
03: Äppelbo 07: Utting
04: Oskarshamn 06: Üttfeld
05: Östersund 08: Zwickau
06: Üttfeld 01: Åkersberga
07: Utting 03: Äppelbo
08: Zwickau 05: Östersund
The basic principle to remember is: The position of characters in the Unicode code charts does not specify their sort orderUCA Collation Common words (lift, elevator.., start of the week)
Table 1 shows some examples of cases where sort order differs by language, usage, or another customization. Language Swedish: z < ö
German: ö < z
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization
• Internationalization is the process of designing a software application or extending the software, so that it can be easily adapted /supported in various languages and regions without major engineering changes
• Localization is the process of adapting internationalized software for a specific region or language by adding locale-specific components and translating text
• Globalization: Combination of both i18n & l10n
What is Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
• Enabling i18n we can localized a software quickly
• With the addition of localized data, the same executable can run worldwide
• Textual elements, such as status messages and the GUI component labels, are not hardcoded in the program
• Instead they are stored outside the source code and retrieved dynamically
• Support for new languages does not require recompilation
• Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end user's region and language
• Items to be localized :• Strings
• Date & time
• Currencies
Internationalization
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Internationalization Introduction/overview
Why should I care about Internationalization
History / Evolution
Various encodings
History linked to Java
References
Oracle Confidential – Internal/Restricted/Highly Restricted 22
1
2
3
4
5
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
EvolutionSource: http://www.asciitable.com/
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Source:
http://www.lookuptables.com/ebcdic_scancodes.php
Evolution
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
EvolutionCharacter Set / Repertoire
• A character set or repertoire is an unordered collection of characters that can be represented by numeric values.
• A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages). This is then called a coded character set when each character is assigned a particular number, called a code point. In the coded character set called ISO 8859-1 (also known as Latin1) the decimal code point value for the letter é is 233. However, in ISO 8859-5, the same code pointrepresents the Cyrillic character щ.
Code Point
code points are just non-negative integer numbers in a certain range. They do not have an implicit binary representation or a width of 21 or 32 bits. Binary representation and unit widths are defined for encoding forms
Character encoding scheme (Code Page in Windows)
• A character encoding scheme defines the representation of numeric values from one or more coded character sets containing symbols, letters, digits in bits and bytes
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Evolution
CJK / ISO-8859-X
• While these 256 set were ok for the english speaking world, was not true in the East Asian part of the world (DBCS--BIG5, SJIS,ISO-8859-x,..)
Windows Code page
• ANSI ( Apps using GUI native apps using Windows GUI )& OEM ( Console based app
Information
Exchange
• Advent of WWW or data being transferred from one system to other …situation where one code page info or one char encoding is sent to the other what would you see (^&*$##@!@@) or too much of code needed . World needed a common format to address this
• In 1990, there was a parallel initiatives by 2 bodies one was ISO & group of people from OEM (Apple, Xerox early folks, Sun Microsystems, Microsoft)
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Source: http://www.unicode.org/charts/
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode
Unicode
• Unicode provides a unique number for every character no matter what the platform,no matter what the program,no matter what the language ( Also referred as Code point, as stated by http://www.unicode.org/ )
• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience
Unicode
• Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience
• The Unicode Standard contains much more information for implementers, covering—in depth—topics such as bitwise encoding, biditext , Normalizer ( comparision , searching..)
Unicode
• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth
• Allows you to define around 1114112 code points ( 10FFFF )
• Divided into 17 multilingual planes ( Basic, Supplementary)
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode
• Unicode was originally designed as a fixed-width 16-bit character encoding (ucs-2). However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth
• The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters (BMP,SMP).
• Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned.
• Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters.
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Why should I care about Internationalization
History / Evolution
Various encodings
History linked to Java
References
Oracle Confidential – Internal/Restricted/Highly Restricted 31
1
2
3
4
5
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Encoding formsEncoding forms that are available:
ASCII, EBCDIC,ISO-8859-1, SJIS
USC-2 (LE,BE), UCS-4(LE, BE)
Encoding Formats in Unicode – UTF-16 (LE, BE) , UTF-32(LE,BE), UTF-8
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Problem: When data is transferred from one computer to another and more than 1 byte represents a character , there could be trouble . Anything over 2 bytes causes us to think of how it is transmitted and represented
BOM : Byte Order Mark , defines the endiness of a system and is placed at the beginning of a data file or stream
List of BOM’s–UTF-16BE: FE FF
–UTF-16LE: FF FE
–UTF-32BE: 00 00 FE FF
–UTF-32LE: FF FE 00 00
–UTF-8: EF BB BF***
Oracle Confidential – Internal/Restricted/Highly Restricted 33
Encoding forms
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
• UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before UTF-16 were added to Version 2.0 of the standard
• UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to represent supplementary characters
• Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters
Encoding forms
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Encoding Forms
Source: http://www.w3.org/
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode Planes: BMP, SMP
Why Surrogate Keys:
In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
Reserves these code pointsHigh surrogates u+D800- U+DBFF
Low surrogates U+DC00 - U+DFFF
Encoding forms
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Surrogate KeysSource: https://en.wikipedia.org/wiki/UTF-16
Example of how Surrogate Keys work : U+10437 (𐐷)
1) 0x10437 -0x10000 = 0x00437 ( 0000 0000 0100 0011 0111 )
2) Split high 10-bit value & low 10-bit value: 0000000001 and 0000110111
3) 0xD800 + 0x0001 = 0xD801
4) 0xDC00 + 0x0037 = 0xDC37
Representation of U+10437 (𐐷) - 0xD801 0xDC37 or \ud801\udc37
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
UTF-8• UTF-8 story
• Source: https://en.wikipedia.org/wiki/UTF-8
• UTF-8 is the dominant character encoding for the World Wide Web, accounting for 85.1% of all Web pages in September 2015 (with the most popular East Asian encoding, GB 2312, at 1.0%).[4][2][5] The Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8,[6] and the W3C recommends UTF-8 as the default encoding in XML and HTML.[7]
• UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an octet in the Unicode Standard). Code points with lower numerical values
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
UTF-8
Bits of code point First code point Last code point
Bytes in
sequence Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+0000 U+007F 1 0xxxxxxx
11 U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+4000000 U+7FFFFFFF 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
UTF-8
0x00..0x7F → 1 byte
0x80..0x7FF → 2 bytes
0x800..0xD7FF, 0xE000..0xFFFF → 3 bytes
0x10000 .. 0x10FFFF → 4 bytes
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
UTF-8 & UTF-16
1. UTF-8 and UTF-16 are both used for encoding characters
2. UTF-8 uses a byte at the minimum in encoding the characters while UTF-16 uses two
3. A UTF-8 encoded file tends to be smaller than a UTF-16 encoded file (When using ASCII only characters, a UTF-16 encoded file would be roughly twice as big as the same file encoded with UTF-8)
4. UTF-8 is compatible with ASCII while UTF-16 is incompatible with ASCII
5. UTF-8 is byte oriented while UTF-16 is not
6. UTF-8 is better in recovering from errors compared to UTF-16
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
UTF-8 & UTF-16
• No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too
•Most reasonable characters, like Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. Unless really exotic characters are needed (like for names), this means that the 16-bit subset of UTF-16 can be used as a fixed-length encoding, which speeds indexing.
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Why should I care about Internationalization
History / Evolution
Various encodings
I18n & Java
References
Oracle Confidential – Internal/Restricted/Highly Restricted 43
1
2
3
4
5
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java history
JDK 1.3
• Sun took over the Taligent i18n classes maintenance responsibility
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java history linked to i18n
• Java’s char data type, whose values are 16-bit unsigned integers representing Unicode code points in the Basic Multilingual Plane, encoded with UTF-16, and whose default value is the null code point ('\u0000')
• ICU was taken from JDK 1.3 and is independent being released since
• JDK also leverages code from ICU4J & we thank the ICU committee (IBM) for this
• Unicode is an evolving standard, and the Java platform has tracked the standard so that it now supports Unicode 8.0 in JDK 9
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java history linked to i18n• Version 2.0 of the Unicode standard was the first to include a design to
enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it had to support supplementary characters ( JSR 204: Supplementary characters support )
•Use the primitive type int to represent code points in low-level APIs
• Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs.
•Provide APIs to easily convert between various char and code point-based representations.
With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll
see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is
relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the
name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode & i18n
New Unicode release ≠ New data addition only
– Correction to existing character
– Technical reports may be revised/added, too
● Before Unicode 7
– Releases were irregular. Advance notices were Unreliable
– Difficult to plan to add to JDK
Since Unicode 7
• – New major version is released every year in June.
– Very helpful, easy to plan
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
I18n & Unicode versions
Java Version Supported Unicode version
Prior JDK 1.1 1.1.5
1.1 2.0
J2SE 1.2 2
J2SE 1.4 3.2
JDK 5.0 4
JDK 6 4
JDK 7 6
JDK 8 6.2
JDK 9(Yet To be Released) 7.0 ,8.0
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Unicode & i18n● Before Unicode 7
– Releases were irregular. Advance notices were Unreliable
– Difficult to plan to add to JDK
Since Unicode 7
• – New major version is released every year in June.
– Very helpful, easy to plan
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale & i18n
• Locale
• ID representing each cultural region
• It does not have any data
– A locale consists of :-• ISO 639-1 2-letter language code. e.g., “en”
• ISO 3166 2-letter country code. e.g., “US”
• variant code (any form): JDK supplied ones are:• “NY”: Norwegian Nynorsk
• “TH”: Thai digit for Thai Gov.
• “EURO”: Designates “Euro” (obsolete)
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
CLDRThe Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications.
This standardizes the commonly used locale data, among the types of data that CLDR includes are the following:
• Translations for language names, territory and country names.
• Translations for currency names, including singular/plural modifications.
• Translations for weekday, month, era, period of day, in full and abbreviated forms.
• Translations for timezones and example cities (or similar) for timezones.
• Translations for calendar fields.
• Patterns for formatting/parsing dates or times of day.
• Exemplar sets of characters used for writing the language.
• Patterns for formatting/parsing numbers.
• Rules for language-adapted collation.
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
Locale Sensitive Services
• java.text.BreakIterator, *.Collator, *.DateFormat,*.DateFormatSymbols •.DecimalFormatSymbols, java.text.NumberFormat,*.bidi
• java.util.Calendar, *.Currency, *.Locale,.TimeZone
Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone
The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
Support for locale-sensitive behavior in the java.util and java.text packages is entirely platform independent, the only platform dependent functionality is the setting of the initial default locale and the initial default time zone based on the host operating system's locale and time zone
The Java platform does not require you to use the same Locale throughout your program. If you wish, you can assign a different Locale to every locale-sensitive object in your program. This flexibility allows you to develop multilingual applications, which can display information in multiple languages
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
BreakIterator
• Used in text editors and multiple applications with text processing
• Supports four types of text-breaking.
• “If you can dream it, you can do it. Walt Disney”• Character “/I/f/ /y/o/u/ /c/a/n/ /d/r/e/a/m/ /i/t/,/ /…/ /W/a/l/t/ /D/i/s/n/e/y/”
• Word “/If/ /you/ /can/ /dream/ /it/,/ /you/ /can/ /do/ /it/./ /Walt/ /Disney/”
• Sentence “/If you can dream it, you can do it. /Walt Disney/”
• Line “/If /you /can /dream /it, /you /can /do /it. /Walt /Disney/”
• Two implementations exist :
• Rule-based
• Specify rules using General_Category in the UCD
• Dictionary-based
• Helpful for languages which don't use a space between words
• We have only one dictionary file for word- & line-breaking of Thai language
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
Normalizer
Normalizes text for many purposes
(e.g. comparison, sort, search)
• Supports four Normalization forms.• NFD: Canonical Decomposition
• NFC Canonical Decomposition, followed by Canonical Composition
• NFKD: Compatibility Decomposition
• NFKC: Compatibility Decomposition, followed by Canonical Composition
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
Collator• – Performs String comparison. → Sorting
• Our only implementation is Rule-based Collator.
• Examples)
• You can choose either sorting:
– “AA”, “aa”, “BA” or “AA”, “Ba”, “aa”
• UTS #10 Unicode Collation Algorithm - Yet to be supported
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services
BiDiProvides information of the bidirectional reordering of text.
Bidirectional Character Types in the UCD are used.
Examples)
'0' (Digit, zero): EN
'٠' (Arabic-Indic digit, zero): AN
'A' (Latin capital letter, A): L
)' ' ا Arabic letter, Alef): AL
)' ' א Hebrew letter, Alef): R
Actually, there are 23 Bidirectional Charater Types!!
UAX #9 Unicode Bidirectional Algorithm
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Locale Sensitive Services (Contd)
Supported Loacle’s as of JDK 8
http://www.oracle.com/technetwork/java/javase/java8locales-2095355.html
~ 71 Locale’s for java.text.* , java.util.*
java.util.spi java.text.spi
CurrencyNameProvider BreakIteratorProvider
LocaleServiceProvider CollatorProvider
TimeZoneNameProvider DateFormatProvider
CalendarDataProvider DateFormatSymbolsProvider
DecimalFormatSymbolsProvider
NumberFormatProvider
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
FYI , classes related to unicode
Java Version I18 support in Java Class/es
1.0 java.lang.Character & java.lang.String
1.1 java.text.* ; BreakIterator; Collator
1.2 java.lang.Character.UnicodeBlock
1.4 java.awt.font.NumericShaper
java.text.Bidi *
7 java.lang.Character.UnicodeScript
6 java.text.Normalizer *
6 java.net.IDN
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java i18n History (Apart from Unicode versions)JDK 1.0
● char as 16-bit Unicode (code unit)
● Implementation supported ISO 8859-1 only
● Leaked ISO 8859-1 into specs (properties)
● java.util.Date: aligned with C library date-time functions (0-based month numbering, opposite time zone offset)
JDK 1.1● Added I18N classes, java.text.*, java.util.Locale, etc. (came from Taligent OS)
● Originally written in C++ and ported to Java
● Also ported the date-time classes from Taligent OS
○ java.util.Calendar , GregorianCalendar , TimeZone, SimpleTimeZone
JDK 1.2● Input method support (API)
● Unicode 2.0
○ Added Character.UnicodeBlock
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java i18n History JDK 1.3
● Sun took over the Taligent i18n classes maintenance responsibility
○ Date-time API maintenance for Y2K
● Reimplemented platform time zone detection code
● Bi-directionality text rendering support in Swing
JDK 1.4● Added Thai support
○ Text break
○ Collator
○ Thai Buddhist calendar support
○ Input method
● Added java.text.Bidi , java.util.Currency support
JDK 5.0● JSR 204: Supplementary characters support
● Multilingual Font Configuration support
● BigDecimal support in java.text.DecimalFormat
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Java i18n History JDK 6
● Locale Sensitive Services (a.k.a. pluggable locales)
○ java.text.spi and java.util.spi
● Added java.text.Normalizer
● Added some locales (derived from CLDR)
● Added Japanese calendar support
JDK 7● Enhanced java.util.Locale
○ Script support (e.g., Hans, Hant)
○ Extensions support
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
References & Citations used• http://www.oracle.com/us/technologies/java
• http://userguide.icu-project.org/icudata
• http://unicode.org/
• https://www.w3.org/International/
• http://userguide.icu-project.org/unicode
• https://en.wikipedia.org/wiki/List_of_Unicode_characters
• https://en.wikipedia.org/wiki/UTF-16
• https://en.wikipedia.org/wiki/UTF-8
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Quiz
• ASCII – bits ?
• Variable encoding format -- FEFF, FFFE ?
• Can you register Domain names with unicode chars ?
• JDK i18n had 1 JSR which is that ?
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 65
Q & A
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal 66
Thank you
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
BACKUP
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
public class NotI18N {
static public void main(String[] args) {
System.out.println("Hello.");
System.out.println("How are you?");
System.out.println("Goodbye.");
}
}
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
import java.util.*;
public class I18NSample {
static public void main(String[] args) {
String language;
String country;
if (args.length != 2) {
language = new String("en");
country = new String("US");
} else {
language = new String(args[0]);
country = new String(args[1]);
}
Locale currentLocale;
ResourceBundle messages;
currentLocale = new Locale(language, country);
messages = ResourceBundle.getBundle("MessagesBundle", currentLocale);
System.out.println(messages.getString("greetings"));
System.out.println(messages.getString("inquiry"));
Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Properties:
greetings = Hello.
farewell = Goodbye.
inquiry = How are you?
greetings = Hallo.
farewell = Tschüß.
inquiry = Wie geht's?
greetings = Bonjour.
farewell = Au revoir.
inquiry = Comment allez-vous?
% java I18NSample fr FR
Bonjour.
Comment allez-vous?
Au revoir.
In the next example the language code is en (English) and the country
code is US (United States) so the program displays the messages in
English:
% java I18NSample en US
Hello.