icu overview: the open source unicode library

27
28th Internationalization and Unicode Con ference © 2005 IBM Corporation ICU Overview: The Open Source Unicode Library George Rhoten IBM Globalization Center of Competency

Upload: mabel-waters

Post on 17-Jan-2018

229 views

Category:

Documents


0 download

DESCRIPTION

Agenda Background Information What is ICU? Architecture Overview ICU Overview Agenda Background Information What is ICU? Architecture Overview Significant New ICU Features References Q and A 28th Internationalization and Unicode Conference 28th Internationalization and Unicode Conference Orlando, Florida, September, 2005

TRANSCRIPT

Page 1: ICU Overview: The Open Source Unicode Library

28th Internationalization and Unicode Conference © 2005 IBM Corporation

ICU Overview:The Open Source Unicode Library

George RhotenIBM Globalization Center of Competency

Page 2: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005228th Internationalization and Unicode Conference

Agenda

Background Information What is ICU? Architecture Overview

– Significant New ICU Features

References Q and A

Page 3: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005328th Internationalization and Unicode Conference

Why Globalization?

Page 4: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005428th Internationalization and Unicode Conference

Unicode

Handles all modern world languages Efficient and effective processing Lossless data exchange Enables single-binary global software But… all languages large, complex standard⇒

– 1,400 pages + Annexes + additional standards

– 96,000+ characters

– Major update every 3 years

– Minor update about once a year

– 70 character properties, many multi-valued

– Affects many processes: display, line-break, regular expressions, …

Page 5: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005528th Internationalization and Unicode Conference

Internationalization, Localization & Locales

Requirements vary widely across languages & countries– Sorting

– Text searching

– Line breaks

– Date/time/number/currency formatting

– Codepage conversion

– …and so on

Performance is key– It is easy to do the right thing

– It is hard to do it fast

Page 6: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005628th Internationalization and Unicode Conference

What is ICU?

International Components for Unicode Globalization / Unicode / Locales Mature, widely used set of C/C++ and Java libraries

– Basis for Java 1.1 internationalization, but goes far beyond Java 1.1 Very portable – identical results on all platforms / programming

languages– C/C++: 30+ platforms/compilers– Java: IBM & Sun JDK– You can use: C/C++ (ICU4C), Java (ICU4J), C/C++ with Java (ICU4JNI)

Full threading model Customizable Modular Open source – but non-restrictive

Page 7: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005728th Internationalization and Unicode Conference

Who uses ICU?

Products Within IBM– All 5 major software brands– Many other related software applications– Used on all IBM operating systems

Other Companies and Organizations– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business

Objects, Caris, CERN, Cognos, Debian Linux, Gentoo Linux, HP, Home Depot, Inktomi, JD Edwards, Macromedia, Mathworks, MKS, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE Linux, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping LLC., Xerox, Yahoo!...and many more

Page 8: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005828th Internationalization and Unicode Conference

ICU Features

Unicode text handling Charset conversions (700+) Collation & Searching Locales from CLDR (250+) Resource Bundles Calendar & Time zones Complex-text layout engine Unicode Regular Expressions

Breaks: word, line, … Formatting

– Date & time– Messages– Numbers & currencies

Transforms– Normalization– Casing– Transliterations

Page 9: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 2005928th Internationalization and Unicode Conference

Architecture Overview 1

Locale Based Services– Locale is an identifier, not a container– Keywords for variants: de@collation=phonebook

Resource inheritance: shared resources

root

en

US IE

de

DE CH

zh

Hant Hans

TW CN TWCN

Language

Script

Region

Page 10: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051028th Internationalization and Unicode Conference

Architecture Overview 2

Open and Close Service Model

– Open a service object, use it many times, close it when done

– Better performance by avoiding setup costs per operation

ICU Threading Model

– Multiple service objects in use simultaneouslywith same or different attributes

– Large resources shared in read-only cache

– Compatible with Java threading model

Page 11: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051128th Internationalization and Unicode Conference

Architecture Overview 3

Data Driven Services– Customize at build-time or run-time

– Interchange with other platforms;• same results on each

– Rule-based• Collation, Word-breaks, Transforms

– Pattern-based• Date/Time/Number/Message formatting

– Table-based• Character Conversion

Page 12: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051228th Internationalization and Unicode Conference

Architecture Overview – ICU4C

Simple Error Handling– Thread safe– Works in C and C++

C/C++ subset for portability Version Management

– Multiple versions of ICU4C in the same process memory space– Data and library versioning

String Buffer Management– Preflighting and overflow protection

Flexible– Allows Loading and Unloading ICU4C libraries– Runtime settable memory allocation and mutex functions

Page 13: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051328th Internationalization and Unicode Conference

Architecture Overview – ICU4J

Supplement for Java Core globalization (no character conversion or

regular expressions)– We do supply complex text support for Sun

Modularized: products may add just needed functionality

Usually drop-in replacement for JDK functionality– Changing the import statements is usually all that is

needed

Page 14: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051428th Internationalization and Unicode Conference

ICU4J: Supplement for Java

CLDR (Common Locale Data Repository)– More fully supported locales than Java

Up-to-date globalization: standards-compliant; latest Unicode– Supplementary character (GB 18030, JIS X 213, HKSCS)

• Java 5 adds handling of supplementary characters

– Full properties – JDK has only a fraction

– Unicode Collation Algorithm

– Local calendars (Islamic, Japan,…); more time zone localizations

– Currencies, String Search, Internationalized Domain Names

– Transforms: Case, Scripts, Normalization

Much shorter release cycle and quicker support for Unicode standard

Page 15: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051528th Internationalization and Unicode Conference

Unicode Text Handling

C (UTF-16)– UChar*: null-terminated or with length

C++ (UTF-16)– UnicodeString: full featured string class

Java (UTF-16BE)– Uses java.lang.String and adds utilities

All handle supplementary characters– Required for GB 18030 and JIS 213 repertoire

Page 16: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051628th Internationalization and Unicode Conference

Unicode Text Handling 2

All Unicode 4.1 properties– direct API

• values, names, enumerations– UnicodeSet

• Fast, compact set operations (union, intersection, …)• Pattern-based (both Perl & POSIX syntax for properties)

– \p{greek} vs. [:greek:]• All properties:

– [\p{lowercase}-[a-z]]– [\p{greek} & \p{uppercase}]

Page 17: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051728th Internationalization and Unicode Conference

Recent Additions

Conforms to CLDR 1.3– Adds many translated terms for languages, scripts, regions,

currencies, and time zones.

– Access to more CLDR items

Support for Unicode interpretation of POSIX properties Charset detection API (ICU4J only) Better modularization for memory constrained environments

(ICU4C only)

Page 18: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051828th Internationalization and Unicode Conference

Character Set Conversion

Precise alias information:– When you ask for “Shift-JIS”, you can request the precise definition by

platform (e.g. Windows, IBM, Java, … ) Buffer management

– API automatically handles characters that cross buffers– Can provide offset mappings between byte buffer and UChar buffer

Runtime customizations allowed for:– illegal sequences– undefined characters

Unicode Text Compression – SCSU, BOCU-1 Consistent conversion results across platforms You can use more character sets at runtime or build time

Page 19: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20051928th Internationalization and Unicode Conference

Collation: Sorting, Searching and Matching

Fast international comparison for string search; fully UCA compliant– Compressed sort keys, optimized string comparison, sublinear

string search– Incremental sortkeys used for radix sorting

Precise binary sortkey stability over time (library versioning) Fully data driven

– Many common rules provided Runtime and build time rule customizations

– strength, normalization, upper vs. lowercase first, ignore punctuation, numeric, …

– Only delta from UCA is needed for rule customization

Page 20: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052028th Internationalization and Unicode Conference

Calendar & Time Zones

International Calendars – Islamic, Buddhist, Hebrew, Japanese – Required for correct presentation of dates in some countries

Olson timezone support with localizations Recent Additions:

– Many more time zone localizations

Page 21: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052128th Internationalization and Unicode Conference

Formatting

Date & time: 8 formats per locale by default Messages

– Completely localizable, plural support Numbers & currencies

– Scientific Notation, Spelled-out (checks, etc.)– Full Orthogonal Currency support

• INR In Hindi:• INR In English: Rs. 1,234.57• INR In German: Rs. 1.234,57

Recent Additions– List available currencies API– Short and stand-alone month/day names

Page 22: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052228th Internationalization and Unicode Conference

Transforms

Unicode Normalization– Highly optimized for performance

– performance utilities: concatenation, detection, comparison

Casing (upper, lower, title, folding) General Transforms

– Script transliterations

– Half-width/Full-width, Hex, etc.

– Chain transforms together, filter source characters

– Rule-based, customizable at runtime.

String Prep: NFS, Internationalized Domain Names (IDN)

Page 23: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052328th Internationalization and Unicode Conference

Segmentation: word, line & sentence

Fast state-table implementation Customizable

– Rule-based – customizable at runtime

– Special customizations, e.g. Thai

Recent Additions:– Uses new UText API

• Discontinuous text• Buffering• Usable with UTF-8, UTF-16 or UTF-32

Page 24: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052428th Internationalization and Unicode Conference

Unicode Regular Expressions

Full Regex Implementation– C/C++ only: Java 1.4 has own package (though not as

powerful)

All Unicode 4.1 Properties– Supported through UnicodeSet

Good performance– Competitive with non-Unicode regex

Page 25: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052528th Internationalization and Unicode Conference

Complex-text layout engine

Glyph processing, positioning & adjustment– Ligature substitution, contextual forms, kerning, accent placement, bidi scripts,

etc. Support for:

– Information for drawing– Caret Display– Hit Testing– Selection Highlighting– Caret Movement– Layout Metrics– Line Break– Canonical Equivalence: a + ´ or á

Recent Additions:– Support for more complex scripts

Page 26: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052628th Internationalization and Unicode Conference

References

ICU main site:– http://www.ibm.com/software/globalization/icu/

– Links to• Download ICU• User Guide, Technical FAQ, Support, Bug Reports, Demonstrations

ICU support site:– http://icu.sourceforge.net/

Unicode Consortium– http://www.unicode.org/

• Unicode glossary, Unicode character database

Page 27: ICU Overview: The Open Source Unicode Library

ICU Overview: The Open Source Unicode Library

Orlando, Florida, September, 20052728th Internationalization and Unicode Conference

Questions and Answers