implementation issues

Implementation Issues

Mark Davis2003-09-24

PropertiesCore

Decomposition and Normalization

CJK

Code Point Canonical_Combining_Class IdeographicName Decomposition_Mapping Unified_IdeographRepresentative_Glyph Composition_Exclusion RadicalBlock Full_Composition_Exclusion IDS_Binary_Operator

General Decomposition_Type IDS_Trinary_OperatorAge Numeric Unicode_Radical_StrokeGeneral_Category Numeric_Value MiscScript Numeric_Type MathWhite_Space Hex_Digit Quotation_MarkAlphabetic ASCII_Hex_Digit DashHangul_Syllable_Type Case HyphenNoncharacter_Code_Point Uppercase Terminal_PunctuationDefault_Ignorable_Code_Point Lowercase DiacriticDeprecated Lowercase_Mapping ExtenderLogical_Order_Exception Titlecase_Mapping Grapheme_Base

Shaping and Rendering Uppercase_Mapping Grapheme_ExtendJoin_Control Case_Folding Grapheme_LinkJoining_Group Simple_Lowercase_Mapping Unicode_1_NameJoining_Type Simple_Titlecase_Mapping ISO_CommentLine_Break Simple_Uppercase_Mapping BidiEast_Asian_Width Simple_Case_Folding Bidi_Control

Identifiers Special_Case_Condition Bidi_MirroredID_Continue Soft_Dotted Bidi_ClassID_Start Bidi_Mirroring_GlyphXID_ContinueXID_Start

Behavior

Bidirectional Algorithm (Arabic/Hebrew)Linebreak, User-Character, Word,…NormalizationCollationRegular ExpressionsProgramming Identifiers

…

Scripts, not Languages

a

English

German

Italian

.

English

Russian

Armenian

।Hindi

Gujarati

Marathi

¨

English

Russian

Greek

Size Doesn’t Matter

Text storage size is approximately the same for all languagesIn real data, other data dominatesCompression available if needed ZIP SCSU BOCU

Normalization

Produces Unique FormComparison, Matching, CountingUsed in Collation International Domain Names W3C Character Model (Web) Network File System…

Transcoding: ISCII - Unicode

ISCII Halant + Halant Halant + Nukta INV halant RA ATR EXT

Unicode Halant + ZWJ Halant + ZWNJ SPACE virama RA Not in plain text Not required

Unicode = Lingua Franca

Transcoding = Converting from one character encoding to anotherMany standards / systems defined in terms of Unicode C#, Java, XML, …

Unicode

cp1252

SJISGB18030

ISCII ISCII

Transliteration

Round-trip Transliterations ↔ श śa Ideal published form Unique source sequence → unique target

Best-Fit Transliterations श → sa For limited environments

Keyboard Transliterations श ← ssa Limited to QWERTY keys

Indic-Indic not simple mapping; “holes”

Keyboards

One key → many characters

Many keys → one character

क0915

�्094D

ष0937

aà

00E0`

→

→

Supporting Sequences

KeyboardsFontsSelection

Fonts

Required Glyphs, Positioning

Sequences Necessary to produce them

Context (e.g. in OpenType)

क0915

�्094D

ष0937

Selection

Use appropriate boundaries for user-charactersArrow keys, mouse selection, etc

Unicode Stability

Encoding. Once a character is encoded, it will not be moved or removed.Name. Once a character is encoded, its character name will not be changed.Normalization. Once a character is encoded, its canonical combining class and decomposition mapping will not be changed in a way that will destabilize normalization.Identity. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.Property Value. The structure of certain property values in the Unicode Character Database will not be changed.

Locale Data

(examples)

implementation issues

Documents