implementation issues
DESCRIPTION
Implementation Issues. Mark Davis 2003-09-24. Properties. Behavior. Bidirectional Algorithm (Arabic/Hebrew) Linebreak, User-Character, Word, … Normalization Collation Regular Expressions Programming Identifiers …. Scripts, not Languages. a. Armenian. English. Italian. English. - PowerPoint PPT PresentationTRANSCRIPT
Implementation Issues
Mark Davis2003-09-24
PropertiesCore
Decomposition and Normalization
CJK
Code Point Canonical_Combining_Class IdeographicName Decomposition_Mapping Unified_IdeographRepresentative_Glyph Composition_Exclusion RadicalBlock Full_Composition_Exclusion IDS_Binary_Operator
General Decomposition_Type IDS_Trinary_OperatorAge Numeric Unicode_Radical_StrokeGeneral_Category Numeric_Value MiscScript Numeric_Type MathWhite_Space Hex_Digit Quotation_MarkAlphabetic ASCII_Hex_Digit DashHangul_Syllable_Type Case HyphenNoncharacter_Code_Point Uppercase Terminal_PunctuationDefault_Ignorable_Code_Point Lowercase DiacriticDeprecated Lowercase_Mapping ExtenderLogical_Order_Exception Titlecase_Mapping Grapheme_Base
Shaping and Rendering Uppercase_Mapping Grapheme_ExtendJoin_Control Case_Folding Grapheme_LinkJoining_Group Simple_Lowercase_Mapping Unicode_1_NameJoining_Type Simple_Titlecase_Mapping ISO_CommentLine_Break Simple_Uppercase_Mapping BidiEast_Asian_Width Simple_Case_Folding Bidi_Control
Identifiers Special_Case_Condition Bidi_MirroredID_Continue Soft_Dotted Bidi_ClassID_Start Bidi_Mirroring_GlyphXID_ContinueXID_Start
Behavior
Bidirectional Algorithm (Arabic/Hebrew)Linebreak, User-Character, Word,…NormalizationCollationRegular ExpressionsProgramming Identifiers
…
Scripts, not Languages
a
English
German
Italian
.
English
Russian
Armenian
।Hindi
Gujarati
Marathi
¨
English
Russian
Greek
Size Doesn’t Matter
Text storage size is approximately the same for all languagesIn real data, other data dominatesCompression available if needed ZIP SCSU BOCU
Normalization
Produces Unique FormComparison, Matching, CountingUsed in Collation International Domain Names W3C Character Model (Web) Network File System…
Transcoding: ISCII - Unicode
ISCII Halant + Halant Halant + Nukta INV halant RA ATR EXT
Unicode Halant + ZWJ Halant + ZWNJ SPACE virama RA Not in plain text Not required
Unicode = Lingua Franca
Transcoding = Converting from one character encoding to anotherMany standards / systems defined in terms of Unicode C#, Java, XML, …
Unicode
cp1252
SJISGB18030
ISCII ISCII
Transliteration
Round-trip Transliterations ↔ श śa Ideal published form Unique source sequence → unique target
Best-Fit Transliterations श → sa For limited environments
Keyboard Transliterations श ← ssa Limited to QWERTY keys
Indic-Indic not simple mapping; “holes”
Keyboards
One key → many characters
Many keys → one character
क0915
�्094D
ष0937
aà
00E0`
→
→
Supporting Sequences
KeyboardsFontsSelection
Fonts
Required Glyphs, Positioning
Sequences Necessary to produce them
Context (e.g. in OpenType)
क0915
�्094D
ष0937
Selection
Use appropriate boundaries for user-charactersArrow keys, mouse selection, etc
Unicode Stability
Encoding. Once a character is encoded, it will not be moved or removed.Name. Once a character is encoded, its character name will not be changed.Normalization. Once a character is encoded, its canonical combining class and decomposition mapping will not be changed in a way that will destabilize normalization.Identity. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.Property Value. The structure of certain property values in the Unicode Character Database will not be changed.
Locale Data
(examples)
Q & A