implementation issues

16
Implementation Issues Mark Davis 2003-09-24

Upload: zeki

Post on 05-Jan-2016

35 views

Category:

Documents


7 download

DESCRIPTION

Implementation Issues. Mark Davis 2003-09-24. Properties. Behavior. Bidirectional Algorithm (Arabic/Hebrew) Linebreak, User-Character, Word, … Normalization Collation Regular Expressions Programming Identifiers …. Scripts, not Languages. a. Armenian. English. Italian. English. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Implementation Issues

Implementation Issues

Mark Davis2003-09-24

Page 2: Implementation Issues

PropertiesCore

Decomposition and Normalization

CJK

Code Point Canonical_Combining_Class IdeographicName Decomposition_Mapping Unified_IdeographRepresentative_Glyph Composition_Exclusion RadicalBlock Full_Composition_Exclusion IDS_Binary_Operator

General Decomposition_Type IDS_Trinary_OperatorAge Numeric Unicode_Radical_StrokeGeneral_Category Numeric_Value MiscScript Numeric_Type MathWhite_Space Hex_Digit Quotation_MarkAlphabetic ASCII_Hex_Digit DashHangul_Syllable_Type Case HyphenNoncharacter_Code_Point Uppercase Terminal_PunctuationDefault_Ignorable_Code_Point Lowercase DiacriticDeprecated Lowercase_Mapping ExtenderLogical_Order_Exception Titlecase_Mapping Grapheme_Base

Shaping and Rendering Uppercase_Mapping Grapheme_ExtendJoin_Control Case_Folding Grapheme_LinkJoining_Group Simple_Lowercase_Mapping Unicode_1_NameJoining_Type Simple_Titlecase_Mapping ISO_CommentLine_Break Simple_Uppercase_Mapping BidiEast_Asian_Width Simple_Case_Folding Bidi_Control

Identifiers Special_Case_Condition Bidi_MirroredID_Continue Soft_Dotted Bidi_ClassID_Start Bidi_Mirroring_GlyphXID_ContinueXID_Start

Page 3: Implementation Issues

Behavior

Bidirectional Algorithm (Arabic/Hebrew)Linebreak, User-Character, Word,…NormalizationCollationRegular ExpressionsProgramming Identifiers

Page 4: Implementation Issues

Scripts, not Languages

a

English

German

Italian

.

English

Russian

Armenian

।Hindi

Gujarati

Marathi

¨

English

Russian

Greek

Page 5: Implementation Issues

Size Doesn’t Matter

Text storage size is approximately the same for all languagesIn real data, other data dominatesCompression available if needed ZIP SCSU BOCU

Page 6: Implementation Issues

Normalization

Produces Unique FormComparison, Matching, CountingUsed in Collation International Domain Names W3C Character Model (Web) Network File System…

Page 7: Implementation Issues

Transcoding: ISCII - Unicode

ISCII Halant + Halant Halant + Nukta INV halant RA ATR EXT

Unicode Halant + ZWJ Halant + ZWNJ SPACE virama RA Not in plain text Not required

Page 8: Implementation Issues

Unicode = Lingua Franca

Transcoding = Converting from one character encoding to anotherMany standards / systems defined in terms of Unicode C#, Java, XML, …

Unicode

cp1252

SJISGB18030

ISCII ISCII

Page 9: Implementation Issues

Transliteration

Round-trip Transliterations ↔ श śa Ideal published form Unique source sequence → unique target

Best-Fit Transliterations श → sa For limited environments

Keyboard Transliterations श ← ssa Limited to QWERTY keys

Indic-Indic not simple mapping; “holes”

Page 10: Implementation Issues

Keyboards

One key → many characters

Many keys → one character

क0915

�्094D

ष0937

00E0`

Page 11: Implementation Issues

Supporting Sequences

KeyboardsFontsSelection

Page 12: Implementation Issues

Fonts

Required Glyphs, Positioning

Sequences Necessary to produce them

Context (e.g. in OpenType)

क0915

�्094D

ष0937

Page 13: Implementation Issues

Selection

Use appropriate boundaries for user-charactersArrow keys, mouse selection, etc

Page 14: Implementation Issues

Unicode Stability

Encoding. Once a character is encoded, it will not be moved or removed.Name. Once a character is encoded, its character name will not be changed.Normalization. Once a character is encoded, its canonical combining class and decomposition mapping will not be changed in a way that will destabilize normalization.Identity. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.Property Value. The structure of certain property values in the Unicode Character Database will not be changed.

Page 15: Implementation Issues

Locale Data

(examples)

Page 16: Implementation Issues

Q & A