unicode 4.0 mark davis president, the unicode consortium note: slides differ from proceedings
TRANSCRIPT
![Page 1: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/1.jpg)
Unicode 4.0
Mark Davis
President, The Unicode Consortium
Note: slides differ from proceedings
![Page 2: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/2.jpg)
OverviewNew CharactersConformanceUAX: Unicode Standard AnnexesUCD: Unicode Character Database
UTS: Unicode Technical Standards Not part of the Standard, but can claim
conformance
![Page 3: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/3.jpg)
Properties and BehaviorUnicode is not just a list of charactersProperties and behavior are crucialWith them, new characters can work
“out of the box”Some are part of the standard (BIDI,
Normalization), others are associated (Collation, Regular Expressions)
![Page 4: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/4.jpg)
New Characters: 1,228 Modern Scripts
(additions to) Indic, Khmer, Latin, Greek, Arabic, Syriac
(minority scripts) Limbu, Tai Le, Osmanya Historic Scripts
Linear B, Cypriot, Ugaritic, Shavian, Aegean Numbers
Symbols Monograms, digrams, tetragrams, other symbols modifier & combining characters
![Page 5: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/5.jpg)
New Characters (cont.)Special Characters
additional variation selectors (for future CJK variants), double-diacritics for dictionary use
For a detailed list, see Derived Age in the UCD 4.0, and the beta Charts.
Character repertoire corresponds to ISO/IEC 10646:2003.
![Page 6: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/6.jpg)
Conformance Substantially improved specification of
conformance requirements Incorporated UTR #17: Character Encoding Model,
clearly separating encoding forms and encoding schemes
Tightened definitions of UTF-8, UTF-16, UTF-32 Separate definition of Unicode String
Clarified conformance status of Unicode Standard Annexes
Formal definitions of properties & algorithms Provisional properties
![Page 7: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/7.jpg)
UTF vs. Unicode String Important Distinction UTF
Unique representation for Code Point All else illegal
C0 80D800 0061
Unicode String Sequence of code units Internal Processing, not interchange Not necessarily valid UTF
C0 A0D800 0061
![Page 8: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/8.jpg)
Conformance (cont.) Formalized policies for stability of the standard Clarification of semantics of important
characters, including BOM Revised scope of enclosing combining marks Revised semantics of ZWJ for cursive scripts Normalization Corrections
U+2F868; U+2F874; U+2F91F; U+2F95F; U+2F9BF
All corrections subject to strict stability constraints: For 3.2 repertoire, NFC3.2(X) = NFC4.0(X)
![Page 9: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/9.jpg)
Textual Clarifications Major changes to Chapters 2, 3, 6, 14 and 15 Definitive terminology for code points:
graphic, format, control, private-use = assigned characters
surrogate, noncharacter, reserved not characters
Substantial improvements to many character block descriptions, especially Indic
![Page 10: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/10.jpg)
Programming language identifiers Now backwards-compatible
Once a Unicode identifier, Always a Unicode identifier
Alternate definition for complete stability Fix set of allowed characters Allow all reserved code points + Complete stability - “Odd” characters
Also see new UTR on Syntax Characters
![Page 11: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/11.jpg)
Case mappings now normative (but tailorable) Clearer definition of string functions:
isUpper(), isLower(), isTitle(), isFold() toUpper(), toLower(), toTitle(), toFold()
Definition of titlecase uses word boundaries Note that the Turkic mappings do not
maintain canonical equivalence, without additional processing.
![Page 12: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/12.jpg)
UAX #9: BIDI BIDI: Arabic/Hebrew Display
HTML, all modern word processors, OSs,… New:
canonically equivalence now preserved data change, not algorithm
shaping is done after reordering but not across directional boundaries
clarifications of: ZWJ, ZWNJ intermediate level processing
![Page 13: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/13.jpg)
UAX #15: Normalization Unique form for text comparison
W3C Character Model, International Domain Names, Network File System,…
New: Description of Stable Code Points. Notation NFC(x) and isNFC(x), in Notation. Added pointer to UTN #5 Canonical Equivalences in
Applications Rewrote Annex 12: Corrigenda for clarity, and to
describe the use of Normalization Corrections. Added Annex 13: Canonical Equivalence.
![Page 14: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/14.jpg)
UAX #14: Line Breaking Line-Break (word-wrap) all Unicode text
Customizable for different languages New:
Negative numbers and dates with hyphens will not break across lines
Word-Joiner will link any characters (except hard line breaks)
Behavior of soft hyphen clarified marks opportunity for breaking, not specific graphic
appearance. Rules for GL relaxed: SP and ZW override New Property Values: NL, WJ
![Page 15: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/15.jpg)
UAX #29: Text Boundaries Default “User Character”, Word, Sentence
boundaries Customizable for different languages Word, sentence: tailoring expected
New: Extracted from 3.0, but significantly revised Grapheme cluster (“user character”)
Hangul Syllable or other Base plus (optionally) any number of NSMs
![Page 16: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/16.jpg)
No Sub. ChangesUAX #11: East Asian Width
Guidelines for choosing character widthUAX #24: Script Names
Default script assignment Used in regular expressions Now UAX
![Page 17: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/17.jpg)
Superseded UAXes Incorporated into and thus superseded
by Unicode Version 4.0: UAX #13: Unicode Newline Guidelines UAX #19: UTF-32 UAX #21: Case Mappings UAX #27: Unicode 3.1 UAX #28: Unicode 3.2
![Page 18: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/18.jpg)
Unicode Character Database Crucial Component of Unicode Documentation coalesced into UCD.html. New properties and values
Hangul_Syllable_Type, Unicode_Radical_Stroke CJK numeric values added. PropertyValueAliases adds block names
UCD fallback props more precisely defined. for code points not explicitly in data files
New Characters Appropriate properties assigned
![Page 19: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/19.jpg)
UCD4.0 (cont.) Modifier letters
The general category of 02B9..02BA, 02C6..02CF changed to general category Lm.
Khmer Two Khmer characters are deprecated; four others
strongly discouraged. Decimal Digits
Numeric_Type=decimal digit now aligned with General_Category=Nd
Braille Added script value
![Page 20: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/20.jpg)
UCD4.0 (cont. 2) Case Mapping
Fixed for Turkish, Lithuanian Default Ignorables
Hangul Filler characters Soft-Hyphen, CGJ, ZWS Arabic End of Ayah and Syriac Abbreviation Mark
no longer DI, shaping classes fixed. Grapheme_Extend
removes halfwidth katakana marks, most Mc (except as needed for canonical equivalence)
![Page 21: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/21.jpg)
Unicode Technical StandardUTS: separate standard
independent conformance requirementsUTR: information and guidelines
Documents may move from UTR status to UTS
![Page 22: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/22.jpg)
UTS #10: Unicode Collation Significance:
String comparison, matching, searching Compares all Unicode characters Handles linguistic features
Accents, Case, Punctuation,… Contextual weighting,…
Tailor for different languages Version 4.0.0 due Sept. 2003
From now on, to be sync'ed in repertoire and version with the Unicode Standard.
![Page 23: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/23.jpg)
UTS #18: Regular Exp. Significance:
Crucial to many applications: web, XML,… Unicode adds significant requirements
Level 1: Basic Support Perl
Level 2: Extended Support Level 3: Tailored Support
New: Recently approved as UTS (was UTR) Adds clearer conformance requirements
Flexible list of features Partial conformance claims
![Page 24: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/24.jpg)
UTS #6: SCSUSimple Unicode CompressionAdded suitability for XMLSee also Technical Note on BOCU
Main difference: preserves binary order x < y => BOCU(x) < BOCU(y)
![Page 25: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/25.jpg)
New UTRsDraft UTR #23: Character Properties
Draft Character Property ModelCharacter Folding
Hiragana-Katakana, Case, …Programming Language IDs, Syntax
characters
![Page 26: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/26.jpg)
Q& AOther talks here:Common Locale Data
interchange of language-specific data for sorting, dates, times, currencies
ICU premier Unicode enablement library full-featured, x-platform C, C++, Java
![Page 27: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/27.jpg)
Background Slides
![Page 28: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/28.jpg)
Unicode 3.2 (March, 2002) New Characters: 1,016 Symbols
Large collection of mathematical symbols, especially targeted at MathML, recycling symbols, ornamental brackets.
Special Characters combining grapheme joiner, word joiner, invisible
operators for math, variation selectors Modern Scripts
minority scripts of the Philippines
![Page 29: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/29.jpg)
Conformance Eliminates irregular UTF-8 Defines variation sequences Replaces ZWNBSP with Word Joiner Clarifies scope of combining marks
(further revised in 4.0) Clarifications of conjoining jamo
behavior, hangul syllable structure, decomposables,
![Page 30: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/30.jpg)
Textual Clarifications Combined vowels in Khmer, characters
discouraged in Khmer Use of dingbats
![Page 31: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/31.jpg)
Unicode Standard Annexes UAX #21: Case Mappings (was UTR)
![Page 32: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/32.jpg)
Unicode Character Database New properties:
IDS_Binary_Operator, IDS_Trinary_Operator, Radical, Unified_Ideograph,
Default_Ignorable_Code_Point, Deprecated Soft_Dotted, Logical_Order_Exception
Grapheme_Base, Grapheme_Extend,Grapheme_Link DerivedAge Normalization Corrections Added Property & Property Value Aliases Adds StandardizedVariants.html
![Page 33: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/33.jpg)
Related Items UTS #10: Unicode Collation Algorithm
Ignorable character handling, dual versioning, more conditions on well-formed weights, separate weights for CJK and unassigned characters, non-characters
Note: base version still U3.1 UTR #26: CESU-8 Unicode Technical Notes Updated Character Encoding Stability Policy Added Public Review process Updated Glossary
![Page 34: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/34.jpg)
Unicode 3.1 (March, 2001) New Characters: 44,946
First supplementaries encoded! Modern scripts
CJK Ideographs (now totaling 71,039) Historic scripts
Old Italic, Gothic, Deseret, Byzantine Musical Symbols
Symbols Mathematical Alphanumeric Symbols, (Western)
Musical Symbols
![Page 35: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/35.jpg)
Conformance Non-shortest-form UTF-8 excluded Clarification of the stability of the standard,
code units vs. code points, non-characters, normative properties, informative properties, normative references
Revisions of guidelines: wchar_t, unassigned code points, identifiers
Major revision of Georgian Use of ZWNJ and ZWJ for ligatures Language tag characters encoded
but discouraged
![Page 36: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/36.jpg)
Unicode Standard Annexes UAX #19: UTF-32
![Page 37: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/37.jpg)
Unicode Character Database Major revision of PropList properties:
White_Space, Bidi_Control, Join_Control, Hex_Digit
Alphabetic, Ideographic, Lowercase, Uppercase ID_Start, ID_Continue, XID_Start, XID_Continue Noncharacter_Code_Point
Quotation_Mark, Terminal_Punctuation, Math, Dash, Hyphen, Diacritic, Extender
New properties: Case folding, Scripts Added DerivedProperties, NormalizationTest
![Page 38: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/38.jpg)
Related Items Documented
Character Encoding Stability Policy UTS #10: Unicode Collation Algorithm
Merged data files; updated to base version 3.1 UTR #18: Unicode Regular Expression Guideli
nes
UTR #20: Unicode in XML and other Markup Languages
UTR #22: Character Mapping Tables UTR #24: Script Names
![Page 39: Unicode 4.0 Mark Davis President, The Unicode Consortium Note: slides differ from proceedings](https://reader033.vdocuments.net/reader033/viewer/2022061306/551467fe550346414e8b5c4a/html5/thumbnails/39.jpg)
Schedule2003, April: UCD/UAXes
Final data files available Implementation can proceed
2003: September: Book Available