globalisation & computer systems week 5 1. localisation presentations 2.character representation...
TRANSCRIPT
Globalisation & Computer Systems week 5
1. Localisation presentations2.Character representation and UNICODE UNICODE design principles UNICODE character semantics3. Lab session finish code page work creating and browsing UNICODE
characters
Representation and UNICODE
What about Chinese? Thousands of characters – 256 bit-
patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives
65536 bit-patterns UNICODE
Representation and UNICODE
Reference:The Unicode Standard, Version 4.1.0.
Online: http://www.unicode.org/unicode/uni2book/
UNICODE – design principles Principle 1: 16-bit bytes For code pages, characters share 8-bit byte code
points – determined by interpretation For UNICODE each character assigned a unique
code point 65536 code values available
Byte 1: 256 values X Byte 2: 256 values 63485 for character representation; remaining
2048 reserved for extended 32-bit codes This gives 1, 048, 544 code values to cover all
languages
UNICODE – design principles
Principle 2: allocation of code space General scripts area: alphabetic CJK Ideographs – 27484 ideographs Hangul syllables – 11172 Korean Hangul
syllables 1st 128 code points for Latin Punctuation symbols grouped together
UNICODE – design principles
Principle 3: efficiency All characters have equal status, i.e.
no escape characters Characters of a common script
grouped together as far as is possible Common punctuation shared
Design principles
Principle 4: logical and display order Logical order: how the code is ordered
in memory: follows time sequence of input
…and ‘logically’ that is L-R Dynamically composed characters: base
character ordered ‘before’, i.e. left wrt to the modifying character
Design principles
Principle 5: plain text and rich text Unicode encodes unformatted plain
text, where rendering aim is legibility only
Formatting: extra data, give rich text To preserve plain text requirements?
Have layers of plain text representing characters and how they are formatted
Use mark-up languages: content + tags
Design principles
Principle 6: unification Share characters where you can:
Mixed writing systems Ideographs common to CJK Punctuation
Character semantics Character name Representative glyph Properties
Property 1: Case
A letter in the alphabet has several variants UPPERCASE variant lowercase variant
Five scripts which have case: Latin, Greek, Cyrillic, Armenian, archaic
Georgian
Property 2: Decomposition
A character which is equivalent to one or more other characters Š = S + ˇ 0160 (Latin Ext.-A)= 0053 + 030C (Basic
Latin)
Property 3: Combining class
Base character i.e. no special graphical combining
behaviour when following another character Combining character
Some characters have shape-change or position behaviour when combing with other characters
Non-spacing combining character Does not take up space, e.g. diacritics
Spacing combining character Takes up space as though a base character
Property 3: Combining class
Sequence is a convention: Base character + combining character
Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle
Stacking of diacritics follows the convention: Move from the base character
outwards
Property 4: Directionality
Two directionality types: Left to Right Right to Left (Arabic, Hebrew, Syriac,
Thaana) Logical sequence: Left to Right
Property 5: General Category
The full character space is partitioned into several major categories:
Letters Punctuation Symbols Numbers
Examples of general category codes: Lu: letter, uppercase; Ll: letter, lowercase Nd: number, decimal digit; No: number,
other
Property 6: Numeric value
For characters that represent numbers Decimal digits Fractions Subscripts and superscripts Currency numerators Portion of the CJK ideographs: e.g. U+4E94
Property 7: Mirrored property
For characters that have equivalent mirror image characters, e.g. ‘(‘
Important for directionality
Character properties
Summary
1. Case 2. Decomposition 3. Combining class 4. Directionality 5. General category 6. Numeric value 7. Mirrored property