globalisation & computer systems week 5 1. localisation presentations 2.character representation...

19
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character semantics 3. Lab session finish code page work creating and browsing UNICODE characters

Upload: myron-webster

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Globalisation & Computer Systems week 5

1. Localisation presentations2.Character representation and UNICODE UNICODE design principles UNICODE character semantics3. Lab session finish code page work creating and browsing UNICODE

characters

Page 2: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Representation and UNICODE

What about Chinese? Thousands of characters – 256 bit-

patterns clearly not enough Make the bytes bigger… Bytes have 16-bits, which gives

65536 bit-patterns UNICODE

Page 3: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Representation and UNICODE

Reference:The Unicode Standard, Version 4.1.0.

Online: http://www.unicode.org/unicode/uni2book/

Page 4: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

UNICODE – design principles Principle 1: 16-bit bytes For code pages, characters share 8-bit byte code

points – determined by interpretation For UNICODE each character assigned a unique

code point 65536 code values available

Byte 1: 256 values X Byte 2: 256 values 63485 for character representation; remaining

2048 reserved for extended 32-bit codes This gives 1, 048, 544 code values to cover all

languages

Page 5: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

UNICODE – design principles

Principle 2: allocation of code space General scripts area: alphabetic CJK Ideographs – 27484 ideographs Hangul syllables – 11172 Korean Hangul

syllables 1st 128 code points for Latin Punctuation symbols grouped together

Page 6: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

UNICODE – design principles

Principle 3: efficiency All characters have equal status, i.e.

no escape characters Characters of a common script

grouped together as far as is possible Common punctuation shared

Page 7: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Design principles

Principle 4: logical and display order Logical order: how the code is ordered

in memory: follows time sequence of input

…and ‘logically’ that is L-R Dynamically composed characters: base

character ordered ‘before’, i.e. left wrt to the modifying character

Page 8: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Design principles

Principle 5: plain text and rich text Unicode encodes unformatted plain

text, where rendering aim is legibility only

Formatting: extra data, give rich text To preserve plain text requirements?

Have layers of plain text representing characters and how they are formatted

Use mark-up languages: content + tags

Page 9: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Design principles

Principle 6: unification Share characters where you can:

Mixed writing systems Ideographs common to CJK Punctuation

Page 10: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Character semantics Character name Representative glyph Properties

Page 11: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 1: Case

A letter in the alphabet has several variants UPPERCASE variant lowercase variant

Five scripts which have case: Latin, Greek, Cyrillic, Armenian, archaic

Georgian

Page 12: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 2: Decomposition

A character which is equivalent to one or more other characters Š = S + ˇ 0160 (Latin Ext.-A)= 0053 + 030C (Basic

Latin)

Page 13: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 3: Combining class

Base character i.e. no special graphical combining

behaviour when following another character Combining character

Some characters have shape-change or position behaviour when combing with other characters

Non-spacing combining character Does not take up space, e.g. diacritics

Spacing combining character Takes up space as though a base character

Page 14: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 3: Combining class

Sequence is a convention: Base character + combining character

Symbol: dotted circle, representing the space of the base character, and combining character positioned relative to the circle

Stacking of diacritics follows the convention: Move from the base character

outwards

Page 15: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 4: Directionality

Two directionality types: Left to Right Right to Left (Arabic, Hebrew, Syriac,

Thaana) Logical sequence: Left to Right

Page 16: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 5: General Category

The full character space is partitioned into several major categories:

Letters Punctuation Symbols Numbers

Examples of general category codes: Lu: letter, uppercase; Ll: letter, lowercase Nd: number, decimal digit; No: number,

other

Page 17: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 6: Numeric value

For characters that represent numbers Decimal digits Fractions Subscripts and superscripts Currency numerators Portion of the CJK ideographs: e.g. U+4E94

Page 18: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Property 7: Mirrored property

For characters that have equivalent mirror image characters, e.g. ‘(‘

Important for directionality

Page 19: Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character

Character properties

Summary

1. Case 2. Decomposition 3. Combining class 4. Directionality 5. General category 6. Numeric value 7. Mirrored property