supplementary character support in microsoft products michael s. kaplan globalization infrastructure...

22
Supplementary Character Supplementary Character Support in Microsoft Support in Microsoft Products Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

Upload: christopher-stevens

Post on 26-Mar-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

Supplementary Character Support Supplementary Character Support in Microsoft Productsin Microsoft Products

Michael S. KaplanGlobalization Infrastructure and Font Technology

Windows International

Microsoft

Page 2: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

What are supplementary characters?What are supplementary characters?

"a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate"

Page 3: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

High/low surrogate?High/low surrogate?

High: U+D800 - U+DBFFLow: U+DC00 - U+DFFFTerminology:

– "surrogate pair" preferred over "surrogate character“

See http://www.trigeminal.com/16to32AndBack.asp

Page 4: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Conversion example #1Conversion example #1 Example #1:

– The first character in the Surrogate range (D800, DC00) as UTF-32:

1. D800: binary 1101100000000000 (lower ten bits: 0000000000)

2. DC00: binary 1101110000000000 (lower ten bits: 0000000000)

3. Concatenate 0000000000+0000000000 = x0000

4. Add x10000

Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF)

Page 5: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Conversion example #2Conversion example #2 Example #2.

– You have a Unicode character such as U+2040A (a CJK character in Plane 2) and wish to encode it in UTF-16

1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10

bits piece (0001000001) - Result: 1101100001000001 (D841)

4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A)

Your surrogate pair: D841, DC0A

Page 6: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

UTF-8 conversionsUTF-8 conversions

Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately)

legal conversions: four-byte UTF-8 (one UTF-32 code point)

CESU-8 is the the inverse of the above

Page 7: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

UTF-8 exampleUTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx

becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx

Instead, you should take a Unicode surrogate pair:

110110wwwwzzzzyy, 110111yyyyxxxxxx

and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1):

11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Page 8: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Encoding choices for MSEncoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32

REASONS: There was obviously an existing, well-tested set of APIs

that support UCS-2, which is a subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space

for all characters. A move to UTF-8 would require even more than twice as

much space in many cases.

Page 9: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

The products...The products... Mostly the new generation of products:

– Windows 2000/XP– Office XP (some support in Office 2000)– Visual Studio.Net– .NET’s Common Language Runtime (CLR)

Most (all) of these products supported Unicode already– a little bit of extra work needed for supplementary

characters– usually just UTF-8 changes were needed

Page 10: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Windows 2000Windows 2000

Uniscribe support for renderingEach surrogate pair is a single graphemeAPIs like CharPrev/CharNext not changedNo specific surrogate font/IMEMust be turned on:http://msdn.microsoft.com/library/en-us/intl/unicode_192r.asp

Page 11: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Windows XPWindows XP*.* from Windows 2000Turned on by default!GDI+ support for rendering Font CMAP extensionsLots of UTF-8 issues fixedNo specific surrogate font/IME (yet)Extensions to fallback fonts [limited]:

HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane1HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane2HKLM\Software\Microsoft\Windows NT\CurrentVersion\LanguagePack\SurrogateFallback\Plane3(etc.)

Page 12: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Other system componentsOther system components

MLangInternet Explorer

http://i18nWithVB.com/surrogate_ime/IIS 5.0/6.0

Page 13: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

The downlevel storyThe downlevel story

No good support for Unicode, let alone supplementary characters

Uniscribe/RichEdit does improve the downlevel story for display purposes

Officially, no support on Win9x

Page 14: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

The Office suiteThe Office suite

WordFrontpageExcel/AccessOutlookRichEdit 4.0

Page 15: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Office - Specific FeaturesOffice - Specific Features

Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) -

Word/RichEdit

Page 16: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

CHS/CHT/CHP OfficeCHS/CHT/CHP Office

The product and the langpacks support an extended Unicode IME that handles supplementary characters

An Extension B font is also included

Page 17: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

.NET CLR/Visual Studio.NET.NET CLR/Visual Studio.NET

String class and globalization namespaceStringInfoGetTextElementEnumerator

– Handles supplementary characters– Also handles composite characters

GDI+VS IDE support

Page 18: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

SQL ServerSQL Server

Past - no support (for Unicode, even!)Present - surrogate "safe" (neutral)Future - surrogate “aware”

Page 19: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Items not [currently] supportedItems not [currently] supported

Character MapGraph 10Outlook 10 mail headersFonts/IMEs“Collations” for supplementary characters

Page 20: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Collation plan for Collation plan for supplementary characters in supplementary characters in

the UCA?the UCA? All Plane-1 (non-ideographic) characters sort after all the

other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all

the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like

any other unassigned characters. Plane 14 language tags will be treated as if they were

unassigned. All characters encoded in Plane 15-16 (private use) will be

sorted after all other characters.

Page 21: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Questions?Questions?

Page 22: Supplementary Character Support in Microsoft Products Michael S. Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

24-26 March 2003 Prague, Czech Republic (IUC23)

Supplementary Character Support in Microsoft

Products

Don’t forget to fill out your evals!