26 april 2001 surrogate support in microsoft products, iuc 18 (hong kong) surrogate support in...

21
26 April 2001 Surrogate Support in Mi crosoft Products, IUC 1 8 (Hong Kong) Surrogate Support in Surrogate Support in Microsoft Products Microsoft Products Michael S. Kaplan Software Design Engineer Trigeminal Software, Inc.

Upload: isabella-mccabe

Post on 27-Mar-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Surrogate Support in Surrogate Support in Microsoft ProductsMicrosoft Products

Michael S. KaplanSoftware Design EngineerTrigeminal Software, Inc.

Page 2: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

What are surrogates?What are surrogates?

"a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate"

Page 3: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

High/low surrogate?High/low surrogate?

High: U+D800 - U+DBFFLow: U+DC00 - U+DFFFTerminology:

– "surrogate pair" preferred over "surrogate character"

Page 4: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Conversion example #1Conversion example #1 Example #1:

– The first character in the Surrogate range (D800, DC00) as UTF-32:

1. D800: binary 1101100000000000 (lower ten bits: 0000000000)

2. DC00: binary 1101110000000000 (lower ten bits: 0000000000)

3. Concatenate 0000000000+0000000000 = x0000

4. Add x10000

Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF)

Page 5: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Conversion example #2Conversion example #2 Example #2.

– You have a Unicode character such as U+2040A (a CJK character in Plane2) and wish to encode it in UTF-16

1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10

bits piece (0001000001) - Result: 1101100001000001 (D841)

4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A)

Your surrogate pair: D841, DC0A

Page 6: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

UTF-8 conversionsUTF-8 conversions

Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately)

legal conversions: four-byte UTF-8 (one UTF-32 code point)

Page 7: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

UTF-8 exampleUTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx

becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx

Instead, you should take a Unicode surrogate pair:

110110wwwwzzzzyy, 110111yyyyxxxxxx

and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1):

11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Page 8: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Encoding choices for MSEncoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32

REASONS: There was obviously an existing, well-tested set of APIs

that support UCS-2, which is a total subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space

for all characters. A move to UTF-8 would require even more than twice as

much space in many cases.

Page 9: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

The products...The products...

Mostly the new generation of products:– Windows 2000/XP– Office XP (some support in Office 2000)

Most of these products supported Unicode already– a little bit of extra work needed for surrogate

pairs– usually just UTF-8 support needed

Page 10: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Windows 2000/XPWindows 2000/XP

Uniscribe/GDI+ support for renderingEach surrogate pair is a single graphemeAPIs like CharPrev/CharNext not changedExtensions to fallback fonts in XPFont CMAP extensions in XPLots of UTF-8 issues fixed in XPNo specific surrogate font/IME (yet)

Page 11: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Collation for Supplementary chacactersCollation for Supplementary chacacters

All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs.

All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP.

All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. (includes plane 14 language tags)

All characters encoded in Plane 15-16 (private use) will be sorted after all other characters.

Page 12: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Other system componentsOther system components

MLangInternet ExplorerIIS 5.0/6.0

Page 13: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

The downlevel storyThe downlevel story

No good support for Unicode, let along supplementary characters

Uniscribe/RichEdit does improve the downlevel story for display purposes, at least

Officially, no surrgoate support on Win9x

Page 14: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

The Office suiteThe Office suite

WordFrontpageExcel/AccessOutlookRichEdit 4.0

Page 15: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Specific FeaturesSpecific Features

Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) -

Word/RichEdit

Page 16: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

CHS/CHT/CHP OfficeCHS/CHT/CHP Office

The product and the langpacks support an extended Unicode IME that handles supplementary characters

An Extension B font is also included

Page 17: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Visual Studio[.NET]Visual Studio[.NET]

String class and globalization namespaceStringInfoGetTextElementEnumerator

– Handles supplementary characters– Also handles composite characters

GDI+IDE support

Page 18: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

SQL ServerSQL Server

Past - no supportPresent - surrogate "safe" (neutral)Future - surrogate awaree

Page 19: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Items not supportedItems not supported

Character MapGraph 10Outlook 10 mail headersCollations for supplementary charactersFonts/IMEs

Page 20: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Questions?Questions?

Page 21: 26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong) Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer

26 April 2001 Surrogate Support in Microsoft Products, IUC 18 (Hong Kong)

Surrogate Support in Microsoft Products