sorting it all out: an introduction to collation cathy wissink michael kaplan globalization...

37
Sorting it all out: Sorting it all out: An introduction to An introduction to collation collation Cathy Wissink Cathy Wissink Michael Kaplan Michael Kaplan Globalization Infrastructure and Font Globalization Infrastructure and Font Technology Technology Windows International Windows International Microsoft Microsoft

Upload: mary-park

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

Sorting it all out: Sorting it all out: An introduction to collationAn introduction to collation

Cathy WissinkCathy Wissink

Michael KaplanMichael Kaplan

Globalization Infrastructure and Font TechnologyGlobalization Infrastructure and Font Technology

Windows InternationalWindows International

MicrosoftMicrosoft

Page 2: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 22

Who is this talk geared towards?Who is this talk geared towards?

This is a high-level introduction to the This is a high-level introduction to the concepts of collation, assuming no prior concepts of collation, assuming no prior knowledge.knowledge.Audience:Audience:– Developers new to conceptDevelopers new to concept– People who need to understand collation enough to People who need to understand collation enough to

“sell” this globalization feature to management“sell” this globalization feature to management– Not intended to be a “nuts and bolts” talk (see the Not intended to be a “nuts and bolts” talk (see the

presentation immediately following!)presentation immediately following!)

Page 3: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 33

Collation: Used Everyday!Collation: Used Everyday!

It may not be obvious, but you most likely It may not be obvious, but you most likely use collation in some form everyday:use collation in some form everyday:

Finding a mail slot for a colleagueFinding a mail slot for a colleague

Searching for an author at the bookstoreSearching for an author at the bookstore

Library card catalogLibrary card catalog

Looking up a phone numberLooking up a phone number

Page 4: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 44

Page 5: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 55

Anytime you order or search for Anytime you order or search for data in a logical fashion within a data in a logical fashion within a

structure, you use collation!structure, you use collation!

Page 6: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 66

Collation, the definition:Collation, the definition:

The culturally expected ordering of The culturally expected ordering of linguistic characters in a particular linguistic characters in a particular languagelanguageOften referred to as sorting, ordering, Often referred to as sorting, ordering, alphabetizingalphabetizingInformants recognize correct vs. incorrect Informants recognize correct vs. incorrect collation for their language, but often have collation for their language, but often have a hard time explaining the particular a hard time explaining the particular collation rulescollation rules

Page 7: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 77

Great definitions, but what do they Great definitions, but what do they mean, really?mean, really?

Every language (every culture) has an Every language (every culture) has an expected result when users search for expected result when users search for data in “sorted” orderdata in “sorted” order

If the ordering isn’t perfectly correct, users If the ordering isn’t perfectly correct, users have a very hard time finding datahave a very hard time finding data

This ordering can be influenced by a This ordering can be influenced by a number of linguistic and orthographic number of linguistic and orthographic elements within a languageelements within a language

Page 8: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 88

Examples of linguistic elements Examples of linguistic elements that impact collationthat impact collation

““Character” orderCharacter” order

Casing (upper case vs. lower case)Casing (upper case vs. lower case)

Modifiers (diacritics, Indic matras, vowel marks)Modifiers (diacritics, Indic matras, vowel marks)

Radicals (CJK)Radicals (CJK)

Stroke counts (CJK)Stroke counts (CJK)

Syllable structure (SE Asian languages)Syllable structure (SE Asian languages)

PronunciationPronunciation

Page 9: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 99

Collation in ActionCollation in Action

Latin scripts: English, French, Latin scripts: English, French, Lithuanian, Swedish, Traditional Lithuanian, Swedish, Traditional SpanishSpanish

Chinese variants (Taiwanese orders)Chinese variants (Taiwanese orders)

Devanagari script: Hindi, MarathiDevanagari script: Hindi, Marathi

Tamil script: TamilTamil script: Tamil

Page 10: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1010

English:

Page 11: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1111

French:

Page 12: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1212

Lithuanian:

Page 13: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1313

Swedish:

Page 14: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1414

Spanish (Traditional):

Page 15: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1515

Page 16: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1616

DevanagariDevanagari

Hindi: consonants with modifier marks Hindi: consonants with modifier marks (candrabindu U+0901, anusvara U+0902 or (candrabindu U+0901, anusvara U+0902 or visarga U+0903) sort differently than the visarga U+0903) sort differently than the consonant alone. consonant alone.

A consonant and one of these modifier marks A consonant and one of these modifier marks has a lighter primary sorting weight than the has a lighter primary sorting weight than the same consonant without a modifier mark.same consonant without a modifier mark.

Page 17: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1717

DevanagariDevanagari

कँ� कँ� (Devanagari Ka + candrabindu)(Devanagari Ka + candrabindu)

कँ� कँ� (Devanagari Ka + anusvara)(Devanagari Ka + anusvara)

कँ� कँ� (Devanagari Ka + visarga)(Devanagari Ka + visarga)

कँ कँ (Devanagari Ka)(Devanagari Ka)

Page 18: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1818

DevanagariDevanagari

Hindi vs. Marathi Hindi vs. Marathi

Two different languages within the Two different languages within the Devanagari script, two different sorts of Lla Devanagari script, two different sorts of Lla (U+0933)(U+0933)

Page 19: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1919

DevanagariDevanagari

Hindi: 0932 < Hindi: 0932 < 09330933 < 0934; that is: < 0934; that is: ल ल < < ळ ळ< < ऴऴ

Marathi: 0939 < Marathi: 0939 < 09330933 < 0915+094d+0937 < 0915+094d+0937 conjunct; that is: conjunct; that is: ह ह << ळ ळ < < क्षक्ष

Page 20: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2020

TamilTamil

Consonant + virama (halant) combination Consonant + virama (halant) combination has primary weight lighter than the has primary weight lighter than the consonant aloneconsonant alone

Page 21: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2121

TamilTamil

க்க் (Tamil Ka + virama)(Tamil Ka + virama)

க க (Tamil Ka)(Tamil Ka)

ங்ங் (Tamil Nga + virama)(Tamil Nga + virama)

ங ங (Tamil Nga)(Tamil Nga)

ச் ச் (Tamil Ca + virama)(Tamil Ca + virama)

ச ச (Tamil Ca)(Tamil Ca)

ஞ் ஞ் (Tamil Nya + virama)(Tamil Nya + virama)

ஞ ஞ (Tamil Nya)(Tamil Nya)

Page 22: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2222

Myths about collationMyths about collation

““Well, if I localize my product, these Well, if I localize my product, these kind of details don’t matter”kind of details don’t matter”

Page 23: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2323

Myths about collationMyths about collation

““If I already use Unicode in my If I already use Unicode in my product, sorting is covered by this product, sorting is covered by this

universal encoding”universal encoding”

Page 24: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2424

Myths about collationMyths about collation

““One collation is good enough for One collation is good enough for Europe*, right?”Europe*, right?”

* Replace with the market of your choice: Asia, North * Replace with the market of your choice: Asia, North America, India, etc.America, India, etc.

Page 25: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2525

Myths about collationMyths about collation

““One collation is good enough for One collation is good enough for the Latin* script, right?”the Latin* script, right?”

* Replace with the script of your choice: Cyrillic, Han, * Replace with the script of your choice: Cyrillic, Han, Devanagari, etc.Devanagari, etc.

Page 26: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2626

Why should I care about all Why should I care about all this?this?

Ideally, a well-globalized product uses culturally Ideally, a well-globalized product uses culturally correct collation where the users expect it, for correct collation where the users expect it, for example:example: Address bookAddress book Document filing systemDocument filing system DatabaseDatabase ……

Your users will expect collation in a surprising Your users will expect collation in a surprising number of places!number of places!

Page 27: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2727

Collation ExampleCollation Example

Page 28: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2828

Yet another collation exampleYet another collation example

Page 29: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2929

How do I make sure my users get How do I make sure my users get the results they expect?the results they expect?

Collation usually needs to address user’s Collation usually needs to address user’s expected ordering, not the linguistic expected ordering, not the linguistic ordering of the data source (these two ordering of the data source (these two cancan differ!) differ!)

Swedish user, German dataSwedish user, German data

Multiple users, multilingual dataMultiple users, multilingual data

The Switzerland exampleThe Switzerland example

Page 30: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3030

How do I make sure my users get How do I make sure my users get the results they expect?the results they expect?

Make sure you’re using a collation-aware Make sure you’re using a collation-aware mechanism to order datamechanism to order data

Windows APIs such as CompareString, Windows APIs such as CompareString, LCMapStringLCMapString

SQL Server 2000 collationsSQL Server 2000 collations

The .Net Framework's CompareInfo classThe .Net Framework's CompareInfo class

Except when you want non-linguistic Except when you want non-linguistic collation…collation…

Page 31: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3131

When When notnot to use linguistic collation to use linguistic collation

When consistency across different When consistency across different cultures is requiredcultures is required– ““Case insensitive” file systemsCase insensitive” file systems– File extension names (.INF, .GIF, etc.)File extension names (.INF, .GIF, etc.)

Page 32: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3232

When When notnot to use linguistic collation to use linguistic collation

When users expect data in a specific When users expect data in a specific collation other than their owncollation other than their own– Excel column namesExcel column names

– ““ASCII” orderASCII” order

Page 33: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3333

In summary…In summary…

Linguistically-aware collation is an important Linguistically-aware collation is an important feature of any well-globalized productfeature of any well-globalized product

Collation needs to be considered at the Collation needs to be considered at the language levellanguage level– Encoding, region, script level not enough!Encoding, region, script level not enough!

There are many collation-aware mechanisms out There are many collation-aware mechanisms out there (within OS for example); take advantage of there (within OS for example); take advantage of them!them!

Page 34: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3434

Other applicable IUC talksOther applicable IUC talks

Stay tuned for the second half of this Stay tuned for the second half of this tutorial!tutorial!

Cathy's "Issues in Indic Collation" talk Cathy's "Issues in Indic Collation" talk on Thursday afternoonon Thursday afternoon

Page 35: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3535

Other ReferencesOther ReferencesThis tutorial's corresponding paperThis tutorial's corresponding paper

Unicode Technical Note (UTN) #1Unicode Technical Note (UTN) #1http://unicode.org/notes/tn1/http://unicode.org/notes/tn1/

Nadine Kano, Nadine Kano, Developing International SoftwareDeveloping International Software (out of print, but still available on the web)(out of print, but still available on the web)http://microsoft.com/globaldev/dis_v1/disv1.asphttp://microsoft.com/globaldev/dis_v1/disv1.asp

New! New! Developing International SoftwareDeveloping International Software , 2nd , 2nd edition (available now or very soon): edition (available now or very soon): http://microsoft.com/globaldev/dis_v2/disv2.asphttp://microsoft.com/globaldev/dis_v2/disv2.asp

Michael Kaplan, Michael Kaplan, Internationalization with VBInternationalization with VB http://i18nWithVB.com/http://i18nWithVB.com/

Page 36: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3636

Questions?Questions?

Page 37: Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft

25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3737

Don't forget to fill out your evals!Don't forget to fill out your evals!

Sorting it all out: Sorting it all out: An introduction to collationAn introduction to collation