sorting it all out: an introduction to collation cathy wissink michael kaplan globalization...
TRANSCRIPT
Sorting it all out: Sorting it all out: An introduction to collationAn introduction to collation
Cathy WissinkCathy Wissink
Michael KaplanMichael Kaplan
Globalization Infrastructure and Font TechnologyGlobalization Infrastructure and Font Technology
Windows InternationalWindows International
MicrosoftMicrosoft
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 22
Who is this talk geared towards?Who is this talk geared towards?
This is a high-level introduction to the This is a high-level introduction to the concepts of collation, assuming no prior concepts of collation, assuming no prior knowledge.knowledge.Audience:Audience:– Developers new to conceptDevelopers new to concept– People who need to understand collation enough to People who need to understand collation enough to
“sell” this globalization feature to management“sell” this globalization feature to management– Not intended to be a “nuts and bolts” talk (see the Not intended to be a “nuts and bolts” talk (see the
presentation immediately following!)presentation immediately following!)
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 33
Collation: Used Everyday!Collation: Used Everyday!
It may not be obvious, but you most likely It may not be obvious, but you most likely use collation in some form everyday:use collation in some form everyday:
Finding a mail slot for a colleagueFinding a mail slot for a colleague
Searching for an author at the bookstoreSearching for an author at the bookstore
Library card catalogLibrary card catalog
Looking up a phone numberLooking up a phone number
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 44
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 55
Anytime you order or search for Anytime you order or search for data in a logical fashion within a data in a logical fashion within a
structure, you use collation!structure, you use collation!
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 66
Collation, the definition:Collation, the definition:
The culturally expected ordering of The culturally expected ordering of linguistic characters in a particular linguistic characters in a particular languagelanguageOften referred to as sorting, ordering, Often referred to as sorting, ordering, alphabetizingalphabetizingInformants recognize correct vs. incorrect Informants recognize correct vs. incorrect collation for their language, but often have collation for their language, but often have a hard time explaining the particular a hard time explaining the particular collation rulescollation rules
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 77
Great definitions, but what do they Great definitions, but what do they mean, really?mean, really?
Every language (every culture) has an Every language (every culture) has an expected result when users search for expected result when users search for data in “sorted” orderdata in “sorted” order
If the ordering isn’t perfectly correct, users If the ordering isn’t perfectly correct, users have a very hard time finding datahave a very hard time finding data
This ordering can be influenced by a This ordering can be influenced by a number of linguistic and orthographic number of linguistic and orthographic elements within a languageelements within a language
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 88
Examples of linguistic elements Examples of linguistic elements that impact collationthat impact collation
““Character” orderCharacter” order
Casing (upper case vs. lower case)Casing (upper case vs. lower case)
Modifiers (diacritics, Indic matras, vowel marks)Modifiers (diacritics, Indic matras, vowel marks)
Radicals (CJK)Radicals (CJK)
Stroke counts (CJK)Stroke counts (CJK)
Syllable structure (SE Asian languages)Syllable structure (SE Asian languages)
PronunciationPronunciation
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 99
Collation in ActionCollation in Action
Latin scripts: English, French, Latin scripts: English, French, Lithuanian, Swedish, Traditional Lithuanian, Swedish, Traditional SpanishSpanish
Chinese variants (Taiwanese orders)Chinese variants (Taiwanese orders)
Devanagari script: Hindi, MarathiDevanagari script: Hindi, Marathi
Tamil script: TamilTamil script: Tamil
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1010
English:
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1111
French:
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1212
Lithuanian:
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1313
Swedish:
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1414
Spanish (Traditional):
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1515
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1616
DevanagariDevanagari
Hindi: consonants with modifier marks Hindi: consonants with modifier marks (candrabindu U+0901, anusvara U+0902 or (candrabindu U+0901, anusvara U+0902 or visarga U+0903) sort differently than the visarga U+0903) sort differently than the consonant alone. consonant alone.
A consonant and one of these modifier marks A consonant and one of these modifier marks has a lighter primary sorting weight than the has a lighter primary sorting weight than the same consonant without a modifier mark.same consonant without a modifier mark.
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1717
DevanagariDevanagari
कँ� कँ� (Devanagari Ka + candrabindu)(Devanagari Ka + candrabindu)
कँ� कँ� (Devanagari Ka + anusvara)(Devanagari Ka + anusvara)
कँ� कँ� (Devanagari Ka + visarga)(Devanagari Ka + visarga)
कँ कँ (Devanagari Ka)(Devanagari Ka)
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1818
DevanagariDevanagari
Hindi vs. Marathi Hindi vs. Marathi
Two different languages within the Two different languages within the Devanagari script, two different sorts of Lla Devanagari script, two different sorts of Lla (U+0933)(U+0933)
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 1919
DevanagariDevanagari
Hindi: 0932 < Hindi: 0932 < 09330933 < 0934; that is: < 0934; that is: ल ल < < ळ ळ< < ऴऴ
Marathi: 0939 < Marathi: 0939 < 09330933 < 0915+094d+0937 < 0915+094d+0937 conjunct; that is: conjunct; that is: ह ह << ळ ळ < < क्षक्ष
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2020
TamilTamil
Consonant + virama (halant) combination Consonant + virama (halant) combination has primary weight lighter than the has primary weight lighter than the consonant aloneconsonant alone
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2121
TamilTamil
க்க் (Tamil Ka + virama)(Tamil Ka + virama)
க க (Tamil Ka)(Tamil Ka)
ங்ங் (Tamil Nga + virama)(Tamil Nga + virama)
ங ங (Tamil Nga)(Tamil Nga)
ச் ச் (Tamil Ca + virama)(Tamil Ca + virama)
ச ச (Tamil Ca)(Tamil Ca)
ஞ் ஞ் (Tamil Nya + virama)(Tamil Nya + virama)
ஞ ஞ (Tamil Nya)(Tamil Nya)
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2222
Myths about collationMyths about collation
““Well, if I localize my product, these Well, if I localize my product, these kind of details don’t matter”kind of details don’t matter”
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2323
Myths about collationMyths about collation
““If I already use Unicode in my If I already use Unicode in my product, sorting is covered by this product, sorting is covered by this
universal encoding”universal encoding”
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2424
Myths about collationMyths about collation
““One collation is good enough for One collation is good enough for Europe*, right?”Europe*, right?”
* Replace with the market of your choice: Asia, North * Replace with the market of your choice: Asia, North America, India, etc.America, India, etc.
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2525
Myths about collationMyths about collation
““One collation is good enough for One collation is good enough for the Latin* script, right?”the Latin* script, right?”
* Replace with the script of your choice: Cyrillic, Han, * Replace with the script of your choice: Cyrillic, Han, Devanagari, etc.Devanagari, etc.
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2626
Why should I care about all Why should I care about all this?this?
Ideally, a well-globalized product uses culturally Ideally, a well-globalized product uses culturally correct collation where the users expect it, for correct collation where the users expect it, for example:example: Address bookAddress book Document filing systemDocument filing system DatabaseDatabase ……
Your users will expect collation in a surprising Your users will expect collation in a surprising number of places!number of places!
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2727
Collation ExampleCollation Example
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2828
Yet another collation exampleYet another collation example
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 2929
How do I make sure my users get How do I make sure my users get the results they expect?the results they expect?
Collation usually needs to address user’s Collation usually needs to address user’s expected ordering, not the linguistic expected ordering, not the linguistic ordering of the data source (these two ordering of the data source (these two cancan differ!) differ!)
Swedish user, German dataSwedish user, German data
Multiple users, multilingual dataMultiple users, multilingual data
The Switzerland exampleThe Switzerland example
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3030
How do I make sure my users get How do I make sure my users get the results they expect?the results they expect?
Make sure you’re using a collation-aware Make sure you’re using a collation-aware mechanism to order datamechanism to order data
Windows APIs such as CompareString, Windows APIs such as CompareString, LCMapStringLCMapString
SQL Server 2000 collationsSQL Server 2000 collations
The .Net Framework's CompareInfo classThe .Net Framework's CompareInfo class
Except when you want non-linguistic Except when you want non-linguistic collation…collation…
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3131
When When notnot to use linguistic collation to use linguistic collation
When consistency across different When consistency across different cultures is requiredcultures is required– ““Case insensitive” file systemsCase insensitive” file systems– File extension names (.INF, .GIF, etc.)File extension names (.INF, .GIF, etc.)
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3232
When When notnot to use linguistic collation to use linguistic collation
When users expect data in a specific When users expect data in a specific collation other than their owncollation other than their own– Excel column namesExcel column names
– ““ASCII” orderASCII” order
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3333
In summary…In summary…
Linguistically-aware collation is an important Linguistically-aware collation is an important feature of any well-globalized productfeature of any well-globalized product
Collation needs to be considered at the Collation needs to be considered at the language levellanguage level– Encoding, region, script level not enough!Encoding, region, script level not enough!
There are many collation-aware mechanisms out There are many collation-aware mechanisms out there (within OS for example); take advantage of there (within OS for example); take advantage of them!them!
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3434
Other applicable IUC talksOther applicable IUC talks
Stay tuned for the second half of this Stay tuned for the second half of this tutorial!tutorial!
Cathy's "Issues in Indic Collation" talk Cathy's "Issues in Indic Collation" talk on Thursday afternoonon Thursday afternoon
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3535
Other ReferencesOther ReferencesThis tutorial's corresponding paperThis tutorial's corresponding paper
Unicode Technical Note (UTN) #1Unicode Technical Note (UTN) #1http://unicode.org/notes/tn1/http://unicode.org/notes/tn1/
Nadine Kano, Nadine Kano, Developing International SoftwareDeveloping International Software (out of print, but still available on the web)(out of print, but still available on the web)http://microsoft.com/globaldev/dis_v1/disv1.asphttp://microsoft.com/globaldev/dis_v1/disv1.asp
New! New! Developing International SoftwareDeveloping International Software , 2nd , 2nd edition (available now or very soon): edition (available now or very soon): http://microsoft.com/globaldev/dis_v2/disv2.asphttp://microsoft.com/globaldev/dis_v2/disv2.asp
Michael Kaplan, Michael Kaplan, Internationalization with VBInternationalization with VB http://i18nWithVB.com/http://i18nWithVB.com/
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3636
Questions?Questions?
25 March 200325 March 2003 Prague, Czech Republic (IUC23)Prague, Czech Republic (IUC23) 3737
Don't forget to fill out your evals!Don't forget to fill out your evals!
Sorting it all out: Sorting it all out: An introduction to collationAn introduction to collation